C o m m o n Op er ati n g In fr astr u ctu re ESB
Sensing & Acquisition Subsystem
Data Management Subsystem
Knowledge, Planning & Execution Subsystems External Interfaces: OPeNDAP, THREDDS, LAS,…
Figure 1. DMS and Interfaces Data Management Subsystems report from PDR meeting
Arcot Rajasekar
Data Intensive Computing Environment Group San Diego Supercomputer Center. UCSD
With participation of Kevin Gomes, Michael Wan and Michael Meisinger The Data Management Subsystem (DMS) is the data and information handling component of the ORION CI. The work breakdown structure for the DMS is
concentrated in releases one through three of the OOI Cyberinfrastructure. In this write-up we cover release 1 in detail with brief outlines of work breakdown for releases 2 and 3.
In release 1, an initial end-to-end automated data preservation and distribution network will be established. The main task of the data management group is to create a virtual data network with distributed storage repositories and data archives, and, provide data distribution and access mechanisms for access to soft real-time sensor data as well as archived data collections. The ingestion of the soft real-time data into the DMS is governed by interactions with the Sensing and Acquisition Subsystem and their CI Channel system. Similarly, the access of data from the DMS for application and human consumption is governed through interactions with the Analysis and Synthesis
Subsystem. The internal activities in the DMS are coordinated by close interactions with the Common Operating Infrastructure Subsystem. Other interactions, as we move forward to later releases, will be with the Knowledge Management, Planning &
Prosecution, and Common Execution Infrastructure Subsystems. Another Release 1 goal for the DMS is to provide the Data Management services for the GSO CyberPoP endpoints by extending the virtual data networks to the CyberPoP deployment systems. The interaction of the DMS
with other subsystems is shown in Figure 1. The ORION project will begin assembling collections of digital holdings, both real-time sensor data and static collections and archives that comprise the intellectual content on which current and future research will be based. These digital holdings can
be massive in size, measured in petabytes and tens of millions of files, and must be maintained for decades. In addition, the digital holdings are distributed across multiple institutions, and are published in digestible collections for access by other researchers.
sekar 11/13/07 1:44 PM
sekar 11/13/07 1:44 PM
Comment: This comment originally comes from
Kevin Gomes: What is the GSO? Is this supposed to be CGSN? Not sure about this-Raja
Comment: This comment originally comes from
Kevin Gomes: Not sure I understand what this last sentence means. Is this just basically saying that the CGSN and RSN will have access to read from the DMS so that they can make decisions based on information from data processing or other observatory assets? Not sure about this-Raja
The result is an ever-growing demand for software cyber infrastructure that
simultaneously supports data sharing on a day-to-day basis, data publication for reference and alternate analysis, and data preservation for long-term access.The OOI data
management system needs capabilities not only to address sensor data in conjunction with static and archived data, it also needs to interact with multiple organizations, autonomous ownership and stewardship. This requires a federation architecture with transparent policy based mechanisms for storage and access across each autonomous organization, while at the same time balancing the need for single sign-on and a uniform access protocol. To meet these competing needs we will implement a federated data grid architecture with transparent and customizable policy-based data management. Table 1 provides a list of services provided by the DMS.
To meet the goal of providing a distributed data management and preservation infrastructure, we propose an architecture based on proven technologies that are currently used across a wide variety of projects: 1) Storage Resource
Broker(SRB)/integrated Rule Oriented Data Management System, both developed by the Data Intensive Computing Environment (DICE) Group at the San Diego Supercomputer Center (SDSC) at the University of California at San Diego, 2) the Shore Side Data Systems (SSDS) developed at the Monterey Bay Aquarium Research Institute (MBARI). The SRB/iRODS middleware provides all of the features that are needed to implement a production-level data Grid,
including facilities for collection building, sharing, management, querying, accessing, and preserving data in a federated, distributed framework with uniform access to diverse, heterogeneous storage resources across administrative domains. The middleware uses an integrated metadata catalog that holds system and application- or domain-dependent metadata about the resources and datasets, and methods and users. Together, the data management and metadata management system provide a scalable information discovery and data access system for publishing and analyzing scientific data and metadata. The DMS will also incorporate a component based on the Antelope Real Time System (ARTS) developed by BRTT, Colorado. Even though it is not open source, it is used in many projects for acquisition and dissemination of sensor stream data. The Sensing and
Table 1: List of Data Management Services
Online Data Repository
Data and Metadata organization Persistent Archive Service
Persistent naming, Preservation processes Asset Validation Service
Integrity and Authenticity Aggregation Service
Classification, Categorization, Grouping Attribution Service
Community attributes, Semantic ontology Metadata Search & Navigation Services
Query & Browse by context Dynamic Data Distribution Services
Publish, Subscribe and Query for dynamic data resources
Data Access Services
Suite of External Interfaces OPeNDAP, THREDDS, LAS,…
Management policy enforcement Community specific policies Access and processing controls
Acquisition System will integrate ARTS as a subsystem. Since ARTS also provides database capabilities for storing stream data, we may provide an interface to it through the iRODS system. Other sensor network systems such as Data Turbine will also be considered as needs arise and their usage is indicated in one of the observatories. Complimentary features offered by iRODS and SSDS systems are listed in Table 2.
The DMS for release 1 will be developed in stages. Since iRODS and SSDS are established open source software, we propose to integrate them at the software level through tight integration based on creation of system-specific drivers. The Enterprise Service Bus access point of SSDS will be used for interfacing with the COI. Later as we progress through other releases, the iRODS subsystem itself will be able to provide this interface directly, providing multiple ways of interfacing with the COI. Similarly, the sensor interface provided by SSDS will be leveraged to provide access to sensor streams. Since the products coming out of SSDS are files and relational metadata, these will be made accessible through the iRODS infrastructure. The method of this accessibility is still under design as one can envision (a) replicating the information directly into iRODS or (b) providing a registered access to the products which are still under the control of SSDS. The pros and cons of the two approaches will be discussed in the near future to find a path forward. The iRODS system will be used as a main access point for many applications. To this end, access services based on OPeNDAP, THREDDS and other systems will be developed in addition to the native C, Java and Web interfaces of iRODS. Also, the SRB system (forerunner of iRODS) has been integrated with the ARTS sensor repository and distribution system. We propose porting this capability onto iRODS and providing similar access to the ARTS products and sensor streams through iRODS. The interactions between the two sub-subsystems are captured in Figures 2 and 3.
Table 2: Features Supported in SSDS and iRODS Components of DMS iRODS
• Resource: Archives, Unix File
Systems
• API: Ingest/Access, Register,
Metadata
• Form: Hierarchical (POSIX)
• Access: C,Java,PHP
• Protocol: Native Bin/XML
• Metadata Extraction: Link
DataProcessing /Extraction Micro -Services, Rules
• Replication: supported
• Metadata Catalog: RDB
System: Owner,ACL,Chksum,Audit ,… User defined: KVU -Triplets
SSDS
• Resource: RDB (File System for
Backup)
• API: Ingest/Access, Register,
Metadata
• Form: URI
• Access: Java,REST , WS
• Protocol: HTTP
• Metadata Extraction: API/XML and
Services
• Replication: not supported
• Metadata Catalog: RDB
System Ownership,provenance ) User-defined: No
DP DataStream File HTT Ingest RDB Bk Up WAN Registratio<XML> n File RDB DATA META DATA API HTTP/REST JMS URI ESB
Figure2: SSDS Data Flow
DP File Regist er A P I Put Posix Web File System RDB A P I Get Web OPeN DAP THRE DDS LAS DataStreams Distributed/ Replicated E S B SSDS WS HDF
The DMS will provide distributed, replicated access to heterogeneous data from sensor streams in soft real-time and to static data in files and relational databases. Also, the metadata and semantic ontology support by the subsystem will enhance querying and discovery of such data products. The system will provide a facility to arrange the data products into collections that enable curation and policy-driven reservation. With features for access control, auditing, integrity and authenticity checking, the system will be well-suited for disseminating validated ocean data products through various interfaces to serious researchers, to the public, and to policy makers.
sekar 11/13/07 1:46 PM
Comment: This comment originally comes from
Kevin Gomes: There is no mention in the document about R2 or R3. It makes it looks like everything will be done in R1 even though the intro text mentions R2/R3. I think it would be helpful to actually layout the WBS by time and list the various development pieces and where they go. After reading the Sensing and Acquisition texts and the DMS texts, they seem very much like islands of development. There should be text describing the two WBS schedules and milestones and they should align and complement each other. That may not be part of this text, but somebody should put together that big picture that ties it all together. Also, I think there should be some LOE tied to the various tasks. For example, is the development of the OPeNDAP mechanism into iRODS/SRB a two week effort, or a two year effort? It’s hard to tell if this is all feasible or not because no times are listed with the development tasks. Also, the diagrams make sense because I know the technologies, but at a PDR level, there really is no text in here that describes how iRODS/SRB/SSDS will work together. Table 2 helps, but doesn’t really hit Figures 2 & 3. Not sure about this-Raja What should we do? How can we map from one to another. We didn’t discuss about Releaes 2 and 3 In our meeting because of time constraints. Can the items for these releases be given as a chart/table?