Data Storage Model - R2E1-DMG

(1)

1.2.3.11 S Data Management The Data Management Subsystem is responsible for providing life cycle management, federation, preservation and presentation of OOI data holdings and associated metadata via data streams, repositories and catalogs. see: https://confluence. oceanobservatories. org/display/syseng/Release+Construction+Plan+Overview https://confluence. oceanobservatories. org/display/CIDev/Product+Description+Release+2 Public source repos at https://github.com/ooici. More docs at https://confluence. oceanobservatories.

org/display/syseng/CIAD+DM+Data+Management

1.2.3.11.2.22.2 SV Dynamic Data Distribution

Services - DM R2 Elaboration

1) Provides publish/subscription services for routing the different types of data messages. 2) Provides request/response services that enable users and services to query for and retrieve data messages. 3) In combination with the CEI Processing Service (https://confluence. oceanobservatories.

org/display/syseng/CIAD+CEI+OV+Process+Execution+Management), this can drive a policy decision to execute a process ---- [WAS: Provides publication, subscription, and query services associated with variant and dynamic data resources. This is the DM Distribution service. Used in combination with the CEI Processing Service to drive the policy decision to execute a process.]

CO 1 Define content and format

needed to describe messages DEPENDENCIES: https://confluence.oceanobservatories. org/display/CIDev/UC.R2.13+Acquire+Data+From+Instrument, https: //confluence.oceanobservatories.org/display/CIDev/UC.R1. 06+Distribute+Data+Product 4 https://confluence. oceanobservatories. org/display/CIDev/DM+Data+Type+Representations GIT like version in the message? Consumed by internal OOI programs, won't be used outside. JBG: What would it mean for a message to have a version 2? I think messages are atomic and don't get versioned. (Is this a reference to annotations creating new resources?)

IT 0 R2E1 Prototype data streaming Based on the mechanisms of pubsub and the Exchange, prototype describing, registering and using data streams such that arbitrary consumers can (a) find and (b) subscribe to data streams. Consumers get a certain metadata context before actual real-time streaming would occur.

X 4 5 dstuebe Architecture Team + Team Leads https://confluence. oceanobservatories. org/display/CIDev/Prototype+data+streaming

IT R2E1 Prototype Science Data

Processing: Part 1

Prototype HDF as a means of transport for science data X 4 4 dstuebe Tlennan https://confluence.

oceanobservatories.

org/display/CIDev/R2+Scientific+Data+Model+Development+Support+and+Java+Prototyping

IT R2E1 Prototype Science Data

Processing: Part 2

Prototype YAML objects which are locators in an HDF Data System X 4 4 dstuebe Tlennan https://confluence.

oceanobservatories.

org/display/CIDev/Prototype+Science+Data+Representation

IT R2E2 Define the types of data

messages

The DM needs to deal with messages of all types: events, data, metadata, queries, etc. Find and describe the appropriate level of abstraction for these messages to support their detailed definition in later iterations. Review work done to date on message types and evaluate its suitability.

Outline of current and additional message types and definitions required to meet R2 deliverables

X 4 3 dstuebe Moved to E1 as

messages can be more accurately defined when we know more about the services.

https://confluence.oceanobservatories.

org/display/syseng/CIAD+DM+SV+Notifications+and+Events

IT R2E3 Define the content of the

data messages

Confirm the circumstances under which data messages: 1) contain data in a standardized (internal canonical) format, 2) are represented as raw data (e.g., output from an instrument/instrument service), 3) contain notification about data available via some external interface. Evaluate existing content formats against known requirements.

Draft of R2 message types, scenarios for use and content of each

X 2 dstuebe MBARI Moved to E1 as

messages can be more accurately defined when we know more about the services.

IT R2E3 Define the headers of the

data messages

Specify how the FIPA headers are read and processed (COI) and how priority can be handled in transformation (DM) services.

One page design draft for DM data headers

X 2 dstuebe Moved to E1 as

messages can be more accurately defined when we know more about the services. CO 2 Data distribution control Capabilities to manage the routing underlying the data distribution

services.

4 COI messaging. Need to understand the

core-messaging before we can address this. What do we mean by 'underlying data distribution services' MManning - Architecture presentations being arranged for design teams.

IT R2E2 Refactor R1 pubsub

service to define data streams.

The R1 pubsub service has CRUD operations for topics and queues, refactor the service to manage streams and their definition.

A new pubsub service with the correct object model and service interface for R2.

X 5 5 dstuebe Architecture

team

IT R2E2 Define data distribution

policies and services

Define default/initial policies and service functional interface for distribution

X

IT Implement tools for users

to input data distribution rules

This tool would allow users to defines routes and policies on those routes for distributing data throughout the network

How does the user visualize and adjust the routing and priority? Like running a railroad? Is this Subscription?

IT Implement routing in

message bus

Somehow the messaging backbone must be able to lookup user defined distribution routing and policy definitions and implement them in the messaging bus.

This is a COI function!

1.2.3.11.2.23 SV OOI Common Data and

Metadata Model - DM R2 Elaboration

Extends the canonical data and metadata model for the Integrated Observatory with respect to associations (provenance, lineage, versions) as well as semantical interpretation and enables the transformation to and from the canonical data format.

https://confluence. oceanobservatories.

org/display/CIDev/Versioning%2C+Provenance%2C+and+Related+Concepts lineage == provenance. Depends on 1.2.3.11.1.2. JG: Example: NetCDF is decomposed into components (CDM) and attributes (semantic elements). Data streams do not have version, but they do have provenance.

SV

CO 3 Information resource

associations model

The data model as resources and associations for information resources and descriptions of provenance, citation, lineage and versions.

5 Information about an resource, as well as how

resources are related. Is it part of this use case: https://confluence.oceanobservatories. org/display/CIDev/UC.R2.20+Annotate+Resources. MManning - Yes\

IT R2E2 Define what resources are

to be associated

Provenance, lineage (a subset of provenance), versions, and citations are all characterized using annotations on resources. These annotations are likely created using associations of particular types (essentially the relations describing the resources). It is also possible to link two resources using annotations. Here we need to define what resources can be annotated with what kind of associations.

List of current and R2 resources which will can be annotated, and what kind of associations apply to each.

X 5 2 mmanning The simplest case is to associate one file with

another. But can cached messages have provenance, versions? JBG: A resource of any type can be associated to other resources. (Not sure what is meant by 'cached messages' here.) Can we have a scenario where a version subtype is used?

CO 4 Information resource

versioning

Capabilities to define, identify and retrieve different versions of a data set.

4 https://confluence.

oceanobservatories.

org/display/CIDev/UC.R2.22+Version+Resource

Will need to have a consistent understanding of the concept of 'version'.

IT 4 1 R2E2 Define versioning policy Define what products, resources and other entities will be versioned, policies for each and service design to support this capability and prototype implemntation using CouchDB

X 3 2 dstuebe https://github.com/twitter/snowflake

IT R2E2 Investigate existing

version control tools

Investigate existing version control tools (Hg, Git, Alfesco). Determine if they, or their model, are applicable to managing the versions of information resources in this system. Compare the existing use of the Git model in this process.

Short analysis of gaps in exisiting model and how version control tools address those gaps.

X 4 3 dstuebe

CO Data transformation

service

Capability to register transformation process definitions and to execute them based on the data distribution services.

oceanobservatories.

org/display/CIDev/UC.R2.21+Transform+Data+in+Workflow

IT 0 R2E2 Prototype process

execution with CEI

Prototype executing a pre-registered algorithm within an existing execution engine (e.g. Matlab script inside Matlab, or C program controlled by CC), connected to input queue and producing into output queue X 4 mmanning Architecture Team + Team Leads https://confluence. oceanobservatories. org/display/CIDev/Prototype+process+execution

IT R2E2 Create a process

definition

Define a framework for the creation of a process resource 1-2 page transformation service design

X 3 5 David Chris,

Maurice

IT R2E2 Prototype transformation

processing

Execute a transformation. prototype of how a

transformation is defined, executed and documented

X 4 4 dstuebe Chris, Adam Include handling of data transforms that may take a

relatively long time to process.

IT R2E1 Create Data

transformation service skeleton

Create the service with initial methods and interfaces (message definitions). Note responsibilities of each method.

X mmanning

IT R2E3 Implement basic Data

transformation service capabilities

Complete capabilities required to meet LCA deliverables. X

WBS TypeCO-ID IT-ID Iteration Task Description Outcome of task

(deliverables)

Risk (COs)

R2E1-DM R2E2-DM R2E3-DM Jira Task# Priority Work days Assignee Support Roles

Description Page URL

(2)

IT R2E2 Investigate using OGC or other standard process definition

Review the OGC and other standards for applicability and gaps. One page overview of external workflow engines

X 3 2 mmanning Chris,

DavidS

In particular, WPS, and deegree implementation

CO Data transformation

process repository

Register, retrieve and apply data format transformation processes. 3

IT R2E2 Define storage

mechanism for process definitions

Where and how will the process definitions be stored. How will processes be versioned, tracked and governed.

Define transform resource metadata and associations, etc.

DavidS

IT R2E3 Create a framework and

API that allow users to easily annotate their transformations with metadata.

This should be some sort of library or API that users can instrument their own transformations so that when their transformation run, the process and provenance descriptions can be captured by the system into the repository. This could be a service that is called by the transformation process as it is running.

DavidS

This should be some sort of library or API that users can instrument their own transformations so that when their transformation run, the process and provenance descriptions can be captured by the system into the repository. This could be a service that is called by the transformation entry process as it is running.

CO Semantics resource model The data model as resources and associations representing vocabularies, ontologies, inferencing related to resources

4

IT R2E3 Prototype bridging of

vocabularies

Create mappings to/from exemplar vocabularies from/to the OOI common data model and services to make those conversions

X

CO Vocabulary repository

service

Register and access domain specific vocabulaties; update and extend vocabularies. Naming and versioning of vocabularies. Unique identification of terms and vocabularies with their versions.

4

IT R2E2 Create Vocabulary

repository service skeleton

X

IT R2E3 Implement basic

Vocabulary repository service capabilities

CO Information resource

provenance

Capabilities to define, identify and retrieve data sets related by provenance to other data sets.

4

IT R2E2 Create provenance and

workflow schema

create a schema that contains the necessary attributes to track provenance and workflow. Reflect SSDS and other existing provenance resources.

Draft schema that identifies critical attributes

X 4 5

IT R2E2 Define data set retrieval Need clarification on what it means to retrieve a dataset. Consider the possibility of storing a function/workflow and a reference and retrieving the result of the workflow. (Check out of version control, FTP, OpenDAP, file download via http/s, etc.) Describe the highest-priority retrieval protocols to implement.

Short analysis on data set retrieval capabilities and challenges

X 5 3 dstuebe

IT R2E2 Define versioning

scheme/mechanism

What are versioning requirements for vocabularies? What schemes are supported by the storage mechanisms reviewed above? Propose a versioning scheme to satisfy ION needs.

One page proposal for versioning vocabularies, needs met and unanswered challenges

X 3 differetn types of vocabulary updates:

add/delete/update terms, add/delete/update relationships; add/update term metadata

IT R2E2 Survey existing tools that

provide SKOS and OWL services, characterize their interfaces, and define use cases enabled by them

Research the semantic tools that provide services beyond simple storage of vocabularies/ontologies.

Short analysis for tools reviewed and recommendations

X 4

IT R2E3 Implement or adopt a

Python SPARQL wrapper

X e.g. http://sparql-wrapper.sourceforge.net/

This presumes SPARQL is the search service that's selected above.

IT R2E2 Integrate vocabulary

services with model of associations/annotations

Provide the glue between the CI model for associations, and the services and information presented by the vocabulary repository. Includes recreating vocabularies for internal use from a vocabulary repository, using associations that are provided by the vocabulary repository, and publishing ION associations to a vocabulary.

X how shall we govern vocabularies - if they are built

bottom-up from associations? jg: this would be done by limiting the mechanisms for users creating those associations that can be turned into vocabularies. (All associations should be publishable as RDF, but not all associations should be turned into vocabularies.)

IT R2E3 Crowdsourcing strategy

for vocabulary development and maintenance

Describe how expertise within and beyond the OOI community can be used to develop vocabularies and mappings. Identify the OOI needs that will not be met without using this approach.

One page description of proposed approach, plus a scenario.

X 3 graybeal kstocks, Ilya also a governance issue - need use case

IT R2E3 Identify or create

vocabularies for several domains.

For the domains of features, parameters, functional sensor type, platform type, observing medium, data quality, and institutions: review existing alternatives and select from them or define new

List of reviewed and potential vocabularies and needs met.

X 5 what are the vocabularies we actually need at the

beginning? - from use cases. Platform type, institution name would be 2 particularly useful ones; functional sensor type and parameter are next.

IT R2E3 Create or select a

hierarchical vocabulary, to demonstrate navigation/search

Identify a topic that can use a hierarchical vocabulary to inform semantic searches, then find an existing vocabulary for that topic, or create one. Edit the vocabulary as needed to support effective search results.

Vocabulary, maintained in a controlled repository.

X Need to define how this hierarchy will be managed.

Will there be multiple hierarchies, e.g. custom to users?

IT R2E3 Determine discovery

keyword scheme

Consider existing (e.g. GCMD), pick one, map with others. X

IT R2E3 Mapping between

vocabulary terms and values

For the mappings, need to define storage model, querying, versioning One example mapping and guidelines for additional mappings

X what does 'between terms and values' mean? if it

means what I think it means, I don't think we want it as a task. jbg

CO 5 Ontology repository

service

Register and retrieve ontological representations, representing universal domain knowledge and connecting specific vocabularies. Define a standard ontology language and representation format.

3 Same as 'Vocabulary repository service'. (At least,

they can be implemented with the same tools.) Might or might not be similar enough to treat them identically.

No Task

CO 6 Semantics UI components Screens and plug-ins to the Web UI and application integration services related to use and manage ontologies, vocabularies and related semantics based functions. Definition of vocabularies, ontologies and mappings

1

IT R2E2 Engage UX team w.r.t.

semantic UI components

Discuss search and navigation capabilities; how semantic integration services and available semantic UI technologies may be leveraged.

Meetings with UX team to describe internal processing and interface to search service. Definition of key semantic-related UI components (screens, workflows, etc).

X 2 mmanning Susanne, Carolanne, kstocks, John 1.2.3.11.2.24 SV Persistent Archive Services - DM R2 Elaboration

Extensions of the persistant archive services provides cataloging, validation & curation to organize, persist and maintain data holdings with their associated metadata for an individual, group and/or community.

org/display/CIDev/Persistent+Archive+Storage+Architecture Q: Is the data transformed to engineering units upstream? Or do we store native output of instruments? Or is it a mix of both? REFS: https: //confluence.oceanobservatories. org/display/syseng/CIAD+DM+OV+Preservation https://confluence.oceanobservatories.

org/display/syseng/CIAD+DM+SV+Persistence+Architecture SV

CO 7 Persistent archive policy Capabilities for the definition of storage constraints, replication policies, access policies and their application.

5 IRODS is a strongly leading candidate. Do we

leverage IRODS policy. But we want an abstraction layer in case we don't use IRODS.

CO 8 IRODS persistent archive Implementation of persistent archive using the iRODS technology 4

IT R2E2 Investigate other archives Include investigation of federation issues, update use cases and abstraction layer

Short analysis of potential archive technologies for consideration

X 4 dstuebe If you are already using EC2 for processes, why are

we not looking at using cloud storage for archival purposes? See also "Persistent archive replication service" and "long-term data archive service" components for related tasks. Where does the National Archives requirement play?

IT R2E2 Prototype persistent

archive services using CouchDB, MongoDB, Zookeeper - and compare performance

prototype an alternate data object model and investigate performance gains with the alternate DBs (also see "Get requirements on latency" task)

X 4 https://confluence.

oceanobservatories.

org/display/CIDev/Datastore+and+Association+Service+performance+testing Need this to address data store and registry performance issues

(deliverables)

Risk (COs)

(3)

IT Implement IRODS persistent archive

Implementation of persistent archive using the iRODS technology dstuebe If it is the chosen archive.

CO 9 Persistent archive

replication service

Capabilities to replicate based on policy content of persistent archives to distributed locations or other persistent archives.

oceanobservatories.

org/display/CIDev/UC.R2.27+Manage+Replicated+Archive

IT R2E2 Investigate how other

archives handle replication

As part of the investigation of archival technologies, investigate how they handle replication.

Short analysis of potential archive technologies and/or capabilities for consideration. Propose a design to support R2 requirements.

X 4 dstuebe

IT R2E2 Create service skeleton Create the service with initial methods and interfaces (message definitions). Note responsibilities of each method.

X

IT R2E3 Implement basic service

capabilities

CO 10 Data caching service Capabilities to manage and operate short term caches of information in the network, with configurable cache fill and replacement policies. Caches may be geographically proximate to locations of use, or hold data of recurring interest.

4 What's the purpose of the caching? What are we

caching? If the latency between fetch, transform and delivery from base storage is low do we need caching? What is acceptable latency? MManning -these questions can be part of scoping of this task.

IT R2E2 Design the caching

mechanism

Detailed design for local data cache X

IT R2E3 Define caching policies Draft initial caching policies; types of data to be cached, expiry policy, update policy

X

IT R2E2 Create Data caching

service skeleton

X

caching service capabilities

CO 11 Long-term data archive

service

Manage information in dark or offline archives. Maintain information integrity. Provide estimates for data retrieval time. Move data to and from archive to online repository. Notify data requester of completed retrieval.

3 This is within IRODS capabilities. How is this

different than 68?

IT R2E2 Investigate National