Long Term Ecological Research Network Office
EcoTrends
Cyber-infrastructure
Development
Mark Servilla LTER Network Office LTER Information ManagersAnnual Meeting – San Jose, California 2 – 5 August 2007
Building Blocks to Success
• EcoTrends NIS module
• PASTA NIS Module
Framework
• Metacat/EML metadata and
data management
• PostgreSQL RDBMS
• Java Servlet, JSP, and R
programming
• Community support for data
collection, documentation,
and accessibility
EcoTrends
PASTA
Metacat/EML
Community
PostgreSQL/
Java/Tomcat
LNO NIS
PASTA Architecture
Source A Source B Source C Metacat-Harvester EML Workflow Engine Parser-Loader Dataset Registry Cache Metadata Derived Data Web API HTML SOAP EML.xmlData loading for synthetic processing based on events (e.g., new data, metadata change) Existing LTER
metadata infrastructure (Metacat and EML)
Source data cache available to all workflow engines Support for multiple scientific workflow engines (e.g., R script, Kepler,
Chimera, D2K) Metadata and derived data products; metadata as EML Standard interfaces to support various
web portals (e.g., Trends, GEOSS, GEON,
NEON, WATERS) and web service APIs Metadata describing
derived data, including data provenance and data versioning
– expand on community provenance research
Derived data management Site data/metadata
Existing infrastructure New infrastructure Pluggable work flows
EcoTrends Development 2007
Source A Source B Source C Metacat-Harvester EML Workflow Engine Parser-Loader Dataset Registry Cache Metadata Derived Data Web API HTML SOAP EML.xmlLNO NIS
Development Process
Use-case Project Plan Requirements Coding Testing Release Milestones ITERA TIVE SO LUTI ONS Editorial and technical committees, and LNO Technical committee, NISAC, and LNO Editorial and technical committees, and LNOMajor Milestones
• EML generation
• Derived data loading
• Website presentation/integration • Data discovery and presentation
– Browse (by site, by topic/sub-topic) – Search (simple keyword, advanced)
– Result (result set display, dataset display, plot display)
• Data exploration
– Graphing (single and multiple datasets) – Aggregation (temporal)
– Download (data and metadata)
• Site auditing/DAS
LNO NIS
EML Generation
Step 1: Core Metadata– Define core metadata (e.g., contact information) that is repeated in all EML documents
Step 2: File Name Parsing
– Parse the derived data file names for site/station, variable, unit, and timescale metadata
Step 3: Derived Data Analysis
– Analyze derived data for temporal coverage and data value bounds
Step 4: R Script Analysis and Inclusion
– Include in the methods section of EML the R script used to generate derived data and any annotation associated with a specific derived data product
Step 5: Manual Documentation
– Include both non-automated metadata and tacit knowledge metadata into the EML
Derived Data Loading
• Parse data and load relational database
• Record level attributes
-PRIMARY_KEY :: INTEGER START_DATE :: DATESTAMP END_DATE :: DATESTAMP OBS :: FLOAT N_EXPECTED :: INTEGER S_DEV :: FLOAT S_ERR :: FLOAT PROP_MISSING :: FLOAT PROP_QUESTIONABLE ::FLOAT PROP_ESTIMATED :: FLOAT PROP_TRACE :: FLOAT PROP_INVALID :: FLOAT COMMENT :: TEXT
LNO NIS
Website Presentation
• Initial design and development – EcoTrends editorial committee – Electric Sage Designs, LLC – Laura Downey, Usability Engineer, SEEK Project
Website Integration
Stage 1: Apache, PHP, CSS,Javascript, and MySQL Stage 2: Apache, PHP, CSS,Javascript, and MySQL Stage 3: Tomcat, Servlet, JSP, CSS, Javascript, and Metacat Refactor Refactor Refactor original website to reflect consistency and modularity; modify CSS for application specific design (e.g., table layout)Convert all PHP functionality to equivalent Java Server Page (JSP); integrate Metacat based content
LNO NIS
LNO NIS
LNO NIS
Data Exploration
• Graphing (single and multiple datasets)
• Aggregation (temporal)
Site Auditing/DAS
• Web page auditing
• Data access auditing
• Plot auditing
LNO NIS
PASTA Architecture
Source A Source B Source C Metacat-Harvester EML Workflow Engine Parser-Loader Dataset Registry Cache Metadata Derived Data Web API HTML SOAP EML.xmlData loading for synthetic processing based on events (e.g., new data, metadata change) Existing LTER
metadata infrastructure (Metacat and EML)
Source data cache available to all workflow engines Support for multiple scientific workflow engines (e.g., R script, Kepler,
Chimera, D2K) Metadata and derived data products; metadata as EML Standard interfaces to support various
web portals (e.g., Trends, GEOSS, GEON,
NEON, WATERS) and web service APIs Metadata describing
derived data, including data provenance and data versioning
– expand on community provenance research
LNO NIS
PASTA Application Stack
EML, Metacat, and Harvester Registry/Parser/Loader
Workflow Engine
Cache Database Derived Database Metadata Harvest Web API - Portal
Site Data and EML Metadata
EML, Metacat, and Harvester Network-level Synthesis
Site-level data archive Data transformation and
integration
Standardized data products
Dataset identification and loading
Network interface
Existing EML Harvesting
Sc ale (te mpo ral-spa ti al -org an iz ati o n al )
Generalized Workflow
1. Sites collect and document time-series observation data (e.g., climate, social-economics, …)
2. Sites update EML with a new revision indicating new data
3. EML is harvested into Metacat
4. EML Loader/Parser loads new/updated dataset into “cache” database
5. Workflow Engine transforms “cache” data into “derived” data
6. Transformed data is stored in “derived” database 7. EML is generated for derived data and is stored in
LNO NIS
Decomposed Workflow
1. Sites collect and document time-series observation data (e.g., climate, social-economics, …)
2. Sites update EML with a new revision indicating new data
3. EML is harvested into Metacat
4. EML Loader/Parser loads new/updated dataset into “cache” database
5. Workflow Engine transforms “cache” data into “derived” data
6. Transformed data is stored in “derived” database
7. EML is generated for derived data and is stored in Metacat
LTER Site Data Collection
• Time-series data
– Physical environment (e.g., climate, …)
– Human population and economy – Biogeochemistry – Biotic structure • Data/metadata – Relational Database – Spreadsheet – Text file – HTML/XML
LNO NIS
Decomposed Workflow
1. Sites collect and document time-series observation data (e.g., climate, social-economics, …)
2. Sites update EML with a new revision indicating new data
3. EML is harvested into Metacat
4. EML Loader/Parser loads new/updated dataset into “cache” database
5. Workflow Engine transforms “cache” data into “derived” data
6. Transformed data is stored in “derived” database
7. EML is generated for derived data and is stored in Metacat
EML, Metacat, and the Harvester
• EML Package ID knb-lter-site.XX.YY knb-lter-sev.354.1 knb-lter-sev.354.2 knb-lter-sev.354.3• Metacat stores the XML of EML; new revisions take precedence – old revisions are deprecated, but not deleted
• Harvester is a time-based update process to “pull” site EML and inserts
“existing LTER investment in technology” Source A Source B Source C Metacat-Harvester EML
LNO NIS
Decomposed Workflow
1. Sites collect and document time-series observation data (e.g., climate, social-economics, …)
2. Sites update EML with a new revision indicating new data
3. EML is harvested into Metacat
4. EML Loader/Parser loads new/updated dataset into “cache” database
5. Workflow Engine transforms “cache” data into “derived” data
6. Transformed data is stored in “derived” database
7. EML is generated for derived data and is stored in Metacat
EML Loader/Parser
• Dataset registry identifies Trends data in Metacat • New revisions assert a
“new” data load. The EML parser/loader*
– Translates the site EML into the RDBMS DDL
– Creates a new DB table in the primary database based on the revision
– Loads the new data into the primary database – Trigger to continue workflow Source A Source B Source C Metacat-Harvester EML Parser-Loader Dataset Registry Cache
LNO NIS
Decomposed Workflow
1. Sites collect and document time-series observation data (e.g., climate, social-economics, …)
2. Sites update EML with a new revision indicating new data
3. EML is harvested into Metacat
4. EML Loader/Parser loads new/updated dataset into “cache” database
5. Workflow Engine transforms “cache” data into “derived” data
6. Transformed data is stored in “derived” database
7. EML is generated for derived data and is stored in Metacat
Workflow Data Transformation
• “Cache” database stores site data in native site schema and based on snap-shot version
• Workflow Engine
– reads native schema
– performs transformation/integration – writes to global schema
– produces EML metadata
• “Derived” database stores derived data in consistent global schema Workflow Engine Cache Metadata Derived Data
LNO NIS
Site to Global Schema Mapping
Maximum wind speed meters/second
wspdmax
Minimum wind speed meters/second
wpsdmin
Wind speed meters/second
wspd
Standard deviation of wind direction
wdirstd
Wind direction (azimuth)
wdir
Timestamp of observation 15 min interval
date_time
MCM Canada Glacier Wind Timestamp (daily) value Wind direction (knb-eco-trends.1.1)
value Timestamp (daily)
Wind direction std dev (knb-eco-trends.2.1)
value Timestamp (daily)
Wind speed max (knb-eco-trends.5.1)
…
“triggered by data load”
Global Schema
knb_eco_trends_1_1
scope
identifier
LNO NIS
Decomposed Workflow
1. Sites collect and document time-series observation data (e.g., climate, social-economics, …)
2. Sites update EML with a new revision indicating new data
3. EML is harvested into Metacat
4. EML Loader/Parser loads new/updated dataset into “cache” database
5. Workflow Engine transforms “cache” data into “derived” data
6. Transformed data is stored in “derived” database
7. EML is generated for derived data and is stored in Metacat
EML for “derived” data
• EML metadata for the derived data and inserts
into Metacat
• Derived data is now accessible through “all”
Metacat user interface
Metacat-Harvester EML Workflow Engine Metadata Derived EML.xml
LNO NIS
Decomposed Workflow
1. Sites collect and document time-series observation data (e.g., climate, social-economics, …)
2. Sites update EML with a new revision indicating new data
3. EML is harvested into Metacat
4. EML Loader/Parser loads new/updated dataset into “cache” database
5. Workflow Engine transforms “cache” data into “derived” data
6. Transformed data is stored in “derived” database
7. EML is generated for derived data and is stored in Metacat
Web API
• Store Front provides API to derived data products in secondary DB • HTML – today • Web service – tomorrow • Issues: – Authentication – Authorization – Provenance – Quality – Interactive Plots http://www.ecotrends.info
(beta site location)
Metacat-Harvester EML Metadata Derived Data Web API HTML SOAP EML.xml
LNO NIS