Parallel storage, mining and visualization
of environmental data archives
Mikhail Zhizhin, Dmitry Medvedev, Alexey
Poyda, Dmitry Mishin and Sergei Berezin
Space Research Institute and
Geophysical Center
Collaboration with Microsoft Research
• 2006-2009 Environmental Scenario Search Engine (ESSE)
– Site: Geophysical Center, Moscow State University and MSR Cambridge
– PIs: Mikhail Zhizhin (RAS), Eric Kihn (NOAA) and Vassily Lyutsarev (MSRC)
– PhDs: Alexey Poyda (Moscow State University), Dmitry Mishin and Dmitry Medvedev (RAS)
– Summary: ESSE is an interactive search engine for data mining in environmental data archives. What makes it so different from conventional text-based search engines is that it actually searches inside the numeric datasets using fuzzy logic to describe transitions
between environmental states
• 2007-2009 Climate Induced Vegetation Change Analysis Tool (CLIVT)
– Site: Space Research Institute and MSR
– PIs: Eugeny Lupian and Mikhail Zhizhin (RAS)
– PhDs: Maria Medvedeva, Alexey Poyda, Dmitry Medvedev, Dmitry Voytsehovsky
– Summary: In CLIVT project we bring togetherlarge archives of satellite images and
historical data on vegetation and climatefor the territory of Northern Eurasia and develop a new technique to study relations between the ecosystems dynamics and the climate change
Joint Research Center IKI-MSR
in Moscow
•
Framework Agreement signed at Moscow State
University on March 17
th, 2009
•
Administrative structure and research project
agreements will be elaborated in June 2009
•
Main directions of research:
–
The Parties wish to collaborate on research
concerning global change of climate, ecology and
space environment in their interrelation, which will
require satellite and ground-based sensor
observations together with data intensive
high-performance computing for environmental
What are the challenges
• Repetitive tasks to design (very large) databases for new data products. Interactive access times for any projection of the data array
• Never delete/overwrite data, lineage-accreditation-quality-type flags
• Multilayer (catalog, inventory, order, process) distributed metadata storage [STANDARD]
• Semantically rich common data model [STANDARD] and query language [STANDARD] for (environmental) scientific datasets
• Functionally rich data services supporting data extraction, processing and mining implemented at the data server [STANDARD]
• Distributed algorithms to balance network/database load
• Data export – modeling – visualize – ingest workflow; reference web-services for basic datasets and models
• “Clever” and seamless integration of MS Virtual Earth, Google Maps, WMS and scientific visualization libraries
• Parallel visualization algorithms (GIS?), applications and viewers for very large images, maps and video streams on tiled displays
ActiveStorage Virtual Observatory XML metadata and portal OGSA-DAI Grid data services •MM5 and WRF mesoscale weather models •Matlab
Data processing, analysis and
visualization workflow
•Metadata
•WMS,WCS
•Virtual Earth
NetCDF API REST and SOAP
templates
KML and tile servers
NetCDF and NcML
Virtual Observatory
XML metadata search engine
•
Open Source middleware VxOware
•
Tiers: 1) Web application; 2) REST services; 3)
native XML database backend + native object
stores with indexing (documents, images …)
•
XML: multiple catalog-level metadata schemas,
e.g. FGDC, ECHO, SPASE, NGDC Ordering
Extensions
•
Distributed metadata search over VO federation
Virtual Observatory for Metadata:
A Complete Data Environment is More than Just
the Bits
CLASS, SPIDR, ActiveStorage CLASS products FGDC records Ordering Extentions Wiki Documents Presentations SEARCH in metadata Search result: ResourceID_1 ResourceID_2 ResourceID_3 FGDC Metadata OE (1…*) User Guide Slideshow ResourceID Metadata store Virtual Observatory Web application or portalWeb service API for Data Sources
Visualization service Inventory service Order service
OGSA-DAI Resource and Activities
Plugin
Data Request REST or SOAP API OGSA-DAI client toolkit Ordering Extensions XML
Ordering Extensions XML schema:
station map
XSLT
Why OGSA-DAI service container?
• Standard tool in the Grid community
• Supports distributed workflow (in version 3.*)
• Built in support for asynchronous transactions
• Compatible with Web (Axis) and Grid (OMII, UNICORE, GT4)
• Looked at alternatives like OpenDap, WCS, … –documentation of our
analysis is available
• Problem 1: it is very complex
– Solution: REST wrapper
• Problem 2: supports only File, SQL and XML data types and queries
– Solution: implement additional data sources and functions for data in multidimensional arrays
ESSE / OGSA-DAI extensions
•
Provide catalog and inventory level
metadata about
a
data source
•
Support multidimensional array data model (in
addition to SQL/XML/BLOB)
•
Handle SOAP and REST requests for
data export
•
Have
local
data
processing
and fuzzy logic
data mining
functions
•
Provide
persistent storage
for the data processing and
environmental models output (as a new dataset)
•
Can be chained into asynchronous
distributed dat
a
OGSA-DAI Data Order Flow
OE Web Form Servlet
XSLT get Client OGSA-DAI XML Result Get Data Process Mine SQL XML Granule Time Series 1 2 Adapter Storage Error Message Data S e rv e r
Process Document via SOAP
3
•
Data Types
Time-series
– Sunspot
number
Grids
– NCEP Reanalysis
Stations
– Ionospheric
Soundings
Swath
- AVHRR
Profiles
– Ocean Profile
Maps
– Nighttime lights
Environmental Scenario Search Engine
Time series as a trajectory in the two-dimensional phase space (P-pressure, T-temperature)
State S1corresponding to the red (upper-right) region is the fuzzy expression:
S1= (Very Large P) and (Very Large T)
State S2corresponding to the cyan (lower-left) region is:
S2= (Very Small P) and (Very Small T)
Combining the descriptions of the states with the
time shift operator shiftdt, we can write the following symbolic expression for the
Environmental Scenario
“very low temperature and pressure after very high temperature and pressure”:
Web editor for a multi-state
environmental scenario
Parallel Active Data Storage
•
Open Source software developed in collaboration with MSR
Cambridge
•
Data provided by NCAR and NGDC NOAA
•
Common Data Model and API compatible with Unidata
CDM for NetCDF/HDF
•
Scalable parallel storage and processing engine based on
MS SQL server
•
Capable to store terabytes of gridded output of numerical
weather models and raw meteorological station reports
•
Special client library with API and an OGSA-DAI plugin. The
OGSA-DAI receives from ActiveStorage a CMD object and
transforms it into different ES formats such as NcML,
Common Data Model (CDM)
Common Data Model (CDM) is a ES standard used in OpenDAP, netCDF4 and HDF5 as a general representation of multivariate numeric arrays. Sum-models such as grids
(geophysical fields), points (observatories) and trajectories (ships, airplanes, satellites) are supported -name Group -name -value -dataType Attribute -name -shape -dataType Variable -name -length Dimension -char -byte -short -int -long -float -double -String DataType -name Dataset
Database schema to map CDM into
ActiveStorage
Data retrieval scheme:
single MS SQL server
SQL Server database xmin, ymin xmax, ymax Client library1. Call the client library with array coordinates as call parameters
2. Issue
commands to the database server
3. Select the requested data parts from the appropriate chunks
4. Return the data parts to the client library
5. Merge the data parts and return the whole array to the user
The database engine performs only the basic array selection and subsetting The client library does all the rest (merging chunks, type conversion, etc.) Two versions of the client library: .NET and Java
Distributed queries:
MS SQL database cluster
SQL Server database SQL Server database Client library...
Portions of the global array can be stored on several database servers to increase performance
import ru.wdcb.mdb.NcConnector import com.microsoft.sqlserver.jdbc.SQLServerDriver s = 'jdbc:sqlserver://localhost:1433;databaseName=NCEP_01;user=g uest;password=guest'; connector = NcConnector(); ncid = connector.nc_open(s,0); varid = connector.nc_inq_varid(ncid,'air'); origin = [0 0 10 10]; size = [80000 1 1 1]; stride = [1 1 1 1]; A = connector.nc_get_vars_short(ncid,varid,origin,size,stride);
plot(A, 'DisplayName', 'A', 'YDataSource', 'A'); figure
origin = [0 0 0 0]; size = [1 1 73 144]; stride = [1 1 1 1]; B = connector.nc_get_vars_shortm(ncid,varid,origin,size,stride); B = reshape(B,[73 144]); imagesc (B); figure(gcf);
NCEP/NCAR Weather Reanalysis
• Continually updating gridded data set
• Global Circulation Model output
• 74 weather parameters
• 5000 netCDF files, 30 – 500 MB each
Time coverage:
• 1948 – 2008
• 4-hourly values
Grids:
• Regular grid, 2.5 x 2.5 degrees
NCDC Meteorological Observations Records
• 1901 – 2008 time coverage.
• 30 million sensors.
• 1.7 billion observations.
Fixed ground stations Ships Mobile stations Buoys
• 470 000 ASCII files packed with gzip.
• 50 GBpacked; 400 GB unpacked.
Map of the meteorological stations
Integration of remote sensing and
climate data in CLIVT
Multi-annual NDVI time-series by land cover types
Multi-annual time-series of meteorological data
…
…
Regular cell-grid for data integration
Land cover map GLC2000
Multi-annual NDVI time-series
…
Integrated analysis
NDVI averaging for 2,5°x 2,5° cell-grid by land cover types 1stdecade of June 1999 2nddecade of June 1999 3rddecade of June 2007 1stdecade of June 1999 2nddecade of June 1999 3rddecade of June 2007 1stdecade of June 1999 2nddecade of June 1999 3rddecade of June 2007
NDVI for Evergreen Needleleaf Forest
Air-Temperature 0 0,2 0,4 0,6 0,8 1 -25 -20 -15 -10 -5 0 5 10 15 20 25
Web technologies for visualization of different
data types with geolocation
KML & geoRSS Web-services for CDM data sources OGC Web Map Services WMS/WFS/WCS MS Virtual Earth Google Maps
VisualESSE plugin for NASA World Wind desktop client
MS Virtual Earth, OGC Web Map Service
and NcML grid overlays
OGC WMS web map image with transparency control
Stable world nighttime lights by NGDC NOAA
NcML grid extracted from ActiveStorage
Current surface temperature by NWS NOAA
Reanalysis and forecast
weather data fusion
Related to a selected pushpin
• 50 years of weather history from NCEP/NCAR Reanalysis database
Fuzzy search and Virtual Earth
mapping of environmental events
Search for events at given locations
Select a set of fuzzy scenarios from the VO library and a time interval (history and forecast)
XSL transform of the search engine XML output into KML Map the KML: any location, any events, any time window
UIC SAGE 3.0 ported on MS Windows
Fully functional, not only local
display
PsTools utilities instead of rsh
Uses Windows build-in security
Existing applications
JuxtaView
bitplayer, mplayer
Library for .NET interoperability
WorldWind for SAGE
MultiViewer application
4 4 1 1
5 5 2 2
6 6 3 3
UI Controller Rendering clients
Each node performs data fetching,
processing and rendering
Better utilization of videocluster
resources
Transparent Data Cube
All HPC components from the CLIVT toolbox can run on the same parallel cluster. At the IKI Computing Center in Moscow we utilize a 12-node cluster with WCC MPI fro MM5, MS SQL Server databases for ActiveStorage and and 12-display videowall for Multiviewer. We call this parallel installation for storage, modeling and visualization Transparent Data Cube.