• No results found

Parallel storage, mining and visualization of environmental data archives

N/A
N/A
Protected

Academic year: 2021

Share "Parallel storage, mining and visualization of environmental data archives"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

Parallel storage, mining and visualization

of environmental data archives

Mikhail Zhizhin, Dmitry Medvedev, Alexey

Poyda, Dmitry Mishin and Sergei Berezin

Space Research Institute and

Geophysical Center

(2)

Collaboration with Microsoft Research

• 2006-2009 Environmental Scenario Search Engine (ESSE)

– Site: Geophysical Center, Moscow State University and MSR Cambridge

– PIs: Mikhail Zhizhin (RAS), Eric Kihn (NOAA) and Vassily Lyutsarev (MSRC)

– PhDs: Alexey Poyda (Moscow State University), Dmitry Mishin and Dmitry Medvedev (RAS)

– Summary: ESSE is an interactive search engine for data mining in environmental data archives. What makes it so different from conventional text-based search engines is that it actually searches inside the numeric datasets using fuzzy logic to describe transitions

between environmental states

• 2007-2009 Climate Induced Vegetation Change Analysis Tool (CLIVT)

– Site: Space Research Institute and MSR

– PIs: Eugeny Lupian and Mikhail Zhizhin (RAS)

– PhDs: Maria Medvedeva, Alexey Poyda, Dmitry Medvedev, Dmitry Voytsehovsky

– Summary: In CLIVT project we bring togetherlarge archives of satellite images and

historical data on vegetation and climatefor the territory of Northern Eurasia and develop a new technique to study relations between the ecosystems dynamics and the climate change

(3)

Joint Research Center IKI-MSR

in Moscow

Framework Agreement signed at Moscow State

University on March 17

th

, 2009

Administrative structure and research project

agreements will be elaborated in June 2009

Main directions of research:

The Parties wish to collaborate on research

concerning global change of climate, ecology and

space environment in their interrelation, which will

require satellite and ground-based sensor

observations together with data intensive

high-performance computing for environmental

(4)

What are the challenges

• Repetitive tasks to design (very large) databases for new data products. Interactive access times for any projection of the data array

• Never delete/overwrite data, lineage-accreditation-quality-type flags

• Multilayer (catalog, inventory, order, process) distributed metadata storage [STANDARD]

• Semantically rich common data model [STANDARD] and query language [STANDARD] for (environmental) scientific datasets

• Functionally rich data services supporting data extraction, processing and mining implemented at the data server [STANDARD]

• Distributed algorithms to balance network/database load

• Data export – modeling – visualize – ingest workflow; reference web-services for basic datasets and models

• “Clever” and seamless integration of MS Virtual Earth, Google Maps, WMS and scientific visualization libraries

• Parallel visualization algorithms (GIS?), applications and viewers for very large images, maps and video streams on tiled displays

(5)

ActiveStorage Virtual Observatory XML metadata and portal OGSA-DAI Grid data services •MM5 and WRF mesoscale weather models •Matlab

Data processing, analysis and

visualization workflow

•Metadata

•WMS,WCS

•Virtual Earth

NetCDF API REST and SOAP

templates

KML and tile servers

NetCDF and NcML

(6)

Virtual Observatory

XML metadata search engine

Open Source middleware VxOware

Tiers: 1) Web application; 2) REST services; 3)

native XML database backend + native object

stores with indexing (documents, images …)

XML: multiple catalog-level metadata schemas,

e.g. FGDC, ECHO, SPASE, NGDC Ordering

Extensions

Distributed metadata search over VO federation

(7)

Virtual Observatory for Metadata:

A Complete Data Environment is More than Just

the Bits

CLASS, SPIDR, ActiveStorage CLASS products FGDC records Ordering Extentions Wiki Documents Presentations SEARCH in metadata Search result: ResourceID_1 ResourceID_2 ResourceID_3 FGDC Metadata OE (1…*) User Guide Slideshow ResourceID Metadata store Virtual Observatory Web application or portal

Web service API for Data Sources

Visualization service Inventory service Order service

OGSA-DAI Resource and Activities

Plugin

Data Request REST or SOAP API OGSA-DAI client toolkit Ordering Extensions XML

(8)

Ordering Extensions XML schema:

station map

XSLT

(9)

Why OGSA-DAI service container?

• Standard tool in the Grid community

• Supports distributed workflow (in version 3.*)

• Built in support for asynchronous transactions

• Compatible with Web (Axis) and Grid (OMII, UNICORE, GT4)

• Looked at alternatives like OpenDap, WCS, … –documentation of our

analysis is available

• Problem 1: it is very complex

– Solution: REST wrapper

• Problem 2: supports only File, SQL and XML data types and queries

– Solution: implement additional data sources and functions for data in multidimensional arrays

(10)

ESSE / OGSA-DAI extensions

Provide catalog and inventory level

metadata about

a

data source

Support multidimensional array data model (in

addition to SQL/XML/BLOB)

Handle SOAP and REST requests for

data export

Have

local

data

processing

and fuzzy logic

data mining

functions

Provide

persistent storage

for the data processing and

environmental models output (as a new dataset)

Can be chained into asynchronous

distributed dat

a

(11)

OGSA-DAI Data Order Flow

OE Web Form Servlet

XSLT get Client OGSA-DAI XML Result Get Data Process Mine SQL XML Granule Time Series 1 2 Adapter Storage Error Message Data S e rv e r

Process Document via SOAP

3

Data Types

Time-series

– Sunspot

number

Grids

– NCEP Reanalysis

Stations

– Ionospheric

Soundings

Swath

- AVHRR

Profiles

– Ocean Profile

Maps

– Nighttime lights

(12)

Environmental Scenario Search Engine

Time series as a trajectory in the two-dimensional phase space (P-pressure, T-temperature)

State S1corresponding to the red (upper-right) region is the fuzzy expression:

S1= (Very Large P) and (Very Large T)

State S2corresponding to the cyan (lower-left) region is:

S2= (Very Small P) and (Very Small T)

Combining the descriptions of the states with the

time shift operator shiftdt, we can write the following symbolic expression for the

Environmental Scenario

very low temperature and pressure after very high temperature and pressure:

(13)

Web editor for a multi-state

environmental scenario

(14)

Parallel Active Data Storage

Open Source software developed in collaboration with MSR

Cambridge

Data provided by NCAR and NGDC NOAA

Common Data Model and API compatible with Unidata

CDM for NetCDF/HDF

Scalable parallel storage and processing engine based on

MS SQL server

Capable to store terabytes of gridded output of numerical

weather models and raw meteorological station reports

Special client library with API and an OGSA-DAI plugin. The

OGSA-DAI receives from ActiveStorage a CMD object and

transforms it into different ES formats such as NcML,

(15)

Common Data Model (CDM)

Common Data Model (CDM) is a ES standard used in OpenDAP, netCDF4 and HDF5 as a general representation of multivariate numeric arrays. Sum-models such as grids

(geophysical fields), points (observatories) and trajectories (ships, airplanes, satellites) are supported -name Group -name -value -dataType Attribute -name -shape -dataType Variable -name -length Dimension -char -byte -short -int -long -float -double -String DataType -name Dataset

(16)

Database schema to map CDM into

ActiveStorage

(17)

Data retrieval scheme:

single MS SQL server

SQL Server database xmin, ymin xmax, ymax Client library

1. Call the client library with array coordinates as call parameters

2. Issue

commands to the database server

3. Select the requested data parts from the appropriate chunks

4. Return the data parts to the client library

5. Merge the data parts and return the whole array to the user

The database engine performs only the basic array selection and subsetting The client library does all the rest (merging chunks, type conversion, etc.) Two versions of the client library: .NET and Java

(18)

Distributed queries:

MS SQL database cluster

SQL Server database SQL Server database Client library

...

Portions of the global array can be stored on several database servers to increase performance

(19)

import ru.wdcb.mdb.NcConnector import com.microsoft.sqlserver.jdbc.SQLServerDriver s = 'jdbc:sqlserver://localhost:1433;databaseName=NCEP_01;user=g uest;password=guest'; connector = NcConnector(); ncid = connector.nc_open(s,0); varid = connector.nc_inq_varid(ncid,'air'); origin = [0 0 10 10]; size = [80000 1 1 1]; stride = [1 1 1 1]; A = connector.nc_get_vars_short(ncid,varid,origin,size,stride);

plot(A, 'DisplayName', 'A', 'YDataSource', 'A'); figure

origin = [0 0 0 0]; size = [1 1 73 144]; stride = [1 1 1 1]; B = connector.nc_get_vars_shortm(ncid,varid,origin,size,stride); B = reshape(B,[73 144]); imagesc (B); figure(gcf);

(20)

NCEP/NCAR Weather Reanalysis

• Continually updating gridded data set

• Global Circulation Model output

• 74 weather parameters

• 5000 netCDF files, 30 – 500 MB each

Time coverage:

• 1948 – 2008

• 4-hourly values

Grids:

• Regular grid, 2.5 x 2.5 degrees

(21)

NCDC Meteorological Observations Records

• 1901 – 2008 time coverage.

• 30 million sensors.

• 1.7 billion observations.

Fixed ground stations Ships Mobile stations Buoys

• 470 000 ASCII files packed with gzip.

• 50 GBpacked; 400 GB unpacked.

Map of the meteorological stations

(22)

Integration of remote sensing and

climate data in CLIVT

Multi-annual NDVI time-series by land cover types

Multi-annual time-series of meteorological data

Regular cell-grid for data integration

Land cover map GLC2000

Multi-annual NDVI time-series

Integrated analysis

NDVI averaging for 2,5°x 2,5° cell-grid by land cover types 1stdecade of June 1999 2nddecade of June 1999 3rddecade of June 2007 1stdecade of June 1999 2nddecade of June 1999 3rddecade of June 2007 1stdecade of June 1999 2nddecade of June 1999 3rddecade of June 2007

NDVI for Evergreen Needleleaf Forest

Air-Temperature 0 0,2 0,4 0,6 0,8 1 -25 -20 -15 -10 -5 0 5 10 15 20 25

(23)

Web technologies for visualization of different

data types with geolocation

KML & geoRSS Web-services for CDM data sources OGC Web Map Services WMS/WFS/WCS MS Virtual Earth Google Maps

(24)

VisualESSE plugin for NASA World Wind desktop client

(25)

MS Virtual Earth, OGC Web Map Service

and NcML grid overlays

OGC WMS web map image with transparency control

Stable world nighttime lights by NGDC NOAA

NcML grid extracted from ActiveStorage

Current surface temperature by NWS NOAA

(26)

Reanalysis and forecast

weather data fusion

Related to a selected pushpin

• 50 years of weather history from NCEP/NCAR Reanalysis database

(27)

Fuzzy search and Virtual Earth

mapping of environmental events

Search for events at given locations

Select a set of fuzzy scenarios from the VO library and a time interval (history and forecast)

XSL transform of the search engine XML output into KML Map the KML: any location, any events, any time window

(28)

UIC SAGE 3.0 ported on MS Windows

Fully functional, not only local

display

PsTools utilities instead of rsh

Uses Windows build-in security

Existing applications

JuxtaView

bitplayer, mplayer

Library for .NET interoperability

WorldWind for SAGE

(29)

MultiViewer application

4 4 1 1

5 5 2 2

6 6 3 3

UI Controller Rendering clients

Each node performs data fetching,

processing and rendering

Better utilization of videocluster

resources

(30)

Transparent Data Cube

All HPC components from the CLIVT toolbox can run on the same parallel cluster. At the IKI Computing Center in Moscow we utilize a 12-node cluster with WCC MPI fro MM5, MS SQL Server databases for ActiveStorage and and 12-display videowall for Multiviewer. We call this parallel installation for storage, modeling and visualization Transparent Data Cube.

4

4

1

1

5

5

2

2

(31)

Directions for further research

Continue analysis of climate-biosphere interactions

Sun-Earth connections, including climate, ionosphere,

magnetosphere, cosmic rays

Data-intensive and cloud computing on Microsoft

HPC/Azure platform in remote sensing, environmental

databases and sensor networks

Tiled display / Virtual Earth / Deep Zoom / SAGE

visualization platform / World Wide Telescope

Multispectral micro-remote sensing for art

References

Related documents

We conduct a comparison between DG3 (three-point discontinuous Galerkin scheme; Huynh, 2007), MCV5 (fifth- order multi-moment constrained finite volume scheme; Ii and Xiao, 2009)

World Health Organization and the European research Organization on Genital Infection and Neoplasia in the year 2000 mentioned that HPV testing showed

Immunoprecipi- tation and Western blot for FGFR3 proteins confirmed the presence of both FGFR3 proteins in the cell lysate, suggesting that this decrease in phosphorylation did

In examining the ways in which nurses access information as a response to these uncertainties (Thompson et al. 2001a) and their perceptions of the information’s usefulness in

As a formal method it allows the user to test their applications reliably based on the SXM method of testing, whilst using a notation which is closer to a programming language.

En termes temporals, havíem planificat un final de projecte pel 23 de Gener del present 2010 i s’ha demorat fins a tal punt que hem superat la data límit del 29 de Juny de 2010 fet

1) Yield (kg/hm 2 ): fruit production before harvesting. Fruit remaining in canopy after the shaking process were hand harvested. 3) Fruit recovery (%): percentage of yield

This study examined the good fit of the framework including the causal relationships that the dimensions of relationship value are determined by different dimensions of relationship