Dimitris Kotzinos
FORTH-ICS & University of Cergy - Pontoise
Yannis Roussakis, Kyriakos Kritikos, Athina Kritsotaki, Martin Doerr FORTH-ICS
Linked Data Management and
INSPIRE Compliant Data Export
•
Data integration
•
Data retrieval in a uniform way
•
Data export to other formats
•
The
Web of data
becomes a reality by connecting so far
isolated islands or data silos such as
– Enterprise silos (Disparate DBMS Engines & Development
Frameworks)
– Social Media silos (Social Networking Services, Discussion Forums,
Blogs, Wikis etc.)
– Scientific silos (in biology, physics, earth sciences, etc.)
•
Linked Open Data
(LOD) is a way of publishing data on
the (Semantic) Web that:
Data Integration Layer
• Abstraction layer for data access
abstract the applications from the
specific setup of the data management service (such as local vs. remote,
federation, and distribution)
• Beyond Data Access • Enabling automation of
discovery, composition, and use of datasets
• Data Markets
• Online Visualization Services • Data Publishing Solutions • Data Aggregators
• BI / Analytics as a Service
Linked Open Data as Service
4 November 26, 2013 Query Engine Rel DB Rel DB files Excel files files XML files
A P I Query Update Import Export
Linked Data
Extensible Application Pool
Visualization Collaboration Cross Data sets' Querying
•
Query results viewed as:
– RDF/XML – JSON – GML – KML – Shapefiles•
Transformation services to user provided
•
In this project we deal with scientific data described
in different formats and notions from diverse areas
– ground waters, boreholes, chemical analyses, etc. • GEUS, EKBAA, BRGM
– earthquakes, recordings, stations, etc. • EPPO
– Landslides, susceptibility maps, etc. • GEO-ZS, EKBAA
•
Purpose: integrate, describe and query such
heterogeneous data in a uniform way
Introduction on Available Datasets in the
project
•
Approach
– Creation of a Conceptual Model to integrate and cover all the thematic fields
– Map the source relational data into RDF data compliant to the Conceptual Model
– Provide unique URIs for data
– Rely on a scalable RDF Triple Store to enforce the mappings and enable the storage and query of the RDF data
Models, Mappings and Integration
November 26, 2013
InGeoCloudS Workshop, Brussels 21 November
8 Providers’ Data GSOM mappings query tsv/csv xml json results xml trig trix n3 … export formats Inspire compliant GeoJSON KML Shapefile …. transform
• Our model:
– needs to be a generic, extendable and interoperable, which relies not only to geographic data
– Integrates different datasets created from sources w.r.t. their semantics – developed for the documentation of scientific observation –
measurement and sampling
• An event-centric model that captures complex contexts where
time, place and participants are associated.
• Capable to describe complex relationships – to integrate or connect rich information
E39 Actor P14 carried out by E53 Place P7 took place at E52 Time-span P4 has time-span E7 Activity E70 Thing P12B was present at E55 Type P2 has type E41 Appellation P1 is identified by E62 String E1 CRM Entity P3 has note
E73 Information Object P67B is referred to by
S18 Map
P101 had as general use
S19 Observable Entity S21 Spatial Distribution Model
P16 used specific object
dc:Agent rdf:Literal dc:coverage dc:date dc:format dc:creator dc:subject dc:LocationPeriodOrJurisdiction dc:MediaTypeOrExtent
E42 Place Appellation
Basic Concepts
Proposed (New) Concepts
Reused Concepts (CIDOC) Reused Concepts (DC, RDF)
The model in details - Observation,
Measurement, Sample Taking
Proposed (New) Concepts
Reused Concepts (CIDOC) Reused Concepts (DC, RDF)
Data Mappings to Conceptual Model
•
The data providers’ relational data were informally
mapped into our Conceptual Model
•
Such mappings can be formally expressed using
R2RML
– R2RML is a W3C Recommendation
– http://www.w3.org/TR/r2rml/
– Enables the automatic creation and synchronization of the Linked Data w.r.t. the Relational Data
S16.Borehole S17.Borehole_Collar O7.consists_of P1F.is_identified_by P43F.has_ Dimension E42.Identifier BOREHOLENO S15.Acquifer_Concept INTAKE P1F.is_identified_ by P87F.is_identified_by P2F.has_type P87F.is_identified_by O9.contains_or_ confines E41.Appelation BOREHOLENAME E54.Dimension DRILLDEPTH P90F.has_value P2F.has_type E55.Type “Depth” E60.Number E42.Identifier INTAKENO E47.Spatial_Coordinates LATITUDE E47.Spatial_Coordinates E47.Spatial_Coordinates XUTM E47.Spatial_Coordinates YUTM E47.Spatial_Coordinates ELEVATION E55.Type “EPSG:32632”
GEUS data -> Conceptual Model – Borehole
and Location
P1F.is_identified_by P4F.has_ time-span O4.sampled_at O12.upper_vertical_limit O13.lower_vertical_limit S13.Sample O5.removed P2F.has_type E55.Type “GrWater_Sampling” E53.Place O9.contains_or_ confines E52.Time-Span SAMPLEDATE E47.Spatial_Coordinates TOP E47.Spatial_Coordinates BOTTOM E42.Identifier SAMPLEID S16.Borehole S15.Acquifer_Concept INTAKE P1F.is_identified_ by E42.Identifier INTAKENO S2.Sample_Taking P53F_has_former_or_ current_location P16F.used_specific_ object O9.contains_or_ confines
GEUS data -> Conceptual Model – Ground
Water Sample
E16.Measurement P1F.is_identified_by P2F.has_type P14F.carried_out_by P40F.observed_dimension P4F.has_time-span O9.contains_or_ confines S16.Borehole S15.Acquifer_Concept INTAKE E52.Time-Span TIMEOFMEAS E42.Identifier WATLEVELNO E55.Type QUALITY E39.Actor CATEGORY E54.Dimension WATERLEVEL P90F.has_value E54.Dimension WATLEVMSL P90F.has_value E54.Dimension WATLEVGRSU P90F.has_value E54.Dimension WATLEVMP P90F.has_value P7F.took_place_at P16F.used_specific_ object E55.Type METHOD P32F.used_general_technique
GEUS data -> Conceptual Model – Water Level
Measurement
E16.Measurement
S13.Sample P39F.measured
P1F.is_identified_by
P40F.observed_ dimension
E54.Dimension P91F.has_unit E58.Measurement_UnitUNIT
P90F.has_value P1F.is_identified_by P1F.is_identified_by P1F.is_identified_by P2F.has_type P2F.has_type O15.is_bounded_by E42.Identifier ANALYSISNO E55.Type “Chemical_Analysis” E60.Number AMOUNT E55.Type Comp_Descr E60.Number DETECTIONLIMIT E42.Identifier COMPOUNDNO E42.Identifier EUNO E42.Identifier CASNO
GEUS data -> Conceptual Model – Ground
Water Chemical Analysis
•
INSPIRE (
Infrastructure for Spatial Information in
Europe) covers 34 Spatial Data Themes. It uses ISO
19100 series standards; oriented to
environmental
and
spatial
data
•
The Generic conceptual schema is
not event-oriented
(especially the schema that uses the
observation-measurement specification)
Our data and INSPIRE
•
Data are not INSPIRE-compliant but there is a specific
•
INSPIRE is not made for information integration
•
INSPIRE lacks semantic clarity of concepts in various
places, i.e.
– “An observation is an act that results in the estimation of the value of a feature property…”
– Confusion of the action (event) aka observation with the result
– “The observation itself is also a feature, since it has properties and identity.”
– Observation is the value of a feature property and also a feature
Geosciml: borehole
Geology core: borehole
Borehole is used by 2
Why do not use
INSPIRE directly?
INSPIRE schemas
How many observations definitions (structures) INSPIRE includes?
1) ISO 19156 Observations and Measurements/OM Observation
2) GML Observation
3) Observation OGC-WaterML:Observation Process, MeasurementTimeseries `and
TimeseriesObservation: What is their relation with the other observation schemas(ISO 19156)?
1) ISO 19156 2) GML 3) WaterML
•
Model Mapped to INSPIRE
•
Virtuoso Universal Service is a hybrid of DB engine &
middleware combining the func. of a RDBMS,
ODBMS, and a RDFStore
•
Supports various Internet & Web protocols:
– HTTP(s), WebDav, SOAP, UDDI, SPARQL
•
Implemented various data access APIs:
– ODBC, JDBC, OLE DB, ADO . NET
•
Provides a Web-based DB Administration UI
•
Provides an open-source version
•
Reasons for selection:
– Scalable to handle millions of triples + exhibits a very good performance
– Offers API for RDF data management
– Supports SPARQL and SPARUL
– Supports 2-D features and offers spatial operators
– Supports R2RML and provides the respective relational-to-RDF data mapping mechanism
– Offers a cloud-based version
•
Scalability / Elasticity
– Planned at least 2 instances of Elastic LD Storage to be available in the platform to handle the query load (cross-provider queries are more demanding)
– The elastic capabilities of Amazon Cloud are exploited to increase/decrease the number of instances when the load surpasses or goes below a specific threshold
– Stress tests are currently conducted to determine these upper and lower thresholds
• Generated Linked Data for ground waters (EKBAA, BRGM, GEUS)
• Successfully inserted ~1B triples into Virtuoso
• Created specific queries – For each partner’s dataset
– Combining similar notions from cross provider datasets
• Relative small execution times depending on the number of fetched results + query complexity
• At the moment using only 1 instance in the Amazon Cloud
Conceptual Data Integration & Linking API
• Methods for:
– Querying data/metadata via SPARQL (both provider-specific & cross-provider queries are supported)
– Importing data/metadata in various formats (both blocking and non-blocking depending on the size of meta-data to be imported)
• Direct through respective API method
– Data provider should be responsible of synchronizing his/her relational and RDF data
• Indirect through the API method for the definition of relational-to-RDF mappings
– Apart from the definition of mappings, the data provider does not need to perform any kind of synchronization, as the system will be responsible for performing such a synchronization on his/her behalf
– Exporting data/metadata in various formats (inline in the response or available in a specific URL)
– Updating data/metadata via SPARUL
Conceptual Data Integration & Linking API
Data Integration Layer
Query Engine
A P I Query Update Import Export
Linked Data V ir tu o s o Different datasets Different formats, models e.g. INSPIRE
Conclusions
• Easy integration
• Easy querying
• Hide complexities
•
Cross query over datasets
– Find the boreholes and their locations in all datasets if ground water samples where taken from them after the date 01/01/2005.
select ?bhole ?date ?latitude ?longitude where { {
select ?bhole ?date ?latitude ?longitude where { ?st sci:O5.removed ?sample;
crm:P4F.has_time-span ?date; sci:O4.sampled_at ?place;
crm:P2F.has_type "GRWATER_SAMPLING". ?bhole sci:O9.contains_or_confines ?place;
a sci:S16.Borehole; sci:O7.consists_of ?bcollar.
?bcollar crm:P87F.is_identified_by ?latitude; crm:P87F.is_identified_by ?longitude. filter (regex(?latitude, "Latitude")). filter (regex(?longitude, "Longitude")). filter(?date > "2005-01-01"^^xsd:date).
} limit 100 }
UNION {
select ?bhole ?date ?latitude ?longitude where { ?anal crm:P2F.has_type "QUALITY_MEASUREMENT";
Demo Example
UNION {
select ?bhole ?date ?latitude ?longitude where { ?st crm:P4F.has_time-span ?date;
sci:O4.sampled_at ?bhole. ?bhole a sci:S16.Borehole;
sci:O7.consists_of ?bcollar.
?bcollar crm:P87F.is_identified_by ?latitude; crm:P87F.is_identified_by ?longitude. filter (regex(?latitude, "Latitude")). filter (regex(?longitude, "Longitude")).
filter(?date > "2005-01-01"^^xsd:date). } limit 100
} } Cont’d
•
How to publish your data as Linked Open Data using
GSOM?
•
How to query your data?
•
How to export in different formats?
•
How to export your data as INSPIRE compliant data?
17:15 (45min)
Room E “Next Door”
Yannis Roussakis