fédération de données et de ConnaissancEs
Distribuées en Imagerie BiomédicaLE
Data fusion, semantic alignment, distributed queries
Johan Montagnat
CNRS, I3S lab, Modalis team
on behalf of the CrEDIBLE consortium
Motivations
●
Biomedical data
– High heterogeneity: images, clinical data, biomarkers, biology....
– Increasing amount / number of (open) sources – Big Data
● Large-scale medical studies
(statistical medical studies, epidemiology...)
– Need for cross-factors analysis – Linked Data
● Data (re)analysis opportunities ● Translational research
●
Centralized approaches encounter limitations
– Multiple data source kinds
– Large data volumes to transfer / archive / search
– Sensitive patient data / complex access control policies
– Need to adopt uniform data model & format
Biomedical data mediation & federation
– Data federation through distributed querying and query rewriting
Client
Federator
(query decomposition, planning & results federation)
Mediator
(ETL or query rewriting)
Mediator
(ETL or query rewriting)
Site 1 Site 2
Query-based access to data
Domain ontology-based federation
Client
Federator
Mediator Mediator
Site 1 Site 2
Query-based access to data
Remote sub-queries
Medical domain ontology (reference model)
Data querying
Data
Challenges and expertise
●
Challenges
– Representation of data semantics for heterogeneous
data sources → biomedical ontology building
– Data federation → distributed query engine
– Data mediation → RDB2RDF, ontology alignment
●
Partnership
I3S/Modalis
INRIA/I3S/Wimmics LTSI/MediCIS
Semantic representation
Distributed query engine
Scientific networking – annual workshop
● Objectives
– Multidisciplinary workshop gathering experts in biomedical data
representation, semantic, distribution, federation, integration...
● October 2012: data integration, ontologies, data models,
reasoning
– Semantic models are now widely accepted although ontologies
resources are not sufficient
– Existing systems are mostly centralized but strigent need to
support multi-centric studies
● October 2013: data reuse, ontologies, mediation,
federation, graphs
– Both horizontal and vertical data partition schemes need to be
addressed
– Data models in use are often constructed bottom-up
Bibliography study
●
Distributed (semantic) Query Processing
– Report CrEDIBLE-12-2, november 2012
– A. Gaignard PhD thesis, march 2013, U. Nice Sophia
●
Relational data mediation
– Report I3S 2013-04-FR, november 2013
– Submitted to Journal of Web Semantics
●
Ontology of scientific measures (data acquisition,
Reference ontology
● 3-levels structure: foundational (DOLCE), core, domain
● Domain-specific rules – Inference abilities Particular Endurant Perdurant Action Dataset acquisitions Conceptualization Expression Inscription Dataset acquisition equipment Studies Examinations Dataset processings Datasets Medical image expressions Medical image files Subjects Investigators Centres Languages Medical Image formats ● DataTop ontology – Current focus on measurements ● Derived relational schema
Ontology modules
●
Modularized ontology to improve reuse and
lightweightness
– ONL-MR-DA: MR Dataset Acquisition
– ONL-DP: Data Processing
– ONL-MSA: Mental State Assessment
– OntoVIP: Medical Image Simulation
●
Wide diffusion
Data query and federation engine
● KGRAM (Knowledge Graph Abstract Machine) Semantic
query engine:
– Full support of SPARQL1.1
– Generic interface for heterogeneous backends
– Flexible architecture facilitating different deployment scenarios
● Mediation interface to access relational data
Distributed Query Processing
●
Query federator decoupled from data sources
●
Asynchronous querying of multiple data sources
●Query planning and parallel querying
Distributed Query Processing
●
KGRAM query processing
SELECT ?name ?date
WHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) }
Distributed Query Processing
●
KGRAM query processing
Q1
SELECT ?name ?date
WHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Distributed Query Processing
●
KGRAM query processing
●
Asynchronous execution
Q1
SELECT ?name ?date
WHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q Interface KGRAM core Data source #1 Data source #2 Q
Distributed Query Processing
●
KGRAM query processing
●
Asynchronous execution
Q1
SELECT ?name ?date
WHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q Interface KGRAM core Data source #1 Q
Distributed Query Processing
●
KGRAM query processing
●
Asynchronous execution
Q1
SELECT ?name ?date
WHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q Interface KGRAM core Data source #1 Data source #2 Q Q1
Distributed Query Processing
●
KGRAM query processing
●
Asynchronous execution
Q1
SELECT ?name ?date
WHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q Interface KGRAM core Data source #1 Q Q1
Distributed Query Processing
●
KGRAM query processing
●
Asynchronous execution
Q1
SELECT ?name ?date
WHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q Interface KGRAM core Data source #1 Data source #2 Q Q1 Q2'
Distributed Query Processing
●
KGRAM query processing
●
Asynchronous execution
Q1
SELECT ?name ?date
WHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q Interface KGRAM core Data source #1 Q Q1 Q2'
Distributed Query Processing
●
KGRAM query processing
●
Asynchronous execution
Q1
SELECT ?name ?date
WHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q Interface KGRAM core Data source #1 Data source #2 Q Q1 Q2' Q2''
Distributed Query Processing
●
KGRAM query processing
●
Asynchronous execution
Q1
SELECT ?name ?date
WHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q Interface KGRAM core Data source #1 Q Q1 Q2' Q2''
Distributed Query Processing
●
KGRAM query processing
●
Asynchronous execution
Q1
SELECT ?name ?date
WHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q Interface KGRAM core Data source #1 Data source #2 Q Q1 Q2' Q2''
Distributed Query Processing
●
KGRAM query processing
●
Asynchronous execution
Q1
SELECT ?name ?date
WHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q Interface KGRAM core Data source #1 Q Q1 Q2' Q2''
Performance results
● Mixed (relational / semantic) stores querying
Exploitation in ANR GINSENG
(epidemiology)
● Partnership with GINSENG
consortium and Mnemotix SME
● Federation of heterogeneous
epidemiology repositories
– Multiple epidemiology data
acquisition networks
– Cross-correlation with “external”
data (e.g. demographic: IGN)
Collaboration with Pitié Salpétrière
●
CAC database
– Neurodegenerative diseases (esp. Alzheimer's)
– Mediation of relational CAC schema
● Comparative study of different relational-to-RDF
mediation techniques
– R2RML transformation language
Perspectives
● DataTop ontology development
– Data distribution (spatial and temporal) and data collections
structure
– Alzheimer disease
● Query engine architecture
– Code modularity for alternative optimization strategy
implementation
● Query performance optimization
– Query plan improvement (especially when dealing both with
horizontal and vertical partitioning)
Perspectives: collaborations
●
Université de Laval au Québec
– Ontology design: Alzheimer's disease data
●
ANR CONTINT 2013 BIOMIST
– Industrial system for large databases management
(Product Lifecycle Management - PLM)
– Application to medical databases in the area of brain
functions analysis
●
France Life Imaging excellence infrastructure
– National-level infrastructure for medical data sharing
Publications
●
Peer-reviewed
– Medical domain: iDash image informatics workshop,
sept. 2012; DCICTIA-MICCAI workshop, oct. 2012
– Distributed Query Processing: Web Intelligence, dec.
2012; Journal of Web Semantics (submitted)
– Mediation: Journal of Web Semantics (submitted)
●
PhD thesis
– A. Gaignard (march 2013, U. Nice Sophia Antipolis)
– N. Cerezo (december 2013, U. Nice Sophia Antipolis)
●
Research reports
Conclusions
●
Query-based data federation grounded on
semantic web standards (SPARQL, RDF, RDFS)
– Emphasis on query language expressivity
– Support for both horizontal and vertical data partitioning
– Broad applicability (given that ontologies are available)
●
Ontology-based
– Reference model for data alignment and query terms
●
Currently using Extract-Transform-Load
– Static conversion of data sources in RDF format
– Work on dynamic data mediation on-going
●
Optimization work on-going
– Flexible software architecture