Knowledge discovery from biological
Big Data : scalability issues
Marie-Dominique Devignes, Malika Smaïl, Emmanuel
Bresso, Adrien Coulet, Chedy Raïssi, Amedeo Napoli
Université de Lorraine, LORIA laboratory and INRIA
Nancy Grand-Est, Orpailleur team, Nancy, France
http://orpailleur.loria.fr/
http://www.loria.fr
From (big) data to knowledge
Raw Data
Information
K
KDD : Knowledge Discovery from Databases
iterative and interactive process
Problem solving,
making decision
A « big data » story in the life sciences
Presented by Russ Altman (PharmGKB) on Youtube
◦
EngX webinar at Stanford Engineering School, nov12, 2013
Data
Information
K
Adverse event interpretation
in electronic medical records
1. FDA Adverse Event
Reporting System :
(FAERS)
2. Data subset related
to eight classes of
side effects
3. Supervised statistical
machine learning
Tatonetti et al., A novel signal detection algorithm for identifying
hidden drug-drug interactions in adverse event reports
JAMIA, 19:79-85, 2011
4. Correlation models :
Adverse reaction due to
Drug Drug Interactions
The KDD bottlenecks in the life sciences
Data and data sources
◦
Noisy, complex, heterogenenous, distributed, dynamic, etc.
◦
Need for « knowledge/model - driven » data integration
Data selection
◦
Example and feature selection for machine learning
◦
Need for guidelines
Parameters of data mining programs
◦
Experimental approach
◦
Need for efficient execution platforms
Pattern evaluation and interpretation
◦
Big data mining can yield big volume of patterns !
◦
How to evaluate novelty, significance
and consistency of a pattern at
large scale ?
Objectives of the talk
1.
How do big data and biological databases cooperate ?
2.
How can bio-ontologies help in knowledge discovery ?
3.
Big data opportunities for the knowledge discovery
Biological databases are Big Data
More than 1500 biological databases today
◦
Curated data (not always)
◦
Complex schema
◦
Time-consuming update and integration
Uniprot - Stats nov 2013 :
SwissProt >542 KiloSeq for 192 MegaAA
TrEMBL > 48 MegaSeq for 15 GigaAA
Linked Open Data (LOD)
◦
Interconnected data
◦
Freely accessible on the web
◦
RDF Resource Description
Framework
{Subject, Property, Object}
URI (Uniform Resource Identifier)
◦
Bio2RDF project
1 Tera triple graph in july 2013
Uniprot
Semantic web* as emerging biological Big Data
*Semantic Web is a group of technologies to allow computers to autonomously process information
resources without human intervention by annotating the meaning – or "semantics" – to them" (coined
by Tim Berners-Lee in 1998).
KeggPathway hsa:nnn
KeggGene hsa:ggg
Uniprot sp:ppp
Interpro ipr:ddd
Has_gene
See_Also, Xref
From databases to RDF triples
« RDFization » of database contents
◦
Database fact
RDF triple
◦
Database
Graph
◦
e.g. A protein P:pppp containing a domain D:ddd
= « EBI Sparql end-point »
Cooperation between LOD and databases
Classical databases can provide reliable curated information
to complement and enrich information extracted from LOD
Project EXPLOD-BioMed (Adrien Coulet)
◦
Exploring LOD in the purpose of mining biomedical data
◦
Collect data about the genes responsible for intellectual disability
Use Bio2RDF or EBI/RDF SPARQL endpoints
◦
Incomplete « RDFization » -> complete the datasets by querying
classical databases + RDF representation of results
Storing retrieved RDF triples into a triple store
◦
Or… back to a relational DB (!) for easy design of KDD workflows
using Knowledge Discovery Environments (such as KNIME)
Flexibility versus Semantics : research
opportunities
Moving from relational DB to NoSQL storage systems
◦
Schema-less data -> lack of documentation, loss of semantics
◦
New management systems to be invented
Analytic tools need to be adapted to such systems
◦
Mahout …
◦
MOA …
◦
PEGASUS …
Fayyad UM (2012) Big data everywhere and No SQL in sight.
SIGKDD explorations, 14: i-ii
Objectives of the talk
1.
How do big data and biological databases cooperate ?
2.
How can bio-ontologies help in knowledge discovery ?
3.
Big data opportunities for the knowledge discovery
KDDK : Knowlege Discovery guided by
Domain Knowledge in the Orpailleur team
Data
Knowledge
Base (KB)
DB1
DB3
2. Data
Mining
1. Data
extraction and
formatting
3. Result
interpretation
DB2
Data integration
… Etc.
Data mining
Domain
Knowledge
Bio-ontologies, an asset in the life sciences
Ontologies = knowledge representation
◦
From hierarchical vocabularies
e.g. MeSH , MedDRA
, GO, SNOMED, ICD…
◦
To logical representation of concepts and relationships
e. g. SIO Semanticscience Integrated Ontology, UMLS Semantic
Types, SOPharm
…
Usages (semantic web technologies)
◦
Model layer of knowledge bases
◦
Semantic enrichment
e.g. Onto-Tools, IntelliGO
◦
Cross-ressource data retrieval
National Center for Bio-Ontologies :
NCBO bioportal
BIO-Ontologies and LOD exploration
39
biological resources:
UniProt, GO, ArrayExpress,
GEO, PharmGKB, etc.
5 Mega records
24,8 Giga
annotations
366
bio-ontologies
at the NCBO BioPortail:
6 Mega concepts
(Jonquet C et al. (2011) NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources.
Web Semantics 9:316-324)
Bio-ontologies and dimension reduction
Big data often mean high-dimensional data
◦
Statistical methods for feature selection
Many possible methods
◦
Clustering similar features using a terminology and semantic
similarity measure
E.g. semantic clustering of 1288 MedDRA adverse effect terms
-> 112 term clusters
Enables execution of symbolic data mining methods such as
frequent itemset search
Bresso et al. (2013) Integrative relational Machine-Learning Approach for
Understanding Drug Side-Effect Profiles. BMC Bioinformatics,14(1):207.
Objectives of the talk
1.
How do big data and biological databases cooperate ?
2.
How can bio-ontologies help in knowledge discovery ?
3.
Big data opportunities for the knowledge discovery
Big data as a reservoir of data for validating
hypotheses and models
Huge data sets become available for mining
◦
“The amount of effort required to warehouse data often means that
valuable data sources in organizations are never mined. This is where
Hadoop can make a big difference” (Eric Dumbhill, Big Data now, 2012)
◦
Adverse events -> grouping medical records from different hospitals is
useful to enlarge the dataset
Data mining often generate more than one model, sometimes a
huge amount of patterns
◦
Training set requires integrated curated data
The critical « Vs » of Big Data in the Life
Sciences
Variety and variability
◦
New data types provided by high-throughput technologies (OMICS
data but also images from microscopy devices …)
Value :
◦
FAERS and drug drug interaction -> better control of drug
treatments
◦
Individual genomes -> personalized medicine
Veracity
◦
Multiple source integration means detecting and managing possible
inconsistencies
◦
Quality and provenance metadata in the LOD
Bio2RDF uses DublinCore metadata triples and calculates 9 metrics
for each dataset
New paradigms for knowledge discovery
Cooperation between symbolic and statistical methods
◦
Statistical feature selection before symbolic data mining
◦
Automatic filtering and/or ranking of patterns using statistical
significance measurements before expert interpretation
Adaptive learning systems
Other projects in the Orpailleur team
Research projects
◦
Parallelization of CORON tools (
http://coron.loria.fr
)
A suite of tools for symbolic data mining and formal concept
analysis
◦
Text mining (ANR Hybride :
http://hybride.loria.fr/
)
Collaboration with Orphanet
◦
Graph mining for chemical reactions
Pennerath F, Niel G, Vismara P. , Jauffret P. , Laurenço C. , Napoli A. (2010) "A
graph-mining algorithm for the evaluation of bond formability". Journal of
Chemical Information and Modeling, 50:221-239.
◦
Spatio-temporal mining of agronomical data
Mari JF, Lazrak E-G, Benoît M (2013) Time space stochastic modelling of
agricultural landscapes for environmental issues. Environmental Modelling
and Software 46:219-227
Education : TELECOM Nancy (
http://www.telecomnancy.eu/
)
◦
Training engineers as « Data Scientists », Masters level
Conclusion
LOD and biological databases can cooperate in the
KDD process
Bio-ontologies are a major asset in the Life Sciences
◦
For data exploration
◦
For dimension reduction
Semantic web technology scales up at RDF level
◦
But not yet at the OWL and reasoning level
HPC computing and programs can process big data
References
Big Data Now. O’Reilly Media Inc. 1st edition, october 2012, www.it-ebooks.info (123 p.) Bresso E, Grisoni R, Marchetti G, Karaboga AS, Souchet M, Smaïl-Tabbone M (2013) Integrative
relational Machine-Learning Approach for Understanding Drug Side-Effect Profiles. BMC Bioinformatics.14:207.
Callahan A, Cruz-Toledo J, Dumontier M. (2013) Ontology-Based Querying with Bio2RDF's Linked Open Data. J Biomed Semantics. 15:4
Coakley MF, Leerkes MR, Barnett J, Gabrielian AE, Noble K, Weber MN and Huyen Y. Unlocking the power of big data at the NIH (Meeeting Report) Big Data September 2013 183-186.
Coulet A, Smaïl-Tabbone M, Napoli A, Devignes MD (2011) Ontology-based knowledge discovery in pharmacogenomics. Adv Exp Med Biol. 696:357-66
Fan W and Bifet A (2012) Mining big data : current status and forecast to the future. SIGKDD explorations, 14:1-5
Fayyad U (2012) Big data everywhere and No SQL in sight. SIGKDD explorations, 14: i-ii
Higdon R, Haynes W, Stanberry L, Stewart E, Yandl G, Howard C, Broomall W, Kolker N and Kolker E (2013) Unraveling the complexities of life sciences data. Big Data March 2013 42-50
Hoehndorf R, DumontierM and Gkoutos G (2012) Evaluation of research in biomedical ontologies. Briefings in Bioinformatics. Sept 8, 2012, 1-17.
Jonquet C, Lependu P, Falconer S, Coulet A, Noy NF, Musen MA, Shah NH (2011) NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources. Web Semantics 9:316-324. Tatonetti NP, Fernald GH, Altman RB (2012) A novel signal detection algorithm for identifying hidden
drug-drug interactions in adverse event reports. J Am Med Inform Assoc.19:79-85