Knowledge discovery from biological Big Data : scalability issues

(1)

Knowledge discovery from biological

Big Data : scalability issues

Marie-Dominique Devignes, Malika Smaïl, Emmanuel

Bresso, Adrien Coulet, Chedy Raïssi, Amedeo Napoli

Université de Lorraine, LORIA laboratory and INRIA

Nancy Grand-Est, Orpailleur team, Nancy, France

http://orpailleur.loria.fr/

http://www.loria.fr

(2)

From (big) data to knowledge

Raw Data

Information

K

KDD : Knowledge Discovery from Databases

iterative and interactive process

Problem solving,

making decision

(3)

A « big data » story in the life sciences



Presented by Russ Altman (PharmGKB) on Youtube

◦

EngX webinar at Stanford Engineering School, nov12, 2013

Data

Information

K

Adverse event interpretation

in electronic medical records

1. FDA Adverse Event

Reporting System :

(FAERS)

2. Data subset related

to eight classes of

side effects

3. Supervised statistical

machine learning

Tatonetti et al., A novel signal detection algorithm for identifying

hidden drug-drug interactions in adverse event reports

JAMIA, 19:79-85, 2011

4. Correlation models :

Adverse reaction due to

Drug Drug Interactions

(4)

The KDD bottlenecks in the life sciences



Data and data sources

◦

Noisy, complex, heterogenenous, distributed, dynamic, etc.

◦

Need for « knowledge/model - driven » data integration



Data selection

◦

Example and feature selection for machine learning

◦

Need for guidelines



Parameters of data mining programs

◦

Experimental approach

◦

Need for efficient execution platforms



Pattern evaluation and interpretation

◦

Big data mining can yield big volume of patterns !

◦

How to evaluate novelty, significance

and consistency of a pattern at

large scale ?

(5)

Objectives of the talk

1. How do big data and biological databases cooperate ?

2. How can bio-ontologies help in knowledge discovery ?

3. Big data opportunities for the knowledge discovery

(6)

Biological databases are Big Data



More than 1500 biological databases today

◦

Curated data (not always)

◦

Complex schema

◦

Time-consuming update and integration

Uniprot - Stats nov 2013 :

SwissProt >542 KiloSeq for 192 MegaAA

TrEMBL > 48 MegaSeq for 15 GigaAA

(7)



Linked Open Data (LOD)

◦

Interconnected data

◦

Freely accessible on the web

◦

RDF Resource Description

Framework

{Subject, Property, Object}

URI (Uniform Resource Identifier)

◦

Bio2RDF project



1 Tera triple graph in july 2013

_Uniprot

Semantic web* as emerging biological Big Data

*Semantic Web is a group of technologies to allow computers to autonomously process information

resources without human intervention by annotating the meaning – or "semantics" – to them" (coined

by Tim Berners-Lee in 1998).

KeggPathway hsa:nnn

KeggGene hsa:ggg

_{Uniprot sp:ppp}

Interpro ipr:ddd

Has_gene

See_Also, Xref

(8)

From databases to RDF triples



« RDFization » of database contents

◦

Database fact



RDF triple

◦

Database



Graph

◦

e.g. A protein P:pppp containing a domain D:ddd

= « EBI Sparql end-point »

(9)

Cooperation between LOD and databases



Classical databases can provide reliable curated information

to complement and enrich information extracted from LOD



Project EXPLOD-BioMed (Adrien Coulet)

◦

Exploring LOD in the purpose of mining biomedical data

◦

Collect data about the genes responsible for intellectual disability



Use Bio2RDF or EBI/RDF SPARQL endpoints

◦

Incomplete « RDFization » -> complete the datasets by querying

classical databases + RDF representation of results



Storing retrieved RDF triples into a triple store

◦

Or… back to a relational DB (!) for easy design of KDD workflows

using Knowledge Discovery Environments (such as KNIME)

(10)

Flexibility versus Semantics : research

opportunities



Moving from relational DB to NoSQL storage systems

◦

Schema-less data -> lack of documentation, loss of semantics

◦

New management systems to be invented



Analytic tools need to be adapted to such systems

◦

Mahout …

◦

MOA …

◦

PEGASUS …

Fayyad UM (2012) Big data everywhere and No SQL in sight.

SIGKDD explorations, 14: i-ii

(11)

Objectives of the talk

1. How do big data and biological databases cooperate ?

2. How can bio-ontologies help in knowledge discovery ?

3. Big data opportunities for the knowledge discovery

(12)

KDDK : Knowlege Discovery guided by

Domain Knowledge in the Orpailleur team

Data

Knowledge

Base (KB)

DB1

DB3

2. Data

Mining

1. Data

extraction and

formatting

3. Result

interpretation

DB2

Data integration

… Etc.

Data mining

Domain

Knowledge

(13)

Bio-ontologies, an asset in the life sciences



Ontologies = knowledge representation

◦

From hierarchical vocabularies



e.g. MeSH , MedDRA

, GO, SNOMED, ICD…

◦

To logical representation of concepts and relationships



e. g. SIO Semanticscience Integrated Ontology, UMLS Semantic

Types, SOPharm

…



Usages (semantic web technologies)

◦

Model layer of knowledge bases

◦

Semantic enrichment



e.g. Onto-Tools, IntelliGO

◦

Cross-ressource data retrieval

(14)

National Center for Bio-Ontologies :

NCBO bioportal

(15)

BIO-Ontologies and LOD exploration

39 biological resources:

UniProt, GO, ArrayExpress,

GEO, PharmGKB, etc.



5 Mega records



24,8 Giga

annotations



366 bio-ontologies



at the NCBO BioPortail:





6 Mega concepts

(Jonquet C et al. (2011) NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources.

Web Semantics 9:316-324)

(16)

(17)

Bio-ontologies and dimension reduction



Big data often mean high-dimensional data

◦



Statistical methods for feature selection



Many possible methods

◦



Clustering similar features using a terminology and semantic

similarity measure



E.g. semantic clustering of 1288 MedDRA adverse effect terms



-> 112 term clusters



Enables execution of symbolic data mining methods such as

frequent itemset search

Bresso et al. (2013) Integrative relational Machine-Learning Approach for

Understanding Drug Side-Effect Profiles. BMC Bioinformatics,14(1):207.

(18)

Objectives of the talk

1. How do big data and biological databases cooperate ?

2. How can bio-ontologies help in knowledge discovery ?

3. Big data opportunities for the knowledge discovery

(19)

Big data as a reservoir of data for validating

hypotheses and models



Huge data sets become available for mining

◦

“The amount of effort required to warehouse data often means that

valuable data sources in organizations are never mined. This is where

Hadoop can make a big difference” (Eric Dumbhill, Big Data now, 2012)

◦

Adverse events -> grouping medical records from different hospitals is

useful to enlarge the dataset



Data mining often generate more than one model, sometimes a

huge amount of patterns

◦

Training set requires integrated curated data

(20)

The critical « Vs » of Big Data in the Life

Sciences



Variety and variability

◦

New data types provided by high-throughput technologies (OMICS

data but also images from microscopy devices …)



Value :

◦

FAERS and drug drug interaction -> better control of drug

treatments

◦

Individual genomes -> personalized medicine



Veracity

◦

Multiple source integration means detecting and managing possible

inconsistencies

◦

Quality and provenance metadata in the LOD



Bio2RDF uses DublinCore metadata triples and calculates 9 metrics

for each dataset

(21)

New paradigms for knowledge discovery



Cooperation between symbolic and statistical methods

◦

Statistical feature selection before symbolic data mining

◦

Automatic filtering and/or ranking of patterns using statistical

significance measurements before expert interpretation



Adaptive learning systems

(22)

Other projects in the Orpailleur team



Research projects

◦

Parallelization of CORON tools (

http://coron.loria.fr

)



A suite of tools for symbolic data mining and formal concept

analysis

◦

Text mining (ANR Hybride :

http://hybride.loria.fr/

)



Collaboration with Orphanet

◦

Graph mining for chemical reactions



Pennerath F, Niel G, Vismara P. , Jauffret P. , Laurenço C. , Napoli A. (2010) "A

graph-mining algorithm for the evaluation of bond formability". Journal of

Chemical Information and Modeling, 50:221-239.

◦

Spatio-temporal mining of agronomical data



Mari JF, Lazrak E-G, Benoît M (2013) Time space stochastic modelling of

agricultural landscapes for environmental issues. Environmental Modelling

and Software 46:219-227



Education : TELECOM Nancy (

http://www.telecomnancy.eu/

)

◦

Training engineers as « Data Scientists », Masters level

(23)

Conclusion



LOD and biological databases can cooperate in the

KDD process



Bio-ontologies are a major asset in the Life Sciences

◦

For data exploration

◦

For dimension reduction



Semantic web technology scales up at RDF level

◦

But not yet at the OWL and reasoning level



HPC computing and programs can process big data

(24)

References

 Big Data Now. O’Reilly Media Inc. 1st edition, october 2012, www.it-ebooks.info (123 p.)  Bresso E, Grisoni R, Marchetti G, Karaboga AS, Souchet M, Smaïl-Tabbone M (2013) Integrative

relational Machine-Learning Approach for Understanding Drug Side-Effect Profiles. BMC Bioinformatics.14:207.

 Callahan A, Cruz-Toledo J, Dumontier M. (2013) Ontology-Based Querying with Bio2RDF's Linked Open Data. J Biomed Semantics. 15:4

 Coakley MF, Leerkes MR, Barnett J, Gabrielian AE, Noble K, Weber MN and Huyen Y. Unlocking the power of big data at the NIH (Meeeting Report) Big Data September 2013 183-186.

 Coulet A, Smaïl-Tabbone M, Napoli A, Devignes MD (2011) Ontology-based knowledge discovery in pharmacogenomics. Adv Exp Med Biol. 696:357-66

 Fan W and Bifet A (2012) Mining big data : current status and forecast to the future. SIGKDD explorations, 14:1-5

 Fayyad U (2012) Big data everywhere and No SQL in sight. SIGKDD explorations, 14: i-ii

 Higdon R, Haynes W, Stanberry L, Stewart E, Yandl G, Howard C, Broomall W, Kolker N and Kolker E (2013) Unraveling the complexities of life sciences data. Big Data March 2013 42-50

 Hoehndorf R, DumontierM and Gkoutos G (2012) Evaluation of research in biomedical ontologies. Briefings in Bioinformatics. Sept 8, 2012, 1-17.

 Jonquet C, Lependu P, Falconer S, Coulet A, Noy NF, Musen MA, Shah NH (2011) NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources. Web Semantics 9:316-324.  Tatonetti NP, Fernald GH, Altman RB (2012) A novel signal detection algorithm for identifying hidden

drug-drug interactions in adverse event reports. J Am Med Inform Assoc.19:79-85

(25)

Thank you for your attention !