• No results found

Knowledge discovery from biological Big Data : scalability issues

N/A
N/A
Protected

Academic year: 2021

Share "Knowledge discovery from biological Big Data : scalability issues"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

Knowledge discovery from biological

Big Data : scalability issues

Marie-Dominique Devignes, Malika Smaïl, Emmanuel

Bresso, Adrien Coulet, Chedy Raïssi, Amedeo Napoli

Université de Lorraine, LORIA laboratory and INRIA

Nancy Grand-Est, Orpailleur team, Nancy, France

http://orpailleur.loria.fr/

http://www.loria.fr

(2)

From (big) data to knowledge

Raw Data

Information

K

KDD : Knowledge Discovery from Databases

iterative and interactive process

Problem solving,

making decision

(3)

A « big data » story in the life sciences

Presented by Russ Altman (PharmGKB) on Youtube

EngX webinar at Stanford Engineering School, nov12, 2013

Data

Information

K

Adverse event interpretation

in electronic medical records

1. FDA Adverse Event

Reporting System :

(FAERS)

2. Data subset related

to eight classes of

side effects

3. Supervised statistical

machine learning

Tatonetti et al., A novel signal detection algorithm for identifying

hidden drug-drug interactions in adverse event reports

JAMIA, 19:79-85, 2011

4. Correlation models :

Adverse reaction due to

Drug Drug Interactions

(4)

The KDD bottlenecks in the life sciences

Data and data sources

Noisy, complex, heterogenenous, distributed, dynamic, etc.

Need for « knowledge/model - driven » data integration

Data selection

Example and feature selection for machine learning

Need for guidelines

Parameters of data mining programs

Experimental approach

Need for efficient execution platforms

Pattern evaluation and interpretation

Big data mining can yield big volume of patterns !

How to evaluate novelty, significance

and consistency of a pattern at

large scale ?

(5)

Objectives of the talk

1.

How do big data and biological databases cooperate ?

2.

How can bio-ontologies help in knowledge discovery ?

3.

Big data opportunities for the knowledge discovery

(6)

Biological databases are Big Data

More than 1500 biological databases today

Curated data (not always)

Complex schema

Time-consuming update and integration

Uniprot - Stats nov 2013 :

SwissProt >542 KiloSeq for 192 MegaAA

TrEMBL > 48 MegaSeq for 15 GigaAA

(7)

Linked Open Data (LOD)

Interconnected data

Freely accessible on the web

RDF Resource Description

Framework

{Subject, Property, Object}

URI (Uniform Resource Identifier)

Bio2RDF project

1 Tera triple graph in july 2013

Uniprot

Semantic web* as emerging biological Big Data

*Semantic Web is a group of technologies to allow computers to autonomously process information

resources without human intervention by annotating the meaning – or "semantics" – to them" (coined

by Tim Berners-Lee in 1998).

KeggPathway hsa:nnn

KeggGene hsa:ggg

Uniprot sp:ppp

Interpro ipr:ddd

Has_gene

See_Also, Xref

(8)

From databases to RDF triples

« RDFization » of database contents

Database fact

RDF triple

Database

Graph

e.g. A protein P:pppp containing a domain D:ddd

= « EBI Sparql end-point »

(9)

Cooperation between LOD and databases

Classical databases can provide reliable curated information

to complement and enrich information extracted from LOD

Project EXPLOD-BioMed (Adrien Coulet)

Exploring LOD in the purpose of mining biomedical data

Collect data about the genes responsible for intellectual disability

Use Bio2RDF or EBI/RDF SPARQL endpoints

Incomplete « RDFization » -> complete the datasets by querying

classical databases + RDF representation of results

Storing retrieved RDF triples into a triple store

Or… back to a relational DB (!) for easy design of KDD workflows

using Knowledge Discovery Environments (such as KNIME)

(10)

Flexibility versus Semantics : research

opportunities

Moving from relational DB to NoSQL storage systems

Schema-less data -> lack of documentation, loss of semantics

New management systems to be invented

Analytic tools need to be adapted to such systems

Mahout …

MOA …

PEGASUS …

Fayyad UM (2012) Big data everywhere and No SQL in sight.

SIGKDD explorations, 14: i-ii

(11)

Objectives of the talk

1.

How do big data and biological databases cooperate ?

2.

How can bio-ontologies help in knowledge discovery ?

3.

Big data opportunities for the knowledge discovery

(12)

KDDK : Knowlege Discovery guided by

Domain Knowledge in the Orpailleur team

Data

Knowledge

Base (KB)

DB1

DB3

2. Data

Mining

1. Data

extraction and

formatting

3. Result

interpretation

DB2

Data integration

… Etc.

Data mining

Domain

Knowledge

(13)

Bio-ontologies, an asset in the life sciences

Ontologies = knowledge representation

From hierarchical vocabularies

e.g. MeSH , MedDRA

, GO, SNOMED, ICD…

To logical representation of concepts and relationships

e. g. SIO Semanticscience Integrated Ontology, UMLS Semantic

Types, SOPharm

Usages (semantic web technologies)

Model layer of knowledge bases

Semantic enrichment

e.g. Onto-Tools, IntelliGO

Cross-ressource data retrieval

(14)

National Center for Bio-Ontologies :

NCBO bioportal

(15)

BIO-Ontologies and LOD exploration

39

biological resources:

UniProt, GO, ArrayExpress,

GEO, PharmGKB, etc.

5 Mega records

24,8 Giga

annotations

366

bio-ontologies

at the NCBO BioPortail:

6 Mega concepts

(Jonquet C et al. (2011) NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources.

Web Semantics 9:316-324)

(16)
(17)

Bio-ontologies and dimension reduction

Big data often mean high-dimensional data

Statistical methods for feature selection

Many possible methods

Clustering similar features using a terminology and semantic

similarity measure

E.g. semantic clustering of 1288 MedDRA adverse effect terms

-> 112 term clusters

Enables execution of symbolic data mining methods such as

frequent itemset search

Bresso et al. (2013) Integrative relational Machine-Learning Approach for

Understanding Drug Side-Effect Profiles. BMC Bioinformatics,14(1):207.

(18)

Objectives of the talk

1.

How do big data and biological databases cooperate ?

2.

How can bio-ontologies help in knowledge discovery ?

3.

Big data opportunities for the knowledge discovery

(19)

Big data as a reservoir of data for validating

hypotheses and models

Huge data sets become available for mining

“The amount of effort required to warehouse data often means that

valuable data sources in organizations are never mined. This is where

Hadoop can make a big difference” (Eric Dumbhill, Big Data now, 2012)

Adverse events -> grouping medical records from different hospitals is

useful to enlarge the dataset

Data mining often generate more than one model, sometimes a

huge amount of patterns

Training set requires integrated curated data

(20)

The critical « Vs » of Big Data in the Life

Sciences

Variety and variability

New data types provided by high-throughput technologies (OMICS

data but also images from microscopy devices …)

Value :

FAERS and drug drug interaction -> better control of drug

treatments

Individual genomes -> personalized medicine

Veracity

Multiple source integration means detecting and managing possible

inconsistencies

Quality and provenance metadata in the LOD

Bio2RDF uses DublinCore metadata triples and calculates 9 metrics

for each dataset

(21)

New paradigms for knowledge discovery

Cooperation between symbolic and statistical methods

Statistical feature selection before symbolic data mining

Automatic filtering and/or ranking of patterns using statistical

significance measurements before expert interpretation

Adaptive learning systems

(22)

Other projects in the Orpailleur team

Research projects

Parallelization of CORON tools (

http://coron.loria.fr

)

A suite of tools for symbolic data mining and formal concept

analysis

Text mining (ANR Hybride :

http://hybride.loria.fr/

)

Collaboration with Orphanet

Graph mining for chemical reactions

Pennerath F, Niel G, Vismara P. , Jauffret P. , Laurenço C. , Napoli A. (2010) "A

graph-mining algorithm for the evaluation of bond formability". Journal of

Chemical Information and Modeling, 50:221-239.

Spatio-temporal mining of agronomical data

Mari JF, Lazrak E-G, Benoît M (2013) Time space stochastic modelling of

agricultural landscapes for environmental issues. Environmental Modelling

and Software 46:219-227

Education : TELECOM Nancy (

http://www.telecomnancy.eu/

)

Training engineers as « Data Scientists », Masters level

(23)

Conclusion

LOD and biological databases can cooperate in the

KDD process

Bio-ontologies are a major asset in the Life Sciences

For data exploration

For dimension reduction

Semantic web technology scales up at RDF level

But not yet at the OWL and reasoning level

HPC computing and programs can process big data

(24)

References

 Big Data Now. O’Reilly Media Inc. 1st edition, october 2012, www.it-ebooks.info (123 p.)  Bresso E, Grisoni R, Marchetti G, Karaboga AS, Souchet M, Smaïl-Tabbone M (2013) Integrative

relational Machine-Learning Approach for Understanding Drug Side-Effect Profiles. BMC Bioinformatics.14:207.

 Callahan A, Cruz-Toledo J, Dumontier M. (2013) Ontology-Based Querying with Bio2RDF's Linked Open Data. J Biomed Semantics. 15:4

 Coakley MF, Leerkes MR, Barnett J, Gabrielian AE, Noble K, Weber MN and Huyen Y. Unlocking the power of big data at the NIH (Meeeting Report) Big Data September 2013 183-186.

 Coulet A, Smaïl-Tabbone M, Napoli A, Devignes MD (2011) Ontology-based knowledge discovery in pharmacogenomics. Adv Exp Med Biol. 696:357-66

 Fan W and Bifet A (2012) Mining big data : current status and forecast to the future. SIGKDD explorations, 14:1-5

 Fayyad U (2012) Big data everywhere and No SQL in sight. SIGKDD explorations, 14: i-ii

 Higdon R, Haynes W, Stanberry L, Stewart E, Yandl G, Howard C, Broomall W, Kolker N and Kolker E (2013) Unraveling the complexities of life sciences data. Big Data March 2013 42-50

 Hoehndorf R, DumontierM and Gkoutos G (2012) Evaluation of research in biomedical ontologies. Briefings in Bioinformatics. Sept 8, 2012, 1-17.

 Jonquet C, Lependu P, Falconer S, Coulet A, Noy NF, Musen MA, Shah NH (2011) NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources. Web Semantics 9:316-324.  Tatonetti NP, Fernald GH, Altman RB (2012) A novel signal detection algorithm for identifying hidden

drug-drug interactions in adverse event reports. J Am Med Inform Assoc.19:79-85

(25)

Thank you for your attention !

References

Related documents

The M270/M274 family of four-cylinder engines is optimally equipped thanks to the flexible consumption technologies Camtronic, lean-burn combustion and natural gas capability.

Users compete for testbed resources by submit- ting bids which specify resource combinations of interest in space/time (e.g., “any 32 MICA2 motes for 8 hours anytime in the next

Urbanization Works, which were carried out from 9 July 2007 and 19 September 2007, consisted of the following infrastructure:. • domestic and rainwater drainage networks on

A 2015 comprehensive study of federal data by elementary and secondary schools that combines all out-of- school suspensions to calculate comparative suspension rates for every

Association Conference, Winter 2003. "Option Pricing with Stable Hyperbolic Functions" featured papers and discussions by seven CCNY students. Co-chair and

Interestingly, we observe a crossover in the internal activity of open-source software development depending on the average number of file changes f (see fig.. The crossover is

Šis įrankis yra Eclipse priedas, jis gali būti naudojamas vienas (leidţia modeliuoti poţymių diagramas), arba kartu su fmp2rsm priedu, kuris sujungia šį įrankį

Similarly, the World Bank Group’s Global Emerging Markets Local Currency Bond Program (Gemloc) supports LCBM development in emerging market countries (EMCs) (World Bank and