Data integration for
metagenomics: current status and
future plans
Neil Wipat
Computing Science
University of Newcastle
Overview
•
metaMicrobase
•
Current method of data integration
•
Problems in data and database integration
•
Future possibilities: Semantic data integration using the
Metagenomics Informatics
pipeline overview
Enviromental Sample Clone Library Clone 1 Clone 2 etc. Sequencing Facility Sequence Sequence QC Sequence DB & Cache First Pass YAMAP Meta Microbase EMBL Submission Barcode DB Sample MetadataMetagenomics Informatics
pipeline overview
Enviromental Sample Clone Library Clone 1 Clone 2 etc. Sequencing Facility Sequence Sequence QC Sequence DB & Cache First Pass YAMAP Meta Microbase EMBL Submission Barcode DB Sample MetadataMetaMicrobase overview
Annotated Sequence MetaGenome Cache Annotated Sequence Annotated Sequence Annotated Sequence Collector Notification server BlastN BlastP Ssearch Mummer Promer Sharkhunt User defined Orfans Orthologues Data Warehouse Primary Data Secondary Data Sequence Comparison Metabolic reconstruction Secondary analysisData integration - current
approach
BlastN BlastP Ssearch Mummer Promer Sharkhunt User defined Orfans Orthologues Secondary Data Sequence Comparison Metabolic reconstruction Secondary analysis BlastN Attributes Attributes Orfans Attributes Attributes SSearch Attributes Attributes Mummer Attributes Attributes EC numbers Attributes Attributes Promer Attributes Attributes BlastP results Attributes Attributes Seqeunce Features Locus tag Accession no. Orthologues Attributes AttributesData integration - current
approach
BlastN BlastP Ssearch Mummer Promer Sharkhunt User defined Orfans Orthologues Secondary Data Sequence Comparison Metabolic reconstruction Secondary analysis BlastN Attributes Attributes Orfans Attributes Attributes SSearch Attributes Attributes Mummer Attributes Attributes EC numbers Attributes Attributes Promer Attributes Attributes BlastP results Attributes Attributes Seqeunce Features Locus tag Accession no. Orthologues Attributes AttributesData integration - user defined
tasks cause problems
User defined Data
Warehouse Secondary Data myGrid Workflows User defined Queries
User defined Data Warehouse Secondary Data myGrid User defined Queries Data Data Data Data Data Data
Data integration - user defined
tasks cause problems
myGrid workflows don’t solve
data integration
Data Data Data DataThey call Web services, carry out analysis
and gather more data, and produce their
myGrid workflows don’t solve
data integration
We need a way to join it all together ...
Data Data
Data Data
Problems in data and
database integration
•
Variation in computational access to data sources•
Attribute and table names vary between databases (often notdescriptive)
•
The use of non-standard terminology or different ‘standardterminologies’ to describe date entries
•
Mismatch between attribute data types and their formats (domaintypes).
•
Not all data describing the contents of tables is readilycomputationally accessible. Some metadata and data is assumed by context.
Approaches to data and
database integration
•
Variation in computational access to data sources•
SQL based computational access to databases can be standardisedusing interfaces such as JDBC, ODBC or OGSA-DAI.
•
Approaches to distributed query processing using such standardshave been developed e.g. OGSA-DQP and BioMart
•
These tackle the technical issues about querying over data stored inmultiple locations but don’t ensure queries are semantically meaningful (you need to know the db schema)
Semantic data integration
using ComparaGrid
•
Developing Grid-based systems for integrating comparativegenomics data
•
EBI, Institute for Food Research, John Innes Centre, Manchester,Newcastle University (Maths & Stats & Computing Science), SCRI
•
A generic architecture for semantic data and database integrationApproach
•
Problems relating to differences in table and attribute names can be addressed by semantically defining a database by mapping tables and attributes to a formal ontology.•
(attribute semantics, attribute value semantics and table semantics)•
Difference relating to differences in terminology describing data entries can be overcome by reference to controlled vocabularies/ontologies or byreference to an equivalence table.
Species Bacteria Vertebrates organism_tab bact_spec_tab verteb_tab
Database Table org
enz spec ec name ename Attributes Ontology thing organism plant animal enzyme bacterium invertebrate vertebrate
Approach
•
To resolve issues where table data is assumed by context, tables can befurther defined by reference to a term in an ontology (in a similar fashion to attribute names)
•
The reasoning system interprets data from the tables, takes the extrainformation into account, effectively populating the missing properties.