• No results found

Data integration for metagenomics: current status and future plans

N/A
N/A
Protected

Academic year: 2021

Share "Data integration for metagenomics: current status and future plans"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Data integration for

metagenomics: current status and

future plans

Neil Wipat

Computing Science

University of Newcastle

(2)

Overview

metaMicrobase

Current method of data integration

Problems in data and database integration

Future possibilities: Semantic data integration using the

(3)

Metagenomics Informatics

pipeline overview

Enviromental Sample Clone Library Clone 1 Clone 2 etc. Sequencing Facility Sequence Sequence QC Sequence DB & Cache First Pass YAMAP Meta Microbase EMBL Submission Barcode DB Sample Metadata

(4)

Metagenomics Informatics

pipeline overview

Enviromental Sample Clone Library Clone 1 Clone 2 etc. Sequencing Facility Sequence Sequence QC Sequence DB & Cache First Pass YAMAP Meta Microbase EMBL Submission Barcode DB Sample Metadata

(5)

MetaMicrobase overview

Annotated Sequence MetaGenome Cache Annotated Sequence Annotated Sequence Annotated Sequence Collector Notification server BlastN BlastP Ssearch Mummer Promer Sharkhunt User defined Orfans Orthologues Data Warehouse Primary Data Secondary Data Sequence Comparison Metabolic reconstruction Secondary analysis

(6)

Data integration - current

approach

BlastN BlastP Ssearch Mummer Promer Sharkhunt User defined Orfans Orthologues Secondary Data Sequence Comparison Metabolic reconstruction Secondary analysis BlastN Attributes Attributes Orfans Attributes Attributes SSearch Attributes Attributes Mummer Attributes Attributes EC numbers Attributes Attributes Promer Attributes Attributes BlastP results Attributes Attributes Seqeunce Features Locus tag Accession no. Orthologues Attributes Attributes

(7)

Data integration - current

approach

BlastN BlastP Ssearch Mummer Promer Sharkhunt User defined Orfans Orthologues Secondary Data Sequence Comparison Metabolic reconstruction Secondary analysis BlastN Attributes Attributes Orfans Attributes Attributes SSearch Attributes Attributes Mummer Attributes Attributes EC numbers Attributes Attributes Promer Attributes Attributes BlastP results Attributes Attributes Seqeunce Features Locus tag Accession no. Orthologues Attributes Attributes

(8)

Data integration - user defined

tasks cause problems

User defined Data

Warehouse Secondary Data myGrid Workflows User defined Queries

(9)

User defined Data Warehouse Secondary Data myGrid User defined Queries Data Data Data Data Data Data

Data integration - user defined

tasks cause problems

(10)

myGrid workflows don’t solve

data integration

Data Data Data Data

They call Web services, carry out analysis

and gather more data, and produce their

(11)

myGrid workflows don’t solve

data integration

We need a way to join it all together ...

Data Data

Data Data

(12)

Problems in data and

database integration

Variation in computational access to data sources

Attribute and table names vary between databases (often not

descriptive)

The use of non-standard terminology or different ‘standard

terminologies’ to describe date entries

Mismatch between attribute data types and their formats (domain

types).

Not all data describing the contents of tables is readily

computationally accessible. Some metadata and data is assumed by context.

(13)

Approaches to data and

database integration

Variation in computational access to data sources

SQL based computational access to databases can be standardised

using interfaces such as JDBC, ODBC or OGSA-DAI.

Approaches to distributed query processing using such standards

have been developed e.g. OGSA-DQP and BioMart

These tackle the technical issues about querying over data stored in

multiple locations but don’t ensure queries are semantically meaningful (you need to know the db schema)

(14)

Semantic data integration

using ComparaGrid

Developing Grid-based systems for integrating comparative

genomics data

EBI, Institute for Food Research, John Innes Centre, Manchester,

Newcastle University (Maths & Stats & Computing Science), SCRI

A generic architecture for semantic data and database integration

(15)

Approach

Problems relating to differences in table and attribute names can be addressed by semantically defining a database by mapping tables and attributes to a formal ontology.

(attribute semantics, attribute value semantics and table semantics)

Difference relating to differences in terminology describing data entries can be overcome by reference to controlled vocabularies/ontologies or by

reference to an equivalence table.

Species Bacteria Vertebrates organism_tab bact_spec_tab verteb_tab

Database Table org

enz spec ec name ename Attributes Ontology thing organism plant animal enzyme bacterium invertebrate vertebrate

(16)

Approach

To resolve issues where table data is assumed by context, tables can be

further defined by reference to a term in an ontology (in a similar fashion to attribute names)

The reasoning system interprets data from the tables, takes the extra

information into account, effectively populating the missing properties.

e.g. Bacteria bact_spec_tab ec thing organism plant animal enzyme bacterium invertebrate vertebrate

(17)

Approach

Mismatches between domain types CANNOT (in general) be solved using a semantic integration approach.

‘Mapping functions’ can be used to identify the mapping between data representations, using pairs of mapping functions

e.g. ensPep InterproID IP:012345 .. IPEntry ID 012345 .. IPEntry.ID= TRIM("IP:" ensPep.InterproID)

(18)

ComparaGrid Architecture

DB 'X' SQL RS Integrator Transformer Wrapper OWL RAW OWL RAW OWL SHARED ComparaGrid ontology JDBC WS/ HTTP all Data DB'A' DB'Z' DB'Y' DB'X' DB 'Y' SQLRS Transformer Wrapper OWL RAW OWL RAW OWL SHARED ComparaGrid ontology JDBC WS/ HTTP SYNTAX SEMANTICS LOCATION GRAPHICAL USER QUERY AND VISUALISATION CLIENT (PUSSYCAT)

(19)

MetaMicrobase overview

Annotated Sequence MetaGenome Cache Annotated Sequence Annotated Sequence Annotated Sequence Collector Notification server BlastN BlastP Ssearch Mummer Promer Sharkhunt User defined Orfans Orthologues Data Warehouse Primary Data Secondary Data Sequence Comparison Metabolic reconstruction Secondary analysis

(20)

Very long term aim

Annotated Sequence MetaGenome Cache Annotated Sequence Annotated Sequence Annotated Sequence Collector Notification server BlastN BlastP Ssearch Mummer Promer Sharkhunt User defined Orfans Orthologues Primary Data Sequence Comparison Metabolic reconstruction Secondary analysis ext DB1 int DB1 int DB1 int DB1 int DB1 int DB1 Compara Grid ext DB1 ext DB1 ext DB1

(21)

Acknowledgements

NERC Metagenomics project

ComparaGrid team

Microbase team

References

Related documents

U sluĉaju dvojnog knjigovodstva, neprofitna organizacija je obvezna voditi nekoliko poslovnih knjiga, i to dnevnik, glavnu knjigu te pomoćne knjige u koje se

Diffusion factors of radioactive airborne particles are analyzed with open data concerning air dose rate, rainfall, wind speed, and wind direction.. A set of three incineration

The above findings described how mobile phones have facilitated the operation of micro-entrepreneurs’ enterprises, including conducting daily business activities, building

Examination at 3 years was carried out on 90% of people and 88% of eyes (23/26) which underwent glaucoma surgery, including 5/6 eyes after iridectomy and 18/20

Although public welfare workers in Arizona and New Mexico understood the conflict over Native peoples’ eligibility for Social Security benefits as a battle between the

enterica serovar Typhi, grown at either low (L) or high (H) osmolarity, and harboring either the pBR322 vector (vector), pFM413S2 (413; containing the structural ompS2 gene under

Hasweh (1987) refers teaching as an interaction between at least three elements that is teachers, students and subject matter and a teacher needs to know the content he or she

different concentrations GB significantly decreased malonaldehyde (MDA) content, Relative conductivity, and increased chlorophyll, Soluble protein, Soluble sugar,