• No results found

Processing Genome Data using Scalable Database Technology. My Background

N/A
N/A
Protected

Academic year: 2021

Share "Processing Genome Data using Scalable Database Technology. My Background"

Copied!
17
0
0

Loading.... (view fulltext now)

Full text

(1)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

Johann Christoph Freytag, Ph.D.

[email protected]

http://www.dbis.informatik.hu-berlin.de

Stanford University, February 2004

Processing Genome Data

using Scalable Database Technology

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

My Background

ERCR (European

Computer Industry

Research Centre),

München (87-89)

DEC’s Database

Technology Center,

München (90-93)

Professor of DBIS

@HUB

Starburst project, IBM Almaden Research Center (85-87)

Visiting Scientist, Almaden Research Center (97/98)

Visiting Scientist, IBM SVL (2001)

PhD @ Harvard Univ.

Visiting Scientist,

Microsoft Res. (2002)

(2)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

“What’s the Meaning of Life”

!

?

DNA

RNA

Protein

Transcription

Genomic

„Transmitter“

Messenger

Gene product

Translation

Replication

?

?

Overview

• (Biological) Motivation/Problems

• Using Database Technology

– Gene-EYe Integration-Platform

– Data Cleansing

– BLAST-Integration into GDB

– In-and-Out-the-Database: Using Workflow for

“dry” Experiments

(3)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

View of Biological Areas

Biological

Motivation

DNA

Genome

RNA

Transcriptome

Amino Acids

Proteome

Pathways

Life

Evolution

Environment

Diseases

Experiments

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

View of Data Data Source

Biological

Motivation

DNA

Genome

RNA

Transcriptome

Amino Acids

Proteome

Pathways

Life

Evolution

Environment

Diseases

Experiments

EMBL

RefSeq

LocusLink

EMBL (EST)

SWISS-PROT

Interpro

OMIM

Taxonomy

Brenda

KEGG

Express

Gene Ontology

(4)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

Complex Relationships

A graph depicting the relationships

between 400+ biological data

sources served by the EBI via SRS

Database Growth of

EMBL (# of records)

More than 400 Data Sources

on the WEB

Source: http://www3.ebi.ac.uk/Services/DBStats/

DBIS (our) Approach

SwissProt

ESEMBLE

...

KABAT

EMBL

Database

Model of the

Biological World

(5)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

• (Biological) Motivation/Problems

• Using Scalable Database Technology

– Gene-EYe Integration-Platform

– Data Cleansing

– BLAST-Integration into GDB

– In-and-Out-the-Database: Using Workflow

Concepts for “dry” Experiments

• Summary

Overview

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

Gene-EYe Integration-Platform

Vision

• Provide mechanisms for

– unified handling of different data sources

– data source integration

– change management

– user defined data preparation

• Provide

– relevant tools for sequence manipulation and

retrieval

– work flow support for operation and administration

(6)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

Gene-EYe Integration-Platform

The Big Picture

Genome Data Store Layer (GDS Schema)

Flat File Data -> Relational Entities (e.g. EMBL)

DATA

Genome DataBase Layer (GDB Schema)

Relational Entities -> Biological Entities (e.g. Gene)

CONTENT

Genome Data Warehouse Layer (GDW Schema)

Biological Entities -> Biological Concepts (e.g. Life Cycle)

KNOWLEDGE

Design

GDS: From Flat File to Database

Genome Data Store Layer (GDS Schema)

EMBL scanner

SW

ALL scann

er

T

A

XO scanner

InterPro scanner

GDS Admin Tools

ENSEMBL sca

nner

Data Storage

Data Cleansing

Update/Admin

GDS Load Tools

EMBL DDL

SWALL DDL

T

A

XO DDL

InterPro DDL

ENSEMBL DDL

(7)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

The Data Import Pipeline - Revisited

Data

File

Scanner

Load

Files

CLOB

Files

Load Spec.

Format

Spec.

Content

Spec.

DDL-Gen.

DDL Script

Gene-EYe

GDS

Loader

Controller

Instance

Spec.

Summary

1: CWM compliant GeneEYe Metadata Repository

Phase 1: Property Files

Perl scripts

Hand crafted

Phase 2: GEM

1

Repository

de.hui.dbis.geneeye.* (Java) Autogenerate from Metadata

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

(8)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

GDB-Layer: From Data to Biology

Data Storage

Data Cleansing (Syn.)

Update/Admin

Data Integration

Data Cleansing (Sem.)

Queries

Defined by and

in cooperation w/

domain experts

[Data]

Genome Data Store Layer (GDS Schema)

Genome Database Layer (GDB Schema)

EMBL

SWALL

TA

X

O

InterPro

ENSEMBL

GDB Builder (IBM Clio?)

Gene

Protein

T

ranscript

T

issue

Variant

GDB Mapper (IBM Clio)

[Definition]

Schema

Data

Schema Mapping with Clio

DB

DB

Source

Schema

DB

DB

Target

Schema

SQL or XQuery

Clio

User mapping

Clio

with permission of Dr. Felix Naumann

IBM Almaden Research Center

(9)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

Clio Features

• Schema Viewer

– Visual mapping between schema elements

• Attribute Matcher

– Intelligent suggestions of likely mappings

• Data Viewer

– Data examples for mapping queries

• Queries

– SQL, XSLT, Xquery

– Use and adhere to source and target schema

constraints

with permission of Dr. Felix Naumann

IBM Almaden Research Center

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

GDW: Providing Facts for Research

Genome Database Layer (GDB Schema)

Genome Data Warehouse Layer (GDW Schema)

Data Mining

Ontology Mapping

Process Simulation

Data Integration

Data Cleansing (Sem.)

Queries

GDW Miner

Gene

Protein

T

ranscript

T

issue

Variant

GDB Explorer

Gen

e

Pr

ote

in

T

ra

n

script

T

issu

e

Var

ia

n

t

Ontology

(10)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

• (Biological) Motivation/Problems

• Using Scalable Database Technology

– Gene-EYe Integration-Platform

– Data Cleansing

– BLAST-Integration into GDB

– In-and-Out-the-Database: Using Workflow

Concepts for “dry” Experiments

• Summary

Overview

DNA Sequence

Determination

Errors in Genome Data

Genome

Feature

Annotation

Protein

Sequence

Determination

Protein

Function

Annotation

DNA, RNA Sequence

Structure Annotation

Protein Sequence

Function Annotation

agagattagcgcgctagatcgatatgataga

gctatatcatccgagatagcagatagctcta

gcacactattacacgagcagcgaccttatat

MDDREDLVYQAKLAEQAERYDEMVESMKKVD

AGMDVELTVEERNLLSVAYKNVIGARRASWY

RIISSIEQKEENKGGEDKLKMIREYRQMVER

DNA

Gene

mRNA

INVOLVED IN ACUTE LEUKEMIAS BINDS DIRECTLY TO ZO-1 MAY ACT AS INTRACELLULAR SIGNALING COMPONENT ... SUBUNIT DISEASE FUNCTION

0,23%

2,58%

5% - 30%

5% - 40%

ƒ

Classes of errors in genome

data production

ƒ

Experimental errors

ƒ

Analysis errors

ƒ

Transformation errors

ƒ

Propagated errors

ƒ

Stale data

[Müller, Naumann, Freytag, ICIQ, 2003]

only

1.3 %

difference

The main difference are transcript

copy numbers in the brain

(11)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

Reliability-based Merging (cont.)

• Domain expert identifies reliable parts for merging

• Definition of a set of views for integration

ID

A

1

A

2

A

3

1

A

1.2

1900

2

A

2.3

1900

3

B

3.1

1900

4

B

3.4

2000

5

C

5.6

2002

r

1

ID

A1

A2

A3

1

B

2000

2

B

2000

3

B

2000

4

B

2000

5

D

2002

r

2

1

2

3.1

3.4

5.6

Merge

ID

A1

A2

A3

1

A

1.2

2000

2

A

2.3

2000

3

B

3.1

2000

4

B

3.4

2000

5

C

5.6

2002

r

1

e.g. MIN()

Current work:

Which are the relevant

mismatch patterns

?

How to assess their

relevance

& importance?

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

• (Biological) Motivation/Problems

• Using Scalable Database Technology

– Gene-EYe Integration-Platform

– BLAST-Integration into GDB

– In-and-Out-the-Database: Using Workflow

Concepts for “dry” Experiments

• Summary

(12)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

BLAST: General Introduction

Algorithm/Package: Similarity Search

Devloped by Altschul et al. (1990)

Three Steps:

1. Search for Word Pairs (Iseq, DSeq) of

Length

L

on the Data Collection of

Sequences above Threshhold T

2. Expansion of each Word Pair until the

Value V of their Alignment is

away

from the local maximum

3. Output of complete alignment (

High-scoring Segment Pair, HSP

), if

Value(Alignment) >

S

DC-File

FormatDB

Sequences Features Index

BLAST

Report Query sequence

Preprocessing

BLAST Call

Output: Powerset of Alignments

BLAST UDF Implementation

• Goal:

Using BLAST in SQL-statements

• How?

– BLAST-UDF implemented as Table Function

– Use in SQL Query

SELECT *

FROM TABLE( BLAST(<Parameter>, <Query Sequence>,

<Comparison Sequence> ))

• Each call returns a set of alignments

over Sequences in the Database

(13)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

Structure of UDB Table Function

• Implementation:

– Mapping of program into

calling structure for table

functions

– Communiaction between the

different calls via

scratchpad

– scratchpad: Storage area

which remains intact and

unchanged between UDF calls

• Storage of data structures for

different steps

• especially for output from

postprocessing:

SeqAlign

Initialize

For all

SEQUENCES

Alignment without gaps

Postprocessing

For all

ALIGNMENTS

Output of the

results

Release data structures

related with sequences

UDF BLAST

FIRST

OPEN

FETCH

CLOSE

FINAL

Release global data structures

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

• (Biological) Motivation/Problems

• Using Scalable Database Technology

– Gene-EYe Integration-Platform

– BLAST-Integration into GDB

– In-and-Out-the-Database: Using Workflow

Concepts for “dry” Experiments

• Summary

(14)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

The Challenge: Exon Skipping

Gene

Protein

n Exons within one Gene

linearly combined

(splicing)

for Protein Generation

Used as Pattern

One Gen with 100 Exons

2

100

~ 10

30

Variations

Challenge: Exon Skipping

Do

alternative fusion points

new

(15)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

110110000111111

New Functionality!

Functional Genomics: Gain of New Insight

First Horizon: Simple Exon Skipping

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

Flow of Processing Steps

Generate

Exon

Sequence

Find

Similarity

Search

Check

for Biological

Validity

Local Database

(automatic)

Remote Tool

(Web Based)

Supported by

local DB

(16)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

Implementation

¾

¾

Some facts

Some facts

¾

60 days 100% load

¾

One splice form per minute

¾

So far: ca. 90.000 splice forms

¾

First biolog. meaningful

results

Cooperations

• Cooperation with

– Univ. of Jena (Rolf Backofen)

– Berlin Center of Bioinformatics (BCB)

• Charite, FU, Max-Planck-Institut (M.

Vingron)

– Industry: IBM, small companies, …

(17)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

Database Environment

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

DB2

2.3 TByte

IBM p-Server

(sponsored by IBM)

Processing Genome Data

using Scalable Database Technology

© 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag

Summary

• Lesson learnt

– Highly Dynamic

Environment

• Data: changes frequently

• User: changes frequently

– Provide a framework for

• Date integration

• Data processing

• Data changes

• Data dependencies…..

• Meta data management

• Future Work

– Query processing

• Include domain knowledge

– Data cleansing

– Set of UDFs for biological

data processing

– Visualization of Data

References

Related documents

authors emphasized the strategic impact of not having spare parts when one of the n machines fails and they included in the cost model parameters from inventory, operations

The two authors independently extracted the data from the forty included articles using an Excel spreadsheet with predefined fields for the following information in each

Usually, you run most of your user programs in the user's Shell process, using the input and transcript pads already created by the Display Manager command,

random XORed canary or off-stack approach helps here Arbitrary memory reads could disclose canary value Format string bugs, /proc/mem/, info leakage... Obfuscation

Abstract. The objective of this study was to introduce the recommender system based on expert and item category to match the right items to users. In this study, the

In the present study, label-free profiling com- bined with RT-qPCR analyses showed that mRNA and protein expression levels of ABA1 and ABA2 were signifi- cantly decreased

For pediatric patients, IV attempts should be considered if the patient is presenting with signs and symptoms of dehydration, in need of medications, or in critical condition and is

If we look at this Arabic Nouns Prefix: Filgurfa ةفرغلا يف , Fisougue قوسلا يف, Dahel algurfa ةفرغلا لخاد , we will find that all these Noun prefixes have a