Data Integration of Bioinformatics and Web-Based Software Development

(1)

Information and Communications Univ. Bioinformatics & Software Systems Lab.

Woo-Hyuk Jang

Integration of Biological XML data

Ph. D. Lecture

2

Where are we?

Server-Side Info. Management

Client-Side Info. Management Business related Issues

Web Services Internationalization and Privacy XML & XML Processing HTML, JavaScript, Plug-in, Applet… WWW Concepts & Web-based Info. Management

S.S. Info. Management Concept CGI, Java Servlets

JDBC, MySQL

App. Of Web-based tech. Semantic Web

(2)

ICE0534 - Web-Based Software Development, Summer 2005 3

Can you remember?

• Problems in Integrating Heterogeneous

Information

- Heterogeneity of formats, data types, units, or semantics.

• Information Mediation

Fig 1. Mediator in Lecture 7.

This Lecture Contains…

• Information Integration in Bioinformatics

- Bioinformatics Overview

• Is there any relationship between Web and Bioinformatics?

- Difficulties to handle Biological XML data - What is it? Why?

• Cultures : Schema-driven, Data-driven • Models : Federation, Warehousing, Mediation

• Integration of XML format data

- Problems - Issues

(3)

Bioinformatics

• A narrow sense

- The application of information technology to life science research

• Modeling (abstraction) • Analysis and collection

• Data integration and information retrieval

- Enables the discovery and analysis of biomolecules and their properties (Structure, function, interactions)

• A wide sense

- The use of computers to collect, analyze, and interpret biological information at the molecular level

6

Web and Bioinformatics

Experiment, Publish

Use

Make, Publish Use

Biological Data Bio Applications

(4)

Difficulties to Handle Biological XML data

• Lack of standard

- Different data model and schemas - Different handling methods are needed - Different formats

• Monstrous volume of data

- It is growing exponentially - Data are updated very frequently

• Newly introduced data, error fixed data

Why Integration?

• In the post-human genome sequencing era,

many analyses on the genome scale are

possible

• Majority of human diseases are the product

of multi-step pathophysiological processes

• The biggest challenge in interpreting the

results of these analyses lies in the data

integration problem

(5)

Two Cultures of Integration

• Database Integration

- Schema level view - Focus on outside of data

• Data Integration

- Data level view

- Focus on inside of data

Schema 1 Schema 2 Schema 3 Schema 4 Data 1 Data 3 Data 2 Data 4 10

Two Cultures of Integration

• Schema-driven (computer scientists)

- Much smaller than data, (hopefully) well-defined elements - Resolve redundancy and heterogeneity at the schema level - High degree of automation once system is set-up

- Focus on methods - you rarely publish a “data paper”

• Data-driven (biologists)

- Value is in the data, abstraction is a result of analysis - Don‘t bother with schemas

• Abstraction is volatile and depends on experimental technique

- Manual integration at data level, constant high effort - You rarely publish a (database) “method paper”

(6)

Models of Integration

• Federation (Multi-database)

• Warehousing (Materialized in house)

• Mediation (Virtual integration)

Models of Integration

• Federation (Multi-database)

- K2/BioKleisli, Entrez

Component DB 1 Component DB 2 Component DB n

FDBMS

Comp DB 1 Comp DB 2.1 Comp DB 2.2

(another FDBMS) (Distributed DBMS)

(Centralised DBMS)

Component DB 1 Component DB 2 Component DB n

FDBMS

Comp DB 1 Comp DB 2.1 Comp DB 2.2

(another FDBMS) (Distributed DBMS)

(7)

Models of Integration

• Warehousing (Materialized in house)

- GUS (Genome Unified Schema), SRS (Sequence Retrieval System)

-Local Operational Warehouse Decision Support & Mining Network Internet Integration & Storage R3 R2 14

Models of Integration

• Mediation (Virtual integration)

- TAMBIS (Transparent Access to Multiple Bioinformatics Information Source)

Mediator

Network Internet

(8)

Models of Integration

• Federation represents a more “static”

approach – using agreed couplings to allow

view creation.

• Warehousing and Mediation addresses

integration in a more “dynamic” way – using

extraction, transformation and integration

processes.

Warehousing vs. Mediation

• Warehouse

- Update-driven: i.e. in warehouse repository

- Heterogeneous data is integrated in advance and stored in-house for direct query and analysis.

• Mediation

- Wrapper and Mediator layer on top of source DBs.

- Query-driven: Query to mediated schema then translated into queries appropriate to sources.

(9)

Now let’s study the…

• Information Integration in Bioinformatics

- Bioinformatics Overview

• Is there any relationship between Web and Bioinformatics? - Difficulties to handle Biological XML data

- Why Integration?

• Cultures : Schema-driven, Data-driven • Models : Federation, Warehousing, Mediation

• Integration of XML format data

- Problems - Issues

• Discussion about Reading Question #6

18

Integration of XML format data

• Why XML?

- Biology is a complex discipline

- Wide variety of data resources and repositories

• No standard protocol exists to interrogate biological data stores

• No standard data format exists to exchange biological data.

• No standard data model exists.

- Difficulties in using and exchanging data

(10)

Integration of XML format data

• Problems

- We focus on schema-driven integration - Warehousing model is efficient

• Have to analyze data • Performance

• To implement perfect mediation model is extremely difficult

- XML data should be converted into RDB

- We want to make our own DB schema accommodating the data from XML files

- We need to make the DB schema regarding efficiency and our own purpose

- Heterogeneity and Large scale

Integration of XML format data

• PreSPI (Prediction System for Protein Interaction)

General XML Wrapper (SAX)

Sequence Structure Function

_٠

Domain

XML

XML XML_XML XML_XML XMLXML

Integration Rule

Local DB1 Local DB2 Local DB3

Warehouse

Local

Web

(11)

Issues of Using XML Biological data

• Structure

- Semi-structured: Can be expressed as trees, graphs - Theoretically, it is ideal to map them into DB regarding

structural feature

• Method for storing XML

- File system

• Has overhead for query

• Text file, invert list, compression file

- Specific storing method

• Use XML’s own structure

- DB system

• Especially, mapping into RDB has been researched a lot • Has overhead for converting into the appropriate model

22

Issues of Using XML Biological data

z Object view of the XML

9 use DOM

9 A Class can be mapped into a Table, PCDATA or ATTRIBUTE can be column

z XML Objects Tables z ============= ============ ============== z Table A z <A> object A { ---z <B>bbb</B> B = "bbb" B C D z <C>ccc</C> <=> C = "ccc" <=> -- --- ---z <D>ddd</D> D = "ddd" ... ... ... z </A> } bbb ccc ddd z ... ... ... z XML-view

z CREATE XMLVIEW xview_1( id char(20), email char (30) ) z AS ( ‘select p.personnel.person@id, p.personnel.person@email z from “file:/home/user1/personal.xml”, p; ‘);

Î“A generic load/extract utility for data transfer between XML documents and relational databases” Bourret, R.; Bornhovd, C.; Buchmann, A.;Advanced Issues of E-Commerce and Web-Based Information Systems, 2000.

(12)

Issues of Using XML Biological data

• Direct method

XML

Document StatementInsert

Mapping Rule XML Saver input Output & execute input

Î“A direct method of data exchange between XML and relational database” Bei Jia; Cai Fei; Tao Lie-Jun; Pan Jin-Gui; Information Technology Interfaces, 2004. 26th International Conference on 2004 Page(s):127 - 132 Vol.1

Issues of Using XML Biological data

• Direct Method (cont’d)

(13)

Issues of Using XML Biological data

• Current methods force DB to follow XML schema

• Complex structured XML

- Share the same element name even thought they should be different columns in DB (DIP, InterPro…)

• Large size of file; we cannot use DOM

• XML updated frequently; the process should be easy

ID_B C … ... ID_B … B ... <protein id=“ID_A" name=“PROTEIN_A“>

<ref db=“B" id=“ID_B" /> <ref db=“C" id=“ID_C" /> ……….. ID_E E ID_A ID_D D ID_A ID_C C ID_A ID_B B ID_A ID DB ID ID_C C ID_E E ID_D D ID_B PROTEIN_A ID_A B NAME ID Rather than 26

Issues of Using XML Biological data

Direct Method cannot cover following XML type

Cannot integrate two more files ; Needs constraint <node id="G:1" uid="DIP:232N" name="BAXA_HUMAN" class="protein">

<xref db="DIP" id="232N" type="src"/> <feature name="swp_ref" class="cref">

<src>SwissProt</src> <val>SWP:Q07812</val>

<xref db="SWP" id="Q07812" type="src"/> </feature>

<xref db="PIR" id="A47538" type="src"/> </feature>

<xref db="gi" id="539664" type="src"/> </feature>

<val>bcl-2-associated protein x, alpha splice form</val> </att>

<att name="organism"> <val>Homo sapiens</val> <xref db="TXID" id="9606" type="ont"/> </att> </node> Q07812 SWP_ID 539664 GI_ID A47538 PIR_ID DIP:232N BAXA_HUMAN G:1 DIP_ID NAME ID 539664 gi G:1 A47538 PIR G:1 Q07812 SWP G:1 DIP:232N DIP G:1 Ref_ID DB ID We want But,

(14)

Issues of Using XML Biological data

• Make a data set for a tuple, which ignore sub

document tree nodes

• Define SQL like syntax

- Where condition of each column for constraints

- Multiple files can be populated into one table by manipulation

CREATE TABLE PROTEIN_IDs(ID_A CHAR(20), NAME CHAR(20), B CHAR(20), C CHAR(20), D CHAR(20) , E CHAR(20) ) AS ( SELECT ( FILE.protein@id, FILE.protein@name, [FILE.protein.ref]@id WHERE @db = B, [FILE.protein.ref]@id WHERE @db = C, [FILE.protein.ref]@id WHERE @db = D, [FILE.protein.ref]@id WHERE @db = E,

[FILE_2.ELEMENT]@value WHERE @id=ID_A) FROM “file/protein.xml”AS FILE, “file/file.xml”AS FILE_2);