Information and Communications Univ. Bioinformatics & Software Systems Lab.
Woo-Hyuk Jang
Integration of Biological XML data
Ph. D. Lecture
2
Where are we?
Server-Side Info. Management
Client-Side Info. Management Business related Issues
Web Services Internationalization and Privacy XML & XML Processing HTML, JavaScript, Plug-in, Applet… WWW Concepts & Web-based Info. Management
S.S. Info. Management Concept CGI, Java Servlets
JDBC, MySQL
App. Of Web-based tech. Semantic Web
ICE0534 - Web-Based Software Development, Summer 2005 3
Can you remember?
• Problems in Integrating Heterogeneous
Information
- Heterogeneity of formats, data types, units, or semantics.
• Information Mediation
Fig 1. Mediator in Lecture 7.
This Lecture Contains…
• Information Integration in Bioinformatics
- Bioinformatics Overview
• Is there any relationship between Web and Bioinformatics?
- Difficulties to handle Biological XML data - What is it? Why?
• Cultures : Schema-driven, Data-driven • Models : Federation, Warehousing, Mediation
• Integration of XML format data
- Problems - Issues
ICE0534 - Web-Based Software Development, Summer 2005 5
Bioinformatics
• A narrow sense
- The application of information technology to life science research
• Modeling (abstraction) • Analysis and collection
• Data integration and information retrieval
- Enables the discovery and analysis of biomolecules and their properties (Structure, function, interactions)
• A wide sense
- The use of computers to collect, analyze, and interpret biological information at the molecular level
6
Web and Bioinformatics
Experiment, Publish
Use
Make, Publish Use
Biological Data Bio Applications
ICE0534 - Web-Based Software Development, Summer 2005 7
Difficulties to Handle Biological XML data
• Lack of standard
- Different data model and schemas - Different handling methods are needed - Different formats
• Monstrous volume of data
- It is growing exponentially - Data are updated very frequently
• Newly introduced data, error fixed data
Why Integration?
• In the post-human genome sequencing era,
many analyses on the genome scale are
possible
• Majority of human diseases are the product
of multi-step pathophysiological processes
• The biggest challenge in interpreting the
results of these analyses lies in the data
integration problem
ICE0534 - Web-Based Software Development, Summer 2005 9
Two Cultures of Integration
• Database Integration
- Schema level view - Focus on outside of data
• Data Integration
- Data level view
- Focus on inside of data
Schema 1 Schema 2 Schema 3 Schema 4 Data 1 Data 3 Data 2 Data 4 10
Two Cultures of Integration
• Schema-driven (computer scientists)
- Much smaller than data, (hopefully) well-defined elements - Resolve redundancy and heterogeneity at the schema level - High degree of automation once system is set-up
- Focus on methods - you rarely publish a “data paper”
• Data-driven (biologists)
- Value is in the data, abstraction is a result of analysis - Don‘t bother with schemas
• Abstraction is volatile and depends on experimental technique
- Manual integration at data level, constant high effort - You rarely publish a (database) “method paper”
ICE0534 - Web-Based Software Development, Summer 2005 11
Models of Integration
• Federation (Multi-database)
• Warehousing (Materialized in house)
• Mediation (Virtual integration)
Models of Integration
• Federation (Multi-database)
- K2/BioKleisli, Entrez
Component DB 1 Component DB 2 Component DB n
FDBMS
Comp DB 1 Comp DB 2.1 Comp DB 2.2
(another FDBMS) (Distributed DBMS)
(Centralised DBMS)
Component DB 1 Component DB 2 Component DB n
FDBMS
Comp DB 1 Comp DB 2.1 Comp DB 2.2
(another FDBMS) (Distributed DBMS)
ICE0534 - Web-Based Software Development, Summer 2005 13
Models of Integration
• Warehousing (Materialized in house)
- GUS (Genome Unified Schema), SRS (Sequence Retrieval System)
-Local Operational Warehouse Decision Support & Mining Network Internet Integration & Storage R3 R2 14
Models of Integration
• Mediation (Virtual integration)
- TAMBIS (Transparent Access to Multiple Bioinformatics Information Source)
Mediator
Network Internet
ICE0534 - Web-Based Software Development, Summer 2005 15
Models of Integration
• Federation represents a more “static”
approach – using agreed couplings to allow
view creation.
• Warehousing and Mediation addresses
integration in a more “dynamic” way – using
extraction, transformation and integration
processes.
Warehousing vs. Mediation
• Warehouse
- Update-driven: i.e. in warehouse repository
- Heterogeneous data is integrated in advance and stored in-house for direct query and analysis.
• Mediation
- Wrapper and Mediator layer on top of source DBs.
- Query-driven: Query to mediated schema then translated into queries appropriate to sources.
ICE0534 - Web-Based Software Development, Summer 2005 17
Now let’s study the…
• Information Integration in Bioinformatics
- Bioinformatics Overview
• Is there any relationship between Web and Bioinformatics? - Difficulties to handle Biological XML data
- Why Integration?
• Cultures : Schema-driven, Data-driven • Models : Federation, Warehousing, Mediation
• Integration of XML format data
- Problems - Issues
• Discussion about Reading Question #6
18
Integration of XML format data
• Why XML?
- Biology is a complex discipline
- Wide variety of data resources and repositories
• No standard protocol exists to interrogate biological data stores
• No standard data format exists to exchange biological data.
• No standard data model exists.
- Difficulties in using and exchanging data
ICE0534 - Web-Based Software Development, Summer 2005 19
Integration of XML format data
• Problems
- We focus on schema-driven integration - Warehousing model is efficient
• Have to analyze data • Performance
• To implement perfect mediation model is extremely difficult
- XML data should be converted into RDB
- We want to make our own DB schema accommodating the data from XML files
- We need to make the DB schema regarding efficiency and our own purpose
- Heterogeneity and Large scale
Integration of XML format data
• PreSPI (Prediction System for Protein Interaction)
General XML Wrapper (SAX)
Sequence Structure Function
٠
٠
٠
DomainXML
XML XMLXML XMLXML XMLXML
Integration Rule
Local DB1 Local DB2 Local DB3Warehouse
Local
Web
ICE0534 - Web-Based Software Development, Summer 2005 21
Issues of Using XML Biological data
• Structure
- Semi-structured: Can be expressed as trees, graphs - Theoretically, it is ideal to map them into DB regarding
structural feature
• Method for storing XML
- File system
• Has overhead for query
• Text file, invert list, compression file
- Specific storing method
• Use XML’s own structure
- DB system
• Especially, mapping into RDB has been researched a lot • Has overhead for converting into the appropriate model
22
Issues of Using XML Biological data
z Object view of the XML9 use DOM
9 A Class can be mapped into a Table, PCDATA or ATTRIBUTE can be column
z XML Objects Tables z ============= ============ ============== z Table A z <A> object A { ---z <B>bbb</B> B = "bbb" B C D z <C>ccc</C> <=> C = "ccc" <=> -- --- ---z <D>ddd</D> D = "ddd" ... ... ... z </A> } bbb ccc ddd z ... ... ... z XML-view
z CREATE XMLVIEW xview_1( id char(20), email char (30) ) z AS ( ‘select p.personnel.person@id, p.personnel.person@email z from “file:/home/user1/personal.xml”, p; ‘);
ΓA generic load/extract utility for data transfer between XML documents and relational databases” Bourret, R.; Bornhovd, C.; Buchmann, A.;Advanced Issues of E-Commerce and Web-Based Information Systems, 2000.
ICE0534 - Web-Based Software Development, Summer 2005 23
Issues of Using XML Biological data
• Direct method
XML
Document StatementInsert
Mapping Rule XML Saver input Output & execute input
ΓA direct method of data exchange between XML and relational database” Bei Jia; Cai Fei; Tao Lie-Jun; Pan Jin-Gui; Information Technology Interfaces, 2004. 26th International Conference on 2004 Page(s):127 - 132 Vol.1
Issues of Using XML Biological data
• Direct Method (cont’d)
ICE0534 - Web-Based Software Development, Summer 2005 25
Issues of Using XML Biological data
• Current methods force DB to follow XML schema
•
Complex structured XML
- Share the same element name even thought they should be different columns in DB (DIP, InterPro…)
• Large size of file; we cannot use DOM
• XML updated frequently; the process should be easy
ID_B C … ... ID_B … B ... <protein id=“ID_A" name=“PROTEIN_A“>
<ref db=“B" id=“ID_B" /> <ref db=“C" id=“ID_C" /> ……….. ID_E E ID_A ID_D D ID_A ID_C C ID_A ID_B B ID_A ID DB ID ID_C C ID_E E ID_D D ID_B PROTEIN_A ID_A B NAME ID Rather than 26
Issues of Using XML Biological data
Direct Method cannot cover following XML type Cannot integrate two more files ; Needs constraint <node id="G:1" uid="DIP:232N" name="BAXA_HUMAN" class="protein">
<xref db="DIP" id="232N" type="src"/> <feature name="swp_ref" class="cref">
<src>SwissProt</src> <val>SWP:Q07812</val>
<xref db="SWP" id="Q07812" type="src"/> </feature>
<feature name="pir_ref" class="cref"> <src>PIR</src>
<val>PIR:A47538</val>
<xref db="PIR" id="A47538" type="src"/> </feature>
<feature name="gi_ref" class="cref"> <src>NCBI</src>
<val>GI:539664</val>
<xref db="gi" id="539664" type="src"/> </feature>
<att name="descr">
<val>bcl-2-associated protein x, alpha splice form</val> </att>
<att name="organism"> <val>Homo sapiens</val> <xref db="TXID" id="9606" type="ont"/> </att> </node> Q07812 SWP_ID 539664 GI_ID A47538 PIR_ID DIP:232N BAXA_HUMAN G:1 DIP_ID NAME ID 539664 gi G:1 A47538 PIR G:1 Q07812 SWP G:1 DIP:232N DIP G:1 Ref_ID DB ID We want But,
ICE0534 - Web-Based Software Development, Summer 2005 27
Issues of Using XML Biological data
• Make a data set for a tuple, which ignore sub
document tree nodes
• Define SQL like syntax
- Where condition of each column for constraints
- Multiple files can be populated into one table by manipulation
CREATE TABLE PROTEIN_IDs(ID_A CHAR(20), NAME CHAR(20), B CHAR(20), C CHAR(20), D CHAR(20) , E CHAR(20) ) AS ( SELECT ( FILE.protein@id, FILE.protein@name, [FILE.protein.ref]@id WHERE @db = B, [FILE.protein.ref]@id WHERE @db = C, [FILE.protein.ref]@id WHERE @db = D, [FILE.protein.ref]@id WHERE @db = E,
[FILE_2.ELEMENT]@value WHERE @id=ID_A) FROM “file/protein.xml”AS FILE, “file/file.xml”AS FILE_2);