EDIT: Easy-to-use Bio-Data Extraction, Integration and Refreshing Tools

(1)

EDIT: Easy-to-use Bio-Data Extraction, Integration and Refreshing Tools

Tianyun Ni

^1,3

, Stanley Su

³

, and Lei Zhou

^1,2

,

1

Shands Cancer Center,

²

Department of Molecular Genetics and Microbiology, College of Medicine,

3

Database Systems R&D Center, College of Engineering, University of Florida, Gainesville, FL,

USA

[email protected]

Abstract

Systems biology research demands the ability to construct custom-designed bio-entities from heterogeneous data sources. EDIT is an ongoing project aimed at building a system to allow bio-researchers with minimum computer proficiency to define the structure of a desired bio-entity, and identify the data sources for its data fields. Then, the system will automatically extract and output the data in the predefined structure and format. The system consists of a host site, which maintains and updates the configuration files specified for source databases, and user sites, which perform data extraction and integration by using the configuration files. An Event-Trigger-Rule Server is used to support data refreshing and to notify biologists about changes made to source databases.

1. Introduction

1.1 Current problems:

Biological and biomedical research is a multidisciplinary endeavor. Decades (centuries) of research have accumulated a rich body of information ranging from human to virus and from ecosystem to protein structure. The accumulation of information at different levels of biological systems, i.e. DNA, protein, cells and tissues, etc., provides a new opportunity for the study of genetic architecture, biological variation and complex phenotype. However, the realization of this potential is hinged upon the ability of bio-researchers to effectively extract and analyze multidisciplinary data from public data sources. Researchers often face the following problems:

1) Fragmented information in a sea of databases.

There are now well over 1000 databases and severs providing biomedical or biological information over the Internet. Different databases provide different, yet related, information. For example, Prosite for protein motifs, TranFact for transcription factors and binding

sites, etc. As a result, information about one bio-entity (i.e., gene) is scattered in many databases.

2) The lack of easy-to-use and effective tools for defining customized data structures and formats, and for automatically extracting and integrating data retrieved from these public databases to form the instances of the defined entity types. A bio-researcher typically has to go through a laborious process to visit different data systems and manually copy data from various data sources to construct a desired biological entity.

3) The lack of a data refreshing and user notification facility. Biological and biomedical data are highly dynamic not only because data refinement is a long and lasting process but also because new types of data continue to emerge as the field of biomedical research progresses rapidly. When data stored in public databases have been refined and modified and new data have been added, the changing data sources will certainly affect the currency, completeness and validity of the data that had been accessed or downloaded by bio-researchers. Either the constructed bio-entities should be automatically refreshed or the bio-researchers should be informed of the changes so that they can take proper actions to deal with the situation.

1.2 Current needs:

In the past decade, many research and development efforts have been devoted to deal with the data extraction and integration problem. Two general approaches have been taken by the biomedical community to access and integrate data from heterogeneous data sources: the federated approach and the data warehousing approach [1-6]. The former approach builds a layer of software on top of distributed and heterogeneous database systems and provides a query facility to access physically distributed data. The latter approach extracts data from different data sources and integrates and stores them in a local database (a data warehouse) for supporting local computation and analysis. These two approaches correspond to the “virtual” way and the “materialized”

way of data integration, which are addressed in [7]. The federated approach has the advantages of (1) providing

(2)

the local autonomy to existing systems in their manipulation and management of their databases, (2) accessing up-to-date data from different data sources at run-time, and (3) giving an integrated view for users to query distributed data. Most of the current large-scale data integration projects lean towards the federated approach.

1.2.1 The need for integrating data into custom- designed data structures, The various federated data integration projects provide biomedical researchers the capability to query, browse information across different database systems. However, there are also major disadvantages of the federated approaches, include performance (i.e., run-time query translation and data access from distributed data over the network), availability and reliability (i.e., possible site and network failures, which make some data inaccessible), and cost (i.e., costs in purchasing/development, installation and maintenance).

Figure 1. The EDIT system have two components: the Host site deals with data structures, formats and access mechanisms of data sources, thus allowing the users to design data integration without worrying about details of data extraction mechanisms.

More importantly, federated approaches are not suitable for those biomedical researchers who want to create customized data structures to meet their specific research needs and objectives. The customized data structure designed for specific research applications may include information that can be extracted from multiple heterogeneous public database systems. In addition, it is often the case that the information extracted from public databases needs to be integrated with specialized information that are either publicly available but not in database or unpublished information/data obtained by the research group. For instance, during our construction of the Integrated Knowledge Base for Apoptosis Research (IKBAR, http://www.ikbar.org/), we need to integrate information extracted from public database systems, such as genomic sequences (from Ensemble), protein sequences and domain information (from SwissPro), with information that generated by researchers, such as information on genetic interactions and regulatory pathways.

A persistent local data deposit that integrating data from public domain with proprietary data is becoming a necessity to many biomedical research groups as we progressing into the era of systems biology. Without a centralized persistent database at his/her own site, it is very difficult for a biomedical researcher to perform complex follow-up computations and analyses efficiently; e.g., analyzing data obtained from genomic and proteomic studies or supporting system simulation experiments.

1.2.2 The need for easy- to- use data integration tools, As various genomic and proteomic techniques becoming the standard of laboratory research, the ability to extract and integrating data from heterogeneous data sources is becoming indispensable to many bench scientists.

Although a local data warehouse designed to satisfy the specific research need of a biomedical researcher is ideal, the resource and expertise needed to build such a system is usually beyond the resource and capability of a typical researcher. Very often the researcher does not need a full- blown complicated data warehouse. Rather, he or she only wants to access some specific data from a limited number of databases and display them for comparison or analysis purposes. For instance, to fully analyze a list of Probe_sets coming out of a DNA array experiment, the scientist will need to extract related information from a variety of databases for each gene on the list.

Traditionally, while large labs/centers have their own database systems and professional programming support, in most small to medium size research groups, the researcher has to take the laborious effort to go to different web sites, make queries for each individual gene, and often select/copy/past from the resulting web

page to compile necessary information for annotating the list of genes. An easy-to-use data extraction and integration tool that does not demand programming ability is highly desirable for a vast population of researchers.

1.2.3 The need for automatic data refreshing and notification tools, Biomedical research is such a dynamic field that central databanks, such as GeneBank, have daily updates. In addition, new types of data continue to emerge as the field of biomedical research is progressing rapidly. For example, several new data types have emerged from the fast development and divergence of DNA microarray technologies. Different data structures are needed for recording data generated by

(3)

using the cDNA-based two-color array [8], oligo pairs- based array such as the GeneChip (© Affimatrix Corp) [9], and the newer Agilent Printed Microarrays (http://www.chem.agilent.com/). In addition, new data types are formulated as the research community endeavors to capture functional information that is traditionally not recorded in a structured format. For example, there are many emerging and evolving data models designed and tested for recording regulatory pathways [10-12].

When data stored in public databases have been refined and modified and new data have been added, they will undoubtedly affect the currency, completeness and validity of the data that had been accessed or downloaded by biomedical researchers to their local sites. The local data need to be refreshed (automatically if possible) to account for the changes made to the source databases. At least, the biomedical researchers should be informed of the changes so that they will not continue to use the out- of-date and possibly invalid data. The needs of automatic periodic checking, notification, and refreshing to keep local data deposit update and accurate are fully addressed in our EDIT system.

2. Objectives

The goal of our research and development is to make data extraction and integration much easier and more efficient for bio-researchers who want to create bio- entities having customized data structures, but have only limited computer knowledge and computational resources to meet their specific research needs and objectives. The tools provided by the EDIT system allow bio-researchers to:

• Define bio-entities with customized data structures through an easy-to-user graphical interface.

• Specify source data systems and integration strategies through a user-friendly GUI without having to know the detailed structures and formats of source databases.

• Choose the output format of the constructed bio- entity in XML and, based on the user’s choice, import it into Excel for display and analysis, or Microsoft Access or other database systems for storage and retrieval, or store it in an XML file for electronic transfer and data sharing.

The aim of these tools is to 1.) enable scientists with no computer programming background to extract information from multiple sources and integrate them into simple data structures for comparison of sets of genes/proteins; and 2.) support biologists with limited computer proficiency to build local integrated databases for their specialized research interests.

EDIT also provides the following mechanisms:

• A data pulling mechanism, at the host site, to automatically pull data from public data systems in order to detect different types of changes made in the public databases.

• A rule-based mechanism to automatically refresh the constructed bio-entities to account for content changes made to the source databases, and to automatically inform all affected bio-researchers when other types of changes made in the source databases can affect the currency, completeness and validity of their bio-entities.

3. System Design

The EDIT system consists of a host site located at the University of Florida and managed by the project administrator, and any number of user sites established in typical biomedical research labs. The host site’s main function is to gather detailed structural and format information of data accessible from public data systems.

This information, stored in configuration files, is used by the tools running at the user sites to automatically integrate data extracted from public data systems (Figure 1). The main functions of the GUI tools running at the user sites are to support the users to 1) define customized data structures, 2) identify data sources, and 3) provide mapping information for constructing bio-entities.

For the past two and half years, we have been working on the design and development of a prototype data integration system (JXP4BIGI) [13]. JXP4BIGI (Java XML Page for Biological Information Gathering and Integration) provides a system-independent framework, which generalizes and streamlines the steps of accessing,extracting, transforming and integrating the data retrievedfrom heterogeneous data sources to build a customized data warehouse.By running the desired bio- entities defined in XML templates (or Java XML pages) that contain embedded, extended SQL statements in the JXP4BIGI framework and by using a number of generalized wrappers, the required data from public databases can be efficiently extracted and integrated to constructthe bio-entities in the XML format. This can be done without having to hard-codethe extraction logics for accessing data from different data sources. The data source configuration information provided by the host site, together with the integration strategy specified by the user, is used to automatically generate JXP pages for use by the JXP4BIGI Processor to extract data from public data systems. JXP4BIGI is being extended and integrated into the EDIT system.

In EDIT, automatic detection of data source changes is carried out by source data pulling facilities (yet to be implemented) running at both the host site and the user sites. Data refreshing and notification functions

(4)

are handled by an implemented Event-Trigger-Rule (ETR) Server [13]. ETR is a generalization and extension of the event-condition-action (ECA) rule model used in active database management systems. The ETR model separates the specifications of events from those of rules. By allowing this separation, events can be defined independently of rules. Triggers are specifications that link distributed events to distributed rules. By applying the ETR model in the distributed environment of EDIT, detection of a change made to a source database is treated as an event. Events can be defined at one site and rules can be defined at many other sites. An event can be associated with different rules at different users’ sites by different trigger specifications. Each subscriber of an event (i.e., a bio-researcher) can optionally specifies some conditions that the event data should meet (i.e., an event filter) before he/she should be notified or the software running at his/her sites should be automatically activated. Once a database change has been detected, the event subscribers’ event filters will be checked to determine which subscribers should be notified, or which subscribers’ rules should be processed to trigger the execution of some procedures.

In this work, we categorize database changes into the following three categories: content change, format change, and structural change. Content change can be subcategorized into the insertion of new data, the updating of existing data fields (e.g., sequence data is changed from "AANNTG" to "AACTTGGA"), and the deletion of existing data instances. The format of a data field can also be changed in a source database. For example, the representation of protein sequence is changed from the upper case to lower case (e.g., MAQI to maqi). Worse still, the structure of data retrievable from a public data system may also be changed. For instance, new data fields may be added, existing data fields may be restructured, or new substructures may be added to an existing data structure. A data pulling facility running at the host site would periodically retrieve data from public data systems to detect structural and format changes. Once a change is detected, the system will update the configuration file and notify the users and/or activate the software components running at the user sites. Another source database pulling facility running at each user site is used to pull data from relevant public data systems to detect content changes, which would trigger rules to automatically perform data refreshing operations. Communications and data transfers between the host site and the user sites are through events which are posted and received by replicas of the Event- Trigger-Rule (ETR) system installed at all sites (Figure 2).

4. Summary

EDIT is a general-purposed data extraction and integration system for the biomedical research community. The laborious tasks of extracting and integrating data from heterogeneous data sources are shielded from bio-researchers by using the data source configuration files constructed and updated at the host site. Bio-researchers have the freedom of constructing custom-designed data structures and specifying integration strategies on their computers and receive desired bio-entities that suit their analysis or simulation needs.

5. Acknowledgements:

This work is supported in part by NIH grant CA095542 to LZ.

6. References

:

[1] Shoop, E., K.A. Silverstein, J.E. Johnson, and E.F. Retzel, MetaFam: a unified classification of protein families. II.

Schema and query capabilities. Bioinformatics, 2001. 17(3): p.

262-71.

[2] Letovsky, Introduction, in Bioinformatics Databases and Systems. 1999, Kluwer Academics Publishers: Boston. p. 1-7.

[3] Markowitz, V.M., I.A. Chen, and A.S. Kosky, Exploring Heterogeneous Molecular Biology Databases in the Context of the Object-Protocol Model. 1996.

[4] Markowitz, V.M., Heterogeneous molecular biology databases. J Comput Biol, 1995. 2(4): p. 537-8.

[5] Markowitz, V.M. and O. Ritter, Characterizing heterogeneous molecular biology database systems. J Comput Biol, 1995. 2(4): p. 547-56.

[6] Karp, P.D., A strategy for database interoperation. J Comput Biol, 1995. 2(4): p. 573-86.

[7] Davidson, S.B., C. Overton, and P. Buneman, Challenges in integrating biological data sources. J Comput Biol, 1995.

2(4): p. 557-72.

[8] Brown, P.O. and D. Botstein, Exploring the new world of the genome with DNA microarrays. Nat Genet, 1999. 21(1 Suppl): p. 33-7.

[9] Affymetrix, Affymetrix Microarray Suite User's Guide Version 5. 2001.

[10] Bader, G.D., I. Donaldson, C. Wolting, B.F. Ouellette, T.

Pawson, and C.W. Hogue, BIND--The Biomolecular Interaction Network Database. Nucleic Acids Res, 2001. 29(1): p. 242-5.

[11] Karp, P.D., M. Riley, S.M. Paley, and A. Pellegrini-Toole, The MetaCyc Database. Nucleic Acids Res, 2002. 30(1): p. 59- 61.

[12] Kanehisa, M., S. Goto, S. Kawashima, and A.

Nakaya, The KEGG databases at GenomeNet. Nucleic Acids Res, 2002. 30(1), pp. 42-6.

[13] Huang, Y., T. Ni, L. Zhou, and S.Y.W. Su, JXP4BIGI: a generalized, Java XML-based approach for

(5)

biological information gathering and integration.

Bioinformatics, 2003. 19(18): p. 2351-2358.

Figure 2. Detailed architecture of the EDIT system.