Data integration from scientific data sources: A Survey

(1)

Rochester Institute of Technology

RIT Scholar Works

Presentations and other scholarship

2005

Data integration from scientific data sources: A

Survey

Rajeev Agrawal

Manjeet Rege

Follow this and additional works at:

http://scholarworks.rit.edu/other

This Conference Proceeding is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Presentations and other scholarship by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

Rajeev Agrawal and Manjeet Rege, "Data Integration from Scientific Data Sources: A Survey", Proceedings of the 7th International Workshop on Computer Science and Information Technologies, 2005. Ufa, Russia

(2)

Data Integration from Scientific Data Sources: A Survey

Rajeev Agrawal Department of Science and Math

Kettering University Flint, USA

e-mail: [email protected]

Manjeet Rege

Department of Computer Science Wayne State University

Detroit, USA e-mail: [email protected]

Abstract

1

For the last few decades, data Integration from heterogeneous data sources has been an area of active research in the Database field. In this paper, we have tried to survey five scientific data integration systems. The purpose of the paper is not to provide an in depth study of the different works conducted on scientific data integration but to give a brief overview of some of the research being conducted to the reader. Readers seeking more detailed information on the topic may refer to the references mentioned at the end of this paper.

1. Introduction

Scientific data integration first raises the issues of traditional data integration [1,2,3,4]. However, in addition to this, several other issues arise such as data representation and reformatting, information extraction and metadata, data translation, deriving semantics from heterogeneous scientific sources, etc.

Currently, there are more than 400 publicly available scientific data sources, which use different data formats, access methods and different terminologies. These heterogeneous data sources store text and other multimedia components [Figure 1]. To answer a query of a scientific researcher, all or some of the above mentioned data sources need to be queried. The results returned need to be integrated and presented in a format suitable to the user’s requirement. Since these resources have been

11

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the CSIT copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Institute for Contemporary Education JMSUICE. To copy otherwise, or to republish, requires a fee and/or special permission from the JMSUICE.

Proceedings of the 7th International Workshop on Computer Science and Information Technologies CSIT’2005

Ufa, Russia, 2005

developed independently, each one has its own data definition format, vocabulary, and user interfaces.

There already has been some attempt to follow a common vocabulary such as Gene Ontology [5]. For example, if one is searching for new targets for antibiotics, then it is possible to find all the gene products that are involved in bacterial protein synthesis, and that have significantly different sequences or structures from those in humans. But if one database describes these molecules as being involved in 'translation', whereas another uses the phrase 'protein synthesis', then it will be difficult for the user, and even harder for a computer to find functionally equivalent terms. The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. There are three separate aspects to this effort: first, write and maintain the ontologies, second, make associations between the ontologies and the genes and gene products in the collaborating databases, and third, develop tools that facilitate the creation, maintenance and use of ontologies.

Most of the user interfaces for the scientific data sources are human-readable. Manual querying of large number of data sources is not an efficient method. Querying mechanisms such as wrappers [6,7,8] need to be modified in the event of a change in the data presentation format of the data sources. In order to query multiple data sources, some form of APIs must be provided to enable machine-readable querying mechanisms.

Even though a significant amount of research is currently being conducted on issues related to scientific data Integration, reviewing all of them would be beyond the scope of this paper. However, we have reviewed five scientific data integration systems.

Section 2 deals with the University of Manchester’s research project called TAMBIS, which aims at providing a single access to multiple Bioinformatics data sources. Section 3 surveys MBM project which introduces University of California, San Diego’s Model

(3)

Based Mediation in which views are defined and executed at the level of Conceptual Models than at the structural level. MOMIS, a research project at the University of Modena and Reggio Emilia is reviewed in section 4, which is a framework to perform information extraction and integration from both structured and semistructured data sources. We review IBM’s DiscoveryLink in Section 5 while in Section 6 we introduce Pegasys, which is a flexible, modular and customizable software system that facilitates the execution and data integration from heterogeneous biological sequence tools. We conclude our survey in section 7 with possible direction for future work.

User

Data Sources Figure 1. Typical Scenario of Scientific Data Search

2. TAMBIS

TAMBIS (Transparent Access to Multiple

Bioinformatics Information Sources) is a joint research project between the School of Biological Sciences and the Information Management Group, part of Computer Science at the University of Manchester in the UK. The idea behind TAMBIS system is to provide transparency to users using Ontologies. TAMBIS provides one uniform interface by hiding the heterogeneous biological data sources from the end user. It gives the illusion of a single query language, single data model and a single data location [9]. It uses a mediator and many source wrappers to give the illusion of a single data source as shown in the system architecture [Figure 2].

1. Biological Concept Model:

The biological concept model is an ontology of biological terms in Description Logic called GRAIL shown as A in the figure 2. It is used to describe the metadata of the underlying data sources, representing an overarching universal schema. Also to express queries in the modeling language and drive a GUI user interface for query formulation.

2. Knowledge-driven Graphical User Interface

B is a knowledge-driven query formulation interface. In TAMBIS, a concept is interchangeable with a database query [10]. The TAMBIS GUI allows the user to construct a concept in order to retrieve the desired information from the Bioinformatics data sources. The user might want to select a particular concept like “phenotype”. A Query Manipulation tool helps the user to add more information about the concept and/or make modifications to the selections previously made. This query is then used to retrieve relevant biological information.

3. The Source Mode, Query Transformation Module and Query Execution Module

C shows the services model linking the biological ontology with the source schemas.

D is a query transformation rewriting process and E is wrapper service dealing with external sources.

Queries expressed in GRAIL just specify “what” is required. They do not provide information about “how” and “from where” the request has to be serviced. This information is provided by the query planning and transformation layer. This layer takes GRAIL query as an input and produces an execution plan as an output. The planning and translation process consists of the following three steps:

• Translation into Query Internal Form (QIF):

The GRAIL query is unnested and certain query constructs are simplified.

• Query Planning: A search algorithm

considers alternative evaluative orders for the components of the QIF generated at step 1, with a view to identifying both valid and efficient ways of evaluating the query.

• Code Generation: The query plan that results from the planning phase is converted into a CPL program for execution.

TEXT

HTML

XML

MULTIMEDIA FORMAT

(4)

Figure 2: TAMBIS System Architecture

3. MBM

Most data integration systems function by having the wrappers get the data from heterogeneous data sources and then this data is mapped to some standard data model using XML at the mediator level. This method of integration has its drawback in not being able to know the semantic relationships, class structures and/or application specific domain constraints.

MBM (Model Based Mediation)research project at University of California, San Diego is a proposed solution to the problem stated above. MBM extends the traditional mediator architecture from the uninterpreted, semistructured data in XML syntax to the semantically rich level of conceptual models with domain knowledge (CMs) [11]. Every wrapper needs to register its CM with the mediator and provide the mediator with information such as exported class schemas, relationship schemas and semantic rules. The wrapper also needs to inform

the mediator about its query capability with respect to the source. The mediator builds the Domain Map based on the CMs received from all the wrappers. The domain map is basically a knowledge map built on the information received from the various wrappers. It helps in integrating information coming from different source domains by establishing relationships such as ‘is a’ and ‘has a’ to add more meaning to the mapping. The domain map can be used by the source to situate their data in the global context. A wrapper has the freedom to change the mediator’s Domain Map by adding and refining the information about its source.

Query processing is performed in a model based mediator architecture called KIND (Knowledge-based Integration of Neuroscience Data) [12]. The idea behind this architecture and system was to integrate data coming from disjoint Neuroscience domains. Detailed explanation of the query plan has been provided in [11].

(5)

4. MOMIS

MOMIS (Mediator envirOnment for Multiple

Information Sources) is a research project at the University of Modena and Reggio Emilia [13, 14]. The MOMIS research is not strictly related to scientific data integration but it applies to traditional data integration issues that can be applied to scientific data integration. It uses a semantic approach to data integration based on the conceptual schema or metadata of the information sources. The MOMIS system architecture consists of the following three components:

1. Common Data Model:

This uses ODL language to describe source schema which can then be used to integrate the data from different data sources.

2. Wrappers:

Wrappers translate the metadata information in the ODL representation. It also translates the global OQL query into a query format acceptable to the source and return the result set back to the mediator.

3. Mediator:

The mediator consists of the SI-Designer and the Query Manager(QM). The SI-Designer is responsible for processing the metadata information received from the wrappers for integration purposes. QM poses the sub-queries of the original global query to each wrapper and presents a unified data answer based on the result sets received back from each wrapper.

The data integration process in MOMIS consists of the following four phases:

1. Generation of a Common Thesaurus:

The Common Thesaurus is a set of relationships describing knowledge about classes and attributes of source schemas. The first phase involves generation of this

Common Thesaurus.

2. Affinity Analysis of Classes:

Relationship amongst classes in the Common Thesaurus is used to find affinity between different classes. It is found on the basis of class names, structures and relationships found in the Common Thesaurus. The affinity between classes found is done for classes belonging to the same data source as well as classes belonging to different data sources.

3. Clustering Classes:

Classes identified having affinity between them are clustered together in a group. The purpose of this phase is to identify classes that need to be integrated.

4. Generation of the Mediated Schema:

The mediated or the global schema is formed by the unification of all the clusters obtained in the previous phase. This global schema consists of all the classes derived from the clusters and serves as an interface to pose queries against the different data sources.

5. DiscoveryLink

IBM’s Discovery Link [15] uses database middleware technology to provide integrated access to data sources used in the life sciences industry. It allows the integration of discovery, clinical trial, regulatory and even marketing data throughout the product development, approval and deployment cycle. DiscoveryLink, like TAMBIS provides the researcher with a feel of a virtual database to query upon. However, in reality the data could be physically scattered across various data sources and not a single data source alone would be sufficient to answer the query. The user queries the “virtual” database using the high-level, nonprocedural query language SQL (Structured Query Language).

Because DiscoveryLink queries the original, distributed sources without modifying or copying the data, it can eliminate many database synchronization issues. Also, it gives the life science researcher to add new data sources while keeping the original data intact. It enables easy creation of wrappers for nonrelational sources. To ensure fast response time, it also includes query optimization technology that automatically searches for the most efficient means of executing a query and assembling the results.

IBM recently enhanced DiscoveryLink with IBM Information Integrator software that integrates in-memory text searches, internal and external data sources and applications, and information from the web. With DiscoveryLink, locally stored data generated by individual R&D teams, now can be accessed and shared across networks using a security-rich, Web-enabled interface.

6. Pegasys:

Pegasys is a flexible, modular and customizable software system that facilitates the execution and data integration from heterogeneous biological sequence tools. A novel data structure has been used for creating workflows of sequence analysis and a unified data model to store its results. To speed up the data generation, all non-serial dependent analyses are run in parallel. The design of the system permits new tools to be added with

(6)

little programming overhead. The system provides the framework to facilitate data integration of analysis results from different tools that were computed on the same input.

The software has been implemented in the Java programming language and is open source, released under the GNU General public license. The following modules have been developed so far: analysis modules for pair-wise and multiple sequence alignment, ab inito gene prediction, masking of repetitive elements, prediction of RNA sequences and eukaryotic splice site predictors. Each workflow has the following qualities: a) the analyses can be linked together such that output from one analysis can be used as input to a subsequent analysis, b) analyses can accept outputs from more than one analysis as input, and c) analyses that are not serially dependent can be executed in parallel. The results

can be exported in General Feature Format [16] and Genome Annotation Markup Elements (GAME) XML [17] for import into Apollo genome editor [18].

Architecture and data flow: The system uses a layered topology with client/server model (Figure 4). The workflows can be created using a graphical user interface. The server, which is made up of separate layers for job scheduling, execution, database interaction, and adaptors, executes this workflow. The application layer converts the workflow rendered in XML into a directed acyclic graph (DAG) of analyses in memory. This DAG is traversed and accordingly analyses are scheduled. The results are inserted into the backend database layer. The data is exported from the system via the adaptor layer in various formats for human interpretation.

Figure 4: Pegasys Architecture

Pegasys data structure: The core data structure of the Pegasys system is a DAG G(V, E), consisting of a set of nodes V and a set of edges connecting the nodes E. This data structure models a workflow created by a user of the system. The DAG is created dynamically at run time when the user selects the choices in GUI. After all of the parameters have been entered, the information for each node and their relationships

to each other are compiled into a structured XML file, which is now the input to the Pegasys server.

Program module: The Program module is the

fundamentals unit of the nodes of the DAG in the application layer of the server and is a real instance of a node. A PegasysProgram class has been created that extends Program by adding an input sequence attribute and a PegasysResultSet to store the results.

(7)

Database: The main goal of the backend database is to maximize information capture during execution of a workflow and it is in relational database. The PegasysDB class has been created to provide communication with the database. The derived PegasysAdaptor classes implement print method to output data in a specific format. The Pegasys system is implemented in the Java programming language. The Biojava toolkit [19] is an extensive set of packages written in java for sequence manipulation, analysis and processing. This toolkit has been integrated into Pegasys. This system is available for download at http://bioinformatics.ubc.ca/pegasys.

7. Conclusion

We presented an overview of five scientific data integration systems. This is not an exhaustive study of all the research being currently conducted on the topic. There is other research currently being conducted and also done in the past that has not been mentioned in this paper due to space limitations.

The goal of TAMBIS is to provide a single interface to multiple Bioinformatics data sources. It gives an illusion of a single data location and provides a single request language to retrieve data from heterogeneous Bioinformatics data sources. MBM provides new mediator architecture to integrate data from disjoint Bioinformatics data sources whereas MOMIS is a framework to perform information extraction and integration from both structured and semistructured data sources and tries to tackle problems related to traditional data integration. We have included MOMIS in this paper, as this can be applicable to Data Integration from scientific data sources. IBM’s Discovery Link uses database middleware technology to provide integrated access to data sources used in the life sciences industry.

Finally, we have tried to provide a representative list of references on the topic of scientific data integration. An open source project Pegasys is a flexible, modular and customizable software system that facilitates the execution and data integration from heterogeneous biological sequence tools.

There is no shortage in possible directions for future research in this area. Most of the current work focuses on developing appropriate data models for scientific data integration. Relatively little work has been conducted on the problems of query optimization and execution plans in this regard, and poses some of the challenges for future research. Also, we believe that the

emerging field of the Semantic Web [20] can make a significant contribution towards querying and integrating structured and semistructured data from heterogeneous scientific data sources.

References:

1. C. Batini, M. Lanzerini, S. B. Navathe. A Comparative Analysis of methodologies for

Database Schema Integration. ACM

Computing Surveys, Vol. 15, 1986.

2. M. R. Genesereth, A. M. Keller, O. M. duschka. Infomaster: An information integration system. ACM SIGMOD Conference, 1997.

3. C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, P. J. Modi, I. Muslea, A. G. Philpot, S. Tejada. Modeling Web sources For Information Integration. National Conference on Artificial Intelligence, 1998.

4. Richard Hull, Gang Zhou. A framework for supporting data integration using the materialized and virtual approaches. ACM SIGMOD Record, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, Volume 25 Issue 2.

5. Gene Ontology Consortium,

http://www.geneontology.org

6. Xiaofeng Meng, Dongdong Hu, Chen Li. Web data extraction and structure mining: Schema-guided wrapper maintenance for web-data extraction. Proceedings of the fifth ACM international workshop on Web information and data management. November 2003.

7. Joachim Hammer, Héctor García-Molina, Svetlozar Nestorov, Ramana Yerneni, Marcus Breunig, Vasilis Vassalos. Template-based wrappers in the TSIMMIS system. ACM SIGMOD Record, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, Volume 26 Issue 2. 8. Massimo Mecella, Barbara Pernici. Designing

wrapper components for e-services in integrating heterogeneous systems. The VLDB Journal - The International Journal on Very Large Data Bases, Volume 10 Issue 1, August 2001.

9. Omar Boucelma, Silvana Castano, Carole Goble. Report on the EDPT’02 Panel on Scientific Data Integration. ACM SIGMOD, Dec 2002.

10.P.G. Baker, A. Brass, S. Bechhofer, C. Goble, N. Paton, R. Stevens, TAMBIS: Transparent Access to Multiple Bioinformatics Information

(8)

Sources. An Overview in Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology, ISMB98, Montreal, 1998

11.Amarnath Gupta, Bertram Ludascher, Maryann Martone. Model-Based Mediation with Domain Maps. International conference on Data Engineering, 2001

12.http://www.npaci.edu/DICE/Neuro/kind01.ht ml

13.http://dbgroup.unimo.it/Momis/

14.D. Beneventano, S. Bergamaschi, F. Guerra, M. Vincini: "The MOMIS approach to Information Integration", IEEE and AAAI International Conference on Enterprise Information Systems (ICEIS01), Setúbal, Portugal, 7-10 July, 2001.

15. http://www.ibm.com/industries/lifesciences.

16.General Feature Format,

http://www.sanger.ac.uk/Software/formats/GF F/index.shtml.

17.GAME XML DTD,

http://flybase.bio.indiana.edu/annot/gamexml. dtd.txt.

18.S.E. Lewis, S. M. Serale, N. Harris, M. Gibson, V. Lyer, J. Ruchter, C. Wiel, L. Bayraktaroglir, E. Birney, M. A. Crosby, J.S. Kaminker, B.B. Matthews, S.E. Prochnik,C.D. Smithy, j.L. Tupy, G.M.Rubin, S. Misra, C.J. Mungall, M.E. Clamp. Apollo: A sequence annotation editor. Genome Biol 2002, 3(12). Epub 2002.

19.http://www.Biojava.org 20.http://www.semanticweb.org