View Generation - SCHEMA MATCHING AND MAPPING-BASED DATA INTEGRATION

To explore the relationships between molecular-biological objects, scientists often have to ask queries in the form "Given a set of LOCUSLINK genes, identify those that are

located at some given cytogenetic positions (LOCATION), and annotated with some given

GO functions, but not associated with some given OMIM diseases". Such queries exhibit the following properties:

• A query involves one or more mappings between objects of a single source, e.g., LOCUSLINK, and one or more targets providing the annotations of interest, e.g., LOCA- TION, GO and OMIM. Both the source and the targets can be confined to subsets of relevant objects.

• The mappings can be used to evaluate logical conditions between objects, i.e., whether they have/do not have some associated annotations. The mappings can be combined using the logical operators AND or OR and individually negated using the logical operator NOT.

GENMAPPER supports the specification and processing of such queries by means of tailored annotation views, which can be flexibly constructed using a set of high-level GAM-based operators. In the following, we briefly present some simple operations, such as Map, Range, and Domain (see Table 15.1), and discuss the most important operations to determine annotations views, Compose and GenerateView, in more detail. Note that the operations are described declaratively and leave room for optimizations in the implementation.

Simple Operations

The Map operation takes as input a source S to be annotated and a target T providing annotations. It searches the database for an existing mapping between S and T and returns the object associations contained in the mapping. Domain and Range identify the source and the target objects, respectively, involved in a mapping. RestrictDomain and RestrictRange return a subset of a mapping covering a given set of objects from the source and from the target, respectively.

Table 15.1 Definitions and examples for simple mapping operations

Operation Definition Example

Map(S, T) Identify object correspondences between S

and T map = Map(S, T) = {s1↔t1, s2↔t2} Domain(map) SELECT DISTINCT S FROM map Domain(map) = {s1, s2}

Range(map) SELECT DISTINCT T FROM map Range(map) = {t1, t2}

RestrictDomain(map, s) SELECT * FROM map

WHERE S in s RestrictDomain(map, {s{s₁↔t₁} 1}) = RestrictRange(map, t) SELECT * FROM map

WHERE T in t RestrictRange(map, {t{s₂↔t2} 2}) =

Compose

Like the MatchCompose operation for metadata-level mappings (see Section 6.3), Com- pose is based on the same intuition, transitivity of associations, to derive new mappings from existing ones. For example, if a locus l in LOCUSLINK is annotated with some GO terms, so are the UNIGENE entries associated with locus l. Compose takes as input a so- called mapping path consisting of two or more mappings connecting two sources with each other, for which a direct mapping is required. For example, it can use a relational join operation to combine map₁: S₁↔S₂ and map₂: S₂↔S₃, which share a common source S₂, and produce as output a mapping between S₁ and S₃.

Compose represents a simple but very effective way to derive new useful mappings. The operation can be used to derive new annotations, which are not directly available in existing sources and their cross-references. However, Compose may lead to wrong correspondences when the transitivity assumption does not hold. This effect can be restricted by allowing Compose to be performed with explicit user confirmation on the involved mapping path. The use of mappings containing correspondences of reduced evidence is a promising subject for future research.

GenerateView

This operation assumes a source S to be annotated and a set of targets T₁, ..., T_m, providing required annotations. The relevant source and target objects are given in the corre- sponding subsets s and t₁, ..., t_m, respectively, each of which may also cover all existing objects of a source. Finally, the GenerateView operation requires a method for combining the mappings (AND or OR), and a list of targets for which the obtained mappings are to

15.5.IM P L E M E N T A T I O N A N D US E 1 6 7

be negated. The result of such a query is a view of m+1 attributes, S, T₁, …, and T_m, containing tuples of related objects from the corresponding sources. In particular, Generate- View implements the pseudo-code shown in Figure 15.5 to build the required annotation view V.

Figure 15.5 The algorithm for GenerateView

GenerateView(S, s, T₁, t₁, ..., T_m, t_m, [AND|OR], {negated}) //Start with all given source objects

V = s

//Iterate over all target For i = 1..m {

Determine mapping M_i: S↔T_i //Using either Map or Compose m_i = RestrictDomain(M_i, s) //Consider the given source objects m_i = RestrictRange(m_i, t_i) //Consider the given target objects If negated[T_i] { //The mapping is specified as negated

s_î = s \ Domain(m_i) //Source objs not involved in the sub-mapping m_î = RestrictDomain(M_i, s_î) //Find correspondences for these objects m_i = m_î RightOuterJoin s_î on S //Preserve objs without correspondences }

V = V InnerJoin/LeftOuterJoin m_i on S //AND: inner, OR: left outer }

GenerateView performs a script of multiple steps based on other operators to execute mappings and manipulate their results. V is first set to the given set s of relevant source objects. For each target T_i, a mapping M_i between S and T_i is to be determined. It may already exist in the database, or in many cases, may be not yet available. In the former case, the required mapping is directly retrieved using the Map operation. In the latter case, we try to derive such a mapping from the existing ones using the Compose opera- tion. A subset m_i is then extracted from M_i to only cover the relevant source objects s and target objects t_i. If necessary, the negation of m_i is built from the subset s_î of s containing the objects not involved in m_i. Finally, V is incrementally extended by performing a left outer join (OR) or inner join (AND) operation with the sub-mapping m_i.

15.5 Implementation and Use

GENMAPPER is fully operational and currently integrates more than 60 public sources, including those for gene annotations, such as LOCUSLINK [116] and UNIGENE [127], and for protein annotations, such as INTERPRO [103] and SWISSPROT [18]. Furthermore, it includes various divisions of NETAFFX [86], a vendor-based data source for annotations of genes used in microarray experiments to measure their expression. GENMAPPER supports both interactive use via a web-based user interface and integration in automatic analysis pipelines using its high-level operations. In the following we present the basic functionalities of the interactive user interface and discuss the use of GENMAPPER in a large-scale analysis application.

Interactive Query Interface

The interactive interface of GENMAPPER allows the user to pose queries and retrieve annotations for a set of given objects from a particular source. First, the relevant source can be selected from the list of currently imported sources. The accessions of the objects of interest can be uploaded from a file or manually copied and pasted. If no accessions are specified, the entire source will be considered. For example, Figure 15.6a shows in

the text input field several LOCUSLINK identifiers, for which the user might want to look for annotations. The source for the objects is set correspondingly to LOCUSLINK.

In the next step, the user can specify all targets of interest from the available sources. To construct the annotation view shown in Figure 15.2, the user can select the targets as shown in Figure 15.6b. GENMAPPER internally manages a graph of all available sources and mappings. Using a shortest path algorithm, GENMAPPER is able to automatically determine a mapping path to traverse from the source to any specified target. The user can also search in the graph for specific paths, for example, with a particular intermedi- ate source. With a high degree of inter-connectivity between the sources, many paths may be possible. Hence, GENMAPPER also allows the user to manually build and save a path customized for specific analysis requirements.

When the relevant paths have been selected or manually constructed, the user can specify the target accessions of interest, the method for combining the mappings, and the negation of single mappings as shown in Figure 15.6c. GENMAPPER then applies the Generate- View operation to construct the annotation view. The interesting accessions among the retrieved ones can be selected to start a new query. Alternatively, the user can retrieve the names and other information of the corresponding objects as illustrated for the GO gene functions in Figure 15.6d. All results can be saved and downloaded for further analysis in external tools.

Figure 15.6 Annotation query for LOCUSLINK genes

A) Source objects D) Object information C) Mapping paths Standard gene symbol Enzyme class Gene functions Disease Publications B) Target selection

Large-scale Automatic Gene Functional Profiling

In an ongoing cooperation project aiming at a comparative analysis between humans and their closest relatives, chimpanzees [43], GENMAPPER has been successfully integrated within an automated analysis pipeline to perform complex and large-scale functional profiling of genes.

15.6.SU M M A R Y 1 6 9

Gene expression measurements have been performed using AFFYMETRIX microarray technology. From a total of approximately 40,000 genes, the expression of around 20,000 genes were detected, from which around 2,500 show a significantly different expression pattern between the species thus representing candidates for further examina- tion [75, 102]. Functional profiling of the differently expressed genes was based on the analysis of the annotations about their known functions as specified by GENEONTOLOGY (GO) terms. In particular, the genes are classified according to the GO function taxon- omy in order to identify the functions, which are conserved or have changed between humans and chimpanzees.

Using the mappings provided by GENMAPPER, the proprietary genes of AFFYMETRIX microarrays were mapped to the generally accepted gene representations UNIGENE and LOCUSLINK, for which GO annotations were in turn derived from the mappings provided by LOCUSLINK. Furthermore, using the structure information of the sources, i.e., IS_A and Subsumed relationships, comprehensive statistical analysis over the entire GO taxon- omy was possible to determine significant genes. The adopted analysis methodology is also applicable to other taxonomies, e.g., ENZYME, to gain additional insights.

15.6 Summary

We presented the GENMAPPER system for flexible integration of heterogeneous annotation data. GENMAPPER uses a generic schema to uniformly represent annotations from different sources. Existing correspondences between objects are explicitly captured to drive data integration and combine annotation knowledge from different sources to enhance analysis tasks. From the generic representation, tailored annotation views are derived to serve specific analysis needs and queries. Such views are flexibly constructed using a set of powerful high-level operators, e.g., to combine annotations imported from different sources. GENMAPPER is fully operational, integrates data from many sources and is currently used by biologists for large-scale functional profiling of genes.

C

H A P T E R

CHAPTER 16

T

HE

H

YBRID

I

NTEGRATION

A

PPROACH

GENMAPPER represents a flexible approach to capture and exploit object correspondences for annotation analysis. However, the employed generic data model GAM focuses specifically on mapping data and cannot handle unstructured data, such as com- ment fields, and data with complex structures, such as geometric data of protein folding structures and genomic sequences, which is typically provided as source-specific attributes besides weblinks to encode object correspondences. To overcome this limita- tion and support access to both kinds of data, we have extended GENMAPPER to a hybrid system by integrating it with a mediator for virtual integration of source-specific annotation data. The key aspects of our new approach are the following:

• We combine a materialized and a virtual data integration to exploit their advantages in a new hybrid approach. On the one hand, the materialization offers high performance for join queries to inter-relate large numbers of objects from different sources. On the other hand, up-to-date source-specific annotation data can be retrieved for analysis when needed.

• Mapping data is explicitly captured from the data sources and stored in a separate database, the so-called Mapping Database, backed by GENMAPPER. This separation allows us to determine different join paths between two sources to relate their objects with each other and to pre-compute them for good query performance.

• Data sources are uniformly integrated and accessed through SRS, the widely accepted commercial mediator tool, which offers wrapper interfaces to a large number of molecular-biological sources, including flat files and relational databases. Hence, we avoid the re-implementation of import functions and can easily add sources supported by SRS.

The next section gives an overview of our hybrid integration approach. Section 16.2 focuses on the management and exploitation of mappings. Section 16.3 describes the query processing mechanism. Finally, Section 16.4 concludes this chapter.

16.1 Overview

Figure 16.1 shows the architecture of our hybrid integration approach. It comprises four components, which are introduced in the following:

Figure 16.1 The hybrid integration approach and its components

1. Retrieve source metadata 2. GUI-Generation

3. Query-specs: Filters, joins 4. Creation of SRS query 5. SRS Call

6. SRS query processing 7. Result streams (XML)

8. Transformation of result streams 9. Display of results

Query-Mediator

SRS Server

GeneOntology Ensembl LocusLink

Web-Browser Mapping-DB/ GenMapper ADM-DB 1 2 3 4 6 9 8 5 7

Query processing steps

• The commercial mediator tool SRS is used to query and retrieve annotations from the relevant public sources. Currently, several sources offering gene annotations, including GENEONTOLOGY (GO) [52], LOCUSLINK [116], ENSEMBL [17], UNIGENE [127], and NETAFFX [86] are integrated to support gene expression analysis.

• GENMAPPER is used to pre-compute alternative mappings between the data sources using different join paths. These mappings are stored as views in a sub-division of GENMAPPER, the Mapping Database. In particular, for each source, a mapping table is maintained storing all correspondences between the source and a pre-selected central source. This star-like schema supports efficient join operations through the central source to inter-relate objects of different sources.

• The Query Mediator is our new development, offering a uniform interface to exploit both mapping and annotation data in GENMAPPER and SRS, respectively. In particular, it captures and transforms user-specified queries into SRS-specific queries, which are then forwarded to SRS for execution. Finally, the Query Mediator combines the results delivered by SRS, performs necessary transformations, and visualizes them on the user web interface.

• The ADM Database serves administration purposes and stores metadata about the integrated sources, such as their names and attributes, and the information about the available mappings, e.g., mapping names and join paths used to compute them. We utilize this metadata to automatically generate the web interface for query formulation. In the remainder of this section, we shortly describe the overall interaction between the components in two main processes, integration of data sources and processing of user queries. The main components, in particular, the Mapping Database, the ADM database, and the Query Mediator, will be discussed in the subsequent sections in detail.

Source Integration

The comprehensive wrapper library provided by SRS supports numerous data sources available in the bioinformatics domain and allows us to easily add new sources. In particular, we use these wrappers to integrate the flatfile-based source LOCUSLINK and two relational databases, ENSEMBL and GO. To achieve good performance for interactive queries, we maintain local copies of these data sources for integration in SRS. The ADM

16.2.MA P P I N G MA N A G E M E N T 1 7 3

Database holds metadata about the sources, especially the names of the sources and their attributes.

In our approach, data sources are organized in a star-like schema supporting efficient join queries. For each object type, one of the sources is chosen as the central source, to which mappings from all other sources of this type are pre-computed. For example, LOCUSLINK is a reference data source for gene annotations. Its identifier is linked in many other sources and often used for citations in scientific publications. Hence, we choose LOCUSLINK as the central gene source in our current implementation to support gene expression analysis. The Mapping Database consists of the mappings from LOCUS- LINK to all other sources, e.g., UNIGENE, ENSEMBL, NETAFFX and GENEONTOLOGY, which are pre-computed and provided by GENMAPPER. To link a source with the central source, alternative mappings can be obtained using different join paths. The mappings are then registered in the ADM Database with the paths employed to compute them (see Section 16.2).

Query Processing

Figure 16.1 also shows the general workflow of query processing, which will be discussed in detail in Section 16.3. The workflow starts with querying metadata about the available sources, attributes and mappings from the ADM Database (Step 1). Using this metadata, the web interface is automatically generated (Step 2). Then, the user can for- mulate the query by selecting the data sources and relevant attributes, and specifying fil- ter conditions and join paths (Step 3). The Query Mediator interprets the user query and generates a query plan, which consists of one or multiple SRS-specific queries (Step 4). The query plan is passed to the SRS server for execution (Step 5 and 6). While subqueries for selection and projection are performed within the corresponding sources, SRS uses GENMAPPER to perform join operations. The query result is then returned as one or multiple XML streams (Step 7). The Query Mediator parses the streams to extract the relevant data (Step 8), which is then prepared in different formats for displaying on web browser or for download (Step 9).

16.2 Mapping Management

The Mapping Database manages the actual mapping data, i.e. correspondences, while the ADM Database stores the metadata on the mappings and the involved sources for GUI generation and query processing. In the following we describe the single databases in more detail.

The Mapping Database

Previous integration systems, such as SRS, determine corresponding objects between two sources using a multi-way join operation along the shortest, automatically determined path connecting them with each other. This approach leads to several problems. First, the shortest path may not always be the best one for inter-relating objects of two sources. Other (probably longer) paths may deliver better data, e.g., if the involved sources are updated more frequently than those in the shortest path. Second, the composition of many mappings can lead to performance problems, even for the shortest paths, if they are to be evaluated at runtime. One solution to improve query time is to pre-compute all possible paths to obtain direct mappings. However, this would lead to an enormous amount

In document SCHEMA MATCHING AND MAPPING-BASED DATA INTEGRATION (Page 183-200)