Due to the challenges in the diversity and large amounts of data, the bioinformatics domain has recently become a major focus of the data integration research field. So far, several approaches have been developed and published (see [78, 67, 129] for different surveys). Previous solutions can be characterized by their approach used to integrate metadata and instance data. For both sub-problems, two alternatives are possible giving four combinations as shown in Table 14.1 together with sample implementations and the supported number of sources (as an indicator for scalability). At the metadata level, we differentiate whether or not an application-/domain-specific, consistent global schema is provided for the user to formulate queries. At the instance level, we differentiate between materialized and virtual integration. Systems may also follow a hybrid approach combin- ing both techniques at one level.
Table 14.1 Data integration approaches and systems in bioinformatics
Criteria Application-specific global schema No application-specific global schema Materiali zed - Data warehouse - Prototypes: IGD, GIMS (3) - Hybrid (Medi- ated warehouse) - Prototype: DATAFOUNDRY (4) - Generic warehouse - Prototypes: COLUMBA (7), GENMAPPER (60) - Generic- hybrid - Prototype: GENMAPPER extension (5) Virtual - Mediator/Federation - Prototypes: TAMBIS (5) - Generic mediator - Prototypes: DISCOVERY- LINK, KLEISLI (60), SRS (700)
Traditional data integration approaches include data warehousing and federation (or
mediation). The former approach physically stores integrated data in a central database
or data warehouse, which can offer high performance for data-intensive analysis tasks. The latter approach uses a mediator to perform data access at run time and provides the most current data. Both are built on the notion of an application- or domain-specific glo- bal schema to consistently represent and access the integrated data. However, construc- tion of the global schema (schema integration) is highly difficult and not scalable for molecular-biological data sources due to their heterogeneity and because many sources have no fixed structure [67, 27]. Frequent changes in the structure and contents of the sources would require a continuous evolution of the global schema and the correspond- ing routines to transform and clean the data.
Data warehouses of molecular-biological data include IGD [124] and GIMS [113], repre- senting early ambitious projects to collect and integrate all molecular-biological data available for a particular organism, C. elegans (IGD) and Saccharomyces cerevisiae (GIMS), in a single database. The most representative example for the federation approach is TAMBIS [55], which uses an existing domain ontology as the global schema, to which source schemas are semantically mapped. On the other side, DATAFOUNDRY
14.3.ST A T E O F T H E AR T 1 5 9
[26] follows a hybrid integration approach (denoted as hybrid in Table 14.1) by extend- ing a mediator with a data warehouse to store aggregated data and copies of most fre- quently accessed source data. However, like traditional data warehouses or mediators, a global schema is required to present a coherent view of the integrated data and used for user queries. Due to the high effort required for constructing the global schema and for transforming and importing source data, these approaches do not scale well and can only deal with a limited number of sources to support specific analysis tasks.
To achieve more flexibility, other systems do not pursue a (laborious) semantic integra- tion of all data sources by constructing an application-specific global schema. Instead, they use a simple generic, i.e., application-independent, data model, into which data from new data sources can be easily transformed and added. Biological sources mostly have a simple entry-attribute-based structure, making a generic representation feasible. According to the integration at the instance level, we divide the approaches without a consistent application-specific global schema into, in our terms, the so-called generic
warehouses and generic mediators.
Generic mediators include DISCOVERYLINK [57]17, KLEISLI [24, 139] and SRS (Sequence Retrieval System) [46, 142]. Their schema is simply the union of the local schemas, which are transformed to a uniform format, such as relational (DISCOVERYLINK), or nested relational (KLEISLI), or attributed-based (SRS). Currently, KLEISLI offers inter- faces to more than 60 public sources [24] and SRS provides wrappers to about 700 data sources [46]. Typically, complete copies of data sources are maintained locally and peri- odically updated for availability and performance reasons. As the price for flexibility, DISCOVERYLINK and KLEISLI leave the task of semantically integration to the responsi- bility of the user. In particular, the user has to explicitly specify join conditions in queries to relate objects/data from different sources with each other. SRS, on the other side, addresses this problem by capturing and utilizing existing object cross-references, i.e., correspondences at the instance level. SRS maintains indices on correspondences and thus can achieve high query performance. In particular, to determine correspondences between objects from two given sources, SRS determines the shortest path between the two sources and performs a join-like operation by traversing the object correspondences along the path.
On the other hand, we currently observe only few prototypes following the generic ware- housing approach. A recent example for generic warehouses is COLUMBA [125]. which physically integrates protein annotations from several sources into a local relational data- base. Source data is imported in its original schema to reduce the effort required for schema integration and data import as much as possible. For each source, the main inte- gration work consists of establishing a mapping table containing all correspondences between its objects and the objects of a previously selected central source, the Protein Data Bank (PDB). In this star-like organization, objects from two arbitrary sources can be efficiently related with each other by joining through the central source.
As indicated in Table 14.1, our system GENMAPPER [32] also belongs to the family of the generic warehouses. It uses a generic data model to uniformly represent objects and their annotations physically imported from different data sources. Existing correspondences between objects are explicitly captured and utilized to drive data integration and com- bine annotation knowledge from different sources. To serve specific analysis needs, 17. DISCOVERYLINK is now distributed as a commercial product under the name IBM Websphere Informa-
powerful operators are provided to derive tailored annotation views from the generic data representation. We further extended GENMAPPER by coupling it with a generic mediator to combine the advantages of both the materialized and virtual integration [76, 77]. The approach is denoted as the generic-hybrid approach in Table 14.1. In particular, while GENMAPPER exploits existing mappings to inter-relate objects of interest, SRS is used to retrieve on a demand-driven basis up-to-date source-specific annotations for the corresponding objects. GENMAPPER and its hybrid extension will be described in detail in two subsequent chapters.
C
H A P T E RCHAPTER 15
T
HE
G
ENMAPPER
A
PPROACH
GENMAPPER (Generic Mapper) represents a new approach to flexibly integrate heteroge- neous data sources for large-scale analysis that preserves and utilizes the semantic knowledge represented in cross-references between the sources. The key aspects of our approach are the following:
• GENMAPPER physically integrates all data in a central database to support flexible, high performance analysis across data from many sources.
• In contrast to previous data warehouse approaches, we do not employ an application- specific global schema (e.g., a star or snowflake schema). Instead, we use a generic schema called GAM (Generic Annotation Model) to uniformly represent object and annotation data from different data sources, including ontologies. It makes it much easier to integrate new data sources and perform corresponding data transformations, thereby improving scalability to a large number of sources. Moreover, it is robust against changes in the external sources thereby supporting easy update and mainte- nance.
• We store existing mappings between sources and correspondences between objects and annotations (cross-references), and exploit them to combine annotation knowl- edge from different sources.
• To support specific analysis needs and queries, we derive tailored annotation views from the generic data representation. This task is supported by a new approach utiliz- ing a set of high-level operators, e.g., to combine mappings. Results of such operators that are of general interest, e.g., new mappings derived from existing mappings, can be materialized in the central database. The separation of the generic data representation and the provision of application-specific views permits GENMAPPER and its (imported and derived) data to be used for a large variety of applications.
In the next section, we give an overview of our data integration approach implemented in GENMAPPER. Section 15.2 presents the generic data model GAM. Section 15.3 and 15.4 discuss the data import phase and the generation of annotation views, respectively. Sec- tion 15.5 describes additional aspects of the technical implementation as well as an appli- cation scenario of GENMAPPER. A brief summary of the chapter is given in Section 15.6.