Data Integration - Dynamic Discovery, Creation and Invocation of Type Adaptors for Web Service

The workflow problem we present in our use case emanates from the variety of data formats assumed by service providers. Data Integration (the means of gathering information from multiple, heterogeneous sources) also addresses this problem, providing solutions which enable the harvesting of information across differing syntactic representations. Given the similarity of the problem, we investigate the following data integration research: a bioinformatics application that utilises on- tologies to capture the meaning of information content; a physics Grid technology that enables transparent access to data ranging over multiple, divergent sources; a

geographic dataset integration solution; a semantic data integration system for the web; and a Grid data definition language to support the meaningful interchange of Grid data.

3.4.1 TAMBIS - Data Integration for Bioinformatics

The Transparent Access to Multiple Bioinformatics Information Sources [87] (TAM- BIS) framework is designed to support the gathering of information from various data sources through a high-level, conceptually-driven query interface. In this system, information sources are typically proprietary flat-file structures, the outputs of programs, or the product of services, with no structured query interface such as

sql_orxquery_{[22], and no standard representation format. A molecular biology}

ontology, expressed using a description logic, is used in conjunction with functions that specify how every concept is accessed within each data source to deliver an advanced querying interface that supports the retrieval of data from multiple information sources assuming different data representations. The requirements for syntactic mediation are similar to those of data integration: syntactic mediation requires a common way to view and present information from syntactically incon- gruous sources; data integration systems, such TAMBIS, have achieved this by using conceptual models to describe information source in a way that is indepen- dent of representation. While the TAMBIS approach is useful when considering the consolidation of Web Service outputs, it does not support the creation of new data in a concrete format, a process that is required when creating inputs to Web Services.

3.4.2 XDTM - Supporting Transparent Data Access

The need to integrate data from heterogeneous sources has also been addressed by Moreau et al [78] within the Grid Physics Network, GriPhyN [50]. Like the bioinformatics domain, data sources used in physics Grids range across a variety of legacy flat file formats. To provide a homogeneous access model to these varying data sources, Moreau et al [78] propose a separation between logical and physical file structures. This allows access to data sources to be expressed in terms of the

logical structure of the information rather than the way in which it is physically represented. To achieve this, anxmlschema is used to express the logical structure of an information source, and mappings are used to relatexml_{schema elements to}

their corresponding parts within a physical representation. The XML Data Type and Mapping for Specify Datasets (XDTM) prototype provides an implementation which allows data sources to be navigated using xpath_{. This enables users to re-}

trieve and iterate across data stored across multiple, heterogeneous sources. While this approach is useful when amalgamating data from different physical representations, it does not address the problem of data represented using different logical representations. Within a Web Service environment where service are described using wsdl, we can assume homogeneous logical representation because interface types are described usingxml_{schema. Our workflow harmonisation problem}

arises from the fact that service providers use different logical representations of conceptually equivalent information, i.e. differently organised xml schemas to hold the same conceptual items.

3.4.3 Ontology-based Geographic Data Set Integration

Geographic data comes in a variety of formats: digitised maps, graphs and tables can be used to capture and visualise a range information from precipitation lev- els to population densities. As new data instances appear, it is important with geographic data sets to recognise their spatial attributes so information can be organised and discovered by regional features such as longitude and latitude, as well as political or geographic location. Uitermark et al [92] address the problem of geographic data set integration: the process of establishing relationships between corresponding object instances from disparate, autonomously producing information sources. Their work is situated in the context of update propagation so geographically equivalent data instances from different sources, in different formats, can be identified and viewed as the same instance. Abstraction rules dictate the correspondence between elements from different data models which means the relationship between instances of data in different models can be derived, e.g. they are in the same location or they fall within the same region.

3.4.4 IBIS: Semantic Data Integration

The Internet-Based Information System (IBIS) [29] is an architecture for the semantic integration of heterogeneous data sources. A global-as-view approach [19, 32] is employed meaning a single view is constructed over disparate information sources by associating each element in a data source to an element in a global schema. A relational model is used as the global schema with non-relational sources wrapped as legacy file formats; Web data and databases models can all be queried using a standard access model. A novel feature of the IBIS architecture is the ability to deal with information gathered via Web forms. This is achieved by exploiting and implementing techniques developed for querying sources with binding patterns [69].

3.4.5 Data Format Definition Language

The Data Format Definition Language (DFDL) [16] is a proposed standard for the description of data formats that intends to facilitate the meaningful interchange of data on the Grid. Rather than trying to impose standardised data formats across vendors, the DFDL language can be used to specify the structure and contents of a file format at an abstract level, with mappings that define how abstract data elements are serialised within the data format. The DFDL api can then be used to parse data and operate over it without regard for the physical representation of the data. This approach has the benefit that information providers can choose to represent their data using the most appropriate format. This is an important consideration for Grid applications because data sets can be large and complex, and therefore, enforcing a particular representation language such as xml is not feasible.

3.4.6 Reflection on Data Integration Techniques

Viewing information sources through a three-tier model [86] allows us to separate different data integration solutions and position our work against them. Figure 3.7

XML Schema

Relational Schema

JAVA Bean

Description Logic

Ontology

ER Model

XML

CSV

DOC

BIN

XLS

RTF

VCARD

Physical Layer: How the data is stored

Logical Layer: How the data is structured

Conceptual Layer: What the data means

Figure 3.7: A Three Tier Model to separate physical storage, logical structure

and conceptual meaning of information.

illustrates the relationship between physical representation, logical organisation, and the meaning of data:

1. Physical Representation - How the file is stored

Data can be stored in a variety of formats: proprietary binary files, text files,

xml_{files and databases encompass the most common methods.}

2. Logical Organisation - How the information is structured

On top of the physical representation layer, the logical organisation of the data dictates the structure of the information, e.g. xml _{schema, relational}

models, etc.

3. Conceptual - What the data means

On top of the logical organisation layer, the conceptual model of an information source specifies what the information means at a high-level of abstraction.

It is common for data integration solutions to use a common representation or uniform access model to facilitate the gathering and processing of information from

different representations. In terms of the three-tier model presented in Figure 3.7, a set of heterogeneous formats in one layer can be abstracted in the layer above to support homogeneous data access. For example, different physical file formats can be integrated through a common structural representation, a technique used by DFDL, XDTM and IBIS. If different logical organisations of data exist, a common conceptual model can be used to access data sources through a single view, an approach used by TAMBIS and the integration of geographic datasets. In either case, some form of mappings or wrapper programs are used to translate data. The workflow harmonisation problem that we presented earlier in Chapter 2 stems from the fact that different service providers assume different logical organisations of data (under the assumption that xml schema are used to describe the input and output of Web Services). Therefore, a common conceptual model that describes the contents of different xmlschemas can be used to drive the translation of data between different formats. To achieve this, some method is required to assign meaning to xml_{schema components expressed in a high-level language such as a}

description logic or ontology. This notion, commonly referred to asxmlsemantics, is discussed in the following section.

In document Dynamic Discovery, Creation and Invocation of Type Adaptors for Web Service Workflow Harmonisation (Page 44-49)