The workflow problem we present in our use case emanates from the variety of data formats assumed by service providers. Data Integration (the means of gathering information from multiple, heterogeneous sources) also addresses this problem, providing solutions which enable the harvesting of information across differing syntactic representations. Given the similarity of the problem, we investigate the following data integration research: a bioinformatics application that utilises on- tologies to capture the meaning of information content; a physics Grid technology that enables transparent access to data ranging over multiple, divergent sources; a
geographic dataset integration solution; a semantic data integration system for the web; and a Grid data definition language to support the meaningful interchange of Grid data.
3.4.1
TAMBIS - Data Integration for Bioinformatics
The Transparent Access to Multiple Bioinformatics Information Sources [87] (TAM- BIS) framework is designed to support the gathering of information from various data sources through a high-level, conceptually-driven query interface. In this sys- tem, information sources are typically proprietary flat-file structures, the outputs of programs, or the product of services, with no structured query interface such as
sqlorxquery[22], and no standard representation format. A molecular biology
ontology, expressed using a description logic, is used in conjunction with functions that specify how every concept is accessed within each data source to deliver an advanced querying interface that supports the retrieval of data from multiple in- formation sources assuming different data representations. The requirements for syntactic mediation are similar to those of data integration: syntactic mediation requires a common way to view and present information from syntactically incon- gruous sources; data integration systems, such TAMBIS, have achieved this by using conceptual models to describe information source in a way that is indepen- dent of representation. While the TAMBIS approach is useful when considering the consolidation of Web Service outputs, it does not support the creation of new data in a concrete format, a process that is required when creating inputs to Web Services.
3.4.2
XDTM - Supporting Transparent Data Access
The need to integrate data from heterogeneous sources has also been addressed by Moreau et al [78] within the Grid Physics Network, GriPhyN [50]. Like the bioinformatics domain, data sources used in physics Grids range across a variety of legacy flat file formats. To provide a homogeneous access model to these varying data sources, Moreau et al [78] propose a separation between logical and physical file structures. This allows access to data sources to be expressed in terms of the
logical structure of the information rather than the way in which it is physically represented. To achieve this, anxmlschema is used to express the logical structure of an information source, and mappings are used to relatexmlschema elements to
their corresponding parts within a physical representation. The XML Data Type and Mapping for Specify Datasets (XDTM) prototype provides an implementation which allows data sources to be navigated using xpath. This enables users to re-
trieve and iterate across data stored across multiple, heterogeneous sources. While this approach is useful when amalgamating data from different physical represen- tations, it does not address the problem of data represented using different logical representations. Within a Web Service environment where service are described using wsdl, we can assume homogeneous logical representation because inter- face types are described usingxmlschema. Our workflow harmonisation problem
arises from the fact that service providers use different logical representations of conceptually equivalent information, i.e. differently organised xml schemas to hold the same conceptual items.
3.4.3
Ontology-based Geographic Data Set Integration
Geographic data comes in a variety of formats: digitised maps, graphs and tables can be used to capture and visualise a range information from precipitation lev- els to population densities. As new data instances appear, it is important with geographic data sets to recognise their spatial attributes so information can be organised and discovered by regional features such as longitude and latitude, as well as political or geographic location. Uitermark et al [92] address the prob- lem of geographic data set integration: the process of establishing relationships between corresponding object instances from disparate, autonomously producing information sources. Their work is situated in the context of update propagation so geographically equivalent data instances from different sources, in different for- mats, can be identified and viewed as the same instance. Abstraction rules dictate the correspondence between elements from different data models which means the relationship between instances of data in different models can be derived, e.g. they are in the same location or they fall within the same region.
3.4.4
IBIS: Semantic Data Integration
The Internet-Based Information System (IBIS) [29] is an architecture for the semantic integration of heterogeneous data sources. A global-as-view approach [19, 32] is employed meaning a single view is constructed over disparate infor- mation sources by associating each element in a data source to an element in a global schema. A relational model is used as the global schema with non-relational sources wrapped as legacy file formats; Web data and databases models can all be queried using a standard access model. A novel feature of the IBIS architecture is the ability to deal with information gathered via Web forms. This is achieved by exploiting and implementing techniques developed for querying sources with binding patterns [69].
3.4.5
Data Format Definition Language
The Data Format Definition Language (DFDL) [16] is a proposed standard for the description of data formats that intends to facilitate the meaningful interchange of data on the Grid. Rather than trying to impose standardised data formats across vendors, the DFDL language can be used to specify the structure and contents of a file format at an abstract level, with mappings that define how abstract data elements are serialised within the data format. The DFDL api can then be used to parse data and operate over it without regard for the physical representation of the data. This approach has the benefit that information providers can choose to represent their data using the most appropriate format. This is an important consideration for Grid applications because data sets can be large and complex, and therefore, enforcing a particular representation language such as xml is not feasible.
3.4.6
Reflection on Data Integration Techniques
Viewing information sources through a three-tier model [86] allows us to separate different data integration solutions and position our work against them. Figure 3.7
XML Schema
Relational Schema
JAVA Bean
Description Logic
Ontology
ER Model
XML
CSV
DOC
BIN
XLS
RTF
VCARD
Physical Layer: How the data is storedLogical Layer: How the data is structured
Conceptual Layer: What the data means
Figure 3.7: A Three Tier Model to separate physical storage, logical structure
and conceptual meaning of information.
illustrates the relationship between physical representation, logical organisation, and the meaning of data:
1. Physical Representation - How the file is stored
Data can be stored in a variety of formats: proprietary binary files, text files,
xmlfiles and databases encompass the most common methods.
2. Logical Organisation - How the information is structured
On top of the physical representation layer, the logical organisation of the data dictates the structure of the information, e.g. xml schema, relational
models, etc.
3. Conceptual - What the data means
On top of the logical organisation layer, the conceptual model of an in- formation source specifies what the information means at a high-level of abstraction.
It is common for data integration solutions to use a common representation or uniform access model to facilitate the gathering and processing of information from
different representations. In terms of the three-tier model presented in Figure 3.7, a set of heterogeneous formats in one layer can be abstracted in the layer above to support homogeneous data access. For example, different physical file formats can be integrated through a common structural representation, a technique used by DFDL, XDTM and IBIS. If different logical organisations of data exist, a common conceptual model can be used to access data sources through a single view, an approach used by TAMBIS and the integration of geographic datasets. In either case, some form of mappings or wrapper programs are used to translate data. The workflow harmonisation problem that we presented earlier in Chapter 2 stems from the fact that different service providers assume different logical organisations of data (under the assumption that xml schema are used to describe the input and output of Web Services). Therefore, a common conceptual model that describes the contents of different xmlschemas can be used to drive the translation of data between different formats. To achieve this, some method is required to assign meaning to xmlschema components expressed in a high-level language such as a
description logic or ontology. This notion, commonly referred to asxmlsemantics, is discussed in the following section.