• No results found

Introduction to Several Integration Systems 51

Chapter 2   Literature Review 6

2.5   Introduction to Several Integration Systems 51

During the 1990s, the emergence of distributed computing, middleware technology, and standards has allowed people to increase focus on the heterogeneity that is intrinsic to data. This has supported particularly syntactic and structural interoperability, and allowed people to address issues at the information level. As the future information system increasingly addresses the information and knowledge level issues, it will require further semantic interoperability. Semantic interoperability requires that the information systems understand the semantics of the information sources as well as the user’s information requests, and use mediation or information brokering to satisfy the information request.

During the past two decades, there was an increase in the adoption of ad hoc standards, resulting in significant progress towards achieving system, syntactic, and structural interoperability. Structural and a limited form of semantic interoperability are achieved by adoption of general purpose metadata standards, such as Dublin Core

[Mudumbai, 1997], as well as metadata standards in various domains such as bibliography [Beard and Smith, 1998], space and astronomy, geographical, environmental [Gunther and Voisard, 1998], and ecological [Reichman, et al., 1999]. Early works focused on data integration based on databases. Data integration is the process which takes as input a set of databases, and produces as output a single unified description of the input schemas (the integrated schema) and the associated mapping information supporting integrated access to existing data through the integrated schema [Parent and Spaccapietra, 1998].

For example, Clio+Garlic [Farquhua, et al., 1995] was developed by IBM, mainly targeted at the transformation of legacy data into a new target schema. It introduced an interactive schema mapping paradigm based on value correspondences: through providing GUI for the users to specify how a value of a target attribute can be created from a set of values of source attributes. According to the user-specified value correspondences, the query/view definition will be automatically discovered using DBMS query optimization techniques. In addition, it has a mechanism for users to verify the mappings.

Early work on the SIMS system [Arens, et al., 1996] included a central domain that is linked to the component databases and an AI-style planner that decompose queries for efficient access. SIMS requires the system designer to build a model of the application domain and to define the contents of each source (database, Web server, etc.) in terms of this model. The SIMS planner provides a single point of access for all the information: the user expresses queries without needing to know anything about the individual sources. SIMS translates the user’s high-level request, expressed in a subset of SQL, into a query plan [Ambite and Knoblock, 2000], a series of operations including queries to sources of relevant data and manipulation of the data.

Later works employed ontologies to help integration at the concept level. By using ontology for explication and transformation of context knowledge users can achieve

interoperability at the semantic level [Calvanese, et al., 1998(1) and Stuckenschmidt and Wache, 2000].

For example, Information Manifold [Kirk, et al., 1995] employs a local-as-view approach. It has the explicit notion of global schema/ontology. Its general mediator, independent of sources and queries, takes declarative descriptions of the contents and capabilities of a set of sources over the global concepts as input. A new source can be added by providing its descriptions and providing a corresponding wrapper. A dialect of description logics, called CARIN, is used for source description. The Bucket algorithm was developed in this project for rewriting the query over the global schema into queries to suitable sources.

In the BUSTER project, semantic integration is viewed as context integration [Visser, 2004] since information can only be well understood in its context. The context appears in terms of assumptions about the meaning of information but the assumptions are often not explicated. Semantic integration can be achieved through context transformation where context information has been explicated, descriptions of information entities are completed, and entities are interpreted in a new context. In context theory, a context is a collection of linguistic expressions providing an explicit description of the domain. Or, it can be viewed as a set of parameters with each representing one special aspect of the context described and a set of values can be assigned to the parameters describing the current context (e. g. {parameter1 = value1,

parameter2 = value2, …, parametern = valuen}).

In later work of SIMS, the EDC project [Hovy, 2003] took this a step further, addressing the problem of the semi-automated construction of the single central model and linking it to a large general purpose term taxonomy or ontology Omega. The system provides dynamically planned access to data about petroleum products’ prices and volumes, provided in a variety of forms and on a variety of media, by the Energy Information Administration, the Bureau of Labor Statistics, the Census Bureau, and the California Energy Commission, in the form of over 50, 000 data

tables. In order to more rapidly construct the domain models, systems are developed for automatically identifying terminology glossary files from websites, extracting and formalizing the glossary definitions, clustering them appropriately, and automatically embedding them into the existing ontology and domain model.