System Architecture - SCHEMA MATCHING AND MAPPING-BASED DATA INTEGRATION

Figure 5.1 shows the architecture of our COMA++ schema matching system. It consists of five components, the Repository to persistently store match-related data, the Schema and

Mapping Pools to manage schemas and mappings in memory, the Match Customizer to

configure matchers and match strategies, and the Execution Engine to execute match operations. All components are managed and used through a comprehensive graphical user interface (GUI).

Figure 5.1 COMA++ system architecture

Repository Graphical User Interface Execution Engine Schema Pool External Schemas, Ontologies Mapping Pool Exported Mappings Element

Identification ExecutionMatcher CombinationSimilarity

Schema Manipulation Source Id Name Structure Content SOURCE Source Id Name Structure Content SOURCE Object Rel Id Source Rel Id Object1 Id Object2 Id Evidence OBJECT_ REL Object Rel Id Source Rel Id Object1 Id Object2 Id Evidence OBJECT_ REL n 1 n 1 1 1 n n n n 1 1 Object Id Source Id Accession Text Number OBJECT Object Id Source Id Accession Text Number OBJECT Source Rel Id Source1 Id Source2 Id Type SOURCE_ REL Source Rel Id Source1 Id Source2 Id Type SOURCE_ REL Match Customizer Matcher

Library StrategiesMatch Mapping Manipulation Matcher Strategy

The repository centrally stores various types of data related with match processing, in particular a) imported schemas, b) produced mappings, c) auxiliary information such as domain-specific taxonomies and synonym tables, and d) the definition and configuration of matchers and match strategies. We use a generic data model implemented in a relational DBMS (MySQL4) to uniformly store the different kinds of schemas as well as mappings between them.

5.1.SY S T E M AR C H I T E C T U R E 4 1

Schemas are uniformly represented by directed acyclic graphs as the internal format for matching. The Schema Pool provides different functions to import external schemas, to load and save them from/to the repository, and to preprocess them for matching. Cur- rently, we support W3C XML Schema Definition (XSD) and Web Ontology Language (OWL), XML Data Reduced (XDR), and relational schemas imported via the Open DataBase Connectivity (ODBC) interface. From the Schema Pool, two arbitrary schemas can be selected to start a match operation. Generated mappings are maintained by the Mapping Pool, which, like the Schema Pool, also offers different functions for further manipulation of the mappings.

The match operation is performed in the Execution Engine according to a match strategy configured in the Match Customizer. As indicated in Figure 5.1, it is based on iterating three steps, element identification to determine the relevant schema elements for match- ing, matcher execution applying multiple matchers to compute the element similarities, and similarity combination to combine matcher-specific similarities and derive a map- ping with the best correspondences between the elements. The obtained mapping can in turn be used as input in the next iteration for further refinement. Each iteration can be individually configured using the various alternatives supported by the Match Custom- izer, in particular, the type of elements to be considered, the matchers for similarity com- putation, and the strategies for similarity combination, which all will be discussed in detail in Chapter 7.

Using this infrastructure, match processing is supported as a workflow of several match steps. For large schemas, we implement specific workflows (i.e., strategies) for context- dependent and fragment-based matching. We shortly introduce these match strategies, which will be discussed in detail in Chapter 8:

• Context-dependent matching: Shared schema elements exhibit multiple contexts, such as address for delivery and invoice, which should be differentiated for a correct matching. In addition to a simple NoContext strategy, i.e., no consideration of element contexts, we support two context-sensitive strategies, AllContext and FilteredContext. AllContext identifies and matches all contexts by considering for a shared element all paths (sequences of nodes) from the root to the element. Unfortunately, this strategy turns out to be expensive and impractical for large schemas with many shared elements due to an explosion of the search space. Therefore, we devised the FilteredCon- text strategy which performs matching in two steps and restricts context evaluation to the most similar nodes.

• Fragment-based matching: In a match task with large schemas, it is likely that large portions of one or both input schemas have no matching counterparts. We thus pro- pose fragment-based schema matching, i.e., a divide-and-conquer strategy which decomposes schemas into several smaller fragments and only matches the fragment pairs with a high similarity. In addition to user-selected fragments, we currently support three static fragment types, Schema, Subschema, Shared, considering the complete schema, single subschemas (e.g., message formats in an XML schemas), and shared subgraphs, respectively.

Figure 5.2 External and internal schema representation CREATE TABLE ShipTo (

poNo INT,

shipToStreet VARCHAR(200), shipToCity VARCHAR(200), shipToZip VARCHAR(20), PRIMARY KEY (poNo) ) ;

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name=“DeliverTo" type="Address"/>

<xsd:element name=“BillTo" type="Address"/> <xsd:complexType name="Address" >

<xsd:sequence>

<xsd:element name=“Street" type="xsd:string"/> <xsd:element name=“City" type="xsd:string"/> <xsd:element name=“Zip" type="xsd:decimal"/> </xsd:sequence> </xsd:complexType> </xsd:schema> DeliverTo Address Street City Zip

A) A relational schema and an XML schema

shipToCity shipToStreet ShipTo shipToZip S₁ S₂ 0.7 DeliverTo.Address.Street ShipTo.shipToStreet 0.7 DeliverTo.Address.City ShipTo.shipToCity Sim S₂Element S₁Element 0.7 DeliverTo.Address.Street ShipTo.shipToStreet 0.7 DeliverTo.Address.City ShipTo.shipToCity Sim S₂Element S₁Element BillTo

B) Graph representation and sample mapping

In document SCHEMA MATCHING AND MAPPING-BASED DATA INTEGRATION (Page 58-60)