• No results found

Chapter 2   Literature Review 6

2.3   Schema-based Structural Approaches 15

2.3.3   Automatic Schema Matching 18

Schema matching can be performed manually. However, manually specifying schema

matches is tedious, time-consuming, error-prone, and therefore an expensive process

[Rahm and Bernstein, 2001], especially when the number of information sources is

growing rapidly and the systems are becoming larger and more complex. Therefore,

automated support for schema matching is required to provide faster and less

labor-intensive integration approaches.

There have been implementations of multiple match algorithms or matchers based on

different methods. The matchers may consider only schema information, instance data

(i.e., data contents), or use hybrid methods.

A. Schema-level approaches

Schema-level matchers only consider schema information, not instance data. The

available information includes the usual properties of schema elements [Giunchiglia and

Yatskvich, 2004], such as name, description, data type, relationship types (part-of, is-a,

etc), constraints, and schema structure (e.g., [Doan, et al., 2001 and Mitra, et al., 1999]).

A general implementation compares each S

1

element with each S

2

element and

determines a similarity metric in the range (0, 1) for each pair. Only the combinations

with a similarity value above a certain threshold are considered as match candidates. The

similarity metrics can be used to identify the best match candidates [Castano, et al., 2001

and Doan, et al., 2000]. On the other hand, structural-level matching can discover

matching combinations of elements that appear together in a structure.

Linguistic approaches are useful for schema-level matching. Two categories of important

approaches, name matching and description matching are discussed in [Rahm and

Bernstein, 2001]. Name matching takes schema elements with equal or similar names

into consideration. The similarity of names can be defined and measured in various ways,

including:

Equality of names (the exact same names). An important sub-case is the

equality of names from the same XML namespace which ensures that the same

names indeed bear the same semantics.

Equality of canonical name representations after stemming and other

preprocessing. This is useful to deal with special prefix/suffix symbols (e.g.,

CNameÆcustomer name and EmpNOÆemployee number).

Equality of synonyms. For example, car can be matched to automobile. General

natural language dictionaries and domain-specific dictionaries are useful to deal

with synonyms.

Equality of hypernyms (name of a class’s super-class). E.g., book is-a

publication and article is-a

publication imply that book can be matched to

article.

Similarity of names based on common substrings; edit distance, pronunciation,

soundex (an encoding of names based on how they sound rather than how they

are spelled), etc. [Bell and Sethi, 2001]. For example, representedBy can be

matched to representative,

ShipTo can be matched to Ship2, and

Business-to-Business can be matched to B2B.

User provided name matches, such as reportsTo = manager and issue = but.

An exception that is usually misleading is in the case of homonyms which are equal or

similar names referring to different concepts. For example, the term “class” can have

different interpretations in different situations, e.g. a group of students or a lesson of a

course. By providing context information such as the domain of discourse, the ambiguity

can be distinguished or reduced.

Description matching uses comments and description (usually written in natural language

to express the intended semantics of schema structures and elements) provided along with

the schemas that can also be evaluated linguistically to determine the similarity between

the schema elements. Simple approaches, such as extracting key words from the

description and sophisticated technologies, such as natural language understanding, can

be applied to look for semantically equivalent elements. For example, the iMAP system

pays attention to the description of elements, in addition to other schema information

[Dhamankar, et al., 2004].

Another category of the schema matching method adopts constraint information

contained in schemas to determine the similarity of schema elements [Larson, et al.,

1989]. The constraints include data types, value ranges, uniqueness, optionality,

relationship types, cardinalities, repeatability, reference, etc. For example, similarity can

be based on the equivalence of data types and domains, of key characteristics (e.g.,

unique, primary, foreign), or of relationship cardinality (e.g., 1:1 relationships), or of is-a

relationships.

Rule-based matching techniques constitute another collection of schema matching

solutions [Madhavan, et al., 2001 and Melni, et al., 2002]. Rule-based techniques

discover similar schema elements by exploiting schema-level information using

hand-crafted rules. For example, two elements match if they have the same name and the

same number of sub-elements. The rules can exploit all possible information, including

element name, data types, structures, number of sub-elements, and integrity constraints.

Instance-level data can give important insight into the contents and meaning of schema

elements [Rahm and Bernstein, 2001]. When the useful schema information is limited or

the schemas are ambiguous, as is often the case for many structured or semi-structured

information sources, the analysis on data instances will become very helpful. Even when

substantial schema information is available, the use of instance-level matching can also

be valuable to uncover incorrect interpretations of schema information. For example, it

can help disambiguate between equally plausible schema-level matches by choosing to

match the elements whose instances are more similar.

Many approaches in schema-level matching can be applied to instance-level matching.

For text elements a linguistic characterization, based on information retrieval techniques,

is the preferred approach. This approach evaluates the similarity of two schema elements

by comparing the relative frequencies of words and combination of words in their data

instances. For numerical data type, statistical characterization, such as numerical value

ranges, averages, or value patterns, can provide insight into the similarity of the

corresponding schema elements.

Various approaches have been proposed to perform instance-level matching, such as

rules, neural networks, and machine learning techniques [Berlin and Motro, 2001; Doan,

et al., 2000; Li and Clifton, 1994; Li, et al., 2000]. Learning-based approaches can exploit

data instance-level information. For example, Doan et al. proposed the LSD system,

which employs the Naive Bayes learning method over data instances [Doan, et al., 2001].

The Naive Bayes method can easily construct some probabilistic rules based on the

analysis of data instances that find similarity between schema elements which names do

not reveal enough similarity clues. Note that the learning-based approaches are classified

as instance-level approaches, but in fact they can also utilize schema-level information.

Since each matching approach has a specific applicability for a given match task, a

matcher that uses just one single approach is unlikely to achieve as many good match

candidates as one that combines several approaches [Rahm and Bernstein, 2001].

Therefore, some hybrid approaches are proposed, including two folders: a hybrid matcher

that integrates multiple matching approaches based on multiple criteria or information

sources (e.g., by using name matching with namespaces and thesauri combined with data

type compatibility), and composite matchers that combine the results of independently

executed matchers, including hybrid matchers.

One important issue of note is to the impossibility of determining, fully automatically, all

matches between two schemas, primarily because most schemas have some semantics

that affect the matching criteria but that are not formally expressed or often even not

documented [Rahm and Bernstein, 2001]. Therefore, the result of the match operation is

only a set of match candidates, which can be accepted, rejected, or modified by the user.

Furthermore, the user should be able to specify matches for elements which are

meaningful that the system fails to discover.