Chapter 2 Literature Review 6
2.3 Schema-based Structural Approaches 15
2.3.3 Automatic Schema Matching 18
Schema matching can be performed manually. However, manually specifying schema
matches is tedious, time-consuming, error-prone, and therefore an expensive process
[Rahm and Bernstein, 2001], especially when the number of information sources is
growing rapidly and the systems are becoming larger and more complex. Therefore,
automated support for schema matching is required to provide faster and less
labor-intensive integration approaches.
There have been implementations of multiple match algorithms or matchers based on
different methods. The matchers may consider only schema information, instance data
(i.e., data contents), or use hybrid methods.
A. Schema-level approaches
Schema-level matchers only consider schema information, not instance data. The
available information includes the usual properties of schema elements [Giunchiglia and
Yatskvich, 2004], such as name, description, data type, relationship types (part-of, is-a,
etc), constraints, and schema structure (e.g., [Doan, et al., 2001 and Mitra, et al., 1999]).
A general implementation compares each S
1element with each S
2element and
determines a similarity metric in the range (0, 1) for each pair. Only the combinations
with a similarity value above a certain threshold are considered as match candidates. The
similarity metrics can be used to identify the best match candidates [Castano, et al., 2001
and Doan, et al., 2000]. On the other hand, structural-level matching can discover
matching combinations of elements that appear together in a structure.
Linguistic approaches are useful for schema-level matching. Two categories of important
approaches, name matching and description matching are discussed in [Rahm and
Bernstein, 2001]. Name matching takes schema elements with equal or similar names
into consideration. The similarity of names can be defined and measured in various ways,
including:
•
Equality of names (the exact same names). An important sub-case is the
equality of names from the same XML namespace which ensures that the same
names indeed bear the same semantics.
•
Equality of canonical name representations after stemming and other
preprocessing. This is useful to deal with special prefix/suffix symbols (e.g.,
CNameÆcustomer name and EmpNOÆemployee number).
•
Equality of synonyms. For example, car can be matched to automobile. General
natural language dictionaries and domain-specific dictionaries are useful to deal
with synonyms.
•
Equality of hypernyms (name of a class’s super-class). E.g., book is-a
publication and article is-a
publication imply that book can be matched to
article.
•
Similarity of names based on common substrings; edit distance, pronunciation,
soundex (an encoding of names based on how they sound rather than how they
are spelled), etc. [Bell and Sethi, 2001]. For example, representedBy can be
matched to representative,
ShipTo can be matched to Ship2, and
Business-to-Business can be matched to B2B.
•
User provided name matches, such as reportsTo = manager and issue = but.
An exception that is usually misleading is in the case of homonyms which are equal or
similar names referring to different concepts. For example, the term “class” can have
different interpretations in different situations, e.g. a group of students or a lesson of a
course. By providing context information such as the domain of discourse, the ambiguity
can be distinguished or reduced.
Description matching uses comments and description (usually written in natural language
to express the intended semantics of schema structures and elements) provided along with
the schemas that can also be evaluated linguistically to determine the similarity between
the schema elements. Simple approaches, such as extracting key words from the
description and sophisticated technologies, such as natural language understanding, can
be applied to look for semantically equivalent elements. For example, the iMAP system
pays attention to the description of elements, in addition to other schema information
[Dhamankar, et al., 2004].
Another category of the schema matching method adopts constraint information
contained in schemas to determine the similarity of schema elements [Larson, et al.,
1989]. The constraints include data types, value ranges, uniqueness, optionality,
relationship types, cardinalities, repeatability, reference, etc. For example, similarity can
be based on the equivalence of data types and domains, of key characteristics (e.g.,
unique, primary, foreign), or of relationship cardinality (e.g., 1:1 relationships), or of is-a
relationships.
Rule-based matching techniques constitute another collection of schema matching
solutions [Madhavan, et al., 2001 and Melni, et al., 2002]. Rule-based techniques
discover similar schema elements by exploiting schema-level information using
hand-crafted rules. For example, two elements match if they have the same name and the
same number of sub-elements. The rules can exploit all possible information, including
element name, data types, structures, number of sub-elements, and integrity constraints.
Instance-level data can give important insight into the contents and meaning of schema
elements [Rahm and Bernstein, 2001]. When the useful schema information is limited or
the schemas are ambiguous, as is often the case for many structured or semi-structured
information sources, the analysis on data instances will become very helpful. Even when
substantial schema information is available, the use of instance-level matching can also
be valuable to uncover incorrect interpretations of schema information. For example, it
can help disambiguate between equally plausible schema-level matches by choosing to
match the elements whose instances are more similar.
Many approaches in schema-level matching can be applied to instance-level matching.
For text elements a linguistic characterization, based on information retrieval techniques,
is the preferred approach. This approach evaluates the similarity of two schema elements
by comparing the relative frequencies of words and combination of words in their data
instances. For numerical data type, statistical characterization, such as numerical value
ranges, averages, or value patterns, can provide insight into the similarity of the
corresponding schema elements.
Various approaches have been proposed to perform instance-level matching, such as
rules, neural networks, and machine learning techniques [Berlin and Motro, 2001; Doan,
et al., 2000; Li and Clifton, 1994; Li, et al., 2000]. Learning-based approaches can exploit
data instance-level information. For example, Doan et al. proposed the LSD system,
which employs the Naive Bayes learning method over data instances [Doan, et al., 2001].
The Naive Bayes method can easily construct some probabilistic rules based on the
analysis of data instances that find similarity between schema elements which names do
not reveal enough similarity clues. Note that the learning-based approaches are classified
as instance-level approaches, but in fact they can also utilize schema-level information.
Since each matching approach has a specific applicability for a given match task, a
matcher that uses just one single approach is unlikely to achieve as many good match
candidates as one that combines several approaches [Rahm and Bernstein, 2001].
Therefore, some hybrid approaches are proposed, including two folders: a hybrid matcher
that integrates multiple matching approaches based on multiple criteria or information
sources (e.g., by using name matching with namespaces and thesauri combined with data
type compatibility), and composite matchers that combine the results of independently
executed matchers, including hybrid matchers.
One important issue of note is to the impossibility of determining, fully automatically, all
matches between two schemas, primarily because most schemas have some semantics
that affect the matching criteria but that are not formally expressed or often even not
documented [Rahm and Bernstein, 2001]. Therefore, the result of the match operation is
only a set of match candidates, which can be accepted, rejected, or modified by the user.
Furthermore, the user should be able to specify matches for elements which are
meaningful that the system fails to discover.
In document
Ontological View-driven Semantic Integration in Open Environments
(Page 31-35)