FUZZYCONSTRAINT-BASED SCHEMA MATCHING FORMULATION
ALSAYED ALGERGAWY, EIKE SCHALLEHN, AND GUNTER SAAKE
∗
Abstrat. The deepWeb has manyhallenges to be solved. Among themis shemamathing. Inthispaper, webuild
aoneptual onnetionbetween the shemamathingproblem SMPand the fuzzyonstraint optimizationproblem FCOP.In
partiular,weproposethe useofthe fuzzyonstraintoptimizationproblemas aframeworktomodeland formalizethe shema
mathingproblem. Byformalizingthe SMP asa FCOP,wegain manybenets. First,weouldexpress itas a ombinatorial
optimizationproblem witha set of softonstraints whihare able toopewithunertainty inshema mathing. Seond, the
atualalgorithmsolutionbeomesindependentoftheonretegraphmodel,allowingustohangethemodelwithoutaetingthe
algorithmbyintroduinganewlevelofabstration. Moreover,weoulddisoveromplexmatheseasily. Finally,weouldmake
atrade-obetweenshemamathingperformaneaspets.
Keywords: shemamathing,onstraintprogramming,fuzzyonstraints,objetivefuntion
1. Introdution. Thedeep Web (alsoknownasDeepnet orthehidden Web)refersto the WorldWide
Web ontent that is not a part of the surfae Web. It is estimated that the deep Web is several orders of
magnitude largerthan the surfaeWeb [4℄. As the numberof deep Web soureshas been inreasingas the
eorts needed to enable users to explore and integrate these soures beome essential. As a result software
systemshavebeendevelopedtoopenthedeepWebtousers.Shemamathing istheoretaskofthesesystems.
Shema mathing is the task of identifying semanti orrespondenes among elements of two or more
shemas. It plays a entral role in many data appliation senarios[22, 17℄: in data integration, to identify
and haraterize inter-shema relationships between multiple (heterogeneous) shemas; in data warehousing,
to map data souresto awarehouse shema; in E-business, to help to map messagesbetweendierentXML
formats; in the Semanti Web, to establishsemantiorrespondenesbetweenonepts of dierent web sites
ontologies;andin datamigration, tomigratelegaydatafrommultiple souresinto anewone[10℄.
Duetotheomplexityofshemamathing,itwasmostlyperformedmanuallybyahumanexpert. However,
manualreoniliation tendstobeaslowand ineientproessespeially in large-saleand dynami
environ-ments. Therefore,theneedforautomatishemamathinghasbeomeessential. Consequently,manyshema
mathingsystemshavebeendevelopedforautomatingtheproess,suhasCupid[17℄,COMA/COMA++[6,1℄,
LSD[8℄, SimilarityFlooding[20℄,OntoBuilder [13℄, QOM[12℄, BTreeMath[11℄, S-Math[14℄,and Spiy [3℄.
Manualsemantimathingoveromesmismatheswhihexistinelementnamesandalsodierentiatesbetween
dierenesofdomains. Hene,weouldassumethatmanualmathingisaperfetproess. Ontheotherhand,
automatimathingmayarrywithitadegreeofunertainty,asitisbasedonsyntati,ratherthansemanti,
means. Furthermore, reently, there has been renewedinterestin building database systemsthat handle
un-ertaindatain aprinipledway[9℄. Heneashortrantabouttherelationshipbetweendatabasesthatmanage
unertaintyanddataintegrationsystemsappears. Therefore,weshould surfforasuitablemodelwhihisable
tomeettheaboverequirements.
A rst step in disovering an eetive and eient way to solve any diult problem suh as shema
mathingistoonstrutaompleteproblemspeiation. Asuitableandpreisedenitionofshemamathing
is essentialfor investigating approahes to solve it. Shema mathing has been extensively researhed, and
manymathingsystemshavebeendeveloped. Someofthese systemsarerule-based[6,17,20℄ andothers are
learning-based[16,7,8℄. However,formalspeiationsofproblemsbeingsolvedbythesesystemsdonotexist,
orarepartial. Littleworkisdonetowardsshemamathingproblemformulatione.g. in[25,23℄.
Inthe rule-basedapproahes, agraph isused to desribethestate of amodeled systemat a giventime,
and graph rules are used to desribe the operations on the system's state. As a onsequene in pratie,
using graph rules has a worst ase omplexity whih is exponential to the size of the graph. Of ourse, an
algorithm of exponential time omplexity is unaeptable for serious system implementation. In general, to
ahieveaeptable performaneitis inevitableto onsequentlyexploit thespeialpropertiesof bothshemas
tobemathed. Besidethat,thereisastrikingommonalityinallrule-basedapproahes;theyareallbasedon
baktraking paradigms. Knowingthat the overwhelmingmajority of theoretialas well asempirial studies
ontheoptimization of baktrakingalgorithmsis basedonthe ontextof onstraint problem(CP),it isnear
∗
to hand to openthis knowledgebase forshema mathing algorithms byreformulatingthe shema mathing
problemasaCP[24,18,5℄.
Tosummarize,weareinaneedtoaframeworkwhihisabletofae thefollowinghallenges:
1. formalizing the shema mathing problem: Althoughmanymathing systemshavebeendeveloped to
solvetheshemamathingproblem,butnoompleteworktoaddresstheformulationproblem. Shema
mathingresearhmostlyfousesonhowwellshemamathingsystemsreognizeorrespondenes. On
theotherhand,notenoughresearhhasbeendoneonformalbasisoftheshemamathing problem.
2. trading-o between shema mathing performane aspets: The performane of a shema mathing
systemomprisestwoequallyimportantfators;namelymathingeetivenessandmathingeieny.
The eetiveness is onerned with the auray and the orretness of the math result while the
eienyisonernedwiththesystemresouressuhastheresponsetimeofthemathsystem. Reent
shemamathingsystemsreportonsiderableeetiveness[6℄,however,theeienyaspetsremaina
missingareaandrepresentanopenhallengefortheshemamathingommunity. Improvingshema
mathingeienyresultsindereasingmathingeetiveness,soatrade-obetweenthetwoaspets
shouldbeonsidered.
3. dealing with unertainty of shema mathing: Shema mathing systems should be able to handle
unertaintyarisesduringthemathingproessfromdierentsoures. Reently,therehasbeenrenewed
interestinbuildingdatabasesystemsthathandleunertaindataanditslineageinaprinipledway,so
ashort rantaboutthe relationshipbetweendatabasesthatmanage unertaintyand lineageand data
integrationsystemsappears. Inadditionto, inordertofullyautomatethemathingproess,wemake
useofextratortoolswhihextratdierentdatamodelsandrepresentthemasaommonmodel. The
extrationproessbringserrorsandunertaintiestothemathingproess
Inthispaper,webuildaoneptualonnetionbetweenthe shemamathing problem(SMP) andthe fuzzy
onstraintoptimization problem (FCOP). Ononehand,weonsider shemamathing asanewappliation of
fuzzyonstraints;ontheother hand,weproposetheuseofthefuzzyonstraintsatisfationproblemasanew
approah forshemamathing. Inpartiular,in thispaper,weproposetheuseoftheFCOPto formulatethe
SMP.However,ourapproahshouldbegeneri,i.e. havetheabilitytoopewithdierentdatamodelsandbe
usedfordierentappliationdomains. Therefore,wersttransformshemastobemathedintoaommondata
modelalledrootedlabeledgraphs. Thenwereformulatethegraphmathingproblemasaonstraintproblem.
Therearemanybenetsbehindthisformulation. First,wegaindiretaesstotherihresearhndingsinthe
CP area;insteadof inventingnew algorithmsfor graphmathing fromsrath. Seond, theatual algorithm
solutionbeomesindependentof theonrete graphmodel, allowingusto hangethemodelwithoutaeting
the algorithm by introduing a new level of abstration. Third, formalizing the SMP asa FCOP failitates
handlingunertaintyin theshema mathingproess. Finally,weould simplydealwithsimpleandomplex
mappings.
The paperis organizedasfollows: Setion 2introduesneessarypreliminaries. Ourframework to unify
shemamathing is presentedin Setion 3inorder to illustratethe sopeof thispaper. Setion4showshow
toformulate theshemamathingproblemasaonstraintproblem. Setion5desribestherelatedwork. The
onludingremarksandongoingfuture workarepresentedin Setion6.
2. Preliminaries. Thispaperisbasedmainlyontwoexistingbodiesofresearh,namelygraphtheory[2℄
andonstraintprogramming[24,18,5℄. Tokeepthispaperself-ontained,webrieypresentinthissetionthe
basioneptsofthem.
2.1. Graph Model. Ashemaisthedesriptionofthestrutureandtheontentofamodelandonsists
of a set of related elements suh as tables, olumns, lasses, or XML elements and attributes. There are
manykindsofdatamodels,suhasrelationalmodel,objet-orientedmodel,ERmodel,XMLshema,et. By
shemastrutureand shemaontent,wemeanitsshema-basedpropertiesanditsinstane-basedproperties,
respetively. In this subsetionwepresentformally rooted (multi-)labeled direted graphs used to represent
shemastobemathedastheinternalommonmodel.
A rooted labeled graphis a direted graphsuh that nodes and edges are assoiated with labels, and in
whihonenodeislabeledin aspeialwaytodistinguishitfrom thegraph'sothernodes. This speialnodeis
alledtherootofthegraph. Withoutlossofgenerality,weshallassumethateverynodeandedgeisassoiated
Definition2.1. ARootedLabeledGraphGisa6-tuple G=(N
G
,EG
,LabG
,sr, tar,l) where:•
N
G
=
{n
root
, n
2
, . . . , n
n}
isanitesetofnodes,eahofthemisuniquelyidentiedbyanobjetidentier (OID),wheren
root
isthe graphroot.•
E
G
=
{
(
n
i
, n
j
)
|n
i
, n
j
∈
N
G}
is anite set of edges, eah edge represents the relationship between two nodes.•
LabG
={LabN G
, LabEG
}is anite set of node labels LabN G
,anda nite setof edge labelsLabEG
.Theselabelsarestrings for desribingthe properties(features)of nodesandedges.
•
sr andtar:EG
7→
NG
aretwo mappings(soureandtarget), assigning asoure andatarget node to eahedge(i. e. ife
= (
ni, nj
)
thensrc
(
e
) =
ni
andtar
(
e
) =
nj
).•
l
:
NG
∪
EG
7→
LabG
is amapping labelassigning alabel from the givenLabG
toeah node andeah edge.• |NG|
=
n
isthe graphsize.Nowthatwehavedenedaonretegraphmodel,inthefollowingsubsetionwepresentbasisofonstraint
programming.
2.2. Constraint Programming. Manyproblemsin omputersiene,mostnotablyin Artiial
Intelli-gene,anbeinterpreted asspeial asesofonstraintproblems. Semanti shema mathing is also an
intel-ligene proess whih aims atmimiking the behavior of humans in nding semanti orrespondenes between
shemas' elements. Therefore, onstraintprogrammingis asuitable sheme torepresent the shema mathing
problem.
Constraintprogrammingis ageneriframeworkfor delarativedesriptionandeetivesolvingforlarge,
partiularly ombinatorial, problems. Not only it is based on a strong theoretial foundation but also it is
attratingwidespreadommerialinterestaswell,inpartiular,inareasofmodelingheterogeneousoptimization
andsatisfationproblems. We,here,onentrateonlyononstraintsatisfationproblems(CSPs)and present
denitionsforCSPs,onstraints,and solutionsfortheCSPs.
Definition2.2. AConstraintSatisfationProblem Pisdenedby a3-tupleP=(X,D,C)where,
•
X
=
{x
1
, x
2
, . . . , xn}
is anitesetof variables.•
D
=
{D
1
, D
2
, . . . , Dn}
is a olletion of nite domains. Eah domain Di
is the set ontaining the possiblevalues forthe orresponding variable xi
∈
X
.•
C
=
{C
1
, C
2
, . . . , Cm}
isaset ofonstraintsonthe variables of X.Definition 2.3. AConstraintC
s
on aset of variablesS
=
{x
1
, x
2
, . . . xr}
isa pairCs
= (
S, Rs
)
, where Rs
isasubset onthe produtofthese variables' domains:Rs
⊆
D
1
× · · · ×
Dr
→ {
0
,
1
}
.Thenumber
r
ofvariablesaonstraintisdeneduponisalledarityoftheonstraint. Thesimplesttypeis theunaryonstraint,whihrestritsthevalueofasinglevariable. Ofspeialinterestaretheonstraintsofaritytwo,alledbinaryonstraints. Aonstraintthatisdenedonmorethantwovariablesisalledaglobalonstraint.
Solving a CSP is nding assignments of values from the respetive domains to the variables so that all
onstraintsaresatised.
Definition2.4. (SolutionofaCSP)Anassignment
Λ
isasolutionofaCSPifitsatisesalltheonstraintsofthe problem,wherethe assignment
Λ
denotes anassignmentof eahvariablex
i
with theorrespondingvaluea
i
suhthatx
i
∈
X
anda
i
∈
D
i
.Example1. (MapColoring)Wewantto olortheregionsofamap, shown inFig.2.1, in awaythat no
twoadjaentregionshavethesameolor. Theatualproblemisthatonlyaertainlimitednumberofolorsis
available. Let'swehavefourregionsandonlythreeolors. Wenowformulatethisproblemas
CSP
= (
X, D, C
)
where:•
X
=
{x
1
, x
2
, x
3
, x
4
}
representsthefourregions,•
D
=
{D
1
, D
2
, D
3
, D
4
}
represents the domains of the variables suh thatD
1
=
D
2
=
D
3
=
D
4
=
{red, green, blue}
,and•
C
=
{C
(
x
1
,x
2)
, C
(
x
1
,x
3)
, C
(
x
1
,x
4)
, C
(
x
2
,x
4)
, C
(
x
3
,x
4)
}
represents the set of onstraints must be satised suh thatC
(
xi,xj
)
=
{
(
vi, vj
)
∈
Di
×
Dj|vi
6
=
vj}
.AsshowninExample1,thereareanumberofsolutionstothespeiedCSP.Anyoneofthemisonsidered
asolutionto theproblem. However,in theshemamathingeld, wedonotonlysearh foranysolutionbut
alsothebest one. Thequalityofsolutionis usuallymeasuredbyanappliationdependentfuntionalled the
Fig.2.1. Mapoloringexample
Definition 2.5. A Constraint Optimization Problem Q is dened by ouple Q =(P,g) suhthat Pis a
CSP and
g
:
D
1
× · · · ×
Dn
→
[0
,
1]
isanobjetive funtion thatmaps eah solutiontuple intoavalue.Example2. (TravelingSalesman)Thetravelingsalesmanproblem is tond theshortestlosed pathby
whihityoutofasetof
n
itiesisvisitedoneandonlyone.Whilepowerful,bothCSP andCOPpresentsomelimitations. Inpartiular,allonstraintsareonsidered
mandatory. Inmanyreal-worldproblems,suhastheshemamathingproblem,thereareonstraintsthatould
beviolatedin solutionswithoutausingsuh solutionsto beunaeptable. If these onstraintsare treatedas
mandatory,thisoftenausesproblemstobeunsolved. Iftheseonstraintsareignored,solutionsofbadquality
arefound.ThisisamotivationtoextendtheCSPshemeandmakeuseofsoftonstraints. Awaytoirumvent
inonsistent onstraintsproblems is to make them fuzzy [15℄. Theidea is to assoiatefuzzy values with the
elementsoftheonstraints,andombine theminareasonableway.
A onstrain, as dened before, is usually dened as apair onsisting of a set of variables and arelation
on these variables. This denition givesus the availability to model dierent typesof unertainty in shema
mathing. In[9℄,authorsidentifydierentsouresforunertaintyindataintegration. Unertaintyinsemanti
mappingsbetweendatasouresanbemodeledbyexploitingfuzzyrelationswhileothersouresofunertainty
anbemodeled bymakingthevariablesetafuzzyset. Inthispaper,wetaketherstoneintoaountwhile
theothersouresareleft forourongoingwork.
Definition 2.6. (Fuzzy Constraint) A Fuzzy Constraint
Cµ
on a set of variablesS
=
{x
1
, x
2
, . . . , xr}
is a pairCµ
= (
S, Rµ
)
, where the fuzzy relationRµ
, dened byµR
:
Q
xi
∈
var
(
C
)
Di
7→
[0
,
1]
whereµR
is the membership funtion indiating towhatextent atuplev satisesCµ
.•
µ
R
(
v
) = 1
means vtotallysatisesC
µ
,•
µ
R
(
v
) = 0
means vtotallyviolatesC
µ
,while•
0
< µR
(
v
)
<
1
meansvpartiallysatisesCµ
.Definition 2.7. A FuzzyConstraintC
µ
on asetof variablesS
=
{x
1
, x
2
, . . . xr}
isapairCµ
= (
S, Rµ
)
, where the fuzzy relation Rµ
, dened byµR
:
Q
xi
∈
var
(
C
)
Di
→
[0
,
1]
whereµR
is the membership funtion indiating towhat extentatuple vsatisesCµ
.•
µR
(
v
) = 1
means vtotaly satisesCµ
,•
µR
(
v
) = 0
means vtotaly violatesCµ
,while•
0
< µR
(
v
)
<
1
meansvpartiallysatisesCµ
.Definition 2.8. A FuzzyConstraintOptimization Problem Q
µ
isa4-tuple Qµ
=(X,D, Cµ
, g)where X isalist of variables,D isa listof domains ofpossible valuesfor the variables, Cµ
isalist of fuzzyonstraintseah ofthemreferringtosomeof thegiven variables, andgisan objetivefuntion tobeoptimized.
In thefollowing setion we shed the light on our shema mathing framework to determine thesope of
shemamathingunderstanding.
3. A uniedshemamathingframework. Eahoftheexistingshemamathingsystemsdealswith
theshemamathingproblem fromitspointof view. As aresulttheneedto ageneriframework thatunies
thesolutionofthisintriateproblemindependentonthedomainofshemastobemathedandindependenton
themodelrepresentationsbeomesessential.Tothisend,wesuggestthefollowinggeneralphasestoaddressthe
shemamathing problem. Figure2showsthese phaseswith themainsopeofthis paper. Thefour dierent
phasesare:
Fig.3.1. MathingProess Phases
•
identifyingelementstobemathed;Pr-mathing Phase,•
applyingthemathingalgorithms;Mathing Phase,and•
exportingthemathresult;MapTrans Phase.Inthefollowingsubsetionweintrodueaframeworkfordeningdierentdatamodelsandhowtotransform
themintoshemagraphs. Thispartfollowsthesameproedurefoundin[25℄toshowthatdierentdatamodels
ouldberepresentedbyshemagraphs.
3.1. ShemaGraph. Tomakethemathingproessamoregeneriproess,shemastobemathedshould
berepresentedinternally byaommonrepresentation. Thisuniform representationreduestheomplexityof
themathing proess by nothavingto opewith dierent representations. Bydeveloping suh import tools,
shemamathimplementationanbeappliedtoshemasofanydatamodelsuhasSQL,XML,UML,andet.
Therefore,therststepin ourapproahistotransformshemasto bemathedintoaommonmodelin order
toapplymathingalgorithms. Wemakeuseofrootedlabeledgraphsastheinternalmodel. Weallthisphase
TransMat; TransformationforMathingproess.
Ingeneral,to representshemasanddatainstanes, startingfromtheroot,theshemaispartitionedinto
relationsandfurther downinto attributesand instanes. Inpartiular, to representrelationalshemas,XML
shemas,et. asrooted labeledgraphs, independently ofthespei soureformat,webenet from therules
foundin[25,21℄. Theserulesarerewrittenasfollows:
•
Every preparedmathing objet in a shemasuh as theshema, relations, elements, attributes et.is representedby anode, suh that theshemaitself is represented bythe root node. Let shema
S
onsistofm
elements(elem
),then∀
elem∈
S
∃
ni
∈
NG
∧
S7→
n
root
, s.t.
1
≤
i
≤
m
•
The features of the prepared mathing objet are represented by node labelsLabN G
. Let features (f eatS
)bethepropertysetofanelement(elem
),then∀
feat∈
featS∃
Lab∈
LabN G
•
Therelationshipbetweentwopreparedmathingobjetsisrepresentedbyanedge. Lettherelationshipsbetweenshemaelementsbe(
relS
),then∀
rel∈
relS∃
e(ni
,nj
)∈
EG
s. t.src
(
e
) =
ni
∈
NG
∧
tar
(
e
) =
nj
∈
NG
•
ThepropertiesoftherelationshipbetweenpreparedobjetsarerepresentedbyedgelabelsLabEG
. Let featuresrf eatS
bethepropertyset ofarelationshiprel
,then,(a)Tworelationalshemas (b)Shemagraphs
Fig.3.2. TwoRelationalShemas&theirShemaGraphs(withoutlabels)
(a)TwoXMLshemas (b)Shemagraphs
Fig.3.3.TwoXMLshemas&theirshemagraphs(withoutlabels
Thefollowingtwoexamplesillustratethathowtheserulesanbeappliedtodierentdatamodelsinorder
tomakeourapproahamoregeneriapproah.
Example3.(Relational DatabaseShemas)ConsidershemasSandTdepitedinFig.3.2(a)(from[20℄).
TheelementsofS andT aretablesand attributes. Applyingtheaboverules, thetwoshemasShemaS and
Shema T are representedby SG1 and SG2 respetively, suh that SG1
= (
NGS, EGS,
LabGS,
srS
,
tarS, lS
)
, whereNGS
=
{n
1
S, n
2
S, n
3
S
, n
4
S
, n
5
S
, n
6
S},
EGS
=
{e
1
−
2
, e
2
−
3
, e
2
−
4
, e
2
−
5
, e
2
−
6
},
Lab
GS
=
LabN S
∪
LabES
=
{
name,
type,
datatype} ∪ {
part-of,
assoiate},
sr
S
,tarS
,lS
aremappingssuhthatsrS
(
e
1
−
2
) =
n
1
S
,tarS
(
e
2
−
3
) =
n
3
S
andl
S
(
e
1
−
2
) =
part-of. Figure3.2(b) showsonlythenodesandedgesof theshemagraphs(SG2 anbedened similarly).Inthisexample,weexploit dierentfeaturesofmathingobjetssuhasname,datatype,andtype. These
features are represented as nodes' labels. These features shall be the input parameters to the next phase.
For example, the nameof a mathing objetin
SG
1
will be used to measure linguisti similarity between it and another mathing objet fromSG
2
, its datatype is to measure datatype ompatibility, and its type is used to determine semanti relationships. However, our approah is exible in the sense that it is able toexploit more features as needed. Moreover, in this example, we exploit one strutural feature part-of to
represent strutural relationships between nodes at dierent levels. Other strutural features e.g.
assoia-tion relationship, that is a strutural relationship speifying both nodes are oneptually at the same level,
is represented between keys. One assoiation relationship is represented in Fig. 3.2(b) between the nodes
n
6
T
andn
9
T
to speify a key/foreign key relation. Visually, assoiation edges are represented as dashed lines.Example 4. (XMLShemas)This examplethat wedisuss illustrates howour uniedshema mathing
framework opes with dierent hoies of the models to be mathed. Now onsider two XML shemas in
Fig.3.3(a)(from[25℄). TheshemasarespeiedusingtheXMLlanguagedeployedonthewebsitebiztalk.org
designedforeletronidoumentsusedin e-business. Theshemagraphs(withoutlabels)oftheseshemasare
shownin Fig.3.3(b). ThelabelsofnodesandedgesarethesameasExample3.
Examples 3and 4illustrate that using Trans-Matphase aims atmathing dierentshema models. The
mathing algorithm (Mathing Phase) does not have to deal with a large number of dierent models. The
weextendtheinternalrepresentation,shemagraphs,andreformulatethegraphmathingproblemasafuzzy
onstraintoptimizationproblem.
4. Shema Mathingas a FCOP.
4.1. Shema Mathing as Graph Mathing. Shemas to be mathed are transformed into rooted
labeled graphs and, hene, the shema mathing problem is onverted into graph mathing. There are two
typesofgraphmathing: graphisomorphism andgraph homomorphism. Ingeneral,amathofonegraphinto
anotherisgivenbyagraphmorphism,whihisamappingofonegraph'sobjetsetsintotheother's,withsome
restritionstopreservethegraph'sstrutureanditstypinginformation.
Definition4.1. AGraph Morphism
φ
:
SG
1
→
SG
2
between twoshema graphsSG
1 = (
NGS, EGS, LabGS, srcS
, tarS, lS
)
andSG
2 = (
NGT
, EGT
, LabGT
, srcT
, tarT
, lT
)
is a pair of mappings
φ
= (
φ
N
, φ
E
)
suh thatφ
N
:
N
GS
→
N
GT
(φ
N
is a node mapping funtion) andφE
:
EGS
→
EGT
(φE
isanedgemapping funtion)and thefollowing restritions apply: 1.∀n
∈
N
GS
∃
l
S
(
n
) =
l
T
(
φ
N
(
n
))
2.
∀e
∈
E
GS
∃
l
S
(
e
) =
l
T
(
φ
E
(
e
))
3.∀e
∈
E
GS
∃
a pathp
′
∈
N
GT
×
E
GT
suh thatp
′
=
φ
E
(
e
)
andφ
N
(
src
S
(
e
)) =
src
T
(
φ
E
(
e
))
∧
φN
(
tarS
(
e
)) =
tarT
(
φE
(
e
))
.The rst two onditions preserve both nodes and edges labeling information, while the third ondition
preservesgraph'sstruture. Graphmathingisanisomorphimathingproblemwhen
|N
GS|
=
|N
GT
|
otherwise itishomomorphi. Obviously,theshemamathingproblemis ahomomorphiproblem.Example5. ForthetworelationalshemasdepitedinFig.3.2(a)anditsassoiatedshemagraphsshown
inFig.3.2(b),theshemamathingproblembetweenshema
S
andshemaT
isonvertedintoahomomorphi graphmathingproblem betweenSG
1
andSG
2
.Graphmathingisonsideredtobeoneofthemostomplexproblemsinomputersiene. Itsomplexity
is due to two major problems. The rst problem is the omputational omplexity of graph mathing. The
timerequiredbybaktrakingin searhtree algorithmsmayin theworst asebeome exponentialin thesize
of thegraph. Graph homomorphismhas been provento be NP-omplete problem [19℄. Theseond problem
is thefat that allof thealgorithms for graphmathing mentionedso faran onlybeapplied to twographs
atatime. Therefore, ifthereare morethantwoshemasthat mustbemathed, thenthe onventionalgraph
mathingalgorithms mustbeappliedto eahpairsequentially. Forappliationsdealingwithlargedatabases,
this may be prohibitive. Hene, hoosing graphmathing asplatform to solvethe shema mathing problem
may be eetive proess but ineient. Therefore, we propose transforming graph homomorphism into a
FCOP.
Nowthatwehavedened agraphmodelanditshomomorphism,letusonsiderhowtoonstrutaFCOP
outofagivengraphmathingproblem.
4.2. Graph Mathingas a FCOP. Intheshemamathing problem, wearetryingtondamapping
between the elements of twoshemas. Multiple onditions should be applied to make these mappings valid
solutionstothemathingproblem,andsomeobjetivefuntionsaretobeoptimizedtoseletthebestmappings
amongmathingresult. Theanalogytoonstraintproblemisquiteobvious: herewemakeamappingbetween
twosets, namely between aset of variables anda set of domains, where someonditions should be satised.
Sobasially, whatwehavetodo toobtainanequivalentonstraintproblem CPfor agivenshemamathing
problem(knowingthatshemastobemathedaretransformedintoshemagraphs)are:
1. takeobjetsofoneshemagraphtobemathedastheCP'ssetofvariables,
2. takeobjetsofothershemagraphstobemathedasthevariables'domain,
3. ndapropertranslationoftheonditionsthatapplytoshemamathingintoasetoffuzzyonstraints,
and
4. formobjetivefuntionstobeoptimized.
Wehavedened theshema mathing problem asagraphmathing homomorphism
φ
. We nowproeed by formalizing the problemφ
as a FCOP problemQµ
= (
X, D, Cµ, g
)
. To onstrut a FCOP out of this problem, wefollowtheaboverules. Throughthese rules,wetakethetworelationaldatabaseshemasshownin Fig.3.2(a) and itsassoiatedshema graphsshown in Fig.3.2(b) asanexample, takinginto aountthat
•
The set of variablesX
is given byX
=
NGS
S
EGS
where the variables fromNGS
are alled node variablesX
N
andfromE
GS
areallededgevariablesX
E
X
=
X
N
S
X
E
=
{x
n
1
, x
n
2
, x
n
3
, x
n
4
, x
n
5
, x
n
6
}
S
{x
e
1
−
2
, x
e
2
−
3
, x
e
2
−
4
, x
e
2
−
5
, x
e
2
−
6
}
•
The set of domainD
is given byD
=
N
GT
S
E
GT
, where the domains fromN
GT
are alled node domainsD
N
andfromE
GT
areallededgedomainsD
E
,=
{Dn
1
, Dn
2
, Dn
3
, Dn
4
, Dn
5
, Dn
6
}
S
{De
1
−
2
, De
2
−
3
, De
2
−
4
, De
2
−
5
, De
2
−
6
}
whereDn
1
=
Dn
2
=
Dn
3
=
Dn
4
=
Dn
5
=
Dn
6
=
{n
1
T
, n
2
T
, n
3
T
, n
4
T
, n
5
T
, n
6
T
, n
7
T
, n
8
T
, n
9
T
, n
10
T
}
(i.e. thenodedomainontainsalltheseondshema graphnodes)andDe
1
−
2
=
De
2
−
3
=
De
2
−
4
=
De
2
−
5
=
De
2
−
6
=
{e
1
−
2
T
, e
1
−
3
T
, e
2
−
4
T
, . . . , p
1
−
2
−
4
T
, . . .
}
(i.e. theedgedomainontainsalltheavailableedgesandpaths in theseondshemagraph)(the edgee
1
−
2
reads theedgeextends betweenthetwonodesn
1
andn
2
suhthate
1
−
2
=
e
(
n
1
, n
2
)
).Usingthisformalizationenablesustodealwithholistimathing. Thisanbeahievedbytakingtheobjets
ofoneshemaasthevariable set,while theobjetsof othershemasasthevariable's domain. Letwehave
n
shemaswhiharetransformedintoshemagraphsSG
1
, SG
2
, . . . , SGn
thenX
=
XN
S
XE
,DN
=
P
n
i
=2
DN i
,DE
=
P
n
i
=2
DEi
. Anotherbenetbehindthisapproahisthatourapproahisabletodisoveromplexmathes oftypes 1:n andn:1 veryeasily. This anbeahievedbyallowingavaluemayhavemultiple valuesfrom itsorrespondingdomainandavaluemaybeassignedtomultiple variables.
Inthefollowingsubsetions,wedemonstratehowtoonstrutbothonstraintsandobjetivefuntionsin
ordertoobtainaompleteproblemdenition.
4.3. Constraint Constrution. Theexploited onstraintsshould reet thegoalsofshemamathing.
Shemamathingbasedonlyonshemaelementpropertieshasbeenattempted. However,itdoesnotprovide
anyfailitytooptimizemathing. Furthermore,additionalonstraintinformation,suhassemantirelationships
andother domainonstraints,isnotinludedand shemasmaynotompletelyapturethesemantisof data
theydesribe. Therefore,inordertoimproveperformaneandorretnessofmathing, additionalinformation
shouldbeinluded. Inthispaper,weareonernedwithbothsyntatiandsemantimathing. Therefore,we
shalllassifyonstraintsthatshouldbeinorporatedin theCPmodelinto: syntationstraintsandsemanti
onstraints. Inthe following, we onsider onlythe onstraintsonstrutionwhile the fuzzyrelations of fuzzy
onstraintare notonsider sineit depends ontheappliation domain. Forexample,asshownbelow,domain
onstraints are risp onstraints, i. e.
µC
(
v
) = 1
, while the strutural onstraints are soft onstraints with dierentdegreeof satisfation.4.3.1. Syntati Constraints.
1. DomainConstraints: Itstatesthatanodevariablemustbeassignavalueorasetofvaluesfromits
orrespondingnodedomain,andanedgevariablemustbeassignedavaluefromitsorrespondingedge
domain. That is
∀xni
∈
XN
andxej
∈
XE∃
aunary onstraintC
dom
µ
(
xni
)
andC
dom
µ
(
xei
)
ensuring domainonsistenyofthemath where,
C
µ
dom
(
xni
)
=
{d
i
∈
D
N i},
C
dom
µ
(
xei
)
=
{d
i
∈
D
Ei}
2. Strutural Constraints: Therearemanystruturalrelationshipsbetweenshemagraphnodessuh
as:
•
Edge Constraint: It states that if an edge exists between twovariable nodes, then an edge (orpath) should exist between their orresponding images. That is
∀xei
∈
XE
and its soure and targetnodesarexns
andxnt
∈
X
∃
twobinaryonstraintsC
src
µ
(
xei,xns
)
andC
tar
µ
(
xei,xnt
)
representingthestruturalbehaviorofmathing,where:
C
µ
src
(
xei,xns
)
=
{
(
di, dj
)
∈
DE
×
DN
|
src
(
di
) =
dj}
C
(
tar
xei,xnt
)
=
{
(
d
i
, d
j
)
∈
D
E
×
D
N
|
tar
(
d
i
) =
d
j
}
(a) Parent Constraint
C
parent
µ
(
xni,xnj
)
representing the strutural behavior of parent relationship,where
C
µ
parent
(
xni,xnj
)
=
{
(
di, dj
)
∈
DN
×
DN
| ∃e
(
di, dj
)
s.t.src
(
e
) =
di}
(b) ChildConstraint
C
child
µ
(
xni,xnj
)
representingthestruturalbehaviorofhildrelationship,whereC
µ
child
(
xni,xnj
)
=
{
(
di, dj
)
∈
DN
×
DN
| ∃e
(
di, dj
)
s.t.tar
(
e
) =
dj
}
() Sibling Constraint
C
sibl
µ
(
xni,xnj
)
representing the strutural behavior of sibling relationship,where
C
µ
sibl
(
xni,xnj
)
=
{
(
di, dj
)
∈
DN
×
DN
| ∃dn
s.t.parent
(
dn, di
)
∧
parent
(
dn, dj
)
}
4.3.2. Semanti Constraints. The rst onstraint type onsiders only the strutural and hierarhial
relationshipsbetweenshemagraphnodes. Inordertoapture theotherfeaturesofshemagraphnodessuh
asthesemantifeaturewemakeuseofthefollowingonstraint.
1. Label Constraints:
∀xni
∈
XN
and∀xei
∈
XE∃
aunary onstraintC
Lab
µ
(
xni
)
andC
Lab
µ
(
xei
)
ensuring thesemantisoftheprediatesintheshemasuhthat:
C
µ
Lab
(
xni
)
=
{dj
∈
DN
|lsim
(
lS
(
x
ni
)
, lT
(
d
j
))
≥
t}
C
µ
Lab
(
xei
)
=
{dj
∈
DE|lsim
(
lS
(
x
ei
)
, lT
(
d
j
))
≥
t}
where
lsim
is alinguisti similarityfuntion determining thesemantisimilarity betweennodes/edges labels andt
isapredenedthreshold.The abovesyntatiand semantionstraintsare by no meansthe ontextual relationships between
ele-ments. Otherkindsofdomainknowledgeanalsoberepresentedthroughonstraints.Moreover,eahonstraint
isassoiatedwithamembershipfuntion
µ
(
v
)
∈
[0
,
1]
toindiate towhatextenttheonstraintshould be sat-ised. Ifµ
(
v
) = 0
, this meansv
totally violates the onstraint andµ
(
v
) = 1
meansv
totally satises it. Constraintsrestritthesearhspae forthemathingproblemsomaybenettheeienyof thesearhpro-ess. On the other hand, if too omplex, onstraints introdue additional omputational omplexity to the
problemsolver.
4.4. ObjetiveFuntionConstrution. Theobjetivefuntionisthefuntionassoiatedwithan
opti-mizationproesswhihdetermineshowgoodasolutionisanddependsontheobjetparameters. Theobjetive
funtion onstitutes the implementation of the problem to be solved. The input parameters are the objet
parameters. The output is the objetive value representingthe evaluation/quality of the individual. In the
shema mathing problem, the objetive funtion simulates human reasoning on similarity between shema
graphobjets.
In this framework, we should onsider two funtion omponents whih onstitute theobjetive funtion.
Therstisalledost funtion
f
ostwhihdeterminestheostofasetonstraintovervariables. Theseond
isalledenergyfuntion
f
energywhihmapseverypossiblevariableassignmentto aost. Then, theobjetive
funtionouldbeexpressedas follows:
g
=
mis|
max(
X
setofonstraints
f
ost+
X
set ofassignment
f
energy)
5. RelatedWork. Shemamathingisafundamentalproessinmanydomainsdealingwithshareddata
suh as data integration, data warehouse, E-ommere, semanti query proessing, and the web semantis.
Mathingsolutionsweredevelopedusingdierentkindofheuristis,butusuallywithoutpriorformaldenition
oftheproblemtheyaresolving.Althoughmanymathingsystems,suhasCupid[17℄,COMA/COMA++[6,1℄,
LSD[8℄, SimilarityFlooding[20℄,OntoBuilder [13℄, QOM[12℄, BTreeMath[11℄, S-Math[14℄,and Spiy [3℄,
havebeendevelopedand dierentapproaheshavebeenproposed tosolvetheshemamathing problem,but
noompleteworkto addresstheformulationproblem. Shema mathingresearhmostly fouses onhowwell
Mostof theexisting work [22℄ denemath asafuntion that takestwoshemas (models)asinput, may
beinthepreseneofauxiliaryinformationsouressuhasuserfeedbakandpreviousmappings,andprodues
amappingasoutput. Ashemaonsists ofasetofrelated elementssuhastables,olumns, lasses,orXML
elements and attributes. A mappingis a set of mapping elements speifying the mathing shema elements
together. Eahmappingelementisspeiedby4-tupleelement
hID, S
1
i
, S
j
2
, Ri
whereID
isanidentierforthe mappingelementthatmathesbetweentheelementS
1
i
oftherstshemaandtheelementS
2
j
oftheseondoneand
R
indiatesthesimilarityvaluebetween0and1. Thevalueof0meansstrongdissimilaritywhilethevalue of 1meansstrong similarity. But, in general, amappingelement indiates that ertain element(s) of shemaS
1
are related to ertain element(s) of shemaS
2
. Eah mapping element an have an assoiated mapping expression whih speies how thetwo elements (or more) are related. Shema mathing is onsidered onlywithidentifyingthemappingsnotdeterminingtheassoiatedexpressions.
IntheworkofA.Doan[7℄,theyformalizetheshemamathingproblemasfourdierentproblems:
1. The basi 1-1 Mathing; given two shemas
S
andT
(representations), for eah elements of S, nd themostsemantiallysimilarelementtofT,utilizingallavailableinformation. This problemisoftenreferredas a one-to-onemathing problem, beause itmathes eah elements with asingle element.
Forexample,the
h
ID1, S.Address, T.CAddress,0.8i
mappingelementindiatesthat there amapping betweentheelementS.Address ofshemaSandtheelementT.CAddress ofshemaTwithadegreeofsimilarity0.8.
2. Mathing for Data Integration; givensoureshemas S1,S2,...,Sn and mediatedshema T,foreah
elementsofSindthemostsimilarelementtofT.
3. Complex Mathing; letS and T be twodata representations. Let O ={O1, O2,...,Ok} be aset of
operatorsthatanbeappliedtotheelementsofTaordingtoasetofrulesRtoFigure2: Mathing
Funtion onstrutformulas. Foreahelementsof S,ndthe mostsimilar elementt,where tanbe
eitheranelementofToraformulafromtheelementsofT,usingOandR.
4. Mathing for Taxonomies; giventwo taxonomiesof onepts S and T, foreah onept node sof S,
ndthemostsimilaroneptnodeofT.
Foreahoftheseproblems,Doanshowsinputinformation,solutionoutput,andtheevaluationofasolution
output. Ingeneral,theinputtoaproblemaninludeanytypeofknowledgeabouttheshemastobemathed
andtheirdomainssuhasshemainformation,instanedata,previousmathings,domainonstraints,anduser
feedbak.
Zhangand et. el.[25℄ formulate the shema mathing problemasa ombinatorial optimizationproblem.
The authors ast the shema mathing problem into a multi-labeled graph mathing problem. The authors
propose ameta-meta model of shema: multi-labeled graph model, whih views shemas asnite strutures
overthe spei signatures. Based on this multi-labeled shema, they propose amulti-labeled graphmodel,
whih is an instane of multi-label shema, to desribe various shemas, where eah node and edge an be
assoiated with aset of labelsdesribing itsproperties. Thenthey onstruta generigraphsimilarity
mea-surebasedontheontrastmodeland proposean optimizationfuntion to omparetwomulti-labeledgraphs.
Usingthe greedyalgorithm, theydesignanoptimization algorithmto solvethemulti-labeledgraphmathing
problem.
Gal and et al. [13℄ propose a fuzzy framework to model theunertainty of the shema mathing proess
outome. The framework aims at identifying and analyzing fators that impat the eetiveness of shema
mathingalgorithmsbyreduingtheunertaintyofexistingalgorithms. Tospeifytheirbelief inthemapping
quality, theauthors assoiate aondene measure with any mapping among attributes' sets. They use the
frameworktodene themonotoniitypropertyasadesiredpropertyoftheshemamathingproblem, soone
ansafelyinterpretahigh ondenemeasureasagoodsemantimapping.
The reentwork for [23℄ introdues a formal speiation for the XML mathing problem. The authors
dene the ingredientsof theXML shema mathing problem using onstraint logi programming. Mathing
problems an be dened through variables, variable domains, onstraints and an objetive funtion. They
distinguishbetweentheonstraintsatisfationproblemandonstraintoptimizationproblemandshowthatthe
optimization problem is more suitable for the shema mathing problem. They make use of ombination of
lusteringmethods andthebranhandbound algorithmtosolvetheshemamathingproblem.
Inourformulationapproah,wehavesomeommonanddistintfeatureswiththeotherrelatedwork. The
onstraint optimization problem. However, ourapproah introduesdistintly theuse of fuzzy onstraint in
ordertoreetthenature oftheshemamathingproblem. Aswellastheuseofthefuzzyonstraintenables
ustomodelunertaintyintheshemamathingproess.
6. Summaryand Future Work. In thispaper,wehaveinvestigated anintriate problem; theshema
mathingproblem. Inpartiular,wehaveintroduedafuzzyonstraint-basedframeworktomodeltheshema
mathingproblem. Tothisend, webuildaoneptualonnetionbetweentheshemamathingproblemand
fuzzy onstraintoptimization problem. On one hand, we onsider shema mathing as anew appliation of
fuzzyonstraintoptimization, andontheotherhandweproposetheuseoffuzzyonstraintoptimization asa
newapproahforshemamathing.
Ourproposed approah isageneriframeworkwhihhasthe featureto dealwithdierentshema
repre-sentations by transformingthe shema mathing problem into graph mathing. Insteadof solving thegraph
mathing problem whih has been proven to be an NP-omplete problem, we reformulate it as aonstraint
problem. Wehave identied two typesof onstraintssyntatiand semantito ensure math semantis. As
wellas,wemakeuseofthefuzzyonstraintsinordertoenableusmodelingunertaintyintheshemamathing
proess. Wealsoshed lightonhowtoonstrutobjetivefuntions.
ThemainbenetofthisapproahisthatwegaindiretaesstotherihresearhndingsintheCParea;
insteadofinventingnewalgorithmsforgraphmathingfromsrath. Anotherimportantadvantageisthatthe
atualalgorithmsolutionbeomesindependentof theonretegraphmodel, allowingus tohangethemodel
withoutaeting thealgorithmbyintroduinganewlevelofabstration.
Understandingtheshemamathing problemis onsideredtherststeptowardsaneetiveandeient
solutionfortheproblem. Inourongoingwork,wewillexploitonstraintsolveralgorithmstoreah ourgoal.
REFERENCES
[1℄ D.Aumueller,H.H. Do,S.Massmann, andE. Rahm,Shema and ontology mathingwithCOMA++,inSIGMOD
Conferene,2005,pp.906908.
[2℄ R.BabakrishnanandK.Ranganathan,Atextbookofgraphtheory,SpringVerlag,1999.
[3℄ A.Bonifati, G. Mea, A.Pappalardo, and S. Raunih, Thespiy projet: Anew approah todata mathing, in
SEBD,Turkey,2006.
[4℄ S. C.Chang,B.He, C.Li,M.Patel,andZ.Zhang,Strutureddatabasesontheweb: Observationsandimpliations,
SIGMODReord,33(2004),pp.6170.
[5℄ R. Dehter,ConstraintProessing,MorganKaufmann,2003.
[6℄ H.H.DoandE.Rahm,COMA-asystemforexibleombination ofshemamathingapproahes,inVLDB2002,2002,
pp.610621.
[7℄ A.Doan,Learningtomapbetweenstrutured representationsof datag,inPh.DThesis,WashingtonUniversity,2002.
[8℄ A. Doan,P. Domingos,andA.Halevy,Reonilingshemasof disparatedatasoures: Amahine-learning approah,
SIGMOD,(2001),pp.509520.
[9℄ X.Dong,A.Halevy,andC.Yu,Dataintegration withunertainty,inVLDB'07,2007,pp.687698.
[10℄ C. Drumm,M.Shmitt,H.-H.Do,andE.Rahm,Quikmig-automati shemamathingfordatamigration projets,
inPro.ACMCIKM07,Portugal,2007.
[11℄ F. Duhateau,Z.Bellahsene,andM.Rohe,Anindexingstrutureforautomatishemamathing,inSMDB
Work-shop,Turkey,2007.
[12℄ M.EhrigandS.Staab,QOM-quikontologymapping,inInternationalSemantiWebConferene,2004,pp.683697.
[13℄ A. Gal, A. Tavor, A. Trombetta, and D. Montesi,A framework for modeling and evaluating automati semanti
reoniliation,VLDBJournal,14(2005),pp.5067.
[14℄ F.Giunhiglia,M.Yatskevih,andP.Shvaiko,Semantimathing: Algorithmsandimplementation,JournalonData
Semantis,IX(2007).
[15℄ H.W.GuesgenandA.Philpott,Heuristisforfuzzyonstraintsatisfation,inANNES'95,1995,pp.132135.
[16℄ W. LiandC. Clifton,Semint: Atoolforidentifying attributeorrespondenes inheterogeneousdatabases usingneural
networks,DataandKnowledgeEngineering,33(2000),pp.4984.
[17℄ J.Madhavan,P.A.Bernstein,andE.Rahm,Generishema mathingwithupid,inVLDB2001,Roma,Italy,2001,
pp.4958.
[18℄ K.MarriottandP.Stukey,Programming withConstraints: AnIntrodution,MITPress,1998.
[19℄ S.Medasani,R.Krishnapuram,andY.Choi,Graphmathingbyrelaxationoffuzzyassignments,IEEETrans.onFuzzy
Systems,9(2001),pp.173182.
[20℄ S.Melnik,H.Garia-Molina,andE.Rahm,Similarityooding:Aversatilegraphmathingalgorithmanditsappliation
toshemamathing,inICDE'02,2002.
[21℄ L.Palopoli,D.Rossai,G.Terraina,andD.Ursino,Agraph-basedapproahforextratingterminologialproperties
frominformationsoureswithheterogeneousformats,KnowledgeandInformationSystems,8(2005),pp.462497.
[23℄ M.Smiljani,M.vanKeulen,andW.Jonker,Formalizingthexmlshemamathingproblemasaonstraint
optimiza-tionproblem,inDEXA,K.V.Andersen,J.K.Debenham,andR.Wagner,eds.,vol.3588ofLetureNotesinComputer
Siene,Springer,2005,pp.333342.
[24℄ E.Tsang,Foundationsof ConstraintSatisfation,AadimiPress,1993.
[25℄ Z.Zhang, H.Che, P.Shi, Y.Sun, andJ. Gu,Formulationshema mathingproblemforombinatorial optimization
problem,IBIS,1(2006),pp.3360.
Editedby: DominikFlejter,TomaszKazmarek,MarekKowalkiewiz
Reeived: February3rd,2008
Aepted: Marh 19th,2008