• No results found

Fuzzy Constraint-Based Schema Matching Formulation

N/A
N/A
Protected

Academic year: 2020

Share "Fuzzy Constraint-Based Schema Matching Formulation"

Copied!
12
0
0

Loading.... (view fulltext now)

Full text

(1)

FUZZYCONSTRAINT-BASED SCHEMA MATCHING FORMULATION

ALSAYED ALGERGAWY, EIKE SCHALLEHN, AND GUNTER SAAKE

Abstrat. The deepWeb has manyhallenges to be solved. Among themis shemamathing. Inthispaper, webuild

aoneptual onnetionbetween the shemamathingproblem SMPand the fuzzyonstraint optimizationproblem FCOP.In

partiular,weproposethe useofthe fuzzyonstraintoptimizationproblemas aframeworktomodeland formalizethe shema

mathingproblem. Byformalizingthe SMP asa FCOP,wegain manybenets. First,weouldexpress itas a ombinatorial

optimizationproblem witha set of softonstraints whihare able toopewithunertainty inshema mathing. Seond, the

atualalgorithmsolutionbeomesindependentoftheonretegraphmodel,allowingustohangethemodelwithoutaetingthe

algorithmbyintroduinganewlevelofabstration. Moreover,weoulddisoveromplexmatheseasily. Finally,weouldmake

atrade-obetweenshemamathingperformaneaspets.

Keywords: shemamathing,onstraintprogramming,fuzzyonstraints,objetivefuntion

1. Introdution. Thedeep Web (alsoknownasDeepnet orthehidden Web)refersto the WorldWide

Web ontent that is not a part of the surfae Web. It is estimated that the deep Web is several orders of

magnitude largerthan the surfaeWeb [4℄. As the numberof deep Web soureshas been inreasingas the

eorts needed to enable users to explore and integrate these soures beome essential. As a result software

systemshavebeendevelopedtoopenthedeepWebtousers.Shemamathing istheoretaskofthesesystems.

Shema mathing is the task of identifying semanti orrespondenes among elements of two or more

shemas. It plays a entral role in many data appliation senarios[22, 17℄: in data integration, to identify

and haraterize inter-shema relationships between multiple (heterogeneous) shemas; in data warehousing,

to map data souresto awarehouse shema; in E-business, to help to map messagesbetweendierentXML

formats; in the Semanti Web, to establishsemantiorrespondenesbetweenonepts of dierent web sites

ontologies;andin datamigration, tomigratelegaydatafrommultiple souresinto anewone[10℄.

Duetotheomplexityofshemamathing,itwasmostlyperformedmanuallybyahumanexpert. However,

manualreoniliation tendstobeaslowand ineientproessespeially in large-saleand dynami

environ-ments. Therefore,theneedforautomatishemamathinghasbeomeessential. Consequently,manyshema

mathingsystemshavebeendevelopedforautomatingtheproess,suhasCupid[17℄,COMA/COMA++[6,1℄,

LSD[8℄, SimilarityFlooding[20℄,OntoBuilder [13℄, QOM[12℄, BTreeMath[11℄, S-Math[14℄,and Spiy [3℄.

Manualsemantimathingoveromesmismatheswhihexistinelementnamesandalsodierentiatesbetween

dierenesofdomains. Hene,weouldassumethatmanualmathingisaperfetproess. Ontheotherhand,

automatimathingmayarrywithitadegreeofunertainty,asitisbasedonsyntati,ratherthansemanti,

means. Furthermore, reently, there has been renewedinterestin building database systemsthat handle

un-ertaindatain aprinipledway[9℄. Heneashortrantabouttherelationshipbetweendatabasesthatmanage

unertaintyanddataintegrationsystemsappears. Therefore,weshould surfforasuitablemodelwhihisable

tomeettheaboverequirements.

A rst step in disovering an eetive and eient way to solve any diult problem suh as shema

mathingistoonstrutaompleteproblemspeiation. Asuitableandpreisedenitionofshemamathing

is essentialfor investigating approahes to solve it. Shema mathing has been extensively researhed, and

manymathingsystemshavebeendeveloped. Someofthese systemsarerule-based[6,17,20℄ andothers are

learning-based[16,7,8℄. However,formalspeiationsofproblemsbeingsolvedbythesesystemsdonotexist,

orarepartial. Littleworkisdonetowardsshemamathingproblemformulatione.g. in[25,23℄.

Inthe rule-basedapproahes, agraph isused to desribethestate of amodeled systemat a giventime,

and graph rules are used to desribe the operations on the system's state. As a onsequene in pratie,

using graph rules has a worst ase omplexity whih is exponential to the size of the graph. Of ourse, an

algorithm of exponential time omplexity is unaeptable for serious system implementation. In general, to

ahieveaeptable performaneitis inevitableto onsequentlyexploit thespeialpropertiesof bothshemas

tobemathed. Besidethat,thereisastrikingommonalityinallrule-basedapproahes;theyareallbasedon

baktraking paradigms. Knowingthat the overwhelmingmajority of theoretialas well asempirial studies

ontheoptimization of baktrakingalgorithmsis basedonthe ontextof onstraint problem(CP),it isnear

(2)

to hand to openthis knowledgebase forshema mathing algorithms byreformulatingthe shema mathing

problemasaCP[24,18,5℄.

Tosummarize,weareinaneedtoaframeworkwhihisabletofae thefollowinghallenges:

1. formalizing the shema mathing problem: Althoughmanymathing systemshavebeendeveloped to

solvetheshemamathingproblem,butnoompleteworktoaddresstheformulationproblem. Shema

mathingresearhmostlyfousesonhowwellshemamathingsystemsreognizeorrespondenes. On

theotherhand,notenoughresearhhasbeendoneonformalbasisoftheshemamathing problem.

2. trading-o between shema mathing performane aspets: The performane of a shema mathing

systemomprisestwoequallyimportantfators;namelymathingeetivenessandmathingeieny.

The eetiveness is onerned with the auray and the orretness of the math result while the

eienyisonernedwiththesystemresouressuhastheresponsetimeofthemathsystem. Reent

shemamathingsystemsreportonsiderableeetiveness[6℄,however,theeienyaspetsremaina

missingareaandrepresentanopenhallengefortheshemamathingommunity. Improvingshema

mathingeienyresultsindereasingmathingeetiveness,soatrade-obetweenthetwoaspets

shouldbeonsidered.

3. dealing with unertainty of shema mathing: Shema mathing systems should be able to handle

unertaintyarisesduringthemathingproessfromdierentsoures. Reently,therehasbeenrenewed

interestinbuildingdatabasesystemsthathandleunertaindataanditslineageinaprinipledway,so

ashort rantaboutthe relationshipbetweendatabasesthatmanage unertaintyand lineageand data

integrationsystemsappears. Inadditionto, inordertofullyautomatethemathingproess,wemake

useofextratortoolswhihextratdierentdatamodelsandrepresentthemasaommonmodel. The

extrationproessbringserrorsandunertaintiestothemathingproess

Inthispaper,webuildaoneptualonnetionbetweenthe shemamathing problem(SMP) andthe fuzzy

onstraintoptimization problem (FCOP). Ononehand,weonsider shemamathing asanewappliation of

fuzzyonstraints;ontheother hand,weproposetheuseofthefuzzyonstraintsatisfationproblemasanew

approah forshemamathing. Inpartiular,in thispaper,weproposetheuseoftheFCOPto formulatethe

SMP.However,ourapproahshouldbegeneri,i.e. havetheabilitytoopewithdierentdatamodelsandbe

usedfordierentappliationdomains. Therefore,wersttransformshemastobemathedintoaommondata

modelalledrootedlabeledgraphs. Thenwereformulatethegraphmathingproblemasaonstraintproblem.

Therearemanybenetsbehindthisformulation. First,wegaindiretaesstotherihresearhndingsinthe

CP area;insteadof inventingnew algorithmsfor graphmathing fromsrath. Seond, theatual algorithm

solutionbeomesindependentof theonrete graphmodel, allowingusto hangethemodelwithoutaeting

the algorithm by introduing a new level of abstration. Third, formalizing the SMP asa FCOP failitates

handlingunertaintyin theshema mathingproess. Finally,weould simplydealwithsimpleandomplex

mappings.

The paperis organizedasfollows: Setion 2introduesneessarypreliminaries. Ourframework to unify

shemamathing is presentedin Setion 3inorder to illustratethe sopeof thispaper. Setion4showshow

toformulate theshemamathingproblemasaonstraintproblem. Setion5desribestherelatedwork. The

onludingremarksandongoingfuture workarepresentedin Setion6.

2. Preliminaries. Thispaperisbasedmainlyontwoexistingbodiesofresearh,namelygraphtheory[2℄

andonstraintprogramming[24,18,5℄. Tokeepthispaperself-ontained,webrieypresentinthissetionthe

basioneptsofthem.

2.1. Graph Model. Ashemaisthedesriptionofthestrutureandtheontentofamodelandonsists

of a set of related elements suh as tables, olumns, lasses, or XML elements and attributes. There are

manykindsofdatamodels,suhasrelationalmodel,objet-orientedmodel,ERmodel,XMLshema,et. By

shemastrutureand shemaontent,wemeanitsshema-basedpropertiesanditsinstane-basedproperties,

respetively. In this subsetionwepresentformally rooted (multi-)labeled direted graphs used to represent

shemastobemathedastheinternalommonmodel.

A rooted labeled graphis a direted graphsuh that nodes and edges are assoiated with labels, and in

whihonenodeislabeledin aspeialwaytodistinguishitfrom thegraph'sothernodes. This speialnodeis

alledtherootofthegraph. Withoutlossofgenerality,weshallassumethateverynodeandedgeisassoiated

(3)

Definition2.1. ARootedLabeledGraphGisa6-tuple G=(N

G

,E

G

,Lab

G

,sr, tar,l) where:

N

G

=

{n

root

, n

2

, . . . , n

n}

isanitesetofnodes,eahofthemisuniquelyidentiedbyanobjetidentier (OID),where

n

root

isthe graphroot.

E

G

=

{

(

n

i

, n

j

)

|n

i

, n

j

N

G}

is anite set of edges, eah edge represents the relationship between two nodes.

Lab

G

={Lab

N G

, Lab

EG

}is anite set of node labels Lab

N G

,anda nite setof edge labelsLab

EG

.

Theselabelsarestrings for desribingthe properties(features)of nodesandedges.

sr andtar:

EG

7→

NG

aretwo mappings(soureandtarget), assigning asoure andatarget node to eahedge(i. e. if

e

= (

ni, nj

)

then

src

(

e

) =

ni

and

tar

(

e

) =

nj

).

l

:

NG

EG

7→

LabG

is amapping labelassigning alabel from the given

LabG

toeah node andeah edge.

• |NG|

=

n

isthe graphsize.

Nowthatwehavedenedaonretegraphmodel,inthefollowingsubsetionwepresentbasisofonstraint

programming.

2.2. Constraint Programming. Manyproblemsin omputersiene,mostnotablyin Artiial

Intelli-gene,anbeinterpreted asspeial asesofonstraintproblems. Semanti shema mathing is also an

intel-ligene proess whih aims atmimiking the behavior of humans in nding semanti orrespondenes between

shemas' elements. Therefore, onstraintprogrammingis asuitable sheme torepresent the shema mathing

problem.

Constraintprogrammingis ageneriframeworkfor delarativedesriptionandeetivesolvingforlarge,

partiularly ombinatorial, problems. Not only it is based on a strong theoretial foundation but also it is

attratingwidespreadommerialinterestaswell,inpartiular,inareasofmodelingheterogeneousoptimization

andsatisfationproblems. We,here,onentrateonlyononstraintsatisfationproblems(CSPs)and present

denitionsforCSPs,onstraints,and solutionsfortheCSPs.

Definition2.2. AConstraintSatisfationProblem Pisdenedby a3-tupleP=(X,D,C)where,

X

=

{x

1

, x

2

, . . . , xn}

is anitesetof variables.

D

=

{D

1

, D

2

, . . . , Dn}

is a olletion of nite domains. Eah domain D

i

is the set ontaining the possiblevalues forthe orresponding variable x

i

X

.

C

=

{C

1

, C

2

, . . . , Cm}

isaset ofonstraintsonthe variables of X.

Definition 2.3. AConstraintC

s

on aset of variables

S

=

{x

1

, x

2

, . . . xr}

isa pair

Cs

= (

S, Rs

)

, where R

s

isasubset onthe produtofthese variables' domains:

Rs

D

1

× · · · ×

Dr

→ {

0

,

1

}

.

Thenumber

r

ofvariablesaonstraintisdeneduponisalledarityoftheonstraint. Thesimplesttypeis theunaryonstraint,whihrestritsthevalueofasinglevariable. Ofspeialinterestaretheonstraintsofarity

two,alledbinaryonstraints. Aonstraintthatisdenedonmorethantwovariablesisalledaglobalonstraint.

Solving a CSP is nding assignments of values from the respetive domains to the variables so that all

onstraintsaresatised.

Definition2.4. (SolutionofaCSP)Anassignment

Λ

isasolutionofaCSPifitsatisesalltheonstraints

ofthe problem,wherethe assignment

Λ

denotes anassignmentof eahvariable

x

i

with theorrespondingvalue

a

i

suhthat

x

i

X

and

a

i

D

i

.

Example1. (MapColoring)Wewantto olortheregionsofamap, shown inFig.2.1, in awaythat no

twoadjaentregionshavethesameolor. Theatualproblemisthatonlyaertainlimitednumberofolorsis

available. Let'swehavefourregionsandonlythreeolors. Wenowformulatethisproblemas

CSP

= (

X, D, C

)

where:

X

=

{x

1

, x

2

, x

3

, x

4

}

representsthefourregions,

D

=

{D

1

, D

2

, D

3

, D

4

}

represents the domains of the variables suh that

D

1

=

D

2

=

D

3

=

D

4

=

{red, green, blue}

,and

C

=

{C

(

x

1

,x

2)

, C

(

x

1

,x

3)

, C

(

x

1

,x

4)

, C

(

x

2

,x

4)

, C

(

x

3

,x

4)

}

represents the set of onstraints must be satised suh that

C

(

xi,xj

)

=

{

(

vi, vj

)

Di

×

Dj|vi

6

=

vj}

.

AsshowninExample1,thereareanumberofsolutionstothespeiedCSP.Anyoneofthemisonsidered

asolutionto theproblem. However,in theshemamathingeld, wedonotonlysearh foranysolutionbut

alsothebest one. Thequalityofsolutionis usuallymeasuredbyanappliationdependentfuntionalled the

(4)

Fig.2.1. Mapoloringexample

Definition 2.5. A Constraint Optimization Problem Q is dened by ouple Q =(P,g) suhthat Pis a

CSP and

g

:

D

1

× · · · ×

Dn

[0

,

1]

isanobjetive funtion thatmaps eah solutiontuple intoavalue.

Example2. (TravelingSalesman)Thetravelingsalesmanproblem is tond theshortestlosed pathby

whihityoutofasetof

n

itiesisvisitedoneandonlyone.

Whilepowerful,bothCSP andCOPpresentsomelimitations. Inpartiular,allonstraintsareonsidered

mandatory. Inmanyreal-worldproblems,suhastheshemamathingproblem,thereareonstraintsthatould

beviolatedin solutionswithoutausingsuh solutionsto beunaeptable. If these onstraintsare treatedas

mandatory,thisoftenausesproblemstobeunsolved. Iftheseonstraintsareignored,solutionsofbadquality

arefound.ThisisamotivationtoextendtheCSPshemeandmakeuseofsoftonstraints. Awaytoirumvent

inonsistent onstraintsproblems is to make them fuzzy [15℄. Theidea is to assoiatefuzzy values with the

elementsoftheonstraints,andombine theminareasonableway.

A onstrain, as dened before, is usually dened as apair onsisting of a set of variables and arelation

on these variables. This denition givesus the availability to model dierent typesof unertainty in shema

mathing. In[9℄,authorsidentifydierentsouresforunertaintyindataintegration. Unertaintyinsemanti

mappingsbetweendatasouresanbemodeledbyexploitingfuzzyrelationswhileothersouresofunertainty

anbemodeled bymakingthevariablesetafuzzyset. Inthispaper,wetaketherstoneintoaountwhile

theothersouresareleft forourongoingwork.

Definition 2.6. (Fuzzy Constraint) A Fuzzy Constraint

on a set of variables

S

=

{x

1

, x

2

, . . . , xr}

is a pair

= (

S, Rµ

)

, where the fuzzy relation

, dened by

µR

:

Q

xi

var

(

C

)

Di

7→

[0

,

1]

where

µR

is the membership funtion indiating towhatextent atuplev satises

.

µ

R

(

v

) = 1

means vtotallysatises

C

µ

,

µ

R

(

v

) = 0

means vtotallyviolates

C

µ

,while

0

< µR

(

v

)

<

1

meansvpartiallysatises

.

Definition 2.7. A FuzzyConstraintC

µ

on asetof variables

S

=

{x

1

, x

2

, . . . xr}

isapair

= (

S, Rµ

)

, where the fuzzy relation R

µ

, dened by

µR

:

Q

xi

var

(

C

)

Di

[0

,

1]

where

µR

is the membership funtion indiating towhat extentatuple vsatisesC

µ

.

µR

(

v

) = 1

means vtotaly satises

,

µR

(

v

) = 0

means vtotaly violates

,while

0

< µR

(

v

)

<

1

meansvpartiallysatises

.

Definition 2.8. A FuzzyConstraintOptimization Problem Q

µ

isa4-tuple Q

µ

=(X,D, C

µ

, g)where X isalist of variables,D isa listof domains ofpossible valuesfor the variables, C

µ

isalist of fuzzyonstraints

eah ofthemreferringtosomeof thegiven variables, andgisan objetivefuntion tobeoptimized.

In thefollowing setion we shed the light on our shema mathing framework to determine thesope of

shemamathingunderstanding.

3. A uniedshemamathingframework. Eahoftheexistingshemamathingsystemsdealswith

theshemamathingproblem fromitspointof view. As aresulttheneedto ageneriframework thatunies

thesolutionofthisintriateproblemindependentonthedomainofshemastobemathedandindependenton

themodelrepresentationsbeomesessential.Tothisend,wesuggestthefollowinggeneralphasestoaddressthe

shemamathing problem. Figure2showsthese phaseswith themainsopeofthis paper. Thefour dierent

phasesare:

(5)

Fig.3.1. MathingProess Phases

identifyingelementstobemathed;Pr-mathing Phase,

applyingthemathingalgorithms;Mathing Phase,and

exportingthemathresult;MapTrans Phase.

Inthefollowingsubsetionweintrodueaframeworkfordeningdierentdatamodelsandhowtotransform

themintoshemagraphs. Thispartfollowsthesameproedurefoundin[25℄toshowthatdierentdatamodels

ouldberepresentedbyshemagraphs.

3.1. ShemaGraph. Tomakethemathingproessamoregeneriproess,shemastobemathedshould

berepresentedinternally byaommonrepresentation. Thisuniform representationreduestheomplexityof

themathing proess by nothavingto opewith dierent representations. Bydeveloping suh import tools,

shemamathimplementationanbeappliedtoshemasofanydatamodelsuhasSQL,XML,UML,andet.

Therefore,therststepin ourapproahistotransformshemasto bemathedintoaommonmodelin order

toapplymathingalgorithms. Wemakeuseofrootedlabeledgraphsastheinternalmodel. Weallthisphase

TransMat; TransformationforMathingproess.

Ingeneral,to representshemasanddatainstanes, startingfromtheroot,theshemaispartitionedinto

relationsandfurther downinto attributesand instanes. Inpartiular, to representrelationalshemas,XML

shemas,et. asrooted labeledgraphs, independently ofthespei soureformat,webenet from therules

foundin[25,21℄. Theserulesarerewrittenasfollows:

Every preparedmathing objet in a shemasuh as theshema, relations, elements, attributes et.

is representedby anode, suh that theshemaitself is represented bythe root node. Let shema

S

onsistof

m

elements(

elem

),then

elem

S

n

i

N

G

S

7→

n

root

, s.t.

1

i

m

The features of the prepared mathing objet are represented by node labels

LabN G

. Let features (

f eatS

)bethepropertysetofanelement(

elem

),then

feat

featS

Lab

Lab

N G

Therelationshipbetweentwopreparedmathingobjetsisrepresentedbyanedge. Lettherelationships

betweenshemaelementsbe(

relS

),then

rel

relS

e(n

i

,n

j

)

E

G

s. t.

src

(

e

) =

ni

NG

tar

(

e

) =

nj

NG

Thepropertiesoftherelationshipbetweenpreparedobjetsarerepresentedbyedgelabels

LabEG

. Let features

rf eatS

bethepropertyset ofarelationship

rel

,then,

(6)

(a)Tworelationalshemas (b)Shemagraphs

Fig.3.2. TwoRelationalShemas&theirShemaGraphs(withoutlabels)

(a)TwoXMLshemas (b)Shemagraphs

Fig.3.3.TwoXMLshemas&theirshemagraphs(withoutlabels

Thefollowingtwoexamplesillustratethathowtheserulesanbeappliedtodierentdatamodelsinorder

tomakeourapproahamoregeneriapproah.

Example3.(Relational DatabaseShemas)ConsidershemasSandTdepitedinFig.3.2(a)(from[20℄).

TheelementsofS andT aretablesand attributes. Applyingtheaboverules, thetwoshemasShemaS and

Shema T are representedby SG1 and SG2 respetively, suh that SG1

= (

NGS, EGS,

Lab

GS,

sr

S

,

tar

S, lS

)

, where

NGS

=

{n

1

S, n

2

S, n

3

S

, n

4

S

, n

5

S

, n

6

S},

EGS

=

{e

1

2

, e

2

3

, e

2

4

, e

2

5

, e

2

6

},

Lab

GS

=

Lab

N S

Lab

ES

=

{

name

,

type

,

datatype

} ∪ {

part-of

,

assoiate

},

sr

S

,tar

S

,l

S

aremappingssuhthatsr

S

(

e

1

2

) =

n

1

S

,tar

S

(

e

2

3

) =

n

3

S

and

l

S

(

e

1

2

) =

part-of. Figure3.2(b) showsonlythenodesandedgesof theshemagraphs(SG2 anbedened similarly).

Inthisexample,weexploit dierentfeaturesofmathingobjetssuhasname,datatype,andtype. These

features are represented as nodes' labels. These features shall be the input parameters to the next phase.

For example, the nameof a mathing objetin

SG

1

will be used to measure linguisti similarity between it and another mathing objet from

SG

2

, its datatype is to measure datatype ompatibility, and its type is used to determine semanti relationships. However, our approah is exible in the sense that it is able to

exploit more features as needed. Moreover, in this example, we exploit one strutural feature part-of to

represent strutural relationships between nodes at dierent levels. Other strutural features e.g.

assoia-tion relationship, that is a strutural relationship speifying both nodes are oneptually at the same level,

is represented between keys. One assoiation relationship is represented in Fig. 3.2(b) between the nodes

n

6

T

and

n

9

T

to speify a key/foreign key relation. Visually, assoiation edges are represented as dashed lines.

Example 4. (XMLShemas)This examplethat wedisuss illustrates howour uniedshema mathing

framework opes with dierent hoies of the models to be mathed. Now onsider two XML shemas in

Fig.3.3(a)(from[25℄). TheshemasarespeiedusingtheXMLlanguagedeployedonthewebsitebiztalk.org

designedforeletronidoumentsusedin e-business. Theshemagraphs(withoutlabels)oftheseshemasare

shownin Fig.3.3(b). ThelabelsofnodesandedgesarethesameasExample3.

Examples 3and 4illustrate that using Trans-Matphase aims atmathing dierentshema models. The

mathing algorithm (Mathing Phase) does not have to deal with a large number of dierent models. The

(7)

weextendtheinternalrepresentation,shemagraphs,andreformulatethegraphmathingproblemasafuzzy

onstraintoptimizationproblem.

4. Shema Mathingas a FCOP.

4.1. Shema Mathing as Graph Mathing. Shemas to be mathed are transformed into rooted

labeled graphs and, hene, the shema mathing problem is onverted into graph mathing. There are two

typesofgraphmathing: graphisomorphism andgraph homomorphism. Ingeneral,amathofonegraphinto

anotherisgivenbyagraphmorphism,whihisamappingofonegraph'sobjetsetsintotheother's,withsome

restritionstopreservethegraph'sstrutureanditstypinginformation.

Definition4.1. AGraph Morphism

φ

:

SG

1

SG

2

between twoshema graphs

SG

1 = (

NGS, EGS, LabGS, srcS

, tarS, lS

)

and

SG

2 = (

NGT

, EGT

, LabGT

, srcT

, tarT

, lT

)

is a pair of mappings

φ

= (

φ

N

, φ

E

)

suh that

φ

N

:

N

GS

N

GT

(

φ

N

is a node mapping funtion) and

φE

:

EGS

EGT

(

φE

isanedgemapping funtion)and thefollowing restritions apply: 1.

∀n

N

GS

l

S

(

n

) =

l

T

(

φ

N

(

n

))

2.

∀e

E

GS

l

S

(

e

) =

l

T

(

φ

E

(

e

))

3.

∀e

E

GS

a path

p

N

GT

×

E

GT

suh that

p

=

φ

E

(

e

)

and

φ

N

(

src

S

(

e

)) =

src

T

(

φ

E

(

e

))

φN

(

tarS

(

e

)) =

tarT

(

φE

(

e

))

.

The rst two onditions preserve both nodes and edges labeling information, while the third ondition

preservesgraph'sstruture. Graphmathingisanisomorphimathingproblemwhen

|N

GS|

=

|N

GT

|

otherwise itishomomorphi. Obviously,theshemamathingproblemis ahomomorphiproblem.

Example5. ForthetworelationalshemasdepitedinFig.3.2(a)anditsassoiatedshemagraphsshown

inFig.3.2(b),theshemamathingproblembetweenshema

S

andshema

T

isonvertedintoahomomorphi graphmathingproblem between

SG

1

and

SG

2

.

Graphmathingisonsideredtobeoneofthemostomplexproblemsinomputersiene. Itsomplexity

is due to two major problems. The rst problem is the omputational omplexity of graph mathing. The

timerequiredbybaktrakingin searhtree algorithmsmayin theworst asebeome exponentialin thesize

of thegraph. Graph homomorphismhas been provento be NP-omplete problem [19℄. Theseond problem

is thefat that allof thealgorithms for graphmathing mentionedso faran onlybeapplied to twographs

atatime. Therefore, ifthereare morethantwoshemasthat mustbemathed, thenthe onventionalgraph

mathingalgorithms mustbeappliedto eahpairsequentially. Forappliationsdealingwithlargedatabases,

this may be prohibitive. Hene, hoosing graphmathing asplatform to solvethe shema mathing problem

may be eetive proess but ineient. Therefore, we propose transforming graph homomorphism into a

FCOP.

Nowthatwehavedened agraphmodelanditshomomorphism,letusonsiderhowtoonstrutaFCOP

outofagivengraphmathingproblem.

4.2. Graph Mathingas a FCOP. Intheshemamathing problem, wearetryingtondamapping

between the elements of twoshemas. Multiple onditions should be applied to make these mappings valid

solutionstothemathingproblem,andsomeobjetivefuntionsaretobeoptimizedtoseletthebestmappings

amongmathingresult. Theanalogytoonstraintproblemisquiteobvious: herewemakeamappingbetween

twosets, namely between aset of variables anda set of domains, where someonditions should be satised.

Sobasially, whatwehavetodo toobtainanequivalentonstraintproblem CPfor agivenshemamathing

problem(knowingthatshemastobemathedaretransformedintoshemagraphs)are:

1. takeobjetsofoneshemagraphtobemathedastheCP'ssetofvariables,

2. takeobjetsofothershemagraphstobemathedasthevariables'domain,

3. ndapropertranslationoftheonditionsthatapplytoshemamathingintoasetoffuzzyonstraints,

and

4. formobjetivefuntionstobeoptimized.

Wehavedened theshema mathing problem asagraphmathing homomorphism

φ

. We nowproeed by formalizing the problem

φ

as a FCOP problem

= (

X, D, Cµ, g

)

. To onstrut a FCOP out of this problem, wefollowtheaboverules. Throughthese rules,wetakethetworelationaldatabaseshemasshown

in Fig.3.2(a) and itsassoiatedshema graphsshown in Fig.3.2(b) asanexample, takinginto aountthat

(8)

The set of variables

X

is given by

X

=

NGS

S

EGS

where the variables from

NGS

are alled node variables

X

N

andfrom

E

GS

areallededgevariables

X

E

X

=

X

N

S

X

E

=

{x

n

1

, x

n

2

, x

n

3

, x

n

4

, x

n

5

, x

n

6

}

S

{x

e

1

2

, x

e

2

3

, x

e

2

4

, x

e

2

5

, x

e

2

6

}

The set of domain

D

is given by

D

=

N

GT

S

E

GT

, where the domains from

N

GT

are alled node domains

D

N

andfrom

E

GT

areallededgedomains

D

E

,

=

{Dn

1

, Dn

2

, Dn

3

, Dn

4

, Dn

5

, Dn

6

}

S

{De

1

2

, De

2

3

, De

2

4

, De

2

5

, De

2

6

}

where

Dn

1

=

Dn

2

=

Dn

3

=

Dn

4

=

Dn

5

=

Dn

6

=

{n

1

T

, n

2

T

, n

3

T

, n

4

T

, n

5

T

, n

6

T

, n

7

T

, n

8

T

, n

9

T

, n

10

T

}

(i.e. thenodedomainontainsalltheseondshema graphnodes)and

De

1

2

=

De

2

3

=

De

2

4

=

De

2

5

=

De

2

6

=

{e

1

2

T

, e

1

3

T

, e

2

4

T

, . . . , p

1

2

4

T

, . . .

}

(i.e. theedgedomainontainsalltheavailableedgesandpaths in theseondshemagraph)(the edge

e

1

2

reads theedgeextends betweenthetwonodes

n

1

and

n

2

suhthat

e

1

2

=

e

(

n

1

, n

2

)

).

Usingthisformalizationenablesustodealwithholistimathing. Thisanbeahievedbytakingtheobjets

ofoneshemaasthevariable set,while theobjetsof othershemasasthevariable's domain. Letwehave

n

shemaswhiharetransformedintoshemagraphs

SG

1

, SG

2

, . . . , SGn

then

X

=

XN

S

XE

,

DN

=

P

n

i

=2

DN i

,

DE

=

P

n

i

=2

DEi

. Anotherbenetbehindthisapproahisthatourapproahisabletodisoveromplexmathes oftypes 1:n andn:1 veryeasily. This anbeahievedbyallowingavaluemayhavemultiple valuesfrom its

orrespondingdomainandavaluemaybeassignedtomultiple variables.

Inthefollowingsubsetions,wedemonstratehowtoonstrutbothonstraintsandobjetivefuntionsin

ordertoobtainaompleteproblemdenition.

4.3. Constraint Constrution. Theexploited onstraintsshould reet thegoalsofshemamathing.

Shemamathingbasedonlyonshemaelementpropertieshasbeenattempted. However,itdoesnotprovide

anyfailitytooptimizemathing. Furthermore,additionalonstraintinformation,suhassemantirelationships

andother domainonstraints,isnotinludedand shemasmaynotompletelyapturethesemantisof data

theydesribe. Therefore,inordertoimproveperformaneandorretnessofmathing, additionalinformation

shouldbeinluded. Inthispaper,weareonernedwithbothsyntatiandsemantimathing. Therefore,we

shalllassifyonstraintsthatshouldbeinorporatedin theCPmodelinto: syntationstraintsandsemanti

onstraints. Inthe following, we onsider onlythe onstraintsonstrutionwhile the fuzzyrelations of fuzzy

onstraintare notonsider sineit depends ontheappliation domain. Forexample,asshownbelow,domain

onstraints are risp onstraints, i. e.

µC

(

v

) = 1

, while the strutural onstraints are soft onstraints with dierentdegreeof satisfation.

4.3.1. Syntati Constraints.

1. DomainConstraints: Itstatesthatanodevariablemustbeassignavalueorasetofvaluesfromits

orrespondingnodedomain,andanedgevariablemustbeassignedavaluefromitsorrespondingedge

domain. That is

∀xni

XN

and

xej

XE∃

aunary onstraint

C

dom

µ

(

xni

)

and

C

dom

µ

(

xei

)

ensuring domain

onsistenyofthemath where,

C

µ

dom

(

xni

)

=

{d

i

D

N i},

C

dom

µ

(

xei

)

=

{d

i

D

Ei}

2. Strutural Constraints: Therearemanystruturalrelationshipsbetweenshemagraphnodessuh

as:

Edge Constraint: It states that if an edge exists between twovariable nodes, then an edge (or

path) should exist between their orresponding images. That is

∀xei

XE

and its soure and targetnodesare

xns

and

xnt

X

twobinaryonstraints

C

src

µ

(

xei,xns

)

and

C

tar

µ

(

xei,xnt

)

representing

thestruturalbehaviorofmathing,where:

C

µ

src

(

xei,xns

)

=

{

(

di, dj

)

DE

×

DN

|

src

(

di

) =

dj}

C

(

tar

xei,xnt

)

=

{

(

d

i

, d

j

)

D

E

×

D

N

|

tar

(

d

i

) =

d

j

}

(9)

(a) Parent Constraint

C

parent

µ

(

xni,xnj

)

representing the strutural behavior of parent relationship,

where

C

µ

parent

(

xni,xnj

)

=

{

(

di, dj

)

DN

×

DN

| ∃e

(

di, dj

)

s.t.

src

(

e

) =

di}

(b) ChildConstraint

C

child

µ

(

xni,xnj

)

representingthestruturalbehaviorofhildrelationship,where

C

µ

child

(

xni,xnj

)

=

{

(

di, dj

)

DN

×

DN

| ∃e

(

di, dj

)

s.t.

tar

(

e

) =

dj

}

() Sibling Constraint

C

sibl

µ

(

xni,xnj

)

representing the strutural behavior of sibling relationship,

where

C

µ

sibl

(

xni,xnj

)

=

{

(

di, dj

)

DN

×

DN

| ∃dn

s.t.

parent

(

dn, di

)

parent

(

dn, dj

)

}

4.3.2. Semanti Constraints. The rst onstraint type onsiders only the strutural and hierarhial

relationshipsbetweenshemagraphnodes. Inordertoapture theotherfeaturesofshemagraphnodessuh

asthesemantifeaturewemakeuseofthefollowingonstraint.

1. Label Constraints:

∀xni

XN

and

∀xei

XE∃

aunary onstraint

C

Lab

µ

(

xni

)

and

C

Lab

µ

(

xei

)

ensuring the

semantisoftheprediatesintheshemasuhthat:

C

µ

Lab

(

xni

)

=

{dj

DN

|lsim

(

lS

(

x

ni

)

, lT

(

d

j

))

t}

C

µ

Lab

(

xei

)

=

{dj

DE|lsim

(

lS

(

x

ei

)

, lT

(

d

j

))

t}

where

lsim

is alinguisti similarityfuntion determining thesemantisimilarity betweennodes/edges labels and

t

isapredenedthreshold.

The abovesyntatiand semantionstraintsare by no meansthe ontextual relationships between

ele-ments. Otherkindsofdomainknowledgeanalsoberepresentedthroughonstraints.Moreover,eahonstraint

isassoiatedwithamembershipfuntion

µ

(

v

)

[0

,

1]

toindiate towhatextenttheonstraintshould be sat-ised. If

µ

(

v

) = 0

, this means

v

totally violates the onstraint and

µ

(

v

) = 1

means

v

totally satises it. Constraintsrestritthesearhspae forthemathingproblemsomaybenettheeienyof thesearh

pro-ess. On the other hand, if too omplex, onstraints introdue additional omputational omplexity to the

problemsolver.

4.4. ObjetiveFuntionConstrution. Theobjetivefuntionisthefuntionassoiatedwithan

opti-mizationproesswhihdetermineshowgoodasolutionisanddependsontheobjetparameters. Theobjetive

funtion onstitutes the implementation of the problem to be solved. The input parameters are the objet

parameters. The output is the objetive value representingthe evaluation/quality of the individual. In the

shema mathing problem, the objetive funtion simulates human reasoning on similarity between shema

graphobjets.

In this framework, we should onsider two funtion omponents whih onstitute theobjetive funtion.

Therstisalledost funtion

f

ost

whihdeterminestheostofasetonstraintovervariables. Theseond

isalledenergyfuntion

f

energy

whihmapseverypossiblevariableassignmentto aost. Then, theobjetive

funtionouldbeexpressedas follows:

g

=

mis

|

max(

X

setofonstraints

f

ost

+

X

set ofassignment

f

energy

)

5. RelatedWork. Shemamathingisafundamentalproessinmanydomainsdealingwithshareddata

suh as data integration, data warehouse, E-ommere, semanti query proessing, and the web semantis.

Mathingsolutionsweredevelopedusingdierentkindofheuristis,butusuallywithoutpriorformaldenition

oftheproblemtheyaresolving.Althoughmanymathingsystems,suhasCupid[17℄,COMA/COMA++[6,1℄,

LSD[8℄, SimilarityFlooding[20℄,OntoBuilder [13℄, QOM[12℄, BTreeMath[11℄, S-Math[14℄,and Spiy [3℄,

havebeendevelopedand dierentapproaheshavebeenproposed tosolvetheshemamathing problem,but

noompleteworkto addresstheformulationproblem. Shema mathingresearhmostly fouses onhowwell

(10)

Mostof theexisting work [22℄ denemath asafuntion that takestwoshemas (models)asinput, may

beinthepreseneofauxiliaryinformationsouressuhasuserfeedbakandpreviousmappings,andprodues

amappingasoutput. Ashemaonsists ofasetofrelated elementssuhastables,olumns, lasses,orXML

elements and attributes. A mappingis a set of mapping elements speifying the mathing shema elements

together. Eahmappingelementisspeiedby4-tupleelement

hID, S

1

i

, S

j

2

, Ri

where

ID

isanidentierforthe mappingelementthatmathesbetweentheelement

S

1

i

oftherstshemaandtheelement

S

2

j

oftheseondone

and

R

indiatesthesimilarityvaluebetween0and1. Thevalueof0meansstrongdissimilaritywhilethevalue of 1meansstrong similarity. But, in general, amappingelement indiates that ertain element(s) of shema

S

1

are related to ertain element(s) of shema

S

2

. Eah mapping element an have an assoiated mapping expression whih speies how thetwo elements (or more) are related. Shema mathing is onsidered only

withidentifyingthemappingsnotdeterminingtheassoiatedexpressions.

IntheworkofA.Doan[7℄,theyformalizetheshemamathingproblemasfourdierentproblems:

1. The basi 1-1 Mathing; given two shemas

S

and

T

(representations), for eah elements of S, nd themostsemantiallysimilarelementtofT,utilizingallavailableinformation. This problemisoften

referredas a one-to-onemathing problem, beause itmathes eah elements with asingle element.

Forexample,the

h

ID1, S.Address, T.CAddress,0.8

i

mappingelementindiatesthat there amapping betweentheelementS.Address ofshemaSandtheelementT.CAddress ofshemaTwithadegreeof

similarity0.8.

2. Mathing for Data Integration; givensoureshemas S1,S2,...,Sn and mediatedshema T,foreah

elementsofSindthemostsimilarelementtofT.

3. Complex Mathing; letS and T be twodata representations. Let O ={O1, O2,...,Ok} be aset of

operatorsthatanbeappliedtotheelementsofTaordingtoasetofrulesRtoFigure2: Mathing

Funtion onstrutformulas. Foreahelementsof S,ndthe mostsimilar elementt,where tanbe

eitheranelementofToraformulafromtheelementsofT,usingOandR.

4. Mathing for Taxonomies; giventwo taxonomiesof onepts S and T, foreah onept node sof S,

ndthemostsimilaroneptnodeofT.

Foreahoftheseproblems,Doanshowsinputinformation,solutionoutput,andtheevaluationofasolution

output. Ingeneral,theinputtoaproblemaninludeanytypeofknowledgeabouttheshemastobemathed

andtheirdomainssuhasshemainformation,instanedata,previousmathings,domainonstraints,anduser

feedbak.

Zhangand et. el.[25℄ formulate the shema mathing problemasa ombinatorial optimizationproblem.

The authors ast the shema mathing problem into a multi-labeled graph mathing problem. The authors

propose ameta-meta model of shema: multi-labeled graph model, whih views shemas asnite strutures

overthe spei signatures. Based on this multi-labeled shema, they propose amulti-labeled graphmodel,

whih is an instane of multi-label shema, to desribe various shemas, where eah node and edge an be

assoiated with aset of labelsdesribing itsproperties. Thenthey onstruta generigraphsimilarity

mea-surebasedontheontrastmodeland proposean optimizationfuntion to omparetwomulti-labeledgraphs.

Usingthe greedyalgorithm, theydesignanoptimization algorithmto solvethemulti-labeledgraphmathing

problem.

Gal and et al. [13℄ propose a fuzzy framework to model theunertainty of the shema mathing proess

outome. The framework aims at identifying and analyzing fators that impat the eetiveness of shema

mathingalgorithmsbyreduingtheunertaintyofexistingalgorithms. Tospeifytheirbelief inthemapping

quality, theauthors assoiate aondene measure with any mapping among attributes' sets. They use the

frameworktodene themonotoniitypropertyasadesiredpropertyoftheshemamathingproblem, soone

ansafelyinterpretahigh ondenemeasureasagoodsemantimapping.

The reentwork for [23℄ introdues a formal speiation for the XML mathing problem. The authors

dene the ingredientsof theXML shema mathing problem using onstraint logi programming. Mathing

problems an be dened through variables, variable domains, onstraints and an objetive funtion. They

distinguishbetweentheonstraintsatisfationproblemandonstraintoptimizationproblemandshowthatthe

optimization problem is more suitable for the shema mathing problem. They make use of ombination of

lusteringmethods andthebranhandbound algorithmtosolvetheshemamathingproblem.

Inourformulationapproah,wehavesomeommonanddistintfeatureswiththeotherrelatedwork. The

(11)

onstraint optimization problem. However, ourapproah introduesdistintly theuse of fuzzy onstraint in

ordertoreetthenature oftheshemamathingproblem. Aswellastheuseofthefuzzyonstraintenables

ustomodelunertaintyintheshemamathingproess.

6. Summaryand Future Work. In thispaper,wehaveinvestigated anintriate problem; theshema

mathingproblem. Inpartiular,wehaveintroduedafuzzyonstraint-basedframeworktomodeltheshema

mathingproblem. Tothisend, webuildaoneptualonnetionbetweentheshemamathingproblemand

fuzzy onstraintoptimization problem. On one hand, we onsider shema mathing as anew appliation of

fuzzyonstraintoptimization, andontheotherhandweproposetheuseoffuzzyonstraintoptimization asa

newapproahforshemamathing.

Ourproposed approah isageneriframeworkwhihhasthe featureto dealwithdierentshema

repre-sentations by transformingthe shema mathing problem into graph mathing. Insteadof solving thegraph

mathing problem whih has been proven to be an NP-omplete problem, we reformulate it as aonstraint

problem. Wehave identied two typesof onstraintssyntatiand semantito ensure math semantis. As

wellas,wemakeuseofthefuzzyonstraintsinordertoenableusmodelingunertaintyintheshemamathing

proess. Wealsoshed lightonhowtoonstrutobjetivefuntions.

ThemainbenetofthisapproahisthatwegaindiretaesstotherihresearhndingsintheCParea;

insteadofinventingnewalgorithmsforgraphmathingfromsrath. Anotherimportantadvantageisthatthe

atualalgorithmsolutionbeomesindependentof theonretegraphmodel, allowingus tohangethemodel

withoutaeting thealgorithmbyintroduinganewlevelofabstration.

Understandingtheshemamathing problemis onsideredtherststeptowardsaneetiveandeient

solutionfortheproblem. Inourongoingwork,wewillexploitonstraintsolveralgorithmstoreah ourgoal.

REFERENCES

[1℄ D.Aumueller,H.H. Do,S.Massmann, andE. Rahm,Shema and ontology mathingwithCOMA++,inSIGMOD

Conferene,2005,pp.906908.

[2℄ R.BabakrishnanandK.Ranganathan,Atextbookofgraphtheory,SpringVerlag,1999.

[3℄ A.Bonifati, G. Mea, A.Pappalardo, and S. Raunih, Thespiy projet: Anew approah todata mathing, in

SEBD,Turkey,2006.

[4℄ S. C.Chang,B.He, C.Li,M.Patel,andZ.Zhang,Strutureddatabasesontheweb: Observationsandimpliations,

SIGMODReord,33(2004),pp.6170.

[5℄ R. Dehter,ConstraintProessing,MorganKaufmann,2003.

[6℄ H.H.DoandE.Rahm,COMA-asystemforexibleombination ofshemamathingapproahes,inVLDB2002,2002,

pp.610621.

[7℄ A.Doan,Learningtomapbetweenstrutured representationsof datag,inPh.DThesis,WashingtonUniversity,2002.

[8℄ A. Doan,P. Domingos,andA.Halevy,Reonilingshemasof disparatedatasoures: Amahine-learning approah,

SIGMOD,(2001),pp.509520.

[9℄ X.Dong,A.Halevy,andC.Yu,Dataintegration withunertainty,inVLDB'07,2007,pp.687698.

[10℄ C. Drumm,M.Shmitt,H.-H.Do,andE.Rahm,Quikmig-automati shemamathingfordatamigration projets,

inPro.ACMCIKM07,Portugal,2007.

[11℄ F. Duhateau,Z.Bellahsene,andM.Rohe,Anindexingstrutureforautomatishemamathing,inSMDB

Work-shop,Turkey,2007.

[12℄ M.EhrigandS.Staab,QOM-quikontologymapping,inInternationalSemantiWebConferene,2004,pp.683697.

[13℄ A. Gal, A. Tavor, A. Trombetta, and D. Montesi,A framework for modeling and evaluating automati semanti

reoniliation,VLDBJournal,14(2005),pp.5067.

[14℄ F.Giunhiglia,M.Yatskevih,andP.Shvaiko,Semantimathing: Algorithmsandimplementation,JournalonData

Semantis,IX(2007).

[15℄ H.W.GuesgenandA.Philpott,Heuristisforfuzzyonstraintsatisfation,inANNES'95,1995,pp.132135.

[16℄ W. LiandC. Clifton,Semint: Atoolforidentifying attributeorrespondenes inheterogeneousdatabases usingneural

networks,DataandKnowledgeEngineering,33(2000),pp.4984.

[17℄ J.Madhavan,P.A.Bernstein,andE.Rahm,Generishema mathingwithupid,inVLDB2001,Roma,Italy,2001,

pp.4958.

[18℄ K.MarriottandP.Stukey,Programming withConstraints: AnIntrodution,MITPress,1998.

[19℄ S.Medasani,R.Krishnapuram,andY.Choi,Graphmathingbyrelaxationoffuzzyassignments,IEEETrans.onFuzzy

Systems,9(2001),pp.173182.

[20℄ S.Melnik,H.Garia-Molina,andE.Rahm,Similarityooding:Aversatilegraphmathingalgorithmanditsappliation

toshemamathing,inICDE'02,2002.

[21℄ L.Palopoli,D.Rossai,G.Terraina,andD.Ursino,Agraph-basedapproahforextratingterminologialproperties

frominformationsoureswithheterogeneousformats,KnowledgeandInformationSystems,8(2005),pp.462497.

(12)

[23℄ M.Smiljani,M.vanKeulen,andW.Jonker,Formalizingthexmlshemamathingproblemasaonstraint

optimiza-tionproblem,inDEXA,K.V.Andersen,J.K.Debenham,andR.Wagner,eds.,vol.3588ofLetureNotesinComputer

Siene,Springer,2005,pp.333342.

[24℄ E.Tsang,Foundationsof ConstraintSatisfation,AadimiPress,1993.

[25℄ Z.Zhang, H.Che, P.Shi, Y.Sun, andJ. Gu,Formulationshema mathingproblemforombinatorial optimization

problem,IBIS,1(2006),pp.3360.

Editedby: DominikFlejter,TomaszKazmarek,MarekKowalkiewiz

Reeived: February3rd,2008

Aepted: Marh 19th,2008

References

Related documents

The hyperspectral imagery enabled the extraction of thermal, radiance, and reflectance spectra from 260 spectral bands from each plot for the calculation of indices

By contrast to the changes we observed in other pyramidal neurons in motor cortex, the MPFC and somatosensory cortex, there were no differences in the soma volume, total arbor

Vitamin A Delivery in Total Parenteral Nutrition

Signaling alkaline pH stress in the yeast Saccharomyces cerevisiae through the Wsc1 cell surface sensor and the Slt2 MAPK pathway. The Saccharomyces cerevisiae

Skeide, Dilation theory and continuous tensor product systems of Hilbert modules, PQQP: Quantum Probability and White Noise Analysis XV (2003), World Scientific..

Given the topology of a 3D data bus that has m terminals, n bus wires, p different timing periods, and each timing period has at least one data access timing transferred from a

This study was conducted to know the socio-demographic profile and pattern of burn injury in patients admitted in burn unit of tertiary care hospital.. Methods: A hospital based

The aim of the qualitative AMR study was to explore AB knowledge, attitudes and be- haviours of health care professionals (HCPs – i.e., both physicians and community pharmacists)