Contents lists available at ScienceDirect
Information
Processing
and
Management
journal homepage: www.elsevier.com/locate/ipm
FCA
based
ontology
development
for
data
integration
Gaihua Fu
∗School of Civil Engineering and Geosciences, Newcastle University, Newcastle upon Tyne, UK
a
r
t
i
c
l
e
i
n
f
o
Article history:Received12May2015 Revised5November2015 Accepted22February2016 Availableonline14March2016
Keywords:
Ontologydevelopment Formalconceptanalysis Dataintegration Informationsharing
a
b
s
t
r
a
c
t
Datais a valuableasset to our society. Effective use ofdata can enhance productivity ofbusiness and create economic benefit to customers. However with data growing at unprecedentedrates,organisationsarestrugglingtotakefulladvantageofavailabledata. Onemainreasonforthisisthatdataisusuallyoriginatedfromdisparatesources.Thiscan result indataheterogeneity,and preventdatafrom beingdigestedeasily. Amongother techniquesdeveloped,ontologybasedapproachesisonepromisingmethodfor overcom-ingheterogeneityandimprovingdatainteroperability.Thispapercontributesaformaland semi-automatedapproach for ontologydevelopment based onFormalConcept Analysis (FCA),with theaimto integratedatathat exhibitsimplicit and ambiguousinformation. A case study has been carried out on several non-trivial industrial datasets, and our experimental results demonstratethat proposed methodoffers an effective mechanism thatenablesorganisationstointerrogateandcurateheterogeneousdata,andtocreatethe knowledgethatmeetstheneedofbusiness.
© 2016TheAuthors.PublishedbyElsevierLtd. ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/ ).
1. Introduction
Businessproductivityandcompetitivenessareincreasinglybeingdrivenbytheeffectiveaccessanduseofdata.Data pro-videsamineofinformationthatcanhelpusspotundiscoveredpatternsofbusinessimportanceandtocreatetheknowledge thatwillbeneededtotacklethechallengesoffuture.Howeverwithdatabecomingavailableandgrowingatunprecedented rates,organisationsstruggle totake full advantage ofvaluable data.One mainreasonfor thisisthat data isusually cre-atedandmaintainedbya rangeoforganisations.Thisresultsinmismatchbetweendatasets,i.e.,datasets differfromone organisationtoanothernotonlyinwhatisencodedbutalsoinhowitisencoded.
Inorder fororganisations to use anddigest heterogeneous dataand uncover theuntold business patterns,there is a growinginteresttodeveloptechniquesthatinvestigatecomplexdataphenomenaandfacilitatebetterdatainteroperability
(Doan,Halevy,&Ives,2012;Doan,Noy,&Halevy,2004;Duckham&Worboys,2005;Huang,Lin,&Chan,2012;Jiang,Zhang,
Tang,&Nie,2015;Lenzerini,2002).Among varioustechniquesdeveloped,ontologyresearchisonedisciplinethatcandeal
withdataheterogeneityandimprovedatasharing(Kalfoglou&Schorlemmer,2003;Mateetal.,2015;Noy,2004). Ontology-basedintegrationsystemsareusually characterisedby aglobalontologywhichrepresentsa reconciled,integratedview of theunderlyingdatasources.Systemstakingthisapproachusually provideuserswithauniforminterface—allqueriesmade to source data are expressed in terms ofa globalontology, asare the queryresults. This frees the userfrom the need to understand each individual datasource. Unfortunately, in many domains one faces the problems of either having no
∗ Correspondingauthor.Tel:+441912086822.
E-mail address: [email protected],[email protected] http://dx.doi.org/10.1016/j.ipm.2016.02.003
0306-4573/© 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
establishedontologythatcanbereadilyemployedintheintegrationwork,orexistingontologiesdonotfitforthepurpose (e.g.,notconsistingofknowledgethatsufficientlycapturesthesemanticsoftheinformationunderinvestigation).
Inthispaperwecontributeaformalandsemi-automatedapproachforontologydevelopment.Ratherthanstartingfrom scratch,we buildan ontology byeffectivediscoveringanduseoftheknowledgethat isburiedinthedatasetsto be inte-grated.ThemethodisbasedonFormalConceptAnalysis(FCA)(Ganter&Wille,1999;Ganter,Stumme,&Wille,2005),which isamathematicalapproachfordataanalysis.FCAsupportsontologydevelopmentbyabstractingconceptualstructuresfrom attribute-basedobjectdescriptions,anditenablesconsiderableontologydevelopmentactivitiesautomated.
OurresearchextendsclassicalFCAtheory tosupportontology developmentforintegratingdatasetsthatexhibitimplicit
andambiguousinformation.Implicitinformationiscausedby thefact thatsome organisationstendto takesome domain
knowledgeasgranted,anddonotexplicitlyspecifyitintheir designdocumentsordatasets.Thiscanleadtoan ontology thatisill-formed,anddoesnotcorrectlycapturecriticalconceptsandthesemanticsofthedomain.Ambiguousinformation isduetothefactthatorganisationsdifferfromeachotherinculture,conventionsandrequirementsinsystemdevelopment, hencetheymayvaryinhowtheychoosetorepresentabusinessobject,andatwhatlevelsofgranularitysuchinformation isencoded.Thiscausesinconsistenciesbetweenthedatasetsofdifferentorganisations.
Weconsiderthatovercomingthisimplicity andambiguityisanimportantstepinontology development.Thework re-portedhereisafollowonresearch ofBecketal.(2013),FuandCohn (2008a)andFuandCohn (2008b).Inthispaperwe report further technicaladvances we havemade. To restore implicitinformation,we introduce a rulebased method.We discusshow rules arederived anddeployedforrecoveringimplicit information.Toresolve disambiguate information,we defineasetofprimitiveoperationstodealwithsimplematchesindataalignment.Theseoperationsarethencomposedto dealwithmorecomplicatedmatches.Finally,we reportonourexperimentsthat arecarriedout toconstructan ontology forintegratingnon-trivial datasetsfrom severalUK watercompanies.We measure thequality of thedeveloped ontology by utilisingthe metricsofclassical informationtheory andalsointerms ofits fitness totheapplicationdomain. Our ex-perimentalresultsdemonstratethattechniquesdescribedinthispaperprovidean effectivemechanismforreconcilingand harmonising heterogeneous datafrom disparate sources, andthey support development ofontologies that better fit and
respecttheunderlyingknowledgestructuresofdomains.
The remaining partofthe paperis organisedasfollows.Section 2 reviewsrelatedresearch. Section 3recalls relevant notions of FCA andbriefs ourframework for ontology development. Sections 4 and5 presenttechniques that deal with implicitandambiguousinformation.Section6discusseshowtoderiveanontologybyusingresultsgeneratedfromSections
4and5.Section7reportsourexperimentalresults.Section8concludesthepaperandsuggestsfutureresearch.
2. Relatedresearch
Severalareasofresearchareinterestingtothiswork.Firstly,integrationtechniquesinvestigatedindatabaseand informa-tionintegrationarequiterelevant.Varioustopicshavebeenstudiedbythesecommunitiesandtheonesthatarethemost interestingherearemappingdiscoveryandschemaintegration,andtechniqueshavebeendevelopedtosupportthese(Bahga
& Madisetti,2015; Do & Rahm,2002; Doanetal., 2004; Lenzerini,2002; Liu& Zhang,2014; Madhavan& Halevy, 2003;
Pedersen,Pedersen,&Riis,2013;Rahm&Bernstein,2001).Mappingdiscoverytakestwoormoredatabaseschemasasinput
andproducesamappingbetweenelementsoftheinputschemasthatcorrespondsemanticallytoeachother.Manyofthe earlyaswellascurrentmappingsolutionsemployhand-craftedrulesorheuristicstomatchschemas(Madhavan,Bernstein,
&Rahm,2001;Rahm&Bernstein,2001).Examplesofsuchheuristicsincludelinguisticmatchingofschemaelementnames,
detectingsimilarityofstructuresofschemaelements,andconsideringthepatternsinrelationshipsoftheschemaelements. Techniqueshavealsobeenproposedtouselearningbasedmethods(Doan,Domingos, &Halevy,2001;Neumann,Ho,Tian,
Haas,&Meggido,2002).
Schemaintegration constructsaglobalschemabasedontheinter-schema relationshipsproduced inmappingdiscovery.
Each mapping element is analysed to decide which representationof relatedelements should be included in the global schema.Whenamappingdescribesthecorresponding schemaelementsasidentical,theirintegrationisstraightforward— simplyincludesone ofschemaelementsintotheglobalschema.Morefrequently,thecorresponding schemaelementsare notthesamebutaremutuallyrelatedbysomesemanticproperties,andschemamergingisperformedmanuallyor semi-automaticallywiththeassistanceofdomainengineerstoguidethedesignersintheirresolution.
Ontologyresearchisanotherdisciplinethatdealswithdataintegration.Acommondefinitionofanontologyisthatitis aformal, explicitspecificationofadomainofdiscourse (Gruber,1993).Asitprovides asharedunderstandingandexplicit specificationofadomain,an ontologyisconsideredtohaveakeyroletoplayindataintegration(Bakhtouchi,Bellatreche,
& Ait-Ameur,2011; Bian,Zhang,& Peng, 2011;Noy, 2004;Uschold & Grüninger,2004; Yuetal., 2012).Unfortunately, for
manydomains one faces the needto develop ontologies fromscratch (asthere isno existing ontology that canbe used readily), andagrowing numberofmethods havebeenproposed inrecentyearsto addressthe issuesofontology design and development. Mostmethods are basedon the traditional knowledge engineering approach (Brockmans etal., 2006;
Pinto&Martins,2004;Sure,Tempich,&Vrandecic,2006).Thesemethodsusuallystartwithdefiningthedomainandscope
ofontologies.Thisisfollowedbyadataacquisitionprocess:importantconceptsarecollected;aconcepthierarchyisderived, andpropertiesandsemanticconstraintsareattachedtoconcepts.
Asdevelopingontologiesfromscratchisanexpensiveprocesstoperform,therehasbeenincreasinginterestinreusingor mergingexistingontologies(orotherknowledgestructuressuchasthesauri)thataredevelopedindependentlyindifferent
applications(Duong,Truong,&Nguyen,2012;Truong&Nguyen,2012;Xie,Liu,&Guan,2011;Yang,2011).Centraltothese studiesisresearchonontologymappingandontology integration.Approachestoontologymappingaresimilartoonesfor matchingdatabaseschemasandotherstructureddata,andtheyuselexicalandstructuralcomponentsofdefinitionstofind correspondences.However,asan ontologycapturesricherdatasemantics thantraditionaldatabaseschemas,themethods forfindingmappingstendtoexploittheseextradatasemantics(Kalfoglou&Schorlemmer,2003;Nguyen,2007;Rodriguez&
Egenhofer,2003;Truong&Nguyen,2012).Forexample,inNoyandMusen(2000)atoolhasbeendevelopedtouselinguistic
similaritymatches betweenconcepts forinitiating mappings, andthen usethe underlyingontological structures (classes, slots,facets)tosuggestasetofheuristicsforidentifyingfurthermatchesbetweentheontologies.InDuong andJo(2012), amethodhasbeenproposed tomappingontologicalconcepts usingpropagating PriorlyMatchableConcepts.The method exploitsinformationsuch asconcepttypes,relationsandconstraintsto providesuggestionsforpossibleconcept matches. Themethod guildson howto priorlycheck the similaritybetweenconcepts anditreduces computationalcomplexityby avoidingchecking similarityamong unmatchableconcepts. In Nguyen (2006), an approach hasbeen proposed to resolve threelevels of ontology conflicts: instant level,concept levelandrelation level,using consensus method.The techniques developedinDoan,Madhavan,Domingos,&Halevy(2003)andSpohr,Hollink,andCimiano(2011)employs learningbased techniquestofindontologymappings.Theyexploitinformationindatainstancesandtaxonomicstructureofontologies,and thenusesaprobabilisticmodeltocombineresultsofdifferentlearners.
Basedontheinter-ontologymappingsderivedinmappingdiscovery,amergingprocessintegratesthesourceontologies andgenerates aglobalontology.However,deriving ameaningfulontology isa hardproblemevenwiththegroundsetof inter-ontology mappings provided, andmost methods that support the merging process are performed in an interactive mannerwiththeassistanceofhumanusers,asisdoneindatabaseandinformationintegrationresearch.
Anotherbranchof research studies ontology developmentandintegration withformal methods. Ofparticularinterest here is research based on Formal Concept Analysis (FCA) (Ganter & Wille, 1999; Ganter et al., 2005; Wille, 1982). FCA is a formal method for concept classification and conceptual structure derivation. FCA related tools enable considerable knowledgeprocessingactivities tobe automated,particularlyconceptgenerationandhierarchyderivation.Asaresult, FCA hasbeen attractinggreat interest to support systematic, semi-automateddevelopment andintegration ofontologies (Bai
& Zhou,2011; Formica,2006; He & Wang,2011; Nanda, Simpson, Kumara,& Shooter, 2006; Xia,2013). Forexample, in
Rouane,Valtchev,Sahraoui,& Huchard(2004) ontologicalhierarchymergingisstudiedintheframeworkofFCAbytaking
intoaccountofboth taxonomicandother semanticrelationshipsofontologies.AmethodFCA-MERGEhasbeendeveloped
inStummeandMaedche(2001) touseFCAtosupportontology integration.FCA_MERGEtakesasinputthetwoontologies
andasetofnaturallanguagedocuments,andcomputesaconceptlatticefromtwosourceontologiesusingFCAtechniques. Theconceptlatticeisthenexploitedbydomainexpertstoderiveamergedontology.InZhao,Wang,andHalang(2006) a similaritymethodhasbeenintroducedtomapontologyconceptsbasingonRoughSetandFormalConceptAnalysistheory. TheideaistoconstructfromtwosourceontologiesaconceptlatticewithFCAandsimilaritymeasureoftwoconcepts are thencomputedusingRoughSettheory.InChen,Bau,andYeh(2011) authorsproposeda methodthatcombinesWordNet andFuzzy Formal Concept Analysis techniques for merging ontologies. WordNetis firstly used to align concepts from a sourceontologytoconceptsinabaseontology,andtheremainingunmappedconceptsarethenalignedtothebaseontology usingasimilaritymeasurebasedonfuzzyFCA.
Ourapproach isin linewith FCAbased research. Yet it differs fromprevious studies inseveral aspects. Firstly, while mostresearch focusing on similaritymeasure of ontology concepts, we contribute an integratedframework that offers a structuralandsystematicdescriptionofontology mergingprocess.Secondly,withFCA asbackbonewe investigatehow to resolve implicitandambiguous information.Previous research is either impliciton how theseproblems are resolved, or onlyaddressparticulartypesoftheseproblems.Forexample,inRouaneetal.(2004)thereisaninteresting discussionon attributeconflicts,buttheauthorsdonotaddressindetailhowtheseproblemsareresolved.Thirdly,whilemostprevious researchconsidersonetoonemappingbetweenconcepts,ourmethodisabletodealwithmorecomplicatedissues,i.e.,an ontologyconceptmayhavemulti-mappingsfromanotherontology,whichhasnotbeeninvestigatedsufficientlyinliterature. Finally,weapplied theproposed techniquestonon-trivialindustrialdatasets,andexamined howeffectivelytheproposed methodcanhelpwithimprovingdatainteroperability.ThishasrarelybeenreportedinotherFCA-basedworks.
3. FCAterminologiesandanFCAbasedframeworkforontologydevelopment
Inthissection,weintroducethebasicconceptsofFCAandbriefourframeworkforontologydevelopment.Wewilluse dataandexamplesfromwaterinfrastructuredomaintopresenttechniquesdevelopedinthisresearch.
FCAtheorywasdevelopedinWille(1982)andatypicaltaskthatFCAcanperformisdataanalysis,makingtheconceptual structureofthedatavisibleandaccessible(Ganter&Wille,1999;Ganteretal.,2005).CentraltoFCAisthenotionofformal
context,whichisdefinedasatripleK:=
G,M,I,whereGisasetofobjects,Misasetofattributes,andI⊆GXMisabinaryrelationbetweenGandM.Arelation
g,m∈Iisreadas“objectghastheattributem”.Aformalcontextcanbedepictedby acrosstableasshowninFig.1(a),wheretheelementsontheleftsideareobjects;theelementsatthetopareattributes; andtherelationsbetweenthemarerepresentedbythecrosses.Aformal conceptofacontextK:=
G,M, Iisdefinedaspair(A,B),whereA⊆G,B⊆M,A´=BandB´=A.A´isthesetof attributescommontoalltheobjects inA andB´isthesetofobjectshavingtheattributesinB.TheextentoftheconceptFig.1. (a)Aformalcontext.(b)Theconceptlatticeforthecontextin(a).
Fig.2. (a)Amanyvaluedcontext.(b)Aonevaluecontextafterconceptualscaling.
allformalconceptsorderedbysub-andsuper-conceptrelationsformsaconceptlattice.Fig.1(b)showstheconceptlattice forthecontextinFig.1(a),whereanoderepresentsaconceptlabelledwithitsintensionalandextensionaldescription.The linksrepresentthesub-andsuper-conceptrelations.
The formal contexts introduced above are not theonesthat occur most frequentlyinapplications ofFCA. Mostoften data isencodedin manyvalued contexts. Amany valuedcontext K:=
G,M, W, I consistsof a setofobjects G, aset of attributesM,aset ofattributevaluesW,andaset ofternaryrelations I⊆G× MXW.A relation<g,m, w> ∈Iis readas “objectg hastheattributem anditsvalue isw”.Fig.2(a)showsamanyvaluedcontext whichlistsdifferentwaterpipes havingdifferentattributevalues.InorderforFCAtheory tobeappliedto amanyvaluedcontext,itneedsto beunfolded intoaonevaluedcontextthroughconceptualscaling(Ganter&Wille,1999).Fig.2(b)showstheonevaluedcontextforthe manyvaluedcontextinFig.2(a)afterconceptualscaling.Astheextentandintentofaconceptoverlapswiththoseofitssuper-andsub-concepts,redundancyexistsinaconcept lattice.Topreventthis,reducedlabellingisintroduced.Alatticewithreducedlabellingisobtainedbyreplacingeachconcept
(A,B)with(N(A),N(B)),whereN(A)containsthenon-redundantelementsinA,andN(B)containsthenon-redundantelements
in B. An object o will appear in N(A)if the corresponding concept is the greatest lower bound ofall concepts contain-ingo.AnattributeawillappearinN(B)ifthecorrespondingconceptistheleastupperboundofallconceptscontaininga.
Fig.3(a)showsthelatticederivedfromtheoneinFig.1(b)withreducedlabelling.Furthermorewecaneliminateinalattice the conceptswhich donot possesstheir ownattributesorobjects. Thisleadstoa structurecalleda GaloisSub Hierarchy
(GSH). A GSHonly consists ofso called attribute concepts and objectconcepts.An object concept represents the smallest
conceptwiththisobjectinitsextension,andanattribute conceptrepresentsthelargestconceptwiththisattributeinits intension.TheGSHofthelatticeinFig.3(a)isdepictedinFig.3(b),whereconcepts1,5,8and12areremovedduetothe emptyN(A)andN(B).Concept2isanattributeconceptandconcept9isanobjectconcept.
WiththeFCAtheoryasthebackbone,wehavedevelopedaframeworktosupportontologydevelopment.Theframework essentiallyconsistsofthreecomponents:ContextFormation,ContextCompositionandOntologyDerivation,asillustratedin
Fig.4.Togenerateanintegratedontologyfortwodatasets,ContextFormationtakesthedatasetsasinputsandgeneratesa onevaluedcontextforeachofthem.ThegeneratedcontextsarethenfedtoContextCompositiontoproduceanintegrated GSH. Ontology Derivationtakes theGSHgenerated inContext Compositionandgenerates an integratedontology as well as concept mappings betweentwo datasets. We will describe Context Formation inSection 4, and elaborate on Context CompositionandOntologyDerivationinSections5and6.
4. Contextformation
Fig.5 showsthecomponents ofContextformation.Given adataset,Data Acquisition derivesconcepts encodedinthe datasetaswellastheirattributedefinitions,andtheresultisamanyvaluedcontextforthedataset.Thecomponentlooks atsourceswherevariousfeaturetypes(concepts)andtheirdefinitionscanbeextracted.Themostcommonsourceshereare
Fig.3. (a)Alatticewithreducedlabelling.(b)AGaloissubhierarchy.
Fig.4. AnFCAbasedframeworkforontologydevelopment.
Fig.5. Contextformation.
text/webdocumentscreatedbysystemdesigners/developersforspecifyingsystemrequirementsanddesign.Otherimportant sourcesareconceptual/logical datamodelsoftheconcerneddataset.Thegeneratedcontextisthenfed totheInformation Explicationcomponenttorestoreimplicitinformation.ThecomponentConceptualScalingtransformsamanyvaluedcontext intoaonevaluedcontext,inorderforclassicFCAtechniquestobeapplicable.
4.1. Implicitinformation
Themainchallengehereistodealwithimplicitinformation.Implicitinformationiscausedbyseveralfactors.Asan ex-ampleinwaterinfrastructuredomain,whendefiningafeaturetype,organisationstendtoexplicitlystatespecificproperties, butleavecommononesunarticulatedintheirdesigndocuments.Forinstance,asewerpipeischaracterisedbyhowit con-veyssewage:eitherbygravityorbypressure,withthegravitydistributionemployedmoreoftenthanthepressurisedform.
Table1
Amanyvaluedcontexttable.
Size What How Location
pipeType1 Main Pressurised
pipeType2 Main Aboveground
pipeType3 Main Sludge pipeType4 Main
Fig.6. TheGSHfortheformalcontextinTable1.
Mostwatercompaniesexplicitlyspecifythepressurisedcharacteristicofasewerpipe,butnotthegravityone.Furthermore, manyorganisationstakesomedomainknowledgeasgranted,anddonotencodeitexplicitly.Forexample,asludgeseweris usuallypressurisedratherthangravity.Asthisiswellunderstoodinthedomain,manywatercompanieschoosenotto en-codethisinformationexplicitly.Table1showsaportionofamanyvaluedcontextthatisgeneratedforaseweragedataset, wheremanyblankcellsexistduetoimplicitorunarticulateddomainknowledge.
Themainconsequenceofthisisthatitcanleadtoanontologythatisill-formed,anddoesnotcorrectlycapturecritical conceptsandsemanticsofthedomain.Fig.6showstheGSHforthecontextinTable1.Duetoimplicitinformation,many importantconcepts, such asgravity sewerandundergroundsewer, are missingfromthehierarchy andthereforefromthe resultantontology.Furthermore,differentorganisationsmaychoosewhatnottoarticulateintheirdatasets.Webelievethis hiddenknowledgeisoneofmainreasonsthathinderdatacompatibilityorinteroperabilityacrossorganisations.
4.2. Ruleelicitation
Weclassify implicitinformationintotwo groups:attribute-specificandobject-specific.Attribute-specificimplicit infor-mationisconcernedwithaparticularattribute,andisapplicabletoallobjectshavingthatattribute.Object-specificimplicit informationis concerned withan attribute ofparticular objectsonly. An exampleof formeris withthehow attributein
Table1.Theunarticulateddomainknowledgehereisthatasewerpipecarriessewagebygravityifnotexplicitlyspecified,and
thisappliestoallseweragepipeshavinghowattribute.Anexampleofobject-specificimplicitinformationiswiththehow
attributeofpipeType3.Theimplicitinformationhereisthatifapipecarriessludgesewage,bydefaultitcarriesitbypressure.
Thisis relevanttothe howattribute,butappliesto pipeType3only(pipesthat carry sludge) andthereforeis classifiedas object-specificimplicitinformation.
We usea rulebasedapproach torecover implicitinformation.Asimplicitinformation islargely unarticulateddomain knowledge,weneedtoworkcloselywithdomainexpertstoacquiretheserules.Wehavetwotypesofrules,attributerules dealingwithattribute-specificimplicitinformation,andobjectrulesdealingwithobject-specificimplicitinformation.
Toelicitattributerules,we iterateeachattribute.Anattributehasimplicitinformationifithasmissingvaluesforsome objects.Eachattributewithimplicitinformationinacontexttableincursarule.Involvementofdomainexpertsisrequired atthispointtogeneratesucharule.Forexample,Rule1inFig.7iscollectedforthehowattributeinTable1.
Toelicitobjectrules,weiterateeachobjectinthecontext,andexamineeachofitsattributesthatdonothaveavalue.If anattributehasimplicitinformationwhichcannotbe recoveredwithanattributerule,anobjectruleiselicitedtorecover implicit information withthe help of domain experts. Forexample, for object pipeType3,the attribute how hasimplicit
1) a sewerage pipe usually carries waste water unless it is specified otherwise;
2) if a sewerage pipe is not explicitly specified as a pressurised pipe, then it is a gravity sewer; 3) if a sewerage pipe is not explicitly specified as an above ground pipe, then it is an underground pipe; 4) a sludge sewerage pipe is a pressurised pipe unless it is specified otherwise;
information.As theimplicitinformation forhowinthiscaseis pressurised,it cannot be recoveredwithRule 1 discussed above.Anobjectrule,Rule4inFig.7,isacquiredinthiscaseforpipeType3.
Fig.7showsasetofruleselicitedforthecontexttableinTable1,whereRule1,2and3areattributerules.Rule4isan objectrule,whichworksforthehowattributeoftheobjectpipeType3only.
4.3.Ruledeployment
Thisstepisconcernedwithhowacontexttablecanbemanipulatedtorestoreimplicitinformation.Torecoverimplicit informationforanobject,wefirstidentifyasetofrulesapplicabletoit.Thisincludesallrelevantattributerulesandobject rulesfortheobject.Eachattributeoftheobjectisexaminedtoseeifithasimplicitinformation.Iftheanswerisyes,the relevantattribute ruleisidentified. Theidentificationofan objectrule isstraightforwardasit islinked totheconcerned objectdirectly.Foranobject,ifbothanattributeruleandanobjectruleareidentifiedasrelevanttoanattribute,theobject ruleoverridestheattribute rulewhenrestoringimplicitinformation.Forexample,forPipeType3 (inTable1),both Rule1 and4(in Fig. 7) deal withhowattribute,butonly Rule 4is appliedwhen restoringimplicitinformation forhowofthis object.
Once applicable rules have been identified, we generate newobjects by applying different combination ofthe rules. Thisallowsobjectswithdifferentcombinationofattributestobeidentified.Eachderivedobjectretainstheexistingobject attributerelationshipsoftheoriginalobjectandderivesnewones(forattributeshavingmissingvalues)byapplying corre-spondingrules.Forexample,forsewerPipeType1,therearetwoattributesthathaveimplicitinformation,whatandlocation.
Accordingly,twoattributerulesareidentified:Rule1forwhatattributeandRule2forlocationattribute.Thereisnoobject ruleidentified forpipeType1.Byapplyingdifferentcombinationoftherules,threenewobjectsarederived frompipeType1,
pipeType1_object1byapplyingRule 1,pipeType1_object2by applyingRule2,andpipeType1_object3by applyingRule1and
2.All newobjects retainexisting object attribute relationshipsofpipeType1,andwith differentrelationshipsderived due tothedifferentrules applied.Dependingonthe numberofrulesapplicable, eachoriginal contextobjectderivesdifferent numberofnewobjects.Forexample,thereare2applicablerulesforPipeType1,PipeType2andPipeType3.Thecombination oftheserulesgenerated3derivedobjectsforeachoriginalobject.PipeType4has3applicablerulesand7newobjectshave beenderived.
Table2liststhemanyvaluedcontextafterimplicitinformationhasbeenrestoredwithrules.Thismanyvaluedcontext
isthenfedto ConceptualScalingcomponent(asshowninFig.6)togenerateaone valuedcontext table.Table3liststhe onevaluedcontexttableaftertheconceptualscalingofthecontextinTable2.
5. Contextcomposition
Contextcomposition takes two formal contexts as input, and generates an integrated GSH. The main components of ContextCompositionareContextIntegrationandHierarchyGeneration,asshowninFig.8.
Themainchallengehereistodealwithambiguousinformationduringcontextintegration,i.e.,differenttermsmaybe employed to refer to the same attribute, and attributes may be modelled at different levels of granularity. An example hereisthat one datasetmay modela seweragepipe aseither mainorlateraland anothermayclassify it astrunkmain,
Table2
Amanyvaluedcontexttableafterinformationexplication.
Size What How Location
PipeType1 Main Pressurised
pipeType1_object1 Main Wastewater Pressurised
pipeType1_object2 Main Pressurised Underground
pipeType1_object3 Main Wastewater Pressurised Underground
pipeType2 Main Aboveground
pipeType2_object1 Main Wastewater Aboveground
pipeType2_object2 Main Gravity Aboveground
pipeType2_object3 Main Wastewater Gravity Aboveground
pipeType3 Main Sludge
pipeType3_object1 Main Sludge Pressurised
pipeType3_object2 Main Sludge Underground
pipeType3_object3 Main Sludge Pressurised Underground
pipeType4 Main
pipeType4_object1 Main Wastewater
pipeType4_object2 Main Gravity
pipeType4_object3 Main Underground
pipeType4_object4 Main Wastewater Gravity
pipeType4_object5 Main Wastewater Underground
pipeType4_object6 Main Gravity Underground
Table3
AonevaluecontexttableafterconceptualscalingofthemanyvaluedcontextinTable2.
Main Wastewater Sludge Pressurised Gravity Underground Aboveground
PipeType1 X X pipeType1_object1 X X X pipeType1_object2 X X X pipeType1_object3 X X X X pipeType2 X X pipeType2_object1 X X X pipeType2_object2 X X X pipeType2_object3 X X X X pipeType3 X X pipeType3_object1 X X X pipeType3_object2 X X X pipeType3_object3 X X X X pipeType4 X pipeType4_object1 X X pipeType4_object2 X X pipeType4_object3 X X pipeType4_object4 X X X pipeType4_object5 X X X pipeType4_object6 X X X pipeType4_object7 X X X X
Fig.8. Contextcomposition. Table4
Context K 1.
Main Operational Abandoned Abandonedintact Proposedrecommission
K1O1 X X
K1O2 X X
K1O3 X X
K1O4 X X X
non-trunk main, orprivate pipe. Attributedisambiguation isa process to matchattributes fromdifferentdatasets. In this
researchwe useapre-defineddatadictionarydevelopedinFuandCohn (2008a)todisambiguateattributes.Thedata dic-tionarymaintainsasetoftermsthatdescribeconceptsinadomain,aswellastheirterminologicalrelationships,e.g.BT/NT (Broader/Narrower Term)etc. Using the data dictionary, we can decide semanticrelationships oftwo attributes. In what follows,wewillusethecontexttablesK1andK2showninTable4and5toillustratethecontextintegrationprocess.
Giventwo contexts K1:=
G1, M1, I1 andK2:=G2, M2, I2,the integratedcontext K: =G, M, I is computedby firstperformingadisjointunionofobjectsetsoftwocontexts,thatis,G=G1∪∗G2 (1)
MandIareassignedM1 andI1 fromK1 atthisstage,i.e, M=M1 andI=I1.Table6showsthecontextKaftertheabove operations.
Thenext stepidentifies thesemanticrelationshipbetweenanattribute inM2andan attributeinM.Foreachattribute
Table5 Context K 2.
Main Live Abandoneddestroyed Proposed Standby Abandonedintact
K2O1 X X K2O2 X X K2O3 X X K2O4 X X K2O5 X X Table6
Thecontexttableafterobjectunionoperation.
Main Operational Abandoned Abandonedintact Proposedrecommission
K1O1 X X K1O2 X X K1O3 X X K1O4 X X X K2O1 K2O2 K2O3 K2O4 K2O5 Table7 Primitivematches. attExt relExt A ifindsanequivalentterm A jinM {<O, A j>|if < O, A i>∈I 2} A ifindsabroadertermA jinM {A i} {<O, A i>,<O, Aj >|if <O, A i>∈I 2} A ifindsanarrowerterm A jinM {A i} {<O m, A i>|if < O m, A i>∈I 2}∪{<O n, A i>|if < O n, A j>∈I } A ifindsnomatchinM {A i} {<O, A i>|if < O, A i>∈I 2}
derivethenewattributesandrelationshipstobeaddedtothecontexttableK.WeuseattExttodenotethesetofattributes to be added to M, anduse relExt to denote theset of relationshipsto be added to I.After each round of mapping,the attributesetMandrelationshipsetIofKarecalculatedas:
M=M∪attExt (2)
I=I∪relExt (3)
An attribute could findmappings of differenttypes, including1 to 1mapping, or1 to manymappings, andaccordingly differentoperationsforcontexttablemanipulation.Inwhatfollows,wefirstdescribeprimitiveoperations,whichdealwith 1to1mappings,andwewillthendiscusshowtheprimitiveoperationscanbecomposedtodealwith1tomanymappings.
5.1. Primitiveoperations
ForagivenattributeAi ∈M2 ofK2,fourtypesofmappingcanbe identifiedfromMofK.Table7summarises typesof mapping,aswell asattributes(i.e.attExt)andrelationships(i.e.relExt)tobeaddedto Kforeachmappingtype,whereAj
denotesthematchfromMofK.
IAi finds an equivalentattribute Aj ∈Mof K. In thiscase,Ai will be unified withAj .The context table K is expanded with relationships between Aj and objects that have relationships with Ai in K2. For example for the context K2 in
Table 5, the attribute live finds an equivalent attribute operational in K. They are unified as operational in K, i.e.,
attExt=
(no newattributetobe addedtoK).Sincethereexistsa relationship<K2O2,live>inK2,newrelationships are establishedinK, i.e.,relExt={<K2O2,operational>}.Similarlywe foundan equivalentmatchmaininK forattribute
main inK2.The resultant context table isshown inTable 8,where thenewly added relationshipsare shaded forthe purposeofreadability.
II AifindsamatchAjthatismoregenerictoit.Inthiscase,theresultingcontextKisexpandedwithattributeAiand rela-tionshipsbetweenAi andobjectsfromK2.NewrelationshipsareestablishedinKbetweenthoseobjectshavingattribute
Table8
Thecontexttableafteranequivalentmatch.
Main Operational Abandoned Abandonedintact Proposedrecommission
K1O1 X X K1O2 X X K1O3 X X K1O4 X X X K2O1 X K2O2 X X K2O3 X K2O4 X K2O5 X Table9
Thecontexttableafterabroadermatch.
Main Operational Abandoned Abandonedintact Proposedrecommission Abandoneddestroyed
K1O1 X X K1O2 X X K1O3 X X K1O4 X X X K2O1 X X X K2O2 X X K2O3 X K2O4 X K2O5 X Table10
Thecontexttableafteranarrowermatch.
Main Operational Abandoned Abandonedintact Proposedrecommission Abandoneddestroyed Proposed
K1O1 X X K1O2 X X X K1O3 X X K1O4 X X X K2O1 X X X K2O2 X X K2O3 X X K2O4 X K2O5 X
alsohave attribute Aj. Forexample,the closest matchforabandoned destroyedin K2 isabandoned which is abroader termtoit.ThenattributeabandoneddestroyedisaddedtoK,andfollowingtworelationshipsareaddedtoK:
<K2O1, abandoneddestroyed> <K2O1, abandoned>
wherethefirstisoriginatedfromK2duetotheexistenceof<K2O1,abandoneddestroyed> inK2.Thesecondisderived
duetothefactthatabandoned destroyedisamorespecificfeaturetoabandoned,andthereforetheexistenceof<K2O1,
abandoneddestroyed>derives<K2O1,abandoned>.TheresultofthesecontextextensionsishighlightedinTable9.
III Ai findsamatchAj thatismorespecific toit.InthiscasethecontextK isexpandedwithAi andexistingrelationships betweenAi andobjectsfromK2.NewbinaryrelationshipsareestablishedinKbetweenthoseobjectshavingrelationships withAj (originally fromK1) andattribute Ai (which is originally fromK2). The theory is that ifAi isa more generic feature ofAj , then anyobject whichhas attribute Aj should alsohave attribute Ai .For example,if the closest match fortheattributeproposedinK2 isproposedrecommissioninKwhichisanarrowertermtoit,thenattributeproposedis addedtoK.ThefollowingtworelationshipsareaddedtoK,
<K2O3,proposed> <K1O2,proposed>
wherethefirstisoriginatedfromK2duetotheexistenceof<K2O3,proposed>inK2.Thesecondisestablishedduetothe existenceof<K1O2,proposedrecommission>inKaswellasthefact thatproposedisamoregeneric featuretoproposed
Table11
Thecontexttableafteranon-match.
Main Operational Abandoned Abandonedintact Proposedrecommission Abandoneddestroyed Proposed Standby
K1O1 X X K1O2 X X X K1O3 X X K1O4 X X X K2O1 X X X K2O2 X X K2O3 X X K2O4 X X K2O5 X Table12
Thecontexttableafteracompositematch(theintegratedcontexttable).
Main Operational Abandoned Abandonedintact Proposedrecommission Abandoneddestroyed Proposed Standby
K1O1 X X K1O2 X X X K1O3 X X K1O4 X X X K2O1 X X X K2O2 X X K2O3 X X K2O4 X X K2O5 X X X
IVAifindsnomatchinK.InthiscasethecontextKissimplyexpandedwithAiandexistingrelationshipsbetweenAiand objectsoriginatingfromK2.ForexamplethereisnosemanticmatchinK fortheattributestandbyofK2.Inthiscase,K isextendedwithattExt={standby}andrelExt={<K2O4,standby>},asshowninTable11.
5.2.Compositeoperations
Inmany situations, an attribute mayhave multiple matches andeach match isof differenttype, e.g. havingboth an equivalentand a broader matchat same time. The primitive operationsdiscussed above can be composed to deal with thesecomplexcases.ForanattributeA∈M2ofK2,ifasetofmatches{A1,A2,…,An }areidentifiedfromMofK,thecontext
Kisextendedasfollows: M=M ∪n j=1attExtA j (4) I=I ∪n j=1relExtA j (5) whereattExtA
j andrelExtA j respectivelydenotetheattributeandrelationshipsetsthatarederivedwhenAismatchedtoAj withtheprimitiveoperationsdiscussedinSection5.1.
Forexample,fortheattribute abandonedintact,twomatches arefound fromK,theequivalentmatchabandonedintact
andthegenericmatchabandoned.Forequivalentmatchabandonedintact,thefollowingaregenerated:
attExt=
relExt=
{
<K2O5, abandonedintact>}
Forthegenericmatchabandoned,thefollowingaregeneratedby
attExt=
{
abandonedintact}
relExt=
{
<K2O5, abandonedintact>,<K2O5, abandoned>}
AddingtheseintoKresultsintheformalcontextshowninTable12,whichisalsofinalintegratedcontexttable.TheGSH constructedfromthisintegratedcontextisillustratedinFig.9.
Fig.9. TheGSHderivedfromthecontextinTable12.
Fig.10. Ontologyderivation.
6. Ontologyderivation
Ontology derivation componentof ourframework takes the GSHgenerated inSection 5 andgenerates an ontological structure1.Fig.10showsthecomponentsofontologyderivation.TheGSHisexploitedtoderiveseveraltypesofinformation, including ontological concepts, subsumption relationshipsbetween concepts, and attributesof concepts. The information identified formsanontological structurefromwhichafull ontology canbedeveloped.The mappingbetweenconceptsof differentdatasetscanalsobeidentifiedfromtheGSH.
6.1. Mappingidentification
Thissubcomponentderivesmappingsbetweenconceptsoftwodatasets.Givenaformalconcept inaGSH,ifitsextent containsmorethanone objects(e.g.theextentofnode3is{K1O1,K2O2}inFig.9),thenitindicatesapotential mapping betweenthesesourceconcepts.The validationofdomainengineersisrequestedattheevaluationstage tojudgewhether a mappingidentifiediscorrect. Iftheanswerisnegative, features needtobe identifiedtodifferentiate oneconcept from another. This ofteninvolves the identification ofnew attributesor relationshipsof concerned concepts. The existence of incorrectmatchestriggerstheneedtoiteratecontextcompositionorintegrationoperations.
6.2. Concept/relationship/attributeidentification
AsweemployaGSHintheresearch,intermediate,abstractconceptsarereducedinthecontextintegrationstepandthe resultinghierarchyconsistsonlyofobjectconceptsandattributeconcepts.Objectconceptshavetobekeptintheresultant ontologicalhierarchyasthey correspondtotheinitial concepts (eitherexplicitorimplicit)ofdatasetsandthereforeneed
1Theterm ontological structure isusedheretomeanthatthederivedconceptualstructureonlycontainslimiteddatasemantics,i.e.onlyconcepts, attributesand is-a relationshipsareidentified.Furtherdevelopmentisstillrequiredtocaptureotherdatasemanticstogenerateafullontology.
Table13 Sourcedatasets.
Dataset Noofconcepts Noofattributes
D 1 54 18
D 2 47 21
D 3 43 22
D 4 66 23
to remain in ontological structure to respect the initial class specification ofthe datasets. For an attribute concept, the assistanceof domain engineers is requiredto decide whetherit should be kept or discarded by takinginto account its significanceorinterestto theapplication. Whenan attributeconcept isdiscarded inaGSH, all elementsin itsintent are passedon to its sub concepts, and super-/sub-concept relationshipsare established between its super-concepts and sub-concepts.
After a decisionhas beenmade onwhich concepts are to be kept inthe resultant ontology, therules for identifying relationshipsandattributesofaconceptarestraightforward:
• Allelementsintheintentofaformalconceptaredeclaredasattributesoftheontologicalconcept.
• Sub/superrelationsbetweentwoformalconceptsareidentifiedasis-arelationshipsbetweenthecorresponding ontolog-icalconcepts.
7. Empiricalevaluation
Anevaluationoftheproposedtechniqueshasbeenperformedonseveralindustrialdatasets.Wefirstdescribethe exper-imentalsetupandtheontologysimilaritymeasuresemployedintheevaluation.Wethenreportontheevaluationresults.
7.1. Experimentsetupandontologysimilaritymeasures
Datasetsweused forperforming ourexperimentswere sourced fromfourUKwatercompanies. Thesedatasets essen-tiallyencodesametypesofinformation,includingvariouswaterpipes,meteringandtreatmentfacilitatesfortransporting freshwater/wastewaterforcustomersacrossthe UK.Howevereachorganisationrecordsitsinformationwithlittlethought towards interoperabilitywith others.This resultsin dataheterogeneities. Due todata confidentiality agreement we have withour industrialpartners, wecannot publishthesedatasets.Neverthelesswe havelistinTable13 thestatistics onthe datasets.
Themappingandintegrationwascarriedoutinasemi-automatedmanner,wheredataacquisitionandattribute disam-biguationwereconductedmanually,theopensourcetoolGalicia(Valtchevetal.,2003)wasemployedforcontext manipu-lationandGSHgeneration,andall otherprocessessuchasinformationexplicationandconceptualscalingwerecompleted withJavaandSQLcodes.Theevaluationwasperformedinthreephases.
• Phase I experiments constructed local ontologies for each dataset involved. Pairwise comparison was conducted to measure the similarityof theselocal ontologies,and theresults were then served asbenchmarks forthe subsequent evaluation.
• PhaseIIexperimentsstudiedhowimplicitinformationimpactsonontology interoperability,anddemonstratedhow in-formationexplicationcanhelpwithontologyalignment.
• Phase III compared an ontology developed in this research with a handcrafted ontology developed with traditional knowledge engineeringapproach.Theperformance oftwoontologieswasevaluatedbystudyinghowbestthetwo on-tologiesfitandrespecttheknowledgestructuresofdatasetstobeintegrated.
Evaluationwasperformed at2levels:lexical level andtaxonomic level.Lexicallevel evaluationreflects how well the lexical terms of a source ontology cover those of a target ontology. Taxonomic level evaluationexamines how well the conceptualhierarchyofasourceontologyresemblesthatofatargetontology.Weemploytheontologymeasuresproposed
inDellschaftandStaab(2006)andMaedcheandStaab(2002)inourexperiments.Lexicalprecisionandrecallofasource
ontologyOSagainstatargetOntologyOTarecomputedas:
LP
(
OS ,OT)
=|
CS ∩CT|
|
CS|
(6)LR
(
OS ,OT)
=|
CS|
C∩CT|
T
|
(7)whereCS (orCT )isthesetoftermsdescribingconceptsinOs (orOT ).
LexicalF-measure,LF,isusedforbalancingtheprecisionandrecallvalues,andiscalculatedasharmonicmeanofLPand
LR.
LF=2·LP
(
OS ,OT)
·LR(
OS ,OT)
Taxonomiclevelmeasures aredividedintolocalandglobalmeasures.Localmeasures comparethesimilarityof hierarchi-cal positionsoftwo concepts inthesource andthe targetontologies. Forlocal taxonomicprecision,the similarityoftwo concepts iscomputedbasedonthecommonsemanticcotopiesfromtheconcept hierarchies.Thecommonsemanticcotopies
includesall thecommonsuper-andsub-conceptsofaconceptpair.Givensucha semanticcotopyce,thelocaltaxonomic precisiontpandrecalltroftwoconceptsc1 ∈CS andc2∈CT isdefinedas
t p
(
c1,c2,OS ,OT)
=|
ce(
c1,|
ceOS(
)
c∩ce(
c2,OT)
1,OS)
|
(9) tr(
c1,c2,OS ,OT)
=|
ce(
c1|
,ceOS(
)
c∩ce(
c2,OT)
2,OT)
|
(10) Sincet p(
c2,c1,OT ,OS)
=|ce (c 1|,ce O (Sc )2∩,ce O T()c |2,O T), we have tr(
c1,c2,OS ,OT)
= tp(
c2,c1,OT ,OS)
(11)Globaltaxonomicprecisionandrecallaredefinedbysumminguplocaltaxonomicprecisionandrecallofcommonconcepts intwoontologies. TP
(
OS ,OT)
=|
C 1 S ∩CT|
c∈(C S∩C T) t p(
c,c,OS ,OT)
(12) TR(
OS ,OT)
= 1|
CS ∩CT|
c∈(C S∩C T) tr(
c,c,OS ,OT)
(13) SinceTP(
OT ,OS)
= 1 |C S∩C T| c∈(C S∩C T)t p
(
c,c,OT ,OS)
andtp(
c,c,OT,OS)
=tr(
c,c,OS ,OT)
duetoEq.(11),wehaveTR
(
OS ,OT)
= TP(
OT ,OS)
(14)TaxonomicF-measureTFisusedtobalanceTPandTRtogenerateacombinedtaxonomicmeasure.
TF
(
OS ,OT)
=2T∗PT(
PO(
OS ,OT)
∗TR(
OS ,OT)
S ,OT
)
+TR(
OS ,OT)
(15)AcombinedmeasureGF,whichbalancesthelexicalandtaxonomicmeasures,isusedtogiveasummarisingoverviewofthe similarityofOS againstOT ,andiscomputedastheharmonicmeanofLFandTF:
GF
(
OS ,OT)
= 2LF∗LF(
O(
OS ,OT)
∗TF(
OS ,OT)
S ,OT
)
+TF(
OS ,OT)
(16)7.2. Experimentalresults
7.2.1. PhaseI
The concepts andattributesfrom fourdatasets havebeen identifiedand usedto generatecontext tables.The context tableswere thenfed toGalicia toderiveGSHs.An ontologywasgeneratedfromaGSHby discardingall attributeobjects andkeepingtheobjectconcepts.Fourontologieswere generated,each foradataset(i.e.an ontologyOiisgeneratedfora datasetDiwherei=1,2,3and4).
Matricesdescribedin Section 7.1were used tomeasure the similarityof theseontologies.We observed that thefour water companies differ greatly from each other on what business objects they record in their systems, which leads to ontologiesthatareincompatibletoeachotherbothlexicallyandtaxonomically.Theselocalontologiesonlyagreedwitheach other toa smallextent:onlyarelatively smallpercentageoftermsinoneontology were alsofoundinanother ontology. This was measured withlexical precision LP (Table 14). Ontology O2 is the one that hasthe least common terms with otherontologies.Manualinspectionoftheseontologiesfoundthatthislexicaldisagreementwasmainlyduetothedifferent aspectsofthedomainthat anorganisationchosetoencodeinitsdatamanagementsystems,andthisresultedindifferent ontology concepts.Thepoorperformance ofO2ontology wasduetothegranularityissues—itencodedconcepts atafiner levelthanotherontologies,whichresultedinlexicalmismatcheswithotherontologies.
Table14 BaselineLP . O1 O2 O3 O4 O1 − 9.26% 20.37% 25.93% O2 10.38% − 8.51% 17.02% O3 25.58% 9.30% − 27.91% O4 21.21% 12.12% 18.19% −
Table15 BaselineTP . O1 O2 O3 O4 O1 − 10.00% 40.90% 34.52% O2 11.83% − 11.07% 26.82% O3 36.60% 12.60% − 30.75% O4 33.29% 24.58% 30.65% − Table16 Rulesets.
Totalnoofrules Noofattributerules Noofobjectrules
O1 15 6 9
O2 12 5 6
O3 11 6 5
O4 15 6 9
Fig.11. Lexicalprecision.
Fig.12. Taxonomicprecision.
The taxonomic level similarity of these ontologies was slightly better but scores were still quite low, as shown in
Table15.Thepresenceofdifferentconceptsinthehierarchiesoftheseontologiesledtodisappointingresults.Again,
ontol-ogyO2performedtheworst—ithasamuchlowertaxonomicprecisionwhencomparedtotheotherontologies.Examination revealedthatthegranularitymismatchwasagainthemaincauseforthis.AsO2ontologyencodedbusinessobjectsatafiner granularitythanothers,ithadaverydifferenthierarchytothoseofotherontologies.
7.2.2. PhaseII
Experimentswere firstly performed to restore implicitinformation for ontologies generated in Phase I, the resultant ontologieswere thencompared witheachother usingsamemeasures. Todothis, rulesforrestoringimplicitinformation wereacquiredforeachdatasetwiththehelpofdomainengineers.Table16showsthestatisticsontheserulesets.
TherulesweredeployedtoformalcontextsgeneratedinthePhaseIexperimentstorestoreimplicitinformationaswell asderive newfeature types.The resultant contexts were usedto generateontologies inthe same wayasdidin PhaseI experiments.Thesimilaritymeasureswerecalculatedfortheseontologies,andtheresultswerecomparedwiththeoneswe obtainedinPhaseIexperiment,whichareshowninFigs.11and12.Comparingwiththebaselinesimilarityscores,wecan seeasubstantialimprovementinthesimilarityoftheseontologies,bothatthelexicallevelandatthetaxonomiclevel.The averagelexicalprecisionincreasedtoaround60%whichwasbelow20%inthePhaseIstudy(Fig.11).Thiswasmainlydue totheincreaseofthecommonfeaturetypeswhichwererestoredininformationexplicationprocess.
Table17
Attributedisambiguation.
Target Source Exactmatch BTmatch NTmatch Non-match Multi-match
O1 O2 12 0 2 5 2
O1+O2 O3 15 1 1 4 1
O1+O2+O3 O4 16 2 2 2 1
Table18
PerformanceofFCAontologyandKEontology.
FCA ontology KE ontology
LP TP GF LP TP GF
O1 64.11% 67.07% 83.81% 62.88% 53.29% 44.39%
O2 64.92% 74.61% 88.02% 53.61% 51.81% 39.67%
O3 47.18% 64.99% 82.93% 52.58% 49.67% 46.25%
O4 72.39% 75.54% 86.38% 71.13% 52.41% 42.17%
Taxonomicprecisionwasimprovedsimilarly:from20%toaround60%byaverage(Fig.12).Thisimprovementwasmainly duetotheresulting ontologiesbearingasimilar levelofdetailintheir hierarchiesoncethey wereenriched withderived objects generated with rules. Aconcept inone ontology had an increased numberof common super- andsub-concepts withitsmatchingconceptinanotherontology.Thisresultedinimprovedlocaltaxonomicsimilarityandthereforeimproved globaltaxonomicsimilarity.Thisledtotheconclusionthatimplicitinformationimpactsgreatlyonthesimilarityofthelocal ontologies,andsimilarityoftheseontologiescanbeimprovedsignificantlyifwecanhaveimplicitinformationrestored.
7.2.3. PhaseIII
The four local ontologies, which had implicit information restored in Phase II experiments, were then integrated to build a global ontology. This was achieved by first performing the context integration as described in Section 5. The contexts of O1 and O2 were integrated first, and resultant context was then integrated with O3 context and so on, as showninTable17.The mainactivityperformedherewasattribute disambiguation.Table17showsthetypesofattribute matches found duringvarious stages ofthe integration process.For example,forthe 21 attributes ofthe O2context, 12 found an exactmatchfromthe O1context,and2 foundnarrower matches,5did notfind anymatch, and2found mul-tiple matches. After attribute disambiguation, the integrated context wasused to generate a GSH, from which an inte-gratedontology wasderived.Thetotalnumberofconceptsintheintegratedontologywas248andthedepthofhierarchy was6.
To evaluate the quality of this integrated ontology (FCA ontology for short), we compared it against a handcrafted ontology that was developed with a traditional knowledge engineering approach as described in Fu and Cohn (2008a)
(KE ontology for short). Both FCA ontology and KE ontology had the same local ontologies (i.e. O1, O2, O3 and O4) as
major inputs (and therefore comparison made in this research are unbiased), but they differ from each other on how ontological hierarchies were built and how implicit/unarticulated information was recovered. The hierarchy of FCA ontology was generated automatically with FCA tool Glacia basing on the attribute definition of objects, and the hi-erarchy of KE ontology was generated manually basing on the domain knowledge from domain experts. FCA ontology
achieved informationexplication via the domain rules as discussed inSection 4.KE ontology didthis through a manual semantic enrichment process.Extra data semantics ofKE ontologywere manually derived fromboth system design doc-uments and domainengineers. The resultant KE ontologyconsistsof 216 concepts which wasorganisedin 5hierarchical levels.
We evaluatethe two ontologiesin thesimilar fashion asdonein (Brewster, Alani, Dasmahapatra,& Wilks,2004). We consider that an ontology is of good quality when it conforms to andhas a good coverage of knowledge structures of datasetsto beintegrated. Thiswasperformedby comparingFCAontologyandKEontologyagainst localontologiesO1,O2, O3andO4asdevelopedinPhaseIIexperiment.Table18summarisestheresults.Bothontologieshadsimilarscoresforthe lexicalprecisionLPwhencomparedagainsttheseontologies.ThiscanbelargelyexplainedbythatbothFCAontologyandKE
ontologyhadtheselocalontologiesasinput,i.e.,conceptsintheseontologiesweremajorlexicalsourcesofbothontologies.
FCAontologyoutperformedKEontologyonits similaritytothelocalontologiesatthetaxonomiclevel.ThisisbecauseFCA
ontologywasgeneratedsystematicallybasedonattributedefinitionsofinputfeaturetypes(ofthelocalontologies),and
sub-andsuper-conceptrelationshipsbetweenconceptswereidentifiedinthesamefashionasthelocalontologies.Thisledtothe improvedtaxonomicprecision oftheFCA ontology.However theontologicalhierarchygeneratedwithKEmethodisrather subjective, i.e.dependingupon humanjudgementon what intermediate concept to add,and whena sub-/super-concept relationshipshouldbeestablished.Thehierarchytendstobedistortedwithmissingsub-andsuper-conceptlinkswhenthe numberofconcepts increases.FCAontologyalsooutperformedKEontologyontheoverallsimilaritymeasureGF.Thisleads totheconclusionsthatFCAontologyfitsandrespectsthelocalontologiesbetterandthereforebetterservestheintegration purposeinthiscase.
8. Conclusionsandfuturework
Theavailabilityofvastquantitiesofdatapresentsorganisationswithbothopportunitiesandchallenges.Dataintegration techniquesofferapromisingwayforaddressingtheissueofdataheterogeneitiesandpromotingdatasharingand interoper-abilityacrossorganisations.Inthispaperwepresentaformalandsemi-automatedmethodforontologydevelopment,with theaimtoreconcileheterogeneousdataandsupportdataintegration.TheresearchextendsclassicalFCAtheorytoaddress theissuesofimplicitandambiguousinformation,which,weconsider,areimportantbuthavenotbeensufficiently inves-tigatedbypreviousstudies.Theresearchenablesconsiderableontologyengineeringactivitiesautomated,includingconcept derivationandhierarchygeneration.Incontrasttostudiesthat drawupon eithersmallorsimplifieddatasets,we evaluate theproposedtechniquesonnon-trivialindustrialdatasets.Ourexperimental resultsdemonstratethetechniquesdescribed inthispapercan helpcurate andfusedatafromdispersesources,andsupportthedevelopmentofontologiesthat better
fitsandrespectstheunderlyingknowledgestructureofdomain.Therearea numberofworkswhichwe plantoundertake
inthefuture,includingdevelopingtechniquestodealwithincompleteinformationindataintegration,andvalidatingthe proposedtechniquesondatasetsinotherapplicationdomains.
Acknowledgments
ThisresearchisafollowonworkofUKEPSRCgrant(EP/C014707/1).GaihuaFuiscurrentlyfundedbytheEPSRCgrants
(EP/I035781/1)andEP/K012398/1.
References
Bahga,A.,&Madisetti,V.K.(2015).Healthcaredataintegrationandinformaticsinthecloud.Computer,48(2),50–57.
Bai,X.,&Zhou,X.Z.(2011).Developmentofontology-basedinformationsystemusingformalconceptanalysisandassociationrules.AdvancesinComputer Science,IntelligentSystemandEnvironment,106,121–126.
Bakhtouchi,A.,Bellatreche,L.,&Ait-Ameur,Y.(2011).Ontologiesandfunctionaldependenciesfordataintegrationandreconciliation.AdvancesinConceptual Modeling:RecentDevelopmentsandNewDirections,6999,98–107.
Beck,A.,Boukhelifa,N.,Fu,G.,Hickinbotham,S.,Parker,J.,Bennett,B.,etal.(2013).UtilitydataintegrationandknowledgerepresentationintheUK:The VISTAproject.GeoHydroinformatics-IntegratingGISandWaterEngineering.UnitedKingdom:Chapman&Hall/CRCPress.
Bian,J.,Zhang,H.,&Peng,X.G.(2011).Theresearchandimplementationofheterogeneousdataintegrationunderontologymappingmechanism.Web InformationSystemsandMining,6988,87–94.
Brewster,C.,Alani,H.,Dasmahapatra,S.,&Wilks,Y.(2004).Data-drivenontologyevaluation.InProceedingsoftheLanguageResourcesandEvaluation Con-ference(pp.164–168).
Brockmans,S.,Colomb,R.M.,Haase,P.,Kendall,E.F.,Wallace,E.K.,Welty,C.,etal.(2006).AmodeldrivenapproachforbuildingOWLDLandOWLfull ontologies.InProceedingsoftheInternationalSemanticWebConference(ISWC):2006(pp.187–200).
Chen,R.C.,Bau,C.T.,&Yeh,C.J.(2011).MergingdomainontologiesbasedontheWordNetsystemandFuzzyFormalConceptAnalysistechniques.Applied SoftComputing,11(2),1908–1923.
Dellschaft,K.,&Staab,S.(2006).Onhowtoperformagoldstandardbasedevaluationofontologylearning.InProceedingsoftheInternationalSemanticWeb Conference(ISWC):2006(pp.228–241).
Do,H.,&Rahm,E.(2002).COMA-asystemforflexiblecombinationofschemamatchingapproaches.InProceedingsofthe28thInternationalConferenceon VeryLargeDataBases,August20-23,2002(pp.610–621).HongKong,China.
Doan,A.,Domingos,P.,&Halevy,A.Y.(2001).Reconcilingschemasofdisparatedatasources:Amachine-learningapproach.InProceedingsofthe2001ACM SIGMODInternationalConferenceonManagementofData,May21-24,2001(pp.509–520).SantaBarbara,CA,USA.
Doan,A.,Halevy,A.,&Ives,Z.G.(2012).PrinciplesofDataIntegration.Waltham,MA:MorganKaufmann.
Doan,A.,Madhavan,J.,Domingos,P.,&Halevy,A.Y.(2003).Ontologymatching:Amachinelearningapproach.InStaabSteffen,etal.(Eds.),Handbookon Ontologies(pp.397–416).BerlinHeidelberg:Springer.
Doan,A.,Noy,N.F.,&Halevy,A.Y.(2004).Introductiontothespecialissueonsemanticintegration.SigmodRecord,33(4),11–13.
Duckham,M.,&Worboys,M.(2005).Analgebraicapproachtoautomatedinformationfusion.InternationalJournalofGeographicInformationSystems,19(5), 537–557.
Duong,T.H.,&Jo,G.S.(2012).Enhancingperformanceandaccuracyofontologyintegrationbypropagatingpriorlymatchableconcepts.Neurocomputing, 88,3–12.
Duong,T.H.,Truong,H.B.,&Nguyen,N.T.(2012).Localneighborenrichmentforontologyintegration.IntelligentInformationandDatabaseSystems,7196, 156–166.
Formica,A.(2006).Ontology-basedconceptsimilarityinformalconceptanalysis.InformationSciences,176(18),2624–2641.
Fu,G.,&Cohn,A.G.(2008a).Semanticintegrationformappingtheunderworld.In Geoinformatics 2008 and Joint Conference on GIS and Built Environment: Geo-Simulation and Virtual GIS Environments, June 28-29, 2008 .Guangzhou,China.Doi:7143/714327-714327-714329.
Fu,G.,&Cohn,A.G.(2008b).Utilityontologydevelopmentwithformalconceptanalysis.In5thInternationalConferenceonFormalOntologyinInformation Systems(pp.297–310).
Ganter,B.,Stumme,G.,&Wille,R.(2005).FormalConceptAnalysis:FoundationsandApplications.Berlin,Heidelberg:Springer. Ganter,B.,&Wille,R.(1999).FormalConceptAnalysis:MathematicalFoundations.Berlin,NewYork:Springer.
Gruber,T.R.(1993).Atranslationapproachtoportableontologyspecifications.KnowledgeAcquisition,5(2),199–220.
He,L.J.,&Wang,Q.T.(2011).Constructionofontologyinformationsystembasedonformalconceptanalysis.AdvancesinComputerScience,Intelligent SystemandEnvironment,104,83–88.
Huang,S.L.,Lin,S.C.,&Chan,Y.C.(2012).Investigatingeffectivenessanduseracceptanceofsemanticsocialtaggingforknowledgesharing.Information Processing&Management,48(4),599–617.
Jiang,Y.C.,Zhang,X.P.,Tang,Y.,&Nie,R.H.(2015).Feature-basedapproachestosemanticsimilarityassessmentofconceptsusingWikipedia.Information Processing&Management,51(3),215–234.
Kalfoglou,Y.,&Schorlemmer,M.(2003).Ontologymapping:Thestateoftheart.TheKnowledgeEngineeringReview,19(1),1–31.
Lenzerini,M.(2002).Dataintegration:Atheoreticalperspective.InProceedingsofthe21stACMSigmod-Sigact-SigartSymposiumonPrinciplesofDatabase Systems(pp.233–246).
Liu,J.,&Zhang,X.X.(2014).DataintegrationinfuzzyXMLdocuments.InformationSciences,280,82–97.
Madhavan,J.,Bernstein,P.A.,&Rahm,E.(2001).Genericschemamatchingwithcupid.InProceedingsofthe27thInternationalConferenceonVeryLargeData Bases(pp.49–58).
Madhavan,J.,&Halevy,A.(2003).Composingmappingsamongdatasources.InProceedingsofthe29thInternationalConferenceonVeryLargeDataBases
(pp.572–583).
Maedche,A.,&Staab,S.(2002).Measuringsimilaritybetweenontologies.InProceedingsofthe13thInternationalConferenceonKnowledgeEngineeringand KnowledgeManagement.:2473:(pp.251–263).
Mate,S.,Kopcke,F.,Toddenroth,D.,Martin,M.,Prokosch,H.U.,Burkle,T.,etal.(2015). Ontology-baseddataintegrationbetweenclinicalandresearch systems.PlosOne,10(1),E0122172.
Nanda,J.,Simpson,T.W.,Kumara,S.R.T.,&Shooter,S.B.(2006).Amethodologyforproductfamilyontologydevelopmentusingformalconceptanalysis andWebontologylanguage.JournalofComputingandInformationScienceinEngineering,6(2),103–113.
Neumann,F.,Ho,C.,Tian,X.,Haas,L.,&Meggido,N.(2002).Attributeclassificationusingfeatureanalysis.InProceedingsofthe18thInternationalConference onDataEngineering(p.271).
Nguyen,N.T.(2006).Conflictsofontologies-classificationandconsensus-basedmethodsforresolving.InKnowledge-BasedIntelligentInformationand EngineeringSystems,Pt2,Proceedings:4252(pp.267–274).
Nguyen,N.T.(2007).Amethodforontologyconflictresolutionandintegrationonrelationlevel.CyberneticsandSystems,38(8),781–797. Noy,N.F.(2004).SemanticIntegration:Asurveyofontology-basedapproaches.SigmodRecord,33(4),65–70.
Noy,N.F.,&Musen,M.A.(2000).PROMPT:Algorithmandtoolforautomatedontologymergingandalignment.InProceedingsoftheSeventeenthNational ConferenceonArtificialIntelligenceandTwelfthConferenceonInnovativeApplicationsofArtificialIntelligence(pp.450–455).
Pedersen,T.B.,Pedersen,D.,&Riis,K.(2013).On-demandmultidimensionaldataintegration:Towardasemanticfoundationforcloudintelligence.Journal ofSupercomputing,65(1),217–257.
Pinto,H.S.,&Martins,J.P.(2004).Ontologies:Howcantheybebuilt?KnowledgeandInformationSystems,6(4),441–464. Rahm,E.,&Bernstein,P.A.(2001).Asurveyofapproachestoautomaticschemamatching.TheVLDBJournal,10(4),334–350.
Rodriguez,M.,&Egenhofer,M.(2003).Determiningsemanticsimilarityamongentityclassesfromdifferentontologies.IEEETransactionsonKnowledgeand DataEngineering,15(2),442–456.
Rouane,M.,Valtchev,P.,Sahraoui,H.,&Huchard,M.(2004).Mergingconceptualhierarchiesusingconceptlattices.InProceedingsofthe3rdWorkshopon ManagingSPEcialization/GeneralizationHierarchies,June15,2004(pp.51–58).Oslo,Norway.
Spohr,D.,Hollink,L.,&Cimiano,P.(2011).Amachinelearningapproachtomultilingualandcross-lingualontologymatching.InProceedingsofthe10th InternationalConferenceontheSemanticWeb(ISWC)2011:7031:(pp.665–680).
Stumme,G.,&Maedche,A.(2001).FCA-MERGE:Bottom-upmergingofontologies.InInternationalJointConferenceOnAI,August4-10,2001(pp.225–234). Seattle,Washington,USA.
Sure,Y.,Tempich,C.,&Vrandecic,D.(2006).Ontologyengineeringmethodologies.SemanticWebTechnologies:TrendsandResearchinOntology-basedSystems
(pp.171–190).Wiley.
Truong,H.B.,&Nguyen,N.T.(2012).Amulti-attributeandmulti-valuedmodelforfuzzyontologyintegrationoninstancelevel.IntelligentInformationand DatabaseSystems,7196,187–197.
Uschold,M.,&Grüninger,M.(2004).Ontologiesandsemanticsforseamlessconnectivity.SigmodRecord,33(4),58–64.
Valtchev,P.,Grosser,D.,Roume,C.,&Hacene,M.R.(2003).Galicia:anopenplatformforlattices.InProceedingsofthe11thInternationalConferenceon ConceptualStructures,July21-25,2003(pp.241–254).Dresden,Germany.
Wille,R.,&Reidel,I.R. (1982).Restructuringlatticetheory:Anapproachbasedonhierarchiesofconcepts.OrderedSets(pp.445–470).Dordrecht,Boston. Xia,H.(2013).Semanticwebontologyintegrationbasedonformalconceptanalysis.Mechatronics,RoboticsandAutomation,373-375,1714–1718. Xie,J.,Liu,F.,&Guan,S.U.(2011).Tree-structurebasedontologyintegration.JournalofInformationScience,37(6),594–613.
Yang,S.(2011).Efficientontologyintegrationmodelforbetterinferenceincontextawarecomputing.ComputationalMaterialsScience,268-270,841–846. Yu,T.,Chen,H.J.,Mi,J.H.,Gu,P.Q.,Wu,T.,&Pan,J.Z.(2012).DartWiki:Asemanticwikiforontology-basedknowledgeintegrationinthebiomedical
domain.CurrentBioinformatics,7(3),278–288.
Zhao,Y.,Wang,X.,&Halang,W.(2006).Ontologymappingbasedonroughformalconceptanalysis.InAdvancedInternationalConfeSenceon Telecommuni-cationsandInternationalConferenceonInternetandWebApplicationsandServices(AICT/ICIW2006)19-25February2006.FrenchCaribbean:Guadeloupe.