• No results found

FCA based ontology development for data integration

N/A
N/A
Protected

Academic year: 2021

Share "FCA based ontology development for data integration"

Copied!
18
0
0

Loading.... (view fulltext now)

Full text

(1)

Contents lists available at ScienceDirect

Information

Processing

and

Management

journal homepage: www.elsevier.com/locate/ipm

FCA

based

ontology

development

for

data

integration

Gaihua Fu

School of Civil Engineering and Geosciences, Newcastle University, Newcastle upon Tyne, UK

a

r

t

i

c

l

e

i

n

f

o

Article history:

Received12May2015 Revised5November2015 Accepted22February2016 Availableonline14March2016

Keywords:

Ontologydevelopment Formalconceptanalysis Dataintegration Informationsharing

a

b

s

t

r

a

c

t

Datais a valuableasset to our society. Effective use ofdata can enhance productivity ofbusiness and create economic benefit to customers. However with data growing at unprecedentedrates,organisationsarestrugglingtotakefulladvantageofavailabledata. Onemainreasonforthisisthatdataisusuallyoriginatedfromdisparatesources.Thiscan result indataheterogeneity,and preventdatafrom beingdigestedeasily. Amongother techniquesdeveloped,ontologybasedapproachesisonepromisingmethodfor overcom-ingheterogeneityandimprovingdatainteroperability.Thispapercontributesaformaland semi-automatedapproach for ontologydevelopment based onFormalConcept Analysis (FCA),with theaimto integratedatathat exhibitsimplicit and ambiguousinformation. A case study has been carried out on several non-trivial industrial datasets, and our experimental results demonstratethat proposed methodoffers an effective mechanism thatenablesorganisationstointerrogateandcurateheterogeneousdata,andtocreatethe knowledgethatmeetstheneedofbusiness.

© 2016TheAuthors.PublishedbyElsevierLtd. ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/ ).

1. Introduction

Businessproductivityandcompetitivenessareincreasinglybeingdrivenbytheeffectiveaccessanduseofdata.Data pro-videsamineofinformationthatcanhelpusspotundiscoveredpatternsofbusinessimportanceandtocreatetheknowledge thatwillbeneededtotacklethechallengesoffuture.Howeverwithdatabecomingavailableandgrowingatunprecedented rates,organisationsstruggle totake full advantage ofvaluable data.One mainreasonfor thisisthat data isusually cre-atedandmaintainedbya rangeoforganisations.Thisresultsinmismatchbetweendatasets,i.e.,datasets differfromone organisationtoanothernotonlyinwhatisencodedbutalsoinhowitisencoded.

Inorder fororganisations to use anddigest heterogeneous dataand uncover theuntold business patterns,there is a growinginteresttodeveloptechniquesthatinvestigatecomplexdataphenomenaandfacilitatebetterdatainteroperability

(Doan,Halevy,&Ives,2012;Doan,Noy,&Halevy,2004;Duckham&Worboys,2005;Huang,Lin,&Chan,2012;Jiang,Zhang,

Tang,&Nie,2015;Lenzerini,2002).Among varioustechniquesdeveloped,ontologyresearchisonedisciplinethatcandeal

withdataheterogeneityandimprovedatasharing(Kalfoglou&Schorlemmer,2003;Mateetal.,2015;Noy,2004). Ontology-basedintegrationsystemsareusually characterisedby aglobalontologywhichrepresentsa reconciled,integratedview of theunderlyingdatasources.Systemstakingthisapproachusually provideuserswithauniforminterface—allqueriesmade to source data are expressed in terms ofa globalontology, asare the queryresults. This frees the userfrom the need to understand each individual datasource. Unfortunately, in many domains one faces the problems of either having no

Correspondingauthor.Tel:+441912086822.

E-mail address: [email protected],[email protected] http://dx.doi.org/10.1016/j.ipm.2016.02.003

0306-4573/© 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

(2)

establishedontologythatcanbereadilyemployedintheintegrationwork,orexistingontologiesdonotfitforthepurpose (e.g.,notconsistingofknowledgethatsufficientlycapturesthesemanticsoftheinformationunderinvestigation).

Inthispaperwecontributeaformalandsemi-automatedapproachforontologydevelopment.Ratherthanstartingfrom scratch,we buildan ontology byeffectivediscoveringanduseoftheknowledgethat isburiedinthedatasetsto be inte-grated.ThemethodisbasedonFormalConceptAnalysis(FCA)(Ganter&Wille,1999;Ganter,Stumme,&Wille,2005),which isamathematicalapproachfordataanalysis.FCAsupportsontologydevelopmentbyabstractingconceptualstructuresfrom attribute-basedobjectdescriptions,anditenablesconsiderableontologydevelopmentactivitiesautomated.

OurresearchextendsclassicalFCAtheory tosupportontology developmentforintegratingdatasetsthatexhibitimplicit

andambiguousinformation.Implicitinformationiscausedby thefact thatsome organisationstendto takesome domain

knowledgeasgranted,anddonotexplicitlyspecifyitintheir designdocumentsordatasets.Thiscanleadtoan ontology thatisill-formed,anddoesnotcorrectlycapturecriticalconceptsandthesemanticsofthedomain.Ambiguousinformation isduetothefactthatorganisationsdifferfromeachotherinculture,conventionsandrequirementsinsystemdevelopment, hencetheymayvaryinhowtheychoosetorepresentabusinessobject,andatwhatlevelsofgranularitysuchinformation isencoded.Thiscausesinconsistenciesbetweenthedatasetsofdifferentorganisations.

Weconsiderthatovercomingthisimplicity andambiguityisanimportantstepinontology development.Thework re-portedhereisafollowonresearch ofBecketal.(2013),FuandCohn (2008a)andFuandCohn (2008b).Inthispaperwe report further technicaladvances we havemade. To restore implicitinformation,we introduce a rulebased method.We discusshow rules arederived anddeployedforrecoveringimplicit information.Toresolve disambiguate information,we defineasetofprimitiveoperationstodealwithsimplematchesindataalignment.Theseoperationsarethencomposedto dealwithmorecomplicatedmatches.Finally,we reportonourexperimentsthat arecarriedout toconstructan ontology forintegratingnon-trivial datasetsfrom severalUK watercompanies.We measure thequality of thedeveloped ontology by utilisingthe metricsofclassical informationtheory andalsointerms ofits fitness totheapplicationdomain. Our ex-perimentalresultsdemonstratethattechniquesdescribedinthispaperprovidean effectivemechanismforreconcilingand harmonising heterogeneous datafrom disparate sources, andthey support development ofontologies that better fit and

respecttheunderlyingknowledgestructuresofdomains.

The remaining partofthe paperis organisedasfollows.Section 2 reviewsrelatedresearch. Section 3recalls relevant notions of FCA andbriefs ourframework for ontology development. Sections 4 and5 presenttechniques that deal with implicitandambiguousinformation.Section6discusseshowtoderiveanontologybyusingresultsgeneratedfromSections

4and5.Section7reportsourexperimentalresults.Section8concludesthepaperandsuggestsfutureresearch.

2. Relatedresearch

Severalareasofresearchareinterestingtothiswork.Firstly,integrationtechniquesinvestigatedindatabaseand informa-tionintegrationarequiterelevant.Varioustopicshavebeenstudiedbythesecommunitiesandtheonesthatarethemost interestingherearemappingdiscoveryandschemaintegration,andtechniqueshavebeendevelopedtosupportthese(Bahga

& Madisetti,2015; Do & Rahm,2002; Doanetal., 2004; Lenzerini,2002; Liu& Zhang,2014; Madhavan& Halevy, 2003;

Pedersen,Pedersen,&Riis,2013;Rahm&Bernstein,2001).Mappingdiscoverytakestwoormoredatabaseschemasasinput

andproducesamappingbetweenelementsoftheinputschemasthatcorrespondsemanticallytoeachother.Manyofthe earlyaswellascurrentmappingsolutionsemployhand-craftedrulesorheuristicstomatchschemas(Madhavan,Bernstein,

&Rahm,2001;Rahm&Bernstein,2001).Examplesofsuchheuristicsincludelinguisticmatchingofschemaelementnames,

detectingsimilarityofstructuresofschemaelements,andconsideringthepatternsinrelationshipsoftheschemaelements. Techniqueshavealsobeenproposedtouselearningbasedmethods(Doan,Domingos, &Halevy,2001;Neumann,Ho,Tian,

Haas,&Meggido,2002).

Schemaintegration constructsaglobalschemabasedontheinter-schema relationshipsproduced inmappingdiscovery.

Each mapping element is analysed to decide which representationof relatedelements should be included in the global schema.Whenamappingdescribesthecorresponding schemaelementsasidentical,theirintegrationisstraightforward— simplyincludesone ofschemaelementsintotheglobalschema.Morefrequently,thecorresponding schemaelementsare notthesamebutaremutuallyrelatedbysomesemanticproperties,andschemamergingisperformedmanuallyor semi-automaticallywiththeassistanceofdomainengineerstoguidethedesignersintheirresolution.

Ontologyresearchisanotherdisciplinethatdealswithdataintegration.Acommondefinitionofanontologyisthatitis aformal, explicitspecificationofadomainofdiscourse (Gruber,1993).Asitprovides asharedunderstandingandexplicit specificationofadomain,an ontologyisconsideredtohaveakeyroletoplayindataintegration(Bakhtouchi,Bellatreche,

& Ait-Ameur,2011; Bian,Zhang,& Peng, 2011;Noy, 2004;Uschold & Grüninger,2004; Yuetal., 2012).Unfortunately, for

manydomains one faces the needto develop ontologies fromscratch (asthere isno existing ontology that canbe used readily), andagrowing numberofmethods havebeenproposed inrecentyearsto addressthe issuesofontology design and development. Mostmethods are basedon the traditional knowledge engineering approach (Brockmans etal., 2006;

Pinto&Martins,2004;Sure,Tempich,&Vrandecic,2006).Thesemethodsusuallystartwithdefiningthedomainandscope

ofontologies.Thisisfollowedbyadataacquisitionprocess:importantconceptsarecollected;aconcepthierarchyisderived, andpropertiesandsemanticconstraintsareattachedtoconcepts.

Asdevelopingontologiesfromscratchisanexpensiveprocesstoperform,therehasbeenincreasinginterestinreusingor mergingexistingontologies(orotherknowledgestructuressuchasthesauri)thataredevelopedindependentlyindifferent

(3)

applications(Duong,Truong,&Nguyen,2012;Truong&Nguyen,2012;Xie,Liu,&Guan,2011;Yang,2011).Centraltothese studiesisresearchonontologymappingandontology integration.Approachestoontologymappingaresimilartoonesfor matchingdatabaseschemasandotherstructureddata,andtheyuselexicalandstructuralcomponentsofdefinitionstofind correspondences.However,asan ontologycapturesricherdatasemantics thantraditionaldatabaseschemas,themethods forfindingmappingstendtoexploittheseextradatasemantics(Kalfoglou&Schorlemmer,2003;Nguyen,2007;Rodriguez&

Egenhofer,2003;Truong&Nguyen,2012).Forexample,inNoyandMusen(2000)atoolhasbeendevelopedtouselinguistic

similaritymatches betweenconcepts forinitiating mappings, andthen usethe underlyingontological structures (classes, slots,facets)tosuggestasetofheuristicsforidentifyingfurthermatchesbetweentheontologies.InDuong andJo(2012), amethodhasbeenproposed tomappingontologicalconcepts usingpropagating PriorlyMatchableConcepts.The method exploitsinformationsuch asconcepttypes,relationsandconstraintsto providesuggestionsforpossibleconcept matches. Themethod guildson howto priorlycheck the similaritybetweenconcepts anditreduces computationalcomplexityby avoidingchecking similarityamong unmatchableconcepts. In Nguyen (2006), an approach hasbeen proposed to resolve threelevels of ontology conflicts: instant level,concept levelandrelation level,using consensus method.The techniques developedinDoan,Madhavan,Domingos,&Halevy(2003)andSpohr,Hollink,andCimiano(2011)employs learningbased techniquestofindontologymappings.Theyexploitinformationindatainstancesandtaxonomicstructureofontologies,and thenusesaprobabilisticmodeltocombineresultsofdifferentlearners.

Basedontheinter-ontologymappingsderivedinmappingdiscovery,amergingprocessintegratesthesourceontologies andgenerates aglobalontology.However,deriving ameaningfulontology isa hardproblemevenwiththegroundsetof inter-ontology mappings provided, andmost methods that support the merging process are performed in an interactive mannerwiththeassistanceofhumanusers,asisdoneindatabaseandinformationintegrationresearch.

Anotherbranchof research studies ontology developmentandintegration withformal methods. Ofparticularinterest here is research based on Formal Concept Analysis (FCA) (Ganter & Wille, 1999; Ganter et al., 2005; Wille, 1982). FCA is a formal method for concept classification and conceptual structure derivation. FCA related tools enable considerable knowledgeprocessingactivities tobe automated,particularlyconceptgenerationandhierarchyderivation.Asaresult, FCA hasbeen attractinggreat interest to support systematic, semi-automateddevelopment andintegration ofontologies (Bai

& Zhou,2011; Formica,2006; He & Wang,2011; Nanda, Simpson, Kumara,& Shooter, 2006; Xia,2013). Forexample, in

Rouane,Valtchev,Sahraoui,& Huchard(2004) ontologicalhierarchymergingisstudiedintheframeworkofFCAbytaking

intoaccountofboth taxonomicandother semanticrelationshipsofontologies.AmethodFCA-MERGEhasbeendeveloped

inStummeandMaedche(2001) touseFCAtosupportontology integration.FCA_MERGEtakesasinputthetwoontologies

andasetofnaturallanguagedocuments,andcomputesaconceptlatticefromtwosourceontologiesusingFCAtechniques. Theconceptlatticeisthenexploitedbydomainexpertstoderiveamergedontology.InZhao,Wang,andHalang(2006) a similaritymethodhasbeenintroducedtomapontologyconceptsbasingonRoughSetandFormalConceptAnalysistheory. TheideaistoconstructfromtwosourceontologiesaconceptlatticewithFCAandsimilaritymeasureoftwoconcepts are thencomputedusingRoughSettheory.InChen,Bau,andYeh(2011) authorsproposeda methodthatcombinesWordNet andFuzzy Formal Concept Analysis techniques for merging ontologies. WordNetis firstly used to align concepts from a sourceontologytoconceptsinabaseontology,andtheremainingunmappedconceptsarethenalignedtothebaseontology usingasimilaritymeasurebasedonfuzzyFCA.

Ourapproach isin linewith FCAbased research. Yet it differs fromprevious studies inseveral aspects. Firstly, while mostresearch focusing on similaritymeasure of ontology concepts, we contribute an integratedframework that offers a structuralandsystematicdescriptionofontology mergingprocess.Secondly,withFCA asbackbonewe investigatehow to resolve implicitandambiguous information.Previous research is either impliciton how theseproblems are resolved, or onlyaddressparticulartypesoftheseproblems.Forexample,inRouaneetal.(2004)thereisaninteresting discussionon attributeconflicts,buttheauthorsdonotaddressindetailhowtheseproblemsareresolved.Thirdly,whilemostprevious researchconsidersonetoonemappingbetweenconcepts,ourmethodisabletodealwithmorecomplicatedissues,i.e.,an ontologyconceptmayhavemulti-mappingsfromanotherontology,whichhasnotbeeninvestigatedsufficientlyinliterature. Finally,weapplied theproposed techniquestonon-trivialindustrialdatasets,andexamined howeffectivelytheproposed methodcanhelpwithimprovingdatainteroperability.ThishasrarelybeenreportedinotherFCA-basedworks.

3. FCAterminologiesandanFCAbasedframeworkforontologydevelopment

Inthissection,weintroducethebasicconceptsofFCAandbriefourframeworkforontologydevelopment.Wewilluse dataandexamplesfromwaterinfrastructuredomaintopresenttechniquesdevelopedinthisresearch.

FCAtheorywasdevelopedinWille(1982)andatypicaltaskthatFCAcanperformisdataanalysis,makingtheconceptual structureofthedatavisibleandaccessible(Ganter&Wille,1999;Ganteretal.,2005).CentraltoFCAisthenotionofformal

context,whichisdefinedasatripleK:=

G,M,I

,whereGisasetofobjects,Misasetofattributes,andIGXMisabinary

relationbetweenGandM.Arelation

g,m

Iisreadas“objectghastheattributem”.Aformalcontextcanbedepictedby acrosstableasshowninFig.1(a),wheretheelementsontheleftsideareobjects;theelementsatthetopareattributes; andtherelationsbetweenthemarerepresentedbythecrosses.

Aformal conceptofacontextK:=

G,M, I

isdefinedaspair(A,B),whereAG,BM,A´=Band=A.A´isthesetof attributescommontoalltheobjects inA andisthesetofobjectshavingtheattributesinB.Theextentoftheconcept
(4)

Fig.1. (a)Aformalcontext.(b)Theconceptlatticeforthecontextin(a).

Fig.2. (a)Amanyvaluedcontext.(b)Aonevaluecontextafterconceptualscaling.

allformalconceptsorderedbysub-andsuper-conceptrelationsformsaconceptlattice.Fig.1(b)showstheconceptlattice forthecontextinFig.1(a),whereanoderepresentsaconceptlabelledwithitsintensionalandextensionaldescription.The linksrepresentthesub-andsuper-conceptrelations.

The formal contexts introduced above are not theonesthat occur most frequentlyinapplications ofFCA. Mostoften data isencodedin manyvalued contexts. Amany valuedcontext K:=

G,M, W, I

consistsof a setofobjects G, aset of attributesM,aset ofattributevaluesW,andaset ofternaryrelations IG× MXW.A relation<g,m, w>Iis readas “objectg hastheattributem anditsvalue isw”.Fig.2(a)showsamanyvaluedcontext whichlistsdifferentwaterpipes havingdifferentattributevalues.InorderforFCAtheory tobeappliedto amanyvaluedcontext,itneedsto beunfolded intoaonevaluedcontextthroughconceptualscaling(Ganter&Wille,1999).Fig.2(b)showstheonevaluedcontextforthe manyvaluedcontextinFig.2(a)afterconceptualscaling.

Astheextentandintentofaconceptoverlapswiththoseofitssuper-andsub-concepts,redundancyexistsinaconcept lattice.Topreventthis,reducedlabellingisintroduced.Alatticewithreducedlabellingisobtainedbyreplacingeachconcept

(A,B)with(N(A),N(B)),whereN(A)containsthenon-redundantelementsinA,andN(B)containsthenon-redundantelements

in B. An object o will appear in N(A)if the corresponding concept is the greatest lower bound ofall concepts contain-ingo.AnattributeawillappearinN(B)ifthecorrespondingconceptistheleastupperboundofallconceptscontaininga.

Fig.3(a)showsthelatticederivedfromtheoneinFig.1(b)withreducedlabelling.Furthermorewecaneliminateinalattice the conceptswhich donot possesstheir ownattributesorobjects. Thisleadstoa structurecalleda GaloisSub Hierarchy

(GSH). A GSHonly consists ofso called attribute concepts and objectconcepts.An object concept represents the smallest

conceptwiththisobjectinitsextension,andanattribute conceptrepresentsthelargestconceptwiththisattributeinits intension.TheGSHofthelatticeinFig.3(a)isdepictedinFig.3(b),whereconcepts1,5,8and12areremovedduetothe emptyN(A)andN(B).Concept2isanattributeconceptandconcept9isanobjectconcept.

WiththeFCAtheoryasthebackbone,wehavedevelopedaframeworktosupportontologydevelopment.Theframework essentiallyconsistsofthreecomponents:ContextFormation,ContextCompositionandOntologyDerivation,asillustratedin

Fig.4.Togenerateanintegratedontologyfortwodatasets,ContextFormationtakesthedatasetsasinputsandgeneratesa onevaluedcontextforeachofthem.ThegeneratedcontextsarethenfedtoContextCompositiontoproduceanintegrated GSH. Ontology Derivationtakes theGSHgenerated inContext Compositionandgenerates an integratedontology as well as concept mappings betweentwo datasets. We will describe Context Formation inSection 4, and elaborate on Context CompositionandOntologyDerivationinSections5and6.

4. Contextformation

Fig.5 showsthecomponents ofContextformation.Given adataset,Data Acquisition derivesconcepts encodedinthe datasetaswellastheirattributedefinitions,andtheresultisamanyvaluedcontextforthedataset.Thecomponentlooks atsourceswherevariousfeaturetypes(concepts)andtheirdefinitionscanbeextracted.Themostcommonsourceshereare

(5)

Fig.3. (a)Alatticewithreducedlabelling.(b)AGaloissubhierarchy.

Fig.4. AnFCAbasedframeworkforontologydevelopment.

Fig.5. Contextformation.

text/webdocumentscreatedbysystemdesigners/developersforspecifyingsystemrequirementsanddesign.Otherimportant sourcesareconceptual/logical datamodelsoftheconcerneddataset.Thegeneratedcontextisthenfed totheInformation Explicationcomponenttorestoreimplicitinformation.ThecomponentConceptualScalingtransformsamanyvaluedcontext intoaonevaluedcontext,inorderforclassicFCAtechniquestobeapplicable.

4.1. Implicitinformation

Themainchallengehereistodealwithimplicitinformation.Implicitinformationiscausedbyseveralfactors.Asan ex-ampleinwaterinfrastructuredomain,whendefiningafeaturetype,organisationstendtoexplicitlystatespecificproperties, butleavecommononesunarticulatedintheirdesigndocuments.Forinstance,asewerpipeischaracterisedbyhowit con-veyssewage:eitherbygravityorbypressure,withthegravitydistributionemployedmoreoftenthanthepressurisedform.

(6)

Table1

Amanyvaluedcontexttable.

Size What How Location

pipeType1 Main Pressurised

pipeType2 Main Aboveground

pipeType3 Main Sludge pipeType4 Main

Fig.6. TheGSHfortheformalcontextinTable1.

Mostwatercompaniesexplicitlyspecifythepressurisedcharacteristicofasewerpipe,butnotthegravityone.Furthermore, manyorganisationstakesomedomainknowledgeasgranted,anddonotencodeitexplicitly.Forexample,asludgeseweris usuallypressurisedratherthangravity.Asthisiswellunderstoodinthedomain,manywatercompanieschoosenotto en-codethisinformationexplicitly.Table1showsaportionofamanyvaluedcontextthatisgeneratedforaseweragedataset, wheremanyblankcellsexistduetoimplicitorunarticulateddomainknowledge.

Themainconsequenceofthisisthatitcanleadtoanontologythatisill-formed,anddoesnotcorrectlycapturecritical conceptsandsemanticsofthedomain.Fig.6showstheGSHforthecontextinTable1.Duetoimplicitinformation,many importantconcepts, such asgravity sewerandundergroundsewer, are missingfromthehierarchy andthereforefromthe resultantontology.Furthermore,differentorganisationsmaychoosewhatnottoarticulateintheirdatasets.Webelievethis hiddenknowledgeisoneofmainreasonsthathinderdatacompatibilityorinteroperabilityacrossorganisations.

4.2. Ruleelicitation

Weclassify implicitinformationintotwo groups:attribute-specificandobject-specific.Attribute-specificimplicit infor-mationisconcernedwithaparticularattribute,andisapplicabletoallobjectshavingthatattribute.Object-specificimplicit informationis concerned withan attribute ofparticular objectsonly. An exampleof formeris withthehow attributein

Table1.Theunarticulateddomainknowledgehereisthatasewerpipecarriessewagebygravityifnotexplicitlyspecified,and

thisappliestoallseweragepipeshavinghowattribute.Anexampleofobject-specificimplicitinformationiswiththehow

attributeofpipeType3.Theimplicitinformationhereisthatifapipecarriessludgesewage,bydefaultitcarriesitbypressure.

Thisis relevanttothe howattribute,butappliesto pipeType3only(pipesthat carry sludge) andthereforeis classifiedas object-specificimplicitinformation.

We usea rulebasedapproach torecover implicitinformation.Asimplicitinformation islargely unarticulateddomain knowledge,weneedtoworkcloselywithdomainexpertstoacquiretheserules.Wehavetwotypesofrules,attributerules dealingwithattribute-specificimplicitinformation,andobjectrulesdealingwithobject-specificimplicitinformation.

Toelicitattributerules,we iterateeachattribute.Anattributehasimplicitinformationifithasmissingvaluesforsome objects.Eachattributewithimplicitinformationinacontexttableincursarule.Involvementofdomainexpertsisrequired atthispointtogeneratesucharule.Forexample,Rule1inFig.7iscollectedforthehowattributeinTable1.

Toelicitobjectrules,weiterateeachobjectinthecontext,andexamineeachofitsattributesthatdonothaveavalue.If anattributehasimplicitinformationwhichcannotbe recoveredwithanattributerule,anobjectruleiselicitedtorecover implicit information withthe help of domain experts. Forexample, for object pipeType3,the attribute how hasimplicit

1) a sewerage pipe usually carries waste water unless it is specified otherwise;

2) if a sewerage pipe is not explicitly specified as a pressurised pipe, then it is a gravity sewer; 3) if a sewerage pipe is not explicitly specified as an above ground pipe, then it is an underground pipe; 4) a sludge sewerage pipe is a pressurised pipe unless it is specified otherwise;

(7)

information.As theimplicitinformation forhowinthiscaseis pressurised,it cannot be recoveredwithRule 1 discussed above.Anobjectrule,Rule4inFig.7,isacquiredinthiscaseforpipeType3.

Fig.7showsasetofruleselicitedforthecontexttableinTable1,whereRule1,2and3areattributerules.Rule4isan objectrule,whichworksforthehowattributeoftheobjectpipeType3only.

4.3.Ruledeployment

Thisstepisconcernedwithhowacontexttablecanbemanipulatedtorestoreimplicitinformation.Torecoverimplicit informationforanobject,wefirstidentifyasetofrulesapplicabletoit.Thisincludesallrelevantattributerulesandobject rulesfortheobject.Eachattributeoftheobjectisexaminedtoseeifithasimplicitinformation.Iftheanswerisyes,the relevantattribute ruleisidentified. Theidentificationofan objectrule isstraightforwardasit islinked totheconcerned objectdirectly.Foranobject,ifbothanattributeruleandanobjectruleareidentifiedasrelevanttoanattribute,theobject ruleoverridestheattribute rulewhenrestoringimplicitinformation.Forexample,forPipeType3 (inTable1),both Rule1 and4(in Fig. 7) deal withhowattribute,butonly Rule 4is appliedwhen restoringimplicitinformation forhowofthis object.

Once applicable rules have been identified, we generate newobjects by applying different combination ofthe rules. Thisallowsobjectswithdifferentcombinationofattributestobeidentified.Eachderivedobjectretainstheexistingobject attributerelationshipsoftheoriginalobjectandderivesnewones(forattributeshavingmissingvalues)byapplying corre-spondingrules.Forexample,forsewerPipeType1,therearetwoattributesthathaveimplicitinformation,whatandlocation.

Accordingly,twoattributerulesareidentified:Rule1forwhatattributeandRule2forlocationattribute.Thereisnoobject ruleidentified forpipeType1.Byapplyingdifferentcombinationoftherules,threenewobjectsarederived frompipeType1,

pipeType1_object1byapplyingRule 1,pipeType1_object2by applyingRule2,andpipeType1_object3by applyingRule1and

2.All newobjects retainexisting object attribute relationshipsofpipeType1,andwith differentrelationshipsderived due tothedifferentrules applied.Dependingonthe numberofrulesapplicable, eachoriginal contextobjectderivesdifferent numberofnewobjects.Forexample,thereare2applicablerulesforPipeType1,PipeType2andPipeType3.Thecombination oftheserulesgenerated3derivedobjectsforeachoriginalobject.PipeType4has3applicablerulesand7newobjectshave beenderived.

Table2liststhemanyvaluedcontextafterimplicitinformationhasbeenrestoredwithrules.Thismanyvaluedcontext

isthenfedto ConceptualScalingcomponent(asshowninFig.6)togenerateaone valuedcontext table.Table3liststhe onevaluedcontexttableaftertheconceptualscalingofthecontextinTable2.

5. Contextcomposition

Contextcomposition takes two formal contexts as input, and generates an integrated GSH. The main components of ContextCompositionareContextIntegrationandHierarchyGeneration,asshowninFig.8.

Themainchallengehereistodealwithambiguousinformationduringcontextintegration,i.e.,differenttermsmaybe employed to refer to the same attribute, and attributes may be modelled at different levels of granularity. An example hereisthat one datasetmay modela seweragepipe aseither mainorlateraland anothermayclassify it astrunkmain,

Table2

Amanyvaluedcontexttableafterinformationexplication.

Size What How Location

PipeType1 Main Pressurised

pipeType1_object1 Main Wastewater Pressurised

pipeType1_object2 Main Pressurised Underground

pipeType1_object3 Main Wastewater Pressurised Underground

pipeType2 Main Aboveground

pipeType2_object1 Main Wastewater Aboveground

pipeType2_object2 Main Gravity Aboveground

pipeType2_object3 Main Wastewater Gravity Aboveground

pipeType3 Main Sludge

pipeType3_object1 Main Sludge Pressurised

pipeType3_object2 Main Sludge Underground

pipeType3_object3 Main Sludge Pressurised Underground

pipeType4 Main

pipeType4_object1 Main Wastewater

pipeType4_object2 Main Gravity

pipeType4_object3 Main Underground

pipeType4_object4 Main Wastewater Gravity

pipeType4_object5 Main Wastewater Underground

pipeType4_object6 Main Gravity Underground

(8)

Table3

AonevaluecontexttableafterconceptualscalingofthemanyvaluedcontextinTable2.

Main Wastewater Sludge Pressurised Gravity Underground Aboveground

PipeType1 X X pipeType1_object1 X X X pipeType1_object2 X X X pipeType1_object3 X X X X pipeType2 X X pipeType2_object1 X X X pipeType2_object2 X X X pipeType2_object3 X X X X pipeType3 X X pipeType3_object1 X X X pipeType3_object2 X X X pipeType3_object3 X X X X pipeType4 X pipeType4_object1 X X pipeType4_object2 X X pipeType4_object3 X X pipeType4_object4 X X X pipeType4_object5 X X X pipeType4_object6 X X X pipeType4_object7 X X X X

Fig.8. Contextcomposition. Table4

Context K 1.

Main Operational Abandoned Abandonedintact Proposedrecommission

K1O1 X X

K1O2 X X

K1O3 X X

K1O4 X X X

non-trunk main, orprivate pipe. Attributedisambiguation isa process to matchattributes fromdifferentdatasets. In this

researchwe useapre-defineddatadictionarydevelopedinFuandCohn (2008a)todisambiguateattributes.Thedata dic-tionarymaintainsasetoftermsthatdescribeconceptsinadomain,aswellastheirterminologicalrelationships,e.g.BT/NT (Broader/Narrower Term)etc. Using the data dictionary, we can decide semanticrelationships oftwo attributes. In what follows,wewillusethecontexttablesK1andK2showninTable4and5toillustratethecontextintegrationprocess.

Giventwo contexts K1:=

G1, M1, I1

andK2:=

G2, M2, I2

,the integratedcontext K: =

G, M, I

is computedby firstperformingadisjointunionofobjectsetsoftwocontexts,thatis,

G=G1∪∗G2 (1)

MandIareassignedM1 andI1 fromK1 atthisstage,i.e, M=M1 andI=I1.Table6showsthecontextKaftertheabove operations.

Thenext stepidentifies thesemanticrelationshipbetweenanattribute inM2andan attributeinM.Foreachattribute

(9)

Table5 Context K 2.

Main Live Abandoneddestroyed Proposed Standby Abandonedintact

K2O1 X X K2O2 X X K2O3 X X K2O4 X X K2O5 X X Table6

Thecontexttableafterobjectunionoperation.

Main Operational Abandoned Abandonedintact Proposedrecommission

K1O1 X X K1O2 X X K1O3 X X K1O4 X X X K2O1 K2O2 K2O3 K2O4 K2O5 Table7 Primitivematches. attExt relExt A ifindsanequivalentterm A jinM {<O, A j>|if < O, A i>I 2} A ifindsabroadertermA jinM {A i} {<O, A i>,<O, Aj >|if <O, A i>I 2} A ifindsanarrowerterm A jinM {A i} {<O m, A i>|if < O m, A i>I 2}∪{<O n, A i>|if < O n, A j>I } A ifindsnomatchinM {A i} {<O, A i>|if < O, A i>I 2}

derivethenewattributesandrelationshipstobeaddedtothecontexttableK.WeuseattExttodenotethesetofattributes to be added to M, anduse relExt to denote theset of relationshipsto be added to I.After each round of mapping,the attributesetMandrelationshipsetIofKarecalculatedas:

M=MattExt (2)

I=IrelExt (3)

An attribute could findmappings of differenttypes, including1 to 1mapping, or1 to manymappings, andaccordingly differentoperationsforcontexttablemanipulation.Inwhatfollows,wefirstdescribeprimitiveoperations,whichdealwith 1to1mappings,andwewillthendiscusshowtheprimitiveoperationscanbecomposedtodealwith1tomanymappings.

5.1. Primitiveoperations

ForagivenattributeAi M2 ofK2,fourtypesofmappingcanbe identifiedfromMofK.Table7summarises typesof mapping,aswell asattributes(i.e.attExt)andrelationships(i.e.relExt)tobeaddedto Kforeachmappingtype,whereAj

denotesthematchfromMofK.

IAi finds an equivalentattribute Aj Mof K. In thiscase,Ai will be unified withAj .The context table K is expanded with relationships between Aj and objects that have relationships with Ai in K2. For example for the context K2 in

Table 5, the attribute live finds an equivalent attribute operational in K. They are unified as operational in K, i.e.,

attExt=

(no newattributetobe addedtoK).Sincethereexistsa relationship<K2O2,live>inK2,newrelationships are establishedinK, i.e.,relExt={<K2O2,operational>}.Similarlywe foundan equivalentmatchmaininK forattribute

main inK2.The resultant context table isshown inTable 8,where thenewly added relationshipsare shaded forthe purposeofreadability.

II AifindsamatchAjthatismoregenerictoit.Inthiscase,theresultingcontextKisexpandedwithattributeAiand rela-tionshipsbetweenAi andobjectsfromK2.NewrelationshipsareestablishedinKbetweenthoseobjectshavingattribute

(10)

Table8

Thecontexttableafteranequivalentmatch.

Main Operational Abandoned Abandonedintact Proposedrecommission

K1O1 X X K1O2 X X K1O3 X X K1O4 X X X K2O1 X K2O2 X X K2O3 X K2O4 X K2O5 X Table9

Thecontexttableafterabroadermatch.

Main Operational Abandoned Abandonedintact Proposedrecommission Abandoneddestroyed

K1O1 X X K1O2 X X K1O3 X X K1O4 X X X K2O1 X X X K2O2 X X K2O3 X K2O4 X K2O5 X Table10

Thecontexttableafteranarrowermatch.

Main Operational Abandoned Abandonedintact Proposedrecommission Abandoneddestroyed Proposed

K1O1 X X K1O2 X X X K1O3 X X K1O4 X X X K2O1 X X X K2O2 X X K2O3 X X K2O4 X K2O5 X

alsohave attribute Aj. Forexample,the closest matchforabandoned destroyedin K2 isabandoned which is abroader termtoit.ThenattributeabandoneddestroyedisaddedtoK,andfollowingtworelationshipsareaddedtoK:

<K2O1, abandoneddestroyed> <K2O1, abandoned>

wherethefirstisoriginatedfromK2duetotheexistenceof<K2O1,abandoneddestroyed> inK2.Thesecondisderived

duetothefactthatabandoned destroyedisamorespecificfeaturetoabandoned,andthereforetheexistenceof<K2O1,

abandoneddestroyed>derives<K2O1,abandoned>.TheresultofthesecontextextensionsishighlightedinTable9.

III Ai findsamatchAj thatismorespecific toit.InthiscasethecontextK isexpandedwithAi andexistingrelationships betweenAi andobjectsfromK2.NewbinaryrelationshipsareestablishedinKbetweenthoseobjectshavingrelationships withAj (originally fromK1) andattribute Ai (which is originally fromK2). The theory is that ifAi isa more generic feature ofAj , then anyobject whichhas attribute Aj should alsohave attribute Ai .For example,if the closest match fortheattributeproposedinK2 isproposedrecommissioninKwhichisanarrowertermtoit,thenattributeproposedis addedtoK.ThefollowingtworelationshipsareaddedtoK,

<K2O3,proposed> <K1O2,proposed>

wherethefirstisoriginatedfromK2duetotheexistenceof<K2O3,proposed>inK2.Thesecondisestablishedduetothe existenceof<K1O2,proposedrecommission>inKaswellasthefact thatproposedisamoregeneric featuretoproposed

(11)

Table11

Thecontexttableafteranon-match.

Main Operational Abandoned Abandonedintact Proposedrecommission Abandoneddestroyed Proposed Standby

K1O1 X X K1O2 X X X K1O3 X X K1O4 X X X K2O1 X X X K2O2 X X K2O3 X X K2O4 X X K2O5 X Table12

Thecontexttableafteracompositematch(theintegratedcontexttable).

Main Operational Abandoned Abandonedintact Proposedrecommission Abandoneddestroyed Proposed Standby

K1O1 X X K1O2 X X X K1O3 X X K1O4 X X X K2O1 X X X K2O2 X X K2O3 X X K2O4 X X K2O5 X X X

IVAifindsnomatchinK.InthiscasethecontextKissimplyexpandedwithAiandexistingrelationshipsbetweenAiand objectsoriginatingfromK2.ForexamplethereisnosemanticmatchinK fortheattributestandbyofK2.Inthiscase,K isextendedwithattExt={standby}andrelExt={<K2O4,standby>},asshowninTable11.

5.2.Compositeoperations

Inmany situations, an attribute mayhave multiple matches andeach match isof differenttype, e.g. havingboth an equivalentand a broader matchat same time. The primitive operationsdiscussed above can be composed to deal with thesecomplexcases.ForanattributeAM2ofK2,ifasetofmatches{A1,A2,…,An }areidentifiedfromMofK,thecontext

Kisextendedasfollows: M=Mn j=1attExtA j (4) I=In j=1relExtA j (5) whereattExtA

j andrelExtA j respectivelydenotetheattributeandrelationshipsetsthatarederivedwhenAismatchedtoAj withtheprimitiveoperationsdiscussedinSection5.1.

Forexample,fortheattribute abandonedintact,twomatches arefound fromK,theequivalentmatchabandonedintact

andthegenericmatchabandoned.Forequivalentmatchabandonedintact,thefollowingaregenerated:

attExt=

relExt=

{

<K2O5, abandonedintact>

}

Forthegenericmatchabandoned,thefollowingaregeneratedby

attExt=

{

abandonedintact

}

relExt=

{

<K2O5, abandonedintact>,<K2O5, abandoned>

}

AddingtheseintoKresultsintheformalcontextshowninTable12,whichisalsofinalintegratedcontexttable.TheGSH constructedfromthisintegratedcontextisillustratedinFig.9.

(12)

Fig.9. TheGSHderivedfromthecontextinTable12.

Fig.10. Ontologyderivation.

6. Ontologyderivation

Ontology derivation componentof ourframework takes the GSHgenerated inSection 5 andgenerates an ontological structure1.Fig.10showsthecomponentsofontologyderivation.TheGSHisexploitedtoderiveseveraltypesofinformation, including ontological concepts, subsumption relationshipsbetween concepts, and attributesof concepts. The information identified formsanontological structurefromwhichafull ontology canbedeveloped.The mappingbetweenconceptsof differentdatasetscanalsobeidentifiedfromtheGSH.

6.1. Mappingidentification

Thissubcomponentderivesmappingsbetweenconceptsoftwodatasets.Givenaformalconcept inaGSH,ifitsextent containsmorethanone objects(e.g.theextentofnode3is{K1O1,K2O2}inFig.9),thenitindicatesapotential mapping betweenthesesourceconcepts.The validationofdomainengineersisrequestedattheevaluationstage tojudgewhether a mappingidentifiediscorrect. Iftheanswerisnegative, features needtobe identifiedtodifferentiate oneconcept from another. This ofteninvolves the identification ofnew attributesor relationshipsof concerned concepts. The existence of incorrectmatchestriggerstheneedtoiteratecontextcompositionorintegrationoperations.

6.2. Concept/relationship/attributeidentification

AsweemployaGSHintheresearch,intermediate,abstractconceptsarereducedinthecontextintegrationstepandthe resultinghierarchyconsistsonlyofobjectconceptsandattributeconcepts.Objectconceptshavetobekeptintheresultant ontologicalhierarchyasthey correspondtotheinitial concepts (eitherexplicitorimplicit)ofdatasetsandthereforeneed

1Theterm ontological structure isusedheretomeanthatthederivedconceptualstructureonlycontainslimiteddatasemantics,i.e.onlyconcepts, attributesand is-a relationshipsareidentified.Furtherdevelopmentisstillrequiredtocaptureotherdatasemanticstogenerateafullontology.

(13)

Table13 Sourcedatasets.

Dataset Noofconcepts Noofattributes

D 1 54 18

D 2 47 21

D 3 43 22

D 4 66 23

to remain in ontological structure to respect the initial class specification ofthe datasets. For an attribute concept, the assistanceof domain engineers is requiredto decide whetherit should be kept or discarded by takinginto account its significanceorinterestto theapplication. Whenan attributeconcept isdiscarded inaGSH, all elementsin itsintent are passedon to its sub concepts, and super-/sub-concept relationshipsare established between its super-concepts and sub-concepts.

After a decisionhas beenmade onwhich concepts are to be kept inthe resultant ontology, therules for identifying relationshipsandattributesofaconceptarestraightforward:

Allelementsintheintentofaformalconceptaredeclaredasattributesoftheontologicalconcept.

Sub/superrelationsbetweentwoformalconceptsareidentifiedasis-arelationshipsbetweenthecorresponding ontolog-icalconcepts.

7. Empiricalevaluation

Anevaluationoftheproposedtechniqueshasbeenperformedonseveralindustrialdatasets.Wefirstdescribethe exper-imentalsetupandtheontologysimilaritymeasuresemployedintheevaluation.Wethenreportontheevaluationresults.

7.1. Experimentsetupandontologysimilaritymeasures

Datasetsweused forperforming ourexperimentswere sourced fromfourUKwatercompanies. Thesedatasets essen-tiallyencodesametypesofinformation,includingvariouswaterpipes,meteringandtreatmentfacilitatesfortransporting freshwater/wastewaterforcustomersacrossthe UK.Howevereachorganisationrecordsitsinformationwithlittlethought towards interoperabilitywith others.This resultsin dataheterogeneities. Due todata confidentiality agreement we have withour industrialpartners, wecannot publishthesedatasets.Neverthelesswe havelistinTable13 thestatistics onthe datasets.

Themappingandintegrationwascarriedoutinasemi-automatedmanner,wheredataacquisitionandattribute disam-biguationwereconductedmanually,theopensourcetoolGalicia(Valtchevetal.,2003)wasemployedforcontext manipu-lationandGSHgeneration,andall otherprocessessuchasinformationexplicationandconceptualscalingwerecompleted withJavaandSQLcodes.Theevaluationwasperformedinthreephases.

Phase I experiments constructed local ontologies for each dataset involved. Pairwise comparison was conducted to measure the similarityof theselocal ontologies,and theresults were then served asbenchmarks forthe subsequent evaluation.

PhaseIIexperimentsstudiedhowimplicitinformationimpactsonontology interoperability,anddemonstratedhow in-formationexplicationcanhelpwithontologyalignment.

Phase III compared an ontology developed in this research with a handcrafted ontology developed with traditional knowledge engineeringapproach.Theperformance oftwoontologieswasevaluatedbystudyinghowbestthetwo on-tologiesfitandrespecttheknowledgestructuresofdatasetstobeintegrated.

Evaluationwasperformed at2levels:lexical level andtaxonomic level.Lexicallevel evaluationreflects how well the lexical terms of a source ontology cover those of a target ontology. Taxonomic level evaluationexamines how well the conceptualhierarchyofasourceontologyresemblesthatofatargetontology.Weemploytheontologymeasuresproposed

inDellschaftandStaab(2006)andMaedcheandStaab(2002)inourexperiments.Lexicalprecisionandrecallofasource

ontologyOSagainstatargetOntologyOTarecomputedas:

LP

(

OS ,OT

)

=

|

CS CT

|

|

CS

|

(6)

LR

(

OS ,OT

)

=

|

CS

|

CCT

|

T

|

(7)

whereCS (orCT )isthesetoftermsdescribingconceptsinOs (orOT ).

LexicalF-measure,LF,isusedforbalancingtheprecisionandrecallvalues,andiscalculatedasharmonicmeanofLPand

LR.

LF=2·LP

(

OS ,OT

)

·LR

(

OS ,OT

)

(14)

Taxonomiclevelmeasures aredividedintolocalandglobalmeasures.Localmeasures comparethesimilarityof hierarchi-cal positionsoftwo concepts inthesource andthe targetontologies. Forlocal taxonomicprecision,the similarityoftwo concepts iscomputedbasedonthecommonsemanticcotopiesfromtheconcept hierarchies.Thecommonsemanticcotopies

includesall thecommonsuper-andsub-conceptsofaconceptpair.Givensucha semanticcotopyce,thelocaltaxonomic precisiontpandrecalltroftwoconceptsc1 ∈CS andc2∈CT isdefinedas

t p

(

c1,c2,OS ,OT

)

=

|

ce

(

c1,

|

ceOS

(

)

cce

(

c2,OT

)

1,OS

)

|

(9) tr

(

c1,c2,OS ,OT

)

=

|

ce

(

c1

|

,ceOS

(

)

cce

(

c2,OT

)

2,OT

)

|

(10) Sincet p

(

c2,c1,OT ,OS

)

=|ce (c 1|,ce O (Sc )2∩,ce O T()c |2,O T), we have tr

(

c1,c2,OS ,OT

)

= tp

(

c2,c1,OT ,OS

)

(11)

Globaltaxonomicprecisionandrecallaredefinedbysumminguplocaltaxonomicprecisionandrecallofcommonconcepts intwoontologies. TP

(

OS ,OT

)

=

|

C 1 S CT

|

c(C SC T) t p

(

c,c,OS ,OT

)

(12) TR

(

OS ,OT

)

= 1

|

CS CT

|

c(C SC T) tr

(

c,c,OS ,OT

)

(13) SinceTP

(

OT ,OS

)

= 1 |C SC T| c(C SC T)

t p

(

c,c,OT ,OS

)

andtp

(

c,c,OT,OS

)

=tr

(

c,c,OS ,OT

)

duetoEq.(11),wehave

TR

(

OS ,OT

)

= TP

(

OT ,OS

)

(14)

TaxonomicF-measureTFisusedtobalanceTPandTRtogenerateacombinedtaxonomicmeasure.

TF

(

OS ,OT

)

=2TPT

(

PO

(

OS ,OT

)

TR

(

OS ,OT

)

S ,OT

)

+TR

(

OS ,OT

)

(15)

AcombinedmeasureGF,whichbalancesthelexicalandtaxonomicmeasures,isusedtogiveasummarisingoverviewofthe similarityofOS againstOT ,andiscomputedastheharmonicmeanofLFandTF:

GF

(

OS ,OT

)

= 2LFLF

(

O

(

OS ,OT

)

TF

(

OS ,OT

)

S ,OT

)

+TF

(

OS ,OT

)

(16)

7.2. Experimentalresults

7.2.1. PhaseI

The concepts andattributesfrom fourdatasets havebeen identifiedand usedto generatecontext tables.The context tableswere thenfed toGalicia toderiveGSHs.An ontologywasgeneratedfromaGSHby discardingall attributeobjects andkeepingtheobjectconcepts.Fourontologieswere generated,each foradataset(i.e.an ontologyOiisgeneratedfora datasetDiwherei=1,2,3and4).

Matricesdescribedin Section 7.1were used tomeasure the similarityof theseontologies.We observed that thefour water companies differ greatly from each other on what business objects they record in their systems, which leads to ontologiesthatareincompatibletoeachotherbothlexicallyandtaxonomically.Theselocalontologiesonlyagreedwitheach other toa smallextent:onlyarelatively smallpercentageoftermsinoneontology were alsofoundinanother ontology. This was measured withlexical precision LP (Table 14). Ontology O2 is the one that hasthe least common terms with otherontologies.Manualinspectionoftheseontologiesfoundthatthislexicaldisagreementwasmainlyduetothedifferent aspectsofthedomainthat anorganisationchosetoencodeinitsdatamanagementsystems,andthisresultedindifferent ontology concepts.Thepoorperformance ofO2ontology wasduetothegranularityissues—itencodedconcepts atafiner levelthanotherontologies,whichresultedinlexicalmismatcheswithotherontologies.

Table14 BaselineLP . O1 O2 O3 O4 O1 − 9.26% 20.37% 25.93% O2 10.38% − 8.51% 17.02% O3 25.58% 9.30% − 27.91% O4 21.21% 12.12% 18.19% −

(15)

Table15 BaselineTP . O1 O2 O3 O4 O1 − 10.00% 40.90% 34.52% O2 11.83% − 11.07% 26.82% O3 36.60% 12.60% − 30.75% O4 33.29% 24.58% 30.65% − Table16 Rulesets.

Totalnoofrules Noofattributerules Noofobjectrules

O1 15 6 9

O2 12 5 6

O3 11 6 5

O4 15 6 9

Fig.11. Lexicalprecision.

Fig.12. Taxonomicprecision.

The taxonomic level similarity of these ontologies was slightly better but scores were still quite low, as shown in

Table15.Thepresenceofdifferentconceptsinthehierarchiesoftheseontologiesledtodisappointingresults.Again,

ontol-ogyO2performedtheworst—ithasamuchlowertaxonomicprecisionwhencomparedtotheotherontologies.Examination revealedthatthegranularitymismatchwasagainthemaincauseforthis.AsO2ontologyencodedbusinessobjectsatafiner granularitythanothers,ithadaverydifferenthierarchytothoseofotherontologies.

7.2.2. PhaseII

Experimentswere firstly performed to restore implicitinformation for ontologies generated in Phase I, the resultant ontologieswere thencompared witheachother usingsamemeasures. Todothis, rulesforrestoringimplicitinformation wereacquiredforeachdatasetwiththehelpofdomainengineers.Table16showsthestatisticsontheserulesets.

TherulesweredeployedtoformalcontextsgeneratedinthePhaseIexperimentstorestoreimplicitinformationaswell asderive newfeature types.The resultant contexts were usedto generateontologies inthe same wayasdidin PhaseI experiments.Thesimilaritymeasureswerecalculatedfortheseontologies,andtheresultswerecomparedwiththeoneswe obtainedinPhaseIexperiment,whichareshowninFigs.11and12.Comparingwiththebaselinesimilarityscores,wecan seeasubstantialimprovementinthesimilarityoftheseontologies,bothatthelexicallevelandatthetaxonomiclevel.The averagelexicalprecisionincreasedtoaround60%whichwasbelow20%inthePhaseIstudy(Fig.11).Thiswasmainlydue totheincreaseofthecommonfeaturetypeswhichwererestoredininformationexplicationprocess.

(16)

Table17

Attributedisambiguation.

Target Source Exactmatch BTmatch NTmatch Non-match Multi-match

O1 O2 12 0 2 5 2

O1+O2 O3 15 1 1 4 1

O1+O2+O3 O4 16 2 2 2 1

Table18

PerformanceofFCAontologyandKEontology.

FCA ontology KE ontology

LP TP GF LP TP GF

O1 64.11% 67.07% 83.81% 62.88% 53.29% 44.39%

O2 64.92% 74.61% 88.02% 53.61% 51.81% 39.67%

O3 47.18% 64.99% 82.93% 52.58% 49.67% 46.25%

O4 72.39% 75.54% 86.38% 71.13% 52.41% 42.17%

Taxonomicprecisionwasimprovedsimilarly:from20%toaround60%byaverage(Fig.12).Thisimprovementwasmainly duetotheresulting ontologiesbearingasimilar levelofdetailintheir hierarchiesoncethey wereenriched withderived objects generated with rules. Aconcept inone ontology had an increased numberof common super- andsub-concepts withitsmatchingconceptinanotherontology.Thisresultedinimprovedlocaltaxonomicsimilarityandthereforeimproved globaltaxonomicsimilarity.Thisledtotheconclusionthatimplicitinformationimpactsgreatlyonthesimilarityofthelocal ontologies,andsimilarityoftheseontologiescanbeimprovedsignificantlyifwecanhaveimplicitinformationrestored.

7.2.3. PhaseIII

The four local ontologies, which had implicit information restored in Phase II experiments, were then integrated to build a global ontology. This was achieved by first performing the context integration as described in Section 5. The contexts of O1 and O2 were integrated first, and resultant context was then integrated with O3 context and so on, as showninTable17.The mainactivityperformedherewasattribute disambiguation.Table17showsthetypesofattribute matches found duringvarious stages ofthe integration process.For example,forthe 21 attributes ofthe O2context, 12 found an exactmatchfromthe O1context,and2 foundnarrower matches,5did notfind anymatch, and2found mul-tiple matches. After attribute disambiguation, the integrated context wasused to generate a GSH, from which an inte-gratedontology wasderived.Thetotalnumberofconceptsintheintegratedontologywas248andthedepthofhierarchy was6.

To evaluate the quality of this integrated ontology (FCA ontology for short), we compared it against a handcrafted ontology that was developed with a traditional knowledge engineering approach as described in Fu and Cohn (2008a)

(KE ontology for short). Both FCA ontology and KE ontology had the same local ontologies (i.e. O1, O2, O3 and O4) as

major inputs (and therefore comparison made in this research are unbiased), but they differ from each other on how ontological hierarchies were built and how implicit/unarticulated information was recovered. The hierarchy of FCA ontology was generated automatically with FCA tool Glacia basing on the attribute definition of objects, and the hi-erarchy of KE ontology was generated manually basing on the domain knowledge from domain experts. FCA ontology

achieved informationexplication via the domain rules as discussed inSection 4.KE ontology didthis through a manual semantic enrichment process.Extra data semantics ofKE ontologywere manually derived fromboth system design doc-uments and domainengineers. The resultant KE ontologyconsistsof 216 concepts which wasorganisedin 5hierarchical levels.

We evaluatethe two ontologiesin thesimilar fashion asdonein (Brewster, Alani, Dasmahapatra,& Wilks,2004). We consider that an ontology is of good quality when it conforms to andhas a good coverage of knowledge structures of datasetsto beintegrated. Thiswasperformedby comparingFCAontologyandKEontologyagainst localontologiesO1,O2, O3andO4asdevelopedinPhaseIIexperiment.Table18summarisestheresults.Bothontologieshadsimilarscoresforthe lexicalprecisionLPwhencomparedagainsttheseontologies.ThiscanbelargelyexplainedbythatbothFCAontologyandKE

ontologyhadtheselocalontologiesasinput,i.e.,conceptsintheseontologiesweremajorlexicalsourcesofbothontologies.

FCAontologyoutperformedKEontologyonits similaritytothelocalontologiesatthetaxonomiclevel.ThisisbecauseFCA

ontologywasgeneratedsystematicallybasedonattributedefinitionsofinputfeaturetypes(ofthelocalontologies),and

sub-andsuper-conceptrelationshipsbetweenconceptswereidentifiedinthesamefashionasthelocalontologies.Thisledtothe improvedtaxonomicprecision oftheFCA ontology.However theontologicalhierarchygeneratedwithKEmethodisrather subjective, i.e.dependingupon humanjudgementon what intermediate concept to add,and whena sub-/super-concept relationshipshouldbeestablished.Thehierarchytendstobedistortedwithmissingsub-andsuper-conceptlinkswhenthe numberofconcepts increases.FCAontologyalsooutperformedKEontologyontheoverallsimilaritymeasureGF.Thisleads totheconclusionsthatFCAontologyfitsandrespectsthelocalontologiesbetterandthereforebetterservestheintegration purposeinthiscase.

(17)

8. Conclusionsandfuturework

Theavailabilityofvastquantitiesofdatapresentsorganisationswithbothopportunitiesandchallenges.Dataintegration techniquesofferapromisingwayforaddressingtheissueofdataheterogeneitiesandpromotingdatasharingand interoper-abilityacrossorganisations.Inthispaperwepresentaformalandsemi-automatedmethodforontologydevelopment,with theaimtoreconcileheterogeneousdataandsupportdataintegration.TheresearchextendsclassicalFCAtheorytoaddress theissuesofimplicitandambiguousinformation,which,weconsider,areimportantbuthavenotbeensufficiently inves-tigatedbypreviousstudies.Theresearchenablesconsiderableontologyengineeringactivitiesautomated,includingconcept derivationandhierarchygeneration.Incontrasttostudiesthat drawupon eithersmallorsimplifieddatasets,we evaluate theproposedtechniquesonnon-trivialindustrialdatasets.Ourexperimental resultsdemonstratethetechniquesdescribed inthispapercan helpcurate andfusedatafromdispersesources,andsupportthedevelopmentofontologiesthat better

fitsandrespectstheunderlyingknowledgestructureofdomain.Therearea numberofworkswhichwe plantoundertake

inthefuture,includingdevelopingtechniquestodealwithincompleteinformationindataintegration,andvalidatingthe proposedtechniquesondatasetsinotherapplicationdomains.

Acknowledgments

ThisresearchisafollowonworkofUKEPSRCgrant(EP/C014707/1).GaihuaFuiscurrentlyfundedbytheEPSRCgrants

(EP/I035781/1)andEP/K012398/1.

References

Bahga,A.,&Madisetti,V.K.(2015).Healthcaredataintegrationandinformaticsinthecloud.Computer,48(2),50–57.

Bai,X.,&Zhou,X.Z.(2011).Developmentofontology-basedinformationsystemusingformalconceptanalysisandassociationrules.AdvancesinComputer Science,IntelligentSystemandEnvironment,106,121–126.

Bakhtouchi,A.,Bellatreche,L.,&Ait-Ameur,Y.(2011).Ontologiesandfunctionaldependenciesfordataintegrationandreconciliation.AdvancesinConceptual Modeling:RecentDevelopmentsandNewDirections,6999,98–107.

Beck,A.,Boukhelifa,N.,Fu,G.,Hickinbotham,S.,Parker,J.,Bennett,B.,etal.(2013).UtilitydataintegrationandknowledgerepresentationintheUK:The VISTAproject.GeoHydroinformatics-IntegratingGISandWaterEngineering.UnitedKingdom:Chapman&Hall/CRCPress.

Bian,J.,Zhang,H.,&Peng,X.G.(2011).Theresearchandimplementationofheterogeneousdataintegrationunderontologymappingmechanism.Web InformationSystemsandMining,6988,87–94.

Brewster,C.,Alani,H.,Dasmahapatra,S.,&Wilks,Y.(2004).Data-drivenontologyevaluation.InProceedingsoftheLanguageResourcesandEvaluation Con-ference(pp.164–168).

Brockmans,S.,Colomb,R.M.,Haase,P.,Kendall,E.F.,Wallace,E.K.,Welty,C.,etal.(2006).AmodeldrivenapproachforbuildingOWLDLandOWLfull ontologies.InProceedingsoftheInternationalSemanticWebConference(ISWC):2006(pp.187–200).

Chen,R.C.,Bau,C.T.,&Yeh,C.J.(2011).MergingdomainontologiesbasedontheWordNetsystemandFuzzyFormalConceptAnalysistechniques.Applied SoftComputing,11(2),1908–1923.

Dellschaft,K.,&Staab,S.(2006).Onhowtoperformagoldstandardbasedevaluationofontologylearning.InProceedingsoftheInternationalSemanticWeb Conference(ISWC):2006(pp.228–241).

Do,H.,&Rahm,E.(2002).COMA-asystemforflexiblecombinationofschemamatchingapproaches.InProceedingsofthe28thInternationalConferenceon VeryLargeDataBases,August20-23,2002(pp.610–621).HongKong,China.

Doan,A.,Domingos,P.,&Halevy,A.Y.(2001).Reconcilingschemasofdisparatedatasources:Amachine-learningapproach.InProceedingsofthe2001ACM SIGMODInternationalConferenceonManagementofData,May21-24,2001(pp.509–520).SantaBarbara,CA,USA.

Doan,A.,Halevy,A.,&Ives,Z.G.(2012).PrinciplesofDataIntegration.Waltham,MA:MorganKaufmann.

Doan,A.,Madhavan,J.,Domingos,P.,&Halevy,A.Y.(2003).Ontologymatching:Amachinelearningapproach.InStaabSteffen,etal.(Eds.),Handbookon Ontologies(pp.397–416).BerlinHeidelberg:Springer.

Doan,A.,Noy,N.F.,&Halevy,A.Y.(2004).Introductiontothespecialissueonsemanticintegration.SigmodRecord,33(4),11–13.

Duckham,M.,&Worboys,M.(2005).Analgebraicapproachtoautomatedinformationfusion.InternationalJournalofGeographicInformationSystems,19(5), 537–557.

Duong,T.H.,&Jo,G.S.(2012).Enhancingperformanceandaccuracyofontologyintegrationbypropagatingpriorlymatchableconcepts.Neurocomputing, 88,3–12.

Duong,T.H.,Truong,H.B.,&Nguyen,N.T.(2012).Localneighborenrichmentforontologyintegration.IntelligentInformationandDatabaseSystems,7196, 156–166.

Formica,A.(2006).Ontology-basedconceptsimilarityinformalconceptanalysis.InformationSciences,176(18),2624–2641.

Fu,G.,&Cohn,A.G.(2008a).Semanticintegrationformappingtheunderworld.In Geoinformatics 2008 and Joint Conference on GIS and Built Environment: Geo-Simulation and Virtual GIS Environments, June 28-29, 2008 .Guangzhou,China.Doi:7143/714327-714327-714329.

Fu,G.,&Cohn,A.G.(2008b).Utilityontologydevelopmentwithformalconceptanalysis.In5thInternationalConferenceonFormalOntologyinInformation Systems(pp.297–310).

Ganter,B.,Stumme,G.,&Wille,R.(2005).FormalConceptAnalysis:FoundationsandApplications.Berlin,Heidelberg:Springer. Ganter,B.,&Wille,R.(1999).FormalConceptAnalysis:MathematicalFoundations.Berlin,NewYork:Springer.

Gruber,T.R.(1993).Atranslationapproachtoportableontologyspecifications.KnowledgeAcquisition,5(2),199–220.

He,L.J.,&Wang,Q.T.(2011).Constructionofontologyinformationsystembasedonformalconceptanalysis.AdvancesinComputerScience,Intelligent SystemandEnvironment,104,83–88.

Huang,S.L.,Lin,S.C.,&Chan,Y.C.(2012).Investigatingeffectivenessanduseracceptanceofsemanticsocialtaggingforknowledgesharing.Information Processing&Management,48(4),599–617.

Jiang,Y.C.,Zhang,X.P.,Tang,Y.,&Nie,R.H.(2015).Feature-basedapproachestosemanticsimilarityassessmentofconceptsusingWikipedia.Information Processing&Management,51(3),215–234.

Kalfoglou,Y.,&Schorlemmer,M.(2003).Ontologymapping:Thestateoftheart.TheKnowledgeEngineeringReview,19(1),1–31.

Lenzerini,M.(2002).Dataintegration:Atheoreticalperspective.InProceedingsofthe21stACMSigmod-Sigact-SigartSymposiumonPrinciplesofDatabase Systems(pp.233–246).

Liu,J.,&Zhang,X.X.(2014).DataintegrationinfuzzyXMLdocuments.InformationSciences,280,82–97.

Madhavan,J.,Bernstein,P.A.,&Rahm,E.(2001).Genericschemamatchingwithcupid.InProceedingsofthe27thInternationalConferenceonVeryLargeData Bases(pp.49–58).

(18)

Madhavan,J.,&Halevy,A.(2003).Composingmappingsamongdatasources.InProceedingsofthe29thInternationalConferenceonVeryLargeDataBases

(pp.572–583).

Maedche,A.,&Staab,S.(2002).Measuringsimilaritybetweenontologies.InProceedingsofthe13thInternationalConferenceonKnowledgeEngineeringand KnowledgeManagement.:2473:(pp.251–263).

Mate,S.,Kopcke,F.,Toddenroth,D.,Martin,M.,Prokosch,H.U.,Burkle,T.,etal.(2015). Ontology-baseddataintegrationbetweenclinicalandresearch systems.PlosOne,10(1),E0122172.

Nanda,J.,Simpson,T.W.,Kumara,S.R.T.,&Shooter,S.B.(2006).Amethodologyforproductfamilyontologydevelopmentusingformalconceptanalysis andWebontologylanguage.JournalofComputingandInformationScienceinEngineering,6(2),103–113.

Neumann,F.,Ho,C.,Tian,X.,Haas,L.,&Meggido,N.(2002).Attributeclassificationusingfeatureanalysis.InProceedingsofthe18thInternationalConference onDataEngineering(p.271).

Nguyen,N.T.(2006).Conflictsofontologies-classificationandconsensus-basedmethodsforresolving.InKnowledge-BasedIntelligentInformationand EngineeringSystems,Pt2,Proceedings:4252(pp.267–274).

Nguyen,N.T.(2007).Amethodforontologyconflictresolutionandintegrationonrelationlevel.CyberneticsandSystems,38(8),781–797. Noy,N.F.(2004).SemanticIntegration:Asurveyofontology-basedapproaches.SigmodRecord,33(4),65–70.

Noy,N.F.,&Musen,M.A.(2000).PROMPT:Algorithmandtoolforautomatedontologymergingandalignment.InProceedingsoftheSeventeenthNational ConferenceonArtificialIntelligenceandTwelfthConferenceonInnovativeApplicationsofArtificialIntelligence(pp.450–455).

Pedersen,T.B.,Pedersen,D.,&Riis,K.(2013).On-demandmultidimensionaldataintegration:Towardasemanticfoundationforcloudintelligence.Journal ofSupercomputing,65(1),217–257.

Pinto,H.S.,&Martins,J.P.(2004).Ontologies:Howcantheybebuilt?KnowledgeandInformationSystems,6(4),441–464. Rahm,E.,&Bernstein,P.A.(2001).Asurveyofapproachestoautomaticschemamatching.TheVLDBJournal,10(4),334–350.

Rodriguez,M.,&Egenhofer,M.(2003).Determiningsemanticsimilarityamongentityclassesfromdifferentontologies.IEEETransactionsonKnowledgeand DataEngineering,15(2),442–456.

Rouane,M.,Valtchev,P.,Sahraoui,H.,&Huchard,M.(2004).Mergingconceptualhierarchiesusingconceptlattices.InProceedingsofthe3rdWorkshopon ManagingSPEcialization/GeneralizationHierarchies,June15,2004(pp.51–58).Oslo,Norway.

Spohr,D.,Hollink,L.,&Cimiano,P.(2011).Amachinelearningapproachtomultilingualandcross-lingualontologymatching.InProceedingsofthe10th InternationalConferenceontheSemanticWeb(ISWC)2011:7031:(pp.665–680).

Stumme,G.,&Maedche,A.(2001).FCA-MERGE:Bottom-upmergingofontologies.InInternationalJointConferenceOnAI,August4-10,2001(pp.225–234). Seattle,Washington,USA.

Sure,Y.,Tempich,C.,&Vrandecic,D.(2006).Ontologyengineeringmethodologies.SemanticWebTechnologies:TrendsandResearchinOntology-basedSystems

(pp.171–190).Wiley.

Truong,H.B.,&Nguyen,N.T.(2012).Amulti-attributeandmulti-valuedmodelforfuzzyontologyintegrationoninstancelevel.IntelligentInformationand DatabaseSystems,7196,187–197.

Uschold,M.,&Grüninger,M.(2004).Ontologiesandsemanticsforseamlessconnectivity.SigmodRecord,33(4),58–64.

Valtchev,P.,Grosser,D.,Roume,C.,&Hacene,M.R.(2003).Galicia:anopenplatformforlattices.InProceedingsofthe11thInternationalConferenceon ConceptualStructures,July21-25,2003(pp.241–254).Dresden,Germany.

Wille,R.,&Reidel,I.R. (1982).Restructuringlatticetheory:Anapproachbasedonhierarchiesofconcepts.OrderedSets(pp.445–470).Dordrecht,Boston. Xia,H.(2013).Semanticwebontologyintegrationbasedonformalconceptanalysis.Mechatronics,RoboticsandAutomation,373-375,1714–1718. Xie,J.,Liu,F.,&Guan,S.U.(2011).Tree-structurebasedontologyintegration.JournalofInformationScience,37(6),594–613.

Yang,S.(2011).Efficientontologyintegrationmodelforbetterinferenceincontextawarecomputing.ComputationalMaterialsScience,268-270,841–846. Yu,T.,Chen,H.J.,Mi,J.H.,Gu,P.Q.,Wu,T.,&Pan,J.Z.(2012).DartWiki:Asemanticwikiforontology-basedknowledgeintegrationinthebiomedical

domain.CurrentBioinformatics,7(3),278–288.

Zhao,Y.,Wang,X.,&Halang,W.(2006).Ontologymappingbasedonroughformalconceptanalysis.InAdvancedInternationalConfeSenceon Telecommuni-cationsandInternationalConferenceonInternetandWebApplicationsandServices(AICT/ICIW2006)19-25February2006.FrenchCaribbean:Guadeloupe.

765–782 ScienceDirect www.elsevier.com/locate/ipm ( EPSRC 50–57 121–126 98–107 Beck, 87–94 164–168) 187–200) 1908–1923 228–241) 610–621). 509–520). Doan, 397–416). 11–13 537–557 Duong, 156–166 2624–2641 7143/714327-714327-714329 297–310) Ganter, Ganter, 199–220 83–88 599–617 215–234 Kalfoglou, 233–246) 82–97 49–58) 572–583) 251–263) Mate, 103–113 271) 267–274) 781–797 65–70 450–455) 217–257 441–464 334–350 442–456 51–58). 665–680) 225–234). 171–190). 187–197 58–64 241–254). 445–470). 1714–1718 594–613 841–846 278–288 2006.

References

Related documents

Due to this limitation there is no over- lapping record of turbulence encounter but there are numer- ous overlapping flight fragments allowing for proper analysis based on the

The calculations were aimed at determining whether the testing of the subjects proceeds faster and whether the resulting noise pattern is subject- ively more similar to the

(2004), who use a k-Nearest Neighbor classifier and investigate different feature weighting schemes and distance measures; Fan and Friedman (2007) study the classification of

With these assumptions, this work describes the development of academic and research networks management in higher education institutions in Mexico to promote research, innovation

Bank-specific factors (internal variables), and macroeconomic factors (external variables), have been used in this study. Increasing Profitability of Islamic Banking

In this ecological study, we used nationally representative hospital-discharge data from a 14-year period to examine the impact of changes in management on the incidence of BPD, as