Exact learning description logic ontologies from data retrieval examples

(1)

Exact Learning Description Logic Ontologies from Data

Retrieval Examples

Boris Konev, Ana Ozaki, and Frank Wolter

Department of Computer Science, University of Liverpool, UK

Abstract. We investigate the complexity of learning description logic ontologies in Angluin et al.’s framework of exact learning via queries posed to an oracle. We consider membership queries of the form “is individualaa certain answer to a data retrieval queryqin a given ABox and the unkown target TBox?” and equivalence queries of the form “is a given TBox equivalent to the unknown target TBox?”. We show that (i) DL-Lite TBoxes with role inclusions andELIconcept expressions on the right-hand side of inclusions and (ii)ELTBoxes without complex concept expressions on the right-hand side of inclusions can be learned in polynomial time. Both results are proved by a non-trivial reduction to learning from subsumption examples. We also show that arbitraryELTBoxes cannot be learned in polynomial time.

1 Introduction

Building an ontology is prone to errors, time consuming, and costly. The research com-munities has addressed this problem in many different ways, for example, by supplying tool support for editing ontologies [15, 4, 9], developing reasoning support for debug-ging ontologies [18], supporting modular ontology design [17], and by investigating automated ontology generation from data or text [8, 6, 5, 14]. One major problem when building an ontology is the fact that domain experts are rarely ontology engineering ex-perts and that, conversely, ontology engineers are typically not familiar with the domain of the ontology. An ontology building project therefore often relies on the successful communication between an ontology engineer (familiar with the semantics of ontology languages) and a domain expert (familiar with the domain of interest). In this paper, we consider a simple model of this communication process and analyse, within this model, the computational complexity of reaching a correct domain ontology. We assume that

– the domain expert knows the domain ontology and its vocabulary without being able to formalize or communicate this ontology;

– the domain expert is able to communicate the vocabulary of the ontology and shares it with the the ontology engineer. Thus, the domain expert and ontology engineer have a common understanding of the vocabulary of the ontology. The ontology engineer knows nothing else about the domain.

(2)

• assume a data instanceAand a queryq(x)are given. Is the individualaa certain answer to queryq(x)inAand the ontologyO?

In addition, we require a way for the ontology engineer to find out whether she has reconstructed the target ontology already and, if this is not the case, to request an example illustrating the incompleteness of the reconstruction. We abstract from defining a communication protocol for this, but assume for simplicity that the following query can be posed by the ontology engineer:

• Is this ontologyHcomplete? If not, return a data instanceA, a queryq(x), and an individualasuch thatais a certain answer toq(x)inAand the ontologyO and it is not a certain answer toq(x)inAand the ontologyH.

Given this model, our question is whether the ontology engineer can learn the target ontologyOand which computational resources are required for this depending on the ontology language in which the ontologyOand the hypothesis ontologiesHare formu-lated. Our model obviously abstracts from a number of fundamental problems in building ontologies and communicating about them. In particular, it makes the assumption that the domain expert knows the domain ontology and its vocabulary (without being able to formalize it) despite the fact that finding an appropriate vocabulary for a domain of interest is a major problem in ontology design [8]. We make this assumption here in order to isolate the problem of communication about the logical relationships between known vocabulary items and its dependence on the ontology language within which the relationships can be formulated.

(3)

oracle [12]. Our move from concept subsumptions to data retrieval examples is motivated by the observation that domain experts are often more familiar with querying data in their domain than with the logical notion of subsumption between complex concepts. Detailed proofs are provided athttp://csc.liv.ac.uk/˜frank/publ/publ.html.

2 Preliminaries

LetNCandNRbe countably infinite sets ofconceptandrolenames, respectively. The dialect DL-Lite∃_Rof DL-Lite is defined as follows [7]. Aroleis a role name or an inverse roler−withr ∈ NR. Arole inclusion (RI)is of the formr v s, whererandsare roles. Abasic conceptis either a concept name or of the form∃r.>, withra role. A DL-Lite∃_R concept inclusion (CI) is of the formB vC, whereB is a basic concept expression andCis anELIconcept expression, that is,Cis formed according to the ruleC, D :=A | > |CuD | ∃r.C | ∃r−.CwhereAranges overNCandrranges overNR. A DL-Lite∃RTBox is a finite set of DL-Lite∃RCIs and RIs. As usual, anEL

concept expressionis anELIconcept expression that does not use inverse roles, anEL concept inclusionhas the formCvDwithCandDELconcept expressions, and a (general)ELTBoxis a finite set ofELconcept inclusions [2]. We also consider the restrictionELlhsof generalELTBoxes where only concept names are allowed on the right-hand side of concept inclusions. Thesizeof a concept expressionC, denoted with |C|, is the length of the string that represents it, where concept names and role names are considered to be of length one. A TBoxsignatureis the set of concept and role names occurring in the TBox. Thesizeof a TBoxT, denoted with|T |, isP

CvD∈T |C|+|D|.

LetNIbe a countably infinite set ofindividual names. An ABoxAis a finite non-empty set containingconcept name assertionsA(a)androle assertionsr(a, b), where

a, bare individuals inNI,Ais a concept name andris a role.Ind(A)denotes the set of individuals that occur inA.Ais asingletonABox if it contains only one ABox assertion. Assertions of the formC(a)andr(a, b), wherea, b∈NI,CanELIconcept expression, andr∈NR, are calledinstance assertions. Note that instance assertions of the form

C(a)withCnot a concept name norC =>do not occur in ABoxes. The semantics of description logic is defined as usual [3]. We writeI |=αto say that an inclusion or assertionαis true inI. An interpretationIis amodelof a KB(T,A)ifI |=αfor all

α∈ T ∪ A.(T,A)|=αmeans thatI |=αfor all modelsIof(T,A).

Alearning frameworkFis a triple(X,L, µ), whereXis a set ofexamples(also called domain or instance space),Lis a set oflearning concepts1_and_µ_{is a mapping} fromLto2X. Thesubsumptionlearning frameworkFS, studied in [12], is defined as

(XS,L, µS), whereLis the set of all TBoxes that are formulated in a given DL;XS is

the set ofsubsumption examplesof the formCvD, whereC, Dare concept expressions of the DL under consideration; andµS(T)is defined as{CvD∈XS | T |=CvD},

for everyT ∈ L. It should be clear thatµS(T) =µS(T0)if, and only if, the TBoxesT

andT0_{entail the same set of inclusions, that is, they are logically equivalent.}

1

(4)

We study thedata retrievallearning frameworkFD defined as(XD,L, µD), where

L is same as inFS; X is the set ofdata retrieval examplesof the form(A, D(a)),

whereAis an ABox,D(a)is a concept assertion of the DL under consideration, and

a ∈ Ind(A); andµ(T) = {(A, D(a)) ∈ XD | (T,A)|= D(a)}. As in the case of

learning from subsumptions,µS(T) =µS(T0)if, and only if, the TBoxesT andT0are

logically equivalent.

Given a learning frameworkF = (X,L, µ), we are interested in the exact identi-fication of atargetlearning conceptl ∈ Lby posing queries to oracles. LetMEMl,X be the oracle that takes as input somex∈ X and returns ‘yes’ ifx∈ µ(l)and ‘no’ otherwise. We say thatxis apositive exampleforlifx∈µ(l)and anegative example forlifx6∈µ(l). Then amembership queryis a call to the oracleMEMl,X. Similarly, for everyl∈ L, we denote byEQl,X the oracle that takes as input ahypothesislearning concepth∈ Land returns ‘yes’, ifµ(h) =µ(l), or acounterexamplex∈µ(h)⊕µ(l)

otherwise, where⊕denotes the symmetric set difference. Anequivalence queryis a call to the oracleEQl,X.

We say that a learning framework(X,L, µ)isexact learnableif there is an algorithm

Asuch that for any targetl∈ Lthe algorithmAalways halts and outputsl0 ∈ Lsuch thatµ(l) =µ(l0)using membership and equivalence queries answered by the oracles MEMl,X andEQl,X, respectively. A learning framework (X,L, µ) ispolynomially exact learnable if it is exact learnable by an algorithmAsuch that at every step2 _of computation the time used byAup to that step is bounded by a polynomialp(|l|,|x|), wherelis the target andx∈Xis the largest counterexample seen so far3_{. As argued in} the introduction, for learning subsumption and data retrieval learning frameworks we additionally assume that the signature of the target TBox is always known to the learner. An important class of learning algorithms—in particular, all algorithms presented in [12, 10, 16] fit in this class—always make equivalence queries with hypothesesh

which are polynomial in the size ofland such thatµ(h)⊆µ(l), so that counterexamples returned by theEQl,X oracles are always positive. We say that such algorithms use positive bounded equivalence queries.

3 Polynomial Time Learnability

In this section we prove polynomial time exact learnability of the DL-Lite∃_RandELlhs data retrieval learning frameworks. These frameworks are instances of the general definition given above, where the concept expressionD in a data retrieval example

(A, D(a))is anELIconcept expression in the DL-Lite∃_Rframework and anELconcept expression in theELlhsframework, respectively.

The proof is by reduction to learning from subsumptions. We illustrate its idea for ELlhs. To learn a TBox from data retrieval examples we run a learning from subsumptions algorithm as a ‘black box’. Every time the learning from subsumptions algorithm makes a membership or an equivalence query we rewrite the query into the data setting and pass it on to the data retrieval oracle. The oracle’s answer, rewritten back to the subsumption

2

We count each call to an oracle as one step of computation. 3

(5)

A r,s

[image:5.612.134.484.115.195.2]

A A .. . A s A r s .. . A s A r r s A .. . A s A r s .. . A s A r r r

Fig. 1:An ABoxA={r(a, a), s(a, a), A(a)}and its unravelling up to leveln.

setting, is given to the learning from subsumptions algorithm. When the learning from subsumptions algorithm terminates we return the learnt TBox. This reduction is made possible by the close relationship between data retrieval and subsumption examples. For every TBoxT and inclusionsC vD, one can interpret a concept expressionCas a labelled tree and encode this tree as an ABoxACwith rootρCsuch thatT |=CvD iff(T,AC)|=D(ρC).

Then, membership queries in the subsumption setting can be answered with the help of a data retrieval oracle due to the relation between subsumptions and instance queries described above. An inclusionC v D is a (positive) subsumption example for some target TBox T if, and only if, (AC, D(ρC))is a (positive) data retrieval example for the same targetT. To handle equivalence queries, we need to be able to rewrite data retrieval counterexamples returned by the data retrieval oracle into the subsumption setting. For every TBoxT and data retrieval query(A, D(a))one can construct a concept expressionCAsuch that(T,A)|=D(a)iffT |=CAvD. Such

a concept expressionCAcan be obtained by unravellingAinto a tree-shaped ABox

and representing it as a concept expression. This unravelling, however, can increase the ABox size exponentially. Thus, to obtain a polynomial bound on the running time of the learning process,CAvDcannot be simply returned as an answer to a subsumption

equivalence query. For example, for a target TBoxT ={∃rn_.A_v_B_}_{and a hypothesis} H = ∅ the data retrieval query (A, B(a)), whereA = {r(a, a), s(a, a), A(a)}, is a positive counterexample. The tree-shaped unravelling ofA up to levelnis a full binary tree of depthn, as shown in Fig. 1. On the other hand, the non-equivalence of T andHcan already be witnessed by(A0_{, B(a))}_{, where}_A0 ₌_{_{r(a, a), A(a)}_}_{. The}

unravelling ofA0 _{up to level}_n_{produces a linear size ABox}_{_{r(a, a2), r(a2, a3), . . . ,}

r(an−1, an), A(a), A(a2), . . . , A(an)}, corresponding to the left-most path in Fig. 1, which, in turn, is linear-size w.r.t. the target inclusion∃rn_.A _v _B_{. Notice that} _A0

is obtained fromAby removing thes(a, a)edge and checking, using membership queries, whether(T,A0)|=qstill holds. In other words, one might need to ask further membership queries in order to rewrite answers to data retrieval equivalence queries given by the data retrieval oracle into the subsumption setting.

(6)

We say that a learning frameworkF = (X,L, µ)polynomially reduces toF0 = (X0,L, µ0)if for some polynomialsp1(·),p2(·)andp3(·,·)there exist a functionf : X0→X and a partial functiong:L × L ×X →X0, defined for every(l, h, x)such that|h|=p1(|l|),µ(h)⊆µ(l)andx∈X, for which the following conditions hold.

– For allx0∈X0we havex0∈µ0(l)if, and only if,f(x0)∈µ(l);

– For allx∈X we havex∈µ(l)\µ(h)if, and only if,g(l, h, x)∈µ0(l)\µ0(h);

– f(x0)is computable in timep2(|x0|);

– g(l, h, x)is computable in timep3(|l|,|x|)andlcan only be accessed by calls to the

membership oracleMEMl,X.

As in the case of learning algorithms, we consider every call to the oracle as one step of computation. Notice also that even thoughgtakeshas input, the polynomial time bound on computingg(l, h, x)does not depend on the size ofhasgis only defined for

hpolynomial in the size ofl.

Theorem 1. Let (X,L, µ)and(X0,L, µ0)be learning frameworks. If there exists a polynomial reduction from(X,L, µ)to(X0_,_L_{, µ}0₎_{and a polynomial learning algorithm}

for(X0_,_L_{, µ}0₎_{that uses membership queries and positive bounded equivalence queries}

then(X,L, µ)is polynomially exact learnable.

We use Theorem 1 to prove that DL-Lite∃_R andELlhsTBoxes can be learned in polynomial time from data retrieval examples. We employ the following result:

Theorem 2 ([12]). The DL-Lite∃_R and ELlhs subsumption learning frameworks are polynomial time exact learnable with membership and positive bounded equivalence queries.

As the existence off is guaranteed by the following lemma, in what follows we prove the existence ofgand establish the corresponding time bounds.

Lemma 1. LetL∈ {DL-Lite∃_R,ELlhs}and letCvDbe anLconcept inclusion. Then

(T,AC)|=D(ρC)if, and only if,T |=CvD.

Polynomial Reduction for DL-Lite∃_R TBoxes We show for any target T and hy-pothesisHpolynomial in the size of T that Algorithm 1 transforms every positive counterexample in polynomial time to a positive counterexample with a singleton ABox (i.e., of the form{A(a)}or{r(a, b)}). Using the equivalences(T,{A(a)})|=C(a)iff T |=AvCand(T,{r(a, b)})|=C(a)iffT |=∃r.> vC, we then obtain a positive subsumption counterexample, sog(l, h, x)is computable in polynomial time.

Given a positive data retrieval counterexample(A, C(a)), Algorithm 1 exhaustively applies therole saturationandparent-child mergingrules introduced in [12]. We say that an instance assertionC(a)isrole saturatedfor(T,A)if(T,A)6|=C0(a)whenever

C0is the result of replacing a rolerby some roles∈NR∩ΣT withT 6|=rvsand

T |= s v r, whereΣT is the signature of the target TBoxT known to the learner.

(7)

Algorithm 1Reducing the positive counterexample

1: LetC(a)be an instance assertion such that(H,A)6|=C(a)and(T,A)|=C(a) 2: functionREDUCECOUNTEREXAMPLE(A,C(a))

3: Find a role saturated and parent/child mergedC(a)(membership queries) 4: ifC=C0u...uCnthen

5: FindCi,0≤i≤n, such that(H,A)6|=Ci(a)

6: C:=Ci

7: ifC=∃r.C0

and there isr(a, b)∈ Asuch that(T,A)|=C0(b)then 8: REDUCECOUNTEREXAMPLE(A,C0(b))

9: else

10: Find a singletonA0_{⊆ A}

such that(T,A0

)|=C(a)but

11: (H,A0

)6|=C(a)(membership queries) 12: return(A0

,C(a))

TC. Now, we say that an instance assertionC(a)isparent/child mergedforT andA if for nodesn1, n2, n3inTCsuch thatn2is anr-successor ofn1,n3is ans-successor

ofn2andT |=r− ≡swe have(T,A)6|=C0(a)ifC0is the concept that results from

identifyingn1andn3. For instance, the concept in Fig. 2c is the result of identifying the

leaf labeled withBin Fig. 2b with the parent of its parent.

We present a run of Algorithm 1 forT ={Av ∃s.B, svr}andH={svr}. As-sume the oracle gives as counterexample(A, C(a)), whereA={t(a, b), A(b), s(a, c)} and C(a) = ∃t.(Au ∃r.∃r−.∃r.B)u ∃s.>(a)(Fig. 2a). Role saturation produces

C(a) =∃t.(Au ∃s.∃s−.∃s.B)u ∃s.>(a)(Fig. 2b). Then, applying parent/child merg-ing twice we obtainC(a) =∃t.(Au ∃s.B)u ∃s.>(a)(Fig. 2c and 2d).

A t

B

s r

r

s

t B

s

s s

A

(a) (b)

s

A s t

B s

s t A sB

[image:7.612.135.481.433.501.2]

(c) (d)

Fig. 2:ConceptCbeing role saturated and parent/child merged.

Since(H,A) 6|= ∃t.(Au ∃s.B)(a), after Lines 4-6, Algorithm 1 updatesC by choosing the conjunct∃t.(Au ∃s.B). AsCis of the form∃t.C0and there ist(a, b)∈ A such that(T,A)|=C0(b), the algorithm recursively calls the function “ReduceCoun-terExample” with Au ∃s.B(b). Now, since(H,A) 6|= ∃s.B(b), after Lines 4-6,C

is updated to∃s.B. Finally,Cis of the form∃t.C0 and there is not(b, c) ∈ Asuch that (T,A) |= C0(c). So the algorithm proceeds to Lines 11-12, where it chooses

A(b)∈ A. Since(T,{A(b)}) |=∃s.B(b)and(H,{A(b)})6|=∃s.B(b)we have that T |=Av ∃s.BandH 6|=Av ∃s.B.

(8)

Algorithm 2Minimizing an ABoxA

1: LetAbe an ABox such that(T,A)|=A(a)but(H,A)6|=A(a), forA∈NC,a∈Ind(A). 2: functionMINIMIZEABOX(A)

3: Concept saturateAwithH

4: foreveryA∈NC∩ΣT anda∈Ind(A)such that 5: (T,A)|=A(a)and(H,A)6|=A(a)do 6: Domain MinimizeAwithA(a) 7: Role MinimizeAwithA(a) 8: return(A)

2. ifC is of the form∃r.C0 (or∃r−.C0) andC is role saturated and parent/child merged then either there isr(a, b)∈ A(orr(b, a)∈ A) such that(T,A)|=C0(b)

or there is a singletonA0_{⊆ A}_{such that}₍_T_,_A0₎_|₌_C(a)_.

Lemma 3. For any target DL-Lite∃_RTBoxT and hypothesis DL-Lite∃_RTBoxHgiven a positive data retrieval counterexample(A, C(a)), Algorithm 1 computes in time polynomial in|T |,|H|,|A|and|C|a counterexampleC0(b)such that(T,A0₎_|₌_C0_(b)_,

whereA0_{⊆ A}_{is a singleton ABox.}

Proof. (Sketch) Let(A, C(a))be the input of “ReduceCounterExample”. The number of membership queries in Line 3 is polynomial in|C|and|T |. IfC has more than one conjunct then it is updated in Lines 4-6, soCbecomes either (1) a basic concept or (2) of the form ∃r.C0 (or∃r−.C0). By Lemma 2 in case (1) there is a singleton A0 _{⊆ A}_{such that}₍_T_,_A0₎_|₌_C(a)_{, computed by Line 11 of Algorithm 1. In case (2)}

either there is a singletonA0 _{⊆ A}_{such that}₍_T_,_A0₎_|₌_C(a)_{, computed by Line 11 of}

Algorithm 1, or we obtain a counterexample with a refinedC. Since the size of the refined counterexample is strictly smaller after every recursive call of “ReduceCounterExample”,

the total number of calls is bounded by|C|. _o

Using Theorem 2 and Theorem 1 we obtain:

Theorem 3. The DL-Lite∃_Rdata retrieval framework is polynomially exact learnable.

Polynomial Reduction forELlhsTBoxes In this section we give a polynomial

algo-rithm computinggforELlhs. First we note that the concept assertion in data retrieval counterexamples(A, D(a))can always be made atomic. LetΣT be the signature of the

target TBoxT.

Lemma 4. If(A, D(a))is a positive counterexample then by posing polynomially many membership queries one can find a concept nameA∈ΣT and an individualb∈Ind(A)

such that(A, A(b))is also a counterexample.

Thus it suffices to show that given a positive counterexample(A, D(a))withD ∈ NC, one can compute anELconcept expressionCbounded in size by|T |such that

(9)

Algorithm 3Computing a tree shaped ABox 1: functionFINDTREE(A)

2: MINIMIZEABOX(A)

3: whilethere is a cyclecinAdo 4: Unfolda∈Ind(A)in cyclec 5: MINIMIZEABOX(A)

6: LetCbe the concept expression corresponding toAwith counterexampleA(a). 7: return(C(a),A(a))

Algorithm 2, and unfolding. Algorithm 2minimizesa given ABox with the following rules.

(Concept saturateAwithH) IfA(a)∈ A/ and(H,A)|=A(a)then replaceAby A ∪ {A(a)}, whereA∈NC∩ΣT anda∈Ind(A).

(Domain MinimizeAwithA(a)) IfA(a)is a counterexample and(T,A−b₎_|₌_A(a) then replaceAbyA−b_{, where}_A−b_{is the result of removing from}_A_{all ABox assertions} in whichboccurs.

(Role MinimizeAwithA(a)) IfA(a)is a counterexample and(T,A−r(b,c)₎ _|₌ A(a) then replace Aby A−r(b,c)_{, where} _A−r(b,c) _{be obtained by removing a role}

assertionr(b, c)fromA.

Lemma 5. Given a positive counterexample(A, D(a))with D ∈ NC, Algorithm 2 computes in polynomially many steps with respect to|A|,|H|, and|T |an ABoxA0_such

that|Ind(A0₎_{| ≤ |T |}_and₍_A0_{, A(b))}_{is a positive counterexample, for some}_A_∈_NC_and

b∈Ind(A0₎_.

It remains to show thatAcan be made tree-shaped. We say thatAhas an (undirected) cycle if there is a finite sequencea0·r1·a1·...·rk·aksuch that (i)a0=akand (ii) there are mutually distinct assertions of the formri+1(ai, ai+1)orri+1(ai+1, ai)inA, for

0≤i < k. Theunfoldingof a cyclec=a0·r1·a1·...·rk·akin a given ABoxAis obtained by replacingcby the cyclec0 =a0·r1·a1·...·rk·ak−1·rk·ba0·r1· · ·abk−1·rk·a0, where b

aiare fresh individual names,0≤i≤k−1, in such a way that (i) ifr(ai, d)∈ A, for an individualdnot in the cycle, thenr(_bai, d)∈ A; and (ii) ifA(ai)∈ AthenA(bai)∈ A.

We prove in the full version that after every unfolding-minimisation step in Algo-rithm 3 the ABoxAon the one hand becomes strictly larger and on the other does not exceed the size of the target TBoxT. Thus Algorithm 3 terminates after a polynomial number of steps yielding a tree-shaped counterexample.

Lemma 6. Algorithm 3 computes a minimal tree shaped ABoxAwith size polynomial in|T |and runs in polynomially many steps in|T |and|A|.

Using Theorem 2 and Theorem 1 we obtain:

Theorem 4. TheELlhsdata retrieval framework is polynomially exact learnable.

4 Limits of Polynomial Time Learnability

(10)

subsumptions [12]. We start by giving a brief overview of the construction in [12], show that it fails in the data retrieval setting and then demonstrate how it can be modified.

The non-learnability proof in [12] proceeds as follows. A learner tries to exactly identify one of the possible target TBoxes{TL |L ∈Ln}, for a superpolynomial in

nsetLn defined below. At every stage of computation the oracle maintains a set of TBoxesS, which the learner is unable to distinguish based on the answers given so far. InitiallyS={TL |L∈Ln}. It has been proved that for anyELinclusionCvDeither TL |= C vD for everyL∈ Ln or the number ofL ∈ Ln such thatTL |= C vD does not exceed|C|. When a polynomial learner asks a membership queryCvDthe oracle answers ‘yes’ ifTL |=C v D for everyL ∈ Ln and ‘no’ otherwise. In the latter case the oracle removes polynomially manyTLsuch thatTL |=C vDfromS. Similarly, for any equivalence query with hypothesisHasked by a polynomial learning algorithm there exists a polynomial size inclusionCvD, which can be returned as a counterexample and that excludes only polynomially many TBoxes fromS. Thus, every query to the oracle reduces the size ofSat most polynomially inn, but then the learner cannot distinguish between the remaining TBoxes of our initial superpolynomial setS. The set of indicesLn and the target TBoxesTL are defined as follows. Fix two role namesrands. Ann-tupleLis a sequence of role sequences(σ1, . . . ,σn), where everyσiis a sequence of role namesrands, that isσi=σ1

iσ2i . . . σinwithσ j

i ∈ {r, s}. ThenLn is a set ofn-tuples such that for everyL, L0 ∈ LnwithL = (σ1, . . . ,σn),

L0 = (σ₁0, . . . ,σ0_n), ifσi=σ_j0 thenL=L0andi=j. There areN=b2n_/n_c_different tuples inLn. For everyn >0and everyn-tupleL= (σ1, . . . ,σn)we define an acyclic ELTBoxTLas the union ofT0 ={Xi v ∃r.Xi+1u ∃s.Xi+1 |0 ≤i < n}and the

following inclusions:

A1v ∃σ1.MuX0 B1v ∃σ1.MuX0 . . .

Anv ∃σn.MuX0

Bnv ∃σn.MuX0

A≡X0u ∃σ1.Mu · · · u ∃σn.M.

where the expression∃σ.C stands for∃σ1_._∃_σ2_{. . .}_∃_σn_.C_,_M _{is a concept name that} ‘marks’ a designated path given byσandT0generates a full binary tree whose edges are

labelled with the role namesrandsand withX0at the root,X1at level1and so on. In contrast to the subsumption framework, everyTLcanbe exactly identified using data retrieval queries. For example, as X0u ∃σ1.M u · · · u ∃σn.M v A ∈ TL, a learning from data retrieval queries algorithm can learn all the sequences in then -tupleL= (σ1, . . . ,σn), by defining an ABoxA={X0(a1), r(a1, a2), s(a1, a2), . . . , r(an−1, an), s(an−1, an), M(an)}and then proceeding with unfolding and minimizing Avia membership queries of the form(TL,A)|=A(a1).

To show the non-tractability for data retrieval queries, we first modifySin such a way that the concept expression which ‘marks’ the sequences inL= (σ1, . . . ,σn)is now given by the setBnof all conjunctionsF1u · · · uFn, whereFi ∈ {Ei,E¯i}, for

1≤i≤n. Intuitively, every member ofBnencodes a binary string of lengthnwithEi encoding1andE¯iencoding0. For everyL∈Lnand everyB∈Bnwe defineTLBas the union ofT0and the concept inclusions defined above withBreplacingM.

(11)

Bi v ∃σ.B. Notice that the size of eachTLBis polynomial innand soLn contains superpolynomially manyn-tuples in the size of eachTB

L , withL∈Ln andB∈Bn. EveryTB

L entails, among other inclusions, dn

i=1CivA, where everyCiis eitherAior

Bi. LetΣnbe the signature of the TBoxesTLBand consider a TBoxT∗defined as the following set of concept inclusions:

∃r.(E1uE1)¯ v(E1uE1),¯

∃s.(E1uE1)¯ v(E1uE1),¯

(E1uE1)¯ v ∃r.(E1uE1),¯ (E1uE1)¯ v ∃s.(E1uE1),¯ (EiuE¯i)vA for every1≤i≤nand everyA∈Σn∩NC The basic idea of extending our TBoxes withT∗ _{is that if}_a_∈ _(E

iuE¯i)IA, for an ABoxAand individuala ∈ Ind(A), then for allL ∈Ln andB ∈Bn, we have

(TB

L ,A)|=D(b), whereDis anyELconcept expression overΣnandb∈Ind(A)is any successor or predecessor ofa(oraitself). This means that for each individual in Aat most oneBof the2nbinary strings inBncan be distinguished by data retrieval queries. The following lemma enables us to respond to membership queries without eliminating too manyL∈LnandB∈Bnused to encodeTLBin the set of TBoxes that the learner cannot distinguish.

Lemma 7. For any ABoxA, anyELconcept assertionD(a)overΣn, and anya ∈ Ind(A), if there isL∈LnandB∈Bnsuch that(TLB∪ T∗,A)|=D(a)then:

– either(TB

L ∪ T∗,A)|=D(a), for everyL∈LnandB∈Bn, or

– (TB

L ∪ T∗,A)|=D(a)for at most|D|elementsL∈Ln, or

– (TB

L ∪ T∗,A)|=D(a)for at most|A|elementsB∈Bn.

The next lemma is immediate from Lemma 15 presented in [12]. It shows how the oracle can answer equivalence queries eliminating at most oneL∈Lnused to encode TB

L in the setSof TBoxes that the learner cannot distinguish.

Lemma 8. For anyn >1and anyELTBoxHinΣnwith|H|<2n, there exists an ABoxA, an individuala∈Ind(A)and anELconcept expressionDoverΣnsuch that (i) the size ofAplus the size ofDdoes not exceed6nand (ii) if(H,A)|=D(a)then

(TB

L ,A)|=D(a)for at most oneL∈Lnand if(H,A)6|=D(a)then for everyL∈Ln we have(TB

L ∪ T∗,A)|=D(a).

Then, by Lemmas 7 and 8, we have that: (i) any polynomial size membership query can distinguish at most polynomially many TBoxes fromS; and (ii) if the learner’s hypothesis is polynomial size then there exists a polynomial size counterexample that the oracle can give which distinguishes at most polynomially many TBoxes fromS.

Theorem 5. TheELdata retrieval framework is not polynomially exact learnable.

5 Future Work

(12)

Bibliography

[1] D. Angluin. Queries and concept learning.Machine Learning, 2(4):319–342, 1987. [2] F. Baader, S. Brandt, and C. Lutz. Pushing theELenvelope. InIJCAI, pages

364–369. Professional Book Center, 2005.

[3] F. Baader, D. Calvanese, D. McGuiness, D. Nardi, and P. Patel-Schneider.The De-scription Logic Handbook: Theory, implementation and applications. Cambridge University Press, 2003.

[4] S. Bechhofer, I. Horrocks, C. Goble, and R. Stevens. Oiled: a reason-able ontology editor for the semantic web. InKI 2001: Advances in Artificial Intelligence, pages 396–408. Springer, 2001.

[5] D. Borchmann and F. Distel. Mining ofEL-GCIs. InThe 11th IEEE International Conference on Data Mining Workshops, Vancouver, Canada, 11 December 2011. IEEE Computer Society.

[6] P. Buitelaar, P. Cimiano, and B. Magnini, editors.Ontology Learning from Text: Methods, Evaluation and Applications. IOS Press, 2005.

[7] D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoning and efficient query answering in description logics: TheDL-Litefamily. Journal of Automated reasoning, 39(3):385–429, 2007.

[8] P. Cimiano, A. Hotho, and S. Staab. Learning concept hierarchies from text corpora using formal concept analysis.J. Artif. Intell. Res. (JAIR), 24:305–339, 2005. [9] J. Day-Richter, M. A. Harris, M. Haendel, S. Lewis, et al. Obo-edit an ontology

editor for biologists.Bioinformatics, 23(16):2198–2200, 2007.

[10] M. Frazier and L. Pitt. Learning From Entailment: An Application to Propositional Horn Sentences. InICML, pages 120–127, 1993.

[11] B. Konev, M. Ludwig, D. Walther, and F. Wolter. The logical difference for the lightweight description logic EL.J. Artif. Intell. Res. (JAIR), 44:633–708, 2012. [12] B. Konev, C. Lutz, A. Ozaki, and F. Wolter. Exact learning of lightweight

descrip-tion logic ontologies. InPrinciples of Knowledge Representation and Reasoning: Proceedings of the Fourteenth International Conference, KR 2014, Vienna, Austria, July 20-24, 2014, 2014.

[13] C. Lutz, R. Piro, and F. Wolter. Description logic TBoxes: Model-theoretic charac-terizations and rewritability. InIJCAI, pages 983–988, 2011.

[14] Y. Ma and F. Distel. Learning formal definitions for snomed CT from text. In Artificial Intelligence in Medicine - 14th Conference on Artificial Intelligence in Medicine, AIME 2013, Murcia, Spain, May 29 - June 1, 2013. Proceedings, pages 73–77, 2013.

[15] M. A. Musen. Prot´eg´e ontology editor.Encyclopedia of Systems Biology, pages 1763–1765, 2013.

(13)

[17] H. Stuckenschmidt, C. Parent, and S. Spaccapietra, editors. Modular Ontologies: Concepts, Theories and Techniques for Knowledge Modularization, volume 5445 ofLecture Notes in Computer Science. Springer, 2009.