Computational Complexity of Coverage Checking

5.2 Coverage Checking

5.2.1 Computational Complexity of Coverage Checking

The complexity of coverage checking for any concept C ∈ L under a closed-world interpretation (I,U) is different from the complexity of the typical instance checking problem for DLs which assume an open-world interpretation I. As discussed in Section 3.2.1.1 of Chapter 3, under an open-world interpretation, instance checking is reducible to satisfiability checking for most DLs, where sound and complete satisfiability checking can be as computationally expensive as N2ExpTime [47].

Under a closed-world interpretation, concepts can be thought of as queriesover the fixed model (I,U) which can be thought of as adatabase. In this way, the complexity of coverage checking can be analysed as a function of the complexity of the concept C as the query, and the size of the interpretation (I,U) as the database. Central to analysing the complexity of coverage checking is the complexity of thein- stanceOf function of Algorithm 10 which closely models Definition 3.5.2 of (I,U) in computing whether some individual iis an instance of a concept C with optimisations around context-specific local domains.

We will analyse the complexity of theinstanceOffunction on a case-by-case ba- sis. Firstly, performing an instance checki∈ C(I,U) whereCis a simple concept such as any atomic A, negated atomic ¬A, or nominal concept{i} is an O(n)operation where n = |C(I,U)_|_{, as these concepts will already have their closed-interpretation} under(I,U)pre-computed, so the instance check is as complex as set membership.

For concepts which are conjunctions A0u. . .uAjor disjunctions A0t. . .tAjof

simple conceptsAi for 2≤i≤j, the complexity is at mostO(j·n)wherenis the size

of the largest interpretation|A(_iI,U)|of any conjunct or disjunct operand Ai.

We now consider the complexity of instance checking for quantified role expressions such as 3r.(D). The complexity of instance checking i ∈ (3r.(D))(I,U) is a function of the maximum possible size of the set of all r-successors for any predecessor iwhich we will denote b, and the worst-case complexityO(ψ)of performing

the instance check against concept D. If D is a simple concept, or a conjunction or disjunction of simple concepts, the complexity of the instance check is thenO(b·j·n)

as we potentially checkall bofi’s r-successors inD.

Now assume that D in 3r.(D) is a quantified role expression 3r.(E) where the instance check in E has complexity O(ψ). The complexity of checking i ∈

(3r.(3r.(E)))(I,U)isO((b·(b·ψ)) =O(b2·ψ), as at worst, we are required to check

that all r-successors of r-successors of iare in E. For further nestings of quantified role expressions, the complexity of instance checking is at leastO(bd_·_ψ₎_where_b_is

the maximum number of anyr-successors of any individual, andd is the maximum depth of nested quantified role expressions.

Finally, we consider expressions which may permit simple concepts and nested quantified role expressions along with conjunctions and disjunctions, which repre- sents the full expressivity of concepts which may be generated by refinement opera- tors such asρ_¯

λ. Assume that the maximum number ofr-successors for any rolerand any individualiisb, and that the maximum number of operands in any conjunct or

§5.2 Coverage Checking 133

disjunct is j. For example, consider 3r.(C0u. . .uCj) where any Ci = 3r.(A) for

2≤i≤ jwhere Ais a simple concept. The complexity of instance checking for such expressions isO(b·_∑_jbn) =O(b2_·_j_·_n₎_.

Now assume each operand Ci it itself a nested role expression which has, as a

filler, further conjunctions or disjunctions of nested role expressions where the maximum depth of any nested role expression is d. At the outermost conjunction or disjunction with j operands, there areb·jinstance checks for each role expression, and a subsequentb·jfor each operand of the conjunction or disjunction in the fillers of each, until we eventually reach simple concepts or conjunctions or disjunctions thereof with instance check complexityn. This results in an overall worst-case complexity ofO((j·b)d·n)as we check ther-successors of alljoperands in conjunctions or disjunctions as role fillers with a maximum depth ofd.

In practice, the cost of computing set membership in the pre-computed closed- world interpretation is closer to O(1) when implemented with hash tables, so the dominating factor in this result is essentially the maximum number ofr-successors

bfor any predecessor individual and role r, the maximum nested role depthd, and the maximum number of operandsjfor any subexpression which is a conjunction or disjunction. Furthermore, we observe that role depth in concepts is often limited to small values such as less than 10, but this ultimately depends on the structure of the examples in the knowledge base.

At most, this places the complexity of closed-world instance checking over a concept C in the class of ExpTime problems. When compared to instance checking by open-world reasoning, the integration of C into aS ROI Q knowledge base for re- classification to permit instance checking by entailment is a relatively very expensive operation which is a function of the size of the background knowledge as well as the ABox, and is an N2ExpTime problem [47]. In practice, we observe that classification of certain knowledge bases may take minutes whereas closed-world instance checking will often take milliseconds over the same knowledge base, and is therefore clearly preferable for learning by generate-and-test methods. We analyse this behaviour in practice in Chapter 6, Section 6.3.5.

When computing the coverage of a concept C relative to a set of example in- dividuals E, we perform the instance check procedure at most |E | times with the instanceOf function. As E is of constant size as well as the maximum number of r-successors b for any role, the computational complexity of coverage checking remains the same as that of instance checking, namelyO((j·b)d·n). Despite the ex-

pensive exponential worst-case computational complexity of instance checking, there are several practical optimisations which can significantly improve the performance. Firstly, given that quantified role expressions are the most expensive concepts to check instance membership, it is prudent to perform instance checking in conjunc- tive and disjunctive expressions against any simple operands first, as it is cheaper to recognise failure (in the case of conjunctions) or success (in the case of disjunctions) against such atomic expressions before checking more expensive role expressions. In the definition of the operatorρ_¯

λ, we find that the precedence operatorensures that atomic operands always appear before quantified role expressions, which supports this optimisation. Secondly, note that the implementation of instanceOf as shown in Algorithm 10 incorporates approximate local domains ∆λ for all most-applicable domains ¯λ for any subexpression. By testing if an individualiis not an instance of

some approximate local domain ∆λ, we can be assured that iis also not an instance of any conceptC for which ¯λwas most appliable, whereC(I,U) ⊆ ∆λ. This approach therefore permits fast-failure on checking potentially expensive role expressions.

Two other optimisations are the caching of concept covers (§5.2.1.1) and fast- failure given minimum bounds on concept performance relative to a convex measure function (§5.2.1.2), which we will now describe.

5.2.1.1 Caching of Concept Covers

In the computation of a stamp point such ashx0,y0ifor some conceptCover a binary labelled set of examples E = E+_{∪ E}−_{, the stamp point} _h_x

1,y1i of any refinement

D∈ρ∗(C)will necessarily havex1≤ x0andy1 ≤y0, as the cover of any refinement

Dof Cwill always be a non-strict subset of the cover ofC:

DvC→cover(D,E)⊆cover(C,E)

because ifDvC, then by definition of the closed-world interpretation(I,U), it must be the case thatD(I,U) ⊆C(I,U). Therefore, when computing the stamp point for any concept D, it suffices to begin computation from the cover of its parent concept C

where C ρ D. Therefore, we may trade the computational cost of the time spent computing the cover of any concept D over the entire set of examples E with the space required to maintain the cover of its parent concept,C. While this may increase the space used by a learning algorithm, it may be used to reduce the computation time of coverage checking which is useful when the knowledge base contains a large

§5.2 Coverage Checking 135

amount of data, and where candidate hypotheses often cover fewer examples than in E. However, as we observe in Section 6.3.5, if candidates in the search often cover a significant proportion of examples inE, the difference in the performance of coverage testing with caching can be negligible at the cost of increased memory usage.

5.2.1.2 Fast-Failure of Instance Checking with Bounded Convex Measures

Another optimisation to coverage checking which does not require additional space is to leverage the upper bounds on values of a convex measure functionσ_f given the

cover of a candidateC. As shown in Algorithm 7, any candidateC which does not have an upper boundubσf which exceeds a minimum threshold on quality τmin will

be pruned from the search, as it can never lead to a concept with a performance which exceeds τmin for σf. As shown in Definition 5.1.3, the upper bound function ubσf is

defined as the maximum of either σ_f(hx, 0i)or σ_f(h0,yi), over which the threshold τminis imposed, at least for binary labelled examples. By re-arranging the definition

of σ_f in terms ofx forσ_f(hx, 0i)≥ τmin and fory whereσ_f(h0,yi)≥ τmin, we obtain

two inequalities which impose minimum bounds on the values of x and y, which correspond to the number of labelled examples from each class x = |C(I,U)∩ E+_| andy=|C(I,U)∩ E−|, as demonstrated in Example 5.2.1.

Example 5.2.1. Given a stamp pointhx,yi, the MCC measureσmcc(hx,yi)and upper bound ubσmcc(hx,yi)as defined on page 120, we re-arrange to obtain the following two inequalities:

x≥ τ

2_P₍_N₊_P₎

τ2P+N y≥

τ2N(N+P) τ2N+P

These define the minimum number of examples x which a candidate C with stamp pointhx,yi

must cover fromE+_{or the minimum number of examples y candidate C must cover from}_E−

for the value of measureσmccto meet or exceedτ.

Generally, for stamp pointshc1, . . . ,cniin labelled learning problems where|Ω|= n,

each variable ci where 1 ≤ i ≤ n corresponds to the number of examples in the

cover of some concept which are labelled with ωi ∈ Ω. As shown in Example 5.2.1,

we computed the minimum bounds on each variable ci which satisfy an inequality σf(hc1, . . . ,cni) ≥ τ where f was MCC by re-arranging for eachci to produce ci ≥ Φ(σ_f,i,τ)where Φ(σ_f,i,τ)denotes the right hand side of the re-arrangement of the

inequality forci.

on insufficient upper bounds on the performance ofσf, it must be the case thatci < Φ(σf,i,τ) forall 1 ≤ i ≤ n, as all downward refinements Dof C where D ∈ ρ∗(C)

with stamp pointhc0₁, . . . ,c0_niwill never satisfy σ_f(hc0₁, . . . ,c0_ni)≥τ.

Algorithm 11 computes a coverage for a concept C as a tuple hI1, . . . ,Ini where

each Ii for 1≤i≤nis the set of examples which are instances ofClabelledωi ∈Ω.

Such tuples can be used to compute the stamp point ofCrelative to a set of labelled examples E as h|I1|, . . . ,|In|i. However, Algorithm 11 is designed such that if it

can determine over the course of execution that, for each Ii where 1 ≤ i ≤ n the

inequality |Ii| ≥ Φ(σf,i,τ) cannot be satisfied, each Ii may contain fewer examples

than is actually in the cover of C as it fails fast on the expectation that C will be pruned from the search as its stamp point will not satisfy the minimum boundτon

the measure σ_f. Furthermore, Algorithm 11 ensures that, if any I_i does satisfy its

related inequality|Ii| ≥Φ(σf,i,τ), that each Ii in the computed tuplehI1, . . . ,Iniwill

contain the exact number of examples which are instances ofCwith labelωi.

In document OWL-Miner: Concept Induction in OWL Knowledge Bases (Page 153-158)