Validating Object Matches using Description Logic

follows:

{a} ≡ {b} partOf(a, b),

where a, b are individual names, partOf is a role in description logic or an object property in OWL 2. A sameAs axiom {a} ≡ {b} states that a and b refer to the same object in the real world. A partOf axiom partOf (a, b) states that the object represented by a is part of the object represented by b in the real world. We use a = b as an abbreviation for the sameAs axiom {a} ≡ {b} and use a 6= b for the different individual axiom {a} ⊑ ¬{b} in this thesis. We also use sameAs(a, b) to refer to a sameAs match between a and b.

This section explains the use of description logic for validating object matches generated in different matching cases described in Section 5.3, expanding the descriptions of MatchMaps Steps 6 and 7 in Section4.3.

Like a terminology match, an object match can be verified using concept hier- archies and disjointness axioms. An object match sameAs(a, b) is wrong, if a or b can be shown to belong to a concept C and its complement ¬C. To validate partOf matches, for each input ontology Ti, we manually generate ‘partOf-

disjointness’ axioms and add them to Di, i ∈ {1, 2}. A ‘partOf-disjointness’ ax-

iom C ⊑ ∀partOf .¬D prohibits objects of one type C being partOf objects of another type D. For example, if School ⊑ ∀partOf .¬P ub, School(a), P ub(b), then partOf(a, b) is wrong. For a set of object matches S, such errors in object matches can be detected and removed by checking the consistency of T1∪ T2∪ D1∪ D2∪

M ∪ S using Pellet, calculating minimal inconsistent assumption sets (MIAs, Definition4.7), and following a similar validation process as described for terminology matches.

Definition 6.1 (UNA & NPH). In a dataset, the Unique Name Assumption (UNA) holds, if for any two individual names a and b in the dataset, a 6= b; ‘No PartOf Hierarchy’ (NPH) holds, if there exist no individual names a, b in the dataset such that partOf (a, b).

Differing from a terminology match, an object match can also be checked with respect to the Unique Name Assumption (UNA) and ‘No PartOf Hierarchy’ (NPH) in each input dataset. This is motivated by the facts that there is usu- ally no duplicated representation of the same individual in a dataset and an individual is not represented as a whole and as parts of it at the same time. Similar to disjointness axioms, we could generate different individual axioms as retractable assumptions, and use them to check object matches. However, this makes the data too large to be reasoned with. Instead, for any pair of individual names a and b in the same dataset, we use Pellet to check whether a = b or partOf(a, b) is entailed, and compute sets of axioms as explanations (similar to MIAs), if UNA/NPH is violated. Since we do not add any axioms like a 6= b, no inconsistency arises. If in a crowd-sourced dataset some spatial feature is represented twice, or there is a genuine partOf relationship determined by human checking, we skip this ‘error’ and do not retract any assumption.

The described validation of object matches requires domain experts to make ul- timate decisions. To minimize human effort, several heuristics are designed to allow users to retract ‘similar’ statements at a time, and spatial logic reasoning is employed to detect and remove obvious errors before checking UNA/NPH. As explained in Chapter 4, spatial logic reasoning complements description logic reasoning, and helps validate object matches using location information. The use of spatial logic for validating matches, as well as the heuristics provided to users for removing wrong matches, is explained in Chapter10. Before that, Chapters 7-9introduce a series of new qualitative spatial logics for validating object matches using location information.

A Logic of NEAR and FAR for

Buffered Points

From this chapter to Chapter 9, a series of new qualitative spatial logics is introduced to reason about ‘possibly sameAs’ and ‘possibly partOf’ relations between geometries represented in different geospatial datasets, in particular crowd-sourced datasets. In Section5.1, BEQ and BP T are defined to formalize ‘possibly sameAs’ and ‘possibly partOf’ relations respectively. In the new spatial logics, two additional spatial relations N EAR and F AR are defined, which mean ‘possibly connected’ and ‘definitely disconnected’ respectively. The in- tuition is, for any geometry X in one dataset, its corresponding geometry X′

in another dataset is somewhere within buffer (X, σ). As shown in Fig.7.1, two geometries X, Y are N EAR, if their corresponding geometries X′_{, Y}′ _{could be}

connected, i.e. distance(X, Y ) ∈ [0, 2σ]. Two geometries X, Y are F AR, if their corresponding geometries X′_{, Y}′

are not N EAR, i.e. distance(X, Y ) ∈ (4σ, +∞). The logic of NEAR and FAR for buffered points (LNF) presented in this chapter is for points, whilst the logics in Chapter 8and Chapter 9 are for arbitrary

FIGURE7.1: NEAR (left); FAR (right)

geometries (non-empty sets of points). For any two points X, Y , by Defini- tion5.1, BEQ(X, Y ) iff BP T (X, Y ). Therefore, this logic includes BEQ but not BP T as a predicate. We start with this logic for points, because it is easier and has simpler proofs, which could be reused and extended to more complicated cases for arbitrary geometries. LNF can be used for reasoning about points (several geospatial datasets only have point geometries). The syntax, semantics and axiomatisation of LNF are introduced in Section7.1. Section7.2shows that the axiomatisation is sound and complete for models based on a metric space. Section7.3shows that the LNF satisfiability problem in a metric space is NP-complete. In Section7.4, a new semantics based on a two-dimensional Eu- clidean space R2_{is introduced for LNF, and we show that the LNF satisfiability}

problem in R2 is still decidable, and its complexity is in PSPACE.

7.1 Syntax, Semantics and Axioms of LNF

The language L(LN F ) is defined as

φ, ψ := BEQ(a, b) | N EAR(a, b) | F AR(a, b) | ¬φ | φ ∧ ψ.

φ → ψ =def ¬(φ ∧ ¬ψ).

Definition 7.1 (Metric Model). A metric model M is a tuple (∆, d, I, σ), where (∆, d) is a metric space, I is an interpretation function which maps each individual name to an element in ∆, and σ ∈ R≥0is a margin of error. The notion of

M |= φ (φ is true in model M ) is defined as follows:

M |= φ ∧ ψ iff M |= φ and M |= ψ,

where a, b are individual names, φ, ψ are formulas in L(LN F ).

The notions of validity and satisfiability in metric models are standard. A formula is satisfiable if it is true in some metric model. A formula φ is valid (|= φ) if it is true in all metric models (hence if its negation is not satisfiable). The logic LNF is the set of all valid formulas of L(LN F ). It is proved below that LNF is a proper fragment of the logic M S(M ) described in Section3.3.3. Strictly speak- ing, this only holds when σ ∈ Q≥0, but later we will show that a finite set of

LNF formulas is satisfiable when σ ∈ R≥0, if it is satisfiable when σ = 1. In other

words, σ acts as a scaling factor.

Lemma 7.2. For individual names a, b, the M S(M ) formula {a} ⊑ ¬{b} is not ex- pressible in LNF.

Proof. Let M1, M2 be metric models1. M1 = (∆1, d, I1, σ), where ∆1 = {o1, o2},

d(o1, o2) = σ. I1(a) = o1, I1(b) = o2. For any x differing from a, b, I1(x) = o1.

1_{Note that we can construct models in a one-dimensional or two-dimensional Euclidean}

M2= (∆2, d, I2, σ), where ∆2= {o}. I2(a) = o, I2(b) = o. For any x differing from

a, b, I2(x) = o. For any individual name y, Ii({y}) = {Ii(y)}, i ∈ {1, 2}.

By the definitions of M1, M2, for any individual names x, y, d(I1(x), I1(y)) ∈

[0, σ], d(I2(x), I2(y)) = 0. If φ is an atomic LNF formula about x, y, then by Def-

inition7.1, M1 |= φ iff M2 |= φ. By an easy induction on logical connectives, for

any LNF formula φ, M1|= φ iff M2|= φ.

Since I1({a}) = {o1}, I1({b}) = {o2} and I2({a}) = I2({b}) = {o}, by the truth

definition of M S(M ) formulas, M1 |= ({a} ⊑ ¬{b}), M2 6|= ({a} ⊑ ¬{b}). Hence,

{a} ⊑ ¬{b} is not equivalent to any LNF formula.

Lemma 7.3. The logic LNF is a proper fragment of the logic M S(M ).

Proof. Every atomic LNF formula is expressible in M S(M ):

• BEQ(a, b) ≡ (0 ≤ δ(a, b) ≤ σ); • N EAR(a, b) ≡ (0 ≤ δ(a, b) ≤ 2σ); • F AR(a, b) ≡ (δ(a, b) > 4σ).

LNF and M S(M ) both have logical connectives ¬ and ∧. Hence every LNF formula is expressible in M S(M ). By Lemma7.2, LNF is a proper fragment of M S(M ).

The following calculus (which we will also refer to as LNF) will be shown to be sound and complete for LNF:

Axiom 0 All tautologies of classical propositional logic

Axiom 1 BEQ(a, a);

Axiom 3 N EAR(a, b) → N EAR(b, a);

Axiom 4 F AR(a, b) → F AR(b, a);

Axiom 5 BEQ(a, b) ∧ BEQ(b, c) → N EAR(c, a);

Axiom 6 BEQ(a, b) ∧ N EAR(b, c) ∧ BEQ(c, d) → ¬F AR(d, a);

Axiom 7 N EAR(a, b) ∧ N EAR(b, c) → ¬F AR(c, a);

MP Modus ponens: φ, φ → ψ ⊢ ψ.

The notion of derivability Γ ⊢ φ in LNF is standard. A formula φ is LNF- derivable if ⊢ φ. A set Γ is (LNF) inconsistent if for some formula φ it derives both φ and ¬φ.

We have the following derivable formulas (which we will refer to as facts in the completeness proof):

Fact 8 N EAR(a, b) ∧ BEQ(b, c) ∧ BEQ(c, d) → ¬F AR(d, a);

Fact 9 BEQ(a, b) → N EAR(a, b);

Fact 10 N EAR(a, b) → ¬F AR(a, b);

Fact 11 N EAR(a, b) ∧ BEQ(b, c) → ¬F AR(c, a);

Fact 12 BEQ(a, b) → ¬F AR(a, b);

Fact 13 BEQ(a, b) ∧ BEQ(b, c) → ¬F AR(c, a);

Fact 14 BEQ(a, b) ∧ BEQ(b, c) ∧ BEQ(c, d) → ¬F AR(d, a);

Fact 15 BEQ(a, b) ∧ BEQ(b, c) ∧ BEQ(c, d) ∧ BEQ(d, e) → ¬F AR(e, a).

As shown by Facts 12-15, a chain of at most four BEQs implies the negation of F AR, because F AR is defined as being > 4σ distance away in Definition7.1.

In document Matching disparate geospatial datasets and validating matches using spatial logic (Page 98-105)