Probabilistic Range Queries on Uncertain Data: Uncertain Query

Uncertain Query

Again, adhering to possible world semantics and object based answer semantics, a spatial range query for an uncertain query pointq on uncertain data is defined as follows:

Definition 19 (Probabilistic Range Query). Let DB be an uncertain spatial database, let

Q be an uncertain query object, and let be a positive real value. A spatial range query computes, for each database object, the probability of having a distance less than to Q.

−range(Q,DB) := {(U ∈ DB, P(dist(Q, U)≤)}

Definition 19 is almost equal to Definition 18, except that the query object is no longer assumed to be a certain point, but rather may itself be an uncertain object.

To compute the probability P(dist(Q, U)≤), we first need to formally define the distance between two uncertain objects dist(Q, U). Clearly, for two objects having uncertain locations, the distance between these two objects is also uncertain, i.e., a random variable. In the discrete case, a probabilistic distance function is defined as follows.

4.4 Probabilistic Range Queries on Uncertain Data: Uncertain Query 63

Definition 20 (Probabilistic Distance). Let DB be an uncertain spatial databases and let dist : _Rd×_Rd _→

R+0 be a distance function on (certain) points. Furthermore, let Ui

and Uj be two uncertain objects. A probabilistic distance dist(Ui, Uj) returns a cumulative

probability density function (CDF) of the distance between Ui and Uj.

dist:DB × DB →(_R0 →[0,1]) dist(Ui, Uj) = {(d∈R0, P(dist(Ui, Uj)≤d)∈[0,1])}, where P(dist(Ui, Uj)≤d) = X w∈W I(dist(Ui, Uj)≤d, w)·P(w).

Note that the function dist(·,·) has previously been defined between two certain points. This function is overloaded deliberately. It should be clear from the context whether the traditional distance dist : _Rd_×

Rd defined on (certain) points, or the uncertain version dist:DB × DB →(_R0 →[0,1]) defined on uncertain objects is used.

Lemma 6. The probability P(dist(Ui, Uj)≤d) can be computed in polynomial time.

Proof. To compute the probability P(dist(Ui, Uj) ≤ d) we observe that exploiting object

independence, the probabilityP(dist(Ui, Uj)≤d) depends only on the positions of uncer-

tain objects Ui and Uj, and is independent of any other database object in DB \ {Ui, Uj}.

This observation allows to easily find sets of possible worlds that are equivalent with respect to the random eventdist(Ui, Uj)≤d: Letx∈Ui be a possible location ofUi, and let

y ∈Ui be a possible location of Uj, then any world w ∈Cx,y :={w∈ W|Ui =x, Uj =y}

is equivalent with respect to the random event dist(Ui, Uj) ≤ d. Thus the equivalence

∀w1, w2 ∈ Cx,y : dist(w1.Ui, w1.Uj) ≤ d ⇔ ∀dist(w2.Ui, w2.Uj) ≤ d holds. Formally, this

equivalence is evident, by substitution ofw1.Ui =w2.Ui =xand w1.Uj =w2.Uj =y. Since

there exists one equivalent class Cx,y for each x ∈ Ui and each y ∈ Uj, the number of

equivalent classes equals|Ui| · |Uj|, where|Ui| (|Uj|) is the number of possible locations of

Ui (Uj). Thus, conditions II and III of Lemma 4 are satisfied. Condition I is satisfied

trivially, assuming that the distance function dist(x, y) for two (certain) points x and y

can be computed in polynomial time, which is the case for Euclidean distance. Finally, conditionIV requires to compute the total probability of a equivalent class to be computed efficiently, i.e., the probability

P(Cx,y) =

w∈Cx,y

P(w)

has to be computed efficiently. By definition of Cx,y, this equation can be rewritten as

w∈Cx,y

P(w) = X

{w∈W|Ui=x,Uj=y}

The right-hand side of above equation aggregates the probabilities of all worlds where

Ui = x, Uj = y. Using the indicator function I(Ui = x∧Uj = y, w) that returns 1 if the

predicate Ui =x∨Uj =y holds in world w, this can be rewritten as

{w∈W|Ui=x,Uj=y}

P(w) = X

{w∈W}

I(Ui =x∧Uj =y, w)P(w).

Using the definition of possible world semantics (Equation 2.3), we obtain

{w∈W}

I(Ui =x∧Uj =y, w)P(w) =P(Ui =x∧Uj =y)

Exploiting independence between Ui and Uj we finally obtain

P(Ui =x∧Uj =y) =P(Ui =x)·P(Uj =y),

which can be computed in constant time by looking up the probabilities P(Ui = x) and

P(Uj =y) given by the models of Ui and Uj.

Thus, condition IV holds, and Lemma 4 is applicable to computeP(dist(Ui, Uj)≤ d)

efficiently by

P(dist(Ui, Uj)≤d) =

Cx,y,x∈Ui,y∈Uy

I(dist(w.Ui, w.Uj)≤d, w)P(Cx,y) (4.2)

This equation requires to iterate over all equivalent classes Cx,y, x ∈ Ui, y ∈ Uy, sum-

ming up the probabilities P(Cx,y) for each class where dist(w.Ui, w.Uj) ≤ d for a world

w ∈ Cx,y. Exploiting that the probability P(Cx,y) of each class Cx,y can be computed in

constant time, the total time complexity of computing P(dist(Ui, Uj) ≤ d) is in O(|C|)

where |C| is the number of classes, which equals O(|x| · |y|) where |x| and |y| denote the number of possible locations of objects x and y.

Two answer a probabilistic -range query as defined in Definition 18, we can apply Equation 4.2 to compute the probabilities P(dist(Q, o) ≤ ) for each o ∈ DB by substi- tuting d by , Ui by Q, and Uj by o for each o ∈ DB. This yields a total run-time of

O(P

o∈DB|Q| · |o|) which is in O(|DB| · |Q| ·maxo∈DB|o|).

In document Züfle, Andreas (2013): Similarity search and mining in uncertain spatial and spatio-temporal databases. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 86-88)