11.4 Accelerated Probability Computation
11.4.3 Dynamic-Programming-Based Algorithm
In the following, an algorithm will be introduced that accelerates the computation by several orders of magnitude. This algorithm utilizes a dynamic-programming scheme, also known asPoisson Binomial Recurrence, first introduced in [147]. For the context uncertain top-k queries, this scheme was originally proposed in [214] on the x-relation model, which was the first approach that solves probabilistic queries efficiently by means of dynamic- programming techniques. Here, this scheme is extended to the use with spatial data and computes the probability that an uncertain object X ∈ D is assigned to a certain ranking position w.r.t. the distance to a query observationq.
The probabilities of PT can be efficiently computed requiring a complexity of O(N3). The key idea of this approach is based on the following property. Given a query ob- servation q, an observation x of an uncertain database object X and a set of h objects
S = {Z1, Z2, . . . , Zh} for which the probability Px(Z) that Z ∈ S is closer to the query observation q than x (i.e., that Z is closer to q than x) is known (i.e., all objects Z for which at least one observation has been retrieved from B). The probability Px(Z) can be computed according to the following lemma.
Lemma 11.3 Let q be the query object and let (x, P(X =x)) be the observation x of an object X fetched from the distance browsing B in the current processing iteration. The probability that an object Z 6=X is closer to q than x is
Px(Z) = j
X
i=1
P(Z =zi),
where zi ∈Z,1≤i≤j are the observations of Z fetched in previous processing iterations. Lemma 11.3 says that it is possible to accumulate, in overall linear space, the probabilities of all observations for all objects which have been seen so far and to use them to compute Px(Z), given the current observation x and any object Z ∈ D \ {X}.
Now, the probabilityPi,S,x that exactlyiobjectsZ ∈ S are ranked higher thanx w.r.t. the distance to q can be computed efficiently, utilizing the following lemma.
Lemma 11.4 The event that i objects of S are closer to q than x occurs if one of the following conditions holds. In the case that an object Z ∈ S is closer to q than x, then
i−1 objects of S \ {Z} must be closer to q. Otherwise, if the assumption is made that object Z ∈ S is farther from q than x, then i objects of S \ {Z} must be closer to q.
o
q
P
i,S,x=0,i>|S|
P
iͲ1,S\{Z},xP(x,i)=P
iͲ1,S,xs
closer
to
0 0 0 0 kͲ1 ranki …|S|
NͲ1object
s
0 0 0 0 1 1 2P
iͲ2,S\{Z},x1
…|S|
P
0,Ø,x=1
Figure 11.3: Visualization of the dynamic-programming scheme.
The above lemma leads to the following recursion that allows to compute Pi,S,x by means of the paradigm of dynamic programming:
Pi,S,x =Pi−1,S\{Z},x·Px(Z) +Pi,S\{Z},x·(1−Px(Z)), where
P0,∅,x= 1 and Pi,S,x = 0 ifi <0∨i >|S|. (11.1) An illustration of this dynamic-programming scheme is given in Figure 11.3, where the size of S is marked along the x-axis and the number of objects that are closer to q than the currently processed observation x is marked along the y-axis. The shaded cells represent the probabilities that have to be determined during the process of the RPD computation. As illustrated, each grid cell (which is exemplary marked with a dot in Figure 11.3) can be computed using the values contained in the left and the lower left cells. If the ranking depth is restricted to k, all probabilities are needed that up to k−1 out of N −1 objects – not N objects, asx cannot be preceded by the object it belongs to – are closer to x. In each iteration of the dynamic-programming algorithm, O(N·k) cells have to be computed (which is O(N2) in the setting of this chapter). Performing this for each observation that is retrieved from the distance browsing B, this yields an overall runtime of (N3), as it can be assumed that the total number of observations in the database is linear in the number of database objects.
Regarding the storage requirements for the probability values, the computation of each probabilityPi,S,x only requires information stored in the current line and the previous line to access the probabilities Pi−1,S\{Z},x and Pi,S\{Z},x . Therefore, only these two lines (of length N) need to be preserved requiring O(N) space. The probability table PT used in the straightforward and in the divide-and-conquer-based approach (cf. Subsection 11.3.2), in contrary, had to store N2·mvalues, resulting in an overall space requirement of O(N3). While the bisection-based algorithm still requires exponential asymptotical runtime for the computation of the RPD, the dynamic-programming-based algorithm only requires a
11.5 Experimental Evaluation 119
(a) U D= 2.0. (b)U D= 5.0.
Figure 11.4: Uncertain object distribution in 60×60 space for different degrees of uncer- tainty (N = 40, m= 20).
worst-case runtime of O(N3). This can be further reduced to a quadratic runtime w.r.t. N, if the ranking depth k is assumed to be a small constant, which yields a complexity of O(k · N2). In Chapter 12, a solution will be presented which computes the RPD in linear time w.r.t. the database size. Therefore, the above dynamic-programming scheme will be enhanced. Chapter 12 will also show how the proposed framework, enhanced with the linear-time solution, can be used to support and significantly boost the performance of state-of-the-art probabilistic ranking queries.
11.5
Experimental Evaluation
11.5.1
Datasets and Experimental Setup
This section will examine the effectiveness and efficiency of the proposed probabilistic similarity ranking approaches. [45] only provides a sparse experimental part; therefore, this section comprises the evaluation provided in [49]. Since the computation is highly CPU-bound, the measurements describe the efficiency by the overall runtime cost required to compute an entire ranking averaged over ten queries.
The following experiments are based on artificial and real-world datasets. The arti- ficial datasets ART, which were used for the efficiency experiments, contain 10 to 1,000 ten-dimensional uncertain objects that are located by a Gaussian distribution in the data space. Each object consists of m = 10 observations that are uniformly distributed around the mean positions of the objects with a variance (in the following referred to as degree of uncertainty (UD)) of 10% of the data space, if not stated otherwise. Figure 11.4 exem- plarily depicts the distribution of uncertain objects when varying UD. A growing degree of uncertainty leads to an increase of the overlap between the observations.
For the evaluation of the effectiveness of the proposed methods, two real-world datasets were used: O3 and NSP. The O3 dataset is an environmental dataset consisting of 30
Dataset PRQ MC PRQ MAC PRQ EM MP
O3 0.51 0.65 0.53 0.63
N SPh 0.36 0.43 0.29 0.35
N SPf rq 0.62 0.70 0.41 0.60
Table 11.3: Avg. precision for probabilistic ranking queries on different real-world datasets.
uncertain objects created from time series, each composing a set of measurements of the ozone concentration in the air measured within one month1. Thereby, each observation features a daily ozone concentration curve. The dataset covers observations from the years 2000 to 2004 and is labeled according to the months in a year. NSP is a chronobiologic dataset describing the cell activity of Neurospora2 within sequences of day cycles. This dataset is used to investigate endogenous rhythms. It can be classified w.r.t. two parameters among others: day cycle and fungal type. For the experiments, two subsets of the NSP
datasets were used: N SPhandN SPf rq. N SPhis labeled according to the day cycle length. It consists of 36 objects that created three classes of day cycle (16, 18 and 20 hours). The N SPf rq dataset consists of 48 objects and is labeled w.r.t. the fungal type (f rq1,f rq7 and f rq+).