10.3 Probabilistic Query Processing
10.3.3 Further Probabilistic Query Types
Beyond probabilistic ranking, there is a variety of work tackling other query types in uncertain data, including probabilistic range queries, probabilistic nearest neighbor (PNN) queries and some variants, and probabilistic reverse nearest neighbor (PRNN) queries. Probabilistic range queries have been addressed in [76, 78, 112, 138, 195].
There exist approaches for PNN queries based on certain query objects [77] and for uncertain queries [110, 139]. The authors of [74] add threshold constraints and propose the constrained PNN query for certain query points in order to retrieve only objects whose probability of being the nearest neighbor exceeds a user-specified threshold. A combination of the concepts of PNN queries and top-k retrieval in probabilistic databases is provided by top-k PNN queries [50]. Here, the idea is to return the k most probable result objects of being the nearest neighbor to a single-observation (certain) query point.
[162] proposed a solution for probabilistic k-nearest neighbor (Pk-NN) queries based on expected distances. [75] introduced the probabilistic threshold k-NN (PTk-NN) query, which requires an uncertain object to exceed a probability threshold of being part of the
10.4 Probabilistic Data Mining 103
Pk-NN result. Here, the query is assumed to be a single-observation object.
The framework that is proposed in [34] introduced the concept of probabilistic domi- nation in order to efficiently answer Pk-NN and PTk-NN queries as well as probabilistic ranking and inverse ranking queries for uncertain query objects, applying Uncertain Gen- erating Functions, an extended variant of theGenerating Functions introduced in [154].
The PRNN query returns the set of all objects for which the probability that an uncer- tain query objectQ is their nearest neighbor exceeds a user-defined probability threshold. This problem has been tackled by by [157] for the continuous model and by [67] for the dis- crete case. The work [37] showed to achieve superior results to the mentioned approaches; furthermore, an extension to PRk-NN queries is proposed.
10.4
Probabilistic Data Mining
The aspect of identifying hot items, i.e., objects that are similar to a given amount of other objects, is the basis of several density-based mining applications [61, 90, 143, 184, 194]. The detection of hot items can be efficiently supported by a similarity join query used in a preprocessing step, in particular the distance-range self-join. A survey of probabilistic join queries in uncertain databases can be found in [134]. Approaches for an efficient join are proposed in [138]. The main advantage of this approach is that discrete positions in space can efficiently be indexed using traditional spatial access methods, thus allowing to reduce the high computational cost to process complex query types.
Apart from the analysis of spatial objects, there are various data mining applications that have to cope with the presence of uncertainty. For example, the detection of frequent itemsets as a preprocessing step for rule mining is one of the most important problems in data mining. There is a large body of research on Frequent Itemset Mining (FIM), but very little work has recently been addressing FIM in uncertain databases [79, 80, 150]. The approach proposed in [80] computes the expected support of itemsets by summing all itemset probabilities in theU-Apriori algorithm. Later, in [79], they additionally proposed a probabilistic filter in order to prune candidates early. In [150], theUF-growthalgorithm is proposed. Like U-Apriori, UF-growth computes frequent itemsets by means of the expected support, but it uses the FP-tree approach [104] in order to avoid expensive candidate generation. UF-growth considers itemsets to be frequent if the expected support exceeds a minimum support threshold. The main drawback of this estimator is that information about the uncertainty of the support is lost; [79, 80, 150] ignore the number of possible worlds in which an itemset is frequent. [217] proposes exact and sampling-based algorithms to find likely frequent items in streaming probabilistic data. However, they do not consider itemsets with more than one item. Finally, except for [199], existing FIM algorithms assume binary-valued items which precludes simple adaptation to uncertain databases. The approaches proposed in Chapters 15 [46] and 16 [47] have been the first methods that find frequent itemsets in an uncertain transaction database in a probabilistic way.
A different tree-based algorithm is presented in [151], which suggests an upper bound of the expected support by dealing with projected transactions.
105
Chapter 11
Probabilistic Similarity Ranking on
Spatially Uncertain Data
11.1
Introduction
A probabilistic ranking query on uncertain objects computes for each object X ∈ D the probability that X is the ith nearest neighbor (1 ≤ i ≤ |D|) of a given query object Q. The simplest solution to perform queries on spatially uncertain objects is to represent each object by exactly one observation, e.g., the mean vector, and perform query processing in a traditional way. The advantage of this straightforward solution is that established query and indexing techniques can be applied. However, this solution is accompanied by information loss, since the similarity between uncertain objects is obviously more meaning- ful when taking the full information of the object uncertainty into account. An example of the latter case is shown in Figure 11.1(a), which depicts uncertain objects A, . . . , U, each represented by its mean value. The results of a distance range query w.r.t. the query object Q are shown in the upper right box. There are critical objects like P, which is included in the result, and O, which is not included, though they are very close to each other. The result based on the full probabilistic object representation is shown in Fig- ure 11.1(b). Here, the gray shaded fields indicate those objects which are also included in the non-probabilistic result. Obviously, the objects O and P have quite similar prob- abilities (P(O) = 53%, P(P) = 60%) of belonging to the result. Additionally, it can be observed that the objects E, F, G and M are certain results, i.e., have a probability of 1 to appear in the result set.
This chapter will tackle the problem of similarity ranking on spatially uncertain data exploiting the full probabilistic information of uncertain objects. First, diverse forms of ranking outputs will be suggested which differ in the order the objects are reported to the user. Then, a framework based on iterative distance browsing will be proposed that supports efficient computation of the probabilistic similarity ranking.
The representation of uncertain objects will be based on Definition 9.2 (cf. Chapter 9), where the objects are assumed to incorporate positional uncertainty, i.e., each uncertain
(a) Query on objects represented by mean positions. (b) Query on objects with full uncertainty.
Figure 11.1: Distance range query on objects with different uncertainty representations.
object consists of a set of multiple observations which are mutually exclusive.
In the following, Section 11.2 will formally define different semantics of probabilistic ranking on uncertain objects. Then, Section 11.3 will introduce a framework containing the essential modules for computing the rank probabilities of uncertain observations. In Section 11.4, two approaches will be presented to speed-up the computation of the rank probability distribution. These approaches will be evaluated w.r.t. effectiveness and effi- ciency in Section 11.5. Finally, Section 11.6 will conclude this chapter.