11.4 Accelerated Probability Computation
11.5.3 Efficiency Experiments
The next experiment evaluates the performance of the proposed probabilistic ranking accel- eration strategies proposed in Section 11.4 w.r.t. the query processing time. The different proposed strategies were compared with the straightforward solution without any addi- tional strategy. The competing methods are the following:
• IT: Iterative fetching of the observations from the distance browsing B and compu- tation of the probability tablePT entries without any acceleration strategy.
• TP: Table pruning strategy where the reduced table space was used.
• BS: Bisection-based computation of the probability permutations.
• TP+BS: Combination of TPand BS.
• DP: Dynamic-programming-based computation of the probability permutations.
Influence of the Degree of Uncertainty
The first experiment compares all strategies (including the straightforward solution) for the computation of the RPD on the artificial datasets with different values of UD. The evalu- ation of the query processing time of the proposed approaches is illustrated in Figure 11.5.
In particular, the differences between the used computation strategies are depicted for two different numbers of observations per object (m = 10 andm = 30). Here, a database size of 20 uncertain objects in a ten-dimensional vector space was utilized.
The plain iterative fetching of observations (IT) is hardly affected by an increasing
UD value, as it anyway has to consider all possible worlds for the computations of the probabilistic rank distribution. The table pruning strategy TPsignificantly decreases the required computation time. For a low UD, many objects cover only a small range of ranking positions and can, thus, be neglected. An increasing UD leads to a higher overlap of the objects and requires more computational effort. For the divide-and-conquer-based computation of BS, the query time increases only slightly when increasing UD. However, the required runtime is quite high even for a low UD value. The runtime of TP is much lower for low degrees of uncertainty in comparison with BS; here TPis likely to prune a high number of objects that are completely processed or not yet seen at all. A combination of the benefits of the TP and BS strategies results in a quite good performance, but it is outperformed by the DP approach. This is due to the independence of the dynamic iterations of the degree of uncertainty, because the iterations require quadratic runtime in any case.
Finally, it can be observed that the behavior with of each approach with an increasing
UD remains stable for different values of m. However, a higher number of observations per object leads to significantly higher computational requirements of about an order of magnitude for each approach. Thus, these experiments support that the required runtime of computing the RPD is highly dependent onm, so that the need for efficient solutions is obvious.
Scalability
The next experiment evaluates the scalability based on theART datasets of different size. The BSapproach will be omitted in the following, as the combinationTP+BSproved to be more effective. Here again, different combinations of strategies were considered. The results are depicted in Figure 11.6 for two different values of UD.
Figure 11.6(a) illustrates the results for a low UD value. Since, by considering all possible worlds, the simple approach ITproduces exponential cost, such that experiments for a database size above 30 objects are not applicable. The application of TP yields a significant performance gain. Assuming a low UD value, the ranges of possible ranking positions of the objects hardly overlap. Furthermore, there are objects that do not have to be considered for all ranking positions, since the minimum and maximum ranking posi- tions of all objects are known (cf. Subsection 11.4.1). It can clearly be observed that the combination TP+BSsignificantly outperforms the case where only TPis applied, as the split of ther-sets reduces the number of combinations of higher ranked objects that have to be considered when computing a rank probability for an observation. For small databases where N <100, there is a splitting and merging overhead of theBS optimization, which, however, pays off for an increasing database size. For N <700, TP+BS even beats the
11.6 Summary 123 1 10 100 1000 10000 100000 0 200 400 600 800 1000 1200 Query time [ms] Database size TP+BS DP IT TP (a) U D= 0.5. 1 10 100 1000 10000 100000 1000000 0 200 400 600 800 Query time [ms] Database size TP+BS TP DP IT (b) U D= 5.0.
Figure 11.6: Comparison of the scalability of all strategies on the ART datasets with different degrees of uncertainty.
nation of two optimizations, whereas the dynamic-programming algorithm DP requires cubic runtime complexity anyway (cf. Subsection 11.4.3). However, for higher values ofN,
DP outperforms the other optimizations, as the presence of more objects also leads to the presence of a higher overlap among uncertain objects as well as to an increasing size of the r-sets.
With a high value ofUD (cf. Figure 11.6(b)), the behavior of ITdoes not change, as it has to consider all possible worlds anyway, regardless of the distribution of the observations of uncertain objects. Also, TP is already not applicable for very small databases because of an increased possible range of ranking positions and an increased overlap among the objects. Even TP+BS degenerates soon, despite that the BS optimization has a higher effect than theTPoptimization for high degrees of uncertainty. Finally, as observed before,
DPis not much affected by the value ofUD, and, thus, achieves an improvement of several orders of magnitude in comparison with the other approaches.
11.6
Summary
This chapter introduced a framework that efficiently computes the rank probability distri- bution (RPD) in order to solve probabilistic similarity ranking queries on spatially uncer- tain data. In particular, methods were introduced that break down the high computational complexity required to compute, for an objectX, the probability that X appears on each ranking position according to the distance to a query object Q. This complexity, in the first approach still exponential in the number of retrieved observations, could be reduced to a polynomial runtime by extending a dynamic-programming technique called Poisson Bi- nomial Recurrence. The following chapter will introduce an incremental approach of com- puting the RPD that enhances the dynamic-programming algorithm and finally achieves an overall runtime complexity which is linear in the number of accessed observations.
125
Chapter 12
Incremental Probabilistic Similarity
Ranking
12.1
Introduction
The step to compute the Rank Probability Distribution (RPD) that solves the bipartite graph between uncertain objects and ranking positions w.r.t. the distance to a (potentially uncertain) query object represents the main bottleneck of solving the problem of proba- bilistic ranking. Chapter 11 already adopted a dynamic-programming technique from [214] for the use in spatial data, which can perform this computation in quadratic time and lin- ear space w.r.t. the number of observations required to be accessed until the solution is confirmed. These requirements can finally regarded w.r.t. the database size, as basically, it can be assumed that the total number of observations in the database is linear in the number of database objects. This assumption holds for this chapter. The solution that will be presented in this chapter will further extend the dynamic-programming-based al- gorithm and reduce the former quadratic time complexity requirements to a linear-time complexity solution. Similarly to Chapter 11, an assumption that will be made here is that the observations can be accessed in increasing distance order to the query observation.
This chapter utilizes the definition of spatially uncertain objects according to Defini- tion 9.2 of Chapter 9. However, the proposed method applies in general to x-relations [25] and can be used irrespectively to whether uncertain objects orx-tuples are assumed. Thus, it can be used as a module in various semantics that rank the objects or observations ac- cording to their rank probabilities.
The main contributions of this chapter can be summarized as follows:
• This chapter will utilize the framework of Chapter 11, which is based on iterative dis- tance browsing and which, thus, efficiently supports probabilistic similarity ranking on spatially uncertain data.
• This chapter will present a theoretically founded approach for computing the RPD, which corresponds to Module 2 of the framework presented in Chapter 11. It will be
proved that the proposed method reduces the computational cost from O(k · N2), achieved by [214] and Chapter 11, to O(k· N), whereN is the size of the database and k denotes the ranking depth; in this chapter, k < N will be assumed. The key idea is to use the ranking probabilities of the previously accessed observation to derive those of the currently accessed observation in O(k) time.
• Similarly to Chapter 11, the objective is to find an unambiguous ranking where each object or observation is uniquely assigned to one rank. Here, any user-defined ranking method (also those suggested in Chapter 11) can be plugged in, as the RPD is required in order to compute unique positions. This will be illustrated for several well-known probabilistic ranking queries that make use of such distributions. In particular, it will be demonstrated that, by using the proposed framework, such queries can be processed in O(N · log(N) +k· N) time1, as opposed to existing approaches that require O(k· N2) time.
• Finally, an experimental evaluation will be conducted, using real-world and synthetic data, which demonstrates the applicability of the framework and verifies the theo- retical findings.
The rest of this chapter is organized as follows: Section 12.2 will introduce an efficient approach to compute the RPD. The complete algorithm exploiting the framework will be presented in Section 12.3. Section 12.4 will apply the approach to different probabilistic ranking query types, includingU-kRanks [192, 214],PT-k [108] andGlobal top-k [219] (cf. Chapter 10). The efficiency of the proposed approach will be experimentally evaluated in Section 12.5. Section 12.6 will conclude this chapter. The notations used in Chapter 11 will also be used throughout this chapter.