this neighbor search continues completing the comparisons with all other untouched data points. Therefore a huge amount of CPU resources can be wasted.
Arbitrary Win and Slide Case. In this experiment, we investigate the performance of SOP when varying both window-specific parameters. We fix k as 30 and r as 200. The window and slide sizes are arbitrarily selected for each query from the range of 1K to 500K and from 50 to 50K respectively.
As illustrated in Fig. 11.6, the average CPU time consumed by SOP increases only from 28ms to 282ms (10 folds) as the number of queries increases from 10 to 1000 (100 folds). This continues to outperform the alternative algorithms by at least two orders of magnitudes. Clearly, results shown in Fig. 11.6 demonstrate the effectiveness of the swift query strategy for handling arbitrary win and arbitrary slide case.
0.010.1 1 5 x 104 101 102 103 104 105 106 Number of Queries CPU Time (ms) on Log Scale SOP
MCOD LEAP
(a) CPU (log scale)
0.010.1 1 5 x 104 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Queries Memory Footprint (MB) SOP MCOD LEAP
(b) Memory (log scale)
Figure 11.7:SOP: Varying K, R, W and S values on synthetic dataset
11.4
Varying Pattern and Window Parameters
In this most general case, we prepare four workloads composed of 100, 1000, 10,000, 50,000 queries respectively by varying all window-specific and pattern-specific parame-
11.4 VARYING PATTERN AND WINDOW PARAMETERS
ters.
We observe from Fig. 11.7(a) that similar to the cases of independently varying pattern-specific parameters and window-specific parameters SOP achieves tremendous gain in CPU utilization compared to (augmented) MCOD and (the non-shared) LEAP. Furthermore SOP shows excellent scalability in the cardinality of the workload. As the number of queries rises from 1000 to the extremely large cardinality of 50,000 queries, the CPU costs of SOP only increase from 32ms to 892ms. As the number of the queries increases, the sharing opportunities among the given set of queries also increase. Since SOP achieves full sharing across queries, it effectively reduces the CPU burden caused by the huge workload.
The memory usage of SOP also consistently outperforms the alternatives solutions as shown in Fig. 11.7(b). As previously stated, the reason is that LEAP detects outliers on the same streaming data for each query independently. Hence the memory consumed by the workload queries accumulates as the number of queries grows. On the other hand, MCOD always detects outliers by discovering and maintaining all neighbors for each point. However, SOP only requires minimal information to prove the outlier status of each point. As a result, SOP effectively avoids the usage of unnecessary space by purging redundant intermediate results. Therefore significant memory utilization is reduced by SOP.
12
Related Work
Distance-based Outliers on Streaming Data. With the emergence of digital devices generating data streams, outliers on streaming data are one type of anomalies recently studied [19, 26, 27]. However as described in Chapter Chapter 7 existing work [19, 26, 27] focuses primarily on processing a single outlier detection request.
kNN Queries on Streams. ContinuouskNN queries over sliding windows have in- deed been studied in [54, 59]. Both works use a grid to index the stream data. To improve response time, they either postpone the processing of the new points which are not likely to be inkNN set [59] or eagerly pre-compute the possiblekNN set for each future window as new data arrives [54].
However to determine the outlier status ofpit is not necessary to always discover the full kNN of p. Rather one algorithm should stop evaluating p as long as any k neigh- bors of p have been discovered. Therefore the use of the streaming kNN algorithm to continuously detect outliers is not an efficient approach. Instead our customized skyband algorithm K-SKY always discovers the minimal information necessary to determine the outlier status ofp.
ied as a general optimization problem in streaming environments. Previous research on sharing computations studied traditional SQL queries such as selection, join, and ag- gregation [29, 31, 32, 33]. Their methods include rewriting queries to expose common subexpressions, sharing indices, or segmenting input into partitions and sharing partial results over the partitions. However the key problem we address in this work, namely correctly answering multiple outlier detection queries by only collecting minimal infor- mation, is different from the more general purpose optimization effort required by the traditional SQL query sharing.
Part III
Distributed Outlier Detection:
Distance-Based Outlier
13
Distributed Outlier Detection
Framework
In this section, we introduce the basicDODdistributed framework that correctly discovers distance-based outliers using a single MapReduce job. This framework represents the solid foundation upon which key optimization strategies are then introduced. Below we will prove that under a uniform distribution of the data values, this proposed partitioning scheme is guaranteed to generate minimal communication overhead between the map and reduce phases in the MapReduce job.
13.1
The DOD Framework
DODinvolves three key steps, namelygrid cell partitioning,enhancement with support- ing area, andparallelized outlier detection(refer to Figure 13.1).
Step 1: Grid Cell Partitioning. As a first step, DOD partitions the entire domain space of a datasetDDomain(D)intondisjoint grid cellsCisuch thatC1 ∪C2 ∪...∪Cn
13.1 THE DOD FRAMEWORK Map 1 Map 2 Map N Object p1 Object p4 (K= C5, V =“0-p1”) (K= C1, V =“1-p1”) (K= C2, V =“0-p4”) (K= C3, V =“1-p4”) (K= C6, V =“1-p4”) (K= C7, V =“1-p4”) Output of p1 Output of p4 C1 C12 C4 C3 C2 C7 C5 Supporting area of C1 Supporting area of C7 Object p4 Object p1 S h u ff li n g & S o rt in g Reduce 1 Reduce M
Each Reducer receives as a group one cell_id (key), and the cell’s core objects (tag = 0) plus the cell’s supporting objects (tag = 1)
HDFS Data Blocks
Partitioning plan is given as input to Mappers
(b) MapReduce Implementation of DOD (a) DOD Framework: (1) The space is divided into grid
cells across all dimensions. (2) To detect the outliers in each grid cell independently from others, each grid cell is assigned to it both core points plus its support points. (3) Finally, the outliers in each grid cell can be detected in a fully-distributed fashion.
Object p2
Object p3
Figure 13.1:DOD: Distributed Outlier Detection Using the DOD Framework.
Definition 13.1 Grid Cell. A grid cell Ci in a d-dimensional domain space is a hyper
rectangleCi =h(low1i,high1i), (low2i,high2i), ...,(lowdi,highdi)i, where(lowxi,highxi)
are the boundaries ofCiin thexthdimension, where1 ≤x ≤d.
The points inside cell Ci, referred to as the Ci’s core points, are denoted as Ci.core ={pj |pj ∈Ci}. The areas of the domain space covered by each grid cell may
or may not be of equal size. In general, any partitioning strategy could be utilized to produce such grid cells. Figure 13.1(a) depicts a two-dimensional space partitioned into grid cells using anequi-width partitioningmethod or in shortuniSpace. That is, it divides each dimension of the domain space into equal width segments. Then points are grouped based on their membership in a particular grid cell. Fig. 13.1(a) shows such a grouping by representing the points in each grid cell with the same shape.
Step 2: Enhancement with Supporting Areas.The data points inside each grid cell are not sufficient to detect the outliers in each cell independent from the other cells. For example, data pointp2in grid cellC1 appears to be an outlier when considering only grid
cellC1. However,p2 may have neighbor points in grid cellC5, e.g.,p1, which may make