• No results found

Continuous Outlier Detection Over Data Streams

In document Outlier Detection In Big Data (Page 38-44)

1.4 Proposed Solutions

1.4.1 Continuous Outlier Detection Over Data Streams

Single Outlier Detection Request. We design an efficient continuous processing strategy

calledLEAPto mine outliers from extremely fast streams [19]. To satisfy the stringent re- sponse time requirement of the online monitoring applications, the design of this strategy explores the fundamental optimization opportunities enabled by the general properties of the unsupervised outlier definitions in streaming data.

First, this strategy takes advantage of the rarity property of outliers. Given a dataset D, the majority of points in D are guaranteed to be inliers. Furthermore, given a data point

pinDpotentially examining a small subset of points inDwill be sufficient to prove that

pis an inlier. Therefore an efficient outlier detection algorithm should be able to quickly eliminate inliers by collecting theleast amount of evidencenecessary to prove the inlier status of the data points (inlier evidence).

Second, this strategy fully utilizes the temporal relationships among stream data points. The data points that arrived later in the window are guaranteed to have a more decisive impact on the outlier detection process compared to earlier ones. This is so because the younger a data point pis, the longer its contribution of proving the outlier status of other points will persist into the future. Since the key task for the outlier detection process is to eliminate any guaranteed inliers, identifying enough longer lasting inlier evidence is likely to eliminate the need for further examination for those shorter lasting ones.

1.4 PROPOSED SOLUTIONS

1. We present the first result on efficiently supporting the major distance-based outlier classes. In particular neither the O(kmaxk,n) [9] nor theOkavg(k,n) [47] outliers had been handled in the streaming outlier detection literature to date.

2. We propose theminimal probingoptimization principle, which frees detection algo- rithms from the burden experienced by the state-of-the-art methodologies of having to conduct range query searches [26, 27, 48].

3. We introduce the lifespan-aware prioritization principle, which guides the outlier detection algorithms to probe neighbors for stream data points in a time-aware man- ner to minimize the frequency of probing operation.

4. We integrate these two principles into a general framework called LEAP, which is proven to be optimal in terms of the CPU costs for determining the outlier status of each point.

5. Our experimental studies based on real and synthetic data show that our proposed algorithms achieve three orders of magnitude performance gain compared to the state-of-the-art techniques in a rich variety of scenarios.

Multiple Outlier Detection Requests.We propose theSOPstrategy to efficiently han-

dle an outlier analytics workload composed of a large number of outlier detection requests with arbitrary parameter settings. This includes optimization to guarantee the full sharing of both CPU computations and memory utilization for the processing of the outlier ana- lytics workload. In particular computation-wise, in each active window it only requires a single pass through the batch of the data points to answer all requests. Memory-wise, it assures that only one single copy of the neighbor information shared across all requests is maintained. The key observation here is that certain relationships exist among the outliers generated by different mining requests. For example some mining requests might gener-

1.4 PROPOSED SOLUTIONS

ate the same set of outliers although they are configured with different parameter settings. Some parameter settings are “more restricted” than others in terms of recognizing outliers, therefore guaranteed to generate more outliers than the others.

The contributions in this area include:

1. Our SOP framework is the first to tackle the problem of shared execution of multiple outlier requests with arbitrary pattern and window specific parameters in the stream context.

2. The key innovation of SOP is to transform the multi-query outlier problem into a single-query skyband problem. The output of the skyband query is proven to be minimal yet sufficient for determining the outlier status of each point for any parameter setting on the workload.

3. Our customized skyband algorithm is tuned to process outlier requests with diverse parameter settings. K-SKY is proven to be optimal in the number of points being evaluated.

4. Leveraging the commonality and dominance among the data populations, we are able to utilize one specific skyband query to support multiple queries with vary- ing window specific parameters. By this full sharing is achieved across the query windows.

5. Our extensive experiments demonstrate that SOP routinely achieves three orders of magnitude or more speed up over the state-of-the-art methods [19, 27].

1.4.2

Distributed Outlier Detection

We design scalable distributed approaches to detect both distance-based and density- based outliers from high volume static data. In general our approaches feature novel

1.4 PROPOSED SOLUTIONS

partitioning strategies to be deployed on mappers and outlier detection strategies to be deployed on reducers. These strategies together minimize the overall execution costs that in a distributed system correspond to both communication and processing costs.

First at the mapper side our supporting area partitioning strategy makes each partition self-sufficient yet only introducing a minimal amount of data replication. This minimizes the costs that are required when having to re-distribute data repeatedly between mappers and reducers. Furthermore, an imbalanced workload may not only result in a significant slowdown in processing time, but also risk job failure in some cases. Therefore the par- titioning strategy has to ensure that each reducer is assigned a balanced workload. The key observation here is that to achieve load balancing, assigning an equal number of data points to each reducer is not sufficient. Instead the partitioning strategy should also take into consideration the data distribution of the mined dataset and the costs estimated by the cost model for the detection algorithm to be applied.

At the reducer side the traditional MapReduce based mining algorithms assume that one single mining algorithm is applied to all reducers. However the data partitions in different nodes might have different characteristics. Therefore the best algorithm selected based on the overall characteristics of the whole dataset might not serve any of the data partitions well. Therefore we propose a novelmulti-tacticparadigm. This model makes use of the fact that each reducer in a computer cluster executes independently of each other. Hence they may run different outlier detection algorithms as needed, namely dif- ferent algorithms for different reducers.

The key contributions in this area include:

1. We propose the first distributed approach called DLOF that effectively solves the problem of detecting density-based outliers from large volume data.

1.4 PROPOSED SOLUTIONS

technique, detects all distance-based outliers in a single MapReduce job. DOD is proven to involve minimal communication overhead.

3. For the first time, we theoretically analyze and contrast the costs of distinct classes of outlier detection algorithms under various data distributions. Based on this the- oretical foundation we prove that the traditional frequency-based load balancing assumption does not hold in the outlier detection context, and propose a novelcost- drivenstrategy that effectively generates partitions of balanced workloads.

4. We propose amulti-tacticstrategy that automatically selects the best outlier detec- tion algorithm for a given data partition. Our proposed density-aware technique successfully separates the two interdependent problems of partition-generation and algorithm-selection.

5. We experimentally evaluate our DLOF and DOD techniques using TBs of data. The results demonstrate that our techniques outperform the baseline solutions by a factor of 15x.

1.4.3

Interactive Outlier Exploration

In this area we propose to design an interactive outlier exploration paradigm calledONION

to meet the requirements of online outlier analytics applications. First, ONION is able to answer any outlier detection request with real time responsiveness. Second, it assists the analysts to quickly pinpoint a good parameter setting fitting the datasets to be mined in a systematic way. Furthermore, it facilitates the understanding and interpretation the mined outliers.

In generalONIONis composed of two phases, namely offline phase and online phase. At the offline phase we extract and abstract the key components of outlier analytics,

1.4 PROPOSED SOLUTIONS

namely the input data, the input parameter settings along with the possible outliers gener- ated from the data together into a comprehensive yet compact knowledge base by utilizing the power of computer clusters. The key observation here is that most of the data points are guaranteed to be inliers no matter how the parameter setting changes. The offline phase only needs to be conducted once. At the online phase we provide the users a rich set of analytics tools. It not only supports the traditional outlier mining operation, that is, given a particular input parameter, asking for the generated outliers. It also supports other novel analytics operations such as given a set of outliers, returning the parameters gener- ating this outlier set, or given a set of sampling outliers, output other outliers similar to these samples. Furthermore, these analytics operations will not be conducted on the raw big dataset and the large parameter space any more. They will be supported by directly looking at the aggregated knowledge base built in the offline phase. Therefore they can be answered in real time. These analytics operations in combination provide a powerful yet flexible tool for users to explore the data and interpret the generated outliers as well as recommend appropriate parameter settings to the users.

In particular the key contributions in this area include:

1. We propose the first interactive outlier analytics platform that enables analysts to pinpoint appropriate parameter settings and explore outliers in a systematic way. 2. We establish for the analysts an “outlier-centric panorama” into big datasets by

integrating the input data and parameter space into a comprehensive multi-space ONION knowledge base.

3. We design logarithmic-complexity algorithms for the processing of each outlier exploration operations with real-time responsiveness by leveraging the compact ONION knowledge base.

In document Outlier Detection In Big Data (Page 38-44)