Local Outlier Detection in High-Dimensional Data

Probably the first (although not local) approach for outlier detection in high-dimensional data

was the subspace oriented method by Aggarwal and Yu [AY01]. It is a grid-based approach that

uses an evolutionary search strategy to find unusually sparse subspaces. The data is first di- vided intoφequi-depth ranges in each dimension so that each partition contains approximately

f = 1/φof the total objects. By intersecting the one-dimensional grids in k dimensions, the expected number of objects in such a hyper-cube then isN ·fk with a standard deviation of

N ·fk_·₍₁₋_fk₎. Any object in a hypercube that contains significantly fewer objects than expected is considered an outlier by this algorithm.

This method however has various problems, which will be discussed in more detail in Chap-

ter 4. First of all, the expected values themselves become tiny for high-dimensional data very

quickly, in particular whenφorkare not very small either, making detection with this statistic impossible. Assuming just 10 grid cells and 6 dimensional subspaces, the expected number of

3.4 Local Outlier Detection in High-Dimensional Data 23

objects is 1 out of 1 million, i.e. we need a data set containing several million objects to make

this approach feasible. Furthermore, scores for different dimensionalitykare not comparable, so this approach only works for finding outliers in a single subspace dimensionality k. The proposed evolutionary search therefore focuses on preservingk, but since it does not have a very good control function (in contrast to clusters, outliers will often not surface gradually),

the search for appropriate subspaces is not very effective but more or less just randomized. But probably the worst problem of this method is that due to the equi-depth partitioning strategy

(which is required for the estimation of the number of objects in each cell), the method is very

likely to put outliers into the same grid cell as a nearby cluster. Last but not least, the presence of clusters with different sizes is not really taken into account in the statistical test, which in

fact assumes the data to be uniformly distributed except for the outlier grid cells.

An interesting approach for high-dimensional data is angle-based outlier detection (ABOD)

introduced in [KSZ08], because it does not pay explicit attention to the dimensionality. In

contrast to many other methods, it does not rely on a density estimation, but looks at the variance of the angular spectrum instead. For performance reasons, it can optionally use only

thekNN as a comparison set, which makes FastABOD a local method. Recent improvements on ABOD include aO(nlogn)approximation [PP12].

HOS-Miner [Zha+04] tries to approach the subspace search from the other direction: for each object, it tries to find the subspace where it appears to be most unusual. In order to perform

an Apriori-like search, they exploit that the knn weight (Equation 3.4) outlier scores increase

monotonically with increasing dimensionality. However, the authors neglect the fact that this outlier score is not comparable across different dimensionalities because of the very same mono-

tonicity (as also noted in [NGA11]): in particular, the monotonictiy implies that the maximum

score will always be found in the full dimensional space.

OutRank [Ass+07b; Mül+08; Mül+12] avoids the data snooping bias (see Chapter 4) elegantly since the method searches for subspace clusters, instead of looking for outliers. These should be

much easier to find in high-dimensional data, as they are not rare objects and can be expected

to be recognizable even if the subspace is not yet optimal (of course, subspace clustering does still pose a number of nontrivial challenges). Such clusters can be found for example using

DUSC (dimensionality unbiased subspace clustering, [Ass+07a]) or EDSC (efficient density-

based subspace clustering, [Ass+08]). The outlierness of an object is then estimated by the number of times the object is contained in such subspace clusters as well as the dimensionality

and size of the clusters. This method however relies on the subspace algorithms to produce a

highly redundant clustering (subspace clusters are commonly allowed to overlap). Furthermore, outliers can only be found when the data set clusters well – on a data set where no clusters are

found, all objects will be considered outliers. But even a good density-based clustering result

will often already contain a large number of unclustered objects, so that the methods mostly discover outliers that are the least often contained in clusters. On the other hand, clustering

algorithms are usually not designed to exclude outliers, but will often include outliers in a

nearby cluster. Furthermore, there is little insight or control over which types of outliers are detected, as they are merely a statistical side product of multiple clusterings.

Subspace Outlier Degree (SOD) [Kri+09b] is an outlier detection method for high-dimensional

data. Instead of using thekNN, it uses the more robust concept of shared nearest neighbors (SNN, see Section 4.4), and it chooses a subspace to compute the similarities in. It will be discussed in more detail in Section 5.2.1.

OutRES [MSS10] is a density-based subspace method. For every object, it locates the neighbors

in different subspaces and compares their kernel densities. It tries to take the different dimen-

sionalities into account by adjusting kernel size and distances depending on the dimensionality. In order not to have to compute densities in all2d−1subspaces, an Apriori style search strategy is employed, combined with a test against uniform distribution.

HiCS (High-contrast subspaces for density based outlier ranking, [KMB12]) can be seen as a

meta outlier detection method. It will first identify “high contrast” subspaces; then run an

outlier detection method such as LOF in these subspaces. This allows the use of LOF in high- dimensional data it could normally not process. However, the scores are not normalized across

different dimensionalities. Depending on the data set, HiCS may find an excessive amount of

subspace combinations and the subspace search dominates the total running time.

Correlation Outlier Probabilities (COP) [Zim08; Kri+12] – which will be covered in detail in Section 5.2.2 – try to identify outliers nearby arbitrarily oriented correlation clusters by the

deviation from a local trend.

In document Schubert, Erich (2013): Generalized and efficient outlier detection for spatial, temporal, and high-dimensional data mining. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 50-52)