• No results found

5.2 Local Outlier Detection in Higher Dimensions

5.2.2 Detecting Outliers in Arbitrarily Oriented Subspaces: COP

The method outlined in this section was published as:

H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. “Outlier Detection in Arbitrarily Oriented Subspaces”. In:Proceedings of the 12th IEEE International Conference on Data

Mining (ICDM), Brussels, Belgium. 2012, pp. 379–388. doi:10.1109/ICDM.2012.21

Real data will often have clusters that are not restricted to axis-parallel subspaces. A common idea to detect such arbitrarily oriented clusters is to compute principal component analysis

(PCA) on a local subset of the data, as done for example by the methods ORCLUS [AY00],

4C [Böh+04b], HiCO [Ach+06c] and ERiC [Ach+07]. Achtert et al. [Ach+06b] discuss the ben- efits of PCA for analyzing correlation clusters. A modified variant of PCA is introduced in

[Kri+08] for use in correlation clustering that is less susceptible to outliers by reducing the

weight of far away points and trying differentk. See the survey by Kriegel et al. [KKZ09] for an in-depth overview of correlation clustering methods.

In order to solve the same problems in outlier detection – for example detecting outliers close

to an arbitrarily oriented subspace cluster – a similar approach based on PCA can be em- ployed. While this brings along the same bootstrapping problems that still plague many other

correlation-based methods – in order to find the proper neighborhood in high-dimensional

data, one needs to find the correlations, but in order to find the correlation one needs to already have the proper neighborhood – there are good reasons to stick to the common best practice

of using theknearest neighbors. Key benefits include that this initial neighborhood is usually unbiased (due to symmetry of the query sphere), and using index structures it can often be retrieved in reasonable computation time. Furthermore, the parameterkallows a good control of computational cost.

The method COP does not rely on a prior cluster analysis (since in particular correlation clus- tering itself is a rather computationally intensive task) but instead evaluates individual objects

one at a time. It also does not need the correlation models of the neighbors, but the score of

each object only depends on the neighbor set itself. It can therefore also be evaluated for single candidate observations only, if these have been preselected by a different algorithm.

The COP score is computed by first determining the neighbor setN(p), then performing prin- cipal component analysis (PCA). For PCA, the covariance matrixΣof the neighborhoodN(p) is computed and then decomposed into a rotational matrixV and a diagonal scaling matrixΛ such thatVΛV−1 = Σ. Each column vi of the matrixV corresponds to an eigenvector and entryλi of the diagonal matrixΛto an eigenvalue of the matrixΣ. Intuitively, the eigenvec- tors point into the principal directions of the data set, and the associated eigenvalues give the

variance on this axis. We build another diagonal matrixΩwithωi := 1/

λi, which obviously satisfiesΩΩ = Λ−1. UsingΩ = ΩT andV−1 =VT, we can decomposeΣ−1 as follows:

5.2 Local Outlier Detection in Higher Dimensions 67

By mappingx7→(ΩVT)(x−µ)we can transform the original data so that the dimensions are pairwise uncorrelated and have zero mean and unit variance. Furthermore, the dimensions are

ordered by their original variance. By only retaining the lastd0 dimensions of this space, we can measure the deviations within the subspace orthogonal to the assumed cluster (modeled

by the first d−d0 dimensions). This is somewhat the opposite of the common approach for (global) dimensionality reduction, where one would focus on the first eigenvectors. But for outlier detection, we do not want to represent the structure of the (assumed) cluster, but instead

measure the deviation from the normal. Furthermore, when doing this on a global level, the user

can experimentally choose the dimensionalityd0for optimal results. However, in the context of correlation clustering and correlation outliers, we must assume that different parts of the data

set require different dimensionality. Therefore, we need an automatic method for choosing the

dimensionality.

Heuristics for Choosing a Subspace Dimensionality: In the correlation clustering algo- rithm ERiC [Ach+07], the authors suggest to use a relative variance threshold α ∈]0,1[ and computed0 such that it satisfies the condition

d0percentage := min δ ( δ δ X i=1 λi ≥α d X i=1 λi ) .

Intuitively, this states that the dimensionality should explain the relative share αof the total variance. The authors reported good results withα= 85%.

While SOD [Kri+12] looked at the original data dimensions only, it essentially faced the same

problem of choosing a subspace dimensionality. The heuristic proposed in Equation 5.3 can

trivially be rewritten for use in correlation clustering and correlation outlier detection:

d0weak:= max δ

δλδ ≥αweakmeandi=1λi

This heuristic is based on the assumption that the variances of all dimensions should be approx-

imately equal, and selects all dimensions that haveαweaktimes the average variance. For SOD a threshold ofαweak= 1.1worked well, but for correlation outlier detection much lower values such as αweak = 0.95performed much better. There is a simple reason for this: while SOD used axis parallel subspaces – without changing the orientation – PCA will actively rotate the

data to maximize the variance in the first dimensions, and minimize it in the later dimensions. This way, PCA actually maximizes the differences between the projected variances. For small

sample sizes, there exist natural differences in variance, and PCA will unfortunately emphasize

these effects. To show this effect, we generated uniformU[0; 1]data of20dimensions and up to10000samples. We computed the standard deviations5before and after applying PCA. Fig- ure 5.4 visualizes the mean standard deviation (which is not affected by PCA, and the expected

value for a uniform distribution is p1/12. ≈ 0.289), the minimum and maximum standard

5

0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0 2000 4000 6000 8000 10000 standard deviation sample size Average +- Stddev before PCA +- Stddev after PCA Min/Max before PCA Min/Max after PCA

(a) Linear scale

0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 100 1000 10000 standard deviation

sample size [logscale] Average +- Stddev before PCA +- Stddev after PCA Min/Max before PCA Min/Max after PCA

(b) Logarithmic scale 1 1.1 1.2 1.3 1.4 1.5 1.6 100 1000 10000

max stddev / min stddev

sample size [logscale] Max/Min before PCA

Max/Min after PCA

(c) Relative

Figure 5.4: PCA emphasizes the differences in variance of the axes.

deviation and the standard deviation of the standard deviations before and after applying PCA

to this (uniform) data set. In Figure 5.4c we display the quotient of maximum and minimum standard deviation. By design, PCA maximizes this difference. At around 1000 samples (for 20

dimensional data), the results stabilize, which aligns with the rule of thumb of needing3·d2 samples for PCA.

Furthermore, when for example the data set contains one strong correlation, it can easily hap-

pen that the first dimension already explains85%of the variance, and both of these heuristics would only choose this one, maybe even global, correlation. In order to take this into account,

instead of comparing each eigenvalue to the total mean, we can compare it to the remaining

dimensions only:

d0relative:= max

δ

δλδ≥αrelativemeandi=δλi

This threshold needs to be chosen larger than1: since the eigenvalues are ordered, the largest can obviously not be smaller than the mean. In order to eliminate this parameter, we can also curry this threshold and choose the dimension with the largest coefficient, based on the idea

that when going from relevant to irrelevant dimensions, the relative drop should be maximal:

d0signicant := argmaxδλδ/meandi=δλi

This last heuristic, however, is again easily distracted by a strong global correlation (which

would cause the largest decrease to be atδ = 1). Instead of fixing a dimensionality, we can also test allddimensionalities and see when the result was best. This however meant we can no longer use the same scoring measure that was used for example in SOD (see Section 5.2.1):

distances measured at different dimensionalities are not comparable. Therefore, we need to normalize the deviations in a way that makes the deviations comparable even when they have

different dimensionality.

Distribution of Distances: In order to obtain a comparable value for different dimension- alities, we need to look at the expected deviations in different dimensionalities. Let us ini- tially assume a very basic case, in which the data is approximately standard normal distributed

5.2 Local Outlier Detection in Higher Dimensions 69

Xi ∼ N(0; 1)and i.i.d. in each dimensioni. The Euclidean norm ofd0 such random variables

is then

q Pd0

i=1x2i. If we remove the square root and thus look at the squared Euclidean norm,

we immediately get the chi squared distribution withd0 degrees of freedom by definition:

d0

X

i=1

Xi2 ∼χ2(d0) if∀iXi ∼ N(0; 1).

The actual Euclidean norms then of course areχ(d0)distributed, but we will continue to use the squared norms instead. Chi squared distributions are well understood, since they are a special

case of the Gamma distribution:χ2(d0)∼Γ (d0/2,2).

Assuming that our data were actually not normally distributed around the mean, we can try to

improve results by estimating them from the data instead of using the expected valued0. There exists no known closed form solution for a maximum likelihood estimation of the parameters of the Gamma distribution; however the function is numerically stable and the parameters can be

found for example via Newton’s method or the algorithm described by Choi and Wette [CW69]

which we implemented for our experiments. The results using estimated parameters were how- ever not substantially different from the naïve approach solely based on the dimensionalities.

We did not yet exploit the robust statistics based on the median average deviation (MAD) and

L-Moments (LMM) that we used in later experiments.

Assuming that the squared deviations from the mean were thusΓ(_,_)distributed, we can now compute the cumulative density function (cdf ), which yields the quantile at which the potential

outlier observation lies. A low percentile indicates that the observation is central, while a high value indicates it has an unusually large deviation. But most importantly, these values now are

on a probabilistic scale and are comparable across different dimensionalities.

This allows us to iterate over the different dimensions and obtain the correlation outlier score

(COS) as the maximum quantile (using eitherχ2 orΓdistributions):

COS(o) := max

δ cdfΓ(

_,_) dδ(x−µ)

2

. (5.4)

In contrast to most earlier methods, this score has a strong probabilistic interpretation. How-

ever, since this does not align with the intuitive interpretation for end users, COP [Kri+12]

added an additional normalization to a scale that is easier to interpret. Details on this will follow in the next Section 5.3 with Equation 5.9.