The basic operation of my methodology is to decide whether a given distribution ofe.g. response times for a particular server is well described by the historical population of distributions of response times for the same server. If it is, the new distribution is considered “normal”, or “non-anomalous”. Otherwise, the distribution is considered anomalous.
This is a difficult operation because of the requirement that it work equally well on a wide variety of input data. The population of distributions might be normally distributed in the
quantile-function space—but typically it is not. The population might have a wide “spread”, or it might not. The shape of the distributions is completely arbitrary. In general, there are very few assumptions that I can make globally for all servers, and my methods must work well in all cases.
Let’s assume we have a set of ndistributions, each a b-bin quantile function, as the input data. We represent this dataset as an×bdata matrix, X. As detailed in the previous section, this data matrix is first zero-centered by subtracting the mean vector, and the means of each column are saved in the vector µ. The result is then analyzed according to PCA to determine the primary (orthogonal) components of variation (the eigenvectors or principal components,
Vi), and the amount of variation accounted for by each component (the eigenvalues or singular
values λi).
I then form a basis, which defines the sort of distribution that is considered “normal”, or typical, based on the population seen so far. First, let L be the total variation of all the principal components, or Pb
i=1λi. Then let Ri be the amount of variation remaining after
principal component i: Ri = b X j=i+1 λ j L , i= 1. . . b
Note that Ri is always less than one, Rb is zero, and the sequence of Ri is nonincreasing.
The “normal” basis is formed by including principal components until Ri is less than some
threshold value, which I callbasis error. In other words, if there are kprincipal components in the basis, then Rk is the last value of R that is greater than basis error. Intuitively,
thebasis error parameter controls the specificity of the basis: the lower the value, the more specific to the input the basis will be, and the higher the value, the looser it will be. This parameter is discussed in more detail in Section 4.7.1.
The basisB is ab×kmatrix, wherekis the number of principal components included. This basis represents a k-dimensional subspace of the original b dimensions, and it is the inherent dimensionality of the data at a tolerance of basis error. Each of the input points (which are the original distributions) have a certainprojection error onto this basis; that is, there is some distance between the original distribution and its projection onto the k-dimensional subspace.
If Z is a (b-dimensional) point and Z′ is its (k-dimensional) projection onto the basis, then let Z′′ be the projection expressed in terms of the original b dimensions, or Z′BT. Then the
projection error of Z is defined as the Euclidean distance from Z toZ′′:
P E(Z) = b X i=1 Zi−Zi′′ 2 ! 1 2
I call the mean of the projection errorsµp, and the variance of the projection errorsλp. In
this way, I essentially “roll up” all of the leftover dimensions into a single dimension, which I call the “error” dimension. Thus, the number of dimensions in the basis is controlled by the inherent dimensionality of the data, rather than an arbitrary choice ofb.
I now describe the basic operation of the discrimination process, which I call the anomaly test. The idea is to compare a distribution to be tested for being anomalous against the basis formed by a population of distributions, and to quantify the extent to which the tested distribution is not described by the basis. First, the tested distribution (which is, in its QF representation, a b-dimensional point) is zero-centered by subtracting the mean vector, µ, to get a (row) vector I will callT. T is then multiplied by the basisB to getT′, the projection ofT
onto the basis. One can think of this operation as changing the coordinate system ofT from the original coordinates (the quantile bin values from 1. . . b) to the PCA-defined coordinates (the distance along principal components 1. . . k). Then, I computeP E(T)−µp, or the zero-centered
projection error of T, and call it p.
At this point, I have T′, the zero-centered coordinates of the test distribution in terms of the principal components,λ, the variances of the original data along each of the principal com- ponents, p, the zero-centered projection error of the test distribution, and λp, the variance of
the original projection errors. In other words, I havek+ 1 zero-centered values along with the variance of each of these values as defined by the original population. I can now compute the standardized distance from the basis to the tested distribution. The standardized distance is a special case of the Mahalanobis distance in which the covariance matrix is diagonal. Our covari- ance matrix is diagonal because PCA returns a set of components that are mutually orthogonal, thus removing correlations among the original components. The standardized distance between two d-dimensional points X and Y is defined as:
SD(X, Y) = d X i=1 (Xi−Yi)2 σ2 i ! ,
...whereσi2is the variance along dimensioni. In my application, I am measuring the distance of the test distribution to the origin to compute the output of the anomaly test, which I call theanomaly score:
AS(T) = k X i=1 (T′ i)2 λi ! +p 2 λp
Intuitively, the anomaly score measures the extent to which the test distribution is not described by the basis. If the test distribution is well described by the basis, then its anomaly score will be near zero; otherwise, it will be large. Note that, because we are dealing with a standardized distance metric, the anomaly score is unaffected by the scale and the spread of the input population. In other words, an anomaly score of twenty is comparable, even between servers with drastically different variances. An anomaly score less than a parameter anom thres is considered non-anomalous; otherwise, it is classified as an anomaly of some severity. Section 4.7.5 explores theanom thres setting in more detail.
If the test distribution is deemed to be normal (or non-anomalous), then it is added to the set of distributions used to define the normal basis. This is done to make the process adaptive to gradual shifts in the behavior of a server, whether because of seasonal changes in the demand placed on the server, the gradual aging of the server, or other reasons. At a high level, this is a “windowing” operation: the oldest day in the input set is dropped out of the window as the newest normal day is incorporated. The size of the window, window size, is a parameter, which we will explore in Section 4.7.2. Once the input set is established, the normal basis is recomputed using PCA, as detailed above. On the other hand, if the test distribution is considered anomalous, it is not added to the normal set, and the current basis is reused.
An astute observer might ask: why must you use PCA? Why not just compute the Maha- lanobis distance on the original b dimensional input data and test distribution? The (gener- alized) Mahalanobis distance incorporates the correlation among dimensions, so that the test
0.001 0.01 0.1 1 10 0 5 10 15 20 25 30 35 40 response time (s) QF bin index 2009-01-16
Figure 4.5: Quantile functions for example server, plotted on a logarithmic scale, with the anomalous QF highlighted
distribution is not penalized multiple times for being on the edge of some envelope. However, the Mahalanobis distance uses the covariance matrix, which will be singular if the number of sam- ples is less than the number of bins—the so-called “high-dimension low-sample-size (HDLSS) situation”. The HDLSS situation is common in my application, and a singular covariance ma- trix will yield infinite or otherwise nonsensical Mahalanobis distances, so we use PCA as an approach more robust to the HDLSS situation.