3.3 Outlier Detection Via Minimum Description Length
3.3.2 Independent Component Analysis
ICA [Com94] provides a basis for the calculation of independent components in a mixture of statistically independent random variables. Let a vector ~x
consist of n statistically independent random variables. In order to apply ICA at most one random variable is allowed to follow a Gaussian distribu- tion. It has been determined that the best decomposition of a mixture of
signals can be established by searching for data not following a Gaussian distribution. One reason could be that mixtures of signals from an arbitrary distribution function are always more Gaussian than the original signal. In our approach we apply ICA to maximize non-Gaussianity which is used as a measure of the statistical independence. This can be achieved since ICA favors directions of the data which are not Gaussian distributed. We de- termine coding costs (measured by entropy) which have to be minimized in order to guarantee a best possible compression efficiency. Since the entropy of Gaussian distributions is maximal and all other distributions have a lower entropy it is desirable to maximize non-Gaussianity.
Most real world data is distorted in the data space, hence the assumption of equally dense data distributions is not applicable in those data sets. To overcome this drawback we applied the ICA to the data. One step in the ICA algorithm is the whitening of the data leading to a de-correlation and a normalization of the data to unit variance. This transformation to the so-called white space makes it possible to implicitly handle data with diverse density.
3.3.2.1 Data Preprocessing
In general ICA needs centered data, meaning data with zero mean, as input. If this is not the case the data has to be centered. In Figure 3.1 the different steps of the ICA algorithm are illustrated. The first step is centering of the data. This can be achieved by subtracting the empirical mean m~ of a data setDB, in the example this ism~ ={100,50}, from each data point~x∈DB
~c=~x−m~ (3.1)
whereby the empirical mean is defined as
~ m= P ~ x∈DB~x n (3.2)
90 100 110 40 50 60 -10 0 10 -10 0 10 center PCA: normalize + whiten ICA
Figure 3.1: The principle of the Independent Component Analysis. Shown are the different steps from the original data until the data is transformed into independent components. First the data is centered, then the data is normalized to unit variance and whitened by PCA, and finally the data is transformed into independent components by ICA.
with n =|DB| being the cardinality of the data set.
Now, since the data is centered the PCA [Ait84] which is a subpart of the ICA can be applied. PCA is used to transform the data loosing as lit- tle information as possible while combining existing redundancy in terms of correlations in the data. Thereby, given a set of centered points~c, PCA iden- tifies those directions in a d-dimensional vector space which have maximal variance. For this purpose the centered data~cneed to be normalized to unit variance in all directions. To achieve this, first the covariance matrix Σ has
to be determined by multiplying the centered data vector with its transpose
Σ =~c·~cT. (3.3)
Then an eigenvalue decomposition of the covariance matrix is conducted
Σ :=V DVT, (3.4)
resulting in the EigenvectorsV and the EigenvaluesDof the covariance ma- trix. Both Eigenvectors and Eigenvalues are orthogonal matrices. In addition to that the Eigenvalue matrix is a diagonal matrixD=diag(λ1, ..., λd). The
Eigenvectors build a rotation matrix and the square root of the Eigenvalues corresponds to the variance of the main components.
The PCA transform of vector~xis obtained by
~
y:=√D−1×VT ×~c. (3.5) Note, that since the Eigenvalue matrix is a diagonal matrix the inverse of the Eigenvalue matrix is simply the inverse of each diagonal entry in the matrix
√
D−1 =diag(p1/λ1, ...,
p
1/λd). By multiplying the diagonal matrix with
the variance the main components are normalized to one. The effect of normalization and whitening of a data set by PCA is depicted in Figure 3.1. The redundancy combination in terms of correlations in the data can be clearly seen.
3.3.2.2 Identification Of Independent Components
For solving the problem of finding independent components which is the ma- jor goal of the ICA we used the efficient FastICA [Hyv99, HKO01]. FastICA is based on a fixed-point iteration scheme which solves the determination of the weighting matrix W ={w~1, ..., ~wd} to discover the independent compo-
maximal variance by PCA but since we are rather interested in the optimal projection of the data we need to determine the directions of minimal en- tropy which can be obtained by ICA. Since the iterative optimization of W
expects whitened data as input the whitened data produced by PCA can be inserted. In order to optimize W using the fixed-point iteration of the FastICA algorithm the weight vectors w~ of the matrix W are updated by
~
w=E ~y·g(w~T ·~y)
−E g0(w~T ·~y)
·w.~ (3.6) Thereby, E(· · ·) is the expected value, g(· · ·) is a non-linear contrast func- tion, andg0(· · ·) is the derivative of the non-linear functiong. We decided to use tanh(a) for g(a), resulting in g0(a) = dtanh(da a). The optimization process is finished in case of convergence ofW followed by the orthonormalization of
W. By now the problem of determining an orthogonal weighting matrix W
is reduced, but the random variables are not yet stochastically independent. In order to project the original data ~x into the independent components we need to determine the de-mixing matrix M−1 which is composed of the Eigenvectors V and the Eigenvalues D of the covariance matrix as well as the weighting matrix W. Since the mixing matrix is
M =V ×√D×W (3.7)
we can obtain the de-mixing matrix M−1 by
M−1 =WT ×√1
D ×V
T
(3.8) where W and V are both orthonormal matrices. Hence, the determinant of the de-mixing matrix M−1 can be written as
det(M−1) = Y 1≤i≤d r 1 λi . (3.9)
Note that the rotation or weighting matrix W is responsible for the rotation in the white space after the data has been whitened by the scaled Eigenvector matrix of the original data vector.
Finally, to convert the centered data~c, which has been obtained from the original data~x, into independent components~z we have to project~cinto the independent component space by
~
z =M−1×~c. (3.10)
The last step of Figure 3.1 shows the impact of the transformation of the data into independent components. After ICA the redundancy in the data is minimal.