Independent Component Analysis

(1)

Independent Component Analysis

Seungjin Choi

Abstract Independent component analysis (ICA) is a statistical method, the goal of which is to decompose multivariate data into a linear sum of non-orthogonal ba- sis vectors with coefficients (encoding variables, latent variables, hidden variables) being statistically independent. ICA generalizes a widely-used subspace analysis method such as principal component analysis (PCA) and factor analysis, allowing latent variables to be non-Gaussian and basis vectors to be non-orthogonal in gen- eral. ICA is a density estimation method where a linear model is learned such that the probability distribution of the observed data is best captured, while factor anal- ysis aims at best modeling the covariance structure of the observed data. We begin with a fundamental theory and present various principles and algorithms for ICA.

1 Introduction

Independent component analysis (ICA) is a widely-used multivariate data analysis method that plays an important role in various applications such as pattern recog- nition, medical image analysis, bioinformatics, digital communications, computa- tional neuroscience, and so on. ICA seeks a decomposition of multivariate data into a linear sum of non-orthogonal basis vectors with coefficients being statistically as independent as possible.

We consider a linear generative model, where m-dimensional observed data x ∈ R ^m is assumed to be generated by a linear combination of n basis vectors {a i ∈ R ^m }, x = a

1

s

₁

+ a

2

s

₂

+ · · · a n s _n , (1)

Seungjin Choi

Department of Computer Science, Pohang University of Science and Technology, San 31 Hyoja- dong, Nam-gu, Pohang 790-784, Korea, e-mail: [email protected]

1

(2)

where {s i ∈ R} are encoding variables representing the extent to which each basis vectors is used to reconstruct the data vector. Given N samples, the model (1) can be written in a compact form:

X = AS, (2)

where X = [x(1), . . . , x(N)] ∈ R ^m×N is a data matrix, A = [a

1

, . . . , a n ] ∈ R ^m×n is a basis matrix, and S = [s(1), . . . , s(N)] ∈ R ^n×N is an encoding matrix with s(t) = [s

1

(t), . . . , s n (t)]

^⊤

.

Dual interpretation of basis-encoding in the model (2) is given as follows:

• When columns in X are treated as data points in m-dimensional space, columns in A are considered as basis vectors and each column in S is encoding that represents the extent to which each basis vector is used to reconstruct data vector.

• Alternatively, when rows in X are data points in N-dimensional space, rows in S correspond to basis vectors and each row in A represents encoding.

A strong application of ICA is a problem of blind source separation (BSS), the goal of which is to restore sources S (associated with encodings) without the kon- wledge of A, given the data matrix X . ICA and BSS have been often treated as an identical problem since they are closely related to each other. In BSS, the matrix A is referred to as mixing matrix. In practice, we find a linear transformation W , referred to as demixing matrix, such that the rows of the output matrix

Y = W X, (3)

are statistically independent. Assume that sources (rows of S) are statistically inde- pendent. In such a case, it is well known that W A becomes a transparent transforma- tion when the rows of Y are statistically independent. The transparent transforma- tion is given by W A = P Λ , where P is a permutation matrix and Λ is a nonsingular diagonal matrix involving scaling. This transparent transformation reflects two in- determinacies in ICA [1]: (1) scaling ambiguity; (2) permutation ambiguity. In other words, entries of Y correspond to scaled and permuted entries of S.

Since Jutten and Herault’s first solution [2] to ICA, various methods have been developed so far, including a neural network approach [3], information maximiza- tion [4], natural gradient (or relative gradient) learning [5, 6, 7], maximum like- lihood estimation [8, 9, 10, 11], nonlinear principal component analysis (PCA) [12, 13, 14]. Several books on ICA [15, 16, 17, 18, 19] are available, serving as a good resource for through review and tutorial of ICA. In addition, tutorial papers on ICA [20, 21] are useful resources as well.

In this chapter, we begin with a fundamental idea, emphasizing why independent

components are sought. Then we introduce well-known principles to tackle ICA,

leading to an objective function to be optimized. We explain the natural gradient al-

gorithm for ICA. We also elucidate how we incorporate nonstationarity or temporal

information into the standard ICA framework.

(3)

2 Why Independent Components?

Principal component analysis (PCA) is a popular subspace analysis method that has been used for dimensionality reduction and feature extraction. Given a data matrix X ∈ R ^m×N , the covariance matrix R _xx is computed by

R _xx = 1

N X HX

^⊤

,

where H = I _N×N − _N

¹

1 _N 1

^⊤

_N is the centering matrix, where I _N×N is the N × N iden- tity matrix and 1 _N = [1, . . . , 1]

^⊤

∈ R ^N . The rank-n approximation of the covariance matrix R _xx is of the form

R _xx ≈ U Λ ^U

^⊤

,

where U ∈ R ^m×n contains n eigenvectors associated with n largest eigenvalues of R _xx in its columns and the corresponding eigenvalues are in the diagonal entries of Λ (diagonal matrix). Then principal components z(t) are determined by projecting data points x(t) onto these eigenvectors, leading to

z(t) = U

^⊤

x(t), or in a compact form,

Z = U

^⊤

X .

It is well known that rows of Z are uncorrelated with each other.

ICA generalizes PCA in the sense that latent variables (components) are non- Gaussian and A is allowed to be non-orthogonal transformation, whereas PCA con- siders only orthogonal transformation and implicitly assumes Gaussian components.

Fig. 1 shows a simple example, emphasizing the main difference between PCA and ICA.

We presents a core theorem which plays an important role in ICA. It provides a fundamental principle for various unsupervised learning algorithms for ICA and BSS.

Theorem 1 (Skitovich-Darmois). Let {s

1

, s

2

, . . . , s n } be a set of independent ran- dom variables. Consider two random variables x

₁

and x

₂

which are linear combi- nations of {s i },

y

₁

= α

1

s

₁

+ · · · α n s _n ,

y

₂

= β

1

s

₁

+ · · · β n s _n , (4)

where { α i } and { β i } are real constants. If y

1

and y

₂

are statistically independent,

then each variable s _i for which α i β i 6= 0 is Gaussian.

(4)

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

(a) (b)

Fig. 1 Two-dimensional data with two main arms are fitted by two different basis vectors: (a) PCA makes the implicit assumption that the data have a Gaussian distribution and determines the optimal basis vectors that are orthogonal, which are not efficient at representing non-orthogonal distributions; (b) ICA does not require that the basis vectors be orthogonal and considers non- Gaussian distributions, which is more suitable in fitting more general types of distributions.

Consider the linear model (2) for m = n. Throughout this chapter, we consider the simplest case where m = n (square mixing). Let us define the global transformation as G = W A, where A is the mixing matrix and W is the demixing matrix. With this definition, we write the output y(t) as

y(t) = W x(t) = Gs(t). (5)

Let us assume that both A and W are nonsingular, hence G is nonsingular. Under this assumption, one can easily see that if {y i (t)} are mutually independent non- Gaussian signals, then invoking Theorem 1, G has the following decomposition

G = P Λ . (6)

This justifies why ICA performs BSS.

3 Principles

The task of ICA is to estimate the mixing matrix A or its inverse W = A

⁻¹

(re-

ferred to as dexming matrix) such that elements of the estimate y = A

⁻¹

x = W x are

as independent as possible. For the sake of simplicity, we often leave out the in-

dex t if the time structure does not have to be considered. In this section we review

four different principles: (1) maximum likelihood estimation; (2) mutual informa-

tion minimization; (3) information maximization; (4) negentropy maximization.

(5)

3.1 Maximum likelihood estimation

Suppose that sources s are independent with marginal distributions q _i (s i ) q(s) =

∏ n i=1

q _i (s i ). (7)

In the linear model, x = As, a single factor in the likelihood function is given by p(x|A, q) =

Z

p(x|s, A)q(s)ds

= Z n

∏ j=1

δ ^x j −

∑ n i=1

A _ji s _i

! n

∏ i=1

q _i (s i )ds (8)

= | det A|

⁻¹

∏ n i=1

q _i

∑ n j=1

A

⁻¹

_{i j} x _j

!

. (9)

Then, we have

p(x|A, q) = | det A|

⁻¹

r(A

⁻¹

x). (10) The log-likelihood is written as

log p(x|A, q) = − log | det A| + log q(A

⁻¹

x), (11) which can be also written as

log p(x|W , q) = log | detW | + log p(y), (12) where W = A

⁻¹

and y is the estimate of s with the true distribution q(·) replaced by a hypothesized distribution p(·). Since sources are assumed to be statistically independent, (12) is written as

log p(x|W , q) = log | detW | +

∑ n i=1

log p _i (y i ). (13)

The demixing matrix W is determined by

b

W = arg max W

(

log | detW | +

∑ n i=1

log p _i (y i ) )

. (14)

It is well known that maximum likelihood estimation is equivalent to Kull- back matching where the optimal model is estimated by minimizing Kullback- Leibler (KL) divergence between empirical distribution and model distribution.

Consider KL-divergence from the empirical distribution ˜ p(x) to the model distri-

bution p

_θ

(x) = p(x|A, q)

(6)

KL[ ˜p(x)||p

_θ

(x)] = Z

p(x) log ˜ p(x) ˜ p

_θ

(x) dx

= −H( ˜p) − Z

p(x) log p ˜

_θ

(x)dx, (15) where H( ˜p) = − ^R p(x) log ˜p(x)dx is the entropy of ˜p. Given a set of data points, ˜ {x

1

, . . . , x N } drawn from the underlying distribution p(x), the empirical distribution p(x) puts probability ˜ _N

¹

on each data point, leading to

p(x) = ˜ 1 N

∑ N t=1

δ (x − x t ). (16)

It follows from (15) that arg min

θ KL[ ˜p(x)||p

_θ

(x)] ≡ arg max

θ ^{hlog p}

^θ

^(x)i ^p

^˜

^, ⁽¹⁷⁾ where h·i _p

_˜

represents the expectation with respect to the distribution ˜ p. Plugging (16) into the righthand side of (15), leads to

hlog p

_θ

(x)i _p

_˜

= 1 N

Z ∑

t=1

N δ (x − x t ) log p

_θ

(x)dx = 1 N

∑ N t=1

log p

_θ

(x t ). (18)

Apart from the scaling factor _N

¹

, this is just the log-likelihood function. In other words, maximum likelihood estimation is obtained from the minimization of (15).

3.2 Mutual information minimization

Mutual information is a measure for statistical independence. Demixing matrix W is learned such that the mutual information of y = W x is minimized, leading to the following objective function:

J mi = Z

p(y) log

p(y)

∏ ⁿ _i=1 p i (y i )

dy

= −H(y) −

* n i=1 ∑

log p _i (y i ) +

y

, (19)

where H(·) represents the entropy, i.e., H(y) = −

Z

p(y) log p(y)dy, (20)

(7)

and h·iy denotes the statistical average with respect to the distribution p(y). Note that p(y) = ^p( x

)

| det

W

|

. Thus, the objective function (19) is given by J mi = − log | detW | −

∑ n i=1

hlog p i (y i )i , (21)

where hlog p(x)i is left out since it does not depend on parameters W . For on-line learning, we consider only instantaneous value, leading to

J mi = − log | detW | −

∑ n i=1

log p _i (y i ). (22)

3.3 Information maximization

Infomax [4] involves the maximization of the output entropy z = g(y) where y = W x and g(·) is a squashing function (e.g., g i (y i ) =

_1+e¹_−yi

). It was shown that infomax contrast maximization is equivalent to the minimization of KL divergence between the distribution of y = W x and the distribution p(s) = ∏ ⁿ i=1 p _i (s i ). In fact, infomax is nothing but mutual information minimization in ICA framework.

Infomax contrast function is given by

J I (W ) = H(g(W x)), (23)

where g(y) = [g

1

(y

1

), . . . , g n (y n )]

^⊤

. If g _i (cdot) is differentiable, then it is the cumu- lative distribution function of some probability density function q _i (·),

g _i (y i ) = Z _y

_i

−∞

q _i (s i )ds i . Let us choose a squashing function g _i (y i ) as

g _i (y i ) = 1

1 + e

^−yⁱ

, (24)

where g _i (·) : R → (0, 1) is a monotonically increasing function.

Let us consider an n-dimensional random vector bs, the joint distribution of which is factored into the product of marginal distributions:

q (bs) =

∏ n i=1

q _i (b s _i ). (25)

Then g _i (b s _i ) is distributed uniformly on (0, 1), since g i (·) is the cumulative distri-

bution function of b s _i . Define u = g(bs) = [g

1

(b s

₁

), . . . , g n (b s _n )]

^⊤

which is distributed

uniformly on (0, 1) ⁿ .

(8)

Define v = g(W x). Then Infomax contrast function is re-written as J I (W ) = H(g(W x))

= H(v)

= − Z

p(v) log p(v)dv

= − Z

p(v) log p(v)

∏ ⁿ _i=1 1

_(0,1)

(v i )

! dv

= −KL[vku]

= −KL[g(W x)ku], (26)

where 1

_(0,1)

(·) denotes uniform distribution on (0,1). Note that KL-divergence is invariant under an invertible transformation f ,

KL[ f (u)k f (v)] = KL[ukv]

= KL[ f

⁻¹

(u)k f

⁻¹

(v)].

Therefore we have

J I (W ) = −KL[g(W x)ku]

= −KL[W xkg

⁻¹

(u)]

= −KL[W xkbs]. (27)

It follows from (27) that maximizing J I (W ) (Infomax principle) is identical to the minimization of the KL divergence between the distribution of the output vector y = W x and the distribution bs whose entries are statistically independent. In other words, Infomax is equivalent to mutual information minimization in a framework of ICA.

3.4 Negentropy maximization

Negative entropy or negentropy is a measure of distance to Gaussianity, yielding the larger value for random variable whose distribution is far from Gaussian. Ne- gentropy is always nonnegative and vanishes if and only if the random variable is Gaussian. Negnetropy is defined as

J(y) = H(y ^G ) − H(y), (28)

where H (y) = E{− log p(y)} represents the entropy and y ^G is a Gaussian random

vector whose mean vector and covariance matrix are the same as y. In fact, negen-

tropy is KL-divergence of p(y ^G ) from p(y), i.e.,

(9)

J(y) = KL

p(y)kp(y ^G ) ,

= Z

p(y) log p(y)

p(y ^G ) dy, (29)

leading to (28).

Let us discover a relation between negentropy and mutual information. To this end, we consider mutual information I(y):

I(y) = I(y

1

, . . . , y n )

=

∑ n i=1

H(y i ) − H(y)

=

∑ n i=1

H(y ^G _i ) −

∑ n i=1

J(y i ) + J(y) − H(y ^G )

= J(y) −

∑ n i=1

J(y i ) + 1 2 log

∏ ⁿ _i=1 [R yy ] ii

det R _yy

, (30)

where R yy = E{yy

^⊤

} and [R yy ] ii denotes the ith diagonal entry of R yy .

Assume that y is already whitened (decorrelated), i.e., R yy = I. Then the sum of marginal negentropies is given by

∑ n i=1

J(y i ) = J(y) − I(y) + 1 2 log

∏ ⁿ _i=1 [R yy ] ii

det R _yy

| {z }

0

= −H(y) − Z

p(y) log p(y ^G )dy − I(y)

= −H(x) − log k detW | − I(y) − Z

p(y) log p(y ^G )dy. (31) Invoking R _yy = I, (31) becomes

∑ n i=1

J(y i ) = −I(y) − H(x) − log | detW | + 1

2 log | det R yy |. (32) Note that

1 2 log| det R yy | = 1

2 log | det(W R xx W

^⊤

)|. (33) Therefore, we have

∑ n i=1

J(y i ) = −I(y), (34)

where irrelevant terms are omitted. It follows from (34) that maximizing the sum of

marginal negentropies is equivalent to minimizing the mutual information.

(10)

4 Natural gradient algorithm

In previous section, four different principles lead to the same objective function

J = − log | detW | −

∑ n i=1

log p _i (y i ). (35)

That is, ICA boils down to learning W which minimizes (35),

b

W = arg min W

(

− log | detW | −

∑ n i=1

log p _i (y i ) )

. (36)

An easy way to solve (36) is the gradient descent method which gives a learning algorithm for W that has the form

∆ ^W = − η ∂ J

∂ ^W

= − η ⁿ ^W

^−⊤

− ϕ (y)x

^⊤

o

, (37)

where η > 0 is the learning rate and ϕ (y) = [ ϕ

1

(y

1

), . . . , ϕ n (y n )]

^⊤

is the negative score function whose ith element ϕ i (y i ) is given by

ϕ i (y i ) = − d log p _i (y i )

dy _i . (38)

A popular ICA algorithm is based on natural gradient [22] which is known to be efficient since the steepest descent direction is used when parameter space is on Riemannian manifold. We derive the natural gradient ICA algorithm [5].

Invoking (38), we have

d (

−

∑ n i=1

log q _i (y i ) )

=

∑ n i=1

ϕ i (y i )dy i (39)

= ϕ

^⊤

(y)dy, (40)

where ϕ (y) = [ ϕ

1

(y

1

· · · ϕ n (y n )]

^⊤

and dy is given in terms of dW as

dy = dWW

⁻¹

y. (41)

Define a modified coefficient differential dV as

dV = dWW

⁻¹

. (42)

With this definition, we have

(11)

d (

−

∑ n i=1

log q _i (y i ) )

= ϕ

^⊤

(y)dV y. (43)

We calculate an infinitesimal increment of log | detW |, then we have

d{log | detW |} = tr{dV }, (44)

where tr{·} denotes the trace which adds up all diagonal elements.

Thus combining (43) and (44) gives

dJ = ϕ

^⊤

(y)dV y − tr{dV }. (45)

The differential in (45) is in terms of the modified coefficient differential matrix dV . Note that dV is a linear combination of the coefficient differentials dW _{i j} . Thus, as long as dW is nonsingular, dV represents a valid search direction to minimize (35), because dV spans the same tangent space of matrices as spanned by dW . This leads to a stochastic gradient learning algorithm for V given by

∆ ^V = − η ^d ^J dV

= η {I − ϕ (y)y

^⊤

}. (46)

Thus the learning algorithm for updating W is described by

∆ ^W = η∆ ^VW

= η ⁿ ^I − ϕ (y)y

^⊤

o

W . (47)

5 Flexible ICA

Optimal nonlinear function ϕ i (y i ) is given by (38). However, it requires the knowl- edge of the probability distributions of sources which are not available to us. A vari- ety of hypothesized density model has been used. For example, for super-Gaussian source signals, unimodal or hyperbolic-Cauchy distribution model [9] leads to the nonlinear function given by

ϕ i (y i ) = tanh( β y i ). (48)

Such sigmoid function was also used in [4]. For sub-Gaussian source signals, cubic nonlinear function ϕ i (y i ) = y

³

_i has been a favorite choice. For mixtures of sub- and super-Gaussian source signals, according to the estimated kurtosis of the extracted signals, nonlinear function can be selected from two different choices [23].

Flexible ICA [24] incorporates the generalized Gaussian density model into the

natural gradient ICA algorithm, so that the parameterized nonlinear function pro-

vides flexibility in learning. The generalized Gaussian probability distribution is a

(12)

set of distributions parameterized by a positive real number α , which is usually re- ferred to as the Gaussian exponent of the distribution. The Gaussian exponent α controls the “peakiness” of the distribution. The probability density function (PDF) for a generalized Gaussian is described by

p(y; α ) = α 2 λΓ

_α¹

^e

^−|

λy|^α

, (49)

where Γ (x) is Gamma function given by Γ (x) =

Z

_∞ 0

t ^x−1 e

^−t

dt. (50)

Note that if α = 1, the distribution becomes the standard “Laplacian” distribution.

If α = 2, the distribution is standard normal distribution (see Figure 2).

−40 −3 −2 −1 0 1 2 3 4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y

p(y)

alpha=2 alpha=4 alpha=1 alpha=.8

Fig. 2 The generalized Gaussian distribution is plotted for several different values of Gaussian exponent, α = 0.8, 1, , 2, 4.

For a generalized Gaussian distribution, the kurtosis can be expressed in terms of the Gaussian exponent, given by

κ

_α

= Γ

_α⁵

Γ

_α¹

Γ

² _α³

^{− 3.} ⁽⁵¹⁾

The plot of kurtosis κ

α

versus the Gaussian exponent α for leptokurtic and platykur- tic signals are shown in Fig. 3.

From the parameterized generalized Gaussian density model, the nonlinear func- tion in the algorithm (47) is given by

ϕ i (y i ) = d log p _i (y i ) dy _i

= |y i |

^αⁱ⁻¹

sgn(y i ), (52)

(13)

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10⁻²

10⁻¹ 10⁰ 10¹ 10² 10³ 10⁴

Gaussian exponent

kurtosis

Leptokurtic distribution

2 3 4 5 6 7 8 9 10

10⁻² 10⁻¹ 10⁰ 10¹ 10² 10³ 10⁴

Gaussian exponent

−kurtosis

Platykurtic distribution

(a) (b)

Fig. 3 The plot of kurtosis κ

α

versus Gaussian exponent α: (a) for leptokurtic signal; (b) for platykurtic signal.

where sgn(y i ) is the signum function of y i .

Note that for α i = 1, ϕ i (y i ) in (38) becomes a signum function (which can also be derived from the Laplacian density model for sources). The signum nonlinear function is favorable for the separation of speech signals since natural speeches is often modeled as Laplacian distribution. Note also that for α i = 4, ϕ i (y i ) in (38) becomes a cubic function, which is known to be a good choice for sub-Gaussian sources.

In order to select a proper value of the Gaussian exponent α i , we estimate the kurtosis of the output signal y _i and select the corresponding α i from the relation- ship in Figure 3. The kurtosis of y _i , κ i can be estimated via the following iterative algorithm:

κ i (t + 1) = M

_4i

(t + 1)

M

_2i²

(t + 1) − 3, (53)

where

M

_4i

(t + 1) = (1 − δ )M

4i

(t) + δ |y i (t)|

⁴

, (54) M

2i

(t + 1) = (1 − δ )M

2i

(t) + δ |y i (t)|

²

, (55) where δ is a small constant, say, 0.01.

In general, the estimated kurtosis of demixing filter output does not exactly match

the kurtosis of original source. However, it provides an idea whether the estimated

source is sub-Gassian signal or super-Gaussian signal. Moreover, it was shown

[11, 25] that the performance of source separation is not degraded even if the hy-

pothesized density does not match the true density. From these reasons, we suggest

a pratical method where only several different forms of nonlinear functions are used.

(14)

6 Differential ICA

In a wide sense, most of ICA algorithms based on unsupervised learning belong to the Hebb-type rule or its generalization with adopting nonlinear functions. Moti- vated from the differential Hebb rule [26] and differential decorrelation [27, 28], we introduce an ICA algorithm employing the differential learning and natural gradient, which leads to a differential ICA algorithm. We first introduce a random walk model for latent variables, in order to show that the differential learning is interpreted as the maximum likelihood estimation of a linear generative model. Then the detailed derivation of the differential ICA algorithm is presented.

6.1 Random walk model for latent variables

Given a set of observation data, {x(t)}, the task of learning the linear generative model (1) under a constraint of latent variables being statistically independent, is a semiparametric estimation problem. The maximum likelihood estimation of basis vectors {a i } involves a probabilistic model for latent variables which are treated as nuisance parameters.

In order to show a link between the differential learning and maximum likelihood estimation, we consider a random walk model for latent variables s _i (t), which is a simple Markov chain, i.e.,

s _i (t) = s i (t − 1) + ε i (t), (56) where the innovation ε i (t) is assumed to have zero mean with a density function q _i ( ε i (t)). In addition, innovation sequences { ε i (t)} are assumed to be mutually in- dependent white sequences, i.e., they are spatially independent and temporally white as well.

Let us consider latent variables s i (t) over an N-point time block. We define the vector s i as

s _i = [s i (0), . . . , s i (N − 1)]

^⊤

. (57) Then the joint probability density function of s _i can be written as

p _i (s i ) = p i (s i (0), . . . , s i (N − 1))

=

N−1 ∏

t=0

p _i (s i (t)|s i (t − 1)), (58)

where s _i (t) = 0 for t < 0 and the statistical independence of innovation sequences was taken into account.

It follows from the random walk model (56) that the conditional probability den-

sity of s _i (t) given its past samples can be written as

(15)

p _i (s i (t)|s i (t − 1)) = q i ( ε i (t)). (59) Combining (58) and (59) leads to

p _i (s i ) =

N−1 ∏

t=0

q _i ( ε i (t))

=

N−1 ∏

t=0

q _i s

^′

_i (t))

, (60)

where s

^′

_i (t) = s i (t) − s i (t − 1) which is the first-order approximation of the differen- tiation.

Take the statistical independence of latent variables and (60) into account, then we can write the joint density p(s

1

, . . . , s n ) as

p(s

1

, . . . , s n ) =

∏ n i=1

p _i (s i )

=

N−1 ∏

t=0

∏ n i=1

q _i s

^′

_i (t)

. (61)

The factorial model given in (61) will be used as an optimization criterion in deriv- ing the differential ICA algorithm.

6.2 Algorithm

Denote a set of observation data by

X = {x

1

, . . . , x n }, (62)

where

x _i = [x i (0), . . . , x i (N − 1)]

^⊤

. (63) Then the normalized log-likelihood is given by

1 N log p (X |A) = − log |det A| + 1

N log p(s

1

, . . . , s n )

= − log |det A| + 1 N

N−1 ∑

t=0

∑ n i=1

log q _i (s

^′

_i (t)). (64)

Let us denote the inverse of A by W = A

⁻¹

. The estimate of latent variables is

denoted by y(t) = W x(t). With these defined variables, the objective function (that

is the negative normalized log-likelihood) is given by

(16)

J dib = − 1

N log p (X |A)

= − log |detW | − 1 N

N−1 ∑

t=0

∑ n i=1

log q _i (y

^′

_i (t)), (65)

where s _i is replaced by its estimate y _i and y

^′

_i (t) = y i (t) − y i (t − 1) (the first-order approximation of the differentiation).

For on-line learning, the sample average is replaced by the instantaneous value.

Hence the on-line version of the objective function (65) is given by

J di = − log |detW | −

∑ n i=1

log q _i (y

^′

_i (t)), (66)

Note that objective function (66) is slightly different from (35) used in the con- ventional ICA based on the minimization of mutual information or the maximum likelihood estimation.

We derive a natural gradient learning algorithm which finds a minimum of (66).

To this end, we follow the way that was discussed in [29, 22, 24]. We calculate the total differential d J di (W ) due to the change dW

d J di = J di (W + dW ) − J di (W )

= d {− log |detW |} + d (

−

∑ n i=1

log q _i (y

^′

_i (t)) )

. (67)

Define

ϕ i (y

^′

_i ) = − d log q _i (y

^′

_i )

dy

^′

_i . (68)

and construct a vector ϕ (y

^′

) = [ ϕ

1

(y

^′₁

), . . . , ϕ n (y

^′

_n )]

^⊤

. With this definition, we have

d (

−

∑ n i=1

log q i (y

^′

_i (t)) )

=

∑ n i=1

ϕ i (y

^′

_i (t))dy

^′

_i (t)

= ϕ

^⊤

(y

^′

(t))dy

^′

(t). (69) One can easily see that

d {− log |detW |} = tr

dWW

⁻¹

. (70)

Define a modified differential matrix dV by

dV = dWW

⁻¹

. (71)

(17)

Then, with this modified differential matrix, the total differential dJ di (W ) is com- puted as

d J di = −tr {dV } + ϕ

^⊤

(y

^′

(t))dV y

^′

(t). (72) A gradient descent learning algorithm for updating V is given by

V (t + 1) = V (t) − η t

d J di

dV

= η t

n

I − ϕ (y

^′

(t))y

^′⊤

(t) o

. (73)

Hence, it follows from the relation (71) that the updating rule for W has the form W (t + 1) = W (t) + η t

n

I − ϕ (y

^′

(t))y

^′⊤

(t) o

W (t). (74)

7 Nonstationary Source Separation

So far, we assumed that sources are stationary random process where the statistics does not vary over time. In this section, we show how the natural gradient ICA algorithm is modified to handle nonstationary sources. As in [30], the following assumptions are made in this section.

AS1 The mixing matrix A has full column rank.

AS2 Source signals {s i (t)} are statistically independent with zero mean. This im- plies that the covariance matrix of source signal vector, R _s (t) = E{s(t)s

^⊤

(t)} is a diagonal matrix, i.e.,

R _s (t) = diag{r

1

(t), . . . , r n (t)}, (75) where r _i (t) = E{s

²

_i (t)} and E denotes the statistical expectation operator.

AS3 _r ^r

ⁱ^(t)

j(t)

(i, j = 1, . . . , n and i 6= j) are not constant with time.

We have to point out that the first two assumptions (AS1, AS2) are common in most existing approaches to source separation, however, the third assumption (AS3) is critical in the present paper. For nonstationary sources, the third assumption is satisfied and it allows us to separate linear mixtures of sources via SOS.

For stationary source separation, the typical cost function is based on the mutual information which requires the knowledge of underlying distributions of sources.

Since probability distributions of sources are not known in advance, most ICA al- gorithms rely on hypothesized distributions (for example, see [24] and references therein). HOS should be incorporated either explicitly or implicitly.

For nonstationary sources, Matsuoka et al. have shown that the decomposition

(6) is satisfied if cross-correlations E {y i (t)y j (t)} (i, j = 1, . . . , n, i 6= j) are zeros

at any time instant t, provided that the assumptions (AS1)-(AS3) are satisfied. To

(18)

eliminate cross-correlations, the following cost function was proposed in [30],

J (W ) = 1 2

( n i=1 ∑

log E{y

²

_i (t)} − log det E n

y(t)y

^⊤

(t) o )

, (76)

where det(·) denotes the determinant of a matrix. The cost function given in (76) is a non-negative function which takes minima if and only if E{y i (t)y j (t)} = 0, for i, j = 1, . . . , n, i 6= j. This is the direct consequence of the Hadamard’s inequality which is summarized below.

Theorem 2 (Hadamard’s Inequality). Suppose K = [k i j ] is a non-negative definite symmetric n × n matrix. Then,

det(K) ≤

∏ n i=1

k ii , (77)

with equality iff k _{i j} = 0, for i 6= j.

Take the logarithm on both sides of (77) to obtain

∑ n i=1

log k _ii − log det(K) ≥ 0. (78)

Replacing the matrix K by E {y(t)y

^⊤

(t)}, one can easily see that the cost function (76) has the minima iff E {y i (t)y j (t)} = 0, for i, j = 1, . . . , n and i 6= j.

We compute d n

log det(E{y(t)y

^⊤

(t)}) o

= 2d {log detW } + d {log detC(t)}

= 2tr

W

⁻¹

dW

+ d {log detC(t)} , (79) Define a modified differential matrix dV as

dV = W

⁻¹

dW . (80)

Then, we have

d ( n

i=1 ∑

log E{y

²

_i (t)}

)

= 2E{y

^⊤

(t) Λ

⁻¹

(t)dV y(t)}, (81)

Similarly, we can derive the learning algorithm for W that has the form

∆ ^W (t) = η t

n

I − Λ

⁻¹

(t)y(t)y

^⊤

(t) o W (t)

= η t Λ

⁻¹

(t) n

Λ (t) − y(t)y

^⊤

(t) o

W (t). (82)

(19)

8 Spatial, Temporal, and Spatio-Temporal ICA

ICA decompostion, X = AS, has inherently duality. Considering the data matrix X ∈ R ^m×N where its each row is assumed to be a time course of an attribute, ICA decomposition produces n independent time courses. On the other hand, regarding the data matrix in the form of X

^⊤

, ICA decomposition leads to n independent pat- terns (for instance, images in fMRI or arrays in DNA microarray data).

The standard ICA (where X is considered) is treated as temporal ICA (tICA).

Its dual decomposition (regarding X

^⊤

) is known as spatial ICA (sICA). Combining these two ideas, leads to spatio-temporal ICA (stICA). These variations of ICA, were first investigated in [31]. Spatial ICA or spatio-temporal ICA were shown to be useful in fMRI image analysis [31] and gene expression data analysis [32, 33].

Suppose that the singular value decomposition (SVD) of X is given by

X = UDV ^T = U D

^1/2

V D

^1/2

T

= e U e V ^T , (83)

where U ∈ R ^m×n , D ∈ R ^n×n , and V ∈ R ^N×n for n ≤ min(m, N).

8.1 Temporal ICA

Temporal ICA finds a set of independent time courses and a corresponding set of dual unconstrained spatial patterns. It embodies the assumption that each row vector in e V

^⊤

consists of a linear combination of n independent sequences, i.e., e V

^⊤

= e A _T S _T , where S _T ∈ R ^n×N has a set of n independent temporal sequences of length N and e

A _T ∈ R ^n×n is an associated mixing matrix.

Unmixing by Y _T = W T V e

^⊤

where W _T = Pe A

⁻¹

_T , leads us to recover the n dual patterns A _T associated with the n independent time courses, by calculating A _T =

e

UW

⁻¹

_T , which is a consequence of e X = A T Y _T = e U e V

^⊤

= e UW

⁻¹

_T Y _T .

8.2 Spatial ICA

Spatial ICA seeks a set of independent spatial patterns S _S and a corresponding set of dual unconstrained time courses A _S . It embodies the assumption that each row vector in e U

^⊤

is composed of a linear combination of n independent spatial patterns, i.e., e U

^⊤

= e A _S S _S , where S _S ∈ R ^n×m contains a set of n independent m-dimensional patterns and e A _S ∈ R ^n×n is an encoding variable matrix (mixing matrix).

Define Y S = W S U e

^⊤

where W S is a permuted version of e A

⁻¹

_S . With this defini-

tion, the n dual time courses A _S ∈ R ^N×n associated with the n independent patterns,

(20)

is computed by A _S = e VW

⁻¹

_S , since e X

^⊤

= A S Y _S = e U e V ^T = e VW

⁻¹

_S Y _S . Each column vector of A _S corresponds to a temporal mode.

8.3 Spatio-temporal ICA

In linear decomposition, sICA enforces independence constraints over space, to find a set of independent spatial patterns, whereas tICA embodies independence con- straints over time, to seek a set of independent time courses. Spatio-temporal ICA finds a linear decomposition, by maximizing the degree of independence over space as well as over time, without necessarily producing independence in either space or time. In fact it allows a trade-off between the independence of arrays and the independence of time courses.

Given e X = e U e V ^T , stICA finds the following decomposition:

e

X = S

^⊤

_S Λ ^S T , (84)

where S _S ∈ R ^n×m contains a set of n independent m-dimensional patterns, S _T ∈ R ^n×N has a set of n independent temporal sequences of length N, and Λ is a diagonal scaling matrix. There exist two n × n mixing matrices, W S and W T such that S _S = W _S U e

^⊤

and S _T = W T V e

^⊤

. The following relation

e

X = S

^⊤

_S Λ ^S T

= e UW

^⊤

_S Λ ^W T V e

^⊤

= e U e V ^T , (85)

implies that W

^⊤

_S Λ ^W T = I, which leads to

W _T = W

^−T

_S Λ

⁻¹

. (86)

Linear transforms, W _S and W _T , are found by jointly optimizing objective func- tions associated with sICA and tICA. That is, the objective function for stICA has the form

J stICA = α J sICA + (1 − α )J tICA , (87)

where J sICA and J tICA could be infomax criteria or log-likelihood functions and α defines the relative weighting for spatial independence and temporal independence.

More details on stICA can be found in [31].

(21)

9 Algebraic Methods for BSS

Up to now, we have introduced on-line ICA algorithm in an framework of unsuper- vised learning. In this section, we explain several algebraic methods for BSS where matrix decomposition plays a critical role.

9.1 Fundamental principle for algebraic BSS

Algebraic methods for BSS often make use of the eigen-decomposition of correla- tion matrices or cumulant matrices. Exemplary algebraic methods for BSS include FOBI [34], AMUSE [35], JADE [36], SOBI [37], and SEONS [38]. Some of these methods (FOBI and AMUSE) are based on simultaneous diagonalization of two symmetric matrices. Methods such as JADE, SOBI, and SEONS make use of joint approximate diagonalization of multiple matrices (more than two). The following theorem provides a fundamental principle to algebraic BSS, justifying why simul- taneous diagonalization of two symmetric data matrices (one of them is assumed to be positive definite) provides a solution to BSS.

Theorem 3. Let Λ

1

, D

1

∈ R ^n×n be diagonal matrices with positive diagonal entries and Λ

2

, D

2

∈ R ^n×n be diagonal matrices with non-zero diagonal entries. Suppose that G ∈ R ^n×n satisfies the following decompositions:

D

₁

= G Λ

1

G

^⊤

, (88)

D

₂

= G Λ

2

G

^⊤

. (89)

Then the matrix G is the generalized permutation matrix, i.e., G = P Λ ^{if D}

⁻¹₁

^D

2

and Λ

⁻¹₁

Λ

2

have distinct diagonal entries.

Proof. It follows from (88) that there exists an orthogonal matrix Q such that

G Λ

₁¹²

=

D

1 2

1

Q. (90)

Hence,

G = D

1 2

1

Q Λ

⁻₁¹²

. (91)

Substitute (91) into (89) to obtain

D

⁻¹₁

D

₂

= Q Λ

⁻¹₁

Λ

2

Q

^⊤

. (92)

Since the right-hand side of (92) is the eigen-decomposition of the left-hand side of

(92), the diagonal elements of D

⁻¹₁

D

₂

and Λ

⁻¹₁

Λ

2

are the same. From the assump-

tion that the diagonal elements of D

⁻¹₁

D

₂

and Λ

⁻¹₁

Λ

2

are distinct, the orthogonal

(22)

matrix Q must have the form Q = P Ψ ^{, where} Ψ is an diagonal matrix whose diag- onal elements are either +1 or −1. Hence, we have

G = D

₁¹²

P ΨΛ

⁻₁¹²

= PP

^⊤

D

1 2

1

P ΨΛ

⁻₁¹²

= P Λ , (93)

where

Λ = P

^⊤

D

1 2

1

P ΨΛ

⁻₁¹²

,

which completes the proof.

9.2 AMUSE

As an example of Theorem 3, we briefly explain AMUSE [35] where a BSS solution is determined by simultaneously diagonalizing the equal-time correlation matrix of x(t) and a time-delayed correlation matrix of x(t).

Let us assume that sources {s i (t)} (entries of s(t)) are uncorrelated stochastic processes with zero mean, i.e.,

E{s i (t)s j (t − τ )} = δ i j γ i ( τ ), (94) where δ i j is the Kronecker delta and γ i ( τ ) are distinct for i = 1, . . . , n, given τ ^{. In} other words, the equal-time correlation matrix of source, R _ss (0) = E{s(t)s

^⊤

(t)} is a diagonal matrix with distinct diagonal entries. Moreover, a time-delayed correlation matrix of source, R ss ( τ ) = E{s(t)s

^⊤

(t − τ )} is diagonal as well, with distinct non- zero diagonal entries.

It follows from (2) that the correlation matrices of the observation vector x(t) satisfy

R _xx (0) = AR ss (0)A

^⊤

, (95)

R _xx ( τ ) = AR ss ( τ )A

^⊤

, (96) for some non-zero time-lag τ and both R ss (0) and R ss ( τ ) are diagonal matrices since sources are assumed to be spatially uncorrelated.

Invoking Theorem 3, one can easily see that the inverse of the mixing matrix,

A

⁻¹

, can be identified up to its re-scaled and permuted version by the simultaneous

diagonalization of R _xx (0) and R xx ( τ ), provided that R

⁻¹

_ss (0)R ss ( τ ) has distinct diag-

onal elements. In other words, we determine a linear transformation W such that

R _yy (0) and R yy ( τ ) of the output y(t) = W x(t) are simultaneously diagonalized:

(23)

R _yy (0) = (W A)R ss (0)(W A)

^⊤

, R _yy ( τ ) = (W A)R ss ( τ )(W A)

^⊤

.

It follows from Theorem 3 that W A becomes the transparent transformation.

9.3 Simultaneous diagonalization

We explain how two symmetric matrices are simultaneously diagonalized by a lin- ear transformation. More details on simultaneous diagonalization can be found in [39]. Simultaneous diagonalization consists of two steps (whitening followed by an unitary transformation):

(1)First, the matrix R _xx (0) is whitened by z(t) = D

⁻

1 2

1

U

^⊤₁

x(t), (97)

where D

1

and U

1

are the eigenvalue and eigenvector matrices of R xx (0) as

R _xx (0) = U

1

D

₁

U

^⊤₁

. (98)

Then we have

R _zz (0) = D

⁻₁¹²

U

^⊤₁

R _xx (0)U

1

D

⁻

1 2

1

= I m , R _zz ( τ ) = D

⁻₁¹²

U

^⊤₁

R _xx ( τ )U

1

D

⁻

1 2

1

.

(2)Second, a unitary transformation is applied to diagonalize the matrix R _zz ( τ ). The eigen-decomposition of R _zz ( τ ) has the form

R _zz ( τ ) = U

2

D

₂

U

^⊤₂

. (99) Then y(t) = U

^⊤₂

z(t) satisfies

R _yy (0) = U

^⊤₂

R _zz (0)U

2

= I m , R _yy ( τ ) = U

^⊤₂

R _zz ( τ )U

2

= D

2

.

Thus both matrices R _xx (0) and R xx ( τ ) are simultaneously diagonalized by a linear transform W = U

^⊤₂

D

⁻

1 2

1

U

^⊤₁

. It follows from Theorem 3 that W = U

^⊤₂

D

⁻

1 2

1

U

^⊤₁

is a

valid demixing matrix if all the diagonal elements of D

₂

are distinct.

(24)

9.4 Generalized eigenvalue problem

The simultaneous diagonalization of two symmetric matrices can be carried out without going through two-step procedures. From the discussion in Section 9.3, we have

W R _xx (0)W

^⊤

= I n , (100)

W R _xx ( τ )W

^⊤

= D

2

. (101)

The linear transformation W which satisfies (100) and (101) is the eigenvector ma- trix of R

⁻¹

_xx (0)R xx ( τ ) [39]. In other words, the matrix W is the generalized eigenvec- tor matrix of the pencil R _xx ( τ ) − λ ^R xx (0) [40].

Recently Chang et al. proposed the matrix pencil method for BSS [41] where they exploited R _xx ( τ

1

) and R xx ( τ

2

) for τ

1

6= τ

2

6= 0. Since the noise vector was assumed to be temporally white, two matrices R _xx ( τ

1

) and R xx ( τ

2

) are not theoretically affected by the noise vector, i.e.,

R _xx ( τ

1

) = AR ss ( τ

1

)A

^⊤

, (102) R _xx ( τ

2

) = AR ss ( τ

2

)A

^⊤

. (103) Thus it is obvious to see that we can find an estimate of demixing matrix that is not sensitive to the white noise. A similar idea was also exploited in [42, 43].

In general, the generalized eigenvalue decomposition requires the symmetric- definite pencil (one matrix is symmetric and the other is symmetric and positive definite). However R _xx ( τ

2

) − λ ^R xx ( τ

1

) is not symmetric-definite, which might cause a numerical instability problem which results in complex-valued eigenvectors.

The set of all matrices of the form R

₁

− λ ^R

2

with λ ∈ R is said to be a pencil.

Frequently we encounter into the case where R

₁

is symmetric and R

₂

is symmetric and positive definite. Pencils of this variety are referred to as symmetric-definite pencils [44].

Theorem 4 (pp. 468 in [44]). If R

₁

− λ ^R

2

is symmetric-definite, then there exists a nonsingular matrix U = [u

1

, . . . , u n ] such that

U

^⊤

R

₁

U = diag { γ

1

( τ

1

), . . . , γ n ( τ

1

)} , (104) U

^⊤

R

₂

U = diag { γ

1

( τ

2

), . . . , γ n ( τ

2

)} . (105) Moreover R

₁

u _i = λ i R

₂

u _i for i = 1, . . . , n, and λ i =

^γ_γⁱ^(τ¹⁾

i(τ2)

.

It is apparent from Theorem 4 that R

₁

should be symmetric and R

₂

should be

symmetric and positive definite so that the generalized eigenvector U can be a valid

solution if { λ i } are distinct.

(25)

10 Softwares

A vareity of ICA softwares are available. ICA Central

¹

was created in 1999 to pro- mote research on ICA and blind source separation by means of public mailing lists, a repository of data sets, a repository of ICA/BSS algorithms, and so on. ICA Cen- tral might be the first place where you can find data sets and ICA algorithms. In addition, several widely-used softwares include

• ICALAB Toolboxes (http://www.bsp.brain.riken.go.jp/ICALAB/): ICALAB is an ICA Matlab software toolbox developed in Laboratory for Advanced Brain Sig- nal Processing in RIKEN Brain Science Institute, Japan. It consists of two in- dependent packages, including ICALAB for signal processing and ICALAB for image processing and each package contains a variety of algorithms.

• FastICA (http://www.cis.hut.fi/projects/ica/fastica/): It is the FastICA Matlab package that implements fast fixed-point algorithms for non-Gaussianity max- imization [16]. It was developed in Helsinki University of Technology, Finland and other environments (R, C++, Physon) are also available.

• Infomax ICA (http://www.cnl.salk.edu/∼tewon/ica cnl.html): Matlab and C codes for Bell and Sejnowski’s Infomax algorithm [4] and extended infomax [15]

where a parametric density model is incorporated into Infomax to handle both super-Gaussian and sub-Gaussian sources.

• EEGLAB (http://sccn.ucsd.edu/eeglab/): EEGLAB is an interactive Matlab tool- box for processing continuous and event-related EEG, MEG and other electro- physiological data using ICA, time/frequency analysis, artifact rejection, and sev- eral modes of data visualization.

• ICA: DTU Toolbox (http://isp.imm.dtu.dk/toolbox/ica/): ’ICA: DTU Toolbox’ is a collection of ICA algorithms that includes: (1) ’icaML’ which is an efficient im- plementation of Infomax; (2) ’icaMF’ which is an iterative algorithm that offers a variety of possible source priors and mixing matrix constraints (e.g. positivity) and can also handle over and under-complete mixing; (3) ’icaMS’ which is an

’one shot’ fast algorithm that requires time correlation between samples.

11 Further Issues

• Overcomplete representation: Overcomplete representation enforces the latent space dimension n to be greater than the data dimension m in the linear model (1). Sparseness constraints on latent variables are necessary to learn fruitful rep- resentation [45].

• Bayesian ICA: Bayesian ICA incorporates uncertainty and prior distributions of latent variables into the model (1). Independent factor analysis [46] is a pioneer- ing work along this direction. EM algorithm for ICA was developed in [47] and a full Bayesian ICA (also known as ensemble learning) was developed in [48].

1

Independent Component Analysis

Independent Component Analysis

Seungjin Choi

1 Introduction

We consider a linear generative model, where m-dimensional observed data x ∈ R m is assumed to be generated by a linear combination of n basis vectors {a i ∈ R m }, x = a

s

+ a

s

+ · · · a n s n , (1)

Seungjin Choi

Department of Computer Science, Pohang University of Science and Technology, San 31 Hyoja- dong, Nam-gu, Pohang 790-784, Korea, e-mail: [email protected]

1

where {s i ∈ R} are encoding variables representing the extent to which each basis vectors is used to reconstruct the data vector. Given N samples, the model (1) can be written in a compact form:

X = AS, (2)

where X = [x(1), . . . , x(N)] ∈ R m×N is a data matrix, A = [a

, . . . , a n ] ∈ R m×n is a basis matrix, and S = [s(1), . . . , s(N)] ∈ R n×N is an encoding matrix with s(t) = [s

(t), . . . , s n (t)]

.

Dual interpretation of basis-encoding in the model (2) is given as follows:

• When columns in X are treated as data points in m-dimensional space, columns in A are considered as basis vectors and each column in S is encoding that represents the extent to which each basis vector is used to reconstruct data vector.

• Alternatively, when rows in X are data points in N-dimensional space, rows in S correspond to basis vectors and each row in A represents encoding.

Y = W X, (3)

In this chapter, we begin with a fundamental idea, emphasizing why independent

components are sought. Then we introduce well-known principles to tackle ICA,

leading to an objective function to be optimized. We explain the natural gradient al-

gorithm for ICA. We also elucidate how we incorporate nonstationarity or temporal

information into the standard ICA framework.

2 Why Independent Components?

Principal component analysis (PCA) is a popular subspace analysis method that has been used for dimensionality reduction and feature extraction. Given a data matrix X ∈ R m×N , the covariance matrix R xx is computed by

R xx = 1

N X HX

,

where H = I N×N − N

1 N 1

N is the centering matrix, where I N×N is the N × N iden- tity matrix and 1 N = [1, . . . , 1]

∈ R N . The rank-n approximation of the covariance matrix R xx is of the form

R xx ≈ U Λ U

,

z(t) = U

x(t), or in a compact form,

Z = U

X .

It is well known that rows of Z are uncorrelated with each other.

ICA generalizes PCA in the sense that latent variables (components) are non- Gaussian and A is allowed to be non-orthogonal transformation, whereas PCA con- siders only orthogonal transformation and implicitly assumes Gaussian components.

Fig. 1 shows a simple example, emphasizing the main difference between PCA and ICA.

We presents a core theorem which plays an important role in ICA. It provides a fundamental principle for various unsupervised learning algorithms for ICA and BSS.

Theorem 1 (Skitovich-Darmois). Let {s

, s

, . . . , s n } be a set of independent ran- dom variables. Consider two random variables x

and x

which are linear combi- nations of {s i },

y

= α

s

+ · · · α n s n ,

y

= β

s

+ · · · β n s n , (4)

where { α i } and { β i } are real constants. If y

and y

are statistically independent,

then each variable s i for which α i β i 6= 0 is Gaussian.

(a) (b)

Consider the linear model (2) for m = n. Throughout this chapter, we consider the simplest case where m = n (square mixing). Let us define the global transformation as G = W A, where A is the mixing matrix and W is the demixing matrix. With this definition, we write the output y(t) as

y(t) = W x(t) = Gs(t). (5)

Let us assume that both A and W are nonsingular, hence G is nonsingular. Under this assumption, one can easily see that if {y i (t)} are mutually independent non- Gaussian signals, then invoking Theorem 1, G has the following decomposition

G = P Λ . (6)

This justifies why ICA performs BSS.

3 Principles

The task of ICA is to estimate the mixing matrix A or its inverse W = A

(re-

ferred to as dexming matrix) such that elements of the estimate y = A

x = W x are

as independent as possible. For the sake of simplicity, we often leave out the in-

dex t if the time structure does not have to be considered. In this section we review

four different principles: (1) maximum likelihood estimation; (2) mutual informa-

tion minimization; (3) information maximization; (4) negentropy maximization.

3.1 Maximum likelihood estimation

Suppose that sources s are independent with marginal distributions q i (s i ) q(s) =

We consider a linear generative model, where m-dimensional observed data x ∈ R ^m is assumed to be generated by a linear combination of n basis vectors {a i ∈ R ^m }, x = a

+ · · · a n s _n , (1)

where X = [x(1), . . . , x(N)] ∈ R ^m×N is a data matrix, A = [a

, . . . , a n ] ∈ R ^m×n is a basis matrix, and S = [s(1), . . . , s(N)] ∈ R ^n×N is an encoding matrix with s(t) = [s

Principal component analysis (PCA) is a popular subspace analysis method that has been used for dimensionality reduction and feature extraction. Given a data matrix X ∈ R ^m×N , the covariance matrix R _xx is computed by

R _xx = 1

where H = I _N×N − _N

1 _N 1

_N is the centering matrix, where I _N×N is the N × N iden- tity matrix and 1 _N = [1, . . . , 1]

∈ R ^N . The rank-n approximation of the covariance matrix R _xx is of the form

R _xx ≈ U Λ ^U

+ · · · α n s _n ,

+ · · · β n s _n , (4)

then each variable s _i for which α i β i 6= 0 is Gaussian.

Suppose that sources s are independent with marginal distributions q _i (s i ) q(s) =

q _i (s i ). (7)

δ ^x j −

A _ji s _i

q _i (s i )ds (8)

q _i

_{i j} x _j

log p _i (y i ). (13)

log p _i (y i ) )

(x)dx, (15) where H( ˜p) = − ^R p(x) log ˜p(x)dx is the entropy of ˜p. Given a set of data points, ˜ {x

, . . . , x N } drawn from the underlying distribution p(x), the empirical distribution p(x) puts probability ˜ _N

θ ^{hlog p}

^(x)i ^p

^, ⁽¹⁷⁾ where h·i _p