• No results found

Sparsifying Basis Learning

Sec.5.4.2shows that if an event is “sparsifiable” we can better separate H0 and H1 by projecting

(multiplying by a basis B) the observations y onto a different coordinate system where the signal is concentrated into fewer components (a “sparser representation” of the signal). The Haar wavelet basis is an example of a basis that improves detection of signals with certain structured (hierarchical) dependencies. In general, can we construct or learn a sparsifying basis without assuming such dependencies?

LetBbe an orthonormal matrix andxa vector of uncorrupted binary activations, Theorem5.4.1

states that the sparsification ratio ||x||0

||BTx||0 directly impacts the amount of separation betweenH0and

H1. In fact, given that||x||0is fixed, the two hypotheses are maximally separated when||BTx||0 is

problem:

arg min

B

||BTX||0, subject toBBT =I (5.3)

whereXis a matrix that contains binary observations as its columns and||·||0is the sum of non-zero

elements in the matrix. The constraintBBT =Iensures thatBremains orthonormal.

However, direct minimization of ||BTX||

0 is NP-hard in general [30]. In practice, the`0-norm

is often replaced by the convex and “sparsity-promoting”`1-norm [18]. This suggests the following

relaxation heuristic for (5.3).

arg min

B

||BTX||1, subject toBBT =I (5.4)

where|| · ||1is the maximum absolute column sum of the matrix.

Direct approximation. For large problems, we are interested in efficiently computable heuristics for Eq. (5.4). Independent Component Analysis (ICA) is one such approximation, and solves the following optimization problem

arg min

B

G(BTX), subject to BBT =I (5.5)

where Gis a nonlinear convex smooth approximation to the `1 penalty function, e.g., log cosh(x),

−exp(−x2/2), and x4 [65]. Fig. 5.5 illustrates these functions in relation to the linear penalty function.

Eq. (5.5) can be solved with stochastic gradient algorithm by taking the derivative ofG. However this approach is often slow and requires fine tuning; this leads to the development of “FastICA”, an efficient fixed-point algorithm. Implementation details of FastICA and in-depth analysis can be found in [65].

Letg=G0, the one unit algorithm for fastICA is given below for completeness.

Algorithm 2: ICAone-unit solution

b←random unit vector

while b not converged do

b←Exg(bTx)−Eg0(bTx)x;

b←b/kbk;

There exists two variations of FastICA, both of which use Alg.2as subroutine. Thedeflationary orthogonalization greedily finds and fixes one component bi at a time. The symmetric orthogo-

nalization finds all componentsbi,∀i= 1, . . . , p and orthogonalizes all of them at the end of each

iteration [64]. While the symmetric approach utilizes parallelization and is multitudes faster than

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−1.5

−1

−0.5

0

0.5

1

1.5

2

L

1

penalty

log(cosh(x))

y

4

Figure 5.8: Smooth `1 approximation functions used in ICA with the linear `1 penalty function

plotted in blue solid line.

ICA is a mature technique that finds a linear transformation for non-Gaussian data so that the components are as independent as possible in the transform domain. For unlabeled training data

x∈Rp×n where pis the dimension of the data andn the number of samples, ICA is traditionally

defined as the following optimization problem [65]:

Noise-tolerant relaxed approximation. Ideally we want to learn from noise-free observations

X. However, training data constructed from real-world measurements will contain noise or outliers, and instead we are forced to train withY, which is the observation matrixXcorrupted with noise. Consequently, we may not be able to obtain the “best” basis by optimizing BTY as in ICA.

Instead, we may wish to find a basis that sparsely represents “most of” the observations. More formally, we introduce a latent matrixZ, which can be thought of as the “cause”, in the transform domain, of the noise-free signalX. In other wordsX=BZ. We desireZto be sparse, andBZto be close to the observed signalY. This motivates the next optimization:

arg min

B,Z

kY−BZk2

F+λkZk1, subject toBBT =I (5.6)

where || · ||F is the matrix Frobenius norm, and λ > 0 is a free parameter. Eq. (5.6) essentially

balances the difference betweenYandXwith the sparsity ofZ: increasingλmore strongly penalizes choices ofZthat are not sparse.

to the other convex. The objective can then be solved in an iterative two-step convex optimization process — Orthogonal Procrustes [54] and LASSO with orthonormal design [15]. The two-step procedure is given below.

Algorithm 3: SLSA two-step convex optimization procedure

Step 1: Orthogonal Procrustes

FixZ, solve minB||Y−BZ||2F,:BB

T =I

M ←YZT;

M =UΣVT ;

B←U V;

Step 2: LASSO with orthonormal design

FixB, solve minZkY−BZk2F+λkZk1 K←ZTY;

Z←sign(K)×max(|K| −λ);

The formulation of Eq. (5.6) and solution in Alg. 3 is equivalent to Sparse Latent Semantic Analysis (SLSA) [19], which was introduced for applications involving topic models for text data. Here we adopt the name for consistency.

We note that both Eq. (5.5) and Eq. (5.6) should be viewed as efficiently computable heuristics for Eq. (5.3), which is a non-convex optimization over the Stiefel manifold of all size-porthonormal matrices. As such, they are practical expedients towards the goal of obtaining a sparsifying basis.