Background and Related Work - DEEP MODELS MADE INTERPRETABLE: A SPARSE

CHAPTER 3 DEEP MODELS MADE INTERPRETABLE: A SPARSE

3.1 Background and Related Work

Albeit effective, conventional sparse coding models rely on iterative approximation algorithms, whose inherently sequential structure, as well as the data- dependent latency, often constitute a major bottleneck in the computational efficiency. Besides, the joint optimization of the (unsupervised) feature learning and the supervised steps often has to rely on solving complex bi-level optimization [165], and thus constitutes another efficiency bottleneck. Further, to effectively represent datasets of growing sizes, sparse coding has to refer to larger dictionar- ies. Since the inference complexity of sparse coding increases more than linearly with respect to the dictionary size [165], its scalability turns out to be limited. Other “shallow” models suffer from similar problems.

Deep learning has recently attracted great attention [89]. The advantages of deep learning lie in its composition of multiple non-linear transformations to yield more abstract and descriptive representations. The feed-forward networks could be tuned jointly with task-driven loss functions [163]. With the aid of gradient descent, it also scales linearly in time and space with the number of train samples. There has been a booming interest in bridging “shallow” optimization and deep learning models. Our work shares in the spirit of prior wisdom, with many more novel models explored and insights gained. In this chapter, we derive deep models from the`0 sparse approximation model, the graph-regularized`1sparse approximation model, the`∞constrained model, and the dual-sparsity model. The struc- tural priors inferred from all those models act as effective network regularizations, and lead to improved generalization ability. The resulting deep models also enjoy faster inference, larger learning capacity, and better scalability, compared to their “shallow” counterparts.

`0and`1-based Sparse Approximations

Finding the sparsest, or minimum`0-norm, representation of a signal given a dictionary of basis atoms is an important problem in many application domains. Con- sider a data samplex∈Rm×1_{, that is encoded into its sparse code}_a_∈_Rp×1_using a learned dictionaryD = [d1,d2,· · · ,dp], wheredi ∈ Rm×1, i = 1,2,· · · , pare

the learned atoms. The sparse codes are obtained by solving the `0 regularized problem (λis a constant):

a= arg mina1₂||x−Da||2F +λ||a||0. (3.1) Alternatively, one could explicitly impose constraints on the number of non-zero coefficients of the solution, by solving theM-sparse problem:

a= arg mina||x−Da||F2 s.t. ||a||0 ≤M. (3.2) Unfortunately, these optimization problems are often intractable because there is a combinatorial increase in the number of local minima as the number of the candidate basis vectors increases. One potential remedy is to employ a convex surrogate measure, such as the `1-norm, in place of the `0-norm that leads to a more tractable optimization problem. For example, (3.1) could be relaxed as:

a= arg mina1₂||x−Da||2F +λ||a||1. (3.3) It creates a unimodal optimization problem that can be solved via linear program- ming techniques. The downside is that we have now introduced a mismatch be- tween the ultimate goal and the objective function [169]. Under certain conditions, the minimum`1-norm solution equals to the minimum`0-norm one [56]. But in practice, the `1 approximation is often used way beyond these conditions, and is thus quite heuristic. As a result, we often get a solution which is not exactly minimizing the original`0-norm.

That said, `1 approximation is found to work practically well for many sparse coding problems. Yet in certain applications, we intend to control the exact number of nonzero elements, such as basis selection [169], where`0approximation is indispensable. Beyond that, `0-approximation is desirable for performance con- cerns in many ways. In compressive sensing literature, empirical evidence [26] suggested that using an iterative reweighted`1scheme to approximate the`0solu-

tion often improved the quality of signal recovery. In image enhancement, it was shown in [180] that `0 data fidelity was more suitable for reconstructing images corrupted with impulse noise. For the purpose of image smoothening, the authors of [173] utilized`0gradient minimization to globally control how many non-zero gradients to approximate prominent structures in a structure-sparsity-management manner. Recent work [159] revealed that `0 sparse subspace clustering can com- pletely characterize the set of minimal union-of-subspace structure, without addi- tional separation conditions required by its`1 counterpart.

Network Implementation of`1-Approximation

Figure 3.1: A LISTA network [68] with two time-unfolded stages.

In [68], a feed-forward neural network, as illustrated in Figure 3.1, was pro- posed to efficiently approximate the`1-based sparse codeaof the input signalx; the sparse codeais obtained by solving (3.3) for a given dictionaryDin advance. The network has a finite number of stages, each of which updates the intermediate sparse codezk_(k_{= 1, 2) according to}

zk+1 ₌_s

θ(Wx+Szk), (3.4)

where sθ is an element-wise shrinkage function (u is a vector and ui is its i-th

element,i= 1,2, ..., p):

[sθ(u)]i =sign(ui)(|ui| −θi)+. (3.5) The parameterized encoder, named learned ISTA (LISTA), is a natural network implementation of the iterative shrinkage and thresholding algorithm (ISTA). LISTA learned all its parametersW,Sandθfrom training data using a back-propagation algorithm [93]. In this way, a good approximation of the underlying sparse code can be obtained after a fixed small number of stages.

In [146], the authors leveraged a similar idea on fast trainable regressors and constructed feed-forward network approximations of the learned sparse models. Such a process-centric view was later extended in [145] to develop a princi- pled process of learned deterministic fixed-complexity pursuits, in lieu of iterative proximal gradient descent algorithms, for structured sparse and robust low rank models. Recently, [75] summarized the methodology of the problem-level and model-based “deep unfolding”, and developed new architectures as inference algorithms for both Markov random fields and non-negative matrix factorization.

In document Task-specific and interpretable feature learning (Page 46-49)