Universal Regularizers For Robust Sparse Coding and Modeling

(1)

Universal Regularizers For Robust Sparse Coding

and Modeling

Ignacio Ram´ırez and Guillermo Sapiro Department of Electrical and Computer Engineering

University of Minnesota {ramir048,guille}@umn.edu

Abstract

Sparse data models, where data is assumed to be well represented as a linear combination of a few elements from a dictionary, have gained considerable attention in recent years, and their use has led to state-of-the-art results in many signal and image processing tasks. It is now well understood that the choice of the sparsity regularization term is critical in the success of such models. Based on a codelength minimization interpretation of sparse coding, and using tools from universal coding theory, we propose a framework for designing sparsity regularization terms which have theoretical and practical advantages when compared to the more standard `0or `1ones. The presentation

of the framework and theoretical foundations is complemented with examples that show its practical advantages in image denoising, zooming and classification.

I. INTRODUCTION

Sparse modeling calls for constructing a succinct representation of some data as a combination of a few typical patterns (atoms) learned from the data itself. Significant contributions to the theory and practice of learning such collections of atoms (usually called dictionaries or codebooks), e.g., [1], [14], [33], and of representing the actual data in terms of them, e.g., [8], [11], [12], have been developed in recent years, leading to state-of-the-art results in many signal and image processing tasks [24], [26], [27], [34]. We refer the reader for example to [4] for a recent review on the subject.

A critical component of sparse modeling is the actual sparsity of the representation, which is controlled by a regularization term (regularizer for short) and its associated parameters. The choice of the functional form of the regularizer and its parameters is a challenging task. Several solutions to this problem have been proposed in the literature, ranging from the automatic tuning of the parameters [20] to Bayesian models, where these parameters are themselves considered as random variables [17], [20], [51]. In this work we adopt the interpretation of sparse coding as a codelength minimization problem. This is a natural and objective method for assessing the quality of a statistical model for describing given data, and which is based on the Minimum Description Length (MDL) principle [37]. In this framework, the regularization term in the sparse coding formulation is interpreted as the cost in bits of describing the sparse linear coefficients used to reconstruct the data. Several works on image coding using this approach were developed in the 1990’s under the name of “complexity-based” or “compression-based” coding, following the popularization of MDL as a powerful statistical modeling tool [9], [31], [40]. The focus on these early works was in denoising using wavelet basis, using either generic asymptotic results from MDL or fixed probability models, in order to compute the description length of the coefficients. A later, major breakthrough in MDL theory was the adoption of universal coding tools to compute optimal codelengths. In this work, we improve and extend on previous results in this line of work by designing regularization terms based on such universal codes for image coefficients, meaning that the codelengths obtained when encoding the coefficients of any (natural) image with such codes will be close to the shortest codelengths that can be obtained with any model fitted specifically for that particular instance of coefficients. The resulting framework not only formalizes sparse coding from the MDL and universal coding perspectives but also leads to a family of universal regularizers which we show to consistently improve results in image processing tasks such as denoising and classification. These models also enjoy several desirable theoretical and practical properties such as statistical consistency (in certain cases), improved robustness to outliers in the data, and improved sparse signal recovery (e.g., decoding of sparse signals from a compressive sensing point of view [5]) when compared with the traditional `0 and `1-based techniques in practice. These models

also yield to the use of a simple and efficient optimization technique for solving the corresponding sparse coding

(2)

problems as a series of weighted `1 subproblems, which in turn, can be solved with off-the-shelf algorithms such

as LARS [12] or IST [11]. Details are given in the sequel.

Finally, we apply our universal regularizers not only for coding using fixed dictionaries, but also for learning the dictionaries themselves, leading to further improvements in all the aforementioned tasks.

The remainder of this paper is organized as follows: in SectionIIwe introduce the standard framework of sparse modeling. Section III is dedicated to the derivation of our proposed universal sparse modeling framework, while Section IV deals with its implementation. Section V presents experimental results showing the practical benefits of the proposed framework in image denoising, zooming and classification tasks. Concluding remarks are given in Section VI.

II. SPARSE MODELING AND THE NEED FOR BETTER MODELS

Let X ∈ RM ×N be a set of N column data samples xj ∈ RM, D ∈ RM ×K a dictionary of K column atoms

dk∈ RM, and A ∈ RK×N, aj ∈ RK, a set of reconstruction coefficients such that X = D A. We use aT_k to denote

the k-th row of A, the coefficients associated to the k-th atom in D. For each j = 1, . . . , N we define the active set of aj as Aj = {k : akj 6= 0, 1 ≤ k ≤ K}, and kajk₀ = |Aj| as its cardinality. The goal of sparse modeling

is to design a dictionary D such that for all or most data samples xj, there exists a coefficients vector aj such

that xj ≈ D aj and kajk₀ is small (usually below some threshold L K). Formally, we would like to solve the

following problem min D,A N X j=1 ψ(aj) s.t. kxj− D ajk2₂≤ , j = 1, . . . , N, (1)

where ψ(·) is a regularization term which induces sparsity in the columns of the solution A. Usually the constraint kd_kk₂ ≤ 1, k = 1, . . . , K, is added, since otherwise we can always decrease the cost function arbitrarily by multiplying D by a large constant and dividing A by the same constant. When D is fixed, the problem of finding a sparse aj for each sample xj is called sparse coding,

aj = arg min

a ψ(aj) s.t. kxj− D ajk 2

2≤ . (2)

Among possible choices of ψ(·) are the `0 pseudo-norm, ψ(·) = k·k₀, and the `1 norm. The former tries to

solve directly for the sparsest aj, but since it is non-convex, it is commonly replaced by the `1 norm, which is

its closest convex approximation. Furthermore, under certain conditions on (fixed) D and the sparsity of aj, the

solutions to the `0 and `1-based sparse coding problems coincide (see for example [5]). The problem (1) is also

usually formulated in Lagrangian form, min D,A N X j=1 kxj− D ajk2₂+ λψ(aj), (3)

along with its respective sparse coding problem when D is fixed, aj = arg min

a kxj− D ak 2

2+ λψ(a). (4)

Even when the regularizer ψ(·) is convex, the sparse modeling problem, in any of its forms, is jointly non-convex in (D, A). Therefore, the standard approach to find an approximate solution is to use alternate minimization: starting with an initial dictionary D(0), we minimize (3) alternatively in A via (2) or (4) (sparse coding step), and D (dictionary update step). The sparse coding step can be solved efficiently when ψ(·) = k·k₁ using for example IST

[11] or LARS [12], or with OMP[28] when ψ(·) = k·k₀. The dictionary update step can be done using for example

MOD [14] or K-SVD [1].

A. Interpretations of the sparse coding problem

We now turn our attention to the sparse coding problem: given a fixed dictionary D, for each sample vector xj,

compute the sparsest vector of coefficients aj that yields a good approximation of xj. The sparse coding problem

admits several interpretations. What follows is a summary of these interpretations and the insights that they provide into the properties of the sparse models that are relevant to our derivation.

(3)

1) Model selection in statistics: Using the `0 norm as ψ(·) in (4) is known in the statistics community as the

Akaike’s Information Criterion (AIC) when λ = 1, or the Bayes Information Criterion (BIC) when λ = 1₂log M , two popular forms of model selection (see [22, Chapter 7]). In this context, the `1 regularizer was introduced in

[43], again as a convex approximation of the above model selection methods, and is commonly known (either in its constrained or Lagrangian forms) as the Lasso. Note however that, in the regression interpretation of (4), the role of D and X is very different.

2) Maximum a posteriori: Another interpretation of (4) is that of a maximum a posteriori (MAP) estimation of aj in the logarithmic scale, that is

aj = arg max

a {log P (a|xj)} = arg maxa {log P (xj|a) + log P (a)}

= arg min

a {− log P (xj|a) − log P (a)}, (5)

where the observed samples xj are assumed to be contaminated with additive, zero mean,IID Gaussian noise with

variance σ2, P (xj|a) ∝ e− 1

2σ2kxj−D ak 2

2, and a prior probability model on a with the form P (a) ∝ e−θψ(a) is

considered. The energy term in Equation (4) follows by plugging the previous two probability models into (5) and factorizing 2σ2 into λ = 2σ2θ. According to (5), the `1 regularizer corresponds to an IID Laplacian prior with

mean 0 and inverse-scale parameter θ, P (a) =QK

k=1θe−θ|ak|= θKe−θkak1, which has a special meaning in signal

processing tasks such as image or audio compression. This is due to the widely accepted fact that representation coefficients derived from predictive coding of continuous-valued signals, and, more generally, responses from zero-mean filters, are well modeled using Laplacian distributions. For example, for the special case of DCT coefficients of image patches, an analytical study of this phenomenon is provided in [25], along with further references on the subject.

3) Codelength minimization: Sparse coding, in all its forms, has yet another important interpretation. Suppose that we have a fixed dictionary D and that we want to use it to compress an image, either losslessly by encoding the reconstruction coefficients A and the residual X − DA, or in a lossy manner, by obtaining a good approximation X ≈ DA and encoding only A. Consider for example the latter case. Most modern compression schemes consist of two parts: a probability assignment stage, where the data, in this case A, is assigned a probability P (A), and an encoding stage, where a code C(A) of length L(A) bits is assigned to the data given its probability, so that L(A) is as short as possible. The techniques known as Arithmetic and Huffman coding provide the best possible solution for the encoding step, which is to approximate the Shannon ideal codelength L(A) = − log P (A) [10, Chapter 5]. Therefore, modern compression theory deals with finding the coefficients A that maximize P (A), or, equivalently, that minimize − log P (A). Now, to encode X lossily, we obtain coefficients A such that each data sample xj is approximated up to a certain `2 distortion , kxj− Dajk2₂ ≤ . Therefore, given a model P (a) for a

vector of reconstruction coefficients, and assuming that we encode each sample independently, the optimum vector of coefficients aj for each sample xj will be the solution to the optimization problem

aj = arg min

a − log P (a) s.t. kxj− D ajk 2

2 ≤ , (6)

which, for the choice P (a) ∝ e−ψ(a)coincides with the error constrained sparse coding problem (2). Suppose now that we want lossless compression. In this case we also need to encode the reconstruction residual xj− Daj. Since

P (x, a) = P (x|a)P (a), the combined codelength will be

L(xj, aj) = − log P (xj, aj) = − log P (xj|aj) − log P (aj). (7)

Therefore, obtaining the best coefficients aj amounts to solving minaL(xj, aj), which is precisely the MAP

formulation of (5), which in turn, for proper choices of P (x|a) and P (a), leads to the Lagrangian form of sparse coding (4).1

1

Laplacian models, as well as Gaussian models, are probability distributions over R, characterized by continuous probability density functions, f (a) = F0(a), F (a) = P (x ≤ a). If the reconstruction coefficients are considered real numbers, under any of these distributions, any instance of A ∈ RK×N will have measure 0, that is, P (A) = 0. In order to use such distributions as our models for the data, we assume that the coefficients in A are quantized to a precision ∆, small enough for the density function f (a) to be approximately constant in any interval [a − ∆/2, a + ∆/2], a ∈ R, so that we can approximate P (a) ≈ ∆f (a), a ∈ R. Under these assumptions, − log P (a) ≈ − log f (a) − log ∆, and the effect of ∆ on the codelength produced by any model is the same. Therefore, we will omit ∆ in the sequel, and treat density functions and probability distributions interchangeably as P (·). Of course, in real compression applications, ∆ needs to be tuned.

(4)

Fig. 1. Standard 8×8DCTdictionary (a), global empirical distribution of the coefficients in A (b, log scale), empirical distributions of the coefficients associated to each of the K = 64 DCTatoms (c, log scale). The distributions in (c) have a similar heavy tailed shape (heavier than Laplacian), but the variance in each case can be significantly different. (d) Histogram of the K = 64 different ˆθk values obtained by fitting a Laplacian distribution to each row aT

k of A. Note that there are significant occurrences between ˆθ = 5 to ˆθ = 25. The coefficients A used in (b-d) were obtained from encoding 106 8×8 patches (after removing their DC component) randomly sampled from the Pascal 2006 dataset of natural images [15]. (e) Histograms showing the spatial variability of the best local estimations of ˆθk for a few rows of A across different regions of an image. In this case, the coefficients A correspond to the sparse encoding of all 8×8 patches from a single image, in scan-line order. For each k, each value of ˆθk was computed from a random contiguous block of 250 samples from aTk. The procedure was repeated 4000 times to obtain an empirical distribution. The wide supports of the empirical distributions indicate that the estimated ˆθ can have very different values, even for the same atom, depending on the region of the data from where the coefficients are taken.

As one can see, the codelength interpretation of sparse coding is able to unify and interpret both the constrained and unconstrained formulations into one consistent framework. Furthermore, this framework offers a natural and objective measure for comparing the quality of different models P (x|a) and P (a) in terms of the codelengths obtained.

4) Remarks on related work: As mentioned in the introduction, the codelength interpretation of signal coding was already studied in the context of orthogonal wavelet-based denoising. An early example of this line of work considers a regularization term which uses the Shannon Entropy function P pilog pi to give a measure of the

sparsity of the solution [9]. However, the Entropy function is not used as measure of the ideal codelength for describing the coefficients, but as a measure of the sparsity (actually, group sparsity) of the solution. The MDL principle was applied to the signal estimation problem in [40]. In this case, the codelength term includes the description of both the location and the magnitude of the nonzero coefficients. Although a pioneering effort, the model assumed in [40] for the coefficient magnitude is a uniform distribution on [0, 1], which does not exploit a priori knowledge of image coefficient statistics, and the description of the support is slightly wasteful. Furthermore, the codelength expression used is an asymptotic result, actually equivalent to BIC (see Section II-A1) which can be misleading when working with small sample sizes, such as when encoding small image patches, as in current state of the art image processing applications. The uniform distribution was later replaced by the universal code for integers [38] in [31]. However, as in [40], the model is so general that it does not perform well for the specific case of coefficients arising from image decompositions, leading to poor results. In contrast, our models are derived following a careful analysis of image coefficient statistics. Finally, probability models suitable to image coefficient statistics of the form P (a) ∝ e−|a|β (known as generalized Gaussians) were applied to the MDL-based signal coding and estimation framework in [31]. The justification for such models is based on the empirical observation that sparse coefficients statistics exhibit “heavy tails” (see next section). However, the choice is ad hoc and no optimality criterion is available to compare it with other possibilities. Moreover, there is no closed form solution for performing parameter estimation on such family of models, requiring numerical optimization techniques. In Section III, we derive a number of probability models for which parameter estimation can be computed efficiently in closed form, and which are guaranteed to optimally describe image coefficients.

B. The need for a better model

As explained in the previous subsection, the use of the `1 regularizer implies that all the coefficients in A

share the same Laplacian parameter θ. However, as noted in [25] and references therein, the empirical variance of coefficients associated to different atoms, that is, of the different rows aT_k of A, varies greatly with k = 1 . . . , K. This is clearly seen in Figures 1(a-c), which show the empirical distribution of DCT coefficients of 8×8 patches.

(5)

As the variance of a Laplacian is 2/θ2, different variances indicate different underlying θ. The histogram of the set n ˆ_θ_k_{, k = 1, . . . , K}o

of estimated Laplacian parameters for each row k, Figure 1(d), shows that this is indeed the case, with significant occurrences of values of ˆθ in a range of 5 to 25.

The straightforward modification suggested by this phenomenon is to use a model where each row of A has its own weight associated to it, leading to a weighted `1 regularizer. However, from a modeling perspective, this

results in K parameters to be adjusted instead of just one, which often results in poor generalization properties. For example, in the cases studied in SectionV, even with thousands of images for learning these parameters, the results of applying the learned model to new images were always significantly worse (over 1dB in estimation problems) when compared to those obtained using simpler models such as an unweighted `1.2One reason for this failure may

be that real images, as well as other types of signals such as audio samples, are far from stationary. In this case, even if each atom k is associated to its own θk (λk), the optimal value of θk can have significant local variations

at different positions or times. This effect is shown in Figure 1(e), where, for each k, each θk was re-estimated

several times using samples from different regions of an image, and the histogram of the different estimated values of ˆθk was computed. Here again we used the DCT basis as the dictionary D.

The need for a flexible model which at the same time has a small number of parameters leads naturally to Bayesian formulations where the different possible λk are “marginalized out” by imposing an hyper-prior distribution on λ,

sampling λ using its posterior distribution, and then averaging the estimates obtained with the sampled sparse-coding problems. Examples of this recent line of work, and the closely related Bayesian Compressive Sensing, are developed for example in [23], [44], [49], [48]. Despite of its promising results, the Bayesian approach is often criticized due to the potentially expensive sampling process (something which can be reduced for certain choices of the priors involved [23]), arbitrariness in the choice of the priors, and lack of proper theoretical justification for the proposed models [48].

In this work we pursue the same goal of deriving a more flexible and accurate sparse model than the traditional ones, while avoiding an increase in the number of parameters and the burden of possibly solving several sampled instances of the sparse coding problem. For this, we deploy tools from the very successful information-theoretic field of universal coding, which is an extension of the compression scenario summarized above in Section II-A, when the probability model for the data to be described is itself unknown and has to be described as well.

III. UNIVERSAL MODELS FOR SPARSE CODING

Following the discussion in the preceding section, we now have several possible scenarios to deal with. First, we may still want to consider a single value of θ to work well for all the coefficients in A, and try to design a sparse coding scheme that does not depend on prior knowledge on the value of θ. Secondly, we can consider an independent (but not identically distributed) Laplacian model where the underlying parameter θ can be different for each atom dk, k = 1, . . . , K. In the most extreme scenario, we can consider each single coefficient akj in A

to have its own unknown underlying θkj and yet, we would like to encode each of these coefficients (almost) as if

we knew its hidden parameter.

The first two scenarios are the ones which fit the original purpose of universal coding theory [29], which is the design of optimal codes for data whose probability models are unknown, and the models themselves are to be encoded as well in the compressed representation.

Now we develop the basic ideas and techniques of universal coding applied to the first scenario, where the problem is to describe A as an IID Laplacian with unknown parameter θ. Assuming a known parametric form for the prior, with unknown parameter θ, leads to the concept of a model class. In our case, we consider the class M = {P (A|θ) : θ ∈ Θ} of all IID _{Laplacian models over A ∈ R}K×N, where

and Θ ⊆ R+. The goal of universal coding is to find a probability model Q(A) which can fit A as well as the model in M that best fits A after having observed it. A model Q(A) with this property is called universal (with respect to the model M).

2

Note that this is the case when the weights are found by maximum likelihood. Other applications of weighted `1regularizers, using other types of weighting strategies, are known to improve over `1-based ones for certain applications (see e.g. [51]).

(6)

For simplicity, in the following discussion we consider the coefficient matrix A to be arranged as a single long column vector of length n = K×N , a = (a1, . . . , an). We also use the letter a without sub-index to denote the

value of a random variable representing coefficient values.

First we need to define a criterion for comparing the fitting quality of different models. In universal coding theory this is done in terms of the codelengths L(a) required by each model to describe a.

If the model consists of a single probability distribution P (·), we know from Section II-A3 that the optimum codelength corresponds to LP(a) = − log P (a). Moreover, this relationship defines a one-to-one correspondence

between distributions and codelengths, so that for any coding scheme LQ(a), Q(a) = 2−LQ(a). Now suppose

that we are restricted to a class of models M, and that we need choose the model ˆP ∈ M that assigns the shortest codelength to a particular instance of a. We then have that ˆP is the model in M that assigns the maximum probability to a. For a class M parametrized by θ, this corresponds to ˆP = P (a|ˆθ(a)), where ˆθ(a) is the maximum likelihood estimator (MLE) of the model class parameter θ given a (we will usually omit the argument and just write ˆθ). Unfortunately, we also need to include the value of ˆθ in the description of a for the decoder to be able to reconstruct it from the code C(a). Thus, we have that any model Q(a) inducing valid codelengths LQ(a) will have

LQ(a) > − log P (a|ˆθ). The overhead of LQ(a) with respect to − log P (a|ˆθ) is known as the codelength regret,

R(a, Q) := L_Q(a) − (− log P (a|ˆθ(a))) = − log Q(a) + log P (a|ˆθ(a))).

A model Q(a) (or, more precisely, a sequence of models, one for each data length n) is called universal if R(a, Q) grows sublinearly in n for all possible realizations of a, that is _n1R(a, Q) → 0 , ∀ a ∈ Rn_{, so that the codelength}

regret with respect to the MLE becomes asymptotically negligible.

There are a number of ways to construct universal probability models. The simplest one is the so called two-part code, where the data is described in two two-parts. The first two-part describes the optimal parameter ˆθ(a) and the second part describes the data according to the model with the value of the estimated parameter ˆθ, P (a|ˆθ(a)). For uncountable parameter spaces Θ, such as a compact subset of R, the value of ˆθ has to be quantized in order to be described with a finite number of bits d. We call the quantized parameter ˆθd. The regret for this model is thus

R(a, Q) = L(ˆθd) + L(a|ˆθd) − L(a|ˆθ) = L(ˆθd) − log P (a|ˆθd) − (− log P (a|ˆθ)).

The key for this model to be universal is in the choice of the quantization step for the parameter ˆθ, so that both its description L(ˆθd), and the difference − log P (a|ˆθd) − (− log P (a|ˆθ)), grow sublinearly. This can be achieved

by letting the quantization step shrink as O(1/√n) [37], thus requiring d = O(0.5 log n) bits to describe each dimension of ˆθd. This gives us a total regret for two-part codes which grows as dim(Θ)₂ log n, where dim(Θ) is the

dimension of the parameter space Θ.

Another important universal code is the so called Normalized Maximum Likelihood (NML) [42]. In this case the universal model Q∗(a) corresponds to the model that minimizes the worst case regret,

Q∗(a) = min

Q maxa {− log Q(a) + log P (a|ˆθ(a))},

which can be written in closed form as Q∗(a) = P (a|ˆ_C(M,n)θ(a)), where the normalization constant

C(M, n) := X

a∈Rn

P (a|ˆθ(a))da

is the value of the minimax regret and depends only on M and the length of the data n.3 Note that theNMLmodel requires C(M, n) to be finite, something which is often not the case.

The two previous examples are good for assigning a probability to coefficients a that have already been computed, but they cannot be used as a model for computing the coefficients themselves since they depend on having observed them in the first place. For this and other reasons that will become clearer later, we concentrate our work on a third important family of universal codes derived from the so called mixture models (also called Bayesian mixtures). In

3

The minimax optimality of Q∗(a) derives from the fact that it defines a complete uniquely decodable code for all data a of length n, that is, it satisfies the Kraft inequality with equality.P

a∈Rn2−LQ∗(a)= 1. Since every uniquely decodable code with lengths {LQ(a) : a ∈ Rn} must satisfy the Kraft inequality (see [10, Chapter 5]), if there exists a value of a such that LQ(a) < LQ∗(a) (that is 2−LQ(a)_{> 2}−LQ∗(a)_),

then there exists a vector a0 for which LQ(a0) > LQ∗_(a0_{) for the Kraft inequality to hold. Therefore the regret of Q for a}0 _{is necessarily}

(7)

a mixture model, Q(a) is a convex mixture of all the models P (a|θ) in M, indexed by the model parameter θ, Q(a) = R

ΘP (a|θ)w(θ)dθ, where w(θ) specifies the weight of each model. Being a convex mixture implies that

w(θ) ≥ 0 and R

Θw(θ)dθ = 1, thus w(θ) is itself a probability measure over Θ. We will restrict ourselves to the

particular case when a is considered a sequence of independent random variables,4 Q(a) = n Y j=1 Qj(aj), Qj(aj) = Z Θ P (aj|θ)wj(θ)dθ, (8)

where the mixing function wj(θ) can be different for each sample j. An important particular case of this scheme

is the so called Sequential Bayes code, in which wj(θ) is computed sequentially as a posterior distribution based

on previously observed samples, that is wj(θ) = P (θ|a1, a2, . . . , an−1) [21, Chapter 6]. In this work, for simplicity,

we restrict ourselves to the case where wj(θ) = w(θ) is the same for all j. The result is an IID model where the

probability of each sample aj is a mixture of some probability measure over R,

Qj(aj) = Q(aj) =

Z

Θ

P (aj|θ)w(θ)dθ, ∀ j = 1, . . . , N. (9)

A well known result for IIDmixture (Bayesian) codes states that their asymptotic regret is O(dim(Θ)₂ log n), thus stating their universality, as long as the weighting function w(θ) is positive, continuous and unimodal over Θ (see for example [21, Theorem 8.1],[41]). This gives us great flexibility on the choice of a weighting function w(θ) that guarantees universality. Of course, the results are asymptotic and the o(log n) terms can be large, so that the choice of w(θ) can have practical impact for small sample sizes.

In the following discussion we derive several IID mixture models for the Laplacian model class M. For this purpose, it will be convenient to consider the corresponding one-sided counterpart of the Laplacian, which is the exponential distribution over the absolute value of the coefficients, |a|, and then symmetrize back to obtain the final distribution over the signed coefficients a.

A. The conjugate prior

In general, (9) can be computed in closed form if w(θ) is the conjugate prior of P (a|θ). When P (a|θ) is an exponential (one-sided Laplacian), the conjugate prior is the Gamma distribution,

w(θ|κ, β) = Γ(κ)−1θκ−1βκe−βθ, θ ∈ R+,

where κ and β are its shape and scale parameters respectively. Plugging this in (9) we obtain the Mixture of exponentials model (MOE), which has the following form (see Appendix Afor the full derivation),

QMOE(a|β, κ) = κβ

κ_{(a + β)}−(κ+1)

, a ∈ R+. (10)

With some abuse of notation, we will also denote the symmetric distribution on a as MOE,

QMOE(a|β, κ) =

1 2κβ

κ_{(|a| + β)}−(κ+1)

, a ∈ R. (11)

Although the resulting prior has two parameters to deal with instead of one, we know from universal coding theory that, in principle, any choice of κ and β will give us a model whose codelength regret is asymptotically small.

Furthermore, beingIIDmodels, each coefficient of a itself is modeled as a mixture of exponentials, which makes the resulting model over a very well suited to the most flexible scenario where the “underlying” θ can be different for each aj. In SectionV-Bwe will show that a single MOEdistribution can fit each of the K rows of A better than

K separate Laplacian distributions fine-tuned to these rows, with a total of K parameters to be estimated. Thus, not only we can deal with one single unknown θ, but we can actually achieve maximum flexibility with only two parameters (κ and β). This property is particular of the mixture models, and does not apply to the other universal models presented.

(8)

Finally, if desired, both κ and β can be easily estimated using the method of moments (see AppendixA). Given sample estimates of the first and second non-central moments, ˆµ1 = 1_nPnj=1|aj| and ˆµ2= _n1 Pnj=1|aj|2, we have

that

ˆ

κ = 2(ˆµ2− ˆµ21)/(ˆµ2− 2ˆµ21) and β = (ˆˆ κ − 1)ˆµ1. (12)

When the MOE prior is plugged into (5) instead of the standard Laplacian, the following new sparse coding formulation is obtained, aj = arg min a kxj− Dak 2 2+ λMOE K X k=1 log (|ak| + β) , (13)

where λMOE= 2σ2(κ + 1). An example of the MOEregularizer, and the thresholding function it induces, is shown

in Figure 2 (center column) for κ = 2.5, β = 0.05. Smooth, differentiable non-convex regularizers such as the one in in (13) have become a mainstream robust alternative to the `1 norm in statistics [16], [51]. Furthermore,

it has been shown that the use of such regularizers in regression leads to consistent estimators which are able to identify the relevant variables in a regression model (oracle property) [16]. This is not always the case for the `1

regularizer, as was proved in [51]. TheMOEregularizer has also been recently proposed in the context of compressive

sensing [6], where it is conjectured to be better than the `1-term at recovering sparse signals in compressive sensing

applications.5_{This conjecture was partially confirmed recently for non-convex regularizers of the form ψ(a) = kak} r

with 0 < r < 1 in [39], [18], and for a more general family of non-convex regularizers including the one in (13) in [47]. In all cases, it was shown that the conditions on the sensing matrix (here D) can be significantly relaxed to guarantee exact recovery if non-convex regularizers are used instead of the `1 norm, provided that the exact

solution to the non-convex optimization problem can be computed. In practice, this regularizer is being used with success in a number of applications here and in [7], [46].6 _{Our experimental results in Section} _V _{provide further}

evidence on the benefits of the use of non-convex regularizers, leading to a much improved recovery accuracy of sparse coefficients compared to `1 and `0. We also show in Section V that the MOE prior is much more accurate

than the standard Laplacian to model the distribution of reconstruction coefficients drawn from a large database of image patches. We also show in Section V how these improvements lead to better results in applications such as image estimation and classification.

B. The Jeffreys prior

The Jeffreys prior for a parametric model class M = {P (a|θ), θ ∈ Θ}, is defined as w(θ) = _R p|I(θ)|

Θp|I(ξ)|dξ

, θ ∈ Θ, (14)

where |I(θ)| is the determinant of the Fisher information matrix I(θ) = E_{P (a|˜}_θ) − ∂ 2 ∂ ˜θ2 log P (a|˜θ) _˜ θ=θ . (15)

The Jeffreys prior is well known in Bayesian theory due to three important properties: it virtually eliminates the hyper-parameters of the model, it is invariant to the original parametrization of the distribution, and it is a “non-informative prior,” meaning that it represents well the lack of prior information on the unknown parameter θ [3]. It turns out that, for quite different reasons, the Jeffreys prior is also of paramount importance in the theory of universal coding. For instance, it has been shown in [2] that the worst case regret of the mixture code obtained using the Jeffreys prior approaches that of theNMLas the number of samples n grows. Thus, by using Jeffreys, one can attain the minimum worst case regret asymptotically, while retaining the advantages of a mixture (not needing hindsight of a), which in our case means to be able to use it as a model for computing a via sparse coding.

For the exponential distribution we have that I(θ) = _θ12. Clearly, if we let Θ = (0, ∞), the integral in (14)

evaluates to ∞. Therefore, in order to obtain a proper integral, we need to exclude 0 and ∞ from Θ (note that

5

In [6], the logarithmic regularizer arises from approximating the `0 pseudo-norm as an `1-normalized element-wise sum, without the insight and theoretical foundation here reported.

6

While these works support the use of such non-convex regularizers, none of them formally derives them using the universal coding framework as in this paper.

(9)

Fig. 2. Left to right: `1 (green), MOE (red) and JOE (blue) regularizers and their corresponding thresholding functions thres(x) := arg mina{(x − a)2+ λψ(|a|)}. The unbiasedness of MOE is due to the fact that large coefficients are not shrank by the thresholding function. Also, although theJOEregularizer is biased, the shrinkage of large coefficients can be much smaller than the one applied to small coefficients.

this was not needed for the conjugate prior). We choose to define Θ = [θ1, θ2], 0 < θ1 < θ2 < ∞, leading to

w(θ) = _ln(θ2/θ1)1 1_θ, θ ∈ [θ1, θ2].

The resulting mixture, after being symmetrized around 0, has the following form (see Appendix A): QJOE(a|θ1, θ2) = 1 2 ln(θ2/θ1) 1 |a| e−θ1|a|− e−θ2|a|, a ∈ R+. (16) We refer to this prior as a Jeffreys mixture of exponentials (JOE), and again overload this acronym to refer to

the symmetric case as well. Note that although QJOE is not defined for a = 0, its limit when a → 0 is finite and

evaluates to _{2 ln(θ2/θ1)}θ2−θ1 . Thus, by defining QJOE(0) =

θ2−θ1

2 ln(θ2/θ1), we obtain a prior that is well defined and continuous

for all a ∈ R. When plugged into (5), we get the JOE-based sparse coding formulation,

min a kxj− Dak 2 2+ λJOE K X k=1

{log |a_k| − log(e−θ1|ak|− e−θ2|ak|)}, (17) where, according to the convention just defined for QJOE(0), we define ψJOE(0) := log(θ2− θ1). According to the

MAP interpretation we have that λJOE = 2σ2, coming from the Gaussian assumption on the approximation error as

explained in Section II-A.

As with MOE, theJOE-based regularizer, ψJOE(·) = − log QJOE(·), is continuous and differentiable in R+, and its

derivative converges to a finite value at zero, lima→0ψ0JOE(a) =

θ2 2−θ

2 1

θ2−θ1. As we will see later in Section IV, these

properties are important to guarantee the convergence of sparse coding algorithms using non-convex priors. Note from (17) that we can rewrite the JOE regularizer as

ψJOE(ak) = log |ak| − log e−θ1|a|(1 − e−(θ2−θ1)|a|) = θ1|ak| + log |ak| − log(1 − e−(θ2−θ1)|ak|),

so that for sufficiently large |ak|, log(1−e−(θ2−θ1)|ak|) ≈ 0, θ1|ak| log |ak|, and we have that ψJOE(|ak|) ≈ θ1|ak|.

Thus, for large |ak|, the JOE regularizer behaves like `1 with λ0 = 2σ2θ1. In terms of the probability model, this

means that the tails of the JOE mixture behave like a Laplacian with θ = θ1, with the region where this happens

determined by the value of θ2− θ1. The fact that the non-convex region of ψJOE(·) is confined to a neighborhood

around 0 could help to avoid falling in bad local minima during the optimization (see Section IV for more details on the optimization aspects). Finally, although having Laplacian tails means that the estimated a will be biased [16], the sharper peak at 0 allows us to perform a more aggressive thresholding of small values, without excessively clipping large coefficients, which leads to the typical over-smoothing of signals recovered using an `1 regularizer.

See Figure 2 (rightmost column) for an example regularizer based onJOE with parameters θ1= 20, θ2= 100, and

the thresholding function it induces.

The JOE regularizer has two hyper-parameters (θ1, θ2) which define Θ and that, in principle, need to be tuned.

One possibility is to choose θ1and θ2based on the physical properties of the data to be modeled, so that the possible

values of θ never fall outside of the range [θ1, θ2]. For example, in modeling patches from grayscale images with

a limited dynamic range of [0, 255] in a DCT basis, the maximum variance of the coefficients can never exceed 1282. The same is true for the minimum variance, which is defined by the quantization noise.

Having said this, in practice it is advantageous to adjust [θ1, θ2] to the data at hand. In this case, although no

closed form solutions exist for estimating [θ1, θ2] using MLE or the method of moments, standard optimization

(10)

C. The conditional Jeffreys

A recent approach to deal with the case when the integral over Θ in the Jeffreys prior is improper, is the conditional Jeffreys [21, Chapter 11]. The idea is to construct a proper prior, based on the improper Jeffreys prior and the first few n0 samples of a, (a1, a2, . . . , an0), and then use it for the remaining data. The key observation is that although

the normalizing integral R pI(θ)dθ in the Jeffreys prior is improper, the unnormalized prior w(θ) =pI(θ) can be used as a measure to weight P (a1, a2, . . . , an0|θ),

w(θ) = _R P (a1, a2, . . . , an0|θ)pI(θ)

ΘP (a1, a2, . . . , an0|ξ)pI(ξ)dξ

. (18)

It turns out that the integral in (18) usually becomes proper for small n0 in the order of dim(Θ). In our case we

have that for any n0 ≥ 1, the resulting prior is a Gamma(κ0, β0) distribution with κ0 := n0 and β0 :=Pn0j=1aj

(see Appendix A for details). Therefore, using the conditional Jeffreys prior in the mixture leads to a particular instance of MOE, which we denote byCMOE (although the functional form is identical toMOE), where the Gamma

parameters κ and β are automatically selected from the data. This may explain in part why the Gamma prior performs so well in practice, as we will see in Section V.

Furthermore, we observe that the value of β obtained with this approach (β0) coincides with the one estimated

using the method of moments for MOE if the κ in MOE is fixed to κ = κ0 + 1 = n0+ 1. Indeed, if computed

from n0 samples, the method of moments for MOE gives β = (κ − 1)µ1, with µ1 = _n01 P aj, which gives us

β = n0+1−1_n0 P aj = β0. It turns out in practice that the value of κ estimated using the method of moments gives

a value between 2 and 3 for the type of data that we deal with (see Section V), which is just above the minimum acceptable value for the CMOE prior to be defined, which is n0 = 1. This justifies our choice of n0 = 2 when

applying CMOE in practice.

As n0 becomes large, so does κ0 = n0, and the Gamma prior w(θ) obtained with this method converges to a

Kronecker delta at the mean value of the Gamma distribution, δκ0/β0(·). Consequently, when w(θ) ≈ δκ0/β0(θ), the

mixture R_ΘP (a|θ)w(θ)dθ will be close to P (a|κ0/β0). Moreover, from the definition of κ0 and β0 we have that

κ0/β0 is exactly the MLE of θ for the Laplacian distribution. Thus, for large n0, the conditional Jeffreys method

approaches the MLE Laplacian model.

Although from a universal coding point of view this is not a problem, for large n0 the conditional Jeffreys model

will loose its flexibility to deal with the case when different coefficients in A have different underlying θ. On the other hand, a small n0 can lead to a prior w(θ) that is overfitted to the local properties of the first samples, which

for non-stationary data such as image patches, can be problematic. Ultimately, n0 defines a trade-off between the

degree of flexibility and the accuracy of the resulting model.

IV. OPTIMIZATION AND IMPLEMENTATION DETAILS

All of the mixture models discussed so far yield convex regularizers, rendering the sparse coding problem non-convex in a. It turns out however that these regularizers satisfy certain conditions which make the resulting sparse coding optimization well suited to be approximated using a sequence of successive convex sparse coding problems, a technique known as Local Linear Approximation (LLA) [52] (see also [46], [19] for alternative optimization techniques for such non-convex sparse coding problems). In a nutshell, suppose we need to obtain an approximate solution to aj = arg min a kxj− D ak 2 2+ λ K X k=1 ψ(|ak|), (19)

where ψ(·) is a non-convex function over R+. At each LLA iteration, we compute a(t+1)_j by doing a first order expansion of ψ(·) around the K elements of the current estimate a(t)_kj,

˜

(11)

and solving the convex weighted `1 problem that results after discarding the constant terms ck, a(t+1)_j = arg min a kxj− D ak 2 2+ λ K X k=1 ˜ ψ(t)_k (|ak|) = arg min a kxj− D ak 2 2+ λ K X k=1

ψ0(|a(t)_kj|)|ak| = arg min

a kxj − D ak 2 2+ K X k=1 λ(t)_k |ak|. (20)

where we have defined λ(t)_k := λψ0(|a(t)_kj|). If ψ0(·) is continuous in (0, +∞), and right-continuous and finite at 0, then the LLA algorithm converges to a stationary point of (19) [51]. These conditions are met for both the MOE

and JOE regularizers. Although, for the JOE prior, the derivative ψ0(·) is not defined at 0, it converges to the limit

θ2 2−θ21

2(θ2−θ1) when |a| → 0, which is well defined for θ2 6= θ1. If θ2 = θ1, the JOE mixing function is a Kronecker

delta and the prior becomes a Laplacian with parameter θ = θ1 = θ2. Therefore we have that for all of the mixture

models studied, the LLAmethod converges to a stationary point. In practice, we have observed that 5 iterations are

enough to converge. Thus, the cost of sparse coding, with the proposed non-convex regularizers, is at most 5 times that of a single `1 sparse coding, and could be less in practice if warm restarts are used to begin each iteration.

Of course we need a starting point a(0)_j , and, being a non-convex problem, this choice will influence the approxima-tion that we obtain. One reasonable choice, used in this work, is to define a(0)_kj = a0, k = 1, . . . , K, j = 1, . . . , N ,

where a0 is a scalar so that ψ0(a0) = Ew[θ], that is, so that the first sparse coding corresponds to a Laplacian

regularizer whose parameter is the average value of θ as given by the mixing prior w(θ).

Finally, note that although the discussion here has revolved around the Lagrangian formulation to sparse coding of (4), this technique is also applicable to the constrained formulation of sparse-coding given by Equation (1) for a fixed dictionary D.

Expected approximation error: Since we are solving a convex approximation to the actual target optimization problem, it is of interest to know how good this approximation is in terms of the original cost function. To give an idea of this, after an approximate solution a is obtained, we compute the expected value of the difference between the true and approximate regularization term values. The expectation is taken, naturally, in terms of the assumed distribution of the coefficients in a. Since the regularizers are separable, we can compute the error in a separable way as an expectation over each k-th coefficient, ζq(ak) = Eν∼qh ˜ψk(ν) − ψ(ν)

i

, where ˜ψk(·) is the approximation

of ψk(·) around the final estimate of ak. For the case of q =MOE, the expression obtained is (see Appendix)

ζMOE(ak, κ, β) = Eν∼MOE(κ,β)h ˜ψk(ν) − ψ(ν) i = log(ak+ β) + 1 ak+ β ak+ β κ − 1 − log β − 1 κ. In the MOE case, for κ and β fixed, the minimum of ζMOE occurs when ak = _κ−1β = µ(β, κ). We also have

ζMOE(0) = (κ − 1)−1− κ−1.

The function ζq(·) can be evaluated on each coefficient of A to give an idea of its quality. For example, in

the experiments from Section V, we obtained an average value of 0.16, which lies between ζMOE(0) = 0.19 and

minaζMOE(a) = 0.09. Depending on the experiment, this represents 6% to 7% of the total sparse coding cost

function value, showing the efficiency of the proposed optimization.

Comments on parameter estimation: All the universal models presented so far, with the exception of the conditional Jeffreys, depend on hyper-parameters which in principle should be tuned for optimal performance (remember that they do not influence the universality of the model). If tuning is needed, it is important to remember that the proposed universal models are intended for reconstruction coefficients of clean data, and thus their hyper-parameters should be computed from statistics of clean data, or either by compensating the distortion in the statistics caused by noise (see for example [30]). Finally, note that when D is linearly dependent and rank(D) = RM, the coefficients matrix A resulting from an exact reconstruction of X will have many zeroes which are not properly explained by any continuous distribution such as a Laplacian. We sidestep this issue by computing the statistics only from the non-zero coefficients in A. Dealing properly with the case P (a = 0) > 0 is beyond the scope of this work.

(12)

V. EXPERIMENTAL RESULTS

In the following experiments, the testing data X are 8×8 patches drawn from the Pascal VOC2006 testing subset,7 which are high quality 640×480 RGBimages with 8 bits per channel. For the experiments, we converted the 2600 images to grayscale by averaging the channels, and scaled the dynamic range to lie in the [0, 1] interval. Similar results to those shown here are also obtained for other patch sizes.

A. Dictionary learning

For the experiments that follow, unless otherwise stated, we use a “global” overcomplete dictionary D with K = 4M = 256 atoms trained on the full VOC2006 training subset using the method described in [35], [36], which seeks to minimize the following cost during training,8

min D,A 1 N N X j=1 n kxj− D ajk2₂+ λψ(aj) o + µ DTD 2 F, (21)

where k·k_F denotes Frobenius norm. The additional term, µ DTD

2

F, encourages incoherence in the learned

dictionary, that is, it forces the atoms to be as orthogonal as possible. Dictionaries with lower coherence are well known to have several theoretical advantages such as improved ability to recover sparse signals [11], [45], and faster and better convergence to the solution of the sparse coding problems (1) and (3) [13]. Furthermore, in [35] it was shown that adding incoherence leads to improvements in a variety of sparse modeling applications, including the ones discussed below.

We used MOE as the regularizer in (21), with λ = 0.1 and µ = 1, both chosen empirically. See [1], [26], [35]

for details on the optimization of (3) and (21). B. MOE as a prior for sparse coding coefficients

We begin by comparing the performance of the Laplacian and MOE priors for fitting a single global distribution to the whole matrix A. We compute A using (1) with ≈ 0 and then, following the discussion in Section IV, restrict our study to the nonzero elements of A.

The empirical distribution of A is plotted in Figure 3(a), along with the best fitting Laplacian, MOE, JOE, and a particularly good example of the conditional Jeffreys (CMOE) distributions.9 _The _MLE _{for the Laplacian fit is}

ˆ

θ = N1/ kAk₁ = 27.2 (here N1 is the number of nonzero elements in A). ForMOE, using (12), we obtained κ = 2.8

and β = 0.07. For JOE, θ1= 2.4 and θ2 = 371.4. According to the discussion in SectionIII-C, we used the value

κ = 2.8 obtained using the method of moments for MOE as a hint for choosing n0 = 2 (κ0 = n0+ 1 = 3 ≈ 2.8),

yielding β0 = 0.07, which coincides with the β obtained using the method of moments. As observed in Figure3(a),

in all cases the proposed mixture models fit the data better, significantly better for both Gamma-based mixtures,

MOE and CMOE, and slightly better for JOE. This is further confirmed by the Kullback-Leibler divergence (KLD) obtained in each case. Note that JOE fails to significantly improve on the Laplacian mode due to the excessively large estimated range [θ1, θ2]. In this sense, it is clear that the JOE model is very sensitive to its hyper-parameters,

and a better and more robust estimation would be needed for it to be useful in practice.

Given these results, hereafter we concentrate on the best case which is theMOE prior (which, as detailed above, can be derived from the conditional Jeffreys as well, thus representing both approaches).

From Figure1(e) we know that the optimal ˆθ varies locally across different regions, thus, we expect the mixture models to perform well also on a per-atom basis. This is confirmed in Figure 3(b), where we show, for each row ak, k = 1, . . . , K, the difference in KLD between the globally fitted MOE distribution and the best per-atom fitted

MOE, the globally fitted Laplacian, and the per-atom fitted Laplacians respectively. As can be observed, the KLD

obtained with the global MOEis significantly smaller than the global Laplacian in all cases, and even the per-atom Laplacians in most of the cases. This shows that MOE, with only two parameters (which can be easily estimated, as

7

http://pascallin.ecs.soton.ac.uk/challenges/VOC/databases.html#VOC2006

8

While we could have used off-the-shelf dictionaries such asDCTin order to test our universal sparse coding framework, it is important to use dictionaries that lead to the state-of-the-art results in order to show the additional potential improvement of our proposed regularizers.

9

To compute the empirical distribution, we quantized the elements of A uniformly in steps of 2−8, which for the amount of data available, gives us enough detail and at the same time reliable statistics for all the quantized values.

(13)

Fig. 3. (a) Empirical distribution of the coefficients in A for image patches (blue dots), best fitting Laplacian (green),MOE(red), CMOE

(orange) andJOE(yellow) distributions. The Laplacian (KLD= 0.17 bits) is clearly not fitting the tails properly, and is not sufficiently peaked at zero either. The two models based on a Gamma prior, MOE(KLD= 0.01 bits) and CMOE (KLD= 0.01 bits), provide an almost perfect fit. The fitted JOE (KLD= 0.14) is the most sharply peaked at 0, but doest not fit the tails as tight as desired. As a reference, the entropy of the empirical distribution is H = 3.00 bits. (b) KLD for the best fitting global Laplacian (dark green), per-atom Laplacian (light green), global MOE (dark red) and per-atom MOE (light red), relative to the KLD between the globally fitted MOE distribution and the empirical distribution. The horizontal axis represents the indexes of each atom, k = 1, . . . , K, ordered according to the difference in KLD between the global MOEand the per-atom Laplacian model. Note how the globalMOEoutperforms both the global and per-atom Laplacian models in all but the first 4 cases. (c) active set recovery accuracy of `1 andMOE, as defined in SectionV-C, for L = 5 and L = 10, as a function of σ. The improvement of MOE over `1 is a factor of 5 to 9. (d)PSNR of the recovered sparse signals with respect to the true signals. In this case significant improvements can be observed at the high SNRrange, specially for highly sparse (L = 5) signals. The performance of both methods is practically the same for σ ≥ 10.

detailed in the text), is a much better model than K Laplacians (requiring K critical parameters) fitted specifically to the coefficients associated to each atom. Whether these modeling improvements have a practical impact is explored in the next experiments.

C. Recovery of noisy sparse signals

Here we compare the active set recovery properties of the MOE prior, with those of the `1-based one, on data

for which the sparsity assumption |Aj| ≤ L holds exactly for all j. To this end, we obtain sparse approximations

to each sample xj using the `0-based Orthogonal Matching Pursuit algorithm (OMP) on D [28], and record the

resulting active sets Aj as ground truth. The data is then contaminated with additive Gaussian noise of variance σ

and the recovery is performed by solving (1) for A with = CM σ2 and either the `1 or the MOE-based regularizer

for ψ(·). We use C = 1.32, which is a standard value in denoising applications (see for example [27]).

For each sample j, we measure the error of each method in recovering the active set as the Hamming distance between the true and estimated support of the corresponding reconstruction coefficients. The accuracy of the method is then given as the percentage of the samples for which this error falls below a certain threshold T . Results are shown in Figure3(c) for L = (5, 10) and T = (2, 4) respectively, for various values of σ. Note the very significant improvement obtained with the proposed model.

Given the estimated active set Aj, the estimated clean patch is obtained by projecting xj onto the subspace

defined by the atoms that are active according to Aj, using least squares (which is the standard procedure for

denoising once the active set is determined). We then measure the PSNR of the estimated patches with respect to the true ones. The results are shown in Figure 3(d), again for various values of σ. As can be observed, the

MOE-based recovery is significantly better, specially in the high SNR range. Notoriously, the more accurate active set recovery of MOE does not seem to improve the denoising performance in this case. However, as we will see next, it does make a difference when denoising real life signals, as well as for classification tasks.

D. Recovery of real signals with simulated noise

This experiment is an analogue to the previous one, when the data are the original natural image patches (without forcing exact sparsity). Since for this case the sparsity assumption is only approximate, and no ground truth is available for the active sets, we compare the different methods in terms of their denoising performance.

A critical strategy in image denoising is the use of overlapping patches, where for each pixel in the image a patch is extracted with that pixel as its center. The patches are denoised independently as M -dimensional signals

(14)

Fig. 4. Sample image denoising results. Top: Barbara, σ = 30. Bottom: Boats, σ = 40. From left to right: noisy, `1/OMP, `1/`1,MOE/MOE. The reconstruction obtained with the proposed model is more accurate, as evidenced by a better reconstruction of the texture in Barbara, and sharp edges in Boats, and does not produce the artifacts seen in both the `1 and `0 reconstructions, which appear as black/white speckles all over Barbara, and ringing on the edges in Boats.

σ = 10 learning `1 MOE [1] coding `0 `1 `0 MOE barbara 30.4/34.4 31.2/33.8 30.5/34.4 30.9/34.4 34.4 boat 30.4/33.7 30.9/33.4 30.5/33.7 30.8/33.8 33.7 lena 31.8/35.5 32.4/35.1 32.1/35.6 32.3/35.6 35.5 peppers 31.6/34.8 32.1/34.6 31.8/34.9 32.0/34.9 34.8 man 29.6/33.0 30.6/32.9 29.7/33.0 30.2/33.1 32.8 AVERAGE 30.7/34.2 31.4/33.9 30.8/34.2 31.1/34.3 34.1 σ = 20 `1 MOE [1] `0 `1 `0 MOE 26.5/30.6 26.9/30.2 26.8/30.7 27.0/30.9 30.8 26.9/30.2 27.2/30.1 27.1/30.3 27.3/30.4 30.3 28.3/32.3 28.6/32.0 28.7/32.3 28.8/32.4 32.4 28.3/31.9 28.7/31.8 28.6/31.9 28.7/32.0 31.9 25.8/28.8 26.3/28.9 26.0/28.9 26.2/29.0 28.8 27.0/30.6 27.4/30.4 27.3/30.6 27.5/30.8 30.6 σ = 30 `1 MOE [1] `0 `1 `0 MOE 24.5/28.2 24.8/28.2 24.8/28.3 24.9/28.5 28.4 25.0/28.1 25.2/28.2 25.3/28.2 25.4/28.3 28.2 26.4/30.1 26.6/30.2 26.7/30.3 26.8/30.4 30.3 26.3/29.8 26.6/29.9 26.6/29.9 26.7/29.9 -23.9/26.5 24.2/26.8 24.1/26.6 24.2/26.7 26.5 25.1/28.3 25.4/28.5 25.4/28.4 25.5/28.5 28.4 TABLE I

DENOISING RESULTS:IN EACH TABLE,EACH COLUMN SHOWS THE DENOISING PERFORMANCE OF A LEARNING+CODING COMBINATION. RESULTS ARE SHOWN IN PAIRS,WHERE THE LEFT NUMBER IS THE PSNR BETWEEN THE CLEAN AND RECOVERED INDIVIDUAL PATCHES,

AND THE RIGHT NUMBER IS THE PSNR BETWEEN THE CLEAN AND RECOVERED IMAGES. BEST RESULTS ARE IN BOLD. THE PROPOSED MOE PRODUCES BETTER FINAL RESULTS OVER BOTH THE`0AND`1ONES IN ALL CASES,AND AT PATCH LEVEL FOR ALLσ > 10. NOTE

THAT THE AVERAGE VALUES REPORTED ARE THEPSNROF THE AVERAGEMSE,AND NOT THEPSNRAVERAGE.

and then recombined into the final denoised images by simple averaging. Although this consistently improves the final result in all cases, the improvement is very different depending on the method used to denoise the individual patches. Therefore, we now compare the denoising performance of each method at two levels: individual patches and final image.

To denoise each image, the global dictionary described in Section V-A is further adapted to the noisy image patches using (21) for a few iterations, and used to encode the noisy patches via (2) with = CM σ2. We repeated the experiment for two learning variants (`1andMOEregularizers), and two coding variants ((2) with the regularizer

used for learning, and `0 viaOMP. The four variants were applied to the standard images Barbara, Boats, Lena, Man

and Peppers, and the results summarized in Table I. We show sample results in Figure 4. Although the quantitative improvements seen in Table I are small compared to `1, there is a significant improvement at the visual level, as

can be seen in Figure 4. In all cases the PSNR obtained coincides or surpasses the ones reported in [1].10

E. Zooming

As an example of signal recovery in the absence of noise, we took the previous set of images, plus a particularly challenging one (Tools), and subsampled them to half each side. We then simulated a zooming effect by upsampling

10

Note that in [1], the denoised image is finally blended with the noisy image using an empirical weight, providing an extra improvement to the final PSNR in some cases. The results inIare already better without this extra step.

(15)

image cubic `

0

`

1 MOE

barbara 25.0 25.6 25.5 25.6

boat

28.9 29.8 29.8 29.9

lena

32.7 33.8 33.8 33.9

peppers 32.0 33.4 33.4 33.4

man

28.4 29.4 29.3 29.4

tools

21.0 22.3 22.2 22.3

AVER 22.8 24.0 24.0 24.1

Fig. 5. Zooming results. Left to right: summary, Tools image, detail of zooming results for the framed region, top to bottom and left to right: cubic, `0, `1,MOE. As can be seen, theMOEresult is as sharp as `0 but produces less artifacts. This is reflected in the 0.1dB overall improvement obtained withMOE, as seen in the leftmost summary table.

them and estimating each of the 75% missing pixels (see e.g., [50] and references therein). We use a technique similar to the one used in [32]. The image is first interpolated and then deconvoluted using a Wiener filter. The deconvoluted image has artifacts that we treat as noise in the reconstruction. However, since there is no real noise, we do not perform averaging of the patches, using only the center pixel of ˆxj to fill in the missing pixel at j. The

results are summarized in Figure 5, where we again observe that using MOE instead of `0 and `1 improves the

results.

F. Classification with universal sparse models

In this section we apply our proposed universal models to a classification problem where each sample xj is to

be assigned a class label yj = 1, . . . , c, which serves as an index to the set of possible classes, {C1, C2, . . . , Cc}. We

follow the procedure of [36], where the classifier assigns each sample xj by means of the maximum a posteriori

criterion (5) with the term − log P (a) corresponding to the assumed prior, and the dictionaries representing each class are learned from training samples using (21) with the corresponding regularizer ψ(a) = − log P (a). Each experiment is repeated for the baseline Laplacian model, implied in the `1 regularizer, and the universal modelMOE,

and the results are then compared. In this case we expect that the more accurate prior model for the coefficients will result in an improved likelihood estimation, which in turn should improve the accuracy of the system.

We begin with a classic texture classification problem, where patches have to be identified as belonging to one out of a number of possible textures. In this case we experimented with samples of c = 2 and c = 3 textures drawn at random from the Brodatz database,11_{, the ones actually used shown in Figure} ₆_{. In each case the experiment}

was repeated 10 times. In each repetition, a dictionary of K = 300 atoms was learned from all 16×16 patches of the leftmost halves of each sample texture. We then classified the patches from the rightmost halves of the texture samples. For the c = 2 we obtained an average error rate of 5.13% using `1 against 4.12% when usingMOE, which

represents a reduction of 20% in classification error. For c = 3 the average error rate obtained was 13.54% using `1 and 11.48% using MOE, which is 15% lower. Thus, using the universal model instead of `1 yields a significant

improvement in this case (see for example [26] for other results in classification of Brodatz textures).

The second sample problem presented is the Graz’02 bike detection problem,12 _{where each pixel of each testing}

image has to be classified as either background or as part of a bike. In the Graz’02 dataset, each of the pixels can belong to one of two classes: bike or background. On each of the training images (which by convention are the first 150 even-numbered images), we are given a mask that tells us whether each pixel belongs to a bike or to the background. We then train a dictionary for bike patches and another for background patches. Patches that contain pixels from both classes are assigned to the class corresponding to the majority of their pixels.

11_{http://www.ux.uis.no/}∼_{tranden/brodatz.html}

12

(16)

Fig. 6. Textures used in the texture classification example.

Fig. 7. Classification results. Left to right: precision vs. recall curve, a sample image from the Graz’02 dataset, its ground truth, and the corresponding estimated maps obtained with `1and MOEfor a fixed threshold. The precision vs. recall curve shows that the mixture model gives a better precision in all cases. In the example, the classification obtained with MOEyields less false positives and more true positives than the one obtained with `1.

In Figure 7 we show the precision vs. recall curves obtained with the detection framework when either the `1

or the MOE regularizers were used in the system. As can be seen, theMOE-based model outperforms the `1 in this

classification task as well, giving a better precision for all recall values.

In the above experiments, the parameters for the `1 prior (λ), the MOE model (λMOE) and the incoherence term

(µ) were all adjusted by cross validation. The only exception is the MOE parameter β, which was chosen based on the fitting experiment as β = 0.07.

VI. CONCLUDING REMARKS

A framework for designing sparse modeling priors was introduced in this work, using tools from universal coding, which formalizes sparse coding and modeling from a MDL perspective. The priors obtained lead to models with both theoretical and practical advantages over the traditional `0 and `1-based ones. In all derived cases, the

designed non-convex problems are suitable to be efficiently (approximately) solved via a few iterations of (weighted) `1 subproblems. We also showed that these priors are able to fit the empirical distribution of sparse codes of

image patches significantly better than the traditional IID Laplacian model, and even the non-identically distributed independent Laplacian model where a different Laplacian parameter is adjusted to the coefficients associated to each atom, thus showing the flexibility and accuracy of these proposed models. The additional flexibility, furthermore, comes at a small cost of only 2 parameters that can be easily and efficiently tuned (either (κ, β) in the MOEmodel, or (θ1, θ2) in the JOE model), instead of K (dictionary size), as in weighted `1 models. The additional accuracy

of the proposed models was shown to have significant practical impact in active set recovery of sparse signals, image denoising, and classification applications. Compared to the Bayesian approach, we avoid the potential burden of solving several sampled sparse problems, or being forced to use a conjugate prior for computational reasons (although in our case, a fortiori, the conjugate prior does provide us with a good model). Overall, as demonstrated in this paper, the introduction of information theory tools can lead to formally addressing critical aspects of sparse modeling.

(17)

Future work in this direction includes the design of priors that take into account the nonzero mass at a = 0 that appears in overcomplete models, and online learning of the model parameters from noisy data, following for example the technique in [30].

ACKNOWLEDGMENTS

Work partially supported by NGA, ONR, ARO, NSF, NSSEFF, and FUNDACIBA-ANTEL. We wish to thank Julien Mairal for providing us with his fast sparse modeling toolbox, SPAMS.13 We also thank Federico Lecumberry for his participation on the incoherent dictionary learning method, and helpful comments.

APPENDIX

DERIVATION OF THEMOEMODEL

In this case we have P (a|θ) = θe−θa and w(θ|κ, β) = _Γ(κ)1 θκ−1_βκ_e−βθ_{, which, when plugged into (}₉_{), gives}

Q(a|β, κ) = Z ∞ θ=0 θe−θa 1 Γ(κ)θ κ−1_βκ_e−βθ_{dθ =} βκ Γ(κ) Z ∞ θ=0 e−θ(a+β)θκdθ. After the change of variables u := (a + β)θ (u(0) = 0, u(∞) = ∞), the integral can be written as

Q(a|β, κ) = β κ Γ(κ) Z ∞ θ=0 e−u u a + β k du a + β = βκ Γ(κ)(a + β) −(κ+1) Z ∞ θ=0 e−uukdu = β κ Γ(κ)(a + β) −(κ+1)_{Γ(κ + 1) =} βκ Γ(κ)(a + β) −(κ+1)_κΓ(κ),

obtaining Q(a|β, κ) = κβκ(a + β)−(κ+1), since the integral on the second line is precisely the definition of Γ(κ + 1). The symmetrization is obtained by substituting a by |a| and dividing the normalization constant by two, Q(|a||β, κ) = 0.5κβκ(|a| + β)−(κ+1).

The mean of the MOE distribution (which is defined only for κ > 1) can be easily computed using integration

by parts, µ(β, κ) = κβκ Z ∞ 0 u (u + β)(κ+1)du = κβ − u κ(u + β)κ ∞ 0 +1 κ Z ∞ 0 du (u + β)k = β κ − 1. In the same way, it is easy to see that the non-central moments of order i are µi = ₍κ−1β

i )

.

The MLE estimates of κ and β can be obtained using any nonlinear optimization technique such as Newton method, using for example the estimates obtained with the method of moments as a starting point. In practice, however, we have not observed any significant improvement in using the MLE estimates over the moments-based ones.

Expected approximation error in cost function

As mentioned in the optimization section, the LLA approximates the MOE regularizer as a weighted `1. Here

we develop an expression for the expected error between the true function and the approximate convex one, where the expectation is taken (naturally) with respect to the MOE distribution. Given the value of the current iterate a(t) = a0, (assumed positive, since the function and its approximation are symmetric), the approximated regularizer

is ψ(t)(a) = log(a0+ β) +_|a0|+β1 (a − a0). We have

Ea∼MOE(κ,β) h ψ(t)(a) − ψ(a) i = Z ∞ 0 κβκ (a + κ)κ+1 log(|a0+ β) + 1 a0+ β (a − a0) − log(a + β) da = log(a0+ β) + a0 a0+ β + κβ κ a0+ β Z ∞ 0 a (a + β)κ+1da − κβ κ Z ∞ 0 log(a + β) (a + β)κ+1da = log(a0+ β) + a0 a0+ β + β (a0+ β)(κ − 1) − log β − 1 κ. 13 http://www.di.ens.fr/willow/SPAMS/

(18)

DERIVATION OF THE CONSTRAINEDJEFFREYS(JOE)MODEL

In the case of the exponential distribution, the Fisher Information Matrix in (15) evaluates to

I(θ) = E_{P (·|˜}_θ) ∂ 2 ∂ ˜θ2(− log θ + θ log a) _˜ θ=θ = E_{P (·|˜}_θ) 1 ˜ θ2 _˜ θ=θ = 1 θ2.

By plugging this result into (14) with Θ = [θ1, θ2], 0 < θ1 < θ2 < ∞ we obtain w(θ) = _ln(θ2/θ1)1 1_θ. We now

derive the (one-sided) JOE probability density function by plugging this w(θ) in (9),

Q(a) = Z θ2 θ1 θe−θa 1 ln(θ2/θ1) dθ θ = 1 ln(θ2/θ1) Z θ2 θ1 e−θadθ = 1 ln(θ2/θ1) 1 a e−θ1a− e−θ2a.

Although Q(a) cannot be evaluated at a = 0, the limit for a → 0 exists and is finite, so we can just define Q(0) as this limit, which is

lim

a→0Q(a) = a→0lim

1 ln(θ2/θ1)a 1 − θ1a + o(a2) − (1 − θ2a + o(a2)) = θ2− θ1 ln(θ2/θ1) .

Again, if desired, parameter estimation can be done for example using maximum likelihood (via nonlinear optimization), or using the method of moments. However, in this case, the method of moments does not provide a closed form solution for (θ1, θ2). The non-central moments of order i are

µi = Z ∞+ 0 ai ln(θ2/θ1) 1 a h e−θ1a− e−θ1aida = 1 ln(θ2/θ1) Z +∞ 0 ai−1e−θ1ada − Z +∞ 0 ai−1e−θ2ada . (22)

For i = 1, both integrals in (22) are trivially evaluated, yielding µ1= _ln(θ2/θ1)1 (θ1−1−θ −1

2 ). For i > 1, these integrals

can be solved using integration by parts: µ+_i =

Z +∞

0

ai−1e−θ1ada = ai−1 1 (−θ1) e−θ1a +∞ 0 − 1 (−θ1) (i − 1) Z +∞ 0 ai−2e−θ1ada µ−_i = Z +∞ 0

ai−1e−θ2ada = ai−1 1 (−θ2) e−θ2a +∞ 0 − 1 (−θ2) (i − 1) Z +∞ 0

ai−2e−θ2ada,

where the first term in the right hand side of both equations evaluates to 0 for i > 1. Therefore, for i > 1 we obtain the recursions µ+_i = i−1_θ1 µ+_i−1, µ−_i = i−1_θ2 µ−_i−1, which, combined with the result for i = 1, give the final expression for all the moments of order i > 0

µi = (i − 1)! ln(θ2/θ1) 1 θi 1 − 1 θi 2 , i = 1, 2, . . . . In particular, for i = 1 and i = 2 we have θ1 = ln(θ2/θ1)µ1+ θ−12

−1

, θ2 = ln(θ2/θ1)µ2+ θ−21

−1

, which, when combined, give us

θ1 = 2µ1 µ2+ ln(θ2/θ1)µ21 , θ2 = 2µ1 µ2− ln(θ2/θ1)µ21 . (23)

One possibility is to solve the nonlinear equation θ2/θ1 = µ2+ln(θ2/θ1)µ 2 1 µ2−ln(θ2/θ1)µ2

1 for u = θ1/θ2 by finding the roots of the

nonlinear equation u = µ2+ln uµ21 µ2−ln uµ2

1 and choosing one of them based on some side information. Another possibility

is to simply fix the ratio θ2/θ1 beforehand and solve for θ1 and θ2 using (23).

DERIVATION OF THE CONDITIONALJEFFREYS (CMOE)MODEL

The conditional Jeffreys method defines a proper prior w(θ) by assuming that n0 samples from the data to be

modeled a were already observed. Plugging the Fisher information for the exponential distribution, I(θ) = θ−2, into (18) we obtain w(θ) = P (a n0_|θ)θ−1 R ΘP (an0|ξ)ξ−1dξ = ( Qn0 j=1θe −θaj_)θ−1 R+∞ 0 ( Qn0 j=1ξe−ξaj)ξ−1dξ = θ n0−1_e−θPn0 j=1aj R+∞ 0 ξn0−1e −ξPn0 j=1ajdξ .