Denoising by Sparse Approximation: Error Bounds Based on Rate-Distortion Theory

(1)

Volume 2006, Article ID 26318, Pages1–19 DOI 10.1155/ASP/2006/26318

Denoising by Sparse Approximation: Error Bounds

Based on Rate-Distortion Theory

Alyson K. Fletcher,1_{Sundeep Rangan,}2_{Vivek K Goyal,}3_{and Kannan Ramchandran}4

1_{Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720-1770, USA} 2_{Flarion Technologies Inc., Bedminster, NJ 07921, USA}

3_{Department of Electrical Engineering and Computer Science and Research Laboratory of Electronics,}

Massachusetts Institute of Technology, Cambridge, MA 02139-4307, USA

4_{Department of Electrical Engineering and Computer Sciences, College of Engineering, University of California, Berkeley,}

CA 94720-1770, USA

Received 9 September 2004; Revised 6 June 2005; Accepted 30 June 2005

If a signalxis known to have a sparse representation with respect to a frame, it can be estimated from a noise-corrupted observation yby finding the best sparse approximation toy. Removing noise in this manner depends on the frame eﬃciently representing the signal while itineﬃcientlyrepresents the noise. The mean-squared error (MSE) of this denoising scheme and the probability that the estimate has the same sparsity pattern as the original signal are analyzed. First an MSE bound that depends on a new bound on approximating a Gaussian signal as a linear combination of elements of an overcomplete dictionary is given. Further analyses are for dictionaries generated randomly according to a spherically-symmetric distribution and signals expressible with single dictionary elements. Easily-computed approximations for the probability of selecting the correct dictionary element and the MSE are given. Asymptotic expressions reveal a critical input signal-to-noise ratio for signal recovery.

Copyright © 2006 Alyson K. Fletcher et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION

Estimating a signal from a noise-corrupted observation of the signal is a recurring task in science and engineering. This paper explores the limits of estimation performance in the case where the onlya prioristructure on the signalx ∈RN

is that it has known sparsityKwith respect to a given set of vectorsΦ = {ϕi}Mi=1 ⊂ RN. The setΦis called adictionary and is generally aframe[1,2]. The sparsity ofKwith respect toΦmeans that the signalxlies in the set

ΦK=

v∈RN _|_v₌ M

i=1

αiϕiwith at mostKnonzeroαi’s

.

(1) In many areas of computation, exploiting sparsity is mo-tivated by reduction in complexity [3]; ifK N, then cer-tain computations may be more eﬃciently made onαthan onx. In compression, representing a signal exactly or approx-imately by a member ofΦK is a common first step in eﬃ

-ciently representing the signal, though much more is known whenΦ is a basis or union of wavelet bases than is known in the general case [4]. Of more direct interest here is that

sparsity models are becoming prevalent in estimation prob-lems; see, for example, [5,6].

The parameters of dimensionN, dictionary sizeM, and sparsityKdetermine the importance of the sparsity model. Representative illustrations ofΦKare given inFigure 1. With

dimensionN=2, sparsity ofK=1 with respect to a dictio-nary of sizeM=3 indicates thatxlies on one of three lines, as shown inFigure 1a. This is a restrictive model, even if there is some approximation error in (1). WhenMis increased, the model stops seeming restrictive, even though the set of possi-ble values forxhas measure zero inR2_{. The reason is that} un-less the dictionary has gaps, all ofR2_{is nearly covered. This} paper presents progress in explaining the value of a sparsity model for signal denoising as a function of (N,M,K).

1.1. Denoising by sparse approximation with a frame

Consider the problem of estimating a signalx ∈ RN _from

the noisy observationy=x+d, whered∈RN _{has the i.i.d.}

GaussianN(0,σ2_I

N) distribution. Suppose we know thatx

lies in givenK-dimensional subspace ofRN_{. Then projecting}

(2)

(a) (b)

Figure1: Two sparsity models in dimensionN = 2. (a) Having sparsityK =1 with respect to a dictionary withM=3 elements restricts the possible signals greatly. (b) With the dictionary size in-creased toM=100, the possible signals still occupy a set of measure zero, but a much larger fraction of signals is approximately sparse.

without aﬀecting the signal component. Denoting the pro-jection operator byP, we would have

x=P y=P(x+d)=Px+Pd=x+Pd, (2)

andPdhas onlyK/N fraction of the power ofd.

In this paper, we consider the more general signal model x ∈ ΦK. The setΦK defined in (1) is the union of at most

J =M K

subspaces of dimensionK. We henceforth assume thatM > K (thus J > 1); if not, the model reduces to the classical case of knowing a single subspace that containsx. The distribution ofx, if available, could also be exploited to remove noise. However, in this paper the denoising operation is based only on the geometry of the signal modelΦKand the

distribution ofd.

With the addition of the noised, the observed vectory will (almost surely) not be represented sparsely, that is, not be inΦK. Intuitively, a good estimate forxis the point fromΦK

that is closest to yin Euclidean distance. Formally, because the probability density function ofdis a strictly decreasing function ofd2, this is the maximum-likelihood estimate ofxgiveny. The estimate is obtained by applying an optimal sparse approximation procedure toy. We will write

xSA=arg min

x∈ΦK

y−x2 (3)

for this estimate and call it the optimalK-term approxima-tion ofy. Henceforth, we omit the subscript 2 indicating the Euclidean norm.

The main results of this paper are bounds on the per-component mean-squared estimation error (1/N)E[x −

xSA2] for denoising via sparse approximation.1 These bounds depend on (N,M,K) but avoid further dependence on the dictionaryΦ(such as the coherence ofΦ); some re-sults hold for allΦand others are for randomly generatedΦ.

1_{The expectation is always over the noise}_d_{and is over the dictionary}_Φ and signalxin some cases. However, the estimator does not use the dis-tribution ofx.

To the best of our knowledge, the results diﬀer from any in the literature in several ways.

(a) We study mean-squared estimation error for additive Gaussian noise, which is a standard approach to per-formance analysis in signal processing. In contrast, analyses such as [7] impose a deterministic bound on the norm of the noise.

(b) We concentrate on having dependence solely on dic-tionary size rather than more fine-grained properties of the dictionary. In particular, most signal recov-ery results in the literature are based on noise being bounded above by a function of thecoherenceof the dictionary [8–14].

(c) Some of our results are for spherically symmetric ran-dom dictionaries. The series of papers [15–17] is su-perficially related because of randomness, but in these papers, the signals of interest are sparse with respect to a single known, orthogonal basis and the observations are random inner products. The natural questions in-clude a consideration of the number of measurements needed to robustly recover the signal.

(d) We use source-coding thought experiments in bound-ing estimation performance. This technique may be useful in answering other related questions, especially in sparse approximation source coding.

Our preliminary results were first presented in [18], with fur-ther details in [19,20]. Probability of error results in a rather diﬀerent framework for basis pursuit appear in a manuscript submitted while this paper was under review [21].

1.2. Connections to approximation

A signal with an exactK-term representation might arise be-cause it was generated synthetically, for example, by a com-pression system. A more likely situation in practice is that there is an underlying true signalxthat has a goodK-term approximationrather than an exactK-termrepresentation. At very least, this is the goal in designing the dictionaryΦfor a signal class of interest. It is then still reasonable to compute (3) to estimatexfromy, but there are tradeoﬀs in the selec-tions ofKandM.

Let fM,K denote the squared Euclidean approximation

error of the optimal K-term approximation using an M-element dictionary. It is obvious that fM,Kdecreases with

in-creasingK, and with suitably designed dictionaries, it also decreases with increasing M. One concern of approxima-tion theory is to study the decay of fM,Kprecisely. (For this,

we should consider Nvery large or infinite.) For piecewise smooth signals, for example, wavelet frames give exponential decay withK[4,22,23].

(3)

of subspaces, and thus increases the chance that the selected subspace is not the best one for approximating x. Loosely, whenMis very large and the dictionary elements are not too unevenly spread, there is some subspace very close toy, and thusxSA≈y. This was illustrated inFigure 1.

Fortunately, there are many classes of signals for which Mneed not grow too quickly as a function ofNto get good sparse approximations. Examples of dictionaries with good computational properties that eﬃciently represent audio sig-nals were given by Goodwin [24]. For iterative design proce-dures, see papers by Engan et al.[25] and Tropp et al.[26].

One initial motivation for this work was to give guidance for the selection ofM. This requires the combination of ap-proximation results (e.g., bounds on fM,K) with results such

as ours. The results presented here do not address approxi-mation quality.

1.3. Related work

Computing optimal K-term approximations is generally a diﬃcult problem. Given ∈ R+ _and_K _∈_Z+ _{determine if} there exists aK-term approximationxsuch thatx−x ≤ is an NP-complete problem [27,28]. This computational in-tractability of optimal sparse approximation has prompted study of heuristics. A greedy heuristic that is standard for finding sparse approximate solutions to linear equations [29] has been known as matching pursuit in the signal process-ing literature since the work of Mallat and Zhang [30]. Also, Chen, et al.[31] proposed a convex relaxation of the approx-imation problem (3) calledbasis pursuit.

Two related discoveries have touched oﬀa flurry of recent research.

(a)Stability of sparsity. Under certain conditions, the posi-tions of the nonzero entries in a sparse representation of a signal are stable: applying optimal sparse approx-imation to a noisy observation of the signal will give a coeﬃcient vector with the original support. Typical results are upper bounds (functions of the norm of the signal and the coherence of the dictionary) on the norm of the noise that allows a guarantee of stability [7–10,32].

(b)Eﬀectiveness of heuristics. Both basis pursuit and matching pursuit are able to find optimal sparse ap-proximations, under certain conditions on the dictio-nary and the sparsity of signal [7,9,12,14,33,34]. To contrast, in this paper, we consider noise with unbounded support and thus a positive probability of failing to satisfy a suﬃcient condition for stability as in (a) above; and we do not address algorithmic issues in finding sparse approxima-tions. It bears repeating that finding optimal sparse approx-imations is presumably computationally intractable except in the cases where a greedy algorithm or convex relaxation happens to succeed. Our results are thus bounds on the per-formance of the algorithms that one would probably use in practice.

Denoising by finding a sparse approximation is similar to the concept of denoising by compression popularized by

Saito [35] and Natarajan [36]. More recent works in this area include those by Krim et al.[37], Chang et al.[38], and Liu and Moulin [39]. All of these works use bases rather than frames. To put the present work into a similar framework would require a “rate” penalty for redundancy. Instead, the only penalty for redundancy comes from choosing a sub-space that does not contain the true signal (“overfitting” or “fitting the noise”). The literature on compression with frames notably includes [40–44].

This paper uses quantization and rate-distortion theory only as a proof technique; there are no encoding rates be-cause the problem is purely one of estimation. However, the “negative” results on representing white Gaussian sig-nals with frames presented here should be contrasted with the “positive” encoding results of Goyal et al.[42]. The posi-tive results of [42] are limited to low rates (and hence signal-to-noise ratios that are usually uninteresting). A natural ex-tension of the present work is to derive negative results for encoding. This would support the assertion that frames in compression are useful not universally, but only when they can be designed to yield very good sparseness for the signal class of interest.

1.4. Preview of results and outline

To motivate the paper, we present a set of numerical results from Monte Carlo simulations that qualitatively reflect our main results. In these experiments,N,M, andKare small be-cause of the high complexity of computing optimal approx-imations and because a large number of independent trials are needed to get adequate precision. Each data point shown is the average of 100 000 trials.

Consider a true signalx ∈R4_(N ₌_{4) that has an exact} 1-term representation (K = 1) with respect toM-element dictionaryΦ. We observey=x+dwithd∼N(0,σ2_I

4) and compute estimatexSAfrom (3). The signal is generated with unit norm so that the signal-to-noise ratio (SNR) is 1/σ2_or

−10 log₁₀σ2_{dB. Throughout, we use the following definition} for mean-squared error:

MSE= 1

NE x−xSA 2

. (4)

To have tunableM, we used dictionaries that areM max-imally separated unit vectors inRN_{, where separation is}

mea-sured by the minimum pairwise angle among the vectors and their negations. These are cases of Grassmannian pack-ings [45,46] in the simplest case of packing one-dimensional subspaces (lines). We used packings tabulated by Sloane et al.[47].

Figure 2shows the MSE as a function ofσfor several val-ues ofM. Note that for visual clarity, MSE/σ2_{is plotted, and} all of the same properties are illustrated forK=2 inFigure 3. For small values ofσ, the MSE is (1/4)σ2_{. This is an example} of the general statement that

MSE= K

Nσ

2 _{for small}_σ, ₍₅₎

(4)

10 0

−10 −20

−30

10 log₁₀σ2 0

0.2 0.4 0.6 0.8 1

MSE/

σ

2

M=80

M=40

M=20

M=10

M=7

M=5

M=4

Figure2: Performance of denoising by sparse approximation when the true signalx∈R4_{has an exact 1-term representation with} re-spect to a dictionary that is an optimalM-element Grassmannian packing.

20 10 0 −10 −20 −30 −40 −50 −60

10 log₁₀σ2 0.4

0.5 0.6 0.7 0.8 0.9 1 1.1

MSE/

σ

2

M=40

M=20

M=10

M=7

M=5

M=4

Figure3: Performance of denoising by sparse approximation when the true signalx∈R4_{has an exact 2-term representation with} re-spect to a dictionary that is an optimalM-element Grassmannian packing.

scaled MSE approaches a constant value:

lim

σ→∞

MSE

σ2 =gK,M, (6) where gK,M is a slowly increasing function of M and

limM→∞gK,M =1. This limiting value makes sense because

in the limit,xSA≈ y =x+dand each component ofdhas varianceσ2_{; the denoising does not do anything. The} charac-terization of the dependence ofgK,MonKandMis the main

contribution ofSection 3.

Another apparent pattern inFigure 2that we would like to explain is the transition between low- and high-SNR be-havior. The transition occurs at smaller values ofσfor larger

values ofM. Also, MSE/σ2_{can exceed 1, so in fact the sparse} approximation procedure canincreasethe noise. We are not able to characterize the transition well for general frames. However, inSection 4we obtain results for large frames that are generated by choosing vectors uniformly at random from the unit sphere inRN_{. There, we get a sharp transition}

be-tween low- and high-SNR behavior.

2. PRELIMINARY COMPUTATIONS

Recall from the introduction that we are estimating a sig-nal x ∈ ΦK ⊂ RN from an observation y = x+d, where

d∼N(0,σ2_I

N).ΦKwas defined in (1) as the set of vectors

that can be represented as a linear combination ofKvectors fromΦ= {ϕm}Mm=1. We are studying the performance of the estimator

xSA=arg min

x∈ΦK

y−x. (7)

This estimator is the maximum-likelihood estimator ofxin this scenario, in whichdhas a Gaussian density and the esti-mator has no probabilistic prior information onx. The sub-script SA denotes “sparse approximation” because the esti-mate is obtained by finding the optimal sparse approxima-tion ofy. There are values ofysuch thatxSAis not uniquely defined. These collectively have probability zero and we ig-nore them.

FindingxSAcan be viewed as a two-step procedure: first, find the subspace spanned byKelements ofΦthat contains

xSA; then, projectyto that subspace. The identification of a subspace and the orthogonality of y−xSAto that subspace will be used in our analyses. LetPK = {Pi}ibe the set of the

projections onto subspaces spanned byKof theMvectors in

Φ. Then,PK has at mostJ =

_M

K

elements,2_{and the} esti-mate of interest is given by

xSA=PTy, T=arg max i

_P_i_y _. ₍₈₎

The distribution of the errorx−xSAand the average per-formance of the estimator both depend on the true signalx. Where there is no distribution onx, the performance mea-sure analyzed here is the conditional MSE,

e(x)= 1

NE x−xSA 2

|x; (9)

one could say that showing conditioning in (9) is merely for emphasis.

In the case thatT is independent ofd, the projection in (8) is to a fixedK-dimensional subspace, so

e(x)=K

Nσ

2_. ₍₁₀₎

This occurs whenM =K(there is just one element inPK)

or in the limit of high-SNR (smallσ2_{). In the latter case, the} subspace selection is determined byx, unperturbed byd.

(5)

3. RATE-DISTORTION ANALYSIS AND LOW-SNR BOUND

In this section, we establish bounds on the performance of sparse approximation denoising that apply for any dictionary

Φ. One such bound qualitatively explains the low-SNR per-formance shown in Figures2and3, that is, the right-hand side asymptotes in these plots.

The denoising bound depends on a performance bound for sparse approximationsignal representationdeveloped in Section 3.1. The signal representation bound is empirically evaluated inSection 3.2and then related to low-SNR denois-ing inSection 3.3. We will also discuss the diﬃculties in ex-tending this bound for moderate SNR. To obtain interesting results for moderate SNR, we consider randomly generated

Φ’s inSection 4.

3.1. Sparse approximation of a Gaussian source

Before addressing the denoising performance of sparse ap-proximation, we give an approximationresult for Gaussian signals. This result is alower boundon the MSE when sparsely approximating a Gaussian signal; it is the basis for an up-per boundon the MSE for denoising when the SNR is low. These bounds are in terms of the problem size parameters (M,N,K).

Theorem 1. LetΦbe anM-element dictionary, letJ=M K

, and letv ∈ RN _{have the distribution}_N_(¯_v,_σ2_I

N). Ifvis the

optimalK-sparse approximation ofvwith respect toΦ, then

1 NE

v−v2_≥_σ2_c 1

1−K

N

, (11)

where

c1=J−2/(N−K)

K N

K/(N−K)

. (12)

Forv¯=0, the stronger bound 1

NE

v−v2_≥_σ2_· c1 1−c1·

1−K

N

(13)

also holds.

The proof follows fromTheorem 2, seeAppendix A. Remarks. (i)Theorem 1 shows that for anyΦ, there is an approximation error lower bound that depends only on the frame sizeM, the dimension of the signalN, and the dimen-sion of the signal modelK.

(ii) AsM→ ∞withKandNfixed,c1→0. This is consis-tent with the fact that it is possible to drive the approximation error to zero by letting the dictionary grow.

(iii) The decay ofc1 asMincreases is slow. To see this, define a sparsity measureα=K/Nand a redundancy factor ρ=M/N. Now using the approximation (see, e.g., [48, page 530])

ρN αN

≈

ρ α

αN _ρ

ρ−α (ρ−α)N

, (14)

we can compute the limit

lim

N→∞c1=

_α ρ

2α

1−α

ρ 2(ρ−α)

αα 1/(1−α)

. (15)

Thus, the decay of the lower bound in (11) asρis increased behaves asρ−2α/(1−α)_{. This is slow when}_α_{is small.}

The theorem below strengthensTheorem 1by having a dependence on the entropy of the subspace selection ran-dom variableTin addition to the problem size parameters (M,N,K). The entropy ofTis defined as

H(T)= − |PK|

i=1

pT(i) log2pT(i) bits, (16)

wherepT(i) is the probability mass function ofT.

Theorem 2. LetΦbe anM-element dictionary, and letv ∈ RN _{have the distribution} _N_(¯_v,_σ2_I

N). Ifvis the optimal K

-sparse approximation ofvwith respect toΦandTis the index of the subspace that containsv, then

1 NE

v−v2_≥_σ2_c 2

1−K

N

, (17)

where

c2=2−2H(T)/(N−K)

K N

K/(N−K)

. (18)

Forv¯=0, the stronger bound 1

NE

v−v2_≥_σ2_· c2 1−c2 ·

1−K

N

(19)

also holds.

For the proof, seeAppendix A.

3.2. Empirical evaluation of approximation error bounds

The bound inTheorem 1does not depend on any character-istics of the dictionary other thanMandN. Thus it will be nearest to tight when the dictionary is well suited to repre-senting the Gaussian signalv. That the expression (11) is not just a bound but also a useful approximation is supported by the Monte Carlo simulations described in this section.

To empirically evaluate the tightness of the bound, we compare it to the MSE obtained with Grassmannian frames and certain random frames. The Grassmannian frames are from the same tabulation described inSection 1.4[47]. The random frames are generated by choosing M vectors uni-formly at random from the surface of a unit sphere. One such vector can be generated, for example, by drawing an i.i.d. Gaussian vector and normalizing it.

(6)

uniformly distributed, and hence the bound is the tightest. Each of parts (a)–(c) cover a single value ofNand combine K = 1 andK = 2. Part (d) shows results forN = 10 and N =100 forK =1. In all cases, the bound holds and gives a qualitative match in the dependence of the approximation error onK andM. In particular, the slopes on these log-log plots correspond to the decay as a function ofρdiscussed in Remark (iii). We also find that the diﬀerence in approxima-tion error between using a Grassmannian frame or a random frame is small.

3.3. Bounds on denoising MSE

We now return to the analysis of the performance of sparse approximation denoising as defined in Section 2. We wish to bound the estimation errore(x) for a given signalxand frameΦ.

To create an analogy between the approximation prob-lem considered inSection 3.1and the denoising problem, let ¯

v = x,v−v¯ = d, andv = y. These correspondences fit perfectly, sinced∼N(0,σ2_I

N) and we apply sparse

approxi-mation toyto getxSA.Theorem 2gives the bound

1

NE y−xSA 2

|x≥σ2_c 2

1−K

N

, (20)

wherec2is defined as before. As illustrated inFigure 5, it is as if we are attempting to representdby sparse approximation and we obtaind=xSA−x. The quantity we are interested in ise(x)=(1/N)E[d2_|_x].

In the case thatxandxSAare in the same subspace,d−d is orthogonal todsod2₌_d2₊_d₋_d2_{. Thus knowing} E[d2_|_x]₌_Nσ2_{and having a lower bound on}_E[_d2_|_x] immediately give an upper bound one(x).

The interesting case is whenxandxSAare not necessar-ily in the same subspace. Recalling thatTis the index of the subspace selected in sparse, approximation orthogonally de-composedasd=dT⊕dT⊥ withd_T in the selected subspace and similarly decomposed. Then dT =dT and the expected

squared norm of this component can be bounded above as in the previous paragraph. Unfortunately,dT⊥can be larger thandT⊥in proportion tox, as illustrated inFigure 5. The worst case is fordT⊥ =2d_T⊥, when ylies equidis-tant from the subspace ofxand the subspace ofxSA.

From this analysis, we obtain the weak bound

e(x)= 1

NE x−xSA 2

|x≤4σ2 (21)

and the limiting low-SNR bound

e(0)= 1

NE x−xSA 2

|x|x=0≤σ2

1−c2

1−K

N

.

(22)

4. ANALYSIS FOR ISOTROPIC RANDOM FRAMES

In general, the performance of sparse approximation denois-ing is given by

e(x)= 1

NE x−xSA 2

= 1

N

RN

x−

arg min

x∈ΦK

x+η−x2

2f(η)dη, (23)

wheref(·) is the density of the noised. While this expression does not give any fresh insight, it does remind us that the per-formance depends on every element ofΦ. In this section, we improve greatly upon (21) with an analysis that depends on each dictionary element being an independent random vec-tor and on the dictionary being large. The results are expec-tations over both the noisedand the dictionary itself. In ad-dition to analyzing the MSE, we also analyze the probability of error in the subspace selection, that is, the probability that x andxSA lie in diﬀerent subspaces. In light of the simula-tions inSection 3.2, we expect these analyses to qualitatively match the performance of a variety of dictionaries.

Section 4.1delineates the additional assumptions made in this section. The probability of error and MSE analyses are then given inSection 4.2. Estimates of the probability of error and MSE are numerically validated inSection 4.3, and finally limits asN→ ∞are studied inSection 4.4.

4.1. Modeling assumptions

This section specifies the precise modeling assumptions in analyzing denoising performance with large, isotropic, ran-dom frames. Though the results are limited to the case of K =1, the model is described for generalK. Diﬃculties in extending the results to generalKare described in the con-cluding comments of the paper. While many practical prob-lems involveK >1, the analysis of theK=1 case presented here illustrates a number of unexpected qualitative phenom-ena, some of which have been observed for higher values of K.

The model is unchanged from earlier in the paper except that the dictionaryΦand signalxare random.

(a) Dictionary generation.The dictionaryΦconsists ofM i.i.d. random vectors uniformly distributed on the unit sphere inRN_.

(b) Signal generation.The true signalxis a linear combi-nation of the firstKdictionary elements so that

x=

K

i=1

αiϕi, (24)

for some random coeﬃcients {αi}. The coeﬃcients

{αi}are independent of the dictionary except in that

xis normalized to havex2₌_N_{for all realizations of} the dictionary and coeﬃcients.

(c) Noise.The noisy signalyis given byy=x+d, where, as before,d∼N(0,σ2_I

(7)

103 102

101

M

10−5 10−4 10−3 10−2 10−1 100

No

rm

al

iz

ed

M

S

E

Random Grassmannian

Bound

K

=

1

K=2

(a)

102 101

M

10−3 10−2 10−1 100

No

rm

al

iz

ed

M

S

E

Random Grassmannian

Bound

K=1

K=2

(b)

102 101

M

10−2 10−1 100

No

rm

al

iz

ed

M

S

E

Random Grassmannian

Bound

K

=

1

K

=

2

(c)

103 102

101 100

M

10−1 100

No

rm

al

iz

ed

M

S

E

Random Bound

N

=

100

N

=

10

(d)

Figure 4: Comparison between the bound inTheorem 1and the approximation errors obtained with Grassmannian and spherically-symmetric random frames: (a)N=4,K∈ {1, 2}, 105_{trials per point, (b)}_N₌_6,_K_{∈ {}_{1, 2}_}_{, 10}4_{trials per point, (c)}_N ₌_10,_K_{∈ {}_{1, 2}_}_, 104_{trials per point, and (d)}_N_{∈ {}_{10, 100}_}_,_K₌_{1, 10}2_{trials per point. The horizontal axis in all plots is}_M_.

We will let

γ= 1

σ2, (25)

which is the input SNR because of the scaling ofx. (d)Estimator. The estimator xSA is defined as before to

be the optimalK-sparse approximation of ywith re-spect toΦ. Specifically, we enumerate theJ=M

K

K-element subsets ofΦ. Thejth subset spans a subspace denoted byVjandPjdenotes the projection operator

ontoVj. Then,

xSA=PTy, T= arg min j∈{1,2,...,J}

_y₋_P_j_y 2

. (26)

For the special case whenMandNare large andK=1, we will estimate two quantities.

Definition 1. The subspace selection error probability perr is defined as

perr=Pr

T=jtrue

, (27)

(8)

0

x xSA

y

d−d

d

Figure5: Illustration of variables to relate approximation and de-noising problems. (An undesirable case in whichxSAis not in the same subspace asx.)

Definition 2. Thenormalized expected MSEis defined as

EMSE= 1

Nσ2E x−xSA 2

= γ

NE x−xSA 2

. (28)

Normalized expected MSE is the per-component MSE normalized by the per-component noise variance (1/N)Ed2₌ _σ2_{. The term “expected MSE” emphasizes} that the expectation in (28) is over not just the noised, but also the dictionaryΦand signalx.

We will give tractable computations to estimate bothperr andEMSE. Specifically,perrcan be approximated from a sim-ple line integral andEMSE can be computed from a double integral.

4.2. Analyses of subspace selection error and MSE

The first result shows that the subspace selection error prob-ability can be bounded by a double integral and approxi-mately computed as a single integral. The integrands are sim-ple functions of the problem parameters M,N,K, andγ. While the result is only proven for the case ofK = 1,K is left in the expressions to indicate the precise role of this pa-rameter.

Theorem 3. Consider the model described in Section 4.1. WhenK = 1andMand Nare large, the subspace selection error probability defined in(27)is bounded above by

perr<1− _∞

0 _∞

0 fr(u)fs(v)

×exp

−

CG(u,v)r 1−G(u,v)

1{G(u,v)≤Gmax}dv du,

(29)

andperris well approximated by

perr(N,M,K,γ)

=1− ∞

0 fr(u) exp

−

C(N−K)σ2_u N+ (N−K)σ2_u

r

du

=1−

∞

0 fr(u) exp

−

Cau 1 +au

r

du,

(30)

where

G(u,v)= au

au+1−σ√Kv/N2, Gmax=

rβ(r,s)1/(r−1),

(31)

C=

J−1 rβ(r,s)

1/r

, J=

M K

, (32)

r=N−K

2 , s= K

2, (33)

a=(N−K)σ2

N =

N−K

Nγ , (34)

fr(u)is the probability distribution

fr(u)=rrΓ(r)ur−1e−ru, u∈[0,∞), (35)

β(r,s)is the beta function, and Γ(r)is the gamma function [49].

For the proof, seeAppendix B.

It is interesting to evaluateperrin two limiting cases. First, suppose thatJ =1. This corresponds to the situation where there is only one subspace. In this case,C=0 and (30) gives

perr=0. This is expected since with one subspace, there is no chance of a subspace selection error.

At the other extreme, suppose thatN,K, andγare fixed andM → ∞. ThenC → ∞andperr →1. Again, this is ex-pected since as the size of the frame increases, the number of possible subspaces increases and the probability of error increases.

The next result approximates the normalized expected MSE with a double integral. The integrand is relatively sim-ple to evaluate and it decays quickly asρ → ∞andu→ ∞

so numerically approximating the double integral is not dif-ficult.

Theorem 4. Consider the model described in Section 4.1. WhenK=1andMandNare large, the normalized expected MSE defined in(28)is given approximately by

EMSE(N,M,K,γ)= K N +

_∞

0 _∞

0 fr(u)gr(ρ)F(ρ,u)dρ du, (36) wherefr(u)is given in(35),gr(ρ)is the probability distribution

gr(ρ)=rCrrr−1exp

−(Cρ)r_,

F(ρ,u)=

⎧ ⎨ ⎩γ

au(1−ρ) +ρ ifρ(1 +au)< au, 0 otherwise,

(37)

andC,r, andaare defined in(32)–(34). For the proof, seeAppendix C.

4.3. Numerical examples

(9)

35 30 25 20 15 10 5 0 −5 −10

SNR (dB) −5

−4 −3 −2 −1 0

log

10

(

per

r.

)

Simulated Theoretical

(10,100) (10,1000) (5,1000)

(a)

35 30 25 20 15 10 5 0 −5 −10

SNR (dB) 0

0.2 0.4 0.6 0.8 1

EMSE

Simulated Theoretical

(10,100) (10,1000) (5,1000)

(b)

Figure6: Simulation of subspace selection error probability and normalized expected MSE for isotropic random dictionaries. Cal-culations were made for integer SNRs (in dB), with 5×105 inde-pendent simulations per data point. In all cases,K=1. The curve pairs are labeled by (N,M). Simulation results are compared to the estimates from Theorems3and4.

For each integer SNR from−10 dB to 35 dB, the subspace se-lection and normalized MSE were measured for 5×105 inde-pendent experiments. The resulting empirical probabilities of subspace selection error and normalized expected MSEs are shown inFigure 6. Plotted alongside the empirical results are the estimatesperrandEMSEfrom (30) and (36).

Comparing the theoretical and measured values in Figure 6, we see that the theoretical values match the sim-ulation closely over the entire SNR range. Also note that Figure 6b shows qualitatively the same behavior as Figures 2and3(the direction of the horizontal axis is reversed). In particular,EMSE ≈K/N for high SNR and the low-SNR be-havior depends onMandNas described by (22).

4.4. Asymptotic analysis

The estimatesperrandEMSEare not diﬃcult to compute nu-merically, but the expressions (30) and (36) provide little di-rect insight. It is thus interesting to examine the asymptotic behavior of perr andEMSEasNandM grow. The following theorem gives an asymptotic expression for the limiting value of the error probability function.

Theorem 5. Consider the function perr(N,M,K,γ)defined in (30). Define the critical SNR as a function of M, N,

andKas

γcrit=C−1=

J−1 rβ(r,s)

1/r

−1, (38)

whereC,r,s, andJare defined in(32)and(33). ForK =1 and any fixedγandγcrit,

lim

N,M→∞

γcrit constant

perr(N,M,K,γ)= ⎧ ⎨ ⎩

1 ifγ < γcrit, 0 ifγ > γcrit,

(39)

where the limit is on any sequence ofMandNwithγcrit con-stant.

For the proof, seeAppendix D.

The theorem shows that, asymptotically, there is a criti-cal SNRγcrit above which the error probability goes to one and below which the probability is zero. Thus, even though the frame is random, the error event asymptotically becomes deterministic.

A similar result holds for the asymptotic MSE.

Theorem 6. Consider the functionEMSE(M,N,K,γ)defined in(36)and the critical SNRγcrit defined in(38). ForK =1 and any fixedγandγcrit,

lim

N,M→∞

γcrit constant

EMSE(M,N,K,γ)= ⎧ ⎨ ⎩

Elim(γ) ifγ < γcrit, 0 ifγ > γcrit,

(40)

where the limit is on any sequence ofMandNwithγcrit con-stant, and

Elim(γ)=γ +γcrit 1 +γcrit.

(41)

For the proof, seeAppendix E.

Remarks. (i) Theorems 5 and6 hold for any values of K. They are stated for K = 1 because the significance of

perr(N,M,K,γ) andEMSE(M,N,K,γ) is proven only forK= 1.

(ii) Both Theorems5and6involve limits withγcritconstant. It is useful to examine how M,N, andK must be related asymptotically for this condition to hold. One can use the definition of the beta function β(r,s) = Γ(r)Γ(s)/Γ(r +s) along with Stirling’s approximation, to show that whenK

N,

rβ(r,s)1/r≈1. (42)

Substituting (42) into (38), we see thatγcrit ≈J1/r−1. Also, forKNandKM,

J1/r=

M

K

2/(N−K)

≈

M

K 2K/N

(10)

101 100

10−1 10−2

SNR 0

0.2 0.4 0.6 0.8 1 1.2 1.4

No

rm

al

iz

ed

M

S

E

γcrit.=2

γcrit.=1

γcrit.=0.5

Figure7: Asymptotic normalized MSE asN → ∞(fromTheorem 6) for various critical SNRsγcrit.

so that

γcrit≈

M K

2K/N

−1 (44)

for smallKand largeMandN. Therefore, forγcritto be con-stant, (M/K)2K/N _{must be constant. Equivalently, the}

dictio-nary sizeM must grow asK(1 +γcrit)N/(2K), which is expo-nential in the inverse sparsityN/K.

The asymptotic normalized MSE is plotted inFigure 7for various values of the critical SNRγcrit. Whenγ > γcrit, the normalized MSE is zero. This is expected: fromTheorem 5, whenγ > γcrit, the estimator will always pick the correct sub-space. We know that for a fixed subspace estimator, the nor-malized MSE isK/N. Thus, asN→ ∞, the normalized MSE approaches zero.

What is perhaps surprising is the behavior forγ < γcrit. In this regime, the normalized MSE actually increaseswith increasing SNR. At the critical level,γ=γcrit, the normalized MSE approaches its maximum value

maxElim= 2γcrit 1 +γcrit

. (45)

Whenγcrit >1, the limit of the normalized MSEElim(γ) sat-isfiesElim(γ) > 1. Consequently, the sparse approximation results in noiseamplificationinstead of noise reduction. In the worst case, asγcrit → ∞,Elim(γ)→ 2. Thus, sparse ap-proximation can result in a noise amplification by a factor as large as 2. Contrast this with the factor of 4 in (21), which seems to be a very weak bound.

5. COMMENTS AND CONCLUSIONS

This paper has addressed properties of denoising by sparse approximation that are geometric in that the signal model is membership in a specified union of subspaces, without a

probability density on that set. The denoised estimate is the feasible signal closest to the noisy observed signal.

The first main result (Theorems1and2) is a bound on the performance of sparse approximation applied to a Gaus-sian signal. This lower bound on mean-squared approxima-tion error is used to determine an upper bound on denoising MSE in the limit of low input SNR.

The remaining results apply to the expected perfor-mance when the dictionary itself is random with i.i.d. en-tries selected according to an isotropic distribution. Easy-to-compute estimates for the probability that the subspace con-taining the true signal is not selected and for the MSE are given (Theorems3and4). The accuracy of these estimates is verified through simulations. Unfortunately, these results are proven only for the case ofK = 1. The main technical diﬃculty in extending these results to generalK is that the distances to the various subspaces are not mutually indepen-dent. (ThoughLemma 2does not extend toK >1, we expect that a relation similar to (B.10) holds.)

Asymptotic analysis (N→ ∞) of the situation with a ran-dom dictionary reveals a critical value of the SNR (Theorems 5 and6). Below the critical SNR, the probability of select-ing the subspace containselect-ing the true signal approaches zero and the expected MSE approaches a constant with a simple, closed form; above the critical SNR, the probability of select-ing the subspace containselect-ing the true signal approaches one and the expected MSE approaches zero.

Sparsity with respect to a randomly generated dictionary is a strange model for naturally occurring signals. However, most indications are that a variety of dictionaries lead to performance that is qualitatively similar to that of random dictionaries. Also, sparsity with respect to randomly gener-ated dictionaries occurs when the dictionary elements are produced as the random instantiation of a communication channel. Both of these observations require further investi-gation.

APPENDIX

A. PROOF OF THEOREMS1AND2

We begin with a proof ofTheorem 2;Theorem 1will follow easily. The proof is based on analyzing an idealized encoder forv. Note that despite the idealization and use of source-coding theory, the bounds hold for any values of (N,M,K)— the results are not merely asymptotic. Readers unfamiliar with the basics of source-coding theory are referred to any standard text, such as [50–52], though the necessary facts are summarized below.

Consider the encoder forvshown in Figure 8. The en-coder operates by first finding the optimal sparse approx-imation ofv, which is denoted byv. The subspaces in ΦK

are assumed to be numbered, and the index of the subspace containingvis denoted byT.vis then quantized with a K-dimensional,b-bit quantizer represented by the box “Q” to produce the encoded version ofv, which is denoted byvQ.

(11)

v

H(T) bits bbits

SA

v

T Q vQ

Figure8: The proof ofTheorem 2is based on the analysis of a hy-pothetical encoder forv. The sparse approximation box “SA” finds the optimalK-sparse approximation ofv, denoted byv, by com-putingv=PTv. The subspace selectionTcan be represented with

H(T) bits. The quantizer box “Q” quantizes vwith bbits, with knowledge ofT. The overall output of the encoder is denoted by

vQ.

communicateTto a receiver that knows the probability mass function ofTis given by the entropy ofT, which is denoted byH(T) [51]. In analyzing the encoder forv, we assume that a large number of independent realizations ofvare encoded at once. This allowsbto be an arbitrary real number (rather than an integer) and allows the average number of bits used to representT to be arbitrarily close toH(T). The encoder ofFigure 8can thus be considered to useH(T) +b bits to representvapproximately asvQ.

The crux of the proof is to represent the squared error that we are interested in,v−v2_{, in terms of squared errors} of the overall encoderv→vQand the quantizerv→vQ. We will show the orthogonality relationship below and bound both terms:

Ev−v2₌ _E _v₋_v Q 2

bounded below using fact (a)

− Ev−vQ2

bounded above using fact (b) .

(A.1) The two facts we need from rate-distortion theory are as fol-lows [50–52].

(a) The lowest possible per-component MSE for encoding an i.i.d. Gaussian source with per-component variance σ2_with_R_{bits per component is}_σ2₂−2R_.

(b) Any source with per-component variance σ2 _{can be} encoded with R bits per component to achieve per-component MSEσ2₂−2R_.

(The combination of facts (a) and (b) tells us that Gaussian sources are the hardest to represent when distortion is mea-sured by MSE.)

Applying fact (a) to thev→vQencoding, we get 1

NE v−vQ 2

≥σ2₂−2(H(T)+b)/N_. _(A.2)

Now we would like to define the quantizer “Q” inFigure 8to get the smallest possible upper bound onE[v−vQ2].

Since the distribution ofvdoes not have a simple form (e.g., it is not Gaussian), we have no better tool than fact (b), which requires us only to find (or upper bound) the variance

of the input to a quantizer. Consider a two-stage quantiza-tion process forv. The first stage (with access toT) applies an aﬃne, length-preserving transformation tovsuch that the result has zero mean and lies in aK-dimensional space. The output of the first stage is passed to an optimal b-bit quantizer. Using fact (b), the performance of such a quan-tizer must satisfy

1

KE v−vQ 2

≤σv2|T2−2b/K, (A.3)

whereσ2

v|T is the per-component conditional variance ofv,

in theK-dimensional space, conditioned onT.

From here on, we have slightly diﬀerent reasoning for the ¯

v=0 and ¯v=0 cases. For ¯v=0, we get an exact expression for the desired conditional variance; for ¯v = 0, we use an upper bound.

When ¯v = 0, symmetry dictates thatE[v | T] = 0 for allTandE[v]=0. Thus, the conditional varianceσ2

v|T and

unconditional varianceσ2

v are equal. Taking the expectation

of

v2₌_v2₊_v₋_v2 _(A.4) gives

Nσ2₌_Kσ2

v+E

v−v2_. _(A.5) Thus

σ2

v|T =σv2=

1 K

Nσ2₋_E_v₋_v2₌N K

σ2₋_D SA

, (A.6) where we have usedDSAto denote (1/N)E[v−v2]—which is the quantity we are bounding in the theorem. Substituting (A.6) into (A.3) now gives

1

KE v−vQ 2

≤N

σ2₋_D

SA

K 2

−2b/K_. _(A.7)

To usefully combine (A.2) and (A.7), we need one more orthogonality fact. Since the quantizer Q operates in sub-spaceT, its quantization error is also in subspaceT. On the other hand, becausevis produced by orthogonal projection to subspaceT,v−vis orthogonal to subspaceT. So

_v₋_v_Q 2

= v−vQ 2+v−v2. (A.8) Taking expectations, rearranging, and substituting (A.2) and (A.7) gives

Ev−v2₌_E _v₋_v Q

2

−E v−vQ 2

≥Nσ22−2(H(T)+b)/N−Nσ2−DSA

2−2b/K. (A.9) Recalling that the left-hand side of (A.9) isNDSAand rear-ranging gives

DSA≥σ2

2−2(H(T)+b)/N₋₂−2b/K

1−2−2b/K

(12)

Since this bound must be true for allb ≥0, one can max-imize with respect tobto obtain the strongest bound. This maximization is messy; however, maximizing the numerator is easier and gives almost as strong a bound. The numerator is maximized when

b= K

N−K

H(T) +N 2 log2

N K

, (A.11)

and substituting this value ofbin (A.10) gives

DSA≥σ2·

2−2H(T)/(N−K)₍₁₋_(K/N_))(K/N₎K/(N−K) 1−2−2H(T)/(N−K)_(K/N₎N/(N−K) .

(A.12) We have now completed the proof ofTheorem 2for ¯v=0.

For ¯v=0, there is no simple expression forσ2

v|Tthat does

not depend on the geometry of the dictionary, such as (A.6), to use in (A.3). Instead, use

σv2|T ≤σv2≤

N Kσ

2_, _(A.13)

where the first inequality holds because conditioning cannot increase variance and the second follows from the fact that the orthogonal projection ofvcannot increase its variance, even if the choice of projection depends onv. Now following the same steps as for the ¯v=0 case yields

DSA≥σ2

2−2(H(T)+b)/N₋₂−2b/K _(A.14)

in place of (A.10). The bound is optimized overbto obtain

DSA≥σ2·2−2H(T)/(N−K)

1−

K N

K/(N−K) .

(A.15) The proof ofTheorem 1now follows directly: sinceTis a discrete random variable that can take at mostJ values, H(T)≤log₂J.

B. PROOF OFTHEOREM 3

Using the notation ofSection 4.1, letVj, j = 1, 2,. . .,J, be

the subspaces spanned by theJ possibleK-element subsets of the dictionaryΦ. LetPj be the projection operator onto

Vj, and letTbe index of the subspace closest toy. Let jtrue be the index of the subspace containing the true signalx, so that the probability of error is

perr=Pr

T=jtrue

. (B.1)

For eachj, letxj=Pjy, so that the estimatorxSAin (26) can be rewritten asxSA=xT. Also, define random variables

ρj=

_y₋_x_j 2

y2 , j=1, 2,. . .,J, (B.2) to represent the normalized distances betweenyand theVj’s.

Henceforth, theρj’s will be calledangles, sinceρj =sin2θj,

whereθjis the angle betweenyandVj. The angles are well

defined sincey2_>_{0 with probability one.}

Lemma 1. For all j = jtrue, the angleρj is independent ofx

andd.

Proof. Given a subspaceVand vectory, define the function

R(y,V)= y−PVy

2

y2 , (B.3) where PV is the projection operator onto the subspace y.

Thus,R(y,V) is the angle betweenyandV. With this no-tation,ρj =R(y,Vj). Sinceρjis a deterministic function of

yandVjandy=x+d, to showρjis independent ofxandd,

it suﬃces to prove thatρjis independent ofy. Equivalently,

we need to show that for any functionG(ρ) and vectorsy0 andy1,

EGρj

|y=y0

=EGρj

|y=y1

. (B.4) This property can be proven with the following symmetry argument. LetUbe any orthogonal transformation. SinceU is orthogonal,PUV(U y) = UPVy for all subspacesV and

vectorsy. Combining this with the fact thatUv = vfor allv, we see that

R(U y,UV)= U y−PUV(U y)

2

U y2 = _U

y−PV(y) 2

U y2

= y−PV(y)

2

y2 =R(y,V).

(B.5) Also, for any scalarα >0, it can be verified thatR(αy,V)=

R(y,V).

Now, lety0andy1be any two possible nonzero values for the vectory. Then, there exist an orthogonal transformation Uand scalarα >0 such thaty1=αU y0. Since j= jtrueand K=1, the subspaceVjis spanned by vectorsϕi, independent

of the vectory. Therefore, EGρj

|y=y1

=EGRy1,Vj

=EGRαU y0,Vj

=EGRU y0,Vj

.

(B.6) Now since the elements ofΦare distributed uniformly on the unit sphere, the subspaceUVj is identically distributed

toVj. Combining this with (B.5) and (B.6),

EGρj

|y=y1

=EGRU y0,Vj

=EGRU y0,UVj

=EGRy0,Vj

=EGρj

|y=y0

,

(B.7)

and this completes the proof.

Lemma 2. The random anglesρj,j= jtrue, are i.i.d., each with

a probability density function given by the beta distribution

pρ(ρ)= 1

β(r,s)ρ

r−1₍₁₋_ρ)s−1_, ₀_≤_ρ_≤_1, _(B.8)

(13)

Proof. SinceK =1, each of the subspacesVjfor j = jtrueis spanned by a single, unique vector inΦ. Since the vectors in

Φare independent and the random variablesρjare the angles

betweenyand the spacesVj, the angles are independent.

Now consider a single angleρjforj=jtrue. The angleρj

is the angle betweenyand a random subspaceVj. Since the

distribution of the random vectors definingVjis spherically

symmetric andρj is independent of y,ρj is identically

dis-tributed to the angle between any fixed subspaceVand a ran-dom vectorzuniformly distributed on the unit sphere. One way to create such a random vectorzis to takez=w/w, wherew ∼N(0,IN). Letw1,w2,. . .,wK be the components

ofwinV, and letwK+1,wK+2,. . .,wN be the components in

the orthogonal complement toV. If we define

X=

K

i=1 w2

i, Y = N

i=K+1 w2

i, (B.9)

then the angle betweenzandVisρ=Y/(X+Y). SinceXand Y are the sums ofKandN−Ki.i.d. squared Gaussian ran-dom variables, they are Chi-squared ranran-dom variables with K andN−K degrees of freedom, respectively [53]. Now, a well-known property of Chi-squared random variables is that if X andY are Chi-squared random variables withm andndegrees of freedom,Y/(X+Y) will have the beta dis-tribution with parametersm/2 andn/2. Thus,ρ=Y/(X+Y) has the beta distribution, with parametersrandsdefined in (33). The probability density function for the beta distribu-tion is given in (B.8).

Lemma 3. Letρmin=minj=jtrueρj. Thenρminis independent

ofxanddand has the approximate distribution

Prρmin>

≈exp−(C)r _(B.10)

for small, whereCis given in(32). More precisely, Prρmin>

<exp−(C)r(1−)s−1 for all∈(0, 1), Prρmin>

>exp

− (C)r

(1−)

for0<rβ(r,s)1/(r−1). (B.11) Proof. SinceLemma 1shows that eachρjis independent ofx

andd, it follows thatρminis independent ofxanddas well. Also, for anyj=jtrue, by bounding the integrand of

Prρj<

= 1

β(r,s)

0 ρ

r−1₍₁₋_ρ)s−1_dρ _(B.12)

from above and below, we obtain the bounds (1−)s−1

β(r,s)

0ρ

r−1_{dρ <}_Pr_ρ

j<

< 1

β(r,s)

0ρ

r−1_dρ, (B.13) which simplify to

(1−)s−1r

rβ(r,s) <Pr

ρj<

<

r

rβ(r,s). (B.14)

Now, there are J −1 subspaces Vj where j = jtrue, and by Lemma 2, the ρj’s are mutually independent.

Con-sequently, if we apply the upper bound of (B.14) and 1−δ > exp(−δ/(1−δ)) forδ ∈ (0, 1), withδ = r_/(rβ(r_{,s)), we}

obtain Prρmin>

=

j=jtrue

Prρj>

>

1− r

rβ(r,s) J−1

>exp

− r(J−1)

rβ(r,s)(1−δ)

for 0<<(rβ(r,s))1/r_,

>exp

− r(J−1)

rβ(r,s)(1−)

for 0<<rβ(r,s)1/(r−1). (B.15) Similarly, using the lower bound of (B.14), we obtain

Prρmin>

=

j=jtrue

Prρj>

<

1−(1−)s−1r

rβ(r,s) J−1

<exp

−(1−)s−1r(J−1)

rβ(r,s)

.

(B.16)

Proof ofTheorem 3.LetVtrue be the “correct” subspace, that is,Vtrue =Vjforj = jtrue. LetDtruebe the squared distance fromytoVtrue, and letDminbe the minimum of the squared distances from yto the “incorrect” subspacesVj, j = jtrue. Since the estimator selects the closest subspace, there is an error if and only ifDmin≤Dtrue. Thus,

perr=Pr

Dmin≤Dtrue

. (B.17)

To estimate this quantity, we will approximate the probability distributions ofDminandDtrue.

First considerDtrue. Write the noise vectordasd=d0+ d1, whered0is the component inVtrueandd1is inVtrue⊥ . Let D0= d02andD1= d12. Sincey=x+dandx∈Vtrue, the squared distance fromytoVtrueisD1. Thus,

Dtrue=D1. (B.18) Now considerDmin. For any j,xj is the projection of y

ontoVj. Thus, the squared distance fromyto any spaceVj

isy−xj2=ρjy2. Hence, the minimum of the squared

distances fromyto the spacesVj,j=jtrue, is

Dmin=ρminy2. (B.19) We will bound and approximatey2 _{to obtain the bound} and approximation of the theorem. Notice that y=x+d=

x+d0+d1, wherex+d0∈Vtrueandd1∈Vtrue⊥ . Using this or-thogonality and the triangle inequality, we obtain the bound

y2₌ _x₊_d

0 2+ d1 2≥

x − d0 2+ d1 2

=N−D0 2

+D1.