Consistency of Simple Thresholding Sparse PCA

In Example 3.1.1, the first eigenvector of the sample covariance matrix ˆu1is strongly inconsis-

tent withu1 whenα <1, because it attempts to estimate too many parameters. Sparse data

analytic methods assume that many of these parameters are zero, which can allow greatly improved estimation of the first PC direction u1. Here, this issue is explored in the con-

text of sparse PCA. The sample covariance matrix based estimator, ˆu1, can be improved by

exploiting the fact that u1 has many zero elements.

We first study a natural simple thresholding (ST) method where entries with small abso- lute values are replaced by zero. (Starting with the ST approach makes it easier to demon- strate the key ideas that are also useful for establishing the consistency of a more sophisticated sparse PCA method in Section 3.3.) In HDLSS contexts, it is challenging to apply thresholding directly to ˆu1, because the number of its entries grows rapidly asd→ ∞, which naturally

shrinks their magnitudes given that ˆu1 has norm one. Thresholding is more conveniently

formulated in terms of thedual covariance matrix (Jung et al., 2012).

Denote the dual sample covariance matrix by S = 1_nXTX and the first dual eigenvector by ˜v1. The sample eigenvector ˆu1 is connected with the dual eigenvector ˜v1 through the

following transformation,

u1= (˜u1,1, . . . ,u˜d,1)T =Xv˜1, (3.3)

Given a sequence of threshold valuesζ, define the thresholded entries as ˘ uk,1=        ˜ uk,1 if|u˜k,1|> ζ, 0 if|u˜k,1| ≤ζ, for k= 1, . . . , d. (3.4)

Denote ˘u1 = (˘u1,1, . . . ,u˘d,1)T and normalize it to get the simple thresholding (ST) estimator

ˆ uST

1 = ˘u1/ku˘1k.

For the model in Example 3.1.1, given an eigenvalue of strengthα∈(0,1), (recall thatλ1 =

dα and ˆu1 is strongly inconsistent), below we explore conditions on the threshold sequence ζ

under which the ST estimator ˆuST₁ is in fact consistent withu1. First of all, the thresholdζ can

not be too large; otherwise all the entries will be zeroed out. It will be seen in Theorem 3.2.1 that a sufficient condition for this isζ ≤dγ2, whereγ ∈(0, α). Secondly, the threshold ζ can

not be too small, or pure noise terms will be included. A parallel sufficient condition is shown to beζ ≥logδ(d)λ

1 2

2, whereδ ∈(12,∞).

Below we formally establish conditions on the eigenvalues of the population covariance matrix Σdand the thresholding parameterζ, which give consistency of ˆuST1 tou1. The proofs

are provided in Section 3.8.

To fix ideas, we first consider the extreme sparsity case u1 = (1,0, . . . ,0)T. Suppose that

λ1 ∼dα, in the sense that 0< c1 ≤limd→∞_dλα1 ≤limd→∞λ_dα1 ≤ c2, for two constants c1 and

c2. WLOG, assume Pd_j₌₂λj ∼ d. As in Jung and Marron (2009), denote the measure of

sphericity for{λ2,· · ·, λd} as ε2 ≡ (Pd j=2λj)2 dPd j=2λ2j ,

which can be used as the basis of a hypothesis test for equality of eigenvalues, and assume theε2-condition: ε2 1_d, i.e

(dε2)−1 = Pd j=2λ2j (Pd j=2λj)2 →0,asd→ ∞. (3.5)

• Assume that theε2-condition (3.5) is satisfied, which guarantees that the dual matrix

Sdhas a limit. Hence the first dual eigenvector ˆv1will have a limit and it will then help

build up the consistency of ˆuST₁ .

• In addition, we need the second eigenvalueλ2 to be an obvious distance away from the

first eigenvalueλ1. If not, it will be hard to distinguish the first and second empirical

eigenvectors as observed by Jung and Marron (2009), among others. In that case the appropriate amount of thresholding on the first empirical eigenvector becomes unclear. Therefore, we assume thatλ2 ∼dθ, where θ < α.

Theorem 3.2.1. Suppose thatX1, . . . , Xnare random samples from ad-dimensional normal

distributionN(0,Σd) and the first population eigenvector u1 = (1,0, . . . ,0)T. If the following

conditions are satisfied:

(a) λ1 ∼dα,λ2 ∼dθ, and Pdj=2λj ∼d, where θ∈[0, α) and α∈(0,1],

(b) the ε2-condition (3.5) is satisfied,

2, where δ ∈(1

2,∞) and γ ∈(θ, α),

then the simple thresholding estimatoruˆST₁ is consistent with u1.

In fact,u1= (1,0, . . . ,0)T in Theorem 3.2.1 is a very extreme case. The following theorem

considers the general caseu1= (u1,1, . . . , ud,1)T, where only bdβcelements ofu1 are non-zero.

WLOG, we assume that the firstbdβcentries are non-zero just for notational convenience. Define

Zi≡(z1,i, . . . , zd,i)T = (XiTu1, . . . , XiTud)T, i= 1, . . . , n. (3.6)

We can show thatZi are iid N(0,diag{λ1, . . . , λd}) random vectors. Let

Wi ≡(w1,i, . . . , wd,i)T = (λ −1 2 1 z1,i, . . . , λ −1 2 d zd,i) T_, _i_{= 1}_{, . . . , n,} _(3.7)

and theWi are iid N(0, Id) random vectors, whereId is the d-dimensional identity matrix.

• The non-zero entries of the population eigenvectoru1need to be a certain distance away

from zero. In fact, if the non-zero entries of the first population eigenvector are close to zero, the corresponding entries of the first empirical eigenvector would also be small and look like pure noise entries. Thus, we assume

max_1≤_k_≤b_dβ_c|u_k,₁|−1∼d η 2, where η∈[0, α). • From (3.6), we have Xi= d X j=1 zj,iuj, i= 1, . . . , n.

Since z1,i has the largest variance λ1, then z1,iu1 contributes the most to the variance

of Xi, i= 1, . . . , n. Note that z1,iu1 is consistent with u1, and so z1,iu1 is the key to

making the simple thresholding method work. So we need to show that the remaining parts Hi ≡(h1,i, . . . , hd,i)T = d X j=2 zj,iuj, i= 1, . . . , n (3.8)

have a negligible effect on the direction vector ˆuST₁ .

• Suppose that the Hi are iid N(0,∆d), where ∆d= (mkl)d×d. A sufficient condition to

make their effect negligible is the following mixing condition of Leadbetter et al. (1983):

|mkl| ≤mkk

2m_ll

2ρ_|_k₋_l_|, 1≤k6=l≤ bdβc, (3.9)

where ρt < 1 for all t > 1 and ρtlog(t) −→ 0, as t → ∞. This mixing condition

can guarantee that max1≤i≤n|h1,i| has a quick convergence rate, asd→ ∞. It enables

us to neglect the influence of Hi for sufficiently large d and make zj,iu1 the dominant

component, which then gives consistency to the first population eigenvector u1. Thus

We now state one of the main theorems:

Theorem 3.2.2. Assume thatX1, . . . , Xnare random samples from a d-dimensional normal

distribution N(0,Σd). Define Zi, Wi and Hi as in (3.6), (3.7), and (3.8) for i = 1, . . . , n.

The first population eigenvector is u1 = (u1,1, . . . , ud,1)T with uk,1 6= 0, k = 1, . . . ,bdβc, and

otherwiseuk,1 = 0.

If the following conditions are satisfied:

(a) λ1 ∼dα,λ2 ∼dθ, and Pdj=2λj ∼d, where θ∈[0, α) and α∈(0,1],

(b) the ε2-condition (3.5) is satisfied,

2, where η∈[0, α),

(d) Hi satisfies the mixing condition (3.9), i= 1, . . . , n ,

(e) logδ(d)dθ2 ≤ζ ≤d

2, where δ ∈(1

2,∞) and γ ∈(θ, α−η),

then the thresholding estimatoruˆST₁ is consistent with u1.

We offer a couple of remarks regarding Theorem 3.2.2. First of all, the theorem naturally reduces to Theorem 3.2.1 if we let the sparsity indexβ= 0. More importantly, this theorem, and the following ones in Sections 3.2 to 3.4, show that the concepts depicted in Figure 3.1 hold much more generally than just the models in Examples 3.1.1 and 3.1.2. In particular, in the above theorem, settingθ= 0 and η=β would give the results plotted in Figure 3.1.

In addition, for different thresholding parameter ζ, the ST estimator ˆuST₁ is consistent with different convergence rate, as stated in the following theorem. The notation ζ =o(dρ) below means that ζd−ρ→0 as d→ ∞.

Theorem 3.2.3. For the thresholding parameter ζ = o(dα−2η−κ), where κ ∈ [0, α−η−θ),

the corresponding thresholding estimatoruˆST₁ is consistent with u1, with a convergence rate of

In document Shen_unc_0153D_12982.pdf (Page 61-66)