In Example 3.1.1, the first eigenvector of the sample covariance matrix ˆu1is strongly inconsis-
tent withu1 whenα <1, because it attempts to estimate too many parameters. Sparse data
analytic methods assume that many of these parameters are zero, which can allow greatly improved estimation of the first PC direction u1. Here, this issue is explored in the con-
text of sparse PCA. The sample covariance matrix based estimator, ˆu1, can be improved by
exploiting the fact that u1 has many zero elements.
We first study a natural simple thresholding (ST) method where entries with small abso- lute values are replaced by zero. (Starting with the ST approach makes it easier to demon- strate the key ideas that are also useful for establishing the consistency of a more sophisticated sparse PCA method in Section 3.3.) In HDLSS contexts, it is challenging to apply threshold- ing directly to ˆu1, because the number of its entries grows rapidly asd→ ∞, which naturally
shrinks their magnitudes given that ˆu1 has norm one. Thresholding is more conveniently
formulated in terms of thedual covariance matrix (Jung et al., 2012).
Denote the dual sample covariance matrix by S = 1nXTX and the first dual eigenvector by ˜v1. The sample eigenvector ˆu1 is connected with the dual eigenvector ˜v1 through the
following transformation,
˜
u1= (˜u1,1, . . . ,u˜d,1)T =Xv˜1, (3.3)
Given a sequence of threshold valuesζ, define the thresholded entries as ˘ uk,1= ˜ uk,1 if|u˜k,1|> ζ, 0 if|u˜k,1| ≤ζ, for k= 1, . . . , d. (3.4)
Denote ˘u1 = (˘u1,1, . . . ,u˘d,1)T and normalize it to get the simple thresholding (ST) estimator
ˆ uST
1 = ˘u1/ku˘1k.
For the model in Example 3.1.1, given an eigenvalue of strengthα∈(0,1), (recall thatλ1 =
dα and ˆu1 is strongly inconsistent), below we explore conditions on the threshold sequence ζ
under which the ST estimator ˆuST1 is in fact consistent withu1. First of all, the thresholdζ can
not be too large; otherwise all the entries will be zeroed out. It will be seen in Theorem 3.2.1 that a sufficient condition for this isζ ≤dγ2, whereγ ∈(0, α). Secondly, the threshold ζ can
not be too small, or pure noise terms will be included. A parallel sufficient condition is shown to beζ ≥logδ(d)λ
1 2
2, whereδ ∈(12,∞).
Below we formally establish conditions on the eigenvalues of the population covariance matrix Σdand the thresholding parameterζ, which give consistency of ˆuST1 tou1. The proofs
are provided in Section 3.8.
To fix ideas, we first consider the extreme sparsity case u1 = (1,0, . . . ,0)T. Suppose that
λ1 ∼dα, in the sense that 0< c1 ≤limd→∞dλα1 ≤limd→∞λdα1 ≤ c2, for two constants c1 and
c2. WLOG, assume Pdj=2λj ∼ d. As in Jung and Marron (2009), denote the measure of
sphericity for{λ2,· · ·, λd} as ε2 ≡ (Pd j=2λj)2 dPd j=2λ2j ,
which can be used as the basis of a hypothesis test for equality of eigenvalues, and assume theε2-condition: ε2 1d, i.e
(dε2)−1 = Pd j=2λ2j (Pd j=2λj)2 →0,asd→ ∞. (3.5)
• Assume that theε2-condition (3.5) is satisfied, which guarantees that the dual matrix
Sdhas a limit. Hence the first dual eigenvector ˆv1will have a limit and it will then help
build up the consistency of ˆuST1 .
• In addition, we need the second eigenvalueλ2 to be an obvious distance away from the
first eigenvalueλ1. If not, it will be hard to distinguish the first and second empirical
eigenvectors as observed by Jung and Marron (2009), among others. In that case the appropriate amount of thresholding on the first empirical eigenvector becomes unclear. Therefore, we assume thatλ2 ∼dθ, where θ < α.
Theorem 3.2.1. Suppose thatX1, . . . , Xnare random samples from ad-dimensional normal
distributionN(0,Σd) and the first population eigenvector u1 = (1,0, . . . ,0)T. If the following
conditions are satisfied:
(a) λ1 ∼dα,λ2 ∼dθ, and Pdj=2λj ∼d, where θ∈[0, α) and α∈(0,1],
(b) the ε2-condition (3.5) is satisfied,
(c) logδ(d)dθ2 ≤ζ ≤d
γ
2, where δ ∈(1
2,∞) and γ ∈(θ, α),
then the simple thresholding estimatoruˆST1 is consistent with u1.
In fact,u1= (1,0, . . . ,0)T in Theorem 3.2.1 is a very extreme case. The following theorem
considers the general caseu1= (u1,1, . . . , ud,1)T, where only bdβcelements ofu1 are non-zero.
WLOG, we assume that the firstbdβcentries are non-zero just for notational convenience. Define
Zi≡(z1,i, . . . , zd,i)T = (XiTu1, . . . , XiTud)T, i= 1, . . . , n. (3.6)
We can show thatZi are iid N(0,diag{λ1, . . . , λd}) random vectors. Let
Wi ≡(w1,i, . . . , wd,i)T = (λ −1 2 1 z1,i, . . . , λ −1 2 d zd,i) T, i= 1, . . . , n, (3.7)
and theWi are iid N(0, Id) random vectors, whereId is the d-dimensional identity matrix.
• The non-zero entries of the population eigenvectoru1need to be a certain distance away
from zero. In fact, if the non-zero entries of the first population eigenvector are close to zero, the corresponding entries of the first empirical eigenvector would also be small and look like pure noise entries. Thus, we assume
max1≤k≤bdβc|uk,1|−1∼d η 2, where η∈[0, α). • From (3.6), we have Xi= d X j=1 zj,iuj, i= 1, . . . , n.
Since z1,i has the largest variance λ1, then z1,iu1 contributes the most to the variance
of Xi, i= 1, . . . , n. Note that z1,iu1 is consistent with u1, and so z1,iu1 is the key to
making the simple thresholding method work. So we need to show that the remaining parts Hi ≡(h1,i, . . . , hd,i)T = d X j=2 zj,iuj, i= 1, . . . , n (3.8)
have a negligible effect on the direction vector ˆuST1 .
• Suppose that the Hi are iid N(0,∆d), where ∆d= (mkl)d×d. A sufficient condition to
make their effect negligible is the following mixing condition of Leadbetter et al. (1983):
|mkl| ≤mkk
1
2mll
1
2ρ|k−l|, 1≤k6=l≤ bdβc, (3.9)
where ρt < 1 for all t > 1 and ρtlog(t) −→ 0, as t → ∞. This mixing condition
can guarantee that max1≤i≤n|h1,i| has a quick convergence rate, asd→ ∞. It enables
us to neglect the influence of Hi for sufficiently large d and make zj,iu1 the dominant
component, which then gives consistency to the first population eigenvector u1. Thus
We now state one of the main theorems:
Theorem 3.2.2. Assume thatX1, . . . , Xnare random samples from a d-dimensional normal
distribution N(0,Σd). Define Zi, Wi and Hi as in (3.6), (3.7), and (3.8) for i = 1, . . . , n.
The first population eigenvector is u1 = (u1,1, . . . , ud,1)T with uk,1 6= 0, k = 1, . . . ,bdβc, and
otherwiseuk,1 = 0.
If the following conditions are satisfied:
(a) λ1 ∼dα,λ2 ∼dθ, and Pdj=2λj ∼d, where θ∈[0, α) and α∈(0,1],
(b) the ε2-condition (3.5) is satisfied,
(c) max1≤k≤bdβc|uk,1|−1 ∼d η
2, where η∈[0, α),
(d) Hi satisfies the mixing condition (3.9), i= 1, . . . , n ,
(e) logδ(d)dθ2 ≤ζ ≤d
γ
2, where δ ∈(1
2,∞) and γ ∈(θ, α−η),
then the thresholding estimatoruˆST1 is consistent with u1.
We offer a couple of remarks regarding Theorem 3.2.2. First of all, the theorem naturally reduces to Theorem 3.2.1 if we let the sparsity indexβ= 0. More importantly, this theorem, and the following ones in Sections 3.2 to 3.4, show that the concepts depicted in Figure 3.1 hold much more generally than just the models in Examples 3.1.1 and 3.1.2. In particular, in the above theorem, settingθ= 0 and η=β would give the results plotted in Figure 3.1.
In addition, for different thresholding parameter ζ, the ST estimator ˆuST1 is consistent with different convergence rate, as stated in the following theorem. The notation ζ =o(dρ) below means that ζd−ρ→0 as d→ ∞.
Theorem 3.2.3. For the thresholding parameter ζ = o(dα−2η−κ), where κ ∈ [0, α−η−θ),
the corresponding thresholding estimatoruˆST1 is consistent with u1, with a convergence rate of