Contents lists available atSciVerse ScienceDirect
Journal of Multivariate Analysis
journal homepage:www.elsevier.com/locate/jmva
Boundary behavior in High Dimension, Low Sample Size asymptotics
of PCA
Sungkyu Jung
a,∗, Arusharka Sen
b, J.S. Marron
caDepartment of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA
bDepartment of Mathematics and Statistics, Concordia University, Montreal, Quebec H3G 1M8, Canada
cDepartment of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
a r t i c l e i n f o
Article history: Received 1 June 2011 Available online 23 March 2012 AMS 2000 subject classifications: primary 62H25
34L20 secondary 62F12 Keywords:
Principal component analysis High Dimension Low Sample Size Geometric representation ρ-mixing
Consistency and strong inconsistency Spiked covariance model
a b s t r a c t
In High Dimension, Low Sample Size (HDLSS) data situations, where the dimensiondis much larger than the sample sizen, principal component analysis (PCA) plays an important role in statistical analysis. Under which conditions does the sample PCA well reflect the population covariance structure? We answer this question in a relevant asymptotic context wheredgrows andnis fixed, under a generalized spiked covariance model. Specifically, we assume the largest population eigenvalues to be of the orderdα, whereα <,=, or>1. Earlier results show the conditions for consistency and strong inconsistency of eigenvectors of the sample covariance matrix. In the boundary case,α = 1, where the sample PC directions are neither consistent nor strongly inconsistent, we show that eigenvalues and eigenvectors do not degenerate but have limiting distributions. The result smoothly bridges the phase transition represented by the other two cases, and thus gives a spectrum of limits for the sample PCA in the HDLSS asymptotics. While the results hold under a general situation, the limiting distributions under Gaussian assumption are illustrated in greater detail. In addition, the geometric representation of HDLSS data is extended to give three different representations, that depend on the magnitude of variances in the first few principal components.
©2012 Elsevier Inc. All rights reserved.
1. Introduction
The study of the covariance matrix and its usual estimator, the sample covariance matrix, is an important issue in multivariate statistics. In particular, the sample covariance matrix provides the conventional estimator of principal component analysis (PCA) through the eigenvalue–eigenvector decomposition. PCA plays an important role in dimension reduction and visualization of important data structure. The High Dimension, Low Sample Size (HDLSS) data situation, where the dimensiondof the sample space is much larger than the sample sizen, occurs in many areas of modern science, and thus the dimension reduction through PCA is becoming more important for analysis of such data. The sample PCA (through the eigen-decomposition of the sample covariance matrix) is still well-defined whend
>
n, and thus frequently used in practice. Even when the dimension is much higher than the sample size, the PCA has shown to be successful such as in microarray studies [5]. What is the underlying mechanism which leads to the success of PCA in the HDLSS situation? This is the question we answer in this paper.A central question is whether the sample principal components reflect true underlying distributional structure in the HDLSS context. This has been investigated by comparing the sample eigenvalues and eigenvectors with their population
∗Corresponding author.
E-mail addresses:[email protected](S. Jung),[email protected](A. Sen),[email protected](J.S. Marron). 0047-259X/$ – see front matter©2012 Elsevier Inc. All rights reserved.
counterparts, in a relevant asymptotic context where the dimensiondgrows while the sample sizenis fixed [2,14,25,24]. The asymptotic direction ofdgrowing andnfixed is also studied in different contexts; see for instance, [7,20] and Chapter 4.5 of [21]. While we focus on this asymptotic context in this paper, we also point out that there has been a different approach for the problem where the limits are taken along the direction wheredandngrow at the same rate, i.e.d
/
n→
c∈
(
0,
∞
)
asd→ ∞
. For the result of this type, we refer to [13,4,19,18,16] and references therein.In both investigations, the majority of results are well described in a spiked covariance model, proposed by [13]. An exception we point out is a work by [8], which proposes to estimate the spectral distribution of eigenvalues without assuming a spike model.
A spiked covariance model assumes that the first few eigenvalues are distinctively larger than the others. We use a generalized version of the spike model, as described in Section 3, which is different from that of [13,19]. Let Σ(d) denote the population covariance matrix andS(d)denote the sample covariance matrix. The eigen-decomposition ofΣ(d) isΣ(d)
=
UdΛdUd′, whereΛdis a diagonal matrix of eigenvaluesλ
1,d≥
λ
2,d≥ · · · ≥
λ
d,din non-increasing order,Udis a matrix of corresponding eigenvectors so thatUd= [
u1,d, . . . ,
ud,d]
, and′denotes the transpose of the preceding matrix. The eigen-decomposition ofS(d)is similarly defined asS(d)= ˆ
UdΛˆ
dUˆ
d′. As a simple example of the spiked model, considerλ
1,d=
σ
2dα,λ
2,d= · · · =
λ
d,d=
τ
2, forα, σ
2, τ
2>
0 fixed. The first eigenvector ofS(d)corresponding to the largest eigenvalue is of interest, as it contains the most important variation of the data. The first sample eigenvectoruˆ
1,dis assessed with the angle formed by itself and its population counterpartu1,d. The directionuˆ
1,dis said to beconsistentwithu1,dif Angle(
uˆ
1,d,
u1,d)
→
0 asd→ ∞
. However in the HDLSS context, a perhaps counter-intuitive phenomenon frequently occurs, where the two directions tend to be as far away as possible. We say the directionuˆ
1,disstrongly inconsistentwithu1,dif Angle
(
uˆ
1,d,
u1,d)
→
π2 asd→ ∞
. In the one spike model above, the order of magnitudeα
of the first eigenvalue is the key condition for these two limiting phenomena. [14] have shown thatAngle
(
uˆ
1,d,
u1,d)
→
0,
α >
1;
π
2, α <
1,
(1) in probability under some conditions. Although the gap between consistency and strong inconsistency is relatively thin, the caseα
=
1 has not been investigated, and is a main focus of this paper.It is natural to conjecture from(1)that when
α
=
1, the angle does not degenerate but converges to a random quantity in(
0, π/
2)
. This claim is established in the simple one spike model in the next section, where we describe a range of limits for the eigenvalues and eigenvectors, depending on the order of magnitudeα
ofλ
1,d. In Section3, the claim is generalized for multiple spike cases, and is proved in a much more general distributional setting.The parameter
α
gives a sharp mathematical boundary for the set of HDLSS situations where the estimated PC direction converges to the population direction. In the boundary,α
=
1, it will be shown that the estimated PC direction is affected by the true PC directions, but not as strong as theα >
1 case. The success of PCA in the HDLSS situation is understood as an example of theα
≥
1 cases, as otherwise the estimated principal components are meaningless as shown in(1).In a multiple spike model withm
>
1 spikes, where the firstmprincipal components contain the importantsignalof the distribution, the sample PCA can be assessed by simultaneously comparing the firstmprincipal components. In particular, we investigate the limits ofdistancebetween two subspaces: the subspace generated by the firstmsample PC directionsˆ
u1,d
, . . . ,
uˆ
m,dand the subspace by the firstmpopulation PC directions. The distance is usefully measured by canonical angles and metrics between subspaces, the limiting distributions of which will be investigated for theα
=
1 case, as well as the casesα
̸=
1, in Section3.3. The probability density functions of the limiting distributions forα
=
1 are also derived and illustrated under a Gaussian assumption, to show the effect of parameters in the distributions.The HDLSS data set has an interesting geometric representation in the limitd
→ ∞
, as shown in [11]. In Section4, we extend the result and show that there are three different geometric representations, which coincide with the range of limits depending onα
.2. Range of limits in the single spike model
Suppose we have a data matrixX(d)
= [
X1,d, . . . ,
Xn,d]
, withd>
n, where theddimensional random vectorsXi,dare independent and identically distributed. We assume for now thatXi,dis normally distributed with mean zero and covariance matrix6(d), but the Gaussian assumption will be relaxed in the next section. The population covariance matrix6(d)is assumed to have one spike, that is, the eigenvalues of6(d)areλ
1,d=
σ
2dα, λ
2,d= · · · =
λ
d,d=
τ
2. The corresponding eigenvectors of6(d)are denoted byui,d. The sample covariance matrix is defined asS(d)=
1nX(d)X(′d)with itsith eigenvalueand eigenvector denoted by
λ
ˆ
i,danduˆ
i,d, respectively.The following theorem summarizes the spectrum of the limiting distributions of the eigenvalues and eigenvectors of
S(d), depending on the different order
α
ofλ
1,d. Note that the angle between the two vectorsu,
uˆ
is represented by the inner product through Angle(
u,
uˆ
)
=
cos−1(
u′uˆ
)
. For the eigenvectors with common eigenvalues, there are of course an infinite number of choices. The argument in the following theorem assumes that wechoosea set of population eigenvectorsuj,d before obtaininguˆ
j,d.Theorem 1. Under the Gaussian assumption and the one spike case above,
(
i)
the limit of the first eigenvalue depends onα
:ˆ
λ
1,d max(
dα,
d)
H⇒
σ
2χ
n2 n,
α >
1;
σ
2χ
2 n n+
τ
2 n, α
=
1;
τ
2 n,
α <
1,
as d
→ ∞
, whereH⇒
denotes the convergence in distribution, andχ
2n denotes a random variable with the
χ
2distributionwith degrees of freedom n. The rest of the eigenvalues converge to the same quantity when scaled, that is for any
α
∈ [
0,
∞
)
, j=
2, . . . ,
n,ˆ
λ
j,d d→
τ
2 n,
as d→ ∞
,
in probability.(
ii)
The limit of the first eigenvector depends onα
: u′1,duˆ
1,dH⇒
1α >
1;
1+
τ
2σ
2χ
2 n
−12α
=
1;
0,
α <
1,
as d
→ ∞
. The rest of the eigenvectors are strongly inconsistent with their population counterpart, for anyα
∈ [
0,
∞
),
j=
2, . . . ,
n,u′j,du
ˆ
j,d→
0,
as d→ ∞
,
in probability.
The case
α
=
1 bridges the other two cases. In particular, the ratio of the sample and population eigenvaluesλ
ˆ
1,d/λ
1,dis asymptotically unbiased to 1 whenα >
1. It is asymptotically biased whenα
=
1, and becomes completely deterministic in the caseα <
1, where the effect ofσ
2onλ
ˆ
1,dbecomes negligible. Moreover, the angle Angle
(
u1,d,
uˆ
1,d)
to the optimal direction converges to a random quantity which is defined on(
0, π/
2)
and depends onσ
2, τ
2, andn. The effect of thoseparameters on the limiting distribution of Angle
(
u1,d,
uˆ
1,d)
is illustrated inFig. 1. The ratioσ
2/τ
2can be understood as asignal to noiseratio. A high value of
σ
2/τ
2means that the major variation along the first PC direction is strong. Therefore, forlarger values of
σ
2/τ
2,
Angle(
u1,d
,
uˆ
1,d)
should be closer to zero than smaller values of the ratio, as depicted in the upper panel ofFig. 1. Moreover, the sample PCA with larger sample sizenshould perform better than with smaller sample size. The sample sizenbecomes the degrees of freedom of theχ
2distribution in the limit, and the bottom panel ofFig. 1showsthat theu
ˆ
i,dis closer toui,dfor larger values ofn.The Gaussian assumption in the previous theorem appears as a driver of the limiting
χ
2distributions. Under the general non-Gaussian assumption we state in the next section, theχ
2will be replaced by a distribution that depends heavily on thedistribution of the population principal component scores, which may not be Gaussian.
Remark 1. The results inTheorem 1can be used to estimate the parameters
σ
2andτ
2in the model withα
=
1. As asimple example, one can set
τ
ˆ
2=
n n−1
n j=2 ˆ λj,d d andσ
ˆ
2= ˆ
λ
1,d/
d− ˆ
τ
2/
n. Thenτ
ˆ
2→
τ
2andσ
ˆ
2H⇒
σ
2χ 2 n n asd→ ∞
byTheorem 1and Slutsky’s theorem. The estimator
σ
ˆ
2is not consistent but can evidently be used to construct a confidence interval forσ
2: P
nσ
ˆ
2χ
2 n,1−(a/2)≤
σ
2≤
nσ
ˆ
2χ
2 n,a/2
→
(
1−
a)
asd→ ∞
.
This can be extended to provide asymptotic confidence intervals for principal component scores. A similar estimation scheme can be found in a related but different setting whered
,
n→ ∞
together; see for example [19,16]. These papers do not discuss eigenvector estimation. The sample eigenvector,uˆ
1,dis difficult to improve upon mainly because the direction of deviationuˆ
1,dfromu1,dis quite random unless a more restrictive assumption (e.g. sparsity) is made.3. Limits under the generalized spiked covariance model
The results in the previous section will be generalized to much broader situations, including a generalized spiked covariance model and a relaxation of the Gaussian assumption. We focus on the
α
=
1 case, and describe the limiting distributions for eigenvalues and eigenvectors.Fig. 1. Angle densities for the one spike case. The top panel shows an overlay of the densities with differentσ2, with other parameters fixed. The bottom
panel shows an overlay of the densities with different degrees of freedomnof theχ2distribution. For a larger signal to noise ratioσ2
τ2, and for a largern, the angle to optimal is smaller.
3.1. Eigenvalues and eigenvectors
In the following, all the quantities depend on the dimensiond, but the subscriptdis omitted when it does not cause any confusion. We first describe some elementary facts from matrix algebra, that are useful throughout the paper. The dimension of the sample covariance matrixSincreases asdgrows, so it is challenging to deal withSdirectly. A useful approach is to use the dual ofS, defined as then
×
nsymmetric matrixSD
=
1nX
′
X
,
by switching the role of rows and columns ofX. The
(
i,
j)
th element ofSDis1nXi′Xj. An advantage of working withSDis that for larged, the finite dimensional matrixSDis positive definite with probability one, and itsneigenvalues are the same as the non-zero eigenvalues ofS. Moreover, the sample eigenvectorsuˆ
iare related to the eigen-decomposition ofSD, as shown next. LetSD= ˆ
VnΛˆ
nVˆ
n, whereΛˆ
n=
diag(
λ
ˆ
1, . . . ,
λ
ˆ
n)
andVˆ
nis then×
northogonal matrix of eigenvectorsv
ˆ
icorresponding toλ
ˆ
i. RecallS= ˆ
UΛˆ
Uˆ
′. SinceSis at most rankn, we can writeS= ˆ
UnΛˆ
nUˆ
n′, whereUˆ
n= [ ˆ
u1, . . . ,
uˆ
n]
consists of the firstn columns ofUˆ
. The singular value decomposition ofXis given byX
= ˆ
UnΛˆ
nVˆ
n′=
n
i=1(
nλ
ˆ
i)
− 1 2uˆ
iv
ˆ
i′.
Then thekth sample principal component directionu
ˆ
kfork≤
nis proportional toXv
ˆ
k,ˆ
uk
=
(
nλ
ˆ
k)
− 12X
v
ˆ
k.
(2)Therefore the asymptotic properties of the eigen-decomposition ofS, asd
→ ∞
, can be studied via those of the finite dimensional matrixSD.It is also useful to representSDin terms of the population principal components. LetZ(d)be the standardized principal
components ofX, defined by Z(d)
=
Z1′...
Zd′
=
Λ −1/2 d U ′ dX,
whereZi′
=
(
Zi1, . . . ,
Zin)
is theith row ofZ(d), so thatZi′
=
λ
− 1 2 i u ′ iX.
(3)Under the Gaussian assumption of the previous section, each element ofZ(d)is independently distributed as the standard normal distribution. ByX
=
UdΛ1/2Z(d), SD=
1 nX ′ X=
1 nZ ′ (d)ΛZ(d)=
1 n d
i=1λ
iZiZi′.
The Gaussian assumption onXis relaxed as follows. We assume that each column Xi ofX follows addimensional multivariate distribution with mean zero and covariance matrixΣ. Each entry of the standardized principal components, or the sphered variablesZ(d)is assumed to have finite fourth moments, and is uncorrelated but in general dependent with each other. We regulate the dependency of the principal components by a
ρ
-mixing condition (see [15,6]). We briefly describe a version ofρ
-mixing for our purpose. For−∞ ≤
J≤
L≤ ∞
, letFJLdenote theσ
-field of events generated by the random variablesZi,
J≤
i≤
L. For anyσ
-fieldA, letL2(
A)
denote the space of square-integrable,Ameasurable real-valued randomvariables. For eachm
≥
1, define the maximal correlation coefficientρ(
m)
:=
sup|
corr(
f,
g)
|
,
f∈
L2(
F−∞j),
g∈
L2(
Fj∞+m),
where sup is over allf
,
gandj∈
Z. The sequence{
Zi}
is said to beρ
-mixing ifρ(
m)
→
0 asm→ ∞
.While the concept of
ρ
-mixing is useful as a mild condition for the development of laws of large numbers, its formulation is critically dependent on the ordering of variables. Therefore we assume that there is some permutation of the data which isρ
-mixing. In particular, let{
Zij,(d)}
di=1be the components of thejth column vector ofZ(d). We assume that for eachd, there exists a permutationπ
d: {
1, . . . ,
d} −→ {
1, . . . ,
d}
so that the sequence{
Zπd(i)j,(d):
i=
1, . . . ,
d}
isρ
-mixing. Thisassumption makes the results invariant under a permutation of the variables. We denote these distributional assumptions as
(
c1)
.We then define a generalized spiked covariance model. Recall that a simple one spike model was defined on the eigenvalues of the population covariance matrixΣ, for example,
λ
1=
σ
2dα, λ
2= · · · =
λ
d=
τ
2. This is generalized by allowing multiple spikes, and by relaxing the uniform eigenvalue assumption in the tail to a decreasing sequence. The tail eigenvalues are regulated by a measure of sphericityϵ
kin the limitd→ ∞
. The measure of sphericityϵ
k,
k=
1,
2, . . .
, is defined for{
λ
k, . . . , λ
d}
asϵ
k(
d)
≡
d
i=kλ
i2
d d
i=kλ
2 i,
which is away from 0 and close to 1 when
{
λ
k, . . . , λ
d}
are close to each other. Then we shall assume that the tail eigenvalues do not decrease too fast. In particular, we say theϵ
k-conditionis satisfied whenϵ
k(
d)
decreases at a rate slower thand−1, i.e.(
dϵ
k)
−1=
d
i=kλ
2 i
d
i=kλ
i2
→
0 asd→ ∞
.
The
ϵ
kcondition holds for quite general settings [14, Sec. 2c]. As an example, a polynomially decreasing sequencei− 1 2 of eigenvalues satisfies the conditionϵ
kwithk=
1. In the generalized spiked model, the eigenvalues are assumed to be of the formλ
1=
σ
12dα, . . . , λ
m=
σ
m2dα, forσ
12≥ · · · ≥
σ
m2>
0 for somem≥
1, and theϵ
m+1condition holds forλ
m+1, . . . , λ
d. Also assume that1d
di=m+1
λ
i→
τ
2asd→ ∞
. These conditions for spike models are denoted by(
c2)
.The following theorem gives the limits of the sample eigenvalues and eigenvectors under the general assumptions in this section. We use the following notations. Let
ϕ(
A)
be a vector of eigenvalues of a real symmetric matrixAarranged in non-increasing order and letϕ
i(
A)
be theith largest eigenvalue ofA. Letv
i(
A)
denote theith eigenvector of the matrixAcorresponding to the eigenvalue
ϕ
i(
A)
andv
ij(
A)
be thejth loading ofv
i(
A)
. Also note that there are many choices of eigenvectors ofS including the sign changes. We use the convention that the sign ofuˆ
iwill be chosen so thatuˆ
′iui≥
0. Recall that the vector of theith standardized principal component scores isZi=
(
Z1i, . . . ,
Zni)
′. Denote then×
mmatrix of the firstmprincipal component scores asW= [
σ
1Z1, . . . , σ
mZm]
. The limiting distributions heavily depend on the finite dimensional random matrixW.Theorem 2. Under the assumptions
(
c1)
and(
c2)
with fixed n≥
m≥
1, ifα
=
1, then(
i)
the sample eigenvalues d−1nλ
ˆ
i,dH⇒
ϕ
i(
W′W)
+
τ
2,
i=
1, . . . ,
m;
τ
2,
i=
m+
1, . . . ,
n,
(
ii)
The inner products between the sample and population eigenvectors have limiting distributions:ˆ
u′i,duj,dH⇒
v
ij(
W ′W)
1+
τ
2/ϕ
i(
W′W)
as d→ ∞
jointly for i,
j=
1, . . . ,
m.
The rest of eigenvectors are strongly inconsistent with their population counterpart, i.e.ˆ
u′i,dui,d
→
0 as d→ ∞
for i=
m+
1, . . . ,
n,
in probability.
The theorem shows that the firstmeigenvectors are neither consistent nor strongly inconsistent to the population counterparts. The limiting distributions of angles Angle
(
uˆ
i,d,
ui,d)
to optimal directions are supported on(
0, π/
2)
and depend on the magnitude of the noiseτ
2 and the distribution ofW′W. Note that them×
msymmetric matrixW′Wis the scaled covariance matrix of the principal component scores in the firstmdirections. When the underlying distribution ofXis assumed to be Gaussian, thenW′Wis the Wishart matrixWm
(
n,
Λm)
, whereΛm=
diag(σ
12, . . . , σ
m2)
. Ifm=
1, the matrix becomes a scalar random variableW′W=
ϕ
1
(
W′W)
, and isχ
n2under the Gaussian assumption, which leads toTheorem 1.
In general, the distribution of
ϕ
i(
W′W)
is not simply described. We refer to [17, Ch. 9] for the Gaussian case, and [3] for the casem→ ∞
.The limiting distributions of the cases
α
̸=
1 can be found in a similar manner, which we only state the result for the caseα >
1 and for the firstmcomponents. We refer to [14] for more general results. Fori,
j=
1, . . . ,
m, the eigenvaluesd−αn
λ
ˆ
i,d
H⇒
φ
i(
W′W)
and the inner productsuˆ
′i,duj,dH⇒
v
ij(
W′W)
asd→ ∞
. In comparison to theα
=
1 case, if we setτ
2to be zero, the result becomes identical for allα
≥
1.Remark 2. When the sample sizenalso grows, consistency of sample eigenvalues and eigenvectors can be achieved. In particular, fori
=
1, . . . ,
m, we have asdgrowsˆ
λ
i,dλ
i,d=
d n d−1nλ
ˆ
i,dσ
2 i dH⇒
ϕ
i(
W ′W/
n)
σ
2 i+
τ
2 nσ
i2byTheorem 2. SinceW′W
/
n→
diag(σ
21
, . . . , σ
m2)
by a law of large numbers, we get the consistency of eigenvalues, i.e.ˆ
λ
i,d/λ
i,d→
1 asd,
n→ ∞
, where the limits are applied successively. For the sample eigenvectors, fromTheorem 2(ii) and becausev
ij(
WTW)
→
1 ifi=
j, 0 otherwise asn→ ∞
andϕ
i(
W′W)
=
O(
n)
, we getuˆ
′i,duj,d→
1 ifi=
j, 0 otherwise asd,
n→ ∞
. Therefore, the sample PC directions are consistent to the corresponding population PC directions, i.e. Angle(
uˆ
i,d,
ui,d)
→
0 asd,
n→ ∞
(applied successively), fori≤
m. Therefore it is conjectured that whend,
ngrow together withd≫
n, a similar conclusion toTheorem 2can be drawn.Proof of Theorem 2. The following lemma (Theorem 1 of [14]) shows a version of the law of large numbers for matrices, that is useful in the proof. Recall thatZ′
i
≡
(
Z1i, . . . ,
Zni)
is theith row ofZ(d). Lemma 1. If the assumption(
c1)
and theϵ
k-condition hold, thencd−1 d
i=kλ
i,dZiZi′→
In,
as d→ ∞
in probability, where cd=
n−1
di=1
λ
i,dand Indenotes the n×
n identity matrix. In particular, if d−1
d i=kλ
i,d→
τ
2, then 1 d d
i=kλ
i,dZiZi′→
τ
2I n,
as d→ ∞
in probability.This lemma is used to show that the spectral decomposition ofd−1nS
D, d−1nSD
=
m
i=1σ
2 iZiZi′+
d −1 d
i=m+1λ
iZiZi′,
can be divided into two parts, and the latter converges to a deterministic part. ApplyingLemma 1, we haved−1nS
D
H⇒
S0asd
→ ∞
, whereThen since the eigenvalues of a symmetric matrixAare a continuous function of elements ofA, we have
ϕ(
d−1nSD)
H⇒
ϕ(
S0),
asd
→ ∞
. Noticing that fori=
1, . . . ,
m,ϕ
i(
S0)
=
ϕ
i(
WW′)
+
τ
2=
ϕ
i(
W′W)
+
τ
2,
and fori=
m+
1, . . . ,
n, ϕ
i(
S0)
=
τ
2gives the result.For the eigenvectors, note that the eigenvectors
v
ˆ
iofd−1nSDcan be chosen so that they are continuous [1]. Therefore, we also have that
v
ˆ
i=
v
i(
d−1nSD
)
H⇒
v
i(
S0)
asd→ ∞
, for alli. Also note thatv
i(
S0)
=
v
i(
WW′)
fori≤
m.Similar to the dual approach for covariance matrices, the eigenvectors of then
×
nmatrixWW′can be evaluated from the dual of the matrix. In particular, letW=
UwΛwVw′=
mi=1
λ
iwuiwv
i′w, whereλ
2 iw=
ϕ
i(
W′W)
andv
iw=
v
i(
W′W)
. Thenv
1(
S0)
=
uiw=
Wv
iwλ
iw=
Wv
i(
W ′ W)
√
ϕ
i(
W′W)
.
Now from(2),(3)and the previous equation, for 1
≤
i,
j≤
m,u′ju
ˆ
i=
u′j Xv
ˆ
i
nλ
ˆ
i=
u ′ jXv
ˆ
i
nλ
ˆ
i=
σ
2 jdZ ′ jv
ˆ
i
nλ
ˆ
i=
σ
jZ ′ jv
ˆ
i
d−1nλ
ˆ
iH⇒
σ
jZ ′ jWv
i(
W′W)
ϕ
i(
W′W)
+
τ
2√
ϕ
i(
W′W)
.
(4)Note that
σ
jZj′W= [
σ
jσ
1Zj′Z1· · ·
σ
jσ
mZj′Zm]
is thejth row ofW′WandW′Wv
i(
W′W)
=
ϕ
i(
W′W)v
i(
W′W)
. Therefore, the limiting form(4)becomesϕ
i(
W′W)v
ij(
W′W)
ϕ
i(
W′W)
+
τ
2√
ϕ
i(
W′W)
.
Fori
=
m+
1, . . . ,
n, again from(2)and(3), we getu′iu
ˆ
i=
√
λ
iZi′v
ˆ
i
nλ
ˆ
i=
d−12τ
Z ′ iv
ˆ
i
nλ
ˆ
i/
d=
Op
d−12
.
3.2. Asymptotic results for the centered data matrix
In practice, the sample covariance matrix is usually derived from the centered data matrix. In such a case, we obtain a weaker result thanTheorem 2. LetS
˜
(d)=
n−1(
Xd− ¯
X)(
Xd− ¯
X)
′, whereX¯
=
n−1
ni=1Xi,d1′nis ad
×
nmatrix consisting ofncolumns of the sample mean vector.
For a general mean vector
µ
, the representation ofXin terms of standardized principal componentsZ(d)is X=
µ
1′n+
UdΛ 1 2Z(d) and thus X− ¯
X=
UdΛ 1 2(
Z(d)− ¯
Z),
where theith row ofZ
¯
isz¯
′i
=
n−1
nj=1Zij1′n
=
n −1Z′iJn. The symbolJnrepresents then
×
nmatrix consisting entirely of ones.The dual covariance matrix ofS
˜
(d)is thenS˜
D=
n−1
di=1
λ
i(
Zi− ¯
zi)(
Zi− ¯
zi)
′=
n−1(
In−
n−1Jn)
di=1
λ
iZiZi′(
In−
n−1Jn)
′. Then we have the following result.Proposition 1. Let
λ
˜
i,dbe the ith largest eigenvalue of S˜
(d). Under the assumptions(
c1)
and(
c2)
with fixed n>
m≥
1, ifα
=
1, then for anyϵ >
0,lim d→∞P
m
i=1
n d˜
λ
i,d∈ [
ϕ
i(
W˜
′W˜
), ϕ
i(
W˜
′W˜
)
+
τ
2]
n
−1 i=m+1
n d˜
λ
i,d∈
(τ
2−
ϵ, τ
2+
ϵ)
=
1,
Proof of Proposition 1.Note that
λ
˜
i,d=
ϕ
i(
S˜
(d))
=
ϕ
i(
S˜
D)
fori=
1, . . . ,
n−
1. Similar to the proof ofTheorem 2, the spectral decomposition ofd−1nS˜
Dis divided into two parts,
d−1nS
˜
D=
m
i=1σ
2 i(
Zi− ¯
zi)(
Zi− ¯
zi)
′+
(
In−
n−1Jn)
1 d d
i=m+1λ
iZiZi′(
In−
n−1Jn)
′,
and usingLemma 1,d−1nS
˜
Dconverges in distribution toS˜
0= ˜
WW˜
′+
τ
2(
In−
n−1Jn)
. Then the eigenvalues ofd−1nS˜
Djointly converge to the eigenvalues ofS˜
0.Fori
=
1, . . . ,
m, Weyl’s inequality [14, p. 4121], [10] yields thatϕ
i(
W˜
W˜
′)
+
ϕ
n{
τ
2(
In−
n−1Jn)
} ≤
ϕ
i(
S˜
0)
≤
ϕ
i(
W˜
W˜
′)
+
ϕ
1{
τ
2(
In−
n−1Jn)
}
,
where
ϕ
j{
τ
2(
In−
n−1Jn)
} =
τ
2forj=
1, . . . ,
n−
1 and 0 forj=
n, andϕ
i(
W˜
W˜
′)
=
ϕ
i(
W˜
′W˜
)
. Fori=
m+
1, . . . ,
n−
1, also applying Weyl’s inequality givesϕ
n(
W˜
W˜
′)
+
ϕ
i{
τ
2(
In−
n−1Jn)
} ≤
ϕ
i(
S˜
0)
≤
ϕ
i(
W˜
W˜
′)
+
ϕ
1{
τ
2(
In−
n−1Jn)
}
,
and because the rank ofW˜
W˜
′is at mostm, ϕ
i
(
W˜
W˜
′)
=
0 fori>
m. Thus,ϕ
i(
S˜
0)
=
τ
2, which leads tod−1nλ
˜
i,d→
τ
2in probability asd→ ∞
. The result is derived by combining the casesi=
1, . . . ,
n−
1.When the centered data matrixX
− ¯
X is used, the scaled eigenvalue estimate no longer converges in distribution toϕ
i(
W˜
′W˜
)
+
τ
2. However, the difference becomes smaller for largern, since the centering matrixIn−
n−1Jnis close toInfor largen. For the rest of the paper, we assume that the mean is known and zero for the sake of clear presentation.3.3. Angles between principal component spaces
Under the generalized spiked covariance model withm
>
1, the firstmpopulation principal directions provide a basis of the most important variation. Therefore, it would be more informative to investigate the deviation of eachuˆ
ifrom the subspaceLm1
(
d)
spanned by{
u1,d, . . . ,
um,d}
. Also, denote the subspace spanned by the firstmsample principal directions asˆ
Lm
1
(
d)
≡
span{ ˆ
u1,d, . . . ,
uˆ
m,d}
. When performing dimension reduction, it is critical for the sample PC spaceLˆ
m1 to be closeto the population PC spaceLm1. The closeness of two subspaces can be measured in terms ofcanonical angles.
We briefly introduce the notion of canonical angles and metrics between subspaces, detailed discussions of which can be found in [23,9]. As a simple case, the canonical angle between a 1-dimensional subspace and anm-dimensional subspace is defined as follows. LetL
ˆ
ibe the 1-dimensional linear space with basisuˆ
i. Infinitely many angles can be formed betweenˆ
LiandLm1 withm
>
1. The canonical angle, denoted by Angle(
Lˆ
i,
Lm1)
, is defined by the smallest angle formed, that is theangle betweenu
ˆ
iand its projectionuˆ
Pi ontoLm1. This angle is represented in terms of an inner product asAngle
(
Lˆ
i,
Lm 1)
=
cos −1
ˆ
u′ iuˆ
Pi
uˆ
Pi
uˆ
i
=
min y∈Lm1 Angle(
uˆ
i,
y)
for∥
y∥
>
0.
(5)When two multi-dimensional subspaces are considered, multiple canonical angles are defined. Among angles betweenL
ˆ
m1
andLm1, the first canonical angle is geometrically defined as
θ
1(
Lˆ
m1,
L m 1)
=
max x∈ ˆLm1 min y∈Lm 1 Angle(
x,
y)
for∥
x∥
,
∥
y∥
>
0,
(6)where Angle
(
x,
y)
is the angle formed by the two vectorsx,
y. One can show that the second canonical angle is defined by the same geometric relation as above withLˆ
−xandL−yforx,
yfrom(6), whereLˆ
−xis the orthogonal complement ofxinLˆ
m1.
In practice, the canonical angles are found by the singular value decomposition of a matrix. LetU
ˆ
mandUmbe orthonormal bases forLˆ
m1
,
Lm1 andγ
i’s be the singular values ofUˆ
m′Um. Then the canonical angles areθ
i(
Lˆ
m1,
L m 1)
=
cos −1(γ
i)
in descending order.Distances between two subspaces can be defined using the canonical angles. We point out two metrics from Chapter II.4 of [23]:
1. gap metric
ρ
g(
Lˆ
m1,
Lm1)
=
sin(θ
1)
,2. Euclidean sine metric
ρ
s(
Lˆ
m1,
Lm1)
= {
m i=1sin 2(θ
i)
}
1 2.The gap metric is simple and only involves the largest canonical angle. The Euclidean sine metric makes use of all canonical angles and thus gives a more comprehensive understanding of the closeness between the two subspaces. We will use both metrics in the following discussion.
At first, we examine the limiting distribution of the angle between the sample PC directionu
ˆ
iandLm1.Theorem 3. Under the assumptions
(
c1)
and(
c2)
with fixed n≥
m≥
1, ifα
=
1, then for i=
1, . . . ,
m, the canonical angle converges in distribution: cos
Angle(
Lˆ
i,
Lm 1)
H⇒
1
1+
τ
2/ϕ
i(
W′W)
as d→ ∞
.
Proof. Sinceuˆ
P i=
m j=1(
uˆ
′ iuj)
uj,ˆ
u′ iuˆ
Pi
uˆ
Pi
uˆ
i
=
uˆ
Pi
=
(
uˆ
′ iu1)
2+ · · · +
(
uˆ
′ium)
2.
The result follows from(5),Theorem 2(ii) and the fact that
mj=1
v
ij(
W′W)
2
=
v
i(
W′W)
2=
1.We then investigate the limiting behavior of the distances betweenL
ˆ
m1 andLm1, in terms of either the canonical angles or the distances. From the fact thatUˆ
m= [ ˆ
u1, . . . ,
uˆ
m]
andUm= [
u1, . . . ,
um]
are orthonormal bases ofLˆ
m1 andLm1respectively, cosines of the canonical angles are the singular values ofU
ˆ
′mUm. Since the
(
i,
j)
th element ofUˆ
m′Umisuˆ
′iuj,Theorem 2leads to
ˆ
Um′UmH⇒
v
1(
W ′ W)
· · ·
v
m(
W′W)
1+
τ
2ϕ
1(
W′W)
− 1 2 0...
0
1+
τ
2ϕ
m(
W′W)
−12
,
asd
→ ∞
. Therefore the canonical angles (θ
1, . . . , θ
m) betweenLˆ
m1 andLm
1 converge to the arccosines of
(
1
+
τ
2/ϕ
m(
W′W)
−12, . . . ,
1
+
τ
2/ϕ
1(
W′W)
−12),
(7)asd
→ ∞
. Notice that these canonical angles between subspaces converge to the same limit as inTheorem 3, except that the order is reversed. In particular, the limiting distribution of the largest canonical angleθ
1is the same as that of the anglebetweenu
ˆ
mandLm1, and the smallest canonical angleθ
mcorresponds to the angle between the first sample PC directionuˆ
1and the population PC spaceLm1.
The limiting distributions of the distances between two subspaces are readily derived by the discussions so far. When using the gap metric,
ρ
g(
Lˆ
m1,
Lm1)
H⇒
1
+
ϕ
m(
W′W)/τ
2
− 12 asd
→ ∞
.
And by using the Euclidean sine metric,ρ
s(
Lˆ
m1,
L m 1)
H⇒
m
i=1 1 1+
ϕ
i(
W′W)/τ
2
12 asd→ ∞
.
(8)Remark 3. The convergence of the canonical angles for the case
α >
1 has been shown earlier. [14] introduced a notion ofsubspace consistency, where the directionu
ˆ
imay not be consistent touibut will tend to lie inLm1, i.e. Angle(
Lˆ
i,
Lm1)
→
0 asd
→ ∞
, fori≤
m. In this case, the canonical angles betweenLˆ
mi andLm1 and the distances will converge to 0 asdgrows. In that sense, the empirical PC spaceLˆ
mi is consistent toLm1. On the other hand, when
α <
1, all directionsuˆ
itend to behave as if they were from the eigen-decomposition of the identity matrix. Therefore, all angles tend to beπ/
2 and the distances will converge to their largest possible values, leading to the strong inconsistency.We now focus back on the
α
=
1 case, and illustrate the limiting distributions of the canonical angles and the Euclidean sine distance, to see the effect of parameters in the distribution. For simplicity and clear presentation, the results corresponding tom=
2 are presented under the Gaussian assumption. Note that the limiting distributions depend on the marginal distributions of the first few principal component scores. Therefore no common distribution is evaluated in the limit.Let
σ
2(
=
σ
21
+
σ
22)
denote the (scaled) total variance of the first two principal components. The ratioσ2
τ2 is understood as a signal to noise ratio, similar to the single spike case. Since the ratio of
σ
12andσ
22affects the limiting distributions, we use(λ
1, λ
2)
withλ
1+
λ
2=
1 so thatσ
2(λ
1, λ
2)
=
(σ
12, σ
22)
. Note that forW′W
∼
W2
(
n,
diag(σ
12, σ
22))
,ϕ(
W′W
)
has the same law asσ
2ϕ(
W2(
n,
diag(λ
1, λ
2)))
.Fig. 2. Overlay of contours of densities of canonical angles for them=2 case, corresponding to different(σ2/τ2, λ
1/λ2). Larger signal to noise ratios σ2/τ2lead to the diagonal shift of the density function, and both canonical angles will have smaller values. For a fixedσ2/τ2, the ratioλ
1/λ2between the
first and second PC variances is a driver for different distributions, depicted as the vertical shift of the density function.
The joint limiting distribution of the two canonical angles in (7), also in Theorem 3, is illustrated inFig. 2, with various values of (
σ
2/τ
2, λ
1
/λ
2) and fixedn. Note that for larged, the first canonical angleθ
1≈
Angle(
uˆ
2,d,
Lm1)
andθ
2≈
Angle(
uˆ
1,d,
Lm1)
, and thatθ
1≥
θ
2.•
The diagonal shift of the joint densities inFig. 2is driven by differentσ
2s with other parameters fixed. Bothθ
1and
θ
2aresmaller for larger signal to noise ratios.
•
For fixedσ
2/τ
2, several values of the ratio between the first and second variances (λ
1/λ
2) are considered, and the overlayof densities according to different
λ
1/λ
2is illustrated as the vertical shift inFig. 2. When the variation along the first PCdirection is much stronger than that along the second, i.e. when
λ
1/λ
2is large,θ
2becomes smaller butθ
1tends to bemuch larger. In other words,u
ˆ
1is a reasonable estimate ofu1, butuˆ
2becomes a poor estimate ofu2.See(A.4)in theAppendixfor the probability density function of the canonical angles.
The limiting distribution of the Euclidean sine distance betweenL
ˆ
m1 andLm1 is also depicted inFig. 3, again with various values of (σ
2, λ
1
, λ
2). It can be checked from the top panel ofFig. 3that the distance to the optimal subspace is smaller whenthe signal to noise ratio is larger. The bottom panel illustrates the densities corresponding to different ratios of
λ
1/λ
2. Theeffect of
λ
1/λ
2is relatively small compared to the effect of differentσ
2s, unlessλ
2is too small.4. Geometric representation of the HDLSS data
[11] first showed that the HDLSS data has an interestinggeometric representationin the limitd
→ ∞
. In particular, for larged, the data tend to appear at vertices of a regular simplex and the variability is contained only in the random rotation of the simplex. In the spike model we consider, this geometric representation of the HDLSS data holds whenα <
1, as shown earlier in [14]. The representation in mathematical terms is∥
Xi∥ =
τ
√
d+
op(
√
d),
Xi−
Xj
=
τ
√
2d+
op(
√
d),
(9)forddimensionalXi
,
i=
1, . . . ,
n. This simplified representation has been used to show some high dimensional limit theory for discriminant analysis; see [2,22,12].Similar types of representation can be derived in the
α
≥
1 case. Whenα >
1, where consistency of PC directions happens, we have∥
Xi∥
/
dα/2H⇒ ∥
Yi∥
,
Xi−
Xj
/
dα/2H⇒
Yi−
Yj
(10)whereYi
=
(σ
1Z1i, . . . σ
mZmi)
′s arem-dimensional independent random vectors with mean zero and covariance matrix diag(σ
12, . . . , σ
2m
)
. To understandYi, letXiP be the projection ofXi onto the true PC space Lm1. It can be checked fromd−1/2XP
i
=
mj=1
(σ
jZji)
ujthatYiis the vector of loadings of the scaledXiPin the firstmprincipal component coordinates. Whenα
=
1, a deterministic term is added:∥
Xi∥
2/
d1H⇒ ∥
Yi∥
2+
τ
2,
Xi−
Xj
2/
dH⇒
Yi−
Yj
2+
2τ
2.
(11)Fig. 3. Overlay of densities of the distance between the sample and population PC spaces, measured byρsfor them=2 case. The top panel shows a transition of the density function corresponding to different signal to noise ratios. The bottom panel illustrates the effect of the ratioλ1/λ2between the
first two eigenvalues. For a larger signal to noise ratioσ2/τ2, and for a smaller value ofλ
1/λ2, the Euclidean sine distance is smaller.
These results can be understood geometrically, as summarized and discussed in the following.
α >
1: The variability of samples is restricted to the true PC spaceLm1 for larged, which coincides with the notion of
subspace consistency discussed inRemark 3. Thed-dimensional probability distribution degenerates to them -dimensional subspaceLm1.
α
=
1: (11)is understood with the help of the Pythagorean theorem, that is, the norm ofXiis asymptotically decomposed into orthogonal random and deterministic parts. Thus, data tend to beτ
√
daway fromLm1, andXis projected on
Lm⊥
1
=
span{
um+1, . . . ,
ud}
, the orthogonal complement ofLm1, will follow the representation similar to theα <
1case.
α <
1: The geometric representation(9)holds. Note that the caseα
=
1 smoothly bridges the others.An example elucidating these ideas is shown inFig. 4. Each panel shows 3d scatterplots of 10 different samples (shown as different symbols) ofn
=
3 Gaussian random vectors in dimensionsd=
3,
30,
3000 (shown in respective columns ofFig. 4). In the spiked model, we takeσ
=
τ
=
1 andm=
1 for simplicity and investigate three different orders of magnitudeα
=
12
,
1,
2 of the first eigenvalueλ
1=
dα. For each pair of(
d, α)
, each sampleXiis projected onto the first truePC directionu1, shown as the vertical axis. In the orthogonald
−
1 dimensional subspace, the 2-dimensional hyperplanethat is generated by the data is found, and the data are projected onto that. Within the hyperplane, variation due to rotation is removed by optimally rotating the data onto edges of a regular triangle by a Procrustes method, to give the horizontal axes in each part ofFig. 4. These axes are scaled by dividing by max
(
dα,
d)
and the 10 samples give an impression of various types of convergence as a function ofd, for eachα
.The asymptotic geometric representations summarized above are confirmed by the figure. For
α
=
12, it is expected
from(9)that the data are close to the vertices of the regular triangle, with edge length
√
2d. The vertices of the triangle (in the horizontal plane) with vertical rays (representingu1, the first PC direction) are shown as the dashed lines in the
first two rows ofFig. 4. Note that ford
=
3, in the top row, the points appear to be quite random, but ford=
30, there already is reasonable convergence to the vertices with notable variation alongu1. The cased=
3000 shows more rigidrepresentation with much less variation alongu1. On the other hand, for the case
α
=
2 in the last row ofFig. 4, most of thevariations in the data are found alongu1, shown as the vertical dotted line, and the variation perpendicular tou1becomes
negligible asdgrows, which confirms the degeneracy toL1
1in(10). From these examples, conditions for consistency and
strong inconsistency can be checked heuristically. The sample eigenvectoru
ˆ
1is consistent withu1whenα >
1, since thevariation alongu1is so strong thatu
ˆ
1should be close to that.uˆ
1is inconsistent withu1whenα <
1, since the variationalongu1becomes negligible so thatu
ˆ
1will not be nearu1.For the
α
=
1 case, in the middle row ofFig. 4, it is expected from(11)that each data point will be asymptotically decomposed into a random and a deterministic part. This is confirmed by the scatterplots, where the order of variance alongFig. 4. Gaussian toy example, showing the geometric representations of HDLSS data, withn=3, for three different choices ofα=1/2,1,2 of the spiked model and increasing dimensionsd=3,30,3000. Forα̸=1, data converge to vertices of a regular 3-simplex (caseα <1) or to the first PC direction (caseα >1). Whenα=1, data are decomposed into the deterministic part on the horizontal axes and the random part along the vertical axis.
u1remains comparable to that of horizontal components, asdgrows. The convergence to the vertices is noticeable even
ford
=
30, which becomes stronger for largerd, while the randomness alongu1remains for larged. Also observe that thedistance from eachXito the space spanned byu1becomes deterministic for larged, supporting the first part of(11).
Appendix. Derivation of the density functions
The probability density functions of the limiting distributions in(7)and(8)will be derived for the casem
=
2, under the normal assumption. The argument is readily generalized to allm, but when non-normal distribution is assumed, such derivation is much challenging.We first recall some necessary notions for treating the Wishart matrixW′Wand eigen-decompositions. Most of the results are adopted from [17]. Let A
∼
Wm(
n,
Σm)
and denote its eigen-decomposition as A=
HLH′ with L=
diag(
l1, . . . ,
lm)
. AssumeΣmis positive definite andn≥
mso thatl1>
l2>
· · ·
>
lm>
0 with probability 1. DenoteO
(
m)
= {
Hm×m:
H′H=
Im}
for the set of orthonormalm×
mmatrices and(
dH)
forH∈
O(
m)
as the differential form representing the uniform probability measure onO(
m)
. The multivariate gamma function is defined asΓm
(
a)
=
π
m(m−1)/4 m
i=1 Γ
a−
1 2(
i−
1)
,
whereΓ
(
·
)
is the usual gamma function. ForH≡ [
h1, . . . ,
hm] ∈
O(
m)
,(
dH)
≡
1 Vol(
O(
m))
(
H ′ dH)
=
Γm
m 2
2mπ
m2/2(
H ′ dH),
(A.1) where(
H′dH)
≡
m i>jh ′ idhj.We are now ready to state the density function of
ϕ(
W′W)
form
=
2. Note that under the Gaussian assumptionW′W is the 2×
2 Wishart matrix with degree of freedomnand covariance matrixΣW=
diag(σ
12, σ
22)
. For simplicity, write(
L1,
L2)
=
ϕ(
W′W)
, andL=
diag(
L1,
L2)
. Then the joint density function ofL1andL2is given by e.g. Theorem 3.2.18 of [17]withm