Boundary behavior in High Dimension, Low Sample Size asymptotics of PCA

(1)

Contents lists available atSciVerse ScienceDirect

Journal of Multivariate Analysis

journal homepage:www.elsevier.com/locate/jmva

Boundary behavior in High Dimension, Low Sample Size asymptotics

of PCA

Sungkyu Jung

a,∗

, Arusharka Sen

b

, J.S. Marron

c

a_{Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA}

b_{Department of Mathematics and Statistics, Concordia University, Montreal, Quebec H3G 1M8, Canada}

c_{Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA}

a r t i c l e i n f o

Article history: Received 1 June 2011 Available online 23 March 2012 AMS 2000 subject classifications: primary 62H25

34L20 secondary 62F12 Keywords:

Principal component analysis High Dimension Low Sample Size Geometric representation ρ-mixing

Consistency and strong inconsistency Spiked covariance model

a b s t r a c t

In High Dimension, Low Sample Size (HDLSS) data situations, where the dimensiondis much larger than the sample sizen, principal component analysis (PCA) plays an important role in statistical analysis. Under which conditions does the sample PCA well reflect the population covariance structure? We answer this question in a relevant asymptotic context wheredgrows andnis fixed, under a generalized spiked covariance model. Specifically, we assume the largest population eigenvalues to be of the orderdα, whereα <,=, or>1. Earlier results show the conditions for consistency and strong inconsistency of eigenvectors of the sample covariance matrix. In the boundary case,α = 1, where the sample PC directions are neither consistent nor strongly inconsistent, we show that eigenvalues and eigenvectors do not degenerate but have limiting distributions. The result smoothly bridges the phase transition represented by the other two cases, and thus gives a spectrum of limits for the sample PCA in the HDLSS asymptotics. While the results hold under a general situation, the limiting distributions under Gaussian assumption are illustrated in greater detail. In addition, the geometric representation of HDLSS data is extended to give three different representations, that depend on the magnitude of variances in the first few principal components.

1. Introduction

The study of the covariance matrix and its usual estimator, the sample covariance matrix, is an important issue in multivariate statistics. In particular, the sample covariance matrix provides the conventional estimator of principal component analysis (PCA) through the eigenvalue–eigenvector decomposition. PCA plays an important role in dimension reduction and visualization of important data structure. The High Dimension, Low Sample Size (HDLSS) data situation, where the dimensiondof the sample space is much larger than the sample sizen, occurs in many areas of modern science, and thus the dimension reduction through PCA is becoming more important for analysis of such data. The sample PCA (through the eigen-decomposition of the sample covariance matrix) is still well-defined whend

>

n, and thus frequently used in practice. Even when the dimension is much higher than the sample size, the PCA has shown to be successful such as in microarray studies [5]. What is the underlying mechanism which leads to the success of PCA in the HDLSS situation? This is the question we answer in this paper.

A central question is whether the sample principal components reflect true underlying distributional structure in the HDLSS context. This has been investigated by comparing the sample eigenvalues and eigenvectors with their population

∗_{Corresponding author.}

(2)

counterparts, in a relevant asymptotic context where the dimensiondgrows while the sample sizenis fixed [2,14,25,24]. The asymptotic direction ofdgrowing andnfixed is also studied in different contexts; see for instance, [7,20] and Chapter 4.5 of [21]. While we focus on this asymptotic context in this paper, we also point out that there has been a different approach for the problem where the limits are taken along the direction wheredandngrow at the same rate, i.e.d

/

n

→

c

∈

(

0

,

∞

)

asd

→ ∞

. For the result of this type, we refer to [13,4,19,18,16] and references therein.

In both investigations, the majority of results are well described in a spiked covariance model, proposed by [13]. An exception we point out is a work by [8], which proposes to estimate the spectral distribution of eigenvalues without assuming a spike model.

A spiked covariance model assumes that the first few eigenvalues are distinctively larger than the others. We use a generalized version of the spike model, as described in Section 3, which is different from that of [13,19]. Let Σ₍d) denote the population covariance matrix andS₍d)denote the sample covariance matrix. The eigen-decomposition ofΣ(d) isΣ₍d)

=

UdΛdUd′, whereΛdis a diagonal matrix of eigenvalues

λ

1,d

≥

λ

2,d

≥ · · · ≥

λ

d,din non-increasing order,Udis a matrix of corresponding eigenvectors so thatUd

= [

u1,d

, . . . ,

ud,d

]

, and′denotes the transpose of the preceding matrix. The eigen-decomposition ofS₍d)is similarly defined asS(d)

= ˆ

UdΛ

ˆ

dU

ˆ

d′. As a simple example of the spiked model, consider

λ

1,d

=

σ

2dα,

λ

2,d

= · · · =

λ

d,d

=

τ

2, for

α, σ

2

, τ

2

>

0 fixed. The first eigenvector ofS(d)corresponding to the largest eigenvalue is of interest, as it contains the most important variation of the data. The first sample eigenvectoru

ˆ

1,dis assessed with the angle formed by itself and its population counterpartu1,d. The directionu

ˆ

1,dis said to beconsistentwithu1,dif Angle

(

u

ˆ

1,d

,

u1,d

)

→

0 asd

→ ∞

. However in the HDLSS context, a perhaps counter-intuitive phenomenon frequently occurs, where the two directions tend to be as far away as possible. We say the directionu

ˆ

1,disstrongly inconsistentwith

u1,dif Angle

(

u

ˆ

1,d

,

u1,d

)

→

π₂ asd

→ ∞

. In the one spike model above, the order of magnitude

α

of the first eigenvalue is the key condition for these two limiting phenomena. [14] have shown that

Angle

(

u

ˆ

1,d

,

u1,d

)

→



0

,

α >

1

;

π

2

, α <

1

,

(1) in probability under some conditions. Although the gap between consistency and strong inconsistency is relatively thin, the case

α

=

1 has not been investigated, and is a main focus of this paper.

It is natural to conjecture from(1)that when

α

=

1, the angle does not degenerate but converges to a random quantity in

(

0

, π/

2

)

. This claim is established in the simple one spike model in the next section, where we describe a range of limits for the eigenvalues and eigenvectors, depending on the order of magnitude

α

of

λ

1,d. In Section3, the claim is generalized for multiple spike cases, and is proved in a much more general distributional setting.

The parameter

α

gives a sharp mathematical boundary for the set of HDLSS situations where the estimated PC direction converges to the population direction. In the boundary,

α

=

1, it will be shown that the estimated PC direction is affected by the true PC directions, but not as strong as the

α >

1 case. The success of PCA in the HDLSS situation is understood as an example of the

α

≥

1 cases, as otherwise the estimated principal components are meaningless as shown in(1).

In a multiple spike model withm

>

1 spikes, where the firstmprincipal components contain the importantsignalof the distribution, the sample PCA can be assessed by simultaneously comparing the firstmprincipal components. In particular, we investigate the limits ofdistancebetween two subspaces: the subspace generated by the firstmsample PC directions

ˆ

u1,d

, . . . ,

u

ˆ

m,dand the subspace by the firstmpopulation PC directions. The distance is usefully measured by canonical angles and metrics between subspaces, the limiting distributions of which will be investigated for the

α

=

1 case, as well as the cases

α

̸=

1, in Section3.3. The probability density functions of the limiting distributions for

α

=

1 are also derived and illustrated under a Gaussian assumption, to show the effect of parameters in the distributions.

The HDLSS data set has an interesting geometric representation in the limitd

→ ∞

, as shown in [11]. In Section4, we extend the result and show that there are three different geometric representations, which coincide with the range of limits depending on

α

.

2. Range of limits in the single spike model

Suppose we have a data matrixX₍d)

= [

X1,d

, . . . ,

Xn,d

]

, withd

>

n, where theddimensional random vectorsXi,dare independent and identically distributed. We assume for now thatXi,dis normally distributed with mean zero and covariance matrix6₍d), but the Gaussian assumption will be relaxed in the next section. The population covariance matrix6(d)is assumed to have one spike, that is, the eigenvalues of6₍d)are

λ

1,d

=

σ

2dα

, λ

2,d

= · · · =

λ

d,d

=

τ

2. The corresponding eigenvectors of6₍d)are denoted byui,d. The sample covariance matrix is defined asS(d)

=

1_nX(d)X₍′d)with itsith eigenvalue

and eigenvector denoted by

λ

ˆ

i,dandu

ˆ

i,d, respectively.

The following theorem summarizes the spectrum of the limiting distributions of the eigenvalues and eigenvectors of

S₍d), depending on the different order

α

of

λ

1,d. Note that the angle between the two vectorsu

,

u

ˆ

is represented by the inner product through Angle

(

u

,

u

ˆ

)

=

cos−1

₍

_u′_u

_ˆ

₎

_{. For the eigenvectors with common eigenvalues, there are of course an infinite} number of choices. The argument in the following theorem assumes that wechoosea set of population eigenvectorsuj,d before obtainingu

ˆ

j,d.

(3)

Theorem 1. Under the Gaussian assumption and the one spike case above,

(

i

)

the limit of the first eigenvalue depends on

α

:

ˆ

λ

1,d max

(

dα

,

d

)

H⇒











σ

2

χ

n2 n

,

α >

1

;

σ

2

χ

2 n n

+

τ

2 n

, α

=

1

;

τ

2 n

,

α <

1

,

as d

→ ∞

, where

H⇒

denotes the convergence in distribution, and

χ

2

n denotes a random variable with the

χ

2distribution

with degrees of freedom n. The rest of the eigenvalues converge to the same quantity when scaled, that is for any

α

∈ [

0

,

∞

)

, j

=

2

, . . . ,

n,

ˆ

λ

j,d d

→

τ

2 n

,

as d

→ ∞

,

in probability.

(

ii

)

The limit of the first eigenvector depends on

α

: u′₁_,_du

ˆ

1,d

H⇒











1

α >

1

;



1

+

τ

2

σ

2

_χ

2 n



−12

α

=

1

;

0

,

α <

1

,

as d

→ ∞

. The rest of the eigenvectors are strongly inconsistent with their population counterpart, for any

α

∈ [

0

,

∞

),

j

=

2

, . . . ,

n,

u′_j_,_du

ˆ

j,d

→

0

,

as d

→ ∞

,

in probability.

The case

α

=

1 bridges the other two cases. In particular, the ratio of the sample and population eigenvalues

λ

ˆ

1,d

/λ

1,dis asymptotically unbiased to 1 when

α >

1. It is asymptotically biased when

α

=

1, and becomes completely deterministic in the case

α <

1, where the effect of

σ

2_on

_λ

ˆ

1,dbecomes negligible. Moreover, the angle Angle

(

u1,d

,

u

ˆ

1,d

)

to the optimal direction converges to a random quantity which is defined on

(

0

, π/

2

)

and depends on

σ

2

_{, τ}

2_{, and}_n_{. The effect of those}

parameters on the limiting distribution of Angle

(

u1,d

,

u

ˆ

1,d

)

is illustrated inFig. 1. The ratio

σ

2

/τ

2can be understood as a

signal to noiseratio. A high value of

σ

2

_/τ

2_{means that the major variation along the first PC direction is strong. Therefore, for}

larger values of

σ

2

_/τ

2

_,

_Angle

₍

_u

1,d

,

u

ˆ

1,d

)

should be closer to zero than smaller values of the ratio, as depicted in the upper panel ofFig. 1. Moreover, the sample PCA with larger sample sizenshould perform better than with smaller sample size. The sample sizenbecomes the degrees of freedom of the

χ

2_{distribution in the limit, and the bottom panel of}_{Fig. 1}_shows

that theu

ˆ

i,dis closer toui,dfor larger values ofn.

The Gaussian assumption in the previous theorem appears as a driver of the limiting

χ

2distributions. Under the general non-Gaussian assumption we state in the next section, the

χ

2_{will be replaced by a distribution that depends heavily on the}

distribution of the population principal component scores, which may not be Gaussian.

Remark 1. The results inTheorem 1can be used to estimate the parameters

σ

2_and

_τ

2_{in the model with}

_α

₌

_{1. As a}

simple example, one can set

τ

ˆ

2

₌

n n−1



n j=2 ˆ λj,d d and

σ

ˆ

2

_{= ˆ}

_λ

1,d

/

d

− ˆ

τ

2

/

n. Then

τ

ˆ

2

→

τ

2and

σ

ˆ

2

H⇒

σ

2χ 2 n n asd

→ ∞

by

Theorem 1and Slutsky’s theorem. The estimator

σ

ˆ

2is not consistent but can evidently be used to construct a confidence interval for

σ

2_: P



n

σ

ˆ

2

χ

2 n,1−(a/2)

≤

σ

2

≤

n

σ

ˆ

2

χ

2 n,a/2



→

(

1

−

a

)

asd

→ ∞

.

This can be extended to provide asymptotic confidence intervals for principal component scores. A similar estimation scheme can be found in a related but different setting whered

,

n

→ ∞

together; see for example [19,16]. These papers do not discuss eigenvector estimation. The sample eigenvector,u

ˆ

1,dis difficult to improve upon mainly because the direction of deviationu

ˆ

1,dfromu1,dis quite random unless a more restrictive assumption (e.g. sparsity) is made.

3. Limits under the generalized spiked covariance model

The results in the previous section will be generalized to much broader situations, including a generalized spiked covariance model and a relaxation of the Gaussian assumption. We focus on the

α

=

1 case, and describe the limiting distributions for eigenvalues and eigenvectors.

(4)

Fig. 1. Angle densities for the one spike case. The top panel shows an overlay of the densities with differentσ2_{, with other parameters fixed. The bottom}

panel shows an overlay of the densities with different degrees of freedomnof theχ2_{distribution. For a larger signal to noise ratio}σ2

τ2, and for a largern, the angle to optimal is smaller.

3.1. Eigenvalues and eigenvectors

In the following, all the quantities depend on the dimensiond, but the subscriptdis omitted when it does not cause any confusion. We first describe some elementary facts from matrix algebra, that are useful throughout the paper. The dimension of the sample covariance matrixSincreases asdgrows, so it is challenging to deal withSdirectly. A useful approach is to use the dual ofS, defined as then

×

nsymmetric matrix

SD

=

1

nX

′

X

,

by switching the role of rows and columns ofX. The

(

i

,

j

)

th element ofSDis1_nXi′Xj. An advantage of working withSDis that for larged, the finite dimensional matrixSDis positive definite with probability one, and itsneigenvalues are the same as the non-zero eigenvalues ofS. Moreover, the sample eigenvectorsu

ˆ

iare related to the eigen-decomposition ofSD, as shown next. LetSD

= ˆ

VnΛ

ˆ

nV

ˆ

n, whereΛ

ˆ

n

=

diag

(

λ

ˆ

1

, . . . ,

λ

ˆ

n

)

andV

ˆ

nis then

×

northogonal matrix of eigenvectors

v

ˆ

icorresponding to

λ

ˆ

i. RecallS

= ˆ

UΛ

ˆ

U

ˆ

′. SinceSis at most rankn, we can writeS

= ˆ

UnΛ

ˆ

nU

ˆ

n′, whereU

ˆ

n

= [ ˆ

u1

, . . . ,

u

ˆ

n

]

consists of the firstn columns ofU

ˆ

. The singular value decomposition ofXis given by

X

= ˆ

UnΛ

ˆ

nV

ˆ

n′

=

n



i=1

(

n

λ

ˆ

i

)

− 1 2u

ˆ

i

v

ˆ

i′

.

Then thekth sample principal component directionu

ˆ

kfork

≤

nis proportional toX

v

ˆ

k,

ˆ

uk

=

(

n

λ

ˆ

k

)

− 1

2_X

v

ˆ

_k

.

(2)

Therefore the asymptotic properties of the eigen-decomposition ofS, asd

→ ∞

, can be studied via those of the finite dimensional matrixSD.

It is also useful to representSDin terms of the population principal components. LetZ(d)be the standardized principal

components ofX, defined by Z₍d)

=







Z₁′

...

Z_d′







=

Λ −1/2 d U ′ dX

,

whereZ_i′

=

(

Zi1

, . . . ,

Zin

)

is theith row ofZ(d), so that

Z_i′

=

λ

− 1 2 i u ′ iX

.

(3)

(5)

Under the Gaussian assumption of the previous section, each element ofZ₍d)is independently distributed as the standard normal distribution. ByX

=

UdΛ1/2Z(d), SD

=

1 nX ′ X

=

1 nZ ′ (d)ΛZ(d)

=

1 n d



i=1

λ

iZiZi′

.

The Gaussian assumption onXis relaxed as follows. We assume that each column Xi ofX follows addimensional multivariate distribution with mean zero and covariance matrixΣ. Each entry of the standardized principal components, or the sphered variablesZ₍d)is assumed to have finite fourth moments, and is uncorrelated but in general dependent with each other. We regulate the dependency of the principal components by a

ρ

-mixing condition (see [15,6]). We briefly describe a version of

ρ

-mixing for our purpose. For

−∞ ≤

J

≤

L

≤ ∞

, letFJLdenote the

σ

-field of events generated by the random variablesZi

,

J

≤

i

≤

L. For any

σ

-fieldA, letL2

(

A

)

denote the space of square-integrable,Ameasurable real-valued random

variables. For eachm

≥

1, define the maximal correlation coefficient

ρ(

m

)

:=

sup

|

corr

(

f

,

g

)

|

,

f

∈

L2

(

F−∞j

),

g

∈

L2

(

Fj∞+m

),

where sup is over allf

,

gandj

∈

Z. The sequence

{

Zi

}

is said to be

ρ

-mixing if

ρ(

m

)

→

0 asm

→ ∞

.

While the concept of

ρ

-mixing is useful as a mild condition for the development of laws of large numbers, its formulation is critically dependent on the ordering of variables. Therefore we assume that there is some permutation of the data which is

ρ

-mixing. In particular, let

{

Zij,(d)

}

di=1be the components of thejth column vector ofZ(d). We assume that for eachd, there exists a permutation

π

d

: {

1

, . . . ,

d

} −→ {

1

, . . . ,

d

}

so that the sequence

{

Zπd(i)j,(d)

:

i

=

1

, . . . ,

d

}

is

ρ

-mixing. This

assumption makes the results invariant under a permutation of the variables. We denote these distributional assumptions as

(

c1

)

.

We then define a generalized spiked covariance model. Recall that a simple one spike model was defined on the eigenvalues of the population covariance matrixΣ, for example,

λ

1

=

σ

2dα

, λ

2

= · · · =

λ

d

=

τ

2. This is generalized by allowing multiple spikes, and by relaxing the uniform eigenvalue assumption in the tail to a decreasing sequence. The tail eigenvalues are regulated by a measure of sphericity

ϵ

kin the limitd

→ ∞

. The measure of sphericity

ϵ

k

,

k

=

1

,

2

, . . .

, is defined for

{

λ

_k

, . . . , λ

_d

}

as

ϵ

k

(

d

)

≡



_d



i=k

λ

i

2

d d



i=k

λ

2 i

,

which is away from 0 and close to 1 when

{

λ

_k

, . . . , λ

_d

}

are close to each other. Then we shall assume that the tail eigenvalues do not decrease too fast. In particular, we say the

ϵ

k-conditionis satisfied when

ϵ

k

(

d

)

decreases at a rate slower thand−1, i.e.

(

d

ϵ

k

)

−1

=

d



i=k

λ

2 i



_d



i=k

λ

i

2

→

0 asd

→ ∞

.

The

ϵ

kcondition holds for quite general settings [14, Sec. 2c]. As an example, a polynomially decreasing sequencei− 1 2 _of eigenvalues satisfies the condition

ϵ

kwithk

=

1. In the generalized spiked model, the eigenvalues are assumed to be of the form

λ

1

=

σ

12dα

, . . . , λ

m

=

σ

m2dα, for

σ

12

≥ · · · ≥

σ

m2

>

0 for somem

≥

1, and the

ϵ

m+1condition holds for

λ

m+1

, . . . , λ

d. Also assume that1_d



d

i=m+1

λ

i

→

τ

2asd

→ ∞

. These conditions for spike models are denoted by

(

c2

)

.

The following theorem gives the limits of the sample eigenvalues and eigenvectors under the general assumptions in this section. We use the following notations. Let

ϕ(

A

)

be a vector of eigenvalues of a real symmetric matrixAarranged in non-increasing order and let

ϕ

i

(

A

)

be theith largest eigenvalue ofA. Let

v

i

(

A

)

denote theith eigenvector of the matrix

Acorresponding to the eigenvalue

ϕ

i

(

A

)

and

v

ij

(

A

)

be thejth loading of

v

i

(

A

)

. Also note that there are many choices of eigenvectors ofS including the sign changes. We use the convention that the sign ofu

ˆ

iwill be chosen so thatu

ˆ

′iui

≥

0. Recall that the vector of theith standardized principal component scores isZi

=

(

Z1i

, . . . ,

Zni

)

′. Denote then

×

mmatrix of the firstmprincipal component scores asW

= [

σ

₁Z1

, . . . , σ

mZm

]

. The limiting distributions heavily depend on the finite dimensional random matrixW.

Theorem 2. Under the assumptions

(

c1

)

and

(

c2

)

with fixed n

≥

m

≥

1, if

α

=

1, then

(

i

)

the sample eigenvalues d−1n

λ

ˆ

i,d

H⇒



ϕ

i

(

W′W

)

+

τ

2

,

i

=

1

, . . . ,

m

;

τ

2

_,

_i

₌

_m

₊

₁

_{, . . . ,}

_n

_,

(6)

(

ii

)

The inner products between the sample and population eigenvectors have limiting distributions:

ˆ

u′_i_,_du_j_,_d

H⇒

v

ij

(

W ′_W

₎



1

+

τ

2

/ϕ

i

(

W′W

)

as d

→ ∞

jointly for i

,

j

=

1

, . . . ,

m

.

The rest of eigenvectors are strongly inconsistent with their population counterpart, i.e.

ˆ

u′_i_,_dui,d

→

0 as d

→ ∞

for i

=

m

+

1

, . . . ,

n

,

in probability.

The theorem shows that the firstmeigenvectors are neither consistent nor strongly inconsistent to the population counterparts. The limiting distributions of angles Angle

(

u

ˆ

_i_,_d

,

u_i_,_d

)

to optimal directions are supported on

(

0

, π/

2

)

and depend on the magnitude of the noise

τ

2 _{and the distribution of}_W′_{W. Note that the}_m

_×

_m_{symmetric matrix}_W′_W_is the scaled covariance matrix of the principal component scores in the firstmdirections. When the underlying distribution ofXis assumed to be Gaussian, thenW′_W

is the Wishart matrixWm

(

n

,

Λm

)

, whereΛm

=

diag

(σ

12

, . . . , σ

m2

)

. Ifm

=

1, the matrix becomes a scalar random variableW′_W

₌

_ϕ

1

(

W′W

)

, and is

χ

n2under the Gaussian assumption, which leads to

Theorem 1.

In general, the distribution of

ϕ

i

(

W′W

)

is not simply described. We refer to [17, Ch. 9] for the Gaussian case, and [3] for the casem

→ ∞

.

The limiting distributions of the cases

α

̸=

1 can be found in a similar manner, which we only state the result for the case

α >

1 and for the firstmcomponents. We refer to [14] for more general results. Fori

,

j

=

1

, . . . ,

m, the eigenvalues

d−α_n

_λ

ˆ

i,d

H⇒

φ

i

(

W′W

)

and the inner productsu

ˆ

′i,duj,d

H⇒

v

ij

(

W′W

)

asd

→ ∞

. In comparison to the

α

=

1 case, if we set

τ

2_{to be zero, the result becomes identical for all}

_α

_≥

_1.

Remark 2. When the sample sizenalso grows, consistency of sample eigenvalues and eigenvectors can be achieved. In particular, fori

=

1

, . . . ,

m, we have asdgrows

ˆ

λ

i,d

λ

i,d

=

d n d−1_n

_λ

ˆ

i,d

σ

2 i d

H⇒

ϕ

i

(

W ′_W

_/

_n

₎

σ

2 i

+

τ

2 n

σ

_i2

byTheorem 2. SinceW′_W

_/

_n

_→

_diag

_(σ

2

1

, . . . , σ

m2

)

by a law of large numbers, we get the consistency of eigenvalues, i.e.

ˆ

λ

i,d

/λ

i,d

→

1 asd

,

n

→ ∞

, where the limits are applied successively. For the sample eigenvectors, fromTheorem 2(ii) and because

v

ij

(

WTW

)

→

1 ifi

=

j, 0 otherwise asn

→ ∞

and

ϕ

i

(

W′W

)

=

O

(

n

)

, we getu

ˆ

′i,duj,d

→

1 ifi

=

j, 0 otherwise asd

,

n

→ ∞

. Therefore, the sample PC directions are consistent to the corresponding population PC directions, i.e. Angle

(

u

ˆ

i,d

,

ui,d

)

→

0 asd

,

n

→ ∞

(applied successively), fori

≤

m. Therefore it is conjectured that whend

,

ngrow together withd

≫

n, a similar conclusion toTheorem 2can be drawn.

Proof of Theorem 2. The following lemma (Theorem 1 of [14]) shows a version of the law of large numbers for matrices, that is useful in the proof. Recall thatZ′

i

≡

(

Z1i

, . . . ,

Zni

)

is theith row ofZ(d). Lemma 1. If the assumption

(

c1

)

and the

ϵ

k-condition hold, then

c_d−1 d



i=k

λ

i,dZiZi′

→

In

,

as d

→ ∞

in probability, where cd

=

n−1



d

i=1

λ

i,dand Indenotes the n

×

n identity matrix. In particular, if d−1



d i=k

λ

i,d

→

τ

2, then 1 d d



i=k

λ

i,dZiZi′

→

τ

2_I n

,

as d

→ ∞

in probability.

This lemma is used to show that the spectral decomposition ofd−1_nS

D, d−1nSD

=

m



i=1

σ

2 iZiZi′

+

d −1 d



i=m+1

λ

iZiZi′

,

can be divided into two parts, and the latter converges to a deterministic part. ApplyingLemma 1, we haved−1_nS

D

H⇒

S0

asd

→ ∞

, where

(7)

Then since the eigenvalues of a symmetric matrixAare a continuous function of elements ofA, we have

ϕ(

d−1nSD

)

H⇒

ϕ(

S0

),

asd

→ ∞

. Noticing that fori

=

1

, . . . ,

m,

ϕ

i

(

S0

)

=

ϕ

i

(

WW′

)

+

τ

2

=

ϕ

i

(

W′W

)

+

τ

2

,

and fori

=

m

+

1

, . . . ,

n

, ϕ

i

(

S0

)

=

τ

2gives the result.

For the eigenvectors, note that the eigenvectors

v

ˆ

_iofd−1_nS

Dcan be chosen so that they are continuous [1]. Therefore, we also have that

v

ˆ

_i

=

v

_i

(

d−1_nS

D

)

H⇒

v

i

(

S0

)

asd

→ ∞

, for alli. Also note that

v

i

(

S0

)

=

v

i

(

WW′

)

fori

≤

m.

Similar to the dual approach for covariance matrices, the eigenvectors of then

×

nmatrixWW′_{can be evaluated from} the dual of the matrix. In particular, letW

=

U_wΛ_wV_w′

=



m

i=1

λ

iwuiw

v

i′w, where

λ

2 iw

=

ϕ

i

(

W′W

)

and

v

iw

=

v

i

(

W′W

)

. Then

v

1

(

S0

)

=

uiw

=

W

v

iw

λ

iw

=

W

v

i

(

W ′ W

)

√

ϕ

i

(

W′W

)

.

Now from(2),(3)and the previous equation, for 1

≤

i

,

j

≤

m,

u′_ju

ˆ

i

=

u′j X

v

ˆ

_i



n

λ

ˆ

i

=

u ′ jX

v

ˆ

i



n

λ

ˆ

i

=



σ

2 jdZ ′ j

v

ˆ

i



n

λ

ˆ

i

=

σ

jZ ′ j

v

ˆ

i



d−1_n

_λ

ˆ

i

H⇒

σ

jZ ′ jW

v

i

(

W′W

)



ϕ

i

(

W′W

)

+

τ

2

√

ϕ

i

(

W′W

)

.

(4)

Note that

σ

jZj′W

= [

σ

j

σ

1Zj′Z1

· · ·

σ

j

σ

mZj′Zm

]

is thejth row ofW′WandW′W

v

i

(

W′W

)

=

ϕ

i

(

W′W

)v

i

(

W′W

)

. Therefore, the limiting form(4)becomes

ϕ

i

(

W′W

)v

ij

(

W′W

)



ϕ

i

(

W′W

)

+

τ

2

√

ϕ

i

(

W′W

)

.

Fori

=

m

+

1

, . . . ,

n, again from(2)and(3), we get

u′_iu

ˆ

i

=

√

λ

iZi′

v

ˆ

i



n

λ

ˆ

i

=

d−12

τ

Z ′ i

v

ˆ

i



n

λ

ˆ

i

/

d

=

Op



d−12



.

3.2. Asymptotic results for the centered data matrix

In practice, the sample covariance matrix is usually derived from the centered data matrix. In such a case, we obtain a weaker result thanTheorem 2. LetS

˜

₍d)

=

n−1

(

Xd

− ¯

X

)(

Xd

− ¯

X

)

′, whereX

¯

=

n−1



n

i=1Xi,d1′nis ad

×

nmatrix consisting of

ncolumns of the sample mean vector.

For a general mean vector

µ

, the representation ofXin terms of standardized principal componentsZ₍d)is X

=

µ

1′_n

+

UdΛ 1 2_Z₍_d₎ and thus X

− ¯

X

=

UdΛ 1 2

(

_Z₍_d₎

− ¯

Z

),

where theith row ofZ

¯

isz

¯

′

i

=

n

−1



n

j=1Zij1′n

=

n −1_Z′

iJn. The symbolJnrepresents then

×

nmatrix consisting entirely of ones.

The dual covariance matrix ofS

˜

₍d)is thenS

˜

D

=

n−1



d

i=1

λ

i

(

Zi

− ¯

zi

)(

Zi

− ¯

zi

)

′

=

n−1

(

In

−

n−1Jn

)



d

i=1

λ

iZiZi′

(

In

−

n−1Jn

)

′. Then we have the following result.

Proposition 1. Let

λ

˜

i,dbe the ith largest eigenvalue of S

˜

(d). Under the assumptions

(

c1

)

and

(

c2

)

with fixed n

>

m

≥

1, if

α

=

1, then for any

ϵ >

0,

lim d→∞P



_m



i=1



n d

˜

λ

i,d

∈ [

ϕ

i

(

W

˜

′W

˜

), ϕ

i

(

W

˜

′W

˜

)

+

τ

2

]



n

_

−1 i=m+1



n d

˜

λ

i,d

∈

(τ

2

−

ϵ, τ

2

+

ϵ)





=

1

,

(8)

Proof of Proposition 1.Note that

λ

˜

i,d

=

ϕ

i

(

S

˜

(d)

)

=

ϕ

i

(

S

˜

D

)

fori

=

1

, . . . ,

n

−

1. Similar to the proof ofTheorem 2, the spectral decomposition ofd−1_n_S

˜

Dis divided into two parts,

d−1nS

˜

D

=

m



i=1

σ

2 i

(

Zi

− ¯

zi

)(

Zi

− ¯

zi

)

′

+

(

In

−

n−1Jn

)

1 d d



i=m+1

λ

iZiZi′

(

In

−

n−1Jn

)

′

,

and usingLemma 1,d−1nS

˜

Dconverges in distribution toS

˜

0

= ˜

WW

˜

′

+

τ

2

(

In

−

n−1Jn

)

. Then the eigenvalues ofd−1nS

˜

Djointly converge to the eigenvalues ofS

˜

0.

Fori

=

1

, . . . ,

m, Weyl’s inequality [14, p. 4121], [10] yields that

ϕ

i

(

W

˜

W

˜

′

)

+

ϕ

n

{

τ

2

(

In

−

n−1Jn

)

} ≤

ϕ

i

(

S

˜

0

)

≤

ϕ

i

(

W

˜

W

˜

′

)

+

ϕ

1

{

τ

2

(

In

−

n−1Jn

)

}

,

where

ϕ

j

{

τ

2

(

In

−

n−1Jn

)

} =

τ

2forj

=

1

, . . . ,

n

−

1 and 0 forj

=

n, and

ϕ

i

(

W

˜

W

˜

′

)

=

ϕ

i

(

W

˜

′W

˜

)

. Fori

=

m

+

1

, . . . ,

n

−

1, also applying Weyl’s inequality gives

ϕ

n

(

W

˜

W

˜

′

)

+

ϕ

i

{

τ

2

(

In

−

n−1Jn

)

} ≤

ϕ

i

(

S

˜

0

)

≤

ϕ

i

(

W

˜

W

˜

′

)

+

ϕ

1

{

τ

2

(

In

−

n−1Jn

)

}

,

and because the rank ofW

˜

W

˜

′_{is at most}_m

_{, ϕ}

i

(

W

˜

W

˜

′

)

=

0 fori

>

m. Thus,

ϕ

i

(

S

˜

0

)

=

τ

2, which leads tod−1n

λ

˜

i,d

→

τ

2in probability asd

→ ∞

. The result is derived by combining the casesi

=

1

, . . . ,

n

−

1.

When the centered data matrixX

− ¯

X is used, the scaled eigenvalue estimate no longer converges in distribution to

ϕ

i

(

W

˜

′W

˜

)

+

τ

2. However, the difference becomes smaller for largern, since the centering matrixIn

−

n−1Jnis close toInfor largen. For the rest of the paper, we assume that the mean is known and zero for the sake of clear presentation.

3.3. Angles between principal component spaces

Under the generalized spiked covariance model withm

>

1, the firstmpopulation principal directions provide a basis of the most important variation. Therefore, it would be more informative to investigate the deviation of eachu

ˆ

ifrom the subspaceLm

1

(

d

)

spanned by

{

u1,d

, . . . ,

um,d

}

. Also, denote the subspace spanned by the firstmsample principal directions as

ˆ

Lm

1

(

d

)

≡

span

{ ˆ

u1,d

, . . . ,

u

ˆ

m,d

}

. When performing dimension reduction, it is critical for the sample PC spaceL

ˆ

m1 to be close

to the population PC spaceLm₁. The closeness of two subspaces can be measured in terms ofcanonical angles.

We briefly introduce the notion of canonical angles and metrics between subspaces, detailed discussions of which can be found in [23,9]. As a simple case, the canonical angle between a 1-dimensional subspace and anm-dimensional subspace is defined as follows. LetL

ˆ

ibe the 1-dimensional linear space with basisu

ˆ

i. Infinitely many angles can be formed between

ˆ

LiandLm1 withm

>

1. The canonical angle, denoted by Angle

(

L

ˆ

i

,

Lm1

)

, is defined by the smallest angle formed, that is the

angle betweenu

ˆ

iand its projectionu

ˆ

Pi ontoLm1. This angle is represented in terms of an inner product as

Angle

(

_L

ˆ

_i

_,

_Lm 1

)

=

cos −1



ˆ

u′ iu

ˆ

Pi



u

ˆ

P_i



u

ˆ

_i





=

min y∈Lm1 Angle

(

u

ˆ

i

,

y

)

for

∥

y

∥

>

0

.

(5)

When two multi-dimensional subspaces are considered, multiple canonical angles are defined. Among angles between_L

ˆ

m

1

andLm₁, the first canonical angle is geometrically defined as

θ

1

(

L

ˆ

m1

,

L m 1

)

=

max x∈ ˆLm₁ min y∈Lm 1 Angle

(

x

,

y

)

for

∥

x

∥

,

∥

y

∥

>

0

,

(6)

where Angle

(

x

,

y

)

is the angle formed by the two vectorsx

,

y. One can show that the second canonical angle is defined by the same geometric relation as above with_L

ˆ

_−x_and_L_−y_for_x

_,

_y_from₍₆₎_{, where}_L

ˆ

_−x_{is the orthogonal complement of}_x_in_L

ˆ

m

1.

In practice, the canonical angles are found by the singular value decomposition of a matrix. LetU

ˆ

mandUmbe orthonormal bases for_L

ˆ

m

1

,

Lm1 and

γ

i’s be the singular values ofU

ˆ

m′Um. Then the canonical angles are

θ

i

(

L

ˆ

m1

,

L m 1

)

=

cos −1

_(γ

i

)

in descending order.

Distances between two subspaces can be defined using the canonical angles. We point out two metrics from Chapter II.4 of [23]:

1. gap metric

ρ

g

(

L

ˆ

m1

,

Lm1

)

=

sin

(θ

1

)

,

2. Euclidean sine metric

ρ

s

(

L

ˆ

m1

,

Lm1

)

= {



m i=1sin 2

_(θ

i

)

}

1 2_.

The gap metric is simple and only involves the largest canonical angle. The Euclidean sine metric makes use of all canonical angles and thus gives a more comprehensive understanding of the closeness between the two subspaces. We will use both metrics in the following discussion.

(9)

At first, we examine the limiting distribution of the angle between the sample PC directionu

ˆ

iandLm1.

Theorem 3. Under the assumptions

(

c1

)

and

(

c2

)

with fixed n

≥

m

≥

1, if

α

=

1, then for i

=

1

, . . . ,

m, the canonical angle converges in distribution: cos



Angle

(

_L

ˆ

_i

_,

_Lm 1

)



H⇒

1



1

+

τ

2

_/ϕ

i

(

W′W

)

as d

→ ∞

.

Proof. Sinceu

ˆ

P i

=



m j=1

(

u

ˆ

′ iuj

)

uj,

ˆ

u′ iu

ˆ

Pi



u

ˆ

P_i



u

ˆ

_i



=



u

ˆ

P_i



=



(

u

ˆ

′ iu1

)

2

+ · · · +

(

u

ˆ

′ium

)

2

.

The result follows from(5),Theorem 2(ii) and the fact that



m

j=1



v

ij

(

W′W

)

2

=



v

_i

(

W′W

)



2

=

1.

We then investigate the limiting behavior of the distances betweenL

ˆ

m₁ andLm₁, in terms of either the canonical angles or the distances. From the fact thatU

ˆ

m

= [ ˆ

u1

, . . . ,

u

ˆ

m

]

andUm

= [

u1

, . . . ,

um

]

are orthonormal bases ofL

ˆ

m1 andLm1

respectively, cosines of the canonical angles are the singular values ofU

ˆ

′

mUm. Since the

(

i

,

j

)

th element ofU

ˆ

m′Umisu

ˆ

′iuj,

Theorem 2leads to

ˆ

U_m′Um

H⇒



v

1

(

W ′ W

)

· · ·

v

_m

(

W′W

)











1

+

τ

2

ϕ

1

(

W′W

)



− 1 2 0

...

0



1

+

τ

2

ϕ

m

(

W′W

)



−12







,

asd

→ ∞

. Therefore the canonical angles (

θ

1

, . . . , θ

m) betweenL

ˆ

m1 andL

m

1 converge to the arccosines of

(



1

+

τ

2

/ϕ

_m

(

W′W

)



−12

, . . . ,



1

+

τ

2

/ϕ

₁

(

W′W

)



−12

),

₍₇₎

asd

→ ∞

. Notice that these canonical angles between subspaces converge to the same limit as inTheorem 3, except that the order is reversed. In particular, the limiting distribution of the largest canonical angle

θ

1is the same as that of the angle

betweenu

ˆ

mandLm1, and the smallest canonical angle

θ

mcorresponds to the angle between the first sample PC directionu

ˆ

1

and the population PC spaceLm1.

The limiting distributions of the distances between two subspaces are readily derived by the discussions so far. When using the gap metric,

ρ

g

(

L

ˆ

m1

,

Lm1

)

H⇒



1

+

ϕ

_m

(

W′W

)/τ

2



− 1

2 _as_d

_{→ ∞}

.

And by using the Euclidean sine metric,

ρ

s

(

L

ˆ

m1

,

L m 1

)

H⇒



_m



i=1 1 1

+

ϕ

_i

(

W′_W

)/τ

2



1₂ asd

→ ∞

.

(8)

Remark 3. The convergence of the canonical angles for the case

α >

1 has been shown earlier. [14] introduced a notion of

subspace consistency, where the directionu

ˆ

imay not be consistent touibut will tend to lie inLm1, i.e. Angle

(

L

ˆ

i

,

Lm1

)

→

0 as

d

→ ∞

, fori

≤

m. In this case, the canonical angles betweenL

ˆ

m_i andLm₁ and the distances will converge to 0 asdgrows. In that sense, the empirical PC space_L

ˆ

m

i is consistent toLm1. On the other hand, when

α <

1, all directionsu

ˆ

itend to behave as if they were from the eigen-decomposition of the identity matrix. Therefore, all angles tend to be

π/

2 and the distances will converge to their largest possible values, leading to the strong inconsistency.

We now focus back on the

α

=

1 case, and illustrate the limiting distributions of the canonical angles and the Euclidean sine distance, to see the effect of parameters in the distribution. For simplicity and clear presentation, the results corresponding tom

=

2 are presented under the Gaussian assumption. Note that the limiting distributions depend on the marginal distributions of the first few principal component scores. Therefore no common distribution is evaluated in the limit.

Let

σ

2

₍

₌

_σ

2

1

+

σ

22

)

denote the (scaled) total variance of the first two principal components. The ratioσ

2

τ2 is understood as a signal to noise ratio, similar to the single spike case. Since the ratio of

σ

₁2and

σ

₂2affects the limiting distributions, we use

(λ

1

, λ

2

)

with

λ

1

+

λ

2

=

1 so that

σ

2

(λ

1

, λ

2

)

=

(σ

12

, σ

22

)

. Note that forW

′_W

_∼

_W

2

(

n

,

diag

(σ

12

, σ

22

))

,

ϕ(

W

′_W

₎

_{has the} same law as

σ

2

ϕ(

W2

(

n

,

diag

(λ

1

, λ

2

)))

.

(10)

Fig. 2. Overlay of contours of densities of canonical angles for them=2 case, corresponding to different(σ2_/τ2_{, λ}

1/λ2). Larger signal to noise ratios σ2_/τ2_{lead to the diagonal shift of the density function, and both canonical angles will have smaller values. For a fixed}_σ2_/τ2_{, the ratio}_λ

1/λ2between the

first and second PC variances is a driver for different distributions, depicted as the vertical shift of the density function.

The joint limiting distribution of the two canonical angles in (7), also in Theorem 3, is illustrated inFig. 2, with various values of (

σ

2

_/τ

2

_{, λ}

1

/λ

2) and fixedn. Note that for larged, the first canonical angle

θ

1

≈

Angle

(

u

ˆ

2,d

,

Lm1

)

and

θ

2

≈

Angle

(

u

ˆ

1,d

,

Lm1

)

, and that

θ

1

≥

θ

2.

•

The diagonal shift of the joint densities inFig. 2is driven by different

σ

2_{s with other parameters fixed. Both}

_θ

1and

θ

2are

smaller for larger signal to noise ratios.

•

For fixed

σ

2

/τ

2, several values of the ratio between the first and second variances (

λ

1

/λ

2) are considered, and the overlay

of densities according to different

λ

1

/λ

2is illustrated as the vertical shift inFig. 2. When the variation along the first PC

direction is much stronger than that along the second, i.e. when

λ

1

/λ

2is large,

θ

2becomes smaller but

θ

1tends to be

much larger. In other words,u

ˆ

1is a reasonable estimate ofu1, butu

ˆ

2becomes a poor estimate ofu2.

See(A.4)in theAppendixfor the probability density function of the canonical angles.

The limiting distribution of the Euclidean sine distance betweenL

ˆ

m₁ andLm₁ is also depicted inFig. 3, again with various values of (

σ

2

_{, λ}

1

, λ

2). It can be checked from the top panel ofFig. 3that the distance to the optimal subspace is smaller when

the signal to noise ratio is larger. The bottom panel illustrates the densities corresponding to different ratios of

λ

1

/λ

2. The

effect of

λ

1

/λ

2is relatively small compared to the effect of different

σ

2s, unless

λ

2is too small.

4. Geometric representation of the HDLSS data

[11] first showed that the HDLSS data has an interestinggeometric representationin the limitd

→ ∞

. In particular, for larged, the data tend to appear at vertices of a regular simplex and the variability is contained only in the random rotation of the simplex. In the spike model we consider, this geometric representation of the HDLSS data holds when

α <

1, as shown earlier in [14]. The representation in mathematical terms is

∥

Xi

∥ =

τ

√

d

+

op

(

√

d

),



X_i

−

X_j



=

τ

√

2d

+

op

(

√

d

),

(9)

forddimensionalXi

,

i

=

1

, . . . ,

n. This simplified representation has been used to show some high dimensional limit theory for discriminant analysis; see [2,22,12].

Similar types of representation can be derived in the

α

≥

1 case. When

α >

1, where consistency of PC directions happens, we have

∥

Xi

∥

/

dα/2

H⇒ ∥

Yi

∥

,



X_i

−

X_j



/

dα/2

H⇒



Y_i

−

Y_j



(10)

whereYi

=

(σ

1Z1i

, . . . σ

mZmi

)

′s arem-dimensional independent random vectors with mean zero and covariance matrix diag

(σ

₁2

, . . . , σ

2

m

)

. To understandYi, letXiP be the projection ofXi onto the true PC space Lm1. It can be checked from

d−1/2_XP

i

=



m

j=1

(σ

jZji

)

ujthatYiis the vector of loadings of the scaledXiPin the firstmprincipal component coordinates. When

α

=

1, a deterministic term is added:

∥

Xi

∥

2

/

d1

H⇒ ∥

Yi

∥

2

+

τ

2

,



X_i

−

X_j



2

/

d

H⇒



Y_i

−

Y_j



2

+

2

τ

2

.

(11)

Fig. 3. Overlay of densities of the distance between the sample and population PC spaces, measured byρsfor them=2 case. The top panel shows a transition of the density function corresponding to different signal to noise ratios. The bottom panel illustrates the effect of the ratioλ1/λ2between the

first two eigenvalues. For a larger signal to noise ratioσ2_/τ2_{, and for a smaller value of}_λ

1/λ2, the Euclidean sine distance is smaller.

These results can be understood geometrically, as summarized and discussed in the following.

α >

1: The variability of samples is restricted to the true PC spaceLm

1 for larged, which coincides with the notion of

subspace consistency discussed inRemark 3. Thed-dimensional probability distribution degenerates to them -dimensional subspaceLm₁.

α

=

1: (11)is understood with the help of the Pythagorean theorem, that is, the norm ofXiis asymptotically decomposed into orthogonal random and deterministic parts. Thus, data tend to be

τ

√

daway fromLm1, andXis projected on

Lm⊥

1

=

span

{

um+1

, . . . ,

ud

}

, the orthogonal complement ofLm1, will follow the representation similar to the

α <

1

case.

α <

1: The geometric representation(9)holds. Note that the case

α

=

1 smoothly bridges the others.

An example elucidating these ideas is shown inFig. 4. Each panel shows 3d scatterplots of 10 different samples (shown as different symbols) ofn

=

3 Gaussian random vectors in dimensionsd

=

3

,

30

,

3000 (shown in respective columns ofFig. 4). In the spiked model, we take

σ

=

τ

=

1 andm

=

1 for simplicity and investigate three different orders of magnitude

α

=

1

2

,

1

,

2 of the first eigenvalue

λ

1

=

dα. For each pair of

(

d

, α)

, each sampleXiis projected onto the first true

PC directionu1, shown as the vertical axis. In the orthogonald

−

1 dimensional subspace, the 2-dimensional hyperplane

that is generated by the data is found, and the data are projected onto that. Within the hyperplane, variation due to rotation is removed by optimally rotating the data onto edges of a regular triangle by a Procrustes method, to give the horizontal axes in each part ofFig. 4. These axes are scaled by dividing by max

(

dα

,

d

)

and the 10 samples give an impression of various types of convergence as a function ofd, for each

α

.

The asymptotic geometric representations summarized above are confirmed by the figure. For

α

=

1

2, it is expected

from(9)that the data are close to the vertices of the regular triangle, with edge length

√

2d. The vertices of the triangle (in the horizontal plane) with vertical rays (representingu1, the first PC direction) are shown as the dashed lines in the

first two rows ofFig. 4. Note that ford

=

3, in the top row, the points appear to be quite random, but ford

=

30, there already is reasonable convergence to the vertices with notable variation alongu1. The cased

=

3000 shows more rigid

representation with much less variation alongu1. On the other hand, for the case

α

=

2 in the last row ofFig. 4, most of the

variations in the data are found alongu1, shown as the vertical dotted line, and the variation perpendicular tou1becomes

negligible asdgrows, which confirms the degeneracy toL1

1in(10). From these examples, conditions for consistency and

strong inconsistency can be checked heuristically. The sample eigenvectoru

ˆ

1is consistent withu1when

α >

1, since the

variation alongu1is so strong thatu

ˆ

1should be close to that.u

ˆ

1is inconsistent withu1when

α <

1, since the variation

alongu1becomes negligible so thatu

ˆ

1will not be nearu1.

For the

α

=

1 case, in the middle row ofFig. 4, it is expected from(11)that each data point will be asymptotically decomposed into a random and a deterministic part. This is confirmed by the scatterplots, where the order of variance along

(12)

Fig. 4. Gaussian toy example, showing the geometric representations of HDLSS data, withn=3, for three different choices ofα=1/2,1,2 of the spiked model and increasing dimensionsd=3,30,3000. Forα̸=1, data converge to vertices of a regular 3-simplex (caseα <1) or to the first PC direction (caseα >1). Whenα=1, data are decomposed into the deterministic part on the horizontal axes and the random part along the vertical axis.

u1remains comparable to that of horizontal components, asdgrows. The convergence to the vertices is noticeable even

ford

=

30, which becomes stronger for largerd, while the randomness alongu1remains for larged. Also observe that the

distance from eachXito the space spanned byu1becomes deterministic for larged, supporting the first part of(11).

Appendix. Derivation of the density functions

The probability density functions of the limiting distributions in(7)and(8)will be derived for the casem

=

2, under the normal assumption. The argument is readily generalized to allm, but when non-normal distribution is assumed, such derivation is much challenging.

We first recall some necessary notions for treating the Wishart matrixW′_W_{and eigen-decompositions. Most of the} results are adopted from [17]. Let A

∼

Wm

(

n

,

Σm

)

and denote its eigen-decomposition as A

=

HLH′ with L

=

diag

(

l1

, . . . ,

lm

)

. AssumeΣmis positive definite andn

≥

mso thatl1

>

l2

>

· · ·

>

lm

>

0 with probability 1. Denote

O

(

m

)

= {

Hm×m

:

H′H

=

Im

}

for the set of orthonormalm

×

mmatrices and

(

dH

)

forH

∈

O

(

m

)

as the differential form representing the uniform probability measure onO

(

m

)

. The multivariate gamma function is defined as

Γm

(

a

)

=

π

m(m−1)/4 m



i=1 Γ



a

−

1 2

(

i

−

1

)



,

whereΓ

(

· )

is the usual gamma function. ForH

≡ [

h1

, . . . ,

hm

] ∈

O

(

m

)

,

(

dH

)

≡

1 Vol

(

O

(

m

))

(

H ′ dH

)

=

Γm



m 2



2m

_π

m2_/₂

(

H ′ dH

),

(A.1) where

(

H′dH

)

≡



m i>jh ′ idhj.

We are now ready to state the density function of

ϕ(

W′_W

₎

form

=

2. Note that under the Gaussian assumptionW′_W is the 2

×

2 Wishart matrix with degree of freedomnand covariance matrixΣW

=

diag

(σ

12

, σ

22

)

. For simplicity, write

(

L1

,

L2

)

=

ϕ(

W′W

)

, andL

=

diag

(

L1

,

L2

)

. Then the joint density function ofL1andL2is given by e.g. Theorem 3.2.18 of [17]

withm

=

2, and fL

(

l1

,

l2

)

=

π

2−n

_(σ

2 1

σ

22

)

−n₂ Γ2



n 2



(

l1l2

)

n−3 2

(

_l₁

−

l2

)



O(2) exp



trace



−

1 2Σ −1 W HLH ′



(

dH

).

(A.2)