Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations

(1)

Contents lists available atSciVerse ScienceDirect

Journal of Multivariate Analysis

journal homepage:www.elsevier.com/locate/jmva

Effective PCA for high-dimension, low-sample-size data with noise

reduction via geometric representations

Kazuyoshi Yata, Makoto Aoshima

∗

Institute of Mathematics, University of Tsukuba, Ibaraki 305-8571, Japan

a r t i c l e i n f o Article history:

Received 8 April 2010

Available online 16 September 2011

AMS subject classifications:

primary 62H25 62H30 secondary 34L20 Keywords: Consistency Discriminant analysis Eigenvalue distribution Geometric representation HDLSS Inverse matrix Noise reduction

Principal component analysis

a b s t r a c t

In this article, we propose a new estimation methodology to deal with PCA for high-dimension, low-sample-size (HDLSS) data. We first show that HDLSS datasets have different geometric representations depending on whether aρ-mixing-type dependency appears in variables or not. When theρ-mixing-type dependency appears in variables, the HDLSS data converge to ann-dimensional surface of unit sphere with increasing dimension. We pay special attention to this phenomenon. We propose a method called the noise-reduction methodologyto estimate eigenvalues of a HDLSS dataset. We show that the eigenvalue estimator holds consistency properties along with its limiting distribution in HDLSS context. We consider consistency properties of PC directions. We apply the noise-reduction methodology to estimating PC scores. We also give an application in the discriminant analysis for HDLSS datasets by using the inverse covariance matrix estimator induced by the noise-reduction methodology.

1. Introduction

The high-dimension, low-sample-size (HDLSS) data situation occurs in many areas of modern science such as genetic microarrays, medical imaging, text recognition, finance, chemometrics, and so on. The asymptotic studies of this type of data are becoming increasingly relevant. The asymptotic behavior of eigenvalues of the sample covariance matrix in the limit asd

→ ∞

was studied by Johnstone [6], Baik et al. [2] and Paul [10] under Gaussian assumptions, and Baik and Silverstein [3] under non-Gaussian but i.i.d. assumptions when the dimensiondand the sample sizenincrease at the same rate, i.e.n

/

d

→

c

>

0. In recent years, substantial work has been done on the HDLSS asymptotic theory, where onlyd

→ ∞

while nis fixed, by Hall et al. [5], Ahn et al. [1], Jung and Marron [7], and Yata and Aoshima [14–16]. Hall et al. [5] and Ahn et al. [1] explored conditions to give a geometric representation of HDLSS data. Jung and Marron [7] investigated consistency properties of both eigenvalues and eigenvectors of the sample covariance matrix in the HDLSS data situations. The HDLSS asymptotic theory had been created under the assumption that either the population distribution is normal or the random variables in the sphered data matrix have the

ρ

-mixing dependency (see [4]). However, Yata and Aoshima [14–16] developed the HDLSS asymptotic theory without assuming either the normality or the

ρ

-mixing condition. Yata and Aoshima [14] gave consistency properties of both eigenvalues and eigenvectors of the sample covariance matrix together with PC scores. Yata and Aoshima [16] proposed a method for dimensionality estimation of HDLSS data, and Yata and Aoshima [15] generalized the method to create a new PCA called the cross-data-matrix methodology.

In this paper, suppose we have a d

×

n data matrix X₍d)

=

[x

1(d)

, . . . ,

xn(d)

]

with d

>

n, where xj(d)

=

(

x1j(d)

, . . . ,

xdj(d)

)

T

,

j

=

1

, . . . ,

n, are independent and identically distributed (i.i.d.) as ad-dimensional distribution with

∗_{Corresponding author.}

(2)

mean zero and nonnegative definite covariance matrix6d. The eigen-decomposition of6dis6d

=

Hd3dHdT, where3dis a

diagonal matrix of eigenvalues

λ1

₍d)

≥ · · · ≥

λ

d(d)

(>

0

)

andHd

= [h

1(d)

, . . . ,

hd(d)

]

is a matrix of corresponding eigenvectors.

Then,Z₍d)

=

3

−1/2 d H

T

dX(d)is ad

×

nsphered data matrix from a distribution with the identity covariance matrix. Here, we

writeZ₍d)

= [z

1(d)

, . . . ,

zd(d)

]

Tandzj(d)

=

(

zj1(d)

, . . . ,

zjn(d)

)

T

,

j

=

1

, . . . ,

d. Hereafter, the subscriptdwill be omitted for

the sake of simplicity when it does not cause any confusion. We assume that the fourth moments of each variable inZare uniformly bounded. We assume that

∥z

_j

∥ ̸=

0 forj

=

1

, . . . ,

d, where

∥ · ∥

denotes the Euclidean norm. We consider a general setting as follows:

λ

i

=

aidαi

(

i

=

1

, . . . ,

m

)

and

λ

j

=

cj

(

j

=

m

+

1

, . . . ,

d

).

(1)

Here,ai

(>

0

),

cj

(>

0

)

and

α

i

(α1

≥ · · · ≥

α

m

>

0

)

are unknown constants preserving the order that

λ1

≥ · · · ≥

λ

d, andmis

an unknown non-negative integer. We assumen

>

m.

In Section 2, we show that HDLSS datasets have different geometric representations depending on whether a

ρ

-mixing-type dependency appears in variables or not. When the

ρ

-mixing-type dependency appears in variables, the HDLSS data converge to ann-dimensional surface of unit sphere with increasing dimension. We pay special attention to this phenomenon.AfterSection3,we assume that zjk

,

j

=

1

, . . . ,

d

(

k

=

1

, . . . ,

n

)

are independent.Note that the assumption

includes the case thatXis Gaussian. In Section3, we propose a method calledthe noise-reduction methodologyto estimate eigenvalues of a HDLSS dataset. We show that the eigenvalue estimator holds consistency properties along with its limiting distribution in HDLSS context. In Section4, we consider consistency properties of PC directions. In Section5, we apply the noise-reduction methodology to estimating PC scores. In Section6, we show performances of the noise-reduction methodology by conducting simulation experiments. In Section 7, we provide an inverse covariance matrix estimator induced by the noise-reduction methodology. Finally, in Section8, we give an application in the discriminant analysis for HDLSS datasets by using the inverse covariance matrix estimator.

2. Geometric representations

In this section, we consider several geometric representations. The sample covariance matrix isS

=

n−1XXT. We consider then

×

ndual sample covariance matrix defined byS_D

=

n−1XT_X_{. Let}

_λ1

ˆ

_{≥ · · · ≥ ˆ}

_λ

n

≥

0 be the eigenvalues of

SD. Let us write the eigen-decomposition ofSDasSD

=



n

j=1

λ

ˆ

ju

ˆ

ju

ˆ

Tj. Note thatSDandSshare non-zero eigenvalues and

E

{

(

n

/



d

i=1

λ

i

)

SD

} =

In. Ahn et al. [1] and Jung and Marron [7] claimed that when the eigenvalues of6are sufficiently

diffused in the sense that

d



i=1

λ

2 i



_d



i=1

λ

i



2

→

0 asd

→ ∞

,

(2)

the sample eigenvalues behave as if they are from a scaled identity covariance matrix. WhenXis Gaussian or the components ofZare

ρ

-mixing, it follows that

n d



i=1

λ

i SD

→

In (3)

in probability asd

→ ∞

for a fixednunder(2).

Remark 1. The concept of

ρ

-mixing was first developed by Kolmogorov and Rozanov [8]. See [4] for a clear and insightful discussion. See also [7]. For

−∞ ≤

J

≤

K

≤ ∞

, letFJKdenote that the

σ

-field of events generated by the random variables

(Yi

,

J

≤

i

≤

K). For any

σ

-fieldA, letL2(A

)

denote the space of square-integrable,Ameasurable (real-valued) random

variables. For eachr

≥

1, define the maximal correlation coefficient

ρ(

r

)

=

sup

|

corr

(

f

,

g

)

|

,

f

∈

L2(F−∞j

),

g

∈

L2(F ∞

j+r

),

where sup is over allf

,

gandjis a positive integer. The sequence

{

Yi

}

is said to be

ρ

-mixing if

ρ(

r

)

→

0 asr

→ ∞

. Note

that when

(

z1k

,

z2k

, . . .)

is

ρ

-mixing, it holds that forj

,

j′

=

1

,

2

, . . .

with

|

j

−

j′

| =

r,

|

E

((

z_jk2

−

1

)(

z_j2′_k

−

1

))

| ≤

ρ(

r

)

→

0 asr

→ ∞

.

Remark 2. LetRn

= {e

n

∈

Rn

: ∥e

n

∥ =

1

}

. Letwj

=

(

n

/



di=1

λ

i

)

SDu

ˆ

j

=

(

n

/



di=1

λ

i

)

λ

ˆ

ju

ˆ

j. WhenXis Gaussian or the

components ofZare

ρ

-mixing, it holds from(3)that

wj

∈

Rn

,

j

=

1

, . . . ,

n (4)

(3)

WhenXis non-Gaussian without

ρ

-mixing, Yata and Aoshima [16] claimed that n d



i=1

λ

i SD

→

Dn (5)

in probability asd

→ ∞

for a fixednunder(2), whereDnis a diagonal matrix with any diagonal element havingOp

(

1

)

.

Now, let us further consider the geometric representations given by(3)and(5). Letzk∗

=

(

z₁2_k

−

1

, . . . ,

z_dk2

−

1

)

T

,

k

=

1

, . . . ,

n. We denote the covariance matrix ofzk∗by8. Note that whenXis Gaussian (orzjk

,

j

=

1

, . . . ,

d

(

k

=

1

, . . . ,

n

)

are

independent),8is a diagonal matrix. Let8

=

(φ

_ij

)

andr

= |

i

−

j

|

. Note that when the components ofZare

ρ

-mixing, it holds that

φ

ij

→

0 asr

→ ∞

. WhenXis non-Gaussian without

ρ

-mixing, we may claim that

φ

ij

̸=

0 fori

̸=

j. However, it

should be noted that the geometric representation given by(3)is still claimed even in a case whenXis non-Gaussian without

ρ

-mixing. Let us writeDk

=





d j=1

λ

j



−1



d j=1

λ

jz2jkas a diagonal element of

(

n

/



d j=1

λ

j

)

SD. Note thatDk

= ∥x

k

∥

2

/

tr

(6)

andE

(

Dk

)

=

1. LetV

(

x

)

denote the variance of a random variablex. We have for the variance of eachDkthat

V

(

Dk

)

=

E







d



j=1

λ

j

(

z2jk

−

1

)



2







d



j=1

λ

j



2

=



i,j

λ

i

λ

j

φ

ij



d



j=1

λ

j



2

.

Hence, we consider a regular condition of

ρ

-mixing-type dependency given by



i,j

λ

i

λ

j

φ

ij



d



j=1

λ

j



2

→

0 asd

→ ∞

.

(6)

Note that it holds(6)under(2)whenXis Gaussian or the components ofZ are

ρ

-mixing. Then, we obtain the following theorem.

Theorem 1.When the components of Zsatisfy the condition given by(6), we have(3)as d

→ ∞

for a fixed n. Otherwise, we

have(5)as d

→ ∞

for a fixed n under(2).

Remark 3. We consider the case thatzjk

,

j

=

1

, . . . ,

d

(

k

=

1

, . . . ,

n

)

are distributed as continuous distributions. Let

f

(

Dk

)

be the p.d.f. ofDk. Assume thatZ does not satisfy(6). Assume further thatf

(

Dk

) <

∞

w.p.1 asd

→ ∞

. Let

Rn∗

= {e

(1)

=

(

1

,

0

, . . . ,

0

)

T

,

e(2)

=

(

0

,

1

, . . . ,

0

)

T

, . . . ,

e(n)

=

(

0

,

0

, . . . ,

1

)

T

}

. Then, we have that

ˆ

uj

∈

Rn∗

,

j

=

1

, . . . ,

n

.

(7)

in probability asd

→ ∞

for a fixednunder(2).

Let us observe geometric representations induced by(3)with(4)and(5)with(7). Now, we consider an easy example such as

λ1

= · · · =

λ

_d

=

1 andn

=

2. Note that it is satisfying(2).Fig. 1(a), (b), (c) and (d) give scatter plots of 20 independent pairs of

±w

j

(

j

=

1

,

2

)

generated from the normal distribution,Nd

(

0

,

Id

)

, with mean zero and covariance matrixIdind

(

=

2

,

20

,

200, and 2000)-dimensional Euclidean space, respectively.

Fig. 1shows the geometric representation induced by(3)with(4). Whend

=

2, the plots ofw1appeared quite random

and the plots ofw2appeared around0. However, whend

=

200, the approximation in(3)with(4)became quite good. It

reflected that the plots ofwi

(

i

=

1

,

2

)

appeared around the surface of ann-dimensional unit sphere. As expected, when

d

=

2000, it showed an even more rigid geometric representation.

Fig. 2(a)–(d) give scatter plots of 20 independent pairs of

±w

j

(

j

=

1

,

2

)

generated from thed-variatet-distribution,

t_d

(

0

,

I_d

, ν)

with mean zero, covariance matrixI_d and degree of freedom (d.f.)

ν

=

5 in d

(

=

2

,

20

,

200

,

and 2000

)

-dimensional Euclidean space.

Fig. 2shows the geometric representation induced by(5)with(7). Whend

=

2, the plots ofwi

(

i

=

1

,

2

)

appeared quite

random. Whend

=

200, the approximation in(5)with(7)became moderate. Whend

=

2000, the approximation became quite good. It reflected that the plots ofw_i

(

i

=

1

,

2

)

appeared in close to axes.

Here, we consider the case thatd

→ ∞

andn

→ ∞

. Lete_nbe an arbitrary element ofR_nthat is defined inRemark 2. Then, we have the following theorem.

(4)

(a)d=2. (b)d=20.

(c)d=200. (d)d=2000.

Fig. 1. Gaussian toy example forn=2, illustrating the geometric representation ofw1(plotted as⃝) andw2(plotted as△), and the convergence to an n-dimensional surface of unit sphere with increasing dimension: (a)d=2, (b)d=20, (c)d=200, and (d)d=2000.

Theorem 2. We assume that

n



i,j

λ

i

λ

j

φ

ij



d



j=1

λ

j



2

→

0 and n 2 p



i=1

λ

2 i



d



j=1

λ

j



2

→

0 (8)

when d

→ ∞

and n

→ ∞

. Then, it holds that

n d



i=1

λ

i eT_nSDen

=

1

+

op

(

1

).

(9)

FromTheorem 2,

_λ

ˆ

_j_{’s are mutually equivalent under}₍₈₎_{in the sense that}_n





d

i=1

λ

i



−1

ˆ

λ

j

=

1

+

op

(

1

)

for allj

=

1

, . . . ,

n.

Note thatn

/

d

→

0 under(8).Theorem 2claims that a geometric representation appeared inFig. 1still remains even when

n

→ ∞

in the HDLSS context.

In this article, we pay special attention to the geometric representation given by(3)or(9), that is appeared inFig. 1.After

Section3,we assume that zjk

,

j

=

1

, . . . ,

d

(

k

=

1

, . . . ,

n

)

are independent.This assumption is milder than that the population

distribution is Gaussian. We propose a new estimation method calledthe noise-reduction methodologyto deal with PCA in HDLSS data situations. WhenXmay have the geometric representation given by(5), Yata and Aoshima [15] proposed a different method called the cross-data-matrix methodology. We compare those two methodologies by simulations in Section6.

(5)

(a)d=2. (b)d=20.

(c)d=200. (d)d=2000.

Fig. 2. Non-Gaussian toy example forn=2, illustrating the geometric representation ofw1(plotted as⃝) andw2(plotted as△), and the concentration

on axes with increasing dimension: (a)d=2, (b)d=20, (c)d=200, and (d)d=2000.

3. Noise-reduction methodology

Hereafter we assume that zjk

,

j

=

1

, . . . ,

d

(

k

=

1

, . . . ,

n

)

are independent.We denotenbyn

(

d

)

only whenn

=

dγ,

where

γ

is a positive constant. Yata and Aoshima [14] gave consistency properties of the sample eigenvalues. Their result is summarized as follows: It holds forj

(

=

1

, . . . ,

m

)

that

ˆ

λ

j

λ

j

=

1

+

op

(

1

)

(10)

under the conditions:

(YA-i) d

→ ∞

andn

→ ∞

forjsuch that

α

j

>

1;

(YA-ii)d

→ ∞

andd1−αj

/

_n

(

_d

)

_→

_{0 for}_j_{such that}

α

j

∈

(

0

,

1

]

.

The condition described by bothd

→ ∞

andn

→ ∞

is a mild condition fornin the sense that one can choosenfree from

d. The above result given by Yata and Aoshima [14] draws our attention to the limitations of the capabilities of naive PCA in HDLSS data situations. Let us see a case, say, thatd

=

10 000

, λ1

=

d1/2_and

_λ2

_{= · · · =}

_λ

d

=

1. Then, we observe from

(YA-ii) that one requires the sample of sizen

≫

d1−α1

₌

_d1/2

₌

_{100. It is somewhat inconvenient for the experimenter to}

handle HDLSS data situations. We have thatSD

=

n−1



d

j=1

λ

jzjzjT. Let us write thatU1

=

n−1



m

j=1

λ

jzjzjT andU2

=

n−1



d

j=m+1

λ

jzjzjT so that

(6)

d



j=m+1

λ

2 j



d



j=m+1

λ

j



2

→

0 asd

→ ∞

,

(11)

the noise part holds the geometric representation similar to(3)or(9). Leten

=

(

e1, . . . ,en

)

Tbe an arbitrary element ofRn

that is defined inRemark 2. Then, from(3)andTheorem 2, we have that

n d



j=m+1

λ

j eT_nU2en

=

1

+

op

(

1

)

(12)

asd

→ ∞

either whennis fixed orn

=

n

(

d

)

satisfyingn

(

d

)



d j=m+1

λ

2 j

/





d j=m+1

λ

j



2

→

0.This geometric representation

for the noise part influences the estimation scheme proposed in this article.

We consider an easy example such asm

=

2 and

λ1

=

dα1

, λ2

₌

_dα2

, λ

j

=

cj

,

j

=

3

, . . . ,

d, where

α1

> α2

>

1

/

2

andcj’s are positive constants. Note that it is satisfying(11). Then, we write that

λ

−11SD

=

n−1z1z1T

+

(

n

λ1)

−1

_λ2

_z 2z2T

+

(

n

λ1)

−1



d

j=3

λ

jzjzjT. Let us consider the behavior ofeTn

(

U2

−





d

j=m+1

λ

j

/

n



In

)

enin(12). By using Chebyshev’s inequality

for any

τ >

0 and the uniform boundM

(>

0

)

for the fourth moments condition, one has for all diagonal elements of

λ

−1 1

(

U2

−





d j=m+1

λ

j

/

n



I_n

)

that n



i=1 P





(

n

λ1)

−1 d



j=m+1

λ

j

(

zji2

−

1

)



> τ



≤

M

τ

−2n−1

λ

2_m+1d1 −2α1

₌

_o

(

₁

)

₍₁₃₎

asd

→ ∞

either whenn

→ ∞

ornis fixed. Since we have that

(

n

λ1)

−1



d

j=m+1

λ

j

(

zji2

−

1

)

=

op

(

1

)

for alli

=

1

, . . . ,

n, it

holds that all diagonal elements of

λ

−₁1

(

U2

−





d

j=m+1

λ

j

/

n



In

)

converge to 0 in probability. By using Markov’s inequality

for any

τ >

0, one has for all off-diagonal elements of

λ

−₁1

(

U2

−





d j=m+1

λ

j

/

n



In

)

that P







i̸=i′



(

n

λ1)

−1 d



j=m+1

λ

jzjizji′



2

> τ





≤

τ

−1

_λ

2 m+1d 1−2α1

₌

_o

(

₁

).

Thus we have that



i̸=i′

((

n

λ1)

−1



d j=m+1

λ

jzjizji′

)

2

=

o_p

(

1

)

so that





i̸=i′ eiei′ d



j=m+1

(

n

λ1)

−1

λ

jzjizji′



≤







i̸=i′



(

n

λ1)

−1 d



j=m+1

λ

jzjizji′



2





1/2

=

op

(

1

).

(14)

Then, we obtain that

λ

−₁1eT n

(

U2

−





d

j=m+1

λ

j

/

n



In

)

en

=

op

(

1

)

asd

→ ∞

either whenn

→ ∞

ornis fixed. Note that

λ

−1

1

λ2

→

0 asd

→ ∞

and

∥

n

−1/2_z

1

∥ =

1

+

op

(

1

)

asn

→ ∞

. Hence, by noting that maxen

(

e T nSDen

)

= ˆ

uT1SDu

ˆ

1, it holds that

λ

−1 1 u

ˆ

T 1



S_D

−

n−1 d



j=m+1

λ

jIn



ˆ

u₁

=

u

ˆ

T 1U1u

ˆ

1

λ1

+

op

(

1

)

=

(

u

ˆ

T1z1/n 1/2

₎

2

₊

op

(

1

)

=

1

+

op

(

1

).

Hence, we claim asd

→ ∞

andn

→ ∞

that

λ

−1 1



ˆ

uT₁SDu

ˆ

1

−

n−1 d



j=m+1

λ

j



=

ˆ

λ1

−

n−1



d j=m+1

λ

j

λ1

=

1

+

op

(

1

).

From the proof ofCorollary 4andTheorem 6inAppendix, we can obtain thatn−1/2_zT

1u

ˆ

2

=

op

(

dα2−α1

)

asd

→ ∞

and n

→ ∞

. By noting that

∥

n−1/2_z 2

∥ =

1

+

op

(

1

)

asn

→ ∞

, we have that

λ

−1 2



ˆ

uT₂SDu

ˆ

2

−

n−1 d



j=m+1

λ

j



= ˆ

uT₂

λ1

z1z T 1

λ2

n u

ˆ

2

+ ˆ

u T 2 z2z2T n u

ˆ

2

+

op

(

1

)

=

1

+

op

(

1

).

(7)

Now, we consider estimating the noise part from the fact that asd

→ ∞

andn

→ ∞

λ

−1 j







tr

(

S_D

)

−

j



i=1

ˆ

λ

i n

−

j

−

n −1 d



i=m+1

λ

i







=

op

(

1

)

forj

=

1

,

2. (SeeLemma 7inAppendixfor the details.) Then, we have asd

→ ∞

andn

→ ∞

that

λ

−1 j







ˆ

λ

j

−

tr

(

S_D

)

−

j



i=1

ˆ

λ

i n

−

j







=

1

+

op

(

1

)

forj

=

1

,

2. Hence, we have a consistent estimator for

λ

j

=

dαjwith

α

j

>

1

/

2 that is a milder condition than (YA-ii).

In general, we propose the new estimation methodology as follows:

[Noise-reduction methodology]

˜

λ

j

= ˆ

λ

j

−

tr

(

SD

)

−

j



i=1

ˆ

λ

i n

−

j

(

j

=

1

, . . . ,

n

−

1

).

(15)

Note that

λ

˜

j

≥

0

(

j

=

1

, . . . ,

n

−

1

)

w.p.1 forn

≤

d. Then, we claim the following theorem. Theorem 3.For j

=

1

, . . . ,

m, we have that

˜

λ

j

λ

j

=

1

+

op

(

1

)

under the conditions:

(i) d

→ ∞

and n

→ ∞

for j such that

α

j

>

1

/

2;

(ii)d

→ ∞

and d1−2αj

/

_n

(

_d

)

_→

₀_{for j such that}

α

j

∈

(

0

,

1

/

2

]

. Theorem 4.Let V

(

z2

jk

)

=

Mj

(<

∞

)

for j

=

1

, . . . ,

m

(

k

=

1

, . . . ,

n

)

. Assume that

λ

j

(

j

≤

m

)

has multiplicity one. Under the

conditions:

(i) d

→ ∞

and n

→ ∞

for j such that

α

j

>

1

/

2;

(ii)d

→ ∞

and d2−4αj

/

_n

(

_d

)

_→

₀_{for j such that}

α

j

∈

(

0

,

1

/

2

]

, we have that



_n Mj



˜

λ

j

λ

j

−

1



⇒

N

(

0

,

1

),

where ‘‘

⇒

’’ denotes the convergence in distribution and N

(

0

,

1

)

denotes a random variable distributed as the standard normal

distribution.

Remark 4. Yata and Aoshima [14] gave the asymptotic normality of

λ

ˆ

j’s. Under the assumption that

λ

m+1

= · · · =

λ

d

=

1,

Lee et al. [9] considered an estimate of

λ

jsuch as

˙

λ

j

=

ˆ

λ

j

+

1

−

η

+



(

_λ

ˆ

j

+

1

−

η)

2

−

4

λ

ˆ

j 2

,

whered

/

n

→

η

≥

0 andn

→ ∞

. If 1

/

_λ

ˆ

_j

=

op

(

1

)

, we claim that

λ

˙

j

=

(

λ

ˆ

j

−

η)(

1

+

op

(

1

))

. By noting thatn−1



dj=m+1

λ

j

→

η

when

λ

m+1

= · · · =

λ

d

=

1, it holds that

λ

˙

j

=

(

λ

ˆ

j

−

n−1



dj=m+1

λ

j

)(

1

+

op

(

1

))

. Hence, we may consider

λ

˙

jas a noise

reduction. However, we emphasize thatthe noise-reduction methodology allows users to have a consistent estimator,

λ

˜

j,when

λ

m+1

≥ · · · ≥

λ

d

(>

0

)

and

λ

j

=

cj

(

j

=

m

+

1

, . . . ,

d

)

are unknown constants.

(8)

Remark 5. WhenXis Gaussian,

α1

>

1

/

2 and either when

α1

> α2

orm

=

1, we have asd

→ ∞

that

˜

λ1

⇒

χ

2 n n

for fixedn, where

χ

_n2denotes a random variable distributed as the

χ

2distribution with d.f.n. Jung and Marron [7] claimed a similar result for

λ

ˆ

jwith

α

j

>

1.

Remark 6. Whenzjk

,

j

=

1

, . . . ,

d

(

k

=

1

, . . . ,

n

)

are not independent but the components ofZare

ρ

-mixing, we can claim

the assertions similar toTheorems 3and4under the conditions: (i) d

→ ∞

andn

→ ∞

forjsuch that

α

j

>

1;

(ii) d

→ ∞

,

n

→ ∞

andd2−2αj

/

_n

(

_d

) <

_∞

_for_j_{such that}

α

j

∈

(

0

,

1

]

.

The conditions (i)–(ii) are milder than the ones given by Theorem 3.1 in Yata and Aoshima [14] for a non-

ρ

-mixing case.

Corollary 1. When the population mean may not be zero, let us write that SoD

=

(

n

−

1

)

−1

(

X

−

X

)

T

(

X

−

X

)

, where

X

= [¯

xn

, . . . ,

x

¯

n

]

is having d-vectorx

¯

n

=



n

s=1xs

/

n. We redefine

λ

˜

j

(

j

=

1

, . . . ,

m

)

in(15)by replacingSDand n with

SoDand n

−

1. Then, the assertions inTheorems3and4are still justified under the convergence conditions.

4. Consistency properties of PC directions

In this section, we consider PC direction vectors. Jung and Marron [7], and Yata and Aoshima [14] studied consistency properties of PC direction vectors in the context of naive PCA. LetH

ˆ

= [ ˆ

h1, . . . ,h

ˆ

d

]

such thatH

ˆ

TSH

ˆ

=

3

ˆ

and 3

ˆ

=

diag

(

λ1, . . . ,

ˆ

λ

ˆ

d

)

. Note thath

ˆ

jcan be calculated byh

ˆ

j

=

(

n

λ

ˆ

j

)

−1/2Xu

ˆ

j, whereu

ˆ

jis an eigenvector ofSD. Then, Yata and

Aoshima [14] gave consistency properties of the sample eigenvectors with their population counterparts: Assume that

λ

j

(

j

≤

m

)

has multiplicity one as

λ

j

̸=

λ

j′for allj′

(

̸=

j

)

. Then, the firstmsample eigenvectors are consistent in the sense that

Angle

(

h

ˆ

_j

,

h_j

)

−→

p 0

(

j

=

1

, . . . ,

m

)

under (YA-i)–(YA-ii). The following result can be obtained as a corollary of Theorem 4.1 in Yata and Aoshima [14].

Corollary 2. The first m sample eigenvectors are inconsistent in the sense that

Angle

(

h

ˆ

j

,

hj

)

p

−→

π/

2

(

j

=

1

, . . . ,

m

)

(16)

under the condition that d

→ ∞

and d

/(

n

(

d

)λ

j

)

→ ∞

.

Remark 7. Under the condition described above, we have thath

ˆ

T

jhj

=

op

(

1

) (

j

=

1

, . . . ,

m

)

. Jung and Marron [7] gave(16)

asd

→ ∞

for a fixedn.

Remark 8. When the population mean may not be zero, we still haveCorollary 2by usingSoDdefined inCorollary 1. 5. PC scores with noise-reduction methodology

In this section, we apply the noise-reduction methodology to principal component scores (Pcss). Thej-th Pcs ofxkis given

byhT jxk

=

zjk



λ

j(

=

sjk, say). However, sincehjis unknown, one may useh

ˆ

j

=

(

n

λ

ˆ

j

)

−1/2Xu

ˆ

jas a sample eigenvector. The

j-th Pcs ofxkis estimated byh

ˆ

Tjxk

= ˆ

ujk



n

λ

ˆ

j

(

=ˆ

sjk

,

say

)

, whereu

ˆ

Tj

=

(

u

ˆ

j1, . . . ,u

ˆ

jn

)

. Let us define a sample mean square

error of thej-th Pcs by MSE

(

ˆ

sj

)

=

n−1



n

k=1(

ˆ

sjk

−

sjk

)

2. Then, Yata and Aoshima [14] evaluated the sample Pcs as follows:

Assume that

λ

j

(

j

≤

m

)

has multiplicity one. Then, it holds that

MSE

(

s

ˆ

j

)

λ

j

=

op

(

1

)

(17)

under (YA-i)–(YA-ii).

Now, we modify

ˆ

sjkby using

λ

˜

jdefined by(15). Let us write thatu

ˆ

jk



n

λ

˜

j(

=˜

sjk, say). Then, we obtain the following result. Theorem 5. Assume that

λ

j

(

j

≤

m

)

has multiplicity one. Then, we have that

MSE

(

˜

sj

)

λ

j

=

op

(

1

)

(18)

(9)

Foru

ˆ

_j, we can claim the consistency for a Pcs vectorn−1/2z_j.

Corollary 3. Assume that

λ

j

(

j

≤

m

)

has multiplicity one. Then, the j-th sample eigenvector is consistent in the sense that

Angle

(

u

ˆ

j

,

n−1/2zj

)

p

−→

0 (19)

under the conditions

(

i

)

–

(

ii

)

of Theorem3.

Remark 9. Lee et al. [9] gave a result similar to(19). Under the assumption that

λ

m+1

= · · · =

λ

d

=

1 and the multiplicity

one assumption, they claimed asd

/

n

→

η

≥

0 andn

→ ∞

that

| ˆ

uT_jzj

/

n1/2

| =









1

−

η

(λ

j

−

1

)

2

+

op

(

1

)

when

λ

j

>

1

+

η,

op

(

1

)

when 1

< λ

j

≤

1

+

η.

Here, it holds that

| ˆ

uT

jzj

/

n1/2

| =

1

+

op

(

1

)

under

η/(λ

j

−

1

)

2

=

O

(

d1−2αj

/

n

)

→

0. Then, by noting that

∥z

j

/

n1/2

∥ =

1

+

op

(

1

)

asn

→ ∞

, we have(19)under the conditions (i)–(ii) ofTheorem 3. Thus their result corresponds toCorollary 3when

λ

m+1

= · · · =

λ

d

=

1.

Letxnewbe a new sample from the distribution and independent ofX. Thej-th Pcs ofxnewis given byhjTxnew

(

=

sj(new)

,

say

)

.

Note thatV

(

sj(new)

/



λ

j

)

=

1. We consider a consistent estimator ofsj(new). Then, we have the following result.

Corollary 4. Assume that

λ

j

(

j

≤

m

)

has multiplicity one. Forh

ˆ

j, it holds that

ˆ

hT_jxnew



λ

j

=

sj(new)



λ

j

+

op

(

1

)

under the conditions that

(

i

)

d

→ ∞

and n

→ ∞

for j such that

α

j

>

1,

(

ii

)

d

→ ∞

and d1−αj

/

n

(

d

)

→

0for j such that

α

j

∈

(

1

/

3

,

1

]

, and

(

iii

)

d

→ ∞

and d2−4αj

/

n

(

d

)

→

0for j such that

α

j

∈

(

0

,

1

/

3

]

,

Remark 10. Lee et al. [9] also considered a predict Pcs forsj(new)when

λ

m+1

= · · · =

λ

d

=

1.

Now, we consider applying the noise-reduction methodology to the PC direction vectors. Let us defineh

˜

_j

=

(

n

λ

˜

j

)

−1/2Xu

ˆ

j.

Then, we considerh

˜

_jas an estimate of the PC direction vector,h_j. By usingh

˜

_j

,

j

=

1

, . . . ,

m, we have the following theorem.

Theorem 6.Assume that

λ

j

(

j

≤

m

)

has multiplicity one. Forh

˜

j, it holds that

˜

hT_jxnew



λ

j

=

sj(new)



λ

j

+

op

(

1

)

under the conditions

(

i

)

–

(

ii

)

of Theorem4.

Remark 11. Assume that

λ

j

(

j

≤

m

)

has multiplicity one. Then, thej-th sample eigenvector is consistent in the sense that

˜

hT_jhj

=

1

+

Op

(

n−1

)

+

Op

(

d1−2αjn−1

)

under the conditions (i)–(ii) ofTheorem 3. For the norm, it holds that

∥ ˜

hj

∥ =

1

+

op

(

1

)

under (YA-i)–(YA-ii).

Remark 12. When the population mean may not be zero, we still have the above results by usingSoDdefined inCorollary 1. 6. Performances of noise-reduction methodology

When we observe naive PCA, the sample sizenshould be determined depending ondfor

α

i

∈

(

0

,

1

]

in (YA-ii). On the

other hand, the noise-reduction methodology allows the experimenter to choosenfree fromdfor the case that

α

i

>

1

/

2 as

seen in the theorems given in Sections3and5. The noise-reduction methodology is promising to give feasible estimation for HDLSS data with extremely small order ofncompared tod. In this section, we examine its performance with the help of Monte Carlo simulations.

Independent pseudo-random normal observations were generated fromNd

(

0

,

6) withd

=

1600. We considered

λ1

=

d4/5

_{, λ2}

₌

_d3/5

_{, λ3}

₌

_d2/5_and

_λ4

_{= · · · =}

_λ

d

=

1 in(1). We used the sample of sizen

∈ [

20

,

120

]

to define the

data matrixX

:

d

×

nfor the calculation ofSD. The findings were obtained by averaging the outcomes from 2000 (

=

R, say)

replications. Under a fixed scenario, suppose that ther-th replication ends with estimates of

λ

j

,

λ

ˆ

jrand

λ

˜

jr

(

r

=

1

, . . . ,

R

)

,

given by using(10)and(15). Let us simply write

λ

ˆ

j

=

R−1



R

r=1

λ

ˆ

jrand

λ

˜

j

=

R−1



R

(10)

(a) For the first eigenvalue. (b) For the second eigenvalue.

(c) For the third eigenvalue.

Fig. 3. The behaviors of A:λˆj/λjand B:λ˜j/λjfor (a) the first eigenvalue, (b) the second eigenvalue and (c) the third eigenvalue when the samples, of size n=20(20)120, were taken fromNd(0,6)withd=1600.

A:

_λ

ˆ

_j

_/λ

_j_{and B:}

_λ

˜

_j

_/λ

_j_._{Fig. 3}_{shows the behaviors of both A and B for the first three eigenvalues. By observing the behavior}

of A,(10)seems not to give a feasible estimation within the range ofn. The sample sizenwas not large enough to use the eigenvalues ofSDfor such a high-dimensional space. On the other hand, in view of the behavior of B,(15)gives a reasonable

estimation surprisingly well for such HDLSS datasets. The noise-reduction methodology seems to perform excellently as expected theoretically.

We also considered the Monte Carlo variability. Let Var

(

λ

ˆ

j

/λ

j

)

=

(

R

−

1

)

−1



R

r=1(

λ

ˆ

jr

− ˆ

λ

j

)

2

/λ

2j and Var

(

λ

˜

j

/λ

j

)

=

(

R

−

1

)

−1



R

r=1(

λ

˜

jr

− ˜

λ

j

)

2

/λ

2j. We considered two quantities, A: Var

(

λ

ˆ

j

/λ

j

)

and B: Var

(

λ

˜

j

/λ

j

)

, inFig. 4to show the behaviors

of sample variances of both A and B for the first three eigenvalues.

By observing the behaviors of the sample variances, both the behaviors seem not to make much difference between A and B. Note that it holdsMj

=

2

(

j

=

1

, . . . ,

m

)

for GaussianX. From Theorem 3.2 given in Yata and Aoshima [14], the

limiting distribution of

(

n

/

2

)

1/2

₍

_λ

ˆ

j

/λ

j

−

1

)

isN

(

0

,

1

)

, so that the variance of A is approximately given by Var

(

λ

ˆ

j

/λ

j

)

=

2

/

n.

On the other hand, in view ofTheorem 4, the limiting distribution of

(

n

/

2

)

1/2

₍

_λ

˜

j

/λ

j

−

1

)

isN

(

0

,

1

)

. Hence, the variance of

B is approximately given by Var

(

λ

˜

j

/λ

j

)

=

2

/

n; that is approximately equal to the variance of A.

Next, we considered the Pcs. Independent pseudo-random normal observations were generated from Nd

(

0

,

6). We

considered the case that

λ1

=

d4/5

, λ2

=

d3/5

λ3

=

d2/5and

λ4

= · · · =

λ

_d

=

1 in(1)as before. We fixed the sample size asn

=

60. We set the dimension asd

=

800

(

200

)

1800. Under a fixed scenario, suppose that ther-th replication ends with MSE

(

ˆ

sj

)

r and MSE

(

˜

sj

)

r

(

r

=

1

, . . . ,

R

)

, given by using(17)and(18). Let us simply write MSE

(

ˆ

sj

)

=

R−1



R

r=1MSE

(

ˆ

sj

)

r

and MSE

(

˜

sj

)

=

R−1



R

r=1MSE

(

˜

sj

)

r. We considered two quantities, A: MSE

(

s

ˆ

j

)/λ

jand B: MSE

(

s

˜

j

)/λ

j, inFig. 5to show the

behaviors of both A and B for the first three Pcss.

Again, the noise-reduction methodology seems to perform much better than naive PCA. We conducted simulation studies for other settings as well and verified the superiority of the noise-reduction methodology to naive PCA in HDLSS data situations.

Finally, we compare the noise-reduction methodology with the cross-data-matrix methodology. Yata and Aoshima [15] gave a Pcs estimator by using the cross-data-matrix methodology. Let

´

sjkbe thej-th Pcs estimator ofxkgiven in Section5

in Yata and Aoshima [15]. Independent pseudo-random observations were generated from thed-variatet-distribution,

(11)

(a) For the first eigenvalue. (b) For the second eigenvalue.

(c) For the third eigenvalue.

Fig. 4. The behaviors of A: Var(_λˆ_j_/λ_j₎_{and B: Var}₍_λ˜_j_/λ_j₎_{for (a) the first eigenvalue, (b) the second eigenvalue and (c) the third eigenvalue when the samples,} of sizen=20(20)120, were taken fromNd(0,6)withd=1600.

and

λ4

= · · · =

λ

_d

=

1 in(1). We setd

=

1600 andn

=

60. We considered three quantities, A: MSE

(

ˆ

sj

)/λ

j, B: MSE

(

˜

sj

)/λ

j

and C: MSE

(

´

sj

)/λ

j, inFig. 6to show the behaviors of A, B and C for the first three Pcss.

Note thattd

(

0

,

6, ν)

⇒

Nd

(

0

,

6)as

ν

→ ∞

. When

ν

is small,X has the geometric representation given by(5). In

the case, the cross-data-matrix methodology seems to perform better than the noise-reduction methodology. On the other hand, when

ν

is large,Xhas the geometric representation given by(3). In the case, the noise-reduction methodology seems to perform best among the three estimators.

7. Inverse covariance matrix estimator

In this section, we apply the noise-reduction methodology to estimating the inverse covariance matrix of6. The inverse covariance matrix,6−1, is the key to constructing inference procedures in many statistical problems. However, it should be noted thatS−1_{does not exist in the HDLSS context. Srivastava [}₁₁_,₁₂_{] used the Moore–Penrose inverse of}_S _{for several}

inference problems. Srivastava and Kubokawa [13] proposed an empirical Bayes inverse matrix estimator of6for the discriminant analysis and compared the performance with that of the Moore–Penrose inverse or the inverse matrix defined by only diagonal elements ofS. Then, they concluded that the discriminant rule using the empirical Bayes inverse matrix estimator was better than the others. The empirical Bayes inverse matrix estimator was defined byS_δ−1

=

(

S

+

δ

Id

)

−1with

δ

=

tr

(

S

)/

n. Then, let us consider the eigen-decomposition ofS_δ−1as

S_δ−1

=

n



j=1

(

_λ

ˆ

j

+

δ)

−1_h

ˆ

jh

ˆ

Tj

+

δ

−1



I_d

−

n



j=1

ˆ

h_jh

ˆ

T_j



.

LetV_δ

=

(v

_ij_(δ)

)

=

31/2HTS_δ−1H31/2, where31/2

=

diag

(λ

1₁/2

, . . . , λ

1_d/2

)

. Note that31/2HT6−1H31/2

=

Id. Let us write

κ

=

n−1



d

(12)

(a) For the first Pcs. (b) For the second Pcs.

(c) For the third Pcs.

Fig. 5. The behaviors of A: MSE(ˆsj)/λjand B: MSE(s˜j)/λjfor (a) the first Pcs, (b) the second Pcs and (c) the third Pcs when the samples, of sizen=60,

were taken fromNd(0,6)withd=800(200)1800.

Theorem 7. Assume that n

=

n

(

d

), α1

<

1

−

γ /

2

, γ <

1and the first m population eigenvalues are distinct as

λ1

>

· · ·

> λ

_m.

Under the condition that

(

i

)

d

→ ∞

and

κ/λ

j

=

O

(

d1−γ−αj

) <

∞

for j

=

1

, . . . ,

m, we have that

v

jj(δ)

=

2

λ

j

λ

j

+

2

κ

+

op

(

1

),

v

jj′_(δ)

=

o_p

(

1

),

j′

=

j

+

1

, . . . ,

d

.

For j such that

λ

j

/κ

→

0as d

→ ∞

, we have as d

→ ∞

that

v

jj(δ)

=

λ

j

κ

−1

+

op

(λ

j

κ

−1

),

v

jj′_(δ)

=

o_p

(λ

_j

κ

−1

),

j′

=

j

+

1

, . . . ,

d

.

Letj_δbe the maximum integerj

(

≤

m

)

such that

α

j

−

1

+

γ >

0. We assume that

α

j

−

1

+

γ

̸=

0

(

j

=

1

, . . . ,

m

)

. We

observe that

(v11

_(δ)

, . . . , v

jδjδ(δ)

, v

jδ+1jδ+1(δ)

, . . . , v

dd(δ)

)

is close to

(

2

, . . . ,

2

, λ

jδ+1κ−1

, . . . , λ

d

κ

−1

)

. One should note that

V_δis far fromId, so thatHTS_δ−1His far from3−1. Let us consider a different inverse matrix estimator of6by using the

noise-reduction methodology. Let

ω

=

min

(

tr

(

S

)/(

d1/2_n1/4

_{), δ)}

_and

_λ

´

_j

=

max

(

_λ

˜

_j

_{, ω)}

_{. Then, we define a new inverse matrix}

estimator as S_ω−1

=

n−1



j=1

´

λ

−1 j h

˜

jh

˜

Tj

+

ω

−1



Id

−

n−1



j=1

˜

hjh

˜

Tj



,

(20)

whereh

˜

jis the same one as inTheorem 6. LetVω

=

(v

ij(ω)

)

=

31/2HTS_ω−1H31/2. Then, we obtain the following theorem. Theorem 8. Assume that n

=

n

(

d

), γ <

1

, α1

<

min

(

1

/

2

+

γ /

4

,

1

−

γ /

2

)

and the first m population eigenvalues are distinct as

λ1

>

· · ·

> λ

_m. Let

ψ

=

min

(

n3/4

_κ/

_d1/2

_{, κ)}

_{. Under the conditions that}

₍

_i

₎

_d

_{→ ∞}

_and

_ψ/λ

j

=

O

(

d1/2−γ /4−αj

) <

∞

for

γ <

2

/

3,

(

ii

)

d

→ ∞

and

ψ/λ

j

=

O

(

d1−γ−αj

) <

∞

for

γ

∈ [

2

/

3

,

1

)

, we have that

v

jj(ω)

=

1 max

(

1

, ψ/λ

j

)

+

op

(

1

),

v

jj′_(ω)

=

o_p

(

1

),

j′

=

j

+

1

, . . . ,

d

.

(13)

(a) For the first Pcs. (b) For the second Pcs.

(c) For the third Pcs.

Fig. 6. The behaviors of A: MSE(ˆsj)/λj, B: MSE(s˜j)/λjand C: MSE(´sj)/λjfor (a) the first Pcs, (b) the second Pcs and (c) the third Pcs when the samples, of

sizen=60, were taken fromtd(0,6, ν)withν=20(10)80.

For j such that

λ

j

/ψ

→

0as d

→ ∞

, we have as d

→ ∞

that

v

jj(ω)

=

λ

_ψ

j

+

op

(λ

j

/ψ),

v

jj′_(ω)

=

o_p

(λ

_j

/ψ),

j′

=

j

+

1

, . . . ,

d

.

Letj_ωbe the maximum integerj

(

≤

m

)

such that

α

j

−

1

/

2

+

γ /

4

>

0. We assume that

α

j

−

1

/

2

+

γ /

4

̸=

0

(

j

=

1

, . . . ,

m

)

.

We observe that

(v11

_(ω)

, . . . , v

jωjω(ω)

, v

jω+1jω+1(ω)

, . . . , v

dd(ω)

)

is close to

(

1

, . . . ,

1

, λ

jω+1ψ−1

, . . . , λ

d

ψ

−1

)

. Note that

ψ < κ

w.p.1 when

γ <

2

/

3. We can claim thatV_ωis surely closer toIdthanVδunder (i)–(ii) ofTheorem 8. Remark 13. It should be noted thath

ˆ

T

jS

−1

ω h

ˆ

j

≤

0 w.p.1 as

λ

˜

j

/

λ

ˆ

j

=

op

(

1

)

. Letedbe ad-dimensional unit vector. Assume

further inTheorem 8thatedis a constant vector oredandS_ω−1are independent. Then, we claim asd

→ ∞

thateTdS

−1 ω ed

≥

0

w.p.1.

8. Application

In this section, we apply the inverse covariance matrix estimator given by(20)to the discriminant analysis. Suppose that we have twod

×

Ni data matrices,Xi

= [x

i1, . . . ,xiNi

]

,

i

=

1

,

2. We assume thatx11, . . . ,x1N1 andx21, . . . ,x2N2

are independent and identically distributed as

π1

:

Nd

(

µ

1,6)and

π2

:

Nd

(

µ

2,6), respectively. Let us write the

eigen-decomposition of6as6

=



d

j=1

λ

jhjhjT. We assume(1)about6. Letx0be an observation vector on an individual belonging

to

π1

or to

π2

. We estimate

µ

1,

µ

₂and6by

¯

xi

=

Ni−1 Ni



j=1 xij

,

i

=

1

,

2

,

and S

=

n−1 2



i=1 Ni



j=1

(

xij

− ¯

xi

)(

xij

− ¯

xi

)

T

,

wheren

=

N1

+

N2

−

2. We assumed

>

n. We consider the discriminant rule based on the maximum likelihood ratio under

which we classifyx0into

π1

if

(

1

+

N₁−1

)

−1

(

x0

− ¯

x1)TS−1

(

x0

− ¯

x1) < (1

+

N

−1 2

)

−1

₍

_x

0

− ¯

x2)TS−1

(

x0

− ¯

x2), (21)

(14)

Table 1

The correct classification rate, 1−e1, when(N1,N2)=(10,10)and (20, 20). (N1,N2)=(10,10) ρ S−1 ω S−δ1 Sdiag−1 CDR 0.2 0.864 0.849 0.841 0.977 0.4 0.806 0.770 0.716 0.953 0.6 0.787 0.717 0.623 0.952 (N1,N2)=(20,20) 0.2 0.905 0.897 0.882 0.975 0.4 0.849 0.817 0.740 0.949 0.6 0.839 0.811 0.666 0.949 Table 2

The correct classification rates,(1−e1,1−e2), when(N1,N2)=(10,20)and (20, 40). (N1,N2)=(10,20) ρ S−1 ω Sδ−1 Sdiag−1 CDR 0.2 (0.893, 0.872) (0.881, 0.860) (0.865, 0.848) (0.977, 0.973) 0.4 (0.815, 0.830) (0.783, 0.781) (0.722, 0.707) (0.950, 0.948) 0.6 (0.817, 0.814) (0.769, 0.759) (0.642, 0.636) (0.945, 0.957) (N1,N2)=(20,40) 0.2 (0.924, 0.920) (0.922, 0.913) (0.904, 0.901) (0.973, 0.974) 0.4 (0.861, 0.865) (0.837, 0.841) (0.761, 0.752) (0.949, 0.950) 0.6 (0.855, 0.856) (0.831, 0.835) (0.655, 0.663) (0.954, 0.951)

In the HDLSS context (d

>

n), there does not exist the inverse matrix ofS. We observe fromTheorems 7and8that the inverse matrix estimatorS−1

ω given by(20)is better than the empirical Bayes inverse matrix estimatorSδ−1. Let us compare

the performances ofS−1

ω andSδ−1by conducting simulation studies.

LetS

=

(

sij

)

. DefineSdiag−1 byS

−1

diag

=

diag

(

s

−1 11

, . . . ,

s

−1

dd

)

. We considered the discriminant rule given by applyingS

−1 ω

,

Sδ−1

and S_diag−1 toS−1 _in₍₂₁₎_{. We examined its performance with the help of Monte Carlo simulations. We set}_d

₌

_1600.

We set

µ

₁

=

(

1

, . . . ,

1

,

0

, . . . ,

0

)

T _{whose first 80 elements are 1, and}

_µ

2

=

(

0

, . . . ,

0

)

T. We generated the datasets

(

xi1, . . . ,xiNi

),

i

=

1

,

2, by setting a common covariance matrix as6

=

(ρ

|i−j|1/7

)

, where

ρ

∈

(

0

,

1

)

. Note that tr

(6)

=

d. We considered three levels of correlation as

ρ

=

0

.

2

,

0

.

4

,

0

.

6. Then, the eigenvalues,

(λ1, λ2, λ3, . . .)

, of6were calculated as (44.88, 19.07, 13.55, . . . ) when

ρ

=

0

.

2, (198.05, 47.32, 29.77, . . . ) when

ρ

=

0

.

4, and (491.78, 64.75, 37.80, . . . ) when

ρ

=

0

.

6. We used a testing sample,x0in(21), by generating 50 times randomly from

π1

or

π2

. The experiment was iterated

100 times. The correct classification was estimated by the average rate of correct classification over the 5000 iterations. Note that the standard deviation of this simulation study is less than 0.0071. We denoted the error of misclassifying an individual from

π1

(into

π2

) and from

π2

(into

π1

) bye1ande2, respectively. We also considered the correct discriminant rule (CDR)

given by replacing(21)with

(

x0

−

µ

1)T6−1

(

x0

−

µ

1) < (x0

−

µ

2)T6−1

(

x0

−

µ

2).

InTable 1, we reported the correct classification rate, 1

−

e1, when

(

N1,N2)

=

(

10

,

10

)

and (20, 20). InTable 2, we

reported the correct classification rates,

(

1

−

e1,1

−

e2), when

(

N1,N2)

=

(

10

,

20

)

and (20, 40). When the correlation was low such as

ρ

=

0

.

2, we observed that the rule given byS_diag−1 is as good as the others except CDR. This result is quite natural because6becomes close to a diagonal matrix as

ρ

→

0. As the variables were highly correlated, the rule given byS_diag−1

became worse. It should be noted that the variables in actual HDLSS situations are highly correlated each other. When the correlation was high such as

ρ

=

0

.

4

,

0

.

6, we observed that the rule given byS−1

ω was best among them. It should be noted

that as the correlation between variables gets high, the first few eigenvalues of6tend to become extremely large. We may observe that the noise-reduction methodology effectively works for estimating eigenvalues inS−1

ω .

Acknowledgments

The authors would like to thank the anonymous referees for their constructive comments that made the exposition of the paper more accessible. Research of the second author was partially supported by Grant-in-Aid for Scientific Research (B), Japan Society for the Promotion of Science (JSPS), under Contract Number 22300094.