Large Covariance Matrix Estimation by Composite Minimization

(1)

A

l

m

a

M

a

t

e

r

S

t

u

d

i

o

r

u

m

–

U

n

i

v

e

r

s

i

t

à

d

i

B

o

l

o

g

n

a

DOTTORATO DI RICERCA IN

Scienze Statistiche

Ciclo

XXVIII

Settore Concorsuale di afferenza:

13/D1

Settore Scientifico disciplinare: SECS/S-01

TITOLO TESI

Large covariance matrix estimation

by composite minimization

Presentata da:

Matteo Farnè

Coordinatore Dottorato

Relatore

Prof.ssa Alessandra Luati

Prof.ssa Angela Montanari

(2)

(3)

by omposite minimization

Matteo Farnè

Dipartimentodi Sienze Statistihe Università diBologna

(4)

"Illi autem oto ursus, in quibus eadem visest duorum, septem eiunt distintos intervallissonos, qui numerus rerum omniumfere nodusest; quod doti homines nervis imitatiatque antibusaperuerunt sibi reditum in hun loum, siutalii, qui praestantibus ingeniis in vita humanadivina studia oluerunt." Marus Tullius Ciero SomniumSipionis, 6.18

(5)

At the end of a partiularly exiting and laborious work, I feel the need to address my thanks to the people who aompanied and helped me to omplete suha huge eort.

First ofall,I desireto expressmygratitude toProfessor Angela Monta-nari, mysupervisorand mentor, for theontinuous supportand enourage-ment, for the valuable advies and suggestions, for her passionand for the uniquehanesofgrowthandmaturation sheoeredmearossallmystudy path.

IwanttothankProfessorTrevorHastie,forthebriefbutdeisiveresearh period spent at Stanford, whih gave me the neessary insights to develop my own theoryand to eetively provide original solutions to my researh problem.

I have to thank Dr Yuan Liao for helpful disussions and preious ref-erenes, Professor Roberto Roi for his puntual annotations, Professor EnrioBernardi forhisusefulideas onthemathematial partand Professor NiolaGuglielmifor his strongdiretions onthealgorithmi part.

A partiularly warm aknowledgment goes to theSupervisory Statistis Division of the European Central Bank, where I spent a semester as PhD trainee: they welomed me and allowed me to omplete my PhD studies developing ruial skills for real data analysis in large dimensions. A spe-ithanksgivinggoestoAngelosVouldis,wonderfulresearhsupervisor,to MihealFedesin,for histrust and enthusiasm, andto Patrik Hoganfor his valuable managing.

Finally, I want to thank my family, for the long and patient guide, aid andsupport,myfriends,fortheyoftenalleviatedthis labour,andSilvia,for herontinuous, passionate andenthusiastilisteningand initement.

(6)

(7)

Thepresentthesisonernslargeovarianematrixestimationviaomposite minimizationundertheassumptionoflowrankplussparsestruture. Exist-ing methods like POET (Prinipal Orthogonal omplEment Thresholding) perform estimationbyextratingprinipal omponentsandthenapplyinga soft thresholding algorithm. In ontrast, our method reovers the low rank plus sparse deomposition of the ovariane matrix by least squares mini-mization under nulear norm plus

l

1

norm penalization. This non-smooth onvex minimization proedure is based on semidenite programming and subdierentialmethods,resultingintwoseparableproblemssolvedbya sin-gular valuethresholding plussoftthresholding algorithm.

ThemostreentestimatorinliteratureisalledLOREC (LowRankand sparsECovarianeestimator)andprovidesnon-asymptotierrorratesaswell asidentiability onditions inthe ontext of algebrai geometry. Our work showsthattheunshrinkageoftheestimatedeigenvaluesofthelowrank om-ponentimprovestheperformaneofLOREConsiderably. Thesamemethod also reovers ovariane strutures with very spiked latent eigenvalues like in the POET setting, thus overoming the neessary ondition

p

≤

n

. In addition,itisproved thatourmethodreoversstrutureswithintermediate degrees of spikiness,obtainingalosswhih isbounded aordingly.

Then, an ad ho model seletion riterion whih detets the optimal point in terms of omposite penalty is proposed. Empirial results oming from a wide original simulation study where various low rank plus sparse settings aresimulated aordingto dierent parameter valuesaredesribed outliningindetailthe improvements uponexistingmethods. Two real data-setsarenallyexploredhighlightingtheusefulnessofourmethodinpratial appliations.

Keywords: ovariane matrix, nulear norm, thresholding, low rank plussparsedeomposition, unshrinkage.

(8)

(9)

Aknowledgments i

Abstrat iii

1 Introdution 1

2 State of the art 5

2.1 Sample ovariane matrix estimators . . . 7

2.1.1 TheMaximum Likelihoodovariane estimator . . . . 8

2.1.2 Theunbiased ovariane estimator: xed

n

ontext . . 10

2.1.3 Covariane matrix estimation: the IID dataontext . 10 2.2 Conditioningproperties . . . 11

2.2.1 Matrixonditioning asan ill-posedinverse problem . . 14

2.3 Ledoit andWolf's approah . . . 16

2.3.1 General Asymptotis . . . 17

2.4 Sparseovariane matrix estimation . . . 21

2.5 Fator analysisbasedestimators . . . 24

2.5.1 Stritfator model . . . 26

2.5.2 PCAand fatoranalysis . . . 27

2.5.3 Approximate fatormodel . . . 28

2.5.4 POET estimator . . . 30

3 Numerial and omputational aspets 37 3.1 Anhistorial review . . . 40

3.1.1

l

1

norm heuristis. . . 40

3.1.2 Nulearnormheuristis . . . 45

3.1.3

l

1

norm plusnulear norm . . . 49

3.2 Analytialand algorithmi aspets . . . 51

3.2.1 Numerialontext: a semideniteprogram. . . 51

3.2.2 Solutionmethods . . . 55

4 Low rank plus sparse deomposition 63 4.1 Identiation and reovery . . . 64

(10)

4.1.2 Approximate reovery: a funtionalapproah . . . 71

4.1.3 Approximate reovery: an extendedalgebrai approah 78 4.1.4 Approximate reovery: LOREC approah . . . 92

5 ImprovingLOREC 103 5.1 Theoretial advanes . . . 104

5.2 Simulationsetting . . . 117

5.2.1 Simulationalgorithm . . . 117

5.2.2 Simulated settings andomparison quantities . . . 119

5.2.3 Anew modelseletion riterion . . . 121

5.3 Data analysisresults . . . 123

5.3.1 Simulation results . . . 124

5.3.2 Realdataresults . . . 148

(11)

Introdution

Thepresentthesisonerns largedimensionalovariane matrixestimation. Estimation of population ovariane matries from samples of multivariate data is of interest in many high-dimensional inferene problems - prini-palomponents analysis, lassiation by disriminant analysis, inferring a graphial modelstruture, and others. Depending onthe dierent goalthe interest is sometimes in inferring the eigenstruture of the ovariane ma-trix(asinPCA) andsometimesinestimatingitsinverse(asindisriminant analysisor ingraphial models). Examples ofappliation areas where these problemsariseinludegenearrays,fMRI,textretrieval, imagelassiation, spetrosopy,limate studies,nane andmaro-eonomi analysis.

The theory of multivariate analysis for normal variables has been well worked out,see, for example,Anderson ([2℄). However, itbeame apparent that exat expressions were umbersome, and that multivariate data were rarely Gaussian. The remedy was asymptoti theory for large samples and xedrelatively small dimensions.

Inreentyears,datasetsthatdonot tintothisframeworkhavebeome very ommon, the data are very high-dimensional and sample sizes an be verysmallrelativetodimension. Themosttraditionalovarianeestimator, the sample ovariane matrix, is shown to be dramatially ill-onditioned in a large dimensional ontext, where the proess dimension

p

is loser to or even larger than thesample dimension

n

,even inthe ase thatthe true ovariane matrixiswell-onditioned. Somesolutionstothis drawbakhave been proposed in the asymptoti ontext (for example [75℄ [15 ℄ [45℄). An alternative reent approah isbynumerial optimization,whih providesin thenon-asymptoti ontext, some solutions improving upon thementioned ones.

As desribed intheexisting literature, two key propertiesof thematrix estimation proess assumea partiular relevaneinlarge dimensions:

1. well onditioning,i.e. numerial stability; 2. identiability.

(12)

Bothpropertiesareruialfor thetheoretialreovery andthepratialuse of the estimate. A bad onditioned estimate suers from ollinearity and auses its inverse, the preision matrix, to amplify dramatially any error inthe data. A large dimension mayause theimpossibility to identify the unknownovariane struture andthediultyto interprettheresults.

The rst property is strongly related to regularization tehniques. A basirefereneinthisrespetisTibshirani(1996)([108 ℄),wheretheLASSO estimationalgorithm intheontext ofregressionmodelswasrstproposed. The seond property an be ensured by dimensionality redution methods, whih an beusedto redue theparameterspae dimensionality.

Regularization approahes to large ovariane matries estimation have therefore started to be presented in the literature, both from theoretial and pratial points of view. Some authors propose shrinkage towards the identitymatrix([75℄),othersonsidertaperingthesampleovarianematrix, thatis, gradually shrinkingtheo-diagonal elementstoward zero([54 ℄). At the same time, a ommon approah is to enourage sparsity, either by a penalizedlikelihoodapproah([53 ℄)orbythresholdingthesampleovariane matrix([100 ℄).

Forthisreason,ourresearhstudiesaspeiregularizationproblem un-dertheassumption oflowrankplussparsedeomposition fortheovariane matrix. Suhaproblemissolvedexploitingnon-smoothonvexoptimization methods. Thisapproah allows toproperlyaddressbothreonditioningand dimensionalityredution issuesand is proved to be eetive even in alarge dimensionalontext.

Ourdissertationmovesfromadetailedoutlineofasymptotiapproahes. In Chapter 2, we provide a thorough desription of the motivation to our workand areviewofsome relevantasymptoti methodsfor ovariane esti-mation. Maximumlikelihood estimators and unbiased nite estimators are desribed ([2℄). Spei treatment to the onditioning problem for ovari-anematrix estimates is given. The ovariane shrinkage estimator derived byLedoitand Wolfinthegeneral asymptoti frameworkisdesribed([75 ℄). Sparse ovariane estimators are shown together with the underlying as-sumptions and the estimation error rates, with partiular referene to the thresholding estimator of [15 ℄. POET (Prinipal Orthogonal omplEment Thresholding) estimator([45 ℄), whih ombines Prinipal Component Anal-ysisand thresholding algorithms, isanalyzed indetail.

InChapter 3,we dene the regularization problemabovementioned. It isanulearnormplus

l

1

normapproximationproblem,andworksunderthe assumption of low rank plus sparse struture for the ovariane matrix. It is omposed by a least squares loss and a omposite non-smooth penalty, whih isthe sum ofthe nulearnorm ofthelow rankomponent and the

l

1

normof thesparseomponent.

The numerial rationale behindthe problemformulation is provided. It isshownhowthis problemanbereastfromthepoint ofviewofnumerial

(13)

analysisasasemi-deniteprogram(SDP).Nonstandardoptimizationtools, as subgradient minimization methods, are needed to solve it. We desribe themost reent solution algorithm andpoint out its eetiveness.

InChapter4,weprovideawidereviewofexistingnon-asymptoti meth-ods. Theevolution path ofthe most reent worksis guredout. Themost reent developmentsofthenumerial approahunder theassumption oflow rank plus sparsestruture for the ovariane matrix are desribed, starting fromthebasiontributionbyChandraskeranetal. ([30℄)whihrstproves the exat reovery of the ovariane matrix in the noiseless ontext. This resultisahieved minimizing aspei onvexnon-smooth objetive, whih isthe sumof the nulear norm ofthelowrank omponent and the

l

1

norm ofthesparse omponent.

Then, therstapproximatesolutiontoreoveryandidentiabilityinthe noisyontext, oming from[1℄,is desribed. In the following,theextension of [30℄ providing the rst exat solution of the numerial problem in the noisygraphial model setting([31 ℄) is shownindetail. In that ontext, the objetive isa leastsquarelosspenalizedbythe above mentioned omposite penalty,anditsoptimizationallowsto reovertheinverseovarianematrix. In onlusion, theextension of this framework to theovariane matrix es-timation ontext,oming from [77℄,is explained. Theresulting estimator is alledLOREC (LOwRankand sparsE Covariane estimator).

In the last hapter (Chapter 5), an improvement over the solution de-sribedin[77℄isproposed,basedontheunshrinkage oftheestimated eigen-valuesof thelow rankomponent. Luo'sapproah isompleted by deriving therates of the sparseomponent estimate, and theonditions for its posi-tivedenitenessandinvertibility. Inaddition,theratesofLORECunderthe onditions of POET, and, more importantly, in a ontext where the eigen-values of the low rank omponent are allowed to grow with

p

α

_{, α}

_∈

_[0,

_1]

(generalizedspikiness ontext)areprovided.

Inthefollowing, weshowtheresults ofourproedureon both simulated and real data sets. We illustrate a new model seletion riterion whih is proved to be eetive in our ontext. An original simulation study is presentedwhere extensive simulation results arepointedout, aswell asthe simulationalgorithm and theestimationassessment framework.

In the end,theperformane of our newproposedestimator isompared totheoneof LORECandPOET undervarioussettings. Tworealexamples areprovidedwhere ourmodeliseetive respettotheompetitors. In par-tiular, the seondexample isabanking supervisory dataset whihollets supervisory reporting indiators of themost relevant Euro Area banks. We expliitlythanktheSupervisoryStatistis DivisionoftheEuropeanCentral Bank,where theauthorspentasemesterasaPhDtrainee,fortheallowane to usethese datainanonymous formfor researh purposes.

(14)

(15)

Covariane matrix estimation:

state of the art

In this hapter, a short review of existing solutions to the problem of o-variane matrix estimation is provided. Partiular attention isgiven to the two properties displayed intheIntrodution(wellonditioning and identi-ability)andtothe performaneofexistingmethods inthelarge dimensional ontext. An exhaustive reviewan be foundin Pourhamadi (2013)([95 ℄).

ThisChaptershowsapath arossexistingestimators aimedat outlining the two mentioned features (well onditioning and identiability) for eah estimation setting, espeially when

p

is very large ompared to the sample size

n

orevenlarger. Thisiswhy,foreahestimator,adetaileddisussionof theasymptotiframeworkandtheassumptionsneededtoensureonsisteny (i.e. theonvergene to the theoretial ovariane matrix)is provided.

Existing approahes to the estimation problem are desribed in this Chapter, while non-asymptoti approahes will be the objetof next hap-ters. The desription ofpast approahes isintended to displaythe main is-suesenountered byexistingmethods,withpartiular refereneto thelarge dimensionalontext,andthe reasonswhyweneed todevelopanalternative numerial approahto the ovariane estimation problem.

Therst paragraph(2.1) isdevotedto ovariane matrix estimation un-der the assumption of normality for the data. The maximum likelihood estimator, i.e. the sample ovariane matrix, is introdued and justied. Theunbiasedsample ovariane matrix,undertheassumption ofxed

n

,is thenoutlined. A speiremark ontheasymptotidistribution ofthe sam-ple ovariane matrix under the assumption of independene and idential distributionfor thedataonludes thesetion.

In the seond paragraph (2.2) the onditioning properties of the sam-ple ovariane matrix are explored. The reason why the sample ovariane matrix is bad-onditioned when the dimension is lose to the sample size is deeply explained and analyzed, as well as the reason why the inverse

(16)

ovariane matrixdramatiallyampliestheestimationerrorinaseof bad-onditioning.

The third paragraph (2.3) widely desribes a suessful attempt to ad-dress theproblemof reonditioning thesampleovariane matrix when the dimension islargerthan thesample size: theshrinkage estimator byLedoit &Wolf([75 ℄). Theirmotivations, theirresults andtheir asymptotiontext areproperlyhighlighted, tryingtoretainthekeyelementsoftheirapproah. The fourth paragraph (2.4) briey outlines existingsparsity estimators, withpartiular referene to the thresholding estimator by Bikel & Levina ([15 ℄), whih is desribed in detail with respet to model assumptions and onvergene rates. There we point out the strong link between sparsity assumptions and shrinkage thresholding. That family of estimators shows howit ispossible to usesparsityto reondition theovariane estimateand to signiantly redue thenumber ofparameters.

The fth paragraph (2.5) desribes ovariane matriesestimator based onfatormodelassumptions. Abriefoverviewoffatormodelspeiations and underlying assumptions aross history is provided, disussing the dif-ferent asymptoti ontexts. The relationship between PrinipalComponent Analysis(PCA,[72 ℄)andfatormodelling(see[59 ℄)isruialinthisrespet. Finally,POETestimator([45 ℄),basedontheassumptionofapproximate fa-tor model witha sparseresidual matrix, is widely illustrated, pointing out the ruialassumptions foronsisteny and identiability.

In[45 ℄, the populationovariane matrix is assumedto be thesum of a low rank and a sparse omponent. POET works under the assumption of sparseresidual ovariane matrix and pervasiveeigenvalues of thelow rank omponent (as

p

→ ∞

). Thisstruture is partiularly onvenient in alarge dimensional ontext, and takles both the issues mentioned above, as we willwidelyexplain. For the same reasons,the fatoranalysis assumption is a key to approah ovariane estimation in large dimensions. The asymp-totiorrespondenebetweenPCAandfatorestimationisthereestablished aordingto the underlying assumptionsand thenexploited.

Before starting, we desribe the basi matrix terminology. We restrit our analysis to the real ase. The spetraltheorem ensures that, when

M

is a positive semidenite squared

p

- dimensional real matrix with rank

r

, thereexistsanorthogonal

p

×

r

matrix

U

andadiagonal

r

×

r

matrix

Λ

suh that

M

=

UΛU

′

=

r

X

i

=1

λ

i

u

i

u

′

i

,

(2.1)

whihistheeigenvaluedeompositionof

M

. Salars

λ

1 , . . . , λ

r

arealled theeigenvaluesof

M

andarestritlylarger than

0

. The

r

olumnsof

U

are theeigenvetorsof

M

. If

M

issymmetri,theeigenvalues oinidewiththe singularvalues

σ

1 ,...,r

,whiharethesquarerootsoftheeigenvaluesof

M

′

_M

(17)

i.e. the absolute values of the eigenvalues of

M

. A fortiori, this happensif

M

isa ovariane matrix,whih issymmetriand positive denite.

The relevant norms we aregoing to usethroughout theentire thesisare (see also[62 ℄):

• ||

M

_||

2 =

p

σ

max

(M

′

M

)

isthespetralnormof

M

,whihisits largest singularvalue.

• ||

M

_||

_∞

= max

i,j

|

m

ij

|

is the innity norm of

M

, whih is the largest entry inmagnitude.

• ||

M

_||

F

=

trace(M

′

M) =

q

P

i

P

j

m

2 ij

is the Frobenius norm of

M

, whihis the squareroot ofthesum oftheentriesof

M

.

• ||

M

_||

_∗

=

trace(

√

M

′

_M

_{) =}

Pp

i

=1

σ

i

, sum of the singular values of

M

.

||

M

||

∗

isalled nulear norm. If

M

is a Positive SemiDenite ma-trix(PSD),

||

M

||

∗

=

tr(M)

,beause theeigenvalues and the singular valuesexaly oinide.

• ||

M

_||

1 =

P

_i

P

_j

|

m

ij

|

: sum oftheabsolute values of theentries of

M

. For a

p

-dimensional vetor

x

,therelevant norms for our purpose are:

• ||

x

_||

2 =

q

P

i

x

2 i

,the Eulidean norm of

x

.

• ||

x

_||

1 =

Pp

i

=1

|

x

i

|

,the

l

1

norm of

x

.

• ||

x

_||

_∞

= max

i

|

x

i

|

,themaximumnorm of

x

.

2.1 Sample ovariane matrix estimators

In this paragraph we fous on the most used estimator of the ovariane matrix: the sampleovariane matrix. First, we will derive it asthe maxi-mumlikelihoodestimatorof the ovariane matrix under theassumption of multivariatenormalityfor our data(2.1.1). Maximum likelihood estimators areonsistentwhen

n

→ ∞

. Thisiswhywethenderivetheunbiasedov ari-ane estimator under the assumption of

n

nite (2.1.2), whih is a slightly modied version of the sample ovariane matrix. These two estimators asymptotially onverge when

n

→ ∞

,under the assumption of

p

xed. In theend of this paragraph, we give a ash about the behaviour of this esti-mator under the assumption of independene and idential distribution for our datawhen

n

→ ∞

(2.1.3).

Our main referene for this argument is the famous book by Anderson ([2℄).

(18)

2.1.1 The Maximum Likelihood ovariane estimator

Suppose we have a sample

(x

1 , . . . x

n

)

, from a real-valued

p

−

dimensional normal random variable

x

∼

N

p

(µ

∗

_,

_Σ

∗

₎

, with

p

≤

n

. The

p

×

p

matrix

Σ

∗

₌

_E((x

₋

_µ

∗

_)(x

₋

_µ

∗

₎

′

₎

is real positive denite and symmetri, while

µ

∗

=

E(x)

isa

p

×

1

vetor.

Thedensity of

x

isthefollowing:

f

(x

_|

µ

∗

,

Σ

∗

) = (2π)

−

1

2 p

|

Σ

∗

|

−

1

2 exp

−

1

2 (x

−

µ

∗

₎

′

_Σ

∗−

1 _(x

₋

_µ

∗

₎

.

where

µ

∗

is a

p

×

1

vetor and

Σ

∗

is a

p

×

p

invertible (positive denite) matrix.

Thelikelihood funtionis

L(µ

∗

,

Σ

∗

) =

n

Y

i

=1

N

(x

i

|

µ

∗

,

Σ

∗

) =

= (2π)

−

1

2 pn

|

Σ

∗

|

−

1

2 n

exp

"

−

1/2

n

X

i

=1

(x

i

−

µ

∗

)

′

Σ

∗−

1 (x

i

−

µ

∗

)

#

.

Thelog-likelihood isthen

log

L(µ

∗

,

Σ

∗

) =

₋

1

2 pn

log 2π

−

1

2 n

log

|

Σ

∗

_{| −}

1

2 n

X

i

=1

(x

i

−

µ

∗

)

′

Σ

∗−

1 (x

i

−

µ

∗

).

We denoteby

µ

ˆ

M L

and

Σ

ˆ

M L

thevetorand thepositivedenite matrix maximizing

log

L

. They are the maximum likelihood estimators of

µ

∗

and

Σ

∗

. Sine

log

L

is an inreasing funtion of

L

,

log

L

and

L

share thesame maximum respetto our parameter estimates.

Thefollowing important theoremholds:

Theorem2.1.1. If

x

1 , . . . x

n

onstituteasamplefrom

N

(µ

∗

_,

_Σ

∗

₎

with

p < n

, the maximum likelihood estimators of

µ

∗

and

Σ

∗

are

µ

ˆ

M L

= ¯

x

=

1 n

Pn

i

=1

x

i

and

Σ

ˆ

M L

=

1 n

P

n

i

=1

(x

i

−

x)(x

¯

i

−

x)

¯

′

respetively.

TheproofanbefoundinAnderson(1958),page67andfollowing. It ex-ploitsthepropertiesofthearithmetimeanandofpositivedenitematries. Thekeyargument isthat

log

L

an berewritten inthefollowingway:

−

1 ₂

pn

log 2π

₋

1

2 log

|

Σ

∗

_{| −}

1

2 trΣ

∗−

1 _D

−

1 ₂

n(x

i

−

µ

∗

)Σ

∗−

1 (x

i

−

µ

∗

)

′

,

where

D

=

P

n

i

=1

(x

i

−

x)(x

¯

i

−

x)

¯

′

.

In order to perform maximization, the neessaryassumption is that

Σ

∗

isa positive denite matrix. This ondition is neessaryto ensure that the

(19)

term

n(x

i

−

µ

∗

_)Σ

∗−

1 _(x

i

−

µ

∗

)

′

ahievesa maximumfor

µ

∗

_{= ¯}

_x

andtheterm

log

_|

Σ

∗

_{| −}

tr(Σ

∗−

1 D)

ahieves amaximumfor

Σ

∗

₌

1 n

D

.

ML estimators show a number of interesting optimality properties. In partiular,theyareonsistentandasymptotiallyeient([34℄). Atheorem by Cramer ensures that

µ

ˆ

M L

and

Σ

ˆ

M L

are minimum variane (asymptoti-ally) unbiased estimators. Theseproperties holdifand onlyif

n

→ ∞

.

Note that also the ondition

p < n

is neessary in order to perform maximization. In orderto seethis point, we need to reall a basi theorem ([2℄,p.77):

Theorem2.1.2. Themaximum likelihood estimator

µ

ˆ

M L

= ¯

x

=

1 n

Pn

i

=1

x

i

, from

N

(µ

∗

_,

_Σ

∗

₎

, is distributed aording to

N

(µ

∗

_,

1 n

Σ

∗

)

and independently of

Σ

ˆ

M L

= ˆ

Σ =

1 n

Pn

i

=1

(x

i

−

x)(x

¯

i

−

x)

¯

′

.

n

Σ

ˆ

is distributed aording to

Pn

−

1 i

=1

z

i

z

′

i

, where

z

i

∼

N

(0,

Σ

∗

₎

, and

z

1 , . . . , z

n

−

1

are independent.

This theorem states that under the multivariate normality assumption for thedata,

n

ˆ

Σ

is the sumof

n

−

1

squared

p

dimensional matrieshaving rank

1

. If

p

≥

n

,

n

Σ

ˆ

will never have full rank

p

.

In addition, it has been shown by Wishart ([113 ℄) that

D

=

n

Σ

ˆ

is a matrix-valued stohastiproess having the following distribution:

f

(D

_|

Σ

∗

) =

|

D

|

1

2 (

n

−

p

−

1)

exp

−

1

2 tr(Σ

∗−

1 D)

2

1

2 np

π

p

(

p

−

1)

4 |

Σ

∗

|

1

2 n

Qp

i

=1

Γ[

1

2 (n

+ 1

−

i)]

whihisaWishart distributionwith

ν

=

n

−

1

degrees offreedom,where

Γ(t) =

R

∞

0 x

t

−

1 e

−

x

dx

is the usual Gamma funtion. The proof is reported in [2℄ (p.252 and following). It exploits massively the linear transforms of randomvariables,andis basedonthepropertiesofGram-Shmidt orthogo-nalization algorithm.

Thisresultswasrstderivedforabi-variatedistributionbyFisher([51 ℄) where the distribution of the orrelation oeient (rst dened by Karl Pearson in[91℄)was alsoderived.

We an now understand why

p < n

is a neessary ondition. If

n

≤

p

,

f

(D

_|

Σ

∗

)

isnolongeradensity,suhthatitisnolongerpossibletoderivethe asymptotidistributionfor

ˆ

Σ

(i.e.,alltheusualoptimalitypropertiesofML estimators arelost). Infat,

|

D

|

would be zero, andthe distribution would thusbedegenerate, having nullmeasurein

R

p

×

p

everywhere. Notealsothat if

n

=

p

+ 1

f(D

|

Σ

∗

₎

has not a mode, analogously to the

χ

2

distribution withtwo degrees offreedom.

Inthe sameway,denotingby

T

thequantity

T

= (¯

x

−

µ

∗

₎

′

_W

−

1 _(¯

_x

₋

_µ

∗

₎

, where

W

=

D

n

−

1

,ithasbeen shownbyHotelling ([64℄)that

ν

₋

p

₋

1 vp

T

2 _∼

_F

(20)

where

F

isFisher'sdistribution with

p

and

ν

−

p

+ 1

degrees offreedom (

ν

=

n

−

1

).

T

2

isalledHotelling'sT-squareddistribution. Itisnon-singular if and only if both

µ

ˆ

and

Σ

ˆ

are non-singular, i.e. if

Σ

∗

is positive denite and

ν

−

p

+ 1

>

0

(equivalent to

n > p

).

So, both the sample mean and the sample ovariane matrix are ML estimators of the true mean and the true ovariane matrix if and only if thetrueovariane matrixispositive deniteandthedimension

p

isstritly smallerthanthesamplesize

n

. Inpartiular,thedistribution ofthesample ovarianematrixis

n

−

1 W ishart(Σ

∗

, n

−

1)

. Thismeansthat

Σ

ˆ

isbiasedif

n

isnite. Notethatthisdistributiondoesnothangeevenwhenthetruemean

µ

∗

is known, unless

x

¯

is replaed by the true

µ

∗

. In that ase, the degrees offreedom are

n

and theresulting estimator (

1 n

Pn

i

=1

(x

i

−

µ

∗

)(x

i

−

µ

∗

)

′

) is unbiased.

2.1.2 The unbiased ovariane estimator: xed

n

ontext In order to derive the nite sample unbiased estimator of the ovariane matrix,the keyresultisTheorem2.1.2 aboutthedistribution of

D

=

n

ˆ

Σ =

Pn

i

=1

(x

i

−

x)(x

¯

i

−

x)

¯

′

shownabove. A orollaryofthat theoremstates:

Corollary 2.1.1. Let

x

1 , . . . , x

n

(n > p)

be independently distributed, eah aording to

N

(µ

∗

_,

_Σ

∗

₎

. The distribution of

ˆ

Σ

ν

=

1 _ν

Pn

_i

₌₁

(x

i

−

x)(x

¯

i

−

x)

¯

′

is

W ishart(Σ

∗

, ν)

, where

ν

=

n

−

1

.

Thisresultmeansthat

Σ

ˆ

n

−

1 = (

_n

₋

1 ₁

)

Pn

i

=1

(x

i

−

x)(x

¯

i

−

x)

¯

′

istheunbiased estimator of the ovariane matrix when the dimension

n

is nite. This estimator will be theinput of our new estimation proedure in Chapter 4. Clearly,

Σ

ˆ

n

−

1

and

Σ

ˆ

n

onverge asymptotiallyto thesame estimator. We are now going to derive the asymptoti (normal) distribution of the sampleovariane matrix inthemore general ase ofIIDdata.

2.1.3 Covariane matrix estimation: the IID data ontext Let us suppose

x

i

∼

IID(µ

∗

,

Σ

∗

)

,

i

= 1

. . . , n

. We want to derive the asymptoti distribution of

ˆ

Σ

n

=

1 _n

Pn

_i

₌₁

(x

i

−

x)

¯

′

(x

i

−

x).

¯

Under the IID hypothesis,wehave:

E(x

i

x

′

i

) =

E(x

i

)E(x

′

i

) = Σ

∗

+

µ

∗

µ

∗

′

,

V

(x

i

x

′

i

) =

V

(x

i

) +

V

(x

i

) = Σ

∗

+ Σ

∗

= 2Σ

∗

.

Ourtarget an berewritten asthesumof threeomponents:

1 n

n

X

i

=1

(x

i

−

x)(x

¯

i

−

x)

¯

′

=

n

X

i

=1

x

i

x

′

i

n

−

2 n

X

i

=1

¯

x

′

i

n

+

n

X

i

=1

¯

x

¯

′

n

(21)

Sine

Pn

i

=1

xi

n

prob

→

µ

∗

,we have that

−

2¯

x

n

X

i

=1

x

i

n

+

n

X

i

=1

¯

x¯

x

′

n

=

−

2¯

x

¯

′

+ ¯

x

¯

′

=

−

x¯

¯

x

′

.

onverges inprobability asfollows:

−

x¯

¯

x

′

prob

_{→ −}

µ

∗

µ

∗

′

(2.2) Now, the rstomponent

P

n

i

=1

xix

′

i

n

anbe rewrittenas

1 √

n

X

i

=1

(x

_√

i

x

′

i

)

n

So,for the Central Limittheorem, we have

1 √

n

X

i

=1

x

i

x

′

i

−

(Σ

∗

+

µ

∗

µ

∗

′

)

√

n

CLT

→

_n

1 N

(µ

∗

µ

∗

′

+ Σ

∗

,

2Σ

∗

).

Realling (2.2),wehave that

ˆ

Σ

n

distrib

→

1 √

_n

N

(Σ

∗

,

2Σ

∗

).

(2.3) Theseresults nd onrmation in[58 ℄.

2.2 Thesample ovarianematrix: onditioning prop-erties

We arenowgoing tobrieytalkaboutmatrix onditioning. Letus suppose

p

and

n

are xed. If

n > p,

the expeted value of

Σ

ν

=

n

−

1

is

Σ

∗

, and the entries of its ovariane matrix are

V

(ˆ

σ

n,ij

) =

(

σ

∗

2 ij

+

σ

∗

ii

σ

∗

jj

)

(

n

−

1)

. This highlights whythevarianeof

Σ

ˆ

n

inreasesasthetrueonditionnumberof

Σ

∗

inreases. If the ondition number

c

=

σ

max

/σ

min

inreases, the orrelation between the omponents

x

i

and

x

j

inreases, beause

Σ

∗

is loser to ollinearity. Consequently,

V

(ˆ

σ

n,ij

)

inreases,beause

σ

∗

2 ij

islosertoitsmaximum,whih is

σ

∗

ii

σ

jj

∗

(fortheCauhy-Shwartz inequality).

Coming bak to the main point, it is ruial to study the behaviour of thesampleeigenvalues. In thematrixestimation ontext thereisa relevant issueabout numerial onditioning, i.e. the behaviour of samplemaximum andminimumsingular values,ofa

p

×

n

datamatrix

X

.

Theorem2.2.1(Theorem([39 ℄)). Givennaturalnumbers

n, p

with

p < n+1

let

X

be a

p

×

n

matrix with i.i.d. Gaussian entries that have zero-mean

(22)

andvariane

1 n

. Thenthe largest and smallest singular values

σ

min

(X)

and

σ

max

(X)

are suh that

max

P r

λ

max

≥

1 +

r

p

n

+

t

, P r

λ

min

≤

1 −

r

p

n

−

t

≤

exp

−

nt

2

2 ,

for any

t >

0

.

Thistheoremwasprovedbyusingargumentsfromrandommatrixtheory and the geometry of Banah spaes. It is an essential result to provide a probabilistiboundfortheerror distane

||

Σ

ˆ

n

−

Σ

∗

_||

2

,where

Σ

ˆ

n

=

1 n

X

′

X

=

1 n

Pn

i

=1

x

i

x

′

i

.

Infat, thefollowing Lemma holds:

Lemma 2.2.1. Let

ψ

=

||

Σ

∗

||

2

. Given any

δ >

0

and

φ >

0

with

ψ

≤

8φ

, letthe number of samples

n

be suh that

n

≥

64 pφ

2 δ

2

. Then we have that

P r[

_||

Σ

n

−

Σ

∗

||

2 ≥

δ]

≤

2 exp

−

nδ

2 128ψ

2 .

ThisTheorem isbased ona speiassumption on

ψ

,thelargest eigen-value of

Σ

∗

. By appropriately setting the parameter

ψ

, we an obtain the probabilisti bound aordingly.

ThisLemma relieson the fatthatthespetralnorm isunitarily invari-ant, suh that it is possible to assume a diagonal struture for

Σ

ˆ

without lossofgeneralityand thenapply theprevioustheorem 2.2.1.

Itisremarkablethatwithoutfurtherassumptions,

Σ

ˆ

n

isnot invertibleif

p > n

(sineit is perfetly ollinear, having learly at most rank

n

,and for therestnulleigenvalues). Evenif

p

≤

n

,intheasetheratio

p/n

islessthan 1but not negligible, theestimated(maximumand minimum) eigenvalues are numerially unstable, sine the probabilisti bound is too large. This may result in bad onditioning (i.e. too large ondition number) for

ˆ

Σ

n

. Thisis why inthe Big Data ontext, when

p

isvery large,it is frequent to have anill-onditioned sampleovariane matrix,sineit isdiultto have enough observationto keep the ratio

p/n

negligible([75 ℄).

Theexampleingure(2.1)learlyoutlinesthedesribeddrawbak. The eigenvalues of the ovariane matrix of a simulated

n

×

p

proess

ǫ

i

=

N

p

(0,

_n

1 I

)

,

p

= 100

,

n

= [10,

50,

100,

500,

1000,

10000]

areplotted. The g-uredisplayshowthedispersionoftheeigenvaluesdereasesas

p/n

dereases. Alldistributions tendto theMarenko-Pastur distribution, whih isproved tobethe limitingdistributionoftheeigenvalues ofIIDrandomvariables(in the Kolmogorov asymptoti framework, see [79 ℄). The rank is always equal to

min(p, n

−

1)

. If

p

=

n

,the matrix isthus singular.

We have provided this simple example to state that without further as-sumption on the eigen-struture (values and vetors) of

Σ

∗

(23)

0

50

100

0

1

2

3

4 Eigenvalues

0

50

100

0

0.5

1

1.5

2

2.5 Eigenvalues

0

50

100

0

0.5

1

1.5

2

2.5 Eigenvalues

0

50

100

0.5

1

1.5 Eigenvalues

0

50

100

0.8

1

1.2

1.4 Eigenvalues

0

50

100

0.9

1

1.1

1.2 Eigenvalues

10

50

100

500 1000

10000

Figure 2.1: Eigenvalues of the sample ovariane matrix of

ǫ

i

=

N

p

(0,

1 n

I)

,

p

= 100

,

n

varying

p

_≤

n

is unavoidable in order to guarantee the positive deniteness (and thus theinvertibility) of our ovariane estimate. Anyway, the reovery of theeigen-struture ofaovarianematrix isstronglyrelatedtothe underly-ingassumptionsand to the asymptoti ontext.

Wenowenumeratethreeparametersettingsrelevantforourdissertation: 1.

p

and

n

xed: this is the ase of

Σ

ˆ

n

−

1

, and all numerial estimators we will analyzeinnexthapters ([31 ℄, [1 ℄,[77℄,[15 ℄)

2.

p

xed,

n

→ ∞

: this isthease of

Σ

ˆ

M L

,or of theapproximate fator model([29℄)

3.

pn

n

→

c

when

n

→ ∞

: herewendtheGeneralasymptotiframework, used by Ledoit and Wolf to ensure the onsisteny of their estimator ([75 ℄),andtheKolmogorovasymptotiframework(wherealso

p

→ ∞

). Alsoonsisteny propertiesofthethresholding estimator([15 ℄)and of POETestimator([45 ℄)arederivedunderasimilarframework,where a funtionof

p

and

n

tendsto0while

n

→ ∞

. Seeformoreexplanations setions(2.4) and (2.5).

In the seond ontext, with xed

p

and

n

,the outlined results onern-ing numerial onditioning for the sample ovariane matrix hold, and the ondition

p

≤

n

is unavoidable without further assumptionsto derive nite sample bounds. This is why one of the aims of the present work is trying to exploit results from the third asymptoti framework (in terms of model assumptions) to establishboundsunder the nite sample ontext dropping theondition

p

≤

n

.

(24)

2.2.1 Matrix onditioning as an ill-posed inverse problem We are nowexplaining in detail why a bad-onditioned sample matrix is a fataldrawbak for us. Thereason stands intheonsequenesderiving from the inversionof abad-onditioned matrix.

Letus nowonsiderthestandard linearsystem

Ax

=

b

,where

A

is

p

×

p

, and

x

,

b

are

p

×

1

. If our aim is to derive

b

(the output), we are solving the diretproblem. Ifour aim isto derive

x

(the input),we aresolving the inverse problem. If

A

is full rank, Cramer's theorem is ensuring that the inverse problem has exat solution

x

∗

₌

_A

−

1 _b

. Otherwise, if

A

has rank

r < p

,weneed to solve theleastsquares problem

min

x

∈

R

p

||

Ax

−

b

||

2 ,

andwe have

x

∗

=

r

X

i

=1

|

u

′

_i

b

_|

λ

i

u

i

(2.4)

||

Ax

∗

₋

b

_||

2 =

p

X

i

=

r

+1

||

u

′

_i

b

_||

2 .

Thisfundamentalresult wasproved in[40℄. Howmuhissolutionthe

x

∗

reliable? Hadamard([57℄)outlinedthethree harateristis ofa well-posed problem:

•

existene: the problemadmits one solution

•

uniqueness: the problemhasat most one solution

•

stability: theproblem isnot sensitive to data perturbation.

Inourontext,if

A

isfullrank,theinverseproblemmaybeill-posedsine itviolatesthestabilityondition. If

A

isnotfullrank,theinverseproblemis ill-posed sine it violates theexistene and the uniquenessondition (there are only approximate solutions, no exat ones). The least squares system servesfor identifying inanyase asolution even iftherewouldbe none.

Anyway,(2.1) and(2.4)enableus tounderstand whytheinverseof bad-onditioned matries are numerially unstable. The solution of the diret problemis

Ax

=

U

ΛU

′

_x

₌

Pp

i

=1

λ

i

(u

′

i

x)u

i

,whih dampenstheomponents orrespondingto thesmallesteigenvaluesof

A

. Ontheontrary,(2.4)shows us thatthesolution ofthe inverse problemamplies theeets ofthesame omponents. If we assumethat

b

isperturbed, i.e.

b

ǫ

=

b

+

ǫ

,we note that

x

ǫ

=

x

∗

+

P

r

i

=1

|

u

′

i

ǫ

|

λi

u

i

. So, if

A

is badonditioned (i.e. we have very small eigenvalues), the eet of data perturbation is amplied, and the solution maynot be eetively usable inappliations.

(25)

ThisiswhyPiard([93 ℄)elaboratedaonditionunderwhihtheinverse solution is reliable. It states that

x

∗

=

Pr

i

=1

|

u

′

i

b

|

λi

u

i

<

∞

if and only if

|

u

′

_i

b

_|

deays morerapidlythan theorresponding

λ

i

forall

i

,whih oursif

λ

i

> τ

∀

i

, where

τ

is the threshold at whih thesingularvalues arelevelled bythe noise.

If this ondition isviolated, a regularization method, like thetrunated singularvaluedeomposition(TSVD,see[55℄)orTikhonov'sregressionmethod ([109 ℄) or otherregressionmethods (like the ridgeone), are needed. Thisis whythenonasymptotiapproahforovarianematrixestimationessentially onsistsinspeifyingappropriateregularizationproblemsundersuitable on-ditions for deriving improved error rates, as we will widely desribe in the following hapters.

Notethatthereisahugeliteraturedealingwiththedistributionof eigen-values. We mention again Marenko-Pastur law, whih desribes the be-haviourof the singularvaluesof aretangular randommatrix having Gaus-sianentries([79 ℄). Tray andWidom([107 ℄) foundthelimiting distribution ofthesingularvaluesofalargedimensionalrandomHermitianmatrix. John-stone ([70 ℄) found out the limiting distribution of the largest eigenvalue in prinipal omponent analysis(for

n

≤

p

, underthe assumption of indepen-dentnormalityfor theolumnsofthe data matrix)whihisproportionalto a Wishart of order

1

. A reent work by Chiani([33 ℄) derived theexat dis-tribution of the largest eigenvalues for real Wishart matries and Gaussian Orthogonal Ensembles.

The work in[70℄,inpartiular, outlinedthat for large

p

it an be easier to reover the top

r

eigenvalues ifthey are partiularly spiked, beause the distribution of the

(r

+ 1)

-th eigenvalue isbounded bya Tray-Widom law of lower dimensions (

n

×

(p

−

r)

respet to

n

×

p

). Thus, the

(r

+ 1)

-th eigenvalueofasetof

p

eigenvalueswhere

r

arespikedisstohastiallysmaller than the largesteigenvalue ofa settingof

(p

−

r)

< p

variables non-spiked. This fatsuggests that large dimensions (

p

→ ∞

) an helpthe reovery of strong eigenvalues and somehow justies the use of "sree-plot" to hoose thenumberof eigenvalues.

Therearealsosomeresultsonthedistributionofthesmallesteigenvalues. We referto [8 ℄ for ageneral review.

All in all,the problem of reonditioning our ovariane matrix estimate is approahed dierently aording to the related asymptoti ontext. In Chapter 4 we will fous on the non-asymptoti ontext, outlining various solutions reently provided. Now, we will fous on the desription of key ovariane estimators in the asymptoti ontext where both

p

and

n

are allowedtotendto

∞

. Theestimatorweareabouttodesribebelongstothe lassof shrinkage estimators ([68 ℄) whih represent a widely usedapproah in this ontext as an eetive regularization method. It is relevant to note that the distributional assumption of normality is no longer needed, sine

(26)

the approah we aregoingto desribe isdistribution-free.

2.3 Shrinkagetowardstheidentity: LedoitandWolf's approah

Ledoitand Wolfweretherst toderive in[75 ℄a onsistentestimator ofthe ovarianematrixinanewasymptotiframework,alledgeneralasymptoti framework. Theyproposedawaytotemperthenumerialinstabilityof sam-ple eigenvalues, expliitly reonditioning them byshrinkage. The adoption of a new asymptoti framework was needed to ensure the shrinkage inten-sity to be positive, avoiding it to vanish in the limit. Their estimator is also Bayesian in nature, sine it is a ombination of a priori and sample information. Theyall itEmpirial Bayesian estimator.

Themotivatingresult oftheir analysisit reportedbelow.

Theorem 2.3.1. The eigenvalues are the most dispersed diagonal elements that an be obtained by rotation of a symmetri matrix.

Theproof exploitstheinvariane by rotationoftrae.

This auses that the largest sample eigenvalues are positively biased, while the smallest are negatively biased, and the bias inreases in

p/n

(re-all Theorem 2.2.1). The pattern of sample eigenvalues depends on the Marenko-Pastur distribution, whih holds in the Kolmogorov asymptoti framework. Asdesribed,underKolmogorovasymptotistheratio

p/n

tends to aspei onstant, whileboth

p

and

n

tendto innity.

HerewereportthesolutionproposedbyLedoitandWolftothedesribed problem. Their idea is to shrink thesample ovariane matrix towards the identity matrix, solving the following optimization problem (thus reondi-tioningtheeigenvalues):

min

ρ

1 ,ρ

2 E[

_||

Σ

₋

Σ

∗

_||

2 ]

s.t.

Σ =

ρ

1 I

p

+

ρ

2 ˆ

Σ

n

.

where

ρ

1

and

ρ

2

arenonrandom oeients.

Thetheoretial solutionto this problemis theoptimal linear shrink-ageestimator

Σ

LW

=

β

2 γ

2 µI

+

α

2 γ

2 Σ

ˆ

n

(2.5) with

E[

||

Σ

LW

−

Σ

∗

_||

2 _{] =}

α

2 _β

2 γ

2

,where:

(27)

µ

=<

Σ, I >;

α

2 =

_||

Σ

∗

₋

µI

_||

2 ;

β

2 =

E[

_||

Σ

ˆ

n

−

Σ

∗

||

2 ];

γ

2 =

E[

_||

Σ

ˆ

n

−

µI

||

2 ].

Their derivation exploitsthe natural Pythagorean relationship

α

2 +

β

2 =

γ

2 .

(2.6)

Inthis view,theratio

β

2 γ

2

isalledoptimal shrinkage intensity.

The most important interpretation of this approah for our purposes is thefollowing. It iswell known(Theorem 2.2.1) thatthesampleeigenvalues ofIID datahave bounded error respet to thetrueones, sothat, underthe ondition

p

≤

n

(

p

and

n

xed),

1 p

E(

Pp

i

=1

λ

ˆ

i

) =

p

1 Pp

i

=1

λ

i

,i.e. thetrae of

Σ

∗

is unbiasedly estimated.

At the same time, theorem 2.3.1 shows that sample eigenvalues have a largerdispersionaroundtheirgrandmeanrespettothetrueones(assuming thattheeigenvetorsarereliable). From (2.6)wean arguethat

1 p

E

"

_p

X

i

=1

(ˆ

λ

i

−

µ)

2 #

=

1 p

p

X

i

=1

(λ

i

−

µ)

2 +

E[

||

Σ

ˆ

n

−

Σ

||

2 ],

i.e. theexessdispersionofthesampleeigenvaluesistheerrorofthesample ovariane matrix. This is why here the authors bound

[

||

ˆ

Σ

n

−

Σ

||

2 ]

by bounding

1 p

E

h

Pp

i

=1

(ˆ

λ

i

−

µ)

2 i

,where

µ

= 1

.

So,

Σ

LW

impliitly doesthereonditioning ofeigenvalues,sine

λ

i,LW

=

β

2 γ

2 µ

+

α

2 γ

2 λ

ˆ

i

,

∀

i

= 1, . . . , p.

1 p

E[

Pp

i

=1

(ˆ

λ

i,LW

−

µ)

2 ]

isequalto

α

2 γ

,andisevensmallerthanthedispersion of thetrue ones, for the reasonsdesribed above. Note thatthis method is very similarin its meaning to the

max log

−

det

heuristis for nulear norm minimization(see [49 ℄).

2.3.1 General Asymptotis

In order to derive a feasible estimator, we now need to get into a new asymptoti framework, sine theoptimal shrinkage intensity

β

2

vanishesas

||

Σ

ˆ

n

−

Σ

∗

||

2

vanishes when

n

→ ∞

in the standard asymptoti framework (as proved in paragraph 2.1.3, see onvergene (2.3)). This fat, when

p

is loserto

n

or even larger,is inonsistent withreality. So,anew asymptoti framework,alledGeneralAsymptotis,isneeded,where

β

2

(28)

Consider

n

= 1,

2, . . .

indexing a sequene of statistial models, and for every

n

,

X

n

isa

p

n

×

n

matrixof

n

iidobservationsonasystemof

p

n

random zeromeanvariables withovariane matrix

Σ

n

.

Thefollowing assumption haraterizes thisontext:

A1. There existsa onstant

K

1

independent of

n

suhthat

p

n

/n

≤

K

1

. Itisremarkablethatinthissetting

p

an hange andevengoto innity, but it is not required. Dierently from the Kolmogorov asymptoti frame-work (the one of Marenko-Pastur Law), it is not even neessary this ratio tendsto a niteonstant.

Two further assumptions are needed to derive a onsistent estimator of

Σ

LW

. If

Σ

n

= Γ

n

Λ

n

Γ

′

n

, the produt

Y

n

= Γ

′

n

X

n

is a set of unorrelated variables spanning the same spae as the original variables. The following restritionson thehighermomentsof

Y

n

areimposed:

A2. There existsa onstant

K

2

independent of

n

suhthat

1 p

n

pn

X

i

=1

E[(y

n

_i

₁

)

8 ]

_≤

K

2 ,

A3.

lim

n

→∞

p

2 n

2 P

i,j,k,l

∈

Q

n

Cov(y

i

1 y

j

1 , y

k

1 y

l

1 )

Cardinal of

Q

n

= 0.

where

Q

n

denotes the set of all the quadruples that are made of four distintintegers between

1

and

p

n

.

Assumption2statesthattheeighthmomentof

y

isbounded(onaverage). Assumption 3 states that produts of unorrelated random variables are themselvesunorrelated(on average,inthelimit). Intheasewhengeneral asymptotis degenerate into standard asymptotis (

p/n

→

0

); Assumption 3istrivially veried asa onsequeneof Assumption2.

For what previously stated,Assumption3 isveried when random v ari-ablesarenormallyorevenelliptiallydistributed,sinethesampleovariane of(unorrelated) normalvariables is asymptotiallyunbiased. Anyway,A3 ismuhweakerthan thatsituation.

Theseassumptions arespeiallyneeded to derive thesample ounter-partsof

µ

,

γ

2

,

β

2

.

Notethatthesetwoassumptionsheavilyinvolvetheeigenstruture (eigen-values and eigenvetors) of the true ovariane matrix. Here we need to imposerestritionson eighthmoments,for thepartiular natureoftheir op-timal weights. Anyway, theneed to ontrol the pervasiveness of thelatent struture in the ovariane matrix is ruial for model reovery. We also underline how muh latent fatorial assumptions an impat on ovariane estimation. Thisis whywe aregoing to speiallydisuss therelationship between fator modelling andovariane estimationin paragraph(2.5).

Under these assumptions, Ledoit and Wolf approah the study on the onsistenyoftheirestimator. Intheirontext,thereferenenormis

||

A

||

n

=

(29)

1 pn

tr(AA

′

)

,suhthattheidentitymatrixhasalwaysnormone,andthe refer-enerossprodutis

< A

1 , A

2 >

n

=

1 pn

tr(A

1 A

′

2 )

. The problemof obtaining meaningful absolute rates in high dimensions is another relevant issue. As we willsee, in[45 ℄ the authors deriveasymptoti rates for therelative error matrix(andnottheovarianematrixitself). Instead,underthe nonasymp-totisetting(Chapter4),wewillobtainniteabsoluterates,evenunderthe same assumptionsof [45℄.

We are now going to show why the sample ovariane matrix is not onsistent in this ontext, dierently from the nite

p

ontext, where the ovariane matrix isasymptotiallyonsistentundertheassumption of nor-mality. Theauthorsshowthatquantities

µ

n

=<

Σ

n

, I >

,

α

2 n

=

||

Σ

n

−

µ

n

I

||

2

,

β

_n

2 =

E[

_||

Σ

ˆ

n

−

Σ

n

||

2 ]

,

γ

2 n

=

E[

||

Σ

ˆ

n

−

µI

||

2 ]

areboundedinthegeneral asymp-toti framework when

n

→ ∞

. Then, they prove the following important Theorem: Theorem 2.3.2. Dene

θ

2 n

=

V ar(

pn

1 Ppn

i

=1

E[(y

n

i

1 )

2 ])

.

θ

2 n

is bounded as

n

_{→ ∞}

,and we have:

lim

n

→∞

E[

||

ˆ

Σ

n

−

Σ

n

||

2 ] =

p

n

(µ

2 n

+

θ

n

2 ).

This result states that the sample ovariane matrix is not onsistent under the general asymptoti framework, sine its expeted loss is lower bounded by

pn

n

(µ

2 n

)

,whihdoesnotusuallyvanish. (Reallthat

θ

2 n

vanishes asymptotiallyunderthe assumption ofnormality,for onvergene (2.2)).

There aretwo interesting exeptions:

•

when

pn

n

→

0

, we fall into the standard asymptoti ontext, where thesampleovariane matrixisonsistent. Theonly diereneisthat moregeneralase

p

=

o(n)

isallowed,i.e.

p

isallowedtobeunbounded andgrow towardsinnity;

• µ

2 n

→

0

and

θ

2 n

→

0

.

µ

2 n

implies that most of the random variables havevanishingvarianes,i.e. thereare

O(n)

asymptotiallydegenerate variables. So, if the number of nondegenerate random variables is NOTnegligiblewithrespettothenumberofobservations,thesample ovariane matrix is not onsistent.

Inonsisteny is due to thedisequilibrium between the number of data-points

np

n

andthe number ofparameters

p

n

(p

n

+ 1)/2

. Thisis akey point in our analysis, whih is unsolved by the approah of Ledoit and Wolf. In fat, they write there is no DIRECTonsistent estimator of the ovariane matrixunderthegeneralasymptotis. Theirstrategyistoderiveaonsistent estimatoroftheirtheoretialestimator,whihisprovedtohavetheminimum riskamongallthelinearombinationsof

I

p

and

Σ

n

andisshowntobebetter onditioned than thesample ovariane matrix.

(30)

So, shrinkage matters unless

pn

n

is negligible respet

γ

2 µ

2

, i.e. if the dis-persionof sampleeigenvalues is muh larger than

pn

n

.

To onlude this setion, we are now going to explain how Ledoit and Wolfderive aonsistent estimator for

Σ

LW

.

Theyintrodue sample ounterparts of their keyquantities:

m

n

=<

Σ

ˆ

n

, I >

n

,

d

2 _n

=

_||

Σ

ˆ

n

−

m

n

I

||

2 ,

¯

_b

2 n

=

n

X

k

=1

||

x

n

_.k

x

n

_.k

′

₋

Σ

ˆ

n

||

,

b

2 _n

= min(¯

b

2 _n

, d

2 _n

),

a

2 _n

=

d

2 _n

₋

b

2 _n

,

where

x

n

.k

denotethe

k

−

th

olumn of

X

n

.

All these sample ounterparts are onsistent in the general asymptoti framework, i.e. they onverge to

µ

2 n

,

α

2 n

,

β

2 n

,

γ

2 n

respetively in quadrati mean.

Then, their feasibleonsistent estimator is

ˆ

Σ

LW

=

b

2 n

d

2 n

m

n

I

n

+

a

2 n

d

2 n

ˆ

Σ

n

(2.7)

Thisestimatorisonsistentinthe generalasymptotiframeworkrespet to

Σ

LW

, i.e. they share the same asymptoti expeted loss. Thus, the expetedquadratiloss

α

2 _β

2 γ

2

anbeonsistentlyestimatedinquadratimean by

a

2 n

b

2 n

d

2 n

.

ˆ

Σ

LW

isshowntohaveanimportant optimalityproperty: ithasthesame asymptoti risk asthe theoretial optimal linearombination of

Σ

ˆ

n

and

I

n

withrandom oeients. Inaddition, its ondition number isproved to be bounded inprobability,whih isvery important for pratial use.

The approah by Ledoit and Wolf is undoubtedly very elegant. How-ever, there is still one main diulty: their estimator is exessively better onditioned than the true ovariane matrix, i.e. it is often too biased, for the presene of the identity matrix in the estimator. This is why another major pointof ourdissertation will dealwiththeneedof "unshrinking"the estimatedeigenvalues.

Infat, the numerial issueis nottheonly relevantreason for desiringa well onditionedestimate oftheovariane matrix. Deep statistialreasons lie behind this need: we supposethat thetrue ovariane matrix

Σ

∗

is well onditioned, that isthere is nomulti-ollinearity amongour

p

variables. In this respet, a well onditioned estimate isruial also for tting purposes, i.e. toimprove the statistial propertiesof theestimate.