A
A
l
l
m
m
a
a
M
M
a
a
t
t
e
e
r
r
S
S
t
t
u
u
d
d
i
i
o
o
r
r
u
u
m
m
–
–
U
U
n
n
i
i
v
v
e
e
r
r
s
s
i
i
t
t
à
à
d
d
i
i
B
B
o
o
l
l
o
o
g
g
n
n
a
a
DOTTORATO DI RICERCA IN
Scienze Statistiche
Ciclo
XXVIII
Settore Concorsuale di afferenza:
13/D1
Settore Scientifico disciplinare: SECS/S-01
TITOLO TESI
Large covariance matrix estimation
by composite minimization
Presentata da:
Matteo Farnè
Coordinatore Dottorato
Relatore
Prof.ssa Alessandra Luati
Prof.ssa Angela Montanari
by omposite minimization
Matteo Farnè
Dipartimentodi Sienze Statistihe Università diBologna
"Illi autem oto ursus, in quibus eadem visest duorum, septem eiunt distintos intervallissonos, qui numerus rerum omniumfere nodusest; quod doti homines nervis imitatiatque antibusaperuerunt sibi reditum in hun loum, siutalii, qui praestantibus ingeniis in vita humanadivina studia oluerunt." Marus Tullius Ciero SomniumSipionis, 6.18
At the end of a partiularly exiting and laborious work, I feel the need to address my thanks to the people who aompanied and helped me to omplete suha huge eort.
First ofall,I desireto expressmygratitude toProfessor Angela Monta-nari, mysupervisorand mentor, for theontinuous supportand enourage-ment, for the valuable advies and suggestions, for her passionand for the uniquehanesofgrowthandmaturation sheoeredmearossallmystudy path.
IwanttothankProfessorTrevorHastie,forthebriefbutdeisiveresearh period spent at Stanford, whih gave me the neessary insights to develop my own theoryand to eetively provide original solutions to my researh problem.
I have to thank Dr Yuan Liao for helpful disussions and preious ref-erenes, Professor Roberto Roi for his puntual annotations, Professor EnrioBernardi forhisusefulideas onthemathematial partand Professor NiolaGuglielmifor his strongdiretions onthealgorithmi part.
A partiularly warm aknowledgment goes to theSupervisory Statistis Division of the European Central Bank, where I spent a semester as PhD trainee: they welomed me and allowed me to omplete my PhD studies developing ruial skills for real data analysis in large dimensions. A spe-ithanksgivinggoestoAngelosVouldis,wonderfulresearhsupervisor,to MihealFedesin,for histrust and enthusiasm, andto Patrik Hoganfor his valuable managing.
Finally, I want to thank my family, for the long and patient guide, aid andsupport,myfriends,fortheyoftenalleviatedthis labour,andSilvia,for herontinuous, passionate andenthusiastilisteningand initement.
Thepresentthesisonernslargeovarianematrixestimationviaomposite minimizationundertheassumptionoflowrankplussparsestruture. Exist-ing methods like POET (Prinipal Orthogonal omplEment Thresholding) perform estimationbyextratingprinipal omponentsandthenapplyinga soft thresholding algorithm. In ontrast, our method reovers the low rank plus sparse deomposition of the ovariane matrix by least squares mini-mization under nulear norm plus
l
1
norm penalization. This non-smooth onvex minimization proedure is based on semidenite programming and subdierentialmethods,resultingintwoseparableproblemssolvedbya sin-gular valuethresholding plussoftthresholding algorithm.ThemostreentestimatorinliteratureisalledLOREC (LowRankand sparsECovarianeestimator)andprovidesnon-asymptotierrorratesaswell asidentiability onditions inthe ontext of algebrai geometry. Our work showsthattheunshrinkageoftheestimatedeigenvaluesofthelowrank om-ponentimprovestheperformaneofLOREConsiderably. Thesamemethod also reovers ovariane strutures with very spiked latent eigenvalues like in the POET setting, thus overoming the neessary ondition
p
≤
n
. In addition,itisproved thatourmethodreoversstrutureswithintermediate degrees of spikiness,obtainingalosswhih isbounded aordingly.Then, an ad ho model seletion riterion whih detets the optimal point in terms of omposite penalty is proposed. Empirial results oming from a wide original simulation study where various low rank plus sparse settings aresimulated aordingto dierent parameter valuesaredesribed outliningindetailthe improvements uponexistingmethods. Two real data-setsarenallyexploredhighlightingtheusefulnessofourmethodinpratial appliations.
Keywords: ovariane matrix, nulear norm, thresholding, low rank plussparsedeomposition, unshrinkage.
Aknowledgments i
Abstrat iii
1 Introdution 1
2 State of the art 5
2.1 Sample ovariane matrix estimators . . . 7
2.1.1 TheMaximum Likelihoodovariane estimator . . . . 8
2.1.2 Theunbiased ovariane estimator: xed
n
ontext . . 102.1.3 Covariane matrix estimation: the IID dataontext . 10 2.2 Conditioningproperties . . . 11
2.2.1 Matrixonditioning asan ill-posedinverse problem . . 14
2.3 Ledoit andWolf's approah . . . 16
2.3.1 General Asymptotis . . . 17
2.4 Sparseovariane matrix estimation . . . 21
2.5 Fator analysisbasedestimators . . . 24
2.5.1 Stritfator model . . . 26
2.5.2 PCAand fatoranalysis . . . 27
2.5.3 Approximate fatormodel . . . 28
2.5.4 POET estimator . . . 30
3 Numerial and omputational aspets 37 3.1 Anhistorial review . . . 40
3.1.1
l
1
norm heuristis. . . 403.1.2 Nulearnormheuristis . . . 45
3.1.3
l
1
norm plusnulear norm . . . 493.2 Analytialand algorithmi aspets . . . 51
3.2.1 Numerialontext: a semideniteprogram. . . 51
3.2.2 Solutionmethods . . . 55
4 Low rank plus sparse deomposition 63 4.1 Identiation and reovery . . . 64
4.1.2 Approximate reovery: a funtionalapproah . . . 71
4.1.3 Approximate reovery: an extendedalgebrai approah 78 4.1.4 Approximate reovery: LOREC approah . . . 92
5 ImprovingLOREC 103 5.1 Theoretial advanes . . . 104
5.2 Simulationsetting . . . 117
5.2.1 Simulationalgorithm . . . 117
5.2.2 Simulated settings andomparison quantities . . . 119
5.2.3 Anew modelseletion riterion . . . 121
5.3 Data analysisresults . . . 123
5.3.1 Simulation results . . . 124
5.3.2 Realdataresults . . . 148
Introdution
Thepresentthesisonerns largedimensionalovariane matrixestimation. Estimation of population ovariane matries from samples of multivariate data is of interest in many high-dimensional inferene problems - prini-palomponents analysis, lassiation by disriminant analysis, inferring a graphial modelstruture, and others. Depending onthe dierent goalthe interest is sometimes in inferring the eigenstruture of the ovariane ma-trix(asinPCA) andsometimesinestimatingitsinverse(asindisriminant analysisor ingraphial models). Examples ofappliation areas where these problemsariseinludegenearrays,fMRI,textretrieval, imagelassiation, spetrosopy,limate studies,nane andmaro-eonomi analysis.
The theory of multivariate analysis for normal variables has been well worked out,see, for example,Anderson ([2℄). However, itbeame apparent that exat expressions were umbersome, and that multivariate data were rarely Gaussian. The remedy was asymptoti theory for large samples and xedrelatively small dimensions.
Inreentyears,datasetsthatdonot tintothisframeworkhavebeome very ommon, the data are very high-dimensional and sample sizes an be verysmallrelativetodimension. Themosttraditionalovarianeestimator, the sample ovariane matrix, is shown to be dramatially ill-onditioned in a large dimensional ontext, where the proess dimension
p
is loser to or even larger than thesample dimensionn
,even inthe ase thatthe true ovariane matrixiswell-onditioned. Somesolutionstothis drawbakhave been proposed in the asymptoti ontext (for example [75℄ [15 ℄ [45℄). An alternative reent approah isbynumerial optimization,whih providesin thenon-asymptoti ontext, some solutions improving upon thementioned ones.As desribed intheexisting literature, two key propertiesof thematrix estimation proess assumea partiular relevaneinlarge dimensions:
1. well onditioning,i.e. numerial stability; 2. identiability.
Bothpropertiesareruialfor thetheoretialreovery andthepratialuse of the estimate. A bad onditioned estimate suers from ollinearity and auses its inverse, the preision matrix, to amplify dramatially any error inthe data. A large dimension mayause theimpossibility to identify the unknownovariane struture andthediultyto interprettheresults.
The rst property is strongly related to regularization tehniques. A basirefereneinthisrespetisTibshirani(1996)([108 ℄),wheretheLASSO estimationalgorithm intheontext ofregressionmodelswasrstproposed. The seond property an be ensured by dimensionality redution methods, whih an beusedto redue theparameterspae dimensionality.
Regularization approahes to large ovariane matries estimation have therefore started to be presented in the literature, both from theoretial and pratial points of view. Some authors propose shrinkage towards the identitymatrix([75℄),othersonsidertaperingthesampleovarianematrix, thatis, gradually shrinkingtheo-diagonal elementstoward zero([54 ℄). At the same time, a ommon approah is to enourage sparsity, either by a penalizedlikelihoodapproah([53 ℄)orbythresholdingthesampleovariane matrix([100 ℄).
Forthisreason,ourresearhstudiesaspeiregularizationproblem un-dertheassumption oflowrankplussparsedeomposition fortheovariane matrix. Suhaproblemissolvedexploitingnon-smoothonvexoptimization methods. Thisapproah allows toproperlyaddressbothreonditioningand dimensionalityredution issuesand is proved to be eetive even in alarge dimensionalontext.
Ourdissertationmovesfromadetailedoutlineofasymptotiapproahes. In Chapter 2, we provide a thorough desription of the motivation to our workand areviewofsome relevantasymptoti methodsfor ovariane esti-mation. Maximumlikelihood estimators and unbiased nite estimators are desribed ([2℄). Spei treatment to the onditioning problem for ovari-anematrix estimates is given. The ovariane shrinkage estimator derived byLedoitand Wolfinthegeneral asymptoti frameworkisdesribed([75 ℄). Sparse ovariane estimators are shown together with the underlying as-sumptions and the estimation error rates, with partiular referene to the thresholding estimator of [15 ℄. POET (Prinipal Orthogonal omplEment Thresholding) estimator([45 ℄), whih ombines Prinipal Component Anal-ysisand thresholding algorithms, isanalyzed indetail.
InChapter 3,we dene the regularization problemabovementioned. It isanulearnormplus
l
1
normapproximationproblem,andworksunderthe assumption of low rank plus sparse struture for the ovariane matrix. It is omposed by a least squares loss and a omposite non-smooth penalty, whih isthe sum ofthe nulearnorm ofthelow rankomponent and thel
1
normof thesparseomponent.The numerial rationale behindthe problemformulation is provided. It isshownhowthis problemanbereastfromthepoint ofviewofnumerial
analysisasasemi-deniteprogram(SDP).Nonstandardoptimizationtools, as subgradient minimization methods, are needed to solve it. We desribe themost reent solution algorithm andpoint out its eetiveness.
InChapter4,weprovideawidereviewofexistingnon-asymptoti meth-ods. Theevolution path ofthe most reent worksis guredout. Themost reent developmentsofthenumerial approahunder theassumption oflow rank plus sparsestruture for the ovariane matrix are desribed, starting fromthebasiontributionbyChandraskeranetal. ([30℄)whihrstproves the exat reovery of the ovariane matrix in the noiseless ontext. This resultisahieved minimizing aspei onvexnon-smooth objetive, whih isthe sumof the nulear norm ofthelowrank omponent and the
l
1
norm ofthesparse omponent.Then, therstapproximatesolutiontoreoveryandidentiabilityinthe noisyontext, oming from[1℄,is desribed. In the following,theextension of [30℄ providing the rst exat solution of the numerial problem in the noisygraphial model setting([31 ℄) is shownindetail. In that ontext, the objetive isa leastsquarelosspenalizedbythe above mentioned omposite penalty,anditsoptimizationallowsto reovertheinverseovarianematrix. In onlusion, theextension of this framework to theovariane matrix es-timation ontext,oming from [77℄,is explained. Theresulting estimator is alledLOREC (LOwRankand sparsE Covariane estimator).
In the last hapter (Chapter 5), an improvement over the solution de-sribedin[77℄isproposed,basedontheunshrinkage oftheestimated eigen-valuesof thelow rankomponent. Luo'sapproah isompleted by deriving therates of the sparseomponent estimate, and theonditions for its posi-tivedenitenessandinvertibility. Inaddition,theratesofLORECunderthe onditions of POET, and, more importantly, in a ontext where the eigen-values of the low rank omponent are allowed to grow with
p
α
, α
∈
[0,
1]
(generalizedspikiness ontext)areprovided.
Inthefollowing, weshowtheresults ofourproedureon both simulated and real data sets. We illustrate a new model seletion riterion whih is proved to be eetive in our ontext. An original simulation study is presentedwhere extensive simulation results arepointedout, aswell asthe simulationalgorithm and theestimationassessment framework.
In the end,theperformane of our newproposedestimator isompared totheoneof LORECandPOET undervarioussettings. Tworealexamples areprovidedwhere ourmodeliseetive respettotheompetitors. In par-tiular, the seondexample isabanking supervisory dataset whihollets supervisory reporting indiators of themost relevant Euro Area banks. We expliitlythanktheSupervisoryStatistis DivisionoftheEuropeanCentral Bank,where theauthorspentasemesterasaPhDtrainee,fortheallowane to usethese datainanonymous formfor researh purposes.
Covariane matrix estimation:
state of the art
In this hapter, a short review of existing solutions to the problem of o-variane matrix estimation is provided. Partiular attention isgiven to the two properties displayed intheIntrodution(wellonditioning and identi-ability)andtothe performaneofexistingmethods inthelarge dimensional ontext. An exhaustive reviewan be foundin Pourhamadi (2013)([95 ℄).
ThisChaptershowsapath arossexistingestimators aimedat outlining the two mentioned features (well onditioning and identiability) for eah estimation setting, espeially when
p
is very large ompared to the sample sizen
orevenlarger. Thisiswhy,foreahestimator,adetaileddisussionof theasymptotiframeworkandtheassumptionsneededtoensureonsisteny (i.e. theonvergene to the theoretial ovariane matrix)is provided.Existing approahes to the estimation problem are desribed in this Chapter, while non-asymptoti approahes will be the objetof next hap-ters. The desription ofpast approahes isintended to displaythe main is-suesenountered byexistingmethods,withpartiular refereneto thelarge dimensionalontext,andthe reasonswhyweneed todevelopanalternative numerial approahto the ovariane estimation problem.
Therst paragraph(2.1) isdevotedto ovariane matrix estimation un-der the assumption of normality for the data. The maximum likelihood estimator, i.e. the sample ovariane matrix, is introdued and justied. Theunbiasedsample ovariane matrix,undertheassumption ofxed
n
,is thenoutlined. A speiremark ontheasymptotidistribution ofthe sam-ple ovariane matrix under the assumption of independene and idential distributionfor thedataonludes thesetion.In the seond paragraph (2.2) the onditioning properties of the sam-ple ovariane matrix are explored. The reason why the sample ovariane matrix is bad-onditioned when the dimension is lose to the sample size is deeply explained and analyzed, as well as the reason why the inverse
ovariane matrixdramatiallyampliestheestimationerrorinaseof bad-onditioning.
The third paragraph (2.3) widely desribes a suessful attempt to ad-dress theproblemof reonditioning thesampleovariane matrix when the dimension islargerthan thesample size: theshrinkage estimator byLedoit &Wolf([75 ℄). Theirmotivations, theirresults andtheir asymptotiontext areproperlyhighlighted, tryingtoretainthekeyelementsoftheirapproah. The fourth paragraph (2.4) briey outlines existingsparsity estimators, withpartiular referene to the thresholding estimator by Bikel & Levina ([15 ℄), whih is desribed in detail with respet to model assumptions and onvergene rates. There we point out the strong link between sparsity assumptions and shrinkage thresholding. That family of estimators shows howit ispossible to usesparsityto reondition theovariane estimateand to signiantly redue thenumber ofparameters.
The fth paragraph (2.5) desribes ovariane matriesestimator based onfatormodelassumptions. Abriefoverviewoffatormodelspeiations and underlying assumptions aross history is provided, disussing the dif-ferent asymptoti ontexts. The relationship between PrinipalComponent Analysis(PCA,[72 ℄)andfatormodelling(see[59 ℄)isruialinthisrespet. Finally,POETestimator([45 ℄),basedontheassumptionofapproximate fa-tor model witha sparseresidual matrix, is widely illustrated, pointing out the ruialassumptions foronsisteny and identiability.
In[45 ℄, the populationovariane matrix is assumedto be thesum of a low rank and a sparse omponent. POET works under the assumption of sparseresidual ovariane matrix and pervasiveeigenvalues of thelow rank omponent (as
p
→ ∞
). Thisstruture is partiularly onvenient in alarge dimensional ontext, and takles both the issues mentioned above, as we willwidelyexplain. For the same reasons,the fatoranalysis assumption is a key to approah ovariane estimation in large dimensions. The asymp-totiorrespondenebetweenPCAandfatorestimationisthereestablished aordingto the underlying assumptionsand thenexploited.Before starting, we desribe the basi matrix terminology. We restrit our analysis to the real ase. The spetraltheorem ensures that, when
M
is a positive semidenite squaredp
- dimensional real matrix with rankr
, thereexistsanorthogonalp
×
r
matrixU
andadiagonalr
×
r
matrixΛ
suh thatM
=
UΛU
′
=
r
X
i
=1
λ
i
u
i
u
′
i
,
(2.1)whihistheeigenvaluedeompositionof
M
. Salarsλ
1
, . . . , λ
r
arealled theeigenvaluesofM
andarestritlylarger than0
. Ther
olumnsofU
are theeigenvetorsofM
. IfM
issymmetri,theeigenvalues oinidewiththe singularvaluesσ
1
,...,r
,whiharethesquarerootsoftheeigenvaluesofM
′
M
i.e. the absolute values of the eigenvalues of
M
. A fortiori, this happensifM
isa ovariane matrix,whih issymmetriand positive denite.The relevant norms we aregoing to usethroughout theentire thesisare (see also[62 ℄):
• ||
M
||
2
=
p
σ
max
(M
′
M
)
isthespetralnormofM
,whihisits largest singularvalue.• ||
M
||
∞
= max
i,j
|
m
ij
|
is the innity norm ofM
, whih is the largest entry inmagnitude.• ||
M
||
F
=
trace(M
′
M) =
q
P
i
P
j
m
2
ij
is the Frobenius norm ofM
, whihis the squareroot ofthesum oftheentriesofM
.• ||
M
||
∗
=
trace(
√
M
′
M
) =
Pp
i
=1
σ
i
, sum of the singular values ofM
.||
M
||
∗
isalled nulear norm. IfM
is a Positive SemiDenite ma-trix(PSD),||
M
||
∗
=
tr(M)
,beause theeigenvalues and the singular valuesexaly oinide.• ||
M
||
1
=
P
i
P
j
|
m
ij
|
: sum oftheabsolute values of theentries ofM
. For ap
-dimensional vetorx
,therelevant norms for our purpose are:• ||
x
||
2
=
q
P
i
x
2
i
,the Eulidean norm ofx
.• ||
x
||
1
=
Pp
i
=1
|
x
i
|
,thel
1
norm ofx
.• ||
x
||
∞
= max
i
|
x
i
|
,themaximumnorm ofx
.2.1 Sample ovariane matrix estimators
In this paragraph we fous on the most used estimator of the ovariane matrix: the sampleovariane matrix. First, we will derive it asthe maxi-mumlikelihoodestimatorof the ovariane matrix under theassumption of multivariatenormalityfor our data(2.1.1). Maximum likelihood estimators areonsistentwhen
n
→ ∞
. Thisiswhywethenderivetheunbiasedov ari-ane estimator under the assumption ofn
nite (2.1.2), whih is a slightly modied version of the sample ovariane matrix. These two estimators asymptotially onverge whenn
→ ∞
,under the assumption ofp
xed. In theend of this paragraph, we give a ash about the behaviour of this esti-mator under the assumption of independene and idential distribution for our datawhenn
→ ∞
(2.1.3).Our main referene for this argument is the famous book by Anderson ([2℄).
2.1.1 The Maximum Likelihood ovariane estimator
Suppose we have a sample
(x
1
, . . . x
n
)
, from a real-valuedp
−
dimensional normal random variablex
∼
N
p
(µ
∗
,
Σ
∗
)
, with
p
≤
n
. Thep
×
p
matrixΣ
∗
=
E((x
−
µ
∗
)(x
−
µ
∗
)
′
)
is real positive denite and symmetri, whileµ
∗
=
E(x)
isap
×
1
vetor.Thedensity of
x
isthefollowing:f
(x
|
µ
∗
,
Σ
∗
) = (2π)
−
1
2
p
|
Σ
∗
|
−
1
2
exp
−
1
2
(x
−
µ
∗
)
′
Σ
∗−
1
(x
−
µ
∗
)
.
whereµ
∗
is ap
×
1
vetor andΣ
∗
is a
p
×
p
invertible (positive denite) matrix.Thelikelihood funtionis
L(µ
∗
,
Σ
∗
) =
n
Y
i
=1
N
(x
i
|
µ
∗
,
Σ
∗
) =
= (2π)
−
1
2
pn
|
Σ
∗
|
−
1
2
n
exp
"
−
1/2
n
X
i
=1
(x
i
−
µ
∗
)
′
Σ
∗−
1
(x
i
−
µ
∗
)
#
.
Thelog-likelihood isthen
log
L(µ
∗
,
Σ
∗
) =
−
1
2
pn
log 2π
−
1
2
n
log
|
Σ
∗
| −
1
2
n
X
i
=1
(x
i
−
µ
∗
)
′
Σ
∗−
1
(x
i
−
µ
∗
).
We denoteby
µ
ˆ
M L
andΣ
ˆ
M L
thevetorand thepositivedenite matrix maximizinglog
L
. They are the maximum likelihood estimators ofµ
∗
and
Σ
∗
. Sinelog
L
is an inreasing funtion ofL
,log
L
andL
share thesame maximum respetto our parameter estimates.Thefollowing important theoremholds:
Theorem2.1.1. If
x
1
, . . . x
n
onstituteasamplefromN
(µ
∗
,
Σ
∗
)
with
p < n
, the maximum likelihood estimators ofµ
∗
andΣ
∗
areµ
ˆ
M L
= ¯
x
=
1
n
Pn
i
=1
x
i
andΣ
ˆ
M L
=
1
n
P
n
i
=1
(x
i
−
x)(x
¯
i
−
x)
¯
′
respetively.TheproofanbefoundinAnderson(1958),page67andfollowing. It ex-ploitsthepropertiesofthearithmetimeanandofpositivedenitematries. Thekeyargument isthat
log
L
an berewritten inthefollowingway:−
1
2
pn
log 2π
−
1
2
log
|
Σ
∗
| −
1
2
trΣ
∗−
1
D
−
1
2
n(x
i
−
µ
∗
)Σ
∗−
1
(x
i
−
µ
∗
)
′
,
whereD
=
P
n
i
=1
(x
i
−
x)(x
¯
i
−
x)
¯
′
.In order to perform maximization, the neessaryassumption is that
Σ
∗
isa positive denite matrix. This ondition is neessaryto ensure that theterm
n(x
i
−
µ
∗
)Σ
∗−
1
(x
i
−
µ
∗
)
′
ahievesa maximumforµ
∗
= ¯
x
andtheterm
log
|
Σ
∗
| −
tr(Σ
∗−
1
D)
ahieves amaximumforΣ
∗
=
1
n
D
.ML estimators show a number of interesting optimality properties. In partiular,theyareonsistentandasymptotiallyeient([34℄). Atheorem by Cramer ensures that
µ
ˆ
M L
andΣ
ˆ
M L
are minimum variane (asymptoti-ally) unbiased estimators. Theseproperties holdifand onlyifn
→ ∞
.Note that also the ondition
p < n
is neessary in order to perform maximization. In orderto seethis point, we need to reall a basi theorem ([2℄,p.77):Theorem2.1.2. Themaximum likelihood estimator
µ
ˆ
M L
= ¯
x
=
1
n
Pn
i
=1
x
i
, fromN
(µ
∗
,
Σ
∗
)
, is distributed aording toN
(µ
∗
,
1
n
Σ
∗
)
and independently ofΣ
ˆ
M L
= ˆ
Σ =
1
n
Pn
i
=1
(x
i
−
x)(x
¯
i
−
x)
¯
′
.n
Σ
ˆ
is distributed aording toPn
−
1
i
=1
z
i
z
′
i
, wherez
i
∼
N
(0,
Σ
∗
)
, andz
1
, . . . , z
n
−
1
are independent.This theorem states that under the multivariate normality assumption for thedata,
n
ˆ
Σ
is the sumofn
−
1
squaredp
dimensional matrieshaving rank1
. Ifp
≥
n
,n
Σ
ˆ
will never have full rankp
.In addition, it has been shown by Wishart ([113 ℄) that
D
=
n
Σ
ˆ
is a matrix-valued stohastiproess having the following distribution:f
(D
|
Σ
∗
) =
|
D
|
1
2
(
n
−
p
−
1)
exp
−
1
2
tr(Σ
∗−
1
D)
2
1
2
np
π
p
(
p
−
1)
4
|
Σ
∗
|
1
2
n
Qp
i
=1
Γ[
1
2
(n
+ 1
−
i)]
whihisaWishart distributionwith
ν
=
n
−
1
degrees offreedom,whereΓ(t) =
R
∞
0
x
t
−
1
e
−
x
dx
is the usual Gamma funtion. The proof is reported in [2℄ (p.252 and following). It exploits massively the linear transforms of randomvariables,andis basedonthepropertiesofGram-Shmidt orthogo-nalization algorithm.Thisresultswasrstderivedforabi-variatedistributionbyFisher([51 ℄) where the distribution of the orrelation oeient (rst dened by Karl Pearson in[91℄)was alsoderived.
We an now understand why
p < n
is a neessary ondition. Ifn
≤
p
,f
(D
|
Σ
∗
)
isnolongeradensity,suhthatitisnolongerpossibletoderivethe asymptotidistributionforˆ
Σ
(i.e.,alltheusualoptimalitypropertiesofML estimators arelost). Infat,|
D
|
would be zero, andthe distribution would thusbedegenerate, having nullmeasureinR
p
×
p
everywhere. Notealsothat if
n
=
p
+ 1
f(D
|
Σ
∗
)
has not a mode, analogously to the
χ
2
distribution withtwo degrees offreedom.
Inthe sameway,denotingby
T
thequantityT
= (¯
x
−
µ
∗
)
′
W
−
1
(¯
x
−
µ
∗
)
, where
W
=
D
n
−
1
,ithasbeen shownbyHotelling ([64℄)thatν
−
p
−
1
vp
T
2
∼
F
where
F
isFisher'sdistribution withp
andν
−
p
+ 1
degrees offreedom (ν
=
n
−
1
).T
2
isalledHotelling'sT-squareddistribution. Itisnon-singular if and only if both
µ
ˆ
andΣ
ˆ
are non-singular, i.e. ifΣ
∗
is positive denite andν
−
p
+ 1
>
0
(equivalent ton > p
).So, both the sample mean and the sample ovariane matrix are ML estimators of the true mean and the true ovariane matrix if and only if thetrueovariane matrixispositive deniteandthedimension
p
isstritly smallerthanthesamplesizen
. Inpartiular,thedistribution ofthesample ovarianematrixisn
n
−
1
W ishart(Σ
∗
, n
−
1)
. ThismeansthatΣ
ˆ
isbiasedifn
isnite. Notethatthisdistributiondoesnothangeevenwhenthetruemeanµ
∗
is known, unlessx
¯
is replaed by the trueµ
∗
. In that ase, the degrees offreedom are
n
and theresulting estimator (1
n
Pn
i
=1
(x
i
−
µ
∗
)(x
i
−
µ
∗
)
′
) is unbiased.2.1.2 The unbiased ovariane estimator: xed
n
ontext In order to derive the nite sample unbiased estimator of the ovariane matrix,the keyresultisTheorem2.1.2 aboutthedistribution ofD
=
n
ˆ
Σ =
Pn
i
=1
(x
i
−
x)(x
¯
i
−
x)
¯
′
shownabove. A orollaryofthat theoremstates:Corollary 2.1.1. Let
x
1
, . . . , x
n
(n > p)
be independently distributed, eah aording toN
(µ
∗
,
Σ
∗
)
. The distribution ofˆ
Σ
ν
=
1
ν
Pn
i
=1
(x
i
−
x)(x
¯
i
−
x)
¯
′
isW ishart(Σ
∗
, ν)
, whereν
=
n
−
1
.Thisresultmeansthat
Σ
ˆ
n
−
1
= (
n
−
1
1
)
Pn
i
=1
(x
i
−
x)(x
¯
i
−
x)
¯
′
istheunbiased estimator of the ovariane matrix when the dimensionn
is nite. This estimator will be theinput of our new estimation proedure in Chapter 4. Clearly,Σ
ˆ
n
−
1
andΣ
ˆ
n
onverge asymptotiallyto thesame estimator. We are now going to derive the asymptoti (normal) distribution of the sampleovariane matrix inthemore general ase ofIIDdata.2.1.3 Covariane matrix estimation: the IID data ontext Let us suppose
x
i
∼
IID(µ
∗
,
Σ
∗
)
,i
= 1
. . . , n
. We want to derive the asymptoti distribution ofˆ
Σ
n
=
1
n
Pn
i
=1
(x
i
−
x)
¯
′
(x
i
−
x).
¯
Under the IID hypothesis,wehave:E(x
i
x
′
i
) =
E(x
i
)E(x
′
i
) = Σ
∗
+
µ
∗
µ
∗
′
,
V
(x
i
x
′
i
) =
V
(x
i
) +
V
(x
i
) = Σ
∗
+ Σ
∗
= 2Σ
∗
.
Ourtarget an berewritten asthesumof threeomponents:
1
n
n
X
i
=1
(x
i
−
x)(x
¯
i
−
x)
¯
′
=
n
X
i
=1
x
i
x
′
i
n
−
2
n
X
i
=1
¯
x
x
′
i
n
+
n
X
i
=1
¯
x
x
¯
′
n
Sine
Pn
i
=1
xi
n
prob
→
µ
∗
,we have that−
2¯
x
n
X
i
=1
x
i
n
+
n
X
i
=1
¯
x¯
x
′
n
=
−
2¯
x
x
¯
′
+ ¯
x
x
¯
′
=
−
x¯
¯
x
′
.
onverges inprobability asfollows:−
x¯
¯
x
′
prob
→ −
µ
∗
µ
∗
′
(2.2) Now, the rstomponentP
n
i
=1
xix
′
i
n
anbe rewrittenas1
√
n
n
X
i
=1
(x
√
i
x
′
i
)
n
So,for the Central Limittheorem, we have1
√
n
n
X
i
=1
x
i
x
′
i
−
(Σ
∗
+
µ
∗
µ
∗
′
)
√
n
CLT
→
n
1
N
(µ
∗
µ
∗
′
+ Σ
∗
,
2Σ
∗
).
Realling (2.2),wehave that
ˆ
Σ
n
distrib
→
1
√
n
N
(Σ
∗
,
2Σ
∗
).
(2.3) Theseresults nd onrmation in[58 ℄.2.2 Thesample ovarianematrix: onditioning prop-erties
We arenowgoing tobrieytalkaboutmatrix onditioning. Letus suppose
p
andn
are xed. Ifn > p,
the expeted value ofΣ
ν
=
n
−
1
isΣ
∗
, and the entries of its ovariane matrix are
V
(ˆ
σ
n,ij
) =
(
σ
∗
2
ij
+
σ
∗
ii
σ
∗
jj
)
(
n
−
1)
. This highlights whythevarianeofΣ
ˆ
n
inreasesasthetrueonditionnumberofΣ
∗
inreases. If the ondition number
c
=
σ
max
/σ
min
inreases, the orrelation between the omponentsx
i
andx
j
inreases, beauseΣ
∗
is loser to ollinearity. Consequently,
V
(ˆ
σ
n,ij
)
inreases,beauseσ
∗
2
ij
islosertoitsmaximum,whih isσ
∗
ii
σ
jj
∗
(fortheCauhy-Shwartz inequality).Coming bak to the main point, it is ruial to study the behaviour of thesampleeigenvalues. In thematrixestimation ontext thereisa relevant issueabout numerial onditioning, i.e. the behaviour of samplemaximum andminimumsingular values,ofa
p
×
n
datamatrixX
.Theorem2.2.1(Theorem([39 ℄)). Givennaturalnumbers
n, p
withp < n+1
letX
be ap
×
n
matrix with i.i.d. Gaussian entries that have zero-meanandvariane
1
n
. Thenthe largest and smallest singular valuesσ
min
(X)
andσ
max
(X)
are suh thatmax
P r
λ
max
≥
1 +
r
p
n
+
t
, P r
λ
min
≤
1
−
r
p
n
−
t
≤
exp
−
nt
2
2
,
for anyt >
0
.Thistheoremwasprovedbyusingargumentsfromrandommatrixtheory and the geometry of Banah spaes. It is an essential result to provide a probabilistiboundfortheerror distane
||
Σ
ˆ
n
−
Σ
∗
||
2
,whereΣ
ˆ
n
=
1
n
X
′
X
=
1
n
Pn
i
=1
x
i
x
′
i
.Infat, thefollowing Lemma holds:
Lemma 2.2.1. Let
ψ
=
||
Σ
∗
||
2
. Given anyδ >
0
andφ >
0
withψ
≤
8φ
, letthe number of samplesn
be suh thatn
≥
64
pφ
2
δ
2
. Then we have thatP r[
||
Σ
n
−
Σ
∗
||
2
≥
δ]
≤
2 exp
−
nδ
2
128ψ
2
.
ThisTheorem isbased ona speiassumption on
ψ
,thelargest eigen-value ofΣ
∗
. By appropriately setting the parameterψ
, we an obtain the probabilisti bound aordingly.ThisLemma relieson the fatthatthespetralnorm isunitarily invari-ant, suh that it is possible to assume a diagonal struture for
Σ
ˆ
without lossofgeneralityand thenapply theprevioustheorem 2.2.1.Itisremarkablethatwithoutfurtherassumptions,
Σ
ˆ
n
isnot invertibleifp > n
(sineit is perfetly ollinear, having learly at most rankn
,and for therestnulleigenvalues). Evenifp
≤
n
,intheasetheratiop/n
islessthan 1but not negligible, theestimated(maximumand minimum) eigenvalues are numerially unstable, sine the probabilisti bound is too large. This may result in bad onditioning (i.e. too large ondition number) forˆ
Σ
n
. Thisis why inthe Big Data ontext, whenp
isvery large,it is frequent to have anill-onditioned sampleovariane matrix,sineit isdiultto have enough observationto keep the ratiop/n
negligible([75 ℄).Theexampleingure(2.1)learlyoutlinesthedesribeddrawbak. The eigenvalues of the ovariane matrix of a simulated
n
×
p
proessǫ
i
=
N
p
(0,
n
1
I
)
,p
= 100
,n
= [10,
50,
100,
500,
1000,
10000]
areplotted. The g-uredisplayshowthedispersionoftheeigenvaluesdereasesasp/n
dereases. Alldistributions tendto theMarenko-Pastur distribution, whih isproved tobethe limitingdistributionoftheeigenvalues ofIIDrandomvariables(in the Kolmogorov asymptoti framework, see [79 ℄). The rank is always equal tomin(p, n
−
1)
. Ifp
=
n
,the matrix isthus singular.We have provided this simple example to state that without further as-sumption on the eigen-struture (values and vetors) of
Σ
∗
0
50
100
0
1
2
3
4
Eigenvalues
0
50
100
0
0.5
1
1.5
2
2.5
Eigenvalues
0
50
100
0
0.5
1
1.5
2
2.5
Eigenvalues
0
50
100
0.5
1
1.5
Eigenvalues
0
50
100
0.8
1
1.2
1.4
Eigenvalues
0
50
100
0.9
1
1.1
1.2
Eigenvalues
10
50
100
500
1000
10000
Figure 2.1: Eigenvalues of the sample ovariane matrix of
ǫ
i
=
N
p
(0,
1
n
I)
,p
= 100
,n
varyingp
≤
n
is unavoidable in order to guarantee the positive deniteness (and thus theinvertibility) of our ovariane estimate. Anyway, the reovery of theeigen-struture ofaovarianematrix isstronglyrelatedtothe underly-ingassumptionsand to the asymptoti ontext.Wenowenumeratethreeparametersettingsrelevantforourdissertation: 1.
p
andn
xed: this is the ase ofΣ
ˆ
n
−
1
, and all numerial estimators we will analyzeinnexthapters ([31 ℄, [1 ℄,[77℄,[15 ℄)2.
p
xed,n
→ ∞
: this isthease ofΣ
ˆ
M L
,or of theapproximate fator model([29℄)3.
pn
n
→
c
whenn
→ ∞
: herewendtheGeneralasymptotiframework, used by Ledoit and Wolf to ensure the onsisteny of their estimator ([75 ℄),andtheKolmogorovasymptotiframework(wherealsop
→ ∞
). Alsoonsisteny propertiesofthethresholding estimator([15 ℄)and of POETestimator([45 ℄)arederivedunderasimilarframework,where a funtionofp
andn
tendsto0whilen
→ ∞
. Seeformoreexplanations setions(2.4) and (2.5).In the seond ontext, with xed
p
andn
,the outlined results onern-ing numerial onditioning for the sample ovariane matrix hold, and the onditionp
≤
n
is unavoidable without further assumptionsto derive nite sample bounds. This is why one of the aims of the present work is trying to exploit results from the third asymptoti framework (in terms of model assumptions) to establishboundsunder the nite sample ontext dropping theonditionp
≤
n
.2.2.1 Matrix onditioning as an ill-posed inverse problem We are nowexplaining in detail why a bad-onditioned sample matrix is a fataldrawbak for us. Thereason stands intheonsequenesderiving from the inversionof abad-onditioned matrix.
Letus nowonsiderthestandard linearsystem
Ax
=
b
,whereA
isp
×
p
, andx
,b
arep
×
1
. If our aim is to deriveb
(the output), we are solving the diretproblem. Ifour aim isto derivex
(the input),we aresolving the inverse problem. IfA
is full rank, Cramer's theorem is ensuring that the inverse problem has exat solutionx
∗
=
A
−
1
b
. Otherwise, if
A
has rankr < p
,weneed to solve theleastsquares problemmin
x
∈
R
p
||
Ax
−
b
||
2
,
andwe havex
∗
=
r
X
i
=1
|
u
′
i
b
|
λ
i
u
i
(2.4)||
Ax
∗
−
b
||
2
=
p
X
i
=
r
+1
||
u
′
i
b
||
2
.
Thisfundamentalresult wasproved in[40℄. Howmuhissolutionthe
x
∗
reliable? Hadamard([57℄)outlinedthethree harateristis ofa well-posed problem:
•
existene: the problemadmits one solution•
uniqueness: the problemhasat most one solution•
stability: theproblem isnot sensitive to data perturbation.Inourontext,if
A
isfullrank,theinverseproblemmaybeill-posedsine itviolatesthestabilityondition. IfA
isnotfullrank,theinverseproblemis ill-posed sine it violates theexistene and the uniquenessondition (there are only approximate solutions, no exat ones). The least squares system servesfor identifying inanyase asolution even iftherewouldbe none.Anyway,(2.1) and(2.4)enableus tounderstand whytheinverseof bad-onditioned matries are numerially unstable. The solution of the diret problemis
Ax
=
U
ΛU
′
x
=
Pp
i
=1
λ
i
(u
′
i
x)u
i
,whih dampenstheomponents orrespondingto thesmallesteigenvaluesofA
. Ontheontrary,(2.4)shows us thatthesolution ofthe inverse problemamplies theeets ofthesame omponents. If we assumethatb
isperturbed, i.e.b
ǫ
=
b
+
ǫ
,we note thatx
ǫ
=
x
∗
+
P
r
i
=1
|
u
′
i
ǫ
|
λi
u
i
. So, ifA
is badonditioned (i.e. we have very small eigenvalues), the eet of data perturbation is amplied, and the solution maynot be eetively usable inappliations.ThisiswhyPiard([93 ℄)elaboratedaonditionunderwhihtheinverse solution is reliable. It states that
x
∗
=
Pr
i
=1
|
u
′
i
b
|
λi
u
i
<
∞
if and only if|
u
′
i
b
|
deays morerapidlythan theorrespondingλ
i
foralli
,whih oursifλ
i
> τ
∀
i
, whereτ
is the threshold at whih thesingularvalues arelevelled bythe noise.If this ondition isviolated, a regularization method, like thetrunated singularvaluedeomposition(TSVD,see[55℄)orTikhonov'sregressionmethod ([109 ℄) or otherregressionmethods (like the ridgeone), are needed. Thisis whythenonasymptotiapproahforovarianematrixestimationessentially onsistsinspeifyingappropriateregularizationproblemsundersuitable on-ditions for deriving improved error rates, as we will widely desribe in the following hapters.
Notethatthereisahugeliteraturedealingwiththedistributionof eigen-values. We mention again Marenko-Pastur law, whih desribes the be-haviourof the singularvaluesof aretangular randommatrix having Gaus-sianentries([79 ℄). Tray andWidom([107 ℄) foundthelimiting distribution ofthesingularvaluesofalargedimensionalrandomHermitianmatrix. John-stone ([70 ℄) found out the limiting distribution of the largest eigenvalue in prinipal omponent analysis(for
n
≤
p
, underthe assumption of indepen-dentnormalityfor theolumnsofthe data matrix)whihisproportionalto a Wishart of order1
. A reent work by Chiani([33 ℄) derived theexat dis-tribution of the largest eigenvalues for real Wishart matries and Gaussian Orthogonal Ensembles.The work in[70℄,inpartiular, outlinedthat for large
p
it an be easier to reover the topr
eigenvalues ifthey are partiularly spiked, beause the distribution of the(r
+ 1)
-th eigenvalue isbounded bya Tray-Widom law of lower dimensions (n
×
(p
−
r)
respet ton
×
p
). Thus, the(r
+ 1)
-th eigenvalueofasetofp
eigenvalueswherer
arespikedisstohastiallysmaller than the largesteigenvalue ofa settingof(p
−
r)
< p
variables non-spiked. This fatsuggests that large dimensions (p
→ ∞
) an helpthe reovery of strong eigenvalues and somehow justies the use of "sree-plot" to hoose thenumberof eigenvalues.Therearealsosomeresultsonthedistributionofthesmallesteigenvalues. We referto [8 ℄ for ageneral review.
All in all,the problem of reonditioning our ovariane matrix estimate is approahed dierently aording to the related asymptoti ontext. In Chapter 4 we will fous on the non-asymptoti ontext, outlining various solutions reently provided. Now, we will fous on the desription of key ovariane estimators in the asymptoti ontext where both
p
andn
are allowedtotendto∞
. Theestimatorweareabouttodesribebelongstothe lassof shrinkage estimators ([68 ℄) whih represent a widely usedapproah in this ontext as an eetive regularization method. It is relevant to note that the distributional assumption of normality is no longer needed, sinethe approah we aregoingto desribe isdistribution-free.
2.3 Shrinkagetowardstheidentity: LedoitandWolf's approah
Ledoitand Wolfweretherst toderive in[75 ℄a onsistentestimator ofthe ovarianematrixinanewasymptotiframework,alledgeneralasymptoti framework. Theyproposedawaytotemperthenumerialinstabilityof sam-ple eigenvalues, expliitly reonditioning them byshrinkage. The adoption of a new asymptoti framework was needed to ensure the shrinkage inten-sity to be positive, avoiding it to vanish in the limit. Their estimator is also Bayesian in nature, sine it is a ombination of a priori and sample information. Theyall itEmpirial Bayesian estimator.
Themotivatingresult oftheir analysisit reportedbelow.
Theorem 2.3.1. The eigenvalues are the most dispersed diagonal elements that an be obtained by rotation of a symmetri matrix.
Theproof exploitstheinvariane by rotationoftrae.
This auses that the largest sample eigenvalues are positively biased, while the smallest are negatively biased, and the bias inreases in
p/n
(re-all Theorem 2.2.1). The pattern of sample eigenvalues depends on the Marenko-Pastur distribution, whih holds in the Kolmogorov asymptoti framework. Asdesribed,underKolmogorovasymptotistheratiop/n
tends to aspei onstant, whilebothp
andn
tendto innity.HerewereportthesolutionproposedbyLedoitandWolftothedesribed problem. Their idea is to shrink thesample ovariane matrix towards the identity matrix, solving the following optimization problem (thus reondi-tioningtheeigenvalues):
min
ρ
1
,ρ
2
E[
||
Σ
−
Σ
∗
||
2
]
s.t.Σ =
ρ
1
I
p
+
ρ
2
ˆ
Σ
n
.
where
ρ
1
andρ
2
arenonrandom oeients.Thetheoretial solutionto this problemis theoptimal linear shrink-ageestimator
Σ
LW
=
β
2
γ
2
µI
+
α
2
γ
2
Σ
ˆ
n
(2.5) withE[
||
Σ
LW
−
Σ
∗
||
2
] =
α
2
β
2
γ
2
,where:µ
=<
Σ, I >;
α
2
=
||
Σ
∗
−
µI
||
2
;
β
2
=
E[
||
Σ
ˆ
n
−
Σ
∗
||
2
];
γ
2
=
E[
||
Σ
ˆ
n
−
µI
||
2
].
Their derivation exploitsthe natural Pythagorean relationship
α
2
+
β
2
=
γ
2
.
(2.6)Inthis view,theratio
β
2
γ
2
isalledoptimal shrinkage intensity.The most important interpretation of this approah for our purposes is thefollowing. It iswell known(Theorem 2.2.1) thatthesampleeigenvalues ofIID datahave bounded error respet to thetrueones, sothat, underthe ondition
p
≤
n
(p
andn
xed),1
p
E(
Pp
i
=1
λ
ˆ
i
) =
p
1
Pp
i
=1
λ
i
,i.e. thetrae ofΣ
∗
is unbiasedly estimated.At the same time, theorem 2.3.1 shows that sample eigenvalues have a largerdispersionaroundtheirgrandmeanrespettothetrueones(assuming thattheeigenvetorsarereliable). From (2.6)wean arguethat
1
p
E
"
p
X
i
=1
(ˆ
λ
i
−
µ)
2
#
=
1
p
p
X
i
=1
(λ
i
−
µ)
2
+
E[
||
Σ
ˆ
n
−
Σ
||
2
],
i.e. theexessdispersionofthesampleeigenvaluesistheerrorofthesample ovariane matrix. This is why here the authors bound
[
||
ˆ
Σ
n
−
Σ
||
2
]
by bounding1
p
E
h
Pp
i
=1
(ˆ
λ
i
−
µ)
2
i
,whereµ
= 1
.So,
Σ
LW
impliitly doesthereonditioning ofeigenvalues,sineλ
i,LW
=
β
2
γ
2
µ
+
α
2
γ
2
λ
ˆ
i
,
∀
i
= 1, . . . , p.
1
p
E[
Pp
i
=1
(ˆ
λ
i,LW
−
µ)
2
]
isequaltoα
2
γ
,andisevensmallerthanthedispersion of thetrue ones, for the reasonsdesribed above. Note thatthis method is very similarin its meaning to themax log
−
det
heuristis for nulear norm minimization(see [49 ℄).2.3.1 General Asymptotis
In order to derive a feasible estimator, we now need to get into a new asymptoti framework, sine theoptimal shrinkage intensity
β
2
vanishesas
||
Σ
ˆ
n
−
Σ
∗
||
2
vanishes whenn
→ ∞
in the standard asymptoti framework (as proved in paragraph 2.1.3, see onvergene (2.3)). This fat, whenp
is loserton
or even larger,is inonsistent withreality. So,anew asymptoti framework,alledGeneralAsymptotis,isneeded,whereβ
2
Consider
n
= 1,
2, . . .
indexing a sequene of statistial models, and for everyn
,X
n
isap
n
×
n
matrixofn
iidobservationsonasystemofp
n
random zeromeanvariables withovariane matrixΣ
n
.Thefollowing assumption haraterizes thisontext:
A1. There existsa onstant
K
1
independent ofn
suhthatp
n
/n
≤
K
1
. Itisremarkablethatinthissettingp
an hange andevengoto innity, but it is not required. Dierently from the Kolmogorov asymptoti frame-work (the one of Marenko-Pastur Law), it is not even neessary this ratio tendsto a niteonstant.Two further assumptions are needed to derive a onsistent estimator of
Σ
LW
. IfΣ
n
= Γ
n
Λ
n
Γ
′
n
, the produtY
n
= Γ
′
n
X
n
is a set of unorrelated variables spanning the same spae as the original variables. The following restritionson thehighermomentsofY
n
areimposed:A2. There existsa onstant
K
2
independent ofn
suhthat1
p
n
pn
X
i
=1
E[(y
n
i
1
)
8
]
≤
K
2
,
A3.lim
n
→∞
p
2
n
2
P
i,j,k,l
∈
Q
n
Cov(y
i
1
y
j
1
, y
k
1
y
l
1
)
Cardinal ofQ
n
= 0.
where
Q
n
denotes the set of all the quadruples that are made of four distintintegers between1
andp
n
.Assumption2statesthattheeighthmomentof
y
isbounded(onaverage). Assumption 3 states that produts of unorrelated random variables are themselvesunorrelated(on average,inthelimit). Intheasewhengeneral asymptotis degenerate into standard asymptotis (p/n
→
0
); Assumption 3istrivially veried asa onsequeneof Assumption2.For what previously stated,Assumption3 isveried when random v ari-ablesarenormallyorevenelliptiallydistributed,sinethesampleovariane of(unorrelated) normalvariables is asymptotiallyunbiased. Anyway,A3 ismuhweakerthan thatsituation.
Theseassumptions arespeiallyneeded to derive thesample ounter-partsof
µ
,γ
2
,β
2
.Notethatthesetwoassumptionsheavilyinvolvetheeigenstruture (eigen-values and eigenvetors) of the true ovariane matrix. Here we need to imposerestritionson eighthmoments,for thepartiular natureoftheir op-timal weights. Anyway, theneed to ontrol the pervasiveness of thelatent struture in the ovariane matrix is ruial for model reovery. We also underline how muh latent fatorial assumptions an impat on ovariane estimation. Thisis whywe aregoing to speiallydisuss therelationship between fator modelling andovariane estimationin paragraph(2.5).
Under these assumptions, Ledoit and Wolf approah the study on the onsistenyoftheirestimator. Intheirontext,thereferenenormis
||
A
||
n
=
1
pn
tr(AA
′
)
,suhthattheidentitymatrixhasalwaysnormone,andthe refer-enerossprodutis< A
1
, A
2
>
n
=
1
pn
tr(A
1
A
′
2
)
. The problemof obtaining meaningful absolute rates in high dimensions is another relevant issue. As we willsee, in[45 ℄ the authors deriveasymptoti rates for therelative error matrix(andnottheovarianematrixitself). Instead,underthe nonasymp-totisetting(Chapter4),wewillobtainniteabsoluterates,evenunderthe same assumptionsof [45℄.We are now going to show why the sample ovariane matrix is not onsistent in this ontext, dierently from the nite
p
ontext, where the ovariane matrix isasymptotiallyonsistentundertheassumption of nor-mality. Theauthorsshowthatquantitiesµ
n
=<
Σ
n
, I >
,α
2
n
=
||
Σ
n
−
µ
n
I
||
2
,β
n
2
=
E[
||
Σ
ˆ
n
−
Σ
n
||
2
]
,γ
2
n
=
E[
||
Σ
ˆ
n
−
µI
||
2
]
areboundedinthegeneral asymp-toti framework whenn
→ ∞
. Then, they prove the following important Theorem: Theorem 2.3.2. Deneθ
2
n
=
V ar(
pn
1
Ppn
i
=1
E[(y
n
i
1
)
2
])
.θ
2
n
is bounded asn
→ ∞
,and we have:lim
n
→∞
E[
||
ˆ
Σ
n
−
Σ
n
||
2
] =
p
n
n
(µ
2
n
+
θ
n
2
).
This result states that the sample ovariane matrix is not onsistent under the general asymptoti framework, sine its expeted loss is lower bounded by
pn
n
(µ
2
n
)
,whihdoesnotusuallyvanish. (Reallthatθ
2
n
vanishes asymptotiallyunderthe assumption ofnormality,for onvergene (2.2)).There aretwo interesting exeptions:
•
whenpn
n
→
0
, we fall into the standard asymptoti ontext, where thesampleovariane matrixisonsistent. Theonly diereneisthat moregeneralasep
=
o(n)
isallowed,i.e.p
isallowedtobeunbounded andgrow towardsinnity;•
µ
2
n
→
0
andθ
2
n
→
0
.µ
2
n
implies that most of the random variables havevanishingvarianes,i.e. thereareO(n)
asymptotiallydegenerate variables. So, if the number of nondegenerate random variables is NOTnegligiblewithrespettothenumberofobservations,thesample ovariane matrix is not onsistent.Inonsisteny is due to thedisequilibrium between the number of data-points
np
n
andthe number ofparametersp
n
(p
n
+ 1)/2
. Thisis akey point in our analysis, whih is unsolved by the approah of Ledoit and Wolf. In fat, they write there is no DIRECTonsistent estimator of the ovariane matrixunderthegeneralasymptotis. Theirstrategyistoderiveaonsistent estimatoroftheirtheoretialestimator,whihisprovedtohavetheminimum riskamongallthelinearombinationsofI
p
andΣ
n
andisshowntobebetter onditioned than thesample ovariane matrix.So, shrinkage matters unless
pn
n
is negligible respetγ
2
µ
2
, i.e. if the dis-persionof sampleeigenvalues is muh larger thanpn
n
.To onlude this setion, we are now going to explain how Ledoit and Wolfderive aonsistent estimator for
Σ
LW
.Theyintrodue sample ounterparts of their keyquantities:
m
n
=<
Σ
ˆ
n
, I >
n
,
d
2
n
=
||
Σ
ˆ
n
−
m
n
I
||
2
,
¯
b
2
n
=
n
X
k
=1
||
x
n
.k
x
n
.k
′
−
Σ
ˆ
n
||
,
b
2
n
= min(¯
b
2
n
, d
2
n
),
a
2
n
=
d
2
n
−
b
2
n
,
wherex
n
.k
denotethek
−
th
olumn ofX
n
.All these sample ounterparts are onsistent in the general asymptoti framework, i.e. they onverge to
µ
2
n
,α
2
n
,β
2
n
,γ
2
n
respetively in quadrati mean.Then, their feasibleonsistent estimator is
ˆ
Σ
LW
=
b
2
n
d
2
n
m
n
I
n
+
a
2
n
d
2
n
ˆ
Σ
n
(2.7)Thisestimatorisonsistentinthe generalasymptotiframeworkrespet to
Σ
LW
, i.e. they share the same asymptoti expeted loss. Thus, the expetedquadratilossα
2
β
2
γ
2
anbeonsistentlyestimatedinquadratimean bya
2
n
b
2
n
d
2
n
.ˆ
Σ
LW
isshowntohaveanimportant optimalityproperty: ithasthesame asymptoti risk asthe theoretial optimal linearombination ofΣ
ˆ
n
andI
n
withrandom oeients. Inaddition, its ondition number isproved to be bounded inprobability,whih isvery important for pratial use.The approah by Ledoit and Wolf is undoubtedly very elegant. How-ever, there is still one main diulty: their estimator is exessively better onditioned than the true ovariane matrix, i.e. it is often too biased, for the presene of the identity matrix in the estimator. This is why another major pointof ourdissertation will dealwiththeneedof "unshrinking"the estimatedeigenvalues.
Infat, the numerial issueis nottheonly relevantreason for desiringa well onditionedestimate oftheovariane matrix. Deep statistialreasons lie behind this need: we supposethat thetrue ovariane matrix
Σ
∗
is well onditioned, that isthere is nomulti-ollinearity amongour