Christophe Biernacki, Cathy Maugis
To cite this version:
Christophe Biernacki, Cathy Maugis.
High-dimensional clustering.
Choix de mod`
eles et
agr´
egation, Sous la direction de J-J. DROESBEKE, G. SAPORTA, C. THOMAS-AGNAN
Edition: Technip., 2015. <hal-01252673v2>
HAL Id: hal-01252673
https://hal.archives-ouvertes.fr/hal-01252673v2
Submitted on 12 Jan 2016
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not.
The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destin´
ee au d´
epˆ
ot et `
a la diffusion de documents
scientifiques de niveau recherche, publi´
es ou non,
´
emanant des ´
etablissements d’enseignement et de
recherche fran¸cais ou ´
etrangers, des laboratoires
publics ou priv´
es.
2 High-dimensional lustering
ChristopheBierna ki andCathyMaugis-Rabusseau 1
2.1 Introdu tion. . . 1
2.2 HD lustering: Curseorblessing? . . . 4
2.2.1 HDdensityestimation: Curse. . . 4
2.2.2 HD lustering: Amix of urseandblessing . . . 6
2.2.3 Intermediate on lusion . . . 8
2.3 Non- anoni almodels . . . 10
2.3.1 Gaussianmixtureoffa toranalysers . . . 10
2.3.2 HDGaussianmixturemodels . . . 11
2.3.3 Fun tionaldata. . . 12
2.3.4 Intermediate on lusion . . . 16
2.4 Canoni almodels . . . 16
2.4.1 Parsimoniousmixturemodels . . . 17
2.4.2 Variable sele tionthroughregularization. . . 20
2.4.3 Variable rolemodelling . . . 24
2.4.4 Co- lustering . . . 27
2.4.5 Intermediate on lusion . . . 33
2.5 Futuremethodologi al hallenges . . . 35
High-dimensional lustering
Christophe Bierna ki and Cathy Maugis-Rabusseau
2.1 Introdu tion
High-dimensional(HD)datasetsarenowfrequent,mostlymotivated by
te h-nologi al reasons whi h on ern automation in variable a quisition, heaper
availability of data storage and more powerful standard omputers for qui k
data management possibility. All elds are impa ted by this general
phe-nomenon ofvariable numberination, only thedenition ofhigh being
do-maindependent. Inmarketing,thisnumber anbeoforder
10
2
,inmi roarray
gene expression between
10
2
and10
4
, in text mining10
3
or more, of order10
6
forsinglenu leotidepolymorphism(SNP)data, et . Note alsothat
some-timesmu hmorevariables anbeinvolved,what anbetypi allythe asewith
dis retized urves,forinstan e urves omingfrom temporalsequen es.
Here aretworelatedillustrations. Figure 2.1(a)displaysatext mining
ex-ample 1
. It mixesMedline (1033medi alabstra ts) andCraneld (1398
aero-nauti al abstra ts) making a total of 2431 do uments. Furthermore, all the
words(ex ludingstopwords)are onsideredasfeaturesmakingatotalof
9275
uniquewords. Thedatamatrix onsistsof do umentsontherowsandwords
onthe olumnswithea hentrygivingthetermfrequen y,thatisthenumberof
o urren esof orrespondingword in orrespondingdo ument. Figure2.1(b)
displaysa urveexample. ThisKneadingdataset omesfromDanoneVitapole
ParisResear hCenterand on ernsthequalityof ookiesandtherelationship
withtheourkneadingpro ess(Lévéderetal.[2004℄). Itis omposed by115
dierentoursforwhi hthedoughresistan eismeasuredduringthekneading
pro essfor480se onds. Wenoti ethattheequispa edinstantsoftimeinthe
interval [0; 480℄ (here 241 measures) ould be mu h more large than 241 if
measuresweremorefrequentlyre orded.
(a) (b)
Figure2.1: Examplesofhigh-dimensionaldatasets: (a) Textmining:
n = 2431
do umentsandthefrequen ythatd = 9275
uniquewordso urs inea hdo ument(awhiter ellindi atesahigherfrequen y);(b)Curves:
n = 115
kneading urvesobservedatd = 241
equispa edinstantsoftimeintheinterval[0;480℄.
Su hate hnologi alrevolutionhasahugeimpa tinother s ienti elds,
as so ietal or also mathemati al ones. In parti ular, high-dimensional data
managementbrings somenew hallenges to statisti ianssin estandard
(low-dimensional)dataanalysismethodsstruggletodire tlyapplytothenew
(high-dimensional)datasets. Thereason anbetwofold,sometimeslinked,involving
either ombinatorialdi ultiesordisastrouslylargeestimatevarian ein rease.
Dataanalysismethodsareessentialforprovidingasyntheti viewofdatasets,
allowing data summary and data exploratory for future de ision making for
instan e. Thisneedisevenmorea uteinthehigh-dimensionalsettingsin eon
theonehand thelargenumberofvariablessuggeststhat alot ofinformation
is onveyedby databut, in theother hand, su h information maybehidden
behindtheirvolume.
Clusteranalysisisoneofthemaindataanalysismethod. Itaimsat
parti-tioningadataset
x
= (x
1
, . . . , x
n
)
, omposedbyn
individuals andlyingin aspa e
X
ofdimensiond
intoK
groupsG
1
, . . . , G
K
. This partitionis denotedby
z
= (z
1
, . . . , z
n
)
, lying in aspa eZ
, wherez
i
= (z
i1
, . . . , z
iK
)
′
isave tor
of
{0, 1}
K
su hthat
z
ik
= 1
ifindividualx
i
belongstothek
thgroupG
k
,andz
ik
= 0
otherwise(i = 1, . . . , n
,k = 1, . . . , K
). Figure 2.2givesanillustrationof this prin iple when
d = 2
. Model-based lustering allows to reformulateluster analysis as a well-posed estimation problem both for the partition
z
andfor thenumber
K
of groups. It onsiders datax
1
, . . . , x
n
asn
i.i.d.real-izationsofamixture pdf
f (
·; θ
K
) =
P
K
k=1
π
k
f (
·; α
k
)
,wheref (
·; α
k
)
indi atesthe pdf, parameterizedby
α
k
, asso iatedto the groupk
, whereπ
k
indi atesthemixture proportionof this omponent(
P
K
k=1
π
k
= 1
,π
k
≥ 0
)and whereθ
K
= (π
k
, α
k
, k = 1, . . . , K)
indi atesthewholemixtureparameters. Fromthewholedataset
x
itisthenpossibletoobtainamixtureparameterestimateθ
ˆ
K
It is also possible to derivean estimate
K
ˆ
from an estimate of the marginalprobability
f (x
ˆ
|K)
. Moredetailsonmixturemodels,relatedestimationofθ
K
,z
andK
aregiventhroughout Chapter??.−2
0
2
4
−2
0
2
4
X
1X
2−2
0
2
4
−2
0
2
4
X
1X
2x
= (x
1
, . . . , x
n
)
−→
ˆ
z
= (ˆ
z
1
, . . . , ˆ
z
n
)
,K = 3
ˆ
Figure 2.2: The lusteringpurposeillustratedinthetwo-dimensionalsetting.
Beyond the ni e mathemati al ba kground it provides, model-based
lus-teringhasledalsotonumerousandsigni antpra ti alsu essesin the
low-dimensional settingasChapter??relates,withreferen estherein. Extending
thegeneralframeworkofmodel-based lusteringtothehigh-dimensional
set-tingisthusanaturalanddesirablepurpose. Inprin iple,themoreinformation
we haveaboutea h individual, the better a lusteringmethod is expe ted to
perform. Howeverthestru tureofinterestmayoftenbe ontainedinasubset
oftheavailablevariablesandalotofvariablesmaybeuselessorevenharmful
to dete tareasonable lustering stru ture. It isthus importantto sele tthe
relevantvariablesfrom the luster analysis viewpoint. Itis are entresear h
topi in ontrast to variable sele tion in regression and lassi ation models
(KohaviandJohn[1997℄;GuyonandElissee [2003℄;Miller[1990℄). This new
interestforvariablesele tionin lustering omesfromthein reasinglyfrequent
useofthesemethodsonhigh-dimensionaldatasets,su hastrans riptomedata
sets.
Threetypesofapproa hesdealingwithvariablesele tionin lusteringhave
beenproposed. The rstonein ludes lustering methods withweighted
vari-ables(seeforinstan eFriedmanandMeulman[2004℄)anddimensionredu tion
methods. Forthislater,M La hlanetal.[2002℄useamixtureoffa tor
analyz-erstoredu etheextremelyhighdimensionalityofageneexpressionproblem. A
suitableGaussianmixturefamilyis onsideredinBouveyronetal.[2007℄totake
into a ountthedimensionredu tionand thedata lustering simultaneously.
In ontrastto thisrstmethod type,thelast twoapproa hessele texpli itly
relevantvariables. Theso- alledlterapproa hessele tthevariablesbeforea
lusteringanalysis(seeforinstan eDashetal.[2002℄;JouveandNi oloyannis
variable sele tion and lustering. For distan e-based methods, one an ite
Fowlkes et al. [1988℄ for a forward sele tion approa h with omplete linkage
hierar hi al lustering,DevaneyandRam[1997℄whoproposeastepwise
algo-rithmwhere thequalityof thefeature subsetsis measuredwith the obweb
algorithm or the method of Brus o and Cradit [2001℄based on the adjusted
Rand index for
K
-means lustering. There exists also wrapper methods inthe model-based lustering setting. Whenthe number of variablesis greater
thanthenumberof individuals,Tadesse etal.[2005℄proposeafullyBayesian
method usingareversiblejump algorithm to simultaneously hoosethe
num-berofmixture omponentsandsele tvariables. Kimetal.[2006℄useasimilar
approa h byformulating lustering intermsof Diri hletpro ess mixtures. In
Gaussian mixture model lustering, Lawet al. [2004℄proposeto evaluatethe
importan eofthevariablesinthe lusteringpro essviafeaturesalien iesand
use the Minimum Message Length riterion. Raftery and Dean [2006℄re ast
theproblem of omparing twonestedvariable subsetsasamodel omparison
problemandaddressitusingBayesfa tor. Aninterestingaspe toftheirmodel
formulationis that irrelevantvariables are notrequired to be independentof
the lusteringvariables. Theyavoidthustheunrealisti independen e
assump-tionbetweentherelevantandirrelevantvariablesforthe lustering, onsidered
inTadesseetal.[2005℄,Kimetal.[2006℄andLawetal.[2004℄.Intheirmodel,
the whole irrelevant variable subset depends on the whole relevant variables
throughalinearregressionequation. However,somerelevantvariablesarenot
ne essarilyrequired to explain allirrelevant variables in the linearregression
andtheirintrodu tioninvolvesadditionalparameterswithoutasigni ant
in- rease ofthe loglikelihood. Therelated extensionsproposedby Maugiset al.
[2009a,b℄followthisremark.
Many model proposals already exist, in luding asso iated parameter
esti-mation and, sometimes, spe i model sele tion strategies. We will divide
these models into anoni al and non- anoni al ones, indi ating if parameter
onstraintsare respe tivelydened relatively to the initial data spa eor
rel-ativelyto atransformation (afa torialmappingtypi ally). Beforepresenting
su hmodels, andtheirrelatedmodelsele tionpro ess, wedrawwhatarethe
pros(blessing) andthe ons( urse)ofhavingmanyvariablesforperforminga
lusteranalysispro ess.
2.2 HD lustering: Curse or blessing?
2.2.1 HD density estimation: Curse
Inthe previousse tion, weprovided someexamplesofhigh-dimensionaldata
sets. Inthe presentse tion, the aim is to givea somewhat moretheoreti al
ontheparametri ases. Italsoreliesonsomeasymptoti arguments. Remind
thatwe onsideradataset
x
= (x
1
, . . . , x
n
)
,x
i
beingdes ribedbyd
variables.Non-parametri ase
In the non-parametri situation, usually
x
i
is onsidered to rely in ahigh-dimensional spa eassoonas
n = o e
d
,thusassoonasthelogarithmofthe
samplesize,
ln n
,isnegligiblebesidethespa edimensiond
. Arstjusti ationof this laim is given by Bellman [1961℄: To approximate within error
ǫ > 0
a (Lips hitz) fun tion of
d
variables, about(1/ǫ)
d
evaluations (provided by
the sample size
n
...) on a grid are required. A se ond justi ation is alsogiven bySilverman[1986℄: ApproximatingaGaussian distribution withxed
Gaussian kernelsandwith approximateerror ofabout10%requires asample
size
log
10
n(d)
≈ 0.6(d − 0.25)
. For instan e, withd = 10
,n(10)
≈ 7.10
5
,
implyingalreadyahugesamplesizeforaquitemoderatedimensionalsetting.
Parametri ase
In theparametri situation, let
S
m
be a model des ribed byD
m
ontinuousparameters,likelydependingonthedimension
d
. Insu ha ase,thedatasetx
issaidtorelyinahigh-dimensionalspa eassoonasn
issmallin omparisonto a parti ular fun tion
g
ofD
m
, namelyn = o(g(D
m
))
. As an illustrationfor
g
, we onsider theheteros edasti Gaussian mixture with true parameterθ
∗
andK
omponents. Wenoteθ
ˆ
K
theGaussian MLE withK
omponents.Inthat situation,
g
isalinearfun tion fromthe followingresult(MaugisandMi hel[2012℄): Itexistspositive onstants
κ
andA
su hthatE
x
[d
2
H
(f (
·; θ
∗
), f (
·| ˆ
θ
K
ˆ
))]
≤ κ
inf
K
{
KL(f (
·; θ
∗
), f (
·; ˆ
θ
K
)) +
pen(K)
} +
1
n
where
d
H
denotes theHellingerdistan e, KLtheKullba k-Leiblerdivergen eand pen
(K)
≥ κ
D
K
n
2A ln d + 1
− ln
1
∧
D
K
n
A ln d
.
ThustheHDnon-parametri andparametri situationsaredrasti ally
dif-ferentinmagnitude. However,inpra ti e,
D
K
anbehighsin eD
K
∼ d
2
/2
inthis Gaussiansituation, ombinedwithpotentiallylarge onstants. For
high-lightingthisfa t, onsiderthefollowingtwo- omponentmultivariateGaussian
mixture:
π
1
= π
2
=
1
2
,
X
1
|Z
11
= 1
∼
N(0, I),
X
1
|Z
12
= 1
∼
N(1, I),
(2.1) witha
= (a . . . a)
′
a real ve tor of size
d
. An illustration of this setting isdisplayedinFigure 2.3(a). Notethat thetwo omponentsaremoreandmore
separated when
d
grows sin ek1 − 0k
I
=
√
mixturedensityestimatedegrades(the Kullba k-Leiblerdivergen ein reases)
whendimensionin reasesasitisillustratedinFigure2.3(b)witha
homos edas-ti modelandwithequalmixingproportions.
−3
−2
−1
0
1
2
3
4
−4
−3
−2
−1
0
1
2
3
4
x1
x2
1
2
3
4
5
6
7
8
9
10
12
12.5
13
13.5
14
14.5
15
15.5
16
16.5
d
Kullback−Leibler
(a) (b)Figure2.3: HD urseintheparametri densityestimation ontext: (a)A
bivariatedata setexamplewithisodensityofea h omponentand(b)the
Kullba k-Leiblerdivergen eof thedensityestimatewhen
d
in reases.2.2.2 HD lustering: A mix of urse and blessing
Contrarytodensityestimationwherein reasingdimensionhasa learnegative
ee t, dimensionmay havebothpositiveand negativeee ts onthe
luster-ing task. We distinguish now whi h fa tors favorsu h blessing or urse
out omes.
Blessingfa tors
Weretrievethemodeldesign(2.1). Wedisplayagaina orrespondingsample
in Figure 2.4(a). We have already mentioned that the two omponents are
moreand moreseparatedwhen
d
in reases. The reasonis that ea h variableuniformly provides its own separation information su h that the asso iated
theoreti alerrorde reaseswhen
d
grows. Indeed,thiserrorisequaltoerr
theo
=
Φ(
−
√
d/2)
,whereΦ
isthe dfofN(0, 1)
. We anseethisde reasewithd
byadashlineinFigure2.4(b). Aninteresting onsequen eisthenthattheempiri al
error rate de reases also with
d
as it ould be noti ed in ontinuous line inFigure 2.4(b). It meansthat in reasing dimensionmayhave apositiveee t
onthe lustering task assoonasallvariables onveymeaningfulinformation
onthehiddenpartition.
−3
−2
−1
0
1
2
3
4
−4
−3
−2
−1
0
1
2
3
4
x1
x2
1
2
3
4
5
6
7
8
9
10
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
d
err
Empirical
Theoretical
(a) (b)Figure2.4: HDblessingin the lustering ontextwhenmostvariables onvey
independentpartitioning information: (a)A bivariatedatasetexamplewith
isodensityofea h omponentand(b)thetheoreti al(dashline)andthe
empiri al( ontinuousline)errorratewhen
d
in reases.sians,allmoreandmoreseparatedwhen
d
in reases:π
1
= π
2
= π
3
=
1
3
,
X
1
|Z
11
= 1
∼
N(0, I),
X
1
|Z
12
= 1
∼
N(2, I),
X
1
|Z
13
= 1
∼
N(
−2, I), .
Then Figure 2.5(a)-(d) displaysa related sample of size
n = 1000
fordier-ent dimensionson the main two axes of the Fa torial Dis riminant Analysis
(FDA)mapping. It learlyappearsthat omponentsaremoreandmoreeasily
re ognizedwhendimensionin reases,althoughitisasimplevisualization
pro- ess. Atthelimit,no omplex lusteringalgorithmwouldbeenoughtoidentify
lusters...
Cursefa tors
Infa t,in reasingdimensionmayhaveapositiveee t on lustering retrieval
onlyifvariablesinje tsomepartioninginformation. Inaddition,su h
informa-tionhastobenotredundant. Weillustratenowthesetwoparti ularfeatures.
Firstly,we onsidermanyvariableswhi hprovidenoseparationinformation.
We retrieve thesame parametersetting as (2.1)ex ept that the omponents
arenot moreseparatedwhen
d
growssin ekµ
2
− µ
1
k
I
= 1
, whereµ
1
= 0
isthe enter ofthe rstGaussianand where
µ
2
= (1 0 . . . 0)
′
istheoneofthe
se ond,thus(
k = 1, 2
)X
1
|Z
1k
= 1
∼
N(µ
k
, I).
(2.2)A sampleisdisplayedonFigure2.6(a). Figure 2.6(b)showsin dashline that
thetheoreti alerrorrateis onstant(it orrespondsto
err
theo
= Φ(
−
1
2
)
)when−4
−3
−2
−1
0
1
2
3
4
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
1st axis FDA
2nd axis FDA
d=2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−3
−2
−1
0
1
2
3
4
1st axis FDA
2nd axis FDA
d=20
(a) (b)−1.5
−1
−0.5
0
0.5
1
1.5
−4
−3
−2
−1
0
1
2
3
4
5
1st axis FDA
2nd axis FDA
d=200
−1.5
−1
−0.5
0
0.5
1
1.5
−3
−2
−1
0
1
2
3
1st axis FDA
2nd axis FDA
d=400
( ) (d)Figure2.5: Fa torialDis riminantAnalysis(FDA)onthemain twofa torial
axesofthree Gaussian omponentsmoreandmoreseparatedwhenthespa e
dimensionin reases: (a)
d = 2
,(b)d = 20
,( )d = 200
andd = 400
.Se ondly,we onsidera asewheremanyvariablesprovideseparation,but
redundantinformation,inthefollowingsense: Itisthesameparametersetting
asbeforefortherstdimensionex eptforallotherones
X
1j
= X
11
+ ε
j
,
whereε
j
iid
∼
N(0, 1) (j = 2, . . . , d).
(2.3)SeeadataexampleinFigure2.7(a). Thus, omponentsarenotmoreseparated
when
d
growssin ekµ
2
−µ
1
k
Σ
= 1
,Σ
denotingthe ommon ovarian ematrixof ea h Gaussian omponent, and
µ
k
denoting the enter of the omponentk = 1, 2
(notethatbothµ
k
andΣ
ouldbeeasily omputedfromEquation(2.2)and(2.3)). Consequently,
err
theo
= Φ(
−
1
2
)
is onstantandtheempiri alerrorin reaseswith
d
,asillustratedin Figure2.7(b)withprevious onventions.2.2.3 Intermediate on lusion
−4
−3
−2
−1
0
1
2
3
4
−5
−4
−3
−2
−1
0
1
2
3
4
x1
x2
1
2
3
4
5
6
7
8
9
10
0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
d
err
Empirical
Theoretical
(a) (b)Figure2.6: HD ursein the lustering ontextwhenvariables onveyno
partitioninginformation: (a)Abivariatedatasetexamplewithisodensityof
ea h omponentand(b)thetheoreti al(dashline)andtheempiri al
( ontinuousline)errorratewhen
d
in reases.−3
−2
−1
0
1
2
3
4
−6
−4
−2
0
2
4
6
x1
x2
1
2
3
4
5
6
7
8
9
10
0.3
0.32
0.34
0.36
0.38
0.4
0.42
d
err
Empirical
Theoretical
(a) (b)
Figure2.7: HD urseinthe lustering ontextwhenvariables onvey
redundantpartitioninginformation: (a)Abivariatedatasetexamplewith
isodensityofea h omponentand(b)thetheoreti al(dashline)andthe
empiri al( ontinuousline)errorratewhen
d
in reases.spa e. Inparti ular, lter methods performingvariable sele tionbeforethe
lusteringtaskhavetobeex luded,theriskofremovingdis riminantfeatures
beingtoolarge. Theremainingquestionisthenwhi hwrappermethodstobe
used? Su hmethodsshouldmanagewithprioritythefa tthatsomevariables
have negative ee ts for lustering. The general answeris to design spe i
parsimoniousmodelsfor lustering,themostemblemati onesrelyingonsome
variable sele tion prin iple. Wewill see also several alternativestrategies, in
parti ularvariable lustering(tonotbemingledwithindividual lustering,our
atransformedspa e) evenif itis notofteninitially des ribed withthis point
ofview.
Behingthismodeldesignwhi histherststepofhigh-dimensional
model-based lustering,thequestionofmodelsele tionisthenasked. Insome
situa-tions,traditionalmodelsele tion riteria ouldbedire tlyapplied. However,in
many ases,twokindsofdi ulties mayhappen. Firstly,thenumberof
om-peting models avoidsto enumerateall possible modelswhi h ompete.
Typi- ally,inavariablesele tion ontextthenumberofpossibilitiesis ombinatorial.
Insu h a ase,strategiesfordesigninganintelligentpathinarelevantsubset
ofmodelsisapossibleanswer. Se ondly,validityoftraditionalmodelsele tion
riteriathemselves an be hallenged,requiringsomeoriginalproposals.
In the rest of this hapter, we will give an overview of the main
high-dimensional lustering methods. We will systemati ally highlight novelty of
theproposedmodels,possible onne tionsbetweenthem(variablesele tionor
variable lustering,initial spa eornon- anoni al spa e)and issuesfor model
sele tion( riteriaandstrategiesofuse).
2.3 Non- anoni al models
As dis ussed previously, modelsdesigned forhigh-dimensional lustering rely
on parsimonious denition of related parameters. In this se tion, we fo us
on situations where parsimony is inje ted through parameters dened in a
transformedfeaturespa e, alledherenon- anoni alfeaturespa e. We onsider
this asebefore the anoni alfeaturespa esituation (next se tion)sin eitis
somewhatrelatedtothepioneering ideaof ltering. Indeed,fa torialanalysis
(for instan e prin ipal omponent analysis in the ontinuous ase) was rst
ondu tedforsele ting(new)variablesbeforeapplyingany lusteringmethod
on them. Here, ideas are related but with a wrapper point of view. Most
situationsaddress ontinuousfeatures.
2.3.1 Gaussian mixture of fa tor analysers
In Gaussian model-based lustering, in reasing the number of variables has
its main ee t on the number of parametersin luded in the ovarian e
ma-tri es
Σ
k
, sin eit is ofquadrati order. Consequently, mostmethods aim atintrodu ingparsimonyrston
Σ
k
. Historyanddetails ouldbefoundinBou-veyronand Brunet[2014℄. In parti ular, Ghahramaniand Hinton[1997℄and
M La hlan[2003℄designthefollowingreparameterizationof
Σ
k
:Σ
k
= B
k
B
′
k
+ ω
k
Λ
k
where
B
k
isaloadingsd
× q
non-squarereal matrix(1
≤ q ≤ q
max
,
q
max
< d
),ω
k
isapositiverealnumberandΛ
k
isad
× d
diagonalpositivedenitematrixit is equivalentto assuming
X
1
∈ R
d
to be generated bythe following latent
variable
Y
1
∈ R
q
lyingin asmaller(latent)spa ethan
R
d
X
1
|Y
1
, Z
1k
= 1 = B
k
Y
1
+ µ
k
+ ε
k
where
Y
1
⊥ ε
k
(⊥
denotingindependen e),Y
1
∼
N(0, I)
andε
k
∼
N(0, ω
k
Λ
k
)
.In this layout,
Y
1
is alled the fa tor, by straightforward analogy to fa toranalysismethods. Estimationisperformedthroughanalternating
expe tation- onditionmaximization(AECM)algorithm(MengandvanDyke[1997℄).
Complexityofsu hamodelisequalto
D
m
= (K
− 1) + Kd + Kq[d − (q −
1)/2] + Kd
,whereit anbeseenthatthequadrati parthasvanished. Infa t,it orrespondstothe most omplexmodel ofawholefamily, M Ni holas and
Murphy [2008℄havingdened 12 asso iated parsimoniousversions, in luding
forinstan e inter- lassequality between
B
k
,identityofΛ
k
= I
, et . Finally,models in ompetition
(
S
m
)
m
∈M
gather the ombinations non only of these12parsimoniousversionsbutalsoofthe ouples
(q, K)
ofthelatentdimensionand of the number of omponents. In pra ti e,
q
max
is expe ted to be quite
low for parsimonious reasons and thus the ardinal of
M
is not ex essivelyhigh. Traditionalmodelsele tion riteria(asBIC) anthenbedire tlyapplied
onthis olle tion. The rpa kage pgmm
2
providesanimplementationof this
method.
2.3.2 HD Gaussian mixture models
Bouveyron et al. [2007℄propose another wayfor obtaining parsimony onthe
ovarian ematri es
Σ
k
. Itrelaysonthefollowingspe tralde ompositionΣ
k
= D
k
∆
k
D
′
k
where
D
k
is the orthogonal matrix of the eigenve tors ofΣ
k
and∆
k
is adiagonalmatrix ontainingtherelatedeigenvalues. Theyimpose
∆
k
tofollowtheparsimoniousstru ture
∆
k
=
a
k1
0
. . .0
a
kq
k
0
0
b
k
0
. . .0
b
k
q
k
(d
− q
k
)
with
a
kj
≥ b
k
> 0
, forj = 1, ..., q
k
andq
k
< d
. Su h an assumption anbesomewhat relatedto a kindof prin ipal omponentanalysis perGaussian
group. It ould also be viewed asakind of variable lustering sele tion, the
d
− q
k
remainingvariablesof∆
k
orrespondingto agroupofnoisy features.Figure2.8illustratesathreedimensional(
d = 3
)andtwo omponentssituation(
K = 2
)wherebothsubspa edimensionsq
1
andq
2
areequal(q
1
= q
2
= 2
)butdierinorientation. Estimation aneasilyperformedthroughanEMalgorithm
forinstan e.
Figure2.8: IllustrationoftheHD lusteringmixtureGaussianmodelin atwo
omponentssituation (providedbyBouveyronetal.[2007℄).
Complexityof su ha model is given
D
m
= (K
− 1) + Kd +
P
K
k=1
q
k
[d
−
(q
k
+ 1)/2] +
P
K
k=1
q
k
+ 2K
. In addition, Bouveyron et al. [2007℄ proposeeight parsimonious versions by imposing for instan e equality between
sub-spa e dimensions (
q
k
= q
, for allk
), et . Finally, the whole model family(S
m
)
m
∈M
in ludes ouples((q
1
, . . . , q
K
), K)
ofsubspa edimensionandnum-berof omponents, ombinedwiththeeightmodels. Sin e
q
k
maydependonthe omponent, ontrarytotheGaussianmixtureoffa toranalysersdes ribed
inthepreviousse tion,thenumberofmodelsbe omes ombinatorial. Then,it
maybedi ultintheHD settingtobrowseallmodelsforapplyingaBIC-like
riterionforinstan e. Consequently,Bouveyronet al.[2007℄proposeakindof
ruleof thumb riterion forsele tingea h
q
k
, lookingfor abreak in theeigen-values reeof theempiri al ovarian e matrixforea hgroup omponent,the
so- alleds reetestof Cattell[Cattell,1966℄. Thermixmodpa kage
3
(Lebret
etal.[2015℄)implementsthesemodels.
2.3.3 Fun tional data
Fun tional and dis retized data
Stri tlyspeaking, realfun tionaldata(RamsayandSilverman[2005℄,Ferraty
and Vieu [2006℄) orrespond to
i = 1, . . . , n
urveswhi h are realizations ofn
randomvariableslinkedton L
2
- ontinuousreal-valuedsto hasti pro esses
Y
i
=
{Y
i
(t)
∈ R, t ∈ [0, T ]}
taking values in aHilbert spa eH
of fun tionsdened on the (time) interval
[0, T ]
. Thus, it orresponds to an innitedi-mensional spa e. Sin e most fun tional data arelongitudinal, we adopt here
the onventionof parameterizingmodelsintermsoftime. However,itapplies
equallywell withanyother featuresasangle, length, et . Inaddition,
exten-sionsarepossibleformultivariate urves,itmeansthatindividual
i
isdes ribedbyseveral urves(seeforinstan eJamesandSugar[2003℄orJa quesandPreda
[2014b℄).
Inpra ti e,ea h
Y
i
isunobservedfortwo,essentiallyte hnologi al,reasons.Firstly,the
n
urvesY
i
aredis retizedea hinm
i
time-points{Y
i
(t
is
), 0
≤ s ≤
m
i
, t
is
∈ [0, T ]}
. Se ondly, an error on observation is usually present su hthat only
m
i
ordered time-points{X
i
(t
is
), 0
≤ s ≤ m
i
, t
is
∈ [0, T ]}
(i =
1, . . . , n
)are availablefor ea h urve. For instan e, thefollowingrelationshipbetween dis retized (unobserved) values
Y
i
(t
is
)
and noisy (observed) valuesX
i
(t
is
)
ouldbeassumed:X
i
(t
is
) = Y
i
(t
is
) + ε
is
,
(2.4)where
ε
is
haszeromeanandisun orrelatedwithea hotherandY
i
(t
is
)
. Otherassumptionsarepossibleaswewill seebelow.
WerefertoJa quesandPreda[2014a℄forageneralreviewon lusteringfor
fun tionaldata,in ludingthemodel-basedone. Di ultyofperforming
unan-imous lusteringongenerativedistributions omesfromthefa tthat, ontrary
tothenite-dimensionalsetting,thenotionofdensityprobabilityis generally
not dened for fun tional random variable (Delaigle and Hall [2010℄).
Con-sequently, relatedte hniquesrequiredening densityprobabilities in a
nite-dimensionalspa e,leadingtomultiple anddierentimplementations.
Inthis hapter,wedividemodel-based lusteringte hniquesintotwo
dier-ent ategories: these ones wherethe generativemodel isexpli itlydened on
theobserved values
X
i
=
{X(t
is
), 0
≤ s ≤ m
i
, t
is
∈ [0, T ]}
,i = 1, . . . , n
, andthese ones forwhi h itis notthe ase. Indeed,this split will haveimportant
onsequen esforsomeaspe ts on erningmodelsele tion.
Clustering with noexpli it distributionon
X
i
Usually, therst stepbefore a lusteringmethod isto re onstru t theinitial
fun tionalformofdata. It anthenbeviewedasaprepro essingstep
(lter-ing method). Itoftenreliesontheassumptionthattheunobserved urve
Y
i
anbeexpressed in abasisof
d
fun tions{φ
j
}
j=1,...,d
, forinstan e B-splinesorwavelets,inthefollowingform:
Y
i
(t) =
d
X
j=1
γ
ij
φ
j
(t).
Usingthentheregression(2.4)hypothesis,traditionalleastsquared oe ients
estimatesareobtainedby
ˆ
where
Φ
i
= (φ
j
(t
is
))
isam
i
× d
matrixgatheringthevalueofea hbasisfun -tionforea htimedis retizationknot. Finally,standardmodel-based lustering
te hniques(typi allymultivariateGaussian mixtures, eventuallyHD variants
previously des ribed in Se tions 2.3.1 and 2.3.2) an be dire tly applied on
theestimated oe ients
γ
ˆ
i
. Thepartition onindividualsX
i
isobtainedasasimpleby-produ t,beingthesameasthisone ofindividuals
γ
ˆ
i
.Instead of partitioning the basis oe ients
γ
ˆ
i
, a model-based lusteringte hnique an be alternatively applied to some prin ipal omponent s ores
resulting from fun tional prin ipal omponent analysis (FPCA) of the
pre-vious re onstru ted urves. In pra ti e, the omputational pro ess for
im-plementing FPCA onsists of performing a standard ( entered) PCA to the
matrix
ΓW ˜
˜
Γ
′
T
, where
Γ
= (ˆ
γ
ij
)
isthen
× d
matrixofestimated oe ients,T
=
1
n
I
is then
× n
matrix of weights for urves,Γ
˜
is then
× d
matrix ofentered oe ients of
Γ
andW
is thed
× d
matrix of the inner produ tsw
jj
′
=
R
T
0
φ
j
(t)φ
j
′
(t)dt
(1
≤ j, j
′
≤ d
) (it a ts like ametri ). Thus, thej
thprin ipal omponents ore
C
j
isthej
theigenve torasso iated tothe largestj
theigenvalue:˜
ΓW ˜
Γ
′
TC
j
= α
j
C
j
.
As usual with PCA, FPCA performs a kind of variable ordering. Finally,
lusteringisperformedonatrun atingprin ipal omponents ores
C
1
, . . . , C
q
,with
q
≤ d
.Fromamodel sele tionpointof view,bothpreviousmethods allowtouse
someinformation riterialikeBICforsele tingthenumber
K
of omponents.However,it is notreally possibleto use them forsele ting other parts ofthe
model whi h are thefun tional basis
{φ
j
}
j=1,...,d
and, spe i ally to FPCA,thetrun ationoforder
q
.Clustering with expli it distributionon
X
i
Ideally, for bene ingfrom thewhole mathemati alstatisti s orpus,
model-based lusteringte hniqueswouldrequireadistributiononall
X
i
= (X
i
(t
is
), 0
≤
s
≤ m
i
, t
is
∈ [0, T ])
,i = 1, . . . , n
. Firstofall,itisimportanttonoti ethatper-formingthe lustering taskdire tly withobservedvalues
X
i
'sasiftheywouldorrespondto lassi almultivariatevaluesisnotdesirable,evenifit ouldmeet
thisgoal. Therstreasonisthatea h
X
i
doesnotne essarilyrelyinthesamespa e dimension (here
m
i
for ea h), even if in pra ti e it ould be oftenthease. These ondandthemostimportantreasonisthatworkingwithsu hraw
datawastesorderinformationonthem.
Contrarytotherawdata ase,severalte hniquesproposedistributionson
X
i
whi h take all the fun tional data spe i ity into a ount. Ja ques andPreda [2013℄ perform FPCA by group, leading to prin ipal omponents per
It leads to the following Gaussian mixture model, relying on atrun ation of
order
1
≤ q
k
≤ d
forea h omponent:f (x
i
; θ)
≈
K
X
k=1
π
k
q
k
Y
j=1
φ(C
ijk
; 0, α
jk
)
where
φ(
·; 0, α
jk
)
is theunivariateGaussiandensityofmeanzero(s oresC
ijk
are entered) andvarian e
α
kj
( orrespondingalsoto eigenvalues). Then,pa-rameterestimation is provided throughanEM-likealgorithm formaximizing
the(pseudo)log-likelihood,wherebothstepsarethefollowing:
E-step Compute onditionalprobabilities
t
ik
∝ π
k
Q
q
k
j=1
φ(C
ijk
; 0, α
kj
)
asusual.M-step First, prin ipal s ores are updated. Noti e that weights
T
k
dependnowon
t
ik
's,Γ
k
too. Se ond, perform theq
k
trun ation order sele tionbydete tingakindofelbowintheeigenvaluesbythes reetestof
Cat-tell (Cattell[1966℄). Finally, parameters
π
k
are omputed asusual andparameters
α
k
arealreadygivenfrom previous onditionalFPCA.Thispro essisimplementedin therfun lust
4
pa kage. Asan illustration,
thispa kageisappliedtokneading urves,whi haredes ribedin Se tion2.1,
inFigure2.9. Fromamodelsele tionpointofview,therearesomeimportant
remarks. Stri kly speaking, it is just apseudo likelihood method sin e data
C
ijk
are hangingat ea h iterationstepofEM. Consequently, using sele tionriterialikeBIC ouldbehazardousfor hoosing
K
,q
k
orthefun tionalbasis.However,in pra ti e,BICworks wellfor hoosing
K
. However,itisnotusedforsele ting
q
k
,asprevioussaid,forlimiting omputingtime. Noattemptforhoosing thebasisisperformed.
(a) (b)
Figure2.9:
n = 115
kneading urvesobservedatd = 241
equispa edinstantsoftimein theinterval[0;480℄: (a)raw urves,(b)threegroupspartioning
urveswiththefun lustpa kage.
Alternatively,JamesandSugar[2003℄ onsiderrandomnessdire tly onthe
basis oe ients
γ
i
. Theyassumethatγ
i
arisesfromahomos edasti Gaussianmultivariate model whi h, oupling with (2.4), provides the following
regres-sion model, onditionally on the
i
th urve belonging to thek
th luster (soonditionalto
Z
ik
= 1
):X
i
= Φ
i
(µ
k
+ ǫ
i
) + ε
i
,
where
ǫ
i
∼
N(0, Σ)
andε
i
∼
N(0, σ
2
I
)
. Also,someparsimoniousassumptions
aremadeon enters
µ
k
. Then,anEMalgorithmallowstoestimateallparam-eters. ContrarytothemodelofJa quesandPreda[2013℄des ribedjustabove,
wearenowfa edtoanunambiguousgenerativeapproa hallowing
straightfor-ward model sele tionwith any lassi al riterion for hoosing everyquantity
of interest (the number
K
of lusters, the basis{φ
j
}
and the parsimony ofall means
µ
k
), even if the authors prefer to use a so- alled distortionfun -tion riterionforsele ting
K
fastersin eavoidingEM omputationsforallK
values.
In the samespirit asJames and Sugar [2003℄,Samé et al. [2011℄give
an-other regression model providing a full generative, exible and parsimonious
distribution onthe
X
i
's. They assume that the urvesarise from amixtureof regressions on a basis of polynomial fun tions (the order to be given by
modelsele tion),withpossible hangesin regimeatea hinstantoftime. The
mixing proportionsaredenedbylogisti fun tions forallowingsegmentation
in time. An EM pro edure is performed forestimation and several
parsimo-niousversions aredes ribed. Thisfullgenerativedistributionallowsagainfull
modelsele tion(numberof lusters,polynomialorderofthebasisfun tionand
number of regime hanges) in any standard way. However, as in many
pre-vioussettings, thenumberof ompeting models an in reasedrasti ally. For
instan e,thebasi fun tions an hangebyregime,multiplying ombinations.
2.3.4 Intermediate on lusion
Many parsimoniousmodelling solutionsexist fordealing with HD data,
on- erningaswellindependentand fun tionaldata, evenifsomegaps remainto
be lledlike ategori al fun tional data or also mixed ( ontinuous and
ate-gori al typi ally) multivariate fun tional data. Most of existing models rely
onagenerativedistributiononthedataspa e,allowingdire tuseofstandard
sele tion riteria. However,the ru ial questionisfo usedonthemultipli ity
ofmodelstobe ompared. Itisthereasonwhysomeauthorsfavorsomemore
empiri al,butfast,rulesformodelsele tion.
Weguessthatfutureresear hesshouldaddressnewadvan esforfast
sele -tionofmultiplemodelsin ashortallo atedtime. Inthenextse tion,devoted
to anoni almodelsetting,wewillseeearlyseveralattemptsforthispurpose,
2.4 Canoni al models
We address now models for HD data whi h position parsimony assumptions
dire tly on the initial (or anoni al) variable spa e. Advantage of su h
ap-proa hes,besidenon- anoni alones,is agreatmodel readibilityforthe
pra -titioner. Indeed, thisoneis usually morea ustomedto his variableset than
toasomewhatmorearti ialset,asthefa torialfeatures ouldbesometimes.
In this ontext, this hapter ta klesimportantnotions: variablesele tion,
variable lustering,modelsele tionvalidityandalsostrategiesfordealingwith
modelmultipli ity.
2.4.1 Parsimonious mixture models
Classi al mixture models have already been presented in Chapter ??,
Se -tion??. Itgathersinparti ulartheGaussianmixturemodelforthe ontinuous
aseandthelatentmultinomialmixturemodelforthe ategori al ase,
in lud-ingalsomanyparsimoniousvariants. DealingwithHDdataimposeto onsider
essentiallysomeofthemostparsimoniousonesthusthereisaneedtoprovide
moredetailsin thisse tion. Then,extensionto themixed ase(merging
on-tinuous and ategori alfeatures) is presentedasa straightforwardextension.
Allthesemodelsareimplementedintherpa kagermixmod
5
. Finally,wewill
presentanewattemptforvariablesele tioninthe ontinuous, ategori aland
mixedsituations.
Spheri al and diagonalGaussian mixturesfor ontinuousvariables
We onsiderdatasets
x
= (x
1
, . . . , x
n
)
,withx
i
∈ R
d
. Themostparsimonious
Gaussianmixture modelsdened byCeleux andGovaert[1995℄belongtothe
so- alledspheri alanddiagonalfamilies. Anexampleofdiagonalmodelisgiven
inFigure 2.10. UsingnotationsalreadyprovidedinSe tion?? ofChapter?? ,
their most omplexversionsrespe tively orrespond to onstraints
Σ
k
= λ
k
I
and
Σ
k
= λ
k
B
k
on the ovarian e matrixΣ
k
of thek
th omponent, whereλ
k
=
|Σ
k
|
1/d
andB
k
diagonal with|B
k
| = 1
. In luding some parsimoniousversions,whi h allowsome parts tovary ornot between omponents,a total
oftwospheri alandfourdiagonalmodelsareavailable. All models,and their
respe tivenumberof parameters,are displayed in Table2.1. Model sele tion
anbeeasilyperformedbytraditional riteria,likeBIC.
Latent lass model for ategori al variables
We onsider now data sets
x
= (x
1
, . . . , x
n
)
, ea hx
i
ontainingd
ategori- al variables, the
j
th havingm
j
response levels. The odingx
i
= (x
jh
i
; j =
5Figure 2.10: Isodensityofatwo- omponentsdiagonalGaussianmixture in
thethree-dimensionalspa e.
Family Model Numberofparameters
diagonal
[λB]
dim(π) + Kd + d
[λ
k
B
]
dim(π) + Kd + d + K
− 1
[λB
k
℄dim(π) + 2Kd
− K + 1
[λ
k
B
k
]
dim(π) + 2Kd
spheri al[λI]
dim(π) + Kd + 1
[λ
k
I
]
dim(π) + Kd + K
Table2.1: Some hara teristi softhetwospheri alandthefour diagonal
models. Wehave
dim(π) = K
− 1
inthe aseof freeproportionsanddim(π) = 0
inthe aseofequalproportions.1, . . . , d; h = 1, . . . , m
j
)
indi atesthatx
jh
i
= 1
ifi
hasresponselevelh
forvari-able
j
andx
jh
i
= 0
otherwise. Thestandardmodel for lusteringobservationsdes ribedthrough ategori alvariablesisthe so- alledlatent lass model(see
forinstan e Goodman[1974℄). Data areassumedto ariseindependentlyfrom
amixture of
K
multivariatemultinomial distributionswithpdff (x
i
; θ) =
K
X
k=1
π
k
d
Y
j=1
m
j
Y
h=1
(α
jh
k
)
x
jh
i
,
(2.5)where
θ
= (π, α)
denotestheve torparameterofthelatent lassmodeltobeestimated,with
α
= (α
1
, . . . , α
K
)
andα
k
= (α
jh
k
; j = 1, . . . , d; h = 1, . . . , m
j
)
,α
jh
k
denotingtheprobabilitythat variablej
haslevelh
ifobje ti
isin lusterLebret et al. [2015℄ propose four parsimonious versions, with thus a
to-tal of ve models. They orrespond to an extension of the
parameteriza-tion of Bernoulli distributions used by Celeux and Govaert [1991℄ for
lus-tering andalso byAit hinson andAitken[1976℄for kerneldis riminant
anal-ysis. Thebasi ideaisto impose theve tor
α
j
k
= (α
j1
k
, . . . , α
jm
j
k
)
to takethe form(β
j
k
, . . . , β
j
k
, γ
j
k
, β
j
k
, . . . , β
j
k
)
withγ
j
k
> β
j
k
. Sin eP
m
j
h=1
α
jh
k
= 1
, we have(m
j
− 1)β
k
j
+ γ
k
j
= 1
and, onsequently,β
j
k
= (1
− γ
j
k
)/(m
j
− 1)
. The onstraintγ
k
j
> β
j
k
be omesnallyγ
j
k
> 1/m
j
. Then, the ve torα
j
k
anbe broken upintothetwofollowingparameters:
• a
j
k
= (a
j1
k
, . . . , a
jm
j
k
)
wherea
jh
k
= 1
ifh
orrespondstotherankofγ
j
k
(inthefollowing,thisrankwillbenoted
h(k, j)
),0otherwise;• ε
j
k
= 1
− γ
j
k
whi h orrespondstotheprobabilitythatthedatax
i
arisingfrom the
k
th omponentaresu hthatx
jh(k,j)
i
6= 1
.In other words, the multinomial distribution asso iated to the
j
th variableof the
k
th omponent is reparameterized by a entera
j
k
and the dispersionε
j
k
around this enter. Thus, itallows us to givean interpretation similar tothe enter and thevarian e matrixused for ontinuousdata in the Gaussian
mixture ontext. Finally,therelationshipbetweentheinitialparameterization
andthenewoneisgivenby:
α
jh
k
=
1
− ε
j
k
ifh = h(k, j)
ε
j
k
/(m
j
− 1)
otherwise.(2.6)
Inthefollowing,thismodelwillbedenotedby
[ε
j
k
]
. Inthis ontext,threeothermodels anbeeasilydedu ed. Wenote
[ε
k
]
themodelwhereε
j
k
isindependentof thevariable
j
,[ε
j
]
themodelwhere
ε
j
k
is independentof the omponentk
and, nally,
[ε]
themodelwhereε
j
k
is independentofboththevariablej
andthe omponent
k
. In order to maintain someunity in the notation, we willdenote also
[ε
jh
k
]
the mostgeneralmodel initially introdu ed. Thenumberoffreeparametersasso iated toea h modelis given inTable2.2. Again, model
sele tion anbeeasily performedbytraditional riteria,likeBIC.
Mixeddata models
It is frequent in pra ti e to mix ontinuous and ategori al data. Thus the
i
th individual is omposed by two parts,x
i
= (x
cont
i
, x
cat
i
)
,x
cont
i
andx
cat
i
designingthe ontinuousandthe ategori alonesrespe tively. Inthat ase,it
iseasyto ombine(diagonal)parsimoniousGaussianmixtureandlatent lass
modelby onditionalindependen e [MoustakiandPapageorgiou,2005℄:
Model Numberofparameters
[ε]
dim(π) + 1
[ε
j
]
dim(π) + d
[ε
k
]
dim(π) + K
[ε
j
k
]
dim(π) + Kd
[ε
jh
k
]
dim(π) + K
P
d
j=1
(m
j
− 1)
Table2.2: Numberoffreeparametersofthevemultinomialmodels. We
have
dim(π) = K
− 1
in the aseoffreeproportionsanddim(π) = 0
intheaseofequalproportions.
with
α
k
= (α
cont
k
, α
cat
k
)
(seealsoSe tion??inChapter??). Then,theprevioussixGaussian mixturemodelsandthevemultinomial mixturemodels anbe
ombined, dening straightforwardly30newmixed models. Classi al riteria
anbeusedforsele tingthem, withalsothenumberof lusters
K
.Although previously des ribed models, in the ontinuous, ategori al or
mixeddatasituations,arethemostparsimoniousonesintheirrespe tive
fam-ilies, theyare notreallydesigned forrealisti HDsituations involvingseveral
thousandsof variables for instan e. Indeed, theirparameternumber remains
toohighin su h ases.
Variable sele tion has alwaysbeen anatural answer for HD lustering as
alreadydis ussedinthebeginningofthis hapter. Typi ally,lteringmethods
relyingonapreliminaryfa torialanalysisstepthen utthenumberoffa torial
variables to be retained. However, in model-based lustering involving afull
wrapping approa h, the di ulty is to integrate properly this sele tion step
in the model itself. Thus, wedis uss nowmoresuitable methods forthe HD
situation.
2.4.2 Variable sele tion through regularization
In this se tion, we fo us on the variable sele tion problem in the Gaussian
mixture lustering ontext.
ℓ
1
-penalizationpro eduresInspiredbythesu essoftheLassoregression,PanandShen[2007℄proposeto
takeadvantageofthesparsitypropertyof
ℓ
1
-penalizationofthelikelihood toperformautomati variablesele tionforhigh-dimensionalmodel-based
luster-ing. Theirpro edure, alled PS-Lassoin thesequel, onsistsof usingaLasso
methodtosele trelevant lusteringvariablesandestimatemixtureparameters
inthesameexer ise. The ovarian ematri esareassumedtobeidenti aland
diagonal(
Σ
k
= V =
diag(σ
2
parameters. Forany
K
∈ N
∗
,thefollowingfun tionhastobemaximized:
θ
K
7→
n
X
i=1
ln
"
K
X
k=1
π
k
φ (¯
x
i
; µ
k
, V)
#
− λ
K
X
k=1
kµ
k
k
1
,
(2.7) whereθ
K
= (π, µ
1
, . . . , µ
K
, V)
,kµ
k
k
1
=
d
P
j=1
µ
kj
,¯
x
i
= (x
ij
− ¯x
j
)
1≤j≤p
with¯
x
j
=
1
n
P
n
i=1
x
ij
,λ
is anon-negative regularizationparameterandφ(
·; µ, Σ)
denotes the multivariateGaussian density of enter
µ
and ovarian e matrixΣ
. An EM-algorithmisproposedtosolvethisparameterestimationproblem.Next,amodiedBIC riterionisusedtosele t
K
andλ
:BIC
(K,λ)
=
−2 ln
"
n
Y
i=1
K
X
k=1
π
k
φ(x
i
; µ
k
, V)
#
+ ln(n)D
(K,λ)
where
D
(K,λ)
= (K
− 1) + Kd + d − q
,q
denotingthenumberofthemaximumpenalizedlikelihoodestimatemean omponentsthatareequalto
0
.This approa h was su essively extended in Zhou et al. [2009℄(Gaussian
mixtureswith diagonal ovarian ematri es)and nally in Zhou etal. [2009℄.
In this last paper, aregularized Gaussian mixture model with un onstrained
ovarian ematri esisproposed.Theyemploya
ℓ
1
penaltyonmeanparametersandon ovarian ematri esasfollows:
θ
K
7→
n
X
i=1
ln
"
K
X
k=1
π
k
φ (¯
x
i
; µ
k
, Σ
k
)
#
− λ
K
X
k=1
kµ
k
k
1
− ρ
K
X
k=1
Σ
−1
k
1
,
(2.8) wherekµ
k
k
1
=
d
X
j=1
µ
kj
,
Σ
−1
k
1
=
d
X
j,j
′
=1
j6=j′
(Σ
−1
k
)
jj
′
,
and where
λ
andρ
are two non-negativeregularization parameters. Thispa-rameterestimationproblemissolvedusinganEMalgorithmwheretheso- alled
glassoalgorithm (Friedman et al. [2007℄) is used to estimate sparse pre ision
matri es
Σ
−1
k
.Lasso-MLEpro edure
In Meynet [2012℄ and Meynet and Maugis-Rabusseau [2012℄, they highlight
that the
ℓ
1
-penalizationindu es shrinkageof the oe ients and thus biasedestimators with high estimation risk. Moreover, the use of a BIC-type
ri-terion for the model sele tion an be unsuitable for high-dimensional data.
Consequently,theyproposetoonlyusean
ℓ
1
-penalizedlikelihoodapproa htoon-high-dimensional situations. The evaluation of the MLE rather than the
ℓ
1
-penalizedestimatorforea hmodelis onsideredtoavoidestimationproblems
due to
ℓ
1
-penalization shrinkage. More pre isely, the datax
= (x
1
, . . . , x
n
)
areassumedtohaveanullexpe tation (inpra ti e,empiri al enteringofthe
data isperformedto ensurethis assumption)and theirunknown density
f
isestimated byanite spheri alGaussian mixture. The lustersare
hara ter-ized by the meanparameters
(µ
k
)
1≤k≤K
anda variablej
is alled irrelevantforthe lusteringif
µ
kj
= 0
forallk = 1, . . . , K
;otherwiseitis alledrelevant.Therelevantvariablesubset(resp. irrelevantvariablesubset)isdenotedby
J
r
(resp.
J
c
r
=
{1, . . . , d} \ J
r
). Consequently, the variable sele tionproblem isre astintoamodel sele tionproblem, wherethemodel olle tionis
(S
(K,J
r
)
)
withS
(K,J
r
)
=
x
i
∈ R
d
7→ f(x
i
; θ) =
P
K
k=1
π
k
φ(x
J
i
r
; µ
k
, σ
2
I
)
φ(x
J
r
c
i
; 0, σ
2
I
)
θ
= π
1
, . . . , π
K
, µ
1
, . . . , µ
K
, σ
2
∈ Π
K
× R
|J
r
|
K
× R
∗
+
,
x
J
r
i
denoting the restri tion ofx
i
onJ
r
,|J
r
|
orresponding to the ardinalof
J
r
andΠ
K
denoting thesimplex relatedto parameters(π
1
, . . . , π
K
)
. Thedimensionofamodel
S
(K,J
r
)
orrespondstothetotalnumberoffreeparametersestimatedin themodel:
D
(K,J
r
)
= K(1 +
|J
r
|)
.Theso- alledLasso-MLEpro edureproposedinMeynetandMaugis-Rabusseau
[2012℄isde omposed intothreemainsteps. Intherststep,asPanandShen
[2007℄,an
ℓ
1
-approa his onsidered: Forea h(K, λ)
∈ N
∗
× G
λ
(G
λ
isagivengrid on
λ
), theLasso estimatorθ
ˆ
L
(K,λ)
is omputed by maximizing (2.7) andtheasso iatedrelevantvariablesubsetis
J
(K,λ)
=
{j ∈ {1, . . . , d} : ∃ k ∈ {1, . . . , K}
su hthatµ
ˆ
kj
6= 0}.
Thus a random model sub olle tion
{S
(K,J
r
)
: (K, J
r
)
∈ M
L
}
is obtained, whereM
L
=
{(K, J
r
) : K
∈ N
∗
, J
r
∈
[
λ∈G
λ
J
(K,λ)
}.
The se ond step onsists of omputing the MLE
θ
ˆ
(K,J
r
)
using the standardEM algorithm for ea h model
(K, J
r
)
∈ M
L
. The third step is devoted to
model sele tion. Asin Maugis andMi hel[2012℄,anonasymptoti penalized
riterionis proposed to solvethemodel sele tionproblem. Byextendingthe
general model sele tion theorem of Massart [2007℄ (Theorem 7.11) (see also
Se tion?? inChapter??) (demanderrefàPas aldanslebook),Meynet[2012℄
provesthatthepenaltyis
pen
(K,J
r
)
= κ
1
D
(K,J
r
)
n
1 + κ
2
ln
d
D
(K,J
r
)
,
(2.9)model olle tion omplexitybytakinginto a ountthepossiblelargenumber
ofmodelswithidenti aldimension. Neverthelessthislogarithmtermbe omes
unne essaryifthenumberofmodelswiththesamedimensionissmallenough.
Forinstan e, forniteGaussianmixture modelsin alow-dimensionalsetting,
a penalty proportional to the dimension is su ient to sele t a model lose
to the ora le(Maugis and Mi hel [2011℄). But in the high-dimensional
on-text, the numberof models having the same dimension is expe ted to grow.
Nonetheless,thankstotherandompresele tionofrelevantvariablessubsets,a
ompletevariable sele tionisnotperformedhere. Thus, iftherandommodel
sub olle tionismu hpoorerthanthewholemodel olle tionand ontainsfew
modelswiththesamedimension,apenaltyproportionalto thedimension
pen
(K,J
r
)
=
D
(K,J
r
)
n
(2.10)mightbesu ienttosele tamodelwithproperdimension. Next,thepenalty
dependingonunknownmultipli ative onstantsis alibratedusingtheso- alled
slopeheuristi s[BirgéandMassart,2007;Baudryetal.,2012℄.
ComparingPS-Lasso and Lasso-MLE
To omparetheLasso-MLEandPS-Lassopro edures,thefollowingsimulated
example is proposed in Meynet and Maugis-Rabusseau[2012℄. The data set
onsistsof
n = 200
observationsdes ribedbyd = 1 000
variables. Thedataaresimulateda ordingtoamixtureoftwoGaussiandistributions
π
1
φ(
·; 0
d
, I) +
(1
− π
1
)φ(
·; µ
2
, I)
whereµ
2
= (1.5, . . . , 1.5, 0
950
)
andπ
1
= 0.85
. Therelevantvariablesare therstfty variables(
J
⋆
r
=
{1, . . . , 50}
).20
simulationsofthedatasetareperformed. Forea hsimulation,modelswith
K
∈ {1, 2, 3}
lustersare onsidered. TheresultsaresummarizedinTable2.3. Table2.3showsthat
Pro edure Estimator TR FR
ˆ
K
ARI 1 2 3 PS-Lasso ora le50.3 (0.2)
214.6 (79.0)
0
16
4
0.90 (0.03)
BIC49.7 (0.8)
14.3 (3.4)
0
18
2
0.86 (0.02)
Lasso-MLE ora le50.0 (0.0)
0.2 (0.2)
0
20
0
0.95 (0.02)
AIC50.0 (0.0)
17.1 (4.2)
0
14
6
0.90 (0.04)
BIC49.8 (0.4)
4.4 (2.2)
0
20
0
0.92 (0.02)
DDSE50.0 (0.0)
2.4 (1.7)
0
20
0
0.94 (0.02)
Table2.3: Averagednumberoftruerelevant(TR)andfalserelevant(FR)
variables(
±
standarddeviation);numberoftimesa lusteringwithK = 1
ˆ
,2
and