High-dimensional clustering

(1)

Christophe Biernacki, Cathy Maugis

To cite this version:

Christophe Biernacki, Cathy Maugis.

High-dimensional clustering.

Choix de mod`

eles et

agr´

egation, Sous la direction de J-J. DROESBEKE, G. SAPORTA, C. THOMAS-AGNAN

Edition: Technip., 2015. <hal-01252673v2>

HAL Id: hal-01252673

https://hal.archives-ouvertes.fr/hal-01252673v2

Submitted on 12 Jan 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not.

The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destin´

ee au d´

epˆ

ot et `

a la diffusion de documents

scientifiques de niveau recherche, publi´

es ou non,

´

emanant des ´

etablissements d’enseignement et de

recherche fran¸cais ou ´

etrangers, des laboratoires

publics ou priv´

es.

(2)

2 High-dimensional lustering

ChristopheBierna ki andCathyMaugis-Rabusseau 1

2.1 Introdu tion. . . 1

2.2 HD lustering: Curseorblessing? . . . 4

2.2.1 HDdensityestimation: Curse. . . 4

2.2.2 HD lustering: Amix of urseandblessing . . . 6

2.2.3 Intermediate on lusion . . . 8

2.3 Non- anoni almodels . . . 10

2.3.1 Gaussianmixtureoffa toranalysers . . . 10

2.3.2 HDGaussianmixturemodels . . . 11

2.3.3 Fun tionaldata. . . 12

2.4 Canoni almodels . . . 16

2.4.1 Parsimoniousmixturemodels . . . 17

2.4.2 Variable sele tionthroughregularization. . . 20

2.4.3 Variable rolemodelling . . . 24

2.4.4 Co- lustering . . . 27

2.5 Futuremethodologi al hallenges . . . 35

(3)

(4)

High-dimensional lustering

Christophe Bierna ki and Cathy Maugis-Rabusseau

2.1 Introdu tion

High-dimensional(HD)datasetsarenowfrequent,mostlymotivated by

te h-nologi al reasons whi h on ern automation in variable a quisition, heaper

availability of data storage and more powerful standard omputers for qui k

data management possibility. All elds are impa ted by this general

phe-nomenon ofvariable numberination, only thedenition ofhigh being

do-maindependent. Inmarketing,thisnumber anbeoforder

10

2

,inmi roarray

gene expression between

10

2

and

10

4

, in text mining

10

3

or more, of order

10

6

forsinglenu leotidepolymorphism(SNP)data, et . Note alsothat

some-timesmu hmorevariables anbeinvolved,what anbetypi allythe asewith

dis retized urves,forinstan e urves omingfrom temporalsequen es.

Here aretworelatedillustrations. Figure 2.1(a)displaysatext mining

ex-ample 1

. It mixesMedline (1033medi alabstra ts) andCraneld (1398

aero-nauti al abstra ts) making a total of 2431 do uments. Furthermore, all the

words(ex ludingstopwords)are onsideredasfeaturesmakingatotalof

9275

uniquewords. Thedatamatrix onsistsof do umentsontherowsandwords

onthe olumnswithea hentrygivingthetermfrequen y,thatisthenumberof

o urren esof orrespondingword in orrespondingdo ument. Figure2.1(b)

displaysa urveexample. ThisKneadingdataset omesfromDanoneVitapole

ParisResear hCenterand on ernsthequalityof ookiesandtherelationship

withtheourkneadingpro ess(Lévéderetal.[2004℄). Itis omposed by115

dierentoursforwhi hthedoughresistan eismeasuredduringthekneading

pro essfor480se onds. Wenoti ethattheequispa edinstantsoftimeinthe

interval [0; 480℄ (here 241 measures) ould be mu h more large than 241 if

measuresweremorefrequentlyre orded.

(5)

(a) (b)

Figure2.1: Examplesofhigh-dimensionaldatasets: (a) Textmining:

n = 2431

do umentsandthefrequen ythat

d = 9275

uniquewordso urs in

ea hdo ument(awhiter ellindi atesahigherfrequen y);(b)Curves:

n = 115

kneading urvesobservedat

d = 241

equispa edinstantsoftimein

theinterval[0;480℄.

Su hate hnologi alrevolutionhasahugeimpa tinother s ienti elds,

as so ietal or also mathemati al ones. In parti ular, high-dimensional data

managementbrings somenew hallenges to statisti ianssin estandard

(low-dimensional)dataanalysismethodsstruggletodire tlyapplytothenew

(high-dimensional)datasets. Thereason anbetwofold,sometimeslinked,involving

either ombinatorialdi ultiesordisastrouslylargeestimatevarian ein rease.

Dataanalysismethodsareessentialforprovidingasyntheti viewofdatasets,

allowing data summary and data exploratory for future de ision making for

instan e. Thisneedisevenmorea uteinthehigh-dimensionalsettingsin eon

theonehand thelargenumberofvariablessuggeststhat alot ofinformation

is onveyedby databut, in theother hand, su h information maybehidden

behindtheirvolume.

Clusteranalysisisoneofthemaindataanalysismethod. Itaimsat

parti-tioningadataset

x

= (x

1 , . . . , x

n

)

, omposedby

n

individuals andlyingin a

spa e

X

ofdimension

d

into

K

groups

G

1 , . . . , G

K

. This partitionis denoted

by

z

= (z

1 , . . . , z

n

)

, lying in aspa e

Z

, where

z

i

= (z

i1

, . . . , z

iK

)

′

isave tor

of

{0, 1}

K

su hthat

z

ik

= 1

ifindividual

x

i

belongstothe

k

thgroup

G

k

,and

z

ik

= 0

otherwise(

i = 1, . . . , n

,

k = 1, . . . , K

). Figure 2.2givesanillustration

of this prin iple when

d = 2

. Model-based lustering allows to reformulate

luster analysis as a well-posed estimation problem both for the partition

z

andfor thenumber

K

of groups. It onsiders data

x

1 , . . . , x

n

as

n

i.i.d.

real-izationsofamixture pdf

f (

·; θ

K

) =

P

K

k=1

π

k

f (

·; α

k

)

,where

f (

·; α

k

)

indi ates

the pdf, parameterizedby

α

k

, asso iatedto the group

k

, where

π

k

indi ates

themixture proportionof this omponent(

P

K

k=1

π

k

= 1

,

π

k

≥ 0

)and where

θ

K

= (π

k

, α

k

, k = 1, . . . , K)

indi atesthewholemixtureparameters. Fromthe

wholedataset

x

itisthenpossibletoobtainamixtureparameterestimate

θ

ˆ

K

(6)

It is also possible to derivean estimate

K

ˆ

from an estimate of the marginal

probability

f (x

ˆ

|K)

. Moredetailsonmixturemodels,relatedestimationof

θ

K

,

z

and

K

aregiventhroughout Chapter??.

−2

0

2

4 −2

0

2

4 X

1

X

2

−2

0

2

4 −2

0

2

4 X

1

X

2

x

_{= (x}

₁

_{, . . . , x}

_n

₎

_−→

_ˆ

z

_{= (ˆ}

z

₁

_{, . . . , ˆ}

z

_n

₎

,

K = 3

ˆ

Figure 2.2: The lusteringpurposeillustratedinthetwo-dimensionalsetting.

Beyond the ni e mathemati al ba kground it provides, model-based

lus-teringhasledalsotonumerousandsigni antpra ti alsu essesin the

low-dimensional settingasChapter??relates,withreferen estherein. Extending

thegeneralframeworkofmodel-based lusteringtothehigh-dimensional

set-tingisthusanaturalanddesirablepurpose. Inprin iple,themoreinformation

we haveaboutea h individual, the better a lusteringmethod is expe ted to

perform. Howeverthestru tureofinterestmayoftenbe ontainedinasubset

oftheavailablevariablesandalotofvariablesmaybeuselessorevenharmful

to dete tareasonable lustering stru ture. It isthus importantto sele tthe

relevantvariablesfrom the luster analysis viewpoint. Itis are entresear h

topi in ontrast to variable sele tion in regression and lassi ation models

(KohaviandJohn[1997℄;GuyonandElissee [2003℄;Miller[1990℄). This new

interestforvariablesele tionin lustering omesfromthein reasinglyfrequent

useofthesemethodsonhigh-dimensionaldatasets,su hastrans riptomedata

sets.

Threetypesofapproa hesdealingwithvariablesele tionin lusteringhave

beenproposed. The rstonein ludes lustering methods withweighted

vari-ables(seeforinstan eFriedmanandMeulman[2004℄)anddimensionredu tion

methods. Forthislater,M La hlanetal.[2002℄useamixtureoffa tor

analyz-erstoredu etheextremelyhighdimensionalityofageneexpressionproblem. A

suitableGaussianmixturefamilyis onsideredinBouveyronetal.[2007℄totake

into a ountthedimensionredu tionand thedata lustering simultaneously.

In ontrastto thisrstmethod type,thelast twoapproa hessele texpli itly

relevantvariables. Theso- alledlterapproa hessele tthevariablesbeforea

lusteringanalysis(seeforinstan eDashetal.[2002℄;JouveandNi oloyannis

(7)

variable sele tion and lustering. For distan e-based methods, one an ite

Fowlkes et al. [1988℄ for a forward sele tion approa h with omplete linkage

hierar hi al lustering,DevaneyandRam[1997℄whoproposeastepwise

algo-rithmwhere thequalityof thefeature subsetsis measuredwith the obweb

algorithm or the method of Brus o and Cradit [2001℄based on the adjusted

Rand index for

K

-means lustering. There exists also wrapper methods in

the model-based lustering setting. Whenthe number of variablesis greater

thanthenumberof individuals,Tadesse etal.[2005℄proposeafullyBayesian

method usingareversiblejump algorithm to simultaneously hoosethe

num-berofmixture omponentsandsele tvariables. Kimetal.[2006℄useasimilar

approa h byformulating lustering intermsof Diri hletpro ess mixtures. In

Gaussian mixture model lustering, Lawet al. [2004℄proposeto evaluatethe

importan eofthevariablesinthe lusteringpro essviafeaturesalien iesand

use the Minimum Message Length riterion. Raftery and Dean [2006℄re ast

theproblem of omparing twonestedvariable subsetsasamodel omparison

problemandaddressitusingBayesfa tor. Aninterestingaspe toftheirmodel

formulationis that irrelevantvariables are notrequired to be independentof

the lusteringvariables. Theyavoidthustheunrealisti independen e

assump-tionbetweentherelevantandirrelevantvariablesforthe lustering, onsidered

inTadesseetal.[2005℄,Kimetal.[2006℄andLawetal.[2004℄.Intheirmodel,

the whole irrelevant variable subset depends on the whole relevant variables

throughalinearregressionequation. However,somerelevantvariablesarenot

ne essarilyrequired to explain allirrelevant variables in the linearregression

andtheirintrodu tioninvolvesadditionalparameterswithoutasigni ant

in- rease ofthe loglikelihood. Therelated extensionsproposedby Maugiset al.

[2009a,b℄followthisremark.

Many model proposals already exist, in luding asso iated parameter

esti-mation and, sometimes, spe i model sele tion strategies. We will divide

these models into anoni al and non- anoni al ones, indi ating if parameter

onstraintsare respe tivelydened relatively to the initial data spa eor

rel-ativelyto atransformation (afa torialmappingtypi ally). Beforepresenting

su hmodels, andtheirrelatedmodelsele tionpro ess, wedrawwhatarethe

pros(blessing) andthe ons( urse)ofhavingmanyvariablesforperforminga

lusteranalysispro ess.

2.2 HD lustering: Curse or blessing?

2.2.1 HD density estimation: Curse

Inthe previousse tion, weprovided someexamplesofhigh-dimensionaldata

sets. Inthe presentse tion, the aim is to givea somewhat moretheoreti al

(8)

ontheparametri ases. Italsoreliesonsomeasymptoti arguments. Remind

thatwe onsideradataset

x

= (x

1 , . . . , x

n

)

,

x

i

beingdes ribedby

d

variables.

Non-parametri ase

In the non-parametri situation, usually

x

i

is onsidered to rely in a

high-dimensional spa eassoonas

n = o e

d

,thusassoonasthelogarithmofthe

samplesize,

ln n

,isnegligiblebesidethespa edimension

d

. Arstjusti ation

of this laim is given by Bellman [1961℄: To approximate within error

ǫ > 0

a (Lips hitz) fun tion of

d

variables, about

(1/ǫ)

d

evaluations (provided by

the sample size

n

...) on a grid are required. A se ond justi ation is also

given bySilverman[1986℄: ApproximatingaGaussian distribution withxed

Gaussian kernelsandwith approximateerror ofabout10%requires asample

size

log

10 n(d)

≈ 0.6(d − 0.25)

. For instan e, with

d = 10

,

n(10)

≈ 7.10

5

,

implyingalreadyahugesamplesizeforaquitemoderatedimensionalsetting.

Parametri ase

In theparametri situation, let

S

m

be a model des ribed by

D

m

ontinuous

parameters,likelydependingonthedimension

d

. Insu ha ase,thedataset

x

issaidtorelyinahigh-dimensionalspa eassoonas

n

issmallin omparison

to a parti ular fun tion

g

of

D

m

, namely

n = o(g(D

m

))

. As an illustration

for

g

, we onsider theheteros edasti Gaussian mixture with true parameter

θ

∗

and

K

omponents. Wenote

θ

ˆ

K

theGaussian MLE with

K

omponents.

Inthat situation,

g

isalinearfun tion fromthe followingresult(Maugisand

Mi hel[2012℄): Itexistspositive onstants

κ

and

A

su hthat

E

x

[d

2 _H

(f (

·; θ

∗

), f (

·| ˆ

θ

_K

ˆ

))]

≤ κ

inf

K

{

KL

(f (

·; θ

∗

_{), f (}

·; ˆ

θ

K

)) +

pen

(K)

} +

1 n

where

d

H

denotes theHellingerdistan e, KLtheKullba k-Leiblerdivergen e

and pen

(K)

≥ κ

D

K

n

2A ln d + 1

_{− ln}

1 _∧

D

K

n

A ln d

.

ThustheHDnon-parametri andparametri situationsaredrasti ally

dif-ferentinmagnitude. However,inpra ti e,

D

K

anbehighsin e

D

K

∼ d

2 _/2

in

this Gaussiansituation, ombinedwithpotentiallylarge onstants. For

high-lightingthisfa t, onsiderthefollowingtwo- omponentmultivariateGaussian

mixture:

π

1 = π

2 =

1

2 ,

X

1 |Z

11 = 1

∼

N

(0, I),

X

1 |Z

12 = 1

∼

N

(1, I),

(2.1) with

a

= (a . . . a)

′

a real ve tor of size

d

. An illustration of this setting is

displayedinFigure 2.3(a). Notethat thetwo omponentsaremoreandmore

separated when

d

grows sin e

k1 − 0k

I

=

√

(9)

mixturedensityestimatedegrades(the Kullba k-Leiblerdivergen ein reases)

whendimensionin reasesasitisillustratedinFigure2.3(b)witha

homos edas-ti modelandwithequalmixingproportions.

−3

−2

−1

0

1

2

3

4 −4

−3

−2

−1

0

1

2

3

4 x1

x2

1

2

3

4

5

6

7

8

9

10

12

12.5

13

13.5

14

14.5

15

15.5

16

16.5 d

Kullback−Leibler

(a) (b)

Figure2.3: HD urseintheparametri densityestimation ontext: (a)A

bivariatedata setexamplewithisodensityofea h omponentand(b)the

Kullba k-Leiblerdivergen eof thedensityestimatewhen

d

in reases.

2.2.2 HD lustering: A mix of urse and blessing

Contrarytodensityestimationwherein reasingdimensionhasa learnegative

ee t, dimensionmay havebothpositiveand negativeee ts onthe

luster-ing task. We distinguish now whi h fa tors favorsu h blessing or urse

out omes.

Blessingfa tors

Weretrievethemodeldesign(2.1). Wedisplayagaina orrespondingsample

in Figure 2.4(a). We have already mentioned that the two omponents are

moreand moreseparatedwhen

d

in reases. The reasonis that ea h variable

uniformly provides its own separation information su h that the asso iated

theoreti alerrorde reaseswhen

d

grows. Indeed,thiserrorisequalto

err

theo

=

Φ(

₋

√

d/2)

,where

Φ

isthe dfofN

(0, 1)

. We anseethisde reasewith

d

bya

dashlineinFigure2.4(b). Aninteresting onsequen eisthenthattheempiri al

error rate de reases also with

d

as it ould be noti ed in ontinuous line in

Figure 2.4(b). It meansthat in reasing dimensionmayhave apositiveee t

onthe lustering task assoonasallvariables onveymeaningfulinformation

onthehiddenpartition.

(10)

−3

−2

−1

0

1

2

3

4 −4

−3

−2

−1

0

1

2

3

4 x1

x2

1

2

3

4

5

6

7

8

9

10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4 d

err

Empirical

Theoretical

(a) (b)

Figure2.4: HDblessingin the lustering ontextwhenmostvariables onvey

independentpartitioning information: (a)A bivariatedatasetexamplewith

isodensityofea h omponentand(b)thetheoreti al(dashline)andthe

empiri al( ontinuousline)errorratewhen

d

in reases.

sians,allmoreandmoreseparatedwhen

d

in reases:

π

1 = π

2 = π

3 =

1 ₃

,

X

₁

_|Z

₁₁

_{= 1}

_∼

N

(0, I),

X

1 |Z

12 = 1

∼

N

(2, I),

X

1 |Z

13 = 1

∼

N

(

−2, I), .

Then Figure 2.5(a)-(d) displaysa related sample of size

n = 1000

for

dier-ent dimensionson the main two axes of the Fa torial Dis riminant Analysis

(FDA)mapping. It learlyappearsthat omponentsaremoreandmoreeasily

re ognizedwhendimensionin reases,althoughitisasimplevisualization

pro- ess. Atthelimit,no omplex lusteringalgorithmwouldbeenoughtoidentify

lusters...

Cursefa tors

Infa t,in reasingdimensionmayhaveapositiveee t on lustering retrieval

onlyifvariablesinje tsomepartioninginformation. Inaddition,su h

informa-tionhastobenotredundant. Weillustratenowthesetwoparti ularfeatures.

Firstly,we onsidermanyvariableswhi hprovidenoseparationinformation.

We retrieve thesame parametersetting as (2.1)ex ept that the omponents

arenot moreseparatedwhen

d

growssin e

kµ

2 − µ

1 k

I

= 1

, where

µ

1 = 0

is

the enter ofthe rstGaussianand where

µ

2 = (1 0 . . . 0)

′

istheoneofthe

se ond,thus(

k = 1, 2

)

X

₁

_|Z

_1k

_{= 1}

_∼

N

(µ

k

, I).

(2.2)

A sampleisdisplayedonFigure2.6(a). Figure 2.6(b)showsin dashline that

thetheoreti alerrorrateis onstant(it orrespondsto

err

theo

= Φ(

−

1

2 )

)when

(11)

−4

−3

−2

−1

0

1

2

3

4 −2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5 1st axis FDA

2nd axis FDA

d=2

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2 −3

−2

−1

0

1

2

3

4 1st axis FDA

2nd axis FDA

d=20

(a) (b)

−1.5

−1

−0.5

0

0.5

1

1.5 −4

−3

−2

−1

0

1

2

3

4

5 1st axis FDA

2nd axis FDA

d=200

−1.5

−1

−0.5

0

0.5

1

1.5 −3

−2

−1

0

1

2

3 1st axis FDA

2nd axis FDA

d=400

( ) (d)

Figure2.5: Fa torialDis riminantAnalysis(FDA)onthemain twofa torial

axesofthree Gaussian omponentsmoreandmoreseparatedwhenthespa e

dimensionin reases: (a)

d = 2

,(b)

d = 20

,( )

d = 200

and

d = 400

.

Se ondly,we onsidera asewheremanyvariablesprovideseparation,but

redundantinformation,inthefollowingsense: Itisthesameparametersetting

asbeforefortherstdimensionex eptforallotherones

X

_1j

_{= X}

₁₁

_{+ ε}

_j

_,

where

ε

j

iid

∼

N

(0, 1) (j = 2, . . . , d).

(2.3)

SeeadataexampleinFigure2.7(a). Thus, omponentsarenotmoreseparated

when

d

growssin e

kµ

2 −µ

1 k

Σ

= 1

,

Σ

denotingthe ommon ovarian ematrix

of ea h Gaussian omponent, and

µ

k

denoting the enter of the omponent

k = 1, 2

(notethatboth

µ

k

and

Σ

ouldbeeasily omputedfromEquation(2.2)

and(2.3)). Consequently,

err

theo

= Φ(

−

1

2 )

is onstantandtheempiri alerror

in reaseswith

d

,asillustratedin Figure2.7(b)withprevious onventions.

2.2.3 Intermediate on lusion

(12)

−4

−3

−2

−1

0

1

2

3

4 −5

−4

−3

−2

−1

0

1

2

3

4 x1

x2

1

2

3

4

5

6

7

8

9

10

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44 d

err

Empirical

Theoretical

(a) (b)

Figure2.6: HD ursein the lustering ontextwhenvariables onveyno

partitioninginformation: (a)Abivariatedatasetexamplewithisodensityof

ea h omponentand(b)thetheoreti al(dashline)andtheempiri al

( ontinuousline)errorratewhen

d

in reases.

−3

−2

−1

0

1

2

3

4 −6

−4

−2

0

2

4

6 x1

x2

1

2

3

4

5

6

7

8

9

10

0.3

0.32

0.34

0.36

0.38

0.4

0.42 d

err

Empirical

_Theoretical

(a) (b)

Figure2.7: HD urseinthe lustering ontextwhenvariables onvey

redundantpartitioninginformation: (a)Abivariatedatasetexamplewith

isodensityofea h omponentand(b)thetheoreti al(dashline)andthe

empiri al( ontinuousline)errorratewhen

d

in reases.

spa e. Inparti ular, lter methods performingvariable sele tionbeforethe

lusteringtaskhavetobeex luded,theriskofremovingdis riminantfeatures

beingtoolarge. Theremainingquestionisthenwhi hwrappermethodstobe

used? Su hmethodsshouldmanagewithprioritythefa tthatsomevariables

have negative ee ts for lustering. The general answeris to design spe i

parsimoniousmodelsfor lustering,themostemblemati onesrelyingonsome

variable sele tion prin iple. Wewill see also several alternativestrategies, in

parti ularvariable lustering(tonotbemingledwithindividual lustering,our

(13)

atransformedspa e) evenif itis notofteninitially des ribed withthis point

ofview.

Behingthismodeldesignwhi histherststepofhigh-dimensional

model-based lustering,thequestionofmodelsele tionisthenasked. Insome

situa-tions,traditionalmodelsele tion riteria ouldbedire tlyapplied. However,in

many ases,twokindsofdi ulties mayhappen. Firstly,thenumberof

om-peting models avoidsto enumerateall possible modelswhi h ompete.

Typi- ally,inavariablesele tion ontextthenumberofpossibilitiesis ombinatorial.

Insu h a ase,strategiesfordesigninganintelligentpathinarelevantsubset

ofmodelsisapossibleanswer. Se ondly,validityoftraditionalmodelsele tion

riteriathemselves an be hallenged,requiringsomeoriginalproposals.

In the rest of this hapter, we will give an overview of the main

high-dimensional lustering methods. We will systemati ally highlight novelty of

theproposedmodels,possible onne tionsbetweenthem(variablesele tionor

variable lustering,initial spa eornon- anoni al spa e)and issuesfor model

sele tion( riteriaandstrategiesofuse).

2.3 Non- anoni al models

As dis ussed previously, modelsdesigned forhigh-dimensional lustering rely

on parsimonious denition of related parameters. In this se tion, we fo us

on situations where parsimony is inje ted through parameters dened in a

transformedfeaturespa e, alledherenon- anoni alfeaturespa e. We onsider

this asebefore the anoni alfeaturespa esituation (next se tion)sin eitis

somewhatrelatedtothepioneering ideaof ltering. Indeed,fa torialanalysis

(for instan e prin ipal omponent analysis in the ontinuous ase) was rst

ondu tedforsele ting(new)variablesbeforeapplyingany lusteringmethod

on them. Here, ideas are related but with a wrapper point of view. Most

situationsaddress ontinuousfeatures.

2.3.1 Gaussian mixture of fa tor analysers

In Gaussian model-based lustering, in reasing the number of variables has

its main ee t on the number of parametersin luded in the ovarian e

ma-tri es

Σ

k

, sin eit is ofquadrati order. Consequently, mostmethods aim at

introdu ingparsimonyrston

Σ

k

. Historyanddetails ouldbefoundin

Bou-veyronand Brunet[2014℄. In parti ular, Ghahramaniand Hinton[1997℄and

M La hlan[2003℄designthefollowingreparameterizationof

Σ

k

:

Σ

_k

_{= B}

_k

B

′

k

+ ω

k

Λ

k

where

B

k

isaloadings

d

× q

non-squarereal matrix(

1 ≤ q ≤ q

max

,

q

max

< d

),

ω

k

isapositiverealnumberand

Λ

k

isa

d

× d

diagonalpositivedenitematrix

(14)

it is equivalentto assuming

X

1 ∈ R

d

to be generated bythe following latent

variable

Y

1 ∈ R

q

lyingin asmaller(latent)spa ethan

R

d

X

₁

_|Y

₁

_{, Z}

_1k

_{= 1 = B}

_k

Y

₁

_{+ µ}

_k

_{+ ε}

_k

where

Y

1 ⊥ ε

k

(

⊥

denotingindependen e),

Y

1 ∼

N

(0, I)

and

ε

k

∼

N

(0, ω

k

Λ

k

)

.

In this layout,

Y

1

is alled the fa tor, by straightforward analogy to fa tor

analysismethods. Estimationisperformedthroughanalternating

expe tation- onditionmaximization(AECM)algorithm(MengandvanDyke[1997℄).

Complexityofsu hamodelisequalto

D

m

= (K

− 1) + Kd + Kq[d − (q −

1)/2] + Kd

,whereit anbeseenthatthequadrati parthasvanished. Infa t,

it orrespondstothe most omplexmodel ofawholefamily, M Ni holas and

Murphy [2008℄havingdened 12 asso iated parsimoniousversions, in luding

forinstan e inter- lassequality between

B

k

,identityof

Λ

k

= I

, et . Finally,

models in ompetition

(

S

m

)

m

∈M

gather the ombinations non only of these

12parsimoniousversionsbutalsoofthe ouples

(q, K)

ofthelatentdimension

and of the number of omponents. In pra ti e,

q

max

is expe ted to be quite

low for parsimonious reasons and thus the ardinal of

M

is not ex essively

high. Traditionalmodelsele tion riteria(asBIC) anthenbedire tlyapplied

onthis olle tion. The rpa kage pgmm

2

providesanimplementationof this

method.

2.3.2 HD Gaussian mixture models

Bouveyron et al. [2007℄propose another wayfor obtaining parsimony onthe

ovarian ematri es

Σ

k

. Itrelaysonthefollowingspe tralde omposition

Σ

_k

_{= D}

_k

∆

_k

D

′

k

where

D

k

is the orthogonal matrix of the eigenve tors of

Σ

k

and

∆

k

is a

diagonalmatrix ontainingtherelatedeigenvalues. Theyimpose

∆

k

tofollow

theparsimoniousstru ture

∆

_k

₌







a

k1

0

. . .

0 a

kq

k

0

0 b

k

0

. . .

0 b

k













q

k







(d

− q

k

)

with

a

kj

≥ b

k

> 0

, for

j = 1, ..., q

k

and

q

k

< d

. Su h an assumption an

besomewhat relatedto a kindof prin ipal omponentanalysis perGaussian

group. It ould also be viewed asakind of variable lustering sele tion, the

(15)

d

_{− q}

k

remainingvariablesof

∆

k

orrespondingto agroupofnoisy features.

Figure2.8illustratesathreedimensional(

d = 3

)andtwo omponentssituation

(

K = 2

)wherebothsubspa edimensions

q

1

and

q

2

areequal(

q

1 = q

2 = 2

)but

dierinorientation. Estimation aneasilyperformedthroughanEMalgorithm

forinstan e.

Figure2.8: IllustrationoftheHD lusteringmixtureGaussianmodelin atwo

omponentssituation (providedbyBouveyronetal.[2007℄).

Complexityof su ha model is given

D

m

= (K

− 1) + Kd +

P

K

k=1

q

k

[d

−

(q

k

+ 1)/2] +

P

K

_k=1

q

k

+ 2K

. In addition, Bouveyron et al. [2007℄ propose

eight parsimonious versions by imposing for instan e equality between

sub-spa e dimensions (

q

k

= q

, for all

k

), et . Finally, the whole model family

(S

m

)

m

∈M

in ludes ouples

((q

1 , . . . , q

K

), K)

ofsubspa edimensionand

num-berof omponents, ombinedwiththeeightmodels. Sin e

q

k

maydependon

the omponent, ontrarytotheGaussianmixtureoffa toranalysersdes ribed

inthepreviousse tion,thenumberofmodelsbe omes ombinatorial. Then,it

maybedi ultintheHD settingtobrowseallmodelsforapplyingaBIC-like

riterionforinstan e. Consequently,Bouveyronet al.[2007℄proposeakindof

ruleof thumb riterion forsele tingea h

q

k

, lookingfor abreak in the

eigen-values reeof theempiri al ovarian e matrixforea hgroup omponent,the

so- alleds reetestof Cattell[Cattell,1966℄. Thermixmodpa kage

3

(Lebret

etal.[2015℄)implementsthesemodels.

2.3.3 Fun tional data

Fun tional and dis retized data

Stri tlyspeaking, realfun tionaldata(RamsayandSilverman[2005℄,Ferraty

and Vieu [2006℄) orrespond to

i = 1, . . . , n

urveswhi h are realizations of

n

randomvariableslinkedto

n L

2

- ontinuousreal-valuedsto hasti pro esses

Y

_i

₌

_{Y

_i

_(t)

_{∈ R, t ∈ [0, T ]}}

taking values in aHilbert spa e

H

of fun tions

dened on the (time) interval

[0, T ]

. Thus, it orresponds to an innite

di-mensional spa e. Sin e most fun tional data arelongitudinal, we adopt here

(16)

the onventionof parameterizingmodelsintermsoftime. However,itapplies

equallywell withanyother featuresasangle, length, et . Inaddition,

exten-sionsarepossibleformultivariate urves,itmeansthatindividual

i

isdes ribed

byseveral urves(seeforinstan eJamesandSugar[2003℄orJa quesandPreda

[2014b℄).

Inpra ti e,ea h

Y

i

isunobservedfortwo,essentiallyte hnologi al,reasons.

Firstly,the

n

urves

Y

i

aredis retizedea hin

m

i

time-points

{Y

i

(t

is

), 0

≤ s ≤

m

i

, t

is

∈ [0, T ]}

. Se ondly, an error on observation is usually present su h

that only

m

i

ordered time-points

{X

i

(t

is

), 0

≤ s ≤ m

i

, t

is

∈ [0, T ]}

(

i =

1, . . . , n

)are availablefor ea h urve. For instan e, thefollowingrelationship

between dis retized (unobserved) values

Y

i

(t

is

)

and noisy (observed) values

X

i

(t

is

)

ouldbeassumed:

X

i

(t

is

) = Y

i

(t

is

) + ε

is

,

(2.4)

where

ε

is

haszeromeanandisun orrelatedwithea hotherand

Y

i

(t

is

)

. Other

assumptionsarepossibleaswewill seebelow.

WerefertoJa quesandPreda[2014a℄forageneralreviewon lusteringfor

fun tionaldata,in ludingthemodel-basedone. Di ultyofperforming

unan-imous lusteringongenerativedistributions omesfromthefa tthat, ontrary

tothenite-dimensionalsetting,thenotionofdensityprobabilityis generally

not dened for fun tional random variable (Delaigle and Hall [2010℄).

Con-sequently, relatedte hniquesrequiredening densityprobabilities in a

nite-dimensionalspa e,leadingtomultiple anddierentimplementations.

Inthis hapter,wedividemodel-based lusteringte hniquesintotwo

dier-ent ategories: these ones wherethe generativemodel isexpli itlydened on

theobserved values

X

i

=

{X(t

is

), 0

≤ s ≤ m

i

, t

is

∈ [0, T ]}

,

i = 1, . . . , n

, and

these ones forwhi h itis notthe ase. Indeed,this split will haveimportant

onsequen esforsomeaspe ts on erningmodelsele tion.

Clustering with noexpli it distributionon

X

i

Usually, therst stepbefore a lusteringmethod isto re onstru t theinitial

fun tionalformofdata. It anthenbeviewedasaprepro essingstep

(lter-ing method). Itoftenreliesontheassumptionthattheunobserved urve

Y

i

anbeexpressed in abasisof

d

fun tions

{φ

j

}

j=1,...,d

, forinstan e B-splines

orwavelets,inthefollowingform:

Y

i

(t) =

d

X

j=1

γ

ij

φ

j

(t).

Usingthentheregression(2.4)hypothesis,traditionalleastsquared oe ients

estimatesareobtainedby

ˆ

(17)

where

Φ

i

= (φ

j

(t

is

))

isa

m

i

× d

matrixgatheringthevalueofea hbasis

fun -tionforea htimedis retizationknot. Finally,standardmodel-based lustering

te hniques(typi allymultivariateGaussian mixtures, eventuallyHD variants

previously des ribed in Se tions 2.3.1 and 2.3.2) an be dire tly applied on

theestimated oe ients

γ

ˆ

i

. Thepartition onindividuals

X

i

isobtainedasa

simpleby-produ t,beingthesameasthisone ofindividuals

γ

ˆ

i

.

Instead of partitioning the basis oe ients

γ

ˆ

i

, a model-based lustering

te hnique an be alternatively applied to some prin ipal omponent s ores

resulting from fun tional prin ipal omponent analysis (FPCA) of the

pre-vious re onstru ted urves. In pra ti e, the omputational pro ess for

im-plementing FPCA onsists of performing a standard ( entered) PCA to the

matrix

ΓW ˜

˜

Γ

′

_T

, where

Γ

= (ˆ

γ

ij

)

isthe

n

× d

matrixofestimated oe ients,

T

₌

1 n

I

is the

n

× n

matrix of weights for urves,

Γ

˜

is the

n

× d

matrix of

entered oe ients of

Γ

and

W

is the

d

× d

matrix of the inner produ ts

w

jj

′

=

R

T

0 φ

j

(t)φ

j

′

(t)dt

(

1 ≤ j, j

′

≤ d

) (it a ts like ametri ). Thus, the

j

th

prin ipal omponents ore

C

j

isthe

j

theigenve torasso iated tothe largest

j

theigenvalue:

˜

ΓW ˜

Γ

′

TC

_j

_{= α}

_j

C

_j

_.

As usual with PCA, FPCA performs a kind of variable ordering. Finally,

lusteringisperformedonatrun atingprin ipal omponents ores

C

1 , . . . , C

q

,

with

q

≤ d

.

Fromamodel sele tionpointof view,bothpreviousmethods allowtouse

someinformation riterialikeBICforsele tingthenumber

K

of omponents.

However,it is notreally possibleto use them forsele ting other parts ofthe

model whi h are thefun tional basis

{φ

j

}

j=1,...,d

and, spe i ally to FPCA,

thetrun ationoforder

q

.

Clustering with expli it distributionon

X

i

Ideally, for bene ingfrom thewhole mathemati alstatisti s orpus,

model-based lusteringte hniqueswouldrequireadistributiononall

X

i

= (X

i

(t

is

), 0

≤

s

≤ m

i

, t

is

∈ [0, T ])

,

i = 1, . . . , n

. Firstofall,itisimportanttonoti ethat

per-formingthe lustering taskdire tly withobservedvalues

X

i

'sasiftheywould

orrespondto lassi almultivariatevaluesisnotdesirable,evenifit ouldmeet

thisgoal. Therstreasonisthatea h

X

i

doesnotne essarilyrelyinthesame

spa e dimension (here

m

i

for ea h), even if in pra ti e it ould be oftenthe

ase. These ondandthemostimportantreasonisthatworkingwithsu hraw

datawastesorderinformationonthem.

Contrarytotherawdata ase,severalte hniquesproposedistributionson

X

_i

whi h take all the fun tional data spe i ity into a ount. Ja ques and

Preda [2013℄ perform FPCA by group, leading to prin ipal omponents per

(18)

It leads to the following Gaussian mixture model, relying on atrun ation of

order

1 ≤ q

k

≤ d

forea h omponent:

f (x

i

; θ)

≈

K

X

k=1

π

k

q

k

Y

j=1

φ(C

ijk

; 0, α

jk

)

where

φ(

·; 0, α

jk

)

is theunivariateGaussiandensityofmeanzero(s ores

C

ijk

are entered) andvarian e

α

kj

( orrespondingalsoto eigenvalues). Then,

pa-rameterestimation is provided throughanEM-likealgorithm formaximizing

the(pseudo)log-likelihood,wherebothstepsarethefollowing:

E-step Compute onditionalprobabilities

t

ik

∝ π

k

Q

q

k

j=1

φ(C

ijk

; 0, α

kj

)

asusual.

M-step First, prin ipal s ores are updated. Noti e that weights

T

k

depend

nowon

t

ik

's,

Γ

k

too. Se ond, perform the

q

k

trun ation order sele tion

bydete tingakindofelbowintheeigenvaluesbythes reetestof

Cat-tell (Cattell[1966℄). Finally, parameters

π

k

are omputed asusual and

parameters

α

k

arealreadygivenfrom previous onditionalFPCA.

Thispro essisimplementedin therfun lust

4

pa kage. Asan illustration,

thispa kageisappliedtokneading urves,whi haredes ribedin Se tion2.1,

inFigure2.9. Fromamodelsele tionpointofview,therearesomeimportant

remarks. Stri kly speaking, it is just apseudo likelihood method sin e data

C

ijk

are hangingat ea h iterationstepofEM. Consequently, using sele tion

riterialikeBIC ouldbehazardousfor hoosing

K

,

q

k

orthefun tionalbasis.

However,in pra ti e,BICworks wellfor hoosing

K

. However,itisnotused

forsele ting

q

k

,asprevioussaid,forlimiting omputingtime. Noattemptfor

hoosing thebasisisperformed.

(a) (b)

Figure2.9:

n = 115

kneading urvesobservedat

d = 241

equispa edinstants

oftimein theinterval[0;480℄: (a)raw urves,(b)threegroupspartioning

urveswiththefun lustpa kage.

Alternatively,JamesandSugar[2003℄ onsiderrandomnessdire tly onthe

basis oe ients

γ

i

. Theyassumethat

γ

i

arisesfromahomos edasti Gaussian

(19)

multivariate model whi h, oupling with (2.4), provides the following

regres-sion model, onditionally on the

i

th urve belonging to the

k

th luster (so

onditionalto

Z

ik

= 1

):

X

_i

_{= Φ}

_i

_(µ

_k

_{+ ǫ}

_i

_{) + ε}

_i

_,

where

ǫ

i

∼

N

(0, Σ)

and

ε

i

∼

N

(0, σ

2 _I

₎

. Also,someparsimoniousassumptions

aremadeon enters

µ

k

. Then,anEMalgorithmallowstoestimateall

param-eters. ContrarytothemodelofJa quesandPreda[2013℄des ribedjustabove,

wearenowfa edtoanunambiguousgenerativeapproa hallowing

straightfor-ward model sele tionwith any lassi al riterion for hoosing everyquantity

of interest (the number

K

of lusters, the basis

{φ

j

}

and the parsimony of

all means

µ

k

), even if the authors prefer to use a so- alled distortion

fun -tion riterionforsele ting

K

fastersin eavoidingEM omputationsforall

K

values.

In the samespirit asJames and Sugar [2003℄,Samé et al. [2011℄give

an-other regression model providing a full generative, exible and parsimonious

distribution onthe

X

i

's. They assume that the urvesarise from amixture

of regressions on a basis of polynomial fun tions (the order to be given by

modelsele tion),withpossible hangesin regimeatea hinstantoftime. The

mixing proportionsaredenedbylogisti fun tions forallowingsegmentation

in time. An EM pro edure is performed forestimation and several

parsimo-niousversions aredes ribed. Thisfullgenerativedistributionallowsagainfull

modelsele tion(numberof lusters,polynomialorderofthebasisfun tionand

number of regime hanges) in any standard way. However, as in many

pre-vioussettings, thenumberof ompeting models an in reasedrasti ally. For

instan e,thebasi fun tions an hangebyregime,multiplying ombinations.

2.3.4 Intermediate on lusion

Many parsimoniousmodelling solutionsexist fordealing with HD data,

on- erningaswellindependentand fun tionaldata, evenifsomegaps remainto

be lledlike ategori al fun tional data or also mixed ( ontinuous and

ate-gori al typi ally) multivariate fun tional data. Most of existing models rely

onagenerativedistributiononthedataspa e,allowingdire tuseofstandard

sele tion riteria. However,the ru ial questionisfo usedonthemultipli ity

ofmodelstobe ompared. Itisthereasonwhysomeauthorsfavorsomemore

empiri al,butfast,rulesformodelsele tion.

Weguessthatfutureresear hesshouldaddressnewadvan esforfast

sele -tionofmultiplemodelsin ashortallo atedtime. Inthenextse tion,devoted

to anoni almodelsetting,wewillseeearlyseveralattemptsforthispurpose,

(20)

2.4 Canoni al models

We address now models for HD data whi h position parsimony assumptions

dire tly on the initial (or anoni al) variable spa e. Advantage of su h

ap-proa hes,besidenon- anoni alones,is agreatmodel readibilityforthe

pra -titioner. Indeed, thisoneis usually morea ustomedto his variableset than

toasomewhatmorearti ialset,asthefa torialfeatures ouldbesometimes.

In this ontext, this hapter ta klesimportantnotions: variablesele tion,

variable lustering,modelsele tionvalidityandalsostrategiesfordealingwith

modelmultipli ity.

2.4.1 Parsimonious mixture models

Classi al mixture models have already been presented in Chapter ??,

Se -tion??. Itgathersinparti ulartheGaussianmixturemodelforthe ontinuous

aseandthelatentmultinomialmixturemodelforthe ategori al ase,

in lud-ingalsomanyparsimoniousvariants. DealingwithHDdataimposeto onsider

essentiallysomeofthemostparsimoniousonesthusthereisaneedtoprovide

moredetailsin thisse tion. Then,extensionto themixed ase(merging

on-tinuous and ategori alfeatures) is presentedasa straightforwardextension.

Allthesemodelsareimplementedintherpa kagermixmod

5

. Finally,wewill

presentanewattemptforvariablesele tioninthe ontinuous, ategori aland

mixedsituations.

Spheri al and diagonalGaussian mixturesfor ontinuousvariables

We onsiderdatasets

x

= (x

1 , . . . , x

n

)

,with

x

i

∈ R

d

. Themostparsimonious

Gaussianmixture modelsdened byCeleux andGovaert[1995℄belongtothe

so- alledspheri alanddiagonalfamilies. Anexampleofdiagonalmodelisgiven

inFigure 2.10. UsingnotationsalreadyprovidedinSe tion?? ofChapter?? ,

their most omplexversionsrespe tively orrespond to onstraints

Σ

k

= λ

k

I

and

Σ

k

= λ

k

B

k

on the ovarian e matrix

Σ

k

of the

k

th omponent, where

λ

k

=

|Σ

k

|

1/d

and

B

k

diagonal with

|B

k

| = 1

. In luding some parsimonious

versions,whi h allowsome parts tovary ornot between omponents,a total

oftwospheri alandfourdiagonalmodelsareavailable. All models,and their

respe tivenumberof parameters,are displayed in Table2.1. Model sele tion

anbeeasilyperformedbytraditional riteria,likeBIC.

Latent lass model for ategori al variables

We onsider now data sets

x

= (x

1 , . . . , x

n

)

, ea h

x

i

ontaining

d

ategori- al variables, the

j

th having

m

j

response levels. The oding

x

i

= (x

jh

i

; j =

5

(21)

Figure 2.10: Isodensityofatwo- omponentsdiagonalGaussianmixture in

thethree-dimensionalspa e.

Family Model Numberofparameters

diagonal

[λB]

dim(π) + Kd + d

[λ

k

B

]

dim(π) + Kd + d + K

− 1

[λB

k

℄

dim(π) + 2Kd

− K + 1

[λ

k

B

k

]

dim(π) + 2Kd

spheri al

[λI]

dim(π) + Kd + 1

[λ

k

I

]

dim(π) + Kd + K

Table2.1: Some hara teristi softhetwospheri alandthefour diagonal

models. Wehave

dim(π) = K

− 1

inthe aseof freeproportionsand

dim(π) = 0

inthe aseofequalproportions.

1, . . . , d; h = 1, . . . , m

j

)

indi atesthat

x

jh

i

= 1

if

i

hasresponselevel

h

for

vari-able

j

and

x

jh

i

= 0

otherwise. Thestandardmodel for lusteringobservations

des ribedthrough ategori alvariablesisthe so- alledlatent lass model(see

forinstan e Goodman[1974℄). Data areassumedto ariseindependentlyfrom

amixture of

K

multivariatemultinomial distributionswithpdf

f (x

i

; θ) =

K

X

k=1

π

k

d

Y

j=1

m

j

Y

h=1

(α

jh

_k

)

x

jh

_i

_,

(2.5)

where

θ

= (π, α)

denotestheve torparameterofthelatent lassmodeltobe

estimated,with

α

= (α

1 , . . . , α

K

)

and

α

k

= (α

jh

k

; j = 1, . . . , d; h = 1, . . . , m

j

)

,

α

jh

_k

denotingtheprobabilitythat variable

j

haslevel

h

ifobje t

i

isin luster

(22)

Lebret et al. [2015℄ propose four parsimonious versions, with thus a

to-tal of ve models. They orrespond to an extension of the

parameteriza-tion of Bernoulli distributions used by Celeux and Govaert [1991℄ for

lus-tering andalso byAit hinson andAitken[1976℄for kerneldis riminant

anal-ysis. Thebasi ideaisto impose theve tor

α

j

k

= (α

j1

k

, . . . , α

jm

j

k

)

to takethe form

(β

j

k

, . . . , β

j

k

, γ

j

k

, β

j

k

, . . . , β

j

k

)

with

γ

j

k

> β

j

k

. Sin e

P

m

j

h=1

α

jh

k

= 1

, we have

(m

j

− 1)β

_k

j

+ γ

_k

j

= 1

and, onsequently,

β

j

k

= (1

− γ

j

k

)/(m

j

− 1)

. The onstraint

γ

_k

j

> β

j

_k

be omesnally

γ

j

k

> 1/m

j

. Then, the ve tor

α

j

k

anbe broken up

intothetwofollowingparameters:

• a

j

k

= (a

j1

k

, . . . , a

jm

j

k

)

where

a

jh

k

= 1

if

h

orrespondstotherankof

γ

j

k

(in

thefollowing,thisrankwillbenoted

h(k, j)

),0otherwise;

• ε

j

k

= 1

− γ

j

k

whi h orrespondstotheprobabilitythatthedata

x

i

arising

from the

k

th omponentaresu hthat

x

jh(k,j)

i

6= 1

.

In other words, the multinomial distribution asso iated to the

j

th variable

of the

k

th omponent is reparameterized by a enter

a

j

k

and the dispersion

ε

j

_k

around this enter. Thus, itallows us to givean interpretation similar to

the enter and thevarian e matrixused for ontinuousdata in the Gaussian

mixture ontext. Finally,therelationshipbetweentheinitialparameterization

andthenewoneisgivenby:

α

jh

_k

=

1 _{− ε}

j

_k

if

h = h(k, j)

ε

j

_k

/(m

j

− 1)

otherwise.

(2.6)

Inthefollowing,thismodelwillbedenotedby

[ε

j

k

]

. Inthis ontext,threeother

models anbeeasilydedu ed. Wenote

[ε

k

]

themodelwhere

ε

j

k

isindependent

of thevariable

j

,

[ε

j

_]

themodelwhere

ε

j

k

is independentof the omponent

k

and, nally,

[ε]

themodelwhere

ε

j

k

is independentofboththevariable

j

and

the omponent

k

. In order to maintain someunity in the notation, we will

denote also

[ε

jh

k

]

the mostgeneralmodel initially introdu ed. Thenumberof

freeparametersasso iated toea h modelis given inTable2.2. Again, model

sele tion anbeeasily performedbytraditional riteria,likeBIC.

Mixeddata models

It is frequent in pra ti e to mix ontinuous and ategori al data. Thus the

i

th individual is omposed by two parts,

x

i

= (x

cont

i

, x

cat

i

)

,

x

cont

i

and

x

cat

i

designingthe ontinuousandthe ategori alonesrespe tively. Inthat ase,it

iseasyto ombine(diagonal)parsimoniousGaussianmixtureandlatent lass

modelby onditionalindependen e [MoustakiandPapageorgiou,2005℄:

(23)

Model Numberofparameters

[ε]

dim(π) + 1

[ε

j

_]

_{dim(π) + d}

[ε

k

]

dim(π) + K

[ε

j

_k

]

dim(π) + Kd

[ε

jh

k

]

dim(π) + K

P

d

j=1

(m

j

− 1)

Table2.2: Numberoffreeparametersofthevemultinomialmodels. We

have

dim(π) = K

− 1

in the aseoffreeproportionsand

dim(π) = 0

inthe

aseofequalproportions.

with

α

k

= (α

cont

k

, α

cat

k

)

(seealsoSe tion??inChapter??). Then,theprevious

sixGaussian mixturemodelsandthevemultinomial mixturemodels anbe

ombined, dening straightforwardly30newmixed models. Classi al riteria

anbeusedforsele tingthem, withalsothenumberof lusters

K

.

Although previously des ribed models, in the ontinuous, ategori al or

mixeddatasituations,arethemostparsimoniousonesintheirrespe tive

fam-ilies, theyare notreallydesigned forrealisti HDsituations involvingseveral

thousandsof variables for instan e. Indeed, theirparameternumber remains

toohighin su h ases.

Variable sele tion has alwaysbeen anatural answer for HD lustering as

alreadydis ussedinthebeginningofthis hapter. Typi ally,lteringmethods

relyingonapreliminaryfa torialanalysisstepthen utthenumberoffa torial

variables to be retained. However, in model-based lustering involving afull

wrapping approa h, the di ulty is to integrate properly this sele tion step

in the model itself. Thus, wedis uss nowmoresuitable methods forthe HD

situation.

2.4.2 Variable sele tion through regularization

In this se tion, we fo us on the variable sele tion problem in the Gaussian

mixture lustering ontext.

ℓ

1

-penalizationpro edures

Inspiredbythesu essoftheLassoregression,PanandShen[2007℄proposeto

takeadvantageofthesparsitypropertyof

ℓ

1

-penalizationofthelikelihood to

performautomati variablesele tionforhigh-dimensionalmodel-based

luster-ing. Theirpro edure, alled PS-Lassoin thesequel, onsistsof usingaLasso

methodtosele trelevant lusteringvariablesandestimatemixtureparameters

inthesameexer ise. The ovarian ematri esareassumedtobeidenti aland

diagonal(

Σ

k

= V =

diag

(σ

2

(24)

parameters. Forany

K

∈ N

∗

,thefollowingfun tionhastobemaximized:

θ

K

7→

n

X

i=1

ln

"

_K

X

k=1

π

k

φ (¯

x

i

; µ

k

, V)

#

− λ

K

X

k=1

kµ

k

1 ,

(2.7) where

θ

K

= (π, µ

1 , . . . , µ

K

, V)

,

kµ

k

1 =

d

P

j=1

µ

kj

,

¯

x

i

= (x

ij

− ¯x

j

)

1≤j≤p

with

¯

x

_j

₌

1 n

P

n

i=1

x

ij

,

λ

is anon-negative regularizationparameterand

φ(

·; µ, Σ)

denotes the multivariateGaussian density of enter

µ

and ovarian e matrix

Σ

. An EM-algorithmisproposedtosolvethisparameterestimationproblem.

Next,amodiedBIC riterionisusedtosele t

K

and

λ

:

BIC

(K,λ)

=

−2 ln

"

_n

Y

i=1

K

X

k=1

π

k

φ(x

i

; µ

k

, V)

#

+ ln(n)D

(K,λ)

where

D

(K,λ)

= (K

− 1) + Kd + d − q

,

q

denotingthenumberofthemaximum

penalizedlikelihoodestimatemean omponentsthatareequalto

0

.

This approa h was su essively extended in Zhou et al. [2009℄(Gaussian

mixtureswith diagonal ovarian ematri es)and nally in Zhou etal. [2009℄.

In this last paper, aregularized Gaussian mixture model with un onstrained

ovarian ematri esisproposed.Theyemploya

ℓ

1

penaltyonmeanparameters

andon ovarian ematri esasfollows:

θ

K

7→

n

X

i=1

ln

"

_K

X

k=1

π

k

φ (¯

x

i

; µ

k

, Σ

k

)

#

− λ

K

X

k=1

kµ

k

1 − ρ

K

X

k=1

Σ

−1

k

1 ,

(2.8) where

kµ

k

1 =

d

X

j=1

µ

kj

,

Σ

−1

k

1 =

d

X

j,j

′

₌₁

j6=j′

(Σ

−1

k

)

jj

′

,

and where

λ

and

ρ

are two non-negativeregularization parameters. This

pa-rameterestimationproblemissolvedusinganEMalgorithmwheretheso- alled

glassoalgorithm (Friedman et al. [2007℄) is used to estimate sparse pre ision

matri es

Σ

−1

k

.

Lasso-MLEpro edure

In Meynet [2012℄ and Meynet and Maugis-Rabusseau [2012℄, they highlight

that the

ℓ

1

-penalizationindu es shrinkageof the oe ients and thus biased

estimators with high estimation risk. Moreover, the use of a BIC-type

ri-terion for the model sele tion an be unsuitable for high-dimensional data.

Consequently,theyproposetoonlyusean

ℓ

1

-penalizedlikelihoodapproa hto

(25)

on-high-dimensional situations. The evaluation of the MLE rather than the

ℓ

1

-penalizedestimatorforea hmodelis onsideredtoavoidestimationproblems

due to

ℓ

1

-penalization shrinkage. More pre isely, the data

x

= (x

1 , . . . , x

n

)

areassumedtohaveanullexpe tation (inpra ti e,empiri al enteringofthe

data isperformedto ensurethis assumption)and theirunknown density

f

is

estimated byanite spheri alGaussian mixture. The lustersare

hara ter-ized by the meanparameters

(µ

k

)

1≤k≤K

anda variable

j

is alled irrelevant

forthe lusteringif

µ

kj

= 0

forall

k = 1, . . . , K

;otherwiseitis alledrelevant.

Therelevantvariablesubset(resp. irrelevantvariablesubset)isdenotedby

J

r

(resp.

J

c

r

=

{1, . . . , d} \ J

r

). Consequently, the variable sele tionproblem is

re astintoamodel sele tionproblem, wherethemodel olle tionis

(S

(K,J

r

)

with

S

(K,J

r

)

=











x

_i

_{∈ R}

d

_{7→ f(x}

_i

_{; θ) =}

_P

_K

k=1

π

k

φ(x

J

i

r

; µ

k

, σ

2 I

)

φ(x

J

r

c

i

; 0, σ

2 I

)

θ

= π

1 , . . . , π

K

, µ

1 , . . . , µ

K

, σ

2 ∈ Π

K

× R

|J

r

|

K

× R

∗

+











,

x

J

r

i

denoting the restri tion of

x

i

on

J

r

,

|J

r

|

orresponding to the ardinal

of

J

r

and

Π

K

denoting thesimplex relatedto parameters

(π

1 , . . . , π

K

)

. The

dimensionofamodel

S

(K,J

r

)

orrespondstothetotalnumberoffreeparameters

estimatedin themodel:

D

(K,J

r

)

= K(1 +

|J

r

|)

.

Theso- alledLasso-MLEpro edureproposedinMeynetandMaugis-Rabusseau

[2012℄isde omposed intothreemainsteps. Intherststep,asPanandShen

[2007℄,an

ℓ

1

-approa his onsidered: Forea h

(K, λ)

∈ N

∗

× G

λ

(

G

λ

isagiven

grid on

λ

), theLasso estimator

θ

ˆ

L

(K,λ)

is omputed by maximizing (2.7) and

theasso iatedrelevantvariablesubsetis

J

(K,λ)

=

{j ∈ {1, . . . , d} : ∃ k ∈ {1, . . . , K}

su hthat

µ

ˆ

kj

6= 0}.

Thus a random model sub olle tion

{S

(K,J

r

)

: (K, J

r

)

∈ M

L

}

is obtained, where

M

L

=

_{{(K, J}

r

) : K

∈ N

∗

, J

r

∈

[

λ∈G

λ

J

(K,λ)

}.

The se ond step onsists of omputing the MLE

θ

ˆ

(K,J

r

)

using the standard

EM algorithm for ea h model

(K, J

r

)

∈ M

L

. The third step is devoted to

model sele tion. Asin Maugis andMi hel[2012℄,anonasymptoti penalized

riterionis proposed to solvethemodel sele tionproblem. Byextendingthe

general model sele tion theorem of Massart [2007℄ (Theorem 7.11) (see also

Se tion?? inChapter??) (demanderrefàPas aldanslebook),Meynet[2012℄

provesthatthepenaltyis

pen

(K,J

r

)

= κ

1 D

(K,J

r

)

n

1 + κ

2 ln

d

D

(K,J

r

)

,

(2.9)

(26)

model olle tion omplexitybytakinginto a ountthepossiblelargenumber

ofmodelswithidenti aldimension. Neverthelessthislogarithmtermbe omes

unne essaryifthenumberofmodelswiththesamedimensionissmallenough.

Forinstan e, forniteGaussianmixture modelsin alow-dimensionalsetting,

a penalty proportional to the dimension is su ient to sele t a model lose

to the ora le(Maugis and Mi hel [2011℄). But in the high-dimensional

on-text, the numberof models having the same dimension is expe ted to grow.

Nonetheless,thankstotherandompresele tionofrelevantvariablessubsets,a

ompletevariable sele tionisnotperformedhere. Thus, iftherandommodel

sub olle tionismu hpoorerthanthewholemodel olle tionand ontainsfew

modelswiththesamedimension,apenaltyproportionalto thedimension

pen

(K,J

r

)

=

D

(K,J

r

)

n

(2.10)

mightbesu ienttosele tamodelwithproperdimension. Next,thepenalty

dependingonunknownmultipli ative onstantsis alibratedusingtheso- alled

slopeheuristi s[BirgéandMassart,2007;Baudryetal.,2012℄.

ComparingPS-Lasso and Lasso-MLE

To omparetheLasso-MLEandPS-Lassopro edures,thefollowingsimulated

example is proposed in Meynet and Maugis-Rabusseau[2012℄. The data set

onsistsof

n = 200

observationsdes ribedby

d = 1 000

variables. Thedataare

simulateda ordingtoamixtureoftwoGaussiandistributions

π

1 φ(

·; 0

d

, I) +

(1

_{− π}

1 )φ(

·; µ

2 , I)

where

µ

2 = (1.5, . . . , 1.5, 0

950 )

and

π

1 = 0.85

. Therelevant

variablesare therstfty variables(

J

⋆

r

=

{1, . . . , 50}

).

20

simulationsofthe

datasetareperformed. Forea hsimulation,modelswith

K

∈ {1, 2, 3}

lusters

are onsidered. TheresultsaresummarizedinTable2.3. Table2.3showsthat

Pro edure Estimator TR FR

ˆ

K

ARI 1 2 3 PS-Lasso ora le

50.3 (0.2)

214.6 (79.0)

0

16

4 0.90 (0.03)

BIC

49.7 (0.8)

14.3 (3.4)

0

18

2 0.86 (0.02)

Lasso-MLE ora le

50.0 (0.0)

0.2 (0.2)

0

20

0 0.95 (0.02)

AIC

50.0 (0.0)

17.1 (4.2)

0

14

6 0.90 (0.04)

BIC

49.8 (0.4)

4.4 (2.2)

0

20

0 0.92 (0.02)

DDSE

50.0 (0.0)

2.4 (1.7)

0

20

0 0.94 (0.02)

Table2.3: Averagednumberoftruerelevant(TR)andfalserelevant(FR)

variables(

±

standarddeviation);numberoftimesa lusteringwith

K = 1

ˆ

,

2

and

3

omponentsissele ted;AveragedARI(

±

standarddeviation)overthe

20

simulations. DDSE standsfordata-drivenslopeestimation.