Presentation of two competitor variable selection procedures

3. Finite mixture Gaussian regression models

4.3. Presentation of two competitor variable selection procedures

4. determination of a data clustering by the MAP principle using the estimated mixture parameters of ˆs_{( ˆ}_{K, ˆ}_J

r).

In Section 4.5.4, we shall propose a variable selection procedure both identifying the set of relevant variables for the clustering and providing a data clustering. Before introducing it, we present two competitor variable selection procedures (Maugis and Michel, 2011a; Pan and Shen, 2007) from which we have drawn our inspiration to construct our new procedure as a mixture of these two procedures.

4.3 Presentation of two competitor variable selection procedures

For each procedure, we detail the four steps mentioned above. The model collection is obtained by considering a range of number of clusters K = {Kmin, . . . , Kmax} and a collection of index sets J0

representing sets of relevant variables. The collection K does not depend on the procedure whereas the collection J0depends on the procedure. In this thesis, we only consider isotropic spherical finite Gaussian mixture models defined by (4.3). Thus, we present each procedure in this context although Maugis and Michel (2011a) and Pan and Shen (2007) studied less parsimonious models.

In the sequel, we consider a dataset Y = (Y1, . . . , Yn) from probability distribution with den-

sity s. We assume that the data come from different subpopulations. In a clustering purpose, s is to be estimated by a finite Gaussian mixture density.

4.3.1 Maugis and Michel’s procedure for low-dimensional data

Maugis and Michel (2011a) propose a clustering procedure selecting the relevant variables in a low- dimensional context with p n. The novelty of their method is more about their model selection criterion than about their model collection. Indeed, their model collection involves all the subsets of {1, . . . , p} as index set of potentially relevant variables. In this case,

J0= {{j1, . . . , jd}; 1 ≤ d ≤ p, jl ∈ {1, . . . , p} for all l ∈ {1, . . . , d}, jl6= jl0}. (4.10)

This corresponds to complete variable selection and this results in a model collection of size 2p × card(K). Calculating an estimator in each model is feasible only when p is sufficiently small, in practice p ≤ 10. When p ≥ 10, Maugis and Michel (2011a) must restrict to ordered variable selection and rather consider the collection of all nested subsets of {1, . . . , p} in order to get a practicable method. In this case,

J0 = {{1, . . . , d}; 1 ≤ d ≤ p} . (4.11)

Nonetheless, even ordered variable selection becomes unfeasible as soon as p ≈ 30. Thus, Maugis and Michel’s procedure is only valid for very low-dimensional data. One aim of this thesis is to extend

136 Our Lasso-MLE procedure for variable selection in model-based clustering

their procedure to high-dimensional data by introducing a collection with fewer index sets in order to reduce the number of models and the number of estimators to calculate.

Step 0. Empirical centering

Maugis and Michel (2011a) center the dataset Y = (Y1, . . . , Yn) by subtractingPnl=1Ylj/n to Yij

for all i ∈ {1, . . . , n} and j ∈ {1, . . . , p}. This leads to a new dataset Y = (Y1, . . . , Yn)1. The

density s of Yiis to be estimated to get a clustering of the dataset Y .

Step 1. Model collection

Since the data are empirically centered, Maugis and Michel (2011a) consider that the irrelevant variables for the clustering of Y have a homogeneous behavior around a null mean2. Put Mr= K × J0.

Then, Maugis and Michel’s model collection is {S(K,Jr)}(K,Jr)∈Mr with

S_(K,J_r₎ =    y ∈ Rp 7→ s_θ(y) = Φ(y_[Jc r]| 0, σ 2_I) PK k=1πkΦ(y[Jr]| µk, σ 2_I); θ = (π1, . . . , πK, µ1, . . . , µK, σ) ∈ ΠK× R|Jr| K × R+    . (4.12)

Each model S_(K,J_r₎ is the set of finite Gaussian mixture densities with K components and Jr as

index set representing the relevant variables.

Step 2. Calculation of an estimator of s in each model

For each (K, Jr) ∈ Mr, Maugis and Michel (2011a) compute the maximum likelihood estimator

of s in the model S(K,Jr): b s(K,Jr) = arg min s_θ∈S_{(K,Jr )} γn s_θ , γn s_θ = − 1 n n X i=1 ln s_θ(Yi) . For all y ∈ Rp, b s_(K,J_r₎(y) = Φy_[Jc r]| 0, ˆσ 2_I K X k=1 ˆ πkΦ y_[J_r_]|µb_k, ˆσ2I .

The estimated mixture parameters bθ(K,Jr) = (ˆπ1, . . . , ˆπK, bµ1, . . . , bµK, ˆσ) are computed by the EM

algorithm for model-based clustering (Dempster et al., 1977) described in Section 4.A.3. In practice, Maugis and Michel (2011a) use MIXMODsoftware (Biernacki et al., 2006) to run the EM algorithm.

1_{We shall add an horizontal line to quantities modified by empirical centering, that is to say the dataset Y , the den-}

sity s, the mean vectors µk, the global parameter vector θ, the model collection as well as the estimations ˆs, ˆµk, ˆθ. The

proportions πkand the variance σ2are not modified by empirical centering (see proposition 4.4.1). 2

4.3 Presentation of two competitor variable selection procedures 137

Step 3. Model selection

The model selection criterion used by Maugis and Michel (2011a) is the data-driven penalized criterion derived from the slope heuristics introduced by Birg´e and Massart (2006) and recalled in Sec- tion 6.3.1. First, models are grouped according to their dimension D in order to obtain a model collection {SD}D∈D. The dimension of a model S(K,Jr)is the total number of free parameters esti-

mated in the model: K − 1 mixing proportions, K|Jr| mean parameters and 1 variance parameter,

which gives a total dimension equal to K(1 + |Jr|). For each dimension D ∈ D, letbsD be the esti-

mator maximizing the likelihood among the estimators associated to a model of dimension D. There exists (KD, JD) such that bsD =bs_(K_D_,J_D₎. Then, the function D/n 7→ −γn(bsD) is plotted. One has

to check that this function has a linear behavior for large dimensions, otherwise the slope heuristics can not be applied. If a linear behavior is indeed observed, the slope ˆc of the linear part is evaluated thanks to the graphical interface CAPUSHE(Baudry et al., 2011). The calibrated penalty function is pen(D) = 2ˆcD/n. Finally, the minimizer ˆD of the penalized criterion

ˆ D = arg min D∈D γn(bsD) + 2ˆc D n (4.13)

is determined and the model S_{( ˆ}_{K, ˆ}_J

r):= S(KDˆ,JDˆ)is selected.

Step 4. Data clustering

The variables declared as relevant for the clustering are indexed by ˆJr. The density s is estimated by

bs_{( ˆ}_{K, ˆ}_J

r). A clustering of the dataset Y is derived from

ˆ θ_{( ˆ}_{K, ˆ}_J

r)by applying the MAP principle.

4.3.2 Pan and Shen’s Lasso procedure for high-dimensional data

Complete or ordered variable selection considered by Maugis and Michel (2011a) is untractable unless p is very small. Alternative variable selection procedures are to be considered for high-dimensional data. Pan and Shen (2007) propose a penalized model-based clustering approach. In light of the success of variable selection via `1-penalization in regression, Pan and Shen (2007) conjecture that

`1-penalization may also be viable to variable selection for clustering. They try an appropriate `1-

penalty function to adaptively shrink the mean parameters towards the average cluster means, resulting in automatic variable selection3. The selection of the relevant variables and the estimation of the parameters are performed during the same process. Contrary to Maugis and Michel (2011a), the novelty of Pan and Shen’s procedure is more about their model collection construction than about their model selection criterion. Unlike Maugis and Michel (2011a), Pan and Shen (2007) do not consider a deterministic model collection. They rather construct a random model collection derived from a collection of sets of potentially relevant variables determined by `1-penalization.

138 Our Lasso-MLE procedure for variable selection in model-based clustering

Step 0. Empirical centering

Pan and Shen (2007) center the dataset Y = (Y1, . . . , Yn) by subtractingPnl=1Ylj/n to Yij for all

i ∈ {1, . . . , n} and j ∈ {1, . . . , p}. This leads to a new dataset Y = (Y1, . . . , Yn). The density s of

Yiis to be estimated to get a clustering of the dataset Y .

Steps 1 and 2. Construction of a model collection and calculation of an estimator of s in each model

• Fix K ∈ K and consider S_K = ( y ∈ Rp 7→ s_θ(y) =PK k=1πkΦ(y | µk, σ2I); θ = (π1, . . . , πK, µ1, . . . , µK, σ) ∈ ΘK )

with ΘK := ΠK × (Rp)K × R+. Since the dataset Y is centered, irrelevant variables are

expected to have a homogeneous behavior around a null mean4. Then, to detect such variables, Pan and Shen (2007) penalize the empirical contrast

γn s_θ = − 1 n n X i=1 ln s_θ(Yi)

by an `1-penalty on the mean parameters,

λ|θ|1 := λ p X j=1 K X k=1 |µ_kj| (4.14)

where λ > 0 is a regularization parameter.

Introduce GKa grid of regularization parameters and let λ ∈ GK.

Consider the Lasso estimator defined by b

θ_(K,λ)= arg min

θ∈ΘK

γn(s_θ) + λ|θ|1 . (4.15)

To compute bθ_(K,λ) = (ˆπk,µb_kj, ˆσ)1≤k≤K,1≤j≤p, Pan and Shen (2007) construct an EM algo-

rithm for `1-penalized model-based clustering (see Section 4.A.2). The index set

J(K,λ)= {j ∈ {1, . . . , p} : ∃ k ∈ {1, . . . , K} such that bµkj 6= 0}

represents the set of relevant variables selected by the Lasso bθ(K,λ). The density s is estimated

4.3 Presentation of two competitor variable selection procedures 139

by the Lasso solution defined for all y ∈ Rpby

bs_(K,J_(K,λ)₎(y) = Φ y_[Jc (K,λ)]| 0, ˆσ 2_I K X k=1 ˆ πkΦ y_[J (K,λ)]|µbk, ˆσ 2_I_. _(4.16)

By keeping K fixed and by varying λ ∈ GK, one gets a collection JK = ∪λ∈GKJ(K,λ)of

index sets representing a collection of potentially relevant variables.

• By varying K ∈ K, one gets a collection of index sets Mr = {(K, Jr); K ∈ K, Jr ∈ JK}.

This leads to a model collection {S_(K,J_r₎}_(K,J_r_)∈M_r with S_(K,J_r₎ =    y 7→ s_θ(y) = Φ(y_[Jc r]| 0, σ 2_I) PK k=1πkΦ(y[Jr]| µk, σ 2_I); θ = (π1, . . . , πK, µ1, . . . , µK, σ) ∈ ΠK× R|Jr| K × R+    . (4.17)

Each model S(K,Jr)is the set of finite Gaussian mixture densities with K components and Jras

index set representing the relevant variables. In each model S(K,Jr), the density s is estimated

by the Lasso bs(K,Jr)defined by (4.16).

Step 3. Model selection

Pan and Shen (2007) consider BIC as model selection criterion. First, models are grouped according to their dimension D in order to obtain a model collection {SD}D∈D. Following a conjecture of Efron

et al. (2004) and a result of Zou et al. (2007), Pan and Shen (2007) calculate a model dimension by taking into account the sparsity of the model: the mean parameters set to zero by the EM algorithm are not considered in the dimension calculation. For a model S(K,Jr), there are K − 1 free mixing

parameters, 1 variance parameter and K|Jr| non-zero mean parameters, which gives a dimension

equal to K(1 + |Jr|). For each dimension D ∈ D, denote bybsDthe Lasso estimator maximizing the

likelihood among the Lasso estimators associated to a model of dimension D. There exists (KD, JD)

such that bsD =bs(KD,JD). Then, the minimizer ˆD of the BIC criterion

ˆ D = arg min D∈D γn(bsD) + ln n 2 D n (4.18)

is determined and the model S_{( ˆ}_{K, ˆ}_J

r):= S(KDˆ,JDˆ)is selected.

Step 4. Data clustering

The variables declared as relevant for the clustering are indexed by ˆJr. The density s is estimated by

the Lasso bs_{( ˆ}_{K, ˆ}_J

r). A clustering of the dataset Y is derived from the estimated Lasso parameter vector

b θ_{( ˆ}_{K, ˆ}_J

140 Our Lasso-MLE procedure for variable selection in model-based clustering

In document Variable selection in model-based clustering for high-dimensional data (Page 135-140)