3.5 A penalized approach in mixture models
3.5.2 Variable selection for the background
Under the assumption of component covariance matrices equal to identity considered by Pan and Shen (2007), the penalty of the mean parameters leads to an automatic variable selection. If for a given pth variable, all the component mean parameters are
equal to 0, then the pth variable is uninformative for the cluster classification and the posterior probabilities (3.5) get the following expression:
τil = πlφ(xxxi|µµµl, IP) PK k=1πkφ(xxxi|µµµk, IP) = πlφ(xxxip|0, 1)φ(xxxi,−p|µµµl,−p, IP −1) PK k=1πkφ(xxxip|0, 1)φ(xxxi,−p|µµµk,−p, IP −1) (3.12)
where IP is a P -dimensional diagonal matrix and index i,−p denotes removal of the pth
variable from the ith vector. After simplification of the equation it is clear that the data
from the pth variable do not contribute to the classification.
For a more general case of the component-specific diagonal covariance matrix (Xie, 2008), in order to remove the pth variable two conditions have to be met. The first
one corresponds to mean estimates equal 0 (as previously), while the second relies on the marginal component variances, i.e. for the pth variable all the component variances have to be equal to 1. In such circumstances, the analogue of Equation 3.12 posterior probability can be written and the pth variable does not contribute to the classification.
In a general case of unconstrained covariance matrices (also correlations are modeled), such simple factorization cannot be performed and the conditions for removing variables within the MESP approach need to be derived. Without loss of generality let us divide the variables into two sets -A and B - so that the set A contains the first R variables, B the rest, for any R ∈ [1, P −1]. Denote with X = (XA,XB) the consequent partition of the
data, with µµµk = (µµµk,A, µµµk,B) the component mean vectors, with Σk =
Σk,AA Σk,AB
Σk,BA Σk,BB
!
the component covariance matrices where Σk,AB is a block matrix built from rows in A
and columns in B of Σk matrix.
Herein we aim at formulating the joint distribution of f (XA,XB) as a marginal prob-
ability of XB and a conditional probability of XAgiven XB. For the previous cases, it is
automatic because uncorrelated Gaussian variables are conditionally independent. We generalize Equation 3.12 by the conditional factorization f (XA,XB) = f (XA|XB)f (XB)
46 Section 3.5 - A penalized approach in mixture models and obtain the following formula:
τil =
πlφ(xxxiB|µµµl,B, Σl,BB)φ(xxxi,A|µl,A+ Σl,AB(xxxi,B − µµµl,B), Σl,AA − Σk,ABΣl,BA)
PK
k=1πkφ(xxxiB|µµµk,B, Σk,BB)φ(xxxi,A|µµµk,A+ Σk,AB(xxxi,B − µµµk,B), Σk,AA− Σk,ABΣk,BA)
.
where to ease the notation Σk,AB = Σ
k,ABΣ−1k,BB.
The first necessary condition for removing variables belonging to B as uninformative is to have null mean estimates ˆµkp = 0 for all k = 1, . . . , K and p ∈ B. In that case, the
posterior probability of observation membership is
τil =
πlφ(xxxiB|000, Σl,BB)φ(xxxi,A|µµµl,A+ Σl,ABxxxi,B, Σl,AA − Σl,ABΣl,BA)
PK
k=1πkφ(xxxiB|000, Σk,BB)φ(xxxi,A|µµµk,A+ Σk,ABxxxi,B, Σk,AA− Σk,ABΣk,BA)
.
which implicitly is a function of parameters from the presumably uninformative variables from set B. Naturally, like in the approach of Xie (2008), Hence, a second necessary condition is necessary. That is to have component-wise equal correlation matrix blocks, i.e. for all k = 1, . . . , K Σk,BB = ΣBB and Σk,AB = ΣAB, for the fixed ΣBB and ΣAB,
where ΣBB is expressed as a weighted average of component specific blocks
ΣBB = K
X
k=1
πkΣk,BB
and ΣAB is a matrix of zeros 0AB. If the two conditions are met then the cluster mem-
bership probability is:
τil =
πlφ(xxxiB|000B, ΣBB)φ(xxxi,A|µµµl,A+ 0ABxxxi,B, Σl,AA− 0ABΣBA)
PK
k=1πkφ(xxxiB|000B, ΣBB)φ (xxxi,A|µµµk,A+ 0ABxxxi,B, Σk,AA− 0ABΣBA)
= πlφ(xxxi,A|µµµl,A, Σl,AA) PK
k=1πkφ (xxxi,A|µµµk,A, Σk,AA)
.
(3.13)
As a result, the variables from set B do not influence the membership probabilities. Hence if the two listed conditions are met, the variables from set B should be removed as the uninformative. While the first condition is obtained automatically by the component mean shrinkage, for the second condition a model selection has to be performed. Let the set A consist of all the features that do not meet the first condition and subsequently the set B consists of potentially uninformative variables. Let us denote by C a set of all the possible subsets of set B (C = {C1, ..., CNB} for an appropriate NB value) and Di = CiC is a Ci complement. The Bayesian Information Criterion is then used to select
Chapter 3 - Penalized anomaly detection 47 k in 1, ..., K the penalized likelihood estimates ˆΣk,CiCi, ˆΣk,DiCi and ˆΣk,CiDi are replaced with the fixed ΣCiCi, ΣDiCi and ΣCiDi respectively. The described method might seem computationally expensive, however, there is no need to scan all the P ! models. The first necessary condition already filters out most of the true informative variables.