Parameter Estimation for the Normal Mixture Model

K- means Clustering Algorithm

4.3 The One Dimensional Normal Mixture Model

4.3.2 Parameter Estimation for the Normal Mixture Model

Suppose that a random sample of N observations is obtained from a normal mixture as defined in Section 4.3.1. The likelihood for the N observations is given by the joint density of the sample:

To obtain the maximum likelihood estimates (MLE’s), the log likelihood,

( ) ( ) (

^k²

)

is maximized with respect to the parameters π_k, , µ_k and σ_k². The MLE’s are found by taking the first partial derivatives of the log likelihood with respect to the parameters of interest, setting them equal to zero, and solving.

The first derivatives were given for the normal mixture distribution by Hasselblad (1966). For convenience, Hasselblad’s notation is introduced below and is used from this point forward. Let N be the total number of observations with i= …1, ,N . Let C be the number of clusters with k= …1, ,C. Let y_i be the i^thobservation, π_k be the proportion of total observations contained in cluster k, and µ_k and σ_k² be the mean and variance of the observations in cluster k, respectively. Let the distribution of the k^th normal mixture component be represented as:

(

)

⁽ ⁾

Note that the i subscript in Equation 4.13 is not strictly necessary, as the distribution only changes based on the cluster, not the observation. However, the i is included in order to be consistent with Hasselblad’s notation. The normal mixture distribution of the random variable Y is written as:

The derivatives are calculated under the constraint that ¹

The first partial derivatives of the log likelihood are shown below.

These equations are nonlinear and therefore must be solved iteratively. The usual method for this circumstance is the Newton Raphson (NR) algorithm. However, the NR algorithm may have convergence problems and sometimes converges to a local maximum (Heath, 1997). Therefore, the Expectation/Maximization (EM) algorithm is suggested, as it should converge to the global maximum. The EM algorithm converges most of the time.

To develop the EM algorithm, the first requirement is the definition of the incomplete data (Dempster et al., 1977). For the mixture model, the incomplete data are defined by the indicator variables that assign the observations to specific clusters. These indicator variables may be written as:

The distributions of these indicator variables are specified by the posterior probabilities, α_ik. In other words, ^{P I}

(

_ik ⁼^1|^y_i

)

⁼^α_ik. In the Expectation stage of the EM algorithm, the complete data are obtained by estimating α_ik. The value α_ik represents the posterior probability of the i^th observation falling in the k^thcluster. The posterior probability is estimated by taking a weighted average over the C component densities. That is,

[ ] [ ]

need to be specified: C-1 for the proportions, C for the means, and C for the variances. After the initial iteration, new estimates for these values are obtained from the maximization step of the EM algorithm.

The maximization step of the EM algorithm uses the completed data to estimate the parameters of the distribution using maximum likelihood techniques. Once the observations falling into different groups are identified, obtaining the 3C−1 estimates of the proportions, means, and variances is straightforward and explicit solutions exist.

Equations 4.20 through 4.22 give the maximum likelihood estimates for the proportions, means, and variances for the normal mixture problem. For details of these derivations, see Appendix 4.1. The estimate for the proportion of observations falling in the

kth cluster, π_k, is:

1 .

For the case of homogenous variances among the C clusters, the σ_k²’s are first estimated for all of the clusters. Once these estimates are found, the new common variance is found by:

Each iteration of the EM algorithm involves calculating equations 4.19 through 4.22 in sequence. For the first iteration, the starting values are used to calculate α_ik. At the end of each iteration, a check is made to see if the EM algorithm converged. There are a number of ways that this check can be performed. The approach taken by McLachlan and Peel (2000) is used in this dissertation. They check for convergence by examining all C of the proportion (π_k) estimates and seeing if the change from the previous iteration’s estimate is less than some tolerance value. If any one of the C estimates has changed by some amount greater than the tolerance value, the algorithm continues. The justification for this stopping rule is that the proportions are the most important parameters to estimate accurately because they indicate the

relative sizes of the clusters to the user and contain the 1 N

ik i

∑

= terms which are involved in both the mean and variance estimates. Other stopping rules are possible, such as terminating when the means or variances change less than a specified tolerance between iterations.

However, such rules would place too many restrictions on the algorithm and the convergence time could drastically increase.

Cluster membership is assigned by calculating the C estimates of the α_ik’s for all of the y ’s and assigning the observation to the cluster for which the posterior probability of _i belonging is the greatest. However, for values of α_ik that are very close to each other, assigning an individual to a cluster based on the maximum posterior probability may not be optimal, since competing cluster assignments may be equally “good”. All of the posterior probabilities are available from the mixture model fit, and the user can evaluate the effects of

different cluster assignments. If two posterior probabilities are tied at the maximum value, one can either randomly select a cluster to assign the observation to or place the observation in both of the clusters. For posterior probabilities that are very close to each other, one could try different cluster assignments and evaluate how well the clusters fit the data using the statistics introduced in Section 4.3.5.