The EM/Newton-Raphson Hybrid Algorithm - The One Dimensional Normal Mixture Model

K- means Clustering Algorithm

4.3 The One Dimensional Normal Mixture Model

4.3.3 The EM/Newton-Raphson Hybrid Algorithm

The EM algorithm for mixture models, although an improvement over the Newton-Raphson algorithm, is often slow to converge. The convergence rate depends on several factors, such as the number of clusters, the number of observations, how close the parameters for the different clusters are to each other, which starting values are chosen, and so on. A number of approaches for speeding up the convergence of the EM algorithm in the mixture model case have been proposed in the literature (Aitkin and Aitkin, 1996; Neal and Hinton, 1998; Bradley et al., 1999). This dissertation presents the hybrid method proposed by Aitkin and Aitkin (1996) that switches back and forth between the EM algorithm and the Newton-Raphson algorithm.

The EM algorithm always converges to the MLE (Dempster et al., 1977) while the NR algorithm might converge to a local maximum or minimum or decrease the likelihood between iterations. Although the EM algorithm is much faster to converge to the right neighborhood, it is slow to reach the maximum. On the other hand, the NR algorithm is faster in reaching the maximum provided that it is in the right neighborhood. When the NR algorithm converges, its convergence rate is usually quadratic compared to linear for the EM algorithm (Aitkin and Aitkin, 1996). The EM algorithm traverses the likelihood surface in larger steps than the NR

algorithm. Aitkin and Aitkin (1996) proposed the hybrid EM/NR algorithm to exploit the fast convergence of the EM algorithm with the local accuracy of the NR algorithm.

The EM algorithm is described for the normal mixture model case in previous sections.

The Newton-Raphson algorithm makes use of a first order Taylor series expansion of the function being maximized (Heath, 1997):

( ) ( )

( )( )

f x ≈ f h + f h x h− (4.24)

This motivates the following iteration scheme, known as Newton’s method:

( )

1 '

( )

ⁿ ,

n n

n x x f x

+ = − f x (4.25)

where n represents the iteration number.

The Newton-Raphson method is readily extended to find the solutions to a set of simultaneous equations (Bickel and Doksum, 2001). The form of the new iterate may be expressed in matrix notation as:

( )

1 ,

n n n − n

+ = −

X X H X B X (4.26)

where n is the iteration number, X is the vector of parameter estimates, B is the vector of first derivatives with respect to the parameters, and H is the Hessian matrix of second derivatives.

For example, in the case of a two component normal mixture, there are 3C− =1 3* 2 1 5− = parameters present. The 5x1 vector of parameters, X, takes the form:

The 5x5 Hessian matrix, H, is composed of the second derivatives of the log likelihood with respect to the 5 parameters. Finally, the 5x1 vector, B, contains the first derivatives of the log likelihood with respect to the 5 parameters. The matrices H and B are shown in Equations 4.28 and 4.29, respectively. In general, a mixture model with C clusters has 3C – 1 parameters that must be estimated. The dimensions of X, H, and B increase in multiples of 3 as the number of clusters increases. The Newton-Raphson algorithm starts with the initial values and iterates Equation 4.26 until convergence is reached. Note that each iteration requires the inversion of a 3C - 1 dimensional matrix. Matrix inversion is very computationally intensive and thus becomes much slower for models having large numbers of clusters. The algorithm is said to converge when successive iterations change the proportion estimates by less than a specified tolerance.

In order to implement the Newton-Raphson algorithm, the first derivatives of the log likelihood (for the B matrix) and the second derivatives of the log likelihood (for the H matrix) are needed. These derivatives were derived for the normal mixture distribution by Hasselblad (1966). The notation and first derivatives were defined in Section 4.3.2. The first derivatives are given in Equations 4.15 – 4.17.

( )

The second derivatives of the log likelihood are shown in Equations 4.31 – 4.36. They are obtained by taking the appropriate first derivatives of Equations 4.15 – 4.17. Define δ_mk as the Kronecker delta,

1, if m 0, if m .

mk k

δ _{= }^ ⁼ k

 ≠ (4.30)

( )( )

The implementation of the NR algorithm is straightforward. However, the NR algorithm requires the observed information or Hessian matrix, which is complex to calculate

for the mixture problem. The NR algorithm is also much more sensitive to starting values than the EM algorithm is. The NR algorithm sometimes returns negative estimates for σ . When this happens, we follow Aitkin and Aitkin’s (1996) suggestion and change the sign of these estimates. There is also a possibility that the Hessian matrix may not be positive definite and thus is non-invertible. This possibility increases if the starting values are poor. When this happens, the EM algorithm must be used instead.

The formulas needed to implement the EM and NR algorithms for the normal mixture case are given in Equations 4.15 – 4.36. The EM/Newton-Raphson hybrid algorithm was introduced by Aitkin and Aitkin (1996) to take advantage of the best features of both the EM and NR algorithms. The algorithm is outlined in the flowchart shown in Figure 4.7.

The steps are referenced by the numbers in parentheses. The first step is to run the EM algorithm 5 times (1). This helps to ensure that the log-likelihood is non-decreasing so that the subsequent NR step will not diverge or decrease the log-likelihood. Running the EM algorithm 5 times comes from Redner and Walker’s (1984) experience that 95% of the change in log-likelihood from the initial to the maximum value generally occurred in the first five EM iterations. If the EM algorithm did not converge after 5 iterations, the NR algorithm is run until it converges or the likelihood decreases (5 – 14). If the likelihood decreases, the parameter values are set to the average of their values before the likelihood decreased and their current values (11). The NR algorithm is run again with the new parameter values as starting values (13). This process is called step halving. If 5 step-halvings do not increase the log-likelihood, then the EM algorithm is run 5 more times (1). This process is repeated as shown in Figure 4.7 until convergence is obtained or a user defined maximum number of iterations is

(1)

Figure 4.7: Flowchart for the EM/Newton-Raphson Hybrid Algorithm

reached. Once the EM/Newton-Raphson hybrid algorithm converges, the parameter estimates may be used to calculate confidence intervals (Section 4.3.4) and to assign observations to clusters.

One must be careful to recognize that the EM algorithm estimates the σ_k² parameters, while the NR algorithm estimates the σ_k parameters (as derived by Hasselblad in 1966). The

σ estimates from the EM step in the hybrid algorithm must be translated by taking the square root before using them as inputs for the NR step. Similarly, one should square the NR estimates before using them as inputs for the EM step.

The application of the EM/Newton-Raphson hybrid algorithm usually results in a significantly reduced number of EM algorithm iterations. McLachlan and Peel (2000) report that, in their experience, the hybrid algorithm converges in 50 – 70 percent of the time required for the EM algorithm to converge. However, the hybrid algorithm requires more overhead for implementation. The Hessian matrix is complex to calculate. The EM algorithm usually converges much faster than the hybrid algorithm for the univariate mixture models presented in this dissertation. The convergence time depends on many things, including the number of variables. The examples McLachlan and Peel (2000) discussed were all multivariate applications. For univariate mixtures, the EM algorithm appears to be more desirable. The hybrid algorithm is best applied to multivariate mixture problems and will not be used in this dissertation (due to the faster convergence of the EM algorithm for univariate normal mixtures). However, the 2-DCluster software package does support the hybrid algorithm.