2.3 Iterative Optimization Methods to Estimate MON-MNL
2.3.5 Standard Errors
Train (2009) and James (2017) suggest to compute the information matrix using cross-product of the M-step scores, which can further be used to obtain standard errors of EM and MM estimates. We derive equations2.44 and 2.45 to compute simulated scores for αcand < γc, ∆c >, respectively.13
∂Qi(αc|ψm) ∂αc = R X r=1 hicr(.|ψm) XT t=1 J X j=1 xit j dit j− Pit j(αc, βicr) (2.44) ∂h Qi(γc, ∆c|ψm) i ∂γc = − R X r=1 hicr(.|ψm)(∆c)−1(βicr−γc) ∂h Qi(γc, ∆c|ψm) i ∂∆c = R X r=1 hicr(.|ψm) − 1 2∆ −1 c + 1 2∆ −1 c h
(βicr−γc)(βicr −γc)Ti∆−1c
(2.45)
In a simulation study, we compared three methods to compute standard er- rors, namely M-step scores in MM, bootstrapping in MM, and information matrix in MSLE. The standard error estimates of all three methods matched quiet pre- cisely for all parameters, except for some (especially off-diagonal) elements of the variance-covariance matrix (see Table2.11in section2.5.2). However, all standard error estimates of MM bootstrapping and MSLE matched fairly, raising a question on using M-step scores to compute standard errors in EM and MM algorithms.
13As an alternative method, bootstrapping also can be used to derive the standard errors, but it
We further reviewed existing methods to compute standard errors in EM.
Meng and Rubin(1991) suggests to use the SEM algorithm to compute EM stan- dard errors, but SEM is unattractive due to two reasons (Jamshidian and Jennrich,
2000). First, it requires estimation of the Jacobian and Hessian matrices of the M- step objective function, which generally involves cumbersome algebraic opera- tions. Second, SEM is highly sensitive to slower convergence of the algorithm and can result into very high standard errors. In fact, in simulation studies,Jamshidian and Jennrich(2000) andCamilleri(2009) found that methods involving the infor- mation matrix of the complete data loglikelihood (e.g., M-step scores and SEM) perform poor in practice, as also verified in our simulation study. They suggest to use the information matrix of the observed (incomplete) loglikelihood at con- vergence to obtain the correct standard errors of EM. Intuitively, this approach is analogous to switching from EM or MM to Newton-type methods near con- vergence, but with the motivation to get correct standard errors instead of faster convergence. We suggest to pass the EM or MM estimates to the MSLE with nu- merical gradient routine. This method does not add any algebraic operations in the original EM or MM algorithm because the loglikelihood is evaluated in the E-step nonetheless. Mathematical simplicity of MM and EM remains intact at the cost of higher total computation time (see section2.5.2).
2.4
Discussion: advantages and disadvantages of MM over EM
and MSLE
James(2017) shows that, unlike EM and MSLE, the Hessian and its inverse need to be computed only once in MM and can be reused at each iteration. This feature makes MM more attractive over other methods when dimensionality of the Hes- sian is large and inversion is costly. However, we show that this observation is not universally true. For example, computing the inverse of the Hessians B−1
φ and B−1
α , and reusing them works favorably in MM for LML. However, the Hessian Bm
αc and its inverse need to be computed at each iteration, using equation 2.39, in MM for MON-MNL. Nonetheless, the entire computational advantage of MM is not lost because a computationally-intensive part of the Hessian (BF
αc) can still be pre-computed in the initialization step.
Since parameter updates in MM just require sufficient statistics and the sample gradient can be written as the sum (in no specific order) of individual gradients, MM estimation is suitable for parallel computation. Even the E-step of the EM al- gorithm is suitable for parallelization, but the optimization problem in the M-step requires to store weights and simulation draws of the E-step. The communica- tion overhead and storage of these multi-dimensional arrays in EM neutralize the potential benefits of parallelization. We would also like to note that the Hessian and gradient in MSLE can be broken into an unordered sum over individuals and thus estimation can be parallelized. We illustrate the extent of computation time
savings due to parallelization of MM and MSLE in the Monte Carlo study (section
2.5).
Whereas per iteration time of MM is lower than that of EM, MM generally takes more iterations than EM due to a smaller step size. Computational efficiency of MM then hinges upon the trade-off between extra iterations and per-iteration time savings. In fact, we illustrate in the Monte Carlo study that tightness of the Hessian approximation is a key factor in determining the number of extra iterations in MM. Nevertheless, the faster-MM that we have implemented (section
2.2.4) can alleviate the concern of the poor approximation by augmenting the step size while keeping the simplicity of MM.
Previous comparison studies of iterative optimization estimation (Cherchi and Guevara,2012;James,2017) often overlooked the issue of MSLE and MM (or EM) needing to maximize two different objective functions; even common tolerance criteria cannot provide a fair comparison between computation efficiency of both methods. Furthermore, computation time of standard errors in EM and MM is often ignored while comparing with MSLE. This is also not appropriate because standard errors can be directly computed using the estimated information matrix in MSLE, but we argue that standard errors obtained from M-step scores in MM and EM may not be correct and additional computation is required (see section
2.3.5 for details). In the Monte Carlo study, we try different convergence toler- ances for MM and also report computation time of standard errors to make a fair comparison between MM and MSLE.