C.4 Known structure, partial observability, frequentist
C.4.3 EM vs gradient methods
The title of this section is a little misleading, since EM is implicitly a gradient method, as we show below. However, it is still an interesting question whether one should explicitely try to minimize the gradient, or just use EM.5
We shall restrict our attention to deterministic algorithms with the following form of additive update rule6:
Θ(t+1)= Θ(t)+λtdt, (C.7)
wheredtis the direction in which to move at iterationt, andλtis the step size. Even within the confines of
this form, many variations are possible. For example, how do we choose the direction? How do we choose the step size? How do we maintain the parameter constraints? Areλtanddtjust functions of thet’th training
case (i.e., an online algorithm), or can they depend on all the training data (i.e., a batch algorithm)? We discuss these and other issues below.
The performance of the algorithms can be measured in two ways: the quality of the local optimum which is reached, and the speed with which it is reached. Clearly, both answers will depend on the topology of the space, the starting point, and the direction of the walk. The topology of the space may depend on the network structure, the amount of missing data, and the number of training cases. In our experimental setup, we vary all three of these factors to see how robust our conclusions are. For any fixed space, we start all algorithms off at the same point, and use the same stopping criterion. An algorithm which converges faster is always better, since, in any fixed amount of time, we can afford to try restarting from many different points; the final “answer” can then be the best point visited, or some combination of all of them. We vary the starting point to test the robustness of our conclusions.
The direction
The most obvious choice for the direction vector is the gradient of the log-likelihood function
gt= ( ∂l ∂θ1
,· · · , ∂l
∂θn)|Θ=Θ(t).
As we saw in Section C.4.1, for tabular CPDs with parametersθijkdef=P(Xi =k|Pa(Xi) =j) =wikj, this is given by ∂log Pr(V|Θ) ∂θijk = M X m=1 Pr(Xi=k,Πi=j|Vm) θijk (C.8)
Another choice is the generalized gradientg˜t= Π(Θ(t))g
t, whereΠis some negative definite projection
matrix. It can be shown [JJ93] that this is essentially what EM is doing, whereΠ(Θ(t))≈ −Q( ˆ¨ Θ,Θ)ˆ −1, and
[ ¨Q( ˆΘ,Θ)]i,jˆ =∂
2Q(Θ0,Θ)ˆ ∂θ0
i∂θ0j |Θ
0= ˆΘ
5This subsection is based on my class project for Stats 200B in 1997.
6Multiplicative update rules (exponentiated gradient methods) are also possible, but [BKS97] has shown them to be inefficient in this
is the Hessian ofQevaluated atΘˆ, some interior point of the space, andQis the expected complete-data log-likelihood Q(Θ0|Θ) = X h Pr(h|V,Θ) log Pr(V, h|Θ0) = Eh[log Pr(V, H|Θ0)|V,Θ]
withH being the hidden variables andV the visibles. Thus EM is a quasi-Newton (variable metric) opti- mization method.
In [XJ96] and [JX95] they give exact formulas forΠfor a mixture of Gaussians model and a mixture of experts model, but in generalΠwill be unknown. However, we can still compute the generalized gradient as follows
˜
gt=U(Θ(t))−Θ(t) (C.9)
where the EM update operator is
Θ(t+1)=U(Θ(t)) = arg max
Θ Q(Θ|Θ (t)).
Conjugate directions
It is well known that making the search directions conjugate can dramatically speed up gradient descent algorithms. The Fletcher-Reeves-Polak-Ribiere formula [PVTF88] is
d0 = g0 dt+1 = gt+1+γtdt γt = g T t+1gt+1 gT tgt
It is also possible to compute conjugate generalized gradients. Jamshidian and Jennrich [JJ93] propose the following update rule:
d0 = ˜g0 dt+1 = ˜gt+1+γtdt γt = ˜g T t+1(gt−gt+1) dTt(gt+1−gt)
They show that this method is faster than EM for a variety of non-belief net problems. Thiesson [Thi95] has applied this method to belief nets, but does not report any experimental results.
The step size
By substituting equation C.9 into equation C.7, we can rewrite the EM update rule as
Θ(t+1)= Θ(t)+λtU(Θ(t))−λtΘ(t)= (1−λt)Θ(t)+λtU(Θ(t)). (C.10) We shall call this the EM(λ) rule. (In [BKS97], they derive this rule from an on-line learning perspective.)
The line minimization method suggests choosing a step size ofλt = arg maxλl(Θ(t)+λdt). Since this can be quite slow, a simplification is to use a constant step sizeλt =λ. Forλ = 1, this corresponds to the standard EM rule, and is guaranteed to converge to a local maximum, and to satisfy the positivity and summation constraints. Bauer et al. [BKS97] show that sometimes the optimal learning rate is given byλ > 1; however, if λ > 1, the algorithm is only guaranteed to converge to a local maxmimum if it starts close enough to that maximum; and ifλ > 2, there are no convergence guarantees at all. In the experiments of [JX95] on mixtures of experts, they found that usingλ >1often led to divergence, whereas in the experiments of [BKS97], on relatively large belief nets (the Alarm network and the Insurance network), they found that usingλ = 1.8always led convergence, presumably because of the greater number of local maxima. This rule is also studied in [RW84] and elsewhere.
OtherCarCost SocioEcon Age GoodStudent ExtraCar Mileage VehicleYear RiskAversion SeniorTrain DrivingSkill MakeModel DrivingHist DrivQuality Antilock
Airbag CarValue HomeBase AntiTheft
Theft OwnDamage OwnCarCost PropertyCost LiabilityCost MedicalCost Cushioning Ruggedness Accident
Figure C.1: A simple model of the car insurance domain, from [RN95]. Shaded nodes are hidden.
Experimental comparison
We generated 25 training cases from a small Bayes net (Figure C.1) with random tabular CPDs; we then randomly hid 20% and 50% of the nodes, and timed how long it took each algorithm to find a MLE.7Some results are shown in Tables C.2 and C.2. It can be seen that all methods reach similar quality local minima, but that conjugate generalized gradient without linesearch is the fastest, and non conjugate generalized gradient without linesearch (which is equivalent to EM) is the second fastest. Conjugate methods are always faster, but, perhaps surprisingly, linesearching never seems to pay off, at least in this experiment.
Note that a direct implementation of EM (i.e., one which does not incur the overhead of explicitly com- puting the generalized gradient) is about1.6×faster than the results shown in the tables, which eliminates much of the performance advantage of the conjugate method. ([JJ93, Thi95] suggest starting with EM, and then only switching over to a conjugate gradient method when near a local optimum.)
On balance, it seems that EM has several advantages over gradient ascent. • There is no step-size parameter.
• It automatically takes care of parameter constraints. • It can handle (conjugate) priors easily.
• It can handle deterministic constraints (e.g.,wijk = 0). • It is simple to implement.
7The gradient descent code is closely based on the implementation given in [PVTF88]. Constraints were ensured by reparameterizing
as follows:θijk= βijk2 P
k0βijk0; gradients were computed wrtβijkusing the chain rule. The stopping criterion was when the percentage
change in the negative log likelihood dropped below0.1%. The belief net inference package was written by Geoff Zweig in C++. All experiments were performed on a Sun Ultra Sparc, where 1 CPU second corresponds to roughly 1 real (wall clock) second (this was 1997!).
Linemin.? Gen. grad.? Conj.? final NLL #iter CPU/s #fn. × × × 300.0 105 74.5 210 × × √ 300.4 115 88.2 230 × √ × 294.8 70 29.6 70 × √ √ 297.9 33 14.0 66 √ × × 297.1 62 229.6 677 √ × √ 296.3 26 50.0 301 √ √ × 294.0 60 83.6 547 √ √ √ 296.4 28 41.3 129
Table C.2: Performance of the various algorithms for a single trial with 20% hidden data and 25 training cases. Remember, the goal is to minimize the negative log-likelihood (NLL). “#iter” counts the number of iterations of the outer loop. “#fn” counts the number of times the likelihood function was called. The step size is fixed at 1.0. EM corresponds to the third line.
Linemin.? Gen. grad.? Conj.? final NLL #iter CPU/s #fn.
× × × 210.5 129 121.8 258 × × √ 210.4 122 112.6 244 × √ × 202.5 80 44.3 160 × √ √ 215.3 23 12.6 46 √ × × 215.4 53 215.3 580 √ × √ 210.3 46 100.3 524 √ √ × 201.4 79 143.0 737 √ √ √ 204.1 55 102.4 528
Table C.3: Performance of the various algorithms with 50% hidden data. The starting point is the same as the previous experiment. Surprisingly, the best NLL is lower in this case. EM corresponds to the third line.
On the other hand, if the CPD does not have a closed form MLE, it is necessary to use gradient methods in the M step (this is called generalized EM). In this case, the argument in favor of EM is somewhat less compelling. Unfortunately, we did not compare the algorithms in this case.