EM vs gradient methods - Known structure, partial observability, frequentist

C.4 Known structure, partial observability, frequentist

C.4.3 EM vs gradient methods

The title of this section is a little misleading, since EM is implicitly a gradient method, as we show below. However, it is still an interesting question whether one should explicitely try to minimize the gradient, or just use EM.5

We shall restrict our attention to deterministic algorithms with the following form of additive update rule6_:

Θ(t+1)= Θ(t)+λtdt, (C.7)

wheredtis the direction in which to move at iterationt, andλtis the step size. Even within the confines of

this form, many variations are possible. For example, how do we choose the direction? How do we choose the step size? How do we maintain the parameter constraints? Areλtanddtjust functions of thet’th training

case (i.e., an online algorithm), or can they depend on all the training data (i.e., a batch algorithm)? We discuss these and other issues below.

The performance of the algorithms can be measured in two ways: the quality of the local optimum which is reached, and the speed with which it is reached. Clearly, both answers will depend on the topology of the space, the starting point, and the direction of the walk. The topology of the space may depend on the network structure, the amount of missing data, and the number of training cases. In our experimental setup, we vary all three of these factors to see how robust our conclusions are. For any fixed space, we start all algorithms off at the same point, and use the same stopping criterion. An algorithm which converges faster is always better, since, in any fixed amount of time, we can afford to try restarting from many different points; the final “answer” can then be the best point visited, or some combination of all of them. We vary the starting point to test the robustness of our conclusions.

The direction

The most obvious choice for the direction vector is the gradient of the log-likelihood function

g_t= ( ∂l ∂θ1

,· · · , ∂l

∂θn)|Θ=Θ(t).

As we saw in Section C.4.1, for tabular CPDs with parametersθijkdef=P(Xi =k_|Pa(Xi) =j) =wikj, this is given by ∂log Pr(V_|Θ) ∂θijk = M X m=1 Pr(Xi=k,Πi=j_|Vm) θijk (C.8)

Another choice is the generalized gradientg˜_t= Π(Θ(t)₎_g

t, whereΠis some negative definite projection

matrix. It can be shown [JJ93] that this is essentially what EM is doing, whereΠ(Θ(t)₎_{≈ −}_{Q( ˆ}_¨ _Θ,_Θ)_ˆ −1_{, and}

[ ¨Q( ˆΘ,Θ)]i,jˆ =∂

2_Q(Θ0_,_Θ)ˆ ∂θ0

i∂θ0j |Θ

0_{= ˆ}_Θ

5_{This subsection is based on my class project for Stats 200B in 1997.}

6_{Multiplicative update rules (exponentiated gradient methods) are also possible, but [BKS97] has shown them to be inefficient in this}

withH being the hidden variables andV the visibles. Thus EM is a quasi-Newton (variable metric) opti- mization method.

In [XJ96] and [JX95] they give exact formulas forΠfor a mixture of Gaussians model and a mixture of experts model, but in generalΠwill be unknown. However, we can still compute the generalized gradient as follows

g_t=U(Θ(t))−Θ(t) (C.9)

where the EM update operator is

Θ(t+1)=U(Θ(t)) = arg max

Θ Q(Θ|Θ (t)_).

Conjugate directions

It is well known that making the search directions conjugate can dramatically speed up gradient descent algorithms. The Fletcher-Reeves-Polak-Ribiere formula [PVTF88] is

d0 = g0 dt+1 = gt+1+γtdt γt = g T t+1gt+1 gT tgt

It is also possible to compute conjugate generalized gradients. Jamshidian and Jennrich [JJ93] propose the following update rule:

d0 = ˜g0 dt+1 = ˜gt+1+γtdt γt = ˜g T t+1(gt−gt+1) dT_t(g_t₊₁₋g_t)

They show that this method is faster than EM for a variety of non-belief net problems. Thiesson [Thi95] has applied this method to belief nets, but does not report any experimental results.

The step size

By substituting equation C.9 into equation C.7, we can rewrite the EM update rule as

Θ(t+1)= Θ(t)+λtU(Θ(t))−λtΘ(t)= (1−λt)Θ(t)+λtU(Θ(t)). (C.10) We shall call this the EM(λ) rule. (In [BKS97], they derive this rule from an on-line learning perspective.)

The line minimization method suggests choosing a step size ofλt = arg maxλl(Θ(t)₊_λ_d_t)_{. Since} this can be quite slow, a simplification is to use a constant step sizeλt =λ. Forλ = 1, this corresponds to the standard EM rule, and is guaranteed to converge to a local maximum, and to satisfy the positivity and summation constraints. Bauer et al. [BKS97] show that sometimes the optimal learning rate is given byλ > 1; however, if λ > 1, the algorithm is only guaranteed to converge to a local maxmimum if it starts close enough to that maximum; and ifλ > 2, there are no convergence guarantees at all. In the experiments of [JX95] on mixtures of experts, they found that usingλ >1often led to divergence, whereas in the experiments of [BKS97], on relatively large belief nets (the Alarm network and the Insurance network), they found that usingλ = 1.8always led convergence, presumably because of the greater number of local maxima. This rule is also studied in [RW84] and elsewhere.

OtherCarCost SocioEcon Age GoodStudent ExtraCar Mileage VehicleYear RiskAversion SeniorTrain DrivingSkill MakeModel DrivingHist DrivQuality Antilock

Airbag CarValue HomeBase AntiTheft

Theft OwnDamage OwnCarCost PropertyCost LiabilityCost MedicalCost Cushioning Ruggedness Accident

Figure C.1: A simple model of the car insurance domain, from [RN95]. Shaded nodes are hidden.

Experimental comparison

We generated 25 training cases from a small Bayes net (Figure C.1) with random tabular CPDs; we then randomly hid 20% and 50% of the nodes, and timed how long it took each algorithm to find a MLE.7Some results are shown in Tables C.2 and C.2. It can be seen that all methods reach similar quality local minima, but that conjugate generalized gradient without linesearch is the fastest, and non conjugate generalized gradient without linesearch (which is equivalent to EM) is the second fastest. Conjugate methods are always faster, but, perhaps surprisingly, linesearching never seems to pay off, at least in this experiment.

Note that a direct implementation of EM (i.e., one which does not incur the overhead of explicitly com- puting the generalized gradient) is about1.6_×faster than the results shown in the tables, which eliminates much of the performance advantage of the conjugate method. ([JJ93, Thi95] suggest starting with EM, and then only switching over to a conjugate gradient method when near a local optimum.)

On balance, it seems that EM has several advantages over gradient ascent. • There is no step-size parameter.

• It automatically takes care of parameter constraints. • It can handle (conjugate) priors easily.

• It can handle deterministic constraints (e.g.,wijk = 0). • It is simple to implement.

7_{The gradient descent code is closely based on the implementation given in [PVTF88]. Constraints were ensured by reparameterizing}

as follows:θijk= β_ijk2 P

k0βijk0; gradients were computed wrtβijkusing the chain rule. The stopping criterion was when the percentage

change in the negative log likelihood dropped below0.1%. The belief net inference package was written by Geoff Zweig in C++. All experiments were performed on a Sun Ultra Sparc, where 1 CPU second corresponds to roughly 1 real (wall clock) second (this was 1997!).

Linemin.? Gen. grad.? Conj.? final NLL #iter CPU/s #fn. × × × 300.0 105 74.5 210 × × √ 300.4 115 88.2 230 × √ × 294.8 70 29.6 70 × √ √ 297.9 33 14.0 66 √ × × 297.1 62 229.6 677 √ × √ 296.3 26 50.0 301 √ √ × 294.0 60 83.6 547 √ √ √ _296.4 ₂₈ _41.3 ₁₂₉

Table C.2: Performance of the various algorithms for a single trial with 20% hidden data and 25 training cases. Remember, the goal is to minimize the negative log-likelihood (NLL). “#iter” counts the number of iterations of the outer loop. “#fn” counts the number of times the likelihood function was called. The step size is fixed at 1.0. EM corresponds to the third line.

Linemin.? Gen. grad.? Conj.? final NLL #iter CPU/s #fn.

× × × 210.5 129 121.8 258 × × √ 210.4 122 112.6 244 × √ × 202.5 80 44.3 160 × √ √ 215.3 23 12.6 46 √ × × 215.4 53 215.3 580 √ × √ 210.3 46 100.3 524 √ √ × 201.4 79 143.0 737 √ √ √ _204.1 ₅₅ _102.4 ₅₂₈

Table C.3: Performance of the various algorithms with 50% hidden data. The starting point is the same as the previous experiment. Surprisingly, the best NLL is lower in this case. EM corresponds to the third line.

On the other hand, if the CPD does not have a closed form MLE, it is necessary to use gradient methods in the M step (this is called generalized EM). In this case, the argument in favor of EM is somewhat less compelling. Unfortunately, we did not compare the algorithms in this case.

In document Dynamic Bayesian Networks Representation, Inference And Learning Kevin Patrick Murphy pdf (Page 182-185)