The Forward-Backward Algorithm for Trees

6.2 The Method

6.2.1 The Forward-Backward Algorithm for Trees

In order to compute τv,t= ET |V,E,R,θδ(Tv, t) for a fixed graph (V, E), let us recall the

well-known Forward-Backward algorithm used for Hidden Markov Models.3 The HMMs employed for POS tagging operate on sentences, which are linear sequences of words (Fig. 6.2). The summing over all possible tag sequences is tackled by introducing the so-called forward probability (usually written as α) and backward

probability (β), which are defined as follows:

αvi,t = P (v1, . . . , vi, Ti = t) (6.7)

βvi,t = P (vi+1, . . . , vn|Ti = t) (6.8)

It is easy to see that the product of both, αvi,tβvi,t, gives us the probability of

the whole sequence and the node vi having tag t. This can be used to derive τvi,t:

τvi,t = αvi,tβvi,t P t0α_v i,t0βvi,t0 (6.9)

The point of forward-backward computation is that, due to the Markov prop- erty of HMMs, the forward and backward probability can be computed with recursive formulas, thus avoiding the combinatorial explosion caused by summing over all possible tag sequences:

αvi,t = X t0 αvi−1,t0P (t|t 0 )P (vi|t) (6.10) βvi,t = X t0 P (t0|t)P (vi+1|t0)βvi+1,t0 (6.11)

The general idea of forward-backward computation can be extended beyond linear sequences, which are a special case of trees, to arbitrary trees.4 _{In this case,} the backward probability of a node is the probability of the subtree rooted in it,

3_{See for example Manning and Sch¨}_{utze [1999] or Jelinek [1997] for an introduction to HMMs} and the Forward-Backward algorithm.

4_{I was unable to find any publications describing a generalization of Forward-Backward com-} putation to tree models. I do not believe that this is my innovation, though. I would be grateful for any hints on this topic from reviewers and readers.

6.2. The Method 109 v1 v2 v3 v4 v5 v6 v7 v8 v9

Figure 6.3: The Forward-Backward computation for a tree. Also here, αv6,t =

P (v1, . . . , v6, T6 = t) and βv6,t = P (v7, v8, v9|T6 = t).

given a tag, while the forward probability is the probability of the rest of the tree

and the tag (Fig. 6.3). Note that the forward probability involves not only the

path leading from the root to the node in question (v1, v2, v4, v6 in Fig. 6.3), but also all side branches sprouting from this path (v3, v5).

In order to derive recursive formulas similar to (6.10)–(6.11) for the tree case, let us introduce a concept of transition matrix in this case. A transition matrix

T(v,v0,r) _{associated with an untagged edge (v, v}0_{, r) is a matrix corresponding to}

edge probabilities for every possible tagging of the source and target node. More specifically:

T_t,t(v,v0 0,u(r))=

pedge(v, t, v0, t0, r)

1 − pedge(v, t, v0, t0, r)

(6.12)

Continuing the example from 6.1.2, the probabilities in (6.1) yield the follow- ing transition matrix:

T(machen,mache,/Xen/→/Xe/) = NN VVINF VVFIN         NN _1−0.30.3 0 0 VVINF 0 0 _1−0.010.01 VVFIN 0 0 0 (6.13)

Furthermore, let λv,t be the probability that the node v with tag t is a leaf,

i.e. it has no outgoing edges. This value can be computed as follows:5 λv,t = Y r∈R Y (v0_,t0_)∈r(v,t) [1 − pedge(v, t, v0, t0, r)] (6.14)

5_{Two things can be said about the λ-values: they are extremely expensive to compute, because} they involve a product over all hypothetical edges, also those leading to non-existent words, and they are of virtually no importance, since they tend to differ only slightly from tag to tag. As the terms of the product are mostly numbers very close to 1, although they are many, the result is still going to be fairly close to 1. Thus, although I include this value in the formulas for the sake of soundness of the theory, I actually ignored it in experiments, setting λv,t= 1 everywhere.

Now we can turn to computing the backward probability. A trivial observation is that βv,t= λv,tfor leaf nodes. For a non-leaf node v, the backward probability will

be equal to the product of backward probabilities of all children of v, multiplied by the probability of all outgoing edges of v. This has to be summed over all possible taggings of the child nodes. For example, taking the node v6 in Fig. 6.3, we would have: βv6,t = λv6,t X t7 X t8 T(v6,v7,r) t,t7 βv7,t7T (v6,v8,r0) t,t8 βv8,t8 = λv6,t( X t7 T(v6,v7,r) t,t7 βv7,t7)( X t8 T(v6,v8,r0) t,t8 βv8,t8)

The term λv6,t is due to the fact that v6 contains no further outgoing edges

apart from the two mentioned explicitly. The elements of the transition matrix contain ‘one minus edge probability’ in the denominator in order to remove this term from the product introduced by λ for edges that are present. The second line is due to a simple transformation: P

jaibj = (Piai)(Pjbj). r and r0 are simply

the rules corresponding to the respective edges.

Using matrix and vector notation, let βv and λv be |T |-dimensional vectors.

Further, let outG(v) be the set of outgoing edges of v in graph G (which is our

current graph). A general formula for the backward probability can be expressed as follows:6 βv = λv ∗ Y (v,v0_,r)∈out G(v) T(v,v0,r)βv0 (6.15)

The vague idea for computing the forward probability is to take the forward probability of the parent node and multiply it by the probability of the edge leading to the node in question. However, the parent node might also have other children, which are not included in its forward probability. Looking at Fig. 6.3, the forward probability of v2 must involve not only the forward probability of v1 and the edge leading from v1 to v2, but also the subtree rooted in v3. Thus, the general formula is as follows: αv = Y (v0_,v,r)∈in G(v)      αv0 ∗ Y (v0,v00,r0)∈outG(v0) v006=v T(v0,v00,r0)βv00      · T(v0,v,r) _(6.16)

inG(v) denotes the set of incoming edges of v. Note that the set notation

6_{In the matrix formulas, the asterisk denotes element-wise multiplication and the dot or no} symbol denotes dot product.

6.2. The Method 111

and the outer product is only for notational convenience, as the set is always a singleton. In this case, it means: ‘pick v0, that is the parent node of v’. The inner product goes over all children of v0 except for v and includes the edges leading to them and the probabilities of the subtrees rooted in them. Finally, the last product corresponds to the edge leading from v0 to v.

The last remaining issue is the forward probability of root nodes. It is simply equal to the probability of the root node defined by the model, which we call ρv,t

and compute as follows:

ρv,t= Proot(v|θroot)Proottag(t|v, θroottag) (6.17)

Thus, αv = ρv for root nodes.

In document Statistical and Computational Models for Whole Word Morphology (Page 124-127)