Poutsma’s Implementation - Promising ideas, poor performance

5.1 Promising ideas, poor performance

5.1.1 Poutsma’s Implementation

Computing the translation space The DOT translation space for any input string comprises all possible representations which can be assigned to that string according to

the grammar. This space is very similar to the DOP parse space; the main difference is

that each fragment comprises a pair of linked subtrees rather than a single subtree. Thus, as for DOP, we use a chart to store all fragments relevant to the input string, along with

pointers to those fragments with which they can compose to form valid representations

and, therefore, translations. An example DOT translation space is given in Figure 5.1.

This translation space again comprises a two-dimensional chart of size N2 where N is the

length of the input string. Each token in the input string is assigned a numberisuch that 0_≤i < N. These numbers appear along the horizontal axis; the numbers which appear on the vertical axis (generally represented byj) indicate the number of input tokens spanned. Each open substitution site pair in every fragment present on the chart explicitly points

to a chart position; any fragment composed at a substitution site must be selected from this position.

Poutsma adapts Bod (1998)’s approach to implementing Tree-DOP, described in sec-

the analysis phase, each fragment is viewed as a rewrite rule where the left-hand side of each rule corresponds to the root node pair of a fragment and the right-hand side to the

source and target frontiers of that fragment; this generic rule is shown in (5.1):

< root(ts), root(tt)> −→ <(f rontier(ts1...tsn)),(f rontier(tt1...ttn))> (5.1)

Each rewrite rule also has a pointer to the fragment which it represents, meaning that two

identical rewrite rules which point to fragments with different internal structures remain

distinct. Where frontiers are open substitution sites, the links between source and target

sites are maintained. Thus, the topmost fragment in position [0][2] of the translation space

in Figure 5.1 corresponds to the rewrite rule given in (5.2):

< V P v, N P pp > _−→ <( scanning, N ),( num´erisation, de, N )> (5.2)

This rewrite rule can be combined with rules that have the label<N, N >on the left-hand side.

When fragments are viewed as rewrite rules in this manner, existing algorithms for context-free grammars can be applied to construct the derivation forest for a given input

string. Poutsma (2000, 2003), however, gives no information as to which algorithm is used

in his system, or whether or not it required adaptation to handle linked, bilingual rewrite

rules.

The translation space contains all fragments which can be used to form a representation

for the current input. Since every fragment in the derivation space comprises both a source

and a target tree, each derivation read from this space automatically comprises both a

source-language parse tree and a target-language parse tree. Consequently, generating a

translation simply involves extracting the ordered frontiers from the target-language parse tree.

Selecting the best translation DOP parse probabilities are established by summing

over the probabilities of the derivations yielding each parse. Similarly, DOT translation

each source and target string. Thus, the problem of finding the most probable translation (MPT) for DOT is computationally analogous to the problem of finding the MPP for

DOP and, again, exhaustive search of the translation space is not possible. Poutsma

adopts the top-down breadth-first Monte Carlo sampling algorithm described by Hoogweg

(2000). As the model requires maximisation of translation probability rather than parse

probability, the frequencies of the translations in the sampled set should correspond to

their DOT probabilities. However, while the analysis space built comprises paired source

and target fragments, Poutsma (2003):347 states that “a random selection method to

generate derivations from the target derivation forest” is used. As the DOT probability

of sampling any fragment depends on both the source and target subtree root nodes of that fragment, sampling over target subtrees only means that the distribution of sampled

translations is unlikely to correspond to their DOT distribution. Furthermore, Poutsma

(op cit.) also states that “the random choices of derivations are based on the probabilities of the underlying subderivations” but does not discuss how these sampling probabilities are

computed in his system. As discussed in section 2.5.1, there are several ways to compute

sampling probabilities, each with different implications. Finally, Poutsma does not allow

variation in the number of samples taken to reflect the level of translational ambiguity

present in the input string: in every case, 1500 derivations are sampled and the most

frequently-occurring translation in this set is returned.1

Pruning the fragment set Poutsma (2000, 2003) prunes the set of DOT fragments

by varying maximum depth. As we will discuss in detail in section 5.2.1, however, it is

not necessarily the case that the source and target subtrees in each fragment are of the

same depth. As Poutsma does not address this issue, it is not clear whether he calculates

fragment depth over the source subtree, the target subtree or by some other method. 1_{If each target linked node was transformed into a double category label comprising both the source}

and target node labelsandsampling probabilities were calculated correctly, then correct trees and strings

couldbe generated with the correct probabilities. However, from the description presented in (Poutsma, 2000, 2003), we cannot know exactly the properties of the sample set computed for each input string.

In document Hearne DOT thesis goodmanreductions pdf (Page 121-124)