5.1 Promising ideas, poor performance
5.1.1 Poutsma’s Implementation
Computing the translation space The DOT translation space for any input string comprises all possible representations which can be assigned to that string according to
the grammar. This space is very similar to the DOP parse space; the main difference is
that each fragment comprises a pair of linked subtrees rather than a single subtree. Thus, as for DOP, we use a chart to store all fragments relevant to the input string, along with
pointers to those fragments with which they can compose to form valid representations
and, therefore, translations. An example DOT translation space is given in Figure 5.1.
This translation space again comprises a two-dimensional chart of size N2 where N is the
length of the input string. Each token in the input string is assigned a numberisuch that 0≤i < N. These numbers appear along the horizontal axis; the numbers which appear on the vertical axis (generally represented byj) indicate the number of input tokens spanned. Each open substitution site pair in every fragment present on the chart explicitly points
to a chart position; any fragment composed at a substitution site must be selected from this position.
Poutsma adapts Bod (1998)’s approach to implementing Tree-DOP, described in sec-
the analysis phase, each fragment is viewed as a rewrite rule where the left-hand side of each rule corresponds to the root node pair of a fragment and the right-hand side to the
source and target frontiers of that fragment; this generic rule is shown in (5.1):
< root(ts), root(tt)> −→ <(f rontier(ts1...tsn)),(f rontier(tt1...ttn))> (5.1)
Each rewrite rule also has a pointer to the fragment which it represents, meaning that two
identical rewrite rules which point to fragments with different internal structures remain
distinct. Where frontiers are open substitution sites, the links between source and target
sites are maintained. Thus, the topmost fragment in position [0][2] of the translation space
in Figure 5.1 corresponds to the rewrite rule given in (5.2):
< V P v, N P pp > −→ <( scanning, N ),( num´erisation, de, N )> (5.2)
This rewrite rule can be combined with rules that have the label<N, N >on the left-hand side.
When fragments are viewed as rewrite rules in this manner, existing algorithms for context-free grammars can be applied to construct the derivation forest for a given input
string. Poutsma (2000, 2003), however, gives no information as to which algorithm is used
in his system, or whether or not it required adaptation to handle linked, bilingual rewrite
rules.
The translation space contains all fragments which can be used to form a representation
for the current input. Since every fragment in the derivation space comprises both a source
and a target tree, each derivation read from this space automatically comprises both a
source-language parse tree and a target-language parse tree. Consequently, generating a
translation simply involves extracting the ordered frontiers from the target-language parse tree.
Selecting the best translation DOP parse probabilities are established by summing
over the probabilities of the derivations yielding each parse. Similarly, DOT translation
each source and target string. Thus, the problem of finding the most probable translation (MPT) for DOT is computationally analogous to the problem of finding the MPP for
DOP and, again, exhaustive search of the translation space is not possible. Poutsma
adopts the top-down breadth-first Monte Carlo sampling algorithm described by Hoogweg
(2000). As the model requires maximisation of translation probability rather than parse
probability, the frequencies of the translations in the sampled set should correspond to
their DOT probabilities. However, while the analysis space built comprises paired source
and target fragments, Poutsma (2003):347 states that “a random selection method to
generate derivations from the target derivation forest” is used. As the DOT probability
of sampling any fragment depends on both the source and target subtree root nodes of that fragment, sampling over target subtrees only means that the distribution of sampled
translations is unlikely to correspond to their DOT distribution. Furthermore, Poutsma
(op cit.) also states that “the random choices of derivations are based on the probabilities of the underlying subderivations” but does not discuss how these sampling probabilities are
computed in his system. As discussed in section 2.5.1, there are several ways to compute
sampling probabilities, each with different implications. Finally, Poutsma does not allow
variation in the number of samples taken to reflect the level of translational ambiguity
present in the input string: in every case, 1500 derivations are sampled and the most
frequently-occurring translation in this set is returned.1
Pruning the fragment set Poutsma (2000, 2003) prunes the set of DOT fragments
by varying maximum depth. As we will discuss in detail in section 5.2.1, however, it is
not necessarily the case that the source and target subtrees in each fragment are of the
same depth. As Poutsma does not address this issue, it is not clear whether he calculates
fragment depth over the source subtree, the target subtree or by some other method. 1If each target linked node was transformed into a double category label comprising both the source
and target node labelsandsampling probabilities were calculated correctly, then correct trees and strings
couldbe generated with the correct probabilities. However, from the description presented in (Poutsma, 2000, 2003), we cannot know exactly the properties of the sample set computed for each input string.