5.2 A new implementation of the Tree-DOT model
5.2.2 Translation-space construction
As discussed in section 5.1.3, we do not feel that the translation algorithm proposed
by Poutsma (2000, 2003) – which is based on the ‘fragments as rewrite rules’ technique
proposed for DOP by Bod (1992) and described in section 2.4.1 – will facilitate the experi-
ments required to fully assess the performance of the DOT model. Thus, in this section we
focus on the adaptation of more efficient DOP parsing algorithms to the DOT translation model. Firstly, we discuss how the fragments in the DOT translation space for an input
string relate to the fragments in the DOP parse space for that same string. In light of
this relationship, we then outline the elements which building the parse and translation
spaces have in common, and give the general intuition as to how the former can be used
in creating the latter. Finally, we discuss in detail the possibility of adapting the DOP
parsing algorithms developed by Goodman (1996a, 1998, 2003) and Sima’an (1995a, 1999)
to accomplish the task of translation space computation.
From parsing to translation: the general model
Conceptually, the source- and target-language halves of each DOT fragment, along with
the translational links between them, form a single unit. It is useful on a practical level,
however, to make explicit the relationships between (i) the two halves of the set of bilingual
DOT fragments which can be extracted from a set of linked training trees and (ii) the two
sets of monolingual fragments which can be extracted from that same set of linked training trees by placing the source and target trees in separate sets, discarding the links and apply-
ing the DOP fragmentation operations. In other words, if one of the languages represented
in the bilingual treebank is language L, what is the relationship between (i) the fragment
setFb generated by applying the DOT fragmentation operations to the bilingual treebank
and then stripping away the links and corresponding-language parts of each extracted
fragment, leaving only representations for L and (ii) the fragment set Fm generated by
taking the bilingual treebank, stripping away the links and corresponding-language trees
and applying the DOP fragmentation operations to this monolingual treebank? As the
fragmentation operations defined for Tree-DOT can only select linked nodes to be either root or frontier nodes, it follows that non-linked nodes are always internal to the fragments
in which they occur. Thus, setFb comprises a subset of the fragments in Fm such that all
root nodes and substitution sites of fragments in Fb are linked to target-language nodes
in the bilingual treebank.
In Tree-DOT, the process of building the translation space is driven by the input
string, and the building of target language representations can be viewed as a by-product
of parsing with bilingual fragments. It is possible, therefore, to build a first approximation
of the translation space by simply parsing with the source-language half of the bilingual
fragment base only, i.e. fragment set Fb. Once this has been accomplished, the one or
more target-language subtrees which correspond to each source-language fragment in the
approximated space are retrieved. However, according to the DOT composition opera- tion, the target-language subtrees effectively act as constraints on the source-language
fragments which can combine to form analyses: in order for fragment fx with root node
categories <Rsx,Rtx> to compose with fragment fy with leftmost substitution site cate-
gories<LSSsy,LSSty>, not only must the source root infx,Rsx, correspond to the source
leftmost substitution site in fy, LSSsy, but the target root Rtx must also correspond to
the substitution site category LSSty. Effectively, this means that source-language frag-
ments which can combine freely in a monolingual model are now constrained by their
target-language links. Thus, fragments which, due to translational constraints, cannot be
composed with any other fragments to form valid analyses are removed from the approx- imated space, giving us the bilingual parse and translation space for the input string.
When the task of building the DOT translation space is viewed from this perspec-
tive, adaptation of the parsing algorithms of Goodman (1996a, 1998, 2003) and Sima’an
(1995a, 1999) to accomplish this task seems worthy of investigation. However, we find
that Sima’an’s two-phase analysis method gives the required flexibility whereas Good-
man’s PCFG-reduction method does not. In the remainder of this section, we detail why
this is the case.
Translating with Goodman’s PCFG-reduction approach
Recall that, as described in section 2.4.3, Goodman (1996a, 1998, 2003)’s algorithm for
a PCFG containing maximally 8 rules for each node in the training treebank. Each training-tree node A is assigned a unique address k and, correspondingly, one new non- terminal nodeAkis created; such non-terminals are called “interior” nodes and the original
nodes “exterior” nodes. In addition, the number of subtreesak with root node Ak is also
calculated.
A@j
B@k C@l
(5.7)
For any node grouping such as the one in example (5.7), the eight PCFG rules and their
corresponding probabilities in example (5.8) are then extracted.
(1) Aj −→ BC (a1 j) (2) Aj −→ BkC (bakj) (3) Aj −→ BCl (aclj) (4) Aj −→ BkCl (bkacjl) (5) A −→ BC (a1) (6) A −→ BkC (bak) (7) A −→ BCl (cal) (8) A −→ BkCl (bkacl) (5.8)
These rules correspond to the eight possible contexts in which the node grouping in ex-
ample (5.7) can occur in fragments extracted from the corresponding treebank tree; each
of the three nodes can be either interior or exterior (i.e. root node or substitution site) to
any fragment in which the grouping occurs. Thus, every relevant DOP fragment can be
constructed using one or more PCFG derivations by converting each internal node to an
external node and, furthermore, the probability of each of these DOP fragments can be calculated by summing over the PCFG derivations yielding that fragment.
This is a very attractive algorithm for DOP as the size of the extracted PCFG is far
smaller than the corresponding fragment set and because looking back to the fragment set
is not necessary. However, the inflexibility of this approach – discussed in detail in section
3.1.1 – makes it unsuitable for use in a DOT system on several levels. Importantly, the
advantage of not having to look back to the fragment base has, in the context of translation,
turned into a disadvantage: it is extremely computationally expensive to look back to the
fragment base in situations where that becomes necessary.
As the set of source-language DOT fragments is simply a subset of the corresponding DOP fragment set such that certain treebank tree nodes are not permitted to be external,
Goodman’s PCFG reduction method can also be used to characterise the source-language
this, we simply extract the PCFG rules from the source side of the bilingual fragment set subject to the restriction that rules specifying that an unlinked node is external are not
generated. If, for example, in the node grouping given in example (5.7) only nodes A@j
andB@k were linked and, consequently, nodeC@lwas never external to a fragment then only rules 3, 4, 7 and 8 from example (5.8) would be extracted.
As well as using the PCFG reduction to characterise the subtree structures relevant
to the input string, it must also characterise the parse space probabilistically. In other words, the rule probabilities must also be estimated such that the probability of deriving
each valid fragment is equal to its relative frequency in the DOT fragment base. As it
stands, the rule probabilities given correspond to the frequency distribution of the source- language subtrees in the bilingual fragment base rather than the frequency distribution of
the source and target subtree pairs. We can augment each linked source-language subtree
node with the category of the target-subtree node to which it is linked. For example,
source-language node NP linked to target node PP would be assigned the category label
NP.PP; this ‘category’ would thus be distinct from, for example, source-language nodeNP linked to target nodeNP which would be labelled NP.NP. As DOT fragment probabilities
are conditioned on root node pairs, this transformation allows us to correctly establish
the counts for the number of subtrees headed by each root node pair. (The counts for
subtrees whose root nodes are internal to the source-language fragment are calculated as for DOP.)
However, we see no way of adapting this PCFG reduction so that the target-language
subtrees are also characterised. At best, we could use the PCFG space to rebuild each
source-language subtree and recover its target-language counterpart by matching it against
the training data. However, this involves explicitly recreating every fragment relevant to
the input string which, in turn, requires that we prune the fragment set. As discussed in
section 3.1.1, pruning the fragment set so that the parse space is computable unfortunately
results in a large increase in the size of the PCFG-reduction (if, indeed, it is even possible to compute the corresponding PCFG-reduction) and this algorithm loses its advantage.
Thus, we do not use Goodman (1996a, 1998, 2003)’s PCFG-reduction method in our DOT
Translating with Sima’an’s two-phase analysis approach
As described in section 2.4.2, Sima’an (1995a, 1999)’s two-phase analysis approach takes
the context-free grammar underlying the fragment set and uses it to approximate the parse
space of the input string. Correspondences between these CFG rules and the fragments in
which they occur then facilitate the transition from this CFG parse space to the required
DOP parse space for the input. The underlying CFG is, however, non-probabilistic; frag- ment probabilities are estimated by looking back to the full fragment set. This algorithm
can be applied to the computation of the DOT translation space for a given input string
in a very straightforward manner.
Each DOT fragment is associated with a unique identifier. The CFG underlying the
source side of the fragment set is extracted such that each rule in the CFG is associated
with the set of fragment identifiers in which it occurs. The two-phase analysis algorithm
is then applied exactly as for DOP, as described in section 2.4.2. This algorithm generates
a monolingual parse space comprising those source-subtrees which can be used to parse
the input string. However, as we also retain the fragment identifiers of each of these source-subtrees, recovering the translational counterpart of each subtree, as well as the
DOT probability of the fragment as a whole, is trivial. Finally, fragments which, due
to translational constraints, cannot be composed with any other fragments to form valid
analyses are removed from the approximated space, giving us the bilingual parse and
translation space for the input string. As we discuss in section 5.2.4, several different
disambiguation strategies can now be applied to this translation space in order to select
the best translation to output.
5.2.3 Compact fragment representation
Explicitly creating the DOP fragment base is expensive due to the very large numbers
of fragments that must be extracted, counted, stored and compiled. As the two-phase
algorithm used to compute the parse space for each input string requires only an indication
as to which fragments each underlying CFG rule appears in, it is not necessary to explicitly extract and store the fragment set. Thus, in section 3.1.2 we introduced a dynamic method
be stored. The same issues with regard to fragment set extraction arise for DOT. However, the expense of storing and compiling the DOT fragment set is even greater because each
fragment now comprises two subtrees, along with the links between them. Fortunately,
our on-the-fly fragment set extraction can also be applied to bilingual linked treebanks
in a straightforward manner. Explicit fragment characterisation is done over source trees
only and the target subtrees retrieved when converting from the monolingual derivation
space to the bilingual derivation space.
We first apply the DOT root operation to each of the paired treebank representations,
yielding a set of ‘intermediate’ fragments as for DOP but, this time, the size of this set is
linear in the number oflinked node pairs in the treebank. The DOT frontier operation is then applied by assigning to each nodenin the sourceside of each intermediate fragment a set of fragment identifiers such that if its left and right child nodesnlandnr are present
in a fragment then the corresponding fragment identifier appears in the node’s identifier
set. Either both nl and nr are present in the fragment or neither are present, in which
case node nis itself either a substitution site or not in the fragment. Thus, the presence of fragment identifier fid at node n in the source subtree signifies that the CFG rule n−→nlnr occurs in the source side of fragmentfid.
When assigning DOT fragment identifiers to each node in each source subtree, we
must again account for the distinction between linked and unlinked nodes. Recall that, for DOP, ECNF nodes of the formX y(which are inserted into the treebank trees during conversion to binary format) never occur as either root or frontier nodes because they
must always be internal to those fragments in which they appear. In fact, unlinked nodes
in DOT fragments can be treated in the same way as these ECNF nodes as they also must
always be internal to those fragments in which they occur.
We partition the set of identifiers at source nodenwith left and right child nodesnland nr into four sets representing the four possible combinations of internal and external child
nodes <nls,nrs>, <nls,nri>, <nli,nrs> and <nli,nri>. However, if node nl is unlinked
then sets<nls,nrs>and<nls,nri>remain empty as this node is never a substitution site.
Similarly, if nodenr is unlinked then sets<nls,nrs>and<nli,nrs>are empty, and if both
(A)
Root-generated ‘intermediate’ fragment whose source subtree has been converted to ECNF (through which source nodeB xhas been inserted) and each source node annotated with the number of different fragments yielded through application of the frontier operation:
A O B C D P Q b E F d p R S e f T U V s t u v ⇒ A(16) O B(1) B x(8) P Q b C(4) D(1) p R S E(1) F(1) d T U V s e f t u v (B)
Source node annotations representing all possible frontier operations where the total number of frontier operations possible is 16 and the fragments corresponding to each of these frontier operations have been allocated identifiers from the set of integers 1 - 16:
A(16) <Bs,B xs>:{}<Bs,B xi>:{1-8}<Bi,B xs>:{}<Bi,B xi>:{9-16}
B x(8) <Cs,Ds>:{}<Cs,Di>:{}<Ci,Ds>:{1-4,9-12}<Ci,Di>:{5-8,13-16}
C(4) <Es,Fs>:{1,5,9,13}<Es,Fi>:{2,6,10,14}<Ei,Fs>:{3,7,11,15}<Ei,Fi>:{4,8,12,16}
B(1) <b>:{9-16}
E(1) <e>:{3,4,7,8,11,12,15,16}
F(1) <f>:{2,4,6,8,10,12,14,16}
D(1) <d>:{5-8,13-16}
Figure 5.8: The ‘intermediate’ fragment in (A) was generated by the root operation. (B) gives the source node annotations represent- ing all possible frontier operations where the total number of frontier operations possible is 16 and the fragments corre- sponding to each of these frontier operations have been allo- cated identifiers from the set of integers 1 - 16.
by the example in Figure 5.8 where the node annotations in (B) correspond to the DOT
‘intermediate’ tree in (A). Consider, for example, the annotation for node B x. Its left child node,C, is an unlinked node whereas its right child node,D, is a linked node. Thus, the annotation sets specifying node C as a substitution site are empty.
Extracting these partitioned sets of fragment identifiers along with each source-language
CFG rule extracted gives us the correspondence between the source side of the fragment
set and this CFG. Thus, we can transition from phase 1, in which the source-CFG space
is constructed, to phase 2, thereby generating the monolingual space comprising those
source-subtrees which can be used to analyse the input string. For DOP, we stated that retrieval of any fragment can be accomplished easily by simply checking for its absence or
presence, as an internal node or substitution site, at each node in the intermediate tree. Although this is not strictly necessary for DOP parsing as these fragments are recon-
structed automatically using the annotated CFG during phase 2, it is crucial for DOT as
it allows us to retrieve the target-language subtree corresponding to each source-language
subtree in the derivation space.
Essentially, the set of nodes identified as open substitution sites in any source subtree
also characterise its linked target-language counterpart. Consider, for example, the situa-
tion where the source subtree of fragmentf11in Figure 5.8 is relevant to the input string,
and so we wish to retrieve the corresponding target subtree. We first look at the retrieval
of the source subtree. The sets corresponding to node A indicate that nodesB and B x
are both internal to fragment f11. Trivially, this also means that terminal symbol b is a
frontier node. The sets corresponding to node B x signify that while node C is internal