Translation-space construction - A new implementation of the Tree-DOT model

5.2 A new implementation of the Tree-DOT model

5.2.2 Translation-space construction

As discussed in section 5.1.3, we do not feel that the translation algorithm proposed

by Poutsma (2000, 2003) – which is based on the ‘fragments as rewrite rules’ technique

proposed for DOP by Bod (1992) and described in section 2.4.1 – will facilitate the experi-

ments required to fully assess the performance of the DOT model. Thus, in this section we

focus on the adaptation of more efficient DOP parsing algorithms to the DOT translation model. Firstly, we discuss how the fragments in the DOT translation space for an input

string relate to the fragments in the DOP parse space for that same string. In light of

this relationship, we then outline the elements which building the parse and translation

spaces have in common, and give the general intuition as to how the former can be used

in creating the latter. Finally, we discuss in detail the possibility of adapting the DOP

parsing algorithms developed by Goodman (1996a, 1998, 2003) and Sima’an (1995a, 1999)

to accomplish the task of translation space computation.

From parsing to translation: the general model

Conceptually, the source- and target-language halves of each DOT fragment, along with

the translational links between them, form a single unit. It is useful on a practical level,

however, to make explicit the relationships between (i) the two halves of the set of bilingual

DOT fragments which can be extracted from a set of linked training trees and (ii) the two

sets of monolingual fragments which can be extracted from that same set of linked training trees by placing the source and target trees in separate sets, discarding the links and apply-

ing the DOP fragmentation operations. In other words, if one of the languages represented

in the bilingual treebank is language L, what is the relationship between (i) the fragment

setFb generated by applying the DOT fragmentation operations to the bilingual treebank

and then stripping away the links and corresponding-language parts of each extracted

fragment, leaving only representations for L and (ii) the fragment set Fm generated by

taking the bilingual treebank, stripping away the links and corresponding-language trees

and applying the DOP fragmentation operations to this monolingual treebank? As the

fragmentation operations defined for Tree-DOT can only select linked nodes to be either root or frontier nodes, it follows that non-linked nodes are always internal to the fragments

in which they occur. Thus, setFb comprises a subset of the fragments in Fm such that all

root nodes and substitution sites of fragments in Fb are linked to target-language nodes

in the bilingual treebank.

In Tree-DOT, the process of building the translation space is driven by the input

string, and the building of target language representations can be viewed as a by-product

of parsing with bilingual fragments. It is possible, therefore, to build a first approximation

of the translation space by simply parsing with the source-language half of the bilingual

fragment base only, i.e. fragment set Fb. Once this has been accomplished, the one or

more target-language subtrees which correspond to each source-language fragment in the

approximated space are retrieved. However, according to the DOT composition operation, the target-language subtrees effectively act as constraints on the source-language

fragments which can combine to form analyses: in order for fragment fx with root node

categories <Rsx,Rtx> to compose with fragment fy with leftmost substitution site cate-

gories<LSSsy,LSSty>, not only must the source root infx,Rsx, correspond to the source

leftmost substitution site in fy, LSSsy, but the target root Rtx must also correspond to

the substitution site category LSSty. Effectively, this means that source-language frag-

ments which can combine freely in a monolingual model are now constrained by their

target-language links. Thus, fragments which, due to translational constraints, cannot be

composed with any other fragments to form valid analyses are removed from the approximated space, giving us the bilingual parse and translation space for the input string.

When the task of building the DOT translation space is viewed from this perspec-

tive, adaptation of the parsing algorithms of Goodman (1996a, 1998, 2003) and Sima’an

(1995a, 1999) to accomplish this task seems worthy of investigation. However, we find

that Sima’an’s two-phase analysis method gives the required flexibility whereas Good-

man’s PCFG-reduction method does not. In the remainder of this section, we detail why

this is the case.

Translating with Goodman’s PCFG-reduction approach

Recall that, as described in section 2.4.3, Goodman (1996a, 1998, 2003)’s algorithm for

a PCFG containing maximally 8 rules for each node in the training treebank. Each training-tree node A is assigned a unique address k and, correspondingly, one new non- terminal nodeAkis created; such non-terminals are called “interior” nodes and the original

nodes “exterior” nodes. In addition, the number of subtreesak with root node Ak is also

calculated.

A@j

B@k C@l

(5.7)

For any node grouping such as the one in example (5.7), the eight PCFG rules and their

corresponding probabilities in example (5.8) are then extracted.

(1) Aj −→ BC (_a1 j) (2) Aj −→ BkC (bakj) (3) Aj −→ BCl (aclj) (4) Aj −→ BkCl (bkacjl) (5) A _−→ BC (_a1) (6) A _−→ BkC (bak) (7) A _−→ BCl (cal) (8) A _−→ BkCl (bkacl) (5.8)

These rules correspond to the eight possible contexts in which the node grouping in ex-

ample (5.7) can occur in fragments extracted from the corresponding treebank tree; each

of the three nodes can be either interior or exterior (i.e. root node or substitution site) to

any fragment in which the grouping occurs. Thus, every relevant DOP fragment can be

constructed using one or more PCFG derivations by converting each internal node to an

external node and, furthermore, the probability of each of these DOP fragments can be calculated by summing over the PCFG derivations yielding that fragment.

This is a very attractive algorithm for DOP as the size of the extracted PCFG is far

smaller than the corresponding fragment set and because looking back to the fragment set

is not necessary. However, the inflexibility of this approach – discussed in detail in section

3.1.1 – makes it unsuitable for use in a DOT system on several levels. Importantly, the

advantage of not having to look back to the fragment base has, in the context of translation,

turned into a disadvantage: it is extremely computationally expensive to look back to the

fragment base in situations where that becomes necessary.

As the set of source-language DOT fragments is simply a subset of the corresponding DOP fragment set such that certain treebank tree nodes are not permitted to be external,

Goodman’s PCFG reduction method can also be used to characterise the source-language

this, we simply extract the PCFG rules from the source side of the bilingual fragment set subject to the restriction that rules specifying that an unlinked node is external are not

generated. If, for example, in the node grouping given in example (5.7) only nodes A@j

andB@k were linked and, consequently, nodeC@lwas never external to a fragment then only rules 3, 4, 7 and 8 from example (5.8) would be extracted.

As well as using the PCFG reduction to characterise the subtree structures relevant

to the input string, it must also characterise the parse space probabilistically. In other words, the rule probabilities must also be estimated such that the probability of deriving

each valid fragment is equal to its relative frequency in the DOT fragment base. As it

stands, the rule probabilities given correspond to the frequency distribution of the source- language subtrees in the bilingual fragment base rather than the frequency distribution of

the source and target subtree pairs. We can augment each linked source-language subtree

node with the category of the target-subtree node to which it is linked. For example,

source-language node NP linked to target node PP would be assigned the category label

NP.PP; this ‘category’ would thus be distinct from, for example, source-language nodeNP linked to target nodeNP which would be labelled NP.NP. As DOT fragment probabilities

are conditioned on root node pairs, this transformation allows us to correctly establish

the counts for the number of subtrees headed by each root node pair. (The counts for

subtrees whose root nodes are internal to the source-language fragment are calculated as for DOP.)

However, we see no way of adapting this PCFG reduction so that the target-language

subtrees are also characterised. At best, we could use the PCFG space to rebuild each

source-language subtree and recover its target-language counterpart by matching it against

the training data. However, this involves explicitly recreating every fragment relevant to

the input string which, in turn, requires that we prune the fragment set. As discussed in

section 3.1.1, pruning the fragment set so that the parse space is computable unfortunately

results in a large increase in the size of the PCFG-reduction (if, indeed, it is even possible to compute the corresponding PCFG-reduction) and this algorithm loses its advantage.

Thus, we do not use Goodman (1996a, 1998, 2003)’s PCFG-reduction method in our DOT

Translating with Sima’an’s two-phase analysis approach

As described in section 2.4.2, Sima’an (1995a, 1999)’s two-phase analysis approach takes

the context-free grammar underlying the fragment set and uses it to approximate the parse

space of the input string. Correspondences between these CFG rules and the fragments in

which they occur then facilitate the transition from this CFG parse space to the required

DOP parse space for the input. The underlying CFG is, however, non-probabilistic; fragment probabilities are estimated by looking back to the full fragment set. This algorithm

can be applied to the computation of the DOT translation space for a given input string

in a very straightforward manner.

Each DOT fragment is associated with a unique identifier. The CFG underlying the

source side of the fragment set is extracted such that each rule in the CFG is associated

with the set of fragment identifiers in which it occurs. The two-phase analysis algorithm

is then applied exactly as for DOP, as described in section 2.4.2. This algorithm generates

a monolingual parse space comprising those source-subtrees which can be used to parse

the input string. However, as we also retain the fragment identifiers of each of these source-subtrees, recovering the translational counterpart of each subtree, as well as the

DOT probability of the fragment as a whole, is trivial. Finally, fragments which, due

to translational constraints, cannot be composed with any other fragments to form valid

analyses are removed from the approximated space, giving us the bilingual parse and

translation space for the input string. As we discuss in section 5.2.4, several different

disambiguation strategies can now be applied to this translation space in order to select

the best translation to output.

5.2.3 Compact fragment representation

Explicitly creating the DOP fragment base is expensive due to the very large numbers

of fragments that must be extracted, counted, stored and compiled. As the two-phase

algorithm used to compute the parse space for each input string requires only an indication

as to which fragments each underlying CFG rule appears in, it is not necessary to explicitly extract and store the fragment set. Thus, in section 3.1.2 we introduced a dynamic method

be stored. The same issues with regard to fragment set extraction arise for DOT. However, the expense of storing and compiling the DOT fragment set is even greater because each

fragment now comprises two subtrees, along with the links between them. Fortunately,

our on-the-fly fragment set extraction can also be applied to bilingual linked treebanks

in a straightforward manner. Explicit fragment characterisation is done over source trees

only and the target subtrees retrieved when converting from the monolingual derivation

space to the bilingual derivation space.

We first apply the DOT root operation to each of the paired treebank representations,

yielding a set of ‘intermediate’ fragments as for DOP but, this time, the size of this set is

linear in the number oflinked node pairs in the treebank. The DOT frontier operation is then applied by assigning to each nodenin the sourceside of each intermediate fragment a set of fragment identifiers such that if its left and right child nodesnlandnr are present

in a fragment then the corresponding fragment identifier appears in the node’s identifier

set. Either both nl and nr are present in the fragment or neither are present, in which

case node nis itself either a substitution site or not in the fragment. Thus, the presence of fragment identifier fid at node n in the source subtree signifies that the CFG rule n_−→nlnr occurs in the source side of fragmentfid.

When assigning DOT fragment identifiers to each node in each source subtree, we

must again account for the distinction between linked and unlinked nodes. Recall that, for DOP, ECNF nodes of the formX y(which are inserted into the treebank trees during conversion to binary format) never occur as either root or frontier nodes because they

must always be internal to those fragments in which they appear. In fact, unlinked nodes

in DOT fragments can be treated in the same way as these ECNF nodes as they also must

always be internal to those fragments in which they occur.

We partition the set of identifiers at source nodenwith left and right child nodesnland nr into four sets representing the four possible combinations of internal and external child

nodes <nls,nrs>, <nls,nri>, <nli,nrs> and <nli,nri>. However, if node nl is unlinked

then sets<nls,nrs>and<nls,nri>remain empty as this node is never a substitution site.

Similarly, if nodenr is unlinked then sets<nls,nrs>and<nli,nrs>are empty, and if both

(A)

Root-generated ‘intermediate’ fragment whose source subtree has been converted to ECNF (through which source nodeB xhas been inserted) and each source node annotated with the number of different fragments yielded through application of the frontier operation:

A O B C D P Q b E F d p R S e f T U V s t u v ⇒ A(16) O B(1) B x(8) P Q b C(4) D(1) p R S E(1) F(1) d T U V s e f t u v (B)

Source node annotations representing all possible frontier operations where the total number of frontier operations possible is 16 and the fragments corresponding to each of these frontier operations have been allocated identifiers from the set of integers 1 - 16:

A(16) <Bs,B xs>:{}<Bs,B xi>:{1-8}<Bi,B xs>:{}<Bi,B xi>:{9-16}

B x(8) <Cs,Ds>:{}<Cs,Di>:{}<Ci,Ds>:{1-4,9-12}<Ci,Di>:{5-8,13-16}

C(4) <Es,Fs>:{1,5,9,13}<Es,Fi>:{2,6,10,14}<Ei,Fs>:{3,7,11,15}<Ei,Fi>:{4,8,12,16}

B(1) <b>:{9-16}

E(1) <e>:{3,4,7,8,11,12,15,16}

F(1) <f>:{2,4,6,8,10,12,14,16}

D(1) <d>:{5-8,13-16}

Figure 5.8: The ‘intermediate’ fragment in (A) was generated by the root operation. (B) gives the source node annotations representing all possible frontier operations where the total number of frontier operations possible is 16 and the fragments corresponding to each of these frontier operations have been allocated identifiers from the set of integers 1 - 16.

by the example in Figure 5.8 where the node annotations in (B) correspond to the DOT

‘intermediate’ tree in (A). Consider, for example, the annotation for node B x. Its left child node,C, is an unlinked node whereas its right child node,D, is a linked node. Thus, the annotation sets specifying node C as a substitution site are empty.

Extracting these partitioned sets of fragment identifiers along with each source-language

CFG rule extracted gives us the correspondence between the source side of the fragment

set and this CFG. Thus, we can transition from phase 1, in which the source-CFG space

is constructed, to phase 2, thereby generating the monolingual space comprising those

source-subtrees which can be used to analyse the input string. For DOP, we stated that retrieval of any fragment can be accomplished easily by simply checking for its absence or

presence, as an internal node or substitution site, at each node in the intermediate tree. Although this is not strictly necessary for DOP parsing as these fragments are recon-

structed automatically using the annotated CFG during phase 2, it is crucial for DOT as

it allows us to retrieve the target-language subtree corresponding to each source-language

subtree in the derivation space.

Essentially, the set of nodes identified as open substitution sites in any source subtree

also characterise its linked target-language counterpart. Consider, for example, the situa-

tion where the source subtree of fragmentf11in Figure 5.8 is relevant to the input string,

and so we wish to retrieve the corresponding target subtree. We first look at the retrieval

of the source subtree. The sets corresponding to node A indicate that nodesB and B x

are both internal to fragment f11. Trivially, this also means that terminal symbol b is a

frontier node. The sets corresponding to node B x signify that while node C is internal

In document Hearne DOT thesis goodmanreductions pdf (Page 135-144)