• No results found

Pruning the fragment space: link depth

5.2 A new implementation of the Tree-DOT model

5.2.1 Pruning the fragment space: link depth

The refinement of the fragmentation process to account for translational links may (and

often does) result in a smaller number of DOT fragment per tree pair than would be the

case with DOP. Recall that, as given in section 2.3, the number of monolingual DOP

fragmentsFDOP(AT) projected from non-terminal nodeAT in treebank treeT which has

childrenCT ={CT1...CTn} is calculated according to equation (5.3).

FDOP(AT) =

Y

CTx∈CT

(FDOP(CTx) + 1) (5.3)

Recall also that the total number of fragments T FDOP(T) which can be extracted from

treebank tree T is the sum over the number of fragments which can be projected from each of its nodes, as stated in equation (5.4).

T FDOP(T) =

X

AT∈T

FDOP(AT) (5.4)

Consider, for example, the English tree on the left-hand side of the Tree-DOT represen-

tation given in Figure 5.2.4 According to equations (5.3) and (5.4), this tree yields 357 4While English and French are considered to be syntactically similar languages, in certain contexts they

exhibit strong stylistic divergences. In this particular translation example, the English printer manual section header is phrased as a question, whereas the corresponding French translation of that header is realised as a declarative sentence. We provide further discussion regarding translational divergences

HEADER HEADER

CPint INT-MARK NPpp PERIOD

NPint AUXdo S ? N PP .

Dint N does NP V capacit´e memoire P NPdet

how much memory PRON N have de D N

your PC votre PC

Figure 5.2: Tree-DOT representation.

monolingual DOP fragments. Furthermore, the French tree on the right-hand side of

the Tree-DOT representation given in Figure 5.2 yields 87 fragments according to these

equations.

The definitions of therootandfrontieroperations given for Tree-DOT in section 4.2.1, which are used to extract DOT fragments from linked tree pairs, distinguish between linked

and unlinked nodes. Thus, calculating the number of bilingual fragments extracted from

each tree pair also requires that we distinguish between linked and unlinked nodes. Ac-

cordingly, the number of bilingual DOT fragmentsFDOT(AT) projected from non-terminal

node AT in (source or target) treebank tree T which has children CT = {CT1...CTn} is

calculated according to equation (5.5).

FDOT(AT) = Y linked(CTx)∈CT (FDOT(CTx) + 1) Y unlinked(CTx)∈CT FDOT(CTx) (5.5)

Note that, for each linked tree pair, applying equation (5.5) to any linked node in the source

tree and to the node to which this source node is linked in the target tree yields exactly the same result. For example, applying this formula at the root node of the English tree

in Figure 5.2 (labelledHEADER) and the root node of the French tree (labelledHEADER)

to which it is linked indicates that 10 subtrees are projected from each. Furthermore, each

of the 10 English subtrees corresponds to one of the 10 French subtrees, meaning that a

total of 10 bilingual paired subtrees (i.e. DOT fragments) are projected from the node

pair <HEADER,HEADER> of this tree pair. The total number of fragments T FDOT(T)

which can be extracted from each pair of linked trees T is the sum over the number of

ROOT LISTITEM

CPint INT-MARK NPpp PERIOD

NPint AUXdo S ? N PP .

Dint N does NP V P NPdet

how much have de

Figure 5.3: Tree-DOT fragment extracted from the representation in Fig- ure 5.2.

fragments which can be projected from each of either the source or target tree’s linked nodes, as stated in equation (5.6).

T FDOT(T) =

X

linked(AT)∈Ts|t

FDOT(AT) (5.6)

Thus, according to equations (5.5) and (5.6), the number of DOT fragments which can be

extracted from the linked tree pair in Figure 5.2 is 17.

As we model translational dependencies rather than monolingual dependencies, the

number of DOT fragments extracted from a linked tree pair is generally less than the

number of DOP fragments which can be extracted from each of the source and target

trees comprising that tree pair. We have already illustrated this above: the source tree in

Figure 5.2 yields 357 monolingual fragments and the target tree 87 monolingual fragments

but the bilingual tree pair yields just 17 DOT fragments. Nevertheless, pruning methods

to constrain the size of the fragment base are still necessary. We discussed the relative merits of several pruning methods for Tree-DOP fragments in section 2.3. These methods

involve excluding fragments on the basis of fragment properties such as depth, number of

lexicalised frontiers, number of non-headword frontiers and number of open substitution

sites. The only one of these pruning criteria which can be applied directly to the DOT

fragment base is the restriction on the number of open substitution sites per fragment.

As, in every DOT fragment, each source non-terminal frontier node is linked to ex-

actly one target non-terminal frontier node and vice versa, the source and target subtrees

in each fragment always have the same number of open substitution sites. Thus, the

number of open substitution sites in a fragment can be calculated as the number of links

NPadj NPpp

A N N PP

scanning options options P N de num´erisation

Figure 5.4: Tree-DOT fragment.

fragment property is straightforward. As pointed out in (Way, 2001):187, however, it is

not necessarily the case that the source and target subtrees in each fragment have the

same number of terminal frontier nodes. For example, although the fragment in Figure

5.3 has exactly 2 open substitution sites in each subtree, the source subtree has 4 termi-

nal frontiers5 whereas the target subtree has only 2. Thus, calculation of the number of

terminal frontiers in a fragment involves making a decision as to whether source or target terminal frontiers should be counted. Way (op cit.) also observes that if, for example, the number of terminals is counted on the source subtree and the maximum is set to 3

then fragments such as the one given in Figure 5.3 will be excluded. He suggests that

manual intervention may be necessary to prevent such fragments from being pruned. Fur-

thermore, we note that pruning the fragment base by placing an upper limit on fragment

depth is also problematic for DOT as fragments such as the one given in Figure 5.4 do not

necessarily comprise source and target subtrees of the same depths. Again, use of depth

restrictions directly as they are used for DOP involves making an arbitrary decision as to whether source or target depth should be calculated. However, we observe that this

issue is merely a surface symptom of the fundamental difference between the dependencies

modeled by the DOP and DOT fragment sets rather than constituting a problem in itself:

DOP models monolingual dependencies whereas DOT models bilingual dependencies.

As discussed in section 2.3, the full set of DOP fragments captures all arbitrary de-

pendencies occurring in a given training treebank. Use of pruning techniques reduces this

set such that only a subset of the dependencies present are actually captured. Although

this subset may be specified over quantitative rather than linguistic characteristics of the

full fragment set, it is nevertheless the case that the choice of dependencies modeled is no longer arbitrary.

5As the first two words of the source string –how much– share the same parent node, they are treated

(A) A phrase-structure tree representing the French stringoptions de num´erisation:

NPpp

N PP

options P N de num´erisation

(B) Organisation of the DOP fragments extracted from (A) according to the number of frontier terminals in each:

↓lex = 3↓ NPpp N PP options P N de num´erisation ↓lex = 2↓ NPpp N PP options P N de NPpp N PP options P N num´erisation NPpp N PP P N de num´erisation PP P N de num´erisation ↓lex = 1↓ NPpp N PP P N num´erisation NPpp N PP P N de NPpp N PP options P N NPpp N PP options PP P N de PP P N num´erisation P de N options N num´erisation ↓lex = 0↓ NPpp N PP P N NPpp N PP PP P N

(C) Organisation of the DOP fragments extracted from (A) according to the depth of each:

↓depth = 3↓ NPpp N PP options P N de num´erisation NPpp N PP options P N de NPpp N PP options P N num´erisation NPpp N PP P N de num´erisation NPpp N PP P N num´erisation NPpp N PP P N de ↓depth = 2↓ NPpp N PP options P N NPpp N PP options PP P N de PP P N num´erisation NPpp N PP P N PP P N de num´erisation ↓depth = 1↓ P de N options N num´erisation NPpp N PP PP P N

Figure 5.5: 17 unique DOP fragments can be extracted from the phrase- structure tree in (A). (B) shows these 17 fragments organised according to the number of terminal frontier nodes in each and (C) shows these same 17 fragments organised according to fragment depth.

Consider, for example, the treebank tree given in Figure 5.5(A). This tree yields the 17 unique DOP fragments given in Figure 5.5(B) and (repeated in) (C). In Figure 5.5(B),

these fragments are organised in terms of how many lexicalised frontiers each fragment

has, and in (C) they are organised according to fragment depth. Looking firstly at Fig-

ure 5.5(B), we see that if we prune the fragment base by excluding all lexicalised frag-

ments, then just 3 fragments remain. These fragments model structural dependencies

only. Clearly, in this case, the input string must be tagged before it can be parsed with

this DOP grammar. If we relax the restriction to allow fragments with maximally one lex-

icalised frontier then 12 fragments are included in the fragment base. The fragment base

now provides varying amounts of information about the structural contexts in which each terminal can occur. Relaxing the restriction further to incorporate fragments with max-

imally two lexicalised frontiers allows us to model bilexical dependencies also. However,

the set of fragments comprising maximally one lexicalised frontier encodes information

about allthe terminals in the treebank tree in Figure 5.5(A). Looking at Figure 5.5(C), we see that including only fragments of depth 1 in the fragment base restricts us to cap-

turing local dependencies. However, as is the case when we allow at most 1 terminal per

fragment, information about every terminal in the treebank tree is encoded at depth 1.

Thus, adding fragments of increasingly greater depths allows us simply to capture more

and more probabilistic information about the lexical items already present in the fragment base.

Clearly, the effects of applying pruning thresholds to the DOP fragment base are pre-

dictable. In particular, we know that – with the exception of restricting to unlexicalised

fragments only, which is not generally done in practice – the minimum amount of informa-

tion encoded about each word in the treebank is its part-of-speech tag. If we apply these

same pruning techniques to the DOT fragment space by calculating fragment properties

over either the source or target subtrees, however, the effects on the dependencies modeled

are not predictable. In particular, if we proceed using this methodology then we can no longer be sure, as we were for DOP, that there is some minimal amount of information

encoded about each word in the treebank. This is due to the fact that the DOT fragment

(A) A linked tree pair:

NPadj NPpp

A N N PP

scanning options options P N de num´erisation

(B) The set of DOT fragments extracted from the linked tree pair in (A):

f1 f2 f3

NPadj NPpp

A N N PP

scanning options options P N de num´erisation NPadj NPpp A N N PP scanning P N de num´erisation N N options options

Figure 5.6: The linked pair of trees given in (A) yields the set of DOT fragments given in (B).

ones. Consider, for example, Figure 5.6, where the linked pair of trees given in (A) yields

the set of DOT fragments given in (B). (Note that the target subtree in (A) is exactly the

DOP representation provided in Figure 5.5.) If we restrict the fragment base such that it includes fragments of depth 1 only, then regardless of whether we measure depth over

the source or target subtrees, the fragment base will comprise fragmentf3 only. Thus, by

omitting all other fragments we retain no information about the English word scanning

and the French words de and num´erisation. Similarly, if we restrict the fragment base such that it includes fragments with maximally 1 target terminal frontier then fragment

f3 will again be the only fragment remaining. (While calculating degree of lexicalisation

over the source subtrees will, in this particular instance, result in the retention of fragment

f2 also, this is by no means predictable.)

In section 4.2.1 we saw that the DOT fragmentation operations work over linked nodes only. Correspondingly, in order to calculate the number of fragments yielded by each DOT

representation, we differentiate between linked and unlinked nodes (equations (5.5) and

(5.6)) as linked nodes are productive whereas unlinked nodes are not. Accordingly, here

we conclude that direct application of DOP pruning methods to the DOT fragment base

NP

N PP

VPv configuration P NPdet

V NPzero de D NPpp

setting N N les N PP

printer options options P N

de impression

Figure 5.7: sourcedepth = 3, target depth = 6, link depth= 2

Consequently, we replace the notion of fragment depth – the greatest number of steps taken to get from the root node to any frontier node – with the notion of link depth for fragments comprising linked subtree pairs (Hearne and Way, 2003). The link depth of

a fragment is the greatest number of steps taken which depart from a linked node to get from the root node to any frontier node. This yields the same result whether calculated

over the source or target side of the fragment. For example, for the fragments comprising

two subtree pairs given in Figure 5.7, the depth of the source language subtree (on the

left) is 3 whereas the depth of the target language subtree is 6. If, however, we simply

calculate the depth of the fragment as a whole using the concept of link depth, we arrive

at fragment depth of 2.

Consider again the fragment set in Figure 5.6(B). According to the definition of link depth, both fragments f2 and f3 are of depth 1, meaning that the minimal fragment set

comprises these two fragments only. Clearly, these two fragments encode the minimum

amount of information about each word in the treebank as each word is contained in one

of these fragments. Fragment f1 is of link depth 2 and adds (only) further structural and

contextual information about the words already contained in the fragment base. Thus,

not only does link depth characterise each bilingual fragment as a whole, but pruning the

DOT fragment base according to link depth changes the dependencies occurring in the

fragment base in a predictable way. Henceforth, this is the method we use to calculate DOT fragment depth.