Representational Aims - Translation Structures

4.4 Translation Structures

4.4.3 Representational Aims

To study HATs, we need to be able to generate HATs for every sentence, and store them in a suitable fashion. Generating and storing all trees separately would be both time and space consuming, and would impede a flexible search through them. Hence, a suitable representation of the set of HATs is required. This representation should be easy to search, preferably memory efficient, flexible to abstract information from and easy to combine and compare with other representations of sets of HATs.

(0-6]→(0-1] (1-6] (4-6]→(4-5] (5-6] (0-1]→My (0-6]→(0-2] (2-6] (0-4]→(0-1] (1-4] (1-2]→dog (0-6]→(0-4] (4-6] (0-4]→(0-2] (2-4] (2-3]→also (1-6]→(1-4] (4-6] (1-4]→(1-2] (2-4] (4-5]→eating (1-6]→(1-2] (2-6] (0-2]→(0-1] (1-2] (5-6]→sausages (2-6]→(2-4] (4-6] (2-4]→(2-3] likes

Figure 4.4: The grammar generating the translation tree forest for the sentence ‘My dog also likes eating sausages’, with set-permutation

h0{0}1, 1{1}2, 2{3}3, 3{2,4}4, 4{6}5, 5{5}6i would be (the subscripts indicating the span annotation, which is left exclusive and right inclusive)

The original HAT paper also presented an implementation for generating and storing HATs. The HAT-forests of sentences were represented as a compact chart. Unfortunately, this implementation could not be used for the purpose of this paper. The chart representation was not easily accessible as an object, and modifying the program that generated it to obtain the desired properties did not seem within reach. Furthermore, unaligned words were completely ignored. In some cases, this might be a reasonable simplification, but for empirical analysis this is not desirable.

For this thesis, we aimed to find a new representation of a set of HATs that has both the efficient properties of the current implementation, but is more flexible and accessible, and easy to use for other purposes than precisely the one of this paper. In this subsection, we will explain how we will represent and generate HATs. Documentation of the complete implementation can be found in Appendix??. To make the representation easily accessable for people unfamiliar with the code, we have implemented several demo’s to illustrate the use and possibilities of the HATs.

4.4.4 Representation

We have represented a set of HATs implicitly by a CFG that describes for every (translation admissible) span of the sentence how it is allowed to expand. By default, the nodes are labelled with the span they dominate, but labels for spans can be provided. Figure 4.4 provides an example.

HATS represented as context free grammars are easy to wield. As the labels are unique, the HAT CFG can be easily unpacked into all HATs for a sentence (which we will never do), but can also recursively be investigated. E.g., the number of HATs can be determined efficiently in a top-down fashion, and parsing with the HATgrammar can provide information about several of its properties.

4.4.5 Generation

To generate the HAT forest for an alignment, it suffices to find the minimal segmentations for all contiguous translation admissible subsequences of the alignment. We generated the set of contiguous translation admissible subsequences (i.e., the source sides of all phrase pairs) with the shift-reduce algorithm presented in Zhang et al. (2008). Some small adaptations were necessary to generate the full set of phrase pairs, rather than only the ‘tight’ subset, in which unaligned words at the boundaries were not included. To find the minimal segmentations of all phrases we used Dijkstra’s (1959) shortest path algorithm (Algorithm 1), that we adapted to search forall shortest paths rather than just one, and to

incorporate the information that, given the direction of the edges, only searching in nodes with a larger value could actually lead to a shortest path, which made the search more efficient. Details are provided in Algorithm 1. For memory efficiency, all paths were stored in a linked list with path points that were shared among different paths.

Algorithm 1Shortest Paths

Input: A graphG= (V, E)describing an alignment and two verticesi andj for which(i, j)_∈E is true.

Output: All non-trivial shortest paths fromitoj

#Initialization

visited =∅, depth = 0, paths =_{j}

∀n∈N:reachable(n) =∅; reachable(0) ={j} depth_finished = False

# Start backwards search through graph whilenot depth_finished ori /_∈visiteddo

whilereachable(depth)6=_∅do

depth_finished←False

current_node←N an arbitrary elementv from reachable(depth) reachable(depth)←reachable(depth)− {current_node}

for (l,current_node)∈E do

if l /_∈visited_∪reachable(depth)anddepth₆= 0 then

reachable(depth+1)←reachable(depth+1)∪ {l_}

forpath (current_node,. . . ,j)_∈pathsdo

path←(l,current_node,. . . ,j) end for end if visited←visited∪ {l} end for depth_finished←True depth ←depth+1 end while end while Returnpaths

4.5 Dependency parses

As we have elaborately discussed dependency parses earlier in Section 3.6, we will only briefly revisit them in this section, and discuss how we obtain them.

Several assumptions can be made about dependency parses. In this thesis it is assumed, following common practice, that a dependency structure satisfies the following two conditions:

1. When seen as a relation,D constitutes a single-headed acyclic connected graph in which the words insare the nodes. (tree-constraint)

2. When the words are placed in the original order, the branches of the dependency tree do not cross. (projectivity)

4.5.1 Representation

We will represent a dependency parse of a sentence s=w0· · ·wn as a set of relationsD=_{(i, j)_|there is a dependency arrow from wordwi to wordwj_}.

Formally, dependency grammars are not interpreted as compositional grammars, as they do not postulate the existence of non-terminal syntactic categories. That is, dependency parses describe predicate-argument relations betweenwords

and not between larger parts, and do thus not explicitly specify the recursive structure of a sentence. However, dependency graphs do give rise to a hierarchical structure, that specifies from which smaller parts the sentence was composed. For instance, the dependency graph depicted in Figure 4.5 tells us that ‘likes’ is the head word of the sentence, and that the sentence is composed of 4 parts: the head ‘likes’, its modifier ‘also’, its noun subject whose head is ‘dog’ and the open clausal complement whose head is ‘eating’. The complement and subject are further divisible in ‘My’ and ‘dog’, and ‘eating’ and ‘sausage’, respectively. As the tree is projective, all parts are continuous. Also the graph in Figure 3.11 we saw earlier prescribes an hierarchical structure: it is composed of the subject ‘I’, the headword ‘know’, and the phrase headed by ‘liet’, that is in its turn built up from its head ‘liet’, ‘dat’, ‘hij’ and the discontinous phrase ‘me winnen’. Such an hierarchical structure cannot be captured by a phrase structure grammar.

My dog also likes eating sausage

poss

nsubj

xvmod xcomp dobj

4.5.2 Generation

To assign dependency structures to sentences, we used the Stanford Dependency Parser, that can be downloaded from their website, as well as used online (De Marneffe et al., 2006). The parser provides 5 variants of a typed dependency representation, of which the most basic one corresponds to the earlier imposed conditions on dependency structures (De Marneffe and Manning, 2008). A picture of a dependency parse that was generated by the parser is depicted in Figure 4.5.

Unfortunately, the Stanford dependency parser used does not cover all tokens of the input sentences, as dependency relations between words and punctuation are not present in the Stanford Dependency Representation. The resulting dependency trees do thus not always cover the entire sentence.

4.6 Summary

In this chapter we have presented the basis for the original research conducted for this paper. At the start of this chapter we have stated the aims of this research, and provided a short summary of the background we presented in earlier chapters. We formulated three questions

1. Are dependency structures universal for languages?

2. What are the reasons that dependency parses are not entirely preserved during translation?

3. Can dependency grammars be used to construct a bilingual compositional grammar?

We have listed the main assumptions that we made in our empirical investigation of the three questions presented, and we have revisited the two main ingredients of this investigation: HATs and dependency parses.

Chapter 5

A Bilingual Perspective on

Dependency Parses

After having discussed the main ingredients for our research, as well as its foun- dations, we will now present our findings on the consistency between dependency parses and HATs and the experiments that led to these findings. This chapter consists of 5 sections. Firstly, in Section 5.1, we will give a description of the data that were used for all experiments. In the subsequent sections 5.2 and 5.3 we will report and analyse the consistency of dependency parses and HATs according to two different consistency metrics. In Section 5.4, we will further analyse the consistency between HATs and dependency parses, by performing a manual analysis on a part of the data. As usual, we end the chapter with a short summary.

5.1 Data

We had 4 large automatically aligned parallel corpora available (see 5.1), for the language pairs English-Dutch, English-French, English-German and English- Chinese. The first thee corpora were data from the European Parliament taken from the Europarl corpus (Koehn, 2005), while the English-Chinese data came from the Hong Kong Parallel Corpus. The word-alignments were induced using GIZA++ (Och and Ney, 2003), using the ‘grow-diag-and-final’-heuristic mentioned in Section 3.3, with 4 iterations on model 1, and 3 iterations with the hmm model, model 3 and model 4. The corpora were tokenised and lowercased before GIZA++ was run. In general, the guidelines for building a baseline for the WMT workshops were followed.1

In addition to the automatically aligned corpora, we had access to 5 manually aligned corpora (see 5.2). These corpora were much smaller than the automatically aligned corpora, and covered the language pairs English-French (Graca

Language pair Source Size Alignments

Eng - Du. European Parliament 945167 GIZA++

Eng - Ge. European Parliament 995909 GIZA++

Eng - Ch. Hong Kong Parallel Corpus 1723487 GIZA++

Eng - Fr. European Parliament 949408 GIZA++

Table 5.1: The automatically aligned datasets used for the experiments in this thesis.

Language pair Source Size Alignments

Europarl 987 Padó and Lapata (2006)

Eng - Ch. Hong Kong Parallel Corpus 1723487 GIZA++

Eng - Fr. Hansard 447 Och and Ney (2000)

Europarl 100 Graca et al. (2008)

Eng. - Sp. Europarl 100 Graca et al. (2008)

Eng. - Port. Europarl 100 Graca et al. (2008)

Table 5.2: The manually aligned the datasets used for the experiments in this thesis.

et al., 2008; Och and Ney, 2000), English-Spanish, English-Portuguese (Graca et al., 2008) and English-German (Padó and Lapata, 2006).

5.2 Direct Consistency

Quantifying the consistency between a dependency tree and an alignment is not a trivial task. When the parts described by a dependency tree are all parts according to the alignment, there is no dubiety: such an alignment should get an optimal score. If this is not the case it is less clear, as there are multiple things that should be taken into account. Not only should the measure of consistency be based on whether the head and its dependent are both phrases according to the alignment, they should also be combined in a reasonable fashion.

To give an abstract example, consider the following dependency parse and alignment trees:

A B C D _A _B C D A B C D

A B C D

All the parts that exist in the dependency tree also exist in all alignment trees. The third alignment tree is obviously most similar to the dependency parse, as it prescribes the same compositional structure and uses the same number of rules. However, the first and second alignment tree are indistinguishable if only the number of correct (and possibly incorrect) nodes is considered, as they both contain two correct and one incorrect node. However, the second alignment tree

seems more in line with the dependency parse than the first one, because it does not only prescribe that A should be combined with B into a new part, but also that C and D are combined with one rule.

5.2.1 Experiment 1

The analysis of the example provides an insight: as it seems, alignment trees and dependency parses are intuitively compatible if the relations in the dependency parse are respected by the alignment tree, which gives rise to a straightforward definition of consistency between dependency relations and HATs:

Definition 11 (Direct Consistency).

Let s= w1 w2 · · · wn be a sentence, and D ={(i, j)| there is a dependency

arrow from wordwi to word wj}a set of dependencies describing a dependency tree for s. Let [k, l] denote the subsequence of s from word k to word l. Let span(j) =[m, n], iff mand nare the minimum and maximum position that can be reached from wj by following the directed dependency arrows, respectively. A dependency relation(i, j)is said to be respected by an alignment treeT over sif and only if there is a node inT of which both [i, i]and span(j) are children.

In other words, a dependency relation (i, j) is consistent with a HAT if

the head wordiand the phrase headed by thej are siblings in the HAT. The consistencyC(D, H)of a set of dependency relationsD forming a dependency parse with a HATH can now be expressed as follows:

C(D, H) =|D 0_∩_SH_|

|D0_|

WhereD0 ₌_{₍_{i, span}₍_j₎_|₍_{i, j}₎_∈_D_}_{, the set of span relations prescribed by}_D,

andSH=_{(i, j)_|spaniandj are siblings inH}.

The score of a HAT can be computed recursively, as is expressed in the following definition:

Definition 12 (Score of an Alignment Tree).

s= w1w2. . . wn be a sentence, and D = {(i, j)| there is a dependency arrow

from wordwi to wordwj_}a set of dependencies describing a dependency tree for

s. LetD0 ₌_{₍_{i, span}₍_j₎_|₍_{i, j}₎_∈_D_}_{, where}_span₍_j₎ _{is as defined before. Let}_H be an alignment tree of s. The (unnormalised) score ofH withD is now defined as the score of its highest nodeN:

E(Na, D) = X c∈C_Na E(c, D) + X c1∈CNa X c2∈CNa B(c1, c2)

With base case E(N, D) = 0,B(c1, c2) = 1iff (c1, c2)∈D0, and CN the set of

child nodes of N.

Under this definition of consistency, the first alignment tree in the previous example would receive a score of 1/3, the second alignment tree would receive a score of 2/3, and the third alignment tree a score of 3/3. Which seems to correspond with their adequacy of describing the dependency structure.

Note that the score we assign to alignment trees (the number of retrieved dependency relations divided by the number of dependency relations) corresponds to therecall of the dependency relations in the tree. Typically, recall is used in

combination with precision, as both of them can often easily be fooled individually, but cheating the one will result in a low score for the other.

For HATs and dependency relations, it is not immediately clear how the precision measure should be constructed, as HATs are in some aspects under- specified with respect to dependency parses. It is not immediately clear which relations the HAT guesses should be counted as wrong, because a HAT guesses many relations than cannot coexist in one dependency parse (e.g., if it can make (x,y) true, it can also make (y,x) true), and if all these relations were to be considered, no HAT would ever receive the optimal score for the precision measure. It seems therefore more sensible to consider the maximal number of relations a HAT can make true all at once. However, any dependency parse of the sentence the leafnodes of the HAT dominate will have the same number of dependency relations, and the number of relations a HAT can maximally make true all at once will thus be equal to|D|, rendering the measure for precision and recall identical.2

5.2.2 Results

We have measured the consistency of the first 10.000 alignments of our automatically aligned datasets, and all the manually aligned sentence pairs. We have found the best scoring HATs by assigning weights to the grammar rules of the HATs, that were associated with the number of dependency relations a grammar rule made true, and parsing the HAT with this weighted grammar. We reported on the scores for sentences shorter than 10, 20 and 40 words.

Consistency Score Language Pair |s| < 10 |s| < 20 |s| < 40 English-Dutch 0.47 0.42 0.40 English-French 0.46 0.42 041 English-German 0.44 0.41 0.38 English-Chinese 0.59 0.48 0.42

Table 5.3: The empirically determined consistency scores of all available automatically aligned corpora according to consistency definition 11.

2_{Another way to positively bias the recall measure could be through manipulating the}

branching factors of the nodes in the HATs, as are of influence to thenumber of different groups of relations a HAT can make true all at once. Fortunately, this is not an issue when scoring HATs, as their maximum recursivity does not allow much flexibility in the branching factors of their wnodes.

Consistency Score Language Pair |s| < 10 |s| < 20 |s| < 40 English-French (Hansard) 0.63 0.54 0.51 English-French (LREC) 0.49 0.47 0.47 English-German (Pado) 0.43 0.42 0.41 English-Portuguese (LREC) 0.47 0.45 0.45

English- Spanish (LREC) 0.51 0.48 0.48

Table 5.4: The empirically determined consistency scores of all available manually aligned corpora according to consistency definition 11

5.2.3 Analysis

The consistency scores, which are depicted in Table 5.3 and 5.4, are very low. On average not even half of the dependency relations of the English sentence are respected by any HAT. The dependency relations of shorter sentences are generally somewhat better respected than the dependency relations of longer sentences. The difference between the scores of the automatically aligned datasets and the manually aligned datasets is lower than we expected, which could indicate the influence of mistakes in the automatically aligned datasets is relatively small. However, the large difference between the two manually aligned French datasets indicates that is more likely due to the fact that the manually aligned datasets are too small to get a significant result.

Without further elaboration, we have previously mentioned that the maximally recursive HATs and the linguistically motivated dependency structures possibly exploit compositionality in a different fashion. We suspect that this issue causes the scores to be lower under the current consistency definition, as we will illustrate with two examples.

Firstly, consider the sentence ‘I give you flowers’, and its (word-for-word) translation ‘Ik geef jou bloemen’, with dependency parse:

I give you flowers

subj iobj dobj

I give you flowers

The tree depicted next to the dependency parse is the only tree that respects all dependency relations. The sentence is very short and the dependency parse completely flat, as the sentence consists of a predicate and three single word arguments. In a tree that respects all relations according to Definition 11, ‘I’, ‘give’, ‘you’, and ‘flowers’ are all siblings, which means the tree depicted next to the dependency parse is the only tree obtaining a maximal score. However, as the sentence is a word for word translation, all subsequences are translation admissible, and all HATs will be completely binary. Even though the translation

seems intuitively compositional, none of the HATs will receive a score higher than 1/4, because the dependency parse is not maximally branching.

A similar situation arises when two arguments are translated into one, which happens, e.g., when arguments are translated as pre- or suffixes, when verbs

In document MoL 2013 29: An Empirical Account of Compositionality of Translation through Translation Data (Page 56-79)