• No results found

Ranking algorithms: efficiency vs accuracy

6.5 Discussion

6.5.3 Ranking algorithms: efficiency vs accuracy

As discussed in section 3.2.5, algorithms such as the Monte Carlo method which approxi-

mate an NP-hard search problem are generally not adopted if a deterministic alternative

can be found which does not introduce an unacceptable degradation in performance. We

have presented experiments where four ranking strategies are compared; two of these

strategies – MPT and MPP – use random sampling to approximate the search space

whereas the other two – MPD and SDer – use the Viterbi algorithm. In this section, we

look at how these ranking strategies compare in terms of efficiency.

Table 6.11 gives the average number of seconds required to translate each sentence at

each depth and using each of the four ranking strategies – times for English to French

translation are given on the left and times for French to English translation are given on

the right. The times given represent full processing time for each sentence, i.e. the time

taken to apply the two-phase analysis algorithm in order to determine the translation

space and the time taken to select the best output translation.9 Table 6.12 gives, for MPT

and MPP ranking, the average number of samples taken when disambiguating at each

fragment depth – again the left table refers to English as source language and the right to

French as source language.

Focussing firstly on the average times over all translations given in Table 6.11, we see

that – not surprisingly – the time taken to translate each sentence increases as fragment

link depth increases, with a large increase from link depth 3 to link depth 4. The extra time 9We do not give separate disambiguation times as parsing and disambiguation are inextricably linked

ALL ENGLISH TRANSLATIONS ALL FRENCH TRANSLATIONS samples/sentence MPT MPP 1 206.3 273.1 2 166.9 260.2 3 134.3 214.2 4 106.7 197.9 samples/sentence MPT MPP 1 56.1 160.9 2 39.2 95.7 3 34.6 91.6 4 30.3 93.1

Table 6.12: Average number of samples taken when selecting an output translation for MPT and MPP ranking, for all link depths and both translation directions.

taken for each sentence at greater depths is spent building the translation space (which

contains increasing numbers of fragments) rather than ranking the output translations. In

fact, we see from Table 6.12 that fewer samples are taken per sentence for both MPT and

MPP ranking as fragment depth increases; as seen for parsing in section 3.2.5, as more

contextual information is introduced it becomes easier to determine which translation (for MPT) and representation (for MPP) is most probable.

Again looking at Table 6.11 but this time comparing ranking algorithms, we observe

that for English to French translation at each depth, MPT ranking takes longest, followed

by MPP ranking and MPD, and SDer rankings are fastest but the difference between the

fastest and slowest at link depth 4 is just 1.7 seconds. The opposite, however, holds for

French to English translation: MPP ranking is slowest, followed by MPD and SDer, and

MPT ranking is consistently fastest. (Again, the difference in time taken between fastest

and slowest at link depth 4 is small.) This allows us to conclude that not only do the

ranking methods which require random sampling not take significantly longer to process each sentence than our ranking strategies based on the Viterbi algorithm, but in some

instances they actually arrive at a solution more quickly. In order to investigate why this

is the case, we look more closely at the differences in times taken by the MPT and MPP

ranking methods.

For English to French translation, calculating the MPT takes longer than calculating

the MPP at each depth whereas, for French to English translation, the opposite holds.

There are two issues at play here. Firstly, we see from Table 6.12 that more samples

per sentence are required to determine the MPP than the MPT. It is sometimes the case

that although more than one pair of analyses can be generated for a given input string, all of those analyses yield the same target translation. In this situation, sampling is

required to determine the most probable analysis but not to determine the most probable translation. Clearly, this has an effect on the average numbers of samples taken for each

strategy and, consequently, the average sentence processing times. However, while this

explains why, for French to English translation, searching for the MPT is faster, it does

not explain why the opposite holds for English to French translation. Recall that one

of the factors upon which the decision to stop sampling is based is the total number of

items that are being ranked, i.e. when ranking according to translation probability, we

consider the total number of translations possible, and when ranking according to (paired)

analysis probability we consider the total number of analyses possible. However, while the

total number of analyses possible can be computed efficiently during construction of the translation space, computing the total number of translations possible is more costly as

the combination of terminal symbols and frontier substitution sites yielded by each target

language fragment in the translation space must be considered. Whether or not the time

invested in determining exactly how many translations there are for each string is time

well spent depends on whether ranking according to translation probability gives higher

quality output than ranking according to analysis probability. Looking back to the results

for complete translations only given in Tables 6.3 and 6.7, we see that MPT ranking

significantly outperforms MPP ranking for both translation directions at all depths and

for all metrics. Thus, we conclude that, if choosing between MPT and MPP ranking, determining the number of possible translations for each string istime well spent.

In sections 6.3 and 6.4 we looked at the accuracy achieved when complete transla-

tions (i.e. translations which, according to the training data, are fully grammatical) are

evaluated separately from partial translations (i.e. translations which, according to the

training data, are incomplete and/or ungrammatical), and found that not distinguishing

between complete and partial translations distorts the evidence as to which system con-

figuration gives higher translation accuracy. We now turn our attention to the differences

in efficiency over complete and partial translations. Table 6.13 gives the average seconds per sentence where full translations were generated and the average seconds per sentence

where only partial translations were generated for English as source language (in the left

FULL ENGLISH TRANSLATIONS FULL FRENCH TRANSLATIONS CPU seconds/sentence MPT MPP MPD SDer 1 0.73 1.09 0.19 0.18 2 1.63 1.10 0.38 0.41 3 2.44 1.81 0.99 1.00 4 9.10 11.5 7.59 7.51 CPU seconds/sentence MPT MPP MPD SDer 1 0.44 3.76 1.49 1.45 2 0.83 3.94 1.78 1.81 3 1.83 4.97 2.69 2.67 4 15.5 21.5 16.2 16.0

PARTIAL ENGLISH TRANSLATIONS PARTIAL FRENCH TRANSLATIONS CPU seconds/sentence MPT MPP MPD SDer 1 2.65 3.26 0.47 0.52 2 2.96 5.15 0.97 0.94 3 4.32 6.02 2.27 2.25 4 20.5 15.4 19.0 18.7 CPU seconds/sentence MPT MPP MPD SDer 1 1.38 3.36 6.88 7.03 2 2.02 2.64 8.18 8.31 3 3.60 4.83 9.75 9.86 4 27.8 21.2 34.2 33.8

Table 6.13: Average time taken to translate each sentence for all link depths and ranking strategies, and both translation direc- tions, where we distinguish between complete and partial translations.

FULL ENGLISH TRANSLATIONS FULL FRENCH TRANSLATIONS samples/sentence MPT MPP 1 67.7 226.9 2 44.3 170.2 3 25.3 113.2 4 21.5 106.3 samples/sentence MPT MPP 1 28.2 138.0 2 18.4 82.9 3 13.8 67.5 4 13.4 69.4

PARTIAL ENGLISH TRANSLATIONS PARTIAL FRENCH TRANSLATIONS samples/sentence MPT MPP 1 471.0 637.4 2 424.1 970.7 3 362.9 1011.0 4 285.3 920.6 samples/sentence MPT MPP 1 120.5 442.9 2 94.8 254.1 3 90.0 388.1 4 75.4 385.6

Table 6.14: Average number of samples taken when selecting an output translation for MPT and MPP ranking, for all link depths and both translation directions, where we distinguish between complete and partial translations.

samples per sentence (for MPT and MPP ranking) where full translations were generated

and, separately, where partial translations were generated; again, results for English as

source language are given in the left column and for French as source language in the right

column.

Table 6.13 indicates clearly that sentences for which only partial translations can be

generated take significantly longer to process at all depths than those for which full trans-

lations can be generated for both translation directions; the exception to this is MPP ranking for French to English translation where we see that times are similar for both.

plete translations are slightly higher than MPD and SDer ranking for English as source language and slightly lower for French as source language. Although the times taken are

significantly longer, similar trends are observed for the processing of sentences for which

only partial translations are generated. We see from Table 6.14 that while, as before,

sample sizes decrease as fragment depth increases, significantly more samples are taken

when generating partial translations; this is in accordance with the increased time taken

to process these translations.

Given that the time taken to generate partial translations is longer than to generate

complete translations, that the quality of those translations is significantly poorer and

that they are, in any case, assigned zero probability by the DOT model, whether or not we should attempt to generate such translations at all is debatable. It is important to

point out, however, that much of the time required to process these sentences is spent in

determiningthat they cannot be translated grammatically by the model, so not outputting such translations may not save us a great deal of time. Furthermore, we note that while

partial translations are not of as high quality as complete translations, they still receive

reasonable evaluation scores. Thus, we conclude that generating translations which are

ill-formed with respect to the model and training data is worthwhile.

We looked in detail in sections 6.3 and 6.4 at the accuracy of each system configuration

and concluded that, overall, the highest quality translations were produced by searching for the shortest derivation and using all fragments of link depth 4 or less. Having also

considered the efficiency of each configuration and observed that for the configurations

which give the best accuracy (MPT and SDer at link depth 4), there is little difference

in efficiency – MPT takes, on average, 1.7 seconds per sentence longer than SDer when

translating from English to French but SDer takes 1.9 seconds per sentence longer when

translating from French to English. Thus, for the DOT model over the HomeCentre

corpus, we conclude that there is no need to sacrifice accuracy for efficiency as the most

FULL ENGLISH TRANSLATIONS FULL FRENCH TRANSLATIONS Exact Match Scores

1 2 3 4

MPT 45.40 56.00 59.08 60.92

MPP 33.57 40.85 42.72 43.66

MPD 45.40 54.46 57.54 60.31

SDer 38.10 56.00 60.92 62.15

Exact Match Scores

1 2 3 4

MPT 61.49 66.76 67.05 67.91 MPP 47.07 53.15 52.48 53.38

MPD 62.39 66.76 67.91 68.48

SDer 50.45 65.62 68.19 68.77

Table 6.15: Exact match accuracy for complete translations only.