6.5 Discussion
6.5.3 Ranking algorithms: efficiency vs accuracy
As discussed in section 3.2.5, algorithms such as the Monte Carlo method which approxi-
mate an NP-hard search problem are generally not adopted if a deterministic alternative
can be found which does not introduce an unacceptable degradation in performance. We
have presented experiments where four ranking strategies are compared; two of these
strategies – MPT and MPP – use random sampling to approximate the search space
whereas the other two – MPD and SDer – use the Viterbi algorithm. In this section, we
look at how these ranking strategies compare in terms of efficiency.
Table 6.11 gives the average number of seconds required to translate each sentence at
each depth and using each of the four ranking strategies – times for English to French
translation are given on the left and times for French to English translation are given on
the right. The times given represent full processing time for each sentence, i.e. the time
taken to apply the two-phase analysis algorithm in order to determine the translation
space and the time taken to select the best output translation.9 Table 6.12 gives, for MPT
and MPP ranking, the average number of samples taken when disambiguating at each
fragment depth – again the left table refers to English as source language and the right to
French as source language.
Focussing firstly on the average times over all translations given in Table 6.11, we see
that – not surprisingly – the time taken to translate each sentence increases as fragment
link depth increases, with a large increase from link depth 3 to link depth 4. The extra time 9We do not give separate disambiguation times as parsing and disambiguation are inextricably linked
ALL ENGLISH TRANSLATIONS ALL FRENCH TRANSLATIONS samples/sentence MPT MPP 1 206.3 273.1 2 166.9 260.2 3 134.3 214.2 4 106.7 197.9 samples/sentence MPT MPP 1 56.1 160.9 2 39.2 95.7 3 34.6 91.6 4 30.3 93.1
Table 6.12: Average number of samples taken when selecting an output translation for MPT and MPP ranking, for all link depths and both translation directions.
taken for each sentence at greater depths is spent building the translation space (which
contains increasing numbers of fragments) rather than ranking the output translations. In
fact, we see from Table 6.12 that fewer samples are taken per sentence for both MPT and
MPP ranking as fragment depth increases; as seen for parsing in section 3.2.5, as more
contextual information is introduced it becomes easier to determine which translation (for MPT) and representation (for MPP) is most probable.
Again looking at Table 6.11 but this time comparing ranking algorithms, we observe
that for English to French translation at each depth, MPT ranking takes longest, followed
by MPP ranking and MPD, and SDer rankings are fastest but the difference between the
fastest and slowest at link depth 4 is just 1.7 seconds. The opposite, however, holds for
French to English translation: MPP ranking is slowest, followed by MPD and SDer, and
MPT ranking is consistently fastest. (Again, the difference in time taken between fastest
and slowest at link depth 4 is small.) This allows us to conclude that not only do the
ranking methods which require random sampling not take significantly longer to process each sentence than our ranking strategies based on the Viterbi algorithm, but in some
instances they actually arrive at a solution more quickly. In order to investigate why this
is the case, we look more closely at the differences in times taken by the MPT and MPP
ranking methods.
For English to French translation, calculating the MPT takes longer than calculating
the MPP at each depth whereas, for French to English translation, the opposite holds.
There are two issues at play here. Firstly, we see from Table 6.12 that more samples
per sentence are required to determine the MPP than the MPT. It is sometimes the case
that although more than one pair of analyses can be generated for a given input string, all of those analyses yield the same target translation. In this situation, sampling is
required to determine the most probable analysis but not to determine the most probable translation. Clearly, this has an effect on the average numbers of samples taken for each
strategy and, consequently, the average sentence processing times. However, while this
explains why, for French to English translation, searching for the MPT is faster, it does
not explain why the opposite holds for English to French translation. Recall that one
of the factors upon which the decision to stop sampling is based is the total number of
items that are being ranked, i.e. when ranking according to translation probability, we
consider the total number of translations possible, and when ranking according to (paired)
analysis probability we consider the total number of analyses possible. However, while the
total number of analyses possible can be computed efficiently during construction of the translation space, computing the total number of translations possible is more costly as
the combination of terminal symbols and frontier substitution sites yielded by each target
language fragment in the translation space must be considered. Whether or not the time
invested in determining exactly how many translations there are for each string is time
well spent depends on whether ranking according to translation probability gives higher
quality output than ranking according to analysis probability. Looking back to the results
for complete translations only given in Tables 6.3 and 6.7, we see that MPT ranking
significantly outperforms MPP ranking for both translation directions at all depths and
for all metrics. Thus, we conclude that, if choosing between MPT and MPP ranking, determining the number of possible translations for each string istime well spent.
In sections 6.3 and 6.4 we looked at the accuracy achieved when complete transla-
tions (i.e. translations which, according to the training data, are fully grammatical) are
evaluated separately from partial translations (i.e. translations which, according to the
training data, are incomplete and/or ungrammatical), and found that not distinguishing
between complete and partial translations distorts the evidence as to which system con-
figuration gives higher translation accuracy. We now turn our attention to the differences
in efficiency over complete and partial translations. Table 6.13 gives the average seconds per sentence where full translations were generated and the average seconds per sentence
where only partial translations were generated for English as source language (in the left
FULL ENGLISH TRANSLATIONS FULL FRENCH TRANSLATIONS CPU seconds/sentence MPT MPP MPD SDer 1 0.73 1.09 0.19 0.18 2 1.63 1.10 0.38 0.41 3 2.44 1.81 0.99 1.00 4 9.10 11.5 7.59 7.51 CPU seconds/sentence MPT MPP MPD SDer 1 0.44 3.76 1.49 1.45 2 0.83 3.94 1.78 1.81 3 1.83 4.97 2.69 2.67 4 15.5 21.5 16.2 16.0
PARTIAL ENGLISH TRANSLATIONS PARTIAL FRENCH TRANSLATIONS CPU seconds/sentence MPT MPP MPD SDer 1 2.65 3.26 0.47 0.52 2 2.96 5.15 0.97 0.94 3 4.32 6.02 2.27 2.25 4 20.5 15.4 19.0 18.7 CPU seconds/sentence MPT MPP MPD SDer 1 1.38 3.36 6.88 7.03 2 2.02 2.64 8.18 8.31 3 3.60 4.83 9.75 9.86 4 27.8 21.2 34.2 33.8
Table 6.13: Average time taken to translate each sentence for all link depths and ranking strategies, and both translation direc- tions, where we distinguish between complete and partial translations.
FULL ENGLISH TRANSLATIONS FULL FRENCH TRANSLATIONS samples/sentence MPT MPP 1 67.7 226.9 2 44.3 170.2 3 25.3 113.2 4 21.5 106.3 samples/sentence MPT MPP 1 28.2 138.0 2 18.4 82.9 3 13.8 67.5 4 13.4 69.4
PARTIAL ENGLISH TRANSLATIONS PARTIAL FRENCH TRANSLATIONS samples/sentence MPT MPP 1 471.0 637.4 2 424.1 970.7 3 362.9 1011.0 4 285.3 920.6 samples/sentence MPT MPP 1 120.5 442.9 2 94.8 254.1 3 90.0 388.1 4 75.4 385.6
Table 6.14: Average number of samples taken when selecting an output translation for MPT and MPP ranking, for all link depths and both translation directions, where we distinguish between complete and partial translations.
samples per sentence (for MPT and MPP ranking) where full translations were generated
and, separately, where partial translations were generated; again, results for English as
source language are given in the left column and for French as source language in the right
column.
Table 6.13 indicates clearly that sentences for which only partial translations can be
generated take significantly longer to process at all depths than those for which full trans-
lations can be generated for both translation directions; the exception to this is MPP ranking for French to English translation where we see that times are similar for both.
plete translations are slightly higher than MPD and SDer ranking for English as source language and slightly lower for French as source language. Although the times taken are
significantly longer, similar trends are observed for the processing of sentences for which
only partial translations are generated. We see from Table 6.14 that while, as before,
sample sizes decrease as fragment depth increases, significantly more samples are taken
when generating partial translations; this is in accordance with the increased time taken
to process these translations.
Given that the time taken to generate partial translations is longer than to generate
complete translations, that the quality of those translations is significantly poorer and
that they are, in any case, assigned zero probability by the DOT model, whether or not we should attempt to generate such translations at all is debatable. It is important to
point out, however, that much of the time required to process these sentences is spent in
determiningthat they cannot be translated grammatically by the model, so not outputting such translations may not save us a great deal of time. Furthermore, we note that while
partial translations are not of as high quality as complete translations, they still receive
reasonable evaluation scores. Thus, we conclude that generating translations which are
ill-formed with respect to the model and training data is worthwhile.
We looked in detail in sections 6.3 and 6.4 at the accuracy of each system configuration
and concluded that, overall, the highest quality translations were produced by searching for the shortest derivation and using all fragments of link depth 4 or less. Having also
considered the efficiency of each configuration and observed that for the configurations
which give the best accuracy (MPT and SDer at link depth 4), there is little difference
in efficiency – MPT takes, on average, 1.7 seconds per sentence longer than SDer when
translating from English to French but SDer takes 1.9 seconds per sentence longer when
translating from French to English. Thus, for the DOT model over the HomeCentre
corpus, we conclude that there is no need to sacrifice accuracy for efficiency as the most
FULL ENGLISH TRANSLATIONS FULL FRENCH TRANSLATIONS Exact Match Scores
1 2 3 4
MPT 45.40 56.00 59.08 60.92
MPP 33.57 40.85 42.72 43.66
MPD 45.40 54.46 57.54 60.31
SDer 38.10 56.00 60.92 62.15
Exact Match Scores
1 2 3 4
MPT 61.49 66.76 67.05 67.91 MPP 47.07 53.15 52.48 53.38
MPD 62.39 66.76 67.91 68.48
SDer 50.45 65.62 68.19 68.77
Table 6.15: Exact match accuracy for complete translations only.