Analysis - Automatic post-editing of phrase-based machine translation outputs

Treex (Popel and ˇZabokrtsk´y, 2010) provides a scenario to analyze the sentences up to t-layer, with a prerequisite of already having the a-layer. The analysis is rule based.

We found the quality of the existing analysis to be sufficient, with the exception of the analysis of verb tenses, described in Section 7.2.1. We developed an improved version of the analysis, which we use instead of the original one. We describe our improvements in Section 7.2.2.

7.2.1 Original Verb Tense Analysis

The verb tense is represented by a set of grammatemes, as was briefly described in Section 7.1.3.

As the grammatemes were originally designed for Czech, they match the system of Czech verb tenses very well. The grammatemes are filled by a small set of rules, which map the values of morphological tags to the values of the verb grammatemes.

However, the situation with analysis of English tenses is more complicated, not only because the system of English verb tenses is much more complex than that the Czech one, but also because matching it to the system of tense grammatemes is not yet resolved.

Originally, the analysis of English tenses was done partly heuristically and could not capture many complex compound verb forms. Moreover, it did not capture the tenses fully – some distinctions, such as the perfectivity or the continuousness, were not made at all. All of the following tenses were grouped into a single label (tense=ant) – that is, if the heuristics worked correctly: present perfect simple, present perfect continuous, past simple, past continuous, past perfect simple, past perfect continuous.

7.2.2 Our Adaptations of the Verb Tense Analysis

We found the coarse approach to verb tense analysis unsuitable for our needs, and therefore developed a full analysis of the English tenses. In our approach, the tense of an English verb form is represented by a set of flags, listed in Table 7.2. The flags are binary, except for the modality flag, which is multiclass – its value is a modality type, such as debitive modality (‘must’), possibilitive modality (‘can’, ‘could’) or permissive modality (‘may’, ‘might’). The “going to” flag typically marks future, but is considered not to in case of ‘were going to’.

The analysis is rule-based. It relies on the underlying analyses to be correct, especially that the compound verb form components were identified correctly – in a tectogrammatical tree, each compound verb form is represented by one t-node, which groups together all the tokens that the compound form consists

of. However, it does tolerate errors in VBD (past simple) / VBN (past participle) tagging.

The main principle of the analysis is to transcribe the verb forms into a normalized form which uses a set of only 10 different tokens, corresponding to the possible forms of the verbs ‘be’, ‘have’, and full verbs. These are able to capture the following flags: past, passive, perfect, continuous.4

The other flags, such as future or modality, are triggered by modifiers which are not present in the normalized form, to alleviate the combinatorial complexity of listing all of their possible combinations. They are removed during the transcription, setting the flags immediately. Some of the modifiers trigger multiple flags, such as ‘should’ which we understand as marking both the hortative modality and the conditionality; this is influenced by the ultimate objective to match the verb forms to their Czech counterparts, as e.g. the best translation of ‘he should’ – ‘mˇel by’ – should be both modal and conditional. Most of the modifiers have only one form and do not carry any other tense information, except for the following: ‘have to’, ‘want to’, ‘do’, ‘be going to’, ‘be able to’. For these, the POS tag has to be carried onto the following word.

1. get all relevant tokens – verbs (VB.*), modals (MD) and the word ‘able’ 2. transcribe the tokens, removing modality markers (e.g. ‘must’ and ‘should’),

conditional markers (e.g. ‘would’ and ‘should’), future markers (forms of ‘will’, ‘shall’ and ‘be going to’), and forms of ‘do’, and setting the corresponding flags triggered by the markers

3. set the flags corresponding to the transcription

4. if the compound form could not be analysed, delete the first token and go back to step 2

5. set the negation flag if ‘not’ is found among the lemmas of the auxiliary nodes

6. change “present perfect conditional” to past conditional (e.g. ‘would have loved’)

7. change “present perfect modal” to past modal (e.g. ‘must have loved’)

The result is a set of flags, which is returned. Some of the flags are then mapped to grammatemes:

• the tense grammateme, which reflects the syntactical tense (past/present/future), with an exception of the present perfect tenses which are considered to be past tenses

• the diathesis grammateme, which marks the passive

4_{The tokens used are naturalistic: be, being, were, been; have, having, had; love,}

loving, loved. Thus, e.g. the past perfect tense is represented by had loved, and the present perfect continuous passive tense is represented by had been being loved. This is only for convenience of coding, any other set of tokens could be used.

• the deontmod grammateme, which marks presence of a modal verb (the set of values is the same as for the modality flag)

• the verbmod grammateme, which distinguishes the indicative, imperative and conditional modality

• the negation grammateme, which marks negated verbs

The only omission that we are aware of is an insufficient analysis of infinitives. We rely on preceding analysis steps to correctly identify the infinitives, but we are unsure which flags to assign to them; the system of inifitive forms in Czech is much poorer than that of English, providing little support for our decisions.

Otherwise, the accuracy of the tense analysis is close to 100%; when manually inspecting the results, we only encountered errors that were caused by errors in the preceding analysis steps, such as an auxiliary attached to an incorrect full verb or a mistagged full verb.

Still, it must be noted that the current approach to analysis of both English and Czech verb tenses is still a relaxation of the original idea of the tectogrammatical layer: Sgall (1967) supposed that the attributes of tectogrammatical nodes should capture, among other, the real (semantic or even pragmatic) tense of the verb. At present, this idea is only reflected in the set of values of the tense grammateme, which should have reflected the tense of a clause relatively to the tense of the parent clause. The values are sim for actions happening simultaneously, post for actions happening after and ant for actions happening before the “parent action”. However, in practice, these values represent the absolute tense – i.e. sim means present, post means future and ant means past.

For many applications, it would be beneficial if the tense identification was able to capture the pragmatic tense. However, this seems to be too hard to do at the moment, since not even the syntactic tense analysis is perfect now. Applications related to machine translation therefore have to implicitly employ the assumption that a syntactic tense A in one language will usually correspond to a syntactic tense B in the other language, no matter what the semantics or the pragmatics of the tense are.

Practice has shown this assumption to be rather reasonable for the English-to-Czech translation. For example, both the English present simple tense and the Czech present tense can express both a repeated action in present, as in ‘I paint pictures.’ – ‘Maluju obrazy.’, and an action in future that happens according to a given schedule, as in ‘My plane leaves at 8.’ – ‘Moje letadlo odl´et´a v 8.’.

In document Automatic post-editing of phrase-based machine translation outputs (Page 93-95)