Recently, bias in AI has rightfully gained a considerable amount of attention, not only in the research community but also in the media as the scope of AI applications has been growing. Machine-learning architectures like SMT and NMT (described in Section 2.2.1 and Section 2.2.2) learn by maximizing overall prediction accuracies. As such, the algorithms learn to optimize over more frequently appearing patterns or observations. If a specific group of individuals appears more frequently than others in the training data, the program will optimize for those individuals because this boosts their overall accuracy (Zhou and Schiebinger, 2018). Computer scientists evaluate algorithms on test sets, but typically these are random sub-samples of the original training set and thus likely to contain the same biases as observed during training. As such, biased behaviour is rewarded rather than punished, consequently the outputs we create show a lack of diversity on multiple levels. In Chapter 7 we discuss the loss of linguistic diversity in more detail and relate this to the algorithmic bias observed.
In Chapter 6, we zoom in on issues related to gender agreement in MT. Our work focuses on morphological agreement by incorporating gender features into an NMT pipeline. The analysis we conducted prior to our experiments related to gender revealed that Europarl (Koehn, 2005) has a 2:1 male-female speaker ratio. As some languages express gender agreement with the speaker, this can lead to a higher frequency of male pronouns or male-endings for nouns. This, in turn, can influence the translations and lead to exacerbation of the observed phenomena as statistical approaches learn by generalizing over the seen patterns. Similar to our observation, recent studies (Garg et al., 2018; Lu et al., 2018; Prates et al., 2019) have highlighted issues with biased training and testing data and some already alluded that there might be an exacerbation of the observed biases (Lu et al., 2018; Zhao et al., 2018) by the algorithms themselves. The systematic bias problem extends to a range of AI applications. This is particularly problematic as one of the reasons
for employing such applications is the fact that they ought to be more objective than humans. We dedicate a section to bias and related issues as we should strive towards fair algorithms that do not sustain or worsen observed data biases.
Because of the fact that bias is relatively well hidden in MT, it did not receive a lot of attention until recently. In order to find examples of bias in MT one would need to find a sentence that is ambiguous in the source, but unambiguous in the target. This ambiguity arises when one language makes something explicit which is left implicit in the other language. An example could be the implicit natural gender of a speaker in a language like English compared to the explicit natural gender markers in a language such as French. For example ‘I am happy’ is not marked for gender in English. Its French translation, however, requires the translator to pick between ‘Je suis heureux’ (male) or ‘Je suis heureuse’ (female). Similary, the word ‘sister’ in Basque, would, depending on the gender of the person whose sister is referred to, be translated into ‘arreba’ (male) or ‘ahizpa’ (female). When no gender is explicitly mentioned in the source text, most of these choices could be interpreted as being rather innocent. However, the mere fact that we do not exactly know or control these endings is problematic. For example, when passing a list13 of 111
French reflexive verbs14 through Google Translate, none of them received a female
ending. This corresponds to what we have observed before, i.e. the male endings can almost be considered the default form in Google Translate. As such, the female endings are somehow already marked because of the fact that they do not appear frequently. However, when translating Example (7), the verb ‘viol´ee’ (raped) has a female ending.
13
https://www.frenchtoday.com/blog/french-verb-conjugation/ french-reflexive-verbs-list-exercises/. May 2019.
(7)
‘I was raped’ ‘J’ai ´et´e viol´ee’
These ‘uncontrolled’ fluctuations between male and female endings depend on the training data reflecting conscious and unconscious biases present in our day-to- day communication, but also on the further generalizations made by the algorithms themselves. Users might not always be aware of this when using MT systems, and currently there is nothing to notify the user about the assumptions the algorithms have made. For users that do grasp the target language well enough, we understand how ‘marked’15 translations such as Example (7) can be considered inappropriate or
even offensive, especially when claims about reaching human parity are becoming commonplace (Hassan et al., 2018; L¨aubli et al., 2018; Toral et al., 2018).
We address the issue of gender bias to some extent with the integration of gender features in NMT. Nevertheless, our approach is still limited, and more research needs to be conducted in this direction. Until an appropriate way of handling gender agreement, controlling it, and presenting it to users has been proposed, it is important that researchers and users are aware of these issues and that the necessary checks are put in place.