Construction of the corpus - : Methodology and corpus design

Chapter 4 : Methodology and corpus design

4.3 Construction of the corpus

Due to the lack of available corpora of Egyptian films and their English subtitles, the researcher had to compile a corpus from scratch. In order to build the corpus, the film dialogues in Egyptian Arabic and their English subtitles were transcribed and annotated manually. The UAM CorpusTool was used to assist in the annotation of each unit and collection of statistical data. This tool was developed by the computational linguist Mick O’Donnell in an effort to develop a tool that is easy to use by linguists or computational linguists who do not have knowledge of programming (O’Donnell, 2008). The UAM CorpusTool allows the creation of multiple layers and provides a hierarchically organised tagging scheme for each layer. The tool allows the categories to be organised in a tree structure where categories are connected to one another. This means that every main category can have different subcategories.

The tree allowed the researcher to organise the analysis in layers. The annotation in the UAM CorpusTool can be conducted manually or semi- automatically. The textual annotation of this study was carried out manually because the classification of linguistic varieties is based on pragmatic features which cannot be easily identified automatically (O’Donnell, 2008). The annotation scheme can be designed by the researcher according to the features that she/he wants to code. It allows the annotation of a range of texts at multiple linguistic levels as desired (e.g. classifying the text as a whole, tagging sections of text by function, or tagging sentences/clauses, etc.). A statistical analysis can be generated for the text itself (e.g. lexical density, pronominal usage, word and segment length), or according to the frequency of annotations. The UAM CorpusTool therefore has powerful features to create and annotate film dialogue and its subtitles, as well as analysing the multimodal relations in the source and target products.

For the purpose of the present study, a relatively small-scale translational film corpus was developed consisting of orthographic transcriptions of the film lines as they were uttered on screen. However, the corpus does not include all of the spoken units in the films. Manual transcription of all spoken units for each character in each film would be a complex and time-consuming task (Harris and Salama-Carr, 2000), because transcription software is not accurate enough

when transcribing film dialogue for the representation of dialect. Moreover, given the fact that it is important to have more than one film in order not to bias the data in terms of genre, director and content of a specific film, it was not possible to include all the spoken units of the films in the corpus. Regarding the TTs, the SubRip program was used to extract the English subtitles of the selected scenes from the DVD. SubRip is a software program for Windows which “rips” (extracts) subtitles and their timings from DVD discs. It is a free software program, released under the GNU General Public License.

The corpus has been constructed in accordance with two main criteria: 1) type of diegetic function and scene, and 2) participation of specific characters. The films are divided into scenes or sequences. The first criterion, ‘scene’ is defined according to Bordwell and Thompson who describe scene as the “distinct phases of the action occurring within a relatively unified space and time” (Bordwell and Thompson, 1979/2008, pp.97-98). A new scene is counted when there is a change of setting. Following Ellender (2015) and Ramos Pinto (2017), the corpus included the initial scene of each film in which the character(s) under analysis appear and all other scenes that contained one or more linguistic varieties. Selecting the initial scene and other scenes throughout the film enable an examination of the extent to which the film tradition follows literary traditions. For example, researchers such as Blake (1981), Chapman (1994) and Page (1988) have found that non-standard varieties occur more prominently at the beginning of the book and reduce progressively towards the end of a literary text. All scenes with less than five units of spoken were excluded.

Following these criteria, for example, in Sayed the Romantic films’ fourteen scenes were selected, including the initial scene set at the Cairo University campus. Abu Rāwiya and Sayyid appear with tourists and are talking about Egyptian civilisation. A dialogue starts between them and one of the tourists. Focusing on Abu Rāwiya, he tries to use a high-prestige variety (English) but immediately returns to using his regional dialect. His inability to speak Standard English shows that he cannot fit in with the middle class. Thus, the findings show that most of the elements in the spoken and the mise-en-scène modes (accent, vocabulary/morphosyntax, clothes and figure behaviour) identify Abu Rāwiya as a speaker of a ‘sub-standard social’ variety and consequently as a

poor and low-educated man with low social status; the exception is the setting. The non-compliance of this one mode serves the diegetic functions of introducing a comic moment and irony.

After the identification of scene 1, a limited number of scenes were selected. This selection was made based on a change which happens in relation to the function of linguistic varieties throughout the films. For example, if the same linguistic variety used in scene 1 to portray the selected character as a low social class with a low educational level is used in scene 2 to fulfil similar function, scene 2 will be excluded. This allows us to account for critical moments in the films where the use of linguistic varieties plays a crucial role in fulfilling specific diegetic functions. It helps in identifying the strategies and procedures used to translate the linguistic varieties and to assess their impact in preserving, cancelling or modifying the intermodal relations established in the ST and, consequently, the diegetic functions they support in that key moment. In the same film, the eighth scene, for example, was selected as one of the fourteen scenes. Abu Rāwiya uses a ‘sub-standard social’ variety in confirmation with all elements in the mise-en-scène mode to identify him as a poor and low-educated man with low social status. This relation of confirmation- equivalence serves the diegetic purpose of defining interpersonal relationships of solidarity (between Abu Rāwiya and the other characters in the scene). Although Abu Rāwiya uses the same linguistic variety in the first and eighth scene, the diegetic functions are different because of the different intermodal relations established between the spoken and the mise-en-scène modes. Consideration of the intermodal relations established between the two modes enables the examination of the possible impact of the adopted strategy on preserving, cancelling or modifying the intermodal relations and the diegetic functions they serve.

Film names Number of

words Number of units

Number of scenes Karkar ST 1844 408 13 TT 2327 402 13 Wesh Egram ST 2102 301 15 TT 2503 303 15 Sayed the Romantic ST 2127 281 14 TT 1929 274 14 Harameya fi KG2 ST 989 176 13 TT 1171 175 13

Table 2. Number of words, units and scenes in both the STs and the TTs The second criterion, given that this study aims to examine the extent to which the filmic Egyptian tradition is influenced by Western cinema traditions in favour of using one specific variety in the speech of major and/or minor characters, two type of characters were selected in each film on the basis of the major or minor role they assume. The criterion used to define the major and minor characters is how often the character appears in the film. In terms of the TTs, the investigation of the use of linguistic varieties in the subtitles of the speech of the selected major and minor characters enable the examination of whether TTs maintain the same pattern as STs, and if the strategies employed to translate linguistic varieties differ according to major/minor characters.

Film names Major character Minor character

Karkar Karkar Abu Karkar

Wesh Egram Taha Um Taha

Sayed the Romantic Um Sayyid Abu Rāwiya

Harameya fi KG2 Ḥasan Sibāʿī

Table 3. Names of major and minor characters in the films

The speech of the selected characters was transcribed and organised according to the sentences in the TT. According to Hervey et al., “sentences are marked by a capital letter at the start of the first word, and a full stop, question mark, or exclamation mark at the end of the final word” (2006, p.115). The sentence was selected as the unit of analysis because sentences are what occupies the minds of most translators in the normal process of translating (Maia, 1996). For the annotation, given that the STs are not in a written medium, the characters’ utterances were organised according to the English

sentence in the TTs, i.e. the equivalent utterance to the sentence in the TT was identified and taken as the ST’s unit. In the cases in which the subtitle did not have a full stop as observed in the corpus, the start of the English sentence was identified with a capital letter.

Two Word files were created for each scene: one for the spoken mode in the ST and the other for the subtitles mode in the TT. One slash (/) was used to indicate the subtitle’s line break and two slashes (//) were used to indicate that one sentence spreads over two different subtitles. After the organisation of the utterances and the English subtitles in Word files, the Word files were converted into plain text, the only format accepted by the UAM CorpusTool. A new project was created for each character and each character has two corpora: one with the units extracted from the ST and another with the units extracted from the TT. Each corpus folder contains different sub-corpora with one scene each.

To investigate whether linguistic varieties occur more prominently at the beginning of the film and reduce progressively towards the end of the film, the scenes were classified into two groups. The first group contains the scenes

Figure 8. An example from Wesh Egram of how the annotation of the subtitles of Taha’s speech in scene 100 was conducted using the UAM

from the first half of the films and the second group includes the scenes from the second half of the films.

In document Representation and subtitling of linguistic varieties in Egyptian films (Page 101-106)