Corpus processing and corpus analysis tools

Chapter 6. Methodology

6.1 Corpus design and corpus analysis tools

6.1.2 Corpus processing and corpus analysis tools

There are various tools available nowadays for the automatic processing of digital corpora. There is a variety of software tools for studying metaphor in discourse such as Atlas.ti, NVivo, Vis Dis, etc. (for a detailed overview of these applications, see Kimmel, (2012)). For the purposes of this study, two software products were initially chosen: WordSmith tools version 4 (Scott, 2004) and Paraconc version 1 (Barlow, 2008). WordSmith tools were used at an early stage of this research because of their popularity in metaphor studies. Despite the existence of more recent versions, version 4 remains the only version compatible with the Arabic language. Wordsmith tools, however, have limitations when it comes to bilingual or multilingual corpora. The major shortcoming of WordSmith is the alignment option. Although texts can be aligned, there is no way to query the resulting parallel corpus. Paraconc, on the other hand, is designed to work with parallel corpora. It offers both alignment and parallel query search options.

Both WordSmith and Paraconc are computer-based tools and are limited in size and power of data processing. For this reason, a switch was made to a different tool later in the research, namely Sketch Engine. Sketch Engine is a powerful internet-based tool, offering

more space for data storage and an easy way to query the corpus (Arts et al., 2014; Kilgarriff et al., 2004).

Sketch Engine presents the following advantages: its structure is well developed, and it is designed to work with different languages including English and Arabic. The corpus once uploaded into the Sketch Engine software can be tagged for parts of speech (POS) in both languages. Additional tagging options are available. The corpus can be searched for specific annotation/mark-up in both the unilingual corpus and the parallel corpus.

The Sketch Engine tool makes available large corpora such as the TenTen corpora family that are searchable online using Sketch Engine tools. In the present study, we use the Arabic TenTen (arTenTen) corpus as a reference corpus as an alternative to a general language dictionary to help with the identification of linguistic metaphors in the A&A Arabic subcorpus as it will be seen later in 6.2.2.

The decision to switch from the use of WordSmith (Scott, 2004) and Paraconc (Barlow, 2008) was made after experiencing difficulties in aligning and querying the parallel bilingual English- Arabic corpus. Paraconc allows a parallel search for special lexical units. However, it does not allow a search for specific annotated units. Sketch Engine, on the other hand, offers the possibility to search for annotated lexical items using the option ‘corpus query language’ (CQL).

Sketch Engine was launched in 2004 (Kilgariff et al., 2004) and has been improved over the last 12 years. In what follows, the main functions of Sketch Engine are described. Arts et al., (2014) report that Sketch Engine is both a tool and a service. It is a tool that offers core functions, some of which are similar to the functions offered by other available tools such as the keyword list and the concordance. In addition, it is an Internet-based service. Users take a subscription (in the case of the present study, a 2-year subscription was taken) and are offered many options: search the corpora available via Sketch Engine such as the BNC, build up new corpora using the WebBootCat tool or upload their own corpora. The latter was chosen for the current study. The A&A corpus texts were first aligned using Trados SDL 2015 software and uploaded in TMX format to the Sketch Engine.

The key functions of Sketch Engine that were used for the analysis of the corpus are: - Word Sketch which is “a one-page summary of a word’s grammatical and

collocational behaviour” (Arts et al., 2014:9).

Figure 6.1 below shows a word sketch for the lexical unit ‘universe’ in the A&A English subcorpus. This function offers information related to the use of the word in the A&A English subcorpus such as statistical information (364 occurrences in 75,124, i.e., 4.2 per million). It also summarises all modifiers of the word universe and their frequencies (e.g. early, observable, entire, inflationary), the verbs, the prepositions and the adjectives with which universe collocate. This information is not only useful to the language learner but also to the translator as it is a corpus-based evidence to how words are used in context.

Figure 6.1 Word Sketch for the English word ‘universe’ in the A&A English subcorpus

Concordance: The concordance tool in Sketch Engine is similar to all other

concordancers. It provides all occurrences of a word in a specified breadth of context. The results of the concordance can be sorted out in various ways. The search can be conducted by different means: a simple search, for instance, a word, a lemma, a phrase, a character or corpus query language (CQL).

The CQL search is relevant as it allows to retrieve the linguistic metaphors from the bilingual corpus, as a whole or by type (direct, indirect). The linguistic metaphors and the metaphor signals were assigned each a code as specified in Table 6.2 below.

Table 6.2 Special codes used to annotate the corpus

CODE MEANING

MRW Linguistic metaphors FMM Direct metaphors WXF Indirect metaphors

Mf Metaphor signal (metaphor flag)

The code MRW is the abbreviation for metaphor related words which is the term used by Steen et al., (2010b) for linguistic metaphors. FMM is an arbitrary code chosen at an earlier stage to allow a search in Microsoft Word for direct metaphors. The code should not match any recognisable chunk of a word that might be highlighted when using the search function in Word. For the same reason, indirect metaphors were annotated using the code “WXF”. The code “Mf” is the abbreviation used by Steen et al., (2010b) for metaphor flag, the term they use for metaphor signal.

The CQL function allows the user to query the corpus using one of these annotations. For instance, conducting a search for all direct metaphors is done by typing the command <FMM/> in Table 6.2 below.

Figure 6.2. CQL for direct metaphors in the A&A English corpus

For a detailed description of the CQL function in Sketch Engine and the underlying computational model, refer to Jakubíček, Kilgarriff et al., (2013).

All queries except for the CQL can be carried out on both monolingual and parallel corpora in Sketch Engine. The CQL query, however, is not available for parallel corpora with customised markup uploaded to the Sketch Engine as TMX files. The Sketch Engine support team was contacted and then agreed to fix the problem manually in the A&A corpus as a one-off task. This function is used to retrieve all the linguistic metaphors from the A&A parallel corpus at go. Figure 6.3 shows a sample of the results of a search for direct metaphors in the English A&A corpus and Figure 6.4 shows a sample of the CQL search in the parallel A&A corpus.

It is worth noting here that the alignment is an important step in building the parallel corpus. Following Frankenberg-Garcia (2009b), the alignment was done at the ST sentence level. Whenever the ST sentence corresponded to less or more than one sentence in the TT, the TT sentences were merged together or split to match the original sentence. However, unlike Frankenberg-Garcia (2009b) who used a blank space to match sentences that are not translated into the TT, the same ST sentence was copied into the TT whenever a sentence is not translated into the TT to ensure it was imported into the translation memory. This was done to keep a record of instances where whole passages containing a metaphor are deleted in the TT because the alignment function in Trados SDL does not import segments with no matching translations into the translation memory.

Figure 6.3 Sample of the results of the CQL for annotated direct metaphors in the English A&A corpus

Figure 6.4 Sample of the results of the CQL for linguistic metaphors in the A&A parallel corpus

Word list and keyword list:

As its name indicates, this function produces a list of words in the corpus. It can be either simple (all words and their frequencies) or a keyword list. The keyword list can be obtained by selecting the desired reference corpus. Sketch Engine offers a wide list to choose from (BNC, COCA, Europal, TenTen, etc.). The word list can be obtained for words, lemmas, collocates (n-grams to be specified), and terms.

Sketch Engine offers other functions which are not discussed here as they are not relevant to this study, namely the thesaurus and sketch difference options which are of interest to lexicographers (Arts et al., 2014; Kilgarriff et al., 2004).

This section has covered the main questions related to the design of the corpus and the analysis tools as well as the different query types used to search for linguistic metaphors in the A&A English subcorpus and the A&A bilingual corpus. The next section describes the method used in the present study to identify linguistic metaphors in English, then in Arabic.

In document The translation of metaphors in popular science from English into Arabic in the domain of astronomy and astrophysics. (Page 103-110)