Data collection and methods of analysis - Corpora, data collection and analysis

CHAPTER 4 Methodology

4.2 Corpora, data collection and analysis

4.2.2 Data collection and methods of analysis

Large samples of instances of green and zielony were retrieved from each corpus. The aim was to analyse 5,000 examples in each language in each period of time, giving a total number of c.20,000 examples analysed. This number was decided on for two reasons: it was considered a large sample in terms of detailed semantic analysis (for example Gieroń- Czepczor (2011) analysed samples ranging between 1,500 and 3,000 examples), and also, due to the corpora limitations discussed below (4.2.2.1), it would not have been possible to obtain larger samples that were fully comparable across all four datasets. Due to the problems and limitations discussed in 4.2.2.1, the earlier sets of data (Polish and English) contained fewer examples than the later sets and there were slightly fewer examples in the earlier Polish data than in the earlier English data (4,643 and 4,764 respectively). Datasets for the later periods of time, however, contained 5,000 examples each.

4.2.2.1 Limitations of corpora

Over the years corpora have proved fruitful in many aspects of semantic analysis (e.g. Sinclair, 1991; Geeraerts, Gevaert, and Speelman, 2012; Sagi, Kaufmann and Clark, 2012). In order to analyse how colour terms are used in a language, corpora of real texts are the best tools. Although using corpora is extremely advantageous, there were some limitations to my study. One such limitation is the different composition of corpora, especially when working with two languages. This is an issue that many researchers working on two or more languages face. Gieroń-Czepczor (2011), for example, used the BNC for English and PWN (Polskie Wydawnictwo Naukowe) corpus for Polish in her semantic analysis. Additionally she used COCA for comparing the frequencies of BCTs in British and

American English and the PELCRA Reference Corpus of Polish for the frequencies of BCTs in Polish. Both PWN (40 million words) and PELCRA (100 million words) are now part of the NKJP corpus. She acknowledges, in connection with the corpora she used, that ‘the Polish and British corpora, let alone the American one, are incompatible in terms of size, composition, tagging and statistical tools’ (Gieroń-Czepczor, 2011:36). As far as this thesis is concerned, tagging and statistical tools are not an issue. The problem of size was dealt with by making and analysing a sample of 5,000 examples per period of time. The three corpora are not identical in terms of composition: some have small amounts of genres that do not appear in others. But all three (BNC, COCA and NKJP) are large, so the overwhelming comparability of the written data reduces the significance of minor differences.

The fact that the BNC is a corpus of British English, and COCA a corpus of American English, offers an opportunity to compare the two varieties. Due to increasing globalization and contact between them, it was anticipated that semantic differences between American and English uses of green would not be significant enough to undermine the diachronic aspect of the study. Any that were found, however, would be of interest in their own right, and are discussed in Chapter 5.

As far as the Polish corpus is concerned, the main problem that was encountered was that only part of the corpus is balanced. This was especially problematic for the earlier data. In order to analyse 5,000 examples, I had to use the unbalanced part of the corpus, as otherwise the number of results would have been much smaller (80% of the texts in NKJP were written after 1990). Indeed, even though I used the unbalanced part of the corpus, I still did not have 5,000 examples, but slightly fewer (4,760); therefore in order for my data to be as similar as possible in terms of the numbers of analysed words, I analysed the same number of examples from the BNC. However, once I collected the data it turned out that there were duplicates in the texts that I could not replace with new examples (see 4.2.2.2), therefore the earlier Polish data contains 4,643 examples.

4.2.2.2 Retrieval

Samples of examples were retrieved from English and Polish corpora. For English green, the search was a simple ‘green’ in both search engines in the BNC and COCA. In COCA, the dates 2001-2010 were selected. There was no need to select dates in the BNC, as the

vast majority of the texts are from the target period 1985-1994. Getting a sample of 5,000 had to be done in a few stages, because not only was I not able to retrieve a sample larger than 1,000 but also, due to my access level restrictions, I was not able to save more than 3,000 examples a day. Therefore, the data were collected over a period of a few days, in smaller samples, which together gave the required number of examples. If duplicates were found, they were removed and replaced with new examples.

For Polish zielony, in the search engine the word zielony was followed by a wild card [**] (inflectional search) which allowed me to find inflectional variants (Pęzik, 2012:257). Polish has a rich inflectional system, therefore in order to include inflectional variants, such a wild card was necessary. Searching for zielony only would not include, for example, feminine or neuter forms, such as zielona sukienka (green dress) or zielone jabłko (green apple) respectively. Such a search also listed the word zieleni, which can be a verb or a noun zieleń in the genitive, dative, locative or vocative case (the word zieleń did not appear in the results). This demonstrates that such different languages as English and Polish need to have different approaches when it comes to retrieving data from corpora. Retrieval of data from the Polish corpus can be done by means of two search engines: Poliqarp and PELCRA (Pęzik, 2012:253-254). For the purpose of my research, the latter was used. PELCRA proved especially useful because it allowed me to search for zielony** in two separate periods of time.

I searched for zielony** in two periods of time: 1985-1994, which was roughly equivalent to the data from the BNC, and 2001-2010, which was equivalent to the dates in COCA. As already explained, because of the lack of a sufficient number of examples in the balanced part of the earlier data, I used the unbalanced part of the corpus and analysed all the examples of zielony**. As far as the later data is concerned, because there were over 18,000 examples of zielony**, I had to choose the best way of grouping the examples in order to get as much variety as possible. Therefore I retrieved a sample of 5,000 by getting 5,000 examples from 5,000 different texts. Because there were a few (c.10) duplicates, I removed them and replaced them with new examples. This time, however, I did not select the option of choosing one text, because this set of examples would be added to the existing one and the probability of repetition was high: instead, therefore, I retrieved a sample of ten examples of zielony**, and this time the result was ten examples from two texts. The basic information about a text is its title and author (Górski and Łazinski,

2012:22), so for example in my sample there would be a few examples of zielony** from the same daily paper or magazine, but each example would be from a different article in it. This procedure was followed in order to compensate for the unbalanced part of the corpus that was used for getting the earlier sample of zielony**.

4.2.2.3 Qualitative analysis and categorization

Once the data had been retrieved, all examples of green and zielony were analysed in the contexts in which they appeared. I was analysing semantic meanings with a view to identifying prototypical examples of each sense alongside examples which might border on other senses and illustrate semantic change in progress. While it was possible to identify meanings for most of the examples, there were ambiguous examples in each set of data. Whenever possible, meanings were assigned to them, but in really problematic cases, they were left unanalysed, but will be included in the statistics.

As the networks of senses in Chapters 5 and 6 will demonstrate, there were many meanings of green and zielony identified in my data. Categories were created while analysing the data: no prior categories were assumed. Although most of the categories were included in the analyses and networks, some were excluded. The excluded examples are names and titles of all kinds, such as company names, club names, geographical names, group names, nicknames, place names, surnames and titles. As will be presented in Chapter 5 and 6, in cases where names are important, these are referred to and/or discussed in detail. Such exceptions are explained in the analysis. Additionally, some that are not mentioned in Chapters 5 or 6 (such as place names and surnames) are briefly referred to in Chapter 7, but they are not part of the semantic networks. Although the categories that were included in the analysis should, in most cases, be self-explanatory, there were occasions where categories were included within other categories. Such information, whenever necessary, is included in the categories in question in Chapters 5 and 6.

The categories are the result of my own research and analysis. Although I had access to previous studies (see Chapter 3), the OED and other dictionaries, the categories are the result of the data from the corpora used.

Once the categories were established, it became evident that each sense of green and

treated as a separate prototype, although, as was explained in Chapter 3, green and zielony have their etymological prototype which is plants and their parts.

4.2.2.3.1 Networks of senses

The networks of senses are a visual representation of the qualitative analysis presented in Chapters 5 and 6. They were created in order to show the polysemous characters of green and zielony in a graphic form. My networks differ from those discussed in Chapter 3 in four main aspects. Firstly, my networks show the semantic changes and developments in greater detail than those presented in Chapter 3. Moreover, a number of the senses that are included in my networks are not present in the previous networks. Secondly, as the examples included in the discussion in Chapters 5 and 6 were selected in order to illustrate both prototypical examples of each sense, and peripheral examples that might throw light on semantic change (see section 4.3), each sense is considered to be a separate prototype. Thirdly, unlike previous networks, my English and Polish networks are diachronic, that is they demonstrate green and zielony in two periods of time and show if a given meaning was present in both or only one period. Fourthly, the analysis demonstrated that not only are metaphor and metonymy (and metaphtonymy) the main mechanisms of change, but blending is too, therefore meanings which developed as a result of blending are also included in the network. As already mentioned, Steinvall’s (2002) theory of type modification is incorporated, and it is argued that type modification is a form of blending. Whereas the networks are the result of my own analysis of the corpus examples, showing semantic change was aided by the information in the OED: that is, dates of the first recorded uses of certain senses. There were, however, problematic cases where the development was considered ambiguous, as discussed in Chapters 5 and 6.

One of the challenges in creating the Polish network of senses was the lack of an equivalent to the OED in Polish, a dictionary which would list first recorded meanings of

zielony. The only first recorded meaning of zielony is its original meaning of colour (see

Chapter 6). For this reason, the English network was created first and used as a template to throw into relief the differences between the two networks.

In the networks and analyses in Chapters 5 and 6, all meanings of zielony and green and all stages of development are presented in a form of codes. The networks start with E and P for English and Polish respectively. Moreover, the networks provide information on

whether a given meaning was found in just one or both periods of time. This is shown through the numbers 1 and 2: 1 refers to the later and 2 to the earlier period of time. For example E1 refers to the later English period, whereas E2 refers to the earlier English period. Additionally, each stage is considered a separate prototype.

Therefore my networks provide three types of information. They show the category centre: the prototype and connections between more and less central categories; they show how different senses of these polysemous words developed; and they indicate whether a given meaning is found in one or both periods of time analysed.

4.2.2.4 Quantitative analysis

Once all examples were categorized, I was able to analyse the data quantitatively. The results of both the qualitative and quantitative analyses are presented in Chapters 5 and 6, and tables with the frequencies from all sets of data are also included. The number of occurrences of each prototype might indicate which meaning or meanings are the most productive and commonly used in a given language at a certain point in time.

While most categories were unproblematic in terms of what a given example of green or

zielony means and which category it belongs to, some examples were considered to belong

to more than one category. Many such interesting cases are discussed in Chapters 5 and 6, therefore, although I tried to give the most exact frequency of each sense, this was not always possible due to the complex character of certain senses.

In document Colour and semantic change: a corpus-based comparison of English green and Polish zielony (Page 80-85)