Data gathering - METHODOLOGY FOR CORPUS ANALYSIS

METHODOLOGY FOR CORPUS ANALYSIS

6.4 Data gathering

There were several stages involved in gathering and preparing relevant data for the present study. As the EMAS corpus consists of untagged data, the first stage (Stage 1) involved a tagging process using CLAWS, an automated part-of-speech (POS) tagger

169

from the University Centre for Computer Corpus Research on Language (UCREL) owned by the Lancaster University, which is available online. As the size of the EMAS corpus is small (see 6.2), the data was tagged using the CLAWS free trial service at http://www.comp.lancs.ac.uk/ucrel/claws/trial.html. The CLAWS tagger was reported to have an accuracy rate of 96% and an error rate of 1.15% of all words in the British National Corpus (BNC) (Leech, Garside and Bryant 1994), which is considered acceptable as BNC is a large corpus and consists of texts produced by native speakers.

As far as the EMAS corpus is concerned, it is expected that there will be some tagging errors, at a rate slightly higher than the one in BNC, since EMAS is a small corpus and consists of texts produced by ESL learners with many errors. Therefore, in order to increase the accuracy level, all examples extracted from the EMAS corpus were manually scrutinized to ensure they were accurately tagged before further analysis was carried out.

The purpose of conducting POS tagging is to facilitate the analysis of data, in which every single word in texts will be automatically classified according to their respective categories (noun, pronoun, verb, etc). All particles will be tagged as AVP by the CLAWS tagger, as the majority of PVs is made up of LV+AVP combination. This procedure will help to extract all instances of PVs in LV+AVP structure and eliminate prepositional verbs (PRPVs) in LV+PRP form. This distinction is necessary as ‘up’ for instance can function as a PRP (in PRPVs) and AVP (in PVs), as shown in the examples below.

170

They went up the hill. (PRP) : went up is a PRPV He went up to his room. (AVP) : went up is a PV

However, as far as the CLAWS tagger is concerned, there are a number of instances in which PRPs that act as particles (Prt) are still tagged as PRPs by the CLAWS tagger instead of as AVPs. For instance, the particle up in PV clean up extracted from the tagged EMAS was inaccurately tagged as a PRP (e.g. cleaned_VVD up_PRP myself_PNX), and in other cases it was accurately tagged as an AVP (e.g. to_TO0 clean_VVI up_AVP ourselves_PNX). Therefore, to increase the accuracy of the tagged data, and to ensure the examples taken were true examples of PVs, they were also scrutinized manually.

After the EMAS corpus was tagged, the second stage (Stage 2) was to transfer the tagged data to language analysis software (WordSmith Tools version 5 (WS5)) for further analysis. WS Tools was chosen as it not only helps me to access and analyze corpus data conveniently on my computer, but, most importantly, the reliability of this particular corpus analysis tool has been verified by previous studies, which used WS to analyze texts in various corpora (e.g. Flowerdew 2003; Nelson 2000; Mukundan 2004;

Scott 2001; Henry & Roseberry 2001; Bondi 2001). As the present study focuses on lexical and grammatical patterns, I found that WS5 was very helpful. The WS5 also claims to be “software for finding patterns in text” (http://www.lexicalanalysis.co.uk/

LexicalAnalysisSoftware/ index.htmlxxx). Added to that, the Concord function in particular, is really useful in analyzing patterns as it “renders keywords in context ... in numerous contexts and with co-texts to the left and right....the ability to re-sort lines

171

based on whether words preceded or followed the keyword” (Prinsloo and Prinsloo 2011: 99). Furthermore, other major functions of Concord (e.g. collocates, plot, patterns, clusters) offer detailed analysis, which can be conducted with respect to phraseological behaviour and patterns, the main focus of the present study.

In the second stage of data gathering (Stage 2), WS5 is required to identify and report every instance of AVP up, out, off and down found in the EMAS corpus. The AVP could be immediately adjacent to the lexical verb (LV+AVP), or within two words (LV+X+AVP) as in:

pick (LV) up (AVP) the phone pick (LV) it (X) up (AVP)

Even though PVs can also appear within three words (LV+X+X+AVP) or more, the present study will only focus on those of two (e.g. pick up), and three (e.g. pick it up) varieties because it was reported that occurrences of PVs with longer separations are relatively infrequent (Gardner and Davies 2007: 345). Moreover, considering the learners’ level of language learning (primary and secondary level), it is fairly difficult for them to produce PVs with longer separations. Table 42 below summarizes the results from the data gathering process in Stage 2.

172

Table 50: Frequency of AVP up, down, out and off in the EMAS

Table 42 above shows 946 instances of AVP up, 630 instances of AVP down, 538 instances of AVP out and 80 instances of AVP off in the form of two-word (LV+AVP) and three-word (LV+X+AVP) constructions, extracted from the EMAS corpus, which will be a focus of analysis in the present study.

Following this, the next stage (Stage 3) involved WS5 queries to identify all instances of LV and their inflections that are frequently attached to the identified AVPs (up, down, out, off) in LV+AVP and LV+X+AVP structures. As a result of this stage, a list was produced of LV lemmas frequently attached to each of the AVPs in the two structures. For the purpose of the present study, only the top six LV lemmas frequently attached to each of the AVPs were considered for further investigation. The reason for choosing these top six LV lemmas is because they provide sufficient rich data for analysis purposes. A summary of the results obtained from data gathering in Stage 3 is presented in Table 43 below.

Adverb particles LV+AVP and (AVPs) LV+X+AVP constructions (f)

up 946

down 630

out 538

off 80

173 was finally produced for further analysis, which include the top four AVPs (up, down, out and off) that were found to be problematic to learners, as well as the top six LV lemmas frequently attached to these selected AVPs. Therefore, the final data consists of a total of 24 PVs, which include wake up, pick up, get up, set up, go up and make up (LV lemmas + AVP up); fall down, calm down, go down, jump down, put down, and drop down (LV lemmas + AVP down); go out, come out, pull out, find out, get out and take out (LV lemmas + AVP out), and, finally, take off, show off, switch off , go off, set off and get off (LV lemmas + AVP off). To summarize, below is a chart to illustrate the three main stages involved in the data gathering process.

174

In document A study on the use of phrasal verbs by Malaysian learners of English (Page 188-194)