Challenges for the Direct Approach
3.4 Corpora and Writing: Challenges
3.4.7 Focus on surface form
The fi nal, and most serious, challenge to be addressed here is the inevi-table focus on surface forms in corpus work. We are basically restricted
to studying features that our corpus tools can fi nd, which means that we run the risk of focusing exclusively on the word and the phrase level when using computer-assisted methods. The challenge lies in connecting surface forms (which are easy to search for by computer) to meaning (which tends to require human analysis) – whether lexical, collocational, pragmatic or discursive. Related concerns have been voiced by Swales (2002), who argues that the computer-based orientation of corpus studies leads to atomized, bottom-up investigations of language use (see also Flowerdew 2005).
In a writing context, we are often interested in exploring specifi c functions of language, which do not stand in a one-to-one relation to formal realizations. For example, compare the retrieval of (a) modal verbs and (b) hedges in a corpus search. While a modal verb represents a specifi c lexicogrammatical form, a hedge represents a linguistic function that can be realized through several different surface forms.
In order to fi nd instances of (a) in a corpus, we can simply list all existing modal verb forms using a grammar for reference, and then search the corpus. It would help if the corpus were tagged for part of speech, other-wise cases of homonymy would have to be checked (e.g. the verb ‘can’
versus the noun ‘can’). In order to fi nd instances of (b) in a corpus, we would fi rst have to try to compile a list of all the possible linguistic forms that could function as hedges. Such a list would not be found in a grammar, let alone in a dictionary. Furthermore, different people’s lists might be quite different. If we assume that this initial hurdle can be overcome, though, all the instances in the corpus would have to be retrieved (e.g.
modals ‘may’ and ‘could’), and then every single one would have to be checked in order to exclude those examples that do not function as hedges, such as We may now turn to the following aspect of the problem . . . or We could not detect any statistically signifi cant difference . . . (examples from Salager-Meyer 2001). Although hedging, or the degree of (un)certainty toward the propositions expressed by a writer, is a feature of great interest in writing instruction, retrieving actual instances in the classroom can be prohibitively time-consuming.
Another example, through which we can illustrate some simple solutions to this problem, is attribution, or references to other sources. How does one locate examples of attribution in the corpus classroom? Three ways that have been suggested for fi nding simple indicators of, or proxies for, attribution are the following: (1) Tribble (2002: 143–144) reports that asking learners to ‘look at where and how parentheses are used is an excellent way of beginning an investigation of citation practices – especially
50 Corpus-Based Approaches to English Language Teaching
in that once the parenthesised citations have been identifi ed, it is then easy to follow up how (and with which verbs, in which structures) the proper nouns which occur in such lists are used in the text’. (2) Provided that the texts involved are not too quantitative in character, very simple search strings such as 200* and 19** would retrieve common citation years.2 (3) Proper names can also be retrieved, based on the names listed in the bibliographies of the corpus texts (cf. Ädel and Garretson 2006). There is only so much time we can ask our students to put into interpretation and analysis, so there is an urgent need to be inventive in retrieving relevant examples.
The examples above illustrate the fundamental issue at stake in the seventh challenge: the form-function split in human language. The challenge for corpus work is to fi nd mappings between functional categories (such as politeness, evaluation or metadiscourse), which are very important in writing, and surface forms. Corpus linguists need to consider this split more thoroughly in order to make progress in corpus-based analysis of text and discourse.
One way in which we can make progress is through the use of annotation, or mark-up. The annotation of corpus text to allow for searches above the word level has been suggested in corpus stylistics (see Wynne 2005), and it also represents a promising avenue for creating corpora that are more useful in the classroom. A corpus annotated for features such as hedging and attri-bution, exemplifi ed above, would be highly useful in the EAP classroom.
One relatively simple example of annotating a written corpus is mark-ing all quoted material (typically explicitly marked by quotation marks, or set off in block quotes) as distinct from the running text. It is a serious restriction in present-day written corpora that there is no automatic way of making a distinction between the current writer’s text on the one hand and quoted text on the other. This presents an unnecessary obstacle to many studies of writing, specifi cally those concerning the choices that writers make in their texts. In the case of students researching corpus frequencies, for example, it means having to spend time looking at the co-text of every single occurrence in order to determine whether it is a case of the current writer’s text or quoted material. To give an example, when investigating writer visibility in research papers, one may end up with thousands of hits for ‘I’, particularly in papers from the social sciences, which tend to include text from informants, questionnaires, etc.
Figure 3.1 shows a concordance sample of ‘I’ from a collection of research articles in the humanities, where the occurrences have been grouped into three categories.
In the fi rst group, what you see is not an academic writer, but an informant or interviewee, speaking. While the second group consists of metalinguistic quotes, the third group is the only relevant one for anyone researching academic writer visibility. This illustrates how a relatively simple annotation of quoted material would enable more effi cient corpus searches.