My approach to the analysis of the influence of prescriptivism on language change conceptualises language change by drawing on descriptive or usage-based accounts of language change, which analyse frequency patterns of linguistic features. Patterns of variation and change are thus determined by correlating linguistic variables with extralinguistic factors such as time, genre, register, etc. (Nevalainen 2006a: 560–561). Studies of language variation and change in this vein, associated for instance with historical sociolinguistics, have established that when it comes to language change, “the picture that emerges is one of gradual evolution rather than abrupt change” (Mair and Leech 2006: 319). Language variation and change is thus assessed on the basis of identifying “shifting frequencies of use for competing variants”, one of which is prescriptively targeted (Mair and Leech 2006: 319); this allows for the importance of prescriptive influence or effect to be assessed. The approach taken in sociolinguistic and descriptive corpus-based research on language variation takes changes in frequency patterns to be indicators of language change (cf. Mair 2006: 2). Consequently, if we take this approach to language change, and if we manage to show that changes in frequencies across time are constrained by prescriptivism-related factors, we can posit an influence of prescriptivism on language change.
The data for the analysis of the actual usage patterns for each feature were extracted from the Corpus of Contemporary American English (COCA, Davies 2008–present) and the Corpus of Historical American English (COHA, Davies 2010b), both created and maintained by Mark Davies at Brigham Young University. Although both corpora are available for online use via the BYU interface, the analysis conducted for this study required the use of the full-text data, which can be purchased in three different formats: plain text, word/lemma/part-of-speech-tagged files, and SQL-database format. This makes wide-ranging exploration of the data possible, in terms of the analysis of many different features, as well as many different aspects
106 4.4. Actual language use: data and analysis
of the texts included in the corpora, such as the type-token ratio of each separate text in the corpus. The full-text data come with some limitations, due to texts in the corpora being affected by copyright restrictions. Consequently, the version of the corpus data available for purchase has been transformed, in order to abide by copyright restrictions. This transformation consists in the removal of ten out of every 200 words of text. This ensures legal use of the copyrighted text in such a way that the strength of the corpus and the validity of the results obtained are not seriously affected, because, as stated on the website, “95% of the data is still there”.4This means however, that the results obtained in this analysis will be somewhat different, though not significantly so, from the same results obtained through the corpora’s online interface.
The COCA corpus is described by its creator as “the first reliable monitor corpus of English” spanning the period 1990–2015 (Davies 2010a: 447; see also Davies 2008–present, 2009), and, as a monitor corpus, it is regularly updated with new data. As of March 2017, according to its website, COCA contained more than 520 million words of text, or 20 million words for each year. The full-text corpus data used for this analysis cover the period 1990–2012; thus the total number of words is lower than 520 million. The COCA corpus is divided into five sections: academic, fiction, magazines, newspapers, and spoken. Davies (2009: 161–162) describes in more detail the selection and sources of materials in COCA, and the specific subcategories for each section are given in Table 4.2. COHA can perhaps be described to some extent as the historical counterpart to COCA. It contains 400 million words, divided into four sections: fiction, magazines, newspapers, and non-fiction books (Davies 2012). These four sections are further subdivided into subgenre categories, given in Table 4.2. A complete list of sources for both corpora is available on the website, while Davies (2009, 2010a, 2012) describes in detail how the corpora have been created and how they can be used to study language variation and change in different ways.5
It is important at this point to raise the question of what the different corpus sections, and their subcategories, represent in terms of register variation in historical and present-day American English. The understanding of the nature of texts in the two corpora is instrumental in interpreting the results, for two main reasons. The first one relates to the comparability of the two corpora. In other words, it is important to understand in what ways the two corpora are similar, as well as where they differ, so that the interpretation of the results can take this into account. The general intention
4For more details see http://corpus.byu.edu/full-text/limitations.asp.
5I use the terms ‘genre’ and ‘register’ interchangeably when referring to the various corpus sections.
COCA this text COHA Academic Non-fiction History Education Geography/Social Science Law/Political Science Humanities Philosophy/Religion Science/Technology Medicine Miscellaneous Fiction Fiction
General (Books) Drama (Plays)
General (Journal) Movie Scripts
Science Fiction Novels
Juvenile Poetry
Movie Scripts Short stories
Magazine Magazine News/Opinion No subgenres Financial Science/Technology Society/Arts Religion Sports Entertain Home/Health African-American Children Women/Men Newspaper Newspaper Miscellaneous No subgenres International News National News Local News Money Life Sports Editorial Spoken No subgenres
108 4.4. Actual language use: data and analysis
underlying the use of these two corpora was that they would provide comparable evidence, at least for some registers, so that the development of usage could be traced over a longer period of time. In the case of fiction, for instance, this is quite straightforward (Table 4.2).
In terms of comparability, COCA and COHA contain similar material for the fiction, magazines, and newspapers sections in both corpora. For the historical corpus, some text types, such as movie scripts, newspapers, and magazines, are only available from the twentieth century onwards. A spoken section is only available in COCA, so my analysis of spoken data is restricted to the period 1990–2012. Finally, COCA has an academic section, while COHA has a non-fiction section. While these two are different categories, it seems a priori not unreasonable to assume that they are not dissimilar, given that they both contain a preponderance of relatively formal texts. As we will see in Chapter 6, this assumption is supported by the frequency patterns observed in these two genres. An additional reason for this consideration is that the organisation of the subcategories of texts for both the academic section of COCA and the non-fiction section of COHA is according to the general Library of Congress classification system (cf. Davies 2009: 162; Davies 2012: 124–125).
Finally, a limitation of COCA, when it comes to working with these subcategories, is that the subsections included in the spoken section are not based on actual genres of spoken language, but on the sources from where the texts were obtained. On the naturalness of the spoken data, Davies notes that they “are based almost entirely on transcripts ofunscriptedconversation on television and radio programs” (Davies 2009: 162) and that these texts are accurate and spontaneous; in other words, they provide a good representation of non-media English. The corpus data were then used to extract all occurrences of the various linguistic features with the help of Python scripts. In most cases, occurrences were extracted from the part-of-speech-tagged files of the corpora, using regular expressions. The use of the part-of-speech-tagged data greatly facilitated further analysis of each occurrence.
The data analysis can be divided into two parts. In the first part of the analysis, I look at the frequency of occurrence of a particular feature across subsections of the corpora, as well as across time periods. For the analysis of patterns of use, I will rely on the two approaches used in corpus-based linguistic analysis of variation and change suggested by Biber et al. (2016): the text-linguistic approach and the variationist approach. As Biber et al. (2016) show, these two approaches can sometimes produce different results, as an analysis of the frequency distributions of one variant is a good indication of the rate of occurrence of that variation, while the results of a
variationist analysis allow us to “report proportional preference; they do not actually tell us how often a listener/reader will encounter these structures in texts” (Biber et al. 2016: 359). Leech and Smith (2009: 176–178) also discuss the difference between variationist and text-linguistic measures of frequencies of usage, but use the terms “proportionate” method and “normalisation” method to refer to variationist and text-linguistic approaches, respectively. What a variationist account shows us is how often a variant occurs, in the context of two or more options, out of a total number of possible occurrences. In the second part of the analysis, I specifically look at whether and how additional prescriptively targeted features may predict the use of a proscribed feature. This analysis is done using the split infinitive as a case study. In what follows, I discuss each of these two types of analysis in more detail.
When establishing the variable context, it is important to distinguish between contexts in which variation is a choice, as opposed to contexts where one variant is categorically used (cf. Poplack and Dion 2009: 571: “in contexts where speakers must choose among the major variants”). With reference to the features used in this study, the assumption is made that theoretically, both the standard and the proscribed variants are possible options, given that the notion of ‘incorrectness’, prescriptively speaking, is linguistically arbitrary. Of course, establishing variants was not possible in the context of the discourse particlelikeand non-literalliterally(cf. Section 2.6), since it is difficult to conceptualise a variable in whichlikeis one of the variants; in addition the environments in which this variant could occur would be almost impossible to predict or determine (but see D’Arcy 2007).
The analysis ofain’twas done on the basis of both text-linguistic and variationist frequencies. These were analysed in the context of two variables: present tense negativebe notandhave not. The variants are given in Table 4.3. It is also important to address a point of disagreement present in the methodological approaches employed in previous studies to determine the variables in which ain’t is used, that is, the forms with which it alternates. Wolfram (1974: 153), for instance, considers ain’t as alternating only with the’m/’re /’s notforms of be not, noting thataren’tand isn’twere not observed in his sample of speech from Puerto Rican English speakers, and were thus excluded from the analysis. Cheshire (1981) considersain’tas a variant of all contracted forms of presentbe notandhave not, including both auxiliary and copular be not, but not the full forms of these verbs (this is not stated explicitly, but can be inferred from the data presented). Weldon (1994) considers ain’t as a variant of both full and contracted copula inbe notenvironments, while Anderwald analyses the alternations of ain’twithbe notandhave not, including both auxiliary
110 4.4. Actual language use: data and analysis
and copula be not, but does not specify whether the total number of environments includes full forms. Taking into account these differing approaches, in my analysis I take both full forms and contracted forms ofbe notandhave not, primarily because my data come from a general standard American English corpus, in which both full and contracted forms occur. Second, I do not distinguish between copula and auxiliary be notenvironments, because (a) such a distinction is not particularly relevant for the issue of the influence of prescriptivism on usage, and (b) studies of variation have shown that it is hard to identify regularities in the patterns of variation between copular be notand auxiliarybe not, as opposed tohave not, leading Anderwald (2002: 139) to conclude that “a distinction of BE into auxiliary and copular uses is perhaps not particularly warranted”.
be not have not
prescribed am/are/is not I am not coming. have/has not I have not left.
’m/’re/’s not I’m not coming. ’ve/’s not I’ve not left.
aren’t/isn’t He isn’t coming. haven’t/hasn’t I haven’t left.
proscribed ain’t I ain’t coming. ain’t I ain’t left.
Table 4.3: The variants ofbe notandhave not
For the analysis of the newer uses oflike, as discussed in Section 3.4 above, I will focus only on the discourse particlelike. While some of the occurrences of the discourse particlelikeare tagged with appropriate part-of-speech tags in the corpora used, it was noticed that the accuracy of the tagging was not very high, and that in most cases a much better indicator of whetherlikeis used as a discourse particle or not was the transcription of the data. Accordingly, since the discourse particlelikeis set off with commas in the majority of the cases, this was used as a more reliable way of identifying and extracting those instances oflike. In the cases oflikeandliterally, for instance, a variationist analysis was not undertaken, due to the complicated issue of establishing the variable context (but see D’Arcy 2007 for a variationist analysis of vernacularlike). In the case oflike, only normalised frequencies of occurrence were therefore used to track the patterns of usage of this feature, based on the tags in the corpora.
In the case ofliterally, two analyses were carried out: one on the basis of the entire set of occurrences ofliterally, and another on the basis of a subset of occurrences. This decision was made in view of the fact that the process of change whichliterally is undergoing entails a significant amount of variation in its uses, and, consequently, it
is difficult to neatly disambiguate the meanings and functions of these uses, especially of the newer ones. The three uses ofliterallydistinguished in the analysis are given in Table 4.4; these were distinguished on the basis of categorisations of the uses of literallyfound in previous studies (see Section 3.5).
uses ofliterally
prescribed primary This is literally translated.
dual There were literally thousands of people.
proscribed non-literal This book literally blew my mind.
Table 4.4: The uses ofliterally
In its primary use, literallyrefers to what something means, how something is said or meant, or how something is translated or interpreted. In such cases,literally has a clear denotational meaning, as it functions as a manner adverb; hereliterallyis the answer to the question how something is done (e.g.How do you mean? I mean literally.) In all other cases, however, the meaning of literally is more ambiguous and elusive, and in such cases it is almost always possible that the speaker is using literally to signal that something that may be understood figuratively must be understood literally (cf. creative cases in Powell 1992, discussed in Section 3.5 above). In these cases, the meaning ofliterallyis highly dependent on context. In view of this, a manual analysis of all occurrences ofliterally was deemed too laborious to be worthwhile, given that it would be unlikely to reveal any major insights. A middle-ground solution was to perform both an automatic binary disambiguation of the entire set of occurrences ofliterallyusing Python scripts and a manual analysis classifying a subset of the occurrences ofliterallyinto three categories. The first type of analysis distinguishes between cases that are very clearly instances of the so-called primary use ofliterally, and all other cases (for more details on how this was done, see Appendix C). An important consideration in this decision was the fact that prescriptive attitudes are highly conservative with respect to most types of changes in the language. In the treatment ofliterally, especially, only the primary (i.e. denotational) meaning of the word was therefore accepted, while all other uses (i.e. intensifying, subjective or metapragmatic ones) are seen as ‘incorrect’. This first part of the corpus analysis is based on data from both corpora, and allows us to track the usage of the proscribed variants over time.
The manual analysis was carried out on a sample of the total number of occurrences ofliterallyin the corpora. This sample was extracted by selecting every
112 4.4. Actual language use: data and analysis
fifth case in the set of all occurrences of literally in both COCA and COHA. This manual analysis was done for two reasons. First, it was meant to provide additional evidence for the distribution of the three uses of literally, which was not possible with the automatic disambiguation of all its occurrences. Second, the manual analysis also served to identify any possible differences between the results of the automatic disambiguation and the manual analysis of the data. In this way, I hope to be able to provide a rough estimate of the level of accuracy of the automatic disambiguation.
An analysis of the frequency patterns of negative concord also requires a variationist account of the ratio of the stigmatised forms in comparison to all others; here, we cannot rely on frequency counts of negative concord constructions alone. In order to delimit the total number of possible environments in which negative concord may occur, it is important to consider the variants of negative expression and the circumstances in which negative concord can be expected. The rule of negative attraction (first formulated by Klima 1964: 267, 289, cited in Labov et al. 1968: 268; see also Wolfram 1974: 163) is the starting point for determining the contexts for negative concord. According to this rule, in standard English sentences with indefinites, there are two possible options. If the indefinite is in pre-verbal position, the negative marker is attracted to the first indefinite before the verb, which accounts for sentences such asNobody knows anything, to use Labov et al.’s example (1968: 268). If the indefinite is post-verbally located, the negative marker may optionally be attached to the verb, as inJohn doesn’t know anything, or to the indefinite, as inJohn knows nothing. In other words, if the indefinite comes after the verb, “the negative attraction rule may or may not apply” (Wolfram 1974: 164). In relation to these