CHAPTER 3 – Method and research design
3.6. Quantitative analysis of linguistic data
This investigation uses variable rule analysis, which employs generalised linear models that enable the formal mathematical evaluation of the relationship between a binary dependent variable and several independent variables (Tagliamonte, 2012). According to Sankoff, who developed the first such programme in the 1960s, variable rules are ‘the probabilistic modelling and the statistical treatment of discrete choices and their conditioning’ (Sankoff, 1988:2). In other words, they determine whether a pattern is statistically significant, or merely occurs by chance. Three lines of evidence may be used to create an argument that explains variable phenomena: statistical significance, effect magnitude, and the constraint hierarchy (Poplack & Tagliamonte, 2001:92; Tagliamonte, 2002:731).
A factor is statistically significant if the probability (p) of its correlation with the dependent variable is 0.05 or less (p 0.05). This means that the observed result would be very unlikely to occur under the null hypothesis, if there were no relationship between the dependent variable and the factor in question. It is necessary to find out not just which factors are statistically significant, but also which factors are not, since both are useful indications of how variation operates (Baker, 2010b). The effect magnitude or effect size is concerned with the strength of factors; factor groups with a larger range are more significant than factor groups with a smaller range. Finally, the constraint hierarchy shows factor groups in the order of their statistical significance, and hence shows the direction of effects.
This investigation uses Rbrul, a version of the variable rule programme, to conduct multi-effect logistic regression analysis. 60 Rbrul was developed by Daniel Johnson (2009) and has an
important advantage over its predecessor, Goldvarb, in that it can conduct mixed-effect models (Tagliamonte, 2012:138). This means that Rbrul assigns each signer/speaker a value, or random intercept, based on how much unexplained variance there is for each signer/speaker in the model. It is then possible to abstract away from individual variation and generalise to a population:
we can test whether there are differences among groups that are robustly present across the dataset, and we can be more confident that the trends are not carried by one or two individuals (Drager and Hay, 2012:60).
60 Rbrul is an open-source programme that can be run from www.danielezrajohnson.com/Rbrul.R and uses
88
The dependent variable must be binary (Johnson, 2009:359), although independent variables may include many factors.
There are several points to be aware of when using Rbrul, since these affect the design of the analyses conducted in chapters 5 and 6. While several independent variables, or factor groups, can be included in an Rbrul model, the viability of the model depends on how many tokens are included. Factors that occur only a small number of times are unlikely to produce a viable model, and models that include too few tokens for the number of factor groups are also unworkable. Although it is possible to check for interactions between different factors (see, for example, Clark and Watson, 2011), this again requires enough tokens to render the model viable, and when interaction effects were included, the models failed to converge due to the size of the N values. The annotation of more tokens in future will make this more feasible. Finally, independent variables may be discrete or continuous, and age is treated as a continuous variable for all runs of
Rbrul.
The tables in sections 5.4 and 6.4 to 6.7 that show the findings of Rbrul runs begin by stating the deviance, degrees of freedom (DF) and grand mean of the variable. The profile of the binary dependent variable is shown on a p scale from 0 to 1, and it is important to note the variant that corresponds to the application value in order to see which variant is favoured, or preferred, by a given factor. This is necessary for interpreting the Centred Factor Weight (CFW), which gives information about the effect size of each factor. For example, if a variable can be realised as x and
y – with x as the application value – and the factor weight for female signers was 0.8, this would
suggest that female signers strongly favour x and disfavour y. Conversely, if y were the application value, the same findings would suggest that female signers strongly favour y and disfavour x. The log-odds (L-O) are also shown, since these may be more familiar to those working in other fields. Log-odds give similar information to the CFW, but on a scale that is centred on 0. A positive value (such as 0.255) shows how strongly a factor favours the application value, while a negative value (such as -0.255) shows how strongly a factor disfavours the application value. In keeping with the way in which such findings are described in the literature on sociolinguistic variation, the analyses in chapters 5 and 6 focus on the centred factor weights, but the log-odds are also reported for the benefit of practitioners from other fields.
Where enough factors are found to be significant to make it viable to create a constraint hierarchy, this is established by using the range of effect sizes. The range, which is shown in the findings tables for all factor groups, is calculated by subtracting the smallest factor weight from the largest, multiplying by 100, and rounding to the nearest whole integer. The constraint hierarchy presents statistically significant factor groups in order from those with the largest range to those with the smallest (Tagliamonte, 2012).