Chapter 5: Quantitative analysis of MNZE rhoticity
5.2 Modelling the MNZE data
5.2.7 Model 1
5.2.7.1 Model 1 variables
It is important to consider the model‟s treatment of the variables entered into the model. The model estimates a baseline intercept value which represents the probability of /r/ (calculated in log odds) in the hypothetical situation where all variables have their default treatment values or conditions.
For any continuous variables entered into the model, the model treats zero as the default condition for that variable and estimates a coefficient (slope value) for a 1-unit increment of the variable. The coefficient represents the degree to which the model‟s baseline intercept changes as the value of that variable changes. For the categorical / factor variables in the model, the model treats 1 factor level as the default / reference condition and estimates coefficients (slopes) for the other level(s) based on a contrast / alternation with this default condition. Automatically, the model assigns the default condition to the factor level that is numerically or alphabetically first. However, the default condition can be set manually according to the theory-driven assumptions and research goals. For the models fit to the MNZE data (with the phrase final and absolute final tokens removed), the available explanatory variables which I could utilise in the models (with the default condition listed first) were (as in 7):
7.
(i) following phonological context: vowel versus consonant
(ii) preceding vowel context: START versus FIRE versuslettER versus NEAR versus NORTH
versus NURSE versus OUR versus SQUARE
(iii) word frequency: a continuous variable (default = 0) (iv) age: adult versus teenager
(v) region: central versus northern (vi) gender: female versus male
(vii) MCI: a continuous variable9 (default = 0)
In order to take into account variation with regard to individual speakers and individual word forms over and above these explanatory (grouping) variables, each of the models described below also included the random effects:
(i) speaker (ii) word form
Many descriptions of how to fit regression models (e.g. Gries 2009; Gorman & Johnson 2013), advocate starting with a “full” model (i.e. a model with all potentially relevant factors and interactions included) and subsequently removing variables which are not identified as significant in the model. Non-significant variables or interactions between variables are removed sequentially and models are compared. The aim is to identify the simplest model which best fits the data in accord with the principle of Occam‟s razor (Gries 2009: 260). This necessitates an avoidance of model over-fitting, i.e. not creating too complex a model with too many variables in the data. While a complex model with many variables may perfectly predict the actual data sample, it is not useful for predicting behaviour in the wider
population. The aim is to strike a balance between a “good fit” and an “over-fit” model. As expressed by K. Johnson (2008: 90) the aim is to “get as good a fit as possible with a minimum of predictive variables.”
The model returns several measures of “goodness of fit” which can be used to identify a best fitting model. These are the AIC (Aikake Information Criterion), the BIC (Bayesian Information Criterion), the deviance for the maximum likelihood criterion and the
9 An alternative approach in which MCI is treated as a factor is also described below.
likelihood. With the exception of the log-likelihood, smaller values for each of these criteria indicate a better fit. A greater log-likelihood value represents a better fitting model. It is not clear which of these respective values should be attributed the greatest importance. For example, Starkweather (2010a) recommends using BIC values for model comparison, Gorman (2010: 70) uses the AIC. A useful alternative method is to employ a likelihood ratio test (i.e. Anova) to compare the degree of improvement between related (i.e. nested) models (see Jaeger 2008: 439; Gries 2009: 261).
In order to verify that a combination of variables and interactions included in the model contribute significantly to the model fit (and should be retained), it is good practice to compare the fitted model with a null model, i.e. one that includes only the intercept and the random effects. It is also insightful in relation to specific variable and interactions, to
compare the model with the apparently significant variable or interaction included against the same model minus that specific variable or interaction. The results of anova comparisons of respective models indicate whether the removal of a given variable is significantly
detrimental or beneficial to the model fit.
As noted earlier, collinearity can create difficulties for model fitting. In order to counteract potential complications from collinearity in the present models I checked the correlation matrices provided by model outputs, performed Pearson correlation tests and examined Variation Inflation Factors (VIFs) for potentially collinear explanatory variables. I performed centering (see Gries 2009: 121) on variables with VIF values of 3 or more.
For each of the full models described below I first entered into the model all of the relevant potential linguistic and social explanatory variables listed in 7 above, as well as any appropriate interactions between factors. I then removed items sequentially in the following order: 1) interactions between variables identified as non-significant (least significant first), 2) variables identified as not having any significant effect (least significant first), unless the variable was implicated in a significant interaction. After each removal I checked the BIC value for goodness of fit and applied Anova. At each stage I retained the best fitting model (i.e. if a model was a better fit with a non-significant variable included then I retained the non-significant variable).
Model 1 included the fixed effects: Following context, Preceding vowel, Word frequency, Age, Region, MCI and Gender. It also included interactions between Following context and each of Region, Age, MCI and Gender. It included the random effects: Speaker and Word form.
The best fit for Model 1 retained the variables: following context, preceding vowel, region, age, MCI and interactions between following context and each of region, age and MCI. The intercept and estimated coefficients for Model 1 are provided in table 5.5. I discuss the effects of the linguistic factors first, followed by the social factors.
Table 5.5: Best-fitting model estimates for Model 1 (pre-vocalic and pre-consonantal /r/).
Estimate Std. Error z value Pr(>|z|) Intercept / baseline 2.55073 0.52186 4.888 <0.001
Vowel FIRE 0.85468 2.32057 0.368 0.71265
Vowel lettER 0.25421 0.42699 0.595 0.55161
Vowel NEAR 0.63274 0.56038 1.129 0.25884
Vowel NORTH -0.03923 0.47624 -0.082 0.93435
Vowel NURSE 2.84125 0.45550 6.238 <0.001
Vowel OUR -0.34305 0.75967 -0.452 0.65157
Vowel SQUARE 0.17363 0.47581 0.365 0.71517
Following C -11.26990 0.59385 -18.978 <0.001
Region N -0.49862 0.18863 -2.643 <0.01
Age Young -1.20441 0.30188 -3.990 <0.001
MCI -0.20386 0.03458 -5.895 <0.001
Following C: Region N 1.97274 0.27844 7.085 <0.001 Following C: Age
Young
2.79910 0.51742 5.410 <0.001 Following C: MCI 0.31827 0.04270 7.454 <0.001