3.3 Further topics in regression
3.3.2 Coding of categorical variables
The formulas for logistic regression (and regression more broadly) require summing a vector of predictors weighted by regression coefficients. However, it is not possible to add together “negative declarative clause type” and “unaccusative verb.” Thus, it is necessary to transform these predictors into numerical form. This is referred to as defining a contrast matrix for the categorical (i.e. non-numeric) predictor.
The conceptually simplest way to do this is, for a categorical variable with N categories X1, … XN, to define N indicator variables I1…IN. I1takes the value 1 if an observation has the category X1and zero otherwise; the other indicator variables behave in a likewise fashion. However, including all N indicator variables in the model leads to problems with fitting the model. Regression models generally have an intercept term – a𝛽0which is included for each observation, regardless of its properties. Since the N indicator variables
completely partition the data, it is possible to balance arbitrary changes in the corresponding𝛽s with equal and opposite changes to𝛽0. The model fitting algorithm will not converge, since there is an infinite family
of models all of which fit the data equally well. One of the indicator variables must be left out. 𝛽0will include the effect of the category which does not have an indicator variable (among other effects); this is called the “reference category.” The remaining𝑁 −1indicator variables (and their corresponding𝛽s)
will each yield an estimate of the difference of their category with the reference category.5 This regime of
contrasts is called “treatment contrasts” because it canonically corresponds to an experimental paradigm where there is one control group (the reference category) and the experimental hypotheses are whether any of several treatments (the other categories) provokes a significant difference from the control. The𝛽s in such a regression and their associated confidence intervals and𝑝-values provide an estimate of the effect of a certain treatment.
4Although these differences are for the main effects of clause type, which should not be scrutinized for their contribution to the
model in the presence of an interaction term, which these models both have.
5For each N-category predictor, there is only enough information to addu� −1terms to the model. An alternative strategy could be
imagined: removing theu�0term from the regression so as to compensate for the Nth indicator variable and itsu�. This works in the
case of a single categorical variable, however it does not generalize. After adding a second categorical variable there is not another term that can be removed.u�0must remain in the model to serve as the base case for all categorical predictors.
Table 3.2:A comparison of two logistic regression models ondo-support data from the PPCHE. One has affirmative questions as the reference level, whereas the other has negative declaratives fulfilling this role.
Ref = Aff.~Q. Ref = Neg.~Decl.
Estimate 𝑝-value Estimate 𝑝-value (Intercept) −1.38 0.00 −1.76 1.27⋅10−16
year.std 0.18 0.32 0.18 0.32
Type[Treat: Neg. Decl.] −0.38 0.40 — —
Type[Treat: Neg. Q.] 0.85 0.20 1.23 0.03
Type[Treat: Aff. Q.] — — 0.38 0.40
Treatment contrasts are the R software’s default, and (unless other arrangements are made) the first level of a factor (in alphabetical order) is treated as the reference level. However, there are drawbacks to this default. It is rare in syntactic analysis to be able to specify that one member of a set of contexts is the basic or control context, and all others are derived from this one. This means that the choice of the reference level is somewhat arbitrary. It is regrettably common practice to examine the significance values associated with individual coefficients. (Better approaches are discussed in section 3.3.3 immediately following.) Doing this without having made a considered choice of reference level does not yield sensible results. Whether the reference level has an intermediate or extreme estimated𝛽can affect whether the treatment effects appear significant. Table 3.2 illustrates this phenomenon using a subsample of data from the PPCHE.6 In the
left-hand model, with affirmative questions as the reference level, neither of the clause type effects has a significant𝑝-value. Under some interpretations, it would be said that there is no effect of clause type in this data. However, on the right negative declaratives are treated as the reference level (this is the only difference between the two models). The effect for negative questions is significant at the𝛼= 0.05 level, indicating an effect of clause type despite there being no changes in the underlying data.
6It is necessary to use a subsample because the full dataset is large enough that a significant difference may be detected between
Table 3.3:A comparison of two logistic regression models on a subsample ofdo-support data from the PPCHE. Both models use sum contrasts.
Model 1 Model 2
Estimate 𝑝-value Estimate 𝑝-value
(Intercept) −1.22 4.72⋅10−8 −1.22 4.72⋅10−8
year.std 0.18 0.32 0.18 0.32
Type[Sum: Aff. Q.] −0.16 0.63 −0.16 0.63
Type[Sum: Neg. Decl.] −0.54 0.03 — —
Type[Sum: Neg. Q.] — — 0.70 0.07
An improvement on treatment contrasts is provided by R’s built-in sum contrasts. These contrasts, instead of comparing the value of one category to another, compare the mean of a single category to the mean of all category means.7 It is still the case that one of the levels is left out of the comparison. This is
illustrated in table 3.3, which reuses the same subsample of the PPCHE data as table 3.2. The point estimate of its difference from the mean can be calculated by taking the negative of the summation of the other variables. With reference to the table,0.70= −(−0.54+ −0.16). As the regression output reflects, there is a
significant difference in this data between negative declaratives and the mean of all clause types.8 There is
no significant difference from the mean for the other two clause types, however.9 In this dissertation I’ll use
sum contrasts for models.