Coding of categorical variables - Further topics in regression

3.3 Further topics in regression

3.3.2 Coding of categorical variables

The formulas for logistic regression (and regression more broadly) require summing a vector of predictors weighted by regression coeﬃcients. However, it is not possible to add together “negative declarative clause type” and “unaccusative verb.” Thus, it is necessary to transform these predictors into numerical form. This is referred to as deﬁning a contrast matrix for the categorical (i.e. non-numeric) predictor.

The conceptually simplest way to do this is, for a categorical variable with N categories X₁, … X_N, to deﬁne N indicator variables I₁…I_N. I₁takes the value 1 if an observation has the category X₁and zero otherwise; the other indicator variables behave in a likewise fashion. However, including all N indicator variables in the model leads to problems with ﬁtting the model. Regression models generally have an intercept term – a𝛽0which is included for each observation, regardless of its properties. Since the N indicator variables

completely partition the data, it is possible to balance arbitrary changes in the corresponding𝛽s with equal and opposite changes to𝛽0. The model ﬁtting algorithm will not converge, since there is an inﬁnite family

of models all of which fit the data equally well. One of the indicator variables must be left out. _𝛽₀will include the effect of the category which does not have an indicator variable (among other effects); this is called the “reference category.” The remaining_{𝑁 −}1indicator variables (and their corresponding𝛽s)

will each yield an estimate of the diﬀerence of their category with the reference category.5 _{This regime of}

contrasts is called “treatment contrasts” because it canonically corresponds to an experimental paradigm where there is one control group (the reference category) and the experimental hypotheses are whether any of several treatments (the other categories) provokes a significant difference from the control. The𝛽s in such a regression and their associated confidence intervals and𝑝-values provide an estimate of the effect of a certain treatment.

4_{Although these diﬀerences are for the main eﬀects of clause type, which should not be scrutinized for their contribution to the}

model in the presence of an interaction term, which these models both have.

5_{For each N-category predictor, there is only enough information to add}_{u� −}₁_{terms to the model. An alternative strategy could be}

imagined: removing theu�0term from the regression so as to compensate for the Nth indicator variable and itsu�. This works in the

case of a single categorical variable, however it does not generalize. After adding a second categorical variable there is not another term that can be removed.u�0must remain in the model to serve as the base case for all categorical predictors.

Table 3.2:A comparison of two logistic regression models ondo-support data from the PPCHE. One has aﬃrmative questions as the reference level, whereas the other has negative declaratives fulﬁlling this role.

Ref = Aﬀ.~Q. Ref = Neg.~Decl.

Estimate 𝑝-value Estimate 𝑝-value (Intercept) ₋1.38 0.00 −1.76 1.27⋅10−16

year.std 0.18 0.32 0.18 0.32

Type[Treat: Neg. Decl.] ₋0.38 0.40 — —

Type[Treat: Neg. Q.] 0.85 0.20 1.23 0.03

Type[Treat: Aﬀ. Q.] — — 0.38 0.40

Treatment contrasts are the R software’s default, and (unless other arrangements are made) the first level of a factor (in alphabetical order) is treated as the reference level. However, there are drawbacks to this default. It is rare in syntactic analysis to be able to specify that one member of a set of contexts is the basic or control context, and all others are derived from this one. This means that the choice of the reference level is somewhat arbitrary. It is regrettably common practice to examine the significance values associated with individual coefficients. (Better approaches are discussed in section 3.3.3 immediately following.) Doing this without having made a considered choice of reference level does not yield sensible results. Whether the reference level has an intermediate or extreme estimated_𝛽can affect whether the treatment effects appear significant. Table 3.2 illustrates this phenomenon using a subsample of data from the PPCHE.6 _{In the}

left-hand model, with affirmative questions as the reference level, neither of the clause type effects has a significant𝑝-value. Under some interpretations, it would be said that there is no effect of clause type in this data. However, on the right negative declaratives are treated as the reference level (this is the only difference between the two models). The effect for negative questions is significant at the_𝛼= 0.05 level, indicating an effect of clause type despite there being no changes in the underlying data.

6_{It is necessary to use a subsample because the full dataset is large enough that a signiﬁcant diﬀerence may be detected between}

Table 3.3:A comparison of two logistic regression models on a subsample ofdo-support data from the PPCHE. Both models use sum contrasts.

Model 1 Model 2

Estimate 𝑝-value Estimate 𝑝-value

(Intercept) −1.22 4.72⋅10−8 −₁.₂₂ ₄.₇₂⋅₁₀−8

year.std 0.18 0.32 0.18 0.32

Type[Sum: Aﬀ. Q.] ₋0.16 0.63 −0.16 0.63

Type[Sum: Neg. Decl.] ₋0.54 0.03 — —

Type[Sum: Neg. Q.] — — 0.70 0.07

An improvement on treatment contrasts is provided by R’s built-in sum contrasts. These contrasts, instead of comparing the value of one category to another, compare the mean of a single category to the mean of all category means.7 _{It is still the case that one of the levels is left out of the comparison. This is}

illustrated in table 3.3, which reuses the same subsample of the PPCHE data as table 3.2. The point estimate of its diﬀerence from the mean can be calculated by taking the negative of the summation of the other variables. With reference to the table,0.70= −(−0.54+ −0.16). As the regression output reﬂects, there is a

signiﬁcant diﬀerence in this data between negative declaratives and the mean of all clause types.8 _{There is}

no signiﬁcant diﬀerence from the mean for the other two clause types, however.9 _{In this dissertation I’ll use}

sum contrasts for models.

In document A Multi-Step Analysis of the Evolution of English Do-Support (Page 48-50)