• No results found

3.5 Does touchscreen input reflect attention?

5.2.5 Regression modelling

5.2.5.2 Categorical independent variable

Regression modelling can also be used when the independent variable is not a number, but a variable that can take one of a specific set of values (a categorical variable or ‘factor’). The etymologydata contains information regarding the inflectional paradigm for each verb. The inflectional paradigm of a verb can be either regular or irregular, so those two options are the two values for this factor (its ‘levels’). The frequencies of verbs may differ between the two: verbs with irregular inflectional paradigms, for example, may be more frequent. Fig.5.36suggests this, at least: the two boxes represent the distributions frequencies for verbs with regular and irregular inflection. The thick black line inside each box shows the median frequency for the respective group of verbs. The lower and upper ends of the boxes represent the first and third quartiles: one quarter of the verbs with regular inflections have frequency values of less than 7.1 and thus fall below the box; one quarter of the verbs with regular inflections have frequency values of more than 8.5 and thus fall above box; and it follows that the remaining two quarters (or one half) have frequency values that fall inside the box. The vertical lines beyond the ends of the boxes extend to the most extreme value within 1.5 times the difference between the first and third quartile (the ‘interquartile range’, or the height of the box)—the end-point of the vertical line extending from the top of each box is the largest value in the data that is less than or equal to the third quartile value plus 1.5 times the interquartile range, and the end-point of the line extending from the bottom is the smallest value that is equal to or larger than the first quartile value minus 1.5 times the interquartile range. (This explains why these lines are not of equal length in plots of this type: for example, the largest quarter of data points (above the box) may

be very clustered, so that the largest value is much less than 1.5 interquartile ranges above the third quartile and the upper line ending at it it therefore very short.) Verbs with values more extreme than than (lower than first quartile minus 1.5 times interquartile range, or higher than third quartile plus 1.5 times interquartile range) are shown individually, as grey dots. The grey X shows the mean value for each group; here, these are almost exactly the same as the medians. It is thus apparent that the irregular verbs, on average, have a higher frequency than the regular verbs. However, the difference is not enormous, and there is much variation even within each group of verbs.

The relationship between this regularity factor and the written corpus frequency can be tested statistically with a linear regression model. Table5.6shows the coefficients for this model, and the orange75line in Fig.5.36is a graphical representation of the model. It is obvious that the

orange line ends exactly in the grey X’s that show the mean values. With just one categorical independent variable with two levels, this is precisely what linear regression does: the fitted or predicted values of the dependent variable for either level are the mean values of the dependent variable within that level. The parameter estimates for categorical independent variables have a different meaning than the estimates for numerical independent variables: for example, with the numerical independent variable of the ratio of writing/speech corpus frequencies in the model in Table5.5 discussed above, the intercept parameter is the fitted value for when this independent variable is 0. A categorical variable cannot be 0 by definition—it can only take one of a set of values (for example ‘regular’ or ‘irregular’). With binary variables (like regularity here), a common practice is to choose one of these values as the ‘reference level’ and construct a numerical variable based on that, which takes the value 0 for all data points that have the reference level value and 1 for the data points that have the other level. The intercept thus becomes the fitted value for the reference level. In the present model, the reference level is ‘regular’, and the intercept value of 7.79 is the mean frequency value for regular verbs. For these verbs, the constructed numerical variable is 0, and the coefficient for this variable is thus multiplied by 0 and thus does not change the predicted value. For the non-reference level, however, the numerical variable is 1. This means the fitted values for irregular verbs are calculated by taking the intercept and adding the coefficient for this variable. With the present model, that means the fitted value for all irregular verbs is 7.79+(1×0.4) =8.19, which is the mean of the frequencies of all the irregular verbs.

Happily, the standard error, Wald test, and probability values work much the same way with categorical independent variables as they did with numerical ones (discussed above): the difference between the estimate and the expected value (if the expected value is 0, this is just the

75There is no meaning attached to the color of the line here. It was chosen merely to make the line more visible in a figure with several other lines.

estimate) is divided by the standard error of the estimate, and the resulting statistic is located in a standard normal distribution to find a probability value. If thispis less than 0.05, I assume that the difference expressed by the variable estimate is significant, meaning the variable in question has a significant effect. In the model of frequency by regularity, that is the case (p = 0.04): irregular verbs are significantly more frequent than regular ones overall.

variable parameter estimate standard error t p

(Intercept) 7.79 0.14 55.90 < 0.01

irregular 0.40 0.20 2.03 0.04

Table 5.6: Coefficients of regression model for writing corpus frequency on regularity of inflectional paradigm ● ● ● ● ● ● ●

x

x

4 6 8 10 12 regular irregular inflection frequenc y in writing

Figure 5.36: Frequency in writing of Dutch verbs by regularity of inflectional paradigm (boxes) and linear regression model thereof (orange line)