Comparing these methods - Analyzing eye gaze data

3.5 Does touchscreen input reflect attention?

5.2.4 Analyzing eye gaze data

5.2.4.4 Comparing these methods

I have argued here that comparing mean percentages across time windows is conceptually a misguided method of analyzing eye gaze data, and that GCA and SSANOVA seem more suited for that task. Which of the latter two is better? There is no immediately apparent answer to (2009), Chanethom (2011), Ardestani (2013), Stevenson et al. (2014), and Zhang et al. (2014).

that question. It is true that GCA (like mixed-effects models generally) seems most appropriate when there are many independent variables and one wants to test which of them are significant predictors of the data while controlling for all others. SSANOVA, on the other hand, is designed to model (and investigate the differences between) just a few curves—based on only one or two grouping variables of interest. In eyetracking studies that fit broadly into the visual world paradigm, there are generally only a few curves that are compared. While this appears to make SSANOVA a better fit, GCA is not incapable of dealing with just a few variables just because it is designed to handle many. To answer the question of which method is better, this section compares the performance of GCA and SSANOVA using theeyetrackdataset and two sets of randomly generated, simulated data.66

Theeyetrackdataset has data for all timepoints from all participants in all trials, as mentioned in Section 5.2.4.3. To see how GCA and SSANOVA perform on less ideal data, this section will use only a subset of the dataset, consisting of six (out of 48) sessions. Fig.5.20shows that this subset exhibits the same broad pattern of proportion over time, but with certain differences: for example, gazes on the ‘empty’ region appear to increase and gazes on the color-matching distractor image appear to decrease at around 500 ms in the subset, whereas the whole dataset (shown in Fig.5.16above) does not have this peak and trough.

0.00 0.25 0.50 0.75 1.00 −500 0 500 1000 time proportion image

color distractor empty

object distractor target

Figure 5.20: Average proportions of gazes on four different regions of the screen over time (in milliseconds) in the subset

As demonstrated in Section5.2.4.2above, a linear model with orthogonalized polynomials of time as predictors (as in GCA) can approximate data with non-linear changes over time, while

66I am grateful to Dave Kleinschmidt for suggesting simulated data as a test of performance. Any errors in the implementation of this idea are of course my own.

still being linear and therefore allowing the use of tools developed for linear models. Fig.5.21

shows the best fit of a GCA model with third-order polynomial terms to the subset used here. While the parameters of a GCA model are not easy to interpret, this method allows the relatively simple statistical method of regression modelling to be extended to non-linear data. This model is still a linear regression model: for example, the formula in (5.7) describes the model curve of gazes on ‘empty’ regions (orange dashed line in Fig.5.21).

(5.7) Y =0.626+(−2.598×t₁)+(0.407×t₂)+(0.480×t₃)+

This formula multiplies predictors by parameters and adds constants (the intercept 0.626 and the error term). It is entirely ignorant of the fact that the predictors t2and t3were calculated

using quadratic and cubic polynomials.

0.00 0.25 0.50 0.75 1.00 −500 0 500 1000 time proportion

image color distractor empty

object distractor target

Figure 5.21: Average proportions of gazes in the subset (solid lines) and third-order GCA model fits (dashed lines)

Third-order polynomials are essentially limited to one peak and one trough. Therefore, this model does fairly well on those curves that can be approximated with one peak and trough (target and object-matching distractor), but only captures a very rough overall shape of more complex curves (color-matching distractor and empty region). To allow for more peaks and troughs and thus enable GCA to fit the curve shape more accurately, orthogonalized polynomial terms of higher order are necessary.

The best-fitting GCA model using seventh-order polynomial terms is shown in Fig.5.22, and (5.8) is its formula for the ‘empty’ (orange) curve. This model is obviously a closer fit than the third-order GCA above, but that comes at a cost. Firstly, this is very computationally expensive. The somewhat arbitrary order of 7 was chosen here simply because models with

higher orders took very long to fit on a typical desktop computer. Moreover, the model fitting function (lmer() from thelme4package; Bates et al. 2014) warned that it failed to converge on the data with seventh-order polynomials, which suggests (very briefly) that the model as specified is not appropriate for the data as it is. Secondly, the model is all but impossible to interpret: most of its 32 parameters achieve significance (under the assumption that thet-values of estimated parameters approximate a standard normal distribution and that parameters whose t has an absolute value of more than 1.96 are therefore significant atα = 0.05). While this at first glance seems to reflect the fact that the seventh-order curves are obviously better fits than the third-order ones, there is no objective way of finding the best order for a GCA model. A model of lower order (fifth, say) would be less likely to overfit, but that would mean dropping the ‘significant’ higher-order terms. Thirdly, the seventh-order model appears to be overfitting the data. Consider for example the color-matching distractor and empty region gazes before -500 ms, where this model suggests that there is an initial rise in the proportion of gazes toward the color-matching distractor (and concurrent fall in gazes on empty regions), followed by a drop before the major rise in proportion. This pattern is not apparent in the raw data, meaning the model is inaccurate at best and misleading at worst here. This is a known problem: terms of higher order “are just capturing differences in the tails” (Mirman 2014; “tails” here means the beginning and end of the time window under analysis).

(5.8) Y =0.626+(−2.598×t₁)+(0.407×t₂)+(0.480×t₃)+(0.472×t₄)+(0.738×t₅)+ (0.280×t₆)+(0.240×t₇)+ 0.00 0.25 0.50 0.75 1.00 −500 0 500 1000 time proportion image

color distractor empty

object distractor target

Figure 5.22: Average proportions of gazes in the subset (solid lines) and seventh-order GCA model fits (dashed lines)

Fig.5.23shows that the SSANOVA model captures the initial minor increase in gazes on the object-matching distractor image (around -500 ms), but without modelling a fictitious drop shortly thereafter. At the same time, it provides smooth curves: SSANOVA correctly does not model the color distractor and empty curves to be on the same level at around 200 ms, even though the data suggest this. The smoothness of splines makes them unlikely to fit this extreme peak, and comparing the SSANOVA fits in Fig.5.23to the full data in Fig.5.16shows that the larger dataset is very well approximated by SSANOVA even based on the small subset.

0.00 0.25 0.50 0.75 1.00 −500 0 500 1000 time proportion image

color distractor empty

object distractor target

Figure 5.23: Average proportions of gazes in the subset (solid lines) and SSANOVA model fits (dashed lines) with their confidence intervals (faded bands)

Thus, on this one realistic dataset, SSANOVA appears to perform better than GCA: it is less computationally expensive, it is less likely to overfit, and makes finding areas of difference between curves easier thanks to the 95% confidence intervals.

However, this may be coincidence: it is conceivable that the eyetrack dataset is singularly well suited to SSANOVA (the dataset is provided as part of an SSANOVA package, after all). Therefore, I tested the performance of GCA and SSANOVA on simulated data in two sets of simulations. This allowed more robust testing, since it is easy to simulate thousands of simple random datasets. It also allowed me to make these simulated datasets much less perfect than the eyetrackdata by introducing a large random error into the data, and thus make these simulated datasets more like the real eyetracking data collected in the present study. (These simulations were run in R (R Development Core Team 2011); see AppendixGfor details and code.) The starting point for each of these two sets of simulations were data-generating functions. A basic polynomial of the form given in (5.9) was defined for each simulation by specifying the twelve β-values (resulting in effectively a GCA-like model, to counter the possible bias

for SSANOVA in the eyetrack dataset). The variables x₁, x₂, and x₃ are the values of orthogonalized polynomials of first, second, and third order for the range of a continuous integer variablex (simulating time bins);d₁andd₂are two of the three levels of the grouping variable (thus simulating a study with three areas of interest; note that there is no variable for the third level because that was used as the reference level); and the variablesxn×dnare the interaction terms between the orthogonal polynomials and the two grouping variable levels (which are necessary in order to allow for different curve shapes over time for the different levels).

(5.9) y= β₀+(β₁×x₁)+(β₂×x₂)+(β₃×x₃)+(β₄×d₁)+(β₅×d₂)+(β₆×(x₁×d₁))+(β₇× (x₁×d₂))+(β₈×(x₂×d₁))+(β₉×(x₂×d₂))+(β₁₀×(x₃×d₁))+(β₁₁×(x₃×d₂)) Each of the two sets of simulations had two of these data-generating functions, a null function with β₁. . . β₁₁ set to 0 and an effect function with some of those β-values different from 0. To introduce randomness,67 grouped errors were added to simulate 20 participants: firstly, the

constant β-values were changed by adding or subtracting random values for each simulated participant, and a random error was added to each y-value. Both of these randomizations were based on normal distributions with mean 0. Thus, each simulated participant had their individual generation function, and the output of these was subject to further noise. As a result, each dataset in these simulations had 20 distinct y-values for each combination of x (time) and d (area)—one for each simulated participant.

Each function was run 1000 times (with the randomization being (randomly) different for each run). GCA and SSANOVA models were then fit onto each of these datasets, and their performance was compared. The following paragraphs describe the results of these simulations in two sets: the first set consists of 1000 datasets based on a function with just one β being not 0 and 1000 datasets based on the corresponding function with all β-values being 0. The second set consists of 1000 datasets based on a more realistic and complex function with several

β-values being not 0 and 1000 datasets based on its counterpart with all β-values being 0. In the first set of simulations, only one βin the effect model (apart from the intercept β₀) was not 0: the parameter for the interaction between the ‘response’ level of the image factor and the second-order orthogonalized polynomial value was set to 1, thus giving a parabola-like trajectory to the curves for the percentages of gazes at the ‘response’ and ‘other’68images (see

Fig.5.24b). The null model had this and all other β₁. . . β₁₁ set to 0 (and the same intercept β₀ as the effect model), thus simulating the three absolutely identical flat lines in Fig.5.24a(where

67Since all simulation data was computer-generated, strictly speaking the errors were onlypseudorandom. I do

not consider the distinction to be relevant for these simulations, and will continue to use the term ‘random’ for simplicity’s sake.

68The ‘other’ level of the image variable received the inverse of this effect due to the way contrasts were handled in setting up the dummy variables for that categorical variable in this simple simulation.

the three lines have been given different line types to make it more apparent that they overlap completely).

image other response big

(a) Null model

image other response big

(b) Effect model

Figure 5.24: Data generation functions for the first set of simulations (before being randomized for simulated participants)

Note that this was the databefore randomization. With the random errors added to parameters andy-values, the data was much more messy, as is apparent from comparing the two simulated datasets in Fig.5.25to the underlying functions in Fig.5.24.

image other response big

(a) Null model

image other response big

(b) Effect model

Figure 5.25: Examples of datasets in the first set of simulations (averaged across simulated participants)

1000 datasets like this were generated, a GCA model was fitted to each simulated dataset, and each model’s parameter estimates, their standard errors, and thet-values resulting from these values were saved for analysis. On the assumption that thet-values approximate a standard normal distribution (which is the assumption that underlies how the p-values for parameter estimates are often calculated), the t-values for the parameter of interest should ideally fall between−1.96 and 1.96 in 95% of the simulations without an effect—in the simulation, they fall in this interval 96.4% of the time (see Fig. 5.26). The t-values for the other predictors (which were all zero in the data-generating function) should also be in this interval 95% of the time, and they are 95.1–97.4% of the time.69Similarly, thet-values for this parameter should ideally be outside of that interval in 97.5% of the simulations with an effect—and they are in 91.4% of those simulated models.70 Thus, the GCA approach to this kind of data does not have

textbook-level/nominal performance (apparently it slightly underestimates the effect and is thus conservative), but it does get very close.

null model data effect model data

−1.96 1.96 −1.96 1.96

t−value

Figure 5.26: Density plot oft-values of GCA models in the first set of simulations SSANOVA models were also fitted to these 1000 simulated datasets. Since SSANOVA models are analyzed by whether the 95% confidence intervals of different splines overlap (see Section5.2.4.3

above), this was how these models were evaluated. Fig.5.27shows the SSANOVA splines with 95% confidence intervals for nine null-model and nine effect-model datasets as examples. The number of models that had non-overlapping confidence intervals for at least one time bin was 130 for the simulations without an effect and 331 for the models with an effect—in other words, the false positive rate was 13% and the false negative rate an astonishing 66.9% in this simulation. The size of the differences did not differ between the true and the false positives: some models (like #134 in Fig.5.27aand #139 in Fig.5.27b) had only small differences in the time interval (sequence of bins) where they did have a difference; others (#135 in both Fig.5.27a 69The intercept (grand mean) was far from zero in the simulated data, and consequently itst-value is outside the same interval in all 1000 simulation models.

70The other predictors are again zero, and theirt-values should again be between −1.96 and 1.96 in 95% of models—and they are in 94.3–97.4% of the models.

131 132 133

134 135 136

137 138 139

image other response

(a) Without effect

131 132 133

134 135 136

137 138 139

image other response

(b) With effect

Figure 5.27: Examples of SSANOVA models in the first set of simulations

and Fig.5.27b) showed a much larger difference. Fig.5.28ais a density plot of the largest vertical differences (one from each model that did have a difference), with the false positives (shown in blue, based on null-model data) and true positives (shown in orange, based on effect-model data) being virtually indistinguishable. This is puzzling, but the difference between the two models #135 in Fig.5.27is enlightening: the false positive #135 has a difference all the way through the simulated range of time (it looks like a constant offset rather than an interaction between image and time), while the true positive #135 (like all the models with a difference in Fig.5.27b) only shows a difference for part of that range. In fact, 79 of the 130 false positives (60.8%) did show a difference all the way through, whereas only 32 of the 331 true positives (9.6%) did. Fig.5.28b, a histogram of the length of these differences, makes very obvious that false positives mostly showed their difference for all bins and that true positives mostly had a difference for only a part of the time range.

Redefining a ‘positive’ SSANOVA finding as showing a difference between confidence intervals andthis difference not lasting for the entire time range seems warranted. This, of course, changes the numbers: there are 51 false and 299 true positives under that definition, so ₅₁299₊₂₉₉ =85.4% of the positives are now true ones (up from ₁₃₀331₊₃₃₁ = 71.8% under the previous definition). The SSANOVA approach to this type of data as a whole still appears to be overly conservative, especially when compared to the GCA’s performance above, but it must be remembered here that the data was generated by what is essentially a GCA model—and an unrealistically simple one, at that. As the examples in Mirman et al. (2008), Barr (2008), and Mirman (2014) show (and as attempts at modelling subsets of the data gathered in the present study showed), GCA models of real data generally have either no significant parameter βn or several ones, but not

0 5 10 15 0.00 0.05 0.10 0.15 maximum difference density

null model data effect model data

(a) Density plot of models by largest difference between the two confidence intervals

0 20 40 60 80 0 20 40 number of bins count

null model data effect model data

(b) Histogram of models by number of bins showing significant difference

Figure 5.28: Comparison of false and true positives in SSANOVA models in the first set of simulations

just one. Simulations where the effect model had several significant parameters would therefore be more realistic.

The second set of simulations was designed to meet that goal. These simulations were like the ones in the first set, except that the underlying data-generation function for the effect model had non-zero values for the intercept, the main effect for the third-order time polynomial values, and three interactions between time polynomials and image levels. These were chosen from the effects with p< .05 in a GCA model of the adult participants’ data from experiment 2 of this thesis (the data that is presented in Section5.3), split by whether the area contained their ultimate response choice, one of the other two choices, or the larger image representing the explicit object in the instruction. All parameters corresponding to effects which had p ≥ .05 in that empirical model were set to 0 in the effect model here. Fig.5.29bshows this underlying model. The null-model again had all parameters except the intercept set to 0, see Fig.5.29a. Just as in the first set of simulations, random errors were introduced into these models to generate 2000 datasets (1000 null-effect datasets and 1000 effect datasets) . Fig.5.29cshows the resulting gaze percentages over time from one simulated dataset generated by the null model, and Fig.5.29d

shows the gaze percentages for one dataset generated by the effect model. GCA and SSANOVA models were fitted to these 2000 simulated datasets.

For each of these 2000 GCA models, the parameter estimates, their standard errors, and the t-values resulting from these values were calculated. On the assumption that the t-values

image other response big

(a) Null model

image other response big

(b) Effect model

image other response big

(d) Data based on effect model

Figure 5.29: Data generation functions for the second set of simulations (top panels) and examples of simulated datasets (bottom panels)

approximate a standard normal distribution (Mirman 2014), the t-values for all parameters71

should ideally fall between−1.96 and 1.96 in 95% of the null-model simulations. As the panels on the left in Fig.5.30show, they do so 93.4% to 96.3% of the time (slightly different percentages for the different parameters). In the effect-model simulations, thet-values for the parameters that were not zero should ideally be outside of that interval in 97.5% of the simulations with an

In document The acquisition of sentence alternations : how children understand and use the English dative alternation (Page 110-125)