Trellis graphics - Exploratory data analysis. An introduction to R for the language sciences

A trellis is a wooden grid for growing roses and other flowers that need vertical support.

Trellis graphics are graphs in which data are visualized by many systematically organized graphs simultaneously. We have encountered one trellis function already, the pairs() function that produces a pairwise scatterplot matrix, where each plot is a hole in the ’trel-lis’. There are more advanced functions for more complex trellis plots, they are available in thelatticelibrary. In order to use these functions, we first have to load this library:

library(lattice)

Trellis graphics become important when you are dealing with different groups of data.

For instance, the words in theitemsdata frame fall into two groups: animals on the one hand, and the produce of plants (fruits, vegetables, nuts) on the other hand. Therefore, the factor Class (with levels animal and plant) is a grouping factor for the words.

Another possible grouping factor is whether the word is morphologically complex (e.g., woodpecker) or morphologically simple (e.g., snake). With respect to the lexical decision data in lexdec, the factor Subject is a grouping factor: Each subject completed the same experiment with 79 words and 79 nonwords. In turn, the subjects can be grouped by their first language, English, or some other language.

A question that arises when running a lexical decision experiment with native and non-native speakers of English is whether there might be systematic differences in how they perform this task. It is to be expected that the non-native speakers require more time for a lexical decision. But the way they make errors might differ as well. In order to explore this possibility, we make boxplots for the reaction times for correct and incorrect responses, and we do this both for the native speakers, and for the non-native speakers in the experiment. In other words, we use the factor NativeLanguage as a grouping factor. In order to make this grouped boxplot, we use thebwplot() function from the latticelibrary, as follows:

> bwplot(RT ˜ Correct | NativeLanguage, data = lexdec)

The result is shown in Figure 1.7. As you can see,bwplot()requires two arguments, a formula and a data frame,lexdecin this example. The formula

RT

correct incorrect 6.0

6.5 7.0 7.5

English

correct incorrect

Other

Figure 1.7: Trellis box and whiskers plot for log reaction time by accuracy (correct versus incorrect response) grouped by the first language of the subject.

Frequency ∼ Correct | NativeLanguage

is read as consider Frequency as a function of (or, a depending on)Correct(with lev-els correct and incorrect) grouped by the levels of NativeLanguage (with lev-els English and other). Note that the vertical bar is the grouping operator. An-other paraphrase within the context of bwplot() is ’create box and whisker plots for the distributions of reaction times for the levels of Correct conditioned on the levels of NativeLanguage’. The result is a plot with two panels, one for each level of the main grouping factor, native language. Within each of these panels, we have two box and whiskers plots, one for each level of Correct. This trellis graph shows some remark-able differences between the native and non-native speakers of English. First of all, we see that the boxes (and medians) for the non-native speakers are shifted upwards com-pared to those for the native speakers, indicating that they required more time for their decisions, as expected. Interestingly, we also see that the incorrect responses were asso-ciated with shorter decision latencies for the native speakers, but with longer latencies for the non-native speakers. Finally, note that there are many outliers only for the correct responses, for both groups of subjects. Later in this course, we shall see how we can test whether the pattern that we see here is indeed reason for surprise. What is clear at this point is that there is a pattern in the data that is worth examining in greater detail.

Figure 1.8 illustrates the powerful but also more complex xyplot() function. For each of the subjects in the weight rating experiment, it shows the weight rating as a func-tion of log frequency. The initials of the subjects (the grouping factor) appear in the title bars above each panel. This graph was made with the function xylowess(), which is available in the scripts file FUNCTIONS/cap1.q. This function facilitates the use of xyplot() but is much less flexible. We discuss this function and xyplot() in some more detail in the next section. We first load the function into R using source(), and then runxylowess():

> source("FUNCTIONS/cap1.q")

> xylowess(Rating ˜ Frequency | Subject, + data = weight,

+ xlab = "log Frequency", ylab = "Weight Rating")

The dependent variable (Rating) appears on the vertical axes, the predictor (Frequency) is graphed on the horizontal axes, and there is one panel for each of the levels of the grouping factor,Subject. As can be seen in Figure 1.8, weight ratings appear to increase with increasing (log) frequency. There seems to be some variation in how strong the effect is. To judge from the scatterplot smoothers, subject G (third on the bottom row) does not seem to have this frequency effect, in contrast to, for instance, subject R5, for whom the effect seems quite large.

A similar plot for weight rating by number of synsets is shown in Figure 1.9. What we observe here for almost all subjects is a shallow U-shaped curve. A problem that arises here, however, is that words with many synsets also tend to have high frequen-cies. Hence, the right part of the U-shaped curves might reflect the effect of frequency

Frequency

Figure 1.8: Weight rating as a function of frequency grouped by subject.

log Synset Count

Figure 1.9: Weight rating as a function of the synset count, grouped by subject.

rather than number of synsets. In order to explore this possibility, we make use of the conditioning plot shown in Figure 1.10.

xylowess(Rating ˜ SynsetCount | equal.count(Frequency), data = weight,

xlab = "log Synset Count", ylab = "Weight Rating")

This plot graphs the ratings as a function of the synset counts, conditioning on equal counts of Frequency.

The function equal.counts()splits the frequencies in six overlapping frequency bands with equal numbers of observations in each band. For each of these six frequency bands,xylowess()produces a scatterplot with scatterplot smoother. The panels above each plot highlight the range of the frequencies used for that panel. As you can see in Figure 1.10, the lowest frequency band is found in the lower left plot, and the highest frequency band in the upper right plot. In other words, the panels are arranged from left to right and from bottom to top for the frequency count on which we have conditioned.

Note that as we proceed from bottom left to upper right, the curve moves upward. This is the frequency effect that we observed earlier: The higher the frequency, the more likely the weight rating is to be high as well. In addition, we see that within most frequency bands, the effect of the synset count seems to be negative, i.e., for larger synset counts, the weight rating is lower. This suggests that semantic ambiguity leads to decreased estimates of weight, in contrast to frequency, which gives rise to increased estimates of weight.

1.2.4 Summary

Before proceeding to the next section, first make sure you are familiar with the new func-tions and parameters:

vector functions range() min() max() sum() mean() quantile()

simple plot functions plot() barplot() hist() plot(density()) lines() mtext()

trellis graphics bwplot() xylowess() equal.count() graphical parameters: par mfrow cex xlab ylab lty xaxt formula | ∼

libraries library(lattice) library(MASS)

log Synset Count

Weight Rating

1.0 1.5 2.0 1

2 3 4 5 6 7

equal.count(Frequency)equal.count(Frequency)

1.0 1.5 2.0 equal.count(Frequency) equal.count(Frequency)

1.0 1.5 2.0 equal.count(Frequency)

1 2 3 4 5 6 7 equal.count(Frequency)

Figure 1.10: A conditioning plot for weight rating by number of synsets for equal counts of frequency.

In document Exploratory data analysis. An introduction to R for the language sciences (Page 37-44)