Visualizing two or more variables - Exploratory data analysis. An introduction to R for the lan

In the lexical decision experiment, some subjects were native speakers of English, others were not. And some were men, and others women. Let’s cross-tabulate the subjects by native language and sex. We first create a data frame with the subject-specific information only,

> subjects = unique(lexdec[, c("Subject", "NativeLanguage", "Sex")])

> subjects[1:4,]

Subject NativeLanguage Sex

1 A1 English F

475 A2 English M

712 A3 Other F

949 C English F

and then usetable()with two instead of one input factors:

> subjects.tab = table(subjects$NativeLanguage, subjects$Sex)

> subjects.tab F M English 7 5 Other 7 2

We can make a barplot for this two-way contingency table with the same barplot() function we used above. The upper left panel of Figure 1.4 was produced with

> barplot(subjects.tab, beside = T, + legend.text=c("English", "other"), + col=c("black", "white"))

Withbeside=Twe tellRto plot the two values of a column beside each other, instead of stacking them above each other. The legend is added with the argumentlegend.text, which in this case is a vector specifying the rows of the contingency table.

F M English other

0246810

2 3 4 5 6 7 8

0.01.02.03.0

items$Frequency

items$FamilySize

2 3 4 5 6 7 8

0.01.02.03.0

items$Frequency

items$FamilySize

3 4 5 6 7 8 9 10

1.01.52.0

items$Length

items$SynsetCount

Figure 1.4: A barplot for a 2 by 2 contingency table, and scatterplots with scatterplot smoothers.

The remaining panels of Figure 1.4 illustrate how the relation between two numerical variables can be visualized by means of scatterplots. The upper right panel plots the 81 words in items in the plane spanned by log Frequency and log Family Size. You can see that words with a very high frequency tend to have a very high family size. In other words, the two variables are positively correlated. At the same time, it is also clear that there is a lot of noise, and that the scatter (or variance) in family sizes is greater for lower frequencies. Such an uneven pattern is refered to as heteroskedastic, and is endemic in lexical statistics. The following lines of code illustrate how to create the three scatterplots of Figure 1.4.

> plot(items$Frequency, items$FamilySize) # upper right panel

> plot(items$Frequency, items$FamilySize) # lower left panel

> lines(lowess(items$Frequency, items$FamilySize))

> plot(items$Length, items$Synsets) # lower right panel

> lines(lowess(items$Length, items$Synsets))

The lower left panel illustrates how you can use a scatterplot smoother to bring out the main trend in the data. The function that we have used here islowess(), the output of which is fed intolines(). There are many other smoothers, for further details we refer the reader to Venables & Ripley (2000:228–232). As for histograms and density estimation, the shape of the smooth curve running through the data points depends on the width of the ’bin’ width specifying the points in the plot which influence the smooth at each value.

The default settings for this bin width (or smoother span) are a sensible first guess, but when you think there is undersmoothing or oversmoothing you can try out other spans.

For further details, the reader should consult the on-line help forlowess().

The lower right panel shows a scatterplot for word length and number of synsets, again with a lowess smoother. Comparing the two graphs, the correlation in the left panel seems more robust than the one in the right panel. This is as far as visual inspection of the data can lead us. We will need more formal methods to guide us with respect to the question whether there are grounds for assuming these patterns would be observed again in new samples of the same kind of words.

The plot of family size by frequency raises the question which words in the data set have both high frequencies and high family sizes. A plot that is quite helpful here is a scatterplot in which the circles are replaced by the corresponding words, as shown in Figure 1.5. This figure was produced with

> plot(items$Frequency, items$FamilySize, type="n", + xlab="log frequency", ylab="log family size")

> text(items$Frequency, items$FamilySize,

+ as.character(items$Item), cex=0.8) # convert factor to strings It is easy to see that horse and dog are the words with the highest frequency and family size in the sample. Thetextfunction is the crucial tool here. It requires three vectors of equal length: a vector of x-coordinates, a vector of y-coordinates, and a vector of strings.

2 3 4 5 6 7 8

apricot squirrelbutterfly potato beetroot

Figure 1.5: Scatterplot of frequency by length with labeled points for 81 words denoting animals and plants.

Frequency

0.0 1.5 3.0 3 5 7 9

2468

0.01.53.0

FamilySize

SynsetCount

1.02.0

3579

Length

2 4 6 8 1.0 2.0 0.0 1.0 2.0

0.01.02.0

DerivEntropy

Figure 1.6: A pairs plot for the five numerical variables in theitemsdata frame.

In order to avoid plotting both strings and plot symbols, we specified type = "n"in theplot()command, so that the axes, labels and tick marks are properly set up, but no actual points are shown.

Thus far, we have considered plots involving two variables only. Often, we have more than two variables, and although we might look at all possible combinations with a series of scatterplots, it is often more convenient and insightful to make a single multipanel figure that shows all pairwise scatterplots. Figure 1.6 shows such a scatterplot matrix for all two by two combinations of the five numerical variables initems. The panels on the main diagonal provide the labels for the panels. Furthermore, each pair of variables is plotted twice, once with a given variable on the horizontal axis, and once with the same variable on the vertical axis. Such pairs of plots have coordinates that are mirrored in the main diagonal. Thus, panel (1,2) is the mirror image of panel (2,1), which we just studied in Figure 1.5. Similarly, panel (5,1) in the lower left has its opposite in the upper right

corner at location (1,5). This pairs plot was produced with pairs(items[,-c(1,6)])

The condition on the columns with the minus sign,-c(1,6), allowed all columns except columns 1 and 6 (both factors) into the plot. Note that there seem to be correlations among many of these variables, a phenomenon that is known as multicollinearity. The problem that multicollinearity causes is that when we seek to understand how these variables affect lexical decision latencies or weight ratings, it may be quite difficult to ascertain what the independent contribution of the different variables might be.

In document Exploratory data analysis. An introduction to R for the language sciences (Page 32-37)