• No results found

Clustering and classification

3.1 Covariance and correlation matrices

3.2.1 Principal components analysis

The data frameDATA/affixes.txtlists, for 44 texts with varying authors and genres, a productivity index for 27 derivational affixes. For full details, the reader is referred to Baayen [1994]. The question of interest is whether there is any structure in this large table of numbers. More specifically, the 44 texts represent rather different text types: religious texts (e.g., the book of Mormon), books written for children (e.g., Alice’s adventures in Wonderland), literary texts (e.g., novels by Austin, Conrad, James), and officialese (e.g.

texts from the US government accounting office). A classification into these text types is provided in the column labeledRegistersof this data frame.

> dat = read.table("DATA/affixes.txt", T)

> dat[c("Mormon", "Austen", "Carroll", "Gao"),c(5:10, 29)]

ian ful y ness able ly Registers

Mormon 0 0.1887 0.5660 2.0755 0.0000 2.2642 B Austen 0 1.2891 1.5654 1.6575 1.0129 6.2615 L Carroll 0 0.2717 1.0870 0.2717 0.4076 6.3859 C Gao 0 0.3306 1.9835 0.8264 0.8264 4.4628 O

Principal components analysis is a useful tool for exploring the structure of this kind of multivariate data through the covariance or correlation matrix. In order to understand what this technique does, consider Figure 3.1. The upper left panel shows a cube, and the grey coloring of the cube indicates that data points are spread out everywhere in the cube. In order to describe a point in the cube, we need all three axes. The cube in the upper right describes the situation in which all the points are located on the grey plane.

We could describe the location of a point on this plain using the three axes of the cube, but a more economical description would use two (orthogonal) vectors in the grey plane.

The cube in the lower left panel also involves a plane, but now there is more variation (a greater range of values) in the Y and Z direction than in the X direction. The final cube depicts the case where all the points are located on a line. To describe the location of these points, a single axis is sufficient.

What principal components analysis does is try to reduce the number of dimensions.

For the upper left cube, this is impossible. For the upper right cube, this is possible: We can get rid of one dimension. The way in which principal components does this is by rotating the axes in such a way that you get two new axis that are sufficient to reach each data point on the diagonal plain of the (unrotated) cube. In the case of the lower left cube, principal components analysis will take its first axis (principal component 1, henceforth PC1) to be the axis going up and back, because this is the dimension in which the variation is greatest. The second axis (PC2) will be, in this example, the original X axis.

Returning to our data, we can regard the 44 texts as 44 points in a 27-dimensional space. Do we need all these 27 dimensions, or can we reduce the number of dimensions to a (much) smaller number? It turns out that this is possible, and that this reduction sheds light on the correlational structure in our data matrix. The function carrying out principal components analysis isprcomp(). Its input is a matrix (or data frame, but then only numerical columns). As the last two colums of our data frame contain descriptions of the texts, we remove them from the input before runningprcomp().

> dat = read.table("DATA/affixes.txt", T)

> dat.pr = prcomp(dat[,1:(ncol(dat)-2)])

> names(dat.pr)

[1] "sdev" "rotation" "center" "scale" "x"

As shown by names()applied to the principal components objectdat.pr, this object has 5 components. The first component, sdev, is the standard deviation corresponding with each PC.

Figure 3.1: Different distributions of points (highlighted in grey) in a cube.

> dat.pr$sdev

[1] 1.859784816 1.106792595 0.704414914 0.539529911 0.532013999 0.434277456 [7] 0.409533141 0.377801947 0.330303242 0.295235260 0.257427979 0.226976732 [13] 0.211334348 0.189316061 0.161693081 0.150288381 0.126535199 0.112576297 [19] 0.103904452 0.086989626 0.074193167 0.067361823 0.058533722 0.042922322 [25] 0.026037970 0.009769321 0.008716893

Note that there are 27 standard deviations, just as there were 27 affixes (columns). Also note that the principal components are ordered by the magnitude of their standard de-viations. Recall that the variance is the square of the standard deviation. In order to see what proportion of the variance in the data is accounted for by the successive principal components, we square dat.pr$sdevand then divide each variance by the total vari-ance:

> props = dat.pr$sdevˆ2/sum(dat.pr$sdevˆ2)

Figure 3.2 plots these proportions of explained variance. There is a rule of thumb stating that those principal components are important that account for at least 5% of the variance.

These have been colored green. Another rule of thumb is to locate the cutoff point where there is a clear discontinuity as you move from left to right. In this example, the two rules of thumb converge, but this need not be the case.

> barplot(props, col=as.numeric(props>0.05)+2,

+ xlab="principal components",ylab="proportion of variance explained")

> abline(h=0.05)

It is clear that the first principal component accounts for more than half of the variance, and that the fourth and later principal components have little or nothing to contribute.

In short, this screeplot tells us that we can reduce 27 dimensions to 3 dimensions without loosing much variance: The first three PCs account for three quarters of the variance.

> sum(props[1:3])

> [1] 0.7663088

The coordinates of the texts in the cube spanned by the first three principal compo-nents are available in the component ofdat.prlabelled x:

> dat.pr$x[c("Mormon", "Austen", "Carroll", "Gao"),1:3]

PC1 PC2 PC3

Mormon -3.7613247 1.5552693 1.4117837 Austen -0.1745206 -1.5247233 0.3285241 Carroll 0.3363524 1.5711792 -0.2937536 Gao -1.8250509 -0.8581186 -1.2897237

principal components proportion of variance explained 0.00.10.20.30.40.5

Figure 3.2: Screeplot for the principal components analysis of texts in affix productivity space.

Figure 3.3 plots the texts in this 3-dimensional space by means of a scatterplot matrix displaying all 3 pairs of combinations of PCs. You can think of this as looking into the cube from different sides: once from the top, once from the front, and once from the side.

A convenient trellis function for doing so issplom()(for scatterplot matrices). This is a powerful function, for full details, the reader is referred to the on-line help.

> library(lattice)

> datpr = data.frame(dat.pr$x, Registers=dat$Registers)

> super.sym = trellis.par.get("superpose.symbol")

> splom(datpr[,1:3], + groups = Registers, + data = datpr,

+ panel = panel.superpose, + key = list(

+ title = "texts in productivity space", + text = list(c("Religious", "Children", + "Literary", "Officialese")), + points = list(pch = super.sym$pch[1:4],

+ col = super.sym$col[1:4])

+ )

+ )

Figure 3.3 suggests some clustering, especially in the panel for PC1 and PC2 (first panel of second row). The literary texts are in the center, the religious texts in the upper left, the texts for children more to the lower right, and the officialese tends towards the bottom of the graph. Later in this chapter, we will encounter methods for testing whether these clusters are really there, or whether we are overinterpreting the pattern that we see here.

A third important component of a principal components object is the rotation matrix, which looks like this:

> dim(dat.pr$rotation) [1] 27 27

> dat.pr$rotation[1:10,1:3] # just a part of this large matrix

PC1 PC2 PC3 PC4

semi 0.0018753121 -0.001359615 0.003074151 -0.0033841237 anti -0.0003107270 -0.002017771 -0.002695399 0.0005929162 ee -0.0019930399 0.001106277 -0.017102260 -0.0033997410 ism 0.0087251807 -0.046360929 0.046553003 0.0300832267 ian -0.0459376905 -0.008605163 -0.010271978 -0.0937441773 ful 0.0334764289 0.013734791 0.010000845 -0.0966573851 y 0.1113180755 -0.043908360 -0.276324337 -0.5719405630 ness 0.0297280626 -0.112768134 0.700249340 -0.1374734621 able 0.0084568997 -0.124364821 0.012313097 0.1119376764

Scatter Plot Matrix PC1

0 2

4 0 2 4

−4

−2 0

−4 −2 0

0 PC2

1

2 0 1 2

−2

−1 0

−2 −1 0

0.0 PC3

0.5 1.0 1.5

0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5 0.0

−1.5−1.0−0.50.0