Support vector machines - Dendrogram of diana(x = citems.sel)

Dendrogram of diana(x = citems.sel)

3.3.3 Support vector machines

Support vector machines are another recent development in classification, and their per-formance is often quite good. InR, the librarye1071provides the functionsvm()that, fortunately, is much easier to use than the neural network function in the preceding sec-tion. A support vector machine for a binary classification problem tries to find a hyper-plane in multidimensional space such that ideally all elements of a given class are on one side of that hyperplane, and all the other elements are on the other side. Furthermore, it allocates a margin around that hyperplane, and points that are exactly the margin dis-tance away from the hyperplane are called its support vectors.

Let’s applysvm()to the classification problem of Figure 3.4. As you can see when you study the help pages forsvm(), there are a lot of parameters that can be varied in order to optimize performance. We use parameter settings that work fine for our examples.

> dat = read.table("DATA/affixes.txt", T)

> dat.svm = svm(dat$Registers˜., data=dat[,1:27], cost=100, + kernel="linear", gamma=1)

> summary(dat.svm) ...

Number of Support Vectors: 32 ( 2 8 17 5 )

Number of Classes: 4 Levels:

B C L O

> table(dat$Registers)

B C L O

2 8 28 6

The algorithm requires 32 support vectors for 44 multidimensional data points, for the first two classes, as many support vectors as there are data points. The result is a perfect classification,

> table(true=dat$Registers, predicted=predict(dat.svm)) predicted

true B C L O

B 2 0 0 0

C 0 8 0 0

L 0 0 28 0

O 0 0 0 6

but we may ask to what extent we have overfitted the data. In order to address this question,svm() has an option crosswhich carries out n-fold cross-validation. We set cross to 10, so thatsvm()is trained on 9/10 of the data points, and then predicts the class for the remaining 1/10 data points, and this across 10 cross-validation runs:

dat.svm = svm(dat$Registers˜., data=dat[,1:27], cost=100, kernel="linear", gamma=1, cross=10)

summary(dat.svm) ...

10-fold cross-validation on training data:

Total Accuracy: 75 Single Accuracies:

50 75 80 100 60 75 100 60 100 60

The average accuracy is 75%, with maxima of 100% correct and one instance of chance performance (50%).

Now consider a larger data set with 1091 Dutch monomorphemic verbs, of which 190 are irregular verbs. How well can we predict a verb’s regularity from its lexical statistics?

We first apply the same options as we used above:

verba = read.table("DATA/regirregverbs.txt",T)

verba.svm = svm(Regularity˜log(Frequency+1)+log(PastFreq+1)+

Length + Synsets + Bigram + Density + PhonBig + InflEntropy, data=verba, cost=100, kernel = "linear", gamma=1)

summary(verba.svm) ...

Number of Support Vectors: 383 ( 193 190 )

Number of Classes: 2 Levels:

irregular regular

table(true=verba$Regularity, predicted=predict(verba.svm)) predicted

true irregular regular

irregular 17 173

regular 3 898

but performance is miserable, only 17 of the 190 irregular verbs are classified correctly.

We now change thekerneloption to its default (radial) and get

> verba.svm = svm(Regularity˜log(Frequency+1)+log(PastFreq+1)+

+ Length + Synsets + Bigram + Density + PhonBig + InflEntropy, + data=verba, cost=100, gamma=1)

> table(true=verba$Regularity, predicted=predict(verba.svm)) predicted

true irregular regular

irregular 188 2

regular 1 900

with a very low misclassification rate of only 3 out of 1091. With 10-fold cross-validation the results remain remarkably good, with an average accuracy of 80%.

verba.svm = svm(Regularity˜log(Frequency+1)+log(PastFreq+1)+

Length + Synsets + Bigram + Density + PhonBig + InflEntropy, data=verba, cost=100, gamma=1, cross=10)

summary(verba.svm) ...

Number of Support Vectors: 720 ( 537 183 )

Number of Classes: 2 Levels:

irregular regular

10-fold cross-validation on training data:

Total Accuracy: 80.20165 Single Accuracies:

78.89908 75.22936 83.48624 88.99083 81.65138 82.5688 71.55963 81.65138 79.81651 However, if we construct our own random sample and check the accuracy there, we see

that the good scores are predominantly the result of the model guessing that a verb is regular:

> verba = verba[sample(1:nrow(verba)),]

> verba.train = verba[1:982,]

> verba.heldout = verba[983:nrow(verba),]

> verba.train.svm = verba.svm = svm(Regularity˜log(Frequency+1)+

+ log(PastFreq+1)+ Length + Synsets + Bigram + Density + PhonBig +

+ InflEntropy, + data=verba.train, cost=100, gamma=1)

> table(true=verba.heldout$Regularity,

+ predicted=predict(verba.train.svm, verba.heldout)) predicted

true irregular regular

irregular 10 11

regular 7 81

# and another run:

> table(true=verba.heldout$Regularity,

+ predicted=predict(verba.train.svm, verba.heldout)) predicted

true irregular regular

irregular 3 8

regular 11 87

3.4 Problems

1. Burrows [1992], in a study of English writers, observed that the time period in which an author was born was reflected on a principal component. The data frame DATA/affixes2.txtcontains the productivity rates of 27 affixes for 27 literary writers. Analyse this data set with respect to the possibility that there is a temporal dimension with respect to productivity as well. This data frame has columns listing the dates of birth and death of the authors. Following Burrows, you can group the authors by whether they were born before 1850.

2. Load the data framesDATA/affixes.txt"and DATA/affixes3.txt. The col-umn labeledClassin the second data frame classifies the affixes according to their stratum (Latinate versus Germanic). The first data frame contains the affix produc-tivities for authors (rows) by affixes (columns). Carry out a principal components analysis of affixes in author space (by flipping the matrix with the transpose func-tiont()) and investigate whether there is evidence of clustering by lexical stratum.

3. Reanalyze the analysis of register variation among the 44 texts discussed in the sec-tion on principal components analysis, but now using the correlasec-tion matrix instead of the covariance matrix.

4. Use the divisive cluster algorithmdiana()(librarycluster) to cluster the affixes inDATA/affixes.txtusing the correlation matrix. (You should use only columns 1:27 of this data frame.) What is driving the main division in this data set?

5. Rerun the linear discriminant analysis on the affixes with the two-way distinction between Latinate and Germanic (in aff$Class) as classification. Plot the result and interpret the result.

6. Write a function around the code that produced Figure 3.12 that takes as its argu-ment a vector of four seeds, and use this function to study the effect of the initial state of the neural net on its performance.

7. Run a support vector machine analysis on the affixes with the two-way distinction between Latinate and Germanic (inaff$Class).

Chapter 4

In document Exploratory data analysis. An introduction to R for the language sciences (Page 128-133)