Exploratory data analysis. An introduction to R for the language sciences

(1)

Exploratory data analysis.

An introduction to R for the language sciences

R. H. Baayen

Interfaculty research unit for language and speech, University of Nijmegen, & Max Planck Institute for Psycholinguistics, Nijmegen

e-mail: [email protected]

course materials Helsinki, version December 17, 2004

Introduction

This book provides an introduction to the statistical analysis of quantitative data for researchers studying aspects of language and language processing. For many of my col-leagues, the statistical analysis of quantitative data is an onerous task that they would rather leave to others. They tend to use statistical packages as a kind of oracle, from which you elicit a verdict as to whether you have one or more significant effects in your data. In order to elicit a response from the oracle, one has to click one’s way through cas-cades of menues. After a magic button press, voluminous output tends to be produced that hides the p-values, the ultimate goal of the statistical pelgrimage, among lots of other numbers that are completely meaningless to the user, as befits a true oracle.

The approach to data analysis to which this book provides a guide is fundamentally different in several ways. First of all, we will make use of a radically different tool for do-ing statistics, the interactive programmdo-ing environment known asR. R is an open source implementation of the S language and environment for data analysis originally devel-oped at Bell Laboratories. Learning to work withRis in many ways similar to learning a new language. Once you have mastered its grammar, and once you have acquired some basic vocabulary, you will also have begun to acquire a new way of thinking about data analysis that is essential for understanding the structure in your data. The design of R

is especially elegant in that it has a consistent uniform syntax for specifying statistical models, no matter which type of model is being fitted.

What is essential about working withR_{, and this brings us to the second difference in}

our approach, is that we will depend heavily on visualization.Rhas outstanding graphical facilities, which generally provide far more insight into the data than long lists of statistics that depend on often questionable simplifying assumptions. That is, this book provides an introduction to exploratory data analysis. Moreover, we will work incrementally and interactively. Ris an object-oriented programming language. If you are not familiar with this term, you can think of R_{as a language in which a statistical model is an object, an}

object created by the researcher to capture the structure in the data. Once the object is created, there are many things you can do with that object. You can summarize the object in order to inspect parameters and their p-values, or you can plot the object in order to see how well it fits the data. Or you can update the object, or extract predictions from

(2)

the object, and so on. The process of understanding the structure in your data is almost always an iterative process involving graphical inspection, model building, graphical in-spection, updating and adjusting the model, etc. The flexibility ofRis crucial for making this iterative process both easy and enjoyable.

A third, at first sight heretical aspect of this book is that we have avoided all formal maths. The focus of this book is on explaining the key concepts and on providing guide-lines for the proper use of statistical techniques. A useful metaphor is learning to drive a car. In order to drive a car, you need to know the position and function of tools such as the steering wheel and the brake. You also need to know that you should not drive with the hand brake on. And you need to know the traffic rules. Without these three kinds of knowledge, driving a car is extremely dangerous. What you do not need to know is how to construct a combustion engine, or how to drill for oil and refine it so that you can use it to fuel a combustion engine. The aim of this book is to provide you with a driving licence for exploratory data analysis. There is one caveat here. To stretch the metaphor to its limit: WithR_{, you are receiving driving lessons in an all-powerful car, a combination of a racing}

car, a lorry, a personal vehicle, and a limousine. Consequently, you have to be a respon-sible driver, which means that you will often find that you will need additional driving lessons beyond those offered in this book. Moreover, it never hurts to consult professional drivers — statisticians with a solid background in mathematical statistics who know the ins and outs of the tools and techniques, and their advantages and disadvantages.

Finally, the approach we have taken in this course is to work with real data sets rather than with small artificial examples. Real data are often messy, and it is important to know how to proceed when the data display all kinds of problems that standard introductory textbooks hardly ever mention.

An important reason for using R is that it is a carefully designed programming en-vironment that allows you, in a very flexible way, to write your own code, or modify existing code, to tailor Rto your specific needs. Moreover, you can call R from another program (e.g., from scripting languages such as Python, Perl, or AWK) so that you do not have to do repeated similar analysis by hand one by one. To see why this is useful, con-sider a researcher studying similarities in meaning and form for a large number of words. Suppose that a separate model needs to be fitted for each of 1000 words to the data of the other 999 words. If you are used to thinking about statistical question as paths through cascaded menues, you will discard such an analysis as impractical almost immediately. When you work inR_{, it is a piece of cake, because you can write the code for one word,}

and then cycle it through all other words. If all the data is available at once, you could do this withinR. If the data become available one chunk at the time, and if the joint data set is too large to load intoR_{all at once, you can call}R_{from another program to analyse the}

separate chunks. We have seen many instances of researchers being limited in the ques-tions they explored because they were thinking in ’menu-driven’ language instead of in an interactive programming language likeR_{. This is an area where language determines}

thought.

(3)

have to get used to getting your commands for Rexactly right. Every comma, apostro-phy, and bracket is important, and a single mistyped character will causeR_{to break and}

respond with a warning. There are command line editing facilities, and you can page through earlier commands with the up and down arrows of your keyboard. It is often more useful, however, to open a simple text editor (emacs, gvim, notepad), to prepare your commands in the editor, and to copy and paste finished commands into theR win-dow. Especially more complex commands tend to be used more than once, and it is often much easier to make copies in the editor and modify these, then to try to edit multiple-line commands in the R window itself. Output from R that is worth remembering can be pasted back into the editor, which in this way retains a detailed history of both your commands and of the relevant results.

There are several ways in which you can use this book. If you use this book as an in-troduction to statistics, it is important to work through the examples, not only by reading them through, but by trying them out inR. Each chapter also comes with a set of prob-lems, the solutions to these problems are provided at the end of the book. If you use this book to learn how to apply inRparticular techniques that you are already familiar with, then the quickest way to proceed is to study the structure of the relevant data files used to illustrate the technique. Once you have understood how the data are to be formatted, you can load the data intoRand try out the example. Once you have got this working, it should not be difficult to try out the same technique on your own data.

This book is organized as follows. The first chapter describes the basics of the lan-guage, simple data structures, loading data, and exploring the structure in the data using various visualization techniques. The second chapter provides an introduction to random variables, distributions, and standard statistical tests for single random variables as well as tests for two random variables. Chapter 3 is an overview of exploratory techniques for clustering and classification. Chapter 4 introduces multiple regression, including analysis of (co)variance and multilevel modeling.

(4)

(5)

Chapter 1 Calculating and plotting in R

In order to learn to work withR_{, you have to learn to speak its language. The grammar of}

theRlanguage is beautifully and easy to learn. It is important to master the basics ofR’s grammar, as this grammar is designed to help you think about your data in a way that shows the way as to how you might want to analyse them. In this chapter, we begin with very simple examples that show you how to talk with R. As soon as possible, however, we will begin to use examples from a large experimental data set.

When you start R, for instance, by typing Rto the prompt of your Linux console, R

responds by providing you with its own prompt (>). It also checks whether there is a file named.Rdatain your current directory. If there is no such file, it creates one. If there already is such a file, indicating that you have worked on a problem in this directory before, it will load that file and make the objects stored in that file available to you. It is advisable to separate different projects in different directories, in order to avoid your workspace to become cluttered with lots of different unrelated objects. When you work with large data sets and complex objects in R, this is also the way to avoid that your

.RDatafile becomes unmanageably large.

The way to learn a language is to start speaking it. The way to learn R is to use it. Reading through the examples in this chapter is not enough to become a confident user ofR. For this, you need to actually try out the examples, by typing them at theRprompt. You have to be very precise in your commands, which requires a discipline that you will only master if you learn from experience, and learn from your mistakes. Don’t be put off if Rcomplains about your initial attempts to use it, just carefully compare what you typed, letter by letter and bracket by bracket, with the code in the examples.

1.1 Calculating with R

1.1.1 Numbers and strings

Once you have an R window, you can use R as an (overgrown) calculator, as shown in the following examples:

(6)

> 1 + 2 # addition [1] 3 > 2 * 3 # multiplication [1] 6 > 6 / 3 # division [1] 2 > 2ˆ3 # power [1] 8 > sqrt(9) # square root [1] 3 > sqrt(9)ˆ3 [1] 27

Note, first of all, that R provides the answer as soon as you hit the return key. There is no need to supply the equal sign, and in fact you should not try to supply one. Second, the answers to all these examples are preceded by a [1]. We will return to why this is shortly. Third, the output of one expressionsqrt(9)can serve immediately as the input of a second expressionˆ 3.

You can save the output of any calculation in variables such asxoryby using the equal sign=, the assignment operator. You can also use the variables in calculations in the same way as numbers:

> x = 1 + 2 # assignment

> x # request to display value

[1] 3

> y = sqrt(16) # assignment

> y # another request to display value

[1] 4

> x ˆ y # working with two variables

[1] 81

If you type the name of a variable, say x, to the R prompt, it returns the corresponding value, 3 in this example.

It is also possible to store sequences of letters, technically known as strings, in a vari-able:

> w = "word" > w

[1] "word"

Note that strings are enclosed between double quotes.

1.1.2 Vectors

In most of your work with R_{, you will not be dealing with single numbers, or single}

(7)

to add 1 to each of the numbers 1, 2, 3, and 4. The way to do this inRis to first combine these numbers into an ordered list, a vector, by means of the combination functionc()_:

> x = c(1, 2, 3, 4) # combining numbers into a vector

> x

[1] 1 2 3 4

Now that we have combined the numbers 1, 2, 3 and 4 into a vector, we can add 1 to each element as follows:

> x + 1 # add one to each vector element

[1] 2 3 4 5

> y = x + 1 # store result in y

> y # display y

[1] 2 3 4 5

Vectors are very useful when the same calculations have to be carried out on many pairs of numbers: > x = c(1, 2, 3, 4) > y = c(5, 6, 7, 8) > x + y [1] 6 8 10 12 > x * y [1] 5 12 21 32 > xˆ2 [1] 1 4 9 16 > xˆ2 * y [1] 5 24 63 128

In these examples, you can conceptualizexand yas two columns of numbers. Calcula-tions are performed on the numbers in these columns that are on the same row. (It will be helpful to think of vectors as columns rather than as rows for many tools inR.)

Vectors are fundamental to most of what you will be doing with R, and we will there-fore discuss a number of ways for creating vectors, and for accessing the elements of a vector. The operator: creates an ascending or descending sequence of whole numbers (integers): > 1:4 [1] 1 2 3 4 > 4:1 [1] 4 3 2 1 > c(1:4, 4:1) [1] 1 2 3 4 4 3 2 1

(8)

> seq(1, 2, 0.1)

[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 > seq(2, 0, -0.5)

[1] 2.0 1.5 1.0 0.5 0.0

The functionrepis useful for repeating single numbers but also vectors:

> rep(1, 4) [1] 1 1 1 1 > rep(1:4, 4) [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 > rep(1:4, 4:1) [1] 1 1 1 1 2 2 2 3 3 4 > rep(seq(1, 4, 0.5), 2) [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Note that the second argument ofrep_{specifies the number of repetitions. If this second}

argument is itself a vector, it specifies the number of repetitions required for each of the elements of the first argument, which should then itself be a vector.

In the examples thus far, we have only examined vectors whose elements were num-bers. However, vectors can also be constructed for strings, and the functions c() and

rep()work just as before:

> determiners = c("the", "an", "a") > determiners

[1] "the" "an" "a"

> abbreviations = c("1st", "2nd", "3rd") > abbreviations

[1] "1st" "2nd" "3rd"

> c(determiners, abbreviations)

[1] "the" "an" "a" "1st" "2nd" "3rd" > rep(determiners, 3:1)

[1] "the" "the" "the" "an" "an" "a"

When presented with a vector, it is often necessary to access specific elements from that vector. This is done by means of a mechanism called subscripting. The position of the element to be extracted from the vector is added after the vector name between square brackets. When more than one element needs to be extracted, a vector of positions can be used instead of a single number. Here are some examples that illustrate the key principles.

> determiners[1] [1] "the"

> determiners[2] [1] "an"

(9)

> determiners[c(1, 3)] [1] "the" "a"

> determiners[3:1] [1] "a" "an" "the"

It is also possible to subscript with a condition that has to be met:

> words = c("the", "cat", "sat", "on", "the", "mat")

> words[words == "the"] # show the elements equal to "the" [1] "the" "the"

> which(words == "the") # show the positions of these elements [1] 1 5

> words[which(words == "the")] [1] "the" "the"

> words =="the"

[1] TRUE FALSE FALSE FALSE TRUE FALSE # a vector of booleans

> booleans = words == "the" > words[booleans]

[1] "the" "the"

When you subscript a vector with a vector (that should be of the same length), the result is a vector with those elements that correspond to the elements with the value TRUE in the boolean vector. The function which() is available for extracting the indexes of those elements in a vector that meet a given condition. Note that a double equal sign,==, denotes equality, while a single equal sign is the assignment operator.

The functionlength()returns the number of elements in a vector, so if you need to access the last element in a vector, or the one but last, you can proceed as follows:

> words[length(words)] [1] "mat" > words[length(words) - 1] [1] "the" > words[(length(words) - 2) : (length(words) - 1)] [1] "on" "the"

Note that the left and right arguments of the: operator are included within parenthesis. This is because this operator has a high precedence, and comes into effect before the minus operator. Compare:

> (length(words) - 2) : (length(words) - 1) [1] 4 5

> length(words) - 2 : length(words) - 1

[1] 3 2 1 0 -1

(10)

> 2 : length(words) [1] 2 3 4 5 6

is created first. This vector is then subtracted formlength(words), which is automati-cally expanded to a vector of 5 sixes. From the resulting vector,

> length(words) - 2 : length(words) [1] 4 3 2 1 0

a one is subtracted to give the final result. When an operation is carried out on two vectors that do not have the same length, the shorter one is recycled until it has the same length as the longer vector:

> v1 = c(1, 2, 3, 4) > v2 = c(5, 6)

> v1 * v2

[1] 5 12 15 24

If you want the elements of your vector to be sorted, use sort(). To reverse the order of the elements, use rev_{. For a random reordering of the elements of a vector, a}

permutation, there issample():

> sort(words)

[1] "cat" "mat" "on" "sat" "the" "the" > sort(words)[length(words):1]

[1] "the" "the" "sat" "on" "mat" "cat" > rev(sort(words))

[1] "the" "the" "sat" "on" "mat" "cat" > numbers=c(1, 3, 5, 7, 11, 13, 17, 19) > sample(numbers) [1] 17 19 13 11 1 3 7 5 > numbers = sample(numbers) > numbers [1] 19 7 1 5 11 17 3 13 > sort(numbers) [1] 1 3 5 7 11 13 17 19

The functionuniqueremoves repeated entries in a vector:

> unique(words)

[1] "the" "cat" "sat" "on" "mat" > z = rep(numbers, 1:8)

> z

[1] 19 7 7 1 1 1 5 5 5 5 11 11 11 11 11 17 17 17

[19] 17 17 17 3 3 3 3 3 3 3 13 13 13 13 13 13 13 13

(11)

[1] 19 7 1 5 11 17 3 13 > sort(unique(z))

[1] 1 3 5 7 11 13 17 19

Note that when we ask R to show the contents of the variable z, we get two lines of output. The first line is preceded by[1], indicating that the first number listed on this line is the first element of z_{. The second line is preceded by}[19]_{, which tells you that}

the next element ofzis the 19th element of this vector.

Finally, the table() function tabulates the frequencies with which the items of a vector occur:

> table(words) words

cat mat on sat the

1 1 1 1 2 > table(z) z 1 3 5 7 11 13 17 19 3 7 4 2 5 8 6 1 > z.table = table(z) > z.table z 1 3 5 7 11 13 17 19 3 7 4 2 5 8 6 1

The output of thetable()function is displayed on three successive lines. First of all, it lists the name of the object for which a table is calculated. In the above examples, these objects were the vectors words and z. On the next line, we find the elements of these vectors, and on the third line, the counts of how often these elements occurred in these vectors.

1.1.3 Objects

The S-language on whichRis based is an object-oriented language. Everything that exists inRis an object that has specific methods associated with it. There are different types of objects, and each type of object has it own methods. One object type that we encountered above is the vector. One of the methods associated with a vector is the print method. For vectors, the print method is simple: it displays the contents of the vector, adding the position in the vector for the first element on each new line in theRwindow.

Functions are also objects. Consider, for instance, the function ls(), which lists the contents of your current work space:

> ls()

[1] "abbreviations" "determiners" "numbers" "words"

(12)

Asls()is itself an object, it has a print method. For functions, the print method displays the function code. Therefore, if you typels_{to the prompt without the parenthesis, you}

ask R to print the object on the screen. The code for ls() is too long to repeat here, instead, we show the code for the command to quit R,q():

> q

function (save = "default", status = 0, runLast = TRUE) .Internal(quit(save, status, runLast))

<environment: namespace:base> > q()

Save workspace image? [y/n/c]:

If you want to know more details about a function, the on-line help is very useful, just type a question mark followed by the function name. Type ?qto the prompt in order to see the details of what you can do with q(). When invoked as function, q() will ask whether the workspace image should be saved. If you respond with yes, the objects in your workspace will be available the next time you start upRwith that workspace.

Another type of object is produced by thetable()function. Let’s have a closer look at the table for our words vector. The functionnames()extracts the names of the ele-ments that have been counted, and the counts themselves, without their names, can be extracted with the function as.numeric()_{. And you can access the count associated}

with a name by subscripting with the relevant name.

> words.table = table(words) > words.table

cat mat on sat the

1 1 1 1 2

> names(words.table)

[1] "cat" "mat" "on" "sat" "the" > as.numeric(words.table) [1] 1 1 1 1 2 > words.table["the"] the 2 > words.table[c("the", "cat")] words the cat 2 1

It is crucial to keep in mind that subscripting by name always requires double quotes: the name is a string, and should be marked as such. Make sure you understand the following examples.

> z.table z

(13)

1 3 5 7 11 13 17 19 3 7 4 2 5 8 6 1 > z.table[5] 11 5 > z.table["11"] 11 5 > z.table["5"] 5 4

If you subscript with the number 5, you ask for the fifth element, which has label11and the value 5. It is often much clearer to address the table by the name itself, in which case you need double quotes. So with z.table["11"]you ask for the count of elevens in the vector.

Up till now, we have created vectors in R. In order to load already existing data files with numbers or strings into R, there is the functionscan().

> n = scan (file = "DATA/numbers.txt") Read 12 items

> w = scan(file = "DATA/words.txt", what = "character") Read 8 items

> n

[1] 2 4 6 8 10 12 14 16 17 18 20 22

> w

[1] "an" "example" "of" "a" "file" [6] "with" "some" "words"

Here is a summary of the functions and operators discussed thus far. Make sure you are confident about what they do before you read on.

arithmetic * / ∧ sqrt()

creating vectors : c() seq() rep()

loading vectors scan()

reordering vector elements rev() sort() sample()

summarizing vectors length() unique() table()

vector indexes which()

table objects names(), as.numeric()

(14)

1.1.4 Matrices

Just as it is often quite handy to bring numbers together in a vector, it is often the case that we will need to bring vectors together into tables. The functionscbind()_andrbind()

bind vectors by column and by row respectively:

> a = c(1, 2, 3, 4) > b = c(5, 6, 7, 8) > c = c(9, 10, 11, 12) > A = cbind(a, b, c) > A a b c [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 > B = rbind(a, b, c) > B [,1] [,2] [,3] [,4] a 1 2 3 4 b 5 6 7 8 c 9 10 11 12

Tables of numbers such as A and B are referred to as matrices. Note that when you use

cbind, the names of the vectors appear as column labels, while forrbind(), they appear as row labels.

As with vectors, we will often need to access specific elements, or rows, or columns of a matrix. To do so, we use the same subscripting mechanism with square brackets, but we now separate rows and columns by means of a comma. Information preceding the comma pertains to rows, information following the comma concerns columns. Have a look at the these examples of subscripting

> A[1, ] # select the first row

a b c 1 5 9

> A[2, ] # select the second row

a b c

2 6 10

> A[, "c"] # select the column labelled "c"

[1] 9 10 11 12

> A[, 3] # select the third column

[1] 9 10 11 12

> B["b", ] # select the row labelled "b" [1] 5 6 7 8

(15)

> B[, 2] # select the second column a b c 2 6 10 > C = B[1:2, 4:2] > C [,1] [,2] [,3] a 4 3 2 b 8 7 6

Note that the matrixCinherited the row labels from matrixB. The dimensions of a matrix can be queried with the functiondim():

> dim(C) [1] 2 3

Whatdim()returns is a vector with two elements specifying the number of rows and the number of columns.

When we created the matrices AandB, we did so by first creating individual vectors, which we then combined. When you have a vector that you want to reformat into a matrix, you can use thematrix()function:

> n # we made this vector above

[1] 2 4 6 8 10 12 14 16 17 18 20 22

> D = matrix(n, 6, 2) # a matrix of 6 rows and 2 columns

> D [,1] [,2] [1,] 2 14 [2,] 4 16 [3,] 6 17 [4,] 8 18 [5,] 10 20 [6,] 12 22

> E = matrix(n, 4, 3) # a matrix of 4 rows and 3 columns

> E [,1] [,2] [,3] [1,] 2 10 17 [2,] 4 12 18 [3,] 6 14 20 [4,] 8 16 22

> F = matrix(0, 2, 2) # a 2 by 2 matrix of zeros

> F

[,1] [,2]

[1,] 0 0

(16)

You can add or subtract matrices that have the same dimensions. If you multiply a ma-trix with a number, each element of the mama-trix is multiplied by that number. The same principle applies to arithmetic functions such assqrt()orlog().

> E * 2 [,1] [,2] [,3] [1,] 4 20 34 [2,] 8 24 36 [3,] 12 28 40 [4,] 16 32 44 > E + 2*E [,1] [,2] [,3] [1,] 6 30 51 [2,] 12 36 54 [3,] 18 42 60 [4,] 24 48 66

> log(E + 1) # natural logarithm

[,1] [,2] [,3]

[1,] 1.098612 2.397895 2.890372 [2,] 1.609438 2.564949 2.944439 [3,] 1.945910 2.708050 3.044522 [4,] 2.197225 2.833213 3.135494

Vectors of strings can also be combined into tables, but these are referred to as arrays and not as matrices:

> X = rbind(c("this", "is"), c("an", "array"), c("and", "not"), + c("a", "matrix")) > X [,1] [,2] [1,] "this" "is" [2,] "an" "array" [3,] "and" "not" [4,] "a" "matrix"

> X[3, 1] # extract the 1st element on 3rd row

[1] "and"

The plus on the second line of this example is the prompt that R _{gives instead of the} >

when the command on the previous line is not complete.

1.1.5 Data frames

The elements of an array and a matrix should all be of the same type. When you try to combine vectors of different types, one of the vectors will be converted to the type of the other, as in the following example:

(17)

> words.table # we made this table above words

cat mat on sat the

1 1 1 1 2

> wrong = cbind(names(words.table), as.numeric(words.table)) > wrong [,1] [,2] [1,] "cat" "1" [2,] "mat" "1" [3,] "on" "1" [4,] "sat" "1" [5,] "the" "2"

> rm(wrong) # delete wrong from the workspace

The second column ofwrong _{is not a vector of numbers, but a vector of strings, so we}

cannot do any numerical operations on this vector any more. Fortunately, R provides a special kind of table in which you can bring together vectors of different types. This data type is known as a data frame. Here is how you can create a data frame to replacewrong_: > right = data.frame(words = names(words.table),

+ frequency = as.numeric(words.table)) > right words frequency 1 cat 1 2 mat 1 3 on 1 4 sat 1 5 the 2

We supplied two vectors to the data.frame()function, which we named words and

frequency. These names appear as the column labels of the data frameright. There are three ways in which you can access the columns of a data frame, and two ways to access its rows.

> right[, 2] # the second column

[1] 1 1 1 1 2

> right[,"frequency"] # the column labelled "frequency" [1] 1 1 1 1 2

> right$frequency # the $ operator saves typing

[1] 1 1 1 1 2

> right["1", ] # the row with the rowname "1" words frequency

1 cat 1

(18)

words frequency

1 cat 1

New is the $operator, which provides a convenient way for addressing columns. Data frames have both rownames and column names, which can be extracted with the func-tionsrownames()andcolnames():

> rownames(right)

[1] "1" "2" "3" "4" "5" > colnames(right)

[1] "words" "frequency"

The outputs of rownames()and colnames()are themselves vectors, and the lengths of these vectors are identical to the dimensions returned bydim():

> length(rownames(right)) == dim(right)[1] [1] TRUE

> length(colnames(right)) == dim(right)[2] [1] TRUE

In order to illustrate the use of data frames, let’s consider a real example of a dataset studied by Baayen and Hay [2004]. Baayen and Hay were interested in the extent to non-linguistic cognition is affected by one’s language. More specifically, they investigated to what extent one’s knowledge of the names for objects and the lexical properties of these names influence the way we think about these objects. The way they addressed this question is by means of an experiment with 81 concrete words, names for animals as well as fruits, nuts and vegetables. They asked 20 subjects to indicate, for each of these words, on a seven point scale how heavy they thought the word’s referent was. The result is a data set with 20 * 81 = 1620 subjective weight estimates. The experimental data are available in theDATAdirectory as weight.ratings.txt. This file has 1620 lines, one for each rating elicited for a given subject and a given word. A given line also lists the sex of the speaker, as well as many different lexical variables. We load this data file intoR

withread.table():

> weight = read.table("DATA/weight.ratings.txt", header = TRUE)

The option header = TRUE _{specifies that the first line in} weight.ratings.txt_{is a}

header that specifies the names for the different columns. This option can be abbreviated toT. It makes no sense to typeweightto theRprompt, asweightis a very large object, as can be seen withdim()_:

> dim(weight)

[1] 1620 23

(19)

weight[1:4,]

Subject Rating Trial Sex Word Frequency FamilySize

1 A1 5 1 F horse 7.771910 3.332205

2 A1 1 2 F gherkin 2.079442 0.000000

3 A1 3 3 F hedgehog 3.637586 0.000000

4 A1 1 4 F bee 5.700444 1.791759

SynsetCount Length Class FreqSingular FreqPlural

1 2.079442 5 animal 1518 854

2 1.098612 7 plant 4 3

3 1.098612 8 animal 21 16

4 1.098612 3 animal 118 180

DerivEntropy Complex rInfl meanRT SubjFreq meanSize

1 1.0925 simplex 0.5747060 6.3364 4.84 4.8195 2 0.0000 simplex 0.2231435 6.5161 3.16 2.2484 3 0.0000 simplex 0.2578291 6.5924 3.32 3.5920 4 0.8521 simplex -0.4193735 6.2978 4.56 3.6399 BNCw BNCc BNCd BNCcRatio BNCdRatio 1 79.173243 62.07120 59.855833 0.783992 0.756011 2 0.111433 0.16249 0.237523 1.458184 2.131531 3 2.718969 0.64996 3.325324 0.239047 1.223009 4 5.281931 1.62490 5.938079 0.307634 1.124225

or to ask for the column names:

> colnames(weight)

[1] "Subject" "Rating" "Trial" [4] "Sex" "Word" "Frequency" [7] "FamilySize" "SynsetCount" "Length" [10] "Class" "FreqSingular" "FreqPlural" [13] "DerivEntropy" "Complex" "rInfl"

[16] "meanRT" "SubjFreq" "meanSize" [19] "BNCw" "BNCc" "BNCd" [22] "BNCcRatio" "BNCdRatio"

The first column lists the (anonymized) subjects in the experiment, each of which con-tributed 81 ratings:

> table(weight$Subject)

A1 A2 G H I1 I2 J K L M1 M2 P R1 R2 R3 R4 R5 S1 S2 T1

81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81

In order to obtain just the names of the different subjects, we have two options.

(20)

[1] A1 A2 G H I1 I2 J K L M1 M2 P R1 R2 R3 R4 R5 S1 [19] S2 T1

20 Levels: A1 A2 G H I1 I2 J K L M1 M2 P R1 R2 R3 R4 ... T1 > levels(weight$Subject)

[1] "A1" "A2" "G" "H" "I1" "I2" "J" "K" "L" "M1" "M2" [12] "P" "R1" "R2" "R3" "R4" "R5" "S1" "S2" "T1"

Note that when we apply unique() to the column of the data frame listing the sub-jects, we not only obtain the names of the subsub-jects, but also some additional information, namely, that there are twenty levels that are subsequently listed in summary form. The reason for this additional information is that string vectors in a data frame are converted automatically to factors. A factor in statistics is a variable that has strings as its possible values. Another factor in theweightdata frame isSex, which has as its levelsF(female) andM(male). The distinction between a string vector and a factor is crucial for many of the tools that we will be using in later chapters. The simplest way to see what subjects participated is to use the function that simply returns the levels of a factor,levels(). A related summary function isnlevels(), which returns the number of levels:

> nlevels(weight$Subject) [1] 20

The column labelledWordspecifies the word for which a rating was elicited. We extract a list of these words withlevels():

> levels(weight$Word)

[1] "almond" "ant" "apple" "apricot" [5] "asparagus" "avocado" "badger" "banana" [9] "bat" "beaver" "bee" "beetroot" [13] "blackberry" "blueberry" "broccoli" "bunny" [17] "butterfly" "camel" "carrot" "cat"

[21] "cherry" "chicken" "clove" "crocodile" [25] "cucumber" "dog" "dolphin" "donkey" [29] "eagle" "eggplant" "elephant" "fox" [33] "frog" "gherkin" "goat" "goose" [37] "grape" "gull" "hedgehog" "horse" [41] "kiwi" "leek" "lemon" "lettuce" [45] "lion" "magpie" "melon" "mole" [49] "monkey" "moose" "mouse" "mushroom" [53] "mustard" "olive" "orange" "owl"

[57] "paprika" "peanut" "pear" "pig" [61] "pigeon" "pineapple" "potato" "radish" [65] "reindeer" "shark" "sheep" "snake" [69] "spider" "squid" "squirrel" "stork" [73] "strawberry" "swan" "tomato" "tortoise"

(21)

[77] "vulture" "walnut" "wasp" "whale" [81] "woodpecker"

The third column ofweightlists the trial number, if a word has trial number 4, it was the fourth word in the experiment that a given subject was asked to rate. This variable ranges from 1 to 81, and is a control variable that allows us to trace possible effects of learing or fatigue that might take place in the course of the experiment.

The remaining 18 columns specify various properties of the words. These are briefly described in the appendix. Variables of interest for the present example are a word’s frequency (Frequency), its family size (the number of complex words in which it ap-pears as a constituent, FamilySize), the number of synonym sets (synsets) in which it is listed in WordNet [Miller, 1990, Beckwith et al., 1991, Fellbaum, 1998], its length in let-ters (Length), itsClass(plant or animal), and its derivational entropy, a token-weighted variant of the family size count, [Moscoso del Prado Mart´ın et al., 2004].

All lexical variables for a given word are repeated on twenty rows ofweight, once for each subject. In order to obtain a data frame that is restricted to the information pertaining only to the items, we use theunique()function as follows:

> items = unique(weight[, 5:23]) # skip columns with information

> dim(items) # specific to subject and trial

[1] 81 19

The first four columns ofweight_{contain information about the subject and the trial itself.}

This is information we want to discard in order to obtain a data frame that summarizes the properties of the words in the experiment. From column 5 onwards, we have the information that is specific to the items. The subsection of the data frame obtained by

weight[, 5:23]has 81 unique lines, each of which is repeated 20 times, once for each subject. Withunique, we remove the redundant lines, and retain exactly one instance of each unique line.

This data frame still has more columns then we need at this moment. Let’s consider how we can create a smaller data frame with exactly the relevant information:

> items = items[,c(1:6,9)] # or

> items = items[,c("Word", "Frequency", "FamilySize", + "SynsetCount", "Length", "Class", "DerivEntropy")] > items[1:4,]

Word Frequency FamilySize SynsetCount Length Class

1 horse 7.771910 3.332205 2.079442 5 animal 2 gherkin 2.079442 0.000000 1.098612 7 plant 3 hedgehog 3.637586 0.000000 1.098612 8 animal 4 bee 5.700444 1.791759 1.098612 3 animal DerivEntropy 1 1.0925 2 0.0000

(22)

3 0.0000

4 0.8521

The first two commands select exactly the same subset of columns. The first uses the column numbers, the second the column names. It is sometimes convenient to relabel column names or rownames. For instance, we can rename the rownames with the words,

> rownames(items) = as.character(items$Word) > items[1:4,1:6]

Word Frequency FamilySize SynsetCount Length Class

horse horse 7.771910 3.332205 2.079442 5 animal

gherkin gherkin 2.079442 0.000000 1.098612 7 plant

hedgehog hedgehog 3.637586 0.000000 1.098612 8 animal

bee bee 5.700444 1.791759 1.098612 3 animal

which makes it easy to extract information from the data frame by name:

> items["bat", "Frequency"] [1] 5.918894

> items["pig", "Length"] [3] 3

The functionas.character()used above converts the factoritems$Wordto a vector of strings. This conversion is necessary because the row names and the column names are vectors of strings and not factors.

It is very important to become fluent in subscripting data frames. You should keep in mind that restrictions on rows precede the comma, and restrictions on columns follow the comma in the subscripting sequence[,]. Here are some examples:

> items[items$Length == 3, 2:4]

Frequency FamilySize SynsetCount

bee 5.700444 1.791759 1.0986123 pig 6.660575 2.772589 2.3025851 fox 5.652489 1.791759 1.9459101 bat 5.918894 2.197225 2.3025851 dog 7.667626 3.135494 2.0794415 owl 4.859812 1.386294 0.6931472 cat 7.086738 2.772589 2.1972246 ant 5.347108 1.386294 1.0986123

> items2 = items[items$Length > 5 & items$Length < 7, c(1,3)][1:5,] > items2

Word FamilySize

peanut peanut 0.6931472

pigeon pigeon 1.0986123

(23)

donkey donkey 0.0000000 magpie magpie 0.0000000 > items[items$Length > 7 | items$Length == 3, + c("Word", "FamilySize")][1:5,] Word FamilySize hedgehog hedgehog 0.0000000 bee bee 1.7917595 pineapple pineapple 0.0000000 blackberry blackberry 0.6931472 tortoise tortoise 0.0000000

The second and third example illustrate the logical connectives and (&) and or (|). They also illustrate that you can subscript the part of the data frame that you just subscripted. After all, the result of subscripting a dataframe is a new, smaller dataframe, that can in turn be subscripted. Subscripting with [1:5, ] is a convenient way for inspecting the first couple of lines of a data frame.

You sort the lines in a data frame with the functionorder()_: > items2[order(items2$Word),] Word FamilySize donkey donkey 0.0000000 magpie magpie 0.0000000 peanut peanut 0.6931472 pigeon pigeon 1.0986123 tomato tomato 0.0000000

When you call order()_with items2$Word_{as argument, it returns a vector with the}

row numbers of the words such that the words themselves are sorted:

> order(items2$Word) [1] 3 5 1 2 4

When this vector is inserted in the row slot of the subscript ofitems2, its rows are rear-ranged in this order. Whenorderis supplied with more than one argument, it will sort on the first argument, and resolve ties by looking at the second argument, or the third, if required, and so on.

> items2[order(items2$FamilySize, items2$Word),] Word FamilySize donkey donkey 0.0000000 magpie magpie 0.0000000 tomato tomato 0.0000000 peanut peanut 0.6931472 pigeon pigeon 1.0986123

(24)

1.1.6 Random variables

Thus far, we have encountered two kinds of vectors: numerical vectors and factors. Both are used to represent random variables. A random variable is the outcome of an experi-ment. Here are some examples of experiments and their associated random variables: tossing a coin a random variable with values ’head’ or ’tail’.

throwing a dice a random variable with values 1, 2, . . . , 6.

counting words a random variable with as values the frequency of occurrence in some corpus: 0, 1, 2, . . . , N (with N the size of the corpus).

familiarity rating the subjective estimate of frequency, usually on a scale of 1 to 7, is the random variable in this experiment.

lexical decision this kind of experiment has two associated random variables: the ac-curacy of a response (with levels ’correct’ and ’incorrect’) and the latency of the response (in milliseconds).

A random variable is random in the sense that the outcome of a given experiment is not fully predictable. The opposite of a random variable is a constant. The size of a given corpus such as the Brown corpus [Kuˇcera and Francis, 1967] in word tokens is fixed, hence by itself this specific corpus size is not a random variable. On the other hand, corpus size, defined as ranging over many different corpora, is a random variable, because we cannot say what the corpus size is without being told what corpus we are dealing with. The art of statistics is to learn from prior experience with a given random variable (or sets of random variables) in order to optimize one’s predictions as to what the most likely value of a random variable is.

1.1.7 Summary

Before starting with the section on visualization, make sure you are confident about the use of the functions that were introduced in this section.

creating tables cbind() rbind() matrix()

data frames $ data.frame() rownames() colnames() read.table()

properties dim()

sorting order()

selecting unique()

type conversion as.character()

(25)

1.2 Visualization

An important first step in exploratory data analysis is to inspect your data graphically. It is difficult and often downright impossible to make sense of large tables of numbers. But patterns in the data often become visible thanks to the tools for data visualization that are now available. We first discuss tools for visualizing properties of single random variables (in vectors and uni-dimensional tables), and then proceed with an overview of tools for graphing groups of random variables (typically brought together in matrices or data frames).

1.2.1 Visualizing single random variables

Bar plots and histograms are useful for obtaining visual summaries of the distributions of random variables. Figure 1.1 illustrates this for the numeric variables in the items

data frame that describes the main properties of the words used in the rating experiment eliciting subjective estimates of the referent’s weight. This figure has six panels arranged in a matrix of three rows and two columns. In order to instructR_{to make such a matrix of}

plots, we have to set the appropriate graphics parameter,mfrowto the vector c(3, 2)

using the functionpar(), which controls a large number of graphical parameters. Plots will be added one by one to the plot region, proceeding row by row from left to right.

par(mfrow=c(3,2))

The upper left panel is a bar plot of the counts of word lengths:

barplot(table(items$Length), xlab="word length", col="grey")

The optionxlabsets the label for the X axis, and with the optioncolwe set the color for the bars to grey. We see that word lengths range from 3 to 10, and that the distribution is somewhat asymmetric, with a mode (the value observed most often) at 5. The mean is 5.9, and the median is 6. (The median is obtained by ordering the observations from small to large, and then taking the value for which 50% of the data points are smaller.) Mean, median, and range are easy to extract with the corresponding functions mean(),

median(), andrange():

> mean(items$Length) [1] 5.91358 > median(items$Length) [1] 6 > range(items$Length) [1] 3 10

(26)

> min(items$Length) [1] 3

> max(items$Length) [1] 10

The upper right panel of Figure 1.1 shows the histogram corresponding to the bar plot in the upper left panel. The main difference between the bar plot and the histogram is that the latter is scaled on the vertical axis in such a way that the total area of the bars is equal to 1. This allows us to see that the probability of the word lengths 5 and 6 jointly is close to 0.5. This histogram was produced with thetruehist()function in theMASS

library of Venables and Ripley [2003], that is part of any recent distribution ofR_{. In order}

to access this function, we need to load this library with thelibrary()function,

library(MASS)

after which we can produce the histogram in the upper left panel with

truehist(items$Length, xlab="word length", col="grey")

The remaining panels of Figure 1.1 were made in the same way.

truehist(items$Frequency, xlab = "log word frequency", col = "grey") truehist(items$SynsetCount, xlab = "log synset count", col = "grey") truehist(items$FamilySize, xlab = "log family size", col = "grey")

truehist(items$DerivEntropy, xlab = "derivational entropy", col = "grey")

Note that the bottom panels show highly skewed distributions: Most of the words in this experiment have no morphological family members at all. Now that all panels have been filled, we reset the graphics parameter to one figure in the plot region:

par(mfrow=c(1, 1))

There are several ways in which plots can be saved as independent graphics files: as png or jpeg files, or as PostScript files. The corresponding functions arepng(),jpeg(), andpostscript(). We illustrate how these functions work for PostScript.

> postscript("barplot.ps", horizontal = FALSE, he = 6, wi = 6, + family = "Helvetica", paper = "special", onefile = FALSE) > truehist(items$Frequency, xlab = "log word frequency") > dev.off()

The first argument of postscript() is the name of the PostScript file to be created. Whether the plot should be in portrait or landscape mode is controlled by thehorizontal

argument. The parametersheand wicontrol the height and width of the plot in inches. The font to be used is specified byfamily, and withpaper="special"the output will be an encapsulated PostScript file that can be easily incorporated in, for instance, a LA_TEX

(27)

3 4 5 6 7 8 9 10 word length 0 5 10 4 6 8 10 0.00 0.10 0.20 word length 1 2 3 4 5 6 7 8 0.00 0.15 0.30

log word frequency

0.5 1.0 1.5 2.0 2.5

0.0

0.6

1.2

log synset count

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.0

0.4

0.8

log family size

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.6

1.2

derivational entropy

Figure 1.1: A bar plot and histograms for the variables describing the lexical properties of the words used in the weight rating experiment.

(28)

only a single plot in the file. There are many more options, check the on-line help for further details. Thepostscript()_{command opens a PostScript file, and all following}

plot commands are diverted to this PostScript file. To close the postscript file, you use the functiondev.off(). After this command, new plots will appear in the graphics window as usual. Thedev.off() _{command is crucial: If you forget to close your file,}

you will run into all sorts of trouble when you try to view the file outsideR, or if you try to make a new figure inR.

The shape of a histogram depends, sometimes to a surprising extent, on the width of the bars and on the position of the left side of the first bar. The functiontruehist()has defaults that are chosen to minimize the risk of obtaining a rather arbitrarily shaped his-togram (see also Haerdle, 1991). A function that further reduces this risk isdensity()_.

We illustrate this function for the reaction times elicited in a visual lexical decision experi-ment using the same words as in the weight rating experiexperi-ment. In a visual lexical decision experiment, words are presented on a computer screen together with non-existing words like ’sulp’. Subjects are asked to indicate as quickly as possible by means of two push buttons whether the letter string presented on the screen in a real word. The time be-tween the moment that the word is displayed on the screen and the moment at which a button response is recorded is the reaction time (also referred to as response latency). It is a measure of the complexity of lexical processing which is known to be co-determined by a wide range of lexical variables. The reaction times for 79 of the 81 words discussed above are available in the DATA directory as the text file lexdec.txt. Further details about the variables in this data set, see the appendix.

The upper left panel of Figure 1.2 shows the histogram as given bytruehist() ap-plied to the (logarithmically transformed) reaction times.

lexdec = read.table("DATA/lexdec.txt",T)

truehist(lexdec$RT, col="lightgrey", xlab="log RT")

The distribution of reaction times is somewhat skewed, with an extended right tail of very long latencies.

The upper right panel of Figure 1.2 shows the histogram as produced by hist()

instead oftruehist(), together with the density curve. The two have roughly the same shape, but the density curve smoothes the discrete jumps of the histogram. The lower left panel useshist(), but now with the same bin widths astruehist(). The histogram and the density curve are now very similar estimates of the distribution of reaction times. Plotting the upper right and lower left panels requires some careful preparation in order to make sure that the ranges of values for the two axes are set properly to acco-modate both the histogram and the density function. We therefore begin with the stan-dard function for making a histogram, hist(), which we force to make the same bins astruehist() in the case of the lower left panel, by specifying the breaks (the points where new bins should begin) explicitly. Instead of plotting the histogram, we save it, so that we can extract the range of values for the horizontal and vertical axes.

(29)

6.0 6.5 7.0 7.5 0.0 1.5 log RT log RT 6.0 6.5 7.0 7.5 0.0 1.5 log RT 6.0 6.5 7.0 7.5 0.0 1.5

Figure 1.2: Histograms and density function for the response latencies of 21 subjects to 79 nouns referring to animals and plants (fruits and vegetables).

+ breaks = seq(5.8, 7.6, by = 0.1)) # lower left panel

We then repeat this procedure for the density curve,

> d = density(lexdec$RT)

and then set the X and Y limits:

> xlimit = range(h$breaks, range(d$x)) > ylimit = range(0, h$density, d$y)

Finally, we plot the histogram, and add the curve for the density with the functionlines(). The functionlines()takes a vector of x coordinates and a vector of y coordinates, and connects the points specified by these coordinates with a line (in the order specified by the input vectors).

> hist(lexdec$RT, freq = FALSE, col = "lightgrey", + border = "darkgrey", ylab = "", xlab = "log RT", + xlim = xlimit, ylim = ylimit, main = "",

+ breaks = seq(5.8, 7.6, by = 0.1)) > lines(d)

The border _{option of} hist() _{controls the color of the lines marking the bars of the}

(30)

0 500 1000 1500 6.0 6.5 7.0 7.5 Index log RT 6.0 6.5 7.0 7.5 Quartiles log RT 0% 25% 50% 75% 100% 6.0 6.5 7.0 7.5 Deciles log RT 0% _10% _20% _30% _40% _50% _60% _70% _80% _90% 100%

Figure 1.3: Quantiles, density plots, and boxplots for reaction times in a visual lexical decision experiment.

empty string. The density() _{function returns an object of type} density_{that comes}

with its own plotting method. So all we need to do is to apply the generic plot()and

lines()functions to the output ofdensity().

There are several other ways in which you can visualize the distribution of a random variable. Figure 1.3 shows plots based on the values sorted from small to large. The upper left panel plots the index (or rank) of the reaction time on the horizontal axis, and the corresponding reaction time on the vertical axis. This way of plotting the data reveals the range of values, as well as the presence outliers. Outliers are data points with values that are surprisingly large or small given all data points considered jointly. There are a few outliers with short reaction times, and some more outliers with very long reaction times. The plot also shows that the distribution of reaction times is asymmetric, which we also observed with the density and histogram plots. This panel was produced with

(31)

> plot(sort(lexdec$RT), ylab = "log RT")

The upper right panel of Figure 1.3 shows the quartiles of the distribution of reaction times, and the lower left panel the deciles. The quartiles are the data points you get by dividing the sorted data into four equal parts. The 50% quartile is also known as the median. The deciles are the data points dividing the sorted data into 10 equal parts. The function quantile() calculates the quantiles for its input vector, and by default it produces the quartiles. By supplying a second vector with the required percentage points, the default can be changed. Here is the code that produced the quantile plots in Figure 1.3.

> plot(quantile(lexdec$RT), xaxt = "n", + xlab = "Quartiles", ylab = "log RT")

> mtext(c("0%", "25%", "50%", "75%", "100%"), + side = 1, at = 1:5, line = 1, cex = 0.7) > plot(quantile(lexdec$RT, seq(0, 1, 0.1)),

+ xaxt = "n", xlab = "Deciles", ylab = "log RT")

> mtext(paste(seq(0, 100, 10), rep("%", 11), sep = ""), + side = 1, at = 1:11, line = 1, cex = 0.7, las = 2)

These plots require special attention with respect to the labels on the horizontal axis. We first instructplot()to forget about tick marks withxaxt = "n", and we then use the

mtext()function to add a text vector (the percentage points) to one of the margins of the plot, in this case, the bottom margin. The margins of a plot are labeled 1 (bottom), 2 (left), 3 (top) and 4 (right), and we directmtext()to a margin with the optionside. The optionatspecifies where the elements of the text vector should be placed, and with

line = 1we specify that the text should be placed one line away from the plot. We also set the font size to 0.7 of the default with cex_{. In the case of the lower left panel, the}

deciles are rotated withlas = 2, which tellsplot()to place the strings perpendicular to the margin. Finally, note that the function paste() glues the elements of its input vectors together into strings. By default, the elements are separated by a space, but as we want the percentage sign to be immediately adjacent to the preceding number, we set the separator to the empty string withsep = "".

Figure ?? plots the estimated density, the ordered values, and a new summary plot, a box and whiskers plot or boxplot, for the reaction times, with the untransformed RTs in milliseconds on the upper row of panels, and log RT on the lower row of panels. The boxplots (the rightmost panels) were made with the functionboxplot()_{, where we use}

the exponential function exp()to undo the logarithmic transformation of the reaction times:

boxplot(exp(lexdec$RT)) # upper right panel

boxplot(lexdec$RT) # lower right panel

The box in a box and whiskers plot shows the interquartile range, the range from the first to the third quartile. The whiskers in a boxplot extend to maximally 1.5 times the

(32)

interquartile range. Points falling outside the whiskers are plotted individually, they are potential outliers. The horizontal line in the box represents the median. The large number of individual points extending above the upper whiskers in these boxplots show that we are dealing with a skewed, non-symmetrical distribution.

A comparison of the upper and lower panels shows that the skewing is reduced, al-though not eliminated, by the logarithmic transformation. This is clearly visible in the boxplot in the lower right. There are still many high outliers, but their number is smaller and the box has moved more towards the center of the graph. The purpose of the loga-rithmic transformation is to reduce the skewing and to make the distribution more sym-metric, as most statistical tools discussed in later chapters are built on the assumption that variables have reasonably symmetric distributions.

1.2.2 Visualizing two or more variables

In the lexical decision experiment, some subjects were native speakers of English, others were not. And some were men, and others women. Let’s cross-tabulate the subjects by native language and sex. We first create a data frame with the subject-specific information only,

> subjects = unique(lexdec[, c("Subject", "NativeLanguage", "Sex")]) > subjects[1:4,]

Subject NativeLanguage Sex

1 A1 English F

475 A2 English M

712 A3 Other F

949 C English F

and then usetable()with two instead of one input factors:

> subjects.tab = table(subjects$NativeLanguage, subjects$Sex) > subjects.tab

F M English 7 5

Other 7 2

We can make a barplot for this two-way contingency table with the same barplot()

function we used above. The upper left panel of Figure 1.4 was produced with

> barplot(subjects.tab, beside = T, + legend.text=c("English", "other"), + col=c("black", "white"))

Withbeside=Twe tellRto plot the two values of a column beside each other, instead of stacking them above each other. The legend is added with the argumentlegend.text_,

(33)

F M English other 0 2 4 6 8 10 2 3 4 5 6 7 8 0.0 1.0 2.0 3.0 items$Frequency items$FamilySize 2 3 4 5 6 7 8 0.0 1.0 2.0 3.0 items$Frequency items$FamilySize 3 4 5 6 7 8 9 10 1.0 1.5 2.0 items$Length items$SynsetCount

Figure 1.4: A barplot for a 2 by 2 contingency table, and scatterplots with scatterplot smoothers.

(34)

The remaining panels of Figure 1.4 illustrate how the relation between two numerical variables can be visualized by means of scatterplots. The upper right panel plots the 81 words in items in the plane spanned by log Frequency and log Family Size. You can see that words with a very high frequency tend to have a very high family size. In other words, the two variables are positively correlated. At the same time, it is also clear that there is a lot of noise, and that the scatter (or variance) in family sizes is greater for lower frequencies. Such an uneven pattern is refered to as heteroskedastic, and is endemic in lexical statistics. The following lines of code illustrate how to create the three scatterplots of Figure 1.4.

> plot(items$Frequency, items$FamilySize) # upper right panel

> plot(items$Frequency, items$FamilySize) # lower left panel

> lines(lowess(items$Frequency, items$FamilySize))

> plot(items$Length, items$Synsets) # lower right panel

> lines(lowess(items$Length, items$Synsets))

The lower left panel illustrates how you can use a scatterplot smoother to bring out the main trend in the data. The function that we have used here islowess(), the output of which is fed intolines()_{. There are many other smoothers, for further details we refer}

the reader to Venables & Ripley (2000:228–232). As for histograms and density estimation, the shape of the smooth curve running through the data points depends on the width of the ’bin’ width specifying the points in the plot which influence the smooth at each value. The default settings for this bin width (or smoother span) are a sensible first guess, but when you think there is undersmoothing or oversmoothing you can try out other spans. For further details, the reader should consult the on-line help forlowess()_.

The lower right panel shows a scatterplot for word length and number of synsets, again with a lowess smoother. Comparing the two graphs, the correlation in the left panel seems more robust than the one in the right panel. This is as far as visual inspection of the data can lead us. We will need more formal methods to guide us with respect to the question whether there are grounds for assuming these patterns would be observed again in new samples of the same kind of words.

The plot of family size by frequency raises the question which words in the data set have both high frequencies and high family sizes. A plot that is quite helpful here is a scatterplot in which the circles are replaced by the corresponding words, as shown in Figure 1.5. This figure was produced with

> plot(items$Frequency, items$FamilySize, type="n", + xlab="log frequency", ylab="log family size") > text(items$Frequency, items$FamilySize,

+ as.character(items$Item), cex=0.8) # convert factor to strings

It is easy to see that horse and dog are the words with the highest frequency and family size in the sample. Thetext_{function is the crucial tool here. It requires three vectors of}

(35)

2 3 4 5 6 7 8 0.0 0.5 1.0 1.5 2.0 2.5 3.0 log frequency

log family size

bat mousegoat pig bunny cherry lion reindeer swan leek donkey blueberry cucumber horse sheep elephant beaver

apricot squirrelbutterfly potato beetroot carrot pigeon paprika pineapple wasp orange melon pear strawberry spider monkey fox clove blackberry crocodile grape snake vulturemustard olive bee walnut mole stork gherkin gull cat tomato woodpeckerasparagus radish frog lemon broccoli apple mushroom peanutbadger almond camel chicken magpie owl ant eggplant dolphin shark lettuce eagle goose banana kiwiavocado tortoise

dog

hedgehog

whale

moose squid

Figure 1.5: Scatterplot of frequency by length with labeled points for 81 words denoting animals and plants.

(36)

Frequency 0.0 1.5 3.0 3 5 7 9 2 4 6 8 0.0 1.5 3.0 FamilySize SynsetCount 1.0 2.0 3 5 7 9 Length 2 4 6 8 1.0 2.0 0.0 1.0 2.0 0.0 1.0 2.0 DerivEntropy

Figure 1.6: A pairs plot for the five numerical variables in theitemsdata frame.

In order to avoid plotting both strings and plot symbols, we specified type = "n"in theplot()_{command, so that the axes, labels and tick marks are properly set up, but no}

actual points are shown.

Thus far, we have considered plots involving two variables only. Often, we have more than two variables, and although we might look at all possible combinations with a series of scatterplots, it is often more convenient and insightful to make a single multipanel figure that shows all pairwise scatterplots. Figure 1.6 shows such a scatterplot matrix for all two by two combinations of the five numerical variables initems. The panels on the main diagonal provide the labels for the panels. Furthermore, each pair of variables is plotted twice, once with a given variable on the horizontal axis, and once with the same variable on the vertical axis. Such pairs of plots have coordinates that are mirrored in the main diagonal. Thus, panel (1,2) is the mirror image of panel (2,1), which we just studied in Figure 1.5. Similarly, panel (5,1) in the lower left has its opposite in the upper right

(37)

corner at location (1,5). This pairs plot was produced with

pairs(items[,-c(1,6)])

The condition on the columns with the minus sign,-c(1,6), allowed all columns except columns 1 and 6 (both factors) into the plot. Note that there seem to be correlations among many of these variables, a phenomenon that is known as multicollinearity. The problem that multicollinearity causes is that when we seek to understand how these variables affect lexical decision latencies or weight ratings, it may be quite difficult to ascertain what the independent contribution of the different variables might be.

1.2.3 Trellis graphics

A trellis is a wooden grid for growing roses and other flowers that need vertical support. Trellis graphics are graphs in which data are visualized by many systematically organized graphs simultaneously. We have encountered one trellis function already, the pairs()

function that produces a pairwise scatterplot matrix, where each plot is a hole in the ’trel-lis’. There are more advanced functions for more complex trellis plots, they are available in thelatticelibrary. In order to use these functions, we first have to load this library:

library(lattice)

Trellis graphics become important when you are dealing with different groups of data. For instance, the words in theitems_{data frame fall into two groups: animals on the one}

hand, and the produce of plants (fruits, vegetables, nuts) on the other hand. Therefore, the factor Class (with levels animal and plant) is a grouping factor for the words. Another possible grouping factor is whether the word is morphologically complex (e.g., woodpecker) or morphologically simple (e.g., snake). With respect to the lexical decision data in lexdec, the factor Subject is a grouping factor: Each subject completed the same experiment with 79 words and 79 nonwords. In turn, the subjects can be grouped by their first language, English, or some other language.

A question that arises when running a lexical decision experiment with native and non-native speakers of English is whether there might be systematic differences in how they perform this task. It is to be expected that the non-native speakers require more time for a lexical decision. But the way they make errors might differ as well. In order to explore this possibility, we make boxplots for the reaction times for correct and incorrect responses, and we do this both for the native speakers, and for the non-native speakers in the experiment. In other words, we use the factor NativeLanguage as a grouping factor. In order to make this grouped boxplot, we use thebwplot() _{function from the} latticelibrary, as follows:

> bwplot(RT ˜ Correct | NativeLanguage, data = lexdec)

The result is shown in Figure 1.7. As you can see,bwplot()_{requires two arguments, a}

(38)

RT

correct incorrect 6.0 6.5 7.0 7.5

English

correct incorrect

Other

Figure 1.7: Trellis box and whiskers plot for log reaction time by accuracy (correct versus incorrect response) grouped by the first language of the subject.

(39)

Frequency ∼ Correct | NativeLanguage

is read as consider Frequency as a function of (or, a depending on)Correct(with lev-els correct _and incorrect_{) grouped by the levels of} NativeLanguage _(with

lev-els English and other). Note that the vertical bar is the grouping operator. An-other paraphrase within the context of bwplot() is ’create box and whisker plots for the distributions of reaction times for the levels of Correct conditioned on the levels of NativeLanguage’. The result is a plot with two panels, one for each level of the main grouping factor, native language. Within each of these panels, we have two box and whiskers plots, one for each level of Correct. This trellis graph shows some remark-able differences between the native and non-native speakers of English. First of all, we see that the boxes (and medians) for the non-native speakers are shifted upwards com-pared to those for the native speakers, indicating that they required more time for their decisions, as expected. Interestingly, we also see that the incorrect responses were asso-ciated with shorter decision latencies for the native speakers, but with longer latencies for the non-native speakers. Finally, note that there are many outliers only for the correct responses, for both groups of subjects. Later in this course, we shall see how we can test whether the pattern that we see here is indeed reason for surprise. What is clear at this point is that there is a pattern in the data that is worth examining in greater detail.

Figure 1.8 illustrates the powerful but also more complex xyplot() function. For each of the subjects in the weight rating experiment, it shows the weight rating as a func-tion of log frequency. The initials of the subjects (the grouping factor) appear in the title bars above each panel. This graph was made with the function xylowess(), which is available in the scripts file FUNCTIONS/cap1.q. This function facilitates the use of

xyplot() _{but is much less flexible. We discuss this function and} xyplot() _{in some}

more detail in the next section. We first load the function into R using source(), and then runxylowess():

> source("FUNCTIONS/cap1.q")

> xylowess(Rating ˜ Frequency | Subject, + data = weight,

+ xlab = "log Frequency", ylab = "Weight Rating")

The dependent variable (Rating_{) appears on the vertical axes, the predictor (}Frequency₎

is graphed on the horizontal axes, and there is one panel for each of the levels of the grouping factor,Subject. As can be seen in Figure 1.8, weight ratings appear to increase with increasing (log) frequency. There seems to be some variation in how strong the effect is. To judge from the scatterplot smoothers, subject G (third on the bottom row) does not seem to have this frequency effect, in contrast to, for instance, subject R5, for whom the effect seems quite large.

A similar plot for weight rating by number of synsets is shown in Figure 1.9. What we observe here for almost all subjects is a shallow U-shaped curve. A problem that arises here, however, is that words with many synsets also tend to have high frequen-cies. Hence, the right part of the U-shaped curves might reflect the effect of frequency

(40)

Frequency Rating 2 3 4 5 6 7 1 2 3 4 5 6 7 A1 A2 2 3 4 5 6 7 G H 2 3 4 5 6 7 I1 I2 J K L 1 2 3 4 5 6 7 M1 1 2 3 4 5 6 7 M2 P R1 R2 R3 R4 2 3 4 5 6 7 R5 S1 2 3 4 5 6 7 S2 1 2 3 4 5 6 7 T1

(41)

log Synset Count Weight Rating 1.0 1.5 2.0 1 2 3 4 5 6 7 A1 A2 1.0 1.5 2.0 G H 1.0 1.5 2.0 I1 I2 J K L 1 2 3 4 5 6 7 M1 1 2 3 4 5 6 7 M2 P R1 R2 R3 R4 1.0 1.5 2.0 R5 S1 1.0 1.5 2.0 S2 1 2 3 4 5 6 7 T1