Data in statistics and in R
2.2 Objects that hold data
In addition to vectors, matrices, lists and data frames are object types that hold data.
Learning to work with these objects is essential to working with data in R.
2.2.1 Arrays and matrices
Arrays generalize the concept of vectors. Recall that a vector has a dimension 1 and a length of at least 0. The ith element of a vector is accessed via the subscript notation;
e.g. v[i]. Matrices are two-dimensional arrays. They are rectangular. The element in the intersection of the ith row and jth column is accessed with m[i, j]. Arrays have k dimensions. Each element of an array is accessed with k indices, a[i1, i2, ... , ik].
An array object of dimension 1 differs from a vector object by virtue of having a dimension vector. The dimension vector is a vector of positive integers. The length of this vector gives the dimension of the array. The dimension vector is an attribute of an array.
The name of the attribute is dim. Here are some statements that clarify these ideas:
> (v <- 1 : 10) # a vector
[1] 1 2 3 4 5 6 7 8 9 10
> c('vector?' = is.vector(v), 'array?' = is.array(v)) vector? array?
TRUE FALSE
> dim(v) <- c(10) # endow v with dim and it is an array
> c('vector?' = is.vector(v), 'array?' = is.array(v)) vector? array?
FALSE TRUE
> v # array of dimension 1 prints like a vector
[1] 1 2 3 4 5 6 7 8 9 10
A matrix object is a two-dimensional array. It therefore has a dim attribute. Its dimension vector has a length of 2. The first element indicates the number of rows and the second the number of columns. Here is an example of how to create a matrix with 3 columns and 2 rows with matrix():
> (m <- matrix(0, ncol = 3, nrow = 2)) [,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
Next, we verify that m is in fact an array with is.array():
> c('matrix?' = is.matrix(m), 'array?' = is.array(m)) matrix? array?
TRUE TRUE
As is.matrix() illustrates, m is both a matrix and an array object. In other words, every matrix is an array, but not every array is a matrix. Like any other object, you get information about the attributes of a matrix with
> attributes(m)
$dim [1] 2 3
Just like vectors, you index and extract elements from arrays with index vectors (see Section 1.3.5). In the next example, we create a matrix of 5 columns and 4 rows and extract a submatrix from it:
> (m <- matrix(1 : 20, ncol = 5, nrow = 4)) [,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
52 Data in statistics and in R
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
> i <- c(2, 3) ; j <- 2 : 4 # index vectors
> m[i, j] # extract rows 2,3 and columns 2 to 4 [,1] [,2] [,3]
[1,] 6 10 14
[2,] 7 11 15
Note how the matrix is created from a sequence of 20 numbers. The first column is filled in first, then the second and so on. This is a general rule. Matrices are filled column-wise because the leftmost index runs the fastest. This rule applies to dimen-sions higher than 2 (i.e., to arrays). You can fill matrix by row by using the named argument byrow in your call to matrix(). You can also name the matrix dimensions by using the named argument dimnames.
Arrays are constructed with array():
> v <- 1 : 24 ; (a <- array(v, dim = c(3, 5, 2))) , , 1
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
, , 2
[,1] [,2] [,3] [,4] [,5]
[1,] 16 19 22 1 4
[2,] 17 20 23 2 5
[3,] 18 21 24 3 6
The printing pattern follows the array filling rule: from the slowest running index (depth of 2), to the next slowest (5 columns) to the fastest (3 rows). Note the cycling—
vis not long enough to fill the array, so after 24 elements, its values start recycling.
We have already seen how to construct matrices with matrix(). Matrices can also be constructed with the cbind() (column bind) and rbind() (row bind) functions.
We discussed them in Section 1.3.5.
2.2.2 Lists
Lists are objects that can contain arbitrary objects. The elements of a list constitute an ordered collection of objects. To construct a list, use list(). In the next example, we make a list of a character vector, integer vector and a matrix. Each component of the list is named during construction:
> ch.v <- letters[1 : 5] #character vector
> int.v <- as.integer(1 : 7) # integer vector
> m <- matrix(runif(10), ncol = 5, nrow = 2) # matrix
> (hodge.podge <- list(integers=int.v, # list + letter = ch.v, floats = m))
$integers
[1] 1 2 3 4 5 6 7
$letter
[1] "a" "b" "c" "d" "e"
$floats
[,1] [,2] [,3] [,4] [,5]
[1,] 0.5116554 0.3470034 0.2139750 0.3776336 0.3646456 [2,] 0.5246382 0.8092359 0.4230139 0.7846506 0.7316200
When the components of the list are named, they can be accessed in two ways—by name or by index:
> rbind(hodge.podge$letter, hodge.podge[[2]]) # row bind [,1] [,2] [,3] [,4] [,5]
[1,] "a" "b" "c" "d" "e"
[2,] "a" "b" "c" "d" "e"
Single list components are accessed by double square brackets, not by single square brackets: We use single brackets to access elements of an array or a vector. Here we extract the second and third elements from the second component of hodge.podge:
> cbind(hodge.podge$letter[2 : 3], # column bind + hodge.podge[[2]][2 : 3])
[,1] [,2]
[1,] "b" "b"
[2,] "c" "c"
The length attribute of a list is the number of its components. Here are the lengths of various parts of hodge.podge: Try to decipher the following
> length(hodge.podge) # no. of list components [1] 3
> length(hodge.podge$floats) # no. of elements in floats [1] 10
> length(hodge.podge$floats[, 1]) # no. of rows in floats [1] 2
> length(hodge.podge$floats[1, ])# no. of columns in floats [1] 5
> length(hodge.podge[[3]][1, ]) # no. of columns in floats [1] 5
Another way to access named list components is using the name of the component in double square brackets. Compare the following:
> (x <- hodge.podge$integers) [1] 1 2 3 4 5 6 7
> (y <- hodge.podge[['integers']]) [1] 1 2 3 4 5 6 7
Like any other object in R, lists can be concatenated with c().
54 Data in statistics and in R
2.2.3 Data frames
As you will quickly find out, we do much of our work with objects of type data.frame.
These objects fit somewhere in between matrices and lists. They are not as rigid as matrices—they can contain columns of different modes—but they are not as loose as lists—they are required to have a rectangular structure.
Data frames are closest to what we think of as data tables (see Example 2.7 and Figure 2.2). You refer to their objects as you do in matrices. Many functions in R use data frames as the starting point for analysis. For this reason alone, you should put your data into data frames before analysis. We shall use the convenience that data frames provide frequently.
We construct data frames with data.frame() from appropriate objects. They can be of almost any mode. But they all must have equal length:
> composers <-c ('Sibelius', 'Wagner', 'Shostakovitch')
> grandiose <- c(1, 3, 2)
> (music <- data.frame(composers, grandiose)) composers grandiose
1 Sibelius 1
2 Wagner 3
3 Shostakovitch 2
We can also construct a data frame with read.table() which reads appropriately saved data from a text file directly into a data frame (see Section 2.4). Appropriate objects can be coerced into data.frames with as.data.frame():
> as.data.frame(matrix(1 : 24, nrow = 4, ncol = 6)) V1 V2 V3 V4 V5 V6
1 1 5 9 13 17 21 2 2 6 10 14 18 22 3 3 7 11 15 19 23 4 4 8 12 16 20 24
In the case above, data.frame() will also work, except that it will name the columns as X1, . . . , X6, instead of V1, . . . , V6. This (at the time of writing) small inconsis-tency might get you if such calls are embedded in scripts that refer to columns by name.
You can refer to columns of a data frame by index or by name. If by name, associate the column name to the data frame with $. For example, for the music data frame above, you access the composer column in one of three ways:
> noquote(cbind('BY NAME' = music$composer, + '|' = '|', 'BY INDEX' = music[, 1],
+ '|' = '|', 'BY NAMED-INDEX' = music[, 'composers'])) BY NAME | BY INDEX | BY NAMED-INDEX [1,] Sibelius | Sibelius | Sibelius
[2,] Wagner | Wagner | Wagner
[3,] Shostakovitch | Shostakovitch | Shostakovitch
To access a row, use, for example, music[1, ]. Index vectors work the usual way on rows and columns, depending on whether they come before or after the comma
in the square brackets. Instead of accessing composer with music$composer, you can attach()the data frame and then simply indicate the column name:
> attach(music) ; composer
[1] "Sibelius" "Wagner" "Shostakovitch"
Attaching a data frame can also be by position. That is, attach(music, pos = 1) attach music ahead of other objects in memory. So if you have another vector named composer, for example, then after attaching music to position 1, composer refers to music’s composer. If you change the attached column’s data by their name, instead of with the data frame name followed by $ and the column name, then the data in the data frame do not change. Once done with your work with a data frame you can detach()it with
> detach(music)
attach()and detach() work on objects of just about any type that you can name:
lists, vectors, matrices, packages (see Section 1.7) and so on. Judicious use of these two functions allows you to conserve memory and save on typing. Data frames have many functions that assist in their manipulation. We will discuss them as the need arises.