Data manipulation - Data in statistics and in R

Data in statistics and in R

2.5 Data manipulation

The core of working with data is the ability to subset, merge, split and perform other such data operations. Applying various operations to subsets of the data wholesale is as important. These are the topics of this section. Unlike traditional programming languages (such as C and Fortran), executing loops in R is computationally inefficient even for mundane tasks. Many of the functions we discuss here help avoid using loops.

We shall meet others as we need them. We discussed how to subset data with index vectors in Section 1.3.5.

2.5.1 Flat tables and expand tables

If you chance upon data that appear in a contingency table format, you may read (or write) them with read.ftable() (or write.ftable()). If you use table() (we will meet it again later), you can expand.table(), a function in the package epitools.

Example 2.13. Back to the capital punishment data (Example 2.12). First, we load the data and view the unique records from two columns of interest:

> load('capital.punishment.rda')

> cp <- capital.punishment

> unique(cp[, c('Method', 'Sex')]) Method Sex

1 9 M

27 Electrocution M

130 Injection M

1022 Gas M

1800 Firing squad M

2175 Hanging M

7521 9 F

7546 Injection F

7561 Electrocution F

64 Data in statistics and in R

(9 stands for unknown or NA). unique() returns only unique records. This reveals the plethora of execution methods, including the best one, NA. Next, we want to count the number of executions by sex:

> (tbl <- table(cp[, c('Method', 'Sex')])) Sex

> (e.tbl <- expand.table(tbl[3 : 5, ])) Method Sex

You may need such an expansion for further analysis. For example, you can now do

> tapply(e.tbl$Method, e.tbl$Method, length)

Firing squad Gas Hanging

2 11 3

which is another way of counting records. tapply() takes three unnamed arguments.

In our case, the first is a vector, the second must be a factor vector and the third is a function to apply to the first based on the factor levels. You can use most functions

instead of length(). ut

2.5.2 Stack, unstack and reshape

From the R help page on stack() and unstack(): “Stacking vectors concatenates multiple vectors into a single vector along with a factor indicating where each obser-vation originated. Unstacking reverses this operation.” The need for stack() can best be explained with an example.

Example 2.14. Let us stack the 2000 U.S. presidential elections in Florida (see Example 2.2). First, we import the data and look at their head:

> e <- read.csv('elections-2000.csv')

> head(e)

County Gore Bush Buchanan Nader 1 ALACHUA 47365 34124 263 3226

2 BAKER 2392 5610 73 53

3 BAY 18850 38637 248 828

4 BRADFORD 3075 5414 65 84

5 BREVARD 97318 115185 570 4470 6 BROWARD 386561 177323 788 7101

A box plot is a way to summarize data (we will discuss it in detail later). The data on the x-axis are factors. So we stack() the data and, for posterity, add a column for the counties:

> stacked.e <- cbind(stack(e), county = e$County)

> head(stacked.e) values ind county 1 47365 Gore ALACHUA 2 2392 Gore BAKER

3 18850 Gore BAY

4 3075 Gore BRADFORD 5 97318 Gore BREVARD 6 386561 Gore BROWARD Next, we do

> plot(stacked.e[, 2 : 1]) (see Figure 2.3)

reshape() works much like stack().

Figure 2.3 Candidates and county votes for the Florida 2000 U.S. presidential

elections. ut

66 Data in statistics and in R

2.5.3 Split, unsplit and unlist

Occasionally, we need to split a data frame into a list based on some factor. A case in point might be when the frame is large and we wish to analyze part of it. Here is an example.

Example 2.15. The data refer to a large study about fish habitat in streams, con-ducted by the Minnesota Department of Natural Resources. The data were collected from two streams, coded as OT and YM under the factor CODE. We wish to split the data into two components, one for each of the streams. So we do this:

> load('fishA.rda')

> f.s <- split(fishA, fishA$CODE)

> (is.list(f.s)) [1] TRUE

> (names(f.s)) [1] "OT" "YM"

is.list() verifies that the result of split() is a list and names() gives the names

of the list components. ut

The functions unsplit() and unlist() do the opposite of what split() does.

2.5.4 Cut

From the R’s help page for cut(): “cut divides the range of x into intervals and codes the values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.” The need for cut is illustrated with the next example.

Example 2.16. The site http://icasualties.org maintains a list of all U.S.

Department of Defense confirmed military casualties in Iraq. To import the data, we save the HTM page, open it with a spreadsheet and save the date column. We then cut dates into 10-day intervals and use table() to count the dead. First, we import the data and turn them into “official” date format:

> casualties <- read.table('Iraq-casualties.txt', sep = '\t')

> casualties$V1 <- as.Date(casualties$V1, '%m/%d/%Y')

> head(casualties, 5) V1

1 2007-01-04 2 2006-12-30 3 2006-12-27 4 2006-12-26 5 2006-12-26

as.Date() turns the data in the only column of casualties to dates according to the format it was read: month, day and four-digit year ('%m/%d/%Y'). Next, we sort

the dates by increasing order, add a Julian day column that corresponds to the date of each casualty and display the first few rows of the data frame:

> casualties$V1 <- sort(casualties$V1)

> jd <- julian(casualties$V1)

> casualties <- data.frame(Date = casualties$V1, Julian = jd)

> head(casualties) Date Julian 1 2003-03-21 12132 2 2003-03-21 12132 3 2003-03-21 12132 4 2003-03-21 12132 5 2003-03-21 12132 6 2003-03-21 12132

Julian date is a count of the number of days that elapsed since some base date. We now determine the number of 10-day intervals, cut the Julian dates into these intervals and count the number of deaths within each interval:

> (b <- ceiling((jd[length(jd)] - jd[1]) / 10)) [1] 139

> cnts <- table(cut(jd, b))

ceiling() returns the smallest integer larger than a decimal number and cut() cuts the data in b equal intervals and returns a vector, like this:

> head(cut(jd, b))

[1] (1.213e+04,1.214e+04] (1.213e+04,1.214e+04]

[3] (1.213e+04,1.214e+04] (1.213e+04,1.214e+04]

[5] (1.213e+04,1.214e+04] (1.213e+04,1.214e+04]

(the intervals are factors named after the Julian date). Finally, we count the number of occurrences of each interval, i.e. the number of reported deaths during 10-day intervals. So the counts look like this:

> head(cnts, 5)

(1.213e+04,1.214e+04] (1.214e+04,1.215e+04]

60 57

(1.215e+04,1.216e+04] (1.216e+04,1.217e+04]

14 8

(1.217e+04,1.218e+04]

During the first 10-day interval, there were 60 reported dates (which refer to 60 casualties). Finally we plot the data (Figure 2.4). We shall see how the plot was

produced later. ut

68 Data in statistics and in R

Figure 2.4 U.S. military casualties in Iraq. Counts are shown in 10-day intervals.

2.5.5 Merge, union and intersect

These operations are best explained with an example.

Example 2.17. Let us create a vector of the first six upper case letters, the corre-sponding integer code of these letters and a data frame of these two:

> a <- data.frame(letter = LETTERS[1 : 6])

> a <- data.frame(a, code = apply(a, 1, utf8ToInt))

LETTERS(for upper case) and letters (for lower case) are data vectors supplied with R that hold the English letters. utf8ToInt() is a function that returns the integer code of the corresponding letter in the so called UTF-8 format. apply() applies to the data frame a, by rows (1) the function utf8ToInt(). Next, we create a similar data frame b and display both data frames:

> b <- data.frame(letter = LETTERS[4 : 9])

> b <- data.frame(b, code = apply(b, 1, utf8ToInt))

> cbind(a, '|' = '|', b) letter code | letter code

1 A 65 | D 68

2 B 66 | E 69

3 C 67 | F 70

4 D 68 | G 71

5 E 69 | H 72

6 F 70 | I 73

cbind() binds the columns of a and b and a column of separators between them.

Now here is what merge() does:

> merge(a, b) letter code

1 D 68

2 E 69

3 F 70

Contrast this with union():

> union(a, b) [[1]]

[1] A B C D E F Levels: A B C D E F [[2]]

[1] 65 66 67 68 69 70 [[3]]

[1] D E F G H I Levels: D E F G H I [[4]]

[1] 68 69 70 71 72 73

which creates a list of the columns in a and b. Note the asymmetry of intersect():

> cbind(intersect(a, b), '|' = '|', intersect(b, a)) letter | letter

1 D | A

2 E | B

3 F | C

4 G | D

5 H | E

6 I | F

The data frames do not have to have equal numbers of rows. ut

2.5.6 is.element()

This is a very useful functions that in “Data Speak” relates many records in one, say, data frame, to many records in another, based on common values. The function is.element() takes two vector arguments and checks for common elements in the two. It returns an index vector (TRUE or FALSE) that gives the common argument values (as TRUE) in its first argument. You can then use the returned logical vector as an index to extract desired elements from the first vector. Thus, the function is not symmetric. The next example illustrates one of the most commonly encountered problems in data manipulation. Its solution is not straightforward.

Example 2.18. In longitudinal studies, one follows some units (subjects) through time. Often, such units enter and leave the experiment after it began and before it ends. A two-year imaginary diet study started with six patients:

> begin.experiment name weight 1 A. Smith 270 2 B. Smith 263 3 C. Smith 294

70 Data in statistics and in R 4 D. Smith 218

5 E. Smith 305 6 F. Smith 261

After one year, three patients joined the study:

> middle.experiment name weight 1 G. Smith 169 2 H. Smith 181 3 I. Smith 201

Five patients, some joined at the beginning and some joined in the middle, finished the experiment:

> end.experiment name weight 1 C. Smith 107 2 D. Smith 104 3 A. Smith 104 4 H. Smith 102 5 I. Smith 100

Imagine that each of these data frames contains hundred of thousands of records.

The task is to merge the data for those who started and finished the exper-iment. First, we identify all the elements in end.experiment that are also in begin.experiment:

> (m <- is.element(begin.experiment$name, end.experiment$name)) [1] TRUE FALSE TRUE TRUE FALSE FALSE

Next, we create a vector of those patient names that started in the beginning and ended in the end:

> (begin.end <- begin.experiment[m, ]) name weight

1 A. Smith 270 3 C. Smith 294 4 D. Smith 218

> (p.names <- begin.experiment[m, 1]) [1] "A. Smith" "C. Smith" "D. Smith"

We merge the data for the weights at the beginning and end of the experiment:

> (patients <- cbind(begin.experiment[m, ],

+ end.experiment[is.element(end.experiment$name, p.names), ])) name weight name weight

1 A. Smith 270 C. Smith 107 3 C. Smith 294 D. Smith 104 4 D. Smith 218 A. Smith 104

patientsis still not very useful. Our goal is to obtain this:

name time weights 1 A. Smith begin 270 2 C. Smith begin 294 3 D. Smith begin 218 4 C. Smith end 107 5 D. Smith end 104 6 A. Smith end 104

(you will see in a moment why). To achieve this goal, we stack names and then weights:

> (p.names <- stack(patients[, c(1, 3)])) values ind

1 A. Smith name 2 C. Smith name 3 D. Smith name 4 C. Smith name.1 5 D. Smith name.1 6 A. Smith name.1

> (weights <- stack(patients[, c(2, 4)])[, 1]) [1] 270 294 218 107 104 104

Now create the data frame:

> (experiment <- data.frame(p.names, weights)) values ind weights

1 A. Smith name 270 2 C. Smith name 294 3 D. Smith name 218 4 C. Smith name.1 107 5 D. Smith name.1 104 6 A. Smith name.1 104

This is it. All that is left is to rename columns and factor levels:

> levels(experiment$ind) <- c('begin', 'end')

> names(experiment)[1 : 2] <- c('name', 'time')

and we achieved our goals. Why do we want this particular format for the data frame?

Because

> tapply(experiment$weights, experiment$time, mean)

begin end

260.6667 105.0000

is so easy. To handle data with hundreds of thousands of subjects, all you have to do

is change the indices in this example. ut

In document Statistics and Data With R (Page 78-86)