• No results found

Figure 4.3 Scatterplot of gestation versus weight with smoking status

4.2 R basics: data frames and lists

4.2.2 Accessing values in a data frame

The values of a data frame can be accessed in several ways. We’ve seen that we can reference a variable in a data frame by name. Additionally, we see how to access elements of each variable, or multiple elements at once.

Accessing variables in a data frame by name

Up to this point, most of the times that we have used the data in a data frame we have “attached” the data frame so that the variables are accessible in our work environment by their names. This is fine when the values will not be modified, but can be confusing otherwise. When R attaches a data frame it makes a copy of the variables. If we make changes to the variables, the data frame is not actually changed. This results in two variables with the same names but different values.

The following example makes a data frame using data.frame() and then attaches it. When a change is made, it alters the copy but not the data frame.

> x = data.frame(a=1:2,b=3:4) # make a data frame > a # a is not there Error: Object "a" not found

> attach(x) # now a and b are there > a # a is a vector [1] 1 2 > a[1] =5 # assignment > a # a has changed [1] 5 2 > x # not x though a b 1 1 3 2 2 4 > detach(x) # remove x > a # a is there and changed [1] 5 2 > x # x is not changed a b 1 1 3 2 2 4

The with() function and the data= argument were mentioned in Chapter 1 as alternatives to attaching a data frame. We will use these when it is convenient.

Accessing a data frame using [,] notation

When we use a spreadsheet, we refer to the cell entries by their column names and rows number. Data-frame entries can be referred to by their column names (or numbers) and/or their row names (or numbers).

Entries of a data vector are accessed with the [] notation. This allows us to specify the entries we want by their indices or names. If df is the data frame, the basic notation is

df[row, column]

There are two positions to put values, though we may leave them blank on purpose. In particular, the value of row can be a single number (to access that row), a vector of numbers (to access those rows), a single name or vector of names (to match the row names), a logical vector of the appropriate length, or left blank to match all the rows. Similarly for the value of column. As with data vectors, rows and columns begin numbering at 1.

In this example, we create a simple data frame with two variables, each with three entries, and then we add row names. Afterward, we illustrate several styles of access.

> df=data.frame(x=1:3,y=4:6) # add in column names > rownames(df)=c("row 1","row 2","row 3") # add row names > df x y row 1 1 4 row 2 2 5 row 3 3 6 > df[3,2] # row=3,col=2 [1] 6

> df["row 3","y"] # by name [1] 6

> df[1:3,1] # rows 1, 2 and 3; column 1

[1] 1 2 3

> df[1:2,1:2] # rows 1 and 2, columns 1 and 2

x y row 1 1 4 row 2 2 5

> df[,1] # all rows, column 1, returns vector

[1] 123

> df [1,] # row 1, all columns x y

row 1 1 4

> df[c(T,F,T),] # rows 1 and 3 (T=TRUE) x y

row 1 1 4 row 3 3 6

The data-frame notation allows us to take subsets of the data frames in a natural and efficient manner. To illustrate, let’s consider the data set babies (UsingR) again. We wish to see if any relationships appear between the gestation time (gestation), birth weight (wt), mother’s age (age), and family income (inc).

Again, we need to massage the data to work with R. Several of these variables have a special numeric code for data that is not available (NA). Looking at the documentation of babies (UsingR) (with ?babies), we see that gestation uses 999, age uses 99, and income is really a categorical variable that uses 98 for “not available.”

We can set these values to NA as follows:

## bad idea, doesn’t change babies, only copies > attach(babies)

> gestation[gestation == 999] = NA > age[age == 99] = NA

> inc[inc == 98] = NA

> pairs(babies[,c("gestation","wt","age","inc")])

But the graphic produced by pairs() won’t be correct, as we didn’t actually change the values in the data frame babies; rather we modified the local copies produced by attach().

A better way to make these changes is to find the indices that are not good for the variables gestation, age, and inc and use these for extraction as follows:

> rm(gestation); rm(age); rm(inc) # clear out copies > detach(babies); attach(babies) # really clear out > not.these = (gestation == 999) | (age == 99) | (inc == 98)

## A logical not and named extraction > tmp = babies[!not.these,

c("gestation","age","wt","inc")] > pairs(tmp)

> detach(babies)

The pairs() function produces the scatterplot matrix (Figure 4.4) of the new data frame tmp. We had to remove the copies of the variables gestation, age, and inc that were created in the previous try at this. To be sure that we used the correct variables, we detached and reattached the data set. Trends we might want to investigate are the relationship between gestation period and birth weight and the relationship of income and age.

Figure 4.4Scatterplot matrix of four