Table 4.4 Different ways to access a data frame

mtcars mpg cyl disp hp drat wt qsec vs Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 … Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 … Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1

To access the row “Honda Civic”

mtcars[’Honda Civic’,] By row name

mtcars[’Honda’,] Can shorten the name if unique match

mtcars[19,] It is also the 19th row in the data set

To access the column “mpg”

mtcars[,’mpg’] By column name mtcars [,1] It is column 1 mtcars$mpg list access by name

mtcars[[’mpg’]] Alternate list access. Note, mtcars [’mpg’] is not a

vector but a data frame.

To access the value “30.4”

mtcars[’Honda’,’mpg’] By name (with match) mtcars[19,1] By row and column number

mtcars$mpg[19] mtcars$mpg is a vector, this is the 19th entry.

output indicates that this isn’t the case. In the data-frame case, this list is then coerced into a data frame. The vector notation specifies the desired variables.

Table 4.4 summarizes the various ways to access elements of a data frame.

4.2.3Setting values in a data frame or list

We’ve seen that we can’t change a data frame’s values by attaching it and then assigning to the variables, as this modifies the copy. Rather, we must assign val¬ ues to the data frame directly. Setting values in a data frame or list is similar to setting them in a data vector. The basic expressions have forms like

df[rows, cols]=values 1st$name=value, or 1st$name[i]=value.

In the [,] notation, if values does not have the same size and type as the values that we are replacing, recycling will be done; if the length of values is too big, however, an error message is thrown. New rows and columns can be created by assigning to the desired indices. Keep in mind that the resulting data frame cannot have any empty columns (holes).

> df = data.frame(a=1:2,b=3:4) # with names > df[1,1]=5 # first row, first column

> df[,2]=9:10 # all rows, second column

> df[1:2,3:4] = cbind(11:12,13:14) # rows and columns at once

> df # new columns added a b c d

1 5 9 11 13 2 2 10 12 14

> df[1:2, 10:11]=cbind(11:12,13:14) # would create a hole

Error in "[<-.data.frame"('*tmp* …

new columns would leave holes after existing columns

> df[,2:3]=a # recycling occurs > df

a b c d 1 5 0 0 13 2 2 0 0 14

Using $ with a list refers to a data vector that can be set accordingly, either all at once or position by position, as with:

> 1st = list(a=1:2,b=l:4,c=c("A","B","C"))

> lst$a = 1:5 # replace the data vector > lst$b[3] = 16 # replace single element > lst$c[4]= “D" # appends to the vector > 1st $a [1] 1 2 3 4 5 $b [1] 1 2 16 4 $c [1] "A" "B" "C" "D"

The c() function can be used to combine lists using the top-level components. This can be used with data frames with the same number of rows, but the result is a list, not a data frame. It can be turned into a data frame again by using data.frame().

4.2.4Applying functions to a data frame or list

In Chapter 3 we noted that apply() could be used to apply a function to the rows or columns of a matrix. The same can be done for a data frame, as it is matrix-like. Although many functions in R adapt themselves to do what we would want, there are times when ambiguities force us to work a little harder by using this technique. For example, if a data frame contains just numbers, the function mean() will find the mean of each variable, whereas median() will find the median of the entire data set. We illustrate on the ewr (UsingR) data set:

> df = ewr[ , 3:10] # make a data frame of the times

> mean(df) # mean is as desired AA CO DL HP NW TW UA US 17.83 20.02 16.63 19.60 15.80 16.28 17.69 15.49 > median(df) # median is not as desired

Error in median(df) : need numeric data

> apply(df,2,median) # median of columns AA CO DL HP NW TW UA US 16.05 18.15 15.50 18.95 14.55 15.65 16.45 14.45

We can apply functions to lists as well as matrices with the lapply () function or its user- friendly version, sapply (). Either will apply a function to each top-level component of a list or the entries of a vector. The lapply () function will return a list, whereas sapply () will simplify the results into a vector or matrix when appropriate.

For example, since a data frame is also a list, the median of each variable above could have been found with

> sapply(df,median)

AA CO DL HP NW TW UA US 16.05 18.15 15.50 18.95 14.55 15.65 16.45 14.45

(Compare this to the output of lapply (df, median).)

4.2.5Problems 4.7 Use the data set mtcars.

1. Sort the data set by weight, heaviest first.

2. Which car gets the best mileage (largest mpg)? Which gets the worst?

3. The cars in rows c(1:3, 8:14, 18:21, 26:28, 30:32) were imported into the United States. Compare the variable mpg for imported and domestic cars using a boxplot. Is there a difference?

4. Make a scatterplot of weight, wt, versus miles per gallon, mpg. Label the points according to the number of cylinders, cyl. Describe any trends.

4.8 The data set cfb (UsingR) contains consumer finance data for 1,000 consumers. Create a data frame consisting of just those consumers with positive INCOME and negative NETWORTH. What is its size?

4.9 The data set hall, fame (UsingR) contains numerous baseball statistics, including Hall of Fame status, for 1,034 players.

1. Make a histogram of the number of home runs hit (HR).

2. Extract a data frame containing at bats (AB), hits (hits), home runs (HR), and runs batted in (RBI) for all players who are in the Hall of Fame. (The latter can be found with Hall.Fame.Membership!="not a member".) Save the data into the data frame hf.

3. For the new data frame, hf, make four boxplots using the command:

boxplot(lapply(hf,scale))

(The scale() function allows all four variables to be compared easily.) Which of the four variables has the most skew?

4.10 The data set dvdsales (UsingR) can be viewed graphically with the command

> barplot(t(dvdsales), beside=TRUE)

1. Remake the barplots so that the years increase from left to right. 2. Which R commands will find the year with the largest sales? 3. Which R commands will find the month with the largest sales?

4.11 Use the data set ewr (UsingR). We extract just the values for the times with df=ewr [,3:10]. The mean of each column is found by using mean (df). How would you find the mean of each row? Why might this be interesting?

4.12 The data set u2 (UsingR) contains the time in seconds for albums released by the band U2 from 1980 to 1997. The data is stored in a list.

1. Make a boxplot of the song lengths by album. Which album has the most spread? Are the means all similar?

2. Use sapply() to find the mean song time for each album. Which album has the shortest mean? Repeat with the median. Are the results similar?

3. What are the three longest songs? The unlist() function will turn a list into a vector. First unlist the song lengths, then sort.

Could you use a data frame to store this data?

4.13 The data set normt emp (UsingR) contains measurements for 130 healthy, randomly selected individuals. The variable temperature contains body temperature, and gender contains the gender, coded 1 for male and 2 for female. Make layered densityplots of temperature, splitting the data by gender. Do the two distributions look to be the same?

4.14 What do you think this notation for data frames returns: df [,] ?

4.3Using model formula with multivariate data

In Example 4.5 we broke up a variable into two pieces based on the value of a second variable. This is a common task and works well. When the value of the second variable has many levels, it is more efficient to use the model-formula notation. We’ve already seen other advantages to this approach, such as being able to specify a data frame to find the variables using data=and to put conditions on the rows considered using subset=.

4.3.1Boxplots from a model formula

In Example 4.4 the inc variable is discrete and not continuous, as it has been turned into a categorical factor by binning, using a bin size of $2,500. Rather than plot gestation versus inc with a scatterplot, as is done in a panel of Figure 4.4, a boxplot would be more

appropriate. The boxplot allows us to compare centers and spreads much more easily. As there are nine levels to the income variable, we wouldn’t want to specify the data with commands like gestation [inc == 1]. Instead, we can use the model formula gestation ~ inc with boxplot(). We read the formula as gestation is modeled by inc, which is interpreted by boxplot() by splitting the variable gestation into pieces corresponding to the values of inc and creating boxplots for each.

We use this approach three times. The last appears in Figure 4.6, where the argument varwidth=TRUE is specified to show boxes with width depending on the relative size of the sample.

> boxplot(gestation ~ inc, data=babies) # not yet > boxplot(gestation ~ inc, subset=gestation != 999 & inc != 98,

+ data=babies) # better

> boxplot(gestation ~ inc, subset=gestation != 999 & inc != 98,

+ data=babies,varwidth=TRUE, # variable width to see sizes

+ xlab="income level”, ylab="gestation (days)")

Figure 4.6Boxplot of gestation times

In document Using R for Introductory Statistics - Free Computer, Programming, Mathematics, Technical Books, Lecture Notes and Tutorials (Page 129-134)