Bar plots - Presenting data - Statistics and Data With R

Presenting data

3.2 Bar plots

Bar plots are the familiar rectangles where the height of the rectangle represents some quantity of interest. Each bar is labeled by the name of that quantity. Bar plots are particularly useful when you have two-column data, one categorical and the other numerical (usually counts). For example, you may have data where the first column holds species names and the second the number of individuals.

Example 3.3. The WHO data were introduced in Example 2.7. Figures 3.1 and 3.2 show data about the distribution of the population over age in two countries with very different cultures, economies and histories. The following script produces Figures 3.1 and 3.2.

78 Presenting data

1 load('who.pop.2000.rda') # population data

2 load('who.ccodes.rda') # country codes

3 load('who.pop.var.names.rda')#variable names in who.pop.2000

5 cn <- 'Austria' # country name

6 #cn <- 'Armenia' # uncomment for Armenia bar plot

7 cc <- who.ccodes$code[who.ccodes$name == cn] # country code

9 par(mfrow = c(2, 1))

10 bl <- as.character(pop.var.names$descr[2 : 26]) # bar labels

11 gender <- 1 # males

12 rows <- who.pop.2000$code == cc & # row to be plotted

13 who.pop.2000$sex == gender

14 columns <- 5 : 29 # columns to be plotted

16 barplot(t(who.pop.2000[rows, columns])[, 1]/1000,

17 names.arg = bl, main = paste(cn, ', males'),

18 las = 2, col = 'gray90')

19 gender <- 2 # females

20 rows <- who.pop.2000$code == cc &

21 who.pop.2000$sex == gender

22 barplot(t(who.pop.2000[rows, columns])[, 1]/1000,

23 names.arg = bl, main = paste(cn, ', females'),

24 las = 2, col = 'gray90')

The script illustrates several important features of R. In particular, linking data from different data frames and using barplot(). It merits a detailed examination.

To produce the annotation in Figures 3.1 and 3.2, we need three data frames: who.

ccodes contains the country codes which match a country code to a country name.

pop.var.names matches variable names in who.pop.2000 to meaningful names. For example, in who.pop.2000, there is a variable named Pop10. This variable holds the population of age group 20 to 24. Using pop.var.names, we can display the variable description (the string 20-24 that corresponds to the variable Pop10). who.pop.2000 holds the data—population by age and countries. Figure 2.2 shows typical observa-tions (rows) for each frame and the links that we need to display the bar plot properly.

In lines 1 to 3 we load the data. In line 5 we assign Austria to cn. If you wish to produce the bar plot for Armenia, comment line 5 and uncomment line 6. In line 7 we extract country codes from who.ccodes. This is how it is done: The statement inside the square brackets,

who.ccodes$names == cn

creates an unnamed logical vector. The length of this vector is the length of who.

ccodes$code. All elements of this vector are set to FALSE except those elements whose value is Austria. These elements are set to TRUE. Because this unnamed log-ical vector appears in the square brackets, the index of the TRUE elements is used to extract the desired values from who.ccodes$code. These extracted values are stored

Figure 3.1 Austria’s age distribution by gender.

in the vector cc. Thus, using the country name we extract the country code. Country names and country codes are unique. Therefore, cc contains one element only.

In line 9 we divide the graphics window into two rows and one column, so that we can plot both sexes on the same graphics window. Preparing a graphics window (also called a device) to accept more than one plot is common. It is done by spec-ifying the number of rows and columns with the argument mfrow. In our case, we specify 2 rows and 1 column with c(2,1). We then set the argument mfrow with a call to par(). In line 10, we assign labels to the bars we are going to produce. The labels we need are for variables 2 to 26. The labels reside in the descr column of the pop.var.namesdata frame. In line 11 we set the gender to males. In lines 12 to 14 we prepare the logical vectors that will be used to extract the necessary row and columns from who.pop.2000. We need to extract the row whose code value corresponds to the

80 Presenting data

Figure 3.2 Armenia’s age distribution by gender.

country code of Austria. We have this value in cc. The row we need to extract is for males, so the row we choose must have a value of gender = 1 in the sex column in who.pop.2000. The condition for extraction is stored in rows. Columns 5 to 29 in who.pop.2000 contain the age group populations.

We are now ready to call barplot(). In line 16, we extract the row and needed columns from the data frame. Before plotting, we must transpose the data because now the columns’ populations must be represented as data (rows) to barplot(). This is done with a call to the transpose function t(). Note the division by 1 000. In line 17 we set the labels for the bars with the named argument names.arg. We also create the main title for the bar plot with paste(). In line 18 we set las to 2. This plots the tick labels perpendicular to the axes. The named argument col is set to gray90.

This color is known to R as light gray. To find out color names in R, type colors().

Lines 19 to 24 repeat the bar plot, this time for females. ut

In Example 3.3, the x-axis shows the age categories into which the populations are divided. For example, in both countries, ages 10–14 and 15–19 are the most prevalent in the population. The example reveals interesting differences between and within countries with respect to gender. Think about answers to these questions:

• Why is there a dip in both male and female populations at the ages of 25–40 in Armenia compared to Austria?

• Why are there more older females than older males in Austria?

• Why is there a big jump in the age group from age 4 to ages 5–10 in both countries?

3.3 Histograms

Histograms are close relatives of bar plots. The main difference is that in histograms we are interested in the distribution of data. In other words, we wish to know if there is regularity in the number of observations that fall within a category. This means that how the data are binned takes on an additional importance.

Example 3.4. One of the most important activities that field ecologists pursue is estimating population densities by recording distances to observed organisms (see Buckland et al., 2001). It turns out that the way distance data are binned affects the way the data are adjusted and later used to infer population densities. When it comes to endangered and rare species, the decision on how to bin the data can influence decision making—about conservation actions, court rulings, etc. A circular plot is used to census birds. The observer sits at the center of an imagined disc with radius r and records distance and species for spotted individuals. In a study of bird density in the Sierra Nevada, such data were recorded for the Nashville warbler. Here are the 84 observations of distances, recorded from 20 such plots (personal data), each with a radius of 50 m:

15 16 10 8 4 2 35 7 5 14 14 0 35 31 0 10 36 16 5 3 22 7 55 24 42 29 2 4 14 29 17 1 3 17 0 10 45 10 9 22 11 16 10 22 48 18 41 4 43 13 7 7 8 9 18 2 5 6 48 28 9 0 54 14 21 23 24 35 14 4 10 18 14 21 8 14 10 6 11 22 1 18 30 39

Figure 3.3 summarizes the data for different numbers of binning categories. breaks = 11 is the default chosen by R. For breaks = 4, the data clearly indicate a regular (monotonic) decay in detectability of Nashville warblers as distance increases. This is not so for the other binned histograms.

The following script produces Figure 3.3.

1 load('distance.rda')

2 par(mfrow = c(2, 2))

3 hist(distance, xlab = '', main = 'breaks = 11',

4 ylab = 'frequency', col = 'gray90')

5 hist(distance, xlab = '', main = 'breaks = 20',

6 ylab = '' , breaks = 20, col = 'gray90')

82 Presenting data

7 hist(distance, xlab = 'distance (m)', main = 'breaks = 8',

8 ylab = 'frequency', breaks = 8, col = 'gray90')

9 hist(distance, xlab = 'distance (m)', main = 'breaks = 4',

10 ylab = '', breaks = 4, col = 'gray90')

We use load() to load the R data (a vector) distance in line 1. In line 2 we instruct the graphics device to accept four figures in a 2 by 2 matrix with a call to par() and with the named argument mfrow set to a 2× 2 matrix of plots. The matrix is filled columns first. If we draw more than four, they will recycle in the graphics window.

In lines 3 and 4 we call hist() to plot the data with the default number of breaks, (which happens to be 11) with our own y-axis label (ylab) and with color (col) set to gray90. In lines 5 and 6 we plot the same data. But now we ask to break them into 20 categories of distances. In lines 7–10 we do the same for different numbers of

breaks. ut

Figure 3.3 is revealing. You may arrive at different conclusions about the distribution of the data based on different numbers of breaks. This provides an opportunity to

Figure 3.3 Histograms of distances to 84 observed Nashville warblers in twenty 50 m circular plots. The histograms are shown for different numbers of binning categories (breaks) of the data.

question conclusions from data. You should strive to have some theoretical (mechanis-tic) idea about what the distribution of the data should look like. The fact that there are some “holes” in observations when breaks = 20 indicates that perhaps there are too many of them. Histograms are useful in exploring differences among treatments in experiments. Here is an example.

Example 3.5. The data, included with R’s distribution, are about plant growth (Dobson, 1983). The data set compares yields—as measured by dry weight of plants—

from a control and two treatments. There are 30 observations on 2 variables: weight (g) and treatment with three levels: ctrl, trt1 and trt2. From Figure 3.4 it seems that the most frequent weight under the control experiment was between 5 and 5.5 g.

In treatment 1, it was between 4 and 5 and in treatment 2 between 5.25 and 5.75. Note the insistence on consistent scales among the histograms of the different treatments.

The following code was used in this example to produce Figure 3.4.

1 data(PlantGrowth) ; attach(PlantGrowth)

2 par(mfrow = c(1, 3))

3 xl <- c(3, 6.5) ; yl <- c(0, 4)

4 a <- hist(weight[group == 'ctrl'], xlim = xl, ylim = yl,

5 xlab = '', main = 'control',

6 ylab = 'frequency', col = 'gray90')

7 b <- hist(weight[group == 'trt1'], xlim = xl, ylim = yl,

8 xlab = 'weight', ylab = '', main = 'treatment 1',

9 col = 'gray90')

10 c <- hist(weight[group == 'trt2'], xlim = xl, ylim = yl,

11 xlab = '', ylab = '', main = 'treatment 2',

12 col = 'gray90')

Figure 3.4 Control and two treatments in a plant growth experiment. Weight refers to dry weight.

84 Presenting data

The PlantGrowth data come with R. To load them, we call data() in line 1. To avoid extra typing, we attach() the data frame (also in line 1). In line 2 we tell the graphics device to accept one row of 3 plots. Because we wish all the plots to scale identically for all figures, we set xl and yl in line 3 and then in line 4 we specify the x- and y-axis limits with the xlim and ylim arguments. We do the same for the other 2 histograms.

In line 4, we choose a subset of the weight data that corresponds to the values of group = 'ctrl'. We do it similarly for the other two histograms in lines 7 and 10.

We also set the x label to xlab = 'weight' in line 8. The y label (ylab) is set to frequency. Because we do not wish to clutter the graphs, we set ylab = '' for the other two histograms in lines 8 and 11. We distinguish between the histograms by specifying different main titles to each in lines 5, 8 and 11.

Note the assignment of the histograms to a, b and c. These create lists that store data about the histograms. This allows us to examine the breakpoints (breaks) and frequencies that hist() uses. We often use the data stored in the histogram list for further analysis. Let us see what a, for example, contains:

> a

$breaks

[1] 4.0 4.5 5.0 5.5 6.0 6.5

$counts [1] 2 2 4 1 1

$intensities

[1] 0.4 0.4 0.8 0.2 0.2

$density

[1] 0.4 0.4 0.8 0.2 0.2

$mids

[1] 4.25 4.75 5.25 5.75 6.25

$xname

[1] "weight[group == \"ctrl\"]"

$equidist [1] TRUE attr(,"class") [1] "histogram"

astores vectors of the breaks, their counts and their density. intensities give the same information as density. The mid (mids) values of the binned data are listed as well. If you do not specify xlab, hist() will label x with xname. In this case the label will be weight[group == "ctrl"]. The extra backlashes are called escape characters. They ensure that the quotes are treated as characters and not as quotes.

Another piece of information is whether the histogram is equidistant or not. Finally, we see that the attribute (attr()) of a is a class and the classname is histogram.

You can use this information to later build your own graphs or tables. ut

In document Statistics and Data With R (Page 92-100)