Presenting data
3.7 Three-dimensional plots and contours
Three-dimensional plots and contour plots are used to represent relief, scatter plots and surfaces. The main functions and their corresponding packages are listed in the data frame graphics.3d.rda (available at the book website). We will use 3D plots here and there and illustrate them when the need arises.
3.8 Assignments
Exercise 3.1. The following data appeared in the World Almanac and Book of Facts, 1975 (pp. 315–318). It was also cited by McNeil (1977) and is available with R. It lists the number of discoveries per year between 1860 and 1959.
Start = 1860 End = 1959
5 3 0 2 0 3 2 3 6 1 2 1 2 1 3 3 3 5 2 4
4 0 2 3 7 12 3 10 9 2 3 7 7 2 3 3 6 2 4 3
5 2 2 4 0 4 2 5 2 3 3 6 5 8 3 6 6 0 5 2
2 2 6 3 4 4 2 2 4 7 5 3 3 0 2 2 2 1 3 4
2 2 1 1 1 2 1 4 4 3 2 1 4 1 1 1 0 0 2 0
1. Load the data. Then, of the reported data, in how many cases were there between 0 and less than 2 discoveries? Between 2 and less than 4 discoveries, between 4 and less than 6 and so on up to 12?
2. Based on the results in (1), plot a histogram of discoveries.
3. Do the data remind you of some regular curve that you may be familiar with?
What is that curve?
Exercise 3.2.
1. Compare the age distribution of males to females in Austria (Figure 3.1). Specu-late about the reasons for the difference in the survival of females and males.
2. Compare the age distribution of males to females in Armenia (Figure 3.2). Spec-ulate about the reasons for the difference in the survival of females and males.
3. Compare the age distribution of males in Austria (Figure 3.1) and Armenia (Figure 3.2). Speculate about the differences in the survival of males in these two countries.
4. Compare the age distribution of females in Austria (Figure 3.1) and Armenia (Figure 3.2). Speculate about the differences in the survival of females in these two countries.
5. What information would you need to verify that your speculations are reasonable?
Exercise 3.3. Go to http://www.google.com. In the search box, enter the following string exactly as shown (including the quotes) “wind energy in X” where X stands for a state name. Spell the state names fully, including upper case letters. For example for X = New York, you enter (including the quotes) “wind energy in New York”. Once you enter the string, click on the Google Search button. Under the search button, you will see how many items were found in the search. Record the number of items found and the state name. Repeat the search for X = all of the contiguous states in the U.S. Using the data you thus gathered (state name vs. the number of search items that came up):
1. Plot a histogram of the data.
2. What is the most common number of items found per state? How many states belong to this number?
3. What is the least common number of items found per state?
4. Using a histogram, identify a region in the U.S. (e.g. Northeast, Northwest, etc.) where most of the found items show up.
5. Why this particular region compared to others?
Exercise 3.4. Sexual dimorphism is a phenomenon where males and females of a species differ with respect to some trait. Among species of spiders, sexual dimorphism
92 Presenting data
is widespread. Females are usually much larger than males (so much so, that they often eat the male after mating). Plot—by hand or with R—an imaginary graph that reflects the histogram of weights of individuals from various spider species. Explain the plot. If you choose R to generate the data, you can use rnorm() (look it up in Help).
Exercise 3.5. Use the discoveries data shown in Exercise 3.1:
1. Introduce a new factor variable named period. The variable should have 20 year periods as levels. So the first level of period is “1860–1879,” the second is “1880–
1899” and so on.
2. Compute the mean number of discoveries for each of these periods.
3. Construct a dot chart for the data.
4. Draw conclusions from the chart.
Exercise 3.6. For this exercise, you will need to use the function rnorm() (see Help).
You have a vector that contains data about tree height (m). The first 30 observations pertain to aspen, the next 25 to spruce and the last 34 to fir.
1. Use a single statement to create imagined data from a normal distribution with means and standard deviations set to aspen: 5, 2; spruce: 8, 3; fir: 10.4.
2. Use a single statement to create an appropriate factor vector 3. Use a single statement to create a data frame from the two vectors.
No need to report the data. Just the code.
Exercise 3.7. Use a single statement to compute the mean height for aspen, spruce, fir and spruce from the data.frame created in Exercise 3.6
Exercise 3.8. Continuing with Exercise 3.6:
1. Use a single statement to reorder the levels of the species column in the data.frame you created in Exercise 3.6 such that species is an ordered factor with the levels aspen > spruce > fir.
2. Use an appropriate printout to prove that the factor is ordered.
3. Use a single statement to compute the means of the species height. The printout should arrange the means according to the ordered levels.
Exercise 3.9. In this exercise, before every call to a function that generates random numbers, call the function set.seed(1) exactly as shown. This will have the effect of getting the same set of random numbers every time you answer the exercise. You will also need to use the functions runif() and round().
1. Create a matrix with 30 columns and 40 rows. Each element is a random number from a normal distribution with mean 10 and standard deviation 2. Show the code, not the data.
2. Create a submatrix with 6 rows and 6 columns. The rows and columns are chosen at random from the matrix. Show the code, not the data.
3. Print the submatrix with 3 decimal digits and without the row and column coun-ters shown in the printout (i.e. without the dimension names; see no.dimnames() on page 32). Show the printout; it should look like this:
9.220 9.224 10.122 10.912 9.671 10.445 13.557 8.984 10.508 4.006 11.976 11.321 7.905 8.579 11.953 10.150 10.004 10.389 8.688 9.950 7.087 12.098 11.141 10.305 11.890 9.409 6.299 5.620 11.083 13.288 10.198 11.002 12.363 8.662 4.222 9.074
Exercise 3.10. In 2003, there were 35 students in my statistics class. To protect their identity, they are labeled S1, . . . , S35.
1. With a single statement, including calls to factor() and paste(), create a factor vector that contains the student labels.
2. Here are the results of the midterm exam
68 76 66 90 78 66 79 82 80 71 90 78 68 52 86 74 74 84 83 80 84 82 75 55 81 74 73 60 70 79 88 73 78 74 61
and final
67 76 87 65 74 76 80 73 90 73 78 82 71 66 89 56 82 75 83 78 91 65 87 90 75 55 78 70 81 77 80 77 83 72 68
Both results above are sorted by student label. Save the data as text files, named midterm.txtand final.txt. Import the data to R.
3. Create a data frame, named exams, with student labels and their grades on the midterm and final. Name the columns student, midterm and final. You may need to use dimnames() to name the columns.
4. Create a vector, named average that holds the mean of the midterm and final grade.
5. Add this vector to the exams data frame. When done with this part of the exercise, the exams data frame should look like this (only the first 5 records are shown;
your frame should have 35 records).
> exams[1 : 5,]
student midterm final average
1 S1 68 67 67.5
2 S2 76 76 76.0
3 S3 66 87 76.5
4 S4 90 65 77.5
5 S5 78 74 76.0
6. Create a data frame named class. Your data frame should look as follows (your data frame should include values for grades instead of NA):
> class
exam grade 1 midterm NA 2 final NA 3 total NA
94 Presenting data
7. Create a list, named class.03. The list has two components, class and exams;
both are the data frames you created. The list should look as shown next (your list should include data instead of NA). Only the first 5 rows of the second component are shown.
> class.03
$class.mean exam grade 1 midterm NA 2 final NA 3 total NA
$student.grades
student midterm final average
1 S1 68 67 67.5
2 S2 76 76 76.0
3 S3 66 87 76.5
4 S4 90 65 77.5
5 S5 78 74 76.0
8. Show two ways to access the first 5 records of the students.grades data frame in the class.03 list.
Exercise 3.11. Download the file elections-2000.csv from the book’s website and:
1. Create a data frame named Florida.
2. How many counties are present in the data file?
3. In how many counties did the majority vote for Gore? For Bush?
4. Suppose that all the votes for Buchanan were to go to Bush and all the votes for Nader were to go to Gore. Who wins the election? By how many votes?