• No results found

2.13 Can you copyedit this paragraph from the August 16, 2003 New York Times? The median sales price, which increased to $575,000, almost 12 per-cent more than the median for the previous quarter and almost 13 percent more than the median for the period a year ago, was at its highest level since the first market overview report was issued in 1989. (The median price is midway between the highest and lowest prices.)

2.14 In real estate articles the median is often used to describe the center, as opposed to the mean. To see why, consider this example from the August 16, 2003 New York Times on apartment prices:

The average and the median sales prices of cooperative apartments were at record highs, with the average up almost 9 percent to $775,052 from the first quarter this year, and the median price at $479,000, also an increase of almost 9 percent.

Explain how using the median might affect the reader’s sense of the center.

2.15 The data set pi2000 (UsingR) contains the first 2,000 digits of π. What is the percentage of digits that are 3 or less? What percentage of the digits are 5 or more?

2.16 The data set rivers contains the lengths (in miles) of 141 major rivers in North America.

1. What proportion are less than 500 miles long? 2. What proportion are less than the mean length? 3. What is the 0.75 quantile?

2.17 The time variable in the nym. 2002 (UsingR) data set contains the time to finish the 2002 New York City marathon for a random sample of the finishers.

1. What percent ran the race in under 3 hours?

2. What is the time cutoff for the top 10%? The top 25%? 3. What time cuts off the bottom 10%?

Do you expect this data set to be symmetrically distributed?

2.18 Compare values of the mean, median, and 25% trimmed mean on the built-in rivers data set. Is there a big difference among the three?

2.19 The built-in data set islands contains the size of the world’s land masses that exceed 10,000 square miles. Make a stem-and-leaf plot, then compare the mean, median, and 25% trimmed mean. Are they similar?

2.20 The data set OBP (UsingR) contains the on-base percentages for the 2002 major league baseball season. The value labeled bondsba01 contains this value for Barry Bonds. What is his z-score?

2.21 For the rivers data set, use the scale() function to find the z-scores. Verify that the z-scores have sample mean() and sample standard deviation 1.

2.22 The median absolute deviation is defined as mad(x)=1.4826·median(|xi-median(x)|).

(2.5) This is a resistant measure of spread and is implemented in the mad () function. Explain in words what it measures. Compare the values of the sample standard deviation, IQR, and median absolute deviation for the exec.pay (UsingR) data set.

2.23 The data set npdb (UsingR) contains malpractice-award information. The variable amount is the size of malpractice awards in dollars. Find the mean and median award amount. What percentile is the mean? Can you explain why this might be the case?

2.24 The data set cabinet (UsingR) contains information on the amount each member of President George W.Bush’s cabinet saved due to the passing of a tax bill in 2003. This information is stored in the variable est.tax. savings. Compare the median and the mean. Explain the difference.

2.25 We may prefer the standard deviation to measure spread over the variance as the units are the same as the mean. Some disciplines, such as ecology, prefer to have a unitless measurement of spread. The coefficient of variation is defined as the standard deviation divided by the mean.

One advantage is that the coefficient of variation matches our intuition of spread. For example, the numbers 1, 2, 3, 4 and 1001, 1002, 1003, 1004 have the same standard deviation but much different coefficient of variations. Somehow, we mentally think of the latter set of numbers as closer together.

For the rivers and pi2000 (UsingR) data sets, find the coefficient of variation.

2.26 A lag plot of a data vector plots successive values of the data against each other. By using a lag plot, we can tell whether future values depend on previous values: if not, the graph is scattered; if so, there is often a pattern.

Making a lag plot (with lag 1) is quickly done with the indexing notation of negative numbers. For example, these commands produce a lag plot‡ of x:

> n = length(x) > plot(x[−n],x[−1])

(The plot () function plots pairs of points when called with two data vectors.) Look at the lag plots of the following data sets:

This is better implemented in the lag.plot() function from the ts package.

1. x=rnorm(100) (random data)

Comment on any patterns you see.

2.27 Verify that the following are true for the summation notation:

2.28 Show that for any data set

2.29 The sample variance definition, Equation (2.3), has a nice interpretation, but the following formula is easier to compute by hand:

The term means to square the data values, then find the sample average, whereas finds the sample average, then squares the answer. Show that the equivalence follows from the definition.

2.3Shape of a distribution

The stem-and-leaf plot tells us more about the data at a glance than a few numeric summaries, although not as precisely. However, when a data set is large, it tells us too much about the data. Other graphical summaries are presented here that work for larger data sets too. These include the histogram, which at first glance looks like a barplot, and the boxplot, which is a graphical representation of the five-number summary.

In addition to learning these graphical displays, we will develop a vocabulary to describe the shape of a distribution. Concepts will include the notion of modes or peaks of a distribution, the symmetry or skew of a distribution, and the length of the tails of a distribution.

2.3.1Histogram

A histogram is a visual representation of the distribution of a data set. At a glance, the viewer should be able to see where there is a relatively large amount of data, and where there is very little. Figure 2.11 is a histogram of the waiting variable from the data set faithful, recording the waiting time between eruptions of Old Faithful. The histogram is created with the hist() function. Its simplest usage is just hist(x), but many alternatives exist. This histogram has two distinct peaks or modes.