2.2 Numerical summaries and box plots
2.2.3 Standard deviation as a measure of spread
The U.S. Census Bureau reported that in 2017, the median family income was$73,891 and the mean family income was $99,114.22 Is a family income of$60,000 far from the mean or somewhat close to the mean? In order to answer this question, it is not enough to know the center of the data set and its range (maximum value - minimum value). We must know about the variability of the data set within that range. Low variability or small spread means that the values tend to be more clustered together. High variability or large spread means that the values tend to be far apart.
EXAMPLE 2.29
Is it possible for two data sets to have the same range but different spread? If so, give an example. If not, explain why not.
Yes. An example is: 1, 1, 1, 1, 1, 9, 9, 9, 9, 9 and 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 9.
The first data set has a larger spread because values tend to be farther away from each other while in the second data set values are clustered together at the mean.
Here, we introduce the standard deviation as a measure of spread. Though its formula is a bit tedious to calculate by hand, the standard deviation is very useful in data analysis and roughly describes how far away, on average, the observations are from the mean.
We call the distance of an observation from its mean itsdeviation. Below are the deviations for the 1st, 2nd, 3rd, and 50thobservations in thenum charvariable. For computational convenience, the number of characters is listed in the thousands and rounded to the first decimal.
x1−x¯= 21.7−11.6 = 10.1 x2−x¯= 7.0−11.6 =−4.6 x3−x¯= 0.6−11.6 =−11.0 .. . x50−x¯= 15.8−11.6 = 4.2
If we square these deviations and then take an average, the result is about equal to the sample variance, denoted by s2: s2= 10.1 2+ (−4.6)2+ (−11.0)2+· · ·+ 4.22 50−1 = 102.01 + 21.16 + 121.00 +· · ·+ 17.64 49 = 172.44
We divide by n−1, rather than dividing byn, when computing the variance; you need not worry about this mathematical nuance for the material in this textbook. Notice that squaring the devia- tions does two things. First, it makes large values much larger, seen by comparing 10.12, (−4.6)2, (−11.0)2, and 4.22. Second, it gets rid of any negative signs.
Thestandard deviationis defined as the square root of the variance: s=√172.44 = 13.13
The standard deviation of the number of characters in an email is about 13.13 thousand. A subscript ofx may be added to the variance and standard deviation, i.e. s2x and sx, as a reminder that these are the variance and standard deviation of the observations represented byx1,x2, ...,xn. Thexsubscript is usually omitted when it is clear which data the variance or standard deviation is referencing.
CALCULATING THE STANDARD DEVIATION
The standard deviation is the square root of the variance. It is roughly the “typical” distance of the observations from the mean.
sX = r 1 n−1 X (xi−x¯)2
The variance is useful for mathematical reasons, but the standard deviation is easier to interpret because it has the same units as the data set. The units for variance will be the units squared (e.g. meters2). Formulas and methods used to compute the variance and standard deviation for a population are similar to those used for a sample.23 However, like the mean, the population values
have special symbols: σ2 for the variance andσ for the standard deviation. The symbol σ is the Greek lettersigma.
THINKING ABOUT THE STANDARD DEVIATION
It is useful to think of the standard deviation as the “typical” or “average” distance that observations fall from the mean.
In Chapter 4, we encounter a bell-shaped distribution known as thenormal distribution. The empirical rule tells us that for normal distributions, about 68% of the data will be within one standard deviation of the mean, about 95% will be within two standard deviations of the mean, and about 99.7% will be within three standard deviations of the mean. However, as seen in Figures2.13
and 2.14, these percentages generally do not hold if the distribution is not bell-shaped.
Number of Characters (in thousands)
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● 0 10 20 30 40 50 60 70
Figure 2.13: In the num chardata, 40 of the 50 emails (80%) are within 1 stan- dard deviation of the mean, and 47 of the 50 emails (94%) are within 2 standard deviations. The empirical rule does not hold well for skewed data, as shown in this example.
GUIDED PRACTICE 2.30
On page70, the concept of shape of a distribution was introduced. A good description of the shape of a distribution should include modality and whether the distribution is symmetric or skewed to one side. Using Figure 2.14as an example, explain why such a description is important.24
23The only difference is that the population variance has a division byninstead ofn−1.
24Figure2.14shows three distributions that look quite different, but all have the same mean, variance, and standard deviation. Using modality, we can distinguish between the first plot (bimodal) and the last two (unimodal). Using skewness, we can distinguish between the last plot (right skewed) and the first two. While a picture, like a histogram, tells a more complete story, we can use modality and shape (symmetry/skew) to characterize basic information about a distribution.
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
Figure 2.14: Three very different population distributions with the same mean µ= 0 and standard deviationσ= 1.
EXAMPLE 2.31
Earlier we reported that the mean family income in the U.S. in 2017 was$99,114. Estimating the standard deviation of income as approximately $50,000, is a family income of $60,000 far from the mean or relatively close to the mean?
Because$60,000 is less that one standard deviation from the mean, it is relatively close to the mean. If the value were more than 2 standard deviations away from the mean, we would consider it far from the mean.
When describing any distribution, comment on the three important characteristics of center, spread, and shape. Also note any especially unusual cases.
EXAMPLE 2.32
In the data’s context (the number of characters in emails), describe the distribution of thenum char
variable shown in the histogram below.
Number of Characters (in thousands)
Frequency 0 10 20 30 40 50 60 0 5 10 15 20 0 5 10 15 20 25 30 35 40 45 50 55 60 65
The distribution of email character counts is unimodal and very strongly skewed to the right. Many of the counts fall near the mean at 11,600, and most fall within one standard deviation (13,130) of the mean. There is one exceptionally long email with about 65,000 characters.
In this chapter we use standard deviation as a descriptive statistic to describe the variability in a given data set. In Chapter 5 we will use the standard deviation to assess how close a sample mean is to the population mean.