• No results found

Box plots and quartiles

2.2 Numerical summaries and box plots

2.2.5 Box plots and quartiles

A box plot summarizes a data set using five summary statistics while also plotting unusual observations, called outliers. Figure 2.15 provides a box plot of the num char variable from the

email50data set.

The five summary statistics used in a box plot are known as the five-number summary, which consists of the minimum, the maximum, and the three quartiles (Q1,Q2,Q3) of the data set being studied.

Q2represents thesecond quartile, which is equivalent to the 50th percentile (i.e. the median). Previously, we saw that Q2(the median) for theemail50data set was the average of the two middle values: 6,768+7,0122 = 6,890.

Q1represents thefirst quartile, which is the 25th percentile, and is the median of the smaller half of the data set. There are 25 values in the lower half of the data set, soQ1 is the middle value: 2,454 characters. Q3 represents the third quartile, or 75th percentile, and is the median of the larger half of the data set: 15,829 characters.

We calculate the variability in the data using the range of the middle 50% of the data: Q3−Q1= 13,375. This quantity is called the interquartile range (IQR, for short). It, like the standard deviation, is a measure of variability orspreadin data. The more variable the data, the larger the standard deviation and IQR tend to be.

25Because theabsolute valueof Z-score for the second observation (x

2 = 85.8 mm→Z2 =−1.89) is larger than that of the first (x1= 95.4 mm→Z1= 0.78), the second observation has a more unusual head length.

26(a) Its Z-score is given by Z = x−µ

σ =

5.19−3

2 = 2.19/2 = 1.095. (b) The observationx is 1.095 standard deviationsabovethe mean. We know it must be above the mean sinceZis positive.

Number of Char

acters (in thousands)

0 10 20 30 40 50 60 70 lower whisker Q1 (first quartile) Q2 (median) Q3 (third quartile) upper whisker max whisker reach outliers

Figure 2.15: A labeled box plot for the number of characters in 50 emails. The median (6,890) splits the data into the bottom 50% and the top 50%. Explore dozens of boxplots with histograms using American Community Survey data on Tableau Public .

INTERQUARTILE RANGE (IQR)

The IQR is the length of the box in a box plot. It is computed as IQR=Q3−Q1

where Q1 andQ3 are the 25th and 75thpercentiles.

OUTLIERS IN THE CONTEXT OF A BOX PLOT

When in the context of a box plot, define an outlier as an observation that is more than 1.5×IQRaboveQ3 or 1.5×IQRbelowQ1. Such points are marked using a dot or asterisk in a box plot.

To build a box plot, draw an axis (vertical or horizontal) and draw a scale. Draw a dark line denoting Q2, the median. Next, draw a line atQ1and atQ3. Connect theQ1andQ3lines to form a rectangle. The width of the rectangle corresponds to the IQR and the middle 50% of the data is in this interval.

Extending out from the rectangle, thewhiskersattempt to capture all of the data remaining outside of the box, except outliers. In Figure 2.15, the upper whisker does not extend to the last three points, which are beyondQ3+ 1.5×IQRand are outliers, so it extends only to the last point below this limit.27 The lower whisker stops at the lowest value, 33, since there are no additional

data to reach. Outliers are each marked with a dot or asterisk. In a sense, the box is like the body of the box plot and the whiskers are like its arms trying to reach the rest of the data.

27You might wonder, isn’t the choice of 1.5×IQRfor defining an outlier arbitrary? It is! In practical data analyses, we tend to avoid a strict definition since what is an unusual observation is highly dependent on the context of the data.

EXAMPLE 2.38

Compare the box plot to the graphs previously discussed: stem-and-leaf plot, dot plot, frequency and relative frequency histogram. What can we learn more easily from a box plot? What can we learn more easily from the other graphs?

It is easier to immediately identify the quartiles from a box plot. The box plot also more prominently highlights outliers. However, a box plot, unlike the other graphs, does not show thedistribution of the data. For example, we cannot generally identify modes using a box plot.

EXAMPLE 2.39

Is it possible to identify skew from the box plot?

Yes. Looking at the lower and upper whiskers of this box plot, we see that the lower 25% of the data is squished into a shorter distance than the upper 25% of the data, implying that there is greater density in the low values and a tail trailing to the upper values. This box plot is right skewed.

GUIDED PRACTICE 2.40

True or false: there is more data between the median andQ3than betweenQ1 and the median.28

EXAMPLE 2.41

Consider the following ordered data set.

5 5 9 10 15 16 20 30 80

Find the 5 number summary and identify how small or large a value would need to be to be considered an outlier. Are there any outliers in this data set?

There are nine numbers in this data set. Because nis odd, the median is the middle number: 15. When finding Q1, we find the median of the lower half of the data, which in this case includes 4 numbers (we do not include the 15 as belonging to either half of the data set). Q1then is the average of 5 and 9, which is Q1= 7, andQ3 is the average of 20 and 30, soQ3= 25. The min is 5 and the max is 80. To see how small a number needs to be to be an outlier on the low end we do:

Q1−1.5×IQR=Q1−1.5×(Q3−Q1) = 7−1.5×(35−7) =−35

On the high end we need:

Q3+ 1.5×IQR=Q3+ 1.5×(Q3−Q1) = 35 + 1.5×(35−7) = 77

There are no numbers less than -41, so there are no outliers on the low end. The observation at 80 is greater than 77, so 80 is an outlier on the high end.

28False. SinceQ

1 is the 25th percentile and the median is the 50th percentile, 25% of the data fall betweenQ1 and the median. Similarly, 25% of the data fall betweenQ2and the median. The distance between the median and Q3 is larger because that 25% of the data is more spread out.