10A Measures of central tendency 10B Range and interquartile range 10C The standard deviation 10D Boxplots 10E Back-to-back stem plots 10F Parallel boxplots

(1)

10

syllabus

rref

efer

erence

ence

Topic:

• Applied statistical

analysis

In this

cha

chapter

pter

10A Measures of central

tendency

10B

Range and interquartile

range

10C The standard deviation

10D Boxplots

10E

Back-to-back stem plots

10F

Parallel boxplots

(2)

Introduction

Yvonne works in the quality control department of a soft drink bottling company. Soft drink is bottled by two machines, each of which is set to pour one litre of soft drink into every bottle. It is part of Yvonne’s job to take 20 bottles of soft drink from each machine and measure the contents in millilitres. The results she obtained from one such check are shown below.

Yvonne must use these results to assess if the machines are sufficiently accurate in dispensing soft drink into the bottles.

One of the main tasks of a statistician is to summarise large volumes of data. It is useful to find one score that is typical of a whole set of data, or a few figures which can describe its distribution.

Finding the mean, median and mode are three different methods of arriving at a score that is typical or central to the data set. Mean, median and mode are often called measures of central tendency.

Machine A 1009 992 990 1018 1017 985 984 1008 1020 1005

992 983 1020 988 996 984 989 1014 995 1004

Machine B 1002 991 990 980 1004 1018 1008 997 992 999

1010 1004 1001 1003 1009 1004 1006 1001 997 994

(3)

Measures of central tendency

The mean

The mean of a set of data is what is referred to in everyday language as the average. For the set of data 4 7 9 12 18:

mean =

= 10

The symbol we use to represent the mean is , that is, a lower-case x with a bar on

top. So, in this case, = 10.

The formal definition of the mean is:

=

where Σx represents the sum of all of the observations in the data set and n represents the number of observations in the data set.

Note that the symbol, Σ, is the Greek letter, sigma, which represents ‘the sum of’.

The mean is also referred to as a summary statistic and is a measure of the centre of

a distribution. The mean is the point about which the distribution ‘balances’. Consider the masses of 7 potatoes, given in

grams, below.

100 120 130 145 160 170 190 The mean is 145 g. The observ-ations 130 and 160 ‘balance’ each other since they are each 15 g from the mean. Similarly, the observations 120 and 170 ‘balance’ each other since they are each 25 g from the mean, as do the observations 100 and 190. Note that the median is also 145 g. That is, for this set of data the mean and the median give the same value for the

centre. This is because the distribution is symmetric.

Now consider two cases in which the distribution of data is not symmetric.

Case 1

Consider the masses of a different set of 7 potatoes, given in grams below. 100 105 110 115 120 160 200

The median of this distribution is 115 g and the mean is 130 g. There are 5 observations that are less than the mean and only 2 that are more. In other words, the mean does not give us a good indication of the centre of the distribution. However, there is still a ‘bal-ance’ between observations below the mean and those above, in terms of the spread of all the observations from the mean. Therefore, the mean is still useful to give a measure of the central tendency of the distribution but in cases where the distribution is skewed, the median gives a better indication of the centre. For a positively skewed distribution, as in the case above, the mean will be greater than the median. For a negatively skewed distribution the mean will be less than the median.

4+ + +7 9 12+18 5

---x

x

(4)

---Case 2

Consider the data below, showing the weekly income (to the nearest $10) of 10 families living in a suburban street.

$300 $670 $680 $690 $700 $710 $710 $720 $730 $750

In this case, = = $666, and the

median is $705.

One of the values in this set, $300, is clearly an outlier. As a result, the value

of the mean is below the weekly income of the other 9 households. In such a case the mean is not very useful in establishing the centre; however, the ‘balance’ still remains for this negatively skewed distribution.

The mean is calculated by using the values of the observations and because of this it becomes a less reliable measure of the centre of the distribution when the distribution is skewed or contains an outlier. Because the median is based on the order of the observ-ations rather than their value, it is a better measure of the centre of such distributions.

Calculating the mean using a graphics calculator

When data are presented in a frequency table with class intervals and we don’t know what the raw data are, we employ another method to find the mean of these grouped data. This other method is shown in the example that follows and uses the midpoints of the class intervals to represent the raw data.

x 6660 10

---CASIO

WE 10-1 Mean

Calculate the mean of the set of data below.

10, 12, 15, 16, 18, 19, 22, 25, 27, 29

THINK WRITE/DISPLAY

Enter the data in L1. (Press and

select 1:Edit.)

Calculate the mean. (a) Press .

(b) Highlight CALC in the top line.

(c) Highlight 1:1–Var Stats and press

.

(d) Press L1 and press .

(e) Several values are given. The top

entry = 19.3 gives us the mean. ₌_19.3

1 STAT

2

STAT

ENTER

2nd ENTER

x _x

(5)

Remember that the Greek letter sigma,

Σ

, represents ‘the sum of’. So,

Σ

f means the sum of the frequencies and is the total of all the numbers in the frequency column. To find the mean for grouped data,

=

where f represents the frequency of the data and m represents the midpoint of the class interval of the grouped data.

The ages of a group of 30 people attending a superannuation seminar are recorded in the frequency table below.

Calculate the mean age of those attending the seminar.

Age (class intervals) Frequency f Age (class intervals) Frequency f 20–29 30–39 40–49 1 6 13 50–59 60–69 70–79 6 3 1 THINK WRITE

Since we don’t have individual raw ages, but rather a class interval, we need to decide on one particular age to represent each interval. We use the midpoint, m, of the class interval. Add an extra column to the table to display these.

The midpoint of the first interval is

, the midpoint of the

second interval is 34.5 and so on.

So, =

≈ 46.8 (correct to 1 decimal place)

Multiply each of the midpoints by the frequency and display these values in

another column headed f×m. For the

first interval we have 24.5 × 1 = 24.5. For the second interval we have 34.5 × 6 = 207 and so on.

Sum the product of the midpoints and the frequencies in the f×m column. 24.5 + 207 + 578.5 + 327 + 193.5 + 74.5 = 1405

Divide this sum by the total number of people attending the seminar.

1

20+29

2

--- = 24.5

Age (class intervals) Frequency f Mid-point of class interval

m f ×m

20–29 30–39 40–49 50–59 60–69 70–79 1 6 13 6 3 1 24.5 34.5 44.5 54.5 64.5 74.5 24.5 207 578.5 327 193.5 74.5

Σ

f= 30

Σ

(f×m)

= 1405

x 1405 30 ---2 3 4

2

WORKED

E

xample

x

Σ

(f ×m)

(6)

---The median

The median is the midpoint of a set of data. Half the data are less than or equal to the median.

Consider the set of data: 2 5 6 8 11 12 15. These data are in ordered form (that is, from lowest to highest). There are 7 observations. The median in this case is the middle or fourth score; that is, 8.

Consider the set of data: 1 3 5 6 7 8 8 9 10 12. These data are in ordered form also; however, in this case there is an even number of scores; that is, there are 10 scores. The median in this case lies halfway between the 5th score (7) and the

6th score (8). So the median is 7.5. (Alternatively, median = .)

When there are n records in a set of ordered data, the median can be located at the

th position.

CASIO WE 10-2 Mean

Solution to worked example 2 using a graphics calculator

THINK DISPLAY

Since we don’t have individual raw ages, but rather a class interval, we need to decide on one particular age to represent each interval. We use the midpoint, m, of the class interval. The midpoint of the first

interval is = 24.5. The midpoint

of the second interval is 34.5 and so on. Enter these midpoints as list 1.

Enter the frequency of each of the class intervals as list 2. For example, L1(3)=44.5 and L2(3)=13.

Calculate the mean. (a) Press .

(b) Highlight CALC in the top line.

(c) Highlight 1:1–Var Stats and press

.

(d) Enter L1 and L2 by pressing [L1],

and [L2]; then pressing

.

A number of values is given. The top

entry, = 46.8, gives us the mean. =

1

20+29

2

---2

3

STAT

ENTER

2nd

, _2nd

ENTER

x x 46.8

7+8

2

--- = 7.5

n+1 2

(7)

Checking this against our previous example, we have n= 10; that is, there were 10

observations in the set. The median was located at the = 5.5th position; that

is, halfway between the 5th and the 6th terms.

A stem plot provides a quick way of locating a median since the data in a stem plot are already ordered.

Mode

There are many examples where neither the mean nor the median is the appropriate measure of the typical score in a data set.

Consider the case of a clothing store. It needs to re-order a supply of dresses. To know what sizes to order it looks at past sales of this particular style and gathers the following data:

8 12 14 12 16 10 12 14 16 18

14 12 14 12 12 8 18 16 12 14

For this data set the mean dress size is 13.2. Dresses are not sold in size 13.2, so this has very little meaning. The median is 13, which also has little meaning as dresses are sold only in even-numbered sizes.

What is most important to the clothing store is the dress size that sells the most. In this case size 12 occurs most frequently. The score that has the highest frequency is called the mode.

When two scores occur most often an equal number of times, both scores are given as the mode. In this situation the scores are bimodal. If all scores occur most often an equal number of times, then the distribution has no mode.

To find the mode from a frequency distribution table, we simply give the score that has the highest frequency.

10+1

2

--- 

 

Consider the stem plot below which contains 22 observations. What is the median?

THINK WRITE

Find the median position, where n= 22. Median = th position

= th position

= 11.5th position

Find the 11th and 12th terms. 11th term = 35

12th term = 38

The median is halfway between the 11th and 12th terms. Median = 36.5

1 n+1

2

---   

22+1 2

---   

2

3

WORKED

E

xample

Stem 2 2 3 3 4 4

Leaf 3 3 5 7 9 1 3 3 4 4 5 8 9 9 0 2 2 6 8 8 8 9

(8)

When a table is presented using grouped data, we do not have a single mode. In these cases, the class with the highest frequency is called the modal class.

For the frequency distribution at right, state the mode.

THINK WRITE

The highest frequency is 14 which belongs to the score 17 and so 17 is the mode.

Mode = 17

Score Frequency

14 3

15 6

16 11

17 14

18 10

19 7

4

WORKED

E

xample

remember

1. The mean is given by = where Σx represents the sum of all the

observations in the data set and n represents the number of observations in the data set.

2. The mean is calculated by using the values of the observations and because of this it becomes a less reliable measure of the centre of the distribution when the distribution is skewed or contains an outlier.

3. To find the mean for grouped data, = where f represents the frequency of the data and m represents the midpoint of the class interval of the grouped data.

4. The median is the midpoint of a set of data. Half the data are less than or equal to the median.

5. When there are n observations in a set of ordered data, the median can be located at the th position.

6. The mode is the score with the highest frequency.

x

Σ

x n

---x

Σ

(f ×m)

Σ

f

---n+1 2

--- 

 

(9)

Measures of central tendency

A graphics calculator may be used for this exercise.

1 Find the mean of each of the following sets of data.

a 5 6 8 8 9

b 3 4 4 5 5 6 7 7 7 8 8 9 9 10 10 12

c 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.6

d 11 13 15 15 16 18 20 21 22

e 0.4 0.5 0.7 0.8 0.8 0.9 1.0 1.1 1.2 1.0 1.3

2 Calculate the mean of each of the following and explain whether or not it gives us a

good picture of the centre of the data.

a 0.7 0.8 0.85 0.9 0.92 2.3

b 14 16 16 17 17 17 19 20

c 23 24 28 29 33 34 37 39

d 2 15 17 18 18 19 20

3 The number of people attending sculpture classes at the local TAFE college for each

week during the first semester is given below.

15 12 15 11 14 8 14 15 11 10

7 11 12 14 15 14 15 9 10 11

What is the mean number of people attending each week? (Express your answer to the nearest whole number.)

4

The ages of a group of junior pilots joining an international airline are indicated on the stem plot below.

5

The mean age of this group of pilots is: A 20

B 28 C 29

D 29.15

E 29.5

Key: 2|1 = 21 yrs Stem 2 2 2 2 2 3 3 3 3 3 Leaf 1 2 4 5 6 6 7 8 8 8 9 0 1 1 2 3 4 4 6 8

The number of people present each week at a 15-week horticul-tural course is given by the stem plot at right.

The mean number of people attending each week was closest to: A 17.7 B 18

C 19.5 D 20 E 21.2

Key: 2|4 = 24 people Stem 0 0 1 1 2 2 Leaf 4 7 2 4 5 5 6 7 8 1 2 4 7 7 7

10A

Ma_t_h

ca_d

One-variable statistics

EXCEL Spreadshe

et One-variable statistics W WORKEDORKED E Example 1

GC progr_a_m

UV stats

m

multiple choiceultiple choice

m

(10)

6 For each of the following, write down whether the mean or the median would provide a better indication of the centre of the distribution.

a A positively skewed distribution

b A symmetric distribution

c A distribution with an outlier

d A negatively skewed distribution

7 Find the mean of each set of data given below.

8 The ages of people attending a beginner’s course in karate are indicated in the

frequency table below.

What is the mean age of those attending the course? (Express your answer correct to the nearest whole number.)

a Class interval Frequency, f b Class interval Frequency, f

0–9 10–19 20–29 30–39 40–49 50–59 1 3 6 17 12 5 0–4 5–9 10–14 15–19 25–29 30–34 2 5 7 13 8 6

c Class interval Frequency, f d Class interval Frequency, f

0–49 50–99 100–149 150–199 200–249 250–299 2 7 8 14 12 5 1–6 7–12 13–18 19–24 25–30 31–36 14 19 23 22 20 14 Age Frequency, f 10–14 15–19 20-24 25–29 30–34 35–39 40–44 45–49 5 5 7 4 3 2 2 1 E XCEL

(11)

9 Write down the median of the sets of data shown in the following stem plots. The key for each stem plot is 3|4 = 34.

10 For each of the following sets of data, write down the median.

a 2 4 6 7 9

b 12 15 17 19 21

c 3 4 5 6 7 8 9

d 3 5 7 8 12 13 15 16

e 12 13 15 16 18 19 21 23 24 26

f 3 8 4 2 1 6 5

g 16 21 14 28 23 15 11 19 25

h 7 4 3 4 9 5 10 4 2 11

i 29 23 22 33 26 18 37 22 16

11 Find the mode for each of the following. (Hint: Some are bimodal and others have no mode.)

a 16, 17, 19, 15, 17, 19, 14, 16, 17

b 147, 151, 148, 150, 148, 152, 151

c 2, 3, 1, 9, 7, 6, 8

d 68, 72, 73, 72, 72, 71, 72, 68, 71, 68

e 2.6, 2.5, 2.9, 2.6, 2.4, 2.4, 2.3, 2.5, 2.6 a Stem 0 1 2 3 4 5 6 Leaf 7 2 3 2 4 5 7 9 0 2 3 6 8 8 4 7 8 9 9 2 7 8 1 3 b Stem 0 0 0 0 0 1 1 1 1 1 Leaf 0 0 1 1 2 2 3 3

4 4 5 5 5 5 5 5 5 5 6 6 6 6 7

8 8 8 9 0 0 1 3 3 5 5 7 c Stem 0 0 0 0 0 1 1 1 1 1 Leaf 1 2 4 4 5 6 6 6 7 8 8 8 8 9 9 0 0 0 1 1 1 1 2 2 2 3 3 3 4 4 5 5 6 7 7 8 9 d Stem 3 3 3 3 3 4 4 4 4 4 Leaf 1 6 8 9 0 0 1 1 1 2 2 3 3 3 3 4 5 5 5 6 7 9 e Stem 60 61 62 63 64 65 66 67 Leaf 2 5 8

1 3 3 6 7 8 9 0 1 2 4 6 7 8 8 9 2 2 4 5 7 8 3 6 7 4 5 8 3 5 4 W WORKEDORKED E Example 3

GC progr_a_m

UV stats

Ma_t_h

ca_d

EXCEL Spreadshe

et

(12)

12 Use the tables below to state the mode of the distribution.

13 For each of the following grouped distributions, state the modal class.

14 The following data give the age of 25 patients admitted to the emergency ward of a

hospital.

18 16 6 75 24

23 82 74 25 21

43 19 84 72 31

74 24 20 63 79

80 20 23 17 19

a Represent the data in a frequency distribution table. (Use classes 0–14, 15–29,

30–44, etc.)

b Find the mean age of patients admitted.

c Find the median class of age of patients admitted.

d Find the modal class for age of patients admitted.

e Do any of your statistics (mean, median or mode) give a clear representation of the

typical age of an emergency ward patient?

f Give some reasons that could explain the pattern of the distribution of data in this

question.

W WORKEDORKED

E Example

4 _a _Score _Frequency

1 2

2 4

3 5

4 6

5 3

b Score Frequency

5 1

6 3

7 5

8 8

9 5

10 3

c Score Frequency

38 2

39 4

40 1

41 5

42 6

43 3

44 6

45 2

a Class Frequency b

1–4 6

5–8 12

9–12 30

13–16 23

17–20 46

21–24 27

25–28 9

Class Frequency

1–7 3

8–14 8

15–21 9

22–28 25

29–35 12

36–42 11

(13)

15 The batting scores for two cricket players over six innings are as follows:

Player A 31, 34, 42, 28, 30, 41 Player B 0, 0, 1, 0, 250, 0

a Find the mean score for each player.

b Which player appears to be better if the mean result is

used?

c Find the median score for each player.

d Which player appears to be better when the decision is

based on the median result?

e Which player do you think would be more useful to

have in a cricket team and why? How can the mean result sometimes lead to a misleading conclusion?

Mean and median amount of soft drink

Remember Yvonne’s quality control check? We are going to calculate the mean and median amount of soft drink dispensed by each machine into each bottle.

1 Use your graphics calculator to store the data for Machine A as a list. Name the

list MA. The data for Machine A were

2 Use your graphics calculator to store the data for Machine B as a list. Name the

list MB. The data for machine B were

3 Use the statistics function of the graphics calculator to find the mean and

median amount of soft drink dispensed into bottles by each machine.

4 At this stage can you say which machine most accurately dispenses soft drink

into bottles?

Note:

1. If you do not have a graphics calculator, you can still calculate the mean and median of each data set.

2. If you do have a graphics calculator, It would be easier to enter the two sets of

data under L1 and L2 (press and select 1:Edit); however, if the data are

stored as named lists MA and MB these can be retained for use later in the chapter.

Work

SHEET

10.1

CASIO Soft drink

1009 992 990 1018 1017 985 984 1008 1020 1005

992 983 1020 988 996 984 989 1014 995 1004

1002 991 990 980 1004 1018 1008 997 992 999

1010 1004 1001 1003 1009 1004 1006 1001 997 994

(14)

Range and interquartile range

We have now looked at measures of central tendency, but although a set of scores may have the same mean, median and mode they still may be very different data sets. Con-sider the results obtained by two groups of 10 students on the same mathematics test. Group A: 45, 46, 47, 48, 50, 50, 52, 53, 54, 55

Group B: 10, 20, 30, 40, 50, 50, 60, 70, 80, 90

In both groups the mean, median and mode mark is 50, but we can see that they are very different data sets. We can see that Group A has a very bunched group of scores but Group B’s scores are very spread out.

The range and interquartile range are examples of a measure of spread. These measures of spread help us analyse the spread of various data sets.

The range

The range is the easiest of this group of summary statistics to calculate. The range of a set of data is the difference between the highest and lowest values in that set.

It is usually not too difficult to locate the highest and lowest values in a set of data. Only when there is a very large number of observations might the job be made more difficult. In the example above that compared the results on a mathematics test by two groups the range for Group A is found by subtracting the lowest score (45) from the

highest score (55). Similarly, we can say that the range for Group B is 90 − 10 = 80.

Statistically we can write the lowest score as minX and the highest score as maxX and so the range can be found using the formula

Range = maxX − minB.

The values of maxX and minX can be found using a graphics calculator as you will see later in the chapter.

While the range gives us some idea about the spread of the data it is not terribly informative since it gives us no idea of how the data are distributed between the highest and lowest values.

The interquartile range

We have seen that the median divides a set of data in half. Similarly, quartiles divide a set of data in quarters. The symbols used to refer to these quartiles are Q₁, Q₂ and Q₃.

The middle quartile, Q₂, is the median.

The interquartile range IQR =Q3−Q1

The interquartile range gives us the range of the middle 50% of values in our set of data. There are four steps to locating Q₁ and Q₃.

Step 1. Write down the data in ordered form from lowest to highest.

Step 2. Locate the median; that is, locate Q₂.

Step 3. Now consider just the lower half of the set of data. Find the middle score. This

score is Q₁.

Step 4. Now consider just the upper half of the set of data. Find the middle score. This

score is Q₃.

The four cases given below illustrate this method.

Case 1

(15)

Consider the lower half of the set, which is 3 6 10. The middle score is 6, so Q₁= 6. Consider the upper half of the set, which is 12 15 21. The middle score is 15, so Q₃= 15.

Case 2

Consider a set of data containing the 7 observations: 4 9 11 13 17 23 30. The data are already ordered. The median is 13.

Consider the lower half of the set, which is 4 9 11. The middle score is 9, so Q₁= 9.

Consider the upper half of the set, which is 17 23 30. The middle score is 23, so Q₃= 23.

Case 3

Consider a set of data containing the 8 observations: 1 3 9 10 15 17 21 26. The data are already ordered. The median is 12.5.

Consider the lower half of the set, which is 1 3 9 10. The middle score is 6, so Q₁= 6.

Consider the upper half of the set, which is 15 17 21 26. The middle score is 19, so Q₃= 19.

Case 4

Consider a set of data containing the 9 observations: 2 7 13 14 17 19 21 25 29. The data are already ordered. The median is 17.

Consider the lower half of the set, which is 2 7 13 14. The middle score is 10, so Q₁= 10.

Consider the upper half of the set, which is 19 21 25 29. The middle score is 23, so Q₃= 23.

A graphics calculator provides possibly the fastest way of locating quartiles and hence finding the value of the interquartile range.

The ages of the patients who attended the casualty department of an inner suburban hospital on one particular afternoon are shown below.

14 3 27 42 19 17 73 60 62 21 23 2 5 58 33 19 81 59 25 17 69 Find the interquartile range of these data.

THINK WRITE

Order the data. 2 3 5 14 17 17 19 19 21 23

25 27 33 42 58 59 60 62 69 73 81

Find the median. The median is 25 since ten scores lie below it and ten

lie above it.

Find the middle score of the lower half of the data.

For the scores 2 3 5 14 17 17 19 19 21 23, the middle score is 17. So, Q₁= 17.

Find the middle score of the upper half of the data.

For the scores 27 33 42 58 59 60 62 69 73 81, the middle score is 59.5.

So, Q₃= 59.5.

Calculate the interquartile range. IQR =Q₃−Q₁= 59.5 − 17 = 42.5

1

2

3

4

5

(16)

In most cases we are asked to find the interquartile range of a grouped distribution. This requires us to draw a cumulative frequency polygon and find the 25th and 75th percentile. A percentile is a measure of where in a set of scores an individual score lies. For example, the 25th percentile has 25% of scores below it and 75% above it.

To find the interquartile range, draw a second vertical axis that shows the 25th, 50th, and 75th percentile. A line is drawn from the 25th, 50th and 75th percentile to the ogive and then down to the horizontal axis. The value for the quartiles can then be calculated.

The median is the score that is found at the 50th percentile.

CASIO WE 10–6

Parents are often shocked at the amount of money their children spend. The data below give the amount spent (to the nearest whole dollar) by each child in a group that was taken on an excursion to the Exhibition.

15 12 17 23 21 19 16 11 17 18 23 24 25 21 20 37 17 25 22 21 19

Calculate the interquartile range for these data.

THINK DISPLAY

Enter the data. (a) Press .

(b) Select 1:Edit by pressing .

(c) Enter the data in L1.

Obtain the values of the quartiles. (a) Press .

(b) Select CALC.

(c) Select 1:1–Var Stats by pressing .

(d) Enter L1 (press [L1]). Press

.

A list of statistics appears. We shall be using a number of these later. We are looking for the first and third quartiles. Scroll down the screen using the key.

Q₁= 17 and Q₃= 23 So, IQR = 23 − 17 = 6

1

STAT

ENTER

2

STAT

ENTER

2nd ENTER

3

▼

(17)

The cumulative frequency histogram and polygon at right shows the number of customers who order different volumes of concrete from a readymix concrete company during a day.

Find the:

a median

b interquartile range for this distribution.

THINK DISPLAY/WRITE

a Draw a vertical axis showing the

percentiles.

a

Draw a line for the 50th percentile to the ogive and estimate the median.

Median = 0.9

b Draw a line for the 25th and 75th

percentiles and estimate these values.

b Lower quartile = 0.4 Upper quartile = 1.6

Calculate the interquartile range by subtracting the lower quartile from the upper quartile.

Interquartile range = 1.6 − 0.4

= 1.2

0.25

0 10 20 30

0.75 1.25 1.75 2.25 2.75

Number of customers

Cumulati

v

e frequenc

y

40 50

1

0.25

0 10 20 30

0.75 1.25 1.75 2.25 2.75

Number of customers 40

50 100%

75%

50%

25% 0%

2

1

2

7

WORKED

E

xample

remember

1. The range of a set of data is the difference between the highest and lowest values in that set.

2. The interquartile range IQR =Q₃−Q₁.

3. The interquartile range gives us the range of the middle 50% of values in our set of data.

4. There are four steps to locating Q₁ and Q₃.

Step 1. Write down the set of data in ordered form from lowest to highest.

Step 2. Locate the median, that is, locate Q₂.

Step 3. Now consider just the lower half of the set of data. Find the middle

score. This score is Q₁.

Step 4. Now consider just the upper half of the set of data. Find the middle

score. This score is Q₃.

5. The interquartile range of a grouped distribution is estimated from an ogive.

(18)

Range and interquartile

range

1 Write down the range of the sets of data shown in the following stem plots. The key for

each stem plot is 3|4 = 34.

2 For each of the following sets of data, write down the range.

a 2 4 6 7 9

b 12 15 17 19 21

c 3 4 5 6 7 8 9

d 3 5 7 8 12 13 15 16

e 12 13 15 16 18 19 21 23 24 26

f 3 8 4 2 1 6 5

g 16 21 14 28 23 15 11 19 25

h 7 4 3 4 9 5 10 4 2 11

i 29 23 22 33 26 18 37 22 16

a Stem 0 1 2 3 4 5 6 Leaf 7 2 3 2 4 5 7 9 0 2 3 6 8 8 4 7 8 9 9 2 7 8 1 3 b Stem 0 0 0 0 0 1 1 1 1 1 Leaf 0 0 1 1 2 2 3 3

4 4 5 5 5 5 5 5 5 5 6 6 6 6 7

8 8 8 9 0 0 1 3 3 5 5 7 c Stem 0 0 0 0 0 1 1 1 1 1 Leaf 1 2 4 4 5 6 6 6 7 8 8 8 8 9 9 0 0 0 1 1 1 1 2 2 2 3 3 3 4 4 5 5 6 7 7 8 9 d Stem 3 3 3 3 3 4 4 4 4 4 Leaf 1 6 8 9 0 0 1 1 1 2 2 3 3 3 3 4 5 5 5 6 7 9 e Stem 60 61 62 63 64 65 66 67 Leaf 2 5 8

1 3 3 6 7 8 9 0 1 2 4 6 7 8 8 9 2 2 4 5 7 8 3 6 7 4 5 8 3 5 4

10B

E

XCEL

Sprea_dshe

et

(19)

3 a On the 9th of August, the number of cars that stopped at the drive-in area at a McBurger restaurant during each hour (from 7.00 am until 10.00 pm) is shown below.

14 18 8 9 12 24 25 15 18 25 24 21 25 24 14 Find the interquartile range of this set of data.

b At the nearby Kenny’s Fried Chicken restaurant on the same day, the number of

cars stopping during each hour that it was open is shown below.

7 9 13 16 19 12 11 18 20 19 21 20 18 10 Find the interquartile range of these data.

4 Write down a set of data for which n= 5, the median is 6 and the range is 7.

5 Write down a set of data for which n= 8, the median is 7.5 and the range is 10.

6

The quartiles for a set of data are calculated and found to be Q₁= 13, Q₂= 18, and

Q₃= 25. Which of the following statements is true?

A The interquartile range of the data is 5.

B The interquartile range of the data is 7.

C The interquartile range of the data is 12.

D The median is 12.

E The median is 19.

It is recommended that a graphics calculator be used for questions 7 and 8.

7 For each of the following sets of data find the median, the interquartile range and the

range. a

b

c

8 For each set of data shown on the stem plots, find the median, the interquartile range

and the range. 16 19 12 11 8 6 7 15 26 32 32 18 15 43 51 31 29 23 45 23 22 23 25 25 27 21 36 19 31 29 32 28 39 31 29 27 20 22 30 29 1.2 6.1 2.3 3.7 4.1 5.4 2.4 3.7 1.5 5.2 3.7 3.8 6.1 6.3 2.4 7.1 3.6 4.9 1.2 a Stem 2 3 4 5 6 7 8 9 10 11 Leaf

3 5 5 6 7 8 9 9 0 2 2 3 4 6 6 7 8 8 2 2 4 5 6 6 6 7 9 0 3 3 5 6

2 4 5 9 2 7

4 Key: 4|2 = 42

b Stem 1 1 2 2 3 3 4 4 Leaf 4 1 4 5 7 8 8 9 1 2 2 2 4 4 4 4 5 5 5 6

3 4

Key: 2|5 = 25

W WORKEDORKED

E Example

5 SkillS

HEET

10.1

m

multiple choiceultiple choice

W WORKEDORKED

E Example

6 EX

CEL Spre_ad

(20)

9 The frequency histogram and polygon at right displays the results of a survey of 50 drivers who were asked about the number of speeding fines they have received.

a Use the ogive to find the median of the

distribution.

b Find the lower quartile.

c Find the upper quartile.

d Calculate the interquartile range.

10 The frequency distribution table below shows the result of a survey of 90 households

who were asked about the number of times they had been the victim of crime.

a Add a column for cumulative frequency to the table.

b Draw a cumulative frequency histogram and polygon.

c Use your graph to find the median of the distribution.

d Calculate the interquartile range.

The standard deviation

The standard deviation is the most sophisticated and also the most useful measure of spread.

The standard deviation σ can be calculated by using the following formula:

s=

To calculate the standard deviation by hand, use the following steps:

Score Frequency

0 26

1 31

2 22

3 8

4 3

Range of soft drink amounts

Take another look at Yvonne’s quality control check. We have previously found two measures of central tendency for the two soft drink dispensing machines, but these alone are not enough to state which machine dispenses soft drink most accurately.

1 Find the range of amounts dispensed by each machine.

2 Find the interquartile range of amounts dispensed by each machine.

3 From these results, which machine appears to dispense soft drink most

accurately?

W WORKEDORKED

E Example

7

0 0 10 20 30

1 2 3 4 5

No. of speeding fines received by drivers

Cumulati

v

e frequenc

y

40 50

5 15 25 35 45

f x( _i–x)2

∑

n

(21)

---Step 1. Find the mean.

Step 2. Find the difference between each piece of data and the mean.

Step 3. Square the differences.

Step 4. Add the squared differences.

Step 5. Divide by the number of scores.

Step 6. Take the square root.

This algorithm is used to find the standard deviation in the following worked example.

Fortunately, you will not always have to go through this series of steps each time you wish to calculate a standard deviation. Your calculator should have a built-in program for the computation of standard deviations.

Predicting the mean and standard deviation of a

population from a sample

It is not always practical to measure a particular statistic for a whole population so usually

a sample of the population is taken. It is found that the mean of a sample is a reliable

estimate of the mean of a population but the standard deviation of a population is slightly more than the standard deviation of any sample drawn from it. In other words, the com-plete population shows slightly more variability than any sample drawn from it.

A formula used to predict the standard deviation, s, of a complete population from a sample of scores is:

Notice that the only difference between the formulas is that the divisor has changed from n to n− 1. Your calculator is also equipped with a built-in program for the calcu-lation of this formula. It is worth checking that you can obtain both results from your calculator and can distinguish between them. Try reworking the data from worked example 8. You should find that the standard deviation among the 8 packets of lollies

The following data give the number of lollies in each of 8 packets. Find the standard deviation of the data.

14, 14, 13, 15, 16, 13, 14, 17

THINK WRITE

Find the mean.

= 14.5

Find the difference between each score and the mean.

Differences from mean:

−0.5, −0.5, −1.5, 0.5, 1.5, −1.5, −0.5, 2.5

Square each difference. Squared differences:

0.25, 0.25, 2.25, 0.25, 2.25, 2.25, 0.25, 6.25

Add the squared differences. 0.25 + 0.25 + 2.25 + 0.25 + 2.25 + 2.25 + 0.25 + 6.25

= 14

Divide by the number of scores. 14 ÷ 8 = 1.75

Take the square root and round to 4 decimal places.

= 1.3229

The standard deviation σ= 1.3229.

1 x

14+14+13+15+16+13+14+17 8

---=

2

3

4

5

6 1.75

8

WORKED

E

xample

s f x( i–x)

2

∑

n–1

(22)

was σ= 1.3229. If you were to use this sample to predict the standard deviation of all

boxes of lollies then the standard deviation would be s= 1.4142.

A final point worth noting about the standard deviation is that, despite its sophisti-cation, it is still influenced to a high degree by extreme values. Care should be taken when using this statistic with data that include such values.

CASIO WE 10–9

The following frequency distribution gives the prices paid by a car wrecking yard for 40 car wrecks.

a Find the mean and standard deviation in the price paid for these wrecks.

b Estimate the mean and standard deviation in the price paid for wrecks by this yard in general.

Price ($) Frequency Price $ Frequency

0–500 2 2000–2500 7

500–1000 4 2500–3000 6

1000–1500 8 3000–3500 3

1500–2000 10

THINK WRITE

a Calculate the midpoint of each price

range and enter this in L1 on your

graphics calculator (press , select

EDIT and 1:Edit).

Enter the frequency values in L2.

a

Press , select CALC and

1:1–Var Stats and enter L1, L2 to generate the screen shown opposite.

Press to calculate the statistics.

The mean is shown as and the standard deviation as _σ_x.

Mean price = $1825 Standard deviation = $787

b Using the 1–Var Stats output screen, we

use to estimate the mean, and sx to estimate the standard deviation of a wider population.

b Population mean price (estimate)

= $1825

Population standard deviation (estimate)

= $797.03

1

STAT

2

3 STAT

4 ENTER

5 x

x

(23)

The standard deviation

1 Use the algorithm (series of steps) to find the standard deviation of the following data

without using your calculator’s in-built program.

2 Now use the calculator’s in-built program to check each of the standard deviations

that you calculated in the previous question.

You may use your calculator’s in-built program for finding the standard devi-ation and mean in the rest of the questions.

3 Consider the following two groups of people.

a Calculate the mean height, median height and mode height for each group. What

do you notice?

b Are the groups really the same?

c Which group would you expect to show the greatest range in heights?

d Which group would you expect to show the greatest interquartile range in heights?

e Which group would you expect to show the greatest standard deviation in heights?

f Calculate these statistics to confirm your predictions.

a 3, 5, 8, 2, 7, 1, 6, 5 b 11, 8, 7, 12, 10, 11, 14

c 25, 15, 78, 35, 56, 41, 17, 24 d 5.2, 4.7, 5.1, 12.6, 4.8

remember

1. The standard deviation of a group of scores can be found using the formula:

2. The standard deviation of a population can be predicted from a sample of scores by using the formula:

3. The lower the standard deviation the closer together the scores.

4. On your calculator the population standard deviation is denoted and the sample standard deviation s_x.

σ

∑

f x( i–x)2

n

---=

s f x( i–x) 2

∑

n–1 ---=

s_x

remember

10C

W WORKEDORKED

E Example

8

Group A Group B

160 170 170 170 170 170 180

Height (cm)

(24)

4 The following frequency distribution table shows the number of visitors that came to a city museum during the course of a month.

a Find the range of the data.

b Find the mean of the data.

c Find the standard deviation and variance of the data.

5

Calculate the standard deviation of the following data to 3 decimal places.

6 The following frequency distribution table shows the life expectancy of 175

household light globes.

a Find the range of the data.

b Find the mean and standard deviation in the lifetimes of this sample of light globes.

c Estimate the mean and standard deviation in the lifetimes of all light globes of this

brand.

7 The following frequency distribution table shows the distribution of daily maximum

temperatures during the course of a full year.

a Add a cumulative frequency column to the table.

b Draw an ogive of the data.

c Find the upper and lower quartiles of the data and calculate the interquartile range.

d Use the ogive to find the median (50th percentile of the data).

e Find the mean of the data.

f Find the standard deviation of the data.

g Find the range of the data.

Visitor number 80– 90– 100– 110– 120– 130–

Frequency 1 4 11 9 4 2

Score 10– 20– 30– 40– 50–

Frequency 1 6 9 4 1

A 3.027 B 9.437 C 9.209 D 34.048 E None of the above.

Life (h) Frequency Life (h) Frequency

200– 2 450– 38

250– 5 500– 26

300– 12 550– 15

350– 25 600– 7

400– 42 650– 3

Maximum temperature °C

Number of days

Maximum temperature °C

Number of days

0– 4 20– 94

5– 22 25– 19

10– 95 30– 5

15– 124 35– 2

m

W WORKEDORKED

E Example

(25)

8 The following data give the number of fruit that have formed on each of 30 trees in an orchard.

a Complete a frequency distribution table for the data.

b Draw an ogive of the data.

c Use the ogive to find the median, lower quartile and upper quartile of the data.

d Find the interquartile range of the data.

e Find the mean of the data.

f Find the standard deviation of the data.

g Estimate the standard deviation in the number of fruit for the whole orchard.

h Find the range of the data.

9 The polygons drawn at right show the

lifetimes of two samples of different brands of toaster elements when subject to continued use.

a Which brand has the longest mean life?

b Estimate the mean life of each brand.

c Which brand has the greatest standard

deviation in its performance?

d What does this say about the consistency of this element?

e Which brand is best? Give a case in support of Electric Mate. Give a case in

support of Hot Wire.

10 Crunch and Crinkle are two brands of potato crisps. Each is sold in packets nominally

of the same size and for the same price. Upon investigation of a sample of packets of each it is found that Crunch and Crinkle have the same mean weight (25 g). The stan-dard deviation of the weights of Crunch is, however, 5 g and the stanstan-dard deviation of the weights of Crinkle is 2 g. Which brand do you think would represent the best value for money under these circumstances? Why?

45 81 60

48 73 53

52 46 54

36 48 58

38 44 41

72 39 44

36 52 47

74 58 76

56 57 68

46 65 55

Standard deviation of soft

drink amounts

Returning to Yvonne’s measurements, we are now ready to find the standard deviation of the amounts of soft drink dispensed by each machine.

1 Use the data lists MA and MB stored on your graphics calculator to find the

standard deviation for each machine. If you do not have a graphics calculator you can obtain the standard deviation by re-entering each set of data.

2 Explain your choice of σ_x or s_x.

3 Interpret the results in terms of assessing which machine dispenses the soft

drink most accurately.

150 160 170 180 Hours

Hot Wire

Electric Mate

190 200 210 220 230

Work

(26)

Boxplots

Five number summary

A five number summary is a list consisting of the lowest score, lower quartile, median, upper quartile and greatest score of a set of data.

A five number summary gives information about the spread of a set of data. The con-vention is not to detail the numbers with labels but to present them in order; so, for example, the five number summary:

4 15 21 23 28

would be interpreted as lowest score 4, lower quartile 15, median 21, upper quartile 23 and greatest score 28.

Boxplots

A boxplot (or box-and-whisker plot) is a graph of the five number summary. It is a powerful way to show the spread of data. Boxplots consist of a central divided box with attached ‘whiskers’. The box spans the interquartile range. The median is marked by a vertical line inside the box. The whiskers indicate the range of scores:

Boxplots are always drawn to scale. They

are presented either with the five number summary figures attached as labels (diagram

at right)or with a scale presented alongside the boxplot like the diagram below.

From the following five number summary find:

a the median b the interquartile range c the range. 29 37 39 44 48

THINK WRITE

The figures are presented in the order of lowest score, lower quartile, median, upper quartile, greatest score.

X_min= 29, Q_L= 37, median = 39, Q_U= 44,

X_max= 48

a The median is 39. a Median = 39

b The interquartile range is the difference

between the upper and lower quartiles.

b IQR=Q_U−Q_L

= 44 − 37

= 7

c The range is the difference between the

greatest score and the lowest score.

c Range=X_max−X_min

= 48 − 29

= 19

10

WORKED

E

xample

Indicates the lowest score

Indicates the lower quartile

Indicates the median

Indicates the upper quartile

Indicates the greatest score

4 15 21 23 28

(27)

Interpreting a boxplot

The boxplot neatly divides the data into four sections. One-quarter of the scores lie between the lowest score and the lower quartile, one-quarter between the lower quartile and the median, quarter between the median and the upper quartile, and one-quarter between the upper quartile and the greatest score. The reader can easily see where clustering of the data occurs. For example, a small box with relatively long whis-kers would indicate that half of the data (from Q_L to Q_U) would be confined to a small range and the data could be described as clustered. A wide box with relatively short

whiskers would indicate that half of the data (from Q_L to Q_U) would be spread over a

wide range and the data could be described as spread. Consider the boxplots below with their matching histograms.

Identification of extreme values

Extreme values often make the whiskers appear longer than they should and hence give the appearance that the data are spread over a much greater range than they really are.

If an extreme value occurs in a set of data it can be denoted by a small cross on the boxplot. The whisker is then shortened to the next largest (or smallest) figure.

The boxplot below shows that the lowest score was 5. This was an extreme value as the rest of the scores were located within the range 15 to 42.

Size

Positively skewed data

f

Size

Clustered data

f

Size

Normally distributed data

f

Size

Spread data

f

Size

Negatively skewed data

f

0 5 10 15 20 25 30 35 40 45 Scale

(28)

The following stem-and-leaf plot gives the speed of 25 cars caught by a roadside speed camera. Key:8 2 = 82 km/h

8* 6 = 86 km/h

a Prepare a five number summary of the data.

b Draw a boxplot of the data. (Identify any extreme values.)

c Describe the distribution of the data. Stem

8 8* 9 9* 10 10* 11

Leaf

2 2 4 4 4 4 5 5 6 6 7 9 9 9 0 1 1 2 4 5 6 9 0 2

4

THINK WRITE

First identify the positions of the median and upper and lower quartiles. There are 25

pieces of data. The median is the th

score. The lower quartile is the median of the lower half of the data. The upper quartile is the median of the upper half of the data (each half contains 12 scores).

The median is the th score — that is, the 13th score.

The Q_L is the th score in the lower half — that is, the 6.5th score. That is, halfway between the 6th and 7th scores.

The Q_U is halfway between the 6th and 7th scores in the upper half of the data.

Mark the position of the median and upper

and lower quartiles on the stem plot. Key: 8 2 = 82 km/h

8* 6 = 86 km/h

a Write the five number summary:

The lowest score is 82.

The lower quartile is between 84 and 85 — that is, 84.5.

The median is 89.

The upper quartile is between 94 and 95 — that is, 94.5.

The greatest score is 114.

a Five number summary: 82, 84.5, 89, 94.5, 114

1

n+1 2

---25+1 2

---12+1 2

---2

Stem 8 8* 9 9* 10 10* 11

Leaf

2 2 4 4 4 4 5 5 6 6 7 9 9 9 0 1 1 2 4 5 6 9 0 2

4

Q_L

Q_U

Median

(29)

1. Clear the Y= editor (press and ) and turn off any existing plots by

pressing [STAT PLOT] and choosing

4: PlotsOff.

2. Press and select 1:Edit to enter x

data in L1 and frequencies in L2 if data are grouped.

3. Press [STAT PLOT] then and

select settings as below (use arrow keys and

press to make each choice). If data

are not grouped, leave Freq = 1.

4. Press , choose 9: ZoomStat then

press .

5. Press to explore the plot.

THINK WRITE

b Start by ruling a suitable scale. Remember to

include the units of measurement. The box represents the interquartile range so it runs from 84.5 to 94.5. The median is a vertical line in the box at 89. The whiskers should extend to the lowest score (82) and the highest score (114). But the score 114 is a great deal higher than any of the others in the set and might be regarded as an extreme value. It should be indicated by a cross and the whisker will extend only as far as 102 (the second largest number in the set).

b

c Even when the extreme value is excluded the

data appear to be skewed with high values being spread over a much greater range.

c The data are skewed (positively) and include one extremely high value.

80 90 100 110 km/h

×

Graphics Calculator

tip!

Creating a boxplot from a

frequency table

CASIO Boxplot Y=

CLEAR 2nd

STAT

2nd ENTER

ENTER

ZOOM ENTER

(30)

Boxplots

1 From the following five number summary find:

a the median

b the interquartile range

c the range.

6, 11, 13, 16, 32

a the median

c the range.

101, 119, 122, 125, 128

a the median

c the range.

39.2, 46.5, 49.0, 52.3, 57.8

4 The boxplot below shows the distribution of final points scored by a football team

over a season’s roster.

a What was the team’s greatest points score?

b What was the team’s least points score?

c What was the team’s median points score?

d What was the range of points scored?

e What was the interquartile range of points scored?

remember

1. A five number summary is a list consisting of the lowest score, lower quartile, median, upper quartile and greatest score of a set of data.

2. A boxplot is a graphical representation of a five number summary and is a powerful tool to show the spread of data.

3. The box spans the interquartile range; the median is marked by a vertical line inside the box and the whiskers extend to the lowest and greatest scores. 4. Boxplots are always drawn to scale.

5. If an extreme value occurs in a set of data, it can be denoted by a small cross; the whisker is then shortened to the next largest (or smallest) value.

remember

10D

W WORKEDORKED

E Example

10

(31)

5 The boxplot below shows the distribution of data formed by counting the number of honey bears in each of a large sample of packs.

a What was the largest number of honey bears in any pack?

b What was the smallest number of honey bears in any pack?

c What was the median number of honey bears in any pack?

d What was the range of numbers of honey bears per pack?

e What was the interquartile range of honey bears per pack?

Questions 6 to 8 refer to the following boxplot.

6

The median of the data is:

7

The interquartile range of the data:

E cannot be determined because of extreme values.

8

Which of the following is not true of the data represented by the boxplot?

A One-quarter of the scores are between 5 and 20.

B Half of the scores are between 20 and 25.

C The lowest quarter of the data is spread over a wide range.

D Most of the data are contained between the scores of 5 and 20.

E The data are skewed left.

9 The number of sales made each day by a salesperson is recorded over a 2-week

period:

25, 31, 28, 43, 37, 43, 22, 45, 48, 33

a Prepare a five number summary of the data. (There is no need to draw a

stem-and-leaf plot of the data. Just arrange them in order of size.)

b Draw a boxplot of the data.

10 The data below show monthly rainfall in millimetres.

a Prepare a five number summary of the data.

A 5 B 20 C 23 D 25 E 31

A is 23 B is 26 C is 5 D is 20 to 25

J F M A M J J A S O N D

10 12 21 23 39 22 15 11 22 37 45 30

30 35 40 45 50 55 60 Scale

5

×

10 15 20 25 30 Scale

m

multiple choiceultiple choice

m

GC prog_r_a m

UV statistics

EXCEL Spreadshe

et

(32)

11 The stemplot at right details the age of 25 offenders who were caught during random breath testing.

c Describe the distribution of the data.

12 The following stem-and-leaf plot details

the price at which 30 houses in a particular suburb sold for.

(You might like to use a graphics calculator for this question.)

13 The following data detail the number of hamburgers sold

by a fast food outlet every day over a 4-week period.

a Prepare a stem-and-leaf plot of the data.

(Use a class size of 10.)

14 The following data show the ages of 30 mothers upon

the birth of their first baby.

a Prepare a stem-and-leaf plot of the data. (Use a class size of 5.)

b Draw a boxplot of the data. Indicate any extreme values appropriately.

c Describe the distribution in words. What does the distribution say about the age

that mothers have their first baby?

M T W T F S S

125 144 132 148 187 172 181 134 157 152 126 155 183 188 131 121 165 129 143 182 181 152 163 150 148 152 179 181

22 25 31 21 29 22 18 32 19 33 18 17 17 19 23 23 22 48 22 23 25 24 24 18 24 28 23 20 20 20 W WORKEDORKED E Example 11

Key:1 8 = 18 years Stem 1 2 3 4 5 6 7 Leaf 8 8 9 9 9

0 0 0 1 1 3 4 6 9 0 1 2 7

2 5 3 6 8 6 4

Key: 12 4 = $124 000

Stem 12 13 14 15 16 17 Leaf 4 7 9 0 0 2 5 5

(33)

15

Match the boxplot with its most likely histogram.

Back-to-back stem plots

In chapter 9, we saw how to construct a stem plot for a set of univariate data. We can also extend a stem plot so that it displays bivariate data. Specifically, we shall create a stem plot that displays the relationship between a numerical variable and a categorical variable. We shall limit ourselves in this section to categorical variables with just two categories, for example sex. The two categories are used to provide two, back-to-back leaves of a stem plot.

A back-to-back stem plot is used to display bivariate data, involving a numerical variable and a categorical variable with 2 categories.

m

A B

Size

f

Size

f

Size

f

Size

f

Size

f

C D E

The girls and boys in Grade 4 at Kingston Primary School submitted projects on the Olympic Games. The marks they obtained out of 20 are given below.

Display the data on a back-to-back stem plot.

Continued over page Girls’ marks 16 17 19 15 12 16 17 19 19 16

Boys’ marks 14 15 16 13 12 13 14 13 15 14

12

(34)

The back-to-back stem plot allows us to make some visual comparisons of the two distributions. In the above example the centre of the distribution for the girls is higher than the centre of the distribution for the boys. The spread of each of the distributions seems to be about the same. For the boys, the marks are grouped around the 12–15 marks; for the girls, they are grouped around the 16–19 marks. On the whole, we can conclude that the girls obtained better marks than the boys did.

To get a more precise picture of the centre and spread of each of the distributions we can use the summary statistics discussed in chapter 1. Specifically, we are interested in: 1. the mean and the median (to measure the centre of the distributions), and

2. the interquartile range and the standard deviation (to measure the spread of the distributions).

We saw in chapter 1 that the calculation of these summary statistics is very straight-forward and rapid using a graphics calculator.

THINK WRITE

Identify the highest and lowest scores in order to decide on the stems.

Highest score = 19 Lowest score = 12

Use a stem of 1, divide into fifths.

Create an unordered stem plot first. Put the boys’ scores on the left, and the girls’ scores on the right.

Key: 1 2 = 12

Leaf Stem Leaf Boys Girls

1 3 2 3 3 1 2 4 5 4 5 4 1 5

6 1 6 7 6 7 6 1 9 9 9

Now order the stem plot. The scores on the left should increase in value from right to left, while the scores on the right should increase in value from left to right.

Key: 1 2 = 12

Leaf Stem Leaf Boys Girls 3 3 3 2 1 2 5 5 4 4 4 1 5

6 1 6 6 6 7 7 1 9 9 9

1

2

3

The number of ‘how to vote’ cards handed out by various Australian Labor Party and Liberal Party volunteers during the course of a polling day is shown below.

Display the data using a back-to-back stem plot and use this, together with summary statistics, to compare the distributions of the number of cards handed out by the Labor and Liberal volunteers.

(35)

THINK WRITE

Construct the stem plot. Key: 18 0 = 180

Leaf Stem Leaf Labor Liberal

0 18 3 19 4 2 20 4 4 1 0 21 5 9 6 6 2 22 6 8 4 3 23 3

7 6 24 4 5 7 2 25 0 3

3 26 1 3 6 7 0 27 2 2 3 5 9

28 0 5 7

Use a graphics calculator to calculate the summary statistics: the mean, the median, the standard deviation and the interquartile range. Enter each set of data as a separate list. (See worked example 6 on how to use your graphics calculator to calculate these values.)

For the Labor volunteers: Mean = 227.9

Median = 227.5 Interquartile range = 36 Standard deviation = 23.9 For the Liberal volunteers:

Mean = 257.5 Median = 264.5

Interquartile range = 29.5 Standard deviation = 23.4

Comment on the relationship. From the stem plot we see that the Labor distribution

is symmetric and therefore the mean and the median are very close, whereas the Liberal distribution is negatively skewed.

Since the distribution is skewed, the median is a better indicator of the centre of the distribution than is the mean.

Comparing the medians therefore, we have the median number of cards handed out for Labor at 228 and for Liberal at 265, which is a big difference.

The standard deviations were similar as were the interquartile ranges. There was not a lot of difference in the spread of the data.

In essence, the Liberal Party volunteers handed out a lot more ‘how to vote’ cards than the Labor Party volunteers did.

1

2

3

remember

1. A back-to-back stem plot displays bivariate data involving a numerical variable and a categorical variable with two categories.

2. In the ordered stem plot, the scores on the left side of the stem increase in value from right to left.

3. Together with summary statistics, back-to-back stem plots can be used for comparing two distributions.

(36)

Back-to-back stem plots

1 The marks (out of 50), obtained for the end-of-term test by the students in German and

French classes are given below. Display the data on a back-to-back stem plot.

2 The birth masses of 10 boys and 10 girls (in kilograms, to the nearest 100 grams) are

recorded in the table below. Display the data on a back-to-back stem plot.

3 The number of delivery trucks making deliveries to a supermarket each day over a

2-week period was recorded for two neighbouring supermarkets —supermarket A and supermarket B. The data are shown below.

a Display the data on a back-to-back stem plot.

b Use the stem plot, together with some summary statistics, to compare the

distri-butions of the number of trucks delivering to supermarkets A and B.

4 The marks out of 20 for males and females on a science test for a Year-10 class are

given below.

distri-butions of the marks of the males and the females.

5 The end-of-year English marks for 10 students in an English class were compared over

2 years. The marks for 1998 and for the same students in 1999 are shown below.

distri-butions of the marks obtained by the students in 1998 and 1999.

German 20 38 45 21 30 39 41 22 27 33 30 21 25 32 37 42 26 31 25 37

French 23 25 36 46 44 39 38 24 25 42 38 34 28 31 44 30 35 48 43 34

Boys 3.4 5.0 4.2 3.7 4.9 3.4 3.8 4.8 3.6 4.3

Girls 3.0 2.7 3.7 3.3 4.0 3.1 2.6 3.2 3.6 3.1

A 11 15 20 25 12 16 21 27 16 17 17 22 23 24

B 10 15 20 25 30 35 16 31 32 21 23 26 28 29

Females 12 13 14 14 15 15 16 17

Males 10 12 13 14 14 15 17 19

1998 30 31 35 37 39 41 41 42 43 46

1999 22 26 27 28 30 31 31 33 34 36

10E

W WORKEDORKED

E Example

12

W WORKEDORKED

E Example

(37)

6 The age and gender of a group of people attending a fitness class are recorded below.

distri-butions of the ages of the female to male members of the fitness class.

7 The scores on a board game are recorded for a group of kindergarten children and for a

group of children in a preparatory school.

b Use the stem plot, together with some summary statistics, to compare the distributions

of the scores of the kindergarten children compared to the preparatory school children.

8

The pair of variables that could be displayed on a back-to-back stem plot is:

A the height of student and the number of people in the student’s household

B the time put into completing an assignment and a pass or fail score on the assignment

C the weight of a businessman and his age

D the religion of an adult and the person’s head circumference

E the income bracket of an employees and the time the employee has worked for the

company

9

A back-to-back stem plot is a useful way of displaying the relationship between:

A the proximity to markets (km) and the cost of fresh foods on average per kilogram

B height and head circumference

C age and attitude to gambling (for or against)

D weight and age

E the money spent during a day of shopping and the number of shops visited on that day

Female 23 24 25 26 27 28 30 31

Male 22 25 30 31 36 37 42 46

Kindergarten 3 13 14 25 28 32 36 41 47 50

Prep. School 5 12 17 25 27 32 35 44 46 52

m

multiple choiceultiple choice

m

(38)

Parallel boxplots

We saw in the previous section that we could display relationships between a numerical variable and a categorical variable with just two categories, using a back-to-back stem plot. When we want to display a relationship between a numerical variable and a categorical variable with more than two categories, a parallel boxplot can be used.

A parallel boxplot is obtained by constructing individual boxplots for each distribution, using the common scale.

Construction of individual boxplots was discussed in detail earlier in this chapter (see page 438). In this section we concentrate on comparing distributions represented by a number of boxplots (that is, on the interpretation of parallel boxplots).

CASIO WE 10-14

The four Year-7 classes at Western Secondary College complete the sa