4 Statistical distributions DATA ANALYSIS

(1)

4

Statistical

distributions

DATA ANALYSIS

Sydneysiders often like to claim that it’s always raining in Melbourne, but is Melbourne really such a wet city? The following data shows the average number of rainy days per month for the two capital cities, and is supplied by the Bureau of Meteorology.

Is it possible to use this data to compare the rainfalls of the two cities and decide which city is wetter? Does Melbourne generally have more rainy days than Sydney, or is the number of rainy days per month more consistent?

This chapter is about comparing the statistics of data sets and noting any similarities and differences. Do men spend more money than women? Do more people shop on weekends than weekdays? Are teachers generally younger than doctors? Data sets can be compared by examining the shapes of their graphs or by analysing their calculated measures of location and spread.

In this chapter you will learn how to:

n name and use the different types of data and random samples n calculate measures of location (mean, median, mode)

n calculate measures of spread (range, interquartile range, standard deviation) n analyse and interpret dot plots, stem-and-leaf-plots, box plots and radar charts n investigate outliers in small data sets and their effects on the mean, median and

mode

n describe the shape of a distribution and make conclusions about the data in the distribution

n display and compare two sets of data in double stem-and-leaf plots, double box-and-whisker plots, radar charts and area charts

n interpret data presented in a two-way table form

n use summary statistics and multiple displays to interpret and compare the relationships between two data sets.

Month J F M A M J J A S O N D

Sydney 12 12 13 12 12 12 10 10 10 12 11 12

Melbourne 8 7 9 12 14 14 15 16 15 14 12 11

(2)

COLLECTING AND DISPLAYING DATA

Collecting data

Data or information can be collected by a variety of means:

n through observation, such as a naturalist observing animal behaviour

n by experiment—for example, a medical researcher testing the effects of a new drug

n from a survey, usually via a telephone poll or a written questionnaire

n by taking a census—that is, surveying the whole population.

Do you still have your statistics file containing graphs and tables collected during the Preliminary Course? You should now add to your file by collecting articles from recent newspapers that contain graphs and tables, especially those that contain more than one statistical display or display data in an interesting way.

Use your library or explore the Internet to find real data. Here are three useful websites: Australian Bureau of Statistics (ABS) www.abs.gov.au or www.statistics.gov.au Bureau of Meterorology www.bom.gov.au

Morgan Surveys www.roymorgan.com.au

Types of data

Data falls into one of the following types:

n quantitative (numerical) data that is discrete, such as the number of computers in schools

n quantitative (numerical) data that is continuous, such as the weights of gym members

n categorical (qualitative) data, such as the birthplaces of people living in Sydney.

Example 1

What type of data is each of these?

(a) the numbers of people attending Olympic Games

(b) the types of breakfast cereal in Cottonworths supermarket

(c) the body temperature of a hospital patient taken over a 24-hour period

Solution

(a) Quantitative and discrete since the data are distinct whole numbers. (b) Categorical since the data are brand names of cereals.

(c) Quantitative and continuous since the data can be measured along a continuous scale.

n Quantitative or numerical data is best displayed in a column graph or line graph.

n Categorical or qualitative data is best displayed in a sector graph or divided bar graph. Why do you think this is so?

Idea:

Collecting statistical graphs and tables

Idea:

Use the Internet to find real data

Think:

Which graph is best?

P

(3)

Random sampling

It is not always convenient to collect data from all members of a population—that is, by using

a census. If a population is too large or too difficult to survey, a sample of items can be taken

from the population and analysed, and the results used to reflect population characteristics.

n A simple random sample is one where each member of the population is equally likely

to be chosen—for example, choosing the winning balls in Lotto.

n A systematic sample is one where the first member is chosen at random and the others

are chosen at regular intervals—for example, every 8th toy on a production line.

n A stratified sample is one where a representative sample is taken from each stratum or

layer of a population—for example, a stratified sample from a population containing 70%

adults and 30% children would contain 70% adults and 30% children.

1. State whether the data is (i) categorical, (ii) quantitative and discrete, or (iii) quantitative and continuous in each case.

(a) temperature of water in a swimming pool

(b) number of people who voted Liberal in the last four elections (c) response time when patient’s reflexes are tested

(d) religious denomination (e) breeds of dogs

(f) speed of a car

(g) number of goals scored in a football match (h) heights of girls in the school athletics team

2. Give two reasons for choosing a sample rather than a census.

3. What are biased and unbiased samples?

4. Which type of random sample (simple, systematic or stratified) would best suit each of the following situations?

(a) random breath testing

(b) opinion poll on whether Australia should change the flag (c) taste testing a new brand of soft drink

5. Answer the questions that follow these three displays. (a)

Exercise 4-01:

Collecting and displaying data

Drugs used by Australians

AlcoholTobaccoCannabis_Painkillers Sleeping pills

Heroin

Amphetamines

EcstasyCocaine Hallucinogens

100

80

60

40

20

0 90

70

50

30

10

Percentage

(4)

(b)

(c)

(i) State the type of display and information contained in the display. (ii) Describe the type of data displayed.

(iii) Comment briefly on the strengths and weaknesses of the display.

SUMMARY STATISTICS

Measures of location (or averages)

Measures of location or averages are used to indicate the middle or centre of a data set. There are three measures of location: the mean, median and mode. You should use the measure that is best suited to the type and distribution of the data.

The mode is the most popular value or category.

The centre of a categorical data set is always described by the mode. For example, the modal dress size is the size worn by more women than any other.

The mean is the arithmetic average of all scores: = or = .

The median is the middle score (or average of the two middle scores) when the scores are

arranged in ascending order.

The centre of a quantitative data set is usually described by the mean or the median. The mean takes into account all scores in a data set and can be considered as the ‘balance point’ of the data set. It is, however, affected by very large or very small scores. In

distributions where there are outliers, it is better to use the median as the measure of location.

Newstart Allowance

Less than 20

More than 60 21–34 35–54 55–59

250 000

200 000

150 000

100 000

50 000

0

1995 1996 1997

Road fatalities in Australia

Mar 95 Sep 95 Mar 96 Sep 96 Mar 97 Sep 97 Mar 98 Sep 98 Mar 99 3.5

3.0

2.5

2.0

1.5

1.0

0.5

0.0

Fatalities per 10

000 population

x Σx n

--- x Σ_Σfx f

(5)

Outliers

An outlier is a very high or very low score that is clearly apart from the other scores.

An outlier can occur for a variety of reasons and should always be investigated. If an outlier is found to be a value obtained through incorrect measurement or observation and is not a typical score, it can be excluded. If the outlier is a possible value from the population, it should be included in the distribution.

Here the outlier temperature is 42°C.

Example 2

Find the mean and median of the two data sets and state which is the more appropriate measure of location for each set:

A: 1 2 3 3 4 7 8

B: 1 2 3 3 4 7 29

Solution

A: Mean = = 4

Median = 3

B: Mean = = 7

Median = 3

For set A, either the mean or median could be used as the measure of location.

For set B, the median is the better measure of location as the outlier 29 affects the mean.

36 37 38 39 40 41 42 °C

x 1+2+3+3+4+7+8

7

--- The mean can also be found using the statistics mode of a scientific or graphics calculator.

x 1+2+3+3+4+7+29

7

---Just for the record

T

HE

C

HALLENGER DISASTER

In 1986, the space shuttle Challenger exploded just after takeoff and seven astronauts were killed. It was found that two rubber O-rings had failed because of the low air temperature. Here is some data from previous flights:

An engineer had noticed the outlier (i.e. a damage index of 11 at a temperature of 12°C) and the fact that the expected air temperature at the takeoff time was below 0°C and had recommended that the flight be delayed. Unfortunately, the outlier was not considered important and the flight ended in tragedy.

Can you give another example where an outlier should not be ignored?

Air temperature (°C) 12 14 19 23 26

Damage index 11 4 0 0 0

(6)

Measures of spread

Measures of dispersion or spread are used to indicate how spread out a data set is. As with measures of location, you should use the measure that is best suited to the type and

distribution of data.

Range and interquartile range

Range= highest score – lowest score

Interquartile range= upper quartile – lower quartile

= Q3 – Q1

The interquartile range is the range of the middle 50% of scores. The upper quartile (Q3) is

the median of the upper half of the scores and the lower quartile (Q1) is the median of the

lower half. The interquartile range is often a better indicator of spread than the range as it does not take extreme scores into account.

Example 3

Find the range and interquartile range of the two data sets and state which is the more appropriate measure of spread for each set.

A: 1 2 3 3 4 7 8

B: 1 2 3 3 4 7 29

Solution

A: 1 2 3 3 4 7 8

↑ ↑ ↑

Q1 Q2 Q3

Range = 8 – 1 = 7

Interquartile range = 7 – 2 = 5

B: 1 2 3 3 4 7 29

↑ ↑ ↑

Q1 Q2 Q3

Range = 29 – 1 = 28

Interquartile range = 7 – 2 = 5

For set A, either the range or interquartile range could be used as the scores are fairly evenly spread.

For set B, the interquartile range is the better measure of spread as it does not take the outlier score 29 into account.

Standard deviation

Standard deviation is the most common measure of the spread of a distribution. It is the

square root of the average of the squared deviations from the mean.

σn is the standard deviation of a population and σn − 1 is the standard deviation of a sample.

σn − 1 is used to approximate the population standard deviation σn and gets closer to the population value as the sample size increases.

Use σn − 1 if the data is from a sample (or if you are unsure) and σn if all the possible data is

given. In either case, always state which standard deviation you are using.

(7)

Mean and standard deviation from a calculator

The mean and standard deviation of a data set can be calculated using the statistics mode (STAT or SD) of your calculator.

Example 4

Here are the net weekly earnings of 8 labourers:

$730 $490 $600 $440 $490 $370 $700 $580 (a) What is the mean weekly earning?

(b) What is the standard deviation of the earnings?

Solution

Clear any previous data and check that n = 0.

Enter the separate values: 730 490 600 … 580 Check that you have entered the correct number of scores by checking that n = 8. (a) Mean = $550

(b) Standard deviation σn − 1≈ $125.40

Example 5

Find the standard deviation of the two data sets and state which set is the more widely spread.

A: 1 2 3 3 4 7 8

B: 1 2 3 3 4 7 29

Solution

A: Sample standard deviation σn − 1≈ 2.58

B: Sample standard deviation σn − 1≈ 9.88

Set B is more widely spread as it has a much larger standard deviation.

Example 6

Twenty possums were captured, tagged and released in the Booderee National Park. Rangers recaptured several samples of 10 possums over a 2-month period and recorded the number of tagged possums in each sample.

(a) What is the mean number of tagged possums per sample (correct to 2 decimal places)? (b) What is the standard deviation of tagged possums (correct to 2 decimal places)?

Solution

Clear any previous data and check that n = 0.

Enter the data: 0 8 1 11 2 5 etc. Check that you have entered the correct number of scores by checking that n = 31. (a) Mean = 1.48 possums

(b) Standard deviation σn − 1= 1.36 possums

No. tagged per sample (score) 0 1 2 3 4 5

No. of samples (frequency) 8 11 5 4 2 1 P

DATA DATA DATA DATA

x

σn − 1 is the sample standard deviation.

× DATA × DATA × DATA

(8)

Example 7

The annual salaries of employees at the Nelson manufacturing company are tabulated.

(a) How many people are employed at the company?

(b) Using class centres 45, 55, 65, …, find the estimated mean salary of the employees. (c) What is the standard deviation of the salaries?

Solution

(a) Number of employees = 46.

(b) Clear any previous data and check that n = 0.

Enter the data: 25 16 35 12 45 5 etc. Check that n = 46.

Mean = 40.4.

Hence the estimated mean salary is $40 400 per annum. (c) Standard deviation σn= 15.9.

Hence the standard deviation of the salaries is $15 900 per annum.

1. Without using the statistical functions on your calculator, find the mean and median for

each data set (correct to 1 decimal place where appropriate). State which is the better measure of location and why.

(a) 2 3 3 5 6 8 9

(b) 26 22 24 29 21 23 24 22 (c) 8 40 38 42 45 29 31 41 30 (d) 6 8 11 9 10 8 11 12 6 7

2. Find the range and interquartile range for each data set in question 1 and state which is

the better measure of spread and why.

Annual salary (× $1000) Number of employees

20–,30 16

30–,40 12

40–,50 5

50–,60 5

60–,70 6

70–,80 2

Annual salary (× $1000) Class centre No. of employees

20–,30 30–,40 40–,50 50–,60 60–,70 70–,80

25 35 45 55 65 75

16 12 5 5 6 2

46

20–,30 means from 20 up to but not including 30.

× DATA × DATA × DATA

x

σn is the population standard deviation.

(9)

3. Using your calculator, find the mean and standard deviation (correct to 1 decimal place)

for each of these sets of data. (a) 42, 35, 63, 70, 81, 80, 85

(b) $300, $400, $600, $440, $300, $700, $250, $580, $260 (c) 37.4°F, 38.2°F, 39.0°F, 36.8°F, 38.5°F, 38.0°F, 36.8°F, 40.5°F (d) 165 kg, 146 kg, 178 kg, 190 kg, 158 kg, 147 kg

(e) 23, 18, 24, 16, 17, 20, 15, 22, 19

4. The hair colours of 75 people were noted.

(a) What is the modal hair colour? (b) Why is the mode the best measure of

central tendency here?

(c) Why are the mean and median not appropriate measures here?

5. Joshua swam a kilometre each morning for 10 days in preparation for a swimming

carnival. His times (in minutes) were:

28 24 22 24 25 24 26 26 24 27

(a) What is his median swim time? (b) What is his mean swim time? (c) What is his range of swim times?

(d) What is the interquartile range of his times?

(e) What is the standard deviation of his times (correct to 1 decimal place)?

(f) If Joshua asked you to tell him the most appropriate measures of location and spread for these times, which two would you choose? Justify your answer.

6. Ted and Julie were paid by piecework for making T-shirts. The numbers made each day

over an 8-day period were:

Ted: 18 25 19 19 26 24 15 22 Julie: 16 20 21 28 12 26 18 19 (a) For each person find:

(i) the number of T-shirts made in the 8-day period (ii) the interquartile range of T-shirts made

(iii) the mean number of T-shirts made

(iv) the standard deviation of T-shirts made (correct to 1 decimal place) (b) Comment on the statement that Ted is a more consistent worker than Julie by

comparing their means and standard deviations. Give reasons for your answer.

7. Numbers of motor accidents per week over a 9-week period at a busy intersection were:

4 3 6 0 4 9 2 3 5

(a) What is the median number of accidents? (b) What is the mean number of accidents per week?

(c) Does the mean or median best describe the centre of this data set? Give reasons. (d) What is the range of the data?

(e) What is the interquartile range?

(f) Find the standard deviation for the data (correct to 1 decimal place).

(g) Comment on the statement: ‘The number of accidents per week is fairly consistent’. Justify your answer.

Hair colour No. of people

Brown 16

Blonde 25

Black 14

Red 7

(10)

8. This stem plot shows the waiting times in a medical

centre (in minutes).

(a) Find the mean waiting time (correct to 1 decimal place). (b) Find the standard deviation of waiting times (correct

to 1 decimal place).

9. The weekly wages of a group of teachers are

shown in the table.

(a) What is the mean weekly wage? (b) What is the standard deviation of the

weekly wages (correct to 1 decimal place)?

10. The percentage marks for 250 students in a Business Studies examination are listed.

(a) What is the mean mark (to the nearest whole number)?

(b) What is the standard deviation of the marks (correct to 1 decimal place)? (c) Write a comment to the school principal describing the results of these students.

FEATURES OF A STATISTICAL DISPLAY

Shape

The shape of a statistical display shows how the data is distributed. When using a dot plot or histogram, a curve can be used to approximate the general shape.

Clustering

Clustering occurs when scores are close together or ‘bunched

up’. In this stem-and-leaf plot, the scores are clustered in the 50s and 80s.

Score 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 91–100

No. of students 14 18 15 26 20 31 44 39 43

Stem Leaf

1 2 3 4 5 6

2 4 5 3 6 9 0 1 4 5 7 7 8 2 4 5 1 3 7 2

Weekly wage ($) No. of teachers

500–,600 8

600–,700 5

700–,800 30

800–,900 25

900–,1000 12

0 1 2 3 4 5 6 7 8 9 10

Stem Leaf

3 4 5 6 7 8 9

1 2 3 5 9 0 2 6 7 7 8 9 4 5

3 5 6

(11)

Symmetry

A distribution has symmetry if the scores are balanced or evenly spread about the centre of the distribution. In a symmetrical distribution, the mean, median and mode are usually the same. For this distribution, the mean median and mode are all 5.

Skew

A distribution that is skewed is not symmetrical. The tail indicates the direction of the skew.

n If the scores are mostly low (or to the left), the distribution is positively skewed.

n If the scores are mostly high (or to the right), the distribution is negatively skewed. This distribution is positively skewed. The tail points to

the right, the positive direction.

The data in this dot plot is negatively skewed. The tail points to the left, the negative direction.

The data in this stem-and-leaf display is negatively skewed as the scores are mostly high with a tail towards the low scores. Clustering also occurs in the 70s and 80s.

Peaks and modes

Peaks are the high points or ‘humps’ in a display. The highest peak is called the mode.

No peaks: display is uniform or flat and there is

no mode.

One peak: display is unimodal. The mode here is 6.

Two peaks: display is bimodal.

The mode here is 6.

Many peaks: display is multimodal.

There are three peaks here and two modes: 5 and 7.

3 4 5 6 7

21 22 23 24 25 26 27 28

Stem Leaf

2 3 4 5 6 7 8 9

2 6 0 1 4 8 2 6 9 1 3 4 5

0 2 3 4 4 5 5 7 7 7 8 9 3 5 7 7 8 8 8 8 9 2 8

3 4 5 6 7

3 4 5 6 7 8 9

The mode is the higher of the two peaks.

3 4 5 6 7 8 9

2 10

(12)

The three distributions show the relative positions of the mean, median and mode.

n For a symmetrical distribution, the mean, median and mode are usually equal.

n For a skewed distribution, the median is usually between the mean and mode and is the better measure of location.

Diagram (a) could represent results in an HSC General Mathematics examination. Diagram (b) could represent traffic flow from 6 am to noon.

Diagram (c) could represent the heights of basketball players in a club. Can you think of other situations that these diagrams could represent?

Think:

Shape and measures of location

(a) (b) (c)

Mean Median

Frequency

Score

Mode

Mean Median

Frequency

Score

Mode Mean

Median

Frequency

Score

Mode

Symmetrical Positively skewed Negatively skewed

T

EN HOT TIPS FOR TACKLING EXAMS

1. Find out about the format of the exam: the topics to be tested, the time allowed, the

number and format of questions, the marks awarded, whether formulas are supplied.

2. Be prepared!

3. Spend the first 5 minutes browsing through the exam to see the work that is ahead of

you. Note the harder questions—you may need to spend more time on them.

4. Spend the first minute of each question planning and thinking.

5. Keep an eye on the time. Don’t spend too much time on one question.

6. Write clearly. Draw big diagrams. Spread out your working and set it out neatly. Write

down the page, not across.

7. Make sure you have answered the question. Did you remember to round off and/or

include units? Did you use all of the relevant information given?

8. Attempt every question.

9. If the working-out to a hard question is taking too long, then it’s probably wrong.

Don’t get bogged down. If you’re getting nowhere, retrace your steps, start again, or skip the question and return later with a fresh mind.

10. Once you have completed the exam, go over it again. Double-check your answers,

especially the harder ones or those of which you’re unsure.

(13)

1. Draw a curve representing a statistical display that:

(a) is symmetrical (b) is positively skewed

(c) shows clustering (d) is negatively skewed with clustering (e) is symmetrical and bimodal

2. For each of the following displays state:

(i) if the data is symmetrical or skewed (ii) if there are any clusters

(iii) if there are any outliers (iv) how many peaks there are

(a) (b)

(c) (d)

(e) (f)

3. The numbers of visits (or hits) to a popular Internet website were tabulated over a

10-hour period.

Draw a histogram to represent this data and comment on the features of the display, such as shape, skew, clustering and peaks.

Time 1201–₁₃₀₀ 1301–₁₄₀₀ 1401–₁₅₀₀ 1501–₁₆₀₀ 1601–₁₇₀₀ 1701–₁₈₀₀ 1801–₁₉₀₀ 1901–₂₀₀₀ 2001–₂₁₀₀ 2101–₂₂₀₀

Hits (× 1000) 1.3 0.8 0.4 2.1 2.6 4.5 3.9 5.3 2.3 1.2

Exercise 4-03:

Features of a statistical display

1 2 3 4 5 6 7 8 9 10 11

Score

0 2 4 6 8 10 12

Frequency

4 5 6 7 8 9

Stem Leaf

1 2 3 4 5

3 4 6 6 6 7 8 9 9 0 7

1 2 2 5 7 8 8 9 0 2 3

2 9

11 13 16 18 19 20

10 12 14 15 17

5 10 15 20 25 30 35 40 45 50

Score

0 1 2 3 4 5 6

Frequency

7 8 9

Stem Leaf

4 5 6 6 7 8

(14)

4. For the given information:

5 14 8 7 12 3 2 8 4 10 6 2 7 3 9 9 6 4 8 9

(a) draw a dot plot to display the data (b) comment on the features of the display

5. Here is a set of data:

22 16 36 15 16 24 15 15 19 55 58 59 18 17 20 20 24 15 54 19 15 40 21 17 50 22 23 21 24 23 15 35 15 24 22 19 15 17 43 49 (a) Draw a stem-and-leaf display for this data set using stems 1, 2, 3, 4 and 5. (b) Comment on the features of the display.

(c) Give the name of a possible population that this data could represent.

6. This dot plot represents the industrial accidents per month at a factory:

(a) What is the mean number of accidents in this period (correct to 1 decimal place)? (b) What is the standard deviation (correct to 1 decimal place)?

(c) What could be a possible reason for the outlier 9?

(d) What are the mean and standard deviation if the outlier 9 is not included (correct to 1 decimal place)?

(e) Compare the means and standard deviations of the two groups of data.

INVESTIGATING OUTLIERS

Outliers often have the effect of raising or lowering a mean value but they can also affect the mode and median.

Example 8

A: 20 25 30 35 40 45

B: 20 25 30 35 40 60

C: 20 25 30 35 40 120

(a) Find the mean and median of each set of scores.

(b) The three data sets are the same except for the value of the last score. Investigate the effect of increasing the last score on the mean and median of set A.

(c) What are the values of the mean and median of set C if the outlier 120 is not included?

Solution

(a) A: 20 25 30 35 40 45

B: 20 25 30 35 40 60

C: 20 25 30 35 40 120

↑

Median = 32.5

Set Mean Median

A 32.5 32.5

B 35 32.5

C 45 32.5

1 2 3 4 5 6 7

0 8 9

(15)

(b) Increasing the last score has no effect on the median.

As the last score increases, so the value of the mean increases. The outlier of 120 has the greatest effect on the value of the mean.

(c) Set C without the score 120 has a mean and median of 30.

1. For each pair of data sets below find:

(i) the mean and median (correct to 1 decimal place) (ii) the value of any outlier score

(iii) the effect on the mean and median of any outlier

(a) A: 10 12 14 16 18 20

B: 10 12 14 16 18 40

(b) A: 5 37 41 53 56

B: 36 37 41 53 56

(c) A: 3 4 8 9 12 14

B: 3 6 7 10 13 25

(d) A: 110 120 130 135 135 140 140 B: 55 115 135 140 145 145 150

2. For each data set below:

(i) find the mean, median and mode (correct to 1 decimal place where needed) (ii) state the value of any outlier

(iii) say which measure of location is the most appropriate (iv) sketch the shape

(a) 2 8 3 16 9 26 8 (b) 8 16 4 21 4 23 16 12 (c) 120 g 85 g 72 g 60 g 80 g 80 g

(d) 37°C 38°C 41°C 39°C 38°C 37°C 37°C

3. The 7 employees at the Bug and Beef Cafe earned the following wages in a week:

$350 $420 $510 $130 $635 $320 $460 (a) What is the mean wage?

(b) What is the median wage?

(c) Which is the more appropriate measure of location? Justify your answer.

(d) If each employee received a 10% pay rise, what would be the new mean and median wages?

(e) By what percentage would the mean increase?

(f) If the manager who earned $635 was not included in the data set, what would be the mean and median wages?

4. In a netball tournament of 5 matches, the numbers of points scored by three teams are:

The Wombats: 24 18 14 6 22 The Possums: 16 16 15 18 15 The Koalas: 36 8 14 16 12 (a) What are the mean and median for each team? (b) Which team is more consistent? Why?

(c) An error was found in the recording for the Wombats. The score of 6 should have been 16. What are the new mean and median?

(16)

5. Pam and Percy sell photocopiers. The numbers of copiers sold over a 10-week period

are shown.

Pam: 1 2 3 3 5 6 7 8 12 25

Percy: 3 3 3 14 16 18 18 24 32 35

(a) What is the modal number of copiers sold by each person?

(b) What could you say about each person if you only knew the mode? (c) What is the median number of copiers sold by each?

(d) What is the mean number of copiers sold by each?

(e) Which measure of location is the best measure to compare the sales performances of Pam and Percy?

(f) Who is the better salesperson? Why?

6. Choose 5 scores that have the same mean and median. What effect will adding a score

of 100 have on the mean and median?

7. Rupert’s bookstore employs the following people with annual wages as shown:

2 store managers $64 300

4 cashiers $34 200

3 part-time clerical staff $28 500 10 salespeople $46 500 2 part-time cleaners $13 500 (a) What is the modal wage? Why? (b) What is the median wage?

(c) What is the mean wage (to the nearest dollar)?

(d) Which measure would Rupert use to make the salaries appear higher? (e) Which measure of location (average) best represents the average wage for an

employee at Rupert’s bookstore?

DISPLAYING AND COMPARING TWO DATA SETS

Double stem-and-leaf plots

By representing two related data sets in a double (back-to-back) stem-and-leaf display, similarities and differences, such as clustering and averages (measures of location), can be easily seen.

Example 9

This double stem-and-leaf plot shows the numbers of dollars spent by a group of students visiting the Easter show.

(a) How many students went to the show? (b) Give two observations on the shape and

features of the data.

(c) Calculate the mean and standard deviation

(to the nearest 5 cents) of amounts spent by boys and by girls.

(d) Considering all the information you have, do you think that boys are the bigger spenders? Why?

Boys Girls

8 6 6 5 5 4 6 4 3 2 9 8 2 5 3 2 1 1 0 2

1 2 3 4 5

2 5 5 8

0 2 4 5 5 5 6 7 8 9 1 2 4

(17)

Solution

(a) 39 students, consisting of 20 boys and 19 girls.

(b) The amounts spent by the girls show clustering at $20–$29, whereas the amounts spent by the boys are more evenly spread out.

The data for the girls is positively skewed.

(c) Girls: Mean = $25.80 Standard deviation σn − 1= $8.00

Boys: Mean = $30.10 Standard deviation σn − 1= $12.40

(d) Yes. The average amount spent by a boy was $30.10. This was about $6 more than the average amount spent by a girl.

Box plots

Whereas a stem-and-leaf plot gives a good visual comparison of the location of scores in a data set, a box plot (or box-and-whisker plot) shows the spread of the data. Find a five-number summary and draw each box plot on the same scale.

Example 10

The box plots below show the ranges of unleaded petrol prices in six cities in Australia. (a) (i) Which city’s petrol prices had the smallest range?

(ii) Which city’s had the largest range?

(b) In which city was petrol generally cheapest? Give a possible reason for this. (c) Canberra, Sydney and Melbourne had the same range of prices.

(i) Which of these three cities had the lowest median price?

(ii) In which of these cities would you be more likely to pay a higher price for petrol? (d) Write down one observation about petrol prices in Canberra.

Solution

(a) (i) Adelaide (ii) Darwin

(b) Brisbane. The government tax on petrol is lower than in the other cities and so the price paid by the consumer is lower.

(c) (i) Sydney (ii) Melbourne

(d) They were evenly spread across the city. The distribution of petrol prices is symmetrical.

x Use the statistical

function on a scientific or graphics calculator.

x

The box contains the middle 50% of scores with each whisker representing 25% of the remaining scores.

Q1 Q2 Q3

Lower Median quartile

Upper quartile

Upper extreme Lower

extreme

Canberra

Sydney

Melbourne

Adelaide

Brisbane

(18)

Using a graphics calculator is an easy and excellent way to compare box plots.

1. Enter the individual scores of the first data set in List 1.

2. Enter the individual scores of the second data set in List 2.

3. Set the to a median box plot (some calculators have a mean box plot as well).

4. Make sure that both Graph 1 and Graph 2 are ON.

5. Draw the graphs. Both graphs will appear on the screen at the same time, giving you an

excellent comparison of the two data sets.

The calculator will also give you the five-number summary.

Example 11

Liz and George deliver pamphlets to letterboxes in the same neighbourhood. The numbers of pamphlets delivered per hour over 12 hours are shown:

Liz: 24 25 26 27 28 28 31 32 32 32 35 35 George: 15 18 21 24 25 29 31 31 32 38 38 45 (a) Represent the data in a double stem-and-leaf plot.

(b) Find a five-number summary for each data set and hence draw two box plots. (c) Write down one observation that is best seen in the stem-and-leaf plot. (d) Write down one observation that is best seen in the box plots.

(e) Which worker showed the greater interquartile range of pamphlets delivered? Which display shows this the best?

(f) Can we conclude that Liz is a better worker than George?

Solution

(b) Liz: 24 25 26 27 28 28 31 32 32 32 35 35

↑ ↑ ↑ ↑ ↑

Lower extreme = 24 Lower quartile = = 26.5

Median = = 29.5 Upper quartile = = 32 Upper extreme = 35

George: 15 18 21 24 25 29 31 31 32 38 38 45

↑ ↑ ↑ ↑ ↑

Lower extreme = 15 Lower quartile = 22.5

Median = 30 Upper quartile = 35 Upper extreme = 45

(a) Liz George

8 8 7 6 5 4 5 5 2 2 2 1

1 2 3 4

5 8 1 4 5 9 1 1 2 8 8 5

Technology:

Box plots on a graphics calculator

GRAPH

26+27 2

---28+31 2

--- 32+32 2

---15 20 25 30

Pamphlets/hour

35 40 45

(19)

(c) The stem-and-leaf plot shows that the number of pamphlets delivered per hour by Liz was always in the 20s and 30s.

(d) The box plots show the median number of pamphlets delivered per hour by both was about the same (around 30) but George’s range was greater.

(e) George. This is obvious from the box plots. The interquartile range is the length of ‘the box’.

(f) If an employer was looking for consistency, Liz is the more consistent worker as she had less variation in the number of pamphlets delivered per hour. However, for the total number of pamphlets delivered, both employees delivered approximately the same number of pamphlets. We cannot conclude that Liz is a better worker than George.

What to do with outliers?

n If an outlier is considered to be feasible, you can include it in the whiskers.

n If an outlier is considered to be an error, you need not include it in the whiskers but can represent it as a separate point.

Can you describe a situation that these box plots could represent?

1. The numbers of dollars spent by a class of

students visiting the Easter show were discussed in Example 9 (page 124).

(a) Find a five-figure summary for each data set. (b) What is the interquartile range of each? (c) Draw two box plots representing the data sets.

(d) What information is seen more easily in the box plots?

2. A teacher proposes that ‘People always underestimate the length of a piece of string’. A

group of students decide to investigate this theory. They each estimate the lengths of several pieces of string and then measure the actual lengths.

(a) Write down the median of the estimated lengths. (b) Write down the median of the actual lengths.

(c) What are the range and interquartile range for each data set? (d) Would you agree with the teacher’s theory? Justify your answer.

Think:

Is the outlier in or out?

1 2 3 4 5 6 7 8 9 10 11

Outlier excluded

Outlier included

Exercise 4-05:

Displaying and comparing two data sets

Boys Girls

8 6 6 5 5 4 6 4 3 2 9 8 2 5 3 2 1 1 0 2

1 2 3 4 5

2 5 5 8

0 2 4 5 5 5 6 7 8 9 1 2 4

0 2

5 10 15 20

Length of string (cm)

25 30 35

Actual Estimates

(20)

3. Here are two sets of scores represented in a stem-and-leaf

display.

(a) Find the range and interquartile range of each set. (b) Find the median for each set.

(c) Draw box plots representing the data sets.

(d) Write down one observation from the stem-and-leaf plot and one from the box plots.

4. The pulse rates (in beats/minute) of two groups of people were recorded:

Group X: 77 72 80 77 91 62 72 82 79 58 75 67 69 66 98 81 Group Y: 81 86 64 74 92 75 73 81 64 52 82 79 80 53 62 78 (a) Draw a back-to-back stem-and-leaf plot.

(b) What is the mean of each group (correct to 1 decimal place)? (c) What is the median of each group?

(d) Which is the better measure of location? Why?

(e) Comment on the shape of each group in the stem-and-leaf plot.

5. A group of 20 people had their pulse rates taken before and after an exercise class.

(a) By how much did the median pulse rate increase?

(b) The lower extreme ‘before’ and ‘after’ the class did not change. Give a possible reason for this.

(c) Give a possible reason for the outlier pulse rates in the ‘after exercise’ box plot. (d) How many people had a pulse rate between 64 and 72 before the exercise class? (e) What was the interquartile range of pulse rates after the class?

6. Eighteen people took part in the QUIT smoking program. The numbers of cigarettes

smoked per day were recorded before the start of the program and 6 weeks later: Before: 21 10 36 42 16 23 32 42 9 14 21 18 34 45 12 18 16 28 6 weeks later: 6 24 31 38 21 25 16 19 16 18 28 32 8 13 40 38 16 28 (a) What is the interquartile range for each data set?

(b) Draw two box plots on the same scale showing ‘before’ and ‘6 weeks later’. (c) Is the QUIT program working for these people? Justify your answer.

7. The following data shows the average number of rainy days per month for two capital

cities, and is supplied by the Bureau of Meteorology.

Month J F M A M J J A S O N D

Sydney 12 12 13 12 12 12 10 10 10 12 11 12

Melbourne 8 7 9 12 14 14 15 16 15 14 12 11

Set A Set B

2 5 5

8

5 7 0 2

0 1 2 3 4 5 6 7 8 9

3

8 5 2 4 7 2

4

40 60 80 100

Pulse rate (beats/min)

120 140 Before

50 70 90 110 130

exercise

(21)

(a) Use a double stem-and-leaf plot to display the data. (b) Draw box plots representing the data.

(c) Write down one observation from each display.

(d) ‘Melbourne is much wetter than Sydney.’ Do you agree with this statement? Justify your answer.

8. This display represents the lifetime in hours of two brands of light globes.

(a) How many of each brand of light globe were tested?

(b) What is the mean lifetime of ‘Oso Bright’ globes (correct to 1 decimal place)? (c) What is the mean lifetime of ‘Brighta Longa’ globes (correct to 1 decimal place)? (d) Find the standard deviation of the lifetime of each brand (correct to 1 decimal place). (e) Draw box plots representing the data sets.

(f) Which brand of globe would you say is better? Explain your answer.

COMPARING DATA SETS USING CHARTS

Radar chart

A radar chart is used to plot changes over a certain period or cycle, such as temperarure during a 24-hour period, but it is also useful for comparing two sets of data.

A radar plotting chart (or polar graph paper) can be used to manually plot data, but the best option is to generate the radar chart from a spreadsheet package on a computer.

Example 12

This radar chart shows air pollution levels at two different workplaces over a 10-day period. (a) What was the air pollution level at the

meatworks on day 10?

(b) What was the air pollution level at the oil refinery on day 1?

(c) On what days was the pollution level above 50 at the oil refinery?

(d) What were the maximum and minimum pollution levels? When and where did they occur?

(e) By comparing the areas contained within each graph, decide which workplace had the higher overall pollution level.

Oso Bright Brighta Longa

6 5 5 4 2 8 7 7 7 7 7 7 4 4 3 9 9 8 8 7 6 6 6 5 4 4 0 8 8 8 7 7 7 6 5 4 3 1 9 8 8 8 5 5 2 2 7 7 5 1

10 11 12 13 14 15

3 4 5 2 3 3 4 4 5 6 1 2 2 3 3 4 5 5 7 9 9 9 0 2 3 3 4 4 4 5 6 6 8 8 9 1 2 2 3 5 5 6 7 8 8 0 3 3 4 6

Air pollution levels

Day 1

Day 2

Day 3

Day 4

Day 5

Day 6 Day 7

Day 8 Day 9

Day 10

0 20 40 80 60 100

(22)

Solution

(a) About 60. (b) About 45.

(c) Days 4, 6, 8 and 9.

(d) The maximum level was about 85 on day 4 at the oil refinery and the minimum level was about 25 on day 1 at the meatworks.

(e) The oil refinery graph seems to cover a slightly larger area and so had a higher level of pollution over the 10-day period.

Area chart

An area chart consists of different ‘areas’ or ‘bands’, each representing a data set over a given period of time. It shows the sum of the data over the given time as well as the relationship of the parts to a whole. Its main feature is to emphasise changes during this time. An area chart can be plotted on graph paper or drawn using the Chart option in a spreadsheet package. There are several chart subtypes that you can investigate.

Example 13

The table shows the numbers of males and females in full-time employment in January from 1990 to 2000.

Construct an area chart showing the contribution of male and female employees to Australia’s full-time workforce.

Solution

Step 1 Draw a line graph for males using the values in the table and shade below it. This

area represents the male employees.

Step 2 Draw a line graph for total employees by adding the values for females to those of

males. Shade the area between the two lines. This area represents the female employees.

Year 1990 1992 1994 1996 1998 2000

Males (×10 000) 350 160 200 320 360 450

Females (×10 000) 80 50 60 120 140 200

For example, in January 2000 the full-time workforce was 6 500 000, and this was made up of 4 500 000 males and 2 000 000 females.

Australia’s full-time workforce

No. of employees

700 600 500 400 300 200 100 0

Year

1990 1992 1994 1996 1998 2000

Females Males

(

×

10

(23)

Example 14

This area chart compares the unemployment rates for males and females from 1981 to 1997. (a) For the year 1985 find:

(i) the unemployment rate for males (ii) the combined unemployment rate (iii) the unemployment rate for females (b) What was the unemployment rate in 1993? (c) What trends in the unemployment rate can be seen over the period from 1981 to 1997?

Solution

(a) (i) About 8%. (ii) About 17%.

(iii) About 9% (subtract the 8% rate for males from the 17% total rate). (b) About 22%.

(c) The unemployment rate rose from about 12% in 1981 to 17% in 1997.

A fall in the unemployment rate occurred from 1985 to 1989 followed by a rise before another fall from 1993 to 1997. The unemployment rate was at its highest in 1993.

Radar charts and area charts are drawn in a similar way using a spreadsheet package. Use a spreadsheet to draw the area chart for Australia’s full-time workforce (Example 13 on page 130).

1. The numbers of clear days for the ski

resorts of Thredbo and Perisher in the Snowy Mountains area of NSW are shown in the radar chart.

(a) How many clear days did Thredbo have in March?

(b) What was the most number of clear days at either resort? When was this? (c) How many days were not clear in

Perisher in July?

(d) Which data set contains the largest area? What does this area refer to? (e) ‘The weather is better for skiing at

Perisher.’ Do you agree with this statement? Justify your answer.

Unemployment rates

Percentage

25

20

15

10

5

0

Year

1981 1985 1989 1993 1997

Females Males

Use your ruler to help you measure the vertical distances.

Technology:

Using a spreadsheet to draw an area chart or radar chart

Exercise 4-06:

Comparing data sets using charts

Clear days in the ski fields of NSW

Jan

Feb

Mar

Apr

Jun Jul

Aug Oct

Nov Dec

0 4 8

2 6 10

May Sep

(24)

2. The area chart shows the number of wage earners employed in the public and private

sectors in Australia over different years. (a) How many wage earners were

there in the public sector in 1997? (b) What was the total number of

wage earners in 1993?

(c) How many wage earners were employed in the private sector in 1991?

(d) What trends can be seen over the period from 1991 to 1997? (e) What similarities or differences

can be seen between the public and private sectors?

3. The area chart shows the seasonal rainfall for an island group in the Pacific Ocean.

(a) What was the rainfall for the southeastern region in summer? (b) What was the rainfall for the

northern region in spring? (c) What was the total rainfall in

autumn?

(d) The southeast is the wettest region. How is this shown in the graph? What could be a possible reason for one area getting more rain than the others?

(e) What trends in the rainfall can be seen over the year?

(f) What similarities or differences in rainfall can be seen between the regions?

4. Mr Pappadopoulos was admitted to hospital with a suspected stomach ulcer. His fluid

intake (e.g. water and medicine) and output (e.g. urine) over a 24-hour period are summarised in the following table.

(a) Represent the data in a radar chart.

(b) By considering the areas enclosed by each data set, what observation can you make about Mr Pappadopoulos’s intake and output over the 24-hour period?

(c) Write down two other observations from your radar chart.

Time 6 am 8 am 10 am 12 noon 2 pm 4 pm

Intake (mL) 170 240 150 110 250 90

Output (mL) 140 150 80 180 130 90

Time 6 pm 8 pm 10 pm 12 pm 2 am 4 am

Intake (mL) 150 60 180 170 160 210

Output (mL) 60 220 110 160 100 140

Wage earners in Australia

8000 7000 6000

5000

3000 2000

Year

1991 1992 1993 1994 1995

Private sector Public sector

1996 1997 1000

4000

0

No. of wage earners (

×

1000)

Seasonal rainfall for island group

400

350 300

250

150 100

Season

Summer Autumn Winter

Southwestern region Southeastern region Northern region

Spring 50

200

0

(25)

5. Clark and Lois earn extra money for writing articles for newspapers and magazines.

They save these amounts in a joint holiday fund. Their monthly earnings last year are shown in the table.

(a) Represent the data in a radar chart. (b) Represent the data in an area chart.

(c) What information is best seen in the radar chart? (d) What trends are clearly seen in the area chart?

6. (a) What information is contained in

the graph?

(b) How do you think data for the years 2021–2041 was obtained? (c) Describe the features of the part of

the graph for the 15–59 age group. (d) In 1961, approximately what

percentage of the population was between (i) 0 and 14 (ii) 15 and 59? (e) Approximately what percentage of

the population is expected to be over 60 in 2021?

(f) Give two facts about Australia’s population that can be seen in the graph.

(g) What does this area chart show about age groups in the future?

TWO-WAY TABLES

Two-way tables are used to compare two characteristics—for example, gender and health.

Example 15

A National Health Survey in 1995 compared the number of adults in a population who exercised regularly to those who didn’t. The data is displayed in a two-way table.

(a) How many people were surveyed?

(b) What percentage of the people surveyed were female? Give your answer correct to 1 decimal place.

(c) What percentage of females exercised regularly?

(d) What percentage of the population did not exercise regularly?

(e) Comment on the statement ‘Men and women are similar in their exercise habits’.

Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Clark’s earnings ($)

370 240 530 570 780 1030 770 620 790 520 430 490

Lois’s earnings ($)

150 420 480 530 850 1280 920 650 810 480 390 350

Exercise No exercise

Male 3028 1532

Female 1804 946

Australia’s population by age groups

100 90 80

60

40 30

Year

1921 1961 2001 2041

10 50

0

Percentage

2021 1941 1981

Age 60+ Age 15–59 Age 0–14 20

(26)

Solution

(a) Number of people surveyed = 3028 + 1532 + 1804 + 946 = 7310 (b) Number of females = 1804 + 946 = 2750

Percentage of people who were female = × 100 = 37.6%

(c) Percentage of females who exercised = × 100 = 65.6%

(d) Percentage of people who did not exercise = × 100 = 33.9% (e) Number of males = 3028 + 1532 = 4560

Percentage of males who exercised = × 100 = 66.4%

Since the percentage of females who exercised was 65.6% and the percentage for males was 66.4%, there is no significant difference, so the statement is supported by this data.

1. The population of a town was surveyed in 1990 and 1997 to find out who had private

health insurance.

(a) What was the population of the town in 1990? (b) What was the population of the town in 1997?

(c) What percentage of the town had private health insurance in 1990? (d) What percentage of the town had private health insurance in 1997?

(e) Suggest a reason for the decrease in the percentage of people with private health insurance.

2. The percentages of Australians living in rural areas in 1911 and 1996 were compared.

(a) Copy and complete the table.

(b) What percentage of Australians lived in urban areas in 1911? (c) Comment on the differences between 1911 and 1996.

3. In one area there are three phone companies providing a service for mobile phones. The

number of people using each company as a provider was recorded over a 3-year period.

1990 1997

Private 4563 4048

No private 5577 8602

1911 1996

Rural areas 43%

Urban (city) areas 87%

Telstra Optus Vodaphone

1995 204 695 194 198 125 967

1996 315 144 216 276 86 510

1997 402 628 304 025 115 037

2750 7310

---1804 2750

---1532+946 7310

---3028 4560

(27)

(a) How many people in this area owned mobile phones in (i) 1995 and (ii) 1997? (b) What percentage of people used Telstra as their provider in 1996?

(c) What percentage of people used Optus as their provider in 1997? (d) What share of the market did Vodaphone have in (i) 1995 and (ii) 1997? (e) What happened to Telstra’s share of the market from 1995 to 1997? (f) What happened to Optus’s share of the market from 1995 to 1997? (g) Comment on the statement ‘Telstra users doubled from 1995 to 1997’.

4. A survey was taken on whether to change the Australian flag or not. The results are

shown in the table, grouped by age in years.

(a) How many people surveyed voted to (i) change the flag and (ii) keep the flag? (b) What percentage of those surveyed wanted to keep the flag?

(c) What percentage of 18–24-year-olds wanted to change the flag?

(d) Which group was most definite in its response? What was this response? Why do you think this is so?

18–24 25–39 40–54 55–69

Change the flag 790 640 450 140

Keep the flag 1240 860 930 620

T

EN MOREHOT TIPS FORTACKLING EXAMS

1. Bring all of your equipment: pens, paper, geometrical instruments, calculator (check

calculator works).

2. Don’t worry if you feel nervous before an exam. This is normal and helps you perform

better. However, being too casual or too anxious can be harmful to your performance.

3. Write in black or blue, not red. Don’t use liquid paper. Use pencil only for diagrams

and constructions.

4. Read each question and identify what needs to be found.

5. You don’t need to be writing all of the time. What you are writing may be wrong and

a waste of time. Spend some time thinking and considering the best approach.

6. Make sure your answer sounds reasonable and realistic, especially if it involves

money or measurement.

7. If you make a mistake, cross it out with a neat line. Don’t scribble over it completely.

You may still get marks for it if it is right. Don’t use liquid paper. It is both time-consuming and messy.

8. Don’t cross out or change an answer rashly. You may have been right the first time.

9. Don’t round off in the middle of a calculation. Round off at the end only.

10. Don’t be afraid to write words and sentences in your working, but don’t use

abbreviations that you’ve just made up.

(28)

USING MULTIPLE DISPLAYS TO COMPARE DATA SETS

Relationships between data sets can often be interpreted and described more effectively by using more than one display. Looking at a variety of different displays allows a better comparison of data sets as some features are more obvious in one display than in another. Every day in the media you will find examples of multiple displays describing data sets. A company director compares this year’s figures with those of previous years. Medical researchers compare the effects of a new drug on men and women for similarities and differences. Local councils investigate the population mix in a new suburban area in order to provide the most appropriate facilities.

Let us start with two simple data sets and look at three different ways of comparing them.

Example 16

The data sets A and B are displayed as lists, dot plots, a frequency table and a clustered column graph.

Lists

A: 5 6 7 8 9 B: 5 5 7 9 9

Dot plots

Frequency table Column graph

(a) Comment on the shape and features of each data set. (b) Find the mean, median and mode for each set.

(c) Find the range, interquartile range and standard deviation of each set.

(d) Comment on the benefits of using multiple displays to describe the data sets and to find measures of location and spread.

Solution

(a) Set A is symmetrical and flat.

Set B is symmetrical and has two peaks; that is, it is bimodal. (b) Set A: Mean = 7 Median = 7 No mode

Set B: Mean = 7 Median = 7 Mode = 5, 9

(c) Set A: Range = 4 Interquartile range = 3 Standard deviation σn − 1= 1.58

Set B: Range = 4 Interquartile range = 4 Standard deviation σn − 1= 2

(d) Multiple displays cater for differences in people’s preferences as well as allowing for different statistical needs. The dot plots and histogram give good visual representations of the data sets and are best used to describe the shape and features of the data sets. The measures of location and spread are best found from the lists or frequency table, although the other displays can also be used.

Score Frequency Set A Set B

5 6 7 8 9

1 1 1 1 1

2 0 1 0 2

A

5 6 7 8 9

Score

5 6 7 8 9

Score

B

5 6 7 8 9

Score

0 1 2 3

Frequency

(29)

1. Two groups, each containing 15 people, were given a small timer and asked to stop the

timer when they thought 60 seconds had elapsed. The results, in seconds, for the ‘estimated minute’ are listed:

Group A: 34 43 45 50 62 64 65 65 66 68 69 70 71 75 81 Group B: 42 46 48 48 49 50 55 58 60 61 62 64 65 68 70 (a) Construct a double stem-and-leaf plot.

(b) Draw a clustered column graph with classes 30–39, 40–49, … (c) Draw box plots to represent the data sets.

(d) Write down one piece of information that is clearly shown in each of the three displays you have drawn.

(e) Find the mean and standard deviation of each data set (correct to 1 decimal place). (f) Comment on the ability of each group to estimate a minute.

2. A coach, deciding which team should win the ‘most consistent players’ award,

compared the season’s scores for two netball teams: The Birds: 55 23 35 51 56 48 70 52 64 72 The Bees: 18 41 23 46 48 24 56 27 36 48

(a) Display the data in a stem-and-leaf plot, box plots and a column graph. (b) Use your displays to describe the shape and features of each data set.

(c) By finding suitable measures of location and spread, decide which team is more consistent. Justify your answer.

3. The populations of two regions were surveyed to find out who belongs to a workers’

union. The results are tabulated and shown in a back-to-back histogram.

Table

Back-to-back histogram

(a) Write down two comparisons you can make between the two data sets.

(b) Use the information to comment on the statement ‘People in the eastern region are more likely to join a union’. Justify your answer.

Age 15–24 25–34 35–44 45–54 55–64 65+

Eastern region 35% 49% 54% 51% 62% 11%

Western region 34% 36% 38% 42% 45% 4%

Exercise 4-08:

Using multiple displays to compare data sets

Union membership by age and region

30 20 10 0 0 10 20

% belonging to a workers’ union

30 15–24

45–54

40 40

50 60

70 50 60 70

25–34 35–44 55–64 65+

(30)

4. The heights of a group of men and women were measured to the nearest centimetre. The

data was then represented in a double stem-and-leaf display and also as box plots.

Stem-and-leaf

Box plots

(a) What information is better shown in the stem-and-leaf display? (b) What information is better shown in the box plots?

(c) What are the medians and interquartile ranges of the heights of men and women? (d) Calculate the means and standard deviations of the heights of men and women

(correct to 1 decimal place).

(e) Write down two similarities between the heights of men and women. (f) Write down two differences between the heights of men and women.

5. The table below gives the average number of rainy days per month for the Australian

capital cities.

(a) Draw at least two suitable displays illustrating the data.

(b) Calculate the mean and median number of rainy days for each city.

(c) Find the range and standard deviation of the number of rainy days for each city. (d) Use these statistical measures and displays to determine:

(i) which city is driest (ii) which city is wettest

(iii) which city has the most consistent pattern of rainy days

(iv) which city has most variation in the number of rainy days per month

Men Women

8 9 7 7 5 2 9 9 8 8 6 5 5 4 4 4 2 1 8 6 3 2 4 15 16 17 18 19

2 4 4 5 6 8 8 9 0 2 3 3 4 5 5 5 5 6 7 8 8 2 3 4 4

3

City

Month

J F M A M J J A S O N D

Adelaide Brisbane Canberra Darwin Hobart Melbourne Perth Sydney 5 13 7 21 11 8 3 12 4 13 7 20 10 7 3 12 6 15 7 19 11 9 4 13 9 11 8 9 12 12 8 12 14 10 9 2 14 14 13 12 13 8 9 1 14 14 17 12 17 7 10 0 15 15 18 10 16 7 12 1 15 16 16 10 13 7 10 2 15 15 13 10 11 9 11 6 16 14 10 11 8 10 10 12 14 12 7 11 7 12 8 16 13 11 4 12

145 155 165 175

Height (cm)

185 195 Men

150 160 170 180 190

(31)

6. Use the table in question 5 to consider the rainfall per season in Australia. The seasons

are summer (D, J, F), autumn (M, A, M), winter (J, J, A) and spring (S, O, N). (a) Draw at least two suitable displays to illustrate the data.

(b) Calculate the mean, median, range and standard deviation for each season. (c) Use these statistical measures and displays to determine:

(i) which is the wettest season (ii) which is the driest season

(d) Comment on the statement ‘Rainfall in Australia does not vary much between seasons’.

Just for the record

B

ABY BOOMERS

After World War II finished in 1945, there was a ‘baby boom’ in Australia, New Zealand, Britain and North America. This rapid growth in the number of babies born lasted until the mid-1960s. People born during this time are referred to as ‘baby boomers’. The result of the large increase in births during this period will affect Australia’s population statistics as this group of people age. The two graphs show the baby boomer population moving from 2001 to 2031.

In 2031, the baby boomers will be over 65 years. Approximately how many more persons aged over 65 will there be in 2031 compared with 2001?

Age distribution of Australian population

2001

0–5 6–10 11–15 16–20 21–25 26–30 31–35 36–40 41–45 46–50

0 200 400 600 800 1000 1200

(

×

1000)

1400 1600

51–55 56–60 61–65 66–70 71–75 76–80 81–85 86+

Baby boomers

2031

0–5 6–10 11–15 16–20 21–25 26–30 31–35 36–40 41–45 46–50

0 200 400 600 800 1000 1200 1400 1600

51–55 56–60 61–65 66–70 71–75 76–80 81–85 86+

Baby boomers

(

×

(32)

One of the main roles of a statistician is to critically analyse related data sets and report on the findings. Businesses often use the results of an analysis for promotional purposes and companies report to their shareholders.

To critically analyse data sets:

n Draw suitable displays.

n Find measures of location and spread.

n Write a report on the relationship between the data sets, commenting on any similarities and differences between the data sets, unusual features, outliers or patterns.

n Draw conclusions and make recommendations.

1. Twenty overweight people enrolled in a weight loss program at Rhonda’s Weight Loss

Centre. Their weights (in kilograms) before and after the program were: Before: 128 159 85 76 93 125 102 74 88 82

97 84 106 125 76 80 92 77 115 102 After: 75 72 64 95 58 62 120 93 85 72 102 65 73 62 56 60 105 82 52 64

Critically analyse the data and report back to Rhonda on how she can best advertise the success of her centre.

2. The times taken (in seconds) to check a basket of 20 grocery items at 15 automated and

15 manual checkouts were:

Automated: 45 58 63 43 75 69 84 65 96 73 90 61 84 72 96 Manual: 95 105 82 110 125 148 136 137 86 99 145 119 101 97 124 Critically analyse the data and report back to the manager of a store on the benefits of installing automated checkouts based on this data.

Obtain published data from the media or Internet, collect data through experiment or simulation, or use data already collected for your statistics file. Critically analyse the data sets by drawing appropriate graphs and tables, determining measures of location and spread, and writing a report on your findings.

Some suggested data sets are:

n the performances of two sporting teams (e.g. football or netball) in a season

n the performance of a sporting team in home and away matches

n pulse rates of males and females before and after exercise

n spending patterns of men and women

n heights and weights of males and females

n scores in two subject tests

n waiting times at a checkout on different days

n pollution levels at different times in the same city or in two different cities

n rainfall in two different towns or regions

n part-time incomes of male and female students.

Modelling activity:

Analysing data sets

(33)

A population pyramid displays information about the ages of a population. The oldest age group is at the top and hence the display resembles a pyramid. A simple population pyramid (or back-to-back histogram) is shown in question 3 of Exercise 4-08 (page 137).

1. This population pyramid shows a profile of the Australian population from 1911 to

2051. It is actually three pyramids together, showing the years 1911, 1996 and the population projection for 2051.

(a) Compare the numbers of males and females over 60 in 1911 and in 2051. (b) How many females were 35 in 1996?

(c) How many males were 20 in 1911?

(d) Find one age group where there are more males. (e) Find one age group where there are more females.

(f) Write down three differences between the population in 1911 and in 1996.

2. Investigate the age of the Aboriginal and Torres Strait Islander population and compare

with the general Australian population using a population pyramid. You can find the necessary information at the following website: www.abs.gov.au.

Investigation:

Population pyramids

100+

Profile of Australia’s population, 1911–2051

Males Females

Thousand

0 50 100 150 200

50 0

100

150 200

95

90

85

80

75

70

65

60

55

50

45

40

35

30

25

20

15

10

5

0 1911

1996 2051

(34)

C

hapter review

Statistical distributions

1. Collecting and displaying data 2. Summary statistics

3. Features of a statistical display 4. Investigating outliers

5. Displaying and comparing two data sets 6. Comparing data sets using charts

7. Two-way tables

8. Using multiple displays to compare data sets

This chapter, Statistical distributions, revises and extends the statistics covered in the Preliminary Course. It compares two data sets in a variety of displays, including double stem-and-leaf plots, box plots, radar charts and area charts. You also used measures of location and spread to compare data sets and learned how to interpret information from different displays. Be sure to include area charts and the effect of outliers in your summary. You could also include a glossary of statistical terms.

Make a summary of this topic. Use the chapter outline above as a guide. An incomplete mind map has also been started below. Use your own words, symbols, diagrams, boxes and reminders. Use the questions in Your say below to think about your understanding of the topic. Gain a ‘whole picture’ view of the topic and identify any weak areas.

Topic summary

Statistical

distributions

Area charts

Two-way tables Stem-and-leaf

plots

Radar charts

Box plots Outliers

Comparing data sets

Summary statistics Measures of

spread

(35)

n Have you satisfied the outcomes listed at the front of this chapter?

n What was the most important thing that you learned?

n How did you feel about the topic? Did you enjoy it?

n What was new?

n What are your weaknesses? What will you need to study more?

n How will you revise and summarise this topic?

1. Classify the data as (i) quantitative and discrete, (ii) quantitative and continuous,

or (iii) categorical.

(a) numbers of cows on farms in NSW

(b) numbers of letters delivered each day to households in Campbelltown (c) annual water consumption in Sydney

(d) numbers of workers who travel to work by public transport (e) ages of first-year university students

(f) favourite movie

2. Find the mean, median and mode for each data set and suggest a possible population

from which each set of data was taken. (a) 10 11 11 12 12 12 13 13 (b) 3 3 3 4 4 4 5 5 5

(c) 72 72 73 75 76 83 84 85 87 94

3. Consider the set of scores: 3 4 5 5 8 9 12 15 18 20

(a) What is the mean? (b) What is the median?

(c) Without doing any calculations, say what the effect on the mean and median would be of adding:

(i) one score of 30 (ii) one score of 50 (iii) a score of zero (iv) a score of 10

(d) What would be the effect on the mean and median if each score was: (i) increased by 2? (ii) decreased by 3?

4. For each statistical display below:

(i) find the mean and standard deviation of the data set (to 1 decimal place) (ii) describe the shape and features of the distribution

Your say: Reflecting about the topic

● ● ● ●

Chapter assignment

(a)

5 10 15

Frequency

6 10 14 18

Wages from part-time job (× $10)

8 12 16 20

(36)

5. Match the box plots to the following data sets.

(a) a random sample of 30 spectators at a football match (b) a group of 30 senior citizens on a bus trip

(c) a group of 30 dancers at a nightclub

(d) two teachers taking a group of 30 primary students to the zoo

6. A factory produces small metal rods, designed to have a mass of 50 g. Samples were

taken from two different machines and compared. (a) Find the mean and standard

deviation for each machine (correct to 1 decimal place). (b) What are the median and

interquartile range for machine A? (c) What are the median and

interquartile range for machine B? (d) Construct box plots for the two data sets.

(e) Comment on the statement ‘Machine B produces rods of a more consistent mass than machine A’.