4
4
Statistical
distributions
DATA ANALYSIS
Sydneysiders often like to claim that it’s always raining in Melbourne, but is Melbourne really such a wet city? The following data shows the average number of rainy days per month for the two capital cities, and is supplied by the Bureau of Meteorology.
Is it possible to use this data to compare the rainfalls of the two cities and decide which city is wetter? Does Melbourne generally have more rainy days than Sydney, or is the number of rainy days per month more consistent?
This chapter is about comparing the statistics of data sets and noting any similarities and differences. Do men spend more money than women? Do more people shop on weekends than weekdays? Are teachers generally younger than doctors? Data sets can be compared by examining the shapes of their graphs or by analysing their calculated measures of location and spread.
In this chapter you will learn how to:
n name and use the different types of data and random samples n calculate measures of location (mean, median, mode)
n calculate measures of spread (range, interquartile range, standard deviation) n analyse and interpret dot plots, stem-and-leaf-plots, box plots and radar charts n investigate outliers in small data sets and their effects on the mean, median and
mode
n describe the shape of a distribution and make conclusions about the data in the distribution
n display and compare two sets of data in double stem-and-leaf plots, double box-and-whisker plots, radar charts and area charts
n interpret data presented in a two-way table form
n use summary statistics and multiple displays to interpret and compare the relationships between two data sets.
Month J F M A M J J A S O N D
Sydney 12 12 13 12 12 12 10 10 10 12 11 12
Melbourne 8 7 9 12 14 14 15 16 15 14 12 11
COLLECTING AND DISPLAYING DATA
Collecting data
Data or information can be collected by a variety of means:
n through observation, such as a naturalist observing animal behaviour
n by experiment—for example, a medical researcher testing the effects of a new drug
n from a survey, usually via a telephone poll or a written questionnaire
n by taking a census—that is, surveying the whole population.
Do you still have your statistics file containing graphs and tables collected during the Preliminary Course? You should now add to your file by collecting articles from recent newspapers that contain graphs and tables, especially those that contain more than one statistical display or display data in an interesting way.
Use your library or explore the Internet to find real data. Here are three useful websites: Australian Bureau of Statistics (ABS) www.abs.gov.au or www.statistics.gov.au Bureau of Meterorology www.bom.gov.au
Morgan Surveys www.roymorgan.com.au
Types of data
Data falls into one of the following types:
n quantitative (numerical) data that is discrete, such as the number of computers in schools
n quantitative (numerical) data that is continuous, such as the weights of gym members
n categorical (qualitative) data, such as the birthplaces of people living in Sydney.
Example 1
What type of data is each of these?
(a) the numbers of people attending Olympic Games
(b) the types of breakfast cereal in Cottonworths supermarket
(c) the body temperature of a hospital patient taken over a 24-hour period
Solution
(a) Quantitative and discrete since the data are distinct whole numbers. (b) Categorical since the data are brand names of cereals.
(c) Quantitative and continuous since the data can be measured along a continuous scale.
n Quantitative or numerical data is best displayed in a column graph or line graph.
n Categorical or qualitative data is best displayed in a sector graph or divided bar graph. Why do you think this is so?
Idea:
Collecting statistical graphs and tables
Idea:
Use the Internet to find real data
Think:
Which graph is best?
P
Random sampling
It is not always convenient to collect data from all members of a population—that is, by using
a census. If a population is too large or too difficult to survey, a sample of items can be taken
from the population and analysed, and the results used to reflect population characteristics.
n A simple random sample is one where each member of the population is equally likely
to be chosen—for example, choosing the winning balls in Lotto.
n A systematic sample is one where the first member is chosen at random and the others
are chosen at regular intervals—for example, every 8th toy on a production line.
n A stratified sample is one where a representative sample is taken from each stratum or
layer of a population—for example, a stratified sample from a population containing 70%
adults and 30% children would contain 70% adults and 30% children.
1. State whether the data is (i) categorical, (ii) quantitative and discrete, or (iii) quantitative and continuous in each case.
(a) temperature of water in a swimming pool
(b) number of people who voted Liberal in the last four elections (c) response time when patient’s reflexes are tested
(d) religious denomination (e) breeds of dogs
(f) speed of a car
(g) number of goals scored in a football match (h) heights of girls in the school athletics team
2. Give two reasons for choosing a sample rather than a census.
3. What are biased and unbiased samples?
4. Which type of random sample (simple, systematic or stratified) would best suit each of the following situations?
(a) random breath testing
(b) opinion poll on whether Australia should change the flag (c) taste testing a new brand of soft drink
5. Answer the questions that follow these three displays. (a)
Exercise 4-01:
Collecting and displaying data
Drugs used by Australians
AlcoholTobaccoCannabisPainkillers Sleeping pills
Heroin
Amphetamines
EcstasyCocaine Hallucinogens
100
80
60
40
20
0 90
70
50
30
10
Percentage
(b)
(c)
(i) State the type of display and information contained in the display. (ii) Describe the type of data displayed.
(iii) Comment briefly on the strengths and weaknesses of the display.
SUMMARY STATISTICS
Measures of location (or averages)
Measures of location or averages are used to indicate the middle or centre of a data set. There are three measures of location: the mean, median and mode. You should use the measure that is best suited to the type and distribution of the data.
The mode is the most popular value or category.
The centre of a categorical data set is always described by the mode. For example, the modal dress size is the size worn by more women than any other.
The mean is the arithmetic average of all scores: = or = .
The median is the middle score (or average of the two middle scores) when the scores are
arranged in ascending order.
The centre of a quantitative data set is usually described by the mean or the median. The mean takes into account all scores in a data set and can be considered as the ‘balance point’ of the data set. It is, however, affected by very large or very small scores. In
distributions where there are outliers, it is better to use the median as the measure of location.
Newstart Allowance
Less than 20
More than 60 21–34 35–54 55–59
250 000
200 000
150 000
100 000
50 000
0
1995 1996 1997
Road fatalities in Australia
Mar 95 Sep 95 Mar 96 Sep 96 Mar 97 Sep 97 Mar 98 Sep 98 Mar 99 3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Fatalities per 10
000 population
x Σx n
--- x ΣΣfx f
Outliers
An outlier is a very high or very low score that is clearly apart from the other scores.
An outlier can occur for a variety of reasons and should always be investigated. If an outlier is found to be a value obtained through incorrect measurement or observation and is not a typical score, it can be excluded. If the outlier is a possible value from the population, it should be included in the distribution.
Here the outlier temperature is 42°C.
Example 2
Find the mean and median of the two data sets and state which is the more appropriate measure of location for each set:
A: 1 2 3 3 4 7 8
B: 1 2 3 3 4 7 29
Solution
A: Mean = = 4
Median = 3
B: Mean = = 7
Median = 3
For set A, either the mean or median could be used as the measure of location.
For set B, the median is the better measure of location as the outlier 29 affects the mean.
36 37 38 39 40 41 42 °C
x 1+2+3+3+4+7+8
7
--- The mean can also be found using the statistics mode of a scientific or graphics calculator.
x 1+2+3+3+4+7+29
7
---Just for the record
T
HEC
HALLENGER DISASTERIn 1986, the space shuttle Challenger exploded just after takeoff and seven astronauts were killed. It was found that two rubber O-rings had failed because of the low air temperature. Here is some data from previous flights:
An engineer had noticed the outlier (i.e. a damage index of 11 at a temperature of 12°C) and the fact that the expected air temperature at the takeoff time was below 0°C and had recommended that the flight be delayed. Unfortunately, the outlier was not considered important and the flight ended in tragedy.
Can you give another example where an outlier should not be ignored?
Air temperature (°C) 12 14 19 23 26
Damage index 11 4 0 0 0
Measures of spread
Measures of dispersion or spread are used to indicate how spread out a data set is. As with measures of location, you should use the measure that is best suited to the type and
distribution of data.
Range and interquartile range
Range= highest score – lowest score
Interquartile range= upper quartile – lower quartile
= Q3 – Q1
The interquartile range is the range of the middle 50% of scores. The upper quartile (Q3) is
the median of the upper half of the scores and the lower quartile (Q1) is the median of the
lower half. The interquartile range is often a better indicator of spread than the range as it does not take extreme scores into account.
Example 3
Find the range and interquartile range of the two data sets and state which is the more appropriate measure of spread for each set.
A: 1 2 3 3 4 7 8
B: 1 2 3 3 4 7 29
Solution
A: 1 2 3 3 4 7 8
↑ ↑ ↑
Q1 Q2 Q3
Range = 8 – 1 = 7
Interquartile range = 7 – 2 = 5
B: 1 2 3 3 4 7 29
↑ ↑ ↑
Q1 Q2 Q3
Range = 29 – 1 = 28
Interquartile range = 7 – 2 = 5
For set A, either the range or interquartile range could be used as the scores are fairly evenly spread.
For set B, the interquartile range is the better measure of spread as it does not take the outlier score 29 into account.
Standard deviation
Standard deviation is the most common measure of the spread of a distribution. It is the
square root of the average of the squared deviations from the mean.
σn is the standard deviation of a population and σn − 1 is the standard deviation of a sample.
σn − 1 is used to approximate the population standard deviation σn and gets closer to the population value as the sample size increases.
Use σn − 1 if the data is from a sample (or if you are unsure) and σn if all the possible data is
given. In either case, always state which standard deviation you are using.
Mean and standard deviation from a calculator
The mean and standard deviation of a data set can be calculated using the statistics mode (STAT or SD) of your calculator.
Example 4
Here are the net weekly earnings of 8 labourers:
$730 $490 $600 $440 $490 $370 $700 $580 (a) What is the mean weekly earning?
(b) What is the standard deviation of the earnings?
Solution
Clear any previous data and check that n = 0.
Enter the separate values: 730 490 600 … 580 Check that you have entered the correct number of scores by checking that n = 8. (a) Mean = $550
(b) Standard deviation σn − 1≈ $125.40
Example 5
Find the standard deviation of the two data sets and state which set is the more widely spread.
A: 1 2 3 3 4 7 8
B: 1 2 3 3 4 7 29
Solution
A: Sample standard deviation σn − 1≈ 2.58
B: Sample standard deviation σn − 1≈ 9.88
Set B is more widely spread as it has a much larger standard deviation.
Example 6
Twenty possums were captured, tagged and released in the Booderee National Park. Rangers recaptured several samples of 10 possums over a 2-month period and recorded the number of tagged possums in each sample.
(a) What is the mean number of tagged possums per sample (correct to 2 decimal places)? (b) What is the standard deviation of tagged possums (correct to 2 decimal places)?
Solution
Clear any previous data and check that n = 0.
Enter the data: 0 8 1 11 2 5 etc. Check that you have entered the correct number of scores by checking that n = 31. (a) Mean = 1.48 possums
(b) Standard deviation σn − 1= 1.36 possums
No. tagged per sample (score) 0 1 2 3 4 5
No. of samples (frequency) 8 11 5 4 2 1 P
DATA DATA DATA DATA
x
σn − 1 is the sample standard deviation.
× DATA × DATA × DATA
Example 7
The annual salaries of employees at the Nelson manufacturing company are tabulated.
(a) How many people are employed at the company?
(b) Using class centres 45, 55, 65, …, find the estimated mean salary of the employees. (c) What is the standard deviation of the salaries?
Solution
(a) Number of employees = 46.
(b) Clear any previous data and check that n = 0.
Enter the data: 25 16 35 12 45 5 etc. Check that n = 46.
Mean = 40.4.
Hence the estimated mean salary is $40 400 per annum. (c) Standard deviation σn= 15.9.
Hence the standard deviation of the salaries is $15 900 per annum.
1. Without using the statistical functions on your calculator, find the mean and median for
each data set (correct to 1 decimal place where appropriate). State which is the better measure of location and why.
(a) 2 3 3 5 6 8 9
(b) 26 22 24 29 21 23 24 22 (c) 8 40 38 42 45 29 31 41 30 (d) 6 8 11 9 10 8 11 12 6 7
2. Find the range and interquartile range for each data set in question 1 and state which is
the better measure of spread and why.
Annual salary (× $1000) Number of employees
20–,30 16
30–,40 12
40–,50 5
50–,60 5
60–,70 6
70–,80 2
Annual salary (× $1000) Class centre No. of employees
20–,30 30–,40 40–,50 50–,60 60–,70 70–,80
25 35 45 55 65 75
16 12 5 5 6 2
46
20–,30 means from 20 up to but not including 30.
× DATA × DATA × DATA
x
σn is the population standard deviation.
3. Using your calculator, find the mean and standard deviation (correct to 1 decimal place)
for each of these sets of data. (a) 42, 35, 63, 70, 81, 80, 85
(b) $300, $400, $600, $440, $300, $700, $250, $580, $260 (c) 37.4°F, 38.2°F, 39.0°F, 36.8°F, 38.5°F, 38.0°F, 36.8°F, 40.5°F (d) 165 kg, 146 kg, 178 kg, 190 kg, 158 kg, 147 kg
(e) 23, 18, 24, 16, 17, 20, 15, 22, 19
4. The hair colours of 75 people were noted.
(a) What is the modal hair colour? (b) Why is the mode the best measure of
central tendency here?
(c) Why are the mean and median not appropriate measures here?
5. Joshua swam a kilometre each morning for 10 days in preparation for a swimming
carnival. His times (in minutes) were:
28 24 22 24 25 24 26 26 24 27
(a) What is his median swim time? (b) What is his mean swim time? (c) What is his range of swim times?
(d) What is the interquartile range of his times?
(e) What is the standard deviation of his times (correct to 1 decimal place)?
(f) If Joshua asked you to tell him the most appropriate measures of location and spread for these times, which two would you choose? Justify your answer.
6. Ted and Julie were paid by piecework for making T-shirts. The numbers made each day
over an 8-day period were:
Ted: 18 25 19 19 26 24 15 22 Julie: 16 20 21 28 12 26 18 19 (a) For each person find:
(i) the number of T-shirts made in the 8-day period (ii) the interquartile range of T-shirts made
(iii) the mean number of T-shirts made
(iv) the standard deviation of T-shirts made (correct to 1 decimal place) (b) Comment on the statement that Ted is a more consistent worker than Julie by
comparing their means and standard deviations. Give reasons for your answer.
7. Numbers of motor accidents per week over a 9-week period at a busy intersection were:
4 3 6 0 4 9 2 3 5
(a) What is the median number of accidents? (b) What is the mean number of accidents per week?
(c) Does the mean or median best describe the centre of this data set? Give reasons. (d) What is the range of the data?
(e) What is the interquartile range?
(f) Find the standard deviation for the data (correct to 1 decimal place).
(g) Comment on the statement: ‘The number of accidents per week is fairly consistent’. Justify your answer.
Hair colour No. of people
Brown 16
Blonde 25
Black 14
Red 7
8. This stem plot shows the waiting times in a medical
centre (in minutes).
(a) Find the mean waiting time (correct to 1 decimal place). (b) Find the standard deviation of waiting times (correct
to 1 decimal place).
9. The weekly wages of a group of teachers are
shown in the table.
(a) What is the mean weekly wage? (b) What is the standard deviation of the
weekly wages (correct to 1 decimal place)?
10. The percentage marks for 250 students in a Business Studies examination are listed.
(a) What is the mean mark (to the nearest whole number)?
(b) What is the standard deviation of the marks (correct to 1 decimal place)? (c) Write a comment to the school principal describing the results of these students.
FEATURES OF A STATISTICAL DISPLAY
Shape
The shape of a statistical display shows how the data is distributed. When using a dot plot or histogram, a curve can be used to approximate the general shape.
Clustering
Clustering occurs when scores are close together or ‘bunched
up’. In this stem-and-leaf plot, the scores are clustered in the 50s and 80s.
Score 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 91–100
No. of students 14 18 15 26 20 31 44 39 43
Stem Leaf
1 2 3 4 5 6
2 4 5 3 6 9 0 1 4 5 7 7 8 2 4 5 1 3 7 2
Weekly wage ($) No. of teachers
500–,600 8
600–,700 5
700–,800 30
800–,900 25
900–,1000 12
0 1 2 3 4 5 6 7 8 9 10
Stem Leaf
3 4 5 6 7 8 9
1 2 3 5 9 0 2 6 7 7 8 9 4 5
3 5 6
Symmetry
A distribution has symmetry if the scores are balanced or evenly spread about the centre of the distribution. In a symmetrical distribution, the mean, median and mode are usually the same. For this distribution, the mean median and mode are all 5.
Skew
A distribution that is skewed is not symmetrical. The tail indicates the direction of the skew.
n If the scores are mostly low (or to the left), the distribution is positively skewed.
n If the scores are mostly high (or to the right), the distribution is negatively skewed. This distribution is positively skewed. The tail points to
the right, the positive direction.
The data in this dot plot is negatively skewed. The tail points to the left, the negative direction.
The data in this stem-and-leaf display is negatively skewed as the scores are mostly high with a tail towards the low scores. Clustering also occurs in the 70s and 80s.
Peaks and modes
Peaks are the high points or ‘humps’ in a display. The highest peak is called the mode.
No peaks: display is uniform or flat and there is
no mode.
One peak: display is unimodal. The mode here is 6.
Two peaks: display is bimodal.
The mode here is 6.
Many peaks: display is multimodal.
There are three peaks here and two modes: 5 and 7.
3 4 5 6 7
21 22 23 24 25 26 27 28
Stem Leaf
2 3 4 5 6 7 8 9
2 6 0 1 4 8 2 6 9 1 3 4 5
0 2 3 4 4 5 5 7 7 7 8 9 3 5 7 7 8 8 8 8 9 2 8
3 4 5 6 7
3 4 5 6 7
3 4 5 6 7 8 9
The mode is the higher of the two peaks.
3 4 5 6 7 8 9
2 10
The three distributions show the relative positions of the mean, median and mode.
n For a symmetrical distribution, the mean, median and mode are usually equal.
n For a skewed distribution, the median is usually between the mean and mode and is the better measure of location.
Diagram (a) could represent results in an HSC General Mathematics examination. Diagram (b) could represent traffic flow from 6 am to noon.
Diagram (c) could represent the heights of basketball players in a club. Can you think of other situations that these diagrams could represent?
Think:
Shape and measures of location
(a) (b) (c)
Mean Median
Frequency
Score
Mode
Mean Median
Frequency
Score
Mode Mean
Median
Frequency
Score
Mode
Symmetrical Positively skewed Negatively skewed
T
EN HOT TIPS FOR TACKLING EXAMS1. Find out about the format of the exam: the topics to be tested, the time allowed, the
number and format of questions, the marks awarded, whether formulas are supplied.
2. Be prepared!
3. Spend the first 5 minutes browsing through the exam to see the work that is ahead of
you. Note the harder questions—you may need to spend more time on them.
4. Spend the first minute of each question planning and thinking.
5. Keep an eye on the time. Don’t spend too much time on one question.
6. Write clearly. Draw big diagrams. Spread out your working and set it out neatly. Write
down the page, not across.
7. Make sure you have answered the question. Did you remember to round off and/or
include units? Did you use all of the relevant information given?
8. Attempt every question.
9. If the working-out to a hard question is taking too long, then it’s probably wrong.
Don’t get bogged down. If you’re getting nowhere, retrace your steps, start again, or skip the question and return later with a fresh mind.
10. Once you have completed the exam, go over it again. Double-check your answers,
especially the harder ones or those of which you’re unsure.
1. Draw a curve representing a statistical display that:
(a) is symmetrical (b) is positively skewed
(c) shows clustering (d) is negatively skewed with clustering (e) is symmetrical and bimodal
2. For each of the following displays state:
(i) if the data is symmetrical or skewed (ii) if there are any clusters
(iii) if there are any outliers (iv) how many peaks there are
(a) (b)
(c) (d)
(e) (f)
3. The numbers of visits (or hits) to a popular Internet website were tabulated over a
10-hour period.
Draw a histogram to represent this data and comment on the features of the display, such as shape, skew, clustering and peaks.
Time 1201–1300 1301–1400 1401–1500 1501–1600 1601–1700 1701–1800 1801–1900 1901–2000 2001–2100 2101–2200
Hits (× 1000) 1.3 0.8 0.4 2.1 2.6 4.5 3.9 5.3 2.3 1.2
Exercise 4-03:
Features of a statistical display
1 2 3 4 5 6 7 8 9 10 11
Score
0 2 4 6 8 10 12
Frequency
4 5 6 7 8 9
Stem Leaf
1 2 3 4 5
3 4 6 6 6 7 8 9 9 0 7
1 2 2 5 7 8 8 9 0 2 3
2 9
11 13 16 18 19 20
10 12 14 15 17
5 10 15 20 25 30 35 40 45 50
Score
0 1 2 3 4 5 6
Frequency
7 8 9
Stem Leaf
4 5 6 6 7 8
4. For the given information:
5 14 8 7 12 3 2 8 4 10 6 2 7 3 9 9 6 4 8 9
(a) draw a dot plot to display the data (b) comment on the features of the display
5. Here is a set of data:
22 16 36 15 16 24 15 15 19 55 58 59 18 17 20 20 24 15 54 19 15 40 21 17 50 22 23 21 24 23 15 35 15 24 22 19 15 17 43 49 (a) Draw a stem-and-leaf display for this data set using stems 1, 2, 3, 4 and 5. (b) Comment on the features of the display.
(c) Give the name of a possible population that this data could represent.
6. This dot plot represents the industrial accidents per month at a factory:
(a) What is the mean number of accidents in this period (correct to 1 decimal place)? (b) What is the standard deviation (correct to 1 decimal place)?
(c) What could be a possible reason for the outlier 9?
(d) What are the mean and standard deviation if the outlier 9 is not included (correct to 1 decimal place)?
(e) Compare the means and standard deviations of the two groups of data.
INVESTIGATING OUTLIERS
Outliers often have the effect of raising or lowering a mean value but they can also affect the mode and median.
Example 8
A: 20 25 30 35 40 45
B: 20 25 30 35 40 60
C: 20 25 30 35 40 120
(a) Find the mean and median of each set of scores.
(b) The three data sets are the same except for the value of the last score. Investigate the effect of increasing the last score on the mean and median of set A.
(c) What are the values of the mean and median of set C if the outlier 120 is not included?
Solution
(a) A: 20 25 30 35 40 45
B: 20 25 30 35 40 60
C: 20 25 30 35 40 120
↑
Median = 32.5
Set Mean Median
A 32.5 32.5
B 35 32.5
C 45 32.5
1 2 3 4 5 6 7
0 8 9
(b) Increasing the last score has no effect on the median.
As the last score increases, so the value of the mean increases. The outlier of 120 has the greatest effect on the value of the mean.
(c) Set C without the score 120 has a mean and median of 30.
1. For each pair of data sets below find:
(i) the mean and median (correct to 1 decimal place) (ii) the value of any outlier score
(iii) the effect on the mean and median of any outlier
(a) A: 10 12 14 16 18 20
B: 10 12 14 16 18 40
(b) A: 5 37 41 53 56
B: 36 37 41 53 56
(c) A: 3 4 8 9 12 14
B: 3 6 7 10 13 25
(d) A: 110 120 130 135 135 140 140 B: 55 115 135 140 145 145 150
2. For each data set below:
(i) find the mean, median and mode (correct to 1 decimal place where needed) (ii) state the value of any outlier
(iii) say which measure of location is the most appropriate (iv) sketch the shape
(a) 2 8 3 16 9 26 8 (b) 8 16 4 21 4 23 16 12 (c) 120 g 85 g 72 g 60 g 80 g 80 g
(d) 37°C 38°C 41°C 39°C 38°C 37°C 37°C
3. The 7 employees at the Bug and Beef Cafe earned the following wages in a week:
$350 $420 $510 $130 $635 $320 $460 (a) What is the mean wage?
(b) What is the median wage?
(c) Which is the more appropriate measure of location? Justify your answer.
(d) If each employee received a 10% pay rise, what would be the new mean and median wages?
(e) By what percentage would the mean increase?
(f) If the manager who earned $635 was not included in the data set, what would be the mean and median wages?
4. In a netball tournament of 5 matches, the numbers of points scored by three teams are:
The Wombats: 24 18 14 6 22 The Possums: 16 16 15 18 15 The Koalas: 36 8 14 16 12 (a) What are the mean and median for each team? (b) Which team is more consistent? Why?
(c) An error was found in the recording for the Wombats. The score of 6 should have been 16. What are the new mean and median?
5. Pam and Percy sell photocopiers. The numbers of copiers sold over a 10-week period
are shown.
Pam: 1 2 3 3 5 6 7 8 12 25
Percy: 3 3 3 14 16 18 18 24 32 35
(a) What is the modal number of copiers sold by each person?
(b) What could you say about each person if you only knew the mode? (c) What is the median number of copiers sold by each?
(d) What is the mean number of copiers sold by each?
(e) Which measure of location is the best measure to compare the sales performances of Pam and Percy?
(f) Who is the better salesperson? Why?
6. Choose 5 scores that have the same mean and median. What effect will adding a score
of 100 have on the mean and median?
7. Rupert’s bookstore employs the following people with annual wages as shown:
2 store managers $64 300
4 cashiers $34 200
3 part-time clerical staff $28 500 10 salespeople $46 500 2 part-time cleaners $13 500 (a) What is the modal wage? Why? (b) What is the median wage?
(c) What is the mean wage (to the nearest dollar)?
(d) Which measure would Rupert use to make the salaries appear higher? (e) Which measure of location (average) best represents the average wage for an
employee at Rupert’s bookstore?
DISPLAYING AND COMPARING TWO DATA SETS
Double stem-and-leaf plots
By representing two related data sets in a double (back-to-back) stem-and-leaf display, similarities and differences, such as clustering and averages (measures of location), can be easily seen.
Example 9
This double stem-and-leaf plot shows the numbers of dollars spent by a group of students visiting the Easter show.
(a) How many students went to the show? (b) Give two observations on the shape and
features of the data.
(c) Calculate the mean and standard deviation
(to the nearest 5 cents) of amounts spent by boys and by girls.
(d) Considering all the information you have, do you think that boys are the bigger spenders? Why?
Boys Girls
8 6 6 5 5 4 6 4 3 2 9 8 2 5 3 2 1 1 0 2
1 2 3 4 5
2 5 5 8
0 2 4 5 5 5 6 7 8 9 1 2 4
Solution
(a) 39 students, consisting of 20 boys and 19 girls.
(b) The amounts spent by the girls show clustering at $20–$29, whereas the amounts spent by the boys are more evenly spread out.
The data for the girls is positively skewed.
(c) Girls: Mean = $25.80 Standard deviation σn − 1= $8.00
Boys: Mean = $30.10 Standard deviation σn − 1= $12.40
(d) Yes. The average amount spent by a boy was $30.10. This was about $6 more than the average amount spent by a girl.
Box plots
Whereas a stem-and-leaf plot gives a good visual comparison of the location of scores in a data set, a box plot (or box-and-whisker plot) shows the spread of the data. Find a five-number summary and draw each box plot on the same scale.
Example 10
The box plots below show the ranges of unleaded petrol prices in six cities in Australia. (a) (i) Which city’s petrol prices had the smallest range?
(ii) Which city’s had the largest range?
(b) In which city was petrol generally cheapest? Give a possible reason for this. (c) Canberra, Sydney and Melbourne had the same range of prices.
(i) Which of these three cities had the lowest median price?
(ii) In which of these cities would you be more likely to pay a higher price for petrol? (d) Write down one observation about petrol prices in Canberra.
Solution
(a) (i) Adelaide (ii) Darwin
(b) Brisbane. The government tax on petrol is lower than in the other cities and so the price paid by the consumer is lower.
(c) (i) Sydney (ii) Melbourne
(d) They were evenly spread across the city. The distribution of petrol prices is symmetrical.
x Use the statistical
function on a scientific or graphics calculator.
x
The box contains the middle 50% of scores with each whisker representing 25% of the remaining scores.
Q1 Q2 Q3
Lower Median quartile
Upper quartile
Upper extreme Lower
extreme
Canberra
Sydney
Melbourne
Adelaide
Brisbane
Using a graphics calculator is an easy and excellent way to compare box plots.
1. Enter the individual scores of the first data set in List 1.
2. Enter the individual scores of the second data set in List 2.
3. Set the to a median box plot (some calculators have a mean box plot as well).
4. Make sure that both Graph 1 and Graph 2 are ON.
5. Draw the graphs. Both graphs will appear on the screen at the same time, giving you an
excellent comparison of the two data sets.
The calculator will also give you the five-number summary.
Example 11
Liz and George deliver pamphlets to letterboxes in the same neighbourhood. The numbers of pamphlets delivered per hour over 12 hours are shown:
Liz: 24 25 26 27 28 28 31 32 32 32 35 35 George: 15 18 21 24 25 29 31 31 32 38 38 45 (a) Represent the data in a double stem-and-leaf plot.
(b) Find a five-number summary for each data set and hence draw two box plots. (c) Write down one observation that is best seen in the stem-and-leaf plot. (d) Write down one observation that is best seen in the box plots.
(e) Which worker showed the greater interquartile range of pamphlets delivered? Which display shows this the best?
(f) Can we conclude that Liz is a better worker than George?
Solution
(b) Liz: 24 25 26 27 28 28 31 32 32 32 35 35
↑ ↑ ↑ ↑ ↑
Lower extreme = 24 Lower quartile = = 26.5
Median = = 29.5 Upper quartile = = 32 Upper extreme = 35
George: 15 18 21 24 25 29 31 31 32 38 38 45
↑ ↑ ↑ ↑ ↑
Lower extreme = 15 Lower quartile = 22.5
Median = 30 Upper quartile = 35 Upper extreme = 45
(a) Liz George
8 8 7 6 5 4 5 5 2 2 2 1
1 2 3 4
5 8 1 4 5 9 1 1 2 8 8 5
Technology:
Box plots on a graphics calculator
GRAPH
26+27 2
---28+31 2
--- 32+32 2
---15 20 25 30
Pamphlets/hour
35 40 45
(c) The stem-and-leaf plot shows that the number of pamphlets delivered per hour by Liz was always in the 20s and 30s.
(d) The box plots show the median number of pamphlets delivered per hour by both was about the same (around 30) but George’s range was greater.
(e) George. This is obvious from the box plots. The interquartile range is the length of ‘the box’.
(f) If an employer was looking for consistency, Liz is the more consistent worker as she had less variation in the number of pamphlets delivered per hour. However, for the total number of pamphlets delivered, both employees delivered approximately the same number of pamphlets. We cannot conclude that Liz is a better worker than George.
What to do with outliers?
n If an outlier is considered to be feasible, you can include it in the whiskers.
n If an outlier is considered to be an error, you need not include it in the whiskers but can represent it as a separate point.
Can you describe a situation that these box plots could represent?
1. The numbers of dollars spent by a class of
students visiting the Easter show were discussed in Example 9 (page 124).
(a) Find a five-figure summary for each data set. (b) What is the interquartile range of each? (c) Draw two box plots representing the data sets.
(d) What information is seen more easily in the box plots?
2. A teacher proposes that ‘People always underestimate the length of a piece of string’. A
group of students decide to investigate this theory. They each estimate the lengths of several pieces of string and then measure the actual lengths.
(a) Write down the median of the estimated lengths. (b) Write down the median of the actual lengths.
(c) What are the range and interquartile range for each data set? (d) Would you agree with the teacher’s theory? Justify your answer.
Think:
Is the outlier in or out?
1 2 3 4 5 6 7 8 9 10 11
Outlier excluded
Outlier included
Exercise 4-05:
Displaying and comparing two data sets
Boys Girls
8 6 6 5 5 4 6 4 3 2 9 8 2 5 3 2 1 1 0 2
1 2 3 4 5
2 5 5 8
0 2 4 5 5 5 6 7 8 9 1 2 4
0 2
5 10 15 20
Length of string (cm)
25 30 35
Actual Estimates
3. Here are two sets of scores represented in a stem-and-leaf
display.
(a) Find the range and interquartile range of each set. (b) Find the median for each set.
(c) Draw box plots representing the data sets.
(d) Write down one observation from the stem-and-leaf plot and one from the box plots.
4. The pulse rates (in beats/minute) of two groups of people were recorded:
Group X: 77 72 80 77 91 62 72 82 79 58 75 67 69 66 98 81 Group Y: 81 86 64 74 92 75 73 81 64 52 82 79 80 53 62 78 (a) Draw a back-to-back stem-and-leaf plot.
(b) What is the mean of each group (correct to 1 decimal place)? (c) What is the median of each group?
(d) Which is the better measure of location? Why?
(e) Comment on the shape of each group in the stem-and-leaf plot.
5. A group of 20 people had their pulse rates taken before and after an exercise class.
(a) By how much did the median pulse rate increase?
(b) The lower extreme ‘before’ and ‘after’ the class did not change. Give a possible reason for this.
(c) Give a possible reason for the outlier pulse rates in the ‘after exercise’ box plot. (d) How many people had a pulse rate between 64 and 72 before the exercise class? (e) What was the interquartile range of pulse rates after the class?
6. Eighteen people took part in the QUIT smoking program. The numbers of cigarettes
smoked per day were recorded before the start of the program and 6 weeks later: Before: 21 10 36 42 16 23 32 42 9 14 21 18 34 45 12 18 16 28 6 weeks later: 6 24 31 38 21 25 16 19 16 18 28 32 8 13 40 38 16 28 (a) What is the interquartile range for each data set?
(b) Draw two box plots on the same scale showing ‘before’ and ‘6 weeks later’. (c) Is the QUIT program working for these people? Justify your answer.
7. The following data shows the average number of rainy days per month for two capital
cities, and is supplied by the Bureau of Meteorology.
Month J F M A M J J A S O N D
Sydney 12 12 13 12 12 12 10 10 10 12 11 12
Melbourne 8 7 9 12 14 14 15 16 15 14 12 11
Set A Set B
2 5 5
8
5 7 0 2
0 1 2 3 4 5 6 7 8 9
3
8 5 2 4 7 2
4
40 60 80 100
Pulse rate (beats/min)
120 140 Before
50 70 90 110 130
exercise
(a) Use a double stem-and-leaf plot to display the data. (b) Draw box plots representing the data.
(c) Write down one observation from each display.
(d) ‘Melbourne is much wetter than Sydney.’ Do you agree with this statement? Justify your answer.
8. This display represents the lifetime in hours of two brands of light globes.
(a) How many of each brand of light globe were tested?
(b) What is the mean lifetime of ‘Oso Bright’ globes (correct to 1 decimal place)? (c) What is the mean lifetime of ‘Brighta Longa’ globes (correct to 1 decimal place)? (d) Find the standard deviation of the lifetime of each brand (correct to 1 decimal place). (e) Draw box plots representing the data sets.
(f) Which brand of globe would you say is better? Explain your answer.
COMPARING DATA SETS USING CHARTS
Radar chart
A radar chart is used to plot changes over a certain period or cycle, such as temperarure during a 24-hour period, but it is also useful for comparing two sets of data.
A radar plotting chart (or polar graph paper) can be used to manually plot data, but the best option is to generate the radar chart from a spreadsheet package on a computer.
Example 12
This radar chart shows air pollution levels at two different workplaces over a 10-day period. (a) What was the air pollution level at the
meatworks on day 10?
(b) What was the air pollution level at the oil refinery on day 1?
(c) On what days was the pollution level above 50 at the oil refinery?
(d) What were the maximum and minimum pollution levels? When and where did they occur?
(e) By comparing the areas contained within each graph, decide which workplace had the higher overall pollution level.
Oso Bright Brighta Longa
6 5 5 4 2 8 7 7 7 7 7 7 4 4 3 9 9 8 8 7 6 6 6 5 4 4 0 8 8 8 7 7 7 6 5 4 3 1 9 8 8 8 5 5 2 2 7 7 5 1
10 11 12 13 14 15
3 4 5 2 3 3 4 4 5 6 1 2 2 3 3 4 5 5 7 9 9 9 0 2 3 3 4 4 4 5 6 6 8 8 9 1 2 2 3 5 5 6 7 8 8 0 3 3 4 6
Air pollution levels
Day 1
Day 2
Day 3
Day 4
Day 5
Day 6 Day 7
Day 8 Day 9
Day 10
0 20 40 80 60 100
Solution
(a) About 60. (b) About 45.
(c) Days 4, 6, 8 and 9.
(d) The maximum level was about 85 on day 4 at the oil refinery and the minimum level was about 25 on day 1 at the meatworks.
(e) The oil refinery graph seems to cover a slightly larger area and so had a higher level of pollution over the 10-day period.
Area chart
An area chart consists of different ‘areas’ or ‘bands’, each representing a data set over a given period of time. It shows the sum of the data over the given time as well as the relationship of the parts to a whole. Its main feature is to emphasise changes during this time. An area chart can be plotted on graph paper or drawn using the Chart option in a spreadsheet package. There are several chart subtypes that you can investigate.
Example 13
The table shows the numbers of males and females in full-time employment in January from 1990 to 2000.
Construct an area chart showing the contribution of male and female employees to Australia’s full-time workforce.
Solution
Step 1 Draw a line graph for males using the values in the table and shade below it. This
area represents the male employees.
Step 2 Draw a line graph for total employees by adding the values for females to those of
males. Shade the area between the two lines. This area represents the female employees.
Year 1990 1992 1994 1996 1998 2000
Males (×10 000) 350 160 200 320 360 450
Females (×10 000) 80 50 60 120 140 200
For example, in January 2000 the full-time workforce was 6 500 000, and this was made up of 4 500 000 males and 2 000 000 females.
Australia’s full-time workforce
No. of employees
700 600 500 400 300 200 100 0
Year
1990 1992 1994 1996 1998 2000
Females Males
(
×
10
Example 14
This area chart compares the unemployment rates for males and females from 1981 to 1997. (a) For the year 1985 find:
(i) the unemployment rate for males (ii) the combined unemployment rate (iii) the unemployment rate for females (b) What was the unemployment rate in 1993? (c) What trends in the unemployment rate can be seen over the period from 1981 to 1997?
Solution
(a) (i) About 8%. (ii) About 17%.
(iii) About 9% (subtract the 8% rate for males from the 17% total rate). (b) About 22%.
(c) The unemployment rate rose from about 12% in 1981 to 17% in 1997.
A fall in the unemployment rate occurred from 1985 to 1989 followed by a rise before another fall from 1993 to 1997. The unemployment rate was at its highest in 1993.
Radar charts and area charts are drawn in a similar way using a spreadsheet package. Use a spreadsheet to draw the area chart for Australia’s full-time workforce (Example 13 on page 130).
1. The numbers of clear days for the ski
resorts of Thredbo and Perisher in the Snowy Mountains area of NSW are shown in the radar chart.
(a) How many clear days did Thredbo have in March?
(b) What was the most number of clear days at either resort? When was this? (c) How many days were not clear in
Perisher in July?
(d) Which data set contains the largest area? What does this area refer to? (e) ‘The weather is better for skiing at
Perisher.’ Do you agree with this statement? Justify your answer.
Unemployment rates
Percentage
25
20
15
10
5
0
Year
1981 1985 1989 1993 1997
Females Males
Use your ruler to help you measure the vertical distances.
Technology:
Using a spreadsheet to draw an area chart or radar chart
Exercise 4-06:
Comparing data sets using charts
Clear days in the ski fields of NSW
Jan
Feb
Mar
Apr
Jun Jul
Aug Oct
Nov Dec
0 4 8
2 6 10
May Sep
2. The area chart shows the number of wage earners employed in the public and private
sectors in Australia over different years. (a) How many wage earners were
there in the public sector in 1997? (b) What was the total number of
wage earners in 1993?
(c) How many wage earners were employed in the private sector in 1991?
(d) What trends can be seen over the period from 1991 to 1997? (e) What similarities or differences
can be seen between the public and private sectors?
3. The area chart shows the seasonal rainfall for an island group in the Pacific Ocean.
(a) What was the rainfall for the southeastern region in summer? (b) What was the rainfall for the
northern region in spring? (c) What was the total rainfall in
autumn?
(d) The southeast is the wettest region. How is this shown in the graph? What could be a possible reason for one area getting more rain than the others?
(e) What trends in the rainfall can be seen over the year?
(f) What similarities or differences in rainfall can be seen between the regions?
4. Mr Pappadopoulos was admitted to hospital with a suspected stomach ulcer. His fluid
intake (e.g. water and medicine) and output (e.g. urine) over a 24-hour period are summarised in the following table.
(a) Represent the data in a radar chart.
(b) By considering the areas enclosed by each data set, what observation can you make about Mr Pappadopoulos’s intake and output over the 24-hour period?
(c) Write down two other observations from your radar chart.
Time 6 am 8 am 10 am 12 noon 2 pm 4 pm
Intake (mL) 170 240 150 110 250 90
Output (mL) 140 150 80 180 130 90
Time 6 pm 8 pm 10 pm 12 pm 2 am 4 am
Intake (mL) 150 60 180 170 160 210
Output (mL) 60 220 110 160 100 140
Wage earners in Australia
8000 7000 6000
5000
3000 2000
Year
1991 1992 1993 1994 1995
Private sector Public sector
1996 1997 1000
4000
0
No. of wage earners (
×
1000)
Seasonal rainfall for island group
400
350 300
250
150 100
Season
Summer Autumn Winter
Southwestern region Southeastern region Northern region
Spring 50
200
0
5. Clark and Lois earn extra money for writing articles for newspapers and magazines.
They save these amounts in a joint holiday fund. Their monthly earnings last year are shown in the table.
(a) Represent the data in a radar chart. (b) Represent the data in an area chart.
(c) What information is best seen in the radar chart? (d) What trends are clearly seen in the area chart?
6. (a) What information is contained in
the graph?
(b) How do you think data for the years 2021–2041 was obtained? (c) Describe the features of the part of
the graph for the 15–59 age group. (d) In 1961, approximately what
percentage of the population was between (i) 0 and 14 (ii) 15 and 59? (e) Approximately what percentage of
the population is expected to be over 60 in 2021?
(f) Give two facts about Australia’s population that can be seen in the graph.
(g) What does this area chart show about age groups in the future?
TWO-WAY TABLES
Two-way tables are used to compare two characteristics—for example, gender and health.
Example 15
A National Health Survey in 1995 compared the number of adults in a population who exercised regularly to those who didn’t. The data is displayed in a two-way table.
(a) How many people were surveyed?
(b) What percentage of the people surveyed were female? Give your answer correct to 1 decimal place.
(c) What percentage of females exercised regularly?
(d) What percentage of the population did not exercise regularly?
(e) Comment on the statement ‘Men and women are similar in their exercise habits’.
Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Clark’s earnings ($)
370 240 530 570 780 1030 770 620 790 520 430 490
Lois’s earnings ($)
150 420 480 530 850 1280 920 650 810 480 390 350
Exercise No exercise
Male 3028 1532
Female 1804 946
Australia’s population by age groups
100 90 80
60
40 30
Year
1921 1961 2001 2041
10 50
0
Percentage
2021 1941 1981
Age 60+ Age 15–59 Age 0–14 20
Solution
(a) Number of people surveyed = 3028 + 1532 + 1804 + 946 = 7310 (b) Number of females = 1804 + 946 = 2750
Percentage of people who were female = × 100 = 37.6%
(c) Percentage of females who exercised = × 100 = 65.6%
(d) Percentage of people who did not exercise = × 100 = 33.9% (e) Number of males = 3028 + 1532 = 4560
Percentage of males who exercised = × 100 = 66.4%
Since the percentage of females who exercised was 65.6% and the percentage for males was 66.4%, there is no significant difference, so the statement is supported by this data.
1. The population of a town was surveyed in 1990 and 1997 to find out who had private
health insurance.
(a) What was the population of the town in 1990? (b) What was the population of the town in 1997?
(c) What percentage of the town had private health insurance in 1990? (d) What percentage of the town had private health insurance in 1997?
(e) Suggest a reason for the decrease in the percentage of people with private health insurance.
2. The percentages of Australians living in rural areas in 1911 and 1996 were compared.
(a) Copy and complete the table.
(b) What percentage of Australians lived in urban areas in 1911? (c) Comment on the differences between 1911 and 1996.
3. In one area there are three phone companies providing a service for mobile phones. The
number of people using each company as a provider was recorded over a 3-year period.
1990 1997
Private 4563 4048
No private 5577 8602
1911 1996
Rural areas 43%
Urban (city) areas 87%
Telstra Optus Vodaphone
1995 204 695 194 198 125 967
1996 315 144 216 276 86 510
1997 402 628 304 025 115 037
2750 7310
---1804 2750
---1532+946 7310
---3028 4560
(a) How many people in this area owned mobile phones in (i) 1995 and (ii) 1997? (b) What percentage of people used Telstra as their provider in 1996?
(c) What percentage of people used Optus as their provider in 1997? (d) What share of the market did Vodaphone have in (i) 1995 and (ii) 1997? (e) What happened to Telstra’s share of the market from 1995 to 1997? (f) What happened to Optus’s share of the market from 1995 to 1997? (g) Comment on the statement ‘Telstra users doubled from 1995 to 1997’.
4. A survey was taken on whether to change the Australian flag or not. The results are
shown in the table, grouped by age in years.
(a) How many people surveyed voted to (i) change the flag and (ii) keep the flag? (b) What percentage of those surveyed wanted to keep the flag?
(c) What percentage of 18–24-year-olds wanted to change the flag?
(d) Which group was most definite in its response? What was this response? Why do you think this is so?
18–24 25–39 40–54 55–69
Change the flag 790 640 450 140
Keep the flag 1240 860 930 620
T
EN MOREHOT TIPS FORTACKLING EXAMS1. Bring all of your equipment: pens, paper, geometrical instruments, calculator (check
calculator works).
2. Don’t worry if you feel nervous before an exam. This is normal and helps you perform
better. However, being too casual or too anxious can be harmful to your performance.
3. Write in black or blue, not red. Don’t use liquid paper. Use pencil only for diagrams
and constructions.
4. Read each question and identify what needs to be found.
5. You don’t need to be writing all of the time. What you are writing may be wrong and
a waste of time. Spend some time thinking and considering the best approach.
6. Make sure your answer sounds reasonable and realistic, especially if it involves
money or measurement.
7. If you make a mistake, cross it out with a neat line. Don’t scribble over it completely.
You may still get marks for it if it is right. Don’t use liquid paper. It is both time-consuming and messy.
8. Don’t cross out or change an answer rashly. You may have been right the first time.
9. Don’t round off in the middle of a calculation. Round off at the end only.
10. Don’t be afraid to write words and sentences in your working, but don’t use
abbreviations that you’ve just made up.
USING MULTIPLE DISPLAYS TO COMPARE DATA SETS
Relationships between data sets can often be interpreted and described more effectively by using more than one display. Looking at a variety of different displays allows a better comparison of data sets as some features are more obvious in one display than in another. Every day in the media you will find examples of multiple displays describing data sets. A company director compares this year’s figures with those of previous years. Medical researchers compare the effects of a new drug on men and women for similarities and differences. Local councils investigate the population mix in a new suburban area in order to provide the most appropriate facilities.Let us start with two simple data sets and look at three different ways of comparing them.
Example 16
The data sets A and B are displayed as lists, dot plots, a frequency table and a clustered column graph.
Lists
A: 5 6 7 8 9 B: 5 5 7 9 9
Dot plots
Frequency table Column graph
(a) Comment on the shape and features of each data set. (b) Find the mean, median and mode for each set.
(c) Find the range, interquartile range and standard deviation of each set.
(d) Comment on the benefits of using multiple displays to describe the data sets and to find measures of location and spread.
Solution
(a) Set A is symmetrical and flat.
Set B is symmetrical and has two peaks; that is, it is bimodal. (b) Set A: Mean = 7 Median = 7 No mode
Set B: Mean = 7 Median = 7 Mode = 5, 9
(c) Set A: Range = 4 Interquartile range = 3 Standard deviation σn − 1= 1.58
Set B: Range = 4 Interquartile range = 4 Standard deviation σn − 1= 2
(d) Multiple displays cater for differences in people’s preferences as well as allowing for different statistical needs. The dot plots and histogram give good visual representations of the data sets and are best used to describe the shape and features of the data sets. The measures of location and spread are best found from the lists or frequency table, although the other displays can also be used.
Score Frequency Set A Set B
5 6 7 8 9
1 1 1 1 1
2 0 1 0 2
A
5 6 7 8 9
Score
5 6 7 8 9
Score
B
5 6 7 8 9
Score
0 1 2 3
Frequency
1. Two groups, each containing 15 people, were given a small timer and asked to stop the
timer when they thought 60 seconds had elapsed. The results, in seconds, for the ‘estimated minute’ are listed:
Group A: 34 43 45 50 62 64 65 65 66 68 69 70 71 75 81 Group B: 42 46 48 48 49 50 55 58 60 61 62 64 65 68 70 (a) Construct a double stem-and-leaf plot.
(b) Draw a clustered column graph with classes 30–39, 40–49, … (c) Draw box plots to represent the data sets.
(d) Write down one piece of information that is clearly shown in each of the three displays you have drawn.
(e) Find the mean and standard deviation of each data set (correct to 1 decimal place). (f) Comment on the ability of each group to estimate a minute.
2. A coach, deciding which team should win the ‘most consistent players’ award,
compared the season’s scores for two netball teams: The Birds: 55 23 35 51 56 48 70 52 64 72 The Bees: 18 41 23 46 48 24 56 27 36 48
(a) Display the data in a stem-and-leaf plot, box plots and a column graph. (b) Use your displays to describe the shape and features of each data set.
(c) By finding suitable measures of location and spread, decide which team is more consistent. Justify your answer.
3. The populations of two regions were surveyed to find out who belongs to a workers’
union. The results are tabulated and shown in a back-to-back histogram.
Table
Back-to-back histogram
(a) Write down two comparisons you can make between the two data sets.
(b) Use the information to comment on the statement ‘People in the eastern region are more likely to join a union’. Justify your answer.
Age 15–24 25–34 35–44 45–54 55–64 65+
Eastern region 35% 49% 54% 51% 62% 11%
Western region 34% 36% 38% 42% 45% 4%
Exercise 4-08:
Using multiple displays to compare data sets
Union membership by age and region
30 20 10 0 0 10 20
% belonging to a workers’ union
30 15–24
45–54
40 40
50 60
70 50 60 70
25–34 35–44 55–64 65+
4. The heights of a group of men and women were measured to the nearest centimetre. The
data was then represented in a double stem-and-leaf display and also as box plots.
Stem-and-leaf
Box plots
(a) What information is better shown in the stem-and-leaf display? (b) What information is better shown in the box plots?
(c) What are the medians and interquartile ranges of the heights of men and women? (d) Calculate the means and standard deviations of the heights of men and women
(correct to 1 decimal place).
(e) Write down two similarities between the heights of men and women. (f) Write down two differences between the heights of men and women.
5. The table below gives the average number of rainy days per month for the Australian
capital cities.
(a) Draw at least two suitable displays illustrating the data.
(b) Calculate the mean and median number of rainy days for each city.
(c) Find the range and standard deviation of the number of rainy days for each city. (d) Use these statistical measures and displays to determine:
(i) which city is driest (ii) which city is wettest
(iii) which city has the most consistent pattern of rainy days
(iv) which city has most variation in the number of rainy days per month
Men Women
8 9 7 7 5 2 9 9 8 8 6 5 5 4 4 4 2 1 8 6 3 2 4 15 16 17 18 19
2 4 4 5 6 8 8 9 0 2 3 3 4 5 5 5 5 6 7 8 8 2 3 4 4
3
City
Month
J F M A M J J A S O N D
Adelaide Brisbane Canberra Darwin Hobart Melbourne Perth Sydney 5 13 7 21 11 8 3 12 4 13 7 20 10 7 3 12 6 15 7 19 11 9 4 13 9 11 8 9 12 12 8 12 14 10 9 2 14 14 13 12 13 8 9 1 14 14 17 12 17 7 10 0 15 15 18 10 16 7 12 1 15 16 16 10 13 7 10 2 15 15 13 10 11 9 11 6 16 14 10 11 8 10 10 12 14 12 7 11 7 12 8 16 13 11 4 12
145 155 165 175
Height (cm)
185 195 Men
150 160 170 180 190
6. Use the table in question 5 to consider the rainfall per season in Australia. The seasons
are summer (D, J, F), autumn (M, A, M), winter (J, J, A) and spring (S, O, N). (a) Draw at least two suitable displays to illustrate the data.
(b) Calculate the mean, median, range and standard deviation for each season. (c) Use these statistical measures and displays to determine:
(i) which is the wettest season (ii) which is the driest season
(d) Comment on the statement ‘Rainfall in Australia does not vary much between seasons’.
Just for the record
B
ABY BOOMERSAfter World War II finished in 1945, there was a ‘baby boom’ in Australia, New Zealand, Britain and North America. This rapid growth in the number of babies born lasted until the mid-1960s. People born during this time are referred to as ‘baby boomers’. The result of the large increase in births during this period will affect Australia’s population statistics as this group of people age. The two graphs show the baby boomer population moving from 2001 to 2031.
In 2031, the baby boomers will be over 65 years. Approximately how many more persons aged over 65 will there be in 2031 compared with 2001?
Age distribution of Australian population
2001
0–5 6–10 11–15 16–20 21–25 26–30 31–35 36–40 41–45 46–50
0 200 400 600 800 1000 1200
(
×
1000)
1400 1600
51–55 56–60 61–65 66–70 71–75 76–80 81–85 86+
Baby boomers
2031
0–5 6–10 11–15 16–20 21–25 26–30 31–35 36–40 41–45 46–50
0 200 400 600 800 1000 1200 1400 1600
51–55 56–60 61–65 66–70 71–75 76–80 81–85 86+
Baby boomers
(
×
One of the main roles of a statistician is to critically analyse related data sets and report on the findings. Businesses often use the results of an analysis for promotional purposes and companies report to their shareholders.
To critically analyse data sets:
n Draw suitable displays.
n Find measures of location and spread.
n Write a report on the relationship between the data sets, commenting on any similarities and differences between the data sets, unusual features, outliers or patterns.
n Draw conclusions and make recommendations.
1. Twenty overweight people enrolled in a weight loss program at Rhonda’s Weight Loss
Centre. Their weights (in kilograms) before and after the program were: Before: 128 159 85 76 93 125 102 74 88 82
97 84 106 125 76 80 92 77 115 102 After: 75 72 64 95 58 62 120 93 85 72 102 65 73 62 56 60 105 82 52 64
Critically analyse the data and report back to Rhonda on how she can best advertise the success of her centre.
2. The times taken (in seconds) to check a basket of 20 grocery items at 15 automated and
15 manual checkouts were:
Automated: 45 58 63 43 75 69 84 65 96 73 90 61 84 72 96 Manual: 95 105 82 110 125 148 136 137 86 99 145 119 101 97 124 Critically analyse the data and report back to the manager of a store on the benefits of installing automated checkouts based on this data.
Obtain published data from the media or Internet, collect data through experiment or simulation, or use data already collected for your statistics file. Critically analyse the data sets by drawing appropriate graphs and tables, determining measures of location and spread, and writing a report on your findings.
Some suggested data sets are:
n the performances of two sporting teams (e.g. football or netball) in a season
n the performance of a sporting team in home and away matches
n pulse rates of males and females before and after exercise
n spending patterns of men and women
n heights and weights of males and females
n scores in two subject tests
n waiting times at a checkout on different days
n pollution levels at different times in the same city or in two different cities
n rainfall in two different towns or regions
n part-time incomes of male and female students.
Modelling activity:
Analysing data sets
A population pyramid displays information about the ages of a population. The oldest age group is at the top and hence the display resembles a pyramid. A simple population pyramid (or back-to-back histogram) is shown in question 3 of Exercise 4-08 (page 137).
1. This population pyramid shows a profile of the Australian population from 1911 to
2051. It is actually three pyramids together, showing the years 1911, 1996 and the population projection for 2051.
(a) Compare the numbers of males and females over 60 in 1911 and in 2051. (b) How many females were 35 in 1996?
(c) How many males were 20 in 1911?
(d) Find one age group where there are more males. (e) Find one age group where there are more females.
(f) Write down three differences between the population in 1911 and in 1996.
2. Investigate the age of the Aboriginal and Torres Strait Islander population and compare
with the general Australian population using a population pyramid. You can find the necessary information at the following website: www.abs.gov.au.
Investigation:
Population pyramids
100+
Profile of Australia’s population, 1911–2051
Males Females
Thousand
0 50 100 150 200
50 0
100
150 200
95
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
10
5
0 1911
1996 2051
C
hapter review
Statistical distributions
1. Collecting and displaying data 2. Summary statistics3. Features of a statistical display 4. Investigating outliers
5. Displaying and comparing two data sets 6. Comparing data sets using charts
7. Two-way tables
8. Using multiple displays to compare data sets
This chapter, Statistical distributions, revises and extends the statistics covered in the Preliminary Course. It compares two data sets in a variety of displays, including double stem-and-leaf plots, box plots, radar charts and area charts. You also used measures of location and spread to compare data sets and learned how to interpret information from different displays. Be sure to include area charts and the effect of outliers in your summary. You could also include a glossary of statistical terms.
Make a summary of this topic. Use the chapter outline above as a guide. An incomplete mind map has also been started below. Use your own words, symbols, diagrams, boxes and reminders. Use the questions in Your say below to think about your understanding of the topic. Gain a ‘whole picture’ view of the topic and identify any weak areas.
Topic summary
Statistical
distributions
Area charts
Two-way tables Stem-and-leaf
plots
Radar charts
Box plots Outliers
Comparing data sets
Summary statistics Measures of
spread
n Have you satisfied the outcomes listed at the front of this chapter?
n What was the most important thing that you learned?
n How did you feel about the topic? Did you enjoy it?
n What was new?
n What are your weaknesses? What will you need to study more?
n How will you revise and summarise this topic?
1. Classify the data as (i) quantitative and discrete, (ii) quantitative and continuous,
or (iii) categorical.
(a) numbers of cows on farms in NSW
(b) numbers of letters delivered each day to households in Campbelltown (c) annual water consumption in Sydney
(d) numbers of workers who travel to work by public transport (e) ages of first-year university students
(f) favourite movie
2. Find the mean, median and mode for each data set and suggest a possible population
from which each set of data was taken. (a) 10 11 11 12 12 12 13 13 (b) 3 3 3 4 4 4 5 5 5
(c) 72 72 73 75 76 83 84 85 87 94
3. Consider the set of scores: 3 4 5 5 8 9 12 15 18 20
(a) What is the mean? (b) What is the median?
(c) Without doing any calculations, say what the effect on the mean and median would be of adding:
(i) one score of 30 (ii) one score of 50 (iii) a score of zero (iv) a score of 10
(d) What would be the effect on the mean and median if each score was: (i) increased by 2? (ii) decreased by 3?
4. For each statistical display below:
(i) find the mean and standard deviation of the data set (to 1 decimal place) (ii) describe the shape and features of the distribution
Your say: Reflecting about the topic
● ● ● ●
Chapter assignment
(a)
5 10 15
Frequency
6 10 14 18
Wages from part-time job (× $10)
8 12 16 20
5. Match the box plots to the following data sets.
(a) a random sample of 30 spectators at a football match (b) a group of 30 senior citizens on a bus trip
(c) a group of 30 dancers at a nightclub
(d) two teachers taking a group of 30 primary students to the zoo
6. A factory produces small metal rods, designed to have a mass of 50 g. Samples were
taken from two different machines and compared. (a) Find the mean and standard
deviation for each machine (correct to 1 decimal place). (b) What are the median and
interquartile range for machine A? (c) What are the median and
interquartile range for machine B? (d) Construct box plots for the two data sets.
(e) Comment on the statement ‘Machine B produces rods of a more consistent mass than machine A’.