Contents
4.1 Review of summary statistics 4.2 Populations and samples 4.3 Outliers and their effects 4.4 Comparing sets of data Chapter review
4
Samples and
estimates
Exploring and understanding data
■ What a sample represents, and whether it is appropriate
■ Summary statistics as sample statistics and estimates of parameters, including the interpretation and use of sample averages and medians as estimates of underlying population values or of values in a model
Syllabus subject matter
Syllabus4.1
Review of summary statistics
In Year 11, you studied measures of central tendency, measures of spread (dispersion) and quantiles. These are more generally called summary statistics because they summarise the statistical information. They can be calculated from ungrouped data or grouped data, although slightly different methods are needed for grouped data.
Summary statistics
Measures of central tendency
The mode of a set of data is the most common score.
The mean is the arithmetic average of the scores. It is calculated by either: • adding the individual scores and dividing by the number of scores, or • adding the products of the scores and frequencies and dividing by the total
frequency. These are written as
= or = where the symbol Σ means sum.
The median is the middle score (or average of the middle two scores), when the
scores are written in order from smallest to largest. It is the score, provided the scores are in order.
The mode is the most probable, the median the most central and the mean the most commonly used measure of central tendency. The mean may not be appropriate for some discrete data, when the median would be used instead.
Measures of dispersion
Range = highest score − lowest score
Interquartile range = third quartile − first quartile
The standard deviation measures how far every data item is from the mean. It is abbreviated as SD, has the symbol σ and is calculated using the formula
σ = for individual scores
or σ = for tables, where Σ means the sum. (x)
x Σx n
--- x Σfx
Σf
---n+1 2 --- th
Σx2
n
---–x2
Σfx2
Σf
---–x2
!
Quantiles
A quantile is a score that divides the data in a frequency distribution into particular quantities.
Percentiles divide the data into percentage groups. The 35th percentile (shown as P35)
is the score below which 35% of all scores lie.
Deciles divide the data into tenths. The 7th decile (shown as D7) is the score below
which ths of the data lies.
Quartiles divide the data into quarters. The 3rd quartile (shown as Q3) is the score
below which of the data lies.
• If n is odd: First quartile (Q1) = data item
• If n is even: First quartile (Q1) = data item
In both cases, the third quartile (Q3) is the corresponding item, counting back from
the last.
Quantiles may be calculated using interpolation or graphs.
Grouped data
For calculation of the mean and standard deviation, the class midpoints are used. For calculation of the median, range and interquartile range, the true class limits are used.
Interpolation is used to find the median, Q1 and Q3.
The formula for the value of the mth term in a class is where L is the
lower class limit, f is the frequency of the class and w is the class width.
A cumulative frequency polygon or ogive can also be used in these calculations, in which case the median and quartiles are found using 50%, 25% and 75% of the total frequency respectively.
We cannot find a mode, but only a modal class.
7 10
---3 4
---n+1 4 --- th
n+2 4 --- th
L m f
----×w,
+
A test resulted in these scores: 5 8 9 5 7 3 6 8 6 5 4 2 9 5 6 7 4 Find the:
a mean b mode c median
d range e interquartile range f standard deviation.
Solution
a Write the formula for the mean. =
Find the sum of the scores. Σx= 5 + 8 + 9 + … + 7 + 4 = 99
There are 17 scores. n= 17
Substitute in the formula. =
Calculate the mean. Mean = 5.8235 … ≈ 5.8
b Choose the most common score. Mode = 5
c Arrange scores in order. 2, 3, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9
Choose the middle score. Median = 6
x ΣΣΣΣx n
---x 99
17
The summary statistics can also be worked out by entering the data in List 1 of a graphics calculator.
Casio CDX-9850GB PLUS
To enter the data, choose the STAT menu.
If there is already data in List 1, delete it by pressing
.
Enter the scores in List 1, pressing after each item.
Choose the CALC submenu by pressing .
Choose the SET submenu by pressing . Set the 1Var XList to List 1.
Set the 1Var Freq to 1.
and choose 1-Var by pressing .
Use the cursor keys to move down the list.
The mean is about 5.8, median 6, range (maxX − minX) 7, interquartile range (Q3 − Q1) 3, standard deviation (xσn) about 1.95, and the mode is 5.
Texas Instruments TI-83
The TI-83 works in a similar way to the Casio.
Press the key and choose the Edit menu.
If there is already data in a list, clear it by moving up to
the heading and pressing . Enter the scores in L1.
d Find highest minus lowest. Range = 9 − 2 = 7
e Find the middle of the first half. Q1= = 4.5
Find the middle of the second half. Q3= = 7.5
Subtract. Interquartile range = Q3− Q1
= 7.5 − 4.5 = 3
f Find the sum of the squares. Σx2= 25 + 64 + 81 + … + 49 + 16
= 641
Write the formula for the SD. σ=
Substitute. =
Calculate the SD. Standard deviation ≈ 1.95 4+5
2
---7+8 2
---ΣΣΣ Σx2
n
---–x2
641 17
---–(5.8235 …)2
F6 F4 F1
EXE
F2 F6
EXIT F1
STAT
CLEAR ENTER
Graphics
When the third quartile is found by counting back from the last item, you can find which item it is by subtracting 1 less than the required number from the total frequency. If there are 50 items, then the 50th is the last, the 49th the 2nd-last, the 48th the 3rd-last, and so on. Thus the 10th-last is the (50 − 9) = 41st item.
Press the key and choose the CALC menu.
Choose 1-Var Stats, press and put in L1 by pressing 1. Use the cursor keys to move down the list.
The mean is about 5.8, median 6, range (maxX − minX) 7, interquartile range (Q3− Q1) 3,
standard deviation (σx) about 1.95, and the mode is not given.
Sharp EL-9650
The Sharp instructions are given on the CD-ROM.
All methods
Write the answers. Mean ≈ 5.8, mode = 5, median = 6, range = 7, interquartile range = 3, standard deviation ≈ 1.95
STAT
ENTER 2nd
Calculator
instructions
For the following masses, find the:
a mean b mode c median
d range e interquartile range f standard deviation.
Solution
Redraw the table with true class limits, class midpoint (x), frequency ( f ), cumulative frequency, fx (frequency × score), x2 and f x2 (frequency × score2) columns. Use the class
midpoint in calculations for the score.
Mass (kg) 45–49 50–54 55–59 60–64 65–69 70–74 75–79
Frequency 3 5 8 12 10 7 5
True class
limits (kg) x f
Cumulative
frequency fx x2 fx2
44.5–49.5 47 3 3 141 2209 6 627
49.5–54.5 52 5 8 260 2704 13 520
54.5–59.5 57 8 16 456 3249 25 992
59.5–64.5 62 12 28 744 3844 46 128
64.5–69.5 67 10 38 670 4489 44 890
69.5–74.5 72 7 45 504 5184 36 288
74.5–79.5 77 5 50 385 5929 29 645
Totals Σf = 50 Σfx = 3160 Σfx2= 203 090
The summary statistics can also be worked out by entering the data in List 1 and List 2 of a graphics calculator, but only the mean and standard deviation are given correctly.
Casio CDX-9850GB PLUS
To enter the data, choose the STAT menu. If there is already
data in List 1, delete it by pressing . Enter the class centres into List 1, and the frequencies into
List 2, pressing after each item and using the
cursor arrows to move between the lists.
Choose the CALC submenu by pressing .
Choose the SET submenu by pressing .
Set the 1Var XList to List 1. Set the 1Var Freq to List 2.
and choose 1-Var by pressing .
a Write the formula for the mean. =
Substitute. =
Calculate the mean. Mean= 63.2 kg
b The highest frequency is 12. The modal class is 60–64 kg.
c The median is the = 25.5th score. 25.5th score is in the 59.5–64.5 class.
Use interpolation to find the median. It is the (25.5 − 16) = 9.5th score out of 12.
Calculate median using class width of 5. Median = 59.5 + × 5 ≈ 63.5 kg
d The lowest possible score is 44.5 and the
highest is 79.5, so use these for the range.
Range= 79.5 − 44.5 = 35 kg
e First quartile is the = 13th score, as there are an even number of scores.
13th score is in the 54.5–59.5 class. It is the (13 − 8) = 5th score out of 8.
Use interpolation to find Q1.
Q1= 54.5 + × 5
= 57.625 kg
Third quartile is the 13th-last or (50 − 12) = 38th data item.
38th score is in the 64.5–69.5 class. It is at the end of the class.
Write Q3. Q3= 69.5 kg
Subtract. Interquartile range= Q3− Q1
= 69.5 − 57.625
≈ 11.9 kg
f Write the formula for the SD. σ=
Substitute. =
Calculate the SD. Standard deviation≈ 8.22 kg
x ΣΣΣΣfx
Σ Σ Σ Σ f ---3160 50
---n+1 2
---9.5 12
---n+2 4 --- th 5 8 ---Σ Σ Σ Σfx2
Σ Σ Σ Σf
---–x2
203 090 50
---–(63.2)2
F6 F4 F1
A cumulative frequency polygon or ogive can be used to find the median, quartiles and other percentiles. A polygon has straight lines connecting points but an ogive has a smooth curve. You should remember from Year 11 that cumulative frequency polygons and ogives have points placed at the upper ends of class intervals.
Use the cursor keys to move down the list.
The mean is 63.2, median is given as 62, range (maxX − minX) is given as 30, interquartile range (Q3 − Q1) is given as 10, standard deviation (xσn) is about 8.22 and mode is given
as 62. The use of class midpoints means that only the mean and standard deviation are reliable.
Texas Instruments TI-83
Press the key and choose the Edit menu. If there is already data in a list, clear it by moving up
to the heading and pressing .
Enter the class centres into L1 and the frequencies into
L2, pressing after each item and using the cursor arrows to move between the lists.
Press the button and choose the CALC menu.
Choose 1-Var Stats, press and put in L1, L2 by
pressing 1 2.
Use the cursor keys to move down the list.
The mean is 63.2, median is given as 62, range (maxX − minX) is given as 30, interquartile range (Q3− Q1) is given as 10, standard deviation (σx) is about 8.22 and mode is not given.
The use of class midpoints means that only the mean and standard deviation are reliable.
Sharp EL-9650
The Sharp instructions are given on the CD-ROM.
All methods
Write the answers. Mean = 63.2 kg, modal class = 60–64 kg, median ≈ 63.5 kg, range = 35 kg, interquartile range ≈ 11.9 kg, standard deviation ≈ 8.22 kg
STAT
CLEAR ENTER
ENTER
STAT
ENTER 2nd
,
2ndCalculator
You should be able to complete the work in the following exercise as revision of Year 11 work. Use an ogive to find the median and
interquartile range of the following heights of Year 12 students.
Solution
Redraw the table with true class limits and cumulative frequencies.
Draw the ogive, starting from 0 at 144.5. Find the median and quartiles.
For Q1: 25% of 40 =10 For Q2: 50% of 40 = 20
For Q3: 75% of 40 = 30
From the graph: Q1≈ 163.5 Median≈168.5
So Q3≈ 172.5
Q3− Q1= 9
Height (cm) Number
145–149 1
150–154 1
155–159 3
160–164 6
165–169 11
170–174 13
175–179 4
180–184 1
Height (cm) Number Cumulative frequency
144.5–149.5 1 1
149.5–154.5 1 2
154.5–159.5 3 5
159.5–164.5 6 11
164.5–169.5 11 22
169.5–174.5 13 35
174.5–179.5 4 39
179.5–184.5 1 40
Write the answers. The median is about 168.5 cm and the interquartile range is about 9 cm.
Cumulati
v
e
frequenc
y
Height (cm)
40 35 30 25 20 15 10 5 0
140 150 160 170 180 190
Year 12 heights
1 Find the mean, mode, median, range, interquartile range and standard deviation of each
of the following sets of scores.
a 19, 19, 12, 19, 19, 14, 16, 20, 13, 16, 13, 18, 16, 19 b 46, 49, 53, 54, 48, 47, 44, 47, 59, 62, 61, 47, 56 c 6, 7, 9, 2, 6, 6, 9, 8, 5, 4, 2, 3, 7, 8, 6, 4, 7
2 Find the mean, mode, median, range, interquartile range and standard deviation of each
of the following sets of scores.
3 Use a graphics calculator to find the mean, mode, median, range, interquartile range
and standard deviation of each of the following sets of scores.
a 23, 19, 27, 25, 23, 24, 20, 21, 22, 23, 24, 20, 25, 23, 24, 20 b 47, 43, 44, 42, 46, 45, 44, 48, 44, 43, 42, 48, 44, 40, 41
4 Find the mean, modal class, median, range, interquartile range and standard deviation
for each table below.
a Score 7 8 9 10 11 12 13 14 15 16
Frequency 1 4 5 6 8 6 3 3 2 2
b Score 0 1 2 3 4 5 6 7 8 9 10
Frequency 3 7 4 11 12 12 7 5 2 2 1
c Score 26 27 28 29 30 31 32 33 34
Frequency 6 7 11 17 22 24 19 9 5
c Score 22 23 24 25 26 27 28 29 30 31 32
Frequency 2 4 6 10 12 8 6 5 2 1 3
d Score 4 5 6 7 8 9 10 11 12
Frequency 2 0 5 8 12 9 6 4 1
a Class Frequency b Class Frequency
10–14 15–19 20–24 25–29 30–34 2 6 8 11 1 20–29 30–39 40–49 50–59 60–69 70–79 3 8 12 19 9 5
c Class Frequency d Class Frequency
105–109 110–114 115–119 120–124 125–129 16 28 10 10 9 70–89 90–109 110–129 130–149 150–169 170–189 4 11 11 18 37 18
Exercise 4.1
Review of summary statistics
Additional exercise4.1
Graphics
5 Use an ogive to find the median and interquartile range for each of the following data sets. a Heights of some Year 11 students
b Resting heart rate for a group of people
Modelling and problem solving
6 The masses of eggs collected one day at a free-range farm were as follows.
a What is the average egg mass? b What range of masses would
you expect to include the middle 50% of eggs?
7 The numbers of days for which factory workers were absent over a year were as follows.
0 2 3 4 5 2 2 8 0 1 1 0 2 4 3 2 12 3 2 2 1 4 3 0 1 1 1 6 10 4 7 3 9 2
a For a worker chosen at random, what is the most probable number of days absent? b What number of days absent should the manager use to estimate worker-hours lost in
a year?
8 The reliability of popular cars up to 4 years old was surveyed in 2001 by a consumer
organisation. Results for makes where over 400 cars were surveyed are given in the table below. What percentage of Australians with one of these cars could expect their car to break down in the next 12 months?
Percentage by make of newer cars breaking down over 12 months
Source: Choice, September 2001
Height (cm) 145–149 150–154 155–159 160–164 165–169 170–174 175–179 180–184
Number 1 4 8 15 20 40 8 4
Pulse (beats/min) 45–49 50–54 55–59 60–64 65–69 70–74 75–79 80–84
Number 3 4 8 15 20 35 10 3
Mass (g) 36–39 40–43 44–47 48–51 52–55 56–59 60–63 64–67
Number 3 24 48 115 220 55 40 8
Make Honda Mazda Toyota Subaru Mitsubishi Nissan Ford Holden
Number surveyed 433 440 1675 911 907 546 1027 1333
4.2
Populations and samples
Many variables involve the collection of data from very large groups. In this case it may not be practical to use the whole group. It may be more practical to collect information from only part of the group.
When we use a sample to find a statistic, we want the statistic to be as close as possible to the population parameter we are estimating. The sample needs to be chosen so that it is as representative of the population as we can make it. Very small samples will not usually be representative, as shown by the extreme case of a sample consisting of only one. The method of selection of the sample will also influence how representative the sample is.
Populations
For any variable or group of variables, the population is the whole group from which data could be collected.
In a census, data is collected from the whole population. A sample is a part of the population.
In a survey, data is collected from a sample.
A parameter is a clearly defined value about a particular population, such as the mean. A statistic is an estimate of a parameter obtained using a sample.
!
Twenty people waiting in a queue at 7:30 am at an ATM were asked how much they intended to withdraw. The smallest amount was $20, the average amount was $78 and the greatest amount was $500.
Identify the population, parameters and statistics for this sample.
Solution
The population is the whole group that could be asked about the amount they withdraw from an ATM.
The population is all the people who use ATMs.
Parameters are clearly defined values from the whole population. We don’t need to know the value to define it clearly.
There are 3 parameters: • the minimum withdrawal • the average amount withdrawn • the maximum withdrawal.
The statistics are the values obtained from the sample. We are more likely to know these values. The number of people (20) is not a statistic because it is not an estimate of a parameter.
There are 3 statistics:
• The minimum withdrawal is $20. • The average withdrawal is $78. • The maximum withdrawal is $500.
Example
4
Samples
If 4000 survey forms are sent out and only 1500 are returned, the sample is likely to be biased. This is an example of non-response bias, where the bias arises through a significant number of people not responding to a survey. There are three main methods of sampling used to minimise sampling bias.
Janita has a swimming pool but the filter has been broken for months. The pool is now green with algae, but it has attracted many frogs. She has been spotlighting in the pool area on rainy nights and has found and drawn many different frogs. You can work in groups to work out the average size of the frogs.
Work in groups of about four people. Your teacher will give you some paper models of the frogs to measure.
1 Work out the average length of a sample of 3 frogs. 2 Then use a sample of 6 frogs to recalculate the average
length and work out the standard deviation.
3 Now use a sample of 10 frogs.
4 Compare your results with the results of other groups. 5 Does the average length change as the sample size is
increased?
6 What does change as the sample size is increased?
Investigation
Sample size
Teacher
notes
Sampling methods
A systematic sample has items selected at regular intervals from a list. It is best to use a randomly prepared list and begin your selection from a randomly chosen starting point. A random sample has items selected so that every item has an equal chance of being selected. Items may be selected by drawing lots ‘out of a hat’ or by using random
numbers. Tables of random numbers are published for this purpose.
A stratified random sample is selected so that all identifiable groups within the population are represented in the sample in the ratio in which they appear in the population.
!
A sample of 8 students must be selected from a class list of 26 to meet the local business community. The list is alphabetical and includes enrolment numbers from 100 to 125.
Solution
There are 26 students and we want 8. 26 ÷ 8 ≈ 3
We calculate the proportion we need. Choose every third student.
Start at a random place. Start at number 113.
The selections are then as shown. 113, 114, 115, 116, 117, 118, 119, 120, 121,
Notice that if we run out, we go back to the start.
122, 123, 124, 125, 100, 101, 102, 103, 104,
105, 106, 107, 108
Write the selection. Choose students 102, 105, 108, 113, 116, 119, 122 and 125.
In Example 5, it is unlikely that students from the same family would be selected because the list is alphabetical. This is a subtle bias that means that the sample is not completely random. Systematic sampling commonly uses lists that are prepared alphabetically so that twins will not be selected, but other pairs can be selected. Random sampling avoids this problem.
Random samples can also be selected using spinners, dice or other mechanical aids, including computers and calculators. However, computers and calculators do not give truly random numbers. They are properly called pseudo-random, because they will give the same group of numbers over and over again. In most cases (provided you don’t keep resetting the calculator), the pseudo-random numbers generated are good enough.
A group of 50 Mathematics A students at a school are to be surveyed regarding their career preferences. Use the two-digit random number table on page 112 to select a sample of 10.
Solution
First, assign each student a number from (say) 1 to 50.
Next, randomly select a starting position on a random number table. We have reproduced part of the table below and, using a pin, chosen 95 (row 9, column 25) as our starting point. Move along the table and ignore numbers greater than 50. Ignore 00 and repeated numbers, such as the second 11. Record the other numbers until you have selected 10 numbers.
Write the selection. The sample is students 4, 3, 8, 13, 11, 48, 36, 32, 1 and 19. 37 49 95 38 08 68 70 32 88 65 89 70 13 93 24 40 05 41 34 72 49 10 50 78 95
04 92 03 87 51 08 13 11 48 36 98 73 32 94 11 01 78 95 19 70 13 84 91 57 67 05 04 13 40 88 75 68 99 63 19 56 69 99 33 68 24 70 05 25 64 42 41 85 04 88 30 64 49 26 22 93 66 84 39 90 57 91 05 63 53 86 05 39 32 61 67 10 68 26 73
*
Example
6
Use the two-digit random number table on page 112 to randomly select 6 numbers between 450 and 700.
Solution
It is usual to continue from the last place used in the random number table. This was marked with an asterisk in Example 6. Because we want three-digit numbers, take adjacent pairs of numbers and discard the last digit. Then 70 13 becomes 701, 84 91 becomes 849, etc.
Write the selection. The numbers are 576, 689, 631, 566, 682 and 644.
04 92 03 87 51 08 13 11 48 36 98 73 32 94 11 01 78 95 19 70 13 84 91 57 67 05 04 13 40 88 75 68 99 63 19 56 69 99 33 68 24 70 05 25 64 42 41 85 04 88 30 64 49 26 22 93 66 84 39 90 57 91 05 63 53 86 05 39 32 61 67 10 68 26 73
*
*
Example
7
Random number generation program
You can write a simple program on your graphics calculator to generate a number of random whole numbers automatically. A random number generator program that can be typed in or loaded directly into your calculator is included on the CD-ROM.
Technology
Calculatorprogram
2-digit random number table
Use the random number generator on your graphics calculator to obtain a random number between 130 and 150.
Solution
Graphics (and scientific) calculators generate a pseudo-random number from 0 to 1. For numbers between 130 and 150, use the first three digits after the decimal point.
Casio CFX-9850GB PLUS
In the RUN mode, press to obtain Ran#.
Press to repeat until a number in the correct
range is obtained.
Texas Instruments TI-83
In the menu, select PRB 1 to choose rand.
Press to repeat until a number in the correct range is obtained.
Sharp EL-9650
The Sharp instructions are given on the CD-ROM.
OPTN F6 F3 F4
EXE
MATH
ENTER
Example
8
Graphics
calculator
Calculator
instructions
A manufacturing firm employs 42 assembly workers, 10 office staff and 3 supervisors. It is decided that a stratified sample of 15 staff should be surveyed regarding the formulation of a policy on smoking. How many people from each group of workers should be included?
Solution
Get the total number. Number of employees = 42 + 10 + 3 = 55
Work out the fraction in each group. Fraction of assembly workers=
Fraction of office staff=
Fraction of supervisors=
Work out the number of each in the sample.
Assembly= × 15
≈ 11.45
Office= × 15
≈ 2.73
Supervisors= × 15
≈ 0.82
You cannot have part people. The sample should include 11 assembly workers, 3 office staff and 1 supervisor.
42 55
---10 55
---3 55
---42 55
---10 55
---3 55
It is not always easy to collect data to directly calculate a desired statistic to estimate a population parameter. Sometimes it is better to use a sample to find a related statistic that can be used in combination with other statistics to obtain the information wanted.
Which statistic is easier to collect?
A: The proportion of full-time workers who are women B: The proportion of women who are full-time workers
Solution
To find an unbiased sample of full-time workers might prove difficult, as people work in so many different industries and have such different work hours. However, it is relatively easy to obtain an unbiased sample of women by reference to the electoral rolls, so B is the easier (and thus cheaper) statistic to collect.
Example
10
1 Identify the population, parameters and statistics for each of the following situations. a Thirty people leaving a supermarket were asked how much they had spent. The smallest
amount was $4.30, the average was $87.64 and the largest amount was $198.75.
b People with telephone numbers starting with (07) 3275 were rung at 8 pm one night to
ask what TV channel they were watching. Twenty said Channel 9, 15 said Channel 7, 8 said Channel 10 and 7 said Channel 2.
c The cars crossing a bridge were noted by a traffic surveyor. In one 15-minute period there were 18 white cars, 9 yellow cars, 5 blue cars, 7 red cars and 8 other cars of various colours.
d A fast-food shop sold 75 hamburgers, 30 fish’n’chips, 15 salad rolls, 42 mini pizzas,
52 sausage rolls and 27 pies between 5 pm and 6 pm.
e The bicycles in a rack inside the entrance of a school were checked. From 30 bicycles, 20 had chromolly frames. There were more mountain bikes than other types.
2 Use systematic sampling to select a sample of 4 pies from a batch of 50, starting at
number 17.
3 Start at row 3, column 21 of the two-digit random number table on page 112 and
successively select 10 different numbers that are:
a 1- or 2-digit numbers between 1 and 50 b 4-digit numbers between 1000 and 2000 c 3-digit numbers between 200 and 700
d 5-digit numbers between 50 000 and 99 999.
4 Use stratified random sampling to state how many of each group should be chosen in each
case to obtain the following samples.
a Sample of 10 from 120 Year 8s, 150 Year 9s, 130 Year 10s, 80 Year 11s and 90 Year 12s b Sample of 15 from 28 sales reps, 8 managers, 35 clerical staff and 15 stores staff c Sample of 8 from 40 surfers, 55 boogie-boarders and 30 sailboarders
5 Explain some of the difficulties that may occur when an attempt is made to take a census
of a population.
Exercise 4.2
Populations and samples
Additionalexercise
6 Which statistic is easier to collect?
A: The proportion of smokers who die from lung cancer
B: The proportion of people dying from lung cancer who were smokers
7 Which statistic is easier to collect?
A: The proportion of butchers with lost fingers
B: The proportion of people with lost fingers who are butchers
8 Some female koalas are infected with
chlamydia. Which statistic is easier to collect?
A: The proportion of female koalas without babies that have chlamydia B: The proportion of female koalas
with chlamydia that do not have babies
Modelling and problem solving
9 The figures in the table below show the Australian population in 2000. Use stratified
random sampling to state how many should be chosen from each state and territory to make a sample of 500, if:
a males and females are selected in proportion b persons are selected regardless of sex.
Australian population, March 2000
Source: ABS
State or territory Males Females Persons
New South Wales 3 206 357 3 242 435 6 448 792
Victoria 2 352 283 2 401 581 4 753 864
Queensland 1 775 649 1 773 607 3 549 256
South Australia 739 797 756 292 1 496 089
Western Australia 945 331 932 203 1 877 534
Tasmania 231 545 238 745 470 290
Northern Territory 103 003 91 456 194 459
Australian Capital Territory 154 924 156 162 311 086
4.3
Outliers and their effects
Sometimes a set of data may include some scores that are so different from the group as to cause suspicion. Scores that are markedly different from the majority are called outliers. An outlier should be investigated to decide whether it should be left in the data set or discarded as erroneous.
10 A TV channel recently ran a ‘telephone poll’ on the issue of capital punishment. The
poll was conducted because of a particularly callous and brutal murder. Do you think that this poll could have been biased? How/why?
11 Each year, the students of Griffith University are surveyed to estimate how many have
part-time jobs.
a What is the population? b What is the parameter?
12 A manager wishes to select 10 items from a group of 150 that appear on an inventory.
Use the table of random numbers on page 112 to select the 10 items for the manager. (Start at row 2, column 5.)
13 Many research companies use the telephone book to select samples to be surveyed. a Discuss various methods of selecting a random sample of:
i 10 pages from the telephone book
ii 10 names from a particular page of the telephone book iii 10 names from the telephone book
iv 10 Wilsons from the telephone book.
b Obtain a telephone book and select the four samples.
14 A medical research team wish to randomly select 6 people to test the possible side
effects of an experimental drug. They have 40 volunteers seated in 5 rows with 8 chairs in each row.
a In order to select their sample, they number 5 cards from 1 to 5 and place the letters
A to H on another 8 cards. The sets of cards are then shuffled separately and one card from each ‘deck’ is selected. In this way a selection of cards 2 C would indicate the person in the 2nd row sitting on the 3rd chair. The cards are then returned to the separate decks, which are reshuffled, and selections continue in this way until the required sample of 6 people has been chosen. Comment on the suitability of this selection method.
b If there were 28 people seated in the 40 chairs, comment on the suitability or
otherwise of this method.
Outliers
Items outside the following bounds are generally considered to be outliers: • For discrete data: twice the interquartile range from the median.
In general, the removal of an outlier has more effect on the mean than on the median. The following data shows the yields reported by different home gardeners. Each person was asked how many tomatoes they picked from each tomato bush. Should any of the claims be checked?
30 25 35 32 33 28 31 24 27 54 27 35 27 24 22 31 29 32 30
Solution
The data is discrete, so the interquartile range should be used. Arrange in order.
22, 24, 24, 25, 27, 27, 27, 28, 29, 30, 30, 31, 31, 32, 32, 33, 35, 35, 54 Median = 30, Interquartile range = 32 − 27 = 5
Reasonable bounds for data are 30 − 2 × 5 to 30 + 2 × 5 = 20 to 40. The value 54 is an outlier and should be investigated.
Example
11
1 Find any outliers in the following sets of discrete data.
a 15, 18, 16, 17, 19, 20, 22, 14, 14, 15, 16, 18, 13, 25, 14, 15, 17
b 24, 25, 26, 23, 24, 26, 27, 29, 23, 25, 26, 27, 28, 24, 24, 25, 26, 25, 20, 26 c 7, 5, 6, 8, 4, 3, 5, 2, 7, 9, 2, 4, 5, 6, 3, 2, 7, 8, 4, 6, 3, 1, 9, 8, 4
d 45, 46, 45, 47, 51, 44, 46, 47, 44, 45, 46, 44, 45, 37, 44, 46, 43, 46, 49, 44, 46
2 Find any outliers in the following sets of continuous data.
a 3.8, 3.9, 3.7, 3.2, 3.6, 4.9, 4.1, 3.8, 3.6, 3.8, 3.7, 3.6, 3.8, 3.7, 3.7
b 171, 175, 165, 169, 176, 178, 166, 169, 172, 174, 197, 159, 167, 171, 177, 172 c 32.4, 35.4, 38.7, 33.5, 32.5, 30.1, 29.7, 22.1, 34.5, 37.5, 38.5, 34.7, 39.6, 37.4
d 7, 9, 11, 12.2, 14, 10, 8, 15, 17, 4.8, 11.6, 8.5, 9.3, 12.7, 13.5, 7.4
3 The following data was recorded for the numbers of siblings of some Year 12 students:
2 3 0 1 2 2 3 4 5 2 0 1 2 0 1 2 9 1 1 2 3 0 0 1 1 2 2 1
a Find any outliers.
b Calculate the mean and median of the whole set of data.
c Calculate the mean and median of the set of data with any outliers removed.
d Compare the results of parts b and c.
4 The following data was recorded for masses (in kg) of some Year 12 students:
55 60 72 80 75 81 73 58 44 61 55 68 70 124 75 83 78 67 59 69 75 88
a Find any outliers.
b Calculate the mean and median of the whole set of data.
c Calculate the mean and median of the set of data with any outliers removed.
d Compare the results of parts b and c.
4.4
Comparing sets of data
In Year 11, you saw how to construct back-to-back stemplots, boxplots and histograms. These can all be used to compare related sets of data. A boxplot shows the median, quartiles and extremes of the data and is sometimes called a five-number-summary of the data. Instead of using the median and interquartile range, we can use the mean and standard deviation to compare sets of data. This is actually the most common method, since they can both be calculated easily by computer and used to work out probabilities. The mean and standard deviation can also be used to compare individual scores from related sets of data.
Comparison of sets of data
For two related sets of data:
• The set with the higher mean (or median) is considered to be the higher group. • The set with the higher standard deviation (or interquartile range) is considered to
have a greater spread.
Individual scores from related sets of data may be compared in terms of standard deviations from the mean.
!
The following scores were obtained by students from the same class in Maths and English:
English: 13 14 16 12 8 6 15 18 12 14 13 11 10 9 7 9 12 8 9 7 10 10 9 11 13 Maths: 5 2 9 7 9 12 8 9 7 10 10 9 11 18 11 14 16 17 8 6 20 18 12 4 6
Compare the results.
Solution
Find the mean and standard deviation of both sets of scores.
English: Mean = 11.04, σ≈ 2.95 Maths: Mean = 10.32, σ≈ 4.58
Compare the means. The class did better in English than in Maths.
Compare the standard deviations. Maths had a greater spread of scores.
Example
12
Graphics
calculator
In English, Aya got 28 out of 40 and in Maths she got 17 out of 25. The mean and standard deviation for English were 24 and 6 respectively, while for Maths they were 15 and 5. Compare Aya’s results in terms of the standard deviations.
Solution
For English, find the difference with the mean. For English, Aya was 4 above the mean.
Write as a fraction of the standard deviation. This is ≈ 0.67 standard deviations.
For Maths, find the difference with the mean. For Maths, Aya was 2 above the mean.
Write as a fraction of the standard deviation. This is = 0.4 standard deviations.
Write the conclusion. Aya did better in English than in Maths.
Write a reason. She is more standard deviations above the mean.
4 6
---2 5
Did you know?
When OP scores are calculated, statistical methods are used to compare the scores of all people eligible for an OP. This is done across Queensland for all subjects for all schools. Although the methods are more complicated than those shown here, the same principles are used. More complicated methods are necessary to take account of the fact that in the QCS test some students will ‘have a bad day’ or ‘have a good day’, among other considerations to make sure that the system is fair to all students.Modelling and problem solving
1 The scores of some students doing both Physics and Chemistry were as follows:
Physics: 43 57 59 64 78 56 43 49 34 28 42 55 67 69 62 54 Chemistry: 51 56 62 68 84 59 44 29 26 48 67 78 75 84 58 65
Use the mean and standard deviation to compare the results in Physics and Chemistry for these students.
2 The relative humidities at 3 pm in two towns over a period of 2 weeks were:
Town A: 33 35 67 45 48 67 84 56 58 57 45 48 68 56 Town B: 45 48 67 78 79 84 65 58 43 59 69 89 78 69
Compare the results and suggest which town got more storms. (Hint: How does humidity relate to storms?)
3 The number of words in each sentence of some prose was counted as part of a readability
test. The data was recorded as follows by a researcher: 20 36 14 11 20 19 27 21 14 25 18 17 45 22 17 14 15 20 21 18 17 16 13 23 27 10 18 23 31 40
A page of prose taken from another source had the following sentence lengths: 12 16 18 25 34 19 13 15 18 20 23 25 24 14 16
Compare the prose from the two sources.
4 Theo and Martina both sell insurance policies for the same company. The number of
policies sold by each per month over an 11-month period is shown in the following stemplot. Comment on their relative sales records.
Insurance policies sold in a month
5 Jessica submitted her last 15 golf score cards to the club handicapper so that her handicap
could be reviewed. The scores were:
85 90 86 80 95 92 81 86 87 88 89 90 91 92 82 Jessica’s friend Kathy had the following results:
86 102 81 90 105 78 89 86 80 91 84 82 79 81 88 Compare their records.
6 Travelling to work, I have two alternative routes. Both routes pass through sets of traffic
lights. Over 8 journeys by each route, I record the time (minutes) spent waiting at lights. The results are shown below:
Route A: 15 15 16 10 17 20 14 13 Route B: 13 16 15 15 13.5 14 16.5 17 Use statistical methods to recommend a route and give reasons.
7 Petrona got 15/40 for Film and TV and 32/50 for Speech and Drama. She said the Film and
TV test was harder because only half the class got better than 10/40 but most of the Speech and Drama class passed. In fact, for the Film and TV test the mean was 9.8 and the standard deviation was 4.6, while for Speech and Drama the mean was 27.2 and the standard deviation was 7.4. Which subject did she do better in, compared with the rest of the class?
8 David is 198 cm tall and weighs 104 kg. Use the following information to determine
whether David is thin or fat.
Theo Martina
1 0 7 9 7 4 3 1 8 8 9 7 6 5 1 1 2 1 1 1 2 3 5
3 0 4 4
Population mean Standard deviation
Height 174 cm 18 cm
Mass 84 kg 13 kg
Chapter
Chapter
Review
Communication and justification
1 Describe how the median is calculated for grouped continuous data by interpolation.
2 How does the interquartile range measure the spread of data?
3 Describe how to use an ogive to find the first quartile of a set of data.
4 What is the difference between a parameter and a statistic?
5 How is a biased sample different from a fair sample?
6 A local council is considering changing regulations to allow higher density housing
in certain suburbs. What method of sampling should it use to conduct a survey? Justify your answer.
7 What is an outlier?
8 How can two sets of data be compared by summary statistics?
Knowledge and procedures
9 Find the mean, mode, median, range, interquartile range and standard deviation of the
following data.
14 18 17 14 16 19 14 11 12 15 17 20 10 17 14 12
10 Find the mean, modal class, median, range, interquartile range and standard deviation
of the following data.
11 Use a graphics calculator to find the summary statistics of the following data. a 58, 60, 62, 56, 59, 61, 75, 43, 49, 50, 53, 57, 59, 64, 65, 65, 70, 51, 65, 53, 49
12 Use an ogive to find the median and interquartile range of the following data.
13 Identify the population, parameters and statistics when some people walking down
Albert Street were interviewed about how often they bought takeaway food. Two said they didn’t buy takeaway food, 15 said once a week, 10 said twice a week and 8 said more than twice a week, the maximum being 6 times a week.
Score 5–9 10–14 15–19 20–24 25–29 30–34
Frequency 3 8 24 20 18 10
b x 21–30 31–40 41–50 51–60 61–70 71–80
f 4 7 12 16 14 13
x 81–90 91–100 101–110 111–120 121–130
f 10 7 8 6 2
Score 10–14 15–19 20–24 25–29 30–34 35–39
Frequency 3 7 12 16 8 4
Ex 4.1
Ex 4.1
Ex 4.1
Ex 4.2
Ex 4.2
Ex 4.2
Ex 4.3
Ex 4.4
Ex 4.1
Ex 4.1
Ex 4.1
Ex 4.1
14 Use the two-digit random number table on page 112 to successively find the following.
(Start at row 29, column 12.)
a Five 2-digit numbers between 10 and 80 b Seven 3-digit numbers between 200 and 400 c Nine 4-digit numbers between 3000 and 7000
Modelling and problem solving
15 A firm was interested in testing the advertising effectiveness of its new television
commercial. As part of the test, the commercial was shown on a 6:30 pm local news programme in Mackay. Two days later, a market research firm conducted a telephone survey to obtain information on the percentage of viewers who recalled seeing the commercial (the recall rate) and their general impressions of the commercial.
a What was the population of the study? b What was the sample of the study?
16 A survey about commercial fishing is conducted by phoning people chosen at random
from a page selected at random from the telephone book. Discuss reasons why this survey is biased.
17 Which statistic is easier to collect?
A: The proportion of obese people who die from heart disease
B: The proportion of people who die from heart disease who are obese
18 The following data shows the weekly sales claimed by some door-to-door salespeople
selling sets of reference books:
5 8 10 7 6 3 8 12 48 3 12 15 6 8 7 5 3 18 19 15 13 14 18 24
Determine whether any claims should be investigated as suspicious.
19 Andrew scored 68 for both English and Maths. The mean for English was 52 and the
mean for Maths was 55. The standard deviations were 10 and 8 respectively. Determine the subject in which Andrew did better.
20 The following table shows the results of men and women on a test of language skills.
Statistically compare the results.
Score 20–29 30–39 40–49 50–59 60–69 70–79 80–89 90–99
f (women) 2 5 8 24 53 43 30 5
f (men) 7 12 15 42 47 31 14 2 Ex 4.2
Ex 4.2
Ex 4.2
Ex 4.2
Ex 4.3
Ex 4.4