• No results found

4 Samples and estimates

N/A
N/A
Protected

Academic year: 2020

Share "4 Samples and estimates"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Contents

4.1 Review of summary statistics 4.2 Populations and samples 4.3 Outliers and their effects 4.4 Comparing sets of data Chapter review

4

Samples and

estimates

Exploring and understanding data

■ What a sample represents, and whether it is appropriate

Summary statistics as sample statistics and estimates of parameters, including the interpretation and use of sample averages and medians as estimates of underlying population values or of values in a model

Syllabus subject matter

Syllabus
(2)

4.1

Review of summary statistics

In Year 11, you studied measures of central tendency, measures of spread (dispersion) and quantiles. These are more generally called summary statistics because they summarise the statistical information. They can be calculated from ungrouped data or grouped data, although slightly different methods are needed for grouped data.

Summary statistics

Measures of central tendency

The mode of a set of data is the most common score.

The mean is the arithmetic average of the scores. It is calculated by either: • adding the individual scores and dividing by the number of scores, or • adding the products of the scores and frequencies and dividing by the total

frequency. These are written as

= or = where the symbol Σ means sum.

The median is the middle score (or average of the middle two scores), when the

scores are written in order from smallest to largest. It is the score, provided the scores are in order.

The mode is the most probable, the median the most central and the mean the most commonly used measure of central tendency. The mean may not be appropriate for some discrete data, when the median would be used instead.

Measures of dispersion

Range = highest score − lowest score

Interquartile range = third quartile − first quartile

The standard deviation measures how far every data item is from the mean. It is abbreviated as SD, has the symbol σ and is calculated using the formula

σ = for individual scores

or σ = for tables, where Σ means the sum. (x)

x Σx n

--- x Σfx

Σf

---n+1 2 --- th

Σx2

n

---x2

Σfx2

Σf

---x2

!

(3)

Quantiles

A quantile is a score that divides the data in a frequency distribution into particular quantities.

Percentiles divide the data into percentage groups. The 35th percentile (shown as P35)

is the score below which 35% of all scores lie.

Deciles divide the data into tenths. The 7th decile (shown as D7) is the score below

which ths of the data lies.

Quartiles divide the data into quarters. The 3rd quartile (shown as Q3) is the score

below which of the data lies.

• If n is odd: First quartile (Q1) = data item

• If n is even: First quartile (Q1) = data item

In both cases, the third quartile (Q3) is the corresponding item, counting back from

the last.

Quantiles may be calculated using interpolation or graphs.

Grouped data

For calculation of the mean and standard deviation, the class midpoints are used. For calculation of the median, range and interquartile range, the true class limits are used.

Interpolation is used to find the median, Q1 and Q3.

The formula for the value of the mth term in a class is where L is the

lower class limit, f is the frequency of the class and w is the class width.

A cumulative frequency polygon or ogive can also be used in these calculations, in which case the median and quartiles are found using 50%, 25% and 75% of the total frequency respectively.

We cannot find a mode, but only a modal class.

7 10

---3 4

---n+1 4 --- th

n+2 4 --- th

L m f

----×w,

+

A test resulted in these scores: 5 8 9 5 7 3 6 8 6 5 4 2 9 5 6 7 4 Find the:

a mean b mode c median

d range e interquartile range f standard deviation.

Solution

a Write the formula for the mean. =

Find the sum of the scores. Σx= 5 + 8 + 9 ++ 7 + 4 = 99

There are 17 scores. n= 17

Substitute in the formula. =

Calculate the mean. Mean = 5.8235 … 5.8

b Choose the most common score. Mode = 5

c Arrange scores in order. 2, 3, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9

Choose the middle score. Median = 6

x ΣΣΣΣx n

---x 99

17

(4)

The summary statistics can also be worked out by entering the data in List 1 of a graphics calculator.

Casio CDX-9850GB PLUS

To enter the data, choose the STAT menu.

If there is already data in List 1, delete it by pressing

.

Enter the scores in List 1, pressing after each item.

Choose the CALC submenu by pressing .

Choose the SET submenu by pressing . Set the 1Var XList to List 1.

Set the 1Var Freq to 1.

and choose 1-Var by pressing .

Use the cursor keys to move down the list.

The mean is about 5.8, median 6, range (maxX − minX) 7, interquartile range (Q3 − Q1) 3, standard deviation (xσn) about 1.95, and the mode is 5.

Texas Instruments TI-83

The TI-83 works in a similar way to the Casio.

Press the key and choose the Edit menu.

If there is already data in a list, clear it by moving up to

the heading and pressing . Enter the scores in L1.

d Find highest minus lowest. Range = 9 2 = 7

e Find the middle of the first half. Q1= = 4.5

Find the middle of the second half. Q3= = 7.5

Subtract. Interquartile range = Q3 Q1

= 7.5 4.5 = 3

f Find the sum of the squares. Σx2= 25 + 64 + 81 + … + 49 + 16

= 641

Write the formula for the SD. σ=

Substitute. =

Calculate the SD. Standard deviation 1.95 4+5

2

---7+8 2

---ΣΣΣ Σx2

n

---x2

641 17

---(5.8235 …)2

F6 F4 F1

EXE

F2 F6

EXIT F1

STAT

CLEAR ENTER

Graphics

(5)

When the third quartile is found by counting back from the last item, you can find which item it is by subtracting 1 less than the required number from the total frequency. If there are 50 items, then the 50th is the last, the 49th the 2nd-last, the 48th the 3rd-last, and so on. Thus the 10th-last is the (50 − 9) = 41st item.

Press the key and choose the CALC menu.

Choose 1-Var Stats, press and put in L1 by pressing 1. Use the cursor keys to move down the list.

The mean is about 5.8, median 6, range (maxX − minX) 7, interquartile range (Q3− Q1) 3,

standard deviation (σx) about 1.95, and the mode is not given.

Sharp EL-9650

The Sharp instructions are given on the CD-ROM.

All methods

Write the answers. Mean 5.8, mode = 5, median = 6, range = 7, interquartile range = 3, standard deviation 1.95

STAT

ENTER 2nd

Calculator

instructions

For the following masses, find the:

a mean b mode c median

d range e interquartile range f standard deviation.

Solution

Redraw the table with true class limits, class midpoint (x), frequency ( f ), cumulative frequency, fx (frequency × score), x2 and f x2 (frequency × score2) columns. Use the class

midpoint in calculations for the score.

Mass (kg) 45–49 50–54 55–59 60–64 65–69 70–74 75–79

Frequency 3 5 8 12 10 7 5

True class

limits (kg) x f

Cumulative

frequency fx x2 fx2

44.5–49.5 47 3 3 141 2209 6 627

49.5–54.5 52 5 8 260 2704 13 520

54.5–59.5 57 8 16 456 3249 25 992

59.5–64.5 62 12 28 744 3844 46 128

64.5–69.5 67 10 38 670 4489 44 890

69.5–74.5 72 7 45 504 5184 36 288

74.5–79.5 77 5 50 385 5929 29 645

Totals Σf = 50 Σfx = 3160 Σfx2= 203 090

(6)

The summary statistics can also be worked out by entering the data in List 1 and List 2 of a graphics calculator, but only the mean and standard deviation are given correctly.

Casio CDX-9850GB PLUS

To enter the data, choose the STAT menu. If there is already

data in List 1, delete it by pressing . Enter the class centres into List 1, and the frequencies into

List 2, pressing after each item and using the

cursor arrows to move between the lists.

Choose the CALC submenu by pressing .

Choose the SET submenu by pressing .

Set the 1Var XList to List 1. Set the 1Var Freq to List 2.

and choose 1-Var by pressing .

a Write the formula for the mean. =

Substitute. =

Calculate the mean. Mean= 63.2 kg

b The highest frequency is 12. The modal class is 60–64 kg.

c The median is the = 25.5th score. 25.5th score is in the 59.5–64.5 class.

Use interpolation to find the median. It is the (25.5 16) = 9.5th score out of 12.

Calculate median using class width of 5. Median = 59.5 + × 5 63.5 kg

d The lowest possible score is 44.5 and the

highest is 79.5, so use these for the range.

Range= 79.5 44.5 = 35 kg

e First quartile is the = 13th score, as there are an even number of scores.

13th score is in the 54.5–59.5 class. It is the (13 − 8) = 5th score out of 8.

Use interpolation to find Q1.

Q1= 54.5 + × 5

= 57.625 kg

Third quartile is the 13th-last or (50 − 12) = 38th data item.

38th score is in the 64.5–69.5 class. It is at the end of the class.

Write Q3. Q3= 69.5 kg

Subtract. Interquartile range= Q3 Q1

= 69.5 57.625

11.9 kg

f Write the formula for the SD. σ=

Substitute. =

Calculate the SD. Standard deviation 8.22 kg

x ΣΣΣΣfx

Σ Σ Σ Σ f ---3160 50

---n+1 2

---9.5 12

---n+2 4 --- th 5 8 ---Σ Σ Σ Σfx2

Σ Σ Σ Σf

---x2

203 090 50

---(63.2)2

F6 F4 F1

(7)

A cumulative frequency polygon or ogive can be used to find the median, quartiles and other percentiles. A polygon has straight lines connecting points but an ogive has a smooth curve. You should remember from Year 11 that cumulative frequency polygons and ogives have points placed at the upper ends of class intervals.

Use the cursor keys to move down the list.

The mean is 63.2, median is given as 62, range (maxX − minX) is given as 30, interquartile range (Q3 − Q1) is given as 10, standard deviation (xσn) is about 8.22 and mode is given

as 62. The use of class midpoints means that only the mean and standard deviation are reliable.

Texas Instruments TI-83

Press the key and choose the Edit menu. If there is already data in a list, clear it by moving up

to the heading and pressing .

Enter the class centres into L1 and the frequencies into

L2, pressing after each item and using the cursor arrows to move between the lists.

Press the button and choose the CALC menu.

Choose 1-Var Stats, press and put in L1, L2 by

pressing 1 2.

Use the cursor keys to move down the list.

The mean is 63.2, median is given as 62, range (maxX − minX) is given as 30, interquartile range (Q3− Q1) is given as 10, standard deviation (σx) is about 8.22 and mode is not given.

The use of class midpoints means that only the mean and standard deviation are reliable.

Sharp EL-9650

The Sharp instructions are given on the CD-ROM.

All methods

Write the answers. Mean = 63.2 kg, modal class = 60–64 kg, median 63.5 kg, range = 35 kg, interquartile range 11.9 kg, standard deviation 8.22 kg

STAT

CLEAR ENTER

ENTER

STAT

ENTER 2nd

,

2nd

Calculator

(8)

You should be able to complete the work in the following exercise as revision of Year 11 work. Use an ogive to find the median and

interquartile range of the following heights of Year 12 students.

Solution

Redraw the table with true class limits and cumulative frequencies.

Draw the ogive, starting from 0 at 144.5. Find the median and quartiles.

For Q1: 25% of 40 =10 For Q2: 50% of 40 = 20

For Q3: 75% of 40 = 30

From the graph: Q1≈ 163.5 Median≈168.5

So Q3≈ 172.5

Q3− Q1= 9

Height (cm) Number

145–149 1

150–154 1

155–159 3

160–164 6

165–169 11

170–174 13

175–179 4

180–184 1

Height (cm) Number Cumulative frequency

144.5–149.5 1 1

149.5–154.5 1 2

154.5–159.5 3 5

159.5–164.5 6 11

164.5–169.5 11 22

169.5–174.5 13 35

174.5–179.5 4 39

179.5–184.5 1 40

Write the answers. The median is about 168.5 cm and the interquartile range is about 9 cm.

Cumulati

v

e

frequenc

y

Height (cm)

40 35 30 25 20 15 10 5 0

140 150 160 170 180 190

Year 12 heights

(9)

1 Find the mean, mode, median, range, interquartile range and standard deviation of each

of the following sets of scores.

a 19, 19, 12, 19, 19, 14, 16, 20, 13, 16, 13, 18, 16, 19 b 46, 49, 53, 54, 48, 47, 44, 47, 59, 62, 61, 47, 56 c 6, 7, 9, 2, 6, 6, 9, 8, 5, 4, 2, 3, 7, 8, 6, 4, 7

2 Find the mean, mode, median, range, interquartile range and standard deviation of each

of the following sets of scores.

3 Use a graphics calculator to find the mean, mode, median, range, interquartile range

and standard deviation of each of the following sets of scores.

a 23, 19, 27, 25, 23, 24, 20, 21, 22, 23, 24, 20, 25, 23, 24, 20 b 47, 43, 44, 42, 46, 45, 44, 48, 44, 43, 42, 48, 44, 40, 41

4 Find the mean, modal class, median, range, interquartile range and standard deviation

for each table below.

a Score 7 8 9 10 11 12 13 14 15 16

Frequency 1 4 5 6 8 6 3 3 2 2

b Score 0 1 2 3 4 5 6 7 8 9 10

Frequency 3 7 4 11 12 12 7 5 2 2 1

c Score 26 27 28 29 30 31 32 33 34

Frequency 6 7 11 17 22 24 19 9 5

c Score 22 23 24 25 26 27 28 29 30 31 32

Frequency 2 4 6 10 12 8 6 5 2 1 3

d Score 4 5 6 7 8 9 10 11 12

Frequency 2 0 5 8 12 9 6 4 1

a Class Frequency b Class Frequency

10–14 15–19 20–24 25–29 30–34 2 6 8 11 1 20–29 30–39 40–49 50–59 60–69 70–79 3 8 12 19 9 5

c Class Frequency d Class Frequency

105–109 110–114 115–119 120–124 125–129 16 28 10 10 9 70–89 90–109 110–129 130–149 150–169 170–189 4 11 11 18 37 18

Exercise 4.1

Review of summary statistics

Additional exercise

4.1

Graphics

(10)

5 Use an ogive to find the median and interquartile range for each of the following data sets. a Heights of some Year 11 students

b Resting heart rate for a group of people

Modelling and problem solving

6 The masses of eggs collected one day at a free-range farm were as follows.

a What is the average egg mass? b What range of masses would

you expect to include the middle 50% of eggs?

7 The numbers of days for which factory workers were absent over a year were as follows.

0 2 3 4 5 2 2 8 0 1 1 0 2 4 3 2 12 3 2 2 1 4 3 0 1 1 1 6 10 4 7 3 9 2

a For a worker chosen at random, what is the most probable number of days absent? b What number of days absent should the manager use to estimate worker-hours lost in

a year?

8 The reliability of popular cars up to 4 years old was surveyed in 2001 by a consumer

organisation. Results for makes where over 400 cars were surveyed are given in the table below. What percentage of Australians with one of these cars could expect their car to break down in the next 12 months?

Percentage by make of newer cars breaking down over 12 months

Source: Choice, September 2001

Height (cm) 145–149 150–154 155–159 160–164 165–169 170–174 175–179 180–184

Number 1 4 8 15 20 40 8 4

Pulse (beats/min) 45–49 50–54 55–59 60–64 65–69 70–74 75–79 80–84

Number 3 4 8 15 20 35 10 3

Mass (g) 36–39 40–43 44–47 48–51 52–55 56–59 60–63 64–67

Number 3 24 48 115 220 55 40 8

Make Honda Mazda Toyota Subaru Mitsubishi Nissan Ford Holden

Number surveyed 433 440 1675 911 907 546 1027 1333

(11)

4.2

Populations and samples

Many variables involve the collection of data from very large groups. In this case it may not be practical to use the whole group. It may be more practical to collect information from only part of the group.

When we use a sample to find a statistic, we want the statistic to be as close as possible to the population parameter we are estimating. The sample needs to be chosen so that it is as representative of the population as we can make it. Very small samples will not usually be representative, as shown by the extreme case of a sample consisting of only one. The method of selection of the sample will also influence how representative the sample is.

Populations

For any variable or group of variables, the population is the whole group from which data could be collected.

In a census, data is collected from the whole population. A sample is a part of the population.

In a survey, data is collected from a sample.

A parameter is a clearly defined value about a particular population, such as the mean. A statistic is an estimate of a parameter obtained using a sample.

!

Twenty people waiting in a queue at 7:30 am at an ATM were asked how much they intended to withdraw. The smallest amount was $20, the average amount was $78 and the greatest amount was $500.

Identify the population, parameters and statistics for this sample.

Solution

The population is the whole group that could be asked about the amount they withdraw from an ATM.

The population is all the people who use ATMs.

Parameters are clearly defined values from the whole population. We don’t need to know the value to define it clearly.

There are 3 parameters: • the minimum withdrawal • the average amount withdrawn • the maximum withdrawal.

The statistics are the values obtained from the sample. We are more likely to know these values. The number of people (20) is not a statistic because it is not an estimate of a parameter.

There are 3 statistics:

• The minimum withdrawal is $20. • The average withdrawal is $78. • The maximum withdrawal is $500.

Example

4

Samples

(12)

If 4000 survey forms are sent out and only 1500 are returned, the sample is likely to be biased. This is an example of non-response bias, where the bias arises through a significant number of people not responding to a survey. There are three main methods of sampling used to minimise sampling bias.

Janita has a swimming pool but the filter has been broken for months. The pool is now green with algae, but it has attracted many frogs. She has been spotlighting in the pool area on rainy nights and has found and drawn many different frogs. You can work in groups to work out the average size of the frogs.

Work in groups of about four people. Your teacher will give you some paper models of the frogs to measure.

1 Work out the average length of a sample of 3 frogs. 2 Then use a sample of 6 frogs to recalculate the average

length and work out the standard deviation.

3 Now use a sample of 10 frogs.

4 Compare your results with the results of other groups. 5 Does the average length change as the sample size is

increased?

6 What does change as the sample size is increased?

Investigation

Sample size

Teacher

notes

Sampling methods

A systematic sample has items selected at regular intervals from a list. It is best to use a randomly prepared list and begin your selection from a randomly chosen starting point. A random sample has items selected so that every item has an equal chance of being selected. Items may be selected by drawing lots ‘out of a hat’ or by using random

numbers. Tables of random numbers are published for this purpose.

A stratified random sample is selected so that all identifiable groups within the population are represented in the sample in the ratio in which they appear in the population.

!

A sample of 8 students must be selected from a class list of 26 to meet the local business community. The list is alphabetical and includes enrolment numbers from 100 to 125.

Solution

There are 26 students and we want 8. 26 ÷ 8 3

We calculate the proportion we need. Choose every third student.

Start at a random place. Start at number 113.

The selections are then as shown. 113, 114, 115, 116, 117, 118, 119, 120, 121,

Notice that if we run out, we go back to the start.

122, 123, 124, 125, 100, 101, 102, 103, 104,

105, 106, 107, 108

Write the selection. Choose students 102, 105, 108, 113, 116, 119, 122 and 125.

(13)

In Example 5, it is unlikely that students from the same family would be selected because the list is alphabetical. This is a subtle bias that means that the sample is not completely random. Systematic sampling commonly uses lists that are prepared alphabetically so that twins will not be selected, but other pairs can be selected. Random sampling avoids this problem.

Random samples can also be selected using spinners, dice or other mechanical aids, including computers and calculators. However, computers and calculators do not give truly random numbers. They are properly called pseudo-random, because they will give the same group of numbers over and over again. In most cases (provided you don’t keep resetting the calculator), the pseudo-random numbers generated are good enough.

A group of 50 Mathematics A students at a school are to be surveyed regarding their career preferences. Use the two-digit random number table on page 112 to select a sample of 10.

Solution

First, assign each student a number from (say) 1 to 50.

Next, randomly select a starting position on a random number table. We have reproduced part of the table below and, using a pin, chosen 95 (row 9, column 25) as our starting point. Move along the table and ignore numbers greater than 50. Ignore 00 and repeated numbers, such as the second 11. Record the other numbers until you have selected 10 numbers.

Write the selection. The sample is students 4, 3, 8, 13, 11, 48, 36, 32, 1 and 19. 37 49 95 38 08 68 70 32 88 65 89 70 13 93 24 40 05 41 34 72 49 10 50 78 95

04 92 03 87 51 08 13 11 48 36 98 73 32 94 11 01 78 95 19 70 13 84 91 57 67 05 04 13 40 88 75 68 99 63 19 56 69 99 33 68 24 70 05 25 64 42 41 85 04 88 30 64 49 26 22 93 66 84 39 90 57 91 05 63 53 86 05 39 32 61 67 10 68 26 73

*

Example

6

Use the two-digit random number table on page 112 to randomly select 6 numbers between 450 and 700.

Solution

It is usual to continue from the last place used in the random number table. This was marked with an asterisk in Example 6. Because we want three-digit numbers, take adjacent pairs of numbers and discard the last digit. Then 70 13 becomes 701, 84 91 becomes 849, etc.

Write the selection. The numbers are 576, 689, 631, 566, 682 and 644.

04 92 03 87 51 08 13 11 48 36 98 73 32 94 11 01 78 95 19 70 13 84 91 57 67 05 04 13 40 88 75 68 99 63 19 56 69 99 33 68 24 70 05 25 64 42 41 85 04 88 30 64 49 26 22 93 66 84 39 90 57 91 05 63 53 86 05 39 32 61 67 10 68 26 73

*

*

Example

7

Random number generation program

You can write a simple program on your graphics calculator to generate a number of random whole numbers automatically. A random number generator program that can be typed in or loaded directly into your calculator is included on the CD-ROM.

Technology

Calculator

program

(14)

2-digit random number table

(15)

Use the random number generator on your graphics calculator to obtain a random number between 130 and 150.

Solution

Graphics (and scientific) calculators generate a pseudo-random number from 0 to 1. For numbers between 130 and 150, use the first three digits after the decimal point.

Casio CFX-9850GB PLUS

In the RUN mode, press to obtain Ran#.

Press to repeat until a number in the correct

range is obtained.

Texas Instruments TI-83

In the menu, select PRB 1 to choose rand.

Press to repeat until a number in the correct range is obtained.

Sharp EL-9650

The Sharp instructions are given on the CD-ROM.

OPTN F6 F3 F4

EXE

MATH

ENTER

Example

8

Graphics

calculator

Calculator

instructions

A manufacturing firm employs 42 assembly workers, 10 office staff and 3 supervisors. It is decided that a stratified sample of 15 staff should be surveyed regarding the formulation of a policy on smoking. How many people from each group of workers should be included?

Solution

Get the total number. Number of employees = 42 + 10 + 3 = 55

Work out the fraction in each group. Fraction of assembly workers=

Fraction of office staff=

Fraction of supervisors=

Work out the number of each in the sample.

Assembly= × 15

11.45

Office= × 15

2.73

Supervisors= × 15

0.82

You cannot have part people. The sample should include 11 assembly workers, 3 office staff and 1 supervisor.

42 55

---10 55

---3 55

---42 55

---10 55

---3 55

(16)

It is not always easy to collect data to directly calculate a desired statistic to estimate a population parameter. Sometimes it is better to use a sample to find a related statistic that can be used in combination with other statistics to obtain the information wanted.

Which statistic is easier to collect?

A: The proportion of full-time workers who are women B: The proportion of women who are full-time workers

Solution

To find an unbiased sample of full-time workers might prove difficult, as people work in so many different industries and have such different work hours. However, it is relatively easy to obtain an unbiased sample of women by reference to the electoral rolls, so B is the easier (and thus cheaper) statistic to collect.

Example

10

1 Identify the population, parameters and statistics for each of the following situations. a Thirty people leaving a supermarket were asked how much they had spent. The smallest

amount was $4.30, the average was $87.64 and the largest amount was $198.75.

b People with telephone numbers starting with (07) 3275 were rung at 8 pm one night to

ask what TV channel they were watching. Twenty said Channel 9, 15 said Channel 7, 8 said Channel 10 and 7 said Channel 2.

c The cars crossing a bridge were noted by a traffic surveyor. In one 15-minute period there were 18 white cars, 9 yellow cars, 5 blue cars, 7 red cars and 8 other cars of various colours.

d A fast-food shop sold 75 hamburgers, 30 fish’n’chips, 15 salad rolls, 42 mini pizzas,

52 sausage rolls and 27 pies between 5 pm and 6 pm.

e The bicycles in a rack inside the entrance of a school were checked. From 30 bicycles, 20 had chromolly frames. There were more mountain bikes than other types.

2 Use systematic sampling to select a sample of 4 pies from a batch of 50, starting at

number 17.

3 Start at row 3, column 21 of the two-digit random number table on page 112 and

successively select 10 different numbers that are:

a 1- or 2-digit numbers between 1 and 50 b 4-digit numbers between 1000 and 2000 c 3-digit numbers between 200 and 700

d 5-digit numbers between 50 000 and 99 999.

4 Use stratified random sampling to state how many of each group should be chosen in each

case to obtain the following samples.

a Sample of 10 from 120 Year 8s, 150 Year 9s, 130 Year 10s, 80 Year 11s and 90 Year 12s b Sample of 15 from 28 sales reps, 8 managers, 35 clerical staff and 15 stores staff c Sample of 8 from 40 surfers, 55 boogie-boarders and 30 sailboarders

5 Explain some of the difficulties that may occur when an attempt is made to take a census

of a population.

Exercise 4.2

Populations and samples

Additional

exercise

(17)

6 Which statistic is easier to collect?

A: The proportion of smokers who die from lung cancer

B: The proportion of people dying from lung cancer who were smokers

7 Which statistic is easier to collect?

A: The proportion of butchers with lost fingers

B: The proportion of people with lost fingers who are butchers

8 Some female koalas are infected with

chlamydia. Which statistic is easier to collect?

A: The proportion of female koalas without babies that have chlamydia B: The proportion of female koalas

with chlamydia that do not have babies

Modelling and problem solving

9 The figures in the table below show the Australian population in 2000. Use stratified

random sampling to state how many should be chosen from each state and territory to make a sample of 500, if:

a males and females are selected in proportion b persons are selected regardless of sex.

Australian population, March 2000

Source: ABS

State or territory Males Females Persons

New South Wales 3 206 357 3 242 435 6 448 792

Victoria 2 352 283 2 401 581 4 753 864

Queensland 1 775 649 1 773 607 3 549 256

South Australia 739 797 756 292 1 496 089

Western Australia 945 331 932 203 1 877 534

Tasmania 231 545 238 745 470 290

Northern Territory 103 003 91 456 194 459

Australian Capital Territory 154 924 156 162 311 086

(18)

4.3

Outliers and their effects

Sometimes a set of data may include some scores that are so different from the group as to cause suspicion. Scores that are markedly different from the majority are called outliers. An outlier should be investigated to decide whether it should be left in the data set or discarded as erroneous.

10 A TV channel recently ran a ‘telephone poll’ on the issue of capital punishment. The

poll was conducted because of a particularly callous and brutal murder. Do you think that this poll could have been biased? How/why?

11 Each year, the students of Griffith University are surveyed to estimate how many have

part-time jobs.

a What is the population? b What is the parameter?

12 A manager wishes to select 10 items from a group of 150 that appear on an inventory.

Use the table of random numbers on page 112 to select the 10 items for the manager. (Start at row 2, column 5.)

13 Many research companies use the telephone book to select samples to be surveyed. a Discuss various methods of selecting a random sample of:

i 10 pages from the telephone book

ii 10 names from a particular page of the telephone book iii 10 names from the telephone book

iv 10 Wilsons from the telephone book.

b Obtain a telephone book and select the four samples.

14 A medical research team wish to randomly select 6 people to test the possible side

effects of an experimental drug. They have 40 volunteers seated in 5 rows with 8 chairs in each row.

a In order to select their sample, they number 5 cards from 1 to 5 and place the letters

A to H on another 8 cards. The sets of cards are then shuffled separately and one card from each ‘deck’ is selected. In this way a selection of cards 2 C would indicate the person in the 2nd row sitting on the 3rd chair. The cards are then returned to the separate decks, which are reshuffled, and selections continue in this way until the required sample of 6 people has been chosen. Comment on the suitability of this selection method.

b If there were 28 people seated in the 40 chairs, comment on the suitability or

otherwise of this method.

Outliers

Items outside the following bounds are generally considered to be outliers: • For discrete data: twice the interquartile range from the median.

(19)

In general, the removal of an outlier has more effect on the mean than on the median. The following data shows the yields reported by different home gardeners. Each person was asked how many tomatoes they picked from each tomato bush. Should any of the claims be checked?

30 25 35 32 33 28 31 24 27 54 27 35 27 24 22 31 29 32 30

Solution

The data is discrete, so the interquartile range should be used. Arrange in order.

22, 24, 24, 25, 27, 27, 27, 28, 29, 30, 30, 31, 31, 32, 32, 33, 35, 35, 54 Median = 30, Interquartile range = 32 − 27 = 5

Reasonable bounds for data are 30 − 2 × 5 to 30 + 2 × 5 = 20 to 40. The value 54 is an outlier and should be investigated.

Example

11

1 Find any outliers in the following sets of discrete data.

a 15, 18, 16, 17, 19, 20, 22, 14, 14, 15, 16, 18, 13, 25, 14, 15, 17

b 24, 25, 26, 23, 24, 26, 27, 29, 23, 25, 26, 27, 28, 24, 24, 25, 26, 25, 20, 26 c 7, 5, 6, 8, 4, 3, 5, 2, 7, 9, 2, 4, 5, 6, 3, 2, 7, 8, 4, 6, 3, 1, 9, 8, 4

d 45, 46, 45, 47, 51, 44, 46, 47, 44, 45, 46, 44, 45, 37, 44, 46, 43, 46, 49, 44, 46

2 Find any outliers in the following sets of continuous data.

a 3.8, 3.9, 3.7, 3.2, 3.6, 4.9, 4.1, 3.8, 3.6, 3.8, 3.7, 3.6, 3.8, 3.7, 3.7

b 171, 175, 165, 169, 176, 178, 166, 169, 172, 174, 197, 159, 167, 171, 177, 172 c 32.4, 35.4, 38.7, 33.5, 32.5, 30.1, 29.7, 22.1, 34.5, 37.5, 38.5, 34.7, 39.6, 37.4

d 7, 9, 11, 12.2, 14, 10, 8, 15, 17, 4.8, 11.6, 8.5, 9.3, 12.7, 13.5, 7.4

3 The following data was recorded for the numbers of siblings of some Year 12 students:

2 3 0 1 2 2 3 4 5 2 0 1 2 0 1 2 9 1 1 2 3 0 0 1 1 2 2 1

a Find any outliers.

b Calculate the mean and median of the whole set of data.

c Calculate the mean and median of the set of data with any outliers removed.

d Compare the results of parts b and c.

4 The following data was recorded for masses (in kg) of some Year 12 students:

55 60 72 80 75 81 73 58 44 61 55 68 70 124 75 83 78 67 59 69 75 88

a Find any outliers.

b Calculate the mean and median of the whole set of data.

c Calculate the mean and median of the set of data with any outliers removed.

d Compare the results of parts b and c.

(20)

4.4

Comparing sets of data

In Year 11, you saw how to construct back-to-back stemplots, boxplots and histograms. These can all be used to compare related sets of data. A boxplot shows the median, quartiles and extremes of the data and is sometimes called a five-number-summary of the data. Instead of using the median and interquartile range, we can use the mean and standard deviation to compare sets of data. This is actually the most common method, since they can both be calculated easily by computer and used to work out probabilities. The mean and standard deviation can also be used to compare individual scores from related sets of data.

Comparison of sets of data

For two related sets of data:

• The set with the higher mean (or median) is considered to be the higher group. • The set with the higher standard deviation (or interquartile range) is considered to

have a greater spread.

Individual scores from related sets of data may be compared in terms of standard deviations from the mean.

!

The following scores were obtained by students from the same class in Maths and English:

English: 13 14 16 12 8 6 15 18 12 14 13 11 10 9 7 9 12 8 9 7 10 10 9 11 13 Maths: 5 2 9 7 9 12 8 9 7 10 10 9 11 18 11 14 16 17 8 6 20 18 12 4 6

Compare the results.

Solution

Find the mean and standard deviation of both sets of scores.

English: Mean = 11.04, σ≈ 2.95 Maths: Mean = 10.32, σ≈ 4.58

Compare the means. The class did better in English than in Maths.

Compare the standard deviations. Maths had a greater spread of scores.

Example

12

Graphics

calculator

In English, Aya got 28 out of 40 and in Maths she got 17 out of 25. The mean and standard deviation for English were 24 and 6 respectively, while for Maths they were 15 and 5. Compare Aya’s results in terms of the standard deviations.

Solution

For English, find the difference with the mean. For English, Aya was 4 above the mean.

Write as a fraction of the standard deviation. This is 0.67 standard deviations.

For Maths, find the difference with the mean. For Maths, Aya was 2 above the mean.

Write as a fraction of the standard deviation. This is = 0.4 standard deviations.

Write the conclusion. Aya did better in English than in Maths.

Write a reason. She is more standard deviations above the mean.

4 6

---2 5

(21)

Did you know?

When OP scores are calculated, statistical methods are used to compare the scores of all people eligible for an OP. This is done across Queensland for all subjects for all schools. Although the methods are more complicated than those shown here, the same principles are used. More complicated methods are necessary to take account of the fact that in the QCS test some students will ‘have a bad day’ or ‘have a good day’, among other considerations to make sure that the system is fair to all students.

Modelling and problem solving

1 The scores of some students doing both Physics and Chemistry were as follows:

Physics: 43 57 59 64 78 56 43 49 34 28 42 55 67 69 62 54 Chemistry: 51 56 62 68 84 59 44 29 26 48 67 78 75 84 58 65

Use the mean and standard deviation to compare the results in Physics and Chemistry for these students.

2 The relative humidities at 3 pm in two towns over a period of 2 weeks were:

Town A: 33 35 67 45 48 67 84 56 58 57 45 48 68 56 Town B: 45 48 67 78 79 84 65 58 43 59 69 89 78 69

Compare the results and suggest which town got more storms. (Hint: How does humidity relate to storms?)

3 The number of words in each sentence of some prose was counted as part of a readability

test. The data was recorded as follows by a researcher: 20 36 14 11 20 19 27 21 14 25 18 17 45 22 17 14 15 20 21 18 17 16 13 23 27 10 18 23 31 40

A page of prose taken from another source had the following sentence lengths: 12 16 18 25 34 19 13 15 18 20 23 25 24 14 16

Compare the prose from the two sources.

(22)

4 Theo and Martina both sell insurance policies for the same company. The number of

policies sold by each per month over an 11-month period is shown in the following stemplot. Comment on their relative sales records.

Insurance policies sold in a month

5 Jessica submitted her last 15 golf score cards to the club handicapper so that her handicap

could be reviewed. The scores were:

85 90 86 80 95 92 81 86 87 88 89 90 91 92 82 Jessica’s friend Kathy had the following results:

86 102 81 90 105 78 89 86 80 91 84 82 79 81 88 Compare their records.

6 Travelling to work, I have two alternative routes. Both routes pass through sets of traffic

lights. Over 8 journeys by each route, I record the time (minutes) spent waiting at lights. The results are shown below:

Route A: 15 15 16 10 17 20 14 13 Route B: 13 16 15 15 13.5 14 16.5 17 Use statistical methods to recommend a route and give reasons.

7 Petrona got 15/40 for Film and TV and 32/50 for Speech and Drama. She said the Film and

TV test was harder because only half the class got better than 10/40 but most of the Speech and Drama class passed. In fact, for the Film and TV test the mean was 9.8 and the standard deviation was 4.6, while for Speech and Drama the mean was 27.2 and the standard deviation was 7.4. Which subject did she do better in, compared with the rest of the class?

8 David is 198 cm tall and weighs 104 kg. Use the following information to determine

whether David is thin or fat.

Theo Martina

1 0 7 9 7 4 3 1 8 8 9 7 6 5 1 1 2 1 1 1 2 3 5

3 0 4 4

Population mean Standard deviation

Height 174 cm 18 cm

Mass 84 kg 13 kg

Chapter

(23)

Chapter

Review

Communication and justification

1 Describe how the median is calculated for grouped continuous data by interpolation.

2 How does the interquartile range measure the spread of data?

3 Describe how to use an ogive to find the first quartile of a set of data.

4 What is the difference between a parameter and a statistic?

5 How is a biased sample different from a fair sample?

6 A local council is considering changing regulations to allow higher density housing

in certain suburbs. What method of sampling should it use to conduct a survey? Justify your answer.

7 What is an outlier?

8 How can two sets of data be compared by summary statistics?

Knowledge and procedures

9 Find the mean, mode, median, range, interquartile range and standard deviation of the

following data.

14 18 17 14 16 19 14 11 12 15 17 20 10 17 14 12

10 Find the mean, modal class, median, range, interquartile range and standard deviation

of the following data.

11 Use a graphics calculator to find the summary statistics of the following data. a 58, 60, 62, 56, 59, 61, 75, 43, 49, 50, 53, 57, 59, 64, 65, 65, 70, 51, 65, 53, 49

12 Use an ogive to find the median and interquartile range of the following data.

13 Identify the population, parameters and statistics when some people walking down

Albert Street were interviewed about how often they bought takeaway food. Two said they didn’t buy takeaway food, 15 said once a week, 10 said twice a week and 8 said more than twice a week, the maximum being 6 times a week.

Score 5–9 10–14 15–19 20–24 25–29 30–34

Frequency 3 8 24 20 18 10

b x 21–30 31–40 41–50 51–60 61–70 71–80

f 4 7 12 16 14 13

x 81–90 91–100 101–110 111–120 121–130

f 10 7 8 6 2

Score 10–14 15–19 20–24 25–29 30–34 35–39

Frequency 3 7 12 16 8 4

Ex 4.1

Ex 4.1

Ex 4.1

Ex 4.2

Ex 4.2

Ex 4.2

Ex 4.3

Ex 4.4

Ex 4.1

Ex 4.1

Ex 4.1

Ex 4.1

(24)

14 Use the two-digit random number table on page 112 to successively find the following.

(Start at row 29, column 12.)

a Five 2-digit numbers between 10 and 80 b Seven 3-digit numbers between 200 and 400 c Nine 4-digit numbers between 3000 and 7000

Modelling and problem solving

15 A firm was interested in testing the advertising effectiveness of its new television

commercial. As part of the test, the commercial was shown on a 6:30 pm local news programme in Mackay. Two days later, a market research firm conducted a telephone survey to obtain information on the percentage of viewers who recalled seeing the commercial (the recall rate) and their general impressions of the commercial.

a What was the population of the study? b What was the sample of the study?

16 A survey about commercial fishing is conducted by phoning people chosen at random

from a page selected at random from the telephone book. Discuss reasons why this survey is biased.

17 Which statistic is easier to collect?

A: The proportion of obese people who die from heart disease

B: The proportion of people who die from heart disease who are obese

18 The following data shows the weekly sales claimed by some door-to-door salespeople

selling sets of reference books:

5 8 10 7 6 3 8 12 48 3 12 15 6 8 7 5 3 18 19 15 13 14 18 24

Determine whether any claims should be investigated as suspicious.

19 Andrew scored 68 for both English and Maths. The mean for English was 52 and the

mean for Maths was 55. The standard deviations were 10 and 8 respectively. Determine the subject in which Andrew did better.

20 The following table shows the results of men and women on a test of language skills.

Statistically compare the results.

Score 20–29 30–39 40–49 50–59 60–69 70–79 80–89 90–99

f (women) 2 5 8 24 53 43 30 5

f (men) 7 12 15 42 47 31 14 2 Ex 4.2

Ex 4.2

Ex 4.2

Ex 4.2

Ex 4.3

Ex 4.4

References

Related documents

The projected gains over the years 2000 to 2040 in life and active life expectancies, and expected years of dependency at age 65for males and females, for alternatives I, II, and

The objectives of this study were to compare the changes of soil properties and crop yields of winter wheat under wide-narrow row spacing planting mode, and the uniform row

Having brought Klein into his own “field” by converting the puerile into the archaic, Lacan then goes on to summarize, immediately after the little fable of the match boxes, an

The objective of the current study was to evaluate the risk of breast cancer associated with serum levels of sev- eral PFASs among participants of a large case-control study,

Figure 3 Curdlan biosynthesis in ATCC 31749 and gene knockout mutants after 24 hours of cultivation in nitrogen-free media using stationary phase (A) and late exponential phase

Enhance your prescription labels student a minor, and adult basic abbreviations for the generic drugs can there are highlighted blue, that is that most likely because the one

Moreso, the emergence of highly virulent and contagious Lassa virus in many more districts and states in endemic countries of the West African sub-region and the increasing

FURTHER RESOLVED, That the Coordinating Board instructs the Director of the Arkansas Department of Higher Education to notify the administration of Vincennes University,