Contents
2.1 Collecting statistical data 2.2 Organising and displaying information 2.3 Finding data positions and centres 2.4 Measuring spread 2.5 Analysing data Chapter summary Chapter reviewSyllabus subject matter
Applied statistical analysis ■ Identification of variables and types of variables and data (continuous and discrete); practical aspects of collection and entry of data ■ Select and use in context appropriate graphical and tabular displays for different types of data including pie charts, barcharts, tables, histograms, stem-and-leaf plots and boxplots ■ Use of summary statistics including mean, median, standard deviation and interquartile distance as appropriate descriptors of features of data in context ■ Use of graphical displays and summary statistics in describing key features of data, particularly in comparing datasets and exploring possible relationships Quantitative concepts and skills ■ Calculation and estimation with and without instruments ■ Plotting points using Cartesian coordinates■ The summation notation: Syllabus
Guide Chapter 2
Working with data
2.1
Collecting statistical data
The mathematical name for statistical information is data. Nowadays, data is used for both the
singular and plural. The data we could collect about people include their names, heights, weights, blood pressure, mother’s age, skin type, etc. These characteristics may vary from person
to person, so they are called variables. The basic types of variables are listed below.
Continuous variables are often recorded in discrete units. Although height is usually written to
the nearest centimetre, it is still a continuous variable. The nature of a variable affects the statistical analysis that may be performed.
For example, if people number the twelve ice-cream flavours in a shop from their favourite (1) to the one they like the least (12), it does not make sense to work out the average rating for chocolate because it is ordinal data.
Newspapers, radio and television reports, and advertisements often quote statistics to support claims. A common type of claim is ‘9 out of 10 nutritionists recommend …’. What does this really mean, and how can we check whether the claim is valid? A knowledge of basic statistics allows us to check claims such as this and to interpret data in a meaningful way.
Types of variables and data
Nominal, categorical or qualitative variables have values that are names rather than
numbers.
Ordinal variables have an order, but no numeric measure.
Discrete variables are numbers that can have only specific values, such as whole numbers. Continuous variables can take any values within the limits.
Discrete and continuous variables are sometimes called numeric or quantitative
variables.
!
Amanda has decided to buy a second-hand bicycle. She is interested in the make of bike, type of bike, condition, age, colour, price, length of frame, number of gears and lowest height of the seat. Classify each of the variables.
Solution
Names are nominal variables. Nominal variables:
make, type, condition, colour
Measured quantities must be continuous. Age is sometimes recorded discretely.
Continuous variables:
frame length, age, lowest seat height
The price cannot include part of a cent. Discrete variables:
number of gears, price
Example
1
People need access to statistics for all sorts of reasons. Government bodies need information to plan for future development. Businesses rely on statistics to analyse market trends and examine employee performance. Doctors use statistical data to help with diagnoses. The list goes on!
When an interest group wants information regarding an issue, it often contacts the Australian Bureau of Statistics (ABS) or a commercial statistical agency such as the Roy Morgan Research Centre (the Gallup Poll).
Work in small groups to discuss and then list the data that could be collected to help provide answers for the following problems. Specify the variables to be investigated.
■ How could a consumers’ organisation (say CHOICE) compare vacuum cleaner brands?
■ How can the performances of two basketball teams be compared?
■ Which political party is likely to win the next Federal election?
■ What is the relative popularity of the evening news presentations of television networks?
■ How many students will attend your school next year?
Investigation
Martina and her partner, Andy, would like to establish a childcare centre in a suburban area. They want to choose an area that has:
■ a growing population
■ a reasonable number of families
with young children
■ a reasonably high level of
two-income families.
Martina and Andy decide on a ‘short list’ of areas and approach the ABS to seek some information. For what variables should they seek data?
Solution
To compare areas, they want to know about the number of people, the number of families with young children, the income distribution and how many other childcare centres are already established.
For each area, the variables they should get information for are:
■ population
■ percentage growth in population over the last
5 years
■ average income in the area
■ percentage of households with children under
5 years old
■ number of existing childcare centres in the area.
Example
2
Working with data
Many variables involve the collection of data from very large groups. In these cases it may not be practical to use the whole group. It may be more practical to collect information from only part of the group.
The ABS conducts a census of the Australian population every 5 years. Many facts collected are concerned with the same variable, but give values about different aspects of the variable.
Each of these values is called a parameter of the Australian population.
When we use a sample to find a particular statistic, we would like the statistic to be as close as possible to the population parameter we are estimating. For this to be the case, the sample needs to be chosen so that it is as representative of the population as we can make it. It is obvious that very small samples will not usually be representative, as is shown by the extreme case of a sample consisting of only one person or item. In addition, the method of selection of the sample will also influence its representativeness.
In Example 3, the time at which the sample was taken probably ensures it is not representative,
so we would say that it is biased.
For any variable or group of variables, the population is the whole group from which data could be collected.
In a census, data is collected from the whole population. A sample is a part of the population.
In a survey, data is collected from a sample.
!
A parameter is a clearly defined value of a particular population. A statistic is an estimate of a parameter obtained using a sample.
!
Twenty people waiting in a queue at 7:30 am at an ATM were asked how much they intended to withdraw. The smallest amount was $20, the average amount $78 and the greatest amount $500. Identify the population, parameters and statistics for this sample.
Solution
The population is the whole group who could be asked about the amount they intend to withdraw from an ATM.
The population is all the people who use ATMs.
Parameters are clearly defined values from the whole population. We don’t need to know actual values to define parameters clearly.
There are three parameters: – the minimum withdrawal – the average amount withdrawn – the maximum withdrawal.
Statistics are the actual values obtained from the sample.
The number of people (20) is not a statistic because it is not an estimate of a parameter.
There are three statistics:
– The minimum withdrawal is $20. – The average withdrawal is $78. – The maximum withdrawal is $500.
Example
3
If a sample is not selected fairly it will not be representative. A biased sample favours one portion of the population. Sample selection bias may be very subtle. If 4000 survey forms were sent out and only 1500 were returned, do you think that the sample would be representative? This would
be an unfair sample because of what is called non-response bias.
There are several methods of sampling that can be employed to overcome selection bias. Three of the most common are:
■ ■ ■
■ systematic sampling
■ ■ ■
■ random sampling
■ ■ ■
■ stratified random sampling.
In systematic sampling, a list is compiled and every nth item is selected. The value of n would
depend on the size of the sample required. It is best to use a randomly prepared list and better if you begin your selection from a randomly chosen starting point.
Colin has a patch of cabbages, but the patch has attracted cabbage moths. He has used a large net to catch all the cabbage moths, so it is possible to work out the average size of the cabbage moths. Work in groups of about four people. Your teacher will give you some paper models of the cabbage moths to measure.
1 Work out the average wing span of a sample of
three cabbage moths.
2 Use a sample of 6 cabbage moths to recalculate the
average wing span.
3 Now use samples of 10 cabbage moths.
4 Compare your results with those of other groups.
■ Does the average wing span change as the sample
size is increased?
■ What does change as the sample size is increased?
Wing span
Investigation
Cabbage moths
Teacher NotesA sample of 8 students must be selected from a class list of 26 to meet with the local business community. The list is alphabetical with enrolment numbers from 100 to 125.
Solution
There are 26 students and we want 8. 26 ÷ 8 ≈ 3
Divide to find the repeat for selection. Choose every third student.
Start at a random place. Start at number 107.
The selections are then as shown. Notice that if we run out, we go back to the start.
107, 108, 109, 110, 111, 112, 113, 114,
115, 116, 117, 118, 119, 120, 121, 122,
123, 124, 125, 100, 101, 102, 103, 104, …
Write the selection. Choose students 102, 107, 110, 113, 116, 119, 122 and 125.
Example
4
Working with data
In Example 4, it is unlikely that students from the same family will be selected because the list used is alphabetical. This is a subtle bias which means that the sample is not completely random. Systematic sampling commonly uses lists that are prepared alphabetically, so twins will not be selected, but other pairs can be selected. Random sampling avoids this problem.
In random sampling each element of the population has an equally likely chance of being selected. If we take Example 4 again, the 8 students required could be selected by writing the names of each of the 26 students on equally sized pieces of paper. The pieces could then be
placed in a container and 8 pieces selected without replacement. This is called drawing lots.
This method may be difficult with large samples.
An alternative method to drawing items ‘out of a hat’ is to use random numbers. Tables of
random numbers have been produced especially for this purpose. One is given on the next page of this text. It is a 2-digit random number table and contains numbers from 00 to 99.
A group of 50 Maths B students at a school are to be surveyed regarding their career preferences. Use random numbers to select a sample of 10 to be interviewed.
Solution
First, assign each student a number from (say) 1 to 50.
Next, randomly select a starting position on a random number table. Part of the 2-digit random number table on page 43 is reproduced below and, using a pin, 95 (at row 9, column 25) has been chosen as our starting point.
Move along the table and ignore numbers greater than 50. Ignore 00 and repeated numbers, such as the second 11. Record the other numbers until you have selected 10 numbers.
The sample will be students 4, 3, 8, 13, 11, 48, 36, 32, 1 and 19.
37 49 95 38 08 68 70 32 88 65 89 70 13 93 24 40 05 41 34 72 49 10 50 78 95 04 92 03 87 51 08 13 11 48 36 98 73 32 94 11 01 78 95 19* 70 13 84 91 57 67 05 04 13 40 88 75 68 99 63 19 56 69 99 33 68 24 70 05 25 64 42 41 85 04 88
Example
5
Use the 2-digit random number table to randomly select 6 numbers between 450 and 700.
Solution
It is usual to continue from the last place used in the random number table. This was marked with an asterisk in Example 5.
Because we want 3-digit numbers this time, take pairs of numbers and discard the last digit. Then 70 13 becomes 701, 84 91 becomes 849, etc.
The numbers would be 576, 689, 631, 566, 682 and 644.
04 92 03 87 51 08 13 11 48 36 98 73 32 94 11 01 78 95 19* 70 13 84 91 57 67
05 04 13 40 88 75 68 99 63 19 56 69 99 33 68 24 70 05 25 64 42 41 85 04 88
30 64 49* 26 22 93 66 84 39 90 57 91 05 63 53 86 05 39 32 61 67 10 68 26 73
Example
6
Random samples can also be selected using spinners, dice or other mechanical aids, including computers and calculators. However, computers and calculators do not give truly random
numbers. They are properly called pseudo-random, because they will give the same group of
numbers over and over again. In most cases (provided you don’t keep resetting the calculator), the pseudo-random numbers generated are good enough.
2-digit random number table
49 43 74 58 18 76 85 23 53 04 22 74 46 11 68 63 23 75 15 22 69 96 01 44 83
34 73 96 52 84 77 44 53 71 72 17 80 68 70 89 97 42 96 85 11 27 85 12 76 16
87 22 21 83 13 92 96 27 94 30 56 45 75 13 57 23 25 38 84 60 78 09 05 14 95
48 43 67 58 18 82 26 78 91 90 41 37 26 75 50 53 76 54 49 68 22 65 88 61 35
83 94 87 49 80 09 46 79 96 70 65 47 43 95 15 45 32 10 52 12 22 92 78 47 00
36 34 38 79 59 07 24 30 09 77 73 25 20 78 94 22 78 84 66 62 14 27 99 97 32
36 16 40 44 98 60 45 49 50 26 04 24 92 22 77 46 25 48 43 36 18 08 75 30 93
39 43 74 42 69 40 45 67 03 34 97 41 30 44 27 96 50 42 76 44 43 79 47 10 40
37 49 95 38 08 68 70 32 88 65 89 70 13 93 24 40 05 41 34 72 49 10 50 78 95
04 92 03 87 51 08 13 11 48 36 98 73 32 94 11 01 78 95 19 70 13 84 91 57 67
05 04 13 40 88 75 68 99 63 19 56 69 99 33 68 24 70 05 25 64 42 41 85 04 88
30 64 49 26 22 93 66 84 39 90 57 91 05 63 53 86 05 39 32 61 67 10 68 26 73
16 02 93 88 42 32 97 19 48 39 27 00 17 29 98 95 33 02 15 35 84 54 88 77 88
90 72 79 41 71 30 19 99 89 25 18 77 55 49 03 75 26 66 89 31 45 75 85 95 16
99 31 34 95 97 50 56 14 09 36 63 23 12 58 28 64 16 96 92 62 73 96 99 48 21
55 38 06 44 27 29 38 61 58 15 66 43 42 97 45 51 03 81 16 99 06 55 69 43 88
90 44 98 14 71 48 71 64 42 02 32 35 88 10 99 82 31 57 53 66 65 23 77 45 97
38 20 79 34 43 86 60 75 91 50 73 46 65 92 81 46 24 67 26 25 73 91 68 85 58
55 39 11 90 64 29 14 82 02 91 17 28 23 76 50 95 58 89 77 40 64 72 53 62 78
14 93 66 16 82 81 62 77 76 89 84 90 19 36 37 74 60 59 65 30 04 17 45 51 77
16 14 64 03 65 07 93 16 46 30 51 89 20 57 47 93 53 48 33 82 65 95 64 99 12
93 87 08 79 34 56 52 18 28 71 40 87 50 40 47 37 07 74 56 07 36 94 06 55 22
73 84 05 46 00 26 28 53 72 54 00 41 50 02 31 34 22 62 47 21 81 48 23 18 37
65 47 45 88 45 30 48 46 86 88 59 80 11 55 63 87 63 07 05 85 16 29 18 37 16
91 70 92 43 59 43 60 00 29 36 04 30 80 87 61 51 63 29 89 23 02 67 35 37 34
74 30 27 61 24 12 11 37 38 71 54 72 66 17 58 67 84 74 75 73 23 08 46 72 20
63 66 83 95 36 74 00 52 74 63 44 03 90 26 51 27 11 12 75 38 48 18 27 91 90
09 24 10 27 49 82 34 92 67 13 00 33 50 70 40 09 52 92 14 11 37 98 41 02 39
86 13 95 56 74 62 42 60 49 01 59 38 58 74 67 55 09 35 65 45 10 44 26 23 22
99 15 24 44 83 34 69 20 45 98 52 74 99 21 57 63 49 45 92 65 44 79 58 85 52
73 37 86 65 14 02 49 20 42 64 55 56 99 27 93 87 97 83 92 05 17 26 82 07 24
88 71 46 18 04 58 06 99 02 64 07 65 19 89 29 34 36 96 63 86 42 19 91 33 28
95 84 21 98 21 29 39 05 39 83 75 20 42 49 81 21 08 73 69 02 57 77 31 17 33
07 86 42 16 11 48 65 48 38 23 99 97 67 29 20 36 60 41 36 54 34 62 70 85 66
17 78 69 19 74 86 32 92 90 56 97 00 80 72 16 70 14 57 24 75 35 13 28 78 06
79 10 14 82 29 61 89 99 76 40 06 03 21 58 20 89 10 80 10 72 26 65 85 87 20
19 06 97 60 80 42 61 78 24 44 80 74 00 46 07 24 87 93 60 01 76 95 73 35 79
96 63 07 61 84 26 60 97 13 57 41 57 54 97 60 12 74 99 84 69 38 29 50 25 33
83 02 54 19 32 57 11 10 94 92 69 71 28 61 73 48 55 33 25 17 03 53 91 60 78
38 33 71 59 83 69 78 15 04 55 96 59 04 90 06 07 59 55 50 82 32 64 66 32 14
Working with data
To generate integer values, there is a slight difficulty because only some calculators include 1, but they do all include 0. However, taking the integer part of a number generated using the method shown in Example 7 will suffice for most purposes.
You can write a simple program on your graphics calculator to generate a number of random numbers. A random number generator program RANINT is included on the CD-ROM. It can be typed in or loaded directly into your calculator if you have the link cable and software.
In order to select a sample that has fair representation from various groups making up the population, stratified random sampling can be used. With this method, all identifiable groups
(strata) within the population are represented in the sample by the correct fraction.
Use the random number generator on your graphics calculator to obtain a random number between 130 and 150.
Solution
Graphics (and scientific) calculators generate a pseudo-random number from 0 to 1.
To change a number r from the range
0 r 1 to the domain 130 x 150,
we need to make the spread 20 instead of 1, and to start from 130 instead of 0. The formula x= 20 ×r+ 130 will
accomplish this task.
Casio fx-9860G AU
In the RUN mode, press to obtain
Ran#.
Then press 20 130 .
Texas Instruments TI-84
In the menu, select PRB and 1 to choose rand.
Then press 20 130 .
Sharp EL-9900
See the instructions given on the CD-ROM.
OPTN F6 F3 F4
× + EXE
MATH
× + ENTER
Example
7
Calculator Instructions
Technology
02 NQM11B SB TXT.fm Page 44 Thursday, May 22, 2008 9:48 PM
TI Calculator Program
Surveys are the most common method of collection of data. Most surveys are conducted by
means of a questionnaire. The questionnaire can be given by an interviewer, by telephone, by
post or by delivery. It can also be completed by an observer making notes. However the data is collected, it is important to design the questionnaire carefully to collect the desired information.
A manufacturing firm employs 42 assembly workers, 10 office staff and 3 supervisors. It is decided that a stratified sample of 15 staff should be surveyed regarding the formulation of a policy on smoking. How many people from each group of workers should be included?
Solution
Get the total number. Number of employees = 42 + 10 + 3 = 55
Work out the fraction in each group of the population.
Fraction of assembly workers=
Fraction of office staff=
Fraction of supervisors=
Work out the number of each group in the sample.
Assembly= × 15
≈ 11.45
Office= × 15
≈ 2.73
Supervisors= × 15
≈ 0.82
You cannot have part people. The sample should include 11 assembly workers, 3 office staff and 1 supervisor.
42 55
---10 55
---3 55
---42 55
---10 55
---3 55
---Example
8
Questionnaire design
■ The questionnaire should start with a statement regarding the purpose of the survey.
■ All questions should be numbered.
■ Questions should be worded in a clear, unbiased and unambiguous fashion.
■ Questions should be able to be answered concisely.
■ Questions can be open-ended or have answers provided that can be ticked, circled
or numbered.
■ Very few open-ended questions should be incorporated.
■ Categories should be used for sensitive information such as income and age.
■ Don’t make the survey too long.
■ A small pilot survey may be used to refine the questionnaire.
■ You should be clear how statistics will be derived from the responses.
!
Design a short questionnaire to find whether some students would prefer to start and finish school earlier.
Solution
Start with purpose.
Give clear instructions. Check respondent.
Key question.
Clear instruction.
Follow-up question.
This survey asks about school start and finish times. Please circle the response that you most agree with.
1 What year are you in?
8 9 10 11 12
2 Are you male or female?
M F
3 Would you like to start and finish school earlier?
Y N
If you answered No to question 3, stop now.
4 How much earlier would you prefer?
1 1 2 hours
1 2
--- 1
2
---Example
9
The following questionnaire was designed for an interviewer to ask householders about their insurance needs. Critically examine the questionnaire.
1 Greet householder. If a child comes to the door, ask for Daddy.
2 Say ‘I’m conducting a market research survey about insurance.’
3 Ask ‘Have you got house and contents insurance?’
4 Ask ‘What risks are you covered for?’
5 Ask ‘What insurance do you have on your car?’
6 Ask ‘Do you have suitable life insurance?’
7 Say ‘Thank-you for your time.’
Response sheet — Please complete with householder’s responses.
1 Street and house number
2 House and contents insurance Y N
3 Risks—List
4 Car insurance
5 Life insurance
Exercise 2.1
Collecting statistical data
1 In each case below, identify the variable and classify it as nominal, discrete or continuous.
a Carol has 43 DVDs. b Juanita is 172 cm tall.
c Le is Vietnamese-Australian. d Stacey lives in a house with 9 rooms.
e The Murray River is 2530 km long. f Sven has blue eyes.
g Tom has a diastolic blood pressure of 138 mm of mercury.
h The National Gallery has a collection of 17 480 paintings.
i K is the symbol for potassium.
j Sahan got 93.3% on his test to get his Learner’s driving licence.
2 Identify the population, parameters and statistics for each of the following situations.
a 35 people entering a cinema were asked how much they had spent at the snack and sweets
counter. The smallest amount was $6.50, the largest $17.95, with an average of $12.43.
b People with names starting with ‘D’ on the electoral roll in Gympie were sent a survey
about takeaway food consumption. Of the 123 replies, 47 said they had used McDonalds, 35 Red Rooster, 38 KFC, 19 Brodies Chicken, 48 Dominoes Pizza and 73 other outlets.
c Students at a busway station were asked
what kind of calculator they used: 36 said Sharp, 42 said TI, 39 said Casio, 14 said HP and 17 gave other brand names.
d The kinds of vehicles going along the
Ipswich Motorway were observed. In a 30-minute period, 120 cars, 40 semi-trailers, 60 smaller trucks, 18 motorbikes, 29 taxis, 32 buses, 17 utes and 32 vans were observed.
e The price of unleaded petrol was checked
each day over a month at the service stations in and around Gladstone. Prices varied from 116.9 cents/L to 131.4 cents/L, with a median price of 124.8 cents/L.
Solution
■ This questionnaire has the advantage that the interviewer is given specific direction
about questions, so the way the questions are asked should be the same for all the respondents.
■ The instruction to ‘ask for Daddy’ indicates bias.
■ The questionnaire guide numbers do not correspond to response sheet numbers.
■ The question about risks is vague and difficult to write responses for. It is also difficult
to see how this could be compiled to a result.
■ The questions about car insurance and life insurance are also vague. The car insurance
question does not allow for cases where the person has no car or more than one car.
■ There is no provision to show the day and time on the response sheet. Responses
obtained at 11 am on Monday could be very different from those obtained at 4 pm on Saturday.
Additional Exercise
3 Why is it important to try to eliminate sources of bias when selecting a sample to be surveyed?
4 Use systematic sampling to select 10 students from the enrolments at a school. Start from
enrolment number 5130 of the current enrolments, which run from 4928 to 5672 inclusive.
5 Start at row 8, column 17 of the 2-digit random number table on page 43 and select
8 different numbers that are:
a 2-digit numbers between 20 and 99
b 6-digit numbers between 100 000 and 400 000
c 1-digit numbers
d 2-digit numbers between 30 and 84.
6 Use stratified sampling to state how many of each group should be chosen from the
following to obtain the specified sample.
a A sample of 10 swimsuit-wearers from 15 men in board shorts, 10 men in briefs,
25 women in bikinis and 7 women in one-pieces
b A sample of 20 chocolates from 180 soft-centred, 140 hard-centred, 85 liquid-centred and
108 nutty-centred chocolates
c A sample of 16 from 5 fifteen-year-olds, 35 sixteen-year-olds, 10 seventeen-year-olds and
3 eighteen-year-olds
Modelling and problem solving
7 a Why do you think Martina and Andy want to set up a childcare centre in the type of area
described in Example 2 (page 39)?
b What significance would the level of two-income families have for Martina and Andy’s
considerations?
c What other information may help Andy and Martina to help make their decision?
d Other than using the services of the ABS, how else could Andy and Martina collect the
information they require?
8 Tam and Melody plan to open a fast-food outlet. They realise that there are lots of other
fast-food outlets around, so they will need to choose the location for their shop carefully. The area that they need to locate will:
■ have very few fast-food centres nearby
■ have reasonable activity in surrounding shops
■ be well serviced by public transport.
a Why are the characteristics outlined by Tam and Melody important to their choice of
location?
b If Tam and Melody approach the ABS, what information should they request?
c How else could Tam and Melody collect the information that they require?
d What other variables could help Melody and Tam reach a decision?
9 Lyn is the operations manager for an insurance firm that specialises in insurance for new
buildings. Their small sales team moves from one location to another in Queensland to promote and ‘sell’ the insurance cover offered by the firm. What variables should she check with the ABS to help maximise the effectiveness of the efforts of the sales team?
10 Katja intends to set up a ‘Communications’ shop selling mobile phones, internet phones,
11 Your school resource centre wants to install a security system to prevent the loss of material from the library. This will be expensive, so certain improvements to the school grounds will need to be cancelled if the security system goes ahead. You decide to conduct a poll at school to decide whether or not most people want the security system installed.
a State the population involved.
b State the parameter(s) involved.
c Here are some proposed methods for collecting a sample. Select the one that you think is
fairest and state why you rejected the others.
i Ask everyone in your class.
ii Interview the first 80 students who walk into the resource centre.
iii Ask all students in your year level.
iv Announce that there will be a poll on parade and ask the first 70 students who volunteer
for the survey.
v Obtain an alphabetical list of all the students and ask every 20th student on the list.
vi Leave a ‘nomination sheet’ in the resource centre and interview only the people who
write their names on it.
vii Ask 5 students from every form class in the school.
viii Ask 1 in every 15 students from each year level in the school.
ix Call a meeting of interested students and interview those who attend.
x Wait at the school entrance and ask the first 100 students who arrive after 7:30 am.
12 Biotech Industries Pty Ltd wishes to form a staff social committee consisting of 18 members.
The firm decide to use the method of stratified random sampling for selecting the committee members. The employment details of the firm are given in the table.
a How many from each group should be
selected to represent all groups fairly?
b How many from each group should be
selected if no distinction is made between males and females?
13 The table below shows the Australian population in 2007. Use stratified random sampling to
determine how many should be chosen from each state to make a sample of 500:
a if persons are selected regardless of sex
b if males and females are selected in proportion.
Australian population, March 2007
Source: ABS
State Males Females Persons
New South Wales 3 407 098 3 468 627 6 875 725
Victoria 2 568 063 2 620 073 5188136
Queensland 2 078 376 2 083 642 4 162018
South Australia 781 103 800 304 1 581 407
Western Australia 1 058 490 1 036 059 2 094549
Tasmania 242 998 249 743 492 741
Northern Territory 111 000 102 824 213 824
Australian Capital Territory 167 558 170 602 338 160
Australia 10 415 994 10 532 954 20 948 948
Administrative staff
Factory workers
Males 11 73
14 A TV travel show ran an SMS poll to choose whether one of the presenters would visit Bali, Singapore or Nepal in an upcoming episode. The poll was conducted the week after an episode that featured Phuket. Do you think this poll could be biased? Justify your answer.
15 Many schools survey students who finished Year 12 at the end of the previous year to find
what they did on leaving school.
a What is the population? b What is the parameter?
16 A quality control department decides to check a random sample of 10 items from the
210 items produced in one day. Use the table of random numbers on page 43 to select 10 items (start at row 3, column 4).
17 Many research companies use the electoral rolls (available at/from offices of the Australian
Electoral Commission: see aec.gov.au) to choose samples of people to be surveyed. The electoral roll for a Federal electorate normally has about 85 000 people on the roll and includes their addresses. There are many pages in each electoral roll. Suggest three different ways in which the electoral rolls could be used to select samples, and give at least one advantage and disadvantage of each method.
18 Design suitable survey questions to determine:
a the most popular TV station in your school
b what people in your suburb/town are planning to do in their holidays
c the popularity of different restaurants in your suburb/town
d the occupations of parents of students at your school
e how much part-time work students at your school do
f how often people in your street eat fast-food
g the most popular current affairs show on TV
h the favourite disco operator in your area
i the ages of people at an amusement park
j why people choose a particular washing detergent.
19 For each of the situations described in question 18:
i state the population concerned
ii explain how you would select a survey sample
iii explain how you would administer the survey.
20 State the advantages and disadvantages of the following methods of conducting interviews.
a Interviewing people in their homes in selected streets of a suburb
b Interviewing people chosen at random in the street in the CBD
c Ringing people up by choosing names at random from the telephone book
d Asking for people who are interested to ring in for an interview, set up as a competition
with prizes for some of the people who ring in
e Choosing a focus group of people from the electoral roll and paying them to come for
an interview
21 People with telephone numbers on a randomly selected page from the Gold Coast White
Pages are phoned between 6 pm and 8 pm on a weeknight. They are asked whether they
22 The editor of a health and fitness magazine thinks that the weight-and-height tables for Australians are out of date. The editor asks staff to put a survey form in the next issue asking subscribers to send in their current weight, height and age so that new tables can be compiled. Ten people who send in replies will win a year's free subscription to the magazine. Give reasons why this would be a biased survey.
23 The following questionnaire was
designed to obtain information about leisure activities. Critically examine the questionnaire, then rewrite it.
24 The survey form below was intended to find whether doing more homework improves
students’ results. Critically examine the survey and rewrite it, explaining your reasons for changing questions.
2.2
Organising and displaying information
Statistical data is usually organised into a frequency distribution table to make it easier to make sense of the data.
1 What is your age? 2 What is your sex?
3 What do you do on the weekends? 4 What do you do after work? 5 Do you watch a lot of TV? 6 Do you like a lot of sport?
1 What class are you in? 2 What is your sex?
3 How much time do you spend doing homework? 4 How well do you do at school?
5 How much more homework would you need to do to do better?
Scores are the values of statistical variables. x is often used as the symbol.
The frequency of a score is the number of times it occurs. f is often used as the symbol. A frequency table has the data arranged in ascending (or descending) order of
(numerical) scores. It is also called a frequency distribution table.
Tally marks may be used to simplify table compilation. They are usually in groups of 5.
Tables should have a title.
For continuous variables, or discrete variables with many values, it is possible to simplify the frequency table using classes or class intervals. It is usually best to have between 5 and 15 classes. Too many classes will result in many classes with very low or no frequency. Too few classes will result in very little information being retained about the original data.
When 30 people at an employment office were interviewed regarding the number of complete months for which they had been unemployed, their responses were as follows:
1 3 2 5 1 4 2 6 1 3 4 2 1 2 1 4 0 3 5 4 1 6 0 1 2 3 5 0 3 2
a Construct a frequency distribution table for the above data.
b What was the most common period for which a person was unemployed?
Solution
a Write the scores (months) in order. Use tally marks to increase accuracy. Write the frequency for each score.
b The most common period of
unemployment was 1 month.
Time of unemployment
Months, x Tally Frequency, f
0 ⎜⎜⎜ 3
1 ⎜⎜⎜⎜ ⎜⎜ 7
2 ⎜⎜⎜⎜ ⎜ 6
3 ⎜⎜⎜⎜ 5
4 ⎜⎜⎜⎜ 4
5 ⎜⎜⎜ 3
6 ⎜⎜ 2
Example
11
Non-compliant data
If some results of a survey or an experiment show recording or measurement error, they should be classified as non-compliant. These results should be investigated if possible, and perhaps discarded from the data.
!
The upper and lower boundaries of a class are called the upper class limit and lower
class limit respectively.
The stated class limits may not be the same as the true (real) class limits, particularly for continuous variables.
For discrete variables, the class limits are generally inclusive.
For continuous variables, the true class limits generally include the lower limit and exclude the upper limit.
The class width (sometimes called the class length or class size) is the difference between consecutive upper (or lower) class limits.
The class midpoint (class centre, class score or class mark) is the average of the upper and lower class limits.
Class midpoint =
All class intervals should be the same length and have explicitly stated class limits. For the purposes of analysis, every item in a class is assumed to have the class midpoint value. It is best to avoid the use of open-ended classes such as ‘35 and over’.
upper class limit+lower class limit 2
The tensile strengths (kg/mm2) of 60 pieces of sheet metal were measured correct to the
nearest whole number and recorded as follows:
12 75 45 52 61 49 53 45 64 22 54 15 23 45 668
51 40 39 60 61 55 58 41 69 66 52 38 79 18 54
22 31 47 68 33 25 28 54 69 42 37 51 60 32 18
23 61 58 42 18 56 47 27 58 66 19 28 65 28 39
Construct a group frequency distribution table using appropriate class intervals.
Solution
The measurement of 668 is non-compliant. It cannot be used in the table.
The range of values is 79 − 12 = 67.
Ten intervals would have a width of 6.7 kg/mm2.
For convenience, a class width of 10 will be chosen. This means that the stated class limits will be 10–19, 20–29, 30–39, etc. and the true class limits will be 9.5–19.5, 19.5–29.5, 29.5–39.5, etc.
While the true upper class limits are the same as true lower class limits for the next class,
there is no ambiguity because we know that a measurement such as 19.5 would be rounded up to 20, and be in the next class (stated as 20–29).
Tensile strength (kg/mm2)
Class Tally Frequency, f
10–19 ⎜⎜⎜⎜ ⎜ 6
20–29 ⎜⎜⎜⎜ ⎜⎜⎜⎜ 9
30–39 ⎜⎜⎜⎜ ⎜⎜ 7
40–49 ⎜⎜⎜⎜ ⎜⎜⎜⎜ 10
50–59 ⎜⎜⎜⎜ ⎜⎜⎜⎜ ⎜⎜⎜ 13
60–69 ⎜⎜⎜⎜ ⎜⎜⎜⎜ ⎜⎜ 12
70–79 ⎜⎜ 2
Example
12
The weight (in kilograms) lost by each member of a group of grossly obese people under a medically supervised program of weight loss is given here:
12.2 22.1 33.9 10.9 4.2 8.7 1.8 28.4 7.6 19.4 3.8 32.7
23.5 9.4 8.2 21.3 7.1 11.6 2.1 6.9 4.6 7.3 5.6 12.5
4.8 34.4 1.9 17.8 5.8 4.2 9.3 18.3 14.1 23.6 2.3 12.6
10.9 3.8 10.3 12.5 7.5 4.6 8.9 2.1 12.5 9.8 31.1 7.1
Express this data in the form of a frequency distribution table.
Solution
The range is 34.4 − 1.8 = 32.6.
A class interval of 5 should be suitable. Class limits will be 0–5, 5–10, 10–15, etc. To preserve as much accuracy as
possible, the true class limits will be 0–4.95, 4.95–9.95, 9.95–14.95, etc. This can be written as 0–, 5–, 10–, etc. or as 0 to 5, 5 to 10, 10 to 15, etc.
Weight lost after slimming program (kg)
Class Tally Frequency, f
0 to 5 ⎜⎜⎜⎜ ⎜⎜⎜⎜ ⎜⎜ 12
5 to 10 ⎜⎜⎜⎜ ⎜⎜⎜⎜ ⎜⎜⎜⎜ 14
10 to 15 ⎜⎜⎜⎜ ⎜⎜⎜⎜ 10
15 to 20 ⎜⎜⎜ 3
20 to 25 ⎜⎜⎜⎜ 4
25 to 30 ⎜ 1
30 to 35 ⎜⎜⎜⎜ 4
A graph visually displays data, making it easier to identify the features of a frequency
distribution. A stem-and-leaf plot (or stemplot) is a simple way to display a table that gives a visual impression of the data. It is particularly appropriate when scores have 2 significant figures. For 2-digit scores, the first digits are placed in order from lowest to highest in a column to make the stem. The second digits are written in a row across from the first digit to make the leaves. The lengths of the rows form a graph showing the spread of the data.
Data with more than two digits is arranged with the extra digits in the most sensible place. A pie chart (sector graph or circle graph) displays information by showing the proportion of scores as the same fraction of a circle. It is used extensively for the display of nominal data or for cases where a proportion is the most natural measure, such as parts of a whole.
Below are the ages of mothers of Year 11 students. Make a stem-and-leaf plot of the data. 38 43 35 52 55 57 47 49 39 44 46 43 48 44 40 51 53 36 42 49 52 39 44
Solution
The first digits are 3, 4 and 5. Put these in the stem on the left.
Then put the second digits of the first few scores.
Continue with the rest of the scores, leaving spaces between the leaves.
When you have put in all the data, arrange the leaves in ascending order.
Add a title and key.
Stem Leaf
3 4 5
8 5 3 2
Stem Leaf
3 4 5
8 5 9 6 9
3 7 9 4 6 3 8 4 0 2 9 4 2 5 7 1 3 2
Stem Leaf
3 4 5
5 6 8 9 9
0 2 3 3 4 4 4 6 7 8 9 9 1 2 2 3 5 7
Ages of mothers of Year 11 students
Key: 4⎥ 2 = 42
Example
14
Extra Material
More stemplot examples
When some students were asked to choose their favourite flavour from a selection of ice-creams: 8 said raspberry, 6 said coffee, 9 said chocolate chip, 4 said lime and 5 said plum. Show this as a pie chart.
Information can be shown as a histogram. Rectangles are used to represent the classes of data. A histogram is like a column graph, but there are no spaces between the columns.
Solution
Find the proportion of each colour. Multiply each fraction by 360° to find the angle: × 360° = 90°, etc.
Alternatively, you can work out 360°÷ 32 = 11.25°, then multiply each number by this value, so 8 × 11.25° = 90°.
The total may be 359° or 361° because of rounding errors.
Draw the graph, labelling each sector.
Give the total for the graph.
Flavour Number Fraction Angle
Raspberry 8 = × 360° = 90°
Coffee 6 = × 360° ≈ 68°
Chocolate
chip 9 × 360° ≈ 101°
Lime 4 = × 360° = 45°
Plum 5 × 360° ≈ 56°
Total 32 = 1 360°
8 32 --- 1 4 --- 1 4 ---6 32 --- 3 16 --- 3 16 ---9 32 --- 9 32 ---4 32 --- 1 8 --- 1 8 ---5 32 --- 5 32 ---32 32 ---8 32
---Favourite flavours — 32 students
Raspberry Coffee Chocolate Lime Plum chip For histograms:
■ The horizontal axis is continuous and shows the variable.
■ Frequency is always shown on the vertical axis.
■ The columns are shown with no spaces, since
the true upper class limit of any one class is the true lower class limit of the following class.
■ The class limits are considered to be at the
‘halves’ for discrete and ungrouped data, as shown in the diagram at right.
■ The area of each rectangle is proportional to the
frequency of the class it represents. If the class widths are equal, the height of the rectangle also is proportional to the frequency.
1 2 3
A frequency polygon is drawn by plotting the class midpoints against the corresponding frequencies and then joining these points with straight lines. It is like a line graph.
Show the data at right for the diameters of Telopea
speciosissima (waratah) flowers as a histogram.
Solution
The true class limits are 29.5–39.5, 39.5–49.5, etc. The histogram is shown.
Waratah flower diameters (mm)
Class Frequency, f
30–39 2
40–49 5
50–59 8
60–69 6
70–79 3
80–89 0
90–99 1
Flower diameter (mm)
Waratah flower diameters
30 40 50 60 70 80 90 100 8
6
4
2
0
Frequenc
y
30 40 50
29.5 39.5 49.5
Example
16
For frequency polygons:
■ The horizontal axis is continuous and shows the variable.
■ Frequency is always shown on the vertical axis.
■ The true class midpoints are used to plot points, which are joined by straight lines.
■ The polygon is closed at either end, so extreme class midpoints need to be added.
They have zero frequency.
■ A frequency polygon may be drawn by joining the midpoints of the tops of the
columns of a histogram.
■ If the class widths are equal, the area under a frequency polygon is the same as the
area of a histogram for the same data.
It is often useful to know the number of items that lie below (or above) a particular value. The
cumulative frequency is used for this purpose. Cumulative frequency is the sum of the
frequencies up to and including an item. Graphs of cumulative frequency are also useful. Draw a frequency polygon for the waratah flower diameters from Example 16.
Solution
The true class midpoints are 34.5, 44.5, etc.
The extreme midpoints to be added are at 24.5
and 104.5.
The graph is shown.
Flower diameter (mm)
Waratah flower diameters
30 40 50 60 70 80 90 100 8
6
4
2
0
Frequenc
y
30 34.5 40 44.5 50
Class midpoints 34.5
24.5
Example
17
For cumulative frequency graphs:
■ A cumulative frequency histogram is drawn using cumulative frequencies.
■ A cumulative frequency polygon is drawn using cumulative frequencies. In a
cumulative frequency polygon, points are plotted at the true upper class limits to show the cumulative total. The graph begins at the true lower class limit of the first class.
■ An ogive results when the points of a cumulative frequency polygon are connected by
a curve. A percentage ogive has the cumulative frequencies expressed as percentages.
!
Police in a 60 km/h zone registered the following readings on a radar gun: 58 60 52 62 59 70 74 68 61 54 55 58 68 62 63 67 71 74 48 56 63 61 58 65 59 78 49 51 68 53 79 62 55 57 60 72 54 63 56 80 51 64 73 61 60 59 75 69 82 57
a Draw up a cumulative frequency distribution table for this data.
b Using the table, is it possible to say exactly how many motorists exceeded the speed
limit?
c If the police gave a ticket to every motorist who was travelling at more than 64 km/h,
how many tickets would have been issued?
Solution
a The values range from 48 to 82. A class width of 5 gives a suitable number of classes. Convenient limits are 45–49, 50–54, 55–59, etc.
The true limits are 44.5–49.5, 49.5–54.5, 54.5–59.5, etc.
b We see that 20 motorists were definitely travelling at less than the limit, but we can’t tell how many of the next 13 were doing exactly 60 km/h.
We cannot say exactly how many were speeding from the table alone.
c From the table, 17 motorists were definitely travelling at more than 64 km/h.
17 tickets would have been issued. Speeds of motorists (km/h)
Class Tally Frequency, f
Cumulative frequency
44.5–49.5 ⎜⎜ 2 2
49.5–54.5 ⎜⎜⎜⎜ ⎜ 6 8
54.5–59.5 ⎜⎜⎜⎜ ⎜⎜⎜⎜ ⎜⎜ 12 20
59.5–64.5 ⎜⎜⎜⎜ ⎜⎜⎜⎜ ⎜⎜⎜ 13 33
64.5–69.5 ⎜⎜⎜⎜ ⎜ 6 39
69.5–74.5 ⎜⎜⎜⎜ ⎜ 6 45
74.5–79.5 ⎜⎜⎜ 3 48
79.5–84.5 ⎜⎜ 2 50
Draw a cumulative frequency polygon for the data from Example 18.
Solution
The polygon will start from 44.5, and points are placed at the true upper class limit.
Speed (km/h)
Speeds of motorists
45 50 55 60 65 70 75 80 40
30
20
10
0
Cumulati
v
e frequenc
y
85 50
Instead of plotting cumulative frequency on the vertical axis, some people plot cumulative
percentage. This is calculated as a percentage of the total frequency.
A graphics calculator can be used to draw statistical graphs. The following shows how to enter the data for the motorists’ speeds from Example 18 and draw the graphs. We use the class midpoint for histograms and polygons. The data to be entered is thus:
Casio fx-9860G AU
To enter the data, choose the STAT menu. If there is already
data in List 1, delete it by pressing .
Enter the scores in List 1, and enter the frequencies in List 2,
pressing after each item and using the cursor arrows to
move between the lists.
When all the data is entered, enter the GRPH submenu.
From the screen above, press .
Then enter the SET submenu, pressing .
Set the Graph Type to Hist, the XList to List 1 and the Frequency to List 2.
EXIT from the SET submenu and use GPH1, . Set the width to 5.
Press to draw the histogram. To draw a polygon, change the Graph Type to Broken by
pressing from the SET submenu and redraw the graph, again setting the width to 5.
Texas Instruments TI-84
To enter the data, press the key and choose the Edit menu.
If there is already data in a list, clear it by moving up to the heading and pressing
followed by .
Enter the scores in L1, and enter the frequencies in L2.
To draw a histogram, choose STATPLOT using .
Press to set Plot1.
Set it to On, and choose the histogram from the icons.
Set the Xlist to L1 using 1.
Set the Freq to L2 using 2.
Speed, x 47 52 57 62 67 72 77 82
f 2 6 12 13 6 6 3 2
Technology
F6 F4 F1
EXE
F6 F1
F6
F1
F6
F5
STAT
CLEAR
ENTER
2nd Y=
ENTER
2nd
Press the key to set the parameters for the graph, setting Xmin to 40, Xmax to 90, Xscl to 5, Ymin to 0 and Ymax to 15.
Press the key to see the histogram. To change to a polygon, select the line graph icon in
the STATPLOT menu and redraw the graph.
Sharp EL-9900
See the instructions given on the CD-ROM.
Exercise 2.2
Organising and displaying information
1 The pizza toppings at a takeaway food outlet are: Hawaiian (H), vegetarian (V), all
meat (AM), supreme (S) and Mexican (M). The toppings of the first 20 pizzas ordered one day were as follows:
S AM S H S V M AM S M
M S AM S M S H H M S
a Draw up a frequency table for this data.
b Which topping was the second most popular?
2 A box of free-range eggs was examined, and the eggs weighed. The masses (grams) were:
51 55 60 52 50 61 52 54 55 65 50 52 54 64 52 50 51 62 64 51 50 52 58 57 66 61 58 59 57 54 52 50 53 54 55 58
a Construct a frequency distribution table for this data.
b The eggs are to be sorted into the following gradings: medium (46–52 g), large
(53–59 g) and jumbo (59 g). How many of each size would be in the box?
3 The numbers of hits on a website in 20 consecutive days were:
50 63 85 51 49 86 67 65 74 46 69 78 71 45 52 43 63 67 50 58 Show this data as a stemplot.
4 The handspans of a sample of Year 11 students were measured (in cm) as:
21.0 18.5 23.7 19.2 20.1 19.8 17.2 19.6
22.3 20.8 17.9 21.8 19.8 18.1 23.2 21.7
Construct a stem-and-leaf plot to show these measurements.
5 Some classes are shown below. For each set, state the true class interval and class midpoint
for the class shown in parentheses.
a 0–5, (5–10), 10–15, 15–20, …
b 20–29, 30–39, (40–49), 50–59, …
c 12–15, 16–19, 20–23, (24–27), 28–31, …
d 110–, 120–, (130–), 140–, … (data correct to 1 decimal place)
WINDOW
GRAPH
Calculator Instructions
Additional Exercise
6 The breaking strains of 200 samples of cotton yarn were measured in newtons. The values ranged from 77 to 129. Suggest class intervals and class midpoints for grouping this data.
7 The diameters of 150 bolts ranged from 3.42 to 4.67 cm. Suggest suitable class intervals and
class midpoints for grouping.
8 There were 14.4 million motor vehicles, including motorcycles, registered in Australia at
31 March 2006. The numbers of each type are given in the table below. Draw a pie chart to show this data.
Modelling and problem solving
9 Bungee jumpers must be weighed
before they ‘take the plunge’. The weights (kilograms) of 40 jumpers were recorded as follows:
41 58 63 37 49 58 71 33 85 58 60 73 81 46 55 38 80 48 50 62 61 59 63 44 77 62 58 73 62 75 52 60 69 61 55 47 76 42 66 70
a Use the class limits 30–39, 40–49,
…, 80–89 to construct a frequency distribution table.
b Calculate the class width.
c List the class midpoints.
10 A researcher is investigating the lifetime (in hours) of safety light bulbs. The results are:
1008 980 1143 1324 1246 1012 962 1440 1159 1362
1441 1147 1326 1092 1387 1298 1120 1332 1493 1211
1205 1453 1139 1348 1399 1040 1417 1530 1321 1129
1317 1297 1473 1010 1244 1159 1331 1227 1277 1352
1242 1374 1176 1386 1142 1402 1282 1393 1510 1285
a Between what two values does the data range?
b The data needs to be grouped. Suggest a suitable class length and class limits for
grouping.
c Draw up a frequency distribution table using the class limits you have chosen.
d Make a stem-and-leaf plot of the lifetimes.
Vehicles Number registered
Passenger vehicles 11 188 880
Light commercial vehicles 2 114 333
Trucks 475 519
Other 116 895
Motorcycles 463 057
11 The energy consumption (in megajoules) of 26 households in a suburban Brisbane street was recorded as:
176.4 179.5 182.3 180.7 172.6
181.4 179.6 172.1 179.3 175.8
182.7 179.3 177.1 1794 178.7
179.6 182.5 177.5 178.2 176.9
185.6 179.2 176.2 180.4 178.8
181.3
Choose suitable classes and prepare a frequency distribution table using this data.
12 Fifty schools in south-east Queensland were surveyed
regarding the percentage of non-English-speaking background (NESB) students in attendance.
a Draw a histogram for this data.
b Construct a frequency polygon for the data.
13 The numbers of occupational injuries recorded in a year at 48 Queensland factories were:
29 24 27 30 24 31 28 24 28 26 24 23 28 30 29 27 40 33 29 23 29 24 26 27 25 24 26 31 23 35 25 37 24 25 33 34 35 38 31 30 27 32 33 36 37 32 37 27
a Use a class width of 3 to construct a frequency distribution table for the data starting
with 23–25.
b Construct a histogram for this data.
c Draw a frequency polygon for the data.
14 To help estimate the value of a pine plantation, the
girths (centimetres) of a sample of 300 trees were measured. They were recorded and compiled into the table on the right.
a Represent this data as a histogram.
b If this sample is representative of the whole
plantation of 25 000 pines, how many would you expect to have a girth of more than 140 cm? Give reasons for your answer.
% NESB f
0–9 5
10–19 9
20–29 18
30–39 12
40–49 5
50–59 1
Girth (cm) Number of trees
40 to60 15
60 to 80 28
80 to 100 68
100 to 120 95
120 to 140 49
140 to 160 33
15 The percentage marks of 81 Maths B students in a school maths competition were recorded as shown.
a Construct a cumulative frequency table for this
information.
b Draw an ogive.
c If a score of 65% or more was enough for a student to
gain a merit certificate, how many would not receive a merit certificate?
16 A city courier travels a set route each day. The time (in minutes) taken to complete this route
is recorded over 60 consecutive working days as follows:
92 86 76 65 63 93 78 83 88 74 82 94 70 88 81
68 88 97 61 66 90 76 93 76 73 82 77 88 74 83
88 91 96 97 86 93 74 76 84 95 80 80 82 79 74
94 77 85 98 96 90 80 76 92 80 73 77 82 89 77
a Using the classes, 60–64, 65–69, 70–74, etc., present this data in a cumulative frequency
table.
b Draw a percentage ogive for this data.
c In what percentage of days could the courier be expected to complete the route in under
80 minutes?
d If a supervisor decides to give a bonus for ‘early’ completion of the route, what time
would you suggest as the longest to still give a bonus? Give complete justification for your reasons.
2.3
Finding data positions and centres
Ogives provide a convenient way of determining approximate values for statistical measures called fractiles or quantiles.
Mark (%) f
10–19 2
20–29 6
30–39 10
40–49 26
50–59 21
60–69 8
70–79 4
80–89 2
90–99 2
A fractile is the value of a statistical variable below which a particular fraction of the scores lie. Fractiles are also called quantiles.
Quartiles divide a set of data into quarters:
■ 25% of all the scores lie below the first quartile (Q1= lower quartile).
■ 50% of all the scores lie below the second quartile (Q2= middle quartile).
■ 75% of all the scores lie below the third quartile (Q3= upper quartile).
The second quartile is also called the median.
Similarly, deciles divide a data set into tenths and percentiles into hundredths.
40% of all scores lie below D4 and 73% below P73.
The median is an important summary statistic for all numeric data. For discrete data it should be calculated in a different way than for continuous or grouped data. It is the score that lies half-way through the data, when all the scores are arranged in order. The mode is particularly useful for
nominal variables.
The girths of trees in a pine plantation were measured (nearest centimetres) as follows.
a Draw an ogive for this data.
b Find the 2nd decile. c Find the 35th percentile.
d Find the 3rd quartile. e Find the median.
Solution
a The true class intervals are 39.5–59.5, 59.5–79.5, etc. Redraw the table to include the true classes and cumulative frequency.
It will be simplest to use cumulative percentage.
Then draw the percentage ogive.
b Using the graph to find the 20th
percentile, the 2nd decile is 84.
c Using the graph, the 35th
percentile is 98.
d Using the graph to find the 75th
percentile, the 3rd quartile is 125.
e Using the graph to find the 50th
percentile, the median is 107.
Girth (cm) Number of trees
40 to60 15
60 to 80 28
80 to 100 68
100 to 120 95
120 to 140 49
140 to 160 33
160 to 180 12
Girth (cm) f c. f. % c. f.
39.5–59.5 15 15 5
59.5–79.5 28 43 14.3
79.5–99.5 68 111 37
99.5–119.5 95 206 68.7
119.5–139.5 49 255 85
139.5–159.5 33 288 96
159.5–179.5 12 300 100
Girth (cm)
Pine tree girths
19.5 39.5 59.5 79.5 99.5119.5139.5159.5
80
60
40
20
0
Cumulati
v
e percentage
179.5
100
For large sets of data, the median should be calculated from the cumulative frequency. It can be estimated for grouped data by interpolation instead of by using an ogive. There are methods to estimate the mode for grouped data, but they are very crude.
For discrete or ungrouped data, the median is the middle score, when all scores are arranged in order.
In general Median = th item
For an even number of scores, it is taken as the average of the middle two scores. The mode is the most common score. It has the highest frequency. For grouped data, the
modal class is the class with the highest frequency.
A set of data with two scores with equal highest frequency is called bimodal, while a distribution with only one mode is called unimodal.
n+1
( )
2
---!
Find the median in each of the following.
a 1, 12, 14, 11, 4, 7, 15 b 1, 12, 14, 12, 4, 7
Solution
a Place the data in order. 1 4 7 11 12 14 15
Since there are 7 items, the middle is the 4th. Median = 11
b Rearrange the data. 1 4 7 12 12 14
There are 6 pieces of data. Median =7+12 = 9.5 2
---Example
21
The number of cars sold by a used-car yard each day was as follows. Find:
a the mode b the median number of cars sold.
Solution
a The mode has the highest frequency.
Mode = 1 car.
b Redraw the table to include the cumulative frequency.
The median is the = 22.5th score.
There are 15 scores up to and including 1. The next 11 scores are 2, so the 22.5th is a 2.
Median = 2 cars.
Cars sold 0 1 2 3 4 5 6
Number of days 3 12 11 8 6 3 1
Cars sold f c. f.
0 3 3
1 12 15
2 11 26
3 8 34
4 6 40
5 3 43
6 1 44
44+1
( )
2
In Example 23, the data is continuous, so the median can be any value. If the data were discrete, we should take the median to the nearest 0.5 to show this.
When discrete data is grouped, it is often treated as if it is continuous for the purpose of finding summary statistics. This will give a slightly different answer for the median because the
50th percentile may not be the same as the (n + 1)th score.
Many summary statistics involve the addition of many expressions. Mathematicians have a shorthand for such additions.
The heights (cm) of 70 Year 12 students were measured and the results were recorded. Find the median height and the modal class.
Solution
Write the true class limits.
Add a cumulative frequency column.
Since the variable is continuous, we want P50. 50% of 70 = 35.
There are 20 scores up to 174.5.
So the median lies 15 scores above 174.5. There are 23 scores from 174.5 to 179.5.
The median is of the way between 174.5 and 179.5.
Use interpolation to find the median. Median = 174.5 + × (179.5 − 174.5)
≈ 174.5 + 3.3
= 177.8 cm
The modal class has the highest frequency. The modal class is 175–179 cm. Heights of Year 12 students (cm)
Class f
160–164 2
165–169 5
170–174 13
175–179 23
180–184 14
185–189 10
190–194 2
195–199 1
Class f c. f.
159.5–164.5 2 2
164.5–169.5 5 7
169.5–174.5 13 20
174.5–179.5 23 43
179.5–184.5 14 57
184.5–189.5 10 67
189.5–194.5 2 69
194.5–199.5 1 70
15 23
---15 23
---Example
23
---The mode and median are both measures of central tendency. ---The mean is the most commonly used measure of central tendency.
When using grouped data, you should use the true class limits to find the class midpoints for calculations, because the stated class limits can sometimes give an incorrect midpoint.
Summation notation
The Greek letter Σ means ‘the sum of’, such as Σx = ‘Sum of the values of x’.
If particular values are to be added, they are indicated by subscripts such as
= x2+ x3+ x4+ x5
xi
i=2 5
∑
!
The mean of the scores of the variable x is written as . It is the arithmetic average of the values of x.
For ungrouped data = =
For grouped data, the class midpoint is used for x and we can write: =
x
x Σx n
--- total of all scores
number of scores
---x Σfx
Σf
---!
The eggs in a carton marked ‘52 grams’ were weighed. The weights in grams were:
53 52 54 50 51 51
53 54 53 52 54 52
What is the mean weight of an egg from this carton?
Solution
The mean is the average. Mean = =
There are 12 eggs,
so n = 12. = Evaluate. ≈ 52.4 g
Write the answer. The mean weight is about 52.4 g. Σx
n
--- sum of weights
number of eggs
---53+52+54+50+51+51+53+54+53+52+54+52
12
A graphics calculator can be used to find the mean, median and mode of a set of data. However, for many cases it is quicker to calculate by hand than to use a calculator.
The heights (metres) of 40 Year 11 students were recorded. Calculate the mean height of students in this group.
Solution
Redraw the table with true class limits. Use class midpoints (x) for the score values. Put in a frequency × score (f x) column. Add the frequencies.
Use the grouped data formula.
=
=
≈ 1.67 m
The mean height is about 167 cm.
Heights of Year 11 students (m)
Class f
1.50–1.54 2
1.55–1.59 4
1.60–1.64 8
1.65–1.69 14
1.70–1.74 7
1.75–1.79 2
1.80–1.84 3
Class f x f x
1.495–1.545 2 1.52 3.04
1.545–1.595 4 1.57 6.28
1.595–1.645 8 1.62 12.96
1.645–1.695 14 1.67 23.38
1.695–1.745 7 1.72 12.04
1.745–1.795 2 1.77 3.54
1.795–1.845 3 1.82 5.46
Totals 40 66.70
x Σfx
Σf
---66.70 40
---Example
25
Use a graphics calculator to find the mean, median and mode (if possible) of this data.
a 3, 6, 8, 10, 5, 4, 6, 8, 8, 3, 4, 6
Solution
a Enter the data in List 1 as shown on pages 59–60.
Casio fx-9860G AU
Choose the CALC submenu by pressing .
Choose the SET submenu by pressing . Set the 1 XList var to List 1.
Set the 1 Var Freq to 1.
and choose 1-Var by pressing . Use the cursor keys to move down the list.
b x 3 4 5 6 7 8 9
f 5 7 6 3 2 0 1
F2
F6
EXIT F1
Exercise 2.3
Finding data positions and centres
1 Use the percentage ogive shown below to find:
a the 1st quartile b the 3rd quartile c the 9th decile
d the 15th percentile e the 36th percentile f the 65th percentile.
The mean is about 5.92, the median is 6, and the two modes are 6 and 8 with frequencies of 3.
Texas Instruments TI-84
After the data is entered in L1, press the key and choose the CALC menu. Choose 1-Var Stats, press and put in L1 by pressing 1.
Use the cursor keys to move down the list.
The mean is about 5.92 and the median is 6, but the mode is not given.
Sharp EL-9900
See the instructions given on the CD-ROM.
b The method for the table is almost the same for all the calculators as in part a, with the exception that the data is entered in List 1 for the scores and List 2 for the frequencies, as for a graph. The scores are set to List 1 and the frequencies to List 2. For the TI-84 and Sharp calculators, the 1-variable statistics are called using L1, L2 instead of just L1.
In each case the mean is found to be 4.75 and the median 4.5. The mode is 4 with a
frequency of 7 (Casio).
STAT
ENTER 2nd
Calculator Instructions
Additional Exercise
2 .3
5 10 15 20 25 30 35 40 45 0
20 40 60 80 100
x
Cumulati
v
2 Use the ogive shown below to find:
a the median b the 1st quartile c the 1st decile
d the 4th pentile e the 66th percentile f the 90th percentile.
3 Find the mean, median and mode of each of the following sets of data.
a 3, 4, 7, 8, 10, 12, 14, 14, 15, 16, 17
b 1, 2, 2, 2, 3, 4, 7, 9, 11, 12
c 13, 15, 8, 7, 4, 9, 10, 3, 9, 8, 9, 10
d 2, 7, 3, 7, 9, 11, 14, 18, 19, 19, 8, 9, 11, 17
4 Find the mean, median and modal class of each of the following sets of discrete data.
5 Does it make any difference to the results of question 4 if the data is treated as continuous?
Explain your answer carefully.
6 Use a graphics calculator to find the mean, median and mode (if possible) of each set of data.
a 3, 7, 9, 10, 4, 5, 7, 3, 5, 7, 8, 12
b 6, 8, 12, 15, 12, 9, 7, 3, 9, 15, 16, 15, 8, 7, 4, 6
c 0, 3, 0, 4, 1, 8, 7, 6, 2, 2, 9, 3, 12, 14, 6, 8, 10
d 20, 22, 19, 29, 18, 30, 27, 16, 25, 27, 28, 24, 25, 28, 20, 15
a Class f b Class f c Class f
5–9 3 20–29 1 1–3 3
10–14 8 30–39 4 4–6 7
15–19 12 40–49 8 7–9 15
20–24 4 50–59 15 10–12 8
25–29 1 60–69 12 13–15 4
<