10 The normal distribution and correlation DATA ANALYSIS

(1)

10

The normal

distribution and correlation

DATA ANALYSIS

Do people with large feet swim faster? Do tall people weigh more? Do you achieve better examination results if you watch less TV? Is there any relationship between your IQ and the last three digits of your telephone number? These questions are considering a possible relationship between two variables. Correlation is the strength of any such relationship. We would assume that there is no correlation between IQ and telephone numbers, but could there be a correlation between a city’s temperature and its rainfall? We will discuss correlation later in this chapter.

In this chapter you will learn how to:

n use a normal curve to model continuous data

n identify the properties of data that are normally distributed

n investigate the percentages of scores within 1, 2 and 3 standard deviations of the mean

n describe and calculate z-scores

n use calculated z-scores to compare scores from different data sets

n plot ordered pairs of data onto a scatterplot

n recognise any linear pattern or correlation from the scatterplot

n understand the meaning of the value of a given correlation coefficient

n recognise which relationships with a high correlation are causal relationships

n construct a median regression line to give a line of fit on a scatterplot

(2)

THE NORMAL DISTRIBUTION

We have already looked at the shape and features of statistical distributions and have drawn frequency histograms and polygons to display data sets.

In this chapter we will look at the special case of using a curve rather than a histogram or polygon to model a large amount of data.

The heights (in centimetres) of a random sample of adults were measured and a frequency histogram was drawn. The distribution of heights is fairly symmetrical and has a mean of 170 cm with data balanced about this mean. The histogram is approximately a bell-shaped curve called a normal curve.

Many natural characteristics, such as height, blood pressure and skull width, are normally distributed and can be represented by a normal curve.

Standard deviation and the normal curve

Whereas the mean ( ) is a measure of location, the standard deviation (s) measures the spread of a distribution of scores. For a normal distribution, the larger the standard deviation the flatter its curve.

For a sample, the standard deviation is s=σn−1. For a population, the standard deviation is s=σn.

Areas under the normal curve

170

Frequency histogram Normal curve

Height (cm)

170

Height (cm)

Normal curves with the same mean (x) but different standard deviations

Normal curves with different means but the same standard deviation

x x₁ x₂ x₃

x

For any normal distribution:

n 100% of the scores lie under the normal curve

n 68% of the scores lie within 1 standard deviation of the mean (between − s and + s)

n 95% of the scores lie within 2 standard deviations of the mean (between − 2s and + 2s)

n 99.7% of the scores lie within 3 standard deviations of the mean (between − 3s and + 3s)

n 0.3% of the scores are ‘in the tails’.

x x

(3)

Since the normal curve is symmetrical, the percentages for each area, defined by the number of standard deviations from the mean, can be found using the values above.

Area and probability

Area under the normal curve is related to probability, since each can be represented by a percentage. For example, the probability that a score selected at random will be within 1 standard deviation of the mean is 68% or 0.68, and the probability that a score will be within 2 standard deviations of the mean is 95% or 0.95.

We say:

n most scores (68%) will lie within 1 standard deviation of the mean

n a score will most probably (95% chance) lie within 2 standard deviations of the mean

n a score will almost certainly (99.7% chance) lie within 3 standard deviations of the mean.

Example 1

The heights of a large sample of adults are normally distributed with mean 170 cm and standard deviation 8 cm. Within what limits do:

(a) 68% of heights lie? (b) 95% of heights lie? (c) 99.7% of heights lie?

Solution

Mean = 170, standard deviation s= 8

(a) 68% of heights lie between −s and +s: that is, between 162 and 178 cm. (b) 95% of heights lie between − 2s and + 2s: that is, between 154 and 186 cm. (c) 99.7% of heights lie between − 3s and + 3s: that is, between 146 and 194 cm.

100% of the scores lie under the normal curve.

68% of the scores lie within 1 standard deviation of the mean.

95% of the scores lie within 2 standard deviations of the mean.

99.7% of the scores lie within 3 standard deviations of the mean. 100%

95%

68%

99.7%

− 3s + 2s

− 2s + 3s

+ s

− s

x

x x x x x x

x x x

68% 95% 34%

99.7%

2.35% 0.15%

34%

13.5% 13.5%

2.35% _0.15%

+ s + 2s + 3s − s

− 2s − 3s

x x x x x x x

Determine how these percentages for each area were calculated.

x

x x

(4)

Example 2

A machine fills bottles with 300 mL of soft drink. The amounts in the bottles are normally distributed with a mean of 302 mL and a standard deviation of 3 mL.

(a) Draw a normal curve showing the mean and the values that are 1, 2 and 3 standard deviations from the mean.

(b) If 10 000 bottles are filled, how many bottles contain amounts that are: (i) within 1 standard deviation of the mean?

(ii) within 2 standard deviations of the mean? (iii) further than 3 standard deviations from the mean?

(c) If the manufacturer rejects bottles that contain amounts more than 3 standard deviations from the mean, what is the largest amount of soft drink in a bottle that you would find for sale in a supermarket?

(d) What is the probability that a bottle selected at random will contain between 299 mL and 305 mL?

Solution

Mean = 302, standard deviation s = 3 (a) − s = 302 − 3 = 299

+ s = 302 + 3 = 305 − 2s = 302 − 2 × 3 = 296 + 2s = 302 + 2 × 3 = 308 − 3s = 302 − 3 × 3 = 293 + 3s = 302 + 3 × 3 = 311 (b) (i) 68% × 10 000 = 6800 bottles

(ii) 95% × 10 000 = 9500 bottles (iii) 0.3% × 10 000= 30 bottles

(c) The largest amount would be 311 mL (3 standard deviations above the mean). (d) Probability is 68% or 0.68.

Example 3

A lathe makes rods with a mean diameter of 15 mm and a standard deviation of 0.1 mm. (a) If the diameters are normally distributed, find the percentage of rods with a diameter:

(i) greater than 15 mm (ii) between 14.9 mm and 15.1 mm (iii) between 14.7 mm and 15.2 mm (iv) less than 14.8 mm

(b) What is the probability of selecting a rod at random with a diameter greater than 15.1 mm?

x x x x x

302

0.15% 0.15%

305 308 311

299 296

293 mL

68% 95% 99.7% x

x

Just for the record

T

HE

3

SLIMIT

(5)

Solution

Mean = 15, standard deviation s = 0.1

First draw a sketch showing the percentages, the mean and the values 1, 2 and 3 standard deviations from the mean. Solutions are then found by adding or subtracting percentages.

(a) (i) 50% (ii) 68%

(iii) 2.35% + 95% = 97.35% (iv) 0.15% + 2.35% = 2.5%

(b) Probability (diameter . 15.1 mm)= 13.5% + 2.35% + 0.15% or 50% − 34%

= 16% or 0.16

Hint: Draw a diagram for each question.

1. The nicotine content of a brand of cigarette was found to be normally distributed with a

mean of 14.5 mg and a standard deviation of 1.2 mg.

(a) Calculate the values that are 1, 2 and 3 standard deviations from the mean.

(b) Draw a normal curve showing the percentages, mean and standard deviation values. (c) If 20 000 cigarettes were tested, how many had a nicotine level:

(i) within 1 standard deviation of the mean? (ii) within 2 standard deviations of the mean? (iii) outside the 3s limit?

x

15 34%

2.35% 0.15%

34%

13.5% 13.5%

2.35% _0.15%

15.1 15.2 15.3 14.9

14.8

14.7 mm

15 50%

mm 15.1

68%

mm

14.9

15.2 95%

mm

14.8 14.7 2.35%

mm

14.8 14.7 2.35% 0.15%

2.35% 13.5%

0.15%

15.1 15.2 15.3 mm

(6)

2. The lengths of roofing nails are normally distributed with a mean of 20 mm and a

standard deviation of 3 mm.

(a) What percentage of scores lie between:

(i) 17 mm and 23 mm? (ii) 14 mm and 26 mm?

(iii) 11 mm and 29 mm?

(b) If 10 000 nails were measured, how many would have a length:

(i) greater than 23 mm? (ii) between 17 mm and 26 mm? (iii) less than 14 mm?

(c) What is the probability that a roofing nail selected at random is more than 23 mm long?

3. A machine produces steel rods of mean length 148.4 mm with standard deviation of

0.03 mm. The lengths are distributed normally.

(a) If rods more than 3 standard deviations from the mean are rejected: (i) what percentage of rods are rejected?

(ii) what is the acceptable range of rod lengths?

(b) One customer wants to purchase only rods with a length between 148.43 mm and 148.34 mm. What percentage of rods will she accept?

(c) What is the probability that a rod selected at random will be acceptable to this customer?

4. The mean skull width of a population of adults is normally distributed with mean of

1395 mm and standard deviation of 49 mm.

(a) Draw a normal curve showing the percentages and marking the mean and standard deviation values.

(b) If there are 8000 adults in the population, how many have a skull width: (i) between 1346 mm and 1444 mm? (ii) greater than 1493 mm? (iii) between 1297 mm and 1542 mm? (iv) outside the 3s limit?

5. The times taken to complete a crossword were normally distributed with a mean of

10 minutes and a standard deviation of 2.1 minutes. To finish the crossword, what percentage of people took:

(a) between 10 and 12.1 minutes? (b) between 7.9 and 12.1 minutes? (c) more than 16.3 minutes? (d) less than 7.9 minutes?

6. The period from conception to delivery of a baby is called the gestation period. The

gestation period for humans is approximately normally distributed with a mean of 266 days and a standard deviation of 16 days.

(a) What percentage of pregnancies will last more than 282 days? (b) Between what two values will 95% of gestation periods lie?

(c) Is it possible to have a gestation period of 320 days? Justify your answer. (d) What is the probability that a pregnancy will last less than 250 days? (e) Within what range will the length of a pregnancy most probably lie?

7. Light globes were tested and their lifetimes found to be normally distributed with a mean

of 620 hours and a standard deviation of 12 hours. (a) If 12 000 light globes were produced:

(i) how many lifetimes fell within 1 standard deviation of the mean? (ii) how many globes lasted between 620 hours and 644 hours? (iii) how many globes lasted less than 608 hours?

(iv) how many globes lasted more than 644 hours?

(7)

8. Gum leaves from a variety of trees were measured and their lengths found to be normally

distributed about a mean of 65 mm with a standard deviation of 8 mm. (a) What percentage of leaves in the population were:

(i) between 49 mm and 81 mm? (ii) more than 73 mm? (iii) less than 41 mm?

(b) If 5400 leaves were measured, how many had a length: (i) between 57 mm and 81 mm? (ii) less than 57 mm? (iii) between 65 mm and 89 mm?

(c) Between what values will the length of a leaf almost certainly (99.7% chance) lie? (d) What is the probability of selecting a gum leaf that is between 57 mm and 73 mm?

9. Wedges of cheddar cheese are labelled as weighing 250 g. A survey found the weights

to be normally distributed about a mean of 255 g with a standard deviation of 2.5 g. (a) What percentage of wedges contain more than the labelled weight?

(b) If 6000 wedges were included in the survey, how many contained less than the labelled weight?

10. Packets of muesli have an advertised weight of 500 g. The actual weights were found to

be normally distributed with a mean of 490 g and a standard deviation of 10 g. (a) What percentage of packets would you expect to be below the advertised weight? (b) Do you think this product is labelled fairly? Justify your answer.

Z-SCORES

A z-score or standardised score is a number that represents the position of a score relative to the mean. For example:

n a z-score of 1 represents a score that is 1 standard deviation above the mean

n a z-score of −2 represents a score that is 2 standard deviations below the mean. The distribution of z-scores is called a

standard normal distribution and has a

mean of 0 and a standard deviation of 1.

0 1 2 3

−1

−2

−3

z-score

The z-score is the number of standard deviations between a score and the mean.

z-score=

z=

score–mean standard deviation

---x–x s

--- _{− 3s} _{− 2s} _{− s} _{+ s} _{+ 2s} _{+ 3s}

0 1 2 3

−1 −2 −3

z-scale x-scale

(8)

Example 4

Milk cartons are designed to hold 600 mL of milk. Their capacities are normally distributed about a mean of 600 mL with a standard deviation of 10 mL.

(a) Calculate the z-scores for capacities:

(i) 610 mL (ii) 595 mL

(iii) 625 mL (iv) 600 mL

(b) Mark these z-scores on a diagram of a standard normal curve. (c) What can you conclude about a carton of milk containing 625 mL?

Solution

(a) z = where = 600 and s = 10.

(i) z= = 1

(ii) z= = −0.5

(iii) z= = 2.5

(iv) z= = 0

(b)

(c) The z-score is 2.5, so a milk carton with a capacity of 625 mL would be most unlikely, since this capacity is over the 2s (95%) limit.

Example 5

A brand of blueberry muffin mix is labelled ‘net weight 430 g’. The net weights of packets approximate a normal distribution with a mean of 432 g and a standard deviation of 2 g. (a) Complete the table.

(b) What percentage of packets have a net weight:

(i) greater than 432 g? (ii) between 430 g and 434 g? (iii) between 428 g and 434 g? (iv) less then the labelled net weight?

(c) If you selected a packet of this muffin mix, what is the probability that the packet would contain more than the stated weight?

Solution

Net weight (g) 428 430 432 434 436

z-score 0

(a) Net weight (g) 428 430 432 434 436

z-score −2 −1 0 1 2

x–x s

--- x

z = 1 means 1 standard deviation above the mean.

610–600 10

---z =−0.5 means 0.5 standard deviations below the mean.

595–600 10

---z = 2.5 means 2.5 standard deviations above the mean.

625–600 10

---The mean has a z-score of 0.

600–600 10

---0 1 2 3

−1 −2 −3

z-score

(9)

(b) (i) 50%

(ii) 34% + 34% = 68%

(iii) 13.5% + 34% + 34% = 81.5% (iv) 0.15% + 2.35% + 13.5% = 16% (c) Since 16% of packets contain less

than the stated weight, 84% of packets contain more than the stated weight, so the probability is 84% or 0.84.

Example 6

(a) Elizabeth receives a z-score of 1.8 in a class test. What does this mean?

(b) If Elizabeth’s raw score is 80 and the standard deviation of scores in the test is 12, what is the mean?

(c) If Wallace scores 50 on the same test, what is his z-score?

Solution

(a) Elizabeth’s score is 1.8 standard deviations above the class mean.

(b) z =

1.8=

1.8 × 12= 80 − = 58.4 The mean is 58.4.

(c) z = = −0.7

Wallace’s z-score is −0.7.

1. Explain the difference between a raw score and a z-score.

2. What are the mean and standard deviation of the standard normal distribution

(distribution of z-scores)?

3. What proportion of scores is shaded in each of the following diagrams?

432 34%

2.35% 0.15%

34%

13.5% 13.5%

2.35% _0.15%

434 436

430 428

0 1 2

−1 −2

z-scale x-scale (g)

x–x s

---80–x

12

---x x

0 1 2 z-score

−1

−0.7 1.8

Wallace’s raw score

Elizabeth’s raw score

(80) (50)

Mean (58.4)

50–58.4 12

---Exercise 10-02:

z-scores

(a) (b) (c)

(d) (e)

z =−1 z = 1

z = 0 z = 1 z =−2 z = 0

(10)

4. IQs are normally distributed with a mean of 100 and a standard deviation of 16.

(a) What are the z-scores for IQs of:

(i) 100? (ii) 60? (iii) 132?

(b) To join the Four Sigma Club you need an IQ with a z-score of 4 or above. (i) What is meant by a z-score of 4?

(ii) What IQ is this?

(c) If there are 20 million people in Australia, how many have an IQ of 132 or more?

5. In a class test, the average mark was 70 and the standard deviation was 15. Kell received

a z-score of 1.8 and Terry received a z-score of −0.9.

(a) Explain the meaning of each z-score in terms of the average mark and standard deviation.

(b) What were Kell and Terry’s marks in the test? Answer to the nearest mark.

6. Lollies are put into 100 g packets. The machine that performs this task is set to a mean

mass of 100 g with a standard deviation of 2 g. (a) Complete the table.

(b) What percentage of packets will have a mass:

(i) more than 100 g? (ii) less then 96 g? (iii) between 96 g and 104 g? (iv) more than 106 g?

(v) between 98 g and 106 g?

(c) A quality control officer selected several packets of lollies at random and found them to have a mass of at least 110 g. What conclusions could be drawn about these packets and the machine?

7. The average age of a sample of Bingo players is 60 with standard deviation 8.

(a) If the scores were standardised, what would be the z-scores for these players?

(i) Bill: 70 (ii) Mary: 52

(iii) George: 60 (iv) Elsie: 92

(b) What percentage of people in the sample were more than 84 years old? (c) Esma has a z-score of 2.5. How old is she?

(d) Comment on the likelihood of a 20-year-old being in this sample.

8. In an Economics test, Rodney was given a z-score of −1.2 and Natalie was given a

z-score of 3.5.

(a) Who did better in the test? Justify your answer.

(b) If the test mean was 40 and the standard deviation was 15, calculate Rodney and Natalie’s test scores (to the nearest whole percentage).

9. The average weight of a flathead caught in Currumbene Creek is 1.2 kg with a standard

deviation of 0.4 kg. The weights are approximately normally distributed. (a) Complete the table.

Mass (g) 94 96 98 100 102 104 106

z-score 0

Weight (kg) 0 0.4 0.8 1.2 1.6 2.0 2.4

(11)

(b) What percentage of flathead in the creek weigh:

(i) more than 1.6 kg? (ii) between 0.4 kg and 1.6 kg? (iii) less than 2 kg? (iv) between 0.8 kg and 2.4 kg? (c) If 8000 flathead were caught one season, how many weighed over 2 kg?

(d) If flathead weighing less than 0.8 kg are returned to the creek, how many of the 8000 flathead caught would be returned?

10. The heights of a group of preschool children are normally distributed with a mean of

90 cm and a standard deviation of 10 cm.

(a) Find the z-score (correct to 1 decimal place if necessary) for the following children:

(i) Josh: 86 cm (ii) Kylie: 1.25 m

(iii) Simon: 950 mm (iv) Jack: 1.01 m

(b) Sketch a standard normal curve and mark the position of each z-score. (c) If Ann has a z-score of 2.2, how tall is she?

(d) Drew has a z-score of −1.8. How tall is he?

(e) What is the probability of a child being less than 70 cm tall?

To find the proportion of scores under a normal curve between the mean and any z-score, a table of ‘Areas under the normal curve’ can be used.

When z = 1, A = 0.341. This means that 34.1% of scores lie between the mean and z = 1.

When z = 0.5, A = 0.192 and because the normal curve is symmetrical, when z = −0.5, A = 0.192 also. This means that 19.2% of scores lie between the mean and z = −0.5 (and hence 30.8% of scores are less than z = −0.5).

Use the table of areas above and z-scores from Example 4 to find the percentage of milk cartons containing between 595 mL and 625 mL.

Investigation:

Using a table of areas under the normal curve

0 z

A

z A z A z A

(12)

COMPARING NORMAL DISTRIBUTIONS

Comparing using z-scores

Suppose Anna scored 80% for Maths, 60% for Science and 70% for History in her trial examinations. Do we assume that Anna performed best in Maths? If the scores in each subject are normally distributed, we can convert the raw scores in each distribution to

z-scores and compare them. This is known as standardising scores. The standardised score

or z-score tells ‘how many standard deviations a score is above or below the mean’ and does not depend on the value of the mean.

Example 7

Anna’s marks for Maths, Science and History as well as the mean and standard deviation of marks for each subject are given in the table.

(a) Find Anna’s z-score for each subject (correct to 1 decimal place). (b) In which subject did she perform best? Justify your answer. (c) Represent Anna’s results using a standard normal curve.

Solution

(a) z =

Substituting the table values into the formula gives z-scores of Maths 0.3, Science 0.5, History −0.8.

Subject Anna’s mark Mean mark Standard deviation

Maths 80 75 16

Science 60 54 12

History 70 78 10

Study tips

B

EPREPARED TO COMPROMISE

When planning your study, remember that you only have a limited amount of time. It is not possible to do everything, so make the most of the time you do have and be prepared to compromise.

n Set realistic goals.

n Spend time on what's important.

n Don’t overcommit yourself.

n Be reasonable and practical in your expectations.

If you're making up for lost time or past failures remember this: you cannot change the past, so don't dwell on it. What's done is done. You only have control of your present and future. Do the best you can now. Be motivated and confident. Start studying now.

x–x s

(13)

---(b) Anna’s performance is relative to how the rest of the group scored. Anna performed best in Science as she had the highest z-score for Science.

Examiners often use z-scores to scale marks, given a new mean and standard deviation. In Example 7, what would be Anna’s scaled marks if each subject were normally distributed about a mean of 65 and a standard deviation of 12?

Maths Science History

z= z= z=

0.3= 0.5= −0.8=

x= 68.6 x= 71 x= 55.4

≈ 69 ≈ 55

Her scaled marks would be Maths 69, Science 71, History 55.

Investigating a distribution

To decide whether a distribution is approximately normal, you could:

n draw a histogram of the data and see if it approximates a ‘bell-shaped’ curve

n find the mean, median and mode and see if they are equal

n check whether about 68% of scores lie within 1 standard deviation of the mean.

Example 8

Here are the times taken to complete a job application.

(a) Draw a histogram of the data. (b) Is the data approximately normally

distributed? Justify your answer.

0 1 z-score

−1

−0.8

History

0.3

Maths

0.5

Science

(c)

Think:

Scaling of marks

x–x s

--- x–x

s

--- x–x

s

---x–65 12

--- x–65

12

--- x–65

12

---Mean

34% 34%

+ s − s

Median Mode

Bell-shaped curve

x x

Time (minutes) No. of people

20–,25 12 25–,30 41 30–,35 82 35–,40 39 40–,45 12

(14)

Solution

(b) The data is approximately normally distributed as the histogram is approximately bell-shaped.

1. Here are Judy’s marks in her half-yearly exams, along with the class mean and standard

deviations for each subject.

By standardising her scores for each subject, rank her subjects in order of her performance compared with the other students.

2. Ball bearings with a stated diameter of 10 mm are produced by two machines. The

results of quality control checks on the machines are tabulated.

(a) Find the mean of each data set. (b) Represent each data set in a histogram.

(c) Are the data sets normally distributed? Justify your answer. (d) Which machine gives the better product? Justify your answer.

3. The 2-year-old model of a particular car has an average value of $12 000 with a standard

deviation of $1500. The 3-year-old model of the same car has an average value of $10 500 with a standard deviation of $1200. Which is the better car: a 2-year-old model with a value of $11 000 or a 3-year-old model with a value of $9000?

Subject Mark Mean Standard deviation

English 65 75 14

Maths 72 65 10

Science 58 62 16

Art 78 85 8

Diameter (mm) Machine A Machine B

9.4–,9.6 0 3

9.6–,9.8 1 10

9.8–,10.0 43 26

10.0–,10.2 47 30

10.2–,10.4 6 15

10.4–,10.6 2 12

10.6–,10.8 1 4

(a)

Time (min)

No. of people

100

80

60

40

20

0

20 25 30 35 40 45 50

(15)

4. The raw scores of three students are given in the table along with the mean and standard

deviation for each subject.

The results for each subject are approximately normally distributed. (a) Calculate the z-scores for each student.

(b) What is Paul’s best subject?

(c) Which is better: Peter’s mark in General Maths or Paul’s mark in Visual Arts? (d) Who achieved the best result? What subject was this in?

5. The heights of 100 males and 100 females

were tabulated.

By finding the mean and drawing a histogram of each data set, decide whether either data set is normally distributed. Give reasons for your answer.

6. Without doing any calculations, decide which distributions could be normally

distributed and which are not. Sketch the shape of each distribution that is not normally distributed.

(a) weights of Sumo wrestlers (b) ages of teachers

(c) number of people visiting the zoo per day over a month

(d) time spent waiting at the supermarket checkout taken over a 12-hour period (e) prices of bread at different stores

(f) lifetime of light globes

(g) blood pressure of adult females

(h) pulse rate of people taken halfway through an aerobics class (i) results of a university entrance examination

(j) birth weights of babies

7. Decide whether the following data set is normally distributed by finding the mean,

median and mode and drawing a histogram of the data. Justify your answer.

Subject Peter Mary Paul Mean Standard deviation

Visual Arts 48 60 70 66 16

Legal Studies 67 82 74 68 12

General Maths 70 79 55 70 9

Score 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 91–100

No. of students 14 18 15 26 20 31 44 39 43

Height (cm) Male Female

145–,150 0 2

150–,155 0 7

155–,160 1 14

160–,165 2 26

165–,170 8 27

170–,175 14 15

175–,180 28 7

180–,185 36 2

185–,190 9 0

(16)

8. Tom and George joined a Get Slim program. Tom is part of a group of men for whom

the mean weight is 93 kg with a standard deviation of 7.5 kg. George belongs to another group for whom the mean weight is 83 kg with a standard deviation of 6 kg. If Tom weighs 97.2 kg and George weighs 86.4 kg, who is more overweight for their group?

9. The annual rainfall when taken over a number of years was found to be normally

distributed. In Downtown the annual rainfall distribution has a mean of 950 mm and a standard deviation of 420 mm. In Uptown the mean annual rainfall is 1325 mm with a standard deviation of 238 mm. Which town is more likely to have an annual rainfall of 1100 mm? Justify your answer.

10. The stem-and-leaf plot gives the lifetime (in hours) of two brands of light globes. Decide

whether the data sets are normally distributed.

The data shows the waiting times for 100 customers at a mechanical checkout in a

supermarket. Investigate the data to see whether this sample could have come from a normal distribution.

Oso Bright Stem Brighta Longa

5 5 4 2 8 7 7 7 7 7 7 4 4 3 9 9 8 8 7 6 6 6 5 4 4 0 8 8 8 7 7 7 6 5 4 3 1 9 8 8 8 5 5 2 2 7 7 5 1

10 11 12 13 14 15

3 4 5

2 3 3 4 4 5 6

1 2 2 3 3 4 5 5 7 9 9 9 0 2 3 3 4 4 4 5 6 6 8 8 9 1 2 2 3 5 5 6 7 8 8 0 3 3 4 6

Mechanical checkout

Customer Time (s) Customer Time (s) Customer Time (s) Customer Time (s)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 62 133 190 25 80 57 55 135 150 100 110 152 27 96 145 102 35 147 202 38 53 33 77 90 38 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 275 142 146 77 32 137 132 47 112 145 327 130 45 72 93 5 80 215 186 85 70 77 100 72 194 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 112 132 93 70 270 130 85 70 50 145 35 75 55 85 60 65 95 125 160 85 70 25 80 72 140 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 35 157 65 124 170 160 75 145 100 97 120 25 133 52 50 80 67 120 127 42 180 70 80 245 98

(17)

Collect data from a population and decide whether the data is normally distributed. Possible distributions are:

n heights of people

n lengths of a particular nail or screw

n length of TV commercials

n HSC results.

To help you decide you could:

n draw a histogram of the data to see if it approximates a bell-shaped curve

n calculate the mean, median and mode and see if they are the same

n find the standard deviation and investigate percentages.

SCATTERPLOTS

A scatterplot (or scattergram) is a graph consisting of points plotted on an x–y plane. Each point represents the values of two different variables, such as height and weight for an individual.

After plotting the points we look for a pattern—in particular, a linear pattern. Here are some possible patterns.

Example 9

The table gives the heights and weights of 12 students.

(a) Draw a scatterplot and mark the points representing Paul and Kerrie. (b) Is there a pattern to the plotted points?

(c) Describe the relationship between the heights and weights of the students.

Student Paul Kerrie Zoe Hans Vila Pada Dan Rosa Kia Guy Kate Habib

Height (cm) 170 158 150 164 177 186 173 190 156 180 160 187

Weight (kg) 68 55 53 58 67 81 71 78 59 80 67 73

Investigation:

What’s normal?

No pattern Linear pattern Non-linear pattern

(18)

Solution

Paul is represented by the point (170, 68) as he is 170 cm tall and weighs 68 kg. Kerrie is represented by the point (158, 55) as she is 158 cm tall and weighs 55 kg. (b) The points suggest a linear pattern.

(c) The scatterplot indicates that taller students tend to weigh more and shorter students tend to weigh less.

Note: Graph paper is required. Keep your scatterplots for use later in this chapter.

1. The numbers of enquiries to rent or buy a house

were recorded over a 10-week period at Fastasales real estate agency.

(a) Plot the points comparing enquiries to rent with enquiries to buy on a scatterplot. (b) Does there appear to be a pattern to the

plotted points?

(c) Describe the relationship between the number of enquiries to rent or buy. (a)

70

60

50

0

Height (cm)

150 160 170 180

Weight (kg)

190 80

Kerrie (158, 55)

Paul (170, 68)

Exercise 10-04:

Scatterplots

Week Enquiries to rent

Enquiries to buy

1 16 28

2 34 69

3 70 86

4 58 80

5 56 90

6 25 28

7 38 49

8 46 62

9 62 81

(19)

2. The number of umbrellas sold at a certain shopping

centre and the amount of rain per week in the area were recorded over 8 weeks.

(a) Plot the points on a scatterplot

(b) Is there a linear pattern to the plotted points? (c) Does this agree with your expectations? (d) If not, what sort of relationship did you expect

between the number of umbrellas sold and the rainfall?

3. Data was collected on the number of beers

consumed per week and the number of cigarettes smoked per day by the population of

Wheelabarraback. Here are the results for a random sample of 10 people.

(a) Plot the points on a scatterplot. (b) Is there a pattern to the plotted points? (c) Using your scatterplot, what can you say

about the drinking and smoking habits of the people in Wheelabarraback?

4. In Bubsville hospital, the weights of

10 babies were taken at birth and again 1 month later. The weight gain in the first month and the birth weight were analysed to determine if there was any relationship between these two variables.

(a) Plot each pair of values on a scatterplot.

(b) Is there a linear pattern to the plotted points?

(c) Describe the relationship between a baby’s birth weight and the weight gain in the first month.

No. of

umbrellas sold Rainfall (mm)

10 60

35 120

12 56

31 58

4 145

5 75

42 48

18 90

Beers Cigarettes

1 0

4 16

8 9

5 10

6 5

10 14

12 18

0 0

9 14

3 2

Birth weight (to nearest 10 g)

First month weight gain (to nearest 10 g)

3010 1740

2850 1470

4220 300

3980 560

3640 420

3230 1210

4370 560

4220 780

2760 1600

(20)

5. It was decided to test the theory that ‘growing old

affects a woman’s haemoglobin level’. The ages of 20 women and their haemoglobin levels (in grams per decilitre) were recorded.

(a) Plot the pairs of values on a scatterplot. (b) Do the points form a linear pattern? (c) Does the scatterplot support the theory that

‘growing old affects a woman’s haemoglobin level’? Justify your answer.

CORRELATION

Correlation is a measure of the strength of a relationship between two variables, say x and

y. We say x and y are correlated if a scatterplot shows a linear pattern to the plotted points

(x, y). If no pattern exists, the variables are not correlated.

Describing correlation

The variables x and y have a The variables x and y have a

positive correlation if y increases negative correlation if y decreases

as x increases. as x increases.

The correlation is high if the points are close together and low if the points are more spread out. It is perfect if all points lie on a straight line.

Age (years)

Haemoglobin level (g/dL)

20 11.1

22 10.7

25 12.4

28 14.0

28 13.1

31 10.5

32 9.6

35 12.5

38 13.5

40 13.9

45 15.1

49 13.9

54 16.2

55 16.3

57 16.8

60 17.1

62 16.6

63 16.9

65 15.7

67 16.5

Positive correlation

y

x _{Negative correlation}

y

(21)

Example 10

Describe the relationship shown in each scatterplot by one of the following: perfect positive correlation perfect negative correlation high positive correlation high negative correlation low positive correlation low negative correlation no correlation

Solution

(a) perfect positive correlation (b) high negative correlation

(c) no correlation (d) low positive correlation

(e) perfect negative correlation (f) low negative correlation (g) high positive correlation

Example 11

For each pair of variables, state whether they have a positive, negative or no correlation. (a) height and weight of an individual

(b) salary level and pulse rate

(c) price of a car and number of cars sold (d) exam scores and time spent revising

Solution

(a) positive correlation (as height increases, weight usually increases) (b) no correlation

(c) negative correlation (as prices increase, sales usually decrease)

(d) positive correlation (as time spent revising increases, exam scores usually increase)

n rainfall and car accidents

n education level and annual salary

n age and last two digits of phone number

n price of goods and number of goods sold

n home loan interest rates and number of houses sold

n age of employee and number of sick days

n number of police and number of crimes committed

n number of churches and number of hotels in a country town

n number of supermarket items and amount of bill

(a) (b) (c) (d)

(e) (f) (g)

(22)

Correlation and causality

If two variables have a high correlation, this does not necessarily mean that one variable

causes the other. For example, these variables have a high correlation, but the relationship is

not causal:

n length of right arm and length of left arm

n consumption of beer and consumption of soft drink

n price of new cars and annual sales of new cars

n number of babies born and number of storks hatched.

In each case, the first variable does not cause the second variable to happen. There is usually some external factor to explain the high correlation, such as age, time, economic situation, physical or mental attributes, or weather conditions.

What external factors might cause the variables above to have a high correlation?

Correlation coefficient

The correlation coefficient (r) is a measure of the strength of the relationship between two variables. It varies from −1 to +1.

n A correlation of r = 0 means there is no correlation.

n A correlation of r = +1 or r = −1 means there is a perfect positive or negative correlation.

Example 12

Draw a neat sketch to represent the relationship between two variables with a correlation coefficient (r) of:

(a) 0 (b) 0.9 (c) −1 (d) 0.2

Solution

Note: You are not expected to calculate the actual value of r.

1. Use your own words to write down the meaning of correlation.

2. Describe the relationship between two variables that have a correlation of zero.

3. Write down two variables that would have a correlation of zero.

4. Explain the difference between positive and negative correlation.

5. Do the following variables have a positive or negative correlation?

(a) number of cigarettes smoked and incidence of lung cancer (b) calories consumed and weight gained

(c) cost of a shirt and change from $100

(d) number of police on patrol and number of petty thefts (e) alcohol consumption and safe driving

(a) (b) (c) (d)

r = 0

No correlation

r = 0.9

High positive

r = −1

Perfect negative

r = 0.2

Low positive

correlation correlation correlation

(23)

6. Which of these pairs of variables have no correlation?

(a) car number plate and age of driver (b) length of arm and length of leg

(c) age when married and swimming ability

(d) number of CDs sold and number of teenagers in area (e) number of breakdowns and age of a vending machine

7. Each of these situations shows a high correlation. Which ones do not involve causality?

(a) heights of primary school children and their spelling ability (b) driving speed of a vehicle and the amount of petrol used

(c) number of balls served and the number of balls received in tennis (d) number of parking fines and the number of road accidents (e) closeness to Easter and the amount of chocolate sold

8. Describe the relationship shown in each scatterplot below by one of the following:

perfect positive correlation perfect negative correlation high positive correlation high negative correlation low positive correlation low negative correlation no correlation

9. Draw a scatterplot to represent the relationship between two variables with a correlation

coefficient (r) of:

(a) 1 (b) 0.7 (c) −1 (d) 0 (e) −0.2

10. A real estate agent wondered if there was any correlation between the time a salesman

had been with the company and the amount of real estate sold per week. He found the correlation to be −0.78. What conclusion could he draw?

11. The correlation between the daily temperature and the amount of ice-cream sold had a

correlation of 0.9. Describe the relationship between temperature and ice-cream sales.

12. The annual beer consumption and the amount of petrol purchased per annum had a

correlation of 0.85. Explain how this could be so.

(a) (b) (c) (d)

(24)

REGRESSION LINES

For scatterplots displaying variables with a high positive or negative correlation, a line of

best fit can be drawn to approximate the plotted points.

The technique of applying a line of fit to a scatterplot is called linear regression, and a line of fit is called a regression line.

The word regress means to ‘go back’, so we can think of the points ‘going back to be on the regression line’ so that the equation of the regression line represents all points in the distribution.

Line of best fit

We have already constructed lines of best fit in Chapter 1 and in the Preliminary Course. A line of best fit drawn ‘by eye’:

n represents most of the points as closely as possible

n has roughly half of the outlying points above it and half below it

n is drawn so that the distance between each outlying point and the line is kept at a minimum.

Just for the record

H

EIGHTS OF FATHERS ANDSONS

In 1889 Galton found a high negative correlation between the heights of fathers and their sons. He found that taller fathers had smaller sons and smaller fathers had taller sons. If this was not the case then people would just keep getting taller and taller!

High positive correlation High negative correlation

Regression line

(25)

Example 13

In Example 9, a scatterplot was drawn for the follwing heights and weights of 12 students.

(a) Draw a line of best fit.

(b) Determine the gradient of the line. (c) Find the equation of the line.

Solution

(b) Gradient =

=

= = 0.675

(c) Equation of line is w = mh + b.

To find b, substitute m = 0.675, h = 150, w = 53. 53 = 0.675 × 150 + b

b= −48.25

Equation of the line of best fit is w = 0.675h − 48.25.

Median regression line

A line of best fit drawn ‘by eye’ varies depending on the person drawing it. A more precise line of best fit is the median regression line or median–median line. Constructing this line requires a special technique. To examine this technique we must look at how to find a

median point.

Height, h (cm) 170 158 150 164 177 186 173 190 156 180 160 187

Weight, w (kg) 68 55 53 58 67 81 71 78 59 80 67 73

(a)

90

80

70

60

50

0

Height, h (cm)

150 160 170 180 190 200

Weight,

w

(kg)

(190, 80)

27

40 (150, 53)

rise run

---80–53 190–50

---27 40

(26)

Finding the median of a group of points

To find the median point of a group of points, find the median of the x-coordinates and the median of the y-coordinates. The median point has these coordinates.

Example 14

Find the median of the points (2, 2), (4, 1), (6, 4) and (3, 5).

Solution

Arrange x-coordinates in order:

2 3 4 6

Arrange y-coordinates in order:

1 2 4 5

Median point is (3.5, 3).

When the points from Example 14 are viewed in the x-direction, the middle points are (3, 5) and (4, 1).

Draw a vertical line halfway between them.

When the points are viewed in the

y-direction, the middle points are

(2, 2) and (6, 4).

Draw a horizontal line halfway between them.

The point of intersection of the two lines is the median point (3.5, 3).

Constructing a median regression line

To construct a median regression line:

Step 1 Divide the points on a scatterplot into three equal groups (or at least make the first

and last groups equal).

Step 2 Find the median point of each group and mark with a cross (+).

Step 3 Place a transparent ruler joining the medians of the first and last groups.

Step 4 By estimating, slide your ruler one-third of the way towards the median of the middle

group, keeping the same gradient, and draw a line. This is the median regression line. Median = 3+4 = 3.5

2

---5

4

3

2

1

0

1 2 3 4 5 6 x

y

Median point (3.5, 3) (3, 5)

(6, 4)

(2, 2)

(4, 1)

Median = 2+4 = 3 2

---Think:

Finding the median from a diagram

5

4

3

2

1

0

1 2 3 4 5 6 x

y

Median point (3.5, 3)

y-direction,

halfway between (2, 2) and (6, 4)

x-direction,

halfway between (3, 5) and (4, 1) (3, 5)

(6, 4)

(2, 2)

(27)

Example 15

(a) Construct a median regression line given the heights and weights of the 12 students from Example 13.

(b) Find the equation of the line.

(c) Compare its equation with the line of fit we found in Example 13.

Solution

(a) First divide the scatterplot into three groups of 4 points and find the median point of each group. These are marked on the diagram as M1, M2 and M3.

Now place a transparent ruler joining M1 and M3 and slide it one-third of the way towards M2 and draw the line. In this case, M2 is very close to the line joining M1 and M3.

Height, h (cm) 150 156 158 160 164 170 173 177 180 186 187 190

Weight, w (kg) 53 59 55 67 58 68 71 67 80 81 73 78

90

80

70

60

50

0

Height, h (cm)

150 160 170 180 190 200

Weight,

w

(kg)

M3

M2

M1

M1 = median of 1st group (157, 57)

M2 = median of 2nd group (171.5, 67.5)

M3 = median of 3rd group (186.5, 79)

Divide the points into three groups

90

80

70

60

50

0

Height, h (cm)

150 160 170 180 190 200

Weight,

w

(kg)

M3

M2

M1

(28)

(b) Since the median regression line is parallel to the line through M1 and M3 it has the same gradient.

Gradient m = ≈ 0.75. using M1(157, 57) and M3(186.5, 79)

Equation of line is w = mh + b.

To find b, substitute m = 0.75, h = 182, w = 76. (182, 76) is a point on the line

76= 0.75 × 182 + b

b= −60.5

Equation of the median regression line is w = 0.75h − 60.5.

(c) The equation of the line of fit found in Example 13 was w = 0.675h − 48.25. The gradients of both lines are similar, but the vertical intercepts are different.

Example 16

The number of cars inspected by 6 employees at a motor registry in a 2-hour period is shown in the table.

(a) Draw a scatterplot and construct the median regression line.

(b) Predict the number of cars that an employee who has worked for 8 weeks at the motor registry would inspect in the given 2-hour period.

(c) Is it possible to predict the number of cars that would be inspected by an employee who has worked at the motor registry for 200 weeks? Justify your answer.

Solution

(a)

Gradient m = ≈ 0.17, using M1(1.5, 18.5) and M3 (10.5, 20) Equation is C = mW + b.

From graph, vertical intercept b = 16.5.

The equation of the median regression line is C = 0.17W + 16.5. (b) When W = 8, C= 0.17 × 8 + 16.5

= 17.86

The employee would inspect about 18 cars in the 2-hour period. (c) When W = 200, C= 0.17 × 200 + 16.5

= 50.5 79–57 186.5–157

---No. of weeks employed

No. of cars inspected

2 13

9 20

1 24

6 14

12 20

25

20

15

10

5

0

No of weeks (W)

2 4 6 8 10 12

No. cars inspected (

C

) (1, 24)

(2, 13)

(9, 20)

(12, 20)

Median regression line

M1 = (1.5, 18.5)

M2 = (6, 14)

M3 = (10.5, 20)

M1

M2

M3

(6, 14)

(29)

---It is not feasible that someone could inspect 51 cars in a 2-hour period. The linear regression equation should only be used to predict a value of C for a value of W between 1 and 12: that is, within the given range of W values.

1. The number of bushfires and the annual rainfall were recorded over 8 years.

(a) Plot the data on a scatterplot and construct a median regression line. (b) Find the equation of this regression line.

(c) Use your equation to predict the number of bushfires in a year with a rainfall of:

(i) 8 cm (ii) 25 cm

2. Your Choice and Good Buy magazines ranked 8 brands of mobile phone out of 10.

(a) Plot the values on a scatterplot and constuct a median regression line. (b) Find the equation of the line.

(c) Use your equation to predict the ranking given by Good Buy magazine if Your

Choice gave a ranking of 5.

3. Mike the mechanic kept a record of the yearly service bill of a popular model of car of

various ages. A random sample of 10 cars is shown.

(a) Plot the points on a scattergram and draw a line of best fit. (b) What is the gradient of the line and what does it represent? (c) What is the equation of the line?

(d) What is the predicted service bill for an 8-year-old car?

(e) Can we use the equation to predict the service bill for a 1-year-old car?

4. The betting office at Bing Bong racetrack wanted to see if there was any relationship

between the number of people in attendance at the race meetings and the amount of money invested with the bookmakers. Eight race meetings were selected at random.

(a) Plot the data on a scatterplot and draw a median regression line. (b) What is the equation of the line?

(c) What is the predicted investment for a race crowd of 30 000?

(d) If 20 000 people attended a race meeting, what amount could the bookmakers expect to take in investments?

Rainfall (cm) 19 23 24 7 22 27 20 16

No. of bushfires 24 9 12 32 18 10 21 20

Brand Talkie Phonome Tella Betta Yackon u-ring i-ring Chatta

Your Choice 1 4 6 3 3 2 7 8

Good Buy 2 3 5 1 4 2 3 5

Age (years) 3 5 8 7 10 12 7 2 9 8

Service bill ($) 180 640 760 980 1440 1600 1250 120 1300 1120

Attendance (× 1000) 12 14 15 20 18 22 15 21

Amount invested (× $1000) 150 180 210 260 260 290 180 300

(30)

5. A home loan company wanted to see if there was a

relationship between the home loan interest rate and the number of home loan applications. A random sample of applications was taken from the last 5 years, as shown.

(a) Draw a scatterplot of the data and draw a median regression line.

(b) Find the equation of this line.

(c) How many applications could the bank expect if the interest rate were 9.4%?

6. Ten war veterans were tested for hearing loss, and the results recorded.

(a) Plot the points on a scatterplot and draw a line of best fit. (b) Determine the equation of this line.

(c) Use your equation to predict the hearing loss of a war veteran aged:

(i) 75 (ii) 80 (iii) 64

7. The amounts of energy, E kJ (kilojoules), used per day by 8 men of various weights,

W kg, are shown in the table.

(a) Graph this data on a scatterplot and draw a line of best fit. (b) What is the equation of the line?

(c) Calculate the energy used per day by a man of weight:

(i) 100 kg (ii) 85 kg (iii) 120 kg

8. A restaurant owner compared the electricity used for heating with the daily maximum

temperature over 10 days, selected at random.

(a) Graph this data on a scatterplot and construct a median regression line. (b) Find the equation of the line.

(c) Use your equation to predict the amount of electricity used for heating on a day when the maximum temperature is:

(i) 30°C (ii) 20°C (iii) 15°C

(d) What conclusion can you draw about the relationship between temperature and electricity?

Age 75 69 57 84 68 58 60 77 82 74

% of hearing lost 59 70 28 71 72 76 52 60 80 70

Weight, W (kg) 78 89 92 95 101 115 118 125

Energy, E (kJ) 1610 1820 1650 2030 1860 1970 2100 2340

Maximum temperature (°C) 16 32 20 15 25 30 29 22 24 14

Electricity used (× 1000 units) 38 19 40 42 28 16 21 24 25 40

Interest rate (%)

No. of applications

8.3 50

9.7 42

10.4 25

9.5 32

8.0 33

9.3 41

10.8 20

(31)

9. Use the scatterplot from Exercise 10-04, question 4 comparing the birth weights of

10 babies at Bubsville hospital with their weight gains in the first month. (a) Draw a line of best fit.

(b) Find the equation of this line.

(c) If a newborn weighed 3 kg at birth, how many grams in weight would you expect it to gain in the first month?

(d) What is the predicted weight at 1 month of age of a baby weighing 3285 g at birth?

The data points from question 1 in Exercise 10-06 can be entered in a graphics calculator. Enter the rainfall values into List 1 and the bushfire values into List 2 and draw a scatterplot. There are two choices of regression line:

n The linear regression graph (x) has equation y = −1.148x + 40.923.

n The median regression line (med-med) has equation y = −1.75x + 53.42. Also investigate whether your spreadsheet progam constructs regression lines.

Technology:

Median regression line on a graphics calculator

Study tips

R

EADTHE KEYWORDS IN AN EXAMQUESTION

Find: Use a mathematical method (usually algebraic) to find a number, expression or

formula.

Prove: Prove that a given result is true using a rule or formula and give reasons. You are

usually told the answer, but you cannot use it in your proof.

Hence, prove that: Prove the result using the information or answer given in the previous

part of the question. The parts of this question are related, and the examiner is guiding you through a proof, step-by-step, giving hints along the way.

Hence, or otherwise: You can use the information in the previous part of the question or

use some other method to find the answer.

Calculate: The answer is a number.

Evaluate: Find the value of an algebraic expression. The answer is a number.

Describe/explain: Write the answer in a sentence or two in your own words.

Write down/state: Write the answer or quote the formula. This is a memory-recall

question—no working or explanation required.

Simplify: Reduce to a smaller, neater form.

Expand: Rewrite without brackets—for example, 4x2(2x − 3) = 8x3 − 12x2_.

(32)

Use a scatterplot to ‘find a friend’

1. Get a list of the top 10 movies or CDs.

2. Rank these in order of how you like them.

3. Work in pairs and draw a scatterplot with your set of rankings plotted on the x-axis and

the other person’s rankings on the y-axis.

4. Is there a pattern to the points? Is it a linear pattern? Is the correlation positive or

negative? Is it high or low?

A high positive correlation, for example, means that you and your partner are likely to like the same movies or music and hence would probably make good friends.

Predicting Olympic performances

The modern Olympic Games started in 1896 and have been held every 4 years except during the world wars. The data shows the winning distances (in inches) for long jump and high jump from 1896 to 1992.

1. Draw separate scatterplots for high jump and long jump.

2. Is there a linear pattern?

3. Can you predict the Olympic performances for the years 2004 and 2012?

Year Long jump High jump

1896 1900 1904 1908 1912 1920 1924 1928 1932 1936 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992 249.75 282.875 289 294.5 299.25 281.5 293.125 304.75 300.75 317.3125 308 298 308.25 319.75 317.75 350.5 324.5 328.5 336.25 336.25 343.25 342.5 71.25 74.8 71 75 76 76.25 78 76.375 77.625 79.9375 78 80.32 83.25 85 85.75 88.25 87.75 88.5 92.75 92.5 93.5 92

(33)

C

hapter review

The normal distribution and correlation

1. The normal distribution 2. Z-scores

3. Comparing normal distributions 4. Scatterplots

5. Correlation 6. Regression lines

This chapter, The normal distribution and correlation, extended earlier work on the shape and features of a display and drawing lines of best fit. You should now be able to recognise a normal distribution and its properties and compare two normal distributions using

z-scores. Correlation and regression were new concepts, as was the idea of using the

equation of a median regression line to make predictions. Make a note of the new terms used in this chapter in your summary.

Make a summary of this topic. Use the chapter outline above as a guide. An incomplete mind map has also been started below. Use your own words, symbols, diagrams, boxes and reminders. Use the questions in Your say below to think about your understanding of the topic. Gain a ‘whole picture’ view of the topic and identify any weak areas.

Topic summary

Normal

distribution and

correlation

Scatterplots

Correlation

Median regression line

z-scores

(34)

n Have you satisfied the outcomes listed at the front of this chapter?

n What was the most important thing that you learned?

n How did you feel about the topic? Did you enjoy it?

n What was new?

n What are your weaknesses? What will you need to study more?

n How will you revise and summarise this topic?

1. A machine produces bolts of mean length 75 mm with standard deviation 2 mm. If the

lengths are normally distributed, and bolts outside the 3s limit are rejected, would the following bolts be accepted or rejected?

(a) 75 mm (b) 90 mm (c) 64 mm

(d) 81 mm (e) 69 mm

2. Cylindrical rods are produced by a machine and the diameters are normally distributed

with a mean of 4.00 cm and a standard deviation of 0.02 cm. Complete the following. (a) 68% of rods will have a diameter between and

(b) Rods with a diameter greater than will be rejected. (c) 95% of rods will have a diameter between and (d) Rods with a diameter less than will be rejected.

3. The efficiency of a machine is a measure of how accurately it performs. A machine is

tested over a period of time for efficiency and found to have an efficiency of 80% with a standard deviation of 3%. Two further tests later show the efficiency to be 84% and 74%. Are these efficiency values acceptable? Justify your answer.

4. The adult population of Dimboola has a mean height of 160 cm with a standard

deviation of 5 cm. Assuming the distribution of heights to be normal, what conclusion could you draw about an adult of height:

(a) 163 cm? (b) 130 cm? (c) 180 cm?

5. The amounts of washing powder in 1 kg boxes were found to follow a normal

distribution with a mean of 1010 g and standard deviation of 10 g. What percentage of boxes are likely to be:

(a) underweight? (b) overweight?

6. Cholesterol levels of males in the hospitality industry follow a normal distribution

with a mean of 5.8 mmol/L (millimoles per litre) and a standard deviation of 0.3 mmol/L. If the recommended level is less than 5.5 mmol/L, what percentage of males have a healthy cholesterol level?

7. What type of correlation, positive, negative or no correlation, would you expect from

each pair of variables?

(a) foot size and having good teeth

(b) hours spent watching TV and Maths test results (c) height and weight of individuals

(d) age and amount of pocket money

Your say: Reflecting about the topic

● ● ● ●

(35)

8. Match the scatterplots to the correlations.

A. low positive correlation B. high negative correlation C. perfect positive correlation D. no correlation

E. perfect negative correlation

9. The time taken to repair computer components was monitored and found to be

approximately normally distributed with a mean of 8 hours and standard deviation of 2 hours.

(a) Complete the table.

(b) What percentage of repairs took:

(i) between 10 and 14 hours? (ii) less than 6 hours? (iii) more than 12 hours? (iv) between 4 and 10 hours?

10. (a) Leo received a z-score of 1.5 in an assessment task. What does this mean?

(b) If the task mean was 68 and Leo’s raw score was 82, what was the task standard deviation (to 1 decimal place)?

(c) What was Acacia’s z-score (to 1 decimal place) if she scored 55 for the task? (d) What was Max’s score (to the nearest whole number) if his z-score was 0.8?

11. Boxes of 100 apples were stored in a controlled atmosphere for various periods.

Twelve cases were selected at random and the numbers of good saleable apples recorded.

(a) Plot the values on a scattergram and draw a line of best fit. (b) Find the equation of the line.

(c) How many good apples would you find in a box that had been stored for:

(i) 4 weeks? (ii) 10 weeks?

(d) After how long in storage will a box contain only 50% good apples? (e) Is it possible to predict how many good apples would be in a box stored for

20 weeks? Justify your answer.

(f) A box was opened and found to contain 15 rotten apples. How long would it have been in storage?

Time (h) 2 4 6 8 10 12 14

z-score

Weeks in storage 13 2 3 5 4 7 15 10 9 11 3 12

No. of good apples 49 98 96 85 89 68 60 72 69 54 87 58

(a) (b) (c)

(36)

12. The systolic and diastolic blood pressures of 10 patients selected at random were

recorded.

(a) Plot the values on a scatterplot and draw a median regression line. (b) Find the equation of the line.

(c) What diastolic blood pressure should a person have if their systolic blood pressure is:

(i) 106? (ii) 124? (iii) 133?

13. Ten children at a childcare centre were selected

at random, and their age and length of their afternoon nap were recorded.

(a) Plot the values on a scattergram and draw a median regression line.

(b) Is the correlation between age and sleeping time:

(i) positive or negative? (ii) high or low?

(c) What is the equation of the median regression line?

(d) Are you able to use this equation to predict the time spent sleeping by a:

(i) 4-year-old child? (ii) 10-year-old child? (iii) 18-month-old child?

Justify your answers.

14. The weights of a group of people and their times to run 100 m were recorded and

analysed. The correlation coefficient, r, was found to be −0.8. Explain what this means in your own words.

15. The table below shows enquiries to rent or buy a house over a 10-week period, at

Fastasales real estate agency.

(a) Plot the data on a scatterplot and construct a median regression line. (b) Find the equation of this line.

(c) Is it possible to predict the number of enquiries to buy a house if you are given the number of enquiries to rent a house? Justify your answer.

Systolic (mm Hg) 120 190 110 130 115 172 100 148 132 112

Diastolic (mm Hg) 80 100 72 78 70 96 66 90 81 74

Enquiries to rent 16 34 70 58 56 25 38 46 62 49

Enquiries to buy 28 69 86 80 90 28 49 62 81 53

Age of child (years)

Time spent sleeping (minutes)

3 34

4 25

3 30

5 10

4 19

3 32

4 20

5 15

3 26

1 2

--1 2