Stats IB 2013.doc

(1)

(2)

Lesson 1: The Binomial Distribution

Investigation

A singe die is rolled 6 times. Determine the probability of correctly guessing the firstoutcome, then incorrectly guessing the outcome to the next 5 outcomes. (example: correct, incorrect, incorrect, incorrect, incorrect, incorrect)

Determine the probability of correctly guessing any 1 outcome and incorrectly guessing 5 outcomes.

A Binomial Experiment is one which has only 2 possible outcomes.

 Only success (getting what you want) or failure (not getting what you want) is possible

 the probability of success and failure are complementary, their probabilities add to 1

 The probability of success and failure are constant through each trial.

If a binomial distribution has a probability of success, p, and a probability of failure, (1-p), on any given trial, and n trials are performed,

n is the number of trials and k is the number of success in n trials.

This is done on the calculator by using the binomialpdf (n, p, k)

Problem: Vince Carter has a foul shooting percentage of 85%. In a particular game, he shoots 15 foul shots. Determine the probability that he successfully shoots

i) 13 foul shots.

(3)

Try 1: In the production of computer chips, the probability of any particular chip being defective is 10%. If a company orders 25 chips,

a) Determine the probability that this company receives exactly 2 defective chips

b) Determine the probability that this company receives at least 2 defective chips

Try 2: A multiple-choice exam consists of 6 questions, each with 4 possible answers. If a student guesses at each question, determine

a) the probability that they receive a mark of 50%.

(4)

(5)

(6)

(7)

Lesson 2: Measures of Central Tendencies & Dispersion

and the Normal Curve

There are 3 measures of the “middle” or central value of a set of data:

 Mean: the arithmetic average

 Median: the middle value when all data values have been arranged in order

 Mode: the value which occurs most often

There are 2 measure of how “spread out” a set of data is:

 Range: the difference between the highest and the lowest values

 Standard Deviation: a value which indicates how spread out a set of data is

Problem 1: Use a calculator to determine the mean, median and standard deviation of the following set of golf scores:

72 74 73 76 74 71 75 76 72 77

Problem 2: The following table shows data collected by a candy company on the number of candies in a box of caramel chews. The company inspected 100 boxes. Determine the mean, median, mode, range and the standard deviation for the number of candies in each box.

Number of Candies Frequency

48 10

49 23

50 37

51 17

(8)

z-score: a value which represents the number of standard deviations () a particular data value (x) is from the mean ()

Problem 3: The marks on a Math test have a mean of 74 and a standard deviation of 4. On a Biology test, the mean is 72 and the standard deviation is 9. David scored 80 on each test.

a) Determine the number of standard deviations away from the mean that David’s mark is for each test.

b) On which test did he do better, relative to the rest of the class?

(9)

Define: Data is said to be normally distributed about the mean,, if the histogram results in a bell curve, as shown below.

Properties:

 data is symmetric about the mean, 

i.e. 50% of the area under the curve is to the left of the mean, and 50% is to the right.

 the total area under the curve is 100% or 1.00

 the area under the curve represents the probability that a data value will occur there

 68.3% of the data lies between - and +  95.4% of the data lies between -2 and +2  99.7% of the data lies between -3 and +3  100% of the data lies between -4 and +4

Problem 4: Label the properties of the normal distribution on the curve above.

Problem 5: A value has a z-score of z = 1.50. Determine the area under the Standard Normal Curve to the left of this z-score.

Using your graphing calculator

Problem 6: Determine the following

(10)

(11)

(12)

(13)

(14)

(15)

(16)

Lesson 3: Problems Involving the Normal Distribution

Problem 1: The lifetime of light bulbs is normally distributed with a mean lifetime of 98 hours and a standard deviation of 13 hours.

a) What % of the light bulbs will last between 72 hours and 124 hours?

b) What is the probability that a light bulb will last more than 111 hours?

c) In a shipment of 1200 light bulbs, how many would we expect to last from 85-124 hours?

Problem 2: Michael Jordan’s points are normally distributed with a mean of 28 and a standard deviation of 6.

a) In what percent of his games will he score less than 38 point?

b) what is the probability that he scores between 19 and 31 points?

(17)

Problem 3: The speed of vehicles in a 60 km/h zone are normally distributed with a mean of 63 km/h and a standard deviation of 3 km/h. The police allow 10% of the posted speed limit, in case of error, before they issue a ticket. If 120 cars pass through this zone, how many can we expect will receive a ticket?

Problem 4: On a particular exam, the mean was 83 and the standard deviation was 4. If 12 people scored above a mark of 90%, then how many people wrote this exam?

InvNorm (area to the left, mean, standard deviation)

When given the area to the left of the data value, the command “invNorm” can be used to calculate the data value or the z-score. The mean and standard deviation must be given for a data value and 0 and 1 is used, respectively, for z-scores.

(18)

Problem 6: The marks of a large number of students have been represented on a standard normal distribution curve. The values given represent the number of students in each area. Determine the value of to the nearest hundredth.

(19)

(20)

(21)

(22)

(23)

(24)

Lesson 4: Box & Whisker and Cumulative Frequency

Discrete Data:Data is considered discrete if it consists of exact data points.

A column graph can be used to represent the distribution of the data. Column graphs have the following features:

 The horizontal axis represents the range of data values

 The vertical axis represents the frequency of occurrence of the data values

 Column width are equal and height varies according to frequency

 Spaces exist between the bars.

Continuous Data:

The following figures are the heights, to the nearest centimeter, of a group of students:

156 172 168 153 170 160 170 156 160 160 172 174

150 160 163 152 157 158 162 154 159 163 157 160

153 154 152 155 150 150 152 152 154 151 151 154

The heights of people in a group are measured to the nearest centimeter so that the data is a set of whole numbers; thus they may appear to represent discrete date. However, the heights of the people are nevertheless approximations of a value which comes from a continuous number scale.

Heights Frequency (fi)

148 - 150 3

151 - 153 8

154 - 156 6

157 - 159 6

160 - 162 5

163 - 165 2

166 - 168 1

169 - 171 2

(25)

Heights of Students 0 1 2 3 4 5 6 7 8 9

147 149 152 155 158 161 164 167 170 173

Height (cm) N u m b er o f S tu d en ts

A FREQENCYHISTOGRAM can be used to represent the distribution of data. A histogram is a bar graph with the frequency along the vertical axis and the class intervals along the horizontal axis. Histograms are used whenever the data is continuous and have no gaps between the columns (unless there is zero frequency for that class.)

The boundaries of the frequency columns of the histogram represent the true limits of the data in each class interval. For example the data point 152 is the class mark for the range 151 to 153 and represents a height somewhere between 150.5 and 153.5 so the boundaries of the rectangles of the frequency

histogram are 147.5 – 150.5, 150.5 – 153.5, and so on.

Note that the horizontal scale does not start at zero and to emphasize this fact a short jagged line could be inserted in the horizontal axis.

The frequency histogram can be created on your GDC. 1. Enter the data

into L1 and L2.

2. Set the stat plot to histogram with L2 as the frequency.

(26)

A FREQUENCY POLYGON is a line graph with the frequency along the vertical axis and the class marks along the horizontal axis.

The frequency polygon can also be done on the GDC just by choosing a different graph in the stat plot.

Height of Students

0 1 2 3 4 5 6 7 8 9

(27)

Example 2:

A golfer hits 30 balls in a succession with his driver. The distance each ball travelled, in meters, is recorded below.

244.6 245.1 248.0 248.8 250.0

251.1 251.2 253.9 254.5 254.6

255.9 257.0 260.6 262.8 262.9

263.1 263.2 264.3 264.4 265.0

265.5 265.6 266.5 267.4 269.7

270.5 270.7 272.9 275.6 277.5

The data is considered continuous and should be grouped, as shown in the table below. We will group in intervals of 5 meters (an arbitrary number). The interval of 240 – 245 represents the data values greater than or equal to 240 but less than 245 ie [239.5, 240.5). The groups should all be the same width.

Estimates for the measures of central tendency and dispersion can be found by using your graphing calculator. Enter the class mark in list 1 and the frequency in list 2 and then calculate the statistics.

Mean: 261 Standard Deviation: 9.23 Modal Class: 260 – 265

Note: Again the calculator should not be used to calculate the median for continuous grouped data. Methods to do this will be explained later.

The frequency histogram and frequency polygon (below) can also be done on the graphing calculator as shown on page 0.

(28)

CUMULATIVE DATA

The cumulative frequency gives a “running total” of the data values.

For example: The cumulative frequency distribution table for weight of 120 male football players is shown below:

Weight (w in kg) Frequency Cumulative Frequency

2 2 3 2+3=5 12 5+12=17 14 17+14=21 19 31+19=50 37 50+37=87 22 87+22=109 8 109+8=117 2 117+2=119 1 119+1=120

A cumulative frequency curve can be drawn for the group data. It is obtained by plotting the cumulative frequency against the corresponding upper class boundary and joining the points with a smooth curve. A cumulative frequency curve is sometimes called an ogive.

Number Of Players

(29)

PERCENTILES AND QUARTILES

A percentile is the score below which a certain percentage of the data lies. For example, the 85th

percentile is that position which has 85% of the data less than or equal to it.

The lower quartile is the 25th_percentile.

The median is the 50th_percentile.

The upper quartile is the 75th_percentile.

The interquartile range is represents the middle 50% of the population.

Using the example from p. 14, the cumulative frequency curve has effectively placed the data in order. This now enables us to read off estimates of the median, quartiles and other percentiles.

To find the median:

The median is halfway along the list of data. Since the are 120 data values, the median point is at

60. (Technically, the median is at , the average of the 60th_{& 61}st_{weights. In}

large populations of grouped data the discrepancy is small enough to be deemed insignificant. We will consider the median to be the 60th_term)

A horizontal line is drawn from 60 on the vertical axis to the curve, and then drawn at right angles to the horizontal axis. Reading across from 60 and down to the “weight” gives a figure of about 81 kg as the median.

To find the upper and lower quartiles, a similar procedure is used:

Upper quartile: .75*120 = 90 The corresponding weight is approximately 86 kg. Lower quartile: .25*120 = 30 The corresponding weight is approximately 75 kg

The values for the median and quartiles can be read from the graph below.

Number Of Players

Weights (kg)

1

(30)

The difference between the two quartiles is known as the inter-quartile range (Q3 – Q1). In this case, the

inter-quartile range is 85 – 75 = 10. The middle 50% of the population has a weight between 75 kg and 85 kg.

The Box and Whisker Plot is a diagrammatic way of showing median, the quartiles and the range of a set of ungrouped data. An example of a box-and-whisker plot is shown below.

The data for the football player weights can be represented in the following box and whisker plot:

The box-and-whisker plot is considered a five-number summary of the data set:

 The minimum value

 The lower quartile

 The median

 The upper quartile

 The maximum value

Using a Graphing Calculator to Draw Box–and –Whisker Plots

Example #1: Given 0, 1, 1, 3, 4, 5, 7, 7, 7, 8, 9, 10

1. enter data into L1 2. set statistical plot 3. set your window 4. graph (y – use any scale)

(31)

To obtain a Five Number summary: Press TRACE and the median shows a flashing box. Press the arrows and you will have the other four values.

If your data is in a frequency table: enter your data into two lists, L1 for the values and L2 for the

frequencies.

If your data is grouped in intervals: enter the midpoint of each interval in L1 and the frequencies in L2.

Example #2

Mark 0 - 10 10 - 20 20 - 30 30 - 40 40 - 50 50 - 60 60 - 70 midpoint 5 15 25 35 45 55 65 frequency 5 10 15 22 29 13 8

1. Data into L1and L2 2. Set statistical plot 3. Set your window 4. GRAPH

(32)

Assignment

1. The following table gives the age groups of car drivers involved in an accident in a city for a given year.

Draw a cumulative frequency graph of the data and use it to find:

(a) the median age of the drivers involved in accidents

(b) the percentage of drivers, with ages of 23 or less, involved in accidents

(c) Use your calculator to draw a cumulative frequency graph and find the answers to parts (a) and (b). Explain any

discrepancies in your answers.

2. The following boxplots compare the time students in years 9 and 12 spend on homework over a one week period.

(a) Copy and complete the following table:

(b) Determine the range and the inter-quartile range for each group

3. The following table shows the number of earthquakes experienced in a particular location over a 50-year period. The intensity of the earthquakes is given by the Richter Scale.

(a) Construct a cumulative frequency table. (b) Draw a cumulative frequency curve.

(33)

4. Use your graphing calculator to make a box-and-whisker plot of the following sets of data. In each case, use the plot to obtain the median and inter-quartile range of the data.

(a) 384, 371, 399, 383, 377, 402, 390, 394, 385

(b)

x 1 2 3 4 5 6 7

f 2 8 25 47 41 18 8

5. The histogram below shows the weights (in kg) of a group of year 10 students at a country high school.

a. How many students were involved in the survey? b. Calculate the mean weight of the students.

c. Determine the number of students who weigh less than 56 kg.

d. Determine the percentage of students who weigh between 50 and 60 kg.

6. Draw a box-and-whisker plot for the following data and state the 5 number summary.

11, 12, 12, 13, 14, 14, 15, 15, 15, 16, 17, 17, 18

7. A botanist has measured the heights of 60

seedlings and has presented her findings on the

ogive below.

(a) Determine the number of seedlings that seedlings have heights of 5 cm or less.

(b) Calculate the percentage of seedlings that are

taller than 8 cm.

(c) Determine the median height.

(d) Determine the inter-quartile range for the heights.

(e) Determine the 90th_{percentile for the data} _and

(34)

8. The given parallel boxplots represent the 100- meter sprint times for the members of two athletics squads.

(a) Determine the 5 number summaries for both A and B. (b) Determine the range and inter-quartile range for each group.

9. The boxplot shown summarizes the results of a test (out of 100 marks).

Complete the following statements about the test results: (a) The highest mark scored for the test was ________ (b) The lowest mark scored for the test was ________

(c) Half of the class scored a mark greater than or equal to _______ (d) The top 25% of the class scored at least _____ marks.

(e) The middle half of the class had scores between _____ and _____ for this test. (f) The range of the data set is _______.

(g) The inter-quartile range is _________.

10. The table below gives the length, x mm (to the nearest mm) of new born babies born at a local hospital over a one week period.

Length (mm) Frequency

[400, 424] 3

[425, 449] 8

[450. 474] 16

[475, 499] 32

[500, 524] 28

[525, 549] 13

[550, 574] 5

[575, 599] 1 (a) Calculate the sample mean for the data shown. (b) Draw a histogram to represent the data

(35)

Assignment Answer Key

1. (a) 26 years (b) 69.3 %

(c) calculator assumes all values within a given interval are equal to midvalue – inaccurate

OR cum freq curve on calculator uses straight lines between points rather than smooth curve.

2. (a) (b) Range: year 9: 11 year 12: 11.5

IQR: year 9: 5 year 12: 6

3. (a) (b)

(c) (i) 4.6 (ii) 2.7, 3.7 (iii) 3.1

4. (a) median: 385 (b) median: 4

inter-quartile range: 16.5 inter-quartile range: 1

5. (a) 95 (b) 59.6 (c) 25 (d) 36.8 %

6. median: 15 min: 11

Q1; 12.5 max: 18

Q3: 16.5

7. (a) 10 (b) 28.3 % (c) 7 (d) 2.6 (e) 10, 90 % of seedlings have a height of 10 cm or less

8.

9. (a) 98 (b) 25 (c) 70 (d) 85 (e) 55 and 85 (f) 73 (g) 30

10. (a) 494.5 (c)

(36)

Lesson 5: Scatter Plots, Correlation Coefficients & Lines of Best Fit

Scatter plots are x-y plots of bivariate data – data where there are two variables for each observation – for example, the height and weight of an adult female.

When we suspect that two sets of data are related a scatter plot gives an indication of whether the variables have any correlation between them.

The following graphs indicate the correlation descriptions for linear associations.

Correlation Coefficient (r)

The correlation coefficient, r, is a number which describes the strength of the linear relationship between two variables. The table below summarizes a guideline for interpreting r.

Range of r values Correlation ±(1 – 0.75) Strong ±(0.75 – 0.5) Moderate ±(0.5 – 0.25) Weak

±(0.25 – 0) No linear

“The correlation can have values between -1.0 and 1.0. If the correlation is 1.0, the two variables are perfectly correlated with one another. They are in effect interchangeable, for when you know the value of one; you also know the value of the other. If the correlation is -1.0, they are perfectly correlated, but the relationship is negative—when one is high, the other is low. If the correlation is 0.0, there is no relationship at all between the two variables;

Positive correlation Negative correlation No correlation

(37)

Stretch (m)

Mass (g)

knowing the value of one tells you nothing about the value of the other.”

http://www.sfu.ca/~richards/Zen/Pages/Chap13.htm Calculating r for raw data

To calculate the correlation coefficient using raw date a calculator can be used according to the following.

1. Turn on diagnostics

 In the catalog under D. 2. Enter the data

 x-list in L1 and y-list in L2.

3. Perform a linear regression (y=ax+b)

 STAT button go to the CALCulate menu

 LinReg is number 4

4. Record the coefficient r to the appropriate number of decimals.

Example 1: The following data shows the amount of stretch of a vertical spring when various masses are hung from it.

a) Graph the data on a scatterplot.

b) Determine the correlation coefficient for the above data.

c) Discuss the strength of the correlation between mass and amount of stretch.

Solution:

a) b) r = 0.997

c) Very strong positive correlation. Mass

(g) 1 5 11 16 20 25 35 49 60 75 90

Stretch (cm)

(38)

Calculating r using the Pearson’s Product-Moment Formula

The correlation coefficient can also be calculated using the formula . To use this

formula you need the standard deviation for the x-list, the standard deviation for the y-list and the covariance of the two variables. The standard deviations can be determined on the calculator using the 2 variable statistics feature; the covariance will be given to you.

Example 2: Calculate the correlation coefficient for the previous data using the Pearson formula. The covariance is .

Solution: Find the standard deviation for the x and y values using “2 variable stats” on the calculator.

 Even though the formula says and we want use

which are the population standard deviations instead of Sx and Sy which are the

sample standard deviations.

(Compared to 0.997 from the regression.)

Exercise: The average daily temperature in Malaga (Spain) in July was 35º with a standard deviation of 3º. An ice cream salesman calculated that, on average, he sold 800 ice creams a day with a standard deviation of 35. If the covariance for the temperature and the number of ice creams sold is 96, calculate the correlation coefficient, r, and comment on the result.

Exercise: Twelve students were given 20 words in a different language to learn. They were allowed different lengths of time to study the words and the number of errors they made is shown in the table.

Study time, min 1 2 3 4 5 6 7 8 9 10 11 12

Errors Made 19 17 17 15 12 13 11 9 6 5 5 3

(39)

Line of Best Fit

Drawing a Line of Best Fit Using the Mean Point

A line of best fit can help us make a judgment on the strength of the correlation. To draw a line of best fit by eye we do the following:

 Calculate the mean of both sets of data.

 Plot the means as a point .

 Draw a line through this point with equal number of points above and below the line, following the trend if it is apparent.

 It does not need to go through the origin.

Example: The following data shows a correlation between temperature of water and the number of observed jellyfish.

a) Draw a scatter plot of the data. b) Draw a line of best fit.

c) Discuss the correlation.

Solution:

a) b)

c) strong positive correlation Temperature

(ºC) 20 21 22 19 24 26 31 27 24 21 21

Number of

Jellyfish 135 138 150 135 162 201 263 221 168 155 149

0 50 100 150 200 250 300

0 10 20 30 40

(40)

Calculating the Line of Best Fit: Linear Regression

To determine the equation of the line of best fit you use the linear regression feature on the calculator as shown under the correlation section.

5. Turn on diagnostics

 In the catalog under D. 6. Enter the data

 x-list in L1 and y-list in L2. 7. Perform a linear regression (y=ax+b)

 STAT button go to the CALCulate menu

 LinReg is number 4

8. Record the equation of the line in the form

Example: Determine the equation of the line of best fit for the Jellyfish data in the previous example and graph the result along with the original data.

Solution: (Note that the r value shows a strong positive correlation.) Temperature

(ºC)

20 21 22 19 24 26 31 27 24 21 21

Number of Jellyfish