COLLECTING DATA - THE CENTRAL LIMIT THEOREM … CONTINUED

THE CENTRAL LIMIT THEOREM … CONTINUED

STEP 2: COLLECTING DATA

Use a method that your team believes will give the best approximate answer to the question. Describe your method and approach.

STEP 3: CONCLUSION

Write down your answer. Approximately how many books in the library are red?

LATER …

Select a sampling method that seems appropriate for garnering a non-biased sample of 200 or more library books. Count that number in that sample that fit your definition of being red, and express this count as a proportion

p. Using

(100 ) p

p p

µ

σ

= −

write a 95% confidence interval for the true proportion of library books that are red.

TWO-VARIABLE ANALYSIS: CHI-SQUARED TESTS It’s easiest just to begin with an example:

EXAMPLE: Is there any correlation between hair colour and eye colour?

A team goes out and examines a random sample of people. The data collected is displayed in the following

contingency table

Does there seem to be some connection?

Answer: Here’s one way to think about this.

First, compute each row sum, column sum and grand total.

We see that 38 out of the 110 people examined had blue eyes. That is, 38/110 = 34.5% of the sample had blue eyes.

We also see that there were 37 of the people examined were blonde, 46 brown haired and 27 red haired.

If there is absolutely no influence of hair colour on eye colour, then we’d expect 34.5% of the blonde population to have blue eyes, 34.5% of the brown haired population to have blue eyes, and 34.5% of the red haired population to have blue eyes. That is,

THE EXPECTED FREQUENCY (aka count) OF BLONDE PEOPLE WITH BLUE EYES IS:

38 37 12.8 110× = .

THE EXPECTED FREQUENCY OF BROWN HAIRED PEOPLE WITH BLUE EYES IS:

38 46 15.9 110× =

THE EXPECTED FREQUENCY OF RED HAIRED PEOPLE WITH BLUES EYES IS:

38 27 9.3 110× =

How many green-eyed people with red hair would we expect?

27 25 6.1 110× =

(The proportion of our sample with red hair is 27/110. This proportion of 25 green eyed folk should have red hair.)

In general:

The expected frequency of the entry in the

i

-th row and

j

-th column is given by:

row i total col j total grand total

We can go and fill in all the expected frequencies under the assumption that there is no relationship between the two qualities.

Looking at this it seems that our observed frequencies are vastly different from the expected frequencies (for no relationship). It seems that something is going on.

To make this more precise …

The CHI-SQUARED STATISTIC for a table of observed frequencies (o) and expected frequencies (e) is:

( )

sum of all calculations o e

χ

= ⁻e

In our example …

( )

2 (23 12.8) 8 15.9 5 6.1

17.88

12.8 15.9 6.1

χ

= ⁻ + ⁻ + ⋯ + ⁻ =

Now … if there truly were no relationship twixt the two variables then you’d expect the observed values to be very close to the expected values, that is:

χ

² value close to zero suggests no connection.

A large

χ

² value suggests something is going on.

In our example, we seem to have a large

χ

² value.

Statisticians have done the mathematics on the

χ

² statistic and have computed the distribution one would expect it to follow. [NASTY MATH!] Each table of a different size has its own chi-squared distribution.

A table with

r

^{rows and}

c

columns is said to have;

ν

=(r−1)(c− 1)

degrees of freedom.

E,g, In our example we have ν = × = degrees of freedom. 2 2 4

(COMMENT: Why r-1 and c-1? We know that all r rows add to 100% of the data. Thus if you know the sum of the first r-1 rows, then you do not need to be given the sum of the r-th row. It’s value is not free. Ditto for the

columns.)

See internet or most any stats book for a table of

χ

² values for each value of

ν

In our example:

χ

²= 17.88 for ν = . 4

According to the table of values, this value lies between χ²0.995 and χ²0.999

The chances of this χ² value occurring is above 99.5%. Thus, with 99.5%

confidence, we can say that there is some kind of correlation between eye colour and hair colour.

NOTE: There is no claim as to what that connection is. Further independent analysis is required.

EXERCISE: Analyse this (fictional) data from interviews with 5406 fifteen-year olds.

CAVEAT: LOW EXPECTED FREQUENCIES (values lower than 5) TEND TO SKEW CHI-SQUARED TESTS. Statisticians have the rule of thumb that if more than 20% of the entries in a table have expected values less than 5, then the test is unreliable.

QUALITY CONTROL

EXAMPLE: A pipe manufacturer makes pipes of diameter 3 inches. Consumers will tolerate a spread of values with standard deviation σ =0.06 inches.

To test the quality of their manufacturing techniques, each day a random sample of 10 pipes is selected and their mean diameter is computed. Here are the results of twelve days of data:

DAY 1 2 3 4 5 6 7 8 9 10 11 12

Mean 2.98 3.01 3.04 2.97 2.99 3.01 3.05 3.04 3.07 3.08 3.06 3.09

Is the wear and tear of the production equipment having an effect?

Answer: Let’s plot the data. Also, the mean is meant to be 3.00 so let’s plot that line as well:

It looks like the data is drifting away from the target mean. To make this more precise…

Samples of size ten should have mean 3.00 and standard deviation 0.06

0.02 10 = . 99.7% of the results should lie within three standard deviations of this mean. That the data is drifting above the critical line of +3 standard deviations suggests that quality is “out of control.” (Day 9 is the first day of concern.) □

EXAMPLE: Two machines each produce 1 000 bolts per day. The following table shows the number of defective bolts each machine manufactured over a ten day period.

MACHINE 1 42 37 18 37 17 26 35 21 18 17 MACHINE 2 44 36 23 41 24 25 31 35 23 21

Using only basic techniques, is one machine significantly less reliable than the other?

Answer: There isn’t much to work with on this problem.

One approach is to count totals:

Over 10 days, machine 1 produced 268 out of 10 000 defective bolts:

Percentage: 2.68%

Over 10 days, machine 2 produced 303 out of 10 000 defective bolts:

Percentage: 3.03%

These seem on par.

ANOTHER APPROACH: Perform a COUNTS TEST.

Here’s the list showing which machine produced the greatest number of defective bolts per day:

2 1 2 2 2 1 1 2 2 2

Machine 2 is listed seven times out of the ten days.

If there is no difference in the quality of the machines, that is, if each is equally likely to be listed on a day as having produced the most defective bolts for that day, then this sequence is akin the sequence of Hs and Ts in flipping a coin.

Is it unusual to get seven Hs in a run of ten flips? That is, is the fact that machine 2 is listed seven times at all significant?

Question: What are the chances of receiving seven heads in flipping a coin ten times?

Answer: 10! 1₁₀

11.7%

3!7! 2⋅ ≈ .

It can happen. (It occurs about 12% of the time.)

This is not considered “rare” enough to be significant. So we would say that there is no significant evidence to suggest that machine 2 is behaving differently to

machine 1. □

COMMENT: We usually look for events that a “rare”, say have a 5% chance of occurring, to say, with 95% confidence that something unusual is occurring.

For example, suppose machine 2 was listed NINE times out of the ten of string.

The chance of this “naturally” occurring is 10! 1₁₀

1!9! 2⋅ ≈1%, so we would conclude, with 99% confidence, that machine 2 is indeed less reliable than machine 1.

RUN TESTS FOR RANDOMNESS

Suppose some activity has two possible outcomes: A or B.

e.g Toss a coin: H or T Roll a die: Even or Odd

Height of a person: Above the mean or Below the mean

Suppose we perform the activity a number of times and record the sequence of As and Bs that result:

e.g.

A A | B B B | A | B | A A A A A | B B B |A A | B | A A A Definition: A run is a string of repeated letters in the sequence.

One usually separates runs with a “|” to make them easier to see.

In the example above there are nine runs.

TWO COMMENTS:

a) A sequence with a large number of runs suggests that the sequence of A and B generated is not truly random. For instance, the following sequence has the maximal possible of runs. You would unlikely believe it to be a random sequence:

A | B | A | B | A | B | A | B | A | B | A | B | A | B | A b) A sequence with very few runs doesn’t seem that random either.

A A A A A A A A A | B B B B B B B B | A A A A A A A A A A A There seems to be “too much clustering.”

So … the count of runs in a sequence should, in some way, give an indication of just how random that sequence is.

Some mathematical facts …

List them all and count the number of runs in each possible example.

Mathematicians have proven that the count of runs has mean and standard deviation given by these formulae:

run counts lie within two standard deviations of this mean.

COMMENT: This is using the version of standard deviation with “N” in the denominator rather than “N-1.”

EXERCISE:

a) Write down all the possible ways to list three As and two Bs.

b) Count the runs in each

c) Find the mean and the standard deviation of the count of runs. Verify that the above formulae give the same values.

EXAMPLE: Consider the following string:

H H H T T H H H T T T H T T T T

How likely is it that this came from flipping a coin ?

Answer: We have a = 7 heads and b= 9 tails. Here N = 16.

There are 6 runs.

Now, according to the previous result, the runs should follow a distribution with:

8.875

The count of six runs is within the range of two standard deviations from the mean. We cannot conclude that this example is unusual. □

EXAMPLE: Consider the following string:

H T H H T H T H T T H H H T H T

How likely is it that this came from flipping a coin ? Answer: We have a=9 heads, b= 7 tails, and N = 16.

The number 12 is within 2 standard deviations from the mean. We cannot conclude

that this sequence is not random. □

EXAMPLE: Consider the following string:

H H H H H T T T T T T H H T T T

How likely is it that this came from flipping a coin ? Answer: a = 7 heads; b = 9 tails; N = 16.

There are 4 runs.

Again:

8.875 1.9 µ σ

The count of 4 runs is more than two standard deviations below the mean. With 95% confidence we can say that this sequence was not produced by a random

phenomenon. □

TWO APPLICATIONS

ABOVE- and BELOW- the MEDIAN TEST

To determine whether or not a set of numerical data is “random”

a) Write the data in order it was collected b) Compute the median of the data

c) Write “A” or “B” next to each data point to indicate whether that point is above or below the median. (If an entry has the same value as the median, then omit it.)

d) Do a runs test on the sequence of As and Bs.

If the data really was generated by a random phenomenon, then the sequence of As and Bs produced should be random.

EXAMPLE: Here’s some data. Does it seem random? Use the above/below median test.

16 12 23 18 37 21 13 14 30 79 11

Answer: We need to find the median. (Unfortunately, this means ordering the data!):

11 12 14 13 16 18 21 23 30 37 79 median = 18.

Now here’s the sequence in terms of aboves and belows:

B B A * A A B B A A B (The star indicates the omitted value.)

We have: a = 5, b = 5 with N = 10. There are 5 runs. (I know that these a- and b- values are a bit low, but let’s follow the test anyway just for the fun of it!) This gives:

The value of 5 runs is not outside two standard deviations from the mean. The data seems to be following a random phenomenon. □

To decide whether or not the two samples came from the same type of population, arrange all

m

⁺

n

values in increasing order. (If some values of repeated, choose an order among them at random.) Record a sequence of As and Bs to show from which sample each data point came from.

If the resulting sequence of As and Bs is random, then we can conclude that the samples are not really different and come from the same source.

If the sequence is not random, then no such conclusion can be made.

EXAMPLE: Twelve people from a mall were interviewed for their ages. Call these the M values: 13 18 34 17 16 30 13 47 37 35 15 35

Twelve people at an art museum were interviewed for their ages. Call these the A values: 45 52 17 28 41 63 48 23 38 60 40 40

Are these ages from the same type of population?

Answer: Arrange the data in numerical order and keep track of which are Ms and which are As.

The value of 8 runs is more than two standard deviations away from the mean.

With 95% confidence we can say that these two sets of data are not coming from

the same type of population! □

Here’s a fun example:

EXAMPLE: Here are the first twenty digits of

π

: 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 6 4 3 Do they seem random?

Answer: Do the median test:

One checks that the median is 4.5. The sequence of Aboves and Belows is:

B B B B | A A | B | A A | B | A A A A A | B B | A | B B Here:

a = 10 b = 10 N = 20 There are 9 runs.

We have:

11 2.17 µ

The value of 9 runs is within two standard deviations of the mean. This sequence

looks random! □

EXERCISE:

a) Write a sequence of Hs and Ts twenty symbols long that looks random to you. (The number of Hs need not be the same as the number of Ts.) Perform the runs test. Is your sequence “random.”?

b) Flip a coin 20 times and record results. Perform a runs test for randomness on your sequence!

RANK CORRELATION

Here’s an opportunity to offer students a challenging exercise that illustrates the way tools and ideas in statistics are created.

THE PROBLEM:

Five men – Albert, Bilbert, Cuthbert, Dilbert and Egbert – take part in a singing contest and are ranked by two judges 1 – 5 (with “1” as best and “5” as least favored). For example, a possible outcome of the contest might be:

Albert Bilbert Cuthbert Dilbert Egbert

Judge 1 1 4 3 5 2

Judge 2 3 5 2 4 1

If the judges followed purely objective assessment criteria and were completely free of personal preferences, then we would expect the two rankings should be identical. If, on the other hand, the judges followed no set procedures for their ranking schemes and assigned rankings in a random fashion, then we would expect very little or no correlation between the two lists. In the example presented above we seem to be somewhere between these two extremes.

THE CHALLENGE:

Develop an “index” that takes two lists of rankings from two judges and, from those lists, applies some formula or algorithm to those lists and computes a numerical value, which we shall call

R

We would like

R

to have the following properties:

i) 0≤R≤ 1

ii)

R

has value 1 if the two lists are identical.

iii)

R

has value 0 if the two lists are in complete disagreement. (e,g. The first judge lists the candidate in the order 1, 2, 3, 4, 5 and the second judge in the order 5, 4, 3, 2, 1.)

Compute the value of your “Rank Correlation Coefficient” to the example above and interpret the results.

Here are some possible approaches:

APPROACH 1: Given two lists compute the difference of scores of each contestant, square, and sum. This gives a number

D

In our example, we have:

(

^{1 3}

) (

² ⁴ ⁵

) (

² ^{3 2}

) (

² ⁵ ⁴

) (

² ^{2 1}

)

² ⁸

D = − + − + − + − + − =

The largest value

D

can possess (for a list of five numbers) is 40 and this occurs when the rankings are in reverse order. (Why?). The smallest value D can possess is 0, and this occurs when the orders are in complete agreement. So set:

1 40

Comment: This is the approach Charles Spearman took in 1904. He defined his index to be 1 2 D reverse order of one another. [To see this, show what happens to the value

D

if two numbers in one list are swapped. Show that the value of

D

increases if we swap two elements that aren’t already in reverse order.]

APPROACH 2: Use absolute values instead of squaring in the previous approach.

What is the maximal value

D

can obtain in this case and when does it occur?

APPROACH 3: We can reorder the names of the contestants so that list of ranks for the first judge is 1, 2, 3, 4, 5. The list of ranks for the second judge changes accordingly.

Albert Egbert Cuthbert Bilbert Dilbert

Judge 1 1 2 3 4 5

Judge 2 3 1 2 5 4

Now look at each contestant in turn along the second row. Count the number of scores to the right of each entrant with a lower score.

In our example, according to Judge 2, Albert has TWO lower scores to his right.

Bilbert has ZERO lower scores to his right, Cuthbert ZERO, Bilbert ONE, Dilbert ZERO.

Summing these scores gives a value S =2 0+ + + + = . 0 1 0 3

If the rankings were in perfect agreement, then S would have value 0. If they were in perfect disagreement (in reverse order), then S would have value 10, and this is maximal. Set

1 10 R = − S .

In our example, R =0.7 indicating some disagreement.

Comment: In 1938 M. G. Kendall took an approach similar to this one.

****

Many approaches, of course, are possible.

The difficulty in this work is determining when and how a maximal value for a count occurs (and generalizing this to a list of

n

contestants and not just five). [Approach 2 is problematic in this regard.]

In document THINKING MATHEMATICS (Page 174-195)