Objectives. 6.1, 7.1 Estimating with confidence (CIS: Chapter 10) CI)

(1)

Objectives

6.1, 7.1 Estimating with confidence (CIS: Chapter 10)

p  Statistical confidence (CIS gives a good explanation of a 95% CI)

p  Confidence intervals. Further reading

http://onlinestatbook.com/2/estimation/confidence.html

p  Choosing the sample size

p  t distributions. Further reading

http://onlinestatbook.com/2/estimation/t_distribution.html

p  One-sample t confidence interval for a population mean

(2)

Overview of Inference

p  Sample ≠ population, and sample mean ≠ population mean µ.

But we do not know the value of µ, and if we want to make any conclusions about µ then we have to use to do so.

p  Methods for drawing conclusions about a population from sample

data are called statistical inference.

p  There are two main types of inference:

§  Confidence Intervals - estimating the value of a population

parameter, and

§  Tests of Significance - assessing evidence for a claim (hypothesis)

about a population.

p  Inference is appropriate when data are produced by either

§  a random sample or §  a randomized experiment.

€

x

€

x

(3)

Introducing con4idence intervals

p  It is very unlikely that the sample mean based on a sample will ever

equal the true mean. Our aim is to construct an interval around the sample mean which is `likely’ to contain the mean. This is called a confidence interval.

p  In the first lecture we considered a Gallop poll for the proportion of the

electorate that would vote for Obama.

p  Gallup predicted that the Obama vote would be in the interval

[45%,51%] with 95% confidence.

p  The Obama vote turned out to be 50.5%, so the interval did capture the

true proportion.

p  You may be asking yourself how do we understand 95%, since 50.5%

lies in this interval, there does not appear to be any uncertainty in it.

q  In the next few slides, our objective is to understand how a

(4)

Review: properties of the sample mean

The sample mean is a unique number for any particular sample. If you had obtained a different sample (by chance) you almost certainly would have had a different value for your sample mean.

In fact, you could get many different values for the sample mean, and

virtually none of them would actually equal the true population mean

€

, µ.

(5)

In Chapter 4, we learnt that if a random variable was normally

distributed with µ and standard deviation σ then 95% probability it will lie in the interval

Now our focus is on the sample mean it has mean µ and standard error

σ/√n (chapter 5), thus there is 95% probability that it lies in interval

But the mean is unknown, our objective is to locate the true mean based on the sample mean.

p  To do this we turn the story around, if the sample mean lies in the

interval

p  This is the same as saying the mean µ lies in the interval [sample

mean –1.96×σ/√n, sample mean +1.96×σ/√n].

q  Thus 95% of the time, the true mean (that we want to estimate) will

be in the interval (this is called a confidence interval):



sample mean (average) 1.96 _{⇥ p}

n,sample mean (average) + 1.96 ⇥ pn [µ 1.96 _⇥ , µ + 1.96 _⇥ ]  µ 1.96 _{⇥ p} n, µ + 1.96 ⇥ pn  µ 1.96 _{⇥ p} n, µ + 1.96 ⇥ pn

(6)

Case 1: Normal data – sample size one

p  Human heights are approximately a normal distribution. The standard

deviation of a human height is 3.8 inches.

p  Our objective is to construct a confidence interval for the mean height. p  We start with the less than ideal situation that we only have a sample size

one (just observation!). In this case the standard error is 3.8/√1 = 3.8 (the regular standard deviation).

p  We know that the observation is normally distributed, so it is straightforward

to construct the 95% confidence interval for the mean height using just one randomly selected height is:

[height – 1.96×3.8, height + 1.96×3.8] = [height – 7.44, height + 7.44]. Construct an interval using your height.

A large amount of data on heights has been collected and it is known that the mean height of a person is about 67 inches. Does your interval contain the mean? Most of you will contain the mean, 67 inches. Those of you whose height is in the extremes (very tall or small – more than 1.96 standard

(7)

Because the sampling distribution of is narrower than the population distribution, by a factor of √n,

the estimates tend to be closer to the population

parameter µ than individual observations are. n Sample means, n subjects

µ

n

σ

Population, x individual subjects

€

x

€

x

If the population is normally distributed N(µ,σ),

the sampling distribution is N(µ,σ/√n),

€

(8)

Case 1: Normal data – sample size three

p  Again we estimate the mean human height, but this time taken from a

random sample of three people. Recall, the standard deviation of a human height is 3.8 inches.

p  If the sample size is 3, the standard error of an average based on three is

3.8/√3 = 2.19.

p  As each randomly selected height is normally distributed, so is the

average based on three (recall Chapter 5):

p  The 95% confidence interval is

Given any random sample of size three we take its average and plug it in.

¯ X _⇠ N( µ |{z} ?? , _p3.8 3)  ¯ X 1.96 _⇥ _p3.8 3, ¯ X + 1.96 _⇥ 3.8_p 3

(9)

Here we illustrate the height example.

q  In the shot on the right we draw a

sample of size three from the population of all heights. The

average (sample mean) is evaluated.

q  This average corresponds to one of

the green dots on the lower right plot. The green lines is the confidence

interval centered about the average.

q  We did this 100 times and 96 of the

intervals contain the true mean 67. If the sample mean is

normally distributed and the 100 samples were

calculated and for each sample a 95% CI was

evaluated, about 95 would contain the true mean of 67. In reality only have one CI; we are 95% confident it contains mean.

(10)

Observations

p  We see that the length of confidence interval when using just one

person in the sample is 2×1.96×3.8 = 14.88, this is quite long, and does not really allow us to pinpoint the mean.

p  Whereas the length of the confidence interval using three people is

only 2×1.96×3.8/√3 = 14.88/√3

p  If ten people were used to calculate the sample mean the

corresponding interval length would be 14.88/√10 = 4.7.

p  We see that for any given interval either the mean is in this interval

or not. The 95% comes into play when we look at the proportion of intervals that contain the mean.

p  In reality:

p  We do not know the true mean µ, so will never know whether the interval

contained the mean or not.

p  We only observe one sample of size n, and thus have one CI.

(11)

Case 2: Skewed data – sample size 3

p  In the previous example we looked at height data which tends to be

normal. In this example we consider Right skewed data, which is NOT normal – examples include, House prices, Salaries etc.

p  We randomly draw a sample of size 3 from a right skewed

distribution with mean 14 and standard deviation 10.7.

p  The sample/mean average has a mean which is 14 and standard

deviation which is 10.7/√3 = 6.17.

p  We construct a 95% confidence interval to locate the mean,

 ¯ X 1.96 _⇥ 10_p.7 3 , ¯ X + 1.96 _⇥ 10_p.7 3

The confidence interval is constructed under the assumption that the sample mean is normal. In the next slide we investigate how this

(12)

q  We draw three samples from this

skewed distribution and take the average.

q  The average corresponds to one of

the green dots on the plot below. We construct a 95% interval.

q  We see that only 93 of the intervals

contain the mean. The reason for the

difference between the 95% and 93 (though not much) can be found in the green plot of the sample mean. It is slightly

skewed and clearly not normal. The sample size is not large enough for the CLT to work. We do not have 95% confidence in this 95% confidence interval.

(13)

Case 2: Skewed data – sample size 50

p  In the previous example, it was clear that we did not have the full

95% confidence in the 95% confidence interval we had constructed.

p  This was because the sample mean was not normal.

p  We need to be careful when constructing confidence intervals

using small sample sizes because the normality assumption may not hold – this means our interval is not as reliable as we think it is.

p  If the sample size is sufficiently large then we recall from Chapter 5

that the corresponding sample size will be close to normal. This means that a 95% confidence interval will actually be a 95% confidence interval.

p  In this next slide we look at the reliability of the 95% CI (where the

data is sampled from a skewed distribution):

 ¯ X 1.96 _⇥ _p10.7 50, ¯ X + 1.96 _⇥ _p10.7 50

(14)

We observe that the sample mean based on a sample of 50 appears close to normal (though it needs to checked with a

QQplot).

The `coverage’ of the confidence interval (at least over these 100 realizations) is

`about’ 95%.

We can `safely’ say that we have 95% confidence in the 95%

confidence interval.

To summarize a 95% confidence interval is an interval where we are 95% confident it contains the mean (note for any given interval the mean is either there or not – so no probability).

(15)

Implications

We do not need to (and

cannot, anyway) take a lot of random samples to “rebuild” the sampling distribution and find µ at its center.

n

Sample Population

µ

All we need is one SRS of size n and we can rely on the properties of the

sampling distribution to infer reasonable values for the population mean µ.

(16)

Multiple samples revisited

With 95% confidence, we can say that µ should be within 1.96

standard deviations (1.96×σ/√n) from our sample mean .

p  In 95% of all possible samples of

this size n, µ will indeed fall in our

p  In only 5% of samples will be

farther from µ.

p  “Confidence” = the proportion of

possible samples that give us a correct conclusion. € σ n

€

x

€

x

(17)

Calculation practice 1

p  You want to rent an unfurnished one-bedroom apartment in Dallas.

The mean monthly rent for 10 randomly sampled apartments is 980 dollars. Assume that monthly rents follow a normal distribution with standard deviation 280 dollars.

p  Question: Construct a 95% confidence interval for the mean

monthly rent of a one-bedroom apartment.

p  Answer: The standard error for the sample mean is 280/√10 =

88.54. The 95% CI is [980 ±1.96×88.54] = [806,1153]. With 95% confidence we believe the mean price of one-bedroom apartments in Dallas lies in this interval.

(18)

p  Question Does the above confidence interval mean that 95% of all

rents should lie in this interval?

p  Answer: No, this is confidence interval for the mean not the apartment

price. An interval where 95% of apartment prices will lie is [980 ±1.96(88.54+280)] = [257,1720].

You do not have to understand this calculation, but you will notice this interval is much wider. The reason is that it must capture 95% of all rents, which are

extremely varied. This interval will not get narrower as the sample size grows. The CI for the mean is suppose to capture the mean rent, this interval is far narrower and will get narrower as the sample size grows.

q  Question A relator wants to know if the mean price of one bedroom

apartments in Dallas is more than 1100 dollars a month. Based on the confidence interval for the mean, what can you say?

q  Answer We showed that the 95% confidence interval for the mean is

[806,1153] dollars. As this interval contains both values above and below 1100 dollars, we do not know. We do not have enough data to answer her question.

(19)

Calculation practice 2

p  Hypokalemia is diagnosed when the blood potassium level is below

3.5mEq/dl. The potassium in a blood sample varies from sample to sample and follows a normal distribution with unknown mean but standard deviation 0.2. A patient ‘s potassium is measured taken over 4 days. The sample over 4 days is 3, 3.5, 3.9, 4.4, its sample mean is 3.7.

p  Question: Construct a 95% confidence interval for the mean potassium

and discuss whether the patient is likely to be diagnosed with Hypokalemia.

p  Answer: The standard error for the sample mean is 0.2/√4 = 0.1. Thus

the 95% confidence interval for the mean potassium level is

[3.7±1.96×0.1] = [3.504,3.894]. This means with 95% confidence we

believe the mean lies in this interval.

q  Since 3.5 or less does not lie in this interval, it suggests that the patient

does not have lower potassium. There is a precise way of answer this specific problem which we discuss in Chapter 7 (called statistical

(20)

Con4idence interval misunderstandings

p  Suppose 400 alumni were asked to rate the University of Okoboji

counseling services from a scale 1 to 10. The sample mean was found to be 8.6 and it is known that the standard deviation is σ=2. Ima Bitlost has done the analysis, but has made some mistakes.

p  Ima computes the 95% CI interval for the mean satisfaction score

as [8.6±1.96×2]. What is her mistake?

p  Ima has not taken into account that the sample mean has a much

smaller standard deviation (standard error) than the population. The

standard error is 2/√400 = 0.1. Thus the true CI is

[8.6±1.96×0.1] = [8.4,8.796].

p  After correcting her mistake, she states that “I am 95% confident

that the sample mean lies in the interval [8.4,8.796]” What is wrong with her statement?

p  This is a meaningless statement, for sure the sample mean lies in this

(21)

p  She quickly realizes her mistake and instead states “the probability

that the mean lies in the interval [8.4,8.796] is 95%”, what misinterpretation is she making now?

p  By 95%, we mean that if we repeated the experiment many times over

about 95% of the time the intervals will contain the mean. For any given interval the mean is either in there or not. There is no probability

attached to it. To overcome, this issue we say that with we have 95% confidence in the mean lies in this interval.

p  Finally, in her defense for using the normal distribution to determine

the confidence coefficient (1.96) she says “Because the sample size is quite large, the population of alumni ratings will be close to normal”. Explain to Ima her misunderstanding.

p  The distribution of the population always stays the same, regardless of

the sample size (in this case, it is clear that variables that take integer values between 1 to 10 cannot be normal). However, the sample mean does get closer to normal as the sample size grow. With a sample size of 400, the distribution of the sample mean will be very close to normal.

(22)

Different levels of con4idence

p  There is no need to restrict ourselves to 95% confidence intervals. p  The level of confidence we use really depends on how much

confidence we want. For example, you would expect a 99%

confidence interval is more likely to contain the mean than a 95% confidence interval.

p  To construct a 99% confidence interval we use exactly the same

prescription as used to construct a 95% confidence interval, the only thing that changes is 1.96 goes to 2.57 (if you look up -2.57 in the z-tables you will see this corresponds to 0.5%, so 99% of the time the sample mean will lie within 2.57 standard errors from the mean).

p  A 99% CI for the mean one-bedroom apartment price is

[980±2.57×88.54]. Length of interval is 2×2.57×88.54

q  A 90% CI for the mean one-bedroom apartment price is

[980±1.64×88.54]. Length of interval is 2×1.64×88.54

What does a 100% confidence interval look like? In a 100% CI we are sure to find the mean, but this interval is so wide it is not informative.

(23)

Sample size and length of the CI

p  Let us return to the apartment example. We recall that the 95%

confidence interval for the mean price is [980 ±1.96×88.54] = [806,1153]. The length of this interval is 2×1.96×88.54 = 347.

p  Question: Suppose I take a SRS of 100 apartments in Dallas, the

sample mean based on this sample is 1000, what will the CI be?

p  Answer: The standard error is 280/√100 = 28 (much smaller than when

the sample size is 10), and the CI is [1000 ±1.96×28]. The length of this

interval is 2×1.96×28 =109.

p  What we observe is:

p  The length of the interval does not depend on the sample mean, this is

just the centralizing factor. It only depends on (i)1.96, (ii) the standard

deviation and (iii) the sample size.

p  The length of the interval gets smaller as the sample size increases.

p  If we want the interval to have a certain length, we can choose the

(24)

How large an interval

p  You read in a newspaper that

The proportion of the public that supports gay marriage is now 55%±15%.

q  This means a survey was done, the proportion in the survey who

supported gay marriage was 55% and that confidence interval for the population proportion is [55-15,55+15]% = [40%,70%].

q  This is an extremely large interval, it is so wide, that it is really not

that informative about the opinion of the public.

q  As we will see on the next slide, the reason it is too wide is that the

sample size is too small. This experiment was not designed well.

q  Typically, before data is collected, we need to decide how large a

sample to collect. This is usually done by deciding how much

`above and below’ the estimator seems reasonable. For example, [55-3,55+3]% = [52,58]% is more information. The 3% is known as a margin of error. Given a certain margin of error we can then

(25)

Margin of Error

p  Margin of error is the lingo used for the plus and minus part in the

p  That is the confidence interval is

[sample mean±1.96×σ/√n], the margin of error is 1.96×σ/√n.

q  For example, in the previous example the margin of error for the CI

based on 10 apartments is 1.96×88.54.

q  The margin of error for the CI based on 100 apartments is 1.96×28.

q  The margin of error in some sense, is a measure of reliability. For a

given confidence level, the smaller the margin error the more precisely we can pinpoint the true mean.

q  Suppose we want the margin or error to be equal to some value,

then we can find the sample size such that we obtain that margin of error. Solve for n the equation MoE = 1.96×σ/√n (the Margin of Error and the standard deviation σ are given): n = (1.96×σ/MoE)2

(26)

Calculation practice

p  In a study of bone turn over in young women with a medical

condition, serum TRAP was measured in 31 subjects. The sample mean was 13.2 units per liter. Assume the standard deviation is

known to be 6.5U/l.

p  Question: Find the 80% CI for the mean serum level.

p  Answer: 10% in the z-tables, this gives -1.28. The standard error for the

sample mean is 6.5/√31 = 1.16. Altogether this gives the CI

[13.2±1.16×1.28] =[11.7,14.6].

This means with we believe with 80% confidence the mean level

of serum for women with this medical condition should lie in this interval. By choosing such a low level of confidence our interval is quite narrow, but our confidence in this interval is relatively low.

q  Question: How large a sample size should we choose such that the

80% CI for the mean has the margin of error 1U/l.

(27)

When the standard deviation is unknown?

p  In the previous example we assumed the standard deviation was

unknown. In general before we collect the data, we will not have much information about the standard deviation. However, we will have some idea on bounds for it. Ie. The standard deviation for human heights is probably between 2-5 inches. Based on this information we can can find the sample size whose Margin of Error is maximum a certain length.

p  Question How large a sample size do we require such that the margin

of error for a 95% confidence interval for the mean of human heights is maximum 0.25 inch, given that σ lies somewhere between 2-5 inches.

p  Answer We know that the formula is n = (1.96×σ/0.25)2.. We need to

choose the standard deviation to place in the formula.

p  If we use σ=2, then the sample size is n=(1.96×2/0.25)2 = 246.

p  If we use σ=5, then the sample size is n=(1.96×5/0.25)2 = 1537.

p  For standard deviations between 2 and 5, the sample size will be between

246 – 1537. In the next slide we see what the MoE for these different sample

(28)

p  Using the smaller standard deviation gives a smaller sample size, which

is easier to collect. However, if the standard deviation is greater than 2, then it means that the MoE will be larger than the desired minimum:

p  If σ=5, and we use the minimum sample size n=246, then putting these

numbers into the formula we see that the

MoE =1.96×5/√246 = 0.62. Which is larger than the required

of 0.5. This is not what we want, as we want to ensure that the MoE

is less than 0.25.

p  If σ=2, and we use the maximum sample size n=1537, then putting these

numbers into the formula we see that the

MoE =1.96×2/√1537 = 0.1. Which is less than the require of 0.25. This is

exactly what we want, as we want the MoE which is at most 0.25.

q  To be sure that the MoE is maximum 0.25, we need to use a sample

size of n=1537. This means always using what we believe is the

maximum standard deviation in the calculation of margin of error. Ie.

n =

✓

1.96 _⇥ _{M AX} MoE

(29)

Calculation practice (tricky)

p  Question: A confidence interval for the length of parrots beaks is

[4,10] inches. It is based on a sample of size n. By what factor should the sample size increase such that the margin of error is 1?

q  Answer: This looks like an impossible question because we don’t have

any obvious information. But we can break the problem into steps:

q  Confidence intervals are centered about the sample mean, so the

average of the observed data is 7. The margin of error is half the length

of the CI interval which is [10-4]/2 = 3 = 1.96×σ/√n.

q  We want to decrease the MoE, such that MoE = 1, so it decreases by a

third. Now some basic maths, suppose we increase the sample size by factor 9 (9 times the original data):

1.96 _{⇥ p} 9n = 1.96 ⇥ 3n = 1 3 1.96 ⇥ pn | {z } =3 = 3 3 = 1

Thus increasing the sample size by factor 9 results in the Margin of Error reducing to 1. Observe we need a huge increase in sample size to get a moderate decrease in the MoE!

(30)

Calculation (continued)

p  Example If a sample size of 20 gave a confidence interval [4,10],

how large a sample size is required to reduce the margin of error to 1?

p  Solution If the confidence interval is [4,10], from the previous slide

we know that the MoE is 3. This means that

If increase the sample size by factor 36, ie. from n=20 to n=20×36=720. Then I see that the margin of error is

1.96 _{⇥ p}

20 = 3

We see that to decrease the margin of error from 3 to ½ (by a sixth) we need to increase the sample size by factor 36!

1.96 _{⇥ p} 36 _⇥ 20 = 1.96 ⇥ 6 _⇥ p20 = 1 6 ⇥ 1.96 ⇥ p₂₀ = 3 6 = 1 2

(31)

Analysis with unknown standard deviation

p  So far we have assumed that the standard deviation is known, even

though the mean is unknown.

p  In some situations, this is realistic. For example, in the potassium level

example, it seems reasonable to suppose that the amount of variation for everyone is about the same, but everyone has their own personal mean level, which is unknown.

p  In most situations, the mean level is unknown.

p  Given the data: 68, 68.5, 68.9 and 64.4 the sample mean is 68.7, how

to `get’ the standard deviation to construct a confidence interval?

p  We do not know the standard deviation, but we know that we can

estimate it using the formula

p  For our example it is

s = v u u t 1 n 1 n X i=1 (Xi X¯)2 s = r 1 3 ([ 0.7] 2 _{+ [ 0}_._2]2 _{+ [0}_._2]2 _{+ [0}_._7]2_{) = 0}_.₅₉

(32)

Using the z-‐transform with estimated

standard deviation

p  Once we have estimated the standard deviation we replace the

the unknown true standard deviation in the z-transform with the estimated standard deviation:

q  After this we could conduct the analysis just as before. However,

we will show in the next few slides (with the aid of Statcrunch) that this strategy leads to unreliable confidence intervals (when the

sample size small). We consider two examples

q  The data is normal (we `draw’ samples from a distribution with mean 3.8

and standard deviation 3.8, however confidence interval used does not know these specifications) and sample size is n = 3.

q  The data is normal (as above), but sample size is n = 50.

¯ X µ /pn ) ¯ X µ s/pn X¯ ± 1.96p_n ! X¯ ± 1.96 s p n

(33)

Case 1: Normal data – sample size 3.

In this example we draw samples of size 3:

q  The 95% CI using the above data and the normal

We see from this example that the estimated standard deviation (1.73) underestimates the true standard deviation (3.8). This in general tends to be true for small sample sizes. This means the 95% CI is too narrow.

We see from the plot on the left that only 84% of the `95% CI’

contain the mean. This means it is not a 95% CI. Something has gone wrong.  69.9 1.96_⇥ 1_p.73 3 ,69.6 + 1.96⇥ 1.73 p 3

(34)

Case 1: Normal data – sample size 50

In the previous example the sample size was 3, now we

consider the case that the sample size is 50. For the example given on the right the 95% CI is

 68.0 1.96_⇥ 4_p.07 50,68.0 + 1.96 ⇥ 4.07 p 50

For this example, the estimated

standard deviation 4.07 is far closer to the true 3.8. This in general is true for large sample sizes.

Looking at the number of times the mean is contained within in the 95% confidence interval (on the right) we see that it is close to the prescribed level lf 95%.

(35)

Observations from the experiments

p  Simply replacing the true standard deviation with the estimated standard

deviation seems to have severe consequences on the confidence interval.

p  When the sample size was small there tends to be an underestimation in

the standard error, resulting in the 95% CI not really being a 95% CI.

p  To see why consider the z-transforms of the sample mean with known

and estimated standard deviations:

p  (sample mean - µ)/(σ/√n)

p  (sample mean - µ)/(s/√n)

p  In the first case, z-transform will be a standard normal. In the second case

the estimated standard deviation adds extra variability into the `system’. In particular, because s can be smaller than σ, this means the z-transform can be larger and take higher values then we would expect for a standard normal.

p  In the next few slides we show that when we estimate the standard

deviation the z-transform is no longer a standard normal, but the so called t-distribution.

(36)

Review:

σ

is unknown

p  When the sample size is large,

the sample is likely to contain elements representative of the

whole population. Then s is a

good estimate of σ.

Population distribution

Small sample Large sample

p  But when the sample size is

small, the sample contains only

a few individuals. Then s is a

mediocre estimate of σ.

p  The data is unlikely to contain

values in the tails and, s is likely

to underestimate σ.

In the case the we can estimate the standard deviation from the data. The sample standard deviation s provides an estimate of the population standard deviation σ.

(37)

Sample means and standard deviations

p  Just like the sample mean is random with a distribution, so is the

sample standard deviation.

p  Here we take a sample of size 10 from a normal distribution can

(38)

Estimating the standard deviation

p  The sampling distribution of the

sample standard deviation (n=5)

q  The sample distribution of the

sample standard deviation (n=25)

Observe that as the sample size increases the estimator of the sample standard deviation becomes less variable (1.70 reduces to 0.65). Large amount of variability in the sample standard deviation influences the confidence interval.

(39)

That nice Mr. Gosset

p  Just over 100 years ago, W.S. Gosset

was a biometrician who worked for Guiness Brewery in Dublin, Ireland.

p  His hobby was statistics.

p  Gosset realized that his inferences

with small sample data seemed to be incorrect too often – his true

confidence level was less than it was stated to be. We just observed this in the simulations previously.

p  He worked out the proper method that

took into account substituting s for σ.

p  But he had to publish under a

pseudonym: Student (probably because Gosset was a sweet and modest person).

p  Gosset’s theory is based on

the distribution of the quantity

p  This looks like the z-score for

, except that s replaces σ in the denominator.

.

x

t

s

n

−

µ

=

x

(40)

Formal: Student’s

t

distributions

Suppose that an SRS of size

n

is drawn from an

Normal(

µ

,

σ

) population.

p  When σ is known, the sampling distribution for is

Normal(0,1).

p  When σ is estimated from the sample standard deviation s, the

sampling distribution for will be very close to normal if the sample size n is large. This is because for large n, s will be a very

reliable estimator of σ.

q  However, in the case that n is not so large, the variability in s will

have an impact on the distribution.

q  It is clear that the impact it has depends on the sample size.

x t s n −µ = x z n −µ = σ

(41)

Student’s

t

distributions

p  When σ is estimated from the sample standard deviation s, the

sampling distribution for will depend on the sample size.

The sample distribution of

is a t distribution with n − 1 degrees of freedom.

p  The degrees of freedom (df) is a measure of how well s estimates

σ. The larger the degrees of freedom, the better σ is estimated.

q  This means we need a new set of tables! q  Further reading: http://onlinestatbook.com/2/estimation/t_distribution.html x t s n −µ =

x

t

s n

−

µ

=

(42)

When n is very large, s is a very good estimate of σ, and the

corresponding t distributions are very close to the normal distribution. The t distributions become wider (thicker tailed) for smaller sample sizes, reflecting that s can be smaller than σ, so the corresponding t-transform is more likely to take extreme values than the z-t-transform.

(43)

Suppose we want to construct the C% confidence interval for the mean. The standard deviation is unknown, so as well as estimating the mean we also estimate the standard

deviation from the sample. The C% Confidence Interval is:

Example: For an 95% confidence

level C, 95% of Student’s t curve’s

area is contained in the interval.

Impact on con4idence intervals

C t* −t*



¯

X t

n 1

✓

100

C

2

◆

⇥

p

s

n

,

X

¯

+

t

n 1

✓

100

C

2

◆

⇥

p

s

n

Examples: 95%, sample size n=3

 ¯ X 4.3_⇥ _ps 3, ¯ X + 4.3_⇥ _ps 3 95%, sample size n=10  ¯ X 2.26_⇥ _ps 10, ¯ X + 2.26_⇥ _ps 10

(44)

Con4idence level and the margin of error

The confidence level C determines the value of t* (in table D). The margin of error also depends on t*.

*

m t

=

×

s

n

C

t* −t*

§  Higher confidence C implies a larger

margin of error m (thus less precision

in our estimates).

§  A lower confidence level C produces

a smaller margin of error m (thus

better precision in our estimates).

§  We find t* in the line of Table D for df

(45)

Table D

When the sample is very large, we use the normal distribution and the standardized z-value.

When σ is unknown, we use a t distribution with “n−1” degrees of freedom (df).

Table D shows the z-values and t-values corresponding to landmark P-values/ confidence levels.

x

t

s

n

−

µ

=

(46)

p  Focus first on 2.5%. For each n, the 2.5% corresponds to the area on

the left and right tails of the t-distribution with n degrees of freedom.

Remember a distribution gives the chance/likelihood of certain outcomes.

p  Recall that for a normal distribution, the point where we get 2.5% on the

left and the right of the tails of the distribution is 1.96 (which is the very last row of the table).

p  If we go down the table. we see that as the sample size, n, increases the

value corresponding to 2.5, goes from 12.71 (for n=1) to a number that is very close to 1.96 for extremely large n.

p  This means for small n the variability on the standard deviation s means

that the chance of the t-transform being extreme is relatively large.

p  However, as n grows, the estimator of the standard deviation improves,

and the t-transform gets closer to a normal distribution.

p  You observe the same is true for other percentages.

p  90% means looking up 5%

p  99% means looking up 0.5%

(47)

Case 1: Normal data – sample size 3, using t-‐dist

In this example we draw samples of size 3:

q  The 95% CI using the above data and the t-distribution is

This is the same example as

considered previously, but now the t-distribution has been used.

1.96 has been replaced with 4.3. From the plot of the right we see that using the t-distribution to

construct the CI about 95% of the 95% confidence intervals really do contain the population mean.

By using the t-distribution we have corrected for under the

underestimation of the sample sd.  69.9 4.3_⇥ 1_p.73 3 ,69.6 + 4.3 ⇥ 1.73 p 3

(48)

Non-normal data: A misconception

Using a t-distribution rather than a

normal distribution when constructing a confidence interval does not correct for the lack of normality in the data. In the example of the left, we use the t-distribution to construct the CI. But we observe that only 88 of the 100 95% confidence intervals contain the mean. Fundamentally, if the data is not

normal, and the sample size is small neither the normal or the t will give the correct 95% confidence interval.

REMEMBER we only use the t-distribution because we have

estimated the standard deviation from the data.

(49)

Calculation practice (red wine 1)

It has been suggested that drinking red wine in moderation may protect against heart attacks. This is because red wind contains polyphenols which act on blood cholesterol.

To see if moderate red wine consumption increases the average blood level of polyphenols, a group of nine randomly selected healthy men were assigned to drink half a bottle of red wine daily for two weeks. The percent change in their blood polyphenol levels are presented here:

0.7 3.5 4.0 4.9 5.5 7.0 7.4 8.1 8.4

Sample average = 5.50

Sample standard deviation s = 2.517 Degrees of freedom df = n − 1 = 8

x

We will encounter two problems

when doing the analysis. The first is that the sample size is not huge so we have to hope that the sample mean is close to normal. The

second is the standard deviation is unknown and has to be estimated from the data.

(50)

q  What is the 95% confidence interval for the average percent change?

p  First, we determine what t* is. The degrees of freedom are df =

n − 1 = 8 and C = 95%.

p  The margin of error m is: m = t* × s/√n = 2.306 × 2.517/√9 ≈ 1.93. So

the 95% confidence interval is 5.50 ± 1.93, or 3.57 to 7.43.

p  We can say “With 95% confidence, the mean of percent increase

is between 3.57% and 7.43%.”

p  What if we want a 99% confidence interval instead?

p  For C = 99% and df = 8, we find t* = 3.355. Thus m = 3.355 × 2.517/

√9 ≈ 2.81.

p  Now, with 99% confidence, we only can conclude the mean is

between 2.69 and 8.31. (A big price to pay for the extra confidence.)

(…)

(51)

Calculation practice (red wine 2)

Let us return to the same study, but this time we increase the sample size to 15 men. The data is now:

0.7,3.5,4,4.9,5.5,7,7.4,8.1,8.4, 3.2,0.8,4.3,-0.2,-0.6,7.5

The sample mean in this case is 4.3 and the sample standard deviation is 3.06.

Since the sample size has increased, it is likely that the sample standard deviation is a more reliable estimator of the true standard deviation.

The number of degrees of freedom is 14.

Just as in the previous example we can construct a 95% confidence interval but now we use 14df instead of 8dfs.

Solution: Using the t-tables the 95% CI is

2 44.3 _± 2_{| {z }}.145 t-tables 14 df, 2.5% ⇥p3.06 15 3 5 = [2.6, 6]

(52)

Con4idence intervals using Software

p  Usually software will construct the confidence interval for you.

Therefore it is important to connect the calculations with the statistical output.

The box on the right is the output (it is

superimposed on the window used to

generate the output). Observe that L.Limit – U. limit gives the

confidence interval

[2.6,6] calculated on the previous slide.

DF = 14, matches with the degrees of freedom.

(53)

Calculation practice 3

p  Let us return to the example of prices of apartments in Dallas. 10

apartments are randomly sampled. The sample mean and the

sample standard deviation based on this sample is 980 dollars and 250 dollars (both are estimators based on a sample of size ten). Construct a 95% confidence interval for the mean:

p  The standard error is 250/√10 = 79.

p  Looking up the t-tables at 2.5% and 9 degrees of freedom gives 2.262.

p  The 95% confidence interval for the mean is [980 ±

2.262×79]=[801,1159].

q  Suppose we want to know whether the price of apartments have

increased since last year, where the mean price was 850 dollars.

q  Based on this interval we see that 850 dollars and greater is contained in

this interval. This means the mean could be 850 dollars or higher. There given the sample it is unclear whether the mean price of apartments has increased since last year or not.

(54)

Calculation practice 4

p  Let us return to the M&M data. Suppose we want to calculate a 99%

confidence interval for the mean number of M&Ms in plain, peanut butter and peanut M&Ms. These can be calculated using the

summary statistics output:

Summary statistics for Total: Group by: Type

Type n Mean Variance Std. Dev. Std. Err. Median Range Min Max Q1 Q3 M 84 17.297619 8.259753 2.8739786 0.3135768 18 14 7 21 17 19 P 40 8.675 9.814744 3.1328492 0.49534693 8 15 6 21 7 8 PB 46 10.913043 3.325604 1.8236238 0.26887867 11 10 8 18 10 11

Using this output we can calculate the confidence intervals for the mean number of M&Ms in each type.

(55)

Using Software to obtain con4idence intervals

p  Go to Stats -> t-statistics -> one-sample -> with data -> select the

column you want to analyse (choose the Group by if you want it grouped), on the next page select confidence interval and the level you want it at.

Sample mean Std. err DF L Limit U limit

17.2 0.31 83 16.4 18.12

8.6 0.49 39 7.33 10.01

10.9 0.268 45 10.18 11.63

Looking at the intervals, do you think it that the mean number of M&Ms in a plain and peanut bag could be the same.

What about the mean number in peanut and peanut butter? Later on we shall make a formal test on these questions.

(56)

Calculation practice: coffee shop sales

p  The degrees of freedom is 45−1 = 44.

p  For 90% confidence, we find t* = 1.680.

p  The margin of error is 1.680×1.03/√45 = 0.258

p  So the interval for the true mean is 2.67 ± 0.26.

p  “We conclude that the mean annual sales of all coffee shops is between

$2.41 million and $2.93 million, with 90% confidence.” A marketing firm randomly samples 45 coffee shops and determines their annual sales. The sample has an average of $2.67 million and a standard deviation of $1.03 million. What can we say with 90% confidence about the mean

annual sales for the population of all coffee shops?

n

s

t

x

±

*

(57)

Summary of con4idence interval for

µ.

p  The confidence interval for a population mean µ is

p  t* is obtained from Student’s t distribution using n−1 degrees of

freedom. (Table D in the textbook.)

p  t* is the value such that the confidence level C is the area between

–t* and t*.

p  Confidence is the proportion of samples that lead to a correct

conclusion (for a specific method of inference).

p  The investigator chooses the confidence level C.

p  Tradeoff: more confidence means bigger margin of error, wider

intervals.

p  The degrees of freedom is associated with s, the estimate for σ. p  The margin of error also depends on the sample size:

larger samples are better.

*

/

t s

n

*

.

(58)

Interpretation of con4idence, again

p  The confidence level C is the proportion of all possible random

samples (of size n) that will give results leading to a correct conclusion, for a specific method.

p  In other words, if many random samples were obtained and

confidence intervals were constructed from their data with C = 95% then 95% of the intervals would contain the true parameter value.

p  In the same way, if an investigator always uses C = 95% then 95%

of the confidence intervals he constructs will contain the parameter value being estimated.

p  But he never knows which ones do!

p  Changing the method (such as changing the value of t*) will change

the confidence level.

p  Once computed, any individual confidence interval either will or will

not contain the true population parameter value. It is not random.

p  It is not correct to say C is the probability that the true value falls in

(59)

Cautions about using

p  This formula is only for inference about µ, the population mean.

Different formulas are used for inference about other parameters.

p  The data must be a simple random sample from the population. p  The formula is not quite correct for other sampling designs. (But see

a statistician to get the right inference method.)

p  Confidence intervals based on t* are not resistant to outliers.

p  If n is small and the population is not normal, the true confidence

level could be smaller than C. (Usually n ≥ 30 suffices unless the data are highly skewed.)

p  This inference cannot rescue sampling bias, badly produced data or

computational errors.

*

/

(60)

Accompanying problems associated

with this Chapter

p  Quiz 7 p  Quiz 8