Lecture_3 [Modo de compatibilidad]

(1)

Techniques of Statistical

Analysis I

Lect_3: Confidence Intervals for the mean

Bruno Arpino

(2)

_{From point to confidence interval estimation}

_{The general structure of a confidence interval}

_{How to calculate a confidence interval for the}

Outline

2

_{How to calculate a confidence interval for the}

population mean

(3)

_{We saw that the “best” point estimate for a population}

mean, in general, is the sample mean

_{The sample mean estimator is unbiased:}

_{And has lower variance than other estimators. In}

particular:

Where are we and where are we going?

µ

)

X

E(

=

3

particular:

_{If the distribution of our variable of interest in normal or}

the sample is large (Central Limit Theorem), we know

that the sampling distribution of the mean estimator is

normally (or approximately normally) distributed

n

σ

(4)

_{Are we satisfied with a point estimate? No! Because it}

does not take into account the sampling error!

_{How can we improve things? Providing a range of}

“plausible” values for the population mean

_{Idea: build a confidence interval around the point}

Where are we and where are we going?

(cont’d)

4

_{Idea: build a confidence interval around the point}

estimate using the standard error as a measure of

variability

_{We need to decide how much confident we want to be}

(5)

How can we make a “confidence” statement

like this?

(6)

General structure of a confidence interval (CI)

Point Estimate Lower

Confidence Limit (a)

Upper

Confidence Limit (b)

Width of confidence interval (W = b-a)

6

_{Most CIs have the form:}

_{The margin of error (ME = W/2) depends on two factors:}

_the

level of confidence that we require (reliability) and the spread of

sampling distribution of the point estimator. So a CI can be written:

(7)

_{Set the level of confidence such}

that if we draw many random

samples, in a large share of

cases the true mean

will fall within the CIs

Building a CI for the population mean

(population standard deviation known)

7

will fall within the CIs

_{The level of confidence is 1-}

_α

_{The true value of the mean}

would be contained in

100(1 -

α

)% of CI

_{E.g. if 1-}

_α

_{= 0.95, then 95% of}

(8)

_{Example. The General Social Survey (GSS) collects data on loneliness (no. of}

days in past week you felt lonely). Assume that the population mean is 1.8 with a standard deviation of 2.2 and that the sample size was fixed to 900.

_{BEFORE drawing any sample, we can calculate an interval around the pop. mean}

within which the sample mean will fall with an high probability.

_{E.g., with 95% probability the sample mean will fall in the interval (1.66, 1.94).}

_{These limits were computed by adding and}

Building a CI for the population mean (cont’d)

)

073 .

0 N(1.8

~

X

,

2

8

_{These limits were computed by adding and}

subtracting 1.96 standard deviations to/from the mean of 1.8 as follows:

1.8 - (1.96)(0.073) = 1.66 1.8 + (1.96)(0.073) = 1.94

_{The value of 1.96 is based on the fact}

that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean

http://onlinestatbook.com/java/normalshade.html

)

073 .

0 N(1.8

~

(9)

_{The value of 1.96 is based on the fact that 95% of the area of a normal}

distribution is within 1.96 standard deviations of the mean.

_{This can be checked by using the table of the Z:}

Building a CI for the population mean (cont’d)

)

1 N(0

~

Z

,

(10)

_{The figure shows that 95% of the means are no more than 0.14 units (1.96}

standard deviations) from the mean of 1.8.

_{Now consider the probability that a sample mean computed in a random sample}

is within 0.14 units of the population mean of 1.8. Since 95% of the distribution is within 0.14 of 1.8, the probability that the mean from any given sample will be within 0.14 of 1.8 is 95%.

Building a CI for the population mean (cont’d)

10

be within 0.14 of 1.8 is 95%.

_{This means that if we repeatedly compute}

the mean from a sample, and create an interval ranging from

- 0.14 to + 0.14, this interval will

contain the population mean 95% of the times.

)

073 .

0 N(1.8

~

X

,

2

(11)

Building a CI for the population mean (cont’d)

11

_{So, in general a (1-}

α

_{)% CI for the population mean is:}

( )













+

−

=

n

σ

z

x

,

n

σ

z

x

µ

(12)

_{The Ayuntamiento de Barcelona decided to monitor the use of the Bicing}

system. A survey was implemented on a sample of 300 residents asking: “How many times did you use the Bicing system in the past week?”.

_{It has been decided that if the average number of Bicing uses is found to}

be below 2, the system will be suppressed.

CI for the population mean: an example

12

_{Calculate a confidence interval for the mean times of Bicing uses, assuming}

that it follows a normal distribution with standard deviation equal to 0.35. You also know that on the drawn sample the average timesof Bicing use was 2.2.

(13)

_{n = 300.}

_{The point estimate is} _{= 2.2 that is above the threshold of 2 settled by}

the Ayuntamiento. But it’s only a sample average! It could be that we where lucky and drawn a sample of “good” citizens!

_{Let build a CI to take into account the sampling variability:}

CI for the population mean: an example

(cont’d)

x

13

_{Let build a CI to take into account the sampling variability:}

_So:

16 .

2

300

0.35 *

96 .

1

2 .

2 n

σ

z

x

−

_α_/2

=

−

=

24 .

2

300

0.35 *

96 .

1

2 .

2 n

σ

z

x

+

_α_/2

=

+

=

( ) (

µ

2 .

16 ,

2 .

24 )

(14)

Interpretation:

_{We are 95% confident that (the true) average use is between 2.16 and}

2.24 times per week.

_{(Although the true mean may or may not be in this interval, 95% of}

CI for the population mean: an example

(cont’d)

( ) (

µ

2 .

16 ,

2 .

24 )

CI

₀_.₉₅

=

14

_{(Although the true mean may or may not be in this interval, 95% of}

intervals formed in this manner will contain the true mean)

_{Note: it is not correct to say that the true average is between 2.16 and}

2.24 times per week with probability equal to 95%. In fact, after we draw a sample, the true mean either falls or not in our CI.

Conclusion:

_{The Bicing system should not be suppressed because (even accounting for}

(15)

_{A (1-}

α

_{)% CI for the population mean is:}

_{The margin of error is:}

A closer look at the CI formula

( )













+

−

=

n

σ

z

x

,

n

σ

z

x

µ

CI

₁_-_α _α_/2 _α_/2

15

_{The margin of error is:}

_{The width is:}

n

σ

z

ME

=

_α_/2

n

σ

z

*

2 ME

*

2

(16)

In calculating the CI we where assuming:

_{The population distribution of X is normal (or we have a big sample)}

_{The population standard deviation of X (}

_σ

_{) is known.}

Assumptions

16

(17)

_{This is the most common situation in practice.}

_{We need an estimate of σ}

_{The “best” estimator of σ is the}

_{sample standard deviation}

_:

CI for the population mean when

σ

is

unknown

−

n 2

)

x

(x

17

_{Note that in the formula we divide by n-1 and not n! This is because}

it can be proved that only in this way we get an unbiased estimator

of σ!!!

(Believe me ☺ _{or check it in whatever introductory statistics textbook)}

_{We can build a CI for the mean using s instead of σ but the}

sampling estimator in this case will not follow a normal distribution

but a t-student distribution

(18)

_{The t is a family of distributions that depends on degrees of}

freedom (d.f.) = n-1

_{d.f. = Number of observations that are free to vary after sample}

mean has been calculated

t versus z

18

(19)

_{The general formula of a CI for the population mean with unknown}

variance is:

CI for the population mean when

σ

is

unknown (cont’d)

( )













+

−

=

n

s

t

x

,

n

s

t

x

µ

CI

₁_-_α _n_-_1;_α_/2 _n_-_1;_α_/2

19

_{For any given confidence level, t-values are bigger than z-values}

resulting in wider CI. This reflects the bigger uncertainty due to the

fact that s is an estimate of the population standard deviation.

_{However, for n > 30 t- and z-values are almost indistinguishable}

and the previous formula for CI could be used.

_{If n is not large, we need the assumption that the distribution of X}

is normal!



(20)

_{The GSS also asks “On average, how many hours do you personally}

watch TV per day?”

_{Knowing that n=900,}

₌

_{2.865 and s = 2.617, estimate the}

population mean with a 95% CI. Also consider that: t

_900,0.025

=

CI for the population mean when

σ

is

unknown: an example

x

20

(21)

Solution:

_{We have to use the formula for the case with unknown standard}

deviation.

CI for the population mean when

σ

is

unknown: an example (cont’d)

( )













+

−

=

n

s

t

x

,

n

s

t

x

µ

CI

₁_-_α _n_-_1;_α_/2 _n_-_1;_α_/2

21

_{where t}

_899,0.025

_{= 1.963;}

_{= 2.865 and s = 2.617.}

_{Substituting these values in the formula we have:}

_{Note that the same results is obtained if we use the Z distribution}

instead of the t. In fact: t

_899,0.025

= 1.963

≈

z

_0.025

= 1.96 (the sample

size here is very large!).

( )









−

+

=

n

t

x

,

n

t

x

µ

CI

₁_-_α _n_-_1;_α_/2 _n_-_1;_α_/2

x

( ) (

µ

2 .

69 ,

3 .

04 )

(22)

_{When σ is known}

CI for the population mean: summary

( )













+

−

=

n

σ

z

x

,

n

σ

z

x

µ

CI

₁_-_α _α_/2 _α_/2

Point Estimate

_±

(Reliability Factor)(Standard Error)

22

_{When σ is unknown}

_{In both cases a crucial assumption is that X is}

_normally

distributed

(or that the sample size is large enough to apply the

Central Limit Theorem).

( )













+

−

=

n

s

t

x

,

n

s

t

x

µ

(23)

_{When σ is known}

_{When σ is unknown}

CI for the population mean: a note on the ME

s

n

σ

z

_α_/2

=

ME

23

In both cases the ME decreases as:

_{the sample size increases}

_{the sampling variability decreases (standard error or its estimate)}

_{the reliability factor decreases (lower confidence level)}

n

s

t

_n_-_1;_α_/2

=

(24)

Are you now confident about confidence

intervals?

Exercise to solve at home:

_{On a random sample of n = 25 students it was calculated that the Grade}

Point Average (GPA) is 50 with standard deviation of 8. Form a 95% confidence interval for the GPA in the population of all students.

Note that:

_t

_24,0.025

_2.0639

α/2 1,

n−

=

24

_Solution:

_{How do you interpret this result?}

a. 95% of the students have score between 46.698 and 53.302

b. We can be 95% confident the sample mean is between 46.698 and 53.302

c. We can be 95% confident the population mean is between 46.698 and

53.302

d. If random samples of size 25 were repeatedly selected, in the long run

95% of them would contain the value 50

( ) (

µ

46 .

698 ,

53 .

302 )

(25)

An historical note

_{The t distribution also called t-Student was developed by the Irish}

statistician William Gosset of Guinness Breweries

_{Because of company policy forbidding the publication of company work in}

one’s own name, Gosset used the pseudonym Student in his articles.

_{His discoveries were stimulated by the fact that he was given only small}

25

_{His discoveries were stimulated by the fact that he was given only small}

samples of brew to test (guess why?), and realized he could not use normal z-scores after substituting s in standard error formula.

_{So, drinking beer should make you}

(26)

If something is not clear

(or you find mistakes in the slides)

26