• No results found

Lecture_3 [Modo de compatibilidad]

N/A
N/A
Protected

Academic year: 2020

Share "Lecture_3 [Modo de compatibilidad]"

Copied!
26
0
0

Loading.... (view fulltext now)

Full text

(1)

Techniques of Statistical

Analysis I

Lect_3: Confidence Intervals for the mean

Bruno Arpino

(2)

From point to confidence interval estimation

The general structure of a confidence interval

How to calculate a confidence interval for the

Outline

2

How to calculate a confidence interval for the

population mean

(3)

We saw that the “best” point estimate for a population

mean, in general, is the sample mean

The sample mean estimator is unbiased:

And has lower variance than other estimators. In

particular:

Where are we and where are we going?

µ

)

X

E(

=

3

particular:

If the distribution of our variable of interest in normal or

the sample is large (Central Limit Theorem), we know

that the sampling distribution of the mean estimator is

normally (or approximately normally) distributed

n

σ

σ

(4)

Are we satisfied with a point estimate? No! Because it

does not take into account the sampling error!

How can we improve things? Providing a range of

“plausible” values for the population mean

Idea: build a confidence interval around the point

Where are we and where are we going?

(cont’d)

4

Idea: build a confidence interval around the point

estimate using the standard error as a measure of

variability

We need to decide how much confident we want to be

(5)

How can we make a “confidence” statement

like this?

(6)

General structure of a confidence interval (CI)

Point Estimate Lower

Confidence Limit (a)

Upper

Confidence Limit (b)

Width of confidence interval (W = b-a)

6

Most CIs have the form:

The margin of error (ME = W/2) depends on two factors:

the

level of confidence that we require (reliability) and the spread of

sampling distribution of the point estimator. So a CI can be written:

(7)

Set the level of confidence such

that if we draw many random

samples, in a large share of

cases the true mean

will fall within the CIs

Building a CI for the population mean

(population standard deviation known)

7

will fall within the CIs

The level of confidence is 1-

α

The true value of the mean

would be contained in

100(1 -

α

)% of CI

E.g. if 1-

α

= 0.95, then 95% of

(8)

Example. The General Social Survey (GSS) collects data on loneliness (no. of

days in past week you felt lonely). Assume that the population mean is 1.8 with a standard deviation of 2.2 and that the sample size was fixed to 900.

BEFORE drawing any sample, we can calculate an interval around the pop. mean

within which the sample mean will fall with an high probability.

E.g., with 95% probability the sample mean will fall in the interval (1.66, 1.94).

These limits were computed by adding and

Building a CI for the population mean (cont’d)

)

073

.

0

N(1.8

~

X

,

2

8

These limits were computed by adding and

subtracting 1.96 standard deviations to/from the mean of 1.8 as follows:

1.8 - (1.96)(0.073) = 1.66 1.8 + (1.96)(0.073) = 1.94

The value of 1.96 is based on the fact

that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean

http://onlinestatbook.com/java/normalshade.html

)

073

.

0

N(1.8

~

(9)

The value of 1.96 is based on the fact that 95% of the area of a normal

distribution is within 1.96 standard deviations of the mean.

This can be checked by using the table of the Z:

Building a CI for the population mean (cont’d)

)

1

N(0

~

Z

,

(10)

The figure shows that 95% of the means are no more than 0.14 units (1.96

standard deviations) from the mean of 1.8.

Now consider the probability that a sample mean computed in a random sample

is within 0.14 units of the population mean of 1.8. Since 95% of the distribution is within 0.14 of 1.8, the probability that the mean from any given sample will be within 0.14 of 1.8 is 95%.

Building a CI for the population mean (cont’d)

10

be within 0.14 of 1.8 is 95%.

This means that if we repeatedly compute

the mean from a sample, and create an interval ranging from

- 0.14 to + 0.14, this interval will

contain the population mean 95% of the times.

)

073

.

0

N(1.8

~

X

,

2

(11)

Building a CI for the population mean (cont’d)

11

So, in general a (1-

α

)% CI for the population mean is:

( )

+

=

n

σ

z

x

,

n

σ

z

x

µ

(12)

The Ayuntamiento de Barcelona decided to monitor the use of the Bicing

system. A survey was implemented on a sample of 300 residents asking: “How many times did you use the Bicing system in the past week?”.

It has been decided that if the average number of Bicing uses is found to

be below 2, the system will be suppressed.

CI for the population mean: an example

12

Calculate a confidence interval for the mean times of Bicing uses, assuming

that it follows a normal distribution with standard deviation equal to 0.35. You also know that on the drawn sample the average timesof Bicing use was 2.2.

(13)

n = 300.

The point estimate is = 2.2 that is above the threshold of 2 settled by

the Ayuntamiento. But it’s only a sample average! It could be that we where lucky and drawn a sample of “good” citizens!

Let build a CI to take into account the sampling variability:

CI for the population mean: an example

(cont’d)

x

13

Let build a CI to take into account the sampling variability:

So:

16

.

2

300

0.35

*

96

.

1

2

.

2

n

σ

z

x

α/2

=

=

24

.

2

300

0.35

*

96

.

1

2

.

2

n

σ

z

x

+

α/2

=

+

=

( ) (

µ

2

.

16

,

2

.

24

)

(14)

Interpretation:

We are 95% confident that (the true) average use is between 2.16 and

2.24 times per week.

(Although the true mean may or may not be in this interval, 95% of

CI for the population mean: an example

(cont’d)

( ) (

µ

2

.

16

,

2

.

24

)

CI

0.95

=

14

(Although the true mean may or may not be in this interval, 95% of

intervals formed in this manner will contain the true mean)

Note: it is not correct to say that the true average is between 2.16 and

2.24 times per week with probability equal to 95%. In fact, after we draw a sample, the true mean either falls or not in our CI.

Conclusion:

The Bicing system should not be suppressed because (even accounting for

(15)

A (1-

α

)% CI for the population mean is:

The margin of error is:

A closer look at the CI formula

( )

+

=

n

σ

z

x

,

n

σ

z

x

µ

CI

1-α α/2 α/2

15

The margin of error is:

The width is:

n

σ

z

ME

=

α/2

n

σ

z

*

2

ME

*

2

(16)

In calculating the CI we where assuming:

The population distribution of X is normal (or we have a big sample)

The population standard deviation of X (

σ

) is known.

Assumptions

16

(17)

This is the most common situation in practice.

We need an estimate of σ

The “best” estimator of σ is the

sample standard deviation

:

CI for the population mean when

σ

is

unknown

n 2

)

x

(x

17

Note that in the formula we divide by n-1 and not n! This is because

it can be proved that only in this way we get an unbiased estimator

of σ!!!

(Believe me ☺ or check it in whatever introductory statistics textbook)

We can build a CI for the mean using s instead of σ but the

sampling estimator in this case will not follow a normal distribution

but a t-student distribution

(18)

The t is a family of distributions that depends on degrees of

freedom (d.f.) = n-1

d.f. = Number of observations that are free to vary after sample

mean has been calculated

t versus z

18

(19)

The general formula of a CI for the population mean with unknown

variance is:

CI for the population mean when

σ

is

unknown (cont’d)

( )

+

=

n

s

t

x

,

n

s

t

x

µ

CI

1-α n-1;α/2 n-1;α/2

19

For any given confidence level, t-values are bigger than z-values

resulting in wider CI. This reflects the bigger uncertainty due to the

fact that s is an estimate of the population standard deviation.

However, for n > 30 t- and z-values are almost indistinguishable

and the previous formula for CI could be used.

If n is not large, we need the assumption that the distribution of X

is normal!

(20)

The GSS also asks “On average, how many hours do you personally

watch TV per day?”

Knowing that n=900,

=

2.865 and s = 2.617, estimate the

population mean with a 95% CI. Also consider that: t

900,0.025

=

CI for the population mean when

σ

is

unknown: an example

x

20

(21)

Solution:

We have to use the formula for the case with unknown standard

deviation.

CI for the population mean when

σ

is

unknown: an example (cont’d)

( )

+

=

n

s

t

x

,

n

s

t

x

µ

CI

1-α n-1;α/2 n-1;α/2

21

where t

899,0.025

= 1.963;

= 2.865 and s = 2.617.

Substituting these values in the formula we have:

Note that the same results is obtained if we use the Z distribution

instead of the t. In fact: t

899,0.025

= 1.963

z

0.025

= 1.96 (the sample

size here is very large!).

( )

+

=

n

t

x

,

n

t

x

µ

CI

1-α n-1;α/2 n-1;α/2

x

( ) (

µ

2

.

69

,

3

.

04

)

(22)

When σ is known

CI for the population mean: summary

( )

+

=

n

σ

z

x

,

n

σ

z

x

µ

CI

1-α α/2 α/2

Point Estimate

±

(Reliability Factor)(Standard Error)

22

When σ is unknown

In both cases a crucial assumption is that X is

normally

distributed

(or that the sample size is large enough to apply the

Central Limit Theorem).

( )

+

=

n

s

t

x

,

n

s

t

x

µ

(23)

When σ is known

When σ is unknown

CI for the population mean: a note on the ME

s

n

σ

z

α/2

=

ME

23

In both cases the ME decreases as:

the sample size increases

the sampling variability decreases (standard error or its estimate)

the reliability factor decreases (lower confidence level)

n

s

t

n-1;α/2

=

(24)

Are you now confident about confidence

intervals?

Exercise to solve at home:

On a random sample of n = 25 students it was calculated that the Grade

Point Average (GPA) is 50 with standard deviation of 8. Form a 95% confidence interval for the GPA in the population of all students.

Note that:

t

t

24,0.025

2.0639

α/2 1,

n−

=

=

24

Solution:

How do you interpret this result?

a. 95% of the students have score between 46.698 and 53.302

b. We can be 95% confident the sample mean is between 46.698 and 53.302

c. We can be 95% confident the population mean is between 46.698 and

53.302

d. If random samples of size 25 were repeatedly selected, in the long run

95% of them would contain the value 50

( ) (

µ

46

.

698

,

53

.

302

)

(25)

An historical note

The t distribution also called t-Student was developed by the Irish

statistician William Gosset of Guinness Breweries

Because of company policy forbidding the publication of company work in

one’s own name, Gosset used the pseudonym Student in his articles.

His discoveries were stimulated by the fact that he was given only small

25

His discoveries were stimulated by the fact that he was given only small

samples of brew to test (guess why?), and realized he could not use normal z-scores after substituting s in standard error formula.

So, drinking beer should make you

(26)

If something is not clear

(or you find mistakes in the slides)

26

do not hesitate to come at office hours

or e-mail me

References

Related documents

AD-MSC: Adipose-derived mesenchymal stem cell; CEBP α : CCAAT/enhancer- binding protein alpha; CTGF: Connective tissue growth factor; ET-1: Endothelin-1; H&E: Hematoxylin and

The interdisciplinary contributions in this special issue on Results, Methodologi- cal Aspects and Advancements of the Programme for the International Assessment of Adult

Consequently, the all girls (or mixed gender) interactions should be interpreted as: The difference in the probability of a male-birth at a given parity between households with

To the extent that student subject interest is positively related to student academic achievement, cognitive stimulating activities with students could have an effect on

By also considering the words that are syntagmatically related to the words in context, this study contrasted distractors relating to target words and those relating to

The results of the Item Equivalence Study show clear evidence that TIMSS 2019 trend items presented in eTIMSS format were more difficult on average than the paperTIMSS

We hypothesize that a local in- crease in inflammatory cells and cytokines such as neu- trophils and NF- κ B contribute to increased muscle atrophy during the acute phase (2

Methods: Selected items from the PIRLS 2011 student and home questionnaires were analyzed in a regression model fitted using the IEA International Database (IDB) Ana- lyzer