Lecture_4 [Modo de compatibilidad]

(1)

Techniques of Statistical

Analysis I

Lect_4: Confidence Intervals for the

proportion

Bruno Arpino

(2)

Sample proportion

Confidence interval estimation for a population

proportion

Outline

2

(3)

We want to estimate the % of people approving the way

Obama is handling the economy

Here you can find results from some polls

http://www.pollingreport.com/obama_ad.htm

Example

http://www.pollingreport.com/obama_ad.htm

According to the ABC News/Washington Post Poll

conducted in the period Sept. 29-Oct. 2, 2011 on a

sample of 1,002 adults nationwide this percentage was

35%

(4)

Similarly to what we have seen for the mean, the best

point estimator for the population proportion,

π

, is the

sample proportion, p:

Point estimation

interest of

stic characteri the

having sample

in the units

of number p =

4

This is an ubiased estimator: E(p) =

π

In the previous example: p = 35% out of n = 1002

persons mean than in the sample 0.35*1002

≈

351 intervieved persons sayd “approve” (351/1002

≈

_0.35)

(5)

Consider again the previous example.

For each unit the variable of interest is “X=approve or not Obama”. I.e., this is a binary variable (a variable that only takes two values: “yes”, “other”).

Binary variables are usually coded in this way: X = 0 for unit

What is the variable we are modelling?

Binary variables are usually coded in this way: X = 0 for unit withouth the characteristic of interes (“disapprove” or

“unsure”) and X = 1 for units with the characteristic of interest (“approve”).

A proportion can be seen as the mean of the binary variable X:

N X_i

∑

= =

size population

interest of

stic characteri the

having population

in the units

of number

(6)

X only takes values 0 and 1. So, the normal distribution is not appropriate.

The distribution of the sample proportion is binomial but can be approximated by a normal distribution if the sample is “large enough”.

Assumption: n is “large enough”

6 “large enough”.

In particular, the CI formula we consider is valid if n is such that: n*p*(1 – p) > 9

An alternative rule of thumb: formula is ok if at least 15

observations are in the category of interest (“approve”) and at least 15 observations are not in it.

(7)

Again similarly to what we have seen for the mean, we prefer a confidence interval estimation to a point estimation to keep into account sampling variability

Using the CI formula for the mean we can derive a CI for a proportion (we skip the details):

Confidence interval for a proportion

proportion (we skip the details):

where z_α_/2 is the standard normal value for the level of

confidence desired, p is the sample proportion and n is the sample size

( )













−

+

−

=

n

p)

p(1

z

p

,

n

p)

p(1

z

p

(8)

Remember the general structure of a CI:

A closer look at the CI

Point Estimate _± (Reliability Factor)(Standard Error)

( )

_











−

+

−

=

n

p)

p(1

z

p

,

n

p)

p(1

z

p

CI

₁_-_α

π

_α_/2 _α_/2

8

The width is:

And the margin of error is:









α/2

n

α/2

n

-1α

n

p)

p(1

z

*

2

α_/2

−

=

W

n

p)

p(1

z

2 /

α_/2

−

=

(9)

A random sample of 100 people shows that 25 are

left-handed.

Form a 95% confidence interval for the true

proportion of left-handers

Exercise

(10)

n = 100; p = 25/100 = 0.25; 1-

α

;

z

_α_/2

= 1.96.

Exercise (cont’d)

( )

n

p)

p(1

z

p

,

n

p)

p(1

z

p

CI

₀_.₉₅ α_/2 α_/2













−

+

−

=

π

10

(

0 .

1651

,

0 .

3349

)

(11)

We are 95% confident that the true percentage of

left-handers in the population is between 16.51%

and 33.49%.

Exercise (cont’d): interpretation

( ) (

0 .

1651

,

0 .

3349

)

CI

₀_.₉₅

π

=

and 33.49%.

(Although the interval from 0.1651 to 0.3349 may or

may not contain the true proportion, 95% of

(12)

The ABC News/Washington Post Poll we considered before report a margin of error of _± 4 for all the polls. What is wrong about this

statement? Can you calculate a 95% CI with the available data? Can you safely conclude that a minority of US citizens approve Obama?

ME=W/2 of a CI. It depends on the value of p!!!

Exercise 2

  _p(1− _p) _p(1− _p)

12

While phrases such as, “The poll has a margin of error of plus or minus 4” (percentage points!!!) are commonly heard, an additional qualification such as "at a 95 percent confidence level" is also

needed in order to precisely indicate what the error refers to.

Moreover the ME is always positive! (_± It does not make sense)

( )

_      − + − − = n p) p(1 z p , n p) p(1 z p

(13)

Assume 1- α = 0.95; z_α_/2 = 1.96

We know: n = 1002; p = 0.35

We can calculate a 95% CI:

Exercise 2

( )

1002 0.65 * 0.35 96 . 1 35 . 0 , 1002 0.65 * 0.35 96 . 1 35 . 0

CI ₀_.₉₅ _

      + − = π

So actually ME = 0.03! (They probably report the maximum ME for all the polls or using a confidence level of 99%)

In fact, if we set 1- α = 0.99; z_α_/2 = 2.58 then ME = 0.039 (CI = 0.31,0.39). So, also in this case all the “plausible” values are below 50%. We are 99% confident that the proportion of

(

0.32 ,0.38

)

(14)

Check out this applet on CI estimation for a proportion

http://www.prenhall.com/agresti/applet_files/propci.html

You can use it to check to what extend a statement like “95% of CI obtained drwaing several random samples from the

Homework (w/o grade!)

14

of CI obtained drwaing several random samples from the population contain the true population proportion” is correct

Compare this two cases: n=1000; p = 0.5 and n = 1000; p = 0.1

(15)

Is a (more) modern method (due to Brad Efron) for

generating CIs without using mathematical methods to

derive a sampling distribution that assumes a particular

population distribution. It is based on repeatedly taking

samples of size n (with replacement) from the sample

The bootstrap method

samples of size n (with replacement) from the sample

data distribution.

(16)

If something is not clear

(or you find mistakes in the slides)

16 do not hesitate to come at office hours

or e-mail me