π -- the proportion of 1’s in your population and the population standard deviation is

(1)

4. (from last time, we will try again) A special case of means: The proportion

Suppose there are n independent trials that constitute a sample. Each trial results in either "success" or

"failure", and the chance of a success each time is π.

The proportion π could be thought of as the mean of a special kind of population. The population only has values of 1 or 0. If a population has that feature, the population mean E(X) or µ is

π -- the proportion of 1’s in your population and the population standard deviation is

) 1 (

* π

π

σ = −

The binomial can also be used for PROPORTIONS or percentages. For a binomial distribution, if n is large (>20), then the distribution of the sample proportion P is approximately normal with µ = π, and the standard error of the distribution of sample proportions is

P

n

) 1 ( π σ = π ⁻

For example, this is the population of college students in Los Angeles in 2000:

SEX | Freq. Percent Cum.

---+--- Male | 205,534 52.85 52.85 Female | 183,393 47.15 100.00 ---+--- Total | 388,927 100.00

Let’s arbitrarily assign the value 1 to males and the value 0 to females. Look what happens when I compute some summary statistics for sex. Note the proportion of males is the same as the mean and it’s just the percentage or proportion of 1’s in the data.

SEX

--- Percentiles Smallest

1% 0 0 5% 0 0

10% 0 0 Obs 388927 Å a population of 388,927 25% 0 0 Sum of Wgt. 388927

50% 1 Mean .5284642 Å this is π

Largest Std. Dev. .4991898 75% 1 1

90% 1 1 Variance .2491904

95% 1 1 Skewness -.1140418

99% 1 1 Kurtosis 1.013006

(2)

We can take samples from a population of 1’s and 0’s. For example, one single sample of size 25 could look like this

0 1 1 1 0 0 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 1 0

This would have a mean of 14/25 = .56, it would have a standard deviation of . 4964 = . 56 * ( 1 − . 56 ) This sample came from a sampling distribution of all possible samples of size 25 from this population.

We can call the sample proportion for a single sample P. A collection of sample statistics (or the

population of all possible sample proportions from samples of the same size) will have a mean of π and a standard error of

P

n

) 1 ( π

σ = π ⁻ (notice that the standard error is NOT the same as the sample standard deviation)

And if I were to run a simulation by gathering 10,000 samples of size 25 and calculating the proportions of 1’s (P) for each sample and then graphing all of them:

0 10 20 30 Density

0 .2 .4 .6 .8 1

SEX

(3)

0 1 2 3 4 Densi ty

.2 .4 .6 .8 1

r(mean)

Information on the sampling distribution of 10,000 samples --- Percentiles Smallest

1% .28 .2 5% .36 .2

10% .4 .2 Obs 10000Å for 10,000 samples n=25 25% .44 .2 Sum of Wgt. 10000

50% .52 Mean .521288 Å compare with .5284642 Largest Std. Dev. .0999903 Å note the standard deviation 75% .6 .84 of the sampling distribution 90% .64 .84 Variance .0099981 is AKA the standard error 95% .68 .84 Skewness .0211931

99% .76 .88 Kurtosis 2.890723

5. Applications: Probability of a single sample proportion P

The question we typically ask is “what is the probability or what is the chance that a sample of size 25 could have a P=.56 or larger when we know that π is supposed to be about .53 (.5285 to be exact).”

We can answer that question using the Z score.

30 0999 . .

03 . 25

) 53 . 1 (

* 53 .

53 . 56 . )

1 (

* = ≈

−

= −

−

= −

n Z P

π π

π the area to the right of .30 is .382

so we would state that our chance of getting a sample this “far” away from the true value is .382 or

38.2%, or put it another way, 38.2% of all possible samples in this particular sampling distribution will

be .56 or greater.

(4)

6. Estimating With Confidence: the sample proportion (8-1 & 8-5A)

In Chapter 6, the parameters are given, and we use the Z score to estimate the chance of various sample outcomes. In Chapter 8 the parameters are unknown (as in reality), and we draw conclusions from sample outcomes to make educated guesses about the parameters.

Rather than calculating exact probabilities, we use statements of confidence to express the strength of our conclusions.

A. Statistical Confidence

A CONFIDENCE INTERVAL is a range of values (i.e. values derived from sample information) that we think covers the true parameter and we state our confidence in the sample outcome with a percentage. An example might be (handout) "Hypothetical Kerry McCain Ticket"

The article states that 1,113 adults nationwide were surveyed and 53% of registered voters would vote for Kerry, 14% more than the 39% who would vote for Bush. A "margin of error" of +/- 3.0% is given, larger for subgroups.

You have probably heard polling information stated in this way before: 53% with a margin of error of 3% or 53%

plus or minus 3%. The margin of error is often stated to give you a sense of how accurate the statistic is and how certain you are about what it is telling you about it’s proximity to the true parameter.

This is a confidence interval for the population percentage (that is, if we could ask EVERYONE the question "who would you vote for”) or you could think of it as a statement of confidence that this sample "covers" the true parameter. The confidence interval is calculated from the sample percentage and sample standard deviation. Up until now, we have been in a situation where we know exactly what the population parameters are, now we do not, but we have samples and can make statements of confidence about our samples.

Take a long look at Figure 8.2 in your text (p. 256). (see handout) Things to keep in mind:

(1) In about 68% of all samples, the sample percentage will be within one standard error (Z=1.0) of the true population percentage.

(2) In about 95% of all samples, the sample percentage will be within about two standard errors (Z=1.96) of the population percentage.

From the poll, we would say that we were 95% confident that the true percentage of adults nationwide that would vote for Kerry is in the range 50% to 56%. Conventionally, the margin of error stated in the popular press reflects about two standard deviations.

NOTES: You can never be 100% confident. There is always the chance that you could have generated a sample that just by chance (not anything you did wrong) is nowhere near the parameter. Sample theory (see Chapter 6 again) tells us that some percentage of samples will always be far away from the parameter, even though the procedure used to select it was random. Also remember a natural property of the normal curve, it never crosses or touches the x-axis, so even at 10 S.D. there is a non-zero chance that your confidence interval will not cover the parameter, but the chance of that happening is very small.

And remember -- if you generate a bad sample, e.g. biased, non-random, your statistics will be bad and while you can generate a confidence interval, it's meaningless.

B. Constructing Confidence Intervals

Constructing a confidence interval for a population parameter involves five steps:

A. Find the sample statistic of interest. This is our ESTIMATE of the population parameter. Look at Kerry &

McCain again. The article gives a P = .53 as the proportion who would vote for Kerry

(5)

B. Compute the standard error for the sample distribution; for simple random samples involving percentages

0150 1113 .

) 47 )(.

53 (. =

C. Then choose the level of confidence you are interested in from a normal table using the area percentages.

Use the associated Z as a multiplier, let's just use 2.0 to get a 95% confidence interval (this is a standard approximation, using a Z=1.96 would be more correct)

so .0150* 2 = .03 and then multiply by 100 to make it a percentage or .03*100 = 3%

D. Add and subtract the result in C from the result in A so, 53% +/- 3%. This is your "margin of error" that is, how accurate you believe your statistic is, based on the variability of the estimate.

C. Notes on Confidence

a. A typical confidence interval has the form "estimated value, plus or minus Z times the SE of the sample distribution". In other words, take the statistic calculated from a sample and add and subtract some margin of error (the Standard Error multiplied by some value of Z). That is:

⎟⎟ ⎠

⎜⎜ ⎞

⎝

⎛ −

± n

P Z P

P ( 1 )

*

b. If the original population is normally distributed with a known standard deviation, or if the sample size is

"large", then the distribution of the sample statistic is normal, and using Z from the normal table is appropriate. (If the original distribution is normal with an unknown standard deviation, or if it is not normal and the sample is small, you will not use Z, more on this later.)

c. Your margin of error will depend on the choice of a confidence level. A lower confidence will give you a smaller margin of error. A higher confidence will give you a larger margin of error.

d. If your standard deviation is small, it is easier to get a more precise fix on the parameter. Your margin of error is smaller for populations with smaller standard deviations.

e. If your n increases in size, it will reduce your margin of error. If your n gets smaller, it will increase your margin of error. Therefore, you can adjust your sample size to accommodate a desired margin of error (see p. 261) f. If the sample is known to be biased, the confidence interval can be calculated, but it is worthless.

D. Interpreting Confidence

1. The best interpretation for a confidence interval is as follows (where X is the confidence level): "We did a procedure of drawing a sample, computing a statistic, standard error, etc. This procedure will give us a correct interval X% of the time and an incorrect interval 100-X% of the time. We hope this is one of the correct times.

Thus, for about X% of all samples, the interval "sample statistic + or - [Z*(standard error)]" covers the true population percentage.

2. It is not correct to talk about the chance a particular confidence interval contains the parameter. For example, you should not say "there is a X% chance that the parameter is in the confidence interval" because these confidence intervals vary with each sample and the parameter never varies (the parameter is fixed and

unchanging).

Any single confidence interval either covers the true parameter or it does not. Examine the article from Forbes

Magazine.

(6)

3. Another way you might think about this. When you KNOW the TRUE POPULATION PARAMETER, you can make a statement like: there is a 95% CHANCE that the SAMPLE STATISTIC will be in the range of the parameter plus or minus two standard deviations. (this is Chapter 6)

Example: if you know the parameter is µ =40 and the standard deviation of the sampling distribution is 2.5, then there is a 95% chance that the sample average will be in the range of 40 plus or minus 5.

But when you DO NOT KNOW THE TRUE POPULATION PARAMETER, you are forced to make statements like this: I am 95% confident that the POPULATION PARAMETER is in the range of the statistic plus or minus two standard deviations.

Example: if you don't know the parameter and the sample statistic is 40 and the SD is 2.5, then you are 95%

confident that the parameter is covered by the range of 40 plus or minus 5.

E. Sample means (8.1)

What holds for sample proportions holds for the sample mean as well. The confidence interval for the population mean is based on the sample average for some arbitrary level of confidence

⎟ ⎠

⎜ ⎞

⎝

± ⎛ Z n

x σ

* where sigma is the standard deviation for the population. If sigma is unknown and n is large, the sample standard deviation is usually substituted.

See the handout from the Orange County Register on California’s Academic Performance Index and the standard

error for a good explanation about the problems involved in measuring averages.

π -- the proportion of 1’s in your population and the population standard deviation is

4. (from last time, we will try again) A special case of means: The proportion

Suppose there are n independent trials that constitute a sample. Each trial results in either "success" or

"failure", and the chance of a success each time is π.

The proportion π could be thought of as the mean of a special kind of population. The population only has values of 1 or 0. If a population has that feature, the population mean E(X) or µ is

π -- the proportion of 1’s in your population and the population standard deviation is

) 1 (

* π

π

σ = −

The binomial can also be used for PROPORTIONS or percentages. For a binomial distribution, if n is large (>20), then the distribution of the sample proportion P is approximately normal with µ = π, and the standard error of the distribution of sample proportions is

n

) 1 ( π σ = π −

For example, this is the population of college students in Los Angeles in 2000:

SEX | Freq. Percent Cum.

---+--- Male | 205,534 52.85 52.85 Female | 183,393 47.15 100.00 ---+--- Total | 388,927 100.00

Let’s arbitrarily assign the value 1 to males and the value 0 to females. Look what happens when I compute some summary statistics for sex. Note the proportion of males is the same as the mean and it’s just the percentage or proportion of 1’s in the data.

SEX

--- Percentiles Smallest

1% 0 0 5% 0 0

10% 0 0 Obs 388927 Å a population of 388,927 25% 0 0 Sum of Wgt. 388927

50% 1 Mean .5284642 Å this is π

Largest Std. Dev. .4991898 75% 1 1

90% 1 1 Variance .2491904

95% 1 1 Skewness -.1140418

99% 1 1 Kurtosis 1.013006

We can take samples from a population of 1’s and 0’s. For example, one single sample of size 25 could look like this

0 1 1 1 0 0 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 1 0

This would have a mean of 14/25 = .56, it would have a standard deviation of . 4964 = . 56 * ( 1 − . 56 ) This sample came from a sampling distribution of all possible samples of size 25 from this population.

We can call the sample proportion for a single sample P. A collection of sample statistics (or the

population of all possible sample proportions from samples of the same size) will have a mean of π and a standard error of

n

) 1 ( π

σ = π − (notice that the standard error is NOT the same as the sample standard deviation)

And if I were to run a simulation by gathering 10,000 samples of size 25 and calculating the proportions of 1’s (P) for each sample and then graphing all of them:

0 10 20 30 Density

0 .2 .4 .6 .8 1

SEX

0 1 2 3 4 Densi ty

.2 .4 .6 .8 1

r(mean)

Information on the sampling distribution of 10,000 samples --- Percentiles Smallest

1% .28 .2 5% .36 .2

10% .4 .2 Obs 10000Å for 10,000 samples n=25 25% .44 .2 Sum of Wgt. 10000

50% .52 Mean .521288 Å compare with .5284642 Largest Std. Dev. .0999903 Å note the standard deviation 75% .6 .84 of the sampling distribution 90% .64 .84 Variance .0099981 is AKA the standard error 95% .68 .84 Skewness .0211931

99% .76 .88 Kurtosis 2.890723

5. Applications: Probability of a single sample proportion P

The question we typically ask is “what is the probability or what is the chance that a sample of size 25 could have a P=.56 or larger when we know that π is supposed to be about .53 (.5285 to be exact).”

We can answer that question using the Z score.

30 0999 . .

03 . 25

) 53 . 1 (

* 53 .

53 . 56 . )

1 (

* = ≈

−

= −

−

= −

n Z P

π π

π the area to the right of .30 is .382

so we would state that our chance of getting a sample this “far” away from the true value is .382 or

38.2%, or put it another way, 38.2% of all possible samples in this particular sampling distribution will

be .56 or greater.

6. Estimating With Confidence: the sample proportion (8-1 & 8-5A)

In Chapter 6, the parameters are given, and we use the Z score to estimate the chance of various sample outcomes. In Chapter 8 the parameters are unknown (as in reality), and we draw conclusions from sample outcomes to make educated guesses about the parameters.

Rather than calculating exact probabilities, we use statements of confidence to express the strength of our conclusions.

A. Statistical Confidence

A CONFIDENCE INTERVAL is a range of values (i.e. values derived from sample information) that we think covers the true parameter and we state our confidence in the sample outcome with a percentage. An example might be (handout) "Hypothetical Kerry McCain Ticket"

The article states that 1,113 adults nationwide were surveyed and 53% of registered voters would vote for Kerry, 14% more than the 39% who would vote for Bush. A "margin of error" of +/- 3.0% is given, larger for subgroups.

You have probably heard polling information stated in this way before: 53% with a margin of error of 3% or 53%

plus or minus 3%. The margin of error is often stated to give you a sense of how accurate the statistic is and how certain you are about what it is telling you about it’s proximity to the true parameter.

Take a long look at Figure 8.2 in your text (p. 256). (see handout) Things to keep in mind:

(1) In about 68% of all samples, the sample percentage will be within one standard error (Z=1.0) of the true population percentage.

(2) In about 95% of all samples, the sample percentage will be within about two standard errors (Z=1.96) of the population percentage.

From the poll, we would say that we were 95% confident that the true percentage of adults nationwide that would vote for Kerry is in the range 50% to 56%. Conventionally, the margin of error stated in the popular press reflects about two standard deviations.

And remember -- if you generate a bad sample, e.g. biased, non-random, your statistics will be bad and while you can generate a confidence interval, it's meaningless.

B. Constructing Confidence Intervals

) 1 ( π σ = π ⁻

σ = π ⁻ (notice that the standard error is NOT the same as the sample standard deviation)