Techniques of Statistical
Analysis I
Lect_3: Confidence Intervals for the mean
Bruno Arpino
From point to confidence interval estimation
The general structure of a confidence interval
How to calculate a confidence interval for the
Outline
2
How to calculate a confidence interval for the
population mean
We saw that the “best” point estimate for a population
mean, in general, is the sample mean
The sample mean estimator is unbiased:
And has lower variance than other estimators. In
particular:
Where are we and where are we going?
µ
)
X
E(
=
3
particular:
If the distribution of our variable of interest in normal or
the sample is large (Central Limit Theorem), we know
that the sampling distribution of the mean estimator is
normally (or approximately normally) distributed
n
σ
σ
Are we satisfied with a point estimate? No! Because it
does not take into account the sampling error!
How can we improve things? Providing a range of
“plausible” values for the population mean
Idea: build a confidence interval around the point
Where are we and where are we going?
(cont’d)
4
Idea: build a confidence interval around the point
estimate using the standard error as a measure of
variability
We need to decide how much confident we want to be
How can we make a “confidence” statement
like this?
General structure of a confidence interval (CI)
Point Estimate Lower
Confidence Limit (a)
Upper
Confidence Limit (b)
Width of confidence interval (W = b-a)
6
Most CIs have the form:
The margin of error (ME = W/2) depends on two factors:
the
level of confidence that we require (reliability) and the spread of
sampling distribution of the point estimator. So a CI can be written:
Set the level of confidence such
that if we draw many random
samples, in a large share of
cases the true mean
will fall within the CIs
Building a CI for the population mean
(population standard deviation known)
7
will fall within the CIs
The level of confidence is 1-
α
The true value of the mean
would be contained in
100(1 -
α
)% of CI
E.g. if 1-
α
= 0.95, then 95% of
days in past week you felt lonely). Assume that the population mean is 1.8 with a standard deviation of 2.2 and that the sample size was fixed to 900.
BEFORE drawing any sample, we can calculate an interval around the pop. meanwithin which the sample mean will fall with an high probability.
E.g., with 95% probability the sample mean will fall in the interval (1.66, 1.94). These limits were computed by adding andBuilding a CI for the population mean (cont’d)
)
073
.
0
N(1.8
~
X
,
28
These limits were computed by adding andsubtracting 1.96 standard deviations to/from the mean of 1.8 as follows:
1.8 - (1.96)(0.073) = 1.66 1.8 + (1.96)(0.073) = 1.94
The value of 1.96 is based on the factthat 95% of the area of a normal distribution is within 1.96 standard deviations of the mean
http://onlinestatbook.com/java/normalshade.html
)
073
.
0
N(1.8
~
distribution is within 1.96 standard deviations of the mean.
This can be checked by using the table of the Z:Building a CI for the population mean (cont’d)
)
1
N(0
~
Z
,
standard deviations) from the mean of 1.8.
Now consider the probability that a sample mean computed in a random sampleis within 0.14 units of the population mean of 1.8. Since 95% of the distribution is within 0.14 of 1.8, the probability that the mean from any given sample will be within 0.14 of 1.8 is 95%.
Building a CI for the population mean (cont’d)
10
be within 0.14 of 1.8 is 95%.
This means that if we repeatedly computethe mean from a sample, and create an interval ranging from
- 0.14 to + 0.14, this interval will
contain the population mean 95% of the times.
)
073
.
0
N(1.8
~
X
,
2Building a CI for the population mean (cont’d)
11
So, in general a (1-α
)% CI for the population mean is:( )
+
−
=
n
σ
z
x
,
n
σ
z
x
µ
system. A survey was implemented on a sample of 300 residents asking: “How many times did you use the Bicing system in the past week?”.
It has been decided that if the average number of Bicing uses is found tobe below 2, the system will be suppressed.
CI for the population mean: an example
12
Calculate a confidence interval for the mean times of Bicing uses, assumingthat it follows a normal distribution with standard deviation equal to 0.35. You also know that on the drawn sample the average timesof Bicing use was 2.2.
the Ayuntamiento. But it’s only a sample average! It could be that we where lucky and drawn a sample of “good” citizens!
Let build a CI to take into account the sampling variability:CI for the population mean: an example
(cont’d)
x
13
Let build a CI to take into account the sampling variability: So:16
.
2
300
0.35
*
96
.
1
2
.
2
n
σ
z
x
−
α/2=
−
=
24
.
2
300
0.35
*
96
.
1
2
.
2
n
σ
z
x
+
α/2=
+
=
( ) (
µ
2
.
16
,
2
.
24
)
Interpretation:
We are 95% confident that (the true) average use is between 2.16 and2.24 times per week.
(Although the true mean may or may not be in this interval, 95% ofCI for the population mean: an example
(cont’d)
( ) (
µ
2
.
16
,
2
.
24
)
CI
0.95=
14
(Although the true mean may or may not be in this interval, 95% ofintervals formed in this manner will contain the true mean)
Note: it is not correct to say that the true average is between 2.16 and2.24 times per week with probability equal to 95%. In fact, after we draw a sample, the true mean either falls or not in our CI.
Conclusion:
The Bicing system should not be suppressed because (even accounting forα
)% CI for the population mean is: The margin of error is:A closer look at the CI formula
( )
+
−
=
n
σ
z
x
,
n
σ
z
x
µ
CI
1-α α/2 α/215
The margin of error is: The width is:n
σ
z
ME
=
α/2n
σ
z
*
2
ME
*
2
In calculating the CI we where assuming:
The population distribution of X is normal (or we have a big sample)
The population standard deviation of X (
σ
) is known.
Assumptions
16
This is the most common situation in practice.
We need an estimate of σ
The “best” estimator of σ is the
sample standard deviation
:
CI for the population mean when
σ
is
unknown
−
n 2)
x
(x
17Note that in the formula we divide by n-1 and not n! This is because
it can be proved that only in this way we get an unbiased estimator
of σ!!!
(Believe me ☺ or check it in whatever introductory statistics textbook)We can build a CI for the mean using s instead of σ but the
sampling estimator in this case will not follow a normal distribution
but a t-student distribution
The t is a family of distributions that depends on degrees of
freedom (d.f.) = n-1
d.f. = Number of observations that are free to vary after sample
mean has been calculated
t versus z
18
The general formula of a CI for the population mean with unknown
variance is:
CI for the population mean when
σ
is
unknown (cont’d)
( )
+
−
=
n
s
t
x
,
n
s
t
x
µ
CI
1-α n-1;α/2 n-1;α/219
For any given confidence level, t-values are bigger than z-values
resulting in wider CI. This reflects the bigger uncertainty due to the
fact that s is an estimate of the population standard deviation.
However, for n > 30 t- and z-values are almost indistinguishable
and the previous formula for CI could be used.
If n is not large, we need the assumption that the distribution of X
is normal!
The GSS also asks “On average, how many hours do you personally
watch TV per day?”
Knowing that n=900,
=
2.865 and s = 2.617, estimate the
population mean with a 95% CI. Also consider that: t
900,0.025=
CI for the population mean when
σ
is
unknown: an example
x
20
Solution:
We have to use the formula for the case with unknown standard
deviation.
CI for the population mean when
σ
is
unknown: an example (cont’d)
( )
+
−
=
n
s
t
x
,
n
s
t
x
µ
CI
1-α n-1;α/2 n-1;α/221
where t
899,0.025= 1.963;
= 2.865 and s = 2.617.
Substituting these values in the formula we have:
Note that the same results is obtained if we use the Z distribution
instead of the t. In fact: t
899,0.025= 1.963
≈
z
0.025= 1.96 (the sample
size here is very large!).
( )
−
+
=
n
t
x
,
n
t
x
µ
CI
1-α n-1;α/2 n-1;α/2x
( ) (
µ
2
.
69
,
3
.
04
)
When σ is known
CI for the population mean: summary
( )
+
−
=
n
σ
z
x
,
n
σ
z
x
µ
CI
1-α α/2 α/2Point Estimate
±
(Reliability Factor)(Standard Error)
22
When σ is unknown
In both cases a crucial assumption is that X is
normally
distributed
(or that the sample size is large enough to apply the
Central Limit Theorem).
( )
+
−
=
n
s
t
x
,
n
s
t
x
µ
When σ is known
When σ is unknown
CI for the population mean: a note on the ME
s
n
σ
z
α/2=
ME
23
In both cases the ME decreases as:
the sample size increases
the sampling variability decreases (standard error or its estimate)
the reliability factor decreases (lower confidence level)
n
s
t
n-1;α/2=
Are you now confident about confidence
intervals?
Exercise to solve at home:
On a random sample of n = 25 students it was calculated that the Grade
Point Average (GPA) is 50 with standard deviation of 8. Form a 95% confidence interval for the GPA in the population of all students.
Note that:
t
t
24,0.0252.0639
α/2 1,
n−
=
=
24
Solution:
How do you interpret this result?
a. 95% of the students have score between 46.698 and 53.302
b. We can be 95% confident the sample mean is between 46.698 and 53.302
c. We can be 95% confident the population mean is between 46.698 and
53.302
d. If random samples of size 25 were repeatedly selected, in the long run
95% of them would contain the value 50
( ) (
µ
46
.
698
,
53
.
302
)
An historical note
The t distribution also called t-Student was developed by the Irish
statistician William Gosset of Guinness Breweries
Because of company policy forbidding the publication of company work in
one’s own name, Gosset used the pseudonym Student in his articles.
His discoveries were stimulated by the fact that he was given only small
25
His discoveries were stimulated by the fact that he was given only small
samples of brew to test (guess why?), and realized he could not use normal z-scores after substituting s in standard error formula.
So, drinking beer should make you
If something is not clear
(or you find mistakes in the slides)
26