Objectives
6.1, 7.1 Estimating with confidence (CIS: Chapter 10)
p Statistical confidence (CIS gives a good explanation of a 95% CI)
p Confidence intervals. Further reading
http://onlinestatbook.com/2/estimation/confidence.html
p Choosing the sample size
p t distributions. Further reading
http://onlinestatbook.com/2/estimation/t_distribution.html
p One-sample t confidence interval for a population mean
Overview of Inference
p Sample ≠ population, and sample mean ≠ population mean µ.
But we do not know the value of µ, and if we want to make any conclusions about µ then we have to use to do so.
p Methods for drawing conclusions about a population from sample
data are called statistical inference.
p There are two main types of inference:
§ Confidence Intervals - estimating the value of a population
parameter, and
§ Tests of Significance - assessing evidence for a claim (hypothesis)
about a population.
p Inference is appropriate when data are produced by either
§ a random sample or § a randomized experiment.
€
x
€
x
Introducing con4idence intervals
p It is very unlikely that the sample mean based on a sample will ever
equal the true mean. Our aim is to construct an interval around the sample mean which is `likely’ to contain the mean. This is called a confidence interval.
p In the first lecture we considered a Gallop poll for the proportion of the
electorate that would vote for Obama.
p Gallup predicted that the Obama vote would be in the interval
[45%,51%] with 95% confidence.
p The Obama vote turned out to be 50.5%, so the interval did capture the
true proportion.
p You may be asking yourself how do we understand 95%, since 50.5%
lies in this interval, there does not appear to be any uncertainty in it.
q In the next few slides, our objective is to understand how a
Review: properties of the sample mean
The sample mean is a unique number for any particular sample. If you had obtained a different sample (by chance) you almost certainly would have had a different value for your sample mean.
In fact, you could get many different values for the sample mean, and
virtually none of them would actually equal the true population mean
€
, µ.In Chapter 4, we learnt that if a random variable was normally
distributed with µ and standard deviation σ then 95% probability it will lie in the interval
Now our focus is on the sample mean it has mean µ and standard error
σ/√n (chapter 5), thus there is 95% probability that it lies in interval
But the mean is unknown, our objective is to locate the true mean based on the sample mean.
p To do this we turn the story around, if the sample mean lies in the
interval
p This is the same as saying the mean µ lies in the interval [sample
mean –1.96×σ/√n, sample mean +1.96×σ/√n].
q Thus 95% of the time, the true mean (that we want to estimate) will
be in the interval (this is called a confidence interval):
sample mean (average) 1.96 ⇥ p
n,sample mean (average) + 1.96 ⇥ pn [µ 1.96 ⇥ , µ + 1.96 ⇥ ] µ 1.96 ⇥ p n, µ + 1.96 ⇥ pn µ 1.96 ⇥ p n, µ + 1.96 ⇥ pn
Case 1: Normal data – sample size one
p Human heights are approximately a normal distribution. The standard
deviation of a human height is 3.8 inches.
p Our objective is to construct a confidence interval for the mean height. p We start with the less than ideal situation that we only have a sample size
one (just observation!). In this case the standard error is 3.8/√1 = 3.8 (the regular standard deviation).
p We know that the observation is normally distributed, so it is straightforward
to construct the 95% confidence interval for the mean height using just one randomly selected height is:
[height – 1.96×3.8, height + 1.96×3.8] = [height – 7.44, height + 7.44]. Construct an interval using your height.
A large amount of data on heights has been collected and it is known that the mean height of a person is about 67 inches. Does your interval contain the mean? Most of you will contain the mean, 67 inches. Those of you whose height is in the extremes (very tall or small – more than 1.96 standard
Because the sampling distribution of is narrower than the population distribution, by a factor of √n,
the estimates tend to be closer to the population
parameter µ than individual observations are. n Sample means, n subjects
µ
n
σ
σ
Population, x individual subjects€
x
€
x
If the population is normally distributed N(µ,σ),
the sampling distribution is N(µ,σ/√n),
€
Case 1: Normal data – sample size three
p Again we estimate the mean human height, but this time taken from a
random sample of three people. Recall, the standard deviation of a human height is 3.8 inches.
p If the sample size is 3, the standard error of an average based on three is
3.8/√3 = 2.19.
p As each randomly selected height is normally distributed, so is the
average based on three (recall Chapter 5):
p The 95% confidence interval is
Given any random sample of size three we take its average and plug it in.
¯ X ⇠ N( µ |{z} ?? , p3.8 3) ¯ X 1.96 ⇥ p3.8 3, ¯ X + 1.96 ⇥ 3.8p 3
Here we illustrate the height example.
q In the shot on the right we draw a
sample of size three from the population of all heights. The
average (sample mean) is evaluated.
q This average corresponds to one of
the green dots on the lower right plot. The green lines is the confidence
interval centered about the average.
q We did this 100 times and 96 of the
intervals contain the true mean 67. If the sample mean is
normally distributed and the 100 samples were
calculated and for each sample a 95% CI was
evaluated, about 95 would contain the true mean of 67. In reality only have one CI; we are 95% confident it contains mean.
Observations
p We see that the length of confidence interval when using just one
person in the sample is 2×1.96×3.8 = 14.88, this is quite long, and does not really allow us to pinpoint the mean.
p Whereas the length of the confidence interval using three people is
only 2×1.96×3.8/√3 = 14.88/√3
p If ten people were used to calculate the sample mean the
corresponding interval length would be 14.88/√10 = 4.7.
p We see that for any given interval either the mean is in this interval
or not. The 95% comes into play when we look at the proportion of intervals that contain the mean.
p In reality:
p We do not know the true mean µ, so will never know whether the interval
contained the mean or not.
p We only observe one sample of size n, and thus have one CI.
Case 2: Skewed data – sample size 3
p In the previous example we looked at height data which tends to be
normal. In this example we consider Right skewed data, which is NOT normal – examples include, House prices, Salaries etc.
p We randomly draw a sample of size 3 from a right skewed
distribution with mean 14 and standard deviation 10.7.
p The sample/mean average has a mean which is 14 and standard
deviation which is 10.7/√3 = 6.17.
p We construct a 95% confidence interval to locate the mean,
¯ X 1.96 ⇥ 10p.7 3 , ¯ X + 1.96 ⇥ 10p.7 3
The confidence interval is constructed under the assumption that the sample mean is normal. In the next slide we investigate how this
q We draw three samples from this
skewed distribution and take the average.
q The average corresponds to one of
the green dots on the plot below. We construct a 95% interval.
q We see that only 93 of the intervals
contain the mean. The reason for the
difference between the 95% and 93 (though not much) can be found in the green plot of the sample mean. It is slightly
skewed and clearly not normal. The sample size is not large enough for the CLT to work. We do not have 95% confidence in this 95% confidence interval.
Case 2: Skewed data – sample size 50
p In the previous example, it was clear that we did not have the full
95% confidence in the 95% confidence interval we had constructed.
p This was because the sample mean was not normal.
p We need to be careful when constructing confidence intervals
using small sample sizes because the normality assumption may not hold – this means our interval is not as reliable as we think it is.
p If the sample size is sufficiently large then we recall from Chapter 5
that the corresponding sample size will be close to normal. This means that a 95% confidence interval will actually be a 95% confidence interval.
p In this next slide we look at the reliability of the 95% CI (where the
data is sampled from a skewed distribution):
¯ X 1.96 ⇥ p10.7 50, ¯ X + 1.96 ⇥ p10.7 50
We observe that the sample mean based on a sample of 50 appears close to normal (though it needs to checked with a
QQplot).
The `coverage’ of the confidence interval (at least over these 100 realizations) is
`about’ 95%.
We can `safely’ say that we have 95% confidence in the 95%
confidence interval.
To summarize a 95% confidence interval is an interval where we are 95% confident it contains the mean (note for any given interval the mean is either there or not – so no probability).
Implications
We do not need to (and
cannot, anyway) take a lot of random samples to “rebuild” the sampling distribution and find µ at its center.
n
n
Sample Population
µ
All we need is one SRS of size n and we can rely on the properties of the
sampling distribution to infer reasonable values for the population mean µ.
Multiple samples revisited
With 95% confidence, we can say that µ should be within 1.96
standard deviations (1.96×σ/√n) from our sample mean .
p In 95% of all possible samples of
this size n, µ will indeed fall in our
confidence interval.
p In only 5% of samples will be
farther from µ.
p “Confidence” = the proportion of
possible samples that give us a correct conclusion. € σ n
€
x
€
x
Calculation practice 1
p You want to rent an unfurnished one-bedroom apartment in Dallas.
The mean monthly rent for 10 randomly sampled apartments is 980 dollars. Assume that monthly rents follow a normal distribution with standard deviation 280 dollars.
p Question: Construct a 95% confidence interval for the mean
monthly rent of a one-bedroom apartment.
p Answer: The standard error for the sample mean is 280/√10 =
88.54. The 95% CI is [980 ±1.96×88.54] = [806,1153]. With 95% confidence we believe the mean price of one-bedroom apartments in Dallas lies in this interval.
p Question Does the above confidence interval mean that 95% of all
rents should lie in this interval?
p Answer: No, this is confidence interval for the mean not the apartment
price. An interval where 95% of apartment prices will lie is [980 ±1.96(88.54+280)] = [257,1720].
You do not have to understand this calculation, but you will notice this interval is much wider. The reason is that it must capture 95% of all rents, which are
extremely varied. This interval will not get narrower as the sample size grows. The CI for the mean is suppose to capture the mean rent, this interval is far narrower and will get narrower as the sample size grows.
q Question A relator wants to know if the mean price of one bedroom
apartments in Dallas is more than 1100 dollars a month. Based on the confidence interval for the mean, what can you say?
q Answer We showed that the 95% confidence interval for the mean is
[806,1153] dollars. As this interval contains both values above and below 1100 dollars, we do not know. We do not have enough data to answer her question.
Calculation practice 2
p Hypokalemia is diagnosed when the blood potassium level is below
3.5mEq/dl. The potassium in a blood sample varies from sample to sample and follows a normal distribution with unknown mean but standard deviation 0.2. A patient ‘s potassium is measured taken over 4 days. The sample over 4 days is 3, 3.5, 3.9, 4.4, its sample mean is 3.7.
p Question: Construct a 95% confidence interval for the mean potassium
and discuss whether the patient is likely to be diagnosed with Hypokalemia.
p Answer: The standard error for the sample mean is 0.2/√4 = 0.1. Thus
the 95% confidence interval for the mean potassium level is
[3.7±1.96×0.1] = [3.504,3.894]. This means with 95% confidence we
believe the mean lies in this interval.
q Since 3.5 or less does not lie in this interval, it suggests that the patient
does not have lower potassium. There is a precise way of answer this specific problem which we discuss in Chapter 7 (called statistical
Con4idence interval misunderstandings
p Suppose 400 alumni were asked to rate the University of Okoboji
counseling services from a scale 1 to 10. The sample mean was found to be 8.6 and it is known that the standard deviation is σ=2. Ima Bitlost has done the analysis, but has made some mistakes.
p Ima computes the 95% CI interval for the mean satisfaction score
as [8.6±1.96×2]. What is her mistake?
p Ima has not taken into account that the sample mean has a much
smaller standard deviation (standard error) than the population. The
standard error is 2/√400 = 0.1. Thus the true CI is
[8.6±1.96×0.1] = [8.4,8.796].
p After correcting her mistake, she states that “I am 95% confident
that the sample mean lies in the interval [8.4,8.796]” What is wrong with her statement?
p This is a meaningless statement, for sure the sample mean lies in this
p She quickly realizes her mistake and instead states “the probability
that the mean lies in the interval [8.4,8.796] is 95%”, what misinterpretation is she making now?
p By 95%, we mean that if we repeated the experiment many times over
about 95% of the time the intervals will contain the mean. For any given interval the mean is either in there or not. There is no probability
attached to it. To overcome, this issue we say that with we have 95% confidence in the mean lies in this interval.
p Finally, in her defense for using the normal distribution to determine
the confidence coefficient (1.96) she says “Because the sample size is quite large, the population of alumni ratings will be close to normal”. Explain to Ima her misunderstanding.
p The distribution of the population always stays the same, regardless of
the sample size (in this case, it is clear that variables that take integer values between 1 to 10 cannot be normal). However, the sample mean does get closer to normal as the sample size grow. With a sample size of 400, the distribution of the sample mean will be very close to normal.
Different levels of con4idence
p There is no need to restrict ourselves to 95% confidence intervals. p The level of confidence we use really depends on how much
confidence we want. For example, you would expect a 99%
confidence interval is more likely to contain the mean than a 95% confidence interval.
p To construct a 99% confidence interval we use exactly the same
prescription as used to construct a 95% confidence interval, the only thing that changes is 1.96 goes to 2.57 (if you look up -2.57 in the z-tables you will see this corresponds to 0.5%, so 99% of the time the sample mean will lie within 2.57 standard errors from the mean).
p A 99% CI for the mean one-bedroom apartment price is
[980±2.57×88.54]. Length of interval is 2×2.57×88.54
q A 90% CI for the mean one-bedroom apartment price is
[980±1.64×88.54]. Length of interval is 2×1.64×88.54
What does a 100% confidence interval look like? In a 100% CI we are sure to find the mean, but this interval is so wide it is not informative.
Sample size and length of the CI
p Let us return to the apartment example. We recall that the 95%
confidence interval for the mean price is [980 ±1.96×88.54] = [806,1153]. The length of this interval is 2×1.96×88.54 = 347.
p Question: Suppose I take a SRS of 100 apartments in Dallas, the
sample mean based on this sample is 1000, what will the CI be?
p Answer: The standard error is 280/√100 = 28 (much smaller than when
the sample size is 10), and the CI is [1000 ±1.96×28]. The length of this
interval is 2×1.96×28 =109.
p What we observe is:
p The length of the interval does not depend on the sample mean, this is
just the centralizing factor. It only depends on (i)1.96, (ii) the standard
deviation and (iii) the sample size.
p The length of the interval gets smaller as the sample size increases.
p If we want the interval to have a certain length, we can choose the
How large an interval
p You read in a newspaper that
The proportion of the public that supports gay marriage is now 55%±15%.
q This means a survey was done, the proportion in the survey who
supported gay marriage was 55% and that confidence interval for the population proportion is [55-15,55+15]% = [40%,70%].
q This is an extremely large interval, it is so wide, that it is really not
that informative about the opinion of the public.
q As we will see on the next slide, the reason it is too wide is that the
sample size is too small. This experiment was not designed well.
q Typically, before data is collected, we need to decide how large a
sample to collect. This is usually done by deciding how much
`above and below’ the estimator seems reasonable. For example, [55-3,55+3]% = [52,58]% is more information. The 3% is known as a margin of error. Given a certain margin of error we can then
Margin of Error
p Margin of error is the lingo used for the plus and minus part in the
confidence interval.
p That is the confidence interval is
[sample mean±1.96×σ/√n], the margin of error is 1.96×σ/√n.
q For example, in the previous example the margin of error for the CI
based on 10 apartments is 1.96×88.54.
q The margin of error for the CI based on 100 apartments is 1.96×28.
q The margin of error in some sense, is a measure of reliability. For a
given confidence level, the smaller the margin error the more precisely we can pinpoint the true mean.
q Suppose we want the margin or error to be equal to some value,
then we can find the sample size such that we obtain that margin of error. Solve for n the equation MoE = 1.96×σ/√n (the Margin of Error and the standard deviation σ are given): n = (1.96×σ/MoE)2
Calculation practice
p In a study of bone turn over in young women with a medical
condition, serum TRAP was measured in 31 subjects. The sample mean was 13.2 units per liter. Assume the standard deviation is
known to be 6.5U/l.
p Question: Find the 80% CI for the mean serum level.
p Answer: 10% in the z-tables, this gives -1.28. The standard error for the
sample mean is 6.5/√31 = 1.16. Altogether this gives the CI
[13.2±1.16×1.28] =[11.7,14.6].
This means with we believe with 80% confidence the mean level
of serum for women with this medical condition should lie in this interval. By choosing such a low level of confidence our interval is quite narrow, but our confidence in this interval is relatively low.
q Question: How large a sample size should we choose such that the
80% CI for the mean has the margin of error 1U/l.
When the standard deviation is unknown?
p In the previous example we assumed the standard deviation was
unknown. In general before we collect the data, we will not have much information about the standard deviation. However, we will have some idea on bounds for it. Ie. The standard deviation for human heights is probably between 2-5 inches. Based on this information we can can find the sample size whose Margin of Error is maximum a certain length.
p Question How large a sample size do we require such that the margin
of error for a 95% confidence interval for the mean of human heights is maximum 0.25 inch, given that σ lies somewhere between 2-5 inches.
p Answer We know that the formula is n = (1.96×σ/0.25)2.. We need to
choose the standard deviation to place in the formula.
p If we use σ=2, then the sample size is n=(1.96×2/0.25)2 = 246.
p If we use σ=5, then the sample size is n=(1.96×5/0.25)2 = 1537.
p For standard deviations between 2 and 5, the sample size will be between
246 – 1537. In the next slide we see what the MoE for these different sample
p Using the smaller standard deviation gives a smaller sample size, which
is easier to collect. However, if the standard deviation is greater than 2, then it means that the MoE will be larger than the desired minimum:
p If σ=5, and we use the minimum sample size n=246, then putting these
numbers into the formula we see that the
MoE =1.96×5/√246 = 0.62. Which is larger than the required
of 0.5. This is not what we want, as we want to ensure that the MoE
is less than 0.25.
p If σ=2, and we use the maximum sample size n=1537, then putting these
numbers into the formula we see that the
MoE =1.96×2/√1537 = 0.1. Which is less than the require of 0.25. This is
exactly what we want, as we want the MoE which is at most 0.25.
q To be sure that the MoE is maximum 0.25, we need to use a sample
size of n=1537. This means always using what we believe is the
maximum standard deviation in the calculation of margin of error. Ie.
n =
✓
1.96 ⇥ M AX MoE
Calculation practice (tricky)
p Question: A confidence interval for the length of parrots beaks is
[4,10] inches. It is based on a sample of size n. By what factor should the sample size increase such that the margin of error is 1?
q Answer: This looks like an impossible question because we don’t have
any obvious information. But we can break the problem into steps:
q Confidence intervals are centered about the sample mean, so the
average of the observed data is 7. The margin of error is half the length
of the CI interval which is [10-4]/2 = 3 = 1.96×σ/√n.
q We want to decrease the MoE, such that MoE = 1, so it decreases by a
third. Now some basic maths, suppose we increase the sample size by factor 9 (9 times the original data):
1.96 ⇥ p 9n = 1.96 ⇥ 3n = 1 3 1.96 ⇥ pn | {z } =3 = 3 3 = 1
Thus increasing the sample size by factor 9 results in the Margin of Error reducing to 1. Observe we need a huge increase in sample size to get a moderate decrease in the MoE!
Calculation (continued)
p Example If a sample size of 20 gave a confidence interval [4,10],
how large a sample size is required to reduce the margin of error to 1?
p Solution If the confidence interval is [4,10], from the previous slide
we know that the MoE is 3. This means that
If increase the sample size by factor 36, ie. from n=20 to n=20×36=720. Then I see that the margin of error is
1.96 ⇥ p
20 = 3
We see that to decrease the margin of error from 3 to ½ (by a sixth) we need to increase the sample size by factor 36!
1.96 ⇥ p 36 ⇥ 20 = 1.96 ⇥ 6 ⇥ p20 = 1 6 ⇥ 1.96 ⇥ p20 = 3 6 = 1 2
Analysis with unknown standard deviation
p So far we have assumed that the standard deviation is known, even
though the mean is unknown.
p In some situations, this is realistic. For example, in the potassium level
example, it seems reasonable to suppose that the amount of variation for everyone is about the same, but everyone has their own personal mean level, which is unknown.
p In most situations, the mean level is unknown.
p Given the data: 68, 68.5, 68.9 and 64.4 the sample mean is 68.7, how
to `get’ the standard deviation to construct a confidence interval?
p We do not know the standard deviation, but we know that we can
estimate it using the formula
p For our example it is
s = v u u t 1 n 1 n X i=1 (Xi X¯)2 s = r 1 3 ([ 0.7] 2 + [ 0.2]2 + [0.2]2 + [0.7]2) = 0.59
Using the z-‐transform with estimated
standard deviation
p Once we have estimated the standard deviation we replace the
the unknown true standard deviation in the z-transform with the estimated standard deviation:
q After this we could conduct the analysis just as before. However,
we will show in the next few slides (with the aid of Statcrunch) that this strategy leads to unreliable confidence intervals (when the
sample size small). We consider two examples
q The data is normal (we `draw’ samples from a distribution with mean 3.8
and standard deviation 3.8, however confidence interval used does not know these specifications) and sample size is n = 3.
q The data is normal (as above), but sample size is n = 50.
¯ X µ /pn ) ¯ X µ s/pn X¯ ± 1.96pn ! X¯ ± 1.96 s p n
Case 1: Normal data – sample size 3.
In this example we draw samples of size 3:
q The 95% CI using the above data and the normal
We see from this example that the estimated standard deviation (1.73) underestimates the true standard deviation (3.8). This in general tends to be true for small sample sizes. This means the 95% CI is too narrow.
We see from the plot on the left that only 84% of the `95% CI’
contain the mean. This means it is not a 95% CI. Something has gone wrong. 69.9 1.96⇥ 1p.73 3 ,69.6 + 1.96⇥ 1.73 p 3
Case 1: Normal data – sample size 50
In the previous example the sample size was 3, now we
consider the case that the sample size is 50. For the example given on the right the 95% CI is
68.0 1.96⇥ 4p.07 50,68.0 + 1.96 ⇥ 4.07 p 50
For this example, the estimated
standard deviation 4.07 is far closer to the true 3.8. This in general is true for large sample sizes.
Looking at the number of times the mean is contained within in the 95% confidence interval (on the right) we see that it is close to the prescribed level lf 95%.
Observations from the experiments
p Simply replacing the true standard deviation with the estimated standard
deviation seems to have severe consequences on the confidence interval.
p When the sample size was small there tends to be an underestimation in
the standard error, resulting in the 95% CI not really being a 95% CI.
p To see why consider the z-transforms of the sample mean with known
and estimated standard deviations:
p (sample mean - µ)/(σ/√n)
p (sample mean - µ)/(s/√n)
p In the first case, z-transform will be a standard normal. In the second case
the estimated standard deviation adds extra variability into the `system’. In particular, because s can be smaller than σ, this means the z-transform can be larger and take higher values then we would expect for a standard normal.
p In the next few slides we show that when we estimate the standard
deviation the z-transform is no longer a standard normal, but the so called t-distribution.
Review:
σ
is unknown
p When the sample size is large,
the sample is likely to contain elements representative of the
whole population. Then s is a
good estimate of σ.
Population distribution
Small sample Large sample
p But when the sample size is
small, the sample contains only
a few individuals. Then s is a
mediocre estimate of σ.
p The data is unlikely to contain
values in the tails and, s is likely
to underestimate σ.
In the case the we can estimate the standard deviation from the data. The sample standard deviation s provides an estimate of the population standard deviation σ.
Sample means and standard deviations
p Just like the sample mean is random with a distribution, so is the
sample standard deviation.
p Here we take a sample of size 10 from a normal distribution can
Estimating the standard deviation
p The sampling distribution of the
sample standard deviation (n=5)
q The sample distribution of the
sample standard deviation (n=25)
Observe that as the sample size increases the estimator of the sample standard deviation becomes less variable (1.70 reduces to 0.65). Large amount of variability in the sample standard deviation influences the confidence interval.
That nice Mr. Gosset
p Just over 100 years ago, W.S. Gosset
was a biometrician who worked for Guiness Brewery in Dublin, Ireland.
p His hobby was statistics.
p Gosset realized that his inferences
with small sample data seemed to be incorrect too often – his true
confidence level was less than it was stated to be. We just observed this in the simulations previously.
p He worked out the proper method that
took into account substituting s for σ.
p But he had to publish under a
pseudonym: Student (probably because Gosset was a sweet and modest person).
p Gosset’s theory is based on
the distribution of the quantity
p This looks like the z-score for
, except that s replaces σ in the denominator.
.
x
t
s
n
−
µ
=
x
Formal: Student’s
t
distributions
Suppose that an SRS of size
n
is drawn from an
Normal(
µ
,
σ
) population.
p When σ is known, the sampling distribution for is
Normal(0,1).
p When σ is estimated from the sample standard deviation s, the
sampling distribution for will be very close to normal if the sample size n is large. This is because for large n, s will be a very
reliable estimator of σ.
q However, in the case that n is not so large, the variability in s will
have an impact on the distribution.
q It is clear that the impact it has depends on the sample size.
x t s n −µ = x z n −µ = σ
Student’s
t
distributions
p When σ is estimated from the sample standard deviation s, the
sampling distribution for will depend on the sample size.
The sample distribution of
is a t distribution with n − 1 degrees of freedom.
p The degrees of freedom (df) is a measure of how well s estimates
σ. The larger the degrees of freedom, the better σ is estimated.
q This means we need a new set of tables! q Further reading: http://onlinestatbook.com/2/estimation/t_distribution.html x t s n −µ =
x
t
s n
−
µ
=
When n is very large, s is a very good estimate of σ, and the
corresponding t distributions are very close to the normal distribution. The t distributions become wider (thicker tailed) for smaller sample sizes, reflecting that s can be smaller than σ, so the corresponding t-transform is more likely to take extreme values than the z-t-transform.
Suppose we want to construct the C% confidence interval for the mean. The standard deviation is unknown, so as well as estimating the mean we also estimate the standard
deviation from the sample. The C% Confidence Interval is:
Example: For an 95% confidence
level C, 95% of Student’s t curve’s
area is contained in the interval.
Impact on con4idence intervals
C t* −t*
¯
X t
n 1✓
100
C
2
◆
⇥
p
s
n
,
X
¯
+
t
n 1✓
100
C
2
◆
⇥
p
s
n
Examples: 95%, sample size n=3
¯ X 4.3⇥ ps 3, ¯ X + 4.3⇥ ps 3 95%, sample size n=10 ¯ X 2.26⇥ ps 10, ¯ X + 2.26⇥ ps 10
Con4idence level and the margin of error
The confidence level C determines the value of t* (in table D). The margin of error also depends on t*.*
m t
=
×
s
n
C
t* −t*
§ Higher confidence C implies a larger
margin of error m (thus less precision
in our estimates).
§ A lower confidence level C produces
a smaller margin of error m (thus
better precision in our estimates).
§ We find t* in the line of Table D for df
Table D
When the sample is very large, we use the normal distribution and the standardized z-value.
When σ is unknown, we use a t distribution with “n−1” degrees of freedom (df).
Table D shows the z-values and t-values corresponding to landmark P-values/ confidence levels.
x
t
s
n
−
µ
=
p Focus first on 2.5%. For each n, the 2.5% corresponds to the area on
the left and right tails of the t-distribution with n degrees of freedom.
Remember a distribution gives the chance/likelihood of certain outcomes.
p Recall that for a normal distribution, the point where we get 2.5% on the
left and the right of the tails of the distribution is 1.96 (which is the very last row of the table).
p If we go down the table. we see that as the sample size, n, increases the
value corresponding to 2.5, goes from 12.71 (for n=1) to a number that is very close to 1.96 for extremely large n.
p This means for small n the variability on the standard deviation s means
that the chance of the t-transform being extreme is relatively large.
p However, as n grows, the estimator of the standard deviation improves,
and the t-transform gets closer to a normal distribution.
p You observe the same is true for other percentages.
p 90% means looking up 5%
p 99% means looking up 0.5%
Case 1: Normal data – sample size 3, using t-‐dist
In this example we draw samples of size 3:
q The 95% CI using the above data and the t-distribution is
This is the same example as
considered previously, but now the t-distribution has been used.
1.96 has been replaced with 4.3. From the plot of the right we see that using the t-distribution to
construct the CI about 95% of the 95% confidence intervals really do contain the population mean.
By using the t-distribution we have corrected for under the
underestimation of the sample sd. 69.9 4.3⇥ 1p.73 3 ,69.6 + 4.3 ⇥ 1.73 p 3
Non-normal data: A misconception
Using a t-distribution rather than a
normal distribution when constructing a confidence interval does not correct for the lack of normality in the data. In the example of the left, we use the t-distribution to construct the CI. But we observe that only 88 of the 100 95% confidence intervals contain the mean. Fundamentally, if the data is not
normal, and the sample size is small neither the normal or the t will give the correct 95% confidence interval.
REMEMBER we only use the t-distribution because we have
estimated the standard deviation from the data.
Calculation practice (red wine 1)
It has been suggested that drinking red wine in moderation may protect against heart attacks. This is because red wind contains polyphenols which act on blood cholesterol.
To see if moderate red wine consumption increases the average blood level of polyphenols, a group of nine randomly selected healthy men were assigned to drink half a bottle of red wine daily for two weeks. The percent change in their blood polyphenol levels are presented here:
0.7 3.5 4.0 4.9 5.5 7.0 7.4 8.1 8.4
Sample average = 5.50
Sample standard deviation s = 2.517 Degrees of freedom df = n − 1 = 8
x
We will encounter two problemswhen doing the analysis. The first is that the sample size is not huge so we have to hope that the sample mean is close to normal. The
second is the standard deviation is unknown and has to be estimated from the data.
q What is the 95% confidence interval for the average percent change?
p First, we determine what t* is. The degrees of freedom are df =
n − 1 = 8 and C = 95%.
p The margin of error m is: m = t* × s/√n = 2.306 × 2.517/√9 ≈ 1.93. So
the 95% confidence interval is 5.50 ± 1.93, or 3.57 to 7.43.
p We can say “With 95% confidence, the mean of percent increase
is between 3.57% and 7.43%.”
p What if we want a 99% confidence interval instead?
p For C = 99% and df = 8, we find t* = 3.355. Thus m = 3.355 × 2.517/
√9 ≈ 2.81.
p Now, with 99% confidence, we only can conclude the mean is
between 2.69 and 8.31. (A big price to pay for the extra confidence.)
(…)
Calculation practice (red wine 2)
Let us return to the same study, but this time we increase the sample size to 15 men. The data is now:
0.7,3.5,4,4.9,5.5,7,7.4,8.1,8.4, 3.2,0.8,4.3,-0.2,-0.6,7.5
The sample mean in this case is 4.3 and the sample standard deviation is 3.06.
Since the sample size has increased, it is likely that the sample standard deviation is a more reliable estimator of the true standard deviation.
The number of degrees of freedom is 14.
Just as in the previous example we can construct a 95% confidence interval but now we use 14df instead of 8dfs.
Solution: Using the t-tables the 95% CI is
2 44.3 ± 2| {z }.145 t-tables 14 df, 2.5% ⇥p3.06 15 3 5 = [2.6, 6]
Con4idence intervals using Software
p Usually software will construct the confidence interval for you.
Therefore it is important to connect the calculations with the statistical output.
The box on the right is the output (it is
superimposed on the window used to
generate the output). Observe that L.Limit – U. limit gives the
confidence interval
[2.6,6] calculated on the previous slide.
DF = 14, matches with the degrees of freedom.
Calculation practice 3
p Let us return to the example of prices of apartments in Dallas. 10
apartments are randomly sampled. The sample mean and the
sample standard deviation based on this sample is 980 dollars and 250 dollars (both are estimators based on a sample of size ten). Construct a 95% confidence interval for the mean:
p The standard error is 250/√10 = 79.
p Looking up the t-tables at 2.5% and 9 degrees of freedom gives 2.262.
p The 95% confidence interval for the mean is [980 ±
2.262×79]=[801,1159].
q Suppose we want to know whether the price of apartments have
increased since last year, where the mean price was 850 dollars.
q Based on this interval we see that 850 dollars and greater is contained in
this interval. This means the mean could be 850 dollars or higher. There given the sample it is unclear whether the mean price of apartments has increased since last year or not.
Calculation practice 4
p Let us return to the M&M data. Suppose we want to calculate a 99%
confidence interval for the mean number of M&Ms in plain, peanut butter and peanut M&Ms. These can be calculated using the
summary statistics output:
Summary statistics for Total: Group by: Type
Type n Mean Variance Std. Dev. Std. Err. Median Range Min Max Q1 Q3 M 84 17.297619 8.259753 2.8739786 0.3135768 18 14 7 21 17 19 P 40 8.675 9.814744 3.1328492 0.49534693 8 15 6 21 7 8 PB 46 10.913043 3.325604 1.8236238 0.26887867 11 10 8 18 10 11
Using this output we can calculate the confidence intervals for the mean number of M&Ms in each type.
Using Software to obtain con4idence intervals
p Go to Stats -> t-statistics -> one-sample -> with data -> select the
column you want to analyse (choose the Group by if you want it grouped), on the next page select confidence interval and the level you want it at.
Sample mean Std. err DF L Limit U limit
17.2 0.31 83 16.4 18.12
8.6 0.49 39 7.33 10.01
10.9 0.268 45 10.18 11.63
Looking at the intervals, do you think it that the mean number of M&Ms in a plain and peanut bag could be the same.
What about the mean number in peanut and peanut butter? Later on we shall make a formal test on these questions.
Calculation practice: coffee shop sales
p The degrees of freedom is 45−1 = 44.
p For 90% confidence, we find t* = 1.680.
p The margin of error is 1.680×1.03/√45 = 0.258
p So the interval for the true mean is 2.67 ± 0.26.
p “We conclude that the mean annual sales of all coffee shops is between
$2.41 million and $2.93 million, with 90% confidence.” A marketing firm randomly samples 45 coffee shops and determines their annual sales. The sample has an average of $2.67 million and a standard deviation of $1.03 million. What can we say with 90% confidence about the mean
annual sales for the population of all coffee shops?
n
s
t
x
±
*Summary of con4idence interval for
µ.
p The confidence interval for a population mean µ is
p t* is obtained from Student’s t distribution using n−1 degrees of
freedom. (Table D in the textbook.)
p t* is the value such that the confidence level C is the area between
–t* and t*.
p Confidence is the proportion of samples that lead to a correct
conclusion (for a specific method of inference).
p The investigator chooses the confidence level C.
p Tradeoff: more confidence means bigger margin of error, wider
intervals.
p The degrees of freedom is associated with s, the estimate for σ. p The margin of error also depends on the sample size:
larger samples are better.
*
/
t s
n
*
.
Interpretation of con4idence, again
p The confidence level C is the proportion of all possible random
samples (of size n) that will give results leading to a correct conclusion, for a specific method.
p In other words, if many random samples were obtained and
confidence intervals were constructed from their data with C = 95% then 95% of the intervals would contain the true parameter value.
p In the same way, if an investigator always uses C = 95% then 95%
of the confidence intervals he constructs will contain the parameter value being estimated.
p But he never knows which ones do!
p Changing the method (such as changing the value of t*) will change
the confidence level.
p Once computed, any individual confidence interval either will or will
not contain the true population parameter value. It is not random.
p It is not correct to say C is the probability that the true value falls in
Cautions about using
p This formula is only for inference about µ, the population mean.
Different formulas are used for inference about other parameters.
p The data must be a simple random sample from the population. p The formula is not quite correct for other sampling designs. (But see
a statistician to get the right inference method.)
p Confidence intervals based on t* are not resistant to outliers.
p If n is small and the population is not normal, the true confidence
level could be smaller than C. (Usually n ≥ 30 suffices unless the data are highly skewed.)
p This inference cannot rescue sampling bias, badly produced data or
computational errors.
*
/
Accompanying problems associated
with this Chapter
p Quiz 7 p Quiz 8