Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Chapter 27
An Example: Body Fat and Waist Size
Our chapter example revolves around the relationship
Slide 27- 4
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Remembering Regression
In regression, we want to model the relationship
between two quantitative variables, one the predictor and the other the response.
To do that, we imagine an idealized regression
line, which assumes that the means of the
distributions of the response variable fall along the line even though individual values are
Remembering Regression (cont.)
Now we’d like to know what the regression model
can tell us beyond the individuals in the study.
We want to make confidence intervals and test
Slide 27- 6
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
The Population and the Sample
When we found a confidence interval for a mean,
we could imagine a single, true underlying value for the mean.
When we tested whether two means or two
proportions were equal, we imagined a true underlying difference.
What does it mean to do inference for
The Population and the Sample (cont.)
We know better than to think that even if we know every
population value, the data would line up perfectly on a straight line.
In our sample, there’s a whole distribution of %body fat for
Slide 27- 8
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
The Population and the Sample (cont.)
This is true at each waist size.
We could depict the distribution of %body fat at different
The Population and the Sample (cont.)
The model assumes that the means of the
distributions of %body fat for each waist size fall along the line even though the individuals are
scattered around it.
The model is not a perfect description of how the
variables are associated, but it may be useful.
If we had all the values in the population, we
could find the slope and intercept of the idealized
Slide 27- 10
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
The Population and the Sample (cont.)
We write the idealized line with Greek letters and
consider the coefficients to be parameters: 0 is the intercept and 1 is the slope.
Corresponding to our fitted line of , we
write
Now, not all the individual y’s are at these means
—some lie above the line and some below. Like all models, there are errors.
0 1
ˆ
y b b x
0 1
y
x
Assumptions and Conditions
In Chapter 8 when we fit lines to data, we needed
to check only the Straight Enough Condition.
Now, when we want to make inferences about the
coefficients of the line, we’ll have to make more assumptions (and thus check more conditions).
We need to be careful about the order in which
Slide 27- 12
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Assumptions and Conditions (cont.)
1. Linearity Assumption:
Straight Enough Condition: Check the
Assumptions and Conditions (cont.)
1. Linearity Assumption:
If the scatterplot is straight enough, we can
Slide 27- 14
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Assumptions and Conditions (cont.)
2. Independence Assumption:
Randomization Condition: the individuals are
a representative sample from the population.
Check the residual plot (part 1)—the
Assumptions and Conditions (cont.)
3. Equal Variance Assumption:
Does The Plot Thicken? Condition: Check the residual plot (part 2)—the spread of the
Slide 27- 16
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Assumptions and Conditions (cont.)
4. Normal Population Assumption:
Nearly Normal Condition: Check a histogram of the
Assumptions and Conditions (cont.)
If all four assumptions are true, the idealized regression model
would look like this:
At each value of x there is a distribution of y-values that follows a
Slide 27- 18
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Which Come First:
the Conditions or the Residuals?
There’s a catch in regression—the best way to check
many of the conditions is with the residuals, but we get
the residuals only after we compute the regression model.
To compute the regression model, however, we should
check the conditions.
So we work in this order:
Make a scatterplot of the data to check the Straight
Which Come First:
the Conditions or the Residuals? (cont.)
If the data are straight enough, fit a regression model
and find the residuals, e, and predicted values, .
Make a scatterplot of the residuals against x or the
predicted values.
This plot should have no pattern. Check in particular for any
bend, any thickening, or any outliers.
If the data are measured over time, plot the residuals
against time to check for evidence of patterns that might suggest they are not independent.
ˆ
Slide 27- 20
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Which Come First:
the Conditions or the Residuals? (cont.)
If the scatterplots look OK, then make a
histogram and/or Normal probability plot of the residuals to check the Nearly Normal
Condition.
If all the conditions seem to be satisfied, go
Example 1: Oranges
Slide 27- 22
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Example 1: Oranges continued…
The usual null hypothesis about the slope is that it’s equal to 0. (In other words, that there is no linear association between the two variables.)
H0: β1=0 There is no difference in the weight of the oranges as the number per tree change.
Example 1 continued…
Regression Model:
The weight of an orange decreases by .00026 pound for every one orange growing on the tree.
598672
.
00026
.
^
Slide 27- 24
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Example 1 continued…
Model
1. Straight Enough Condition: There is no obvious bend in the
original scatterplot of the data.
2. The oranges growing on one tree does not influence the oranges
on another tree and we think that the orange trees are
representative of all oranges. The residual plot does not seem to have any patterns, trends, or clumping.
3. Equal Variance Condition: The residual plot looks spread evenly
over the line.
4. Nearly Normal Condition: The histogram of the residuals is
unimodal and approximately symmetric, so the Normal model is reasonable for the errors.
Example 1 continued…
Conclusion:
The estimated regression equation is
The R2 for the regression is 98.8%. The number of oranges on the tree
seems to account for about 98.8% of the variation in the weight of the oranges. The slope of the regression says that as 1 more orange
grows on the tree, there is a corresponding decrease of .00026 pounds in the weight of an orange.
The standard error is 0.000 for the slope which is very small, so it looks like the slope estimate is reasonably precise. The P value for the slope is also very small at <.0001, less than our α of 0.05. It appears that the null hypothesis can be rejected. There is strong evidence that the more oranges growing on a tree the weight of the fruit decreases.
Slide 27- 26
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Confidence Intervals and Regression
Continuing our exploration of the “orange problem”,
Interpret the meaning of a 95% confidence level
in the context of this problem.
Find a 95% confidence interval for __________.
Remember:
Confidence level vs. confidence interval?????? If you find an interval, you must also interpret it,
Example: Wildebeests in Tanzania
Long term records from the Serengeti national Park in Tanzania show interesting ecological
relationships. When wildebeest are more
abundant, they graze the grass more heavily, so there are fewer fires and more trees grow. Lions feed more successfully when there are more
Slide 27- 28
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 27- 30
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Intuition About Regression Inference
We expect any sample to produce a b
1 whose
expected value is the true slope, 1.
What about its standard deviation?
What aspects of the data affect how much the
Slide 27- 32
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Intuition About Regression Inference (cont.)
Spread around the line:
Less scatter around the line means the
slope will be more consistent from sample to sample.
The spread around the line is measured with
the residual standard deviation se.
You can always find s
e in the regression
Intuition About Regression Inference (cont.)
Slide 27- 34
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Intuition About Regression Inference (cont.)
Spread of the x’s: A large standard deviation of
Intuition About Regression Inference (cont.)
Sample size: Having a larger sample size, n,
Slide 27- 36
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Standard Error for the Slope
Three aspects of the scatterplot affect the
standard error of the regression slope:
spread around the line, s
e
spread of x values, s
x
sample size, n.
The formula for the standard error (which you will
probably never have to calculate by hand) is:
1Sampling Distribution for Regression Slopes
When the conditions are met, the standardized
estimated regression slope
follows a Student’s t-model with n – 2 degrees of freedom.
1 1
1
b
t
SE b
Slide 27- 38
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Sampling Distribution for Regression Slopes
(cont.)
We estimate the standard error with
where:
n is the number of data values
s
x is the ordinary standard deviation of the x-values.
11 e x s SE b n s
2
What About the Intercept?
The same reasoning applies for the intercept.
We can write
but we rarely use this fact for anything.
The intercept usually isn’t interesting. Most
hypothesis tests and confidence intervals for regression are about the slope.
0 0
2 0
( ) n
b t
Slide 27- 40
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Regression Inference
A null hypothesis of a zero slope questions the entire
claim of a linear relationship between the two variables—often just what we want to know.
To test H
0: 1 = 0, we find
and continue as we would with any other t-test.
The formula for a confidence interval for
1 is
1 10
b
t
SE b
1 n 2 1
Example 2: Marriage Age
Look at the computer output on pg. 659 problem #5. Is
there a decrease in the difference in ages at first marriage for men and women since 1975? Assuming we have
already done the regression analysis, find a 95%
Slide 27- 42
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Example 2 continued…
The formula for the confidence interval is:
Check the computer output for the values that you need. Look up the critical t value in the chart at the back of the book.
1 n 2 1
b t
SE b
Example 2 concluded…
Interpret the interval:
Based on this data, we are 95% confident that the
Slide 27- 44
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Example 3: Handout
Let’s take a look at another problem. We will perform a complete regression
inference involving the association between average monthly
Example 3 continued…
Hypothesis:
H0: β1=0 There is no association between average monthly temperature and electrical usage.
Slide 27- 46
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Example 3 continued…
Model:
1. Straight Enough Condition: The scatterplot appears
linear.
2. Independence/Randomization: The residuals show
random scatter. The data is a representative sample of electrical usage.
3. Does the Plot Thicken? The residual plot is spread fairly
uniformly.
4. Nearly Normal Condition: The histogram of the
residuals is unimodal and approx. symmetric.
Example 3 continued…
Mechanics:
df = 10 t = -10.2
P-Value = 2 P(t<-10.2) = .001 (two sided test)
2
.
3898
9027
.
50
ˆ
avtemp
Slide 27- 48
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Example 3 continued…
Conclusion
The P-value of .001 is much lower than our alpha level of .005. Using the regression slope t-test, we reject the null hypothesis. We have sufficient evidence to say that there is an association
between temperature and electrical usage. As temperature increased, the electrical usage