swm02_27.ppt

(1)

(2)

Chapter 27

(3)

An Example: Body Fat and Waist Size

 Our chapter example revolves around the relationship

(4)

Slide 27- 4

Remembering Regression

 In regression, we want to model the relationship

between two quantitative variables, one the predictor and the other the response.

 To do that, we imagine an idealized regression

line, which assumes that the means of the

distributions of the response variable fall along the line even though individual values are

(5)

Remembering Regression (cont.)

 Now we’d like to know what the regression model

can tell us beyond the individuals in the study.

 We want to make confidence intervals and test

(6)

Slide 27- 6

The Population and the Sample

 When we found a confidence interval for a mean,

we could imagine a single, true underlying value for the mean.

 When we tested whether two means or two

proportions were equal, we imagined a true underlying difference.

 What does it mean to do inference for

(7)

The Population and the Sample (cont.)

 We know better than to think that even if we know every

population value, the data would line up perfectly on a straight line.

 In our sample, there’s a whole distribution of %body fat for

(8)

Slide 27- 8

The Population and the Sample (cont.)

 This is true at each waist size.

 We could depict the distribution of %body fat at different

(9)

The Population and the Sample (cont.)

 The model assumes that the means of the

distributions of %body fat for each waist size fall along the line even though the individuals are

scattered around it.

 The model is not a perfect description of how the

variables are associated, but it may be useful.

 If we had all the values in the population, we

could find the slope and intercept of the idealized

(10)

Slide 27- 10

The Population and the Sample (cont.)

 We write the idealized line with Greek letters and

consider the coefficients to be parameters: ₀ is the intercept and ₁ is the slope.

 Corresponding to our fitted line of , we

write

 Now, not all the individual y’s are at these means

—some lie above the line and some below. Like all models, there are errors.

0 1

ˆ

y b b x 

0 1

y

x

(11)

Assumptions and Conditions

 In Chapter 8 when we fit lines to data, we needed

to check only the Straight Enough Condition.

 Now, when we want to make inferences about the

coefficients of the line, we’ll have to make more assumptions (and thus check more conditions).

 We need to be careful about the order in which

(12)

Slide 27- 12

Assumptions and Conditions (cont.)

1. Linearity Assumption:

 Straight Enough Condition: Check the

(13)

Assumptions and Conditions (cont.)

1. Linearity Assumption:

 If the scatterplot is straight enough, we can

(14)

Slide 27- 14

Assumptions and Conditions (cont.)

2. Independence Assumption:

 Randomization Condition: the individuals are

a representative sample from the population.

 Check the residual plot (part 1)—the

(15)

Assumptions and Conditions (cont.)

3. Equal Variance Assumption:

 Does The Plot Thicken? Condition: Check the residual plot (part 2)—the spread of the

(16)

Slide 27- 16

Assumptions and Conditions (cont.)

4. Normal Population Assumption:

 Nearly Normal Condition: Check a histogram of the

(17)

Assumptions and Conditions (cont.)

 If all four assumptions are true, the idealized regression model

would look like this:

 At each value of x there is a distribution of y-values that follows a

(18)

Slide 27- 18

Which Come First:

the Conditions or the Residuals?

 There’s a catch in regression—the best way to check

many of the conditions is with the residuals, but we get

the residuals only after we compute the regression model.

 To compute the regression model, however, we should

check the conditions.

 So we work in this order:

 Make a scatterplot of the data to check the Straight

(19)

Which Come First:

the Conditions or the Residuals? (cont.)

 If the data are straight enough, fit a regression model

and find the residuals, e, and predicted values, .

 Make a scatterplot of the residuals against x or the

predicted values.

 This plot should have no pattern. Check in particular for any

bend, any thickening, or any outliers.

 If the data are measured over time, plot the residuals

against time to check for evidence of patterns that might suggest they are not independent.

ˆ

(20)

Slide 27- 20

Which Come First:

the Conditions or the Residuals? (cont.)

 If the scatterplots look OK, then make a

histogram and/or Normal probability plot of the residuals to check the Nearly Normal

Condition.

 If all the conditions seem to be satisfied, go

(21)

Example 1: Oranges

(22)

Slide 27- 22

Example 1: Oranges continued…

The usual null hypothesis about the slope is that it’s equal to 0. (In other words, that there is no linear association between the two variables.)

H₀: β₁=0 There is no difference in the weight of the oranges as the number per tree change.

(23)

Example 1 continued…

Regression Model:

The weight of an orange decreases by .00026 pound for every one orange growing on the tree.

598672

.

00026

.

^





(24)

Slide 27- 24

Example 1 continued…

Model

1. Straight Enough Condition: There is no obvious bend in the

original scatterplot of the data.

2. The oranges growing on one tree does not influence the oranges

on another tree and we think that the orange trees are

representative of all oranges. The residual plot does not seem to have any patterns, trends, or clumping.

3. Equal Variance Condition: The residual plot looks spread evenly

over the line.

4. Nearly Normal Condition: The histogram of the residuals is

unimodal and approximately symmetric, so the Normal model is reasonable for the errors.

(25)

Example 1 continued…

Conclusion:

The estimated regression equation is

The R2_{for the regression is 98.8%. The number of oranges on the tree}

seems to account for about 98.8% of the variation in the weight of the oranges. The slope of the regression says that as 1 more orange

grows on the tree, there is a corresponding decrease of .00026 pounds in the weight of an orange.

The standard error is 0.000 for the slope which is very small, so it looks like the slope estimate is reasonably precise. The P value for the slope is also very small at <.0001, less than our α of 0.05. It appears that the null hypothesis can be rejected. There is strong evidence that the more oranges growing on a tree the weight of the fruit decreases.

(26)

Slide 27- 26

Confidence Intervals and Regression

Continuing our exploration of the “orange problem”,

 Interpret the meaning of a 95% confidence level

in the context of this problem.

 Find a 95% confidence interval for __________.

Remember:

Confidence level vs. confidence interval?????? If you find an interval, you must also interpret it,

(27)

Example: Wildebeests in Tanzania

Long term records from the Serengeti national Park in Tanzania show interesting ecological

relationships. When wildebeest are more

abundant, they graze the grass more heavily, so there are fewer fires and more trees grow. Lions feed more successfully when there are more

(28)

Slide 27- 28

(29)

(30)

Slide 27- 30

(31)

Intuition About Regression Inference

 We expect any sample to produce a b

1 whose

expected value is the true slope, ₁.

 What about its standard deviation?

 What aspects of the data affect how much the

(32)

Slide 27- 32

Intuition About Regression Inference (cont.)



Spread around the line:

 Less scatter around the line means the

slope will be more consistent from sample to sample.

 The spread around the line is measured with

the residual standard deviation s_e.

 You can always find s

e in the regression

(33)

Intuition About Regression Inference (cont.)

(34)

Slide 27- 34

Intuition About Regression Inference (cont.)

 Spread of the x’s: A large standard deviation of

(35)

Intuition About Regression Inference (cont.)

 Sample size: Having a larger sample size, n,

(36)

Slide 27- 36

Standard Error for the Slope

 Three aspects of the scatterplot affect the

standard error of the regression slope:

 spread around the line, s

e

 spread of x values, s

x

 sample size, n.

 The formula for the standard error (which you will

probably never have to calculate by hand) is:

 

1

(37)

Sampling Distribution for Regression Slopes

 When the conditions are met, the standardized

estimated regression slope

follows a Student’s t-model with n – 2 degrees of freedom.

 

1 1

1

b

t

SE b



(38)

Slide 27- 38

Sampling Distribution for Regression Slopes

(cont.)

 We estimate the standard error with

where:



 n is the number of data values

 s

x is the ordinary standard deviation of the x-values.

 

1

1 e x s SE b n s  

 2

(39)

What About the Intercept?

 The same reasoning applies for the intercept.

 We can write

but we rarely use this fact for anything.

 The intercept usually isn’t interesting. Most

hypothesis tests and confidence intervals for regression are about the slope.

0 0

2 0

( ) n

b _t

(40)

Slide 27- 40

Regression Inference

 A null hypothesis of a zero slope questions the entire

claim of a linear relationship between the two variables—often just what we want to know.

 To test H

0: 1 = 0, we find

and continue as we would with any other t-test.

 The formula for a confidence interval for 

1 is

 

1 1

0 b

t

SE b





 

1 n 2 1

(41)

Example 2: Marriage Age

Look at the computer output on pg. 659 problem #5. Is

there a decrease in the difference in ages at first marriage for men and women since 1975? Assuming we have

already done the regression analysis, find a 95%

(42)

Slide 27- 42

Example 2 continued…

The formula for the confidence interval is:

Check the computer output for the values that you need. Look up the critical t value in the chart at the back of the book.

 

1 n 2 1

b t



_



SE b

(43)

Example 2 concluded…

Interpret the interval:

Based on this data, we are 95% confident that the

(44)

Slide 27- 44

Example 3: Handout

Let’s take a look at another problem. We will perform a complete regression

inference involving the association between average monthly

(45)

Example 3 continued…

Hypothesis:

H₀: β₁=0 There is no association between average monthly temperature and electrical usage.

(46)

Slide 27- 46

Example 3 continued…

Model:

1. Straight Enough Condition: The scatterplot appears

linear.

2. Independence/Randomization: The residuals show

random scatter. The data is a representative sample of electrical usage.

3. Does the Plot Thicken? The residual plot is spread fairly

uniformly.

4. Nearly Normal Condition: The histogram of the

residuals is unimodal and approx. symmetric.

(47)

Example 3 continued…

Mechanics:

df = 10 t = -10.2

P-Value = 2 P(t<-10.2) = .001 (two sided test)

2 .

3898

9027

.

50 ˆ





avtemp



(48)

Slide 27- 48

Example 3 continued…

Conclusion

The P-value of .001 is much lower than our alpha level of .005. Using the regression slope t-test, we reject the null hypothesis. We have sufficient evidence to say that there is an association

between temperature and electrical usage. As temperature increased, the electrical usage