Chapter 5
Suppose we found the age and weight for
each person in a sample of 10 adults. Is
there any relationship between the age
and weight of these adults?
Create a scatterplot of the data below.
relationship? If so, what
Do you think there is a
kind? If not, why not?
Age 24 30 41 28 50 46 49 35 20 39
Wt 256 124 320 185 158 129 103 196 110 130
Age
W
ei
gh
t
There does
not appear
to be a
relationship
between
age and
weight in
Suppose we found the height and weight
for each person in a sample of 10 adults.
Is there any relationship between the
height and weight of these adults?
Create a scatterplot of the data below.
Ht 74 65 77 72 68 60 62 73 61 64
Wt 256 124 320 185 158 129 103 196 110 130
Is it positive or negative? Weak or strong?
Do you think there is a
relationship? If so, what
kind? If not, why not?
HeightW
ei
gh
Correlation
• The relationship between bivariate
numerical variables
– May be
positive
or
negative
– May be
weak
What does it mean if the
or
strong
relationship is positive?
Negative?
What feature(s) of the graph
would indicate a weak or
Identify the strength and direction
of the following data sets.
Set A Set B Set C
Set A shows a strong,
positive linear relationship.
Set C shows a weaker (moderate),
Set B shows little or no
relationship.
negative linear relationship.
Set D
Set D shows a
strong,
Identify as having a
positive
positive
relationship,
a
negative
negative
relationship, or
no
no
relationship.
1. Heights of mothers and heights of their
adult daughters
+
+
2. Age of a car in years and its current value
3. Weight of a person and calories consumed
4. Height of a person and the person’s birth
month
5. Number of hours spent in safety training
and the number of accidents that occur
-+
+
no
no
-Correlation Coefficient (
r
)-• A
quantitative
quantitative
assessment of the
strength
and
direction
of the
linear
relationship in
bivariate, quantitative data
• Pearson’s sample correlation is used the
most
• Population correlation coefficient -
(rho)
• statistic correlation coefficient – r
• Equation:
y i
x i
s
y
y
s
x
x
n
r
1
1
What are these
z-Example 5.1
For the six primarily undergraduate universities
in California with enrollments between 10,000
and 20,000, six-year graduation rates (y) and
student-related expenditures per full-time
students (x) for 2003 were reported as follows:
Create a scatterplot and calculate r.
Expenditures 8011 7323 8735 7548 7071 8248 Graduation
Example 5.1 Continued
Expenditures 8011 7323 8735 7548 7071 8248 Graduation
rates 64.6 53.0 46.3 42.5 38.5 33.9
Expenditures
G
ra
du
at
io
n
Ra
te
s
r = 0.05
In order to interpret what
this number tells us, let’s
investigate the properties of
Moderate Correlation
Strong correlation
Properties of
r
(correlation coefficient)
1) legitimate values are -1 <
r
< 1
0
.5
.8
1
-1 -.8
-.5
No
Correlation
2) value of
r
is not changed by
any
linear
linear
transformation
transformation
Suppose that the graduation rates were
changed from percents to decimals (divide by
100).
Transform the graduation rates and calculate
r
.
Do the following
transformations and calculate r
1) x’ = 5(x + 14)
2) y’ = (y + 30) ÷ 4
Expenditures 8011 7323 8735 7548 7071 8248 Graduation
rates 64.6 53.0 46.3 42.5 38.5 33.9
r
= 0.05
3) value of
r
does not depend on
which
which
of the two variables is labeled
x
Suppose we wanted to estimate the
expenditures per student for given graduation
rates.
Switch x and y, then calculate
r
.
Expenditures 8011 7323 8735 7548 7071 8248
Graduation
rates 64.6 53.0 46.3 42.5 38.5 33.9
r
= 0.05
4) value of
r
is
affected
affected
by
extreme values.
Plot a revised scatterplot and find
r
.
Expenditures 8011 7323 8735 7548 7071 8248
Graduation
rates 64.6 53.0 46.3 42.5 38.5 33.9
Suppose the 33.9 was REALLY
63.9. What do you think
would happen to the value of
the correlation coefficient?
63.9
Extreme values affect the
correlation coefficient
ExpendituresG
ra
du
at
io
n
Ra
te
s
Expenditures
G
ra
du
at
io
n
Ra
te
s
Find the correlation for these points:
x -3 -1 1 3 5 7 9
Y 40 20 8 4 8 20 40
Compute the correlation coefficient?
Sketch the scatterplot
5) value of
r
is a measure of the extent
to which
x
and
y
are
linearly
linearly
related
r
= 0
x
y
r
= 0, but the data
set has a
definite
definite
relationship!
Does this mean that there is
NO relationship between these
Recap the Properties of r:
1. legitimate values of
r
are -1 <
r
< 1
2. value of
r
is not changed by any
transformation
transformation
3. value of
r
does not depend on
which
which
of
the two variables is labeled
x
4. value of
r
is
affected by extreme values
affected by extreme values
Example 5.1 Continued
Expenditures 8011 7323 8735 7548 7071 8248 Graduation
rates 64.6 53.0 46.3 42.5 38.5 33.9
Expenditures
G
ra
du
at
io
n
Ra
te
s
Interpret r = 0.05
In order to interpret
r
, recall the
definition of the correlation
coefficient.
A
quantitative
quantitative
assessment of the
strength
and
direction
of the
linear
relationship between
bivariate, quantitative data
There is a weak,
positive, linear
relationship between
expenditures and
Does a value of
r
close to 1 or -1
mean that a change in one variable
cause
a change in the other variable?
Consider the following examples:
• The relationship between the number of
cavities in a child’s teeth and the size of
his or her vocabulary is strong and positive.
• Consumption of hot chocolate is negatively
correlated with crime rate.
These variables are both strongly
related to the age of the child
Both are responses to cold weather
Causality can only be shown by carefully
controlling values of all variables that
might be related to the ones under
study. In other words, with a
well-controlled, well-designed experiment.
So does this mean I should feed
children more candy to increase their
vocabulary?
Correlation does not imply
causation
Correlation does not imply
causation
Correlation does
Correlation does
not imply causation
What is the objective of regression
analysis?
•
x
– variable: is the
independent
or
explanatory
variable
•
y
- variable: is the
dependent
or
response
variable
• We will use values of
x
to
predict
values of
y.
Suppose that we have two variables:
x = the amount spent on advertising
y = the amount of sales for the product
during a given period
What question might I want to answer using
this data?
The objective of regression analysis is to
use information
about one variable,
x
,
to
draw some sort of a conclusion
b
– is the slope
– it is the
approximate
amount by which
y
increases when
x
increases by 1 unit
a
– is the
y
-intercept
– it is the
approximate
height of the line
when
x
= 0
– in some situations, the
y
-intercept has no
meaning
The LSRL is
y
ˆ
a
bx
y
ˆ
-
(y-hat) means the
predicted
y
Be sure to put the
hat on the
y
Scatterplots frequently exhibit a linear
pattern. When this is the case, it makes
sense to summarize the relationship
between the variables by finding a line
that is as close as possible to the plots in
the plot.
This is done by calculating the line of best
fit or
Least Square Regression Line
(LSRL).
The LSRL is the line that
minimizes
minimizes
the sum of the squares of the
deviations from the line
The slope of the LSRL is
2
x x y y x x bThe intercept of the LSRL is a y bx
(3,10)
(6,2)
Sum of the squares = 61.25
4
5
.
ˆ
x
y
-4
4.5
-5
y =.5(0) + 4 = 4
0 – 4 = -4
(0,0)
y =.5(3) + 4 = 5.5
10 – 5.5 = 4.5
y =.5(6) + 4 = 7
2 – 7 = -5
Suppose we have a data set that consists
of the observations (0,0), (3,10) and 6,2).
Let’s just fit a
line to the
data by
drawing a line
through what
appears to be
the middle of
the points.
Now find the
vertical
distance from
each point to
the line.
Find the sum of
the squares of
(0,0)
(3,10)
(6,2)
Sum of the squares = 54
3
3
1
ˆ
x
y
Use a calculator to find the line of
best fit
Find the vertical deviations from the line
-3
6
-3
What is the sum of the deviations
from the line?
Will it always be zero?
The line that minimizesminimizes the sum of the squares of the deviations from the
line is the LSRLLSRL.
Find the sum of the squares of the deviations from the
Researchers are studying pomegranate's antioxidants properties to see if it might be helpful in the treatment of cancer. In one study, mice were injected with cancer cells and randomly assigned to one of three groups, plain water, water supplemented with .1% pomegranate fruit extract (PFE), and water supplemented with .2% PFE. The average tumor volume for mice in each group was recorded for several points in time. (x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume (in mm3)
x 11 15 19 23 27
y 150 270 450 580 740
Pomegranate study continued
x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
Calculate the LSRL and the correlation
coefficient.
Interpret the slope and the correlation
coefficient in context.
998
.
0
25
.
37
75
.
269
ˆ
x
r
y
The average volume of the tumor
increases
by approximately 37.25 mm
3for
each day
increase
in the number of days after
injection.
Remember that an
interpretation is stating the definition in
context.
There is a
strong, positive, linear
relationship between
the average tumor
volume and the number of days since
injection
.
Pomegranate study continued
x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
Predict the average volume of the tumor for 20
days after injection.
Predict the average volume of the tumor for 5
days after injection.
x
y
ˆ
269
.
75
37
.
25
3
mm
25
.
475
)
20
(
25
.
37
75
.
269
ˆ
y
3mm
5
.
83
)
5
(
25
.
37
75
.
269
ˆ
y
Can volume be negative?
This is the
danger of
extrapolation
.
The
least-squares line should
not
be
used to make predictions for
y
using
x
-values
outside
the
range in the data set.
Why?
It is unknown whether the
pattern observed in the
Pomegranate study continued
x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
Suppose we want to know how many days after
injection of cancer cells would the average tumor
size be 500 mm
3?
x
y
ˆ
269
.
75
37
.
25
Is this the appropriate
regression line to
answer this question?
No, the slope of the line for predicting
x
is
not
and the intercepts are almost always different.
Here is the appropriate regression line:
y x
s s r
x y
s s r
y
x
ˆ
7
.
277
.
027
The regression line of y on x should
not
be used
to predict x, because it is
not
the line that
minimizes the sum of the squared deviations
Pomegranate study continued
x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
Find the mean of the x-values (x) and the mean
of the y-values (y).
Plot the point of averages (x,y) on the
scatterplot.
x
= 19 and
y
= 438
+
Will the point of
Let’s investigate how the LSRL and
correlation coefficient change when
different points are added to the data set
Suppose we have the following data set.
x
4
5
6
7
8
y
2
5
4
6
9
Sketch a scatterplot. Calculate the LSRL and
the correlation coefficient.
916
.
0
5
.
1
8
.
3
ˆ
r
Let’s investigate how the LSRL and
correlation coefficient change when
different points are added to the data set
Suppose we have the following data set.
x
4
5
6
7
8
y
2
5
4
6
9
Suppose we add the point (5,8) to the data set.
What happens to the regression line and the
correlation coefficient?
916
.
0
5
.
1
8
.
3
ˆ
r
x
y
5
8
17
.
1
15
.
1
ˆ
x
y
Let’s investigate how the LSRL and
correlation coefficient change when
different points are added to the data set
Suppose we have the following data set.
x
4
5
6
7
8
y
2
5
4
6
9
Suppose we add the point (12,12) to the data set.
What happens to the regression line and the
correlation coefficient?
916
.
0
5
.
1
8
.
3
ˆ
r
x
y
12
12
225 .
1 24
. 2
ˆ x
y
Let’s investigate how the LSRL and
correlation coefficient change when
different points are added to the data set
Suppose we have the following data set.
x
4
5
6
7
8
y
2
5
4
6
9
Suppose we add the point (12,0) to the data set.
What happens to the regression line and the
correlation coefficient?
916
.
0
5
.
1
8
.
3
ˆ
r
x
y
12
0
275 .
0 26
. 6
ˆ x
y
The correlation coefficient and the
LSRL are
both
measures that are
Pomegranate study revisited
x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume
x 11 15 19 23 27
y 150 270 450 580 740
Minitab, a statistical software package, was used to fit the least-squares regression line. Part of the resulting output is shown below.
The regression equation is
Predicted volume = -269.75 + 37.25 days
Predictor Coef SE Coef T P Constant -269.75 23.421412 -11.51724 0.0014
interceptslope
Assessing the fit of the LSRL
Important questions are:
1. Is the line an appropriate way to summarize
the relationship between x and y.
2. Are there any unusual aspects of the data
set that we need to consider before
proceeding to use the line to make
predictions?
3. If we decide to use the line as a basis for
prediction, how accurate can we expect
predictions based on the line to be?
Once the LSRL is obtained, the next
step is to examine how effectively
the line summarizes the relationship
between
x
and
y
.
We will look at graphical
and numerical methods to
In a study, researchers were interested in how the
distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.
x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36
y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65
Predictor Coef SE Coef T P Constant -7.69 13.33 -0.58 0.582 Distance to debris 3.234 1.782 1.82 0.112 S=8.67071 R-Sq = 32.0% R-Sq(adj) = 22.3%
Minitab was used to fit the least-squares
regression line. From the partial output, identify the regression line.
x
y
ˆ
7
.
69
3
.
234
In a study, researchers were interested in how the
distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.
x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36
y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65
D
is
ta
nc
e
tr
av
el
ed
Distance to debris
The vertical deviation
between the point
and the LSRL is
called the
residual.
If the point is
above the line,
the residual
will be positive.
If the point is below
the line the residual
will be negative.
Residuals
are calculated by
subtracting the predicted
y
from
the observed
y
.
y
y
ˆ
In a study, researchers were interested in how the
distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.
Use the LSRL to calculate the
predicted distance
traveled.
Subtract to find
the residuals.
Distance
from debris traveled (y)Distance Predicted distance traveled Residual 6.94 0.00 5.23 6.13 5.21 11.29 7.10 14.35 8.16 12.03 5.50 22.72 9.19 20.11 9.05 26.16 9.36 30.65 ) ˆ
(y (y yˆ) 14.76 9.23 9.16 15.28 18.70 10.10 22.04 21.58 22.59 -14.76 -3.10 2.13 -0.93 -6.67 12.62 -1.93 4.58 8.06 What does the sum of the residuals equal? Will the sum of the
residuals always equal zero?
Residual plots
Residual plots
• Is a scatterplot of the
(
x
, residual)
pairs.
• Residuals can also be graphed against the
predicted
y
-values
• The purpose is to determine if a
linear
model
is the best way to
describe the
relationship
between the
x
&
y
variables
• If
no
pattern
exists between the points
Residuals
x
Residuals
x
This residual shows no
pattern so it indicates that the linear model is
appropriate.
This residual shows a curved pattern so it indicates that the linear model is not
In a study, researchers were interested in how the
distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.
Distance
from debris traveled (y)Distance Predicted distance traveled Residual 6.94 0.00 5.23 6.13 5.21 11.29 7.10 14.35 8.16 12.03 5.50 22.72 9.19 20.11 9.05 26.16 9.36 30.65 ) ˆ
(y (y yˆ)
14.76 9.23 9.16 15.28 18.70 10.10 22.04 21.58 22.59 -14.76 -3.10 2.13 -0.93 -6.67 12.62 -1.93 4.58 8.06 Use the values in this
table to create a residual plot for this data
set. Is a linear model appropriate for describing the relationship between the distance from debris and the distance a deer mouse will travel for
food?
Since the
residual plot
displays
no
pattern
, a linear
model is
appropriate for
describing the
relationship
between the
distance from
debris and the
distance a deer
mouse will
travel for food.
-15 -10 -5 5 10 15
5 6 7 8 9
Distance from debris
Re
si
du
al
s
Now plot the residuals
against the
predicted
-15 -10 -5 5 10 15
10 15 20 25 9
Predicted Distance traveled
Re
si
du
al
s
What do you notice about the
general scatter of points on this
residual plot versus the residual plot
using the x-values?
-10 -5 5 10 15
5 6 7 8 9
Distance from debris
Re
si
du
al
s
Residual plots can be
plotted against either
the
x
-values or the
Let’s examine the following data set:
The following data is for 12 black bears from
the Boreal Forest.
x = age (in years) and y = weight (in kg)
Sketch a scatterplot with the fitted regression line.
x 10.5 6.5 28.5 10.5 6.5 7.5 6.5 5.5 7.5 11.5 9.5 5.5
Y 54 40 62 51 55 56 62 42 40Do you notice anything unusual about this data set?59 51 50
Influential observation
What would happen to the regression line if this point is removed?
This point is considered an
influential point
because
it affects the placement
of the least-squares
regression line.
5 10 15 20 25 30 45
40 50 55 60
W
ei
gh
t
Age 5 10 15 20 25 30 45
40 50 55 60
W
ei
gh
t
Let’s examine the following data set:
The following data is for 12 black bears from
the Boreal Forest.
x = age (in years) and y = weight (in kg)
x 10.5 6.5 28.5 10.5 6.5 7.5 6.5 5.5 7.5 11.5 9.5 5.5
Y 54 40 62 51 55 56 62 42 40 59 51 50
5 10 15 20 25 30 45
40 50 55 60
W
ei
gh
t
Age
Notice that this observation has a large residual.
An observation is
an
outlier
if it has
Coefficient of
Coefficient of
determination-• Denoted by r
2• gives the proportion of
variation
variation
in
y
y
that can be attributed to an
Suppose you didn’t know any x -values. What distance would you expect deer mice to travel?
938
.
15
y
Let’s explore the meaning of r2 by revisiting the deer mouse
data set.
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36
y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65
What is total amount of variation in the distance traveled
(y-values)? Hint: Find the sum of the squared deviations.
2SSTo
y
y
Total amount of variation in the distance traveled is
773.95 m .
Why do we
square the
deviations?
5 10 15 20 25 305 6 7 8 9
Distance to Debris
D is ta nc e tr av el ed
SS stands for “sum of
squares”
Now suppose you DO know the
x-values. Your best guess would be the predicted
distance traveled (the point on the LSRL).
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36
y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65
ˆ
2SSResid
y
y
The points vary from the
LSRL by 526.27 m
2.
By how much do the observed points vary from the LSRL? Hint: Find the sum of the residuals squared.
D
is
ta
nc
e
tr
av
el
ed
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36
y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65
The points vary from the LSRL by 526.27 m2.
Total amount of variation in the distance traveled is
773.95 m2.
Approximately what percent
of the variation in distance
traveled can be explained by
the regression line?
320
.
0
95
.
773
27
.
526
1
SSTo
SSResid
1
2 2
r
r
Or
Partial output from the regression analysis of deer mouse data:
Predictor Coef SE Coef T P Constant -7.69 13.33 -0.58 0.582 Distance to
debris
3.234 1.782 1.82 0.112
S = 8.67071 R-sq = 32.0% R-sq(adj) = 22.3%
The coefficient of determination (r
2)
Only 32% of the observed variability in the
distance traveled for food can be explained by
the approximate linear relationship between the
distance traveled for food and the distance to
the nearest debris pile.
What does this
number
represent?
The standard deviation (s):
This is the
typical
amount by which an
observation deviates from the least squares
regression line. It’s found by:
2 -n
SSResid
e
s
Let’s review the values from this output
and their meanings.
The y-intercept (a):
This value has no meaning in context since
it doesn't make sense to have a negative
distance.
The slope (b):
The distance traveled to food increases by
approxiamtely 3.234 meters for an increase
Let’s examine this data set:
x = representative agey = average marathon finish time
Create a scatterplot for this data set.
Age 15 25 35 45 55 65
Time 302.38 193.63 185.46 198.49 224.30 288.71
10 20 30 40 50 60 200 250 300 Representative Age A ve ra ge F in is h T im e
Because of the curved pattern, a straight line
would not accurately describe the relationship
between average finish time and age.
Since this curve resembles a parabola, a quadratic function can be used to
describe this relationship.
2 2 1
ˆ
a
b
x
b
x
y
Using Minitab:
The least-squares quadratic regression is 2
179
.
0
2
.
14
462
ˆ
x
x
y
This curve minimizes the
sum of the squares of the residuals (similar
to least-squares linear
Let’s examine this data set:
x = representative agey = average marathon finish time
Age 15 25 35 45 55 65
Time 302.38 193.63 185.46 198.49 224.30 288.71
10 20 30 40 50 60 200
250 300
Representative Age
A
ve
ra
ge
F
in
is
h
T
im
e
Notice the residuals from the
quadratic regression.
10 20 30 40 50 60
-20 -10 10 20
Age
Re
si
du
al
s
Here is the residual
plot-Since there is no pattern in the residual plot, the quadratic
Let’s examine this data set:
x = representative agey = average marathon finish time
Age 15 25 35 45 55 65
Time 302.38 193.63 185.46 198.49 224.30 288.71
10 20 30 40 50 60 200
250 300
Representative Age
A
ve
ra
ge
F
in
is
h
T
im
e
The measure
R
2is useful for
assessing the fit of the
quadratic regression.
SSTo SSResid 1
2
R
R2 = .921
92.1% of the variation in average marathon finish times can be explained by the approximate quadratic relationship between
Depending on the data set, other regression
models, such as cubic regression, may be used.
Statistical software (like Minitab) is commonly
used to calculate these regression models.
Another method for fitting regression
models to non-linear data sets is to
transform
the data, making it
linear
.
Commonly Used Transformations
Transformation Equation
No transformation Square root of x
Log of x *
Reciprocal of x
Log of y *
Exponential growth or decay
x ba
yˆ log10
bx a
yˆ
log10
x b a
yˆ 1
x b a
yˆ
bx a
yˆ
Pomegranate study revisited:
x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume
Sketch a scatterplot for this data set.
x 11 15 19 23 27 31 35 39
y 40 75 90 210 230 330 450 600
100 200 300 400 500 600
A
ve
ra
ge
t
um
or
v
ol
um
e
There
appears to
be a curve
in the data
points.
Let’s use a
transformation
to linearize the
data.
Since the data appears to be exponential growth,
Pomegranate study revisited:
x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume
Sketch a scatterplot of the log(
y
) and x.
x 11 15 19 23 27 31 35 39
Log(y) 1.60 1.88 1.95 2.32 2.36 2.52 2.65 2.78
1 2 3
15 20 35 10 25 30
Number of days
Lo
g
of
A
ve
ra
ge
t
um
or
v
ol
um
e
Notice that the
relationship now appears linear. Let’s fit an LSRL
to the
transformed data.
The LSRL is
Pomegranate study revisited:
x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume
Sketch a scatterplot of the log(
y
) and x.
x 11 15 19 23 27 31 35 39
Log(y) 1.60 1.88 1.95 2.32 2.36 2.52 2.65 2.78
1 2 3
10
15 20
25 30 35 35 10 25 30
Number of days Lo g of A ve ra ge tu m or v ol um e
10 25 30 35
The LSRL is
x
y
ˆ
1
.
226
0
.
041
log
What would the
predicted average
tumor size be 30
days after injection
of cancer cells?
)
30
(
041
.
0
226
.
1
ˆ
log
y
456
.
2
ˆ
log
y
3 456
.
2
285
.
76
mm
10
ˆ
Another useful transformation is the power
transformation. The power transformation ladder and the scatterplot (both below) can be used to help
determine what type of transformation is appropriate.
Power Transformation Ladder
Power Transformed Value Name
3 (Original value)3 Cube
2 (Original value)2 Square
1 (Original value) No
transformation
½ Square root
1/3 Cube root
0 Log(Original value) Logarithm
-1 Reciprocal
value Original
3 Original value
value Original
1
Suppose that the scatterplot looks like
the curve labeled 1. Then we would use a power that is up the ladder from the no transformation row for
both the x and y
variables.
Suppose that the scatterplot looks like
the curve labeled 2. Then we would use a power that is up the ladder from the no transformation row for
Logistic Regression (Optional)
• Can be used if the
dependent
variable is
categorical
with just two possible values
• Used to describe how the probability of
“success” changes as a numerical predictor
variable, x, changes
• With p denoting the probability of success,
the logistic regression equation is
bx
a
bx
a
e
e
p
1
For any value of
x
, the
value of
p
is always
between 0 and 1.
In a study on wolf spiders, researchers were interested in what variables might be related to a female wolf
spider’s decision to kill and consume her partner during courtship or mating. Data was collected for 53 pairs of courting wolf spiders. (Data listed on page 287)
x = the difference in body width (female – male)
y = cannibalism; coded 0 for no cannibalism and 1 for cannibalism
Minitab was used to construct a scatterplot and to fit a logistic regression to the data.
x x
e
e
p
3.089043.089043.069283.069281
Note that the plot was constructed so that if two plots fell in the exact same location they would be offset a little bit so that all
points would be visible (called jittering).
This equation can be used to
predict the probability of the
male spider being cannibalized
based on the difference in size.
What is the probability of
cannibalism if the male &
female spiders are the same
width (difference of 0)?
044 .
0 1 3.08904 3.06928(0)