• No results found

Unit 5 Notes

N/A
N/A
Protected

Academic year: 2019

Share "Unit 5 Notes"

Copied!
60
0
0

Loading.... (view fulltext now)

Full text

(1)

Chapter 5

(2)

Suppose we found the age and weight for

each person in a sample of 10 adults. Is

there any relationship between the age

and weight of these adults?

Create a scatterplot of the data below.

relationship? If so, what

Do you think there is a

kind? If not, why not?

Age 24 30 41 28 50 46 49 35 20 39

Wt 256 124 320 185 158 129 103 196 110 130

Age

W

ei

gh

t

There does

not appear

to be a

relationship

between

age and

weight in

(3)

Suppose we found the height and weight

for each person in a sample of 10 adults.

Is there any relationship between the

height and weight of these adults?

Create a scatterplot of the data below.

Ht 74 65 77 72 68 60 62 73 61 64

Wt 256 124 320 185 158 129 103 196 110 130

Is it positive or negative? Weak or strong?

Do you think there is a

relationship? If so, what

kind? If not, why not?

Height

W

ei

gh

(4)

Correlation

• The relationship between bivariate

numerical variables

– May be

positive

or

negative

– May be

weak

What does it mean if the

or

strong

relationship is positive?

Negative?

What feature(s) of the graph

would indicate a weak or

(5)

Identify the strength and direction

of the following data sets.

Set A Set B Set C

Set A shows a strong,

positive linear relationship.

Set C shows a weaker (moderate),

Set B shows little or no

relationship.

negative linear relationship.

Set D

Set D shows a

strong,

(6)

Identify as having a

positive

positive

relationship,

a

negative

negative

relationship, or

no

no

relationship.

1. Heights of mothers and heights of their

adult daughters

+

+

2. Age of a car in years and its current value

3. Weight of a person and calories consumed

4. Height of a person and the person’s birth

month

5. Number of hours spent in safety training

and the number of accidents that occur

-+

+

no

no

(7)

-Correlation Coefficient (

r

)-• A

quantitative

quantitative

assessment of the

strength

and

direction

of the

linear

relationship in

bivariate, quantitative data

• Pearson’s sample correlation is used the

most

• Population correlation coefficient -

(rho)

• statistic correlation coefficient – r

• Equation:

 





 

y i

x i

s

y

y

s

x

x

n

r

1

1

What are these

(8)

z-Example 5.1

For the six primarily undergraduate universities

in California with enrollments between 10,000

and 20,000, six-year graduation rates (y) and

student-related expenditures per full-time

students (x) for 2003 were reported as follows:

Create a scatterplot and calculate r.

Expenditures 8011 7323 8735 7548 7071 8248 Graduation

(9)

Example 5.1 Continued

Expenditures 8011 7323 8735 7548 7071 8248 Graduation

rates 64.6 53.0 46.3 42.5 38.5 33.9

Expenditures

G

ra

du

at

io

n

Ra

te

s

r = 0.05

In order to interpret what

this number tells us, let’s

investigate the properties of

(10)

Moderate Correlation

Strong correlation

Properties of

r

(correlation coefficient)

1) legitimate values are -1 <

r

< 1

0

.5

.8

1

-1 -.8

-.5

No

Correlation

(11)

2) value of

r

is not changed by

any

linear

linear

transformation

transformation

Suppose that the graduation rates were

changed from percents to decimals (divide by

100).

Transform the graduation rates and calculate

r

.

Do the following

transformations and calculate r

1) x’ = 5(x + 14)

2) y’ = (y + 30) ÷ 4

Expenditures 8011 7323 8735 7548 7071 8248 Graduation

rates 64.6 53.0 46.3 42.5 38.5 33.9

r

= 0.05

(12)

3) value of

r

does not depend on

which

which

of the two variables is labeled

x

Suppose we wanted to estimate the

expenditures per student for given graduation

rates.

Switch x and y, then calculate

r

.

Expenditures 8011 7323 8735 7548 7071 8248

Graduation

rates 64.6 53.0 46.3 42.5 38.5 33.9

r

= 0.05

(13)

4) value of

r

is

affected

affected

by

extreme values.

Plot a revised scatterplot and find

r

.

Expenditures 8011 7323 8735 7548 7071 8248

Graduation

rates 64.6 53.0 46.3 42.5 38.5 33.9

Suppose the 33.9 was REALLY

63.9. What do you think

would happen to the value of

the correlation coefficient?

63.9

Extreme values affect the

correlation coefficient

Expenditures

G

ra

du

at

io

n

Ra

te

s

Expenditures

G

ra

du

at

io

n

Ra

te

s

(14)

Find the correlation for these points:

x -3 -1 1 3 5 7 9

Y 40 20 8 4 8 20 40

Compute the correlation coefficient?

Sketch the scatterplot

5) value of

r

is a measure of the extent

to which

x

and

y

are

linearly

linearly

related

r

= 0

x

y

r

= 0, but the data

set has a

definite

definite

relationship!

Does this mean that there is

NO relationship between these

(15)

Recap the Properties of r:

1. legitimate values of

r

are -1 <

r

< 1

2. value of

r

is not changed by any

transformation

transformation

3. value of

r

does not depend on

which

which

of

the two variables is labeled

x

4. value of

r

is

affected by extreme values

affected by extreme values

(16)

Example 5.1 Continued

Expenditures 8011 7323 8735 7548 7071 8248 Graduation

rates 64.6 53.0 46.3 42.5 38.5 33.9

Expenditures

G

ra

du

at

io

n

Ra

te

s

Interpret r = 0.05

In order to interpret

r

, recall the

definition of the correlation

coefficient.

A

quantitative

quantitative

assessment of the

strength

and

direction

of the

linear

relationship between

bivariate, quantitative data

There is a weak,

positive, linear

relationship between

expenditures and

(17)

Does a value of

r

close to 1 or -1

mean that a change in one variable

cause

a change in the other variable?

Consider the following examples:

• The relationship between the number of

cavities in a child’s teeth and the size of

his or her vocabulary is strong and positive.

• Consumption of hot chocolate is negatively

correlated with crime rate.

These variables are both strongly

related to the age of the child

Both are responses to cold weather

Causality can only be shown by carefully

controlling values of all variables that

might be related to the ones under

study. In other words, with a

well-controlled, well-designed experiment.

So does this mean I should feed

children more candy to increase their

vocabulary?

(18)

Correlation does not imply

causation

Correlation does not imply

causation

Correlation does

Correlation does

not imply causation

(19)

What is the objective of regression

analysis?

x

– variable: is the

independent

or

explanatory

variable

y

- variable: is the

dependent

or

response

variable

• We will use values of

x

to

predict

values of

y.

Suppose that we have two variables:

x = the amount spent on advertising

y = the amount of sales for the product

during a given period

What question might I want to answer using

this data?

The objective of regression analysis is to

use information

about one variable,

x

,

to

draw some sort of a conclusion

(20)

b

– is the slope

– it is the

approximate

amount by which

y

increases when

x

increases by 1 unit

a

– is the

y

-intercept

– it is the

approximate

height of the line

when

x

= 0

– in some situations, the

y

-intercept has no

meaning

The LSRL is

y

ˆ

a

bx

y

ˆ

-

(y-hat) means the

predicted

y

Be sure to put the

hat on the

y

Scatterplots frequently exhibit a linear

pattern. When this is the case, it makes

sense to summarize the relationship

between the variables by finding a line

that is as close as possible to the plots in

the plot.

This is done by calculating the line of best

fit or

Least Square Regression Line

(LSRL).

The LSRL is the line that

minimizes

minimizes

the sum of the squares of the

deviations from the line

The slope of the LSRL is



2

    x x y y x x b

The intercept of the LSRL is aybx

(21)

(3,10)

(6,2)

Sum of the squares = 61.25

4

5

.

ˆ

x

y

-4

4.5

-5

y =.5(0) + 4 = 4

0 – 4 = -4

(0,0)

y =.5(3) + 4 = 5.5

10 – 5.5 = 4.5

y =.5(6) + 4 = 7

2 – 7 = -5

Suppose we have a data set that consists

of the observations (0,0), (3,10) and 6,2).

Let’s just fit a

line to the

data by

drawing a line

through what

appears to be

the middle of

the points.

Now find the

vertical

distance from

each point to

the line.

Find the sum of

the squares of

(22)

(0,0)

(3,10)

(6,2)

Sum of the squares = 54

3

3

1

ˆ

x

y

Use a calculator to find the line of

best fit

Find the vertical deviations from the line

-3

6

-3

What is the sum of the deviations

from the line?

Will it always be zero?

The line that minimizesminimizes the sum of the squares of the deviations from the

line is the LSRLLSRL.

Find the sum of the squares of the deviations from the

(23)

Researchers are studying pomegranate's antioxidants properties to see if it might be helpful in the treatment of cancer. In one study, mice were injected with cancer cells and randomly assigned to one of three groups, plain water, water supplemented with .1% pomegranate fruit extract (PFE), and water supplemented with .2% PFE. The average tumor volume for mice in each group was recorded for several points in time. (x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume (in mm3)

x 11 15 19 23 27

y 150 270 450 580 740

(24)

Pomegranate study continued

x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume

x 11 15 19 23 27

y 150 270 450 580 740

Calculate the LSRL and the correlation

coefficient.

Interpret the slope and the correlation

coefficient in context.

998

.

0

25

.

37

75

.

269

ˆ

x

r

y

The average volume of the tumor

increases

by approximately 37.25 mm

3

for

each day

increase

in the number of days after

injection.

Remember that an

interpretation is stating the definition in

context.

There is a

strong, positive, linear

relationship between

the average tumor

volume and the number of days since

injection

.

(25)

Pomegranate study continued

x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume

x 11 15 19 23 27

y 150 270 450 580 740

Predict the average volume of the tumor for 20

days after injection.

Predict the average volume of the tumor for 5

days after injection.

x

y

ˆ

269

.

75

37

.

25

3

mm

25

.

475

)

20

(

25

.

37

75

.

269

ˆ

y

3

mm

5

.

83

)

5

(

25

.

37

75

.

269

ˆ

y

Can volume be negative?

This is the

danger of

extrapolation

.

The

least-squares line should

not

be

used to make predictions for

y

using

x

-values

outside

the

range in the data set.

Why?

It is unknown whether the

pattern observed in the

(26)

Pomegranate study continued

x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume

x 11 15 19 23 27

y 150 270 450 580 740

Suppose we want to know how many days after

injection of cancer cells would the average tumor

size be 500 mm

3

?

x

y

ˆ

269

.

75

37

.

25

Is this the appropriate

regression line to

answer this question?

No, the slope of the line for predicting

x

is

not

and the intercepts are almost always different.

Here is the appropriate regression line:

y x

s s r

x y

s s r

y

x

ˆ

7

.

277

.

027

The regression line of y on x should

not

be used

to predict x, because it is

not

the line that

minimizes the sum of the squared deviations

(27)

Pomegranate study continued

x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume

x 11 15 19 23 27

y 150 270 450 580 740

Find the mean of the x-values (x) and the mean

of the y-values (y).

Plot the point of averages (x,y) on the

scatterplot.

x

= 19 and

y

= 438

+

Will the point of

(28)

Let’s investigate how the LSRL and

correlation coefficient change when

different points are added to the data set

Suppose we have the following data set.

x

4

5

6

7

8

y

2

5

4

6

9

Sketch a scatterplot. Calculate the LSRL and

the correlation coefficient.

916

.

0

5

.

1

8

.

3

ˆ

r

(29)

Let’s investigate how the LSRL and

correlation coefficient change when

different points are added to the data set

Suppose we have the following data set.

x

4

5

6

7

8

y

2

5

4

6

9

Suppose we add the point (5,8) to the data set.

What happens to the regression line and the

correlation coefficient?

916

.

0

5

.

1

8

.

3

ˆ

r

x

y

5

8

17

.

1

15

.

1

ˆ

x

y

(30)

Let’s investigate how the LSRL and

correlation coefficient change when

different points are added to the data set

Suppose we have the following data set.

x

4

5

6

7

8

y

2

5

4

6

9

Suppose we add the point (12,12) to the data set.

What happens to the regression line and the

correlation coefficient?

916

.

0

5

.

1

8

.

3

ˆ

r

x

y

12

12

225 .

1 24

. 2

ˆ    x

y

(31)

Let’s investigate how the LSRL and

correlation coefficient change when

different points are added to the data set

Suppose we have the following data set.

x

4

5

6

7

8

y

2

5

4

6

9

Suppose we add the point (12,0) to the data set.

What happens to the regression line and the

correlation coefficient?

916

.

0

5

.

1

8

.

3

ˆ

r

x

y

12

0

275 .

0 26

. 6

ˆ   x

y

(32)

The correlation coefficient and the

LSRL are

both

measures that are

(33)

Pomegranate study revisited

x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume

x 11 15 19 23 27

y 150 270 450 580 740

Minitab, a statistical software package, was used to fit the least-squares regression line. Part of the resulting output is shown below.

The regression equation is

Predicted volume = -269.75 + 37.25 days

Predictor Coef SE Coef T P Constant -269.75 23.421412 -11.51724 0.0014

interceptslope

(34)

Assessing the fit of the LSRL

Important questions are:

1. Is the line an appropriate way to summarize

the relationship between x and y.

2. Are there any unusual aspects of the data

set that we need to consider before

proceeding to use the line to make

predictions?

3. If we decide to use the line as a basis for

prediction, how accurate can we expect

predictions based on the line to be?

Once the LSRL is obtained, the next

step is to examine how effectively

the line summarizes the relationship

between

x

and

y

.

We will look at graphical

and numerical methods to

(35)

In a study, researchers were interested in how the

distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.

x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65

Predictor Coef SE Coef T P Constant -7.69 13.33 -0.58 0.582 Distance to debris 3.234 1.782 1.82 0.112 S=8.67071 R-Sq = 32.0% R-Sq(adj) = 22.3%

Minitab was used to fit the least-squares

regression line. From the partial output, identify the regression line.

x

y

ˆ

7

.

69

3

.

234

(36)

In a study, researchers were interested in how the

distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.

x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65

D

is

ta

nc

e

tr

av

el

ed

Distance to debris

The vertical deviation

between the point

and the LSRL is

called the

residual.

If the point is

above the line,

the residual

will be positive.

If the point is below

the line the residual

will be negative.

Residuals

are calculated by

subtracting the predicted

y

from

the observed

y

.

y

y

ˆ

(37)

In a study, researchers were interested in how the

distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.

Use the LSRL to calculate the

predicted distance

traveled.

Subtract to find

the residuals.

Distance

from debris traveled (y)Distance Predicted distance traveled Residual 6.94 0.00 5.23 6.13 5.21 11.29 7.10 14.35 8.16 12.03 5.50 22.72 9.19 20.11 9.05 26.16 9.36 30.65 ) ˆ

(y (y  yˆ) 14.76 9.23 9.16 15.28 18.70 10.10 22.04 21.58 22.59 -14.76 -3.10 2.13 -0.93 -6.67 12.62 -1.93 4.58 8.06 What does the sum of the residuals equal? Will the sum of the

residuals always equal zero?

(38)

Residual plots

Residual plots

• Is a scatterplot of the

(

x

, residual)

pairs.

• Residuals can also be graphed against the

predicted

y

-values

• The purpose is to determine if a

linear

model

is the best way to

describe the

relationship

between the

x

&

y

variables

• If

no

pattern

exists between the points

(39)

Residuals

x

Residuals

x

This residual shows no

pattern so it indicates that the linear model is

appropriate.

This residual shows a curved pattern so it indicates that the linear model is not

(40)

In a study, researchers were interested in how the

distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.

Distance

from debris traveled (y)Distance Predicted distance traveled Residual 6.94 0.00 5.23 6.13 5.21 11.29 7.10 14.35 8.16 12.03 5.50 22.72 9.19 20.11 9.05 26.16 9.36 30.65 ) ˆ

(y (y  yˆ)

14.76 9.23 9.16 15.28 18.70 10.10 22.04 21.58 22.59 -14.76 -3.10 2.13 -0.93 -6.67 12.62 -1.93 4.58 8.06 Use the values in this

table to create a residual plot for this data

set. Is a linear model appropriate for describing the relationship between the distance from debris and the distance a deer mouse will travel for

food?

(41)

Since the

residual plot

displays

no

pattern

, a linear

model is

appropriate for

describing the

relationship

between the

distance from

debris and the

distance a deer

mouse will

travel for food.

-15 -10 -5 5 10 15

5 6 7 8 9

Distance from debris

Re

si

du

al

s

Now plot the residuals

against the

predicted

(42)

-15 -10 -5 5 10 15

10 15 20 25 9

Predicted Distance traveled

Re

si

du

al

s

What do you notice about the

general scatter of points on this

residual plot versus the residual plot

using the x-values?

-10 -5 5 10 15

5 6 7 8 9

Distance from debris

Re

si

du

al

s

Residual plots can be

plotted against either

the

x

-values or the

(43)

Let’s examine the following data set:

The following data is for 12 black bears from

the Boreal Forest.

x = age (in years) and y = weight (in kg)

Sketch a scatterplot with the fitted regression line.

x 10.5 6.5 28.5 10.5 6.5 7.5 6.5 5.5 7.5 11.5 9.5 5.5

Y 54 40 62 51 55 56 62 42 40Do you notice anything unusual about this data set?59 51 50

Influential observation

What would happen to the regression line if this point is removed?

This point is considered an

influential point

because

it affects the placement

of the least-squares

regression line.

5 10 15 20 25 30 45

40 50 55 60

W

ei

gh

t

Age 5 10 15 20 25 30 45

40 50 55 60

W

ei

gh

t

(44)

Let’s examine the following data set:

The following data is for 12 black bears from

the Boreal Forest.

x = age (in years) and y = weight (in kg)

x 10.5 6.5 28.5 10.5 6.5 7.5 6.5 5.5 7.5 11.5 9.5 5.5

Y 54 40 62 51 55 56 62 42 40 59 51 50

5 10 15 20 25 30 45

40 50 55 60

W

ei

gh

t

Age

Notice that this observation has a large residual.

An observation is

an

outlier

if it has

(45)

Coefficient of

Coefficient of

determination-• Denoted by r

2

• gives the proportion of

variation

variation

in

y

y

that can be attributed to an

(46)

Suppose you didn’t know any x -values. What distance would you expect deer mice to travel?

938

.

15

y

Let’s explore the meaning of r2 by revisiting the deer mouse

data set.

x = the distance from the food to the nearest pile of fine woody debris

y = distance a deer mouse will travel for food

x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65

What is total amount of variation in the distance traveled

(y-values)? Hint: Find the sum of the squared deviations.

2

SSTo

y

y

Total amount of variation in the distance traveled is

773.95 m .

Why do we

square the

deviations?

5 10 15 20 25 30

5 6 7 8 9

Distance to Debris

D is ta nc e tr av el ed

SS stands for “sum of

squares”

(47)

Now suppose you DO know the

x-values. Your best guess would be the predicted

distance traveled (the point on the LSRL).

x = the distance from the food to the nearest pile of fine woody debris

y = distance a deer mouse will travel for food

x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65

ˆ

2

SSResid

y

y

The points vary from the

LSRL by 526.27 m

2

.

By how much do the observed points vary from the LSRL? Hint: Find the sum of the residuals squared.

D

is

ta

nc

e

tr

av

el

ed

(48)

x = the distance from the food to the nearest pile of fine woody debris

y = distance a deer mouse will travel for food

x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65

The points vary from the LSRL by 526.27 m2.

Total amount of variation in the distance traveled is

773.95 m2.

Approximately what percent

of the variation in distance

traveled can be explained by

the regression line?

320

.

0

95

.

773

27

.

526

1

SSTo

SSResid

1

2 2

r

r

Or

(49)

Partial output from the regression analysis of deer mouse data:

Predictor Coef SE Coef T P Constant -7.69 13.33 -0.58 0.582 Distance to

debris

3.234 1.782 1.82 0.112

S = 8.67071 R-sq = 32.0% R-sq(adj) = 22.3%

The coefficient of determination (r

2

)

Only 32% of the observed variability in the

distance traveled for food can be explained by

the approximate linear relationship between the

distance traveled for food and the distance to

the nearest debris pile.

What does this

number

represent?

The standard deviation (s):

This is the

typical

amount by which an

observation deviates from the least squares

regression line. It’s found by:

2 -n

SSResid

e

s

Let’s review the values from this output

and their meanings.

The y-intercept (a):

This value has no meaning in context since

it doesn't make sense to have a negative

distance.

The slope (b):

The distance traveled to food increases by

approxiamtely 3.234 meters for an increase

(50)

Let’s examine this data set:

x = representative age

y = average marathon finish time

Create a scatterplot for this data set.

Age 15 25 35 45 55 65

Time 302.38 193.63 185.46 198.49 224.30 288.71

10 20 30 40 50 60 200 250 300 Representative Age A ve ra ge F in is h T im e

Because of the curved pattern, a straight line

would not accurately describe the relationship

between average finish time and age.

Since this curve resembles a parabola, a quadratic function can be used to

describe this relationship.

2 2 1

ˆ

a

b

x

b

x

y

Using Minitab:

The least-squares quadratic regression is 2

179

.

0

2

.

14

462

ˆ

x

x

y

This curve minimizes the

sum of the squares of the residuals (similar

to least-squares linear

(51)

Let’s examine this data set:

x = representative age

y = average marathon finish time

Age 15 25 35 45 55 65

Time 302.38 193.63 185.46 198.49 224.30 288.71

10 20 30 40 50 60 200

250 300

Representative Age

A

ve

ra

ge

F

in

is

h

T

im

e

Notice the residuals from the

quadratic regression.

10 20 30 40 50 60

-20 -10 10 20

Age

Re

si

du

al

s

Here is the residual

plot-Since there is no pattern in the residual plot, the quadratic

(52)

Let’s examine this data set:

x = representative age

y = average marathon finish time

Age 15 25 35 45 55 65

Time 302.38 193.63 185.46 198.49 224.30 288.71

10 20 30 40 50 60 200

250 300

Representative Age

A

ve

ra

ge

F

in

is

h

T

im

e

The measure

R

2

is useful for

assessing the fit of the

quadratic regression.

SSTo SSResid 1

2

R

R2 = .921

92.1% of the variation in average marathon finish times can be explained by the approximate quadratic relationship between

(53)

Depending on the data set, other regression

models, such as cubic regression, may be used.

Statistical software (like Minitab) is commonly

used to calculate these regression models.

Another method for fitting regression

models to non-linear data sets is to

transform

the data, making it

linear

.

(54)

Commonly Used Transformations

Transformation Equation

No transformation Square root of x

Log of x *

Reciprocal of x

Log of y *

Exponential growth or decay

 

x b

a

yˆ   log10

bx a

yˆ  

log10

       

x b a

yˆ 1

x b a

yˆ  

bx a

yˆ  

(55)

Pomegranate study revisited:

x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume

Sketch a scatterplot for this data set.

x 11 15 19 23 27 31 35 39

y 40 75 90 210 230 330 450 600

100 200 300 400 500 600

A

ve

ra

ge

t

um

or

v

ol

um

e

There

appears to

be a curve

in the data

points.

Let’s use a

transformation

to linearize the

data.

Since the data appears to be exponential growth,

(56)

Pomegranate study revisited:

x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume

Sketch a scatterplot of the log(

y

) and x.

x 11 15 19 23 27 31 35 39

Log(y) 1.60 1.88 1.95 2.32 2.36 2.52 2.65 2.78

1 2 3

15 20 35 10 25 30

Number of days

Lo

g

of

A

ve

ra

ge

t

um

or

v

ol

um

e

Notice that the

relationship now appears linear. Let’s fit an LSRL

to the

transformed data.

The LSRL is

(57)

Pomegranate study revisited:

x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume

Sketch a scatterplot of the log(

y

) and x.

x 11 15 19 23 27 31 35 39

Log(y) 1.60 1.88 1.95 2.32 2.36 2.52 2.65 2.78

1 2 3

10

15 20

25 30 35 35 10 25 30

Number of days Lo g of A ve ra ge tu m or v ol um e

10 25 30 35

The LSRL is

x

y

ˆ

1

.

226

0

.

041

log

What would the

predicted average

tumor size be 30

days after injection

of cancer cells?

)

30

(

041

.

0

226

.

1

ˆ

log

y

456

.

2

ˆ

log

y

3 456

.

2

285

.

76

mm

10

ˆ

(58)

Another useful transformation is the power

transformation. The power transformation ladder and the scatterplot (both below) can be used to help

determine what type of transformation is appropriate.

Power Transformation Ladder

Power Transformed Value Name

3 (Original value)3 Cube

2 (Original value)2 Square

1 (Original value) No

transformation

½ Square root

1/3 Cube root

0 Log(Original value) Logarithm

-1 Reciprocal

value Original

3 Original value

value Original

1

Suppose that the scatterplot looks like

the curve labeled 1. Then we would use a power that is up the ladder from the no transformation row for

both the x and y

variables.

Suppose that the scatterplot looks like

the curve labeled 2. Then we would use a power that is up the ladder from the no transformation row for

(59)

Logistic Regression (Optional)

• Can be used if the

dependent

variable is

categorical

with just two possible values

• Used to describe how the probability of

“success” changes as a numerical predictor

variable, x, changes

• With p denoting the probability of success,

the logistic regression equation is

bx

a

bx

a

e

e

p

1

For any value of

x

, the

value of

p

is always

between 0 and 1.

(60)

In a study on wolf spiders, researchers were interested in what variables might be related to a female wolf

spider’s decision to kill and consume her partner during courtship or mating. Data was collected for 53 pairs of courting wolf spiders. (Data listed on page 287)

x = the difference in body width (female – male)

y = cannibalism; coded 0 for no cannibalism and 1 for cannibalism

Minitab was used to construct a scatterplot and to fit a logistic regression to the data.

x x

e

e

p

3.089043.089043.069283.06928

1  

 

 

Note that the plot was constructed so that if two plots fell in the exact same location they would be offset a little bit so that all

points would be visible (called jittering).

This equation can be used to

predict the probability of the

male spider being cannibalized

based on the difference in size.

What is the probability of

cannibalism if the male &

female spiders are the same

width (difference of 0)?

044 .

0 1 3.08904 3.06928(0)

References

Related documents

Results of the survey are categorized into the following four areas: primary method used to conduct student evaluations, Internet collection of student evaluation data,

This includes the Attitude and heading reference system based on MEMS accelerom- eter and angular rate sensor, the hierarchical control loops of autopilot, the communi- cation

In this section it was shown that Bergenholtz and Gouws's (2017) criticism of the treatment of polysemy in existing model I dictionaries is hardly addressed by the model II

Common Project specific Management Documents (i.e. Project Execution Plan, Project Quality Plan, Project HSSE Plan, Project Engineering Plan, Project Construction Plan, …) are

Different configurations of hybrid model combining wavelet analysis and artificial neural network for time series forecasting of monthly precipitation have been developed and

We have loved them during life, let us not abandon them until we have conducted them, by our prayers, into the House of the Lord.. Jesus, have mercy on the

deducted from the profits of the company for deemed dividend distribution purposes provided that these were acquired in the years 2012 to 2014. Capital

The Colour Sampler tool in the Eye Dropper tool group (tenth row of the toolbar) can be used to place multiple permanent sample points on an image to help while adjusting the