Predictive Analytics Using Regression

(1)

USING REGRESSION

Sumeet Gupta

Associate Professor

(2)

• 

Basic Concepts

•  Applications of Predictive Modeling

•  Linear Regression in One Variable using OLS

•  Multiple Linear Regression

•  Assumptions in Regression

• 

Explanatory Vs Predictive Modeling

•  Performance Evaluation of Predictive Models

• 

Practical Exercises

• 

Case: Nils Baker

(3)

(4)

Predictive Modeling: Applications

• 

Predictive customer activity on credit cards from their

demographic and historical activity patterns

• 

Predicting the time to failure or equipment based on

utilization and environment conditions

• 

Predicting expenditures on vacation travel based on

historical frequent flyer data

• 

Predicting staffing requirements at help desks based on

historical data and product and sales information

• 

Predicting sales from cross selling of products from

historical information

(5)

Basic Concept: Relationships

Examples of relationships:

•  Sales and earnings

•  Cost and number produced

•  Microsoft and the stock market

•  Effort and results

• 

Scatterplot

•  A picture to explore the relationship in bivariate data

• 

Correlation

r

•  Measures strength of the relationship (from –1 to 1)

• 

Regression

(6)

Basic Concept: Correlation

• 

r = 1

•  A perfect straight line

tilting up to the right • 

r = 0

•  No overall tilt

•  No relationship?

• 

r = –

1

•  A perfect straight line

tilting down to the right

X Y X Y X Y X Y X Y X Y

(7)

Basic Concepts: Simple Linear Model

• 

Linear Model for the Population

•  The foundation for statistical inference in regression

•  Observed Y is a straight line, plus randomness

Y

=

_α

+

_β

X

+

_ε

Randomness of individuals

Population relationship, on average

{

X Y

(8)

Basic Concepts: Simple Linear Model

• 

Time Spent vs. Internet Pages Viewed

•  Two measures of the abilities of 25 Internet sites

•  At the top right are eBay, Yahoo!, and MSN

•  Correlation is r = 0.964

•  Very strong positive association (since r is close to 1)

•  Linear relationship

•  Straight line

with scatter

•  Increasing relationship

•  Tilts up and to the right

0 30 60 90

0 100 200

Pages per person

Mi nu te s pe r pe rso n eBay Yahoo! MSN 0 100 200

Pages per person Yahoo!

(9)

Basic Concepts: Simple Linear Model

• 

Dollars vs. Deals

•  For mergers and acquisitions by investment bankers

•  244 deals worth $756 billion by Goldman Sachs

•  Correlation is r = 0.419 •  Positive association •  Linear relationship •  Straight line with scatter •  Increasing relationship

•  Tilts up and to the right

$0 $500 $1,000 0 100 200 300 400 Deals D ol la rs (b ill io ns)

(10)

Basic Concepts: Simple Linear Model

• 

Interest Rate vs. Loan Fee

•  For mortgages

•  If the interest rate is lower, does the bank make it up with a higher loan

fee?

•  Correlation is r = – 0.890

•  Strong negative association

•  Linear relationship

•  Straight line

with scatter

•  Decreasing relationship

•  Tilts down and to the right

5.0% 5.5% 6.0% 0% 1% 2% 3% 4% Loan fee In te re st ra te

(11)

Basic Concepts: Simple Linear Model

• 

Today’s vs. Yesterday’s Percent Change

•  Is there momentum?

•  If the market was up yesterday, is it more likely to be up today? Or is

each day’s performance independent?

•  Correlation is r = 0.11 •  A weak relationship? •  No relationship? •  Tilt is neither up nor down -3% -2% -1% 0% 1% 2% 3% -3% -2% -1% 0% 1% 2% 3% Yesterday's change T od ay 's ch an ge

(12)

$0 $25 $50 $75 $100 $450 $500 $550 $600 $650 Strike Price Ca ll P ri ce

• 

Call Price vs. Strike Price

•  For stock options

•  “Call Price” is the price of the option contract to buy stock at the

“Strike Price”

•  The right to buy at a lower strike price has more value

•  A nonlinear relationship

•  Not a straight line:

A curved relationship

•  Correlation r = – 0.895

•  A negative relationship:

Higher strike price goes with lower call price

(13)

Basic Concepts: Simple Linear Model

• 

Output Yield vs. Temperature

•  For an industrial process

•  With a “best” optimal temperature setting

•  A nonlinear relationship

•  Not a straight line:

A curved relationship

•  Correlation r = – 0.0155

•  r suggests no relationship •  But relationship is strong

•  It tilts neither up nor down ₁₂₀ 130 140 150 160 500 600 700 800 900 Temperature Y ie ld o f p ro ce ss

(14)

Basic Concepts: Simple Linear Model

• 

Circuit Miles vs. Investment

(lower left)

•  For telecommunications firms

•  A relationship with unequal variability

•  More vertical variation at the right than at the left

•  Variability is stabilized by taking logarithms (lower right)

•  Correlation r = 0.820 0 1,000 2,000 0 1,000 2,000 Investment ($millions) C ircu it mi le s (mi lli on s) 15 20 15 20 Log of investment Log o f mi le s r = 0.957

(15)

Basic Concepts: Simple Linear Model

• 

Price vs. Coupon Payment

•  For trading in the bond market

•  Bonds paying a higher coupon generally cost more

•  Two clusters are visible

•  Ordinary bonds (value is from coupon)

•  Inflation-indexed bonds (payout rises with inflation)

•  Correlation r = 0.950

•  for all bonds

•  Correlation r = 0.994

•  Ordinary bonds only _$100 $150 0% 5% 10% Bi d pri ce 0% 5% 10% Coupon rate

(16)

Basic Concepts: Simple Linear Model

• 

Cost vs. Number Produced

•  For a production facility

•  It usually costs more to produce more

•  An outlier is visible

•  A disaster (a fire at the factory)

•  High cost, but few produced

3,000 4,000 5,000 20 30 40 50 Number produced C ost 0 10,000 0 20 40 60 Number produced C ost Outlier removed: More details, r = 0.869 r = –0.623

(17)

Basic Concepts: OLS Modeling

• 

Salary vs. Years Experience

•  For n = 6 employees

•  Linear (straight line) relationship

•  Increasing relationship

•  higher salary generally goes with higher experience

•  Correlation r = 0.8667 20 30 40 50 60 0 10 20 Experience Sa la ry ($ th ou sa nd ) Experience 15 10 20 5 15 5 Salary 30 35 55 22 40 27

(18)

Basic Concepts: OLS Modeling

• 

Summarizes bivariate data: Predicts Y from X

•  with smallest errors (in vertical direction, for Y axis)

•  Intercept is 15.32 salary (at 0 years of experience)

•  Slope is 1.673 salary (for each additional year of experience, on

average) 10 20 30 40 50 60 0 10 20 Experience (X) Sa la ry ( Y )

(19)

Basic Concepts: OLS Modeling

• 

Predicted Value comes from Least-Squares Line

•  For example, Mary (with 20 years of experience)

has predicted salary 15.32+1.673(20) = 48.8

•  So does anyone with 20 years of experience

• 

Residual is actual

Y

minus predicted

Y

•  Mary’s residual is 55 – 48.8 = 6.2

•  She earns about $6,200 more than the predicted salary for a person

with 20 years of experience

(20)

Basic Concepts: OLS Modeling

10 20 30 40 50 60 0 10 Experience 20 Sa la ry

Mary earns 55 thousand

Mary’s predicted value is 48.8

(21)

Basic Concepts: OLS Modeling

• 

Standard Error of Estimate

• 

Approximate size of prediction errors (residuals)

Actual Y minus predicted Y: Y–[a+bX]

• 

Example (Salary vs. Experience)

Predicted salaries are about 6.52 (i.e., $6,520) away from actual salaries

(

1

2

)

₂

1 −

−

=

n

r

S

_e _Y

(

)

6 .

52

2

6

1

6 8667

.

0

1

686 .

11

2

₌

−

=

e

S

(22)

Basic Concepts: OLS Modeling

• 

Interpretation: similar to standard deviation

• 

Can move Least-Squares Line up and down by

S

_e

•  About 68% of the data are within one “standard error of estimate”

of the least-squares line

•  (For a bivariate normal distribution)

20 30 40 50 60 0 10 Experience 20 Sa la ry

(23)

Multiple Linear Regression

• 

Linear Model for the Population

Y = (_α + _β₁X₁ + _β₂X₂ + … + β_kX_k) + _ε

= (Population relationship) + Randomness

•  Where _ε has a normal distribution with mean 0 and constant

standard deviation σ, and this randomness is independent from one case to another

(24)

Multiple Linear Regression: Results

• 

Intercept:

a

•  Predicted value for Y when every X is 0

• 

Regression Coefficients:

b

₁

, b

₂

, …b

_k

•  The effect of each X on Y, holding all other X variables constant

• 

Prediction Equation or Regression Equation

(Predicted Y) = a+b₁X₁+b₂X₂+…+b_kX_k

•  The predicted Y, given the values for all X variables

• 

Prediction Errors or Residuals

(25)

Multiple Linear Regression: Results

• 

t

Tests for Individual Regression Coefficients

•  Significant or not significant, for each X variable

•  Tests whether a particular X variable has an effect on Y, holding the

other X variables constant

•  Should be performed only if the F test is significant

• 

Standard Errors of the Regression Coefficients

(with n–k–1 degrees of freedom)

•  Indicates the estimated sampling standard deviation of each

regression coefficient

•  Used in the usual way to find confidence intervals and hypothesis

tests for individual regression coefficients

k b b b

S

,

2 1

!

(26)

Multiple Linear Regression: Results

• 

Predicted

Page Costs

for Audubon

= a + b₁X₁ + b₂X₂ + b₃X₃

= $4,043 + 3.79(Audience) – 124(Percent Male)

+ 0.903(Median Income)

= $4,043 + 3.79(1,645) – 124(51.1) + 0.903(38,787)

= $38,966

• 

Actual

Page Costs

are

$25,315

• 

Residual is

$25,315 – 38,966 = –$13,651

•  Audubon has Page Costs $13,651 lower than you would expect for

a magazine with its characteristics (Audience, Percent Male, and

(27)

Standard Error

• 

Standard Error of Estimate

S

_e

•  Indicates the approximate size of the prediction errors

•  About how far are the Y values from their predictions?

•  For the magazine data

•  S_e = S = $21,578

•  Actual Page Costs are about $21,578 from their predictions for this

group of magazines (using regression)

•  Compare to S_Y = $45,446: Actual Page Costs are about $45,446 from

their average (not using regression)

•  Using the regression equation to predict Page Costs (instead of simply

(28)

Coeff. of Determination

The strength of association is measured by the square of the multiple correlation coefficient, R2_{, which is also called the coefficient of}

multiple determination.

R2 = SSreg

SS_y

R2_{is adjusted for the number of independent variables and the sample}

size by using the following formula: Adjusted R2₌_R₂_-k(1 - R2)

(29)

Coeff. of Determination

• 

Coefficient of Determination

R

2

•  Indicates the percentage of the variation in Y that is explained by

(or attributed to) all of the X variables

•  How well do the X variables explain Y?

•  For the magazine data

•  R2_{= 0.787 = 78.7%}

•  The X variables (Audience, Percent Male, and Median Income) taken

together explain 78.7% of the variance of Page Costs

•  This leaves 100% – 78.7% = 21.3% of the variation in Page Costs

(30)

The

F

test

• 

Is the regression significant?

•  Do the X variables, taken together, explain a significant amount of

the variation in Y?

•  The null hypothesis claims that, in the population, the X variables

do not help explain Y; all coefficients are 0

H₀: β₁ = β₂ = … = β_k = 0

•  The research hypothesis claims that, in the population, at least

one of the X variables does help explain Y

(31)

The

F

test

H₀ : R2

pop = 0

This is equivalent to the following null hypothesis:

H₀: β₁ = β₂ = β₃= . . . = β_k = 0

The overall test can be conducted by using an F statistic:

F = SSreg/k SS_res/(n - k - 1)

= R2/k

(1 - R2)/(n- k - 1)

(32)

Performing the

F

test

• 

Three equivalent methods for performing

F

test; they

always give the same result

•  Use the p-value

•  If p < 0.05, then the test is significant

•  Same interpretation as p-values in Chapter 10

•  Use the R2 value

•  If R2 is larger than the value in the R2 table, then the result is significant

•  Do the X variables explain more than just randomness?

•  Use the F statistic

•  If the F statistic is larger than the value in the F table, then the result is

(33)

Example:

F

test

• 

For the magazine data, The

X

variables

(Audience, Percent Male, and Median Income)

explain a very highly significant

percentage of the variation in

Page Costs

•  The p-value, listed as 0.000, is less than 0.0005, and is therefore

very highly significant (since it is less than 0.001)

•  The R2 value, 78.7%, is greater than 27.1% (from the R2 table at

level 0.1% with n = 55 and k = 3), and is therefore very highly significant

•  The F statistic, 62.84, is greater than the value (between 7.054

and 6.171) from the F table at level 0.1%, and is therefore very

(34)

t

Tests

• 

A

t

test for each regression coefficient

•  To be used only if the F test is significant

•  If F is not significant, you should not look at the t tests

•  Does the jth X variable have a significant effect on Y, holding the

other X variables constant?

•  Hypotheses are

H₀: βj = 0, H1: βj ≠ 0

•  Test using the confidence interval

•  use the t table with n – k – 1 degrees of freedom

•  Or use the t statistic

•  compare to the t table value with n – k – 1 degrees of freedom j b j statistic

b

S

t

₌

/

j b j

tS

b ±

(35)

Example:

t

Tests

• 

Testing

b

₁

, the coefficient for

Audience

b₁ = 3.79, t = 13.5, p = 0.000

•  Audience has a very highly significant effect on Page Costs, after adjusting for Percent Male and Median Income

• 

Testing

b

₂

, the coefficient for

Percent Male

b₂ = –124, t = –0.90, p = 0.374

•  Percent Male does not have a significant effect on Page Costs, after

adjusting for Audience and Median Income

• 

Testing

b

₃

, the coefficient

for

Median Income

b₃ = 0.903, t = 2.44, p = 0.018

•  Median Income has a significant effect on Page Costs, after adjusting for Audience and Percent Male

(36)

• 

Assumptions underlying the statistical techniques

should be tested twice

•  First for the separate variables

•  Second for the multivariate model variate, which acts

collectively for the variables in the analysis and thus must meet the same assumption as individual variables. Differs for different multivariate technique

(37)

• 

Linearity

•  The independent variable has a linear relationship with the dependent

variable

• 

Normality

•  The residuals or the dependent variable follow a normal distribution

• 

Multicollinearity

•  When some X variables are too similar to one another

• 

Homoskedasticity

•  The variability in Y values for a given set of predictors is the same

regardless of the values of the predictors

• 

Independence among cases (Absence of correlated errors)

(38)

Normality

• 

The residuals or the dependent variable follow a normal

distribution

• 

If the variation from normality is significant then all

statistical tests are invalid

• 

Graphical Analysis

•  Histogram and Normal probability plot

•  Peaked and Skewed distribution result in non-normality

• 

Statistical Analysis

•  If Z value exceeds critical value, then the distribution is

non-normal

(39)

(40)

Homoskedasticity

• 

Assumption related primarily to dependence

relationships between variables

• 

Assumption that the dependent variable(s) exhibit

equal levels of variance across the range of predictor

variable(s).

• 

The variance of the dependent variable should not

be concentrated in only a limited range of the

independent values

• 

Source

•  Type of variable

(41)

Homoskedasticity

• 

Graphical Analysis

•  Analysis of residuals in case of Regression

• 

Statistical Analysis

•  Variances within groups formed by non-metric variables

•  Levene Test

•  Box’s M Test

• 

Remedy

(42)

Homoskedasticity

(43)

Linearity

• 

Assumption for all multivariate techniques based on

correlational measures such as

•  multiple regression,

•  logistics regression,

•  factor analysis, and

•  structural equation modeling

• 

Correlation represents only the linear association

between variables

• 

Identification

•  Scatterplots or examination of residuals using regression

• 

Remedy

(44)

(45)

Absence of Correlated Errors

• 

Prediction errors should not be correlated with each

other

• 

Identification

•  Most possible cause is the data collection process, such as

two separate groups in the data collection process

•  Remedy

(46)

Multicollinearity

•  Multicollinearity arises when intercorrelations among the predictors

are very high.

•  Multicollinearity can result in several problems, including:

•  The partial regression coefficients may not be estimated precisely.

The standard errors are likely to be high.

•  The magnitudes as well as the signs of the partial regression

coefficients may change from sample to sample.

•  It becomes difficult to assess the relative importance of the

independent variables in explaining the variation in the dependent variable.

•  Predictor variables may be incorrectly included or removed in

(47)

Multicollinearity

•  The ability of an independent variable to improve the prediction of the dependent variable is related not only to its correlation to the

dependent variable, but also to the correlation(s) of the additional independent variable to the independent variable(s) already in the regression equation

•  Collinearity is the association, measured as the correlation,

between tow independent variables

•  Multicollinearity refers to the correlation among three or more

independent variables

•  Impact

•  Reduces any single IVs predictive power by the extent to which it is

(48)

Multicollinearity

• 

Measuring Multicollinearity

•  Tolerance

•  Amount of variability of the selected independent variable not explained

by the other independent variables

•  Tolerance Values should be high

•  Cut-off is 0.1 but greater than 0.5 gives better results

•  VIF

•  Inverse of Tolerance

(49)

Multicollinearity

• 

Remedy for Multicollinearity

•  A simple procedure for adjusting for multicollinearity consists of

using only one of the variables in a highly correlated set of variables.

•  Omit highly correlated independent variables and identify other

independent variables to help the prediction

•  Alternatively, the set of independent variables can be transformed into

a new set of predictors that are mutually independent by using techniques such as principal components analysis.

•  More specialized techniques, such as ridge regression and latent root

(50)

Data Transformations

• 

To correct violations of the statistical assumptions

underlying the multivariate techniques

• 

To improve the relationship between variables

• 

Transformation to achieve Normality and

Homoscedasticity

•  Flat Distribution – Inverse transformation

•  Negatively Skewed Distribution – Square Root

Transformation

•  Positively Skewed Distribution – Logarithmic Transformation

•  If the residuals in regression are cone shaped then

•  Cone opens to right – Inverse transformation

(51)

Data Transformations

• 

Transformation to achieve

Linearity

(52)

(53)

Assumptions in Regression

• 

General guidelines for transformation

•  For a noticeable effect of transformation the ratio of a variable’s

mean to the standard deviation should be less than 4.0

•  When the transformation can be performed on either of the two

variables, select the one with smallest ratio of mean/sd.

•  Transformation should be applied to independent variables

except in case of heteroscedasticity

•  Heteroscedasticity can only be remedied by transformation of

the dependent variable in a dependent relationship

•  If the heteroscedastic relationship is also non-linear the

dependent variable and perhaps the independent variables must be transformed

(54)

Issues in Regression

• 

Variable Selection

•  How to choose from a long list of X variables?

•  Too many: waste the information in the data

•  Too few: risk ignoring useful predictive information

• 

Model Misspecification

•  Perhaps the multiple regression linear model is wrong

(55)

(56)

• 

Explanatory models fits the data closely, whereas a good

predictive model predicts new cases accurately

• 

Explanatory models uses entire dataset for estimating the

best-fit model and to maximize explanatory variance (R

2

_).

Predictive models estimate the model on training set and

assess it on the new, unobserved data

• 

Performance measures for explanatory models measures

how close the data fit the models, whereas in predictive

models performance is measured by predictive accuracy

(57)

• 

Prediction Error for observation ‘i’= Actual y value –

predicted y value

• 

Popular numerical measures of predictive accuracy

•  MAE or MAD (Mean absolute error / deviation)

•  Average Error

(58)

  RMSE (Root mean squared error)

‘

(59)

(60)

•  Describe a situation in which a useless regression has a high R2.

•  Check the validity of the linear regression model assumptions.

•  Estimate the excess returns of Bob’s and Putney’s funds. Between them, who

is expected to obtain higher returns at their current funds and by how much?

•  If hired by the firm, who is expected to obtain higher returns and by how

much?

•  Can you prove at the 5% level of significance that Bob would get higher

expected returns if he had attended Princeton instead of Ohio State?

•  Can you prove at the 10% level of significance that Bob would get at least 1%

higher expected returns by managing a growth fund?

•  Is there strong evidence that fund managers with MBA perform worse than

fund managers without MBA? What is held constant in this comparison?

•  Based on your analysis of the case, which candidate do you support for

(61)

• 

Is the presence of a physical Bank Branch creating

demand for checking accounts?

(62)