USING REGRESSION
Sumeet Gupta
Associate Professor
•
Basic Concepts
• Applications of Predictive Modeling
• Linear Regression in One Variable using OLS
• Multiple Linear Regression
• Assumptions in Regression
•
Explanatory Vs Predictive Modeling
• Performance Evaluation of Predictive Models
•
Practical Exercises
•
Case: Nils Baker
Predictive Modeling: Applications
•
Predictive customer activity on credit cards from their
demographic and historical activity patterns
•
Predicting the time to failure or equipment based on
utilization and environment conditions
•
Predicting expenditures on vacation travel based on
historical frequent flyer data
•
Predicting staffing requirements at help desks based on
historical data and product and sales information
•
Predicting sales from cross selling of products from
historical information
Basic Concept: Relationships
Examples of relationships:
• Sales and earnings
• Cost and number produced
• Microsoft and the stock market
• Effort and results
•
Scatterplot
• A picture to explore the relationship in bivariate data
•
Correlation
r
• Measures strength of the relationship (from –1 to 1)
•
Regression
Basic Concept: Correlation
•
r = 1
• A perfect straight line
tilting up to the right •
r = 0
• No overall tilt
• No relationship?
•
r = –
1
• A perfect straight line
tilting down to the right
X Y X Y X Y X Y X Y X Y
Basic Concepts: Simple Linear Model
•
Linear Model for the Population
• The foundation for statistical inference in regression
• Observed Y is a straight line, plus randomness
Y
=
α
+
β
X
+
ε
Randomness of individuals
Population relationship, on average
{
X Y
Basic Concepts: Simple Linear Model
•
Time Spent vs. Internet Pages Viewed
• Two measures of the abilities of 25 Internet sites
• At the top right are eBay, Yahoo!, and MSN
• Correlation is r = 0.964
• Very strong positive association (since r is close to 1)
• Linear relationship
• Straight line
with scatter
• Increasing relationship
• Tilts up and to the right
0 30 60 90
0 100 200
Pages per person
Mi nu te s pe r pe rso n eBay Yahoo! MSN 0 100 200
Pages per person Yahoo!
Basic Concepts: Simple Linear Model
•
Dollars vs. Deals
• For mergers and acquisitions by investment bankers
• 244 deals worth $756 billion by Goldman Sachs
• Correlation is r = 0.419 • Positive association • Linear relationship • Straight line with scatter • Increasing relationship
• Tilts up and to the right
$0 $500 $1,000 0 100 200 300 400 Deals D ol la rs (b ill io ns)
Basic Concepts: Simple Linear Model
•
Interest Rate vs. Loan Fee
• For mortgages
• If the interest rate is lower, does the bank make it up with a higher loan
fee?
• Correlation is r = – 0.890
• Strong negative association
• Linear relationship
• Straight line
with scatter
• Decreasing relationship
• Tilts down and to the right
5.0% 5.5% 6.0% 0% 1% 2% 3% 4% Loan fee In te re st ra te
Basic Concepts: Simple Linear Model
•
Today’s vs. Yesterday’s Percent Change
• Is there momentum?
• If the market was up yesterday, is it more likely to be up today? Or is
each day’s performance independent?
• Correlation is r = 0.11 • A weak relationship? • No relationship? • Tilt is neither up nor down -3% -2% -1% 0% 1% 2% 3% -3% -2% -1% 0% 1% 2% 3% Yesterday's change T od ay 's ch an ge
$0 $25 $50 $75 $100 $450 $500 $550 $600 $650 Strike Price Ca ll P ri ce
•
Call Price vs. Strike Price
• For stock options
• “Call Price” is the price of the option contract to buy stock at the
“Strike Price”
• The right to buy at a lower strike price has more value
• A nonlinear relationship
• Not a straight line:
A curved relationship
• Correlation r = – 0.895
• A negative relationship:
Higher strike price goes with lower call price
Basic Concepts: Simple Linear Model
•
Output Yield vs. Temperature
• For an industrial process
• With a “best” optimal temperature setting
• A nonlinear relationship
• Not a straight line:
A curved relationship
• Correlation r = – 0.0155
• r suggests no relationship • But relationship is strong
• It tilts neither up nor down 120 130 140 150 160 500 600 700 800 900 Temperature Y ie ld o f p ro ce ss
Basic Concepts: Simple Linear Model
•
Circuit Miles vs. Investment
(lower left)• For telecommunications firms
• A relationship with unequal variability
• More vertical variation at the right than at the left
• Variability is stabilized by taking logarithms (lower right)
• Correlation r = 0.820 0 1,000 2,000 0 1,000 2,000 Investment ($millions) C ircu it mi le s (mi lli on s) 15 20 15 20 Log of investment Log o f mi le s r = 0.957
Basic Concepts: Simple Linear Model
•
Price vs. Coupon Payment
• For trading in the bond market
• Bonds paying a higher coupon generally cost more
• Two clusters are visible
• Ordinary bonds (value is from coupon)
• Inflation-indexed bonds (payout rises with inflation)
• Correlation r = 0.950
• for all bonds
• Correlation r = 0.994
• Ordinary bonds only $100 $150 0% 5% 10% Bi d pri ce 0% 5% 10% Coupon rate
Basic Concepts: Simple Linear Model
•
Cost vs. Number Produced
• For a production facility
• It usually costs more to produce more
• An outlier is visible
• A disaster (a fire at the factory)
• High cost, but few produced
3,000 4,000 5,000 20 30 40 50 Number produced C ost 0 10,000 0 20 40 60 Number produced C ost Outlier removed: More details, r = 0.869 r = –0.623
Basic Concepts: OLS Modeling
•
Salary vs. Years Experience
• For n = 6 employees
• Linear (straight line) relationship
• Increasing relationship
• higher salary generally goes with higher experience
• Correlation r = 0.8667 20 30 40 50 60 0 10 20 Experience Sa la ry ($ th ou sa nd ) Experience 15 10 20 5 15 5 Salary 30 35 55 22 40 27
Basic Concepts: OLS Modeling
•
Summarizes bivariate data: Predicts Y from X
• with smallest errors (in vertical direction, for Y axis)
• Intercept is 15.32 salary (at 0 years of experience)
• Slope is 1.673 salary (for each additional year of experience, on
average) 10 20 30 40 50 60 0 10 20 Experience (X) Sa la ry ( Y )
Basic Concepts: OLS Modeling
•
Predicted Value comes from Least-Squares Line
• For example, Mary (with 20 years of experience)
has predicted salary 15.32+1.673(20) = 48.8
• So does anyone with 20 years of experience
•
Residual is actual
Y
minus predicted
Y
• Mary’s residual is 55 – 48.8 = 6.2
• She earns about $6,200 more than the predicted salary for a person
with 20 years of experience
Basic Concepts: OLS Modeling
10 20 30 40 50 60 0 10 Experience 20 Sa la ryMary earns 55 thousand
Mary’s predicted value is 48.8
Basic Concepts: OLS Modeling
•
Standard Error of Estimate
•
Approximate size of prediction errors (residuals)
Actual Y minus predicted Y: Y–[a+bX]•
Example (Salary vs. Experience)
Predicted salaries are about 6.52 (i.e., $6,520) away from actual salaries
(
1
2)
2
1
−
−
−
=
n
n
r
S
S
e Y(
)
6
.
52
2
6
1
6
8667
.
0
1
686
.
11
2=
−
−
−
=
eS
Basic Concepts: OLS Modeling
•
Interpretation: similar to standard deviation
•
Can move Least-Squares Line up and down by
S
e• About 68% of the data are within one “standard error of estimate”
of the least-squares line
• (For a bivariate normal distribution)
20 30 40 50 60 0 10 Experience 20 Sa la ry
Multiple Linear Regression
•
Linear Model for the Population
Y = (α + β1X1 + β2X2 + … + βkXk) + ε
= (Population relationship) + Randomness
• Where ε has a normal distribution with mean 0 and constant
standard deviation σ, and this randomness is independent from one case to another
Multiple Linear Regression: Results
•
Intercept:
a
• Predicted value for Y when every X is 0
•
Regression Coefficients:
b
1, b
2, …b
k• The effect of each X on Y, holding all other X variables constant
•
Prediction Equation or Regression Equation
(Predicted Y) = a+b1X1+b2X2+…+bkXk
• The predicted Y, given the values for all X variables
•
Prediction Errors or Residuals
Multiple Linear Regression: Results
•
t
Tests for Individual Regression Coefficients
• Significant or not significant, for each X variable• Tests whether a particular X variable has an effect on Y, holding the
other X variables constant
• Should be performed only if the F test is significant
•
Standard Errors of the Regression Coefficients
(with n–k–1 degrees of freedom)• Indicates the estimated sampling standard deviation of each
regression coefficient
• Used in the usual way to find confidence intervals and hypothesis
tests for individual regression coefficients
k b b b
S
S
S
,
,
,
2 1!
Multiple Linear Regression: Results
•
Predicted
Page Costs
for Audubon
= a + b1X1 + b2X2 + b3X3
= $4,043 + 3.79(Audience) – 124(Percent Male)
+ 0.903(Median Income)
= $4,043 + 3.79(1,645) – 124(51.1) + 0.903(38,787)
= $38,966
•
Actual
Page Costs
are
$25,315
•
Residual is
$25,315 – 38,966 = –$13,651
• Audubon has Page Costs $13,651 lower than you would expect for
a magazine with its characteristics (Audience, Percent Male, and
Standard Error
•
Standard Error of Estimate
S
e• Indicates the approximate size of the prediction errors
• About how far are the Y values from their predictions?
• For the magazine data
• Se = S = $21,578
• Actual Page Costs are about $21,578 from their predictions for this
group of magazines (using regression)
• Compare to SY = $45,446: Actual Page Costs are about $45,446 from
their average (not using regression)
• Using the regression equation to predict Page Costs (instead of simply
Coeff. of Determination
The strength of association is measured by the square of the multiple correlation coefficient, R2, which is also called the coefficient of
multiple determination.
R2 = SSreg
SSy
R2 is adjusted for the number of independent variables and the sample
size by using the following formula: Adjusted R2 = R2 - k(1 - R2)
Coeff. of Determination
•
Coefficient of Determination
R
2• Indicates the percentage of the variation in Y that is explained by
(or attributed to) all of the X variables
• How well do the X variables explain Y?
• For the magazine data
• R2 = 0.787 = 78.7%
• The X variables (Audience, Percent Male, and Median Income) taken
together explain 78.7% of the variance of Page Costs
• This leaves 100% – 78.7% = 21.3% of the variation in Page Costs
The
F
test
•
Is the regression significant?
• Do the X variables, taken together, explain a significant amount of
the variation in Y?
• The null hypothesis claims that, in the population, the X variables
do not help explain Y; all coefficients are 0
H0: β1 = β2 = … = βk = 0
• The research hypothesis claims that, in the population, at least
one of the X variables does help explain Y
The
F
test
H0 : R2
pop = 0
This is equivalent to the following null hypothesis:
H0: β1 = β2 = β3= . . . = βk = 0
The overall test can be conducted by using an F statistic:
F = SSreg/k SSres/(n - k - 1)
= R2/k
(1 - R2)/(n- k - 1)
Performing the
F
test
•
Three equivalent methods for performing
F
test; they
always give the same result
• Use the p-value
• If p < 0.05, then the test is significant
• Same interpretation as p-values in Chapter 10
• Use the R2 value
• If R2 is larger than the value in the R2 table, then the result is significant
• Do the X variables explain more than just randomness?
• Use the F statistic
• If the F statistic is larger than the value in the F table, then the result is
Example:
F
test
•
For the magazine data, The
X
variables
(Audience, Percent Male, and Median Income)explain a very highly significant
percentage of the variation in
Page Costs
• The p-value, listed as 0.000, is less than 0.0005, and is therefore
very highly significant (since it is less than 0.001)
• The R2 value, 78.7%, is greater than 27.1% (from the R2 table at
level 0.1% with n = 55 and k = 3), and is therefore very highly significant
• The F statistic, 62.84, is greater than the value (between 7.054
and 6.171) from the F table at level 0.1%, and is therefore very
t
Tests
•
A
t
test for each regression coefficient
• To be used only if the F test is significant
• If F is not significant, you should not look at the t tests
• Does the jth X variable have a significant effect on Y, holding the
other X variables constant?
• Hypotheses are
H0: βj = 0, H1: βj ≠ 0
• Test using the confidence interval
• use the t table with n – k – 1 degrees of freedom
• Or use the t statistic
• compare to the t table value with n – k – 1 degrees of freedom j b j statistic
b
S
t
=
/
j b jtS
b ±
Example:
t
Tests
•
Testing
b
1, the coefficient for
Audience
b1 = 3.79, t = 13.5, p = 0.000
• Audience has a very highly significant effect on Page Costs, after adjusting for Percent Male and Median Income
•
Testing
b
2, the coefficient for
Percent Male
b2 = –124, t = –0.90, p = 0.374
• Percent Male does not have a significant effect on Page Costs, after
adjusting for Audience and Median Income
•
Testing
b
3, the coefficient
for
Median Income
b3 = 0.903, t = 2.44, p = 0.018
• Median Income has a significant effect on Page Costs, after adjusting for Audience and Percent Male
•
Assumptions underlying the statistical techniques
should be tested twice
• First for the separate variables
• Second for the multivariate model variate, which acts
collectively for the variables in the analysis and thus must meet the same assumption as individual variables. Differs for different multivariate technique
•
Linearity
• The independent variable has a linear relationship with the dependent
variable
•
Normality
• The residuals or the dependent variable follow a normal distribution
•
Multicollinearity
• When some X variables are too similar to one another
•
Homoskedasticity
• The variability in Y values for a given set of predictors is the same
regardless of the values of the predictors
•
Independence among cases (Absence of correlated errors)
Normality
•
The residuals or the dependent variable follow a normal
distribution
•
If the variation from normality is significant then all
statistical tests are invalid
•
Graphical Analysis
• Histogram and Normal probability plot
• Peaked and Skewed distribution result in non-normality
•
Statistical Analysis
• If Z value exceeds critical value, then the distribution is
non-normal
Homoskedasticity
•
Assumption related primarily to dependence
relationships between variables
•
Assumption that the dependent variable(s) exhibit
equal levels of variance across the range of predictor
variable(s).
•
The variance of the dependent variable should not
be concentrated in only a limited range of the
independent values
•
Source
• Type of variable
Homoskedasticity
•
Graphical Analysis
• Analysis of residuals in case of Regression
•
Statistical Analysis
• Variances within groups formed by non-metric variables
• Levene Test
• Box’s M Test
•
Remedy
Homoskedasticity
Linearity
•
Assumption for all multivariate techniques based on
correlational measures such as
• multiple regression,
• logistics regression,
• factor analysis, and
• structural equation modeling
•
Correlation represents only the linear association
between variables
•
Identification
• Scatterplots or examination of residuals using regression
•
Remedy
Absence of Correlated Errors
•
Prediction errors should not be correlated with each
other
•
Identification
• Most possible cause is the data collection process, such as
two separate groups in the data collection process
• Remedy
Multicollinearity
• Multicollinearity arises when intercorrelations among the predictors
are very high.
• Multicollinearity can result in several problems, including:
• The partial regression coefficients may not be estimated precisely.
The standard errors are likely to be high.
• The magnitudes as well as the signs of the partial regression
coefficients may change from sample to sample.
• It becomes difficult to assess the relative importance of the
independent variables in explaining the variation in the dependent variable.
• Predictor variables may be incorrectly included or removed in
Multicollinearity
• The ability of an independent variable to improve the prediction of the dependent variable is related not only to its correlation to the
dependent variable, but also to the correlation(s) of the additional independent variable to the independent variable(s) already in the regression equation
• Collinearity is the association, measured as the correlation,
between tow independent variables
• Multicollinearity refers to the correlation among three or more
independent variables
• Impact
• Reduces any single IVs predictive power by the extent to which it is
Multicollinearity
•
Measuring Multicollinearity
• Tolerance
• Amount of variability of the selected independent variable not explained
by the other independent variables
• Tolerance Values should be high
• Cut-off is 0.1 but greater than 0.5 gives better results
• VIF
• Inverse of Tolerance
Multicollinearity
•
Remedy for Multicollinearity
• A simple procedure for adjusting for multicollinearity consists of
using only one of the variables in a highly correlated set of variables.
• Omit highly correlated independent variables and identify other
independent variables to help the prediction
• Alternatively, the set of independent variables can be transformed into
a new set of predictors that are mutually independent by using techniques such as principal components analysis.
• More specialized techniques, such as ridge regression and latent root
Data Transformations
•
To correct violations of the statistical assumptions
underlying the multivariate techniques
•
To improve the relationship between variables
•
Transformation to achieve Normality and
Homoscedasticity
• Flat Distribution – Inverse transformation
• Negatively Skewed Distribution – Square Root
Transformation
• Positively Skewed Distribution – Logarithmic Transformation
• If the residuals in regression are cone shaped then
• Cone opens to right – Inverse transformation
Data Transformations
•
Transformation to achieve
Linearity
Assumptions in Regression
•
General guidelines for transformation
• For a noticeable effect of transformation the ratio of a variable’s
mean to the standard deviation should be less than 4.0
• When the transformation can be performed on either of the two
variables, select the one with smallest ratio of mean/sd.
• Transformation should be applied to independent variables
except in case of heteroscedasticity
• Heteroscedasticity can only be remedied by transformation of
the dependent variable in a dependent relationship
• If the heteroscedastic relationship is also non-linear the
dependent variable and perhaps the independent variables must be transformed
Issues in Regression
•
Variable Selection
• How to choose from a long list of X variables?
• Too many: waste the information in the data
• Too few: risk ignoring useful predictive information
•
Model Misspecification
• Perhaps the multiple regression linear model is wrong
•
Explanatory models fits the data closely, whereas a good
predictive model predicts new cases accurately
•
Explanatory models uses entire dataset for estimating the
best-fit model and to maximize explanatory variance (R
2).
Predictive models estimate the model on training set and
assess it on the new, unobserved data
•
Performance measures for explanatory models measures
how close the data fit the models, whereas in predictive
models performance is measured by predictive accuracy
•
Prediction Error for observation ‘i’= Actual y value –
predicted y value
•
Popular numerical measures of predictive accuracy
• MAE or MAD (Mean absolute error / deviation)
• Average Error
RMSE (Root mean squared error)
‘
• Describe a situation in which a useless regression has a high R2.
• Check the validity of the linear regression model assumptions.
• Estimate the excess returns of Bob’s and Putney’s funds. Between them, who
is expected to obtain higher returns at their current funds and by how much?
• If hired by the firm, who is expected to obtain higher returns and by how
much?
• Can you prove at the 5% level of significance that Bob would get higher
expected returns if he had attended Princeton instead of Ohio State?
• Can you prove at the 10% level of significance that Bob would get at least 1%
higher expected returns by managing a growth fund?
• Is there strong evidence that fund managers with MBA perform worse than
fund managers without MBA? What is held constant in this comparison?
• Based on your analysis of the case, which candidate do you support for
•