The Multiple Linear Regression Model
Multiple linear regressions are an extension of the simple model that
incorporates two or more independent variables. Multiple regression
analysis produces an equation with several coefficients, depending on
the number of independent variables X are introduced to the model.
Y-intercept (constant term)
Population slopes for each explanatory variables
The model’s error term or Residuals
Dependent (Response) Variable Independent (Explanatory)
Variables
1
2
i
i
i
k
ki
i
The Multiple Linear Regression Model
Examples where multiple linear regression may be used include:
1. Prediction of Student Performance in Academic and Military Learning
Environment: Use of Multiple Linear Regression Predictive Model and
Hypothesis Testing.
https://eric.ed.gov/?id=EJ1151836
This paper determine whether there is a relationship between students'
performance and influencing factors like academic aptitude test score, time spent
in physical training, and the time spent on training need analysis (TNA)
modules.
2.Relationships between math achievement student and socioemotional
development, math self-efficacy, school environment, house environment.
3. What is the relationship between children’s fifth grade math achievement with
children’s math self-concept, and teacher’s rating on of children’s math
proficiency and self-control?
Why is this important?
The relationship is rarely a function of just one variable, but is instead
influenced by many variables. So the idea is that we should be able to obtain a
more accurate predicted score if using multiple variables to predict our
outcome.
Coefficient of determination
The coefficient of determination (R-squared) is a statistical metric that is used
to measure how much of the variation in outcome can be explained by the
variation in the independent variables. R
2always increases as more predictors
are added to the MLR model even though the predictors may not be related to
the outcome variable.
R
2by itself can't thus be used to identify which predictors should be included in
a model and which should be excluded.
R
2can only be between 0 and 1, where 0 indicates that the outcome cannot be
predicted by any of the independent variables and 1 indicates that the outcome
can be predicted without error from the independent variables.
Multiple Regression Example
The following table presents information on three variables for a small
sample of eight nations. We will take the abortion rate as the dependent
variable and examine the relationship with two variables: one measures
the status and power of women and the other measures religiosity.
Nation
Abortion
Rate (Y)
Women's
Status (x
1)
Religiosity
(x
2)
Canada
165
0.5
74
Chile
100
0.45
93
Denmark
400
0.8
48
Germany
208
0.54
67
Italy
389
0.7
70
Japan
379
0.52
55
UK
207
0.58
67
US
428
0.84
35
The research question
might be:
“How much does an
independent variables
contribute to explaining
dependent variable?
Step 1: Scatter Diagram
Regression line is the best straight line description of the plotted points and use can use it to describe the association between the variables.
If all the lines fall exactly on the line then the line is 0 and you have a perfect relationship.
In the first chart, we can see there is a linear, direct, and strong relationship. When the woman has a higher state will have the highest rate of abortion.
Step 2: Output from SPSS (simple correlation)
Correlations
Abortion Rate (Y)
Women's Status (X1)
Religiosity (X2) Abortion Rate
(Y)
Pearson
Correlation 1 .817* -.842**
Sig. (2-tailed) 0.013 0.009
N 8 8 8
Women's Status (x1)
Pearson
Correlation .817* 1 -.801*
Sig. (2-tailed) 0.013 0.017
N 8 8 8
Religiosity (x2) Pearson
Correlation -.842** -.801* 1
Sig. (2-tailed) 0.009 0.017
N 8 8 8
*. Correlation is significant at the 0.05 level (2-tailed). **. Correlation is significant at the 0.01 level (2-tailed).
Step 3: Steps for Multiple Regressions in SPSS
1. Click ANALYZE 2. Select
REGRESSION 3. Click LINEAR 4. Move “Abortion
Rate” to
DEPENDENT Box
5. Move “Women status and religiosity” to INDEPENDENT (S) box
6. Click OK 7. Continue the
Output Multiple Correlation and Coeficient of determiantion
Model Summaryb
Model R R Square
Adjusted R Square
Std. Error of the Estimate
Durbin-Watson
1 .875a 0.765 0.671 73.19844 1.569
Predictors: (Constant), Religiosity (x2), Women's Status (x1) Dependent Variable: Abortion Rate (Y)
Interpret multiple correlation coefficient (R), and the coefficient of multiple determination (R2). How much of the variance in abortion rate is explained by
the two independent variables?
Multiple correlation R= .875 (the model improved by interacting
independent variables), in other hand there are strong correlation
between religiosity and women’s status with abortion rate.
Coefficient of determination (R
2= .765.) 76.5% of the variation in
abortion rate can be explained by variation in religiosity and women’s
status.
Assumption of autocorrelation:
ANOVA Regression: Test for overall Significance
Hypothesis of Slope
Approach of the hypothesis:
(Consider that all the coefficients are simultaneously equal to zero)
ANOVAa
Model
Sum of
Squares df
Mean
Square F Sig.
1 Regression 87171.94 2 43585.971 8.135 .027b
Residual 26790.06 5 5358.012
Total 113962 7
a. Dependent Variable: Abortion Rate (Y)
b. Predictors: (Constant), Religiosity (x2), Women's Status (x1)
0
:
0
:
1 20
jone
least
At
Ha
H
Interpretation: As Sig < 0.05 then reject null hypothesis, indicating that at least one of the explanatory variables is related or affects to abortion rate. We conclude that the model is useful for predicting.
(At least one independent variable affects Y.) Test Statistic:
MSE(all)
SSR(all)/
MSE
MSR
k
F
where k-1) denominator degrees of F has k numerator andTest for Significance: Individual Variables
Outliers that have been overlooked, will show up ... as, often, very big residuals.
Shows if there is a linear relationship between the variable Xi and Y.
Use t Test Statistic. Hypotheses:
H0: i 0 (No linear relationship.)
H1: i 0 (Linear relationship between Xi and Y.)
Model
Unstandardized
Coefficients Standardized
Coefficients t Sig. B Std. Error
(Constant) 310.89 345.19 0.901 0.409
Women's Status (x1) 348.41 317.47 0.398 1.097 0.322
Religiosity (x2) -3.789 2.624 -0.523 -1.44 0.208
a. Dependent Variable: Abortion Rate (Y)
T statistic for X1 (women’s status)
Multiple Regression Equation
Find the multiple regression equation with Women's Status (x1) and religiosity (x2).
The model has the following equation:
18
.
231
90
*
789
.
3
75
.
0
*
413
.
348
885
.
310
_
,
*
789
.
3
*
413
.
348
885
.
310
_
rate
Abortion
therefore
y
religiosit
status
rate
Abortion
Religiosity is negatively related to abortion rate and women's status is positively related to abortion rate
The predicted abortion rate is 231.18
Prediction: What will be abortion rate would be expected for Women's Status 0.75, and religiosity of 90?
2 1
3
.
789
413
.
348
885
.
310
ˆ
x
x
Y
Model
Unstandardized
Coefficients Standardized
Coefficients t Sig. B Std. Error
(Constant) 310.89 345.19 0.901 0.409
Women's Status (x1) 348.41 317.47 0.398 1.097 0.322
Religiosity (x2) -3.789 2.624 -0.523 -1.44 0.208
a. Dependent Variable: Abortion Rate (Y) For each degree increase in Women's Status, the estimated average abortion rate is increased by 348.413, holding Religiosity constant.
Assumptions for parametric analysis: Regression
In the regression analysis we examined the residuals of the model.
Residual
Linear model: Y = βo+β
1X + e
Residual = Observed – Predicted
When you perform simple linear regression (or any other type of regression analysis), you get a line of best fit. The data points usually don’t fall exactly on this regression equation line; they are scattered around. A residual is the vertical distance between a data point and the
regression line. Each data point has one
residual. They are positive if they are above the regression line and negative if they are below the regression line. If the regression line
actually passes through the point, the residual at that point is zero.
Assumptions for parametric analysis: Regression
When using a parametric analysis such as regression and correlation analysis, these residuals must meet certain assumptions:
Linearity: The relationship between X and the mean of Y is linear. Linearity means that the predictor variables in the regression have a straight-line relationship with the outcome variable. (Big deal if violated.)
Normality: Residuals should be normally distributed with a mean of 0 and variance σ.
(Not as big deal if violated when the sample size is big.)
Homoscedasticity (heterocedasticity): variance of the residuals is homogeneous across levels of the predicted values. The errors are distributed with equal variance. (Non-constant variation of the residuals (heteroscedasticity). (Not as big deal if violated.)
Autocorrelation. We assume that the errors of the models used in the analysis are independent of one another (i.e., the errors are not correlated). When this assumption is not met in the context of time-series research designs, the errors are said to be autocorrelated or dependent.
Independence: Observations are independent of each other (Huge deal if violated!)
Multicollinearity: Refers to predictors that are correlated with other predictors. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your response variable, but also to each other. In other words, it results when you have factors that are a bit redundant.
Note:
Example to observe if the basic assumptions are fulfilled
Given a hypothetical sample of 20 patients who have collected the following data: cholesterol level in blood plasma (in mg/100 ml), age (in years), saturated fat (in g/ week) and level exercise (quantified as 0: no exercise, 1: moderate exercise and 2: intense exercise), the adjustment to a linear model between cholesterol level and other variables.
Develop analysis in statistical software and interpret the output. Note. Answer the questions like the previous exercise (alternatives 'a to h')
h. What will be the cholesterol that would be expected for the 60-year-old patient, who consumes 40 grams of fat and does not do any type of exercise?
Patient Cholesterol Age Fat Exercise
1 350 80 35 0
2 190 30 40 2
3 263 42 15 1
4 320 50 20 0
5 280 45 35 0
6 198 35 50 1
7 232 18 70 1
8 320 32 40 0
9 303 49 45 0
10 220 35 35 0
11 405 50 50 0
12 190 20 15 2
13 230 40 20 1
14 227 30 35 0
15 440 30 80 1
16 318 23 40 2
17 212 35 40 1
18 340 18 80 0
19 195 22 15 0
Assumptions for parametric analysis: Simple Regression
Linearity (
Big deal if violated)
The linearity assumption can best be tested with scatter plots, the
Assumptions for parametric analysis: Simple Regression
Normality
Assumptions for parametric analysis: Simple Regression
How to test Normality with SPSS
Assumptions for parametric analysis: Simple Regression
How to test Normality with SPSS
2. Next step: Save<Standardized
Assumptions for parametric analysis: Simple Regression
How to test Normality with SPSS
Assumptions for parametric analysis: Simple Regression
How to test Normality with SPSS
Ho: The residuals follow the Normal Distribution
Ha: The residuals do not follow the Normal Distribution
Here we see that Sig. from Shapiro Wilk is
more than .05, we cannot reject the null
hypothesis, therefore, the residuals follow
a Normal distribution.
The resulting Q-Q plot is shown
alongside. If the distribution is normal,
then we should expect the points to cluster
around the horizontal line
Tests of Normality
Kolmogorov-Smirnova Shapiro-Wilk
Statistic df Sig. Statistic df Sig. Standardized
Residual 0.0886 20 .200* 0.9684 20 0.7218
Assumptions for parametric analysis: Multiple Regression
Assumptions for parametric analysis: Regression
Homocedasticity
Assumptions for parametric analysis: Regression
Homocedasticity
The figure below shows a random displacement of scores that take on a
rectangular shape with no clustering or systematic pattern. The figure shows
the assumption of homoscedasticity is met.
A residual scatter plot is a figure that shows one axis for predicted scores and one axis for errors of prediction. Initial visual examination can isolate any outliers, otherwise known as extreme scores, in the data-set. Tabachnick and Fidell (2007) explain the residuals (the difference between the obtained DV and the predicted DV scores) and the variance of the residuals should be the same for all predicted scores (homoscedasticity).
Assumptions for parametric analysis: Regression
Autocorrelation
The Durbin-Watson test statistic tests the null hypothesis that the residuals from
an ordinary least-squares regression are not autocorrelated.
The Durbin-Watson statistic ranges in value from 0 to 4. A value near 2
indicates non-autocorrelation; a value toward 0 indicates positive
autocorrelation; a value toward 4 indicates negative autocorrelation.
In the example, Durbin-Watson = 1.877. It is a value around 2,
indicating that the residuals are nonautocorrelated; and therefore we can assume that the
autocorrelated is not observed.
Model Summaryb
Model R R Square
Adjusted R Square
Std. Error of the Estimate
Durbin-Watson
1 .690a .477 .378 58.153 1.877
Assumptions for parametric analysis: Regression
Multicollinearity exists when independent variables in a regression equation are highly correlated among themselves. In others words multicollinearity in regression analysis refers to how strongly interrelated the independent variables in a model are. When multicollinearity is too high, the individual parameter estimates become difficult to interpret. Most regression programs can compute variance inflation factors (VIF) for each variable. As a rule of thumb:
VIF above 5 to 10 suggests problems with multicollinearity. VIF > 5) for all variables (is Highly Correlated with the Other Explanatory Variables)
Tolerance should be > 0.1 (no evidence of collinearity)
Assignment 6
1. We are looking to predict the student’s expected grade in the course as a function of a student’s anticipated grade in the course. The data in the table below are real and represent the mean rating of each of 8 courses on four different variables. These variables were: (X1) teaching skills of the instructor (Teach), (X2) quality of the test and exams (Exam), (X3) instructor perceive knowledge of the subject matter (Knowledge), and (Y) the student’s expected grade in the course on a scale of 1 – 5.
Teach (X1) Exam (X2) Knowled ge (X3) Grade (Y)
3.8 3.8 4.5 4.6
2.8 3.2 3.8 2.5
2.2 1.9 3.9 2.9
3.5 3.5 4.1 3.6
3.2 2.8 3.5 3
3.7 3.8 4.2 4.2
4.1 4.8 4.5 4.6
4.2 4.1 4.7 4.9
a. Interpret the descriptive statistics at least for one variable
b. Draw and interpret the scatter dot (grade(Grade) and Knowledge )
c. Interpret simple correlation with expected grade(Grade) and Knowledge of the course (Knowledge) Hypothesis
d. Interpret the multiple ANOVA regression
e. Find the multiple regression equation with Teach, Exam and Knowledge as independent variables
f. What would be the expected grade of the student in the course where the 'Teach' was 4, 'Exam' 4.5 and 'Knowledge' 5?
g. Compute and interpret the multiple coefficient of determination within the context of this problem
h. Would you consider removing
any of these predictor variables from the model? Why or why not?
Assignment 6
SPSS Output for question 1
Descriptive Statistics
Mean
Std.
Deviation N Teach 3.4375 .67810 8 Exam 3.4875 .87576 8 Knowledg
e
4.1500 .40708 8 Grade 3.7875 .91251 8
Correlations
Teach Exam Knowledge Grade Teach Pearson
Correlation
1 .925** .763* .897**
Sig. (2-tailed)
.001 .028 .003
N 8 8 8 8
Exam Pearson Correlation
.925** 1 .747* .795*
Sig. (2-tailed)
.001 .033 .018
N 8 8 8 8
Knowledge Pearson Correlation
.763* .747* 1 .921**
Sig. (2-tailed)
.028 .033 .001
N 8 8 8 8
Grade Pearson Correlation
.897** .795* .921** 1
Sig. (2-tailed)
.003 .018 .001
N 8 8 8 8
Assignment 6
Model Summaryb
Model R R Square
Adjusted R Square
Std. Error of the Estimate
Durbin-Watson
1 .981a .961 .933 .23695 2.438
a. Predictors: (Constant), Knowledge, Exam, Teach b. Dependent Variable: Grade
ANOVAa
Model
Sum of
Squares df
Mean
Square F Sig.
1 Regression 5.604 3 1.868 33.273 .003b
Residual .225 4 .056
Total 5.829 7
a. Dependent Variable: Grade
b. Predictors: (Constant), Knowledge, Exam, Teach
Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. Collinearity Statistics B Std. Error Beta Tolerance VIF 1 (Constant) -4.128 1.041 -3.966 .017
Teach 1.089 .362 .809 3.009 .040 .133 7.508 Exam -.424 .272 -.407 -1.557 .195 .141 7.097 Knowledge 1.362 .346 .607 3.941 .017 .405 2.467
a. Dependent Variable: Grade
Tests of Normality
Kolmogorov-Smirnova Shapiro-Wilk
Statistic df Sig. Statistic df Sig. Standardized
Residual
.266 8 .100 .840 8 .075
Assignment 6
2. Open from SPSS data file “survey_sample.sav.” This data file contains survey data, including demographic data and various attitude measures. It is based on a subset of variables from the 1998 NORC General Social Survey.
With this data calculate and interpret:
a.Compute and interpret the coefficient of simple correlation (hours per day watching TV and Highest year of school completed)
b.Draw a scatter diagram and interpret with the variables from the previous example
c.Compute and interpret the multiple coefficient of correlation. (From here, use the variables indicated below)
Data:
Dependent variable: Total family income
Independents variables: Age of respondent; Highest year of school completed; Highest year school completed father; Highest year school completed mother; Highest year school completed spouse. d. Compute and interpret the multiple coefficient of determination within the context of this problem e. Compute and interpret the multiple regression equation. Is the model significant (perform the hypotheses for multiple regression analysis)
f. From the analysis performed, would you recommend removing any variable (s) that do not contribute significantly to the model?
g. What will be the total family income that would be expected for the 50-year-old participant, who has 15 years of study, and that his spouse completed 14 years of study and the respondent's age is 50 years? h. Check if the assumptions of autocorrelated assumed (Durbin Watson)
Assignment 6
Correlations
Highest year of school completed
Hours per day watching TV Highest year of
school completed
Pearson Correlation
1 -.261**
Sig. (2-tailed) .000
N 2820 2327
Hours per day watching TV
Pearson Correlation
-.261** 1
Sig. (2-tailed) .000
N 2327 2337
**. Correlation is significant at the 0.01 level (2-tailed).
Model Summaryb
Model R R Square Adjusted R Square
Std. Error of the
Estimate Durbin-Watson
1 .318a .101 .096 1.162 1.982
a. Predictors: (Constant), Highest year school completed, spouse, Age of respondent, Highest year school completed, mother, Highest year of school completed, Highest year school completed, father
Assignment 6
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 131.397 5 26.279 19.479 .000b
Residual 1164.306 863 1.349
Total 1295.703 868
a. Dependent Variable: Total family income
b. Predictors: (Constant), Highest year school completed, spouse, Age of respondent, Highest year school completed, mother, Highest year of school completed, Highest year school completed, father
Coefficientsa
Model
Unstandardized Coefficients
Standardized Coefficients
t Sig.
Collinearity Statistics
B Std. Error Beta Tolerance VIF
1 (Constant) 9.919 .289 34.322 .000
Age of respondent -.007 .003 -.078 -2.291 .022 .893 1.120
Highest year of school completed
.087 .018 .199 4.728 .000 .586 1.706
Highest year school completed, father
.008 .013 .025 .564 .573 .527 1.896
Highest year school completed, mother
-.006 .016 -.016 -.371 .711 .556 1.799
Highest year school completed, spouse
.058 .018 .130 3.247 .001 .646 1.548