Chapter 6. Multiple Regression and Correlation with answer. 2020.ppt

(1)

(2)

The Multiple Linear Regression Model



Multiple linear regressions are an extension of the simple model that

incorporates two or more independent variables. Multiple regression

analysis produces an equation with several coefficients, depending on

the number of independent variables X are introduced to the model.

Y-intercept (constant term)

Population slopes for each explanatory variables

The model’s error term or Residuals

Dependent (Response) Variable Independent (Explanatory)

Variables

1

2

i

k

ki

i

(3)

The Multiple Linear Regression Model



Examples where multiple linear regression may be used include:

1. Prediction of Student Performance in Academic and Military Learning

Environment: Use of Multiple Linear Regression Predictive Model and

Hypothesis Testing.

https://eric.ed.gov/?id=EJ1151836

This paper determine whether there is a relationship between students'

performance and influencing factors like academic aptitude test score, time spent

in physical training, and the time spent on training need analysis (TNA)

modules.

2.Relationships between math achievement student and socioemotional

development, math self-efficacy, school environment, house environment.

3. What is the relationship between children’s fifth grade math achievement with

children’s math self-concept, and teacher’s rating on of children’s math

proficiency and self-control?

Why is this important?

The relationship is rarely a function of just one variable, but is instead

influenced by many variables. So the idea is that we should be able to obtain a

more accurate predicted score if using multiple variables to predict our

outcome.

(4)

Coefficient of determination



The coefficient of determination (R-squared) is a statistical metric that is used

to measure how much of the variation in outcome can be explained by the

variation in the independent variables. R

2

always increases as more predictors

are added to the MLR model even though the predictors may not be related to

the outcome variable.

R

2

by itself can't thus be used to identify which predictors should be included in

a model and which should be excluded.

R

2

can only be between 0 and 1, where 0 indicates that the outcome cannot be

predicted by any of the independent variables and 1 indicates that the outcome

can be predicted without error from the independent variables.

(5)

Multiple Regression Example



The following table presents information on three variables for a small

sample of eight nations. We will take the abortion rate as the dependent

variable and examine the relationship with two variables: one measures

the status and power of women and the other measures religiosity.

Nation

Abortion

Rate (Y)

Women's

Status (x

₁

)

Religiosity

(x

₂

)

Canada

165

0.5

74

Chile

100

0.45

93

Denmark

400

0.8

48

Germany

208

0.54

67

Italy

389

0.7

70

Japan

379

0.52

55

UK

207

0.58

67

US

428

0.84

35

The research question

might be:

“How much does an

independent variables

contribute to explaining

dependent variable?

(6)

Step 1: Scatter Diagram



Regression line is the best straight line description of the plotted points and use can use it to describe the association between the variables.

If all the lines fall exactly on the line then the line is 0 and you have a perfect relationship.

In the first chart, we can see there is a linear, direct, and strong relationship. When the woman has a higher state will have the highest rate of abortion.

(7)

Step 2: Output from SPSS (simple correlation)



Correlations

Abortion Rate (Y)

Women's Status (X₁)

Religiosity (X₂) Abortion Rate

(Y)

Pearson

Correlation 1 .817* _-.842**

Sig. (2-tailed) 0.013 0.009

N 8 8 8

Women's Status (x₁)

Pearson

Correlation .817* ₁ _-.801*

Sig. (2-tailed) 0.013 0.017

N 8 8 8

Religiosity (x₂) Pearson

Correlation -.842** _-.801* ₁

Sig. (2-tailed) 0.009 0.017

N 8 8 8

*. Correlation is significant at the 0.05 level (2-tailed). **. Correlation is significant at the 0.01 level (2-tailed).

(8)

Step 3: Steps for Multiple Regressions in SPSS



1. Click ANALYZE 2. Select

REGRESSION 3. Click LINEAR 4. Move “Abortion

Rate” to

DEPENDENT Box

5. Move “Women status and religiosity” to INDEPENDENT (S) box

6. Click OK 7. Continue the

(9)

Output Multiple Correlation and Coeficient of determiantion



Model Summaryb

Model R R Square

Adjusted R Square

Std. Error of the Estimate

Durbin-Watson

1 .875a _0.765 _0.671 _73.19844 _1.569

Predictors: (Constant), Religiosity (x₂), Women's Status (x₁) Dependent Variable: Abortion Rate (Y)

Interpret multiple correlation coefficient (R), and the coefficient of multiple determination (R2_{). How much of the variance in abortion rate is explained by}

the two independent variables?

Multiple correlation R= .875 (the model improved by interacting

independent variables), in other hand there are strong correlation

between religiosity and women’s status with abortion rate.

Coefficient of determination (R

2

= .765.) 76.5% of the variation in

abortion rate can be explained by variation in religiosity and women’s

status.

Assumption of autocorrelation:

(10)

ANOVA Regression: Test for overall Significance

Hypothesis of Slope

Approach of the hypothesis:

(Consider that all the coefficients are simultaneously equal to zero)

ANOVAa

Model

Sum of

Squares df

Mean

Square F Sig.

1 Regression 87171.94 2 43585.971 8.135 .027b

Residual 26790.06 5 5358.012

Total 113962 7

a. Dependent Variable: Abortion Rate (Y)

b. Predictors: (Constant), Religiosity (x₂), Women's Status (x₁)

0

:

0

:

₁ ₂

0





j

one

least

At

Ha

H



Interpretation: As Sig < 0.05 then reject null hypothesis, indicating that at least one of the explanatory variables is related or affects to abortion rate. We conclude that the model is useful for predicting.

(At least one independent variable affects Y.) Test Statistic:

MSE(all)

SSR(all)/

MSE

MSR

k

F



where _k-₁₎_{denominator degrees of}F has k numerator and

(11)

Test for Significance: Individual Variables



Outliers that have been overlooked, will show up ... as, often, very big residuals.

Shows if there is a linear relationship between the variable X_i and Y.

Use t Test Statistic. Hypotheses:

H₀: _i  0 (No linear relationship.)

H₁: _i  0 (Linear relationship between X_i and Y.)

Model

Unstandardized

Coefficients Standardized

Coefficients t Sig. B Std. Error

(Constant) 310.89 345.19 0.901 0.409

Women's Status (x1) 348.41 317.47 0.398 1.097 0.322

Religiosity (x2) -3.789 2.624 -0.523 -1.44 0.208

a. Dependent Variable: Abortion Rate (Y)

T statistic for X_{1 (}women’s status)

(12)

Multiple Regression Equation



Find the multiple regression equation with Women's Status (x₁) and religiosity (x₂).

The model has the following equation:

18

.

231

90

*

789

.

3

75

.

0

*

413

.

348

885

.

310

_

,

*

789

.

3

*

413

.

348

885

.

310

_















rate

Abortion

therefore

y

religiosit

status

rate

Abortion

Religiosity is negatively related to abortion rate and women's status is positively related to abortion rate

The predicted abortion rate is 231.18

Prediction: What will be abortion rate would be expected for Women's Status 0.75, and religiosity of 90?

2 1

3

.

789

413

.

348

885

.

310

ˆ

x

Y







Model

Unstandardized

Coefficients Standardized

Coefficients t Sig. B Std. Error

(Constant) 310.89 345.19 0.901 0.409

Women's Status (x1) 348.41 317.47 0.398 1.097 0.322

Religiosity (x2) -3.789 2.624 -0.523 -1.44 0.208

a. Dependent Variable: Abortion Rate (Y) For each degree increase in Women's Status, the estimated average abortion rate is increased by 348.413, holding Religiosity constant.

(13)

Assumptions for parametric analysis: Regression



In the regression analysis we examined the residuals of the model.

Residual

Linear model: Y = βo+β

₁

X + e

Residual = Observed – Predicted

When you perform simple linear regression (or any other type of regression analysis), you get a line of best fit. The data points usually don’t fall exactly on this regression equation line; they are scattered around. A residual is the vertical distance between a data point and the

regression line. Each data point has one

residual. They are positive if they are above the regression line and negative if they are below the regression line. If the regression line

actually passes through the point, the residual at that point is zero.

(14)

Assumptions for parametric analysis: Regression



When using a parametric analysis such as regression and correlation analysis, these residuals must meet certain assumptions:

Linearity: The relationship between X and the mean of Y is linear. Linearity means that the predictor variables in the regression have a straight-line relationship with the outcome variable. (Big deal if violated.)

Normality: Residuals should be normally distributed with a mean of 0 and variance σ.

(Not as big deal if violated when the sample size is big.)

Homoscedasticity (heterocedasticity): variance of the residuals is homogeneous across levels of the predicted values. The errors are distributed with equal variance. (Non-constant variation of the residuals (heteroscedasticity). (Not as big deal if violated.)

Autocorrelation. We assume that the errors of the models used in the analysis are independent of one another (i.e., the errors are not correlated). When this assumption is not met in the context of time-series research designs, the errors are said to be autocorrelated or dependent.

Independence: Observations are independent of each other (Huge deal if violated!)

Multicollinearity: Refers to predictors that are correlated with other predictors. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your response variable, but also to each other. In other words, it results when you have factors that are a bit redundant.

Note:

(15)



Example to observe if the basic assumptions are fulfilled

Given a hypothetical sample of 20 patients who have collected the following data: cholesterol level in blood plasma (in mg/100 ml), age (in years), saturated fat (in g/ week) and level exercise (quantified as 0: no exercise, 1: moderate exercise and 2: intense exercise), the adjustment to a linear model between cholesterol level and other variables.

Develop analysis in statistical software and interpret the output. Note. Answer the questions like the previous exercise (alternatives 'a to h')

h. What will be the cholesterol that would be expected for the 60-year-old patient, who consumes 40 grams of fat and does not do any type of exercise?

Patient Cholesterol Age Fat Exercise

1 350 80 35 0

2 190 30 40 2

3 263 42 15 1

4 320 50 20 0

5 280 45 35 0

6 198 35 50 1

7 232 18 70 1

8 320 32 40 0

9 303 49 45 0

10 220 35 35 0

11 405 50 50 0

12 190 20 15 2

13 230 40 20 1

14 227 30 35 0

15 440 30 80 1

16 318 23 40 2

17 212 35 40 1

18 340 18 80 0

19 195 22 15 0

(16)

Assumptions for parametric analysis: Simple Regression



Linearity (

Big deal if violated)

The linearity assumption can best be tested with scatter plots, the

(17)

Assumptions for parametric analysis: Simple Regression



Normality

(18)



How to test Normality with SPSS

(19)

Assumptions for parametric analysis: Simple Regression



2. Next step: Save<Standardized

(20)



(21)



Ho: The residuals follow the Normal Distribution

Ha: The residuals do not follow the Normal Distribution

Here we see that Sig. from Shapiro Wilk is

more than .05, we cannot reject the null

hypothesis, therefore, the residuals follow

a Normal distribution.

The resulting Q-Q plot is shown

alongside. If the distribution is normal,

then we should expect the points to cluster

around the horizontal line

Tests of Normality

Kolmogorov-Smirnova _Shapiro-Wilk

Statistic df Sig. Statistic df Sig. Standardized

Residual 0.0886 20 .200* _0.9684 ₂₀ _0.7218

(22)

Assumptions for parametric analysis: Multiple Regression



(23)

Assumptions for parametric analysis: Regression



Homocedasticity

(24)

Assumptions for parametric analysis: Regression



Homocedasticity

The figure below shows a random displacement of scores that take on a

rectangular shape with no clustering or systematic pattern. The figure shows

the assumption of homoscedasticity is met.

A residual scatter plot is a figure that shows one axis for predicted scores and one axis for errors of prediction. Initial visual examination can isolate any outliers, otherwise known as extreme scores, in the data-set. Tabachnick and Fidell (2007) explain the residuals (the difference between the obtained DV and the predicted DV scores) and the variance of the residuals should be the same for all predicted scores (homoscedasticity).

(25)

Assumptions for parametric analysis: Regression



Autocorrelation

The Durbin-Watson test statistic tests the null hypothesis that the residuals from

an ordinary least-squares regression are not autocorrelated.

The Durbin-Watson statistic ranges in value from 0 to 4. A value near 2

indicates non-autocorrelation; a value toward 0 indicates positive

autocorrelation; a value toward 4 indicates negative autocorrelation.

In the example, Durbin-Watson = 1.877. It is a value around 2,

indicating that the residuals are nonautocorrelated; and therefore we can assume that the

autocorrelated is not observed.

Model Summaryb

Model R R Square

Adjusted R Square

Durbin-Watson

1 .690a _.477 _.378 _58.153 _1.877

(26)

Assumptions for parametric analysis: Regression



Multicollinearity exists when independent variables in a regression equation are highly correlated among themselves. In others words multicollinearity in regression analysis refers to how strongly interrelated the independent variables in a model are. When multicollinearity is too high, the individual parameter estimates become difficult to interpret. Most regression programs can compute variance inflation factors (VIF) for each variable. As a rule of thumb:

VIF above 5 to 10 suggests problems with multicollinearity. VIF > 5) for all variables (is Highly Correlated with the Other Explanatory Variables)

Tolerance should be > 0.1 (no evidence of collinearity)

(27)

Assignment 6



1. We are looking to predict the student’s expected grade in the course as a function of a student’s anticipated grade in the course. The data in the table below are real and represent the mean rating of each of 8 courses on four different variables. These variables were: (X1) teaching skills of the instructor (Teach), (X2) quality of the test and exams (Exam), (X3) instructor perceive knowledge of the subject matter (Knowledge), and (Y) the student’s expected grade in the course on a scale of 1 – 5.

Teach (X1) Exam (X2) Knowled ge (X3) Grade (Y)

3.8 3.8 4.5 4.6

2.8 3.2 3.8 2.5

2.2 1.9 3.9 2.9

3.5 3.5 4.1 3.6

3.2 2.8 3.5 3

3.7 3.8 4.2 4.2

4.1 4.8 4.5 4.6

4.2 4.1 4.7 4.9

a. Interpret the descriptive statistics at least for one variable

b. Draw and interpret the scatter dot (grade(Grade) and Knowledge )

c. Interpret simple correlation with expected grade(Grade) and Knowledge of the course (Knowledge) Hypothesis

d. Interpret the multiple ANOVA regression

e. Find the multiple regression equation with Teach, Exam and Knowledge as independent variables

f. What would be the expected grade of the student in the course where the 'Teach' was 4, 'Exam' 4.5 and 'Knowledge' 5?

g. Compute and interpret the multiple coefficient of determination within the context of this problem

h. Would you consider removing

any of these predictor variables from the model? Why or why not?

(28)

Assignment 6



SPSS Output for question 1

Descriptive Statistics

Mean

Std.

Deviation N Teach 3.4375 .67810 8 Exam 3.4875 .87576 8 Knowledg

e

4.1500 .40708 8 Grade 3.7875 .91251 8

Correlations

Teach Exam Knowledge Grade Teach Pearson

Correlation

1 .925** _.763* _.897**

Sig. (2-tailed)

.001 .028 .003

N 8 8 8 8

Exam Pearson Correlation

.925** ₁ _.747* _.795*

Sig. (2-tailed)

.001 .033 .018

N 8 8 8 8

Knowledge Pearson Correlation

.763* _.747* ₁ _.921**

Sig. (2-tailed)

.028 .033 .001

N 8 8 8 8

Grade Pearson Correlation

.897** _.795* _.921** ₁

Sig. (2-tailed)

.003 .018 .001

N 8 8 8 8

(29)

Assignment 6



Model Summaryb

Model R R Square

Adjusted R Square

Durbin-Watson

1 .981a _.961 _.933 _.23695 _2.438

a. Predictors: (Constant), Knowledge, Exam, Teach b. Dependent Variable: Grade

ANOVAa

Model

Sum of

Squares df

Mean

Square F Sig.

1 Regression 5.604 3 1.868 33.273 .003b

Residual .225 4 .056

Total 5.829 7

a. Dependent Variable: Grade

b. Predictors: (Constant), Knowledge, Exam, Teach

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. Collinearity Statistics B Std. Error Beta Tolerance VIF 1 (Constant) -4.128 1.041 -3.966 .017

Teach 1.089 .362 .809 3.009 .040 .133 7.508 Exam -.424 .272 -.407 -1.557 .195 .141 7.097 Knowledge 1.362 .346 .607 3.941 .017 .405 2.467

a. Dependent Variable: Grade

Tests of Normality

Kolmogorov-Smirnova _Shapiro-Wilk

Statistic df Sig. Statistic df Sig. Standardized

Residual

.266 8 .100 .840 8 .075

(30)

Assignment 6



2. Open from SPSS data file “survey_sample.sav.” This data file contains survey data, including demographic data and various attitude measures. It is based on a subset of variables from the 1998 NORC General Social Survey.

With this data calculate and interpret:

a.Compute and interpret the coefficient of simple correlation (hours per day watching TV and Highest year of school completed)

b.Draw a scatter diagram and interpret with the variables from the previous example

c.Compute and interpret the multiple coefficient of correlation. (From here, use the variables indicated below)

Data:

Dependent variable: Total family income

Independents variables: Age of respondent; Highest year of school completed; Highest year school completed father; Highest year school completed mother; Highest year school completed spouse. d. Compute and interpret the multiple coefficient of determination within the context of this problem e. Compute and interpret the multiple regression equation. Is the model significant (perform the hypotheses for multiple regression analysis)

f. From the analysis performed, would you recommend removing any variable (s) that do not contribute significantly to the model?

g. What will be the total family income that would be expected for the 50-year-old participant, who has 15 years of study, and that his spouse completed 14 years of study and the respondent's age is 50 years? h. Check if the assumptions of autocorrelated assumed (Durbin Watson)

(31)

Assignment 6



Correlations

Highest year of school completed

Hours per day watching TV Highest year of

school completed

Pearson Correlation

1 -.261**

Sig. (2-tailed) .000

N 2820 2327

Hours per day watching TV

Pearson Correlation

-.261** ₁

Sig. (2-tailed) .000

N 2327 2337

**. Correlation is significant at the 0.01 level (2-tailed).

Model Summaryb

Model R R Square Adjusted R Square

Std. Error of the

Estimate Durbin-Watson

1 .318a _.101 _.096 _1.162 _1.982

a. Predictors: (Constant), Highest year school completed, spouse, Age of respondent, Highest year school completed, mother, Highest year of school completed, Highest year school completed, father

(32)

Assignment 6



ANOVAa

Model Sum of Squares df Mean Square F Sig.

1 Regression 131.397 5 26.279 19.479 .000b

Residual 1164.306 863 1.349

Total 1295.703 868

a. Dependent Variable: Total family income

b. Predictors: (Constant), Highest year school completed, spouse, Age of respondent, Highest year school completed, mother, Highest year of school completed, Highest year school completed, father

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t Sig.

Collinearity Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) 9.919 .289 34.322 .000

Age of respondent -.007 .003 -.078 -2.291 .022 .893 1.120

Highest year of school completed

.087 .018 .199 4.728 .000 .586 1.706

Highest year school completed, father

.008 .013 .025 .564 .573 .527 1.896

Highest year school completed, mother

-.006 .016 -.016 -.371 .711 .556 1.799

Highest year school completed, spouse

.058 .018 .130 3.247 .001 .646 1.548

. https://eric.ed.gov/?id=EJ1151836