Chapter
VII
Multiple Regression and Correlation
Analysis ...
Purpose of
Chapter
To establish the relationship
between three
or more
variables:
C
orrelation Analysis.
Establish
a mathematical model
to estimate the value
of a
variable
based on the value
of
the other variables
: Regression
Analysis
Multiple Regression Analysis Technique
Under the same conditions as simple regression analysis except that more than two variables are involved wherein one variable is assumed to be dependent on the others
• Illustrative research question(s) this technique can answer:
Are sales significantly affected by advertising expenditures and price (where all three variables are measured in RWF)? What proportion of the variation in sales is accounted for by advertising and price? How sensitive are sales to changes in advertising and price?
Logistic Regression Analysis
Logistic regression analysis is used to examine relationships between variables when the dependent variable is nominal, even though independent variables are nominal, ordinal, interval, or some mixture thereof. Suppose that one wanted to determine which program interventions were associated with a JOBS Program client's ability to get a job within six months of exiting the program. The outcome variable would be "job" or "no job” clearly a nominal variable. One could then use several independent variables such as job training, post-secondary education and the like to predict the odds of getting a job.
ANALYSIS FOR MULTIPLE REGRESSIONS:
Multiple linear regressions are an extension of the simple model that incorporates two or more independent variables. Multiple regression analysis produces an equation with several coefficients, depending on the number of independent variables X are introduced to the model, thus generating hyper planes.
On the other hand:
Y = is the predicted value of the dependent variable for some unit i
X1i, X2i, …, Xki are values on the independent variables for unit i
β1, β2, . . . βk are the regression coefficients
β0, is the Y-intercept representing the prediction for Y when all independent variables are set to zero
= The slope of y with respect to the variable holding X2, X3,…Xp constant
= The slope of y with respect to the variable holding X1, X3, …Xp constant
….
= The slope of y with respect to the variable Xp keeping the variables constants.
= Random error in Y for observation i
Example
Some film distributor’s offer discount cinema tickets via email. A random sample of 9 moviegoers is taken, and the number of movies the person has seen in the last year (Y), their age (X1) and income (X2), and the number of discount cinema tickets they
have received via email in the last year (X3), is recorded.
The data is used to develop a multiple regression model to predict the number of times an individual goes to the movies per year from their age, income and number of discount cinema tickets received. The following results are obtained.
Table 4.4. Association between Number of movies seen at a cinema last year, age, Income, and Number of emailed discount cinema
Number of movies seen
at a cinema last year (y) Age (X1)
Income (X2)
received last year (X3)
7 45 22 2
10 42 30 3
8 50 30 2
12 27 55 4
11 29 50 3
14 30 60 4
15 25 58 5
7
11 5027 2047 23
a. Determine the overall significance of the regression equation by performing an appropriate F test at =.05
b. Perform appropriate tests to determine the significance of the predictors. Use =.05 and show all working.
c. Interpret the following Outcomes of SPSS software.
Solution
We first correlation analysis and simple regression.
1º. Draw the scatter plot to see the trend and the relationship between the variables under study (in the dialogue box graphics SPSS old single scattering).
Interpretation: Number of movies seen at cinema last year has a strong direct association with Income and Number of Emailed
Discount, while Age has a strong inverse association.
Steps for Multiple Regressions in SPSS
Interpret the following Outcomes of SPSS software.
Model Summaryb
Model R R Square Adjusted R Square
Std. Error of the
Estimate Durbin-Watson
1
.987a .974 .959 .58251 2.023
a. Predictors: (Constant), Number of emailed discount cinema tickets received last year (X3), Age (X1), Income (X2)
b. Dependent Variable: Number of movies seen at a cinema last year (y)
R = .987 (the model improved by interacting independent variables), in other hand there are strong correlation between Number of movies seen at a cinema last year with Number of cinema tickets emailed discount received last year, age and income.
R2 = .974 (97.4% of the total variation in Number of movies seen at a cinema last year, can be explained by variation in Number of
Durbin Watson=2.023 the residual are not correlated, because the value is close to 2.
Output for SPSS
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 64.526 3 21.509 63.387 .000a
Residual 1.697 5 .339
Total 66.222 8
a. Predictors: (Constant), Number of emailed discount cinema tickets received last year (X3), Age (X1), Income (X2)
b. Dependent Variable: Number of movies seen at a cinema last year (y)
The rejection of the null hypothesis, indicating that at least one of the independent variables X1, X2, ..., Xk, contributes significantly
to the model and as such could be useful for estimating the mean of Y.
Hypothesis of Slope
Approach of the hypothesis:
(No linear relationship)
(Consider that all the coefficients are simultaneously equal to zero)
(At least one independent variable affects Y.)
Test statistic: F= 63.387, sig. = .000
Making a decision and interpreting the result of the test:
As Sig < 0.05 then reject null hypothesis, indicating that at least one of the explanatory variables affect to number of movies seen at a cinema last year.
Method: Enter
As we can see, in the amount of Number of emailed discount cinema tickets received last year as their Income are significant in the number of movies seen at a cinema last year thereof.
The model has the following equation:
But only the variable X3(Number of emailed discount cinema tickets received last year) is significant (sig .010)
Coefficientsa
Model
Unstandardized Coefficients
Standardized Coefficients
t Sig. B Std. Error Beta
1 (Constant) .430 3.345 .129 .903
Age (X1) .031 .053 .112 .581 .586
Income (X2) .093 .040 .511 2.311 .069
Number of emailed discount cinema tickets received last year (X3)
1.665 .417 .610 3.996 .010
Does Number of emailed discount cinema tickets received last year have a significant effect on Number of movies seen at a cinema last year? Test at
Test statistic: t = 3.996, Sig.= .010
Making a decision and interpreting the result of the test:
Number of emailed discount cinema tickets received last year the sig = .010 < .05, there is evidence of a significant effect of this variable is an important predictor.
And income the sig = .069 < .10, thus income has some influence as a predictor, but its effect is not as strong as the previous variable; in the same way we see the variable age has not influence as a predictor (sig= .586).
Using the Regression Equation to Make Predictions
Predict the Number of movies seen at a cinema next year if the Number of emailed discount increase in 6 cinema tickets and the average of income is 62 and age decrease at 20 years old.
The predicted average of Movies seen are 17 per year.
Partial Correlation
The techniques of partial correlation can be used when a researcher wishes to observe how a specific bivariate relationship behaves in the presence of a third variable. By observing the partial correlation coefficients, we can identify direct or spurious and intervening relationship.
Chapter review problems
1. A market study for self-service retailer "Nakomatt" analyzes the annual amount that spent on food families of four or more members. It is thought that three independent variables are related to the cost of food. These variables are: total household income, family size and whether the family has children in college.
Family Expenditureon food
Household income
($ 1000) Familysize in collegeChildren
1 3900 37.6 4 0
2 5300 51.5 5 1
3 4300 41.6 4 0
4 4900 46.8 5 0
5 6400 53.8 6 1
6 7300 62.6 7 1
7 4900 54.3 5 0
8 5300 52.7 4 0
9 6100 60.8 5 1
10 6400 63.5 6 1
11 7400 64.2 8 1
12 5800 56.3 5 0
To interpret each table from outcome of SPSS
Model Summaryb
Model R R Square Adjusted R Square
Std. Error of the Estimate
1 .969a .939 .916 320.393 2.559
Predictors: (Constant), Children in college, Household income ($ 1000), Family size Dependent Variable: Expenditure on food
a. Interpret multiple correlation coefficients (R), and the coefficient of multiple determinations (R2). How much of the variance in
Expenditure on food is explained by the three independent variables?
b. Make assumption for correlated by Durbin Watson:____________________________________
Coefficientsa
Model Unstandardized Coefficients
Standardized
Coefficients t Sig. B Std. Error Beta
(Constant) 35.405 767.913 .046 .964
Household income ($ 1000) 63.753 18.391 .493 3.467 .008
Family size 386.805 131.609 .432 2.939 .019
Children in college 275.684 275.936 .131 .999 .347
Dependent Variable: Expenditure on food
c. Find the multiple regression equation with Household income (x1) and family size(x2) and children in college (x3).
d. What expenditure on food would be expected for a family of 5 children, if don’t have kids in college and earning is $ 60,000?
2. The table below presents information on three variables for a small sample of eight nations. We will take abortion rate as the dependent variable and examine is relationship with two variables: one measures women’s status and power and the other measures religiosity.
Nation AbortionRate (Y)
Women's Status
(x1)
Religiosity (x2)
Canada 165 0.5 74
Chile 100 0.45 93
Denmark 400 0.8 48
Germany 208 0.54 67
Italy 389 0.7 70
Japan 379 0.52 55
UK 207 0.58 67
US 428 0.84 35
The SPSS software report is given below, you verify and interpret all results:
Correlations
Abortion Rate (Y)
Women's Status (X1)
Religiosity (X2)
Abortion Rate
(Y) PearsonCorrelation 1 .817* -.842**
Sig. (2-tailed) .013 .009
N 8 8 8
Women's Status (x1)
Pearson
Correlation .817* 1 -.801*
Sig. (2-tailed) .013 .017
N 8 8 8
Religiosity (x2) Pearson
Correlation -.842** -.801* 1 Sig. (2-tailed) .009 .017
*. Correlation is significant at the 0.05 level (2-tailed). **. Correlation is significant at the 0.01 level (2-tailed).
a. Interpret the coefficient of correlation for each pair:_________________________________
Scatter Plot:
Interpret:___________________________ Interpret:___________________________
Model Summaryb
Model R R Square Adjusted RSquare the EstimateStd. Error of Watson
Durbin-1 .875a .765 .671 73.19844 1.569
Predictors: (Constant), Religiosity (x2), Women's Status (x1)
Dependent Variable: Abortion Rate (Y)
b. Interpret multiple correlation coefficients (R), and the coefficient of multiple determinations (R2). How much of the variance in
abortion rate is explained by the two independent variables?
c. Make assumption for correlated by Durbin Watson:__________________________________
ANOVAa
Model
Sum of Squares df Mean Square F Sig.
1 Regression 87171.942 2 43585.971 8.135 .027b
Residual 26790.058 5 5358.012
Total 113962.000 7
a. Dependent Variable: Abortion Rate (Y)
b. Predictors: (Constant), Religiosity (x2), Women's Status (x1)
d. Interpret ANOVA table:_______________________________________________________
Coefficientsa
Model Unstandardized Coefficients Standardized
Coefficients t Sig. B Std. Error Beta
1
(Constant) 310.885 345.190 .901 .409
Women's Status (x1) 348.413 317.472 .398 1.097 .322 Religiosity (x2) -3.789 2.624 -.523 -1.444 .208
a. Dependent Variable: Abortion Rate (Y)
e. Find the multiple regression equation with Women's Status (x1) and religiosity (x2).
g. Interpret assumptions: Errors has a Normal Distribution by graph and test of normality
Tests of Normality
Kolmogorov-Smirnova Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
Unstandardized Residual .300 8 .033 .756 8 .009
a. Lilliefors Significance Correction
h. Interpret these graph about residual for Independence of the Error Term and Homoscedasticity:______
3. Given a hypothetical sample of 20 patients who have collected the following data: cholesterol level in blood plasma (in mg/100 ml), age (in years), saturated fat (in g / week) and level exercise (quantified as 0: no exercise, 1: moderate exercise and 2: intense exercise), the adjustment to a linear model between cholesterol level and other variables.
Patient Cholesterol Age Fat Exercise
1 350 80 35 0
2 190 30 40 2
3 263 42 15 1
4 320 50 20 0
5 280 45 35 0
6 198 35 50 1
7 232 18 70 1
9 303 49 45 0
10 220 35 35 0
11 405 50 50 0
12 190 20 15 2
13 230 40 20 1
14 227 30 35 0
15 440 30 80 1
16 318 23 40 2
17 212 35 40 1
18 340 18 80 0
19 195 22 15 0
20 223 41 34 0
The SPSS software report is given below, you verify and interpret all results:
a. Interpret the coefficient of correlation for each pair:____________________________________
_________________________________________________________________________________
Correlations
Cholesterol Age Fat Exercise
Cholesterol
Pearson Correlation 1 .342 .494* -.286
Sig. (2-tailed) .139 .027 .221
N 20 20 20 20
Age
Pearson Correlation .342 1 -.246 -.426
Sig. (2-tailed) .139 .295 .061
N 20 20 20 20
Fat
Pearson Correlation .494* -.246 1 -.041
Sig. (2-tailed) .027 .295 .864
N 20 20 20 20
Exercise
Pearson Correlation -.286 -.426 -.041 1 Sig. (2-tailed) .221 .061 .864
N 20 20 20 20
*. Correlation is significant at the 0.05 level (2-tailed).
b. Interpret multiple correlation coefficients(R).
____________________________________________________________________________________________________ ____________________________________________________________________________________________________ ________________________________________________________________________________________________
c. Interpret the coefficient of multiple determinations (R2). How much of the variance in abortion rate is explained by the two
independent variables?
_________________________________________________________________________________________________
d. Make assumption for correlated by Durbin Watson:
__________________________________________________________________________________________________
Model Summaryb
Model R R Square R SquareAdjusted Std. Errorof the
Estimate Durbin-Watson 1 .690a 0.477 0.378 58.15349 1.877
ANOVAa
Model Sum of
Squares df MeanSquare F Sig.
1
Regression 49275.942 3 16425.314 4.86 .014b
Residual 54109.258 16 3381.829 Total 103385.2 19
a. Dependent Variable: Cholesterol b. Predictors: (Constant), Exercise, Fat, Age
Interpret ANOVA
table:_________________________________________________________________________________
Coefficientsa
Model Unstandardized Coefficients
Standardized
Coefficients t Sig. B Std. Error Beta
1
(Constant) 99.937 61.275 1.631 .122
Age 2.346 1.056 .464 2.223 .041
Fat 2.306 .720 .606 3.201 .006
Exercise -6.248 19.831 -.064 -.315 .757
a. Dependent Variable: Cholesterol
Write the model from the table of coefficients:______________________________________ Assumption:
Errors has a Normal Distribution
Interpret :_______________________________________________________