Chapter VII. Multiple Regression and Correlation. 2015.doc

(1)

Chapter

VII

Multiple Regression and Correlation

Analysis ...

Purpose of

Chapter

To establish the relationship

between three

or more

variables:

C

orrelation Analysis.

Establish

a mathematical model

to estimate the value

of a

variable

based on the value

of

the other variables

: Regression

Analysis

Multiple Regression Analysis Technique

(2)

Under the same conditions as simple regression analysis except that more than two variables are involved wherein one variable is assumed to be dependent on the others

• Illustrative research question(s) this technique can answer:

 Are sales significantly affected by advertising expenditures and price (where all three variables are measured in RWF)? What proportion of the variation in sales is accounted for by advertising and price? How sensitive are sales to changes in advertising and price?

Logistic Regression Analysis

Logistic regression analysis is used to examine relationships between variables when the dependent variable is nominal, even though independent variables are nominal, ordinal, interval, or some mixture thereof. Suppose that one wanted to determine which program interventions were associated with a JOBS Program client's ability to get a job within six months of exiting the program. The outcome variable would be "job" or "no job” clearly a nominal variable. One could then use several independent variables such as job training, post-secondary education and the like to predict the odds of getting a job.

ANALYSIS FOR MULTIPLE REGRESSIONS:

Multiple linear regressions are an extension of the simple model that incorporates two or more independent variables. Multiple regression analysis produces an equation with several coefficients, depending on the number of independent variables X are introduced to the model, thus generating hyper planes.

On the other hand:

Y = is the predicted value of the dependent variable for some unit i

X1i, X2i, …, Xki are values on the independent variables for unit i

β1, β2, . . . βk are the regression coefficients

β0, is the Y-intercept representing the prediction for Y when all independent variables are set to zero

= The slope of y with respect to the variable holding X2, X3,…Xp constant

= The slope of y with respect to the variable holding X1, X3, …Xp constant

….

= The slope of y with respect to the variable Xp keeping the variables constants.

= Random error in Y for observation i

Example

Some film distributor’s offer discount cinema tickets via email. A random sample of 9 moviegoers is taken, and the number of movies the person has seen in the last year (Y), their age (X1) and income (X2), and the number of discount cinema tickets they

have received via email in the last year (X3), is recorded.

The data is used to develop a multiple regression model to predict the number of times an individual goes to the movies per year from their age, income and number of discount cinema tickets received. The following results are obtained.

Table 4.4. Association between Number of movies seen at a cinema last year, age, Income, and Number of emailed discount cinema

Number of movies seen

at a cinema last year (y) Age (X1)

Income (X2)

(3)

received last year (X3)

7 45 22 2

10 42 30 3

8 50 30 2

12 27 55 4

11 29 50 3

14 30 60 4

15 25 58 5

7

11 5027 2047 23

a. Determine the overall significance of the regression equation by performing an appropriate F test at =.05

b. Perform appropriate tests to determine the significance of the predictors. Use =.05 and show all working.

c. Interpret the following Outcomes of SPSS software.

Solution

We first correlation analysis and simple regression.

1º. Draw the scatter plot to see the trend and the relationship between the variables under study (in the dialogue box graphics SPSS old single scattering).

(4)

Interpretation: Number of movies seen at cinema last year has a strong direct association with Income and Number of Emailed

Discount, while Age has a strong inverse association.

Steps for Multiple Regressions in SPSS

Interpret the following Outcomes of SPSS software.

Model Summaryb

Model R R Square Adjusted R Square

Std. Error of the

Estimate Durbin-Watson

1

.987a _.974 _.959 _.58251 _2.023

a. Predictors: (Constant), Number of emailed discount cinema tickets received last year (X3), Age (X1), Income (X2)

b. Dependent Variable: Number of movies seen at a cinema last year (y)

R = .987 (the model improved by interacting independent variables), in other hand there are strong correlation between Number of movies seen at a cinema last year with Number of cinema tickets emailed discount received last year, age and income.

R2_{= .974 (97.4% of the total variation in Number of movies seen at a cinema last year, can be explained by variation in Number of}

(5)

Durbin Watson=2.023 the residual are not correlated, because the value is close to 2.

Output for SPSS

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression 64.526 3 21.509 63.387 .000a

Residual 1.697 5 .339

Total 66.222 8

a. Predictors: (Constant), Number of emailed discount cinema tickets received last year (X3), Age (X1), Income (X2)

b. Dependent Variable: Number of movies seen at a cinema last year (y)

The rejection of the null hypothesis, indicating that at least one of the independent variables X1, X2, ..., Xk, contributes significantly

to the model and as such could be useful for estimating the mean of Y.

Hypothesis of Slope

Approach of the hypothesis:

(No linear relationship)

(Consider that all the coefficients are simultaneously equal to zero)

(At least one independent variable affects Y.)

Test statistic: F= 63.387, sig. = .000

Making a decision and interpreting the result of the test:

As Sig < 0.05 then reject null hypothesis, indicating that at least one of the explanatory variables affect to number of movies seen at a cinema last year.

Method: Enter

As we can see, in the amount of Number of emailed discount cinema tickets received last year as their Income are significant in the number of movies seen at a cinema last year thereof.

The model has the following equation:

But only the variable X3(Number of emailed discount cinema tickets received last year) is significant (sig .010)

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t Sig. B Std. Error Beta

1 (Constant) .430 3.345 .129 .903

Age (X1) .031 .053 .112 .581 .586

Income (X2) .093 .040 .511 2.311 .069

Number of emailed discount cinema tickets received last year (X3)

1.665 .417 .610 3.996 .010

(6)

Does Number of emailed discount cinema tickets received last year have a significant effect on Number of movies seen at a cinema last year? Test at

Test statistic: t = 3.996, Sig.= .010

Making a decision and interpreting the result of the test:

Number of emailed discount cinema tickets received last year the sig = .010 < .05, there is evidence of a significant effect of this variable is an important predictor.

And income the sig = .069 < .10, thus income has some influence as a predictor, but its effect is not as strong as the previous variable; in the same way we see the variable age has not influence as a predictor (sig= .586).

Using the Regression Equation to Make Predictions

Predict the Number of movies seen at a cinema next year if the Number of emailed discount increase in 6 cinema tickets and the average of income is 62 and age decrease at 20 years old.

The predicted average of Movies seen are 17 per year.

Partial Correlation

The techniques of partial correlation can be used when a researcher wishes to observe how a specific bivariate relationship behaves in the presence of a third variable. By observing the partial correlation coefficients, we can identify direct or spurious and intervening relationship.

Chapter review problems

1. A market study for self-service retailer "Nakomatt" analyzes the annual amount that spent on food families of four or more members. It is thought that three independent variables are related to the cost of food. These variables are: total household income, family size and whether the family has children in college.

Family Expenditureon food

Household income

($ 1000) Familysize in collegeChildren

1 3900 37.6 4 0

2 5300 51.5 5 1

3 4300 41.6 4 0

4 4900 46.8 5 0

5 6400 53.8 6 1

6 7300 62.6 7 1

7 4900 54.3 5 0

8 5300 52.7 4 0

9 6100 60.8 5 1

10 6400 63.5 6 1

11 7400 64.2 8 1

12 5800 56.3 5 0

To interpret each table from outcome of SPSS

Model Summaryb

Model R R Square Adjusted R Square

Std. Error of the Estimate

(7)

1 .969a _.939 _.916 _320.393 _2.559

Predictors: (Constant), Children in college, Household income ($ 1000), Family size Dependent Variable: Expenditure on food

a. Interpret multiple correlation coefficients (R), and the coefficient of multiple determinations (R2_{). How much of the variance in}

Expenditure on food is explained by the three independent variables?

b. Make assumption for correlated by Durbin Watson:____________________________________

Coefficientsa

Model Unstandardized Coefficients

Standardized

Coefficients t Sig. B Std. Error Beta

(Constant) 35.405 767.913 .046 .964

Household income ($ 1000) 63.753 18.391 .493 3.467 .008

Family size 386.805 131.609 .432 2.939 .019

Children in college 275.684 275.936 .131 .999 .347

Dependent Variable: Expenditure on food

c. Find the multiple regression equation with Household income (x1) and family size(x2) and children in college (x3).

d. What expenditure on food would be expected for a family of 5 children, if don’t have kids in college and earning is $ 60,000?

2. The table below presents information on three variables for a small sample of eight nations. We will take abortion rate as the dependent variable and examine is relationship with two variables: one measures women’s status and power and the other measures religiosity.

Nation AbortionRate (Y)

Women's Status

(x1)

Religiosity (x2)

Canada 165 0.5 74

Chile 100 0.45 93

Denmark 400 0.8 48

Germany 208 0.54 67

Italy 389 0.7 70

Japan 379 0.52 55

UK 207 0.58 67

US 428 0.84 35

The SPSS software report is given below, you verify and interpret all results:

Correlations

Abortion Rate (Y)

Women's Status (X1)

Religiosity (X2)

Abortion Rate

(Y) PearsonCorrelation 1 .817* -.842**

Sig. (2-tailed) _.013 _.009

N 8 8 8

Women's Status (x1)

Pearson

Correlation .817* 1 -.801*

Sig. (2-tailed) .013 .017

N 8 8 8

Religiosity (x2) Pearson

Correlation -.842** -.801* 1 Sig. (2-tailed) .009 .017

(8)

*. Correlation is significant at the 0.05 level (2-tailed). **. Correlation is significant at the 0.01 level (2-tailed).

a. Interpret the coefficient of correlation for each pair:_________________________________

Scatter Plot:

Interpret:___________________________ Interpret:___________________________

Model Summaryb

Model R R Square Adjusted RSquare the EstimateStd. Error of Watson

Durbin-1 _.875a _.765 _.671 _73.19844 _1.569

Predictors: (Constant), Religiosity (x2), Women's Status (x1)

Dependent Variable: Abortion Rate (Y)

b. Interpret multiple correlation coefficients (R), and the coefficient of multiple determinations (R2_{). How much of the variance in}

abortion rate is explained by the two independent variables?

c. Make assumption for correlated by Durbin Watson:__________________________________

ANOVAa

Model

Sum of Squares df Mean Square F Sig.

1 Regression 87171.942 2 43585.971 8.135 .027b

Residual 26790.058 5 5358.012

Total 113962.000 7

a. Dependent Variable: Abortion Rate (Y)

b. Predictors: (Constant), Religiosity (x2), Women's Status (x1)

d. Interpret ANOVA table:_______________________________________________________

Coefficientsa

Model Unstandardized Coefficients Standardized

1

(Constant) 310.885 345.190 .901 .409

Women's Status (x1) 348.413 317.472 .398 1.097 .322 Religiosity (x2) -3.789 2.624 -.523 -1.444 .208

a. Dependent Variable: Abortion Rate (Y)

e. Find the multiple regression equation with Women's Status (x1) and religiosity (x2).

(9)

g. Interpret assumptions: Errors has a Normal Distribution by graph and test of normality

Tests of Normality

Kolmogorov-Smirnova _Shapiro-Wilk

Statistic df Sig. Statistic df Sig.

Unstandardized Residual .300 8 .033 .756 8 .009

a. Lilliefors Significance Correction

h. Interpret these graph about residual for Independence of the Error Term and Homoscedasticity:______

3. Given a hypothetical sample of 20 patients who have collected the following data: cholesterol level in blood plasma (in mg/100 ml), age (in years), saturated fat (in g / week) and level exercise (quantified as 0: no exercise, 1: moderate exercise and 2: intense exercise), the adjustment to a linear model between cholesterol level and other variables.

Patient Cholesterol Age Fat Exercise

1 350 80 35 0

2 190 30 40 2

3 263 42 15 1

4 320 50 20 0

5 280 45 35 0

6 198 35 50 1

7 232 18 70 1

(10)

9 303 49 45 0

10 220 35 35 0

11 405 50 50 0

12 190 20 15 2

13 230 40 20 1

14 227 30 35 0

15 440 30 80 1

16 318 23 40 2

17 212 35 40 1

18 340 18 80 0

19 195 22 15 0

20 223 41 34 0

The SPSS software report is given below, you verify and interpret all results:

a. Interpret the coefficient of correlation for each pair:____________________________________

_________________________________________________________________________________

Correlations

Cholesterol Age Fat Exercise

Cholesterol

Pearson Correlation 1 .342 .494* _-.286

Sig. (2-tailed) .139 .027 .221

N 20 20 20 20

Age

Pearson Correlation .342 1 -.246 -.426

Sig. (2-tailed) .139 .295 .061

N 20 20 20 20

Fat

Pearson Correlation .494* _-.246 ₁ _-.041

Sig. (2-tailed) .027 .295 .864

N 20 20 20 20

Exercise

Pearson Correlation -.286 -.426 -.041 1 Sig. (2-tailed) .221 .061 .864

N 20 20 20 20

*. Correlation is significant at the 0.05 level (2-tailed).

b. Interpret multiple correlation coefficients(R).

____________________________________________________________________________________________________ ____________________________________________________________________________________________________ ________________________________________________________________________________________________

c. Interpret the coefficient of multiple determinations (R2_{). How much of the variance in abortion rate is explained by the two}

independent variables?

_________________________________________________________________________________________________

d. Make assumption for correlated by Durbin Watson:

__________________________________________________________________________________________________

Model Summaryb

Model R R Square _{R Square}Adjusted Std. Errorof the

Estimate Durbin-Watson 1 .690a _0.477 _0.378 _58.15349 _1.877

(11)

ANOVAa

Model Sum of

Squares df MeanSquare F Sig.

1

Regression 49275.942 3 16425.314 4.86 .014b

Residual 54109.258 16 3381.829 Total 103385.2 19

a. Dependent Variable: Cholesterol b. Predictors: (Constant), Exercise, Fat, Age

Interpret ANOVA

table:_________________________________________________________________________________

Coefficientsa

Model Unstandardized Coefficients

Standardized

1

(Constant) 99.937 61.275 1.631 .122

Age 2.346 1.056 .464 2.223 .041

Fat 2.306 .720 .606 3.201 .006

Exercise -6.248 19.831 -.064 -.315 .757

a. Dependent Variable: Cholesterol

Write the model from the table of coefficients:______________________________________ Assumption:

Errors has a Normal Distribution

Interpret :_______________________________________________________

(12)