Chapter V. Multivariate Analysis, 2020.doc

(1)

Chapter V

Multivariate analysis...

Purpose of

Chapter

This chapter deals when more than two variables

simultaneously analyzed, for this chapter multivariate analysis

is used, such as multiple regression, partial correlation, logistic

regression model, etc.

Essentially,

multivariate analysis

is a tool to find patterns and

relationships between several variables simultaneously. It lets

us predict the effect a change in one variable will have on other

variables

.

5.1 Introduction

Multivariate analysis consists of a collection of methods that can be used when several measurements are made on each individual or object in one or more samples. We refer to measures as variables and to individuals or objects as units.

(2)

models and multivariate statistics. And with the high availability of high-speed computers and multivariate software, many users can address these questions through Multivariate Techniques.

Data Types

A single sample with several variables measured on each sampling unit (subject or object); or a single sample with two sets of variables measured on each unit; or two samples with several variables measured on each unit; or three or more samples with several variables measured on each unit.

Types of Measurements

 Nominal: Categorical variables with no meaningful order (Examples: Gender, Hair color);

 Ordinal: Categorical variables where a meaningful order exists (Examples: Social class, Rating of instructor);

 Interval: Numerical variables where taking differences is meaningful, but there is no fixed zero position (Examples: Temperature using Celsius/Fahrenheit);

 Ratio: Numerical variables where taking ratios is meaningful since there is a fixed zero (Examples: Age, Height, and Weight). Interval and Ratio are Metric data ⇒ this data can be manipulated mathematically.

Types of Multivariate Techniques:

1. Dependence techniques: a variable or set of variables is identified as the dependent variable to be predicted or explained by other variables known as independent variables.

2. Interdependence techniques: involve the simultaneous analysis of all variables in the set, without distinction between dependent variables and independent variables.

Classification of Multivariate Methods

5.2 Multiple Regression Analysis Technique • This technique is appropriate:

(3)

• Illustrative Research Question(s) this technique can answer:

 Are sales significantly affected by advertising expenditures and price (where all three variables are measured in RWF)?

 What proportion of the variation in sales is accounted by advertising and price?

 How sensitive are sales to changes in advertising and price?

5.3 Partial correlation

The techniques of partial correlation can be used when a researcher wishes to observe how a specific bivariate relationship behaves in the presence of a third variable. By observing the partial correlation coefficients, we can identify direct or spurious and intervening relationship.

5.4 Logistic Regression Analysis

Logistic regression analysis is used to examine relationships between variables when the dependent variable is nominal, even though independent variables are nominal, ordinal, interval, or some mixture thereof. Suppose that one wanted to determine which program interventions were associated with a JOBS Program client's ability to get a job within six months of exiting the program. The outcome variable would be "job" or "no job” clearly a nominal variable. One could then use several independent variables such as job training, post-secondary education and the like to predict the odds of getting a job.

Sometimes we need to predict the variable that has only two possible values (a binary dependent variable). For example, will a Kigali bank customer choose online banking, or not?

The equation predicts the probability:

• Illustrative Research Question(s) this technique can answer:

 Whether you are a market researcher who needs to make accurate predictions about new product launches, you can make reliable prediction.

 How does the probability of getting lung cancer change for every additional pound of overweight and for every X cigarettes smoked per day?

 Do body weight calorie intake, fat intake, and age have an influence on heart attacks (yes vs. no)?

5.2 Analysis for Multiple Regression

Multiple linear regressions are an extension of the simple model that incorporates several independent variables (called predictors). Multiple regression analysis produces an equation with several coefficients, depending on the number of independent variables X are introduced to the model, thus generating hyper planes.

Figure 5.1 illustrate the idea of a multiple regression model. Some of the proposed predictors may be useful, while others may not.

Model:

In which:

Y = Y is the predicted value of the dependent variable for some unit = Intersection with the ‘Y’ axis

(4)

= The slope of ‘y’ with respect to the variable holding X2, X3,…Xp constant

= The slope of ‘y’ with respect to the variable holding X1, X3, …Xp constant

….

= The slope of ‘y’ with respect to the variable Xp keeping the variables constants. X1, X2, …, Xn are values on the independent variables

= Random error in Y for observation i

Assumptions

In Linear regression, the sample size is at least 20 cases per independent variable in the analysis.

The regression has five key assumptions:

 Linear relationship. First, linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots.

 Multivariate normality. Multiple regression assumes that the residuals are normally distributed. The linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram or a PP-Plot. Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test or Shapiro Wilk for small samples. When the data is not normally distributed a non-linear transformation (e.g., log-transformation) might fix this issue.

 No or little multicollinearity. (Absence of correlation between two or more independent variables ‘generally in simple correlation 0.90 and above’. Multicollinearity in multiple regression analysis refers to how strongly interrelated the independent variables in a model are. When multicollinearity is too high, the individual parameter estimates become difficult to interpret. Most regression programs can compute variance inflation factors (VIF) for each variable. (Hair et al., 2006) recommend that a large VIF value (10 or above) or a very small tolerance value (0.10 or below) indicates high collinearity or multicollinearity.

If multicollinearity is found in the data, centering the data (that is deducting the mean of the variable from each score) might help to solve the problem. However, the simplest way to address the problem is to remove independent variables with high VIF values.

 No auto-correlation. Linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. The range of this statistic ranges from 0 to 4. A value around 2 means that errors are not correlated, less than 2 that the errors are positively correlated and greater than 2 that are negatively correlated. For instance, this typically occurs in stock prices, where the price is not independent from the previous price. (The Durbin-Watson d ≈ 2, i.e. when is between the two critical values of 1.5 < d < 2.5). Therefore, we can assume that there is no first order linear auto-correlation in our multiple linear regression data.

 Homoscedasticity. This assumption means that the variance around the regression line is the same for all values of the predictor variable (X). The plot shows a violation of this assumption.

Example

Some film distributor’s offer discount cinema tickets via email. A random sample of 9 moviegoers is taken, and the number of movies the person has seen in the last year (Y), their age (X1) and income (X2), and the number of discount cinema tickets they have received via

email in the last year (X3), is recorded.

The data is used to develop a multiple regression model to predict the number of times an individual goes to the movies per year from their age, income and number of discount cinema tickets received. The following results are obtained.

Table 5.1. Association between Number of movies seen at a cinema last year, age, Income, and Number of emailed discount cinema

Number of movies seen

at a cinema last year (y) Age (X1)

Income (X2)

Number of emailed discount cinema tickets

received last year (X3)

(5)

10 42 30 3

8 50 30 2

12 27 55 4

11 29 50 3

14 30 60 4

15 25 58 5

7

11 5027 2047 23

a. Draw the scatter diagram to see the trend and the relationship between the variables under study.

b. Compute and interpret the coefficient of simple correlation between the variables (dependent variable: number of movies the person has seen in the last year)

c. Compute and interpret multiple correlation coefficient, multiple coefficient of determination and assumption about Autocorrelation by Durbin Watson

d. Determine the multiple regression equation by performing an appropriate F test at =.05

e. Perform appropriate tests to determine the significance of the predictors. Use =.05 and show all working. f. Assumption about multicollinearity

g. Testing assumption of multivariate normality

Solution

a. Draw the scatter diagram to see the trend (linear or nonlinear) and the relationship between the variables under study (in the dialogue box graphics SPSS old single scattering).

Figure 5.2. Scatter plot for each predictor

b. Compute and interpret the coefficient of simple correlation between the variables (dependent variable: number of movies the person has seen in the last year)

(6)

Interpretation: Number of movies seen at cinema last year has a strong direct association with Income and Number of emailed discount, while Age has a strong inverse association.

c. Compute and interpret multiple correlation coefficient, multiple coefficient of determination and assumption about Autocorrelation by Durbin Watson

Figure 5.4. Steps for Multiple Regressions in SPSS

Interpret the following Outcomes of SPSS software.

Figure 5.5. Model Summaryb

Model R R Square Adjusted R Square

Std. Error of the

Estimate Durbin-Watson

1

.987a _.974 _.959 _.58251 _2.023

a. Predictors: (Constant), Number of emailed discount cinema tickets received last year (X3), Age (X1), Income (X2) b. Dependent Variable: Number of movies seen at a cinema last year (y)

(7)

Multiple coefficient of correlation: R = .987 (the model improved by interacting independent variables), in other hand there are strong correlation between number of movies seen with number of emaildiscount, age and income.

Multiple coefficient of determination: R2_{= .974. So 97.4% of the variation in the Number of movies seen at cinema last year can be}

explained by the variation in the Number of cinema tickets sent by email discount received last year, the Age and the variation of Income. Assumption about Autocorrelated: Durbin Watson = 2.023 the residual are nonautocorrelated, because the value is close to 2.

(1.5 < d < 2.5), therefore assumption of non-autocorrelated assumed.

d. Determine the multiple regression equation by performing an appropriate F test at =.05 . Figure 5.6. ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression 64.526 3 21.509 63.387 .000a

Residual 1.697 5 .339

Total 66.222 8

a. Predictors: (Constant), Number of emailed discount cinema tickets received last year (X3), Age (X1), Income (X2)

b. Dependent Variable: Number of movies seen at a cinema last year (y)

Interpretation: As Sig < 0.05 then reject null hypothesis, indicating that at least one of the explanatory variables is related to number of movies seen at a cinema last year.

Hypothesis of Slope

Approach of the hypothesis:

(Consider that all the coefficients are simultaneously equal to zero)

(At least one regression coefficient is not equal to zero)

The rejection of the null hypothesis, indicating that at least one of the independent variables X1, X2, ..., Xk, contributes significantly to the

model and as such could be useful for estimating the mean of Y.

Method: Enter

As we can see, the variable number of discount cinema tickets sent by email last year, their income, these are significant in the amount of movies seen in a cinema last year.

The model has the following equation:

Figure 5.7. Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t Sig.

B Std. Error Beta

1 (Constant) .430 3.345 .129 .903

Age (X1) .031 .053 .112 .581 .586

Income (X2) .093 .040 .511 2.311 .069

Number of emailed discount cinema tickets received last year (X3)

1.665 .417 .610 3.996 .010

a. Dependent Variable: Number of movies seen at a cinema last year (y)

(8)

(Number of emailed discount cinema tickets received last year the sig = .010<.05, indicating this variable is an important predictor) and income the sig = .069<.10, thus income has some influence as a predictor, but its effect is not as strong as the previous variable; in the same way we see the variable age has not influence as a predictor (sig=.586).

f. Assumption about multicollinearity

Interpretation

The variables do not show multicollinearity, because all tolerances are greater than .10 and VIF are less than 10.

g. Testing assumption of multivariate normality Ho: the residuals follow a normal distribution Ha: the residuals do not follow the normal distribution

Steps to perform assumption of normality

Output

Tests of Normality

Kolmogorov-Smirnova _Shapiro-Wilk

Statistic df Sig. Statistic df Sig.

Unstandardized Residual .233 9 .171 .871 9 .128

a. Lilliefors Significance Correction

We check Shapiro Wilk because the sample size is small (less than 30)

Decision: We cannot reject null hypothesis because sig = .128 < .05, therefore we conclude that assumption assume, the residuals follow a normal distribution

Note:

1. Examine the model F-test. If the test result is not significant, the model should be dismissed and there is no need to proceed to further steps.

2. Examine the individual statistical tests for each parameter estimate. An independent variable with significant results can be considered a significant explanatory variable. If an independent variable is not significant, the model should be run again with no significant predictors deleted. Often, it is best to eliminate predictor variables one at a time, and then rerun the reduced model. 3. Examine the model R2_{. No cutoff values exist that can distinguish an acceptable amount of explained variation across all regression}

(9)

In other words, the regression is run for pure forecasting purposes. When the model is more oriented toward explanation of which variables are most important in explaining the dependent variable, cutoff values for the model R2_{are not really appropriate.}

5.3 Partial Correlation

The technique of partial correlation can be used when a researcher wishes to observe how a specific bivariate relationship behaves in the presence of a third variable.

Figure 5.8 Types of Correlation

Generally, a large number of factors simultaneously influence all social and natural phenomena. For example, when we study the correlation between price as dependent variable and demand as independent variable), we completely ignore the effect of other factors like money supply, import and exports etc. which definitely have a bearing on the price.

Figure 5.9 Relationship between the variables

The correlation co-efficient between two variables X1 and X2, studied partially after eliminating the influence of the third variable, from both

of them, is the partial correlation coefficient.

The partial correlation analysis assumes great significance in cases where the phenomena under consideration have multiple factors influencing them, especially in physical and experimental sciences, where it is possible to control the variables and the effect of each variable can be studied separately.

Application example:

Why does voter turnout vary from election to election? For municipal elections in seven different cities of Africa, information has been gathered on the percent of registered voters who actually voted, unemployment rate, and the percentage of all political advertisements that used “negative campaigning” (personal attacks, negative portrayals of the opponent’s record, etc.). For the sake of convenience, the data for the three of the variables are presented again here along with descriptive statistics and Zero-order correlations.

Turnout Unemployment Rate % negative Ads.

Nairobi 70 10 48

Luanda 68 9 53

Lagos 71 10 40

Sierra Leone 55 5 60

Madagascar 60 8 63

Morocco 60 8 63

Mauritius 55 5 60

Mean 62.7143 7.8571 55.2857

Standard

Deviation 6.87300 2.11570 8.71233

Turnout Unemployment Rate % negative Ads.

Turnout .948** -.853*

Unemployment Rate

-.685 Types of Correlation

Direct and Inverse

Simple, Partial and Multiple

(10)

a. Compute the partial correlation coefficient for the relationship between turnout (Y) and unemployment (x) while controlling for the effect of negative advertising (Z). What effect does this control variable have on the bivariate relationship? Is the relationship between turnout and unemployment direct?

b. Compute the partial correlation coefficient for the relationship between turnout (Y) and negative advertising (x) while controlling for the effect of unemployment (Z). What effect does this have on the bivariate relationship? Is the relationship between turnout and negative advertising direct?

c. Find the unstandardized multiple regression equation with unemployment (X1) and negative ads. (A2) as the independent variables.

What turnout would be expected in a city in which the unemployment rate was 10% and 75% of the campaign ads were negative? d. Compute beta-weigths for each independent variable. Which has the stronger impact on turnout?

e. Compute the multiple correlation coefficient (R) and the coefficient of multiple determination (R2_{). How much of the variance in voter}

turnout is explained by the two independent variables?

Solution

a. Compute the partial correlation coefficient for the relationship between turnout (Y) and unemployment (x) while controlling for the effect of negative advertising (Z). What effect does this control variable have on the bivariate relationship? Is the relationship between turnout and unemployment direct?

Correlations

Control Variable Turnout Unemployment_rate

% Negative

Advertising Turnout Significance (2-Correlation 1.000 .957

tailed) .003

df 0 4

(11)

b. Compute the partial correlation coefficient for the relationship between turnout (Y) and negative advertising (x) while controlling for the effect of unemployment (Z). What effect does this have on the bivariate relationship? Is the relationship between turnout and negative advertising direct?

Correlations

Control Variable Turnout % NegativeAdvertising

Unemployment_rate Turnout Correlation 1.000 -.879

Significance

(2-tailed) .021

df 0 4

For turnout (y) and % Negative Advertising (x) while controlling for Unemployment rate (Z), r = -.879. The bivariate relationship is not affected by the control variable, and also the relationship is not direct, this is an inverse relationship.

c. Find the unstandardized multiple regression equation with unemployment (X1) and negative ads. (X2) as the independent variables.

What turnout would be expected in a city in which the unemployment rate was 10% and 75% of the campaign ads were negative?

Coefficientsa

Model

Unstandardized Coefficients StandardizedCoefficients

t Sig.

B Std. Error Beta

1 (Constant) 61.955 6.659 9.304 .001

Unemployment_rate 2.226 .338 .685 6.590 .003

% Negative Advertising -.303 .082 -.384 -3.689 .021

a. Dependent Variable: Turnout

Turnout (Y) = 61.955 + (2.226) unemployment (X1) + (-.303) negative advertising (X2).

For unemployment (x1) = 10 and negative advertising (x2) = 75, then

The expected value for Turnout = 61.52

d. Compute beta-weights for each independent variable. Which has the stronger impact on turnout?

For unemployment (x1): ß1 = .685, for negative advertising (x2): ß2 = -.384, unemployment has a stronger effect on turnout than

negative advertising. Note that the independent variables’ effect on turnout is in opposite directions.

e. Compute the multiple correlation coefficient (R) and the coefficient of multiple determination (R2_{). How much of the variance in}

voter turnout is explained by the two independent variables? Model Summary

Model R R Square Adjusted RSquare Std. Error of theEstimate

1 .988a _.977 _.966 _1.27629

a. Predictors: (Constant), % Negative Advertising, Unemployment_rate

Multiple correlation coefficient: r = .988, there is a strong association between the participation rate with the % of negative publicity, and unemployment rate.

Coefficient of multiple determination: R2 _{= .977.}

97.7 % of the variation in turnout is explained by the variation of negative publicity and the unemployment rate.

How much of the variance in voter turnout is explained by the two independent variables? In 97.7%.

5.4 Logistic Regression

Logistic regression data set is essentially the same as a multivariate regular linear regression data set except that the dependent “Y”variable is binary or any type of categorical variable.

Types of Logistic regression:

A binomial logistic regression (often referred to simply as logistic regression), predicts the probability that an observation falls into one of two categories of a dichotomous dependent variable based on one or more independent variables that can be either continuous or categorical. Alternatively, if you have more than two categories of the dependent variable, see multinomial logistic regression.

(12)

Multinomial Logistic regression: In this type of logistic regression, there are more than two outcomes. Banking business example: Bank wants to know what the probability is that particular customer will invest in Bank Deposit or Mutual fund or Bond.

Ordinal Logistic regression: In this kind of logistic regression, there are more than two outcomes and it should be in order. Banking business example: If bank run one customer satisfaction survey on customer service. Here bank tries to predict particular customer will give feedback as excellent or satisfactory or good or bad.

Assumption for used Logistic Regression

Assumption 1:

Dependent variable should be measured on a dichotomous scale. Examples of dichotomous variables include gender (two groups: "males" and "females"), body composition (two groups: "obese" or "not obese"), and so forth. However, if your dependent variable was not measured on a dichotomous scale, but a continuous scale instead, you will need to carry out multiple regression if your dependent variable was measured on an ordinal scale, ordinal regression would be a more appropriate starting point.

Assumption 2: You have one or more independent variables, which can be either continuous (i.e., an interval or ratio variable) or categorical (i.e., an ordinal or nominal variable). Examples of continuous variables include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. Examples of ordinal variables include Likert items (e.g., a 7-point scale from "strongly agree" through to "strongly disagree"), amongst other ways of ranking categories (e.g., a 3-point scale explaining how much a customer liked a product, ranging from "Not very much" to "Yes, a lot"). Examples of nominal variables include gender (e.g., 2 groups: male and female), profession (e.g., 5 groups: accountant, doctor, nurse, dentist, therapist), and so forth.

Assumption 3: You should have independence of observations and the dependent variable should have mutually exclusive and exhaustive categories.

Assumption 4: There needs to be a linear relationship between any continuous independent variables and the logit transformation of the dependent variable.

Application example

A magazine reseller is trying to decide what magazines to market to customers. In the “old days,” this might have involved trying to decide which customers to send advertisements to via regular mail. In the context of today and the “web,” this might involved deciding what recommendations to make to a customer viewing a web page about other items that the customer might be interested in and therefore want to buy.

In this example, He wants to decide what magazines to include in e-mails to customers as a part of an e-mail marketing campaign. All of the e-mails that will be sent will go to customers that have previously bought a magazine subscription at specific web site and who have not opted out of receiving e-mails.

The magazines advertised in each e-mail will be automatically selected specifically for each customer when the e-mail is generated in order to maximize the probability that the customer will buy. The web site will only include ads for three magazines in each e-mail in a row at the top of the message because management believes that including more ads is ineffective.

This specific web site has a lot of information related each customer. For example, they have data such as income, number of people in the household, and so on. Here are the variables have on each customer from third-party sources:

 Household Income (Income; rounded to the nearest $1,000.00)

 Gender (Is Female = 1 if the person is female, 0 otherwise)

 Marital Status (Is Married = 1 if married, 0 otherwise)

 College Educated (Has College = 1 if has one or more years of college education, 0 otherwise)

 Employed in a Profession (Is Professional = 1 if employed in a profession, 0 otherwise)

 Retired (Is Retired = 1 if retired, 0 otherwise)

 Not employed (Unemployed = 1 if not employed, 0 otherwise)

 Length of Residency in Current City (ResLength; in years)

 Dual Income if Married (Dual = 1 if dual income, 0 otherwise)

 Children (Minors = 1 if children under 18 are in the household, 0 otherwise)

 Home ownership (Own = 1 if own residence, 0 otherwise)

 Resident type (House = 1 if residence is a single family house, 0 otherwise)

 Race (White = 1 if race is white, 0 otherwise)

 Language (English = 1 is the primary language in the household is English, 0 otherwise)

So how might this specific web site decide what magazines to market to each person; that is, what ads to put in each e-mail? One way would be to develop an equation of logistic analysis that predicts the probability that a customer will buy a particular magazine based on the data that the company has about the customer.

For this example, we selected 25 emails sent to customers who bought the magazine (Christian Kid), and recorded the purchase behavior. In addition to the variables for each customer listed above, he has the following variables from their own databases:

(13)

The dependent variable "Y" comes from the “experiment;” that is, from the 25 e-mails to customers containing the ad for “Christian Kid” and whether or not the customer purchased the magazine. That is, the dependent variable is: “Purchased “Christian Kid” (Buy = 1 if purchased “Christian Kid,” 0 otherwise)”

Solution:

In the “Christian Kid” example, we are trying to predict the probability that a customer will respond to an e-mail ad and buy a children’s magazine called “Christian Kid.” We have run an experiment and collected 673 observations where a customer was shown the Christian Kid ad. For each of these observations, we have recorded whether or not the customer buys together with a set of explanatory X-variables. Since the dependent Y variable is binary, logistic regression is appropriate.

The coefficient table from the logistic regression output is shown below:

Variables in the Equation

(Estimated)

B error)S.E.(Std. Wald df Sig.

(Odds ratio) Exp(B)

Step 1a _Income _.0002 _.000 _73.019 ₁ _.000 _1.0002

IsFemale 1.646 .465 12.525 1 .000 5.186

IsMarried .566 .586 .932 1 .334 1.762

HasCollege -.279 .444 .396 1 .529 .756

IsProfessional .225 .465 .235 1 .628 1.253

IsRetired -1.159 .932 1.544 1 .214 .314

Unemployed .989 4.690 .044 1 .833 2.688

ResidenceLength .025 .014 3.196 1 .074 1.025

DualIncome .452 .522 .751 1 .386 1.571

Minors 1.133 .464 5.974 1 .015 3.105

Own 1.056 .559 3.566 1 .059 2.876

House -.927 .622 2.220 1 .136 .396

White 1.864 .545 11.678 1 .001 6.448

English 1.530 .841 3.314 1 .069 4.620

PrevChildMag 1.557 .712 4.785 1 .029 4.746

PrevParentMag .478 .624 .586 1 .444 1.612

Constant -17.911 2.223 64.934 1 .000 .000

Logistic regression equation:

So how are the odds ratios (Exp(B)) in the last column calculated? The odds ratio is simply the exponential of the corresponding regression coefficient. That is:

(14)

Chapter review problems

1. The table below presents the scores of 10 cities on each of 6 variables: three measures of criminal activity and three measures of population structure. Crimes rates are number of incidents per 100000 population as of 2000.

City Homicide Robbery Car theft Growth Density Urban

A 1 19 104 3.8 41.7 52.6

B 5 214 286 5.5 402.7 92.1

C 4 138 344 4.7 277.8 81.2

D 2 37 184 5.4 52.3 45.3

E 6 89 252 14.4 181.5 78.1

F 5 81 230 9.6 102.3 48.8

G 6 145 447 22.8 81.5 84.8

H 7 146 842 40 46.7 88.2

I 3 99 594 21.1 90 83.1

J 6 178 537 13.6 221.2 96.7

Take the three crime variables as the dependent variables (one at a time) and

a. Find the multiple regression equations (unstandardized) with growth and urbanization as independent variable, and interpret the ANOVA regression

ANOVAa

1

Regression 26053.749 2 13026.875 12.170 .005b

Residual 7492.651 7 1070.379

Total 33546.400 9

a. Dependent Variable: Robbery b. Predictors: (Constant), Urban, Growth

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t Sig.

B Std. Error Beta

1 (Constant) -102.201 45.148 -2.264 .058

Growth -.816 1.089 -.151 -.749 .478

Urban 3.041 .651 .941 4.670 .002

a. Dependent Variable: Robbery

b. Make a prediction for each crime variable for a city with a 5% growth rate and a population that is 90% urbanized c. Compute beta weights for each independent variable in each equation and compare their relative effect on robbery.

d. Compute multiple coefficient of correlation and multiple coefficient of determination for Robbery variable, using the population variables as independent variables.

e. Interpret the assumption about Autocorrelated

Model Summary

Model R R Square Adjusted RSquare the EstimateStd. Error of Watson

Durbin-1 .938a _.879 _.845 _88.337 _1.946

a. Predictors: (Constant), Growth, Urban b. Dependent Variable: Robbery

f. Write a paragraph summarizing your findings. (the descriptive data can help you)

Descriptive Statistics

Mean Std. Deviation N

Car theft 382.00 224.090 10

Urban 75.10 18.888 10

(15)

2. The AFL-CIO has undertaken a study of 30 secretaries’ yearly salaries (in thousands of dollars). The organization wants to predict salaries from several other variables. The variables considered to be potential predictors of salary are:

X1 = months of service

X2 = years of education

X3 = score on standardized test

X4 = words per minute (wpm) typing speed

X5 = ability to take dictation in words per minute

A multiple regression model with all five variables was run on a computer package, resulting in the following output:

Unstandardized coefficients

Model B Std. Error t-value Sig

Constant 9.788 0.377 25.96 0.000

X1 0.11 0.019 5.178 0.02

X2 0.053 0.038 1.369 0.06

X3 0.071 0.064 1.119 0.07

X4 0.004 0.307 0.013 0.09

X5 0.065 0.038 1.734 0.054

R = .929 r2 _=.863

Assume that the assumptions met for using a linear regression model. a. What is the regression equation?

b. Would you consider removing any of these predictor variables from the model? Why or why not?

c. From this model, what is the predicted salary (in thousands of dollars) of a secretary with 10 years (120 months) of experience, 9th grade education (9 years of education), a 50 on the standardized test, 60 wpm typing speed, and the ability to take 30 wpm dictation?

d. Which variables have a significant influence on the salary?

3. “Home prices”. Many variables have an impact on determining the price of a house. A few of these are size of the house (square feet), lot size, and number of bathrooms. Information for a random sample of homes for sale in the Statesboro, GA, and area was obtained from the Internet. Regression output modeling the asking price with square footage and number of bathrooms gave the following result:

Model B ErrorStd. valuet- Sig

Constant -152037 85619 -1.78 0.110

Baths 9530 40826 0.23 0.821

Sq ft 139.87 46.67 3 0.015

Dependent Variable is: Price, r=.756, R2_{= .571}

ANOVA

Model

Sum of

Squares df Mean Square F Sig

Regression 99303550067 2 49651775033 11.08 0.004

Residual 40416679100 9 4490742122

Total 1.40E+11 11

a. Write the regression equation.

b. Interpret the multiple correlation coefficients

c. Would you consider removing any of these predictor variables from the model? Why or why not? d. How much of the variation in home asking prices is accounted for by the model?

e. From this model, what is the expected price of the houses (in thousands of dollars) if the house is size 1000 square feet and 5 bathrooms

(16)

Dependent variable is: Calories, R-squared= 38.9% R-squared (adjusted) 36.4%. ANOVA

Model Sum of Squares df

Mean

Square F Sig

Regression 11211.1 3 3737.033333 15.1 0.0001

Residual 17583.5 71 247.6549296

Total 2.88E+04

Model B

Std.

Error t-value Sig

Constant 81.9436 5.456 15 0.0001

Sodium 0.05922 0.0218 2.72 0.0082

Potassium -0.01684 0.026 -0.648 0.5193

Sugars 2.4475 0.4164 5.88 0.0001

Assuming that the conditions for multiple regression are met, a. What is the regression equation?

b. Do you think this model would do a reasonably good job at predicting calories? Explain.

c. Would you consider removing any of these predictor variables from the model? Why or why not?

5. The table below presents information on three variables for a small sample of 25 observations by year of these variables. These data will be used to develop a linear model that predicts annual profit margin as a function of revenue per deposit dollar and number of offices.

The SPSS software report is given below, you verify and interpret all results: Savings and loan Association Operating Data

Year

Revenue per Dollar

Number of

Offices Profit Margin Year

Revenue per Dollar

Number of

Offices Profit Margin

1 3.92 7298 0.75 14 3.78 6672 0.84

2 3.61 6855 0.71 15 3.82 6890 0.79

3 3.32 6636 0.66 16 3.97 7115 0.7

4 3.07 6506 0.61 17 4.07 7327 0.68

5 3.06 6450 0.7 18 4.25 7546 0.72

6 3.11 6402 0.72 19 4.41 7931 0.55

7 3.21 6368 0.77 20 4.49 8097 0.63

8 3.26 6340 0.74 21 4.7 8468 0.56

(17)

10 3.42 6352 0.82 23 4.69 8991 0.51

11 3.45 6361 0.75 24 4.71 9179 0.47

12 3.58 6369 0.77 25 4.78 9318 0.32

13 3.66 6546 0.78

Correlations

Margin Profit Revenue Number of Offices

a. Interpret the coefficient of correlation for each pair:

Profit

Margin Pearson Correlation 1 -.704

** _-.868**

Sig. (2-tailed) .000 .000

N 25 25 25

Revenue Pearson Correlation _-.704** ₁ _.941**

Sig. (2-tailed) .000 .000

N ₂₅ ₂₅ ₂₅

Number

of Offices Pearson Correlation -.868

** _.941** ₁

Sig. (2-tailed) _.000 _.000

N 25 25 25

b. Interpret multiple correlation coefficients(R), and the coefficient of multiple determinations (R2_{). How much of the variance}

in Profit Margin is explained by the two independent variables? Model Summaryb

Model R SquareR R SquareAdjusted Durbin-Watson

1 .930a _.865 _.853 _1.948

a. Predictors: (Constant), Number of Offices, Revenue b. Dependent Variable: Profit Margin

Multiple correlation coefficients(R) ________________________________________________________________________________ Multiple determinations coefficient (R2_{): ____________________________________________________________________________}

____________________________________________________________________________________________________________

c. Make assumption for Autocorrelation by Durbin Watson: _____________________________________________________________

_________________________________________________________________________________________________________

e. Find the multiple regression equation with Revenue (x1) and Number of Offices (x2):

_____________________________________________________________________________________________

(18)

g. What will be Profit Margin would be expected for Revenue of 4.8, and Number of offices of 9500?

_____________________________________________________________________________________________

h. Interpret assumptions: Errors has a Normal Distribution by graph and test of normality

Interpret:____________________ Interpret:________________________________________

Tests of Normality

Kolmogorov-Smirnova _Shapiro-Wilk

Statistic df Sig. Statistic df Sig.

Unstandardized Residual

.082 25 .200* _.970 ₂₅ _.645

Hypotheses testing to determine the normality

Ho: _The residuals follow normal distribution______________________________________________ Ha:______________________________________________________________________________ Sig:_____________

Make Decision and interpret the result: _________________________________________________________________

6. The table below presents information on three variables for a small sample of eight nations. We will take abortion rate as the dependent variable and examine is relationship with two variables: one measures women’s status and power and the other measures religiosity.

Nation

Abortion Rate (Y)

Women's Status (x1)

Religiosity (x2)

Canada 165 0.5 74

Chile 100 0.45 93

Denmark 400 0.8 48

Germany 208 0.54 67

Italy 389 0.7 70

Japan 379 0.52 55

UK 207 0.58 67

(19)

The SPSS software report is given below, you verify and interpret all output:

a. Interpret the coefficient of correlation for at least one pair:_________________________________

Scatter Plot:

b. Interpret:___________________________ Interpret:___________________________

Model Summaryb

Model R R Square Adjusted RSquare the EstimateStd. Error of Durbin-Watson 1

(20)

Predictors: (Constant), Religiosity (x2), Women's Status (x1)

Dependent Variable: Abortion Rate (Y)

c. Interpret multiple correlation coefficients (R), and the coefficient of multiple determinations (R2_{). (How much of the variance in abortion}

rate is explained by the two independent variables?).

d. Make assumption for Autocorrelation by Durbin Watson:__________________________________

ANOVAa

1 Regression 87171.942 2 43585.971 8.135 .027b

Residual 26790.058 5 5358.012

Total 113962.000 7

a. Dependent Variable: Abortion Rate (Y)

b. Predictors: (Constant), Religiosity (x2), Women's Status (x1)

e. Interpret ANOVA table:_______________________________________________________

Coefficientsa

Model Unstandardized

Coefficients StandardizedCoefficients

t Sig.

B Std. Error Beta

1

(Constant) 310.885 345.19 0.901 0.409

Women's Status (x1) 348.413 317.472 0.398 1.097 0.322

Religiosity (x2) -3.789 2.624 -0.523 -1.444 0.208

a. Dependent Variable: Abortion Rate (Y)

f. Find the multiple regression equation with Women's Status (x1) and religiosity (x2).

g. What will be abortion rate would be expected for Women's Status 0.90, and religiosity of 90?

h. Compute beta weights for each independent variable in each equation and compare their relative effect on Abortion Rate.

7. A market study for self-service retailer "Simba supermarket” analyzes the annual amount that spent on food families of four or more members. It is thought that three independent variables are related to the cost of food. These variables are: total household income, family size and whether the family has children in college.

Family Expenditureon food Household income($ 1000) Familysize in collegeChildren

1 3900 37.6 4 0

2 5300 51.5 5 1

3 4300 41.6 4 0

4 4900 46.8 5 0

5 6400 53.8 6 1

6 7300 62.6 7 1

7 4900 54.3 5 0

8 5300 52.7 4 0

9 6100 60.8 5 1

10 6400 63.5 6 1

11 7400 64.2 8 1

12 5800 56.3 5 0

(21)

Model Summaryb

Model R R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson

1 .969a _0.939 _0.916 _320.393 _2.559

Predictors: (Constant), Children in college, Household income ($ 1000), Family size Dependent Variable: Expenditure on food

a. Interpret multiple correlation coefficients (R), and the coefficient of multiple determinations (R2_{). How much of the variance in}

Expenditure on food is explained by the three independent variables?

b. Make assumption for autocorrealtion by Durbin Watson:____________________________________

Coefficientsa

Model Unstandardized Coefficients

Standardized

Coefficients t Sig.

B Std. Error Beta

(Constant) 35.405 767.913 .046 .964

Household income ($ 1000) 63.753 18.391 .493 3.467 .008

Family size 386.805 131.609 .432 2.939 .019

Children in college 275.684 275.936 .131 .999 .347

Dependent Variable: Expenditure on food

c. Compute beta weights for each independent variable in each equation and compare their relative effect on expenditure on food. d. Find the multiple regression equation with Household income (x1) and family size(x2) and children in college (x3).

e. Would you consider removing any of these predictor variables from the model? Why or why not?

f. What expenditure on food would be expected for a family of 5 children, if don’t have kids in college and earning is $ 60,000?

5.Open from SPSS data file “demo.sav.” This data file contains survey data, including demographic data and various attitude measures. With this data calculate and interpret:

a. Compute and interpret the coefficient of simple correlation (Household income in thousands and Age in years) b. Draw a scatter diagram and interpret with the variables from the previous example

c. Compute and interpret the multiple coefficient of correlation. (From here, use the variables indicated below) Data:

Dependent variable: Household income in thousands

Independents variables: Age in years; Level of education, Years with current employer, Number of people in household) d. Compute and interpret the multiple coefficient of determination within the context of this problem

e. Compute and interpret the multiple regression equation. Is the model significant (perform the hypotheses for multiple regression analysis)

f. From the analysis performed, would you recommend removing any variable (s) that do not contribute significantly to the model? g. Check if the assumptions of autocorrelated assumed (Durbin Watson)