Nihar Ranjan Roy
Business Statistics, By Ken Black, Wiley India Edition
Text Book
Regression analysis with two or more independent variables or with at least one nonlinear predictor is called multiple regression analysis.
Examples of simple regression applications include models to predict retail sales by population density.
However, in many cases, other independent variables, taken in conjunction with these variables, can make the regression model a better fit in predicting the dependent variable. For example, sales could be predicted by the size of store and number of competitors in addition to population density
Multiple Regression
probabilistic simple regression model is
Recall
Multiple Regression Model
In multiple regression analysis, the dependent variable, y, is sometimes referred to as the response variable.
The partial regression coefficient of an independent variable,βi , represents the increase that will occur in the value of y from a one-unit increase in that
independent variable if all other variables are held constant.
The “full” (versus partial) regression coefficient of an independent variable is a coefficient obtained from the bivariate model (simple regression) in which the independent variable is the sole predictor of y.
The partial regression coefficients occur because more than one predictor is included in a model
Multiple Regression Model
The simplest multiple regression model is one constructed with two
independent variables, where the highest power of either variable is 1 (first- order regression model).
The regression model is
Multiple Regression Model with Two Independent
Variables (First-Order)
The procedure for determining formulas to solve for multiple regression coefficients is similar.
The formulas are established to meet an objective of minimizing the sum of squares of error for the model.
Hence, the regression analysis shown here is referred to as least squares analysis. Methods of calculus are applied, resulting in k + 1 equations with k + 1 unknowns (b0and k values of bi) for multiple
regression analyses with k independent variables.
Thus, a regression model with six independent variables will generate seven simultaneous equations with seven unknowns (b0, b1, b2, b3, b4, b5, b6).
A real estate study was conducted in a small Louisiana city to determine what variables, if any, are related to the market price of a
home. Several variables were explored, including the number of bedrooms, the
number of bathrooms, the age of the house, the number of square feet of living space, the total number of square feet of space, and the number of garages. Suppose the researcher wants to develop a regression model to
predict the market price of a home by two variables, “total number of square feet in the house” and “the age of the house.”
A Multiple Regression Model
= 57.4 + .0177x1 − .666x2
> RealState <- read.csv("C:/R workspace/RealState.csv", stringsAsFactors=FALSE)
> View(RealState)
> rs=RealState
> relation<-lm(rs$y ~ rs$x1+rs$x2)
> print(relation) Call:
lm(formula = rs$y ~ rs$x1 + rs$x2) Coefficients:
(Intercept) rs$x1 rs$x2 57.35075 0.01772 -0.66635
> print(summary(relation)) Call:
lm(formula = rs$y ~ rs$x1 + rs$x2) Residuals:
Min 1Q Median 3Q Max
-27.7018 -6.8938 -0.1728 7.1340 23.9361 Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.350746 10.007152 5.731 1.31e-05 ***
rs$x1 0.017718 0.003146 5.633 1.64e-05 ***
rs$x2 -0.666348 0.227997 -2.923 0.00842 **
---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 11.96 on 20 degrees of freedom Multiple R-squared: 0.7411, Adjusted R-squared: 0.7152 F-statistic: 28.63 on 2 and 20 DF, p-value: 1.353e-06
Multiple regression models can be developed to fit almost any data set if the level of measurement is adequate and enough data points are available.
Once a model has been constructed, it is important to test the model to determine whether it fits the data well and whether the assumptions underlying regression analysis are met.
−Assessing the adequacy of the regression model can be done in several ways, including
− testing the overall significance of the model,
−studying the significance tests of the regression coefficients,
−computing the residuals,
−examining the standard error of the estimate, and
−observing the coefficient of determination.
SIGNIFICANCE TESTS OF THE REGRESSION MODEL AND ITS COEFFICIENTS
Regression models in which the highest power of any predictor variable is 1 and in which there are no interaction terms—cross products (xi xj)—are
referred to as first-order models.
Polynomial Regression: Non-Linear Regression
Consider a regression model with one independent variable where the model includes a second predictor, which is the independent variable squared. Such a model is referred to as a second-order model with one independent variable because the highest power among the predictors is 2, but there is still only one independent variable. This model takes the following form.
Often when two different independent variables are used in a regression analysis, an interaction occurs between the two variables
In regression analysis, interaction can be examined as a separate independent variable.
An interaction predictor variable can be designed by multiplying the data values of one variable by the values of another variable, thereby creating a new variable.
A model that includes an interaction variable is
Regression Models with Interaction
One problem that can arise in multiple regression analysis is multicollinearity.
Multicollinearity is when two or more of the independent variables of a multiple regression model are highly correlated.
Technically, if two of the independent variables are correlated, we have collinearity; when three or more independent variables are correlated, we have multicollinearity.
However, the two terms are frequently used interchangeably.
The reality of business research is that most of the time some correlation between predictors (independent variables) will be present. The problem of multicollinearity arises
when the intercorrelation between predictor variables is high.
This relationship causes several other problems, particularly in the interpretation of the analysis.
1. It is difficult, if not impossible, to interpret the estimates of the regression coefficients.
2. Inordinately small t values for the regression coefficients may result.
3. The standard deviations of regression coefficients are overestimated.
4. The algebraic sign of estimated regression coefficients may be the opposite of what would be expected for a particular predictor variable.
MultiCollinearity
The problem of multicollinearity can arise in regression analysis in a variety of business research situations.
For example, suppose a model is being developed to predict salaries in a
given industry. Independent variables such as years of education, age, years in management, experience on the job, and years of tenure with the firm might be considered as predictors. It is obvious that several of these variables are correlated (virtually all of these variables have something to do with number of years, or time) and yield redundant information.