CHAPTER 3: RESEARCH METHODOLOGY
3.9 Regression analysis
There are two types of regression analysis: simple linear and multiple linear regression
analysis. Multiple linear regression is a direct generalization of simple linear regression.
The primary difference between these statistical methods is that the value of the
dependent variable is dependent on more than one independent variable in multiple
regression analysis. In road traffic safety research, the relationship between the number
and/or rate of casualties with the explanatory variables is not that straightforward. In most
cases, it is necessary to use more than one independent variable to predict a dependent
variable accurately. For this reason, the multiple regression model is suitable to determine
the relationship between the dependent and independent variables in this study. The
relationship between the dependent variable (Y) and independent variable (X) of a
multiple regression model is given by:
π = π0 + π1π1+ π2π2 + β― + ππππ + Ι (3.12)
The multiple regression model includes the error components (Ι) where Ι οΎ N (0, Ο2), which are also known as residuals (e). The residuals represent the deviations of the
response from the true relationship. The errors are basically unobservable random
variables which account for the effects of other factors on the response. In linear and
regression coefficients (b0, b1,β¦, bk) can be estimated using the least squares method.
The first coefficient, b0, represents the Y-intercept. In the least squares method, the values
of the regression coefficients are chosen such that the sum of squared errors (distances
between the actual Y and estimated ΕΆ) is minimized. The sum of squared errors (SSE) is
given by: πππΈ = β(ππβ π0β π1ππ1β π2ππ2β β― β π πππππ) = β(π β πΜ π )2 (3.13) where:
SSE = Sum of squared errors
Y = Actual values
ΕΆ = Estimated values i = 1, 2, 3,β¦, n
k = Number of independent variables in the regression analysis.
3.9.1 Inference for the regression models
In regression analysis, there is a need to measure the extent to which the sample data
points are spread around the fitted regression function. This is called the standard error
of the estimate and its objective is to measure the amount by which the actual values
differ from the estimated values. The standard error of the estimate is similar to the
standard deviation of the residuals and it is determined using the following equation:
ππ¦,π₯ = β(π β πΜ)
2
π β π β 1 (3.14) where:
Sy,x = Standard error of the estimate
Y = Actual values
n = Number of data
k = Number of independent variables in the regression.
Another inference of the regression model is to define the significance of the
regression. This can be obtained by measuring the multiple correlation coefficient (R)
of the model, which shows the correlation between the responses (Y) and fitted values
(ΕΆ). R is the square root of the coefficient of determination (R2), which is given by:
π 2 = 1 β β(π β πΜ) 2
β(π β πΜ )2 (3.15)
Thus, R falls within a range of 0 ο£ R ο£ 1.
3.9.2 Multi-collinearity
In most multiple regression analyses, the data are routinely recorded rather than
generated from pre-selected settings of the independent variables. The independent
variables are frequently linearly dependent. In other words, some of the independent
variables always move together, whether they are increasing or decreasing. In this case,
even though the regression model coefficients can be obtained, these estimates tend to
be unstable such that the values can change dramatically with slight changes in the data
and the values are larger than expected. The t statistic used to judge the significance of
individual terms may be insignificant. However, the F test will indicate that the
regression is significant. Hence, the calculation of the least squares estimates is sensitive
to rounding errors.
The linear relationship between two or more independent variables is termed as multi-
collinearity. The multi-collinearity between variables is measured by the variance
ππΌπΉπ =
1
1 β π π2 π = 1, 2, 3, β¦ π (3.16) where:
π π2 = Coefficient of determination from the regression of the jth independent variable
on the remaining k β 1 independent variables
k = Number of independent variables.
A VIF value close to 1 indicates that multi-collinearity is not a problem for that
independent variable. Its estimated coefficient and associated t value will not change
significantly as the other independent variables are added or deleted from the regression
equation. A VIF much greater than 1 indicates that the estimated coefficient associated
with that independent variable is unstable. If the VIF is relatively high, the t statistic may
change considerably when the other independent variables are added or deleted from the
regression equation. Statisticians recommend that the value of VIF should be less than
10 in order to prevent collinearity problems.
3.9.3 Stepwise regression
Various explanatory variables can be included in a multiple regression model.
Nevertheless, a model with a higher number of variables is not necessarily the best one.
In fact, multi-collinearity issues are often present in a complex model. Consequently, it
is crucial for one to select the best regression model. The following steps are used to select
Step 1. Include all of the potential predictor variables.
Step 2. Remove the independent variables that seem inappropriate for the regression
model. Several criteria should be taken into account, but priority should be
given to remove independent variables that duplicate other independent
variables (multi-collinearity).
There are two approaches used to select the best regression model: (1) all possible
regressions and (2) stepwise regression. The latter approach is adopted in this study.
Stepwise regression is useful to simplify the regression model. In this approach, the
variables are added or removed from the regression model one at a time in order to attain
a model that contains only significant predictors but at the same time, it does not exclude
any useful variables. The basic steps involved in stepwise regression are listed below:
Step 1. Obtain all of the possible simple regressions. The predictor variable that has the
largest correlation with the independent variable is the first variable which will
be included in the regression equation.
Step 2. Include the variable that gives the largest significant contribution to the
regression sum of squares as the next variable in the regression equation.
Determine the significance of the contribution using F test. The value of the F
statistic that needs to be exceeded before the contribution of a variable is
deemed significant is known as F to enter.
Step 3. Check the significance of the new equation using F test. If the F statistic is less
than the F to remove, delete the variable from the regression equation.
Step 4. Repeat Steps 2 and 3 until all possible additions are non-significant and all
Stepwise regression is carried out using Statgraphics. The software provides two
stepwise options: forward and backward selection. Forward selection begins with the
model that contains only one constant and brings the variables in one at a time if they
improve the fit significantly. In contrast, backward selection begins with the model that
contains all of the variables and removes them one at a time until all of the remaining
variables are statistically significant. Backward selection is adopted in this study.