Regression analysis - RESEARCH METHODOLOGY

CHAPTER 3: RESEARCH METHODOLOGY

3.9 Regression analysis

There are two types of regression analysis: simple linear and multiple linear regression

analysis. Multiple linear regression is a direct generalization of simple linear regression.

The primary difference between these statistical methods is that the value of the

dependent variable is dependent on more than one independent variable in multiple

regression analysis. In road traffic safety research, the relationship between the number

and/or rate of casualties with the explanatory variables is not that straightforward. In most

cases, it is necessary to use more than one independent variable to predict a dependent

variable accurately. For this reason, the multiple regression model is suitable to determine

the relationship between the dependent and independent variables in this study. The

relationship between the dependent variable (Y) and independent variable (X) of a

multiple regression model is given by:

𝑌 = 𝑏₀ + 𝑏₁𝑋₁+ 𝑏₂𝑋₂ + ⋯ + 𝑏_𝑘𝑋_𝑘 + ɛ (3.12)

The multiple regression model includes the error components (ɛ) where ɛ  N (0, σ2), which are also known as residuals (e). The residuals represent the deviations of the

response from the true relationship. The errors are basically unobservable random

variables which account for the effects of other factors on the response. In linear and

regression coefficients (b0, b1,…, bk) can be estimated using the least squares method.

The first coefficient, b0, represents the Y-intercept. In the least squares method, the values

of the regression coefficients are chosen such that the sum of squared errors (distances

between the actual Y and estimated Ŷ) is minimized. The sum of squared errors (SSE) is

given by: 𝑆𝑆𝐸 = ∑(𝑌_𝑖− 𝑏₀− 𝑏₁𝑋_𝑖1− 𝑏₂𝑋_𝑖2− ⋯ − 𝑖 𝑏_𝑘𝑋_𝑖𝑘) = ∑(𝑌 − 𝑌̂ 𝑖 )2_(3.13) where:

SSE = Sum of squared errors

Y = Actual values

Ŷ = Estimated values i = 1, 2, 3,…, n

k = Number of independent variables in the regression analysis.

3.9.1 Inference for the regression models

In regression analysis, there is a need to measure the extent to which the sample data

points are spread around the fitted regression function. This is called the standard error

of the estimate and its objective is to measure the amount by which the actual values

differ from the estimated values. The standard error of the estimate is similar to the

standard deviation of the residuals and it is determined using the following equation:

𝑆_𝑦,𝑥 = √(𝑌 − 𝑌̂)

𝑛 − 𝑘 − 1 (3.14) where:

Sy,x = Standard error of the estimate

Y = Actual values

n = Number of data

k = Number of independent variables in the regression.

Another inference of the regression model is to define the significance of the

regression. This can be obtained by measuring the multiple correlation coefficient (R)

of the model, which shows the correlation between the responses (Y) and fitted values

(Ŷ). R is the square root of the coefficient of determination (R2), which is given by:

𝑅2 _{= 1 −}∑(𝑌 − 𝑌̂) 2

∑(𝑌 − 𝑌̅)2 (3.15)

Thus, R falls within a range of 0  R  1.

3.9.2 Multi-collinearity

In most multiple regression analyses, the data are routinely recorded rather than

generated from pre-selected settings of the independent variables. The independent

variables are frequently linearly dependent. In other words, some of the independent

variables always move together, whether they are increasing or decreasing. In this case,

even though the regression model coefficients can be obtained, these estimates tend to

be unstable such that the values can change dramatically with slight changes in the data

and the values are larger than expected. The t statistic used to judge the significance of

individual terms may be insignificant. However, the F test will indicate that the

regression is significant. Hence, the calculation of the least squares estimates is sensitive

to rounding errors.

The linear relationship between two or more independent variables is termed as multi-

collinearity. The multi-collinearity between variables is measured by the variance

𝑉𝐼𝐹𝑗 =

1 − 𝑅_𝑗2 𝑗 = 1, 2, 3, … 𝑘 (3.16) where:

𝑅_𝑗2 = Coefficient of determination from the regression of the jth_{independent variable}

on the remaining k – 1 independent variables

k = Number of independent variables.

A VIF value close to 1 indicates that multi-collinearity is not a problem for that

independent variable. Its estimated coefficient and associated t value will not change

significantly as the other independent variables are added or deleted from the regression

equation. A VIF much greater than 1 indicates that the estimated coefficient associated

with that independent variable is unstable. If the VIF is relatively high, the t statistic may

change considerably when the other independent variables are added or deleted from the

regression equation. Statisticians recommend that the value of VIF should be less than

10 in order to prevent collinearity problems.

3.9.3 Stepwise regression

Various explanatory variables can be included in a multiple regression model.

Nevertheless, a model with a higher number of variables is not necessarily the best one.

In fact, multi-collinearity issues are often present in a complex model. Consequently, it

is crucial for one to select the best regression model. The following steps are used to select

Step 1. Include all of the potential predictor variables.

Step 2. Remove the independent variables that seem inappropriate for the regression

model. Several criteria should be taken into account, but priority should be

given to remove independent variables that duplicate other independent

variables (multi-collinearity).

There are two approaches used to select the best regression model: (1) all possible

regressions and (2) stepwise regression. The latter approach is adopted in this study.

Stepwise regression is useful to simplify the regression model. In this approach, the

variables are added or removed from the regression model one at a time in order to attain

a model that contains only significant predictors but at the same time, it does not exclude

any useful variables. The basic steps involved in stepwise regression are listed below:

Step 1. Obtain all of the possible simple regressions. The predictor variable that has the

largest correlation with the independent variable is the first variable which will

be included in the regression equation.

Step 2. Include the variable that gives the largest significant contribution to the

regression sum of squares as the next variable in the regression equation.

Determine the significance of the contribution using F test. The value of the F

statistic that needs to be exceeded before the contribution of a variable is

deemed significant is known as F to enter.

Step 3. Check the significance of the new equation using F test. If the F statistic is less

than the F to remove, delete the variable from the regression equation.

Step 4. Repeat Steps 2 and 3 until all possible additions are non-significant and all

Stepwise regression is carried out using Statgraphics. The software provides two

stepwise options: forward and backward selection. Forward selection begins with the

model that contains only one constant and brings the variables in one at a time if they

improve the fit significantly. In contrast, backward selection begins with the model that

contains all of the variables and removes them one at a time until all of the remaining

variables are statistically significant. Backward selection is adopted in this study.

In document A time series analysis of road traffic fatalities in Malaysia / Yusria Darma (Page 155-160)