Chapter 3 Data Description and Methodology
3.5 Model Estimation
3.5.3 Model Evaluation
In order to check how the independent variables are related to each other, collinearity tests are carried out before building the model. Collinearity happens if there is a strong correlation between two independent variables in a model, (O’Brien, (2007)). When collinearity occurs it increases the variance of the independent variables that could be estimated to build a model. In case collinearity and multicollinearity, the estimated dependent variable is usually un- stable because of high standard errors and a reliable and significant regression coefficient for the estimated variable is difficult to obtain (Washington et al. 2011). For instance, when there is collinearity in the model it will make the independent variable statistically insignificant because of high standard errors while it should be significant. For this purpose, variation inflation factors (VIF) have been used to establish the degree of collinearity between independent variables, VIF indicate the R2effect on the variance of the estimated coefficient for independent variables in a regression model (O’Brien, (2007)). The tolerance of an independent variable is (O’Brien, (2007)):
Tolerance = 1-R2 (3-10)
Where tolerance measures the proportion of variance between two independent variables, and as defined previously in Section 3.4, where R2 explains the mean of one variable by the variance of the other variable (in this case the mean of an independent variable by the variance of the other independent variable). The unexplained variance can be identified by 1-
R2: the tolerance rate varies between 0 and 1, and the lower the tolerance, the higher the existence of collinearity and multicollinearity.
VIF = 1/Tolerance (3-11) Collinearity occurs at a tolerance of < 0.20 or 0.10 this leads to a VIF of 5 or 10 (O’Brien,, 2007).
Variation inflation factors were identified using SPSS software. Firstly, the correlation between all study parameters (geometric and traffic variables) is estimated in order to identify
R2, then equations (3-10) and (3-11) are used to identify the collinearity between the independent variables.
Then, after the variables have been selected and models have been developed, assessments are made based on statistical approaches as a part of the process of selecting the most appropriate and best fitting models. The model is first evaluated according to the significance of the variables included in the model. The estimated coefficient ߚ for each of the independent variables in the model should be statistically significant.
The likelihood ratio test was used in order to compare the fixed and random-parameter models using the likelihoods at convergence. The test statistic is chi-square(߯ଶ):
߯ଶ= −2[ܮܮ(ߚ
ி)−ܮܮ(ߚோ)] (3-12)
where ܮܮ(ߚி) is the log-likelihood at convergence of the fixed-parameters NB model, and
ܮܮ(ߚோ) is the log-likelihood of the random-parameters NB model (Anasatasopoulas & Mannering, 2009; Washington et al., 2011). Calculated ߯ଶ with the degrees of freedom which are equal to the number of variables that are randomly distributed in the random- parameters models were used in order to identify the significance of the random-parameters model relative to the fixed-parameters model from this website (Stattrek, 2016).
The McFaddenߩଶstatistic is used in addition to߯ଶto test overall fit of the model and for the purpose of comparison with other models. The McFadden ߩଶ statistic is computed as (Washington et al., 2011):
McFaddenߩଶ= 1 −(ࢼ)
() (3-13)
whereܮܮ(ࢼ) is the log-likelihood at convergence with estimated parameterࢼ, andܮܮ(ࢉ) is the log-likelihood at constant only.
In addition, the two models were also compared using the relationship between actual mean values and predicted values of the response variables for both the random and fixed- parameter models.
Moreover, for the models that included HBI along with traffic and geometric variables, Akaike information criteria (AIC) were used to compare the results to the models that did not include HBI as an input variable.AICis calculated as (Washington et al., 2011):
ܣܫܥ= 2ܳ− 2ܮܮ(ࢼ) (3-14)
where: ܳ is the number of the predicted variables including constant, and ܮܮ(ࢼ) is the log- likelihood at convergence. The lower value of ܣܫܥ are chosen because the lower value of
−2ܮܮ(ࢼ) represents a better fit of the model. Note thatܮܮ(ࢼ)is a negative value.
3.6 Summary
This chapter gives a data description and describes the methodology that was undertaken to explore the feasibility of using truck sensor position data for analysing roundabout accident risk. It provides detailed information about the data sources and the procedures that were undertaken to analyse the data. The definition of HBI by Microlise Ltd. is presented and is compared with previous studies and with tests undertaken using smartphone accelerometers. Counting, filtration, and allocation of HBIs are illustrated. Filtration of truck accidents from total accidents is presented for the purpose of analysis. In addition, the reason for choosing two years of HBIs relative to 11 years of accidents is shown. Then the geometric information computed from Google Earth is compared with the distance equation and with on-site measuring tape and is illustrated. Estimation of traffic data to the correct approaches is discussed in this chapter and the detailed procedure is described.
For the purpose of analysing accidents and HBI numbers a summary statistic of the traffic and geometric characteristics is presented. In addition, from summary statistics for the dependent variables the reason behind choosing NB models for predicting accidents and HBIs is shown. Moreover, a detailed summary of the generation procedure for the econometric models used to predict total accidents, truck accidents, and HBIs was presented. The following chapters present the results acquired from the procedures illustrated in this chapter.