CHAPTER 5. FINAL ANALYSIS
5.3 Preliminary regression analysis
5.3.3 Vehicle-driver sections 1 and 2 sampling analysis
This analysis allowed a more detailed comparison between the variables versus the driver 1 and driver 2 injury severities. Eleven out of the 21 variables are considered to have strong or moderate association with the driver 1 injury severity, but have a weak or very weak association with the driver 2 injury severities. Many of these variables (i.e. vehicle (driver) at-fault, speeding, ejected, driver age group) were found to have a significant effect on the injury severity in the exploratory analysis (see Section 4.2). This suggests a possible difference between the driver injury mechanisms of the two groups. This is expected, as the first section includes all the single vehicle crash involvements, whereas section 2 does not. A second important concern was shown in Table 5-2 above were the proportions of severely injured drivers (especially in section
2) had statistical dependence with the year of statewide data selected. Minimizing temporal variation is very important to the validity of any further analysis.
Table 5-31 Driver crash involvements in multilane high-speed roads in the complete sample (all involvements) and the stratified sample
Stratified Sample All involvements Vehicle-driver Section
Count Percent Count Percent
Driver 1 55569 46.78 197197 50
Driver 2 63221 53.22 197197 50
An alternative analysis was performed by comparing the total crash involvements of drivers sections 1 and 2 and a stratified random sample of the driver 1 and 2 multiple crash involvements. The counts of the driver involvements in the datasets used are described in Table 5-31 above. The stratified sample consisted of 50% of the driver 1 and 2 involvements which no missing data for the variables with significant association with the driver injury severity response already discussed in section 5.2. It was deemed important to keep a representative sample of the different driver involvements related to a single crash, while avoiding repeating crash data in the sampling process.
Using the function PROC SURVEYSELECT in a three-step process, a stratified random sample selection of the multiple vehicle crash reports was chosen. First, half of the involvements from section 1 were selected. Then, records with the report numbers in the selected driver 1 sample were deleted from the driver 2 dataset to avoid repeating crash data for the final analysis.
Finally, a sample of 50% the original number of driver 2 involvements was selected. After selecting the sample of multiple crash involvements, single vehicle crash involvements were added to the sample. The final proportion of single and multiple crashes shown in Table 5-31 above were similar to those shown in Table 5-1.
Table 5-32 Driver crash involvements in multilane high-speed roads by year and injury severity
Test of independence p-value= 0.0884
The analysis of the year to year variation of driver injury severity counts was also analyzed. The Chi-square test of independence shown in Table 5-32 above resulted in a p-value greater than 0.05; thus the null hypothesis of statistical independence is not rejected for the entire sample. The p-value is higher when compared to the test show in Table 5-2 above, which suggests an improvement in the resistance to yearly variation for the stratified sample. This is a very important advantage of the stratified sample. Rather than magnifying the yearly differences, it becomes more heterogeneous in terms of the driver, vehicle, roadway and environment-related characteristics that might be contributing factors to the driver injury severity.
Comparing the categorical analysis of the initial set of crashes and the final sample served to test the effects of drawing a sample of multiple crash involvements for the regression modeling. Additional analysis into the relationships between the driver 1 and 2 sections was performed by separating the single and multiple vehicle crashes. By comparing the counts of some of the most important variables, we can test whether the driver 1 and driver 2 sections are statistically independent (assigned randomly) or if there is any systematic relationship.
Table 5-33 Test of independence between driver section number and the variables listed (Sample n=118,790;
Complete N= 394,394)
Variable P-value using
random sample P value using complete sample
Off_Roadway <0.0001 1
Point_Impact_x <0.0001 <0.0001
Vehicle_Maneuver_x <0.0001 <0.0001
Type_of_Vehicle_x <0.0001 <0.0001
Private_vehicle_use_x <0.0001 <0.0001
CRASH_LANE5 0.003356631 1
nRural_Urban <0.0001 1
Location_Type 0.751214189 1
nVehicle_Special_Functions_x 0.00312239 <0.0001
Partial results of the tests for each driver-vehicle and crash-related variable versus driver section number are shown in Table 5-33 above. Complete results are shown in Appendix B. This analysis used multiple-vehicle crash involvements only. When using the random sample only the driver age, physical defects, location type and alcohol use variables that became significant for statistical dependence of the driver section. The complete results shown in Appendix B suggest that some possible numerical problems (quasi or complete separation) are possible if all
involvements are used because of the repeated values of road characteristics. Some of these problems were apparent in the development of the exploratory regression models (see Section 4.2). The stratified sample showed a positive effect in these variables by alleviating the separation problems due to repeated values.
Table 5-34 Driver crash involvements in multilane high-speed roads in the stratified sample Driver-vehicle section Frequency Percent
Driver 1 single vehicle 10587 8.18%
Driver 1 multiple vehicle 55591 42.96%
Driver 2 multiple vehicle 63235 48.86%
Total 129413 100.00%
Table 5-35 Goodness of fit for the models using the complete records driver 1 section dataset GOF Parameter OVERALL INTERS SIGNAL SEGMENT PURE
SEG UNSIG
Number of Variables 28 24 16 25 19 20
Degrees of freedom 52 41 31 46 35 38
Sample size 120442 70167 41779 71671 43283 28388
Response severe injury
ratio 6.41% 6.23% 5.53% 7.21% 7.18% 7.25%
AIC 48211.51 28534.75 15861.77 30457.82 17747.43 12652.1 Hosmer-Lemeshow
p-value 0.2124 0.2355 0.241 0.7759 0.2078 0.1405
c value (area under ROC
curve) 0.789 0.764 0.759 0.803 0.824 0.771
Percent Concordant 78.5 75.9 75.3 79.9 82.1 76.7
Adjusted R-squared 0.1953 0.2355 0.1384 0.2229 0.2544 0.1835
Based on this analysis, it was decided to compare the road entity regression models using the two datasets to determine the best course of action for the final stage of this investigation.
The composition of the database based on a sample of driver 1 and driver 2 sections is shown in Table 5-34 above applicable to the six road entity models using driver section 1 using the complete records dataset. Table 5-35 above shows the goodness of fit performance of these models. Although the goodness of fit measures for the first models was deemed acceptable, there
was a comparison with models using the random sample of multiple vehicle driver involvements to assess not only the statistical model performance, but the coefficient interpretations, as suggested by Saccommano et al. (1994).
To make a comparison, models for the road entities using the stratified sample of driver involvements from sections 1 and 2 were developed. The goodness of fit for these models was also acceptable, as shown in Table 5-36 below. This database was slightly reduced to 129,193 records for the regression analysis model due to invalid or missing data, including discarding crashes on road sections with very large medians (<150 ft), which did not change results significantly (not more than 3% of any odds ratio), but improved the median size coefficient.
These few cases with very large median sizes might have been one-way roads or special cases which were not the main interest of this investigation. The Hosmer-Lemeshow p-values were improved, higher p-value is better in this calibration test. The coefficients values did not change drastically, but these models now take into account the diversity of driver involvements in multiple vehicle crashes to guard the efficiency of these models against certain biases, such as at-fault drivers, shown in previous sections.
Table 5-36 Goodness of fit for the models using the stratified driver 1 and driver 2 records dataset
GOF Parameter OVERALL INTERS SIGNAL SEGMENT PURE SEG UNSIG
Number of Variables 33 26 20 27 24 17
Degrees of freedom 67 56 43 60 48 48
Sample size 129193 73547 43944 77623 48020 29603
Response severe
injury ratio 6.10% 6.01% 5.38% 6.75% 6.64% 6.94%
AIC 50600.81 29451.99 16614.84 31836.67 18897.62 12868.79 Hosmer-Lemeshow
p-value 0.2493 0.5760 0.2507 0.8468 0.8790 0.1886
c value (area under
ROC curve) 0.768 0.745 0.736 0.786 0.804 0.757
Percent Concordant 76.2 73.9 72.8 78.1 79.9 75.2
Adjusted R-squared 0.1801 0.1482 0.1236 0.2109 0.2388 0.1774
After evaluating and comparing these two sets of models, major differences were found.
These results from these preliminary models were encouraging and a decision was made to proceed with final model development using the stratified sample. Some of the key advantages found are summarized next. First, the year to year statistical dependence of the driver injury severity was significantly reduced, which improves the validity of the analysis. Secondly, the ratio of severe injuries from the data using the stratified sample more closely resembles the total involvements on multilane high-speed arterials, as shown in Table 5-37 below. The higher severe involvements response ratios are expected in the models because incomplete records were removed, which are likely minor crashes with no or lesser injuries.
Table 5-37 Severe injury to driver involvement ratios for complete driver 1 and driver 2 records Road entity group Non-severe Severe Total Severe ratio
All involvements 275143 15847 290990 5.45%
All intersections 155727 9205 164932 5.58%
Signalized 95274 5065 100339 5.05%
Segment + non-signalized 161781 10141 171922 5.90%
Pure Segment 101328 6001 107329 5.59%
Non-signalized 60453 4140 64593 6.41%
A third key advantage was that the random sample model generally had higher AIC values due to the increased heterogeneity of the data. However, this is usually a desired property on systematic crash analysis and statistical analysis in general. Minor loss of explanatory power (as measured by the adjusted R squared value) was necessary to achieve more accuracy. There is an improved calibration of the models using the sampled multiple vehicle involvement data.
Because the more homogeneous data does not completely reflect the variations in driver injury severity in multiple vehicle crashes, the statistical results might be misleading. Even tough the models suffer from a reduction in explanatory power, additional precision outweigh this loss.
A fourth advantage was the numerical stability of the random sample models was vastly improved compared to the earlier models. A set of covariates that more accurately represented the changes in driver injury severity was obtained. The coefficient significance showed a small improvement in the random sample models. Although a more heterogeneous dataset was used, the standard errors remained in the same order of magnitude.
Another advantage during model building was that the positive impact of interactions was noticed in the random sample models. Some important interactions in the driver 1 and 2 sections models did not significantly improve the model (AIC<10) and were eliminated. In the random sample models, important interactions were significant in the models, without adversely affecting significant main effects. During model building using the driver section data, interactions would cause dropping important main effects.
Finally, the variables found significant in the random sample models were more useful when compared to those in the earlier models. One of the reasons was the superior numerical stability of the random sample dataset. Important variables such as shoulder width and lane width were tested in both model sets, however these were significant only in models using a sample of multiple vehicle involvements. Another important contribution was the addition of some roadway-related variables in the final stage of analysis, which is discussed in section 5.4.1.
The evidence presented above and the variables found significant show that a random sample of multiple vehicle crashes from sections 1 and 2 plus the single vehicle crashes is more representative of the total driver involvements than involvements from one section only.
Therefore, a sample of multiple and single vehicle crashes, one involvement per crash was selected for final analysis. A model with this kind of sample is expected to have better reliability and scientific validity.