Using advanced data analysis to learn from infrastructure databases: The case of the US National Bridge Inventory
6. Regression analysis
Regression analysis is a statistical process for modelling and analyzing the relationship of a dependent variable with one or more independent variables. It is commonly used for prediction and forecasting, but it has also been utilized for achieving a better understanding of the relative importance of each independent variable in modelling the dependent variable.
In the negative binomial regression analysis performed herein, the initial number of variables studied were reduced based on the obtained results from correlation analysis and ANOVA. The independent variables included in the analysis are shown in Table I. For these variables, an additive, generalized linear model fit was calculated using MASS library in R [14]. An initial regression was performed to fit a model using all independent variables. Then, a stepwise regression process was utilized, according to which independent variables were added (forward regression) or removed (backward regression) and the so-called Akaike Information Criterion was employed to indicate whether a better model was achieved. By comparing the results of successive regression models, conclusions on omitting unnecessary variables were reached.
Table I: Regression analysis results for superstructure condition rating.
Variable Coefficient Estimate - Intercept = -5.3E+00 Length (m) C1 = -4.8E-05 Maximum span (m) omitted Deck width (m) C2 = 7.1E-04 Year constructed (date/year) C3 = 3.6E-03 Detour length (km) C4 = 1.4E-04 Average Daily Traffic - ADT (vehicles) C5 = 2.7E-08 Truck traffic (% ADT) omitted Earthquake hazard - PGA (g) C6 = 4.9E-02 Precipitation (inches) C7 = -5.6E-04 Snow depth above 1 inch (days) C8 = 1.6E-04 Deicing Not allowed - Allowed C9 = -2.5E-02 Material Concrete continuous - Concrete simple C10 = -9.6E-03 Prestressed concrete continuous C10 = 2.1E-02 Prestresssed concrete simple C10 = 1.8E-02 Steel continuous C10 = 9.8E-04 Steel simple C10 = -4.6E-02 Water Underneath No - Yes C11 = -7.3E-03
In the regression function obtained, the logarithm of the superstructure condition rating is equated to an intercept term and the added independent variables multiplied by each variable’s coefficient:
ln (Superstructure Condition) = Intercept + C1*Length + C2* Deck width + C3* Year of construction + C4*Detour Length+ C5*ADT + C6*PGA
+ C7*Precipitation + C8*Snow depth + C9*Deicing + C10*Material
+ C11*Water underneath (1)
The intercept term is a grand average of the dependent variable, while the effect of the various independent variables is indicated by the estimated coefficients provided in Table I. The categorical independent variables included in this table are handled
through binary dummy variables (one dummy variable is introduced for each group of a categorical variable). These dummy variables can take only the values of 1 (to activate the coefficient for the specific group) or 0 (to deactivate it). For example, as regards superstructure material, there is no coefficient for ‘concrete continuous’, thus C10 is deactivated (the dummy variables for all other groups take the value 0), i.e. Eq. (1) is by default calibrated for the particular material. If another material is used, then Eq. (1) needs a ‘correction’ to shift its result, which is achieved by activating the C10-value (with a dummy variable value equal to 1) of the corresponding group of Table I. Clearly, only one group of a categorical variable and its coefficient value can be active and have a dummy variable value equal to 1 at any time (all other dummy variable values for the remaining groups of the categorical variable are equal to 0 to deactivate the corresponding regression coefficients).
The increase of the value of each independent variable causes either increase or decrease of the value of superstructure condition, depending on the sign of the respective regression coefficient. The regression analysis results reveal the most influencing factor, which is the year of bridge construction, as well as the least influencing ones, which are the bridges’ geometric characteristics. The traffic characteristics appear to have a small effect on the superstructure condition. Indeed, the results for ADT are in agreement with the corresponding results obtained with ANOVA in the previous section. The same applies for earthquake hazard, with increase of the PGA value having a positive effect on superstructure condition. Increased annual precipitation and days of snowfall above 1 inch cause a decrease in superstructure condition. The same applies for bridges, which are located in deicing regions or have water underneath. Furthermore, the structural materials used in superstructures appear to have effects similar to the ones observed with ANOVA.
7. Conclusions
In this paper, factors affecting the superstructure condition of bridges were studied using data from existing bridges located in the US. To perform this task, recorded inspection data for more than 600,000 bridges included in the NBI database were utilized. Since the US territory contains a large variety of environmental exposures, the databases of NOAA and USGS were used to introduce additional (non-NBI) variables regarding climate and earthquake hazard. To estimate data values for each bridge location, spatial interpolation methods were implemented. The combined dataset including NBI and non-NBI data was then analysed using data analysis procedures, to determine which variables affect the structural condition of bridges. The exploratory data analysis performed showed that there are low correlations among the selected NBI variables in contrast to climate variables, which were moderately to highly intercorrelated. ANOVA and multiple comparisons revealed useful patterns, which indicated the effect of each variable to structural condition rating. Moreover, the analysis showed the existence of certain thresholds, after which variables have a different effect to the condition ratings, such as the deicing policy implemented and the days of snow depth above 1 inch: although the deicing region coincides with the region of more than 0.5 days of snowfall above 1 inch, further increase in days of snowfall do not affect superstructure condition rating.