Although initial analysis examining the relationships between ANC and the model predictors gave us an idea of which predictors should be include, it was unknown which combination of predictors, in conjunction, would yield the most accuracy. In order to find the best set of model predictors for predicting ANC, linear regression was initially used
Table 5: List of Final Model Predictors (Final – ANC)
% Pasture Elevation % Quarry % Probable Row Crops
% Woody Wetlands % Emergent Wetlands % Urban – High Density
Carbonate Bedrock Felsic Bedrock
Efroymson’s forward stepwise model selection procedure (Seber, 1977, p.376-377) was performed with all 20 predictors, using the Splus function ‘stepwise’. The ‘forward’ process begins with a model consisting of an intercept only. Each forward step consists of adding to the current model that predictor variable giving the largest reduction of the residual sum of squares provided a certain level of significance or F-to-enter value. The stepwise process of the model selection procedure refers to the possible addition or deletion of predictors in the current model. When an independent variable is added to the model, partial correlations between the variables in the model and the response are considered to determine whether any of the predictors can be eliminated (based upon a specified F-to-exit value). Therefore, the number of predictors, p, increases and/or decreases at each step based upon the specified selection parameters, typically resulting in a “final” predictive model for the response. Using the default settings in Splus of an F- to-enter and F-to-exit value of 2, the final model produced by this selection procedure produced the subset of predictors of the natural log transform of ANC given in Table 5. This set of predictors will be referred to as Final – ANC.
The Leaps and Bounds model selection procedure (Furnival, 1974) was used in addition to Efroymson’s forward stepwise model selection. The Leaps and Bounds
5 10 15 20 # of predictors in the model
20 40 60 80 100 12 0 Mi ni mum Ma llo w' s Cp
Figure 14: Results from Leaps and Bounds Model Selection Procedure
selection procedure examined all possible subsets of predictors, returning the best models of each subset size based upon a designated criteria. The criterion considered was
minimum Mallow’s Cp. Mallow's Cp estimates the total model mean square error divided by the true error variance (NKNW, 1996, p.341-345).
The leaps and bounds procedure using Mallow’s Cp returned a minimum value of 5.48. This model corresponded to the model selected by Efroymson’s stepwise procedure. Using the Cp criterion for model selection, Cp values for a model greater than the number of predictors, p, indicates model bias, or an inflation of the error variance. When Cp is less than the number of model predictors, the model is “interpreted as showing no bias.” This result is attributed to sampling error. Figure 14 plots the minimum Cp values for the best model of each size, with the estimated intercept included as one of the predictors. The line, p = Cp, is included for perspective. Figure 14 displays that the set Final – ANC
Table 6: Liner Regression Parameter Estimates for Final – ANC Model
Predictor Coefficient Standard Error P-value
(Intercept) 7.506 0.157 <0.001
Probable 0.012 0.004 <0.001
Pasture 0.051 0.011 <0.001
Urban High Density 0.063 0.019 <0.001
Emergent Wetlands 0.046 0.021 0.033 Woody Wetlands -0.073 0.019 <0.001 Quarry 0.020 0.012 0.084 Carbon 0.542 0.101 <0.001 Felsic -0.273 0.102 0.008 Elevation -0.001 <0.001 <0.001 R2 = 0.5781 Overall p-value < 0.0001
The form of Final – ANC is:
e β
Y= X⋅ + (1)
where:
• Y= a vector of responses, ln(ANC + 500) • X = a (238 x 10) design matrix of predictors • β = a vector of model parameters
• e ~ N(0,σ2⋅I)
Using the designated model predictors, the estimated model parameters, standard errors and p-values are given in Table 6.
C. Multicollinearity
As discussed above, a concern about this model is the possible interrelationships between the predictors. One manifestation of this problem is multicollinearity. The term
multicollinearity describes a situation where predictor variables are correlated among themselves. When multicollinearity is present among predictors in a given model, the estimated regression coefficients tend to have large sampling variability (NKNW, 1996, p.290). This means that the estimation of the true regression coefficients will vary drastically from sample to sample, indicating little information regarding the true regression parameters.
The possibility of multicollinearity within the predictors of Final – ANC was examined by calculating the Variance Inflation Factor (VIF) of each predictor in Final
ANC. The variance inflation factor for predictor, k, is equal to
(
2k)
k R 1 1 VIF − =where is the coefficient of multiple determination when a predictor is regressed on the p-2 other predictors in the model (NKNW, 1996, p.385-388). Typically, a variance inflation factor of 10 or more indicates the presence of multicollinearity between two or more of the model predictors. The variance inflation factors of the final nine predictors, plus one, were examined. It was previously noted that the two % urban TM variables appeared to be related. Therefore, % urban – low density was included in order to see if this previously identified association significantly altered the model Final – ANC. The results of the VIF analysis are listed in Table 7. None of the calculated variance inflation factors were greater than ten. Therefore, multicollinearity did not appear to be present within these ten predictors, at least as measured by the VIF.
k 2
Table 7: Variance Inflation Factors of Final – ANC plus % Urban – Low Density
Predictor VIF
% Probable Row Crops 1.73
% Pasture 1.90
% Emergent Wetlands 2.37 % Woody Wetlands 2.12 % Urban – Low Density 3.10 % Urban – High Density 2.20
% Quarry 1.27
Elevation 1.17 Carbonate Bedrock 1.34
Felsic Bedrock 1.13
Although none of the VIF values were greater than 10, it was noted that the highest VIF values belonged to the two percent urban area predictors, as well as the two percent wetlands predictors. The relationships between these variables and the effect of including and/or excluding certain combinations of these variables were examined further. When all of the above ten predictors were included in a regression model, each predictor was significant at the α = 0.10 level of significance except for percent urban – low density and percent quarry. When percent urban – low density (which had a p-value of 0.41) was eliminated the model, the resulting predictors were the same ones identified in the Final – ANC model.
When both of the percent urban predictors were included in the model, only percent urban – high density was significant. If one of the two percent urban predictors were included in a model, then the included percent urban predictor was significant at the α = 0.01 level of significance. The implications of this will be further described below.
The relation between percent emergent wetlands and percent woody wetlands was different from the relationship between the percent urban predictors. When both percent
wetland predictors were included in a model, both predictors were significant, but their regression coefficients indicated opposite effects on acid neutralizing capacity. Percent woody wetlands had a negative coefficient and percent emergent wetlands had a positive coefficient.
Percent woody wetlands was always significant at the 0.01 level of significance in the model without percent emergent wetlands. But when excluding percent woody
wetlands from the possible regression models, percent emergent wetlands quickly
became insignificant with p-values greater than 0.50. Further, the coefficient was always negative for each percent wetlands predictor when only one of the two was included in a model.
Therefore, the following conclusions regarding the relationships between the percent wetlands and percent urban predictor variables can be made:
The two percent urban variables provide nearly equivalent information about ANC; only one of the two needs to be included in the final model. Percent emergent wetlands was not a significant predictor of ANC, but its
negative relationship with percent woody wetlands did significantly affect the predictive ability of a regression model. Due to this relationship, either both or neither of them should be included in the final model. The
scientific reasoning for this is unknown.
Therefore, given the available information regarding the location and landscape surrounding each of the sampled stream sites, the simplest and most informative least squares regression model for predicting acid neutralizing capacity included the predictors
0 1000 2000 3000 4000 5000 0 50 100 15 0 20 0 Ord er ed R esp on se - A N C
C.I. for E(ANC)
Figure 15: Individual 95% CI for E(ANC) (Sorted by Original ANC Response)