Multivariate partial least squares (PLS) regression was used to determine statistically significant associations between Enterococcus density levels and the variables listed in Table 3.4. A total of six different models were run in order to
determine the best-fit model. The following paragraphs walk through the model selection process and results of the PLS procedures.
The variable “month” was taken out of the model because there were months in year 2010 that did not have observations for any of the other years. When the PLS procedure was run with “month” in it, there were no significant factors extracted, which prevented the analysis from working appropriately or presenting any results. The PLS procedure was re-run, excluding “month” from the model. In this model (Model 1), there were two statistically significant factors extracted (factor 1: p-value <0.0001, factor 2: p- value 0.008). Appendix A displays the percent variation accounted for by the partial least squares factors for each variable in all the models tested leading up to the final selected model. The following six variables in Model 1 fell below Wold’s criteria of 0.8 in the variable importance plot: station SC2, precipitation values from the Charleston
International Airport, water temperature, and number of septic tanks located within a half mile of each station.
The PLS procedure was re-run, excluding all precipitation values from the Charleston International Airport, water temperature, and number of septic tanks located within a half mile of each station (Model 2). Station SC2 was not excluded from the model even though it fell below Wold’s criteria because this was a categorical variable. Taking out SC2 would exclude 119 observations, equating to almost a third of the total data set used in this study. In Model 2—which included all stations, all precipitation data for the downtown U.S. Charleston Customs House, conductivity, water height, and septic tanks located in a one-mile buffer of each station—there were two statistically significant factors extracted (factor 1: p-value <0.0001, factor 2: p-value 0.014). Station SC2 still fell below Wold’s criteria of .8 on the variable importance plot.
Because station RT-10116 contained observations from only the year 2010 and none of the other years, it was taken out of the model to make sure this station was not skewing the results. The PLS procedure was re-run excluding station RT-10116 (Model 3). Model 3 included stations SC1, SC2, and SC3; all precipitation data from the
downtown Charleston U.S. Customs House; conductivity; water height; and septic tanks located within a one-mile buffer of each station. Model 3 extracted two statistically significant factors (factor 1: p-value <0.0001, factor 2: p-value 0.044). Because the percent variation accounted for by the partial least squares factors did not change substantially for the predictor values (totals: 36.93 for Model 2 and 41.38 for Model 3), keeping station RT-10116 in the model remains appropriate in order to keep observations from year 2010.
Before adding station RT-10116 back into the model, station SC2 was also excluded from the model (Model 4) in order to see how the omission impacted the
results. This was done because station SC2 fell below Wold’s criteria of 0.8 in the variable importance plot for all the previous models. In this effort to analyze what results would be produced by excluding these observations, the PLS procedure was re-run excluding both station SC2 and station RT-10116 (Model 4). Model 4 extracted two statistically significant factors (factor 1: p-value <0.0001, factor 2: p-value 0.064). Model 4 included station SC1 and SC2, all precipitation data from the downtown Charleston U.S. Customs House, conductivity, water height, and septic tanks located within a one- mile buffer of each station. Precipitation values two and three days before sample collection and conductivity fell below Wold’s criteria of 0.8 but only by about a tenth of a decimal. Because taking out station SC2 excluded so many observations in the data set, both station SC2 and station RT-10116 should be added back into the model.
After considering the data further, it was realized that because the number of septic tanks located in a 1-mile buffer around each station was a constant value for each station, this was essentially a weighted numerical value assigned for the variable
“station.” Therefore, the number of septic tanks was taken out of the model completely. The PLS procedure was re-run (Model 5) and extracted two statistically significant factors (factor 1: p-value <0.0001, factor 2: p-value 0.001). Model 5 included stations RT-10116, SC1 and SC2; all precipitation data from the downtown Charleston U.S. Customs House; conductivity; and water height. All variables except for station SC2 remained above Wold’s criteria of 0.8 on the variable importance plot. In Model 5, conductivity was just slightly above Wold’s criteria of 0.8 on the variable importance plot. In addition, only 15.48% of the variation accounted for by the partial least squares factors for the model explained water specific conductance.
In an effort to obtain a model in which the predictor variables in the model explain the highest percent of variation, conductivity was excluded from the model and the PLS procedure was re-run (Model 6). In Model 6, the variation summary shows that the two factors in the model explain 46.71% of the total predictor variation and 43.16% of the response variation. The percent variation accounted for by the predictor variables increased with Model 6, compared to Model 5, which was 42.44%. Therefore, Model 6 appeared to be the best-fit model. Because there were several missing observations in the precipitation data set, PROC PLS excluded these from the analysis, and no predictions were computed for those missing observations. The final model contained 367 records of observations used in the final analysis. In Model 6, the absolute minimum PRESS was achieved with two extracted factors that have a statistically significant p-value less than 0.05 (factor 1: p-value <0.0001, factor 2: p-value 0.002). The complete factor selection process is shown in Tables 3.5 and 3.6. The percent variation accounted for by the partial least squares factors in the final model is shown in Table 3.7.
The correlation loading plot summarizes the two factors and the features in the PLS model, displaying the primary results (Figure 3.12). This plot is composed of blanketed scatter plots, which include the variation explained by both factors for each variable and the weighted effects of the model (SAS Institute Inc., 2009b). The amount of variation explained by the model for each of the variables is comparable to the distance from the origin of the plot. The transformed Enterococcus levels, represented by their observation number in the data set on this plot, are randomly clustered towards the origin, indicating that the data contribute appropriate information about the two factors. Drawing perpendicular lines from the predictor variables on the plot to a line that connects the
origin and the response variable produces relative positive and negative correlations between the predictor and response variables. Figure 3.13 displays the drawn lines that were used to interpret the plot. The correlation loading plot indicates that station SC3 is highly positively correlated with the transformed Enterococcus density levels (labeled “log_MPN_adj” on the plot). Station SC3 was the most correlated with the transformed Enterococcus density levels compared to all other predictor variables in the model. Water height values followed closely by precipitation are also positively correlated with the transformed Enterococcus density levels. Station SC2, which is located towards the origin of the plot, has no correlation with the transformed Enterococcus density levels. Station RT-10116 is slightly negatively correlated with the transformed Enterococcus density levels. Station SC1 is also negatively correlated with the transformed
Enterococcus density levels.
All variables in the final model, except for station SC2, remained above Wold’s criteria of 0.8 on the variable importance plot (Figure 3.14). As stated previously, station SC2 was kept in the model to avoid eliminating almost a third of the data set. The
regression coefficients profile in Figure 3.15 signifies the importance each predictor variable has in the prediction of only the response variable. In the regression coefficients profile plot, station RT-10116 and SC1 have negative coefficients. Looking back at the correlations loadings plot, these are the variables that tend to be negatively correlated with the dependent variable. The plot shown in Figure 3.16 gives the distance from each point to the PLS model with regard to the predictors first and then the responses. This allows for identification of potential outliers. Points that are dramatically farther from the model than the rest of the points could be considered outliers. Those points scattered far
to the right on the X-axis of this plot are potential outliers. However, because of the reliable methods for reading Enterococcus density levels and because of the many factors that can drastically impact Enterococcus density levels in water, these were not excluded from the analysis. The parameter estimates that are used to create the prediction equation are displayed in Table 3.8.
In order to confirm that the PLS procedure analysis results were accurate, the GLM procedure was run, using the same data from the final model. The F-value in the analysis of variance (Table 3.9) was statistically significant (p-value <0.0001), indicating that the model does explain the variance of the response variables. The R2, which is the total variance explained by the model was 0.462199 (46.22%). This remains very close to the variation summary from the PLS procedure in Model 6 that concluded 46.71% of the predictor variation was explained by the model.
Figure 3.2: Map of zoning categories in 2017 for the Shem Creek watershed
Table 3.1: Percent land use by zoning category for the Shem Creek watershed in 2010 and 2017
Zoning Category 2010 (% cover) 2017 (% cover)
Commercial 3.5 2.3 Residential 82.2 83.9 Vacant 10.3 8.0 Recreational 1.5 1.6 Agricultural 0.0 0.0 Undevelopable 0.9 1.3 Other 1.6 2.9
Figure 3.3: Zoning categories that changed from agricultural, recreational, vacant, or undevelopable in 2010 to commercial or residential in 2017 in the Shem Creek watershed
Figure 3.4: Natural log of Enterococcus density levels (ln(MPN)) included in the analysis graphed over time. This figure excludes Enterococcus density levels that fell below the detection limit (<10MPN/100ml).
Figure 3.5: Number of septic tanks within a half-mile and a mile buffer or radius of each 0 2 4 6 8 10 12 ln (M P N) Date Enterococcus (MPN) 1 26 21 20 31 34 48 109 0 20 40 60 80 100 120 RT-10116 SC1 SC2 SC3 Num b er of S ep tic T an k s
Water Quality Monitoring Station
Number of Septic Tanks Located within a .5 and 1 mile buffer of Water Quality Monitoring Stations
septic tanks in .5 mile septic tanks in 1 mile
Figure 3.6: Number of Enterococcus density levels that exceeded the Class SB saltwater recreational limit for a single sample (501MPN/100ml) by year
Figure 3.7: Number of Enterococcus density levels that exceed the Class SB saltwater recreational limit of 501MPN/100ml by month
1 6 14 14 16 22 0 5 10 15 20 25 2010 2013 2014 2015 2016 2017 Num b er of E xc ee d an ce s Year Number of Exceedances >501MPN Number of Exceedances >501MPN 8 11 10 16 17 9 0 2 4 6 8 10 12 14 16 18 Num b er of E xc ee d an ce s Month Number of Exceedances >501MPN Number of Exceedances >501MPN
Figure 3.8: Distribution of the computed normal scores from the ranks (norm_rank) for natural log transformed Enterococcus data (log_MPN_adj). Note that SAS’s terminology for the natural log (ln) is “log”.
Figure 3.9: Probability plot for the computed normal scores from the ranks (norm_rank) of the natural log transformed Enterococcus data (log_MPN_adj) against normal
percentile values. A perfect normal curve would be on the “normal line” indicated by the figure. The ties can be seen where there are multiple points on the same Y-axis value.
Figure 3.10: Fit plot for the computed normal scores from the ranks (Rank for Variable log_MPN_adj) and the natural log transformed Enterococcus data (log_MPN_adj)
Table 3.2: Analysis of Variance for testing that the Rank Variables Reliably Predict the Transformed Enterococcus (ln(MPN)) Values in the Helsel’s Robust Method
Source DF Sum of Squares Mean Square F Value Pr > F
Model 1 996.07825 996.07825 16231.2 <0.0001
Error 347 21.29479 0.06137
Corrected Total 348 1017.37304
Table 3.3: Parameter Estimates for the Helsel’s Robust Method for Predicting Values of Enterococcus that Fell Below the Detection Limit (<10MPN/100ml)
Variable Label DF Parameter
Estimates
Standard Error
t value Pr > |t|
Intercept Intercept 1 4.72752 0.01340 352.74 <0.0001 norm_rank Rank for Variable
log_MPN_adj
Figure 3.11: Data points with uncensored (<10MPN/100ml) fitted values computed by Helsel’s Robust Method
6.18%
Table 3.4: All Variable Names Included in the Analysis and Their Variable Description
Variable Name Variable Description
RT-10116 Water quality monitoring station SC1 Water quality monitoring station SC2 Water quality monitoring station SC3 Water quality monitoring station
Month Month
Rain_1d_airport Total precipitation on the day prior to water sample collection at the Charleston International Airport
Rain_2d_airport Total sum of precipitation on the 2 days prior to water sample collection at the Charleston International Airport
Rain_3d_airport Total sum of precipitation on the 3 days prior to water sample collection at the Charleston International Airport
Rain_1d_dt Total precipitation on the day prior to water sample collection at the Charleston Clearing House, located Downtown
Rain_2d_dt Total sum of precipitation on the 2 days prior to water sample collection at the Charleston Clearing House, located Downtown Rain_3d_dt Total sum of precipitation on the 3 day prior to water sample
collection at the Charleston Customs House, located Downtown Cond_bottom Specific conductance of the water
Temp Water temperature
Height Water height
Sep_pt5 Number of septic tanks located within a half-mile radius of each water quality monitoring station
Sep_1 Number of septic tanks located within a one-mile radius of each water quality monitoring station
Table 3.5: Cross Validation for the Number of Extracted Factors
Number of
Extracted Factors
Root Mean PRESS T2 Prob > T2
1 1.002732 50.70773 <0.0001 2 0.796101 8.496466 0.0020 3 0.764914 0.42245 0.5420 4 0.763042 0 1.0 5 0.763379 0.196669 0.6380 6 0.763596 0.46835 0.4800 7 0.763807 0.74074 0.3740 8 0.763798 0.717576 0.3720 9 0.763798 0.717576 0.3720
Table 3.6: Descriptive Results of the Cross Validation for the Number of Extracted Factors Process
Minimum root mean PRESS 0.7630 Minimizing number of factors 3 Smallest number of factors with p > 0.1 2
Table 3.7: Percent Variation Accounted for by Partial Least Squares Factors
Variable Percent Variation Accounted
for by the 2 PLS factors
Model Effects Station RT-10116 17.5537
Station SC1 28.9036 Station SC2 0.3066 Station SC3 57.5639 Rain_1d_dt 76.6546 Rain_2d_dt 91.6853 Rain_3d_dt 77.9720 Height 23.0367 Current 20.8880 Total 46.7095
Dependent Variables log_MPN_adj 43.1631
Current 4.9529
Figure 3.13: Correlation loading plot from Model 6 with lines drawn in for reading and analyzing the plot. The closer the purple dots are towards log_MPN_adj, the more correlated the predictor variable at the end of the purple lines is with the transformed Enterococcus density levels.
Figure 3.14: Variable importance plot based on the Variance Importance for Projection (VIP) statistics of Wold for Model 6, the final model
Figure 3.16: The “distance to response and predictor models” plot gives the distance from each point to the PLS model with regard to the predictors and responses
respectively.
Table 3.8: Parameter Estimates
log_MPN_adj Intercept 3.668300 Station RT-10116 -2.280029 Station SC1 -0.086197 Station SC2 -0.064866 Station SC3 1.283954 Rain_1d_dt 0.484188 Rain_2d_dt 0.271817 Rain_3d_dt 0.190731 Height 0.250721
Table 3.9: Analysis of Variance Table, Testing if the Final Model Explains the Variance of the Response Variables
Source DF Sum of Squares Mean Square F Value Pr > F
Model 8 629.061249 78.632656 38.14 <0.0001
Error 355 731.958159 2.061854