• No results found

Multivariate partial least squares (PLS) regression was used to determine statistically significant associations between Enterococcus density levels and the variables listed in Table 3.4. A total of six different models were run in order to

determine the best-fit model. The following paragraphs walk through the model selection process and results of the PLS procedures.

The variable “month” was taken out of the model because there were months in year 2010 that did not have observations for any of the other years. When the PLS procedure was run with “month” in it, there were no significant factors extracted, which prevented the analysis from working appropriately or presenting any results. The PLS procedure was re-run, excluding “month” from the model. In this model (Model 1), there were two statistically significant factors extracted (factor 1: p-value <0.0001, factor 2: p- value 0.008). Appendix A displays the percent variation accounted for by the partial least squares factors for each variable in all the models tested leading up to the final selected model. The following six variables in Model 1 fell below Wold’s criteria of 0.8 in the variable importance plot: station SC2, precipitation values from the Charleston

International Airport, water temperature, and number of septic tanks located within a half mile of each station.

The PLS procedure was re-run, excluding all precipitation values from the Charleston International Airport, water temperature, and number of septic tanks located within a half mile of each station (Model 2). Station SC2 was not excluded from the model even though it fell below Wold’s criteria because this was a categorical variable. Taking out SC2 would exclude 119 observations, equating to almost a third of the total data set used in this study. In Model 2—which included all stations, all precipitation data for the downtown U.S. Charleston Customs House, conductivity, water height, and septic tanks located in a one-mile buffer of each station—there were two statistically significant factors extracted (factor 1: p-value <0.0001, factor 2: p-value 0.014). Station SC2 still fell below Wold’s criteria of .8 on the variable importance plot.

Because station RT-10116 contained observations from only the year 2010 and none of the other years, it was taken out of the model to make sure this station was not skewing the results. The PLS procedure was re-run excluding station RT-10116 (Model 3). Model 3 included stations SC1, SC2, and SC3; all precipitation data from the

downtown Charleston U.S. Customs House; conductivity; water height; and septic tanks located within a one-mile buffer of each station. Model 3 extracted two statistically significant factors (factor 1: p-value <0.0001, factor 2: p-value 0.044). Because the percent variation accounted for by the partial least squares factors did not change substantially for the predictor values (totals: 36.93 for Model 2 and 41.38 for Model 3), keeping station RT-10116 in the model remains appropriate in order to keep observations from year 2010.

Before adding station RT-10116 back into the model, station SC2 was also excluded from the model (Model 4) in order to see how the omission impacted the

results. This was done because station SC2 fell below Wold’s criteria of 0.8 in the variable importance plot for all the previous models. In this effort to analyze what results would be produced by excluding these observations, the PLS procedure was re-run excluding both station SC2 and station RT-10116 (Model 4). Model 4 extracted two statistically significant factors (factor 1: p-value <0.0001, factor 2: p-value 0.064). Model 4 included station SC1 and SC2, all precipitation data from the downtown Charleston U.S. Customs House, conductivity, water height, and septic tanks located within a one- mile buffer of each station. Precipitation values two and three days before sample collection and conductivity fell below Wold’s criteria of 0.8 but only by about a tenth of a decimal. Because taking out station SC2 excluded so many observations in the data set, both station SC2 and station RT-10116 should be added back into the model.

After considering the data further, it was realized that because the number of septic tanks located in a 1-mile buffer around each station was a constant value for each station, this was essentially a weighted numerical value assigned for the variable

“station.” Therefore, the number of septic tanks was taken out of the model completely. The PLS procedure was re-run (Model 5) and extracted two statistically significant factors (factor 1: p-value <0.0001, factor 2: p-value 0.001). Model 5 included stations RT-10116, SC1 and SC2; all precipitation data from the downtown Charleston U.S. Customs House; conductivity; and water height. All variables except for station SC2 remained above Wold’s criteria of 0.8 on the variable importance plot. In Model 5, conductivity was just slightly above Wold’s criteria of 0.8 on the variable importance plot. In addition, only 15.48% of the variation accounted for by the partial least squares factors for the model explained water specific conductance.

In an effort to obtain a model in which the predictor variables in the model explain the highest percent of variation, conductivity was excluded from the model and the PLS procedure was re-run (Model 6). In Model 6, the variation summary shows that the two factors in the model explain 46.71% of the total predictor variation and 43.16% of the response variation. The percent variation accounted for by the predictor variables increased with Model 6, compared to Model 5, which was 42.44%. Therefore, Model 6 appeared to be the best-fit model. Because there were several missing observations in the precipitation data set, PROC PLS excluded these from the analysis, and no predictions were computed for those missing observations. The final model contained 367 records of observations used in the final analysis. In Model 6, the absolute minimum PRESS was achieved with two extracted factors that have a statistically significant p-value less than 0.05 (factor 1: p-value <0.0001, factor 2: p-value 0.002). The complete factor selection process is shown in Tables 3.5 and 3.6. The percent variation accounted for by the partial least squares factors in the final model is shown in Table 3.7.

The correlation loading plot summarizes the two factors and the features in the PLS model, displaying the primary results (Figure 3.12). This plot is composed of blanketed scatter plots, which include the variation explained by both factors for each variable and the weighted effects of the model (SAS Institute Inc., 2009b). The amount of variation explained by the model for each of the variables is comparable to the distance from the origin of the plot. The transformed Enterococcus levels, represented by their observation number in the data set on this plot, are randomly clustered towards the origin, indicating that the data contribute appropriate information about the two factors. Drawing perpendicular lines from the predictor variables on the plot to a line that connects the

origin and the response variable produces relative positive and negative correlations between the predictor and response variables. Figure 3.13 displays the drawn lines that were used to interpret the plot. The correlation loading plot indicates that station SC3 is highly positively correlated with the transformed Enterococcus density levels (labeled “log_MPN_adj” on the plot). Station SC3 was the most correlated with the transformed Enterococcus density levels compared to all other predictor variables in the model. Water height values followed closely by precipitation are also positively correlated with the transformed Enterococcus density levels. Station SC2, which is located towards the origin of the plot, has no correlation with the transformed Enterococcus density levels. Station RT-10116 is slightly negatively correlated with the transformed Enterococcus density levels. Station SC1 is also negatively correlated with the transformed

Enterococcus density levels.

All variables in the final model, except for station SC2, remained above Wold’s criteria of 0.8 on the variable importance plot (Figure 3.14). As stated previously, station SC2 was kept in the model to avoid eliminating almost a third of the data set. The

regression coefficients profile in Figure 3.15 signifies the importance each predictor variable has in the prediction of only the response variable. In the regression coefficients profile plot, station RT-10116 and SC1 have negative coefficients. Looking back at the correlations loadings plot, these are the variables that tend to be negatively correlated with the dependent variable. The plot shown in Figure 3.16 gives the distance from each point to the PLS model with regard to the predictors first and then the responses. This allows for identification of potential outliers. Points that are dramatically farther from the model than the rest of the points could be considered outliers. Those points scattered far

to the right on the X-axis of this plot are potential outliers. However, because of the reliable methods for reading Enterococcus density levels and because of the many factors that can drastically impact Enterococcus density levels in water, these were not excluded from the analysis. The parameter estimates that are used to create the prediction equation are displayed in Table 3.8.

In order to confirm that the PLS procedure analysis results were accurate, the GLM procedure was run, using the same data from the final model. The F-value in the analysis of variance (Table 3.9) was statistically significant (p-value <0.0001), indicating that the model does explain the variance of the response variables. The R2, which is the total variance explained by the model was 0.462199 (46.22%). This remains very close to the variation summary from the PLS procedure in Model 6 that concluded 46.71% of the predictor variation was explained by the model.

Figure 3.2: Map of zoning categories in 2017 for the Shem Creek watershed

Table 3.1: Percent land use by zoning category for the Shem Creek watershed in 2010 and 2017

Zoning Category 2010 (% cover) 2017 (% cover)

Commercial 3.5 2.3 Residential 82.2 83.9 Vacant 10.3 8.0 Recreational 1.5 1.6 Agricultural 0.0 0.0 Undevelopable 0.9 1.3 Other 1.6 2.9

Figure 3.3: Zoning categories that changed from agricultural, recreational, vacant, or undevelopable in 2010 to commercial or residential in 2017 in the Shem Creek watershed

Figure 3.4: Natural log of Enterococcus density levels (ln(MPN)) included in the analysis graphed over time. This figure excludes Enterococcus density levels that fell below the detection limit (<10MPN/100ml).

Figure 3.5: Number of septic tanks within a half-mile and a mile buffer or radius of each 0 2 4 6 8 10 12 ln (M P N) Date Enterococcus (MPN) 1 26 21 20 31 34 48 109 0 20 40 60 80 100 120 RT-10116 SC1 SC2 SC3 Num b er of S ep tic T an k s

Water Quality Monitoring Station

Number of Septic Tanks Located within a .5 and 1 mile buffer of Water Quality Monitoring Stations

septic tanks in .5 mile septic tanks in 1 mile

Figure 3.6: Number of Enterococcus density levels that exceeded the Class SB saltwater recreational limit for a single sample (501MPN/100ml) by year

Figure 3.7: Number of Enterococcus density levels that exceed the Class SB saltwater recreational limit of 501MPN/100ml by month

1 6 14 14 16 22 0 5 10 15 20 25 2010 2013 2014 2015 2016 2017 Num b er of E xc ee d an ce s Year Number of Exceedances >501MPN Number of Exceedances >501MPN 8 11 10 16 17 9 0 2 4 6 8 10 12 14 16 18 Num b er of E xc ee d an ce s Month Number of Exceedances >501MPN Number of Exceedances >501MPN

Figure 3.8: Distribution of the computed normal scores from the ranks (norm_rank) for natural log transformed Enterococcus data (log_MPN_adj). Note that SAS’s terminology for the natural log (ln) is “log”.

Figure 3.9: Probability plot for the computed normal scores from the ranks (norm_rank) of the natural log transformed Enterococcus data (log_MPN_adj) against normal

percentile values. A perfect normal curve would be on the “normal line” indicated by the figure. The ties can be seen where there are multiple points on the same Y-axis value.

Figure 3.10: Fit plot for the computed normal scores from the ranks (Rank for Variable log_MPN_adj) and the natural log transformed Enterococcus data (log_MPN_adj)

Table 3.2: Analysis of Variance for testing that the Rank Variables Reliably Predict the Transformed Enterococcus (ln(MPN)) Values in the Helsel’s Robust Method

Source DF Sum of Squares Mean Square F Value Pr > F

Model 1 996.07825 996.07825 16231.2 <0.0001

Error 347 21.29479 0.06137

Corrected Total 348 1017.37304

Table 3.3: Parameter Estimates for the Helsel’s Robust Method for Predicting Values of Enterococcus that Fell Below the Detection Limit (<10MPN/100ml)

Variable Label DF Parameter

Estimates

Standard Error

t value Pr > |t|

Intercept Intercept 1 4.72752 0.01340 352.74 <0.0001 norm_rank Rank for Variable

log_MPN_adj

Figure 3.11: Data points with uncensored (<10MPN/100ml) fitted values computed by Helsel’s Robust Method

6.18%

Table 3.4: All Variable Names Included in the Analysis and Their Variable Description

Variable Name Variable Description

RT-10116 Water quality monitoring station SC1 Water quality monitoring station SC2 Water quality monitoring station SC3 Water quality monitoring station

Month Month

Rain_1d_airport Total precipitation on the day prior to water sample collection at the Charleston International Airport

Rain_2d_airport Total sum of precipitation on the 2 days prior to water sample collection at the Charleston International Airport

Rain_3d_airport Total sum of precipitation on the 3 days prior to water sample collection at the Charleston International Airport

Rain_1d_dt Total precipitation on the day prior to water sample collection at the Charleston Clearing House, located Downtown

Rain_2d_dt Total sum of precipitation on the 2 days prior to water sample collection at the Charleston Clearing House, located Downtown Rain_3d_dt Total sum of precipitation on the 3 day prior to water sample

collection at the Charleston Customs House, located Downtown Cond_bottom Specific conductance of the water

Temp Water temperature

Height Water height

Sep_pt5 Number of septic tanks located within a half-mile radius of each water quality monitoring station

Sep_1 Number of septic tanks located within a one-mile radius of each water quality monitoring station

Table 3.5: Cross Validation for the Number of Extracted Factors

Number of

Extracted Factors

Root Mean PRESS T2 Prob > T2

1 1.002732 50.70773 <0.0001 2 0.796101 8.496466 0.0020 3 0.764914 0.42245 0.5420 4 0.763042 0 1.0 5 0.763379 0.196669 0.6380 6 0.763596 0.46835 0.4800 7 0.763807 0.74074 0.3740 8 0.763798 0.717576 0.3720 9 0.763798 0.717576 0.3720

Table 3.6: Descriptive Results of the Cross Validation for the Number of Extracted Factors Process

Minimum root mean PRESS 0.7630 Minimizing number of factors 3 Smallest number of factors with p > 0.1 2

Table 3.7: Percent Variation Accounted for by Partial Least Squares Factors

Variable Percent Variation Accounted

for by the 2 PLS factors

Model Effects Station RT-10116 17.5537

Station SC1 28.9036 Station SC2 0.3066 Station SC3 57.5639 Rain_1d_dt 76.6546 Rain_2d_dt 91.6853 Rain_3d_dt 77.9720 Height 23.0367 Current 20.8880 Total 46.7095

Dependent Variables log_MPN_adj 43.1631

Current 4.9529

Figure 3.13: Correlation loading plot from Model 6 with lines drawn in for reading and analyzing the plot. The closer the purple dots are towards log_MPN_adj, the more correlated the predictor variable at the end of the purple lines is with the transformed Enterococcus density levels.

Figure 3.14: Variable importance plot based on the Variance Importance for Projection (VIP) statistics of Wold for Model 6, the final model

Figure 3.16: The “distance to response and predictor models” plot gives the distance from each point to the PLS model with regard to the predictors and responses

respectively.

Table 3.8: Parameter Estimates

log_MPN_adj Intercept 3.668300 Station RT-10116 -2.280029 Station SC1 -0.086197 Station SC2 -0.064866 Station SC3 1.283954 Rain_1d_dt 0.484188 Rain_2d_dt 0.271817 Rain_3d_dt 0.190731 Height 0.250721

Table 3.9: Analysis of Variance Table, Testing if the Final Model Explains the Variance of the Response Variables

Source DF Sum of Squares Mean Square F Value Pr > F

Model 8 629.061249 78.632656 38.14 <0.0001

Error 355 731.958159 2.061854

Chapter 4

Related documents