Treatment of Missing Data - Results and Discussion of Predictive Models

6. Results and Discussion of Predictive Models

6.1 Treatment of Missing Data

Prior to performing the multivariate analyses, it was important to evaluate and address the issue of missing survey data. Missing data can be a concern if the

respondents who failed to provide data differ in some meaningful way from the rest of the sample. If the respondents missing data are different, then excluding them can bias statistical results. In this study, two of the demographic variables had missing data requiring investigation: income and farm income. Of the 415 survey respondents, 44 respondents (10.6 percent) did not provide information about their total annual incomes. Twenty-two respondents (5.3 percent), declined to state how much of their total annual income came from farming.

To investigate the potential impacts of these missing data on the study results, t- tests were conducted comparing the respondents who reported data and those who did not on all of the other study variables. For income, the t-tests revealed only one significant

121

difference: those who did not report on average scored slightly lower on perceived control (3.2 versus 3.5) than those who did (significant at the .05 level). No other differences were significant at the .05 level for a two-tailed test.

In the t-tests comparing those who reported farm income and those who did not, only two variables were significant: cost share for cover crops and fear of penalties. The t-test results for cost share are invalid because no respondents who received cost share for cover crops failed to report their farm income. For fear of penalties, 78.9 percent of respondents who reported their farm income believed they might be penalized whereas only 54.6 percent of respondents who did not report farm income believed this. Even though respondents were ensured that their participation in the survey was anonymous, the significant relationship between beliefs about penalties and willingness to report farm income may reflect a concern that not answering the question could make them more susceptible to regulatory scrutiny. However, this relationship did not hold for reporting of total income, which limits this concern.

In summary, the t-test results indicate very few significant differences between the producers who reported income and farm income data and those who did not. This suggests that the missing data are unlikely to bias the study results. However, to be sure that dropping the non-responsive producers from the study sample would not bias the results, multiple imputation was conducted.

Multiple imputation is a missing-data replacement procedure comprised of two distinct steps. First, an imputation model is selected and missing data are generated using this model. Second, the desired statistical tests are performed using each imputed data set and the results are combined. This procedure generally is favored over other methods of

122

addressing missing data because it is found to be relatively insensitive to whether the data are missing at random or not and it can estimate the amount of missing information (McKnight, McKnight, Sidani, & Figueredo, 2007). The amount of missing information indicates the influence that missing data have on statistical inferences and can help determine whether it is reasonable to ignore them in analyses. In the statistical software Stata, the influence of missing information is reported as the relative increase in variance (“RVI”) for each model tested. RVI measures how much the variance of the parameter estimates increases due to missing data. Greater variance tends to make parameter estimates less reliable and standard errors less accurate (McKnight et al., 2007).

Despite the potential benefits of using multiple imputation to address missing data, a decision to use this technique must weigh the benefits against the procedure’s key drawback: the inability to conduct many types of post-estimation analyses. For example, likelihood–ratio tests are not currently applicable to multiple imputation results

(StataCorp, 2009). RVI values can help indicate whether the amount of missing information in each model is significant enough to tip the scale in favor of using this approach.

In this dissertation, the imputation model was based on a multivariate normal distribution and included all of the study variables. Twenty imputations were performed, resulting in 20 distinct complete data sets. In this case, each of the study models was tested using each of the 20 imputed data sets and Stata was used to pool the results into one final set of parameter estimates and standard errors for each model. To determine whether the missing income and farm income data were likely to bias results in this study, the RVI values for each model were checked.

123

Based on these values, using the imputed data did not add a significant amount of information to any of the study models. The average RVI was less than 2 percent in each model, which is considered trivial (McKnight et al., 2007). Due to the low RVI values and the t-test results, the choice was made to preserve the ability to conduct model post- tests by not using the imputed data. As a result, the respondents with missing data for income, farm income, and/or education level7 were excluded from the multivariate model testing in the dissertation, resulting in a final study sample of 369 producers.

In document Agricultural nutrient management in the Neuse River Basin : exploring the links between mandates, motivations, and behavior (Page 130-133)