Integration of Automatic Model Selection Procedures

Chapter 3 Research Design and Methodology

3.3 Implementation of Contingent Valuation Models

3.3.3 Integration of Automatic Model Selection Procedures

. This is an exhaustive assessment procedure that can take hours for computation (depending on the working computer’s speed). In this study, the up-to-date, special package “glmulti” for the R program was used to conduct the best subset regression procedure (Calcagno and de Mazancourt 2010; Calcagno 2013).

The two automatic variable (model) selection techniques are sufficiently useful to determine the best model for ideal survey data with no missing values, but problems arose when applying them onto the real survey data. In the real survey, some respondents did not answer the questions regarding their private information such as income or their opinions about environmental issues, which caused a number of missing values in the

variable but a constant) and adds one variable at a time until the model cannot be further improved.

The total number of subset models of a full model with n variables equals to the sum of all possible combinations of n elements, i.e.∑_{0≤𝑘≤𝑛}�𝑛_𝑘� =2𝑛. This total number includes two extreme cases, i.e. the null model (with no variable but a constant) and the full model itself.

survey data30_{. Since incomplete observations}31

The stepwise regression and best subset regression techniques can select the model with the smallest AIC from numerous possible models based on the initial full model, but they cannot take account of whether a variable should be retained despite the related reduction of observations. Moreover, as shown in Equation 3.48, the AIC is dependent on -2 × log-likelihood and the number of variables in the model. Since more observations could lead to larger value of -2 × log-likelihood

with missing values were omitted in regression, the more variables (each variable corresponded to a question in the survey) included in a model, the more missing values there could be, and thus the fewer observations the model could contain. Obviously, a model with more observations could make better use of the survey data. However, some variables such as income were too important to be excluded from the model even if it could cause substantial reduction of observations.

The first stage of this integrated procedure was to construct the initial full model to obtain a general view of all explanatory variables’ influence on respondents’ answers to the WTP question. The second stage was to apply the best subset regression and stepwise regression techniques onto the initial full model in order to find which variables were removed by the two automatic selection procedures (due to insignificance) and caused substantial reduction of the observations (due to missing values). The third

, the two automatic selection procedures are likely to select a model with relatively few observations. That is to say, for datasets with missing values, the final model selected by the two automatic procedures may neither make good use of the survey data nor retain some important variables. In order to amend the drawbacks of the automatic model selection techniques in tackling data with missing values, manual adjustment was integrated with the automatic selection techniques in this study for model selection and improvement (illustrated in Figure 3.7).

The option of “No opinion” or “I don’t want to answer” was provided for the respondents in this survey, which helped to ensure the reliability of the survey data.

Each observation is a dataset that contains a respondent’s answers to all the questions.

For example, two additional observations (respondents) with the probability of 0.5 to answer yes to the WTP question would add -2 × log(0.5) = 1.38 to the AIC.

stage introduced the manual adjustment which removed those variables identified in the second stage and constructed a new (intermediate) full model that contained less variables and more observations than the initial full model. But variables like income were retained in this stage in spite of the resultant reduction of observations. Then the automatic stepwise regression technique was applied again onto the new full model to select the new (intermediate) stepwise regression model with the smallest AIC.

Figure 3.7 Integrated Model Selection and Improvement Procedure Models highlighted in red were reported in this study (Chapter 4). Explanation of models in the figure was in the following texts.

It should be noted that variables identified in the second stage (insignificant and with missing values) were not all removed from the initial full model at the same time to construct the new full model. This is because even the removal of one variable from the initial full model could lead to substantial differences between the new stepwise model and the initial one (i.e. the stepwise model based on the initial full model). Variables not included in the initial stepwise model might become significant and thus retained in the new stepwise model, and vice versa. As a result, a trial-and-error method was used to remove redundant variables, which made the third stage an

Final Stepwise Regression Model

Initial Full Model

Best Subset Regression Model Intermediate Stepwise Regression Model Manual Adjustment and Evaluation

exhaustive and iterative selection process until the final stepwise regression model was determined to achieve the balance between best model fit (preciseness), most observations (fully use of the survey data) and fewest explanatory variables (simplicity)33

In addition to the AIC, two other criteria were also used in this study to evaluate numerous models in the integrated selection procedure, i.e. the overall model significance and the prediction error rate.

The overall model significance indicates whether the tested model is significantly better than the null model, i.e. the model with no explanatory variables but the intercept/constant. The Likelihood Ratio Test is generally used to the evaluate models’ overall significance (Hosmer, Lemeshow and Sturdivant 2013).

𝐺 = −2ln (_𝐿𝐿

(3.49)

where L is the likelihood of the tested model, L0 is the likelihood of the null model, and G is the log-likelihood ratio statistic which approximately follows the chi-square distribution with k degree of freedom where k is the number of explanatory variables in the test model. The P-value calculated based on G indicates the overall significance of the tested model.

The prediction error rate (the same test with a different presentation form is called classification table) is an intuitive criterion to indicate models’ ability to explain/predict respondents’ answers to the WTP question (Hosmer, Lemeshow and Sturdivant 2013). After the coefficients of the tested model are estimated, each respondent’s probability of answering yes to the WTP question can be calculated by the probability function of the logit model (Equation 3.37). Setting 0.5 as the cut-off point, the respondent is predicted 34

Prediction Error Rate = 𝑁 𝑃𝑟(𝑌𝑒𝑠) > 0.5 |𝑁𝑜 + 𝑁 𝑃(𝑌𝑒𝑠) ≤ 0.5 | 𝑌𝑒𝑠

𝑁𝑡𝑜𝑡𝑎𝑙 (3.50)

to answer yes if the calculated probability is larger than 0.5, otherwise he/she is predicted to answer no. Then the prediction error rate of the tested model is

It should be noted that the stepwise regression itself was also an iterative selection procedure. Moreover, the best subset regression technique was not applied in the third stage because it was far too time-consuming for the exhaustive and iterative examination.

Strictly speaking, this is a posterior estimation as the “prediction” is actually made after the survey.

where 𝑁_{𝑃𝑟(𝑌𝑒𝑠) > 0.5 |𝑁𝑜}is the number of respondents who are predicted to answer yes (as the calculated probability is larger than 0.5) but actually answer no in the survey, 𝑁_{𝑃(𝑌𝑒𝑠) ≤ 0.5 | 𝑌𝑒𝑠} is the number of respondents who are predicted to answer no but actually answer yes, and 𝑁_{𝑡𝑜𝑡𝑎𝑙} is the total number of respondents.

The integrated model selection procedure explained above focuses on the single bound dichotomous choice model because the automatic model selection techniques for the complicated probability function (Equation 3.47) of the double bound dichotomous choice model are not available so far. As a result, this study applied the integrated model selection procedure (Figure 3.7) to determine what variables should be included in the final stepwise regression model (single bound model). Then the double bound model was constructed with the same variables of the single bound model but a more complicated probability function. The integrated model selection procedure of this study has not been reported in the literature, it could be a methodological contribution of improving the model construction and selection in Contingent Valuation studies.

In document Payments for ecosystem services of the middle route of the South-to-North Water Transfer Project in China (Page 80-84)