Set-up Empirical Analysis - Selection Bias: A Machine Learning Approach

In this section the set-up of the analysis of the MEPS (2018) dataset are described. As with the simulation study, first analyses for the first step are performed for both TOTEXP and TOTSLF. Subsequently, two-step models are used to estimate health expenditures, TOTEXP and TOTSLF.

Similar to the simulation study in Chapter 4, the performance of the two Cosslett models (COSS and COSSNN) depends for a large part on the ability to predict the participation to participate, or in this case, the ability to discriminate positive health expenditures or not. So a binary prediction task is performed by means of a probit and a neural network.

For the first-step estimation the following settings are used. For the probit all variables except quadratic and cross-terms are added, which is a total of 104 variables. For the neural network squared and cross terms are added of 3 numeric variables: BM I, AGE and F AM IN C. For the neural network the first layer consists of 110 nodes with activation tanh, second layer 13 nodes with activation tanh, thirth layer 5 nodes with activation tanh and output layer 1 node with activation sigmoid. 15 epochs are used. Every layer has a kernel regularizer L2 of 0.001. Loss function is binary cross-entropy, optimizer is adam’s stochastic gradient descent and metric is accuracy. All numeric variables are scaled to have mean 0 and standard error 1.

For the second-step estimation we compare OLS, OLSH, COSS, COSSNN and 2NN on AIC. See previous chapter for explanations of these methods. For COSS and COSSNN 25 dummies are used. For COSS dummies of the probit of the first step estimation will be used. For COSSNN dummies of the neural network of first step estimation. Settings for the neural network of 2NN: 15 epochs, L2-regularizer of 0.001, same amount of layers and nodes per layer as with neural network of COSSNN, tanh activation function for all layers and same layer set-up as the neural network of COSSNN. Loss function is mean squared error, optimizer stochastic gradient descent and metric mean absolute error. To obtain asymptotic correct standard errors, a bootstrap with R = 100 is performed. Although R = 500 or R = 1000 is more commonly used when bootstrapping, for this dataset it would be computationally infeasible. For both neural networks a validation split of 0.2 and batch size of 130 are used.

Results of model performance are presented in Table 5.2. Estimated parameter values are in Appendix B. Caution should be taken interpreting the estimated model parameters. Normally taking the log would change the interpretation of the estimated parameters to percentage change, but this is not valid for when truncation is present. For the Tobit-2 model:

δE(y2|x, y∗1 > 0)

δx = γ2− σ12λ(x

0_γ

1)(x0γ1+ λ(x0γ1)), (4.1)

for x the union of x1 and x2 and rewrite x01β1 as x0γ1 and x02β2 as x02γ2. See Cameron and

Results

Health expenditures is predicted from the MEPS (2018) dataset, using the models from Chapter 3. In the first section we present results of probit and neural network for the binary prediction of positive total or out-of-pocket health expenditures (TOTEXP and TOTSLF). In the second section we analyse the second step and compare model performance. Estimated parameters are presented in Appendix B.

5.1 First-Step Estimation

Estimation of probit and neural network for binary prediction task of decision equation is performed. The ROC is illustrated in Figure 5.1 for total health expenditure (TOTEXP) and Figure 5.2 for OUP health expenditure (TOTSLF). The AUC of probit is 0.843 for TOTEXP and 0.845 for TOTSLF. The AUC of neural network is 0.853 for TOTEXP and 0.856 for TOTSLF. So the neural network more accurately predicts positive health expenditures, for both TOTEXP as TOTSLF.

Since the dataset is imbalanced, it might be interesting to analyse whether the neural network better predicts the majority class, which is positive health expenditures, or minority class, zero health expenditures. Table 5.1 presents a confussion matrix, which compares AUC performance for false positives and negatives of both classes. So, more information is gathered whether the probit and neural network better predict the majority and/or minority class. For the confussion matrix, class probability predictions from probit and neural network need to be converted to class prediction. One way is to set a tresshold to for instance 0.5, when all predicted class probabilities above 0.5 result in predicted class 1 and below 0.5 to predicted class 0. This however does not take into account model expectations and can lead to ’naive’ predictions. We can find an optimal tresshold as follows (Muchlinski et al., 2016): Denote S0 as Senstitvity and

S1 as Specificity, then maximizing S0(t) + S1(t) for thresholds t. Subsequently class probability

predictions were converted to class predictions and Table 4.2 presents the results for TOTEXP and TOTSLF. For both TOTEXP and TOTSLF difference in percentage correct between probit and NN were largest for class 0, which is the minority class. So, probit was worse in predicting

CHAPTER 5. RESULTS 32 the minority class than NN, resulting in lower AUC.

For the probits McFadden’s R2 (Heij et al., 2004) can be calculated to get an approximation of the SNR. For the TOTEXP it was 27.2% and for TOTSLF 29.0 %. Which as noted in Section 3.2.1 is similar to R2 reported in previous health expenditure literature. This corre- sponds with a SNR of approximately 0.4, for which we found in Chapter 3 that neural network should perform more accurately than probit if nonlinearities are present.

A subset of the variables is taken to examine the sensitivity of the model and check for nonlinearity. We consider using only numeric variables as input. First with nonlinear terms, so squared and cross-terms of the numeric variables are used as input. This leads to an AUC of 0.687 for the probit and 0.690 for the neural network when estimating TOTEXP. For TOTSLF AUC for the probit is 0.750 and for the neural network 0.754. Afterwards only linear terms of numeric variables as input is considered for probit. The AUC changes considerably: TOTEXP 0.582 and TOTSLF 0.738. Since the AUC of TOTEXP changes more than the AUC of TOTSLF for including nonlinear terms as input, we conclude that the relationship between TOTEXP and numeric regressors is more nonlinear than between TOTSLF and numeric regressors.

In machine learning, it is custom to take a training and test set to prevent overfitting (Section 2.2; Bishop, 2006). So far all observations in the training set were also in test set and we assumed that validation split and L2-regularizer sufficiently prevent overfitting. To test the validity of that assumption, analyses are repeated with a random split over the whole dataset with 80% in the training set and 20% in the test set. For TOTEXP, AUC is 0.495 for the probit and 0.690 for the neural network. For TOTSLF, AUC is 0.503 for the probit and 0.754 for the neural network. So, we conclude validation split and L2-regularizer are sufficient conditions to prevent overfitting, since the AUC does not change when splitting the dataset in a training and test set. Nevertheless, it did change for the probit, so the probit might be overfitting for this dataset. Reference Percentage correct Prediction 0 1 TOTEXP Probit 0 29505 36619 44,62 1 8587 106160 92,52 NN 0 29754 35100 45,88 1 8338 107679 92,81 TOTSLF Probit 0 31346 58698 34,81 1 6746 84081 92,57 NN 0 31191 53600 36,79 1 6901 89179 92,82

Table 5.1: Confussion matrix first step estimation with probit and neural network for TOTEXP and TOTSLF

Figure 5.1: ROC-curve prediction positive TOTEXP by probit and neural network

CHAPTER 5. RESULTS 34

5.2 Second-Step Estimation

Results of model descriptives are in Table 5.2. Estimated parameter values are in Appendix B. As noted in the previous chapter, caution should be taken in interpreting the estimated model parameters, since truncation influences the marginal effects (Cameron and Trivedi, 2005). The estimated correlation ˆρ between decision and outcome equation is 0.9. Multipyling ˆρ with standard deviation of residuals of outcome equation of Heckman’s model gives an approximation of σ12, which is 0.715, which is not close to zero. Combined, this indicates selection bias was

present in this dataset. The Jarque-Bera test, which tests the residuals of OLS being identically independently normally distributed is rejected (p < .001). This might be due to the kurtosis, which is 0.78. VIF of the regressors was around 1.5 for TOTEXP and 1.4 for TOTSLF with OLS method, indicating no problems with multicollinearity.

Suprisingly, AIC is lowest for 2NN (Table 5.2). All analysed methods report lower AIC than OLS, indicating improvement over the OLS model. This was confirmed with LR-test or joint significantly test of extra parameters. For both TOTEXP and TOTSLF all analysed methods had p < 0.001 for LR-test against OLS of nonsignificant extra parameters.

Changing the hyperparameters of the NN in both models does not change the AIC: changing all activation functions in tanh or some in sigmoid in combination with tanh does not change results for either COSSNN or 2NN. Using relu as activation function in 1 or several hidden layers, leads to lower AIC for TOTEXP. Changing amount of dummies to M = 15 changes AIC of COSS to 503766 and COSNN to 503095 With M = 40 dummies AIC of COSS changes to 503757 and COSSNN to 503090. This are relative small changes, so the results are relatively insensitive to the amount of dummies used.

Heckman’s two-step estimator can perform poorly if the inverse Mills ratio term is highly correlated with the other regressors (Leung and Yu, 1996). The condition number is an indica- tion of the correlation between inverse Mill’s ratio and regressors (Cameron and Trivedi, 2005). In this case the condition number of 0.00022 does not change when including or excluding the inverse Mill’s ratio, meaning the correlation between regressors and inverse Mill’s ratio is high. So that might indicate bad performance of Heckman’s two-step estimator.

Some analyses about the exclusing of variables is conducted. Since we have many regressors, maybe a simpler model with less regressors performs better. Due to computation time, analyses are only performed on TOTEXP. Since there are so many variables, automated variable selection is chosen over manually selecting variables. Two automated variable selection methods are studied. First a method based on stepwise VIF selection is performed (Lin et al., 2011). First the computer calculates the VIF values for all explanatory variables, removes the variable with the highest value and repeats until all VIF values are below the threshold of 10. There are 82 remaining variables. Resulting AIC are presented in Table 5.2. AIC is lower with restricted amount of explanatory variables, meaning that less explantatory variables does not lead to better models.

1 variable (Zar, 1996). Although not without controversy, since there is a chance of data dredging, a more reliable method of best subset regression (Zhang, 2016) would be incomputable since there are many regressors. A 5% significance level is chosen as a threshold for the inclusion of the model variables. Results are presented in Table 5.2. AIC’s of all methods are slightly lower than the original, but just a small percentage of original AIC. However, this indicates models can be slightly improved by decreasing the amount of parameters.

OLS OLSH COSS COSSNN 2NN

TOTEXP AIC 504771 504723 503765 503093 501864 R2 0.356 0.356 0.361 0.364 0.369 F-test statistic (df) 758(104) 751(105) 628(128) 637(128) 794(105) LR-test statistic (df) - 50(1) 1054(24) 1726(24) 2909(1) VIF 1.55 1.55 1.56 1.57 1.58 Step VIF 505067 505001 504039 503349 502135 Step AIC 504737 504689 503733 503051 501829 TOTSLF AIC 410745 410601 410322 409527 407952 R2 0.279 0.280 0.282 0.287 0.297 F-test statistic (df) 416(104) 413(105) 343(128) 351(128) 449(105) LR-test statistic (df) - 146(1) 471(24) 1266(24) 2795(1) VIF 1.39 1.39 1.39 1.40 1.42

Table 5.2: Results of second step estimation of TOTEXP and TOTSLF with different (semiparametric) methods.1

AIC is AIC of model, R2 explained variance of the models, F-statistic resulting statistic if jointly all parameters differ from 0. LR-test to test if models differ from OLS. All F-test and LR-test statistics are significant with significance level of α < 0.001. VIF is measure of multicollinearity in models. Step VIF is stepwise deletion by VIF lower than 10 and Step AIC is stepwise deletion of variables to increase AIC. For both TOTEXP and TOTSLF analyses are performed, except for Step VIF and Step AIC, due to computation time.

Chapter 6

Conclusion

This thesis investigates new ways to incoperate machine learning models in existing econometric models. We studied sample selection models and introduced two new techniques based on machine learning. We found these techniques to contribute significantly to the two-step method of Heckman by increasing accuracy in the first step and less biased estimates in the second step compared to OLS and OLSH. However, the results of the simulation and empirical analyses give contradicting results.

The first method (COSSNN) was based on a neural network, which might better predict participation in the Tobit Type II model than probit. Subsequently, adjusting the semiparametric estimation method of Cosslett (COSS), might give better results. The second method (2NN) was based on the residuals of performing OLS on the outcome equation. These residuals bias the conditional expectation of the outcome equation and performing a neural network on the residuals and adding these to the decision equation, could lower the bias.

In this thesis, we first conducted a simulation study. In the first step of Heckman’s two- step estimator, we found the neural network to better predict participation when SNR was larger than 0.5 and the probit model was misspecified by adding squared terms to the data generating process. Changing the distribution of the error terms did not have a big impact on the performance of both models, probably due to a relatively high SNR of 1.

Considering both steps, a few conditions might be important. Besides distributional assumptions, SNR and nonlinearity seem to effect the results. Bias is indeed present in estimates of OLS and OLSH when changing the distributional assumptions, while the semiparametric estimation techniques COSS and COSSNN seem to be relatively robust to relaxing these distributional assumptions. SNR and nonlinearity do not seem to give biased estimates when applying COSS or COSSNN. 2NN seems biased for all analysed conditions, including SNR and nonlinearity. However, 2NN becomes unbiased when regressors in decision and outcome equation are unique. Furthermore, differences in AUC in the first step between COSS and COSSNN do not result in different estimates in the second step. This might be due to AUC, which might be a bad predictor for second step performance.

Subsequently, analyses were performed on an empirical dataset. The MEPS (2018) dataset

is a panel survey conducted in the United States of America, which contains many health variables and health expenditures. We predicted positive health expenditures in total and those paid out-of-pocket (TOTEXP and TOTSLF). Although the imbalance of amount of zero health expenditures is different for TOTEXP and TOTSLF, the models did not perform differently compared to each other when estimating TOTEXP and TOTSLF. Selection bias is present and the model with lowest AIC is 2NN. After 2NN, COSSNN has lowest AIC. Although this might indicate 2NN performes well on the MEPS dataset, caution should be taken, since 2NN seems biased in the simulation study. Taking subsets of regressors does not change performance between models. Furtermore, additional analyses showed high correlation between regressors and the inverse Mill’s ratio, which indicates bad performance of OLSH.

One limitation of this study, is that we did not address endogeneity. For example, people who have positive health expenditures might insure themselves, but at the same time, people who are insured might have positive health expenditures. This results in endogeneity through simultaneity. A typical way to model this kind of endogeneity in sample selection problems, is given by Shen (2013). Further research should indicate whether endogeneity effects model performance differently for models analysed in this thesis.

In addition, in the MEPS (2018) a large amount of dummies is present. This results in sparse matrices, so principal component analysis or gradient boosting might be interesting ad- ditions to methods described in this thesis. Also, this thesis is explorative in nature: hypothesis about new models were formed and then studied in simulation and empirically. It would be interesting to analyse the theoretical consequences of adding neural network output to regression. Although not within the scope of this thesis, asymptotic distributions of the new methods can be calculated. The results of this thesis indicate that the new methods might be theoretically unbiased, or even consistent.

Since the validity of the models and tests are explored in this study, but not deductively found, it is unclear which of the contradicting results is valid. Although COSS and COSSNN seem related, COSSNN does perform better in predicting the first step, but that does not result in second step improvement in the simulation. For the empirical dataset COSSNN does seem to perform better than COSS, so it might be that COSSNN is an improvement over COSS. But, as with the 2NN case, the discrepancy between performance in simulation and empirical analysis, should be analysed in further research.

Concluding, the results show that improvement might be possible of econometric models with machine learning techniques. Although methods introduced in this thesis might not straigthforwardly outperform older methods, the way of incoperating machine learning techniques into econometric methods demonstrates new ways of incoperating techniques from both fields. Hopefully, this might inspire econometricians and computer scientists to explore new combinations of techniques and methods from econometrics and machine learning, so existing methods can become more accurate and efficient.

Bibliography

Abraham, A. (2005). Artificial neural networks. Handbook of Measuring System Design. Amemiya, T. (1985). Advanced Econometrics. Harvard university press.

Basu, A. and Manning, W. G. (2009). Issues for the next generation of health care cost analyses. Medical Care, pages 109–114.

Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press, New York.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer-Verlag, Berlin. Cameron, A. C. and Trivedi, P. K. (2005). Microeconometrics: methods and applications.

Cambridge university press, Cambridge.

Cosslett, S. R. (1991). Semiparametric estimation of a regression model with sample selectivity. Nonparametric and Semiparametric Methods in Econometrics and Statistics, pages 175–97. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of

Control, Signals and Systems, 2(4):303–314.

Dai, W., Jin, O., Xue, G.-R., Yang, Q., and Yu, Y. (2009). Eigentransfer: a unified framework for transfer learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 193–200. ACM.

Einav, L., Jenkins, M., and Levin, J. (2013). The impact of credit scoring on consumer lending. The RAND Journal of Economics, 44(2):249–274.

Elkan, C. (2001). The foundations of cost-sensitive learning. In International Joint Conference on Artificial Intelligence, volume 17, pages 973–978.

Gallant, A. R. and Nychka, D. W. (1987). Semi-nonparametric maximum likelihood estimation. Econometrica: Journal of the Econometric Society, 55(2):363–390.

Hall, B. (2002). Notes on sample selection models. Technical report, Mimeo.

Hanley, J. A. and McNeil, B. J. (1983). A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148(3):839–843.

Hansen, B. E. (1994). Autoregressive conditional density estimation. International Economic Review, pages 705–730.

Heckman, J. J. (1977). Sample selection bias as a specification error (with an application to the estimation of labor supply functions). Econometrica: Journal of the Econometric Society, 47:153–161.

Heij, C., Heij, C., de Boer, P., Franses, P. H., Kloek, T., van Dijk, H. K., et al. (2004). Econometric methods with applications in business and economics. Oxford University Press. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural

networks, 4(2):251–257.

Hussinger, K. (2008). R&d and subsidies at the firm level: An application of parametric and semiparametric two-step selection models. Journal of Applied Econometrics, 23(6):729–747. Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Leung, S. F. and Yu, S. (1996). On the choice between sample selection and two-part models. Journal of Econometrics, 72(1-2):197–229.

Lin, D., Foster, D., and Ungar, L. (2011). Vif regression: a fast regression algorithm for large data. Journal of the American Statistical Association, 106(493):232–247.

In document Selection Bias: A Machine Learning Approach (Page 35-55)