This section presents a simulation study of MICE in DEA using a real data set. Although this data set has incomplete cases, it is beneficial to work with a complete data set in order to investigate the proposed methodology and its accuracy for DEA results. The data are taken with permission from the Trauma Audit Research Network (TARN) database, which is maintained by The University of Manchester. The data set provided for analysis contains information relating to sixty-six hospitals with ten characteristics comprising four inputs and six outputs. Table 4.2 below contains a list of these particular input and output variables. Such data sets, which do not contain any missing value, offer possibility of obtaining true efficiency scores for the data sample. To replace some observed cases with simulated missing data for experimental simulation analyses, a specific method was followed in the current study. Individual observations comprising 1%, 5%, 10% and 20% of the complete data set were chosen randomly and removed from the data set. These four separate versions of missing data enable the researcher to examine the robustness and sensitivity of the MICE approach. In addition, Aksezer and Benneyan (2010) stated, “experience showed that when the rate of missing data is more than 10%, it is almost impossible to carry out DEA”, which has been theorised to be assessed in this investigation.
For consistency and reliability with the MAR related hypothesis, all inputs and outputs are put into a pool for selection. Consequently, no preference is instilled to any specific input or output and no precedence is provided to the relevance of input sets above outputs, or output sets over inputs. After applying different levels of missing data, MICE is conducted for each problem in different scenarios for the numbers of imputations, in order to investigate the sensitivity of this factor. For broad generality, the scenarios that have been chosen for consideration in current research are 5, 10 and 20 repeated imputations.
78
Table 4.7: List of inputs and outputs
In order to evaluate the effectiveness of the methodology, input oriented VRS-DEA with the complete data is solved first, before this analysis is repeated for the different missing data sets with all the explained scenarios. Subsequently, the efficiency scores (estimated efficiencies) are gathered for all cases and compared with those obtained from the complete set (true efficiencies). To enable such comparisons, different methods have been used in the literature. Aksezer and Benneyan (2010) used linear regression to compare estimated efficiencies with true values obtained from multiple imputation using the MVN assumption and it is agreed that this method is beneficial for comparisons of this nature. This is useful here because both the complete and partial approaches contain errors, which violates the usual assumption for linear regression that the independent variable should be error free. Thus, that assumption is important in order to generate unbiased estimates using this regression approach.
In general, it is common in the DEA literature to use correlation and rank correlation as a comparison measurement for different purposes. We also argue that this method is beneficial for such comparisons. This is to say that when the results of the two techniques have high correlation, this suggests consistent agreement between the results. Contrastingly, high correlation values do not imply that agreement exists between the two methods. Nevertheless, even though the correlation coefficient calculates the strength of the relationship, it could be erroneous to conclude that high correlation corresponds to high levels of agreement. The
Inputs Outputs
Average number of doctors seen per patient per year (X1)
Average number of consultants seen per patient per year(X2)
Average number of nurses seen per patient per year(X3)
Total cost (£) per patient per year (X4)
Percentage of patients with minor injuries who recovered satisfactorily per year(Y1)
Percentage of patients with moderate injuries who recovered satisfactorily per year(Y2)
Percentage of patients with severe injuries who recovered satisfactorily per year (Y3)
Average of the total period (days) of stay per patient per year (Y4)
Average number of surgical operations per patient per year (Y5)
Average number of treatments provided by emergency services per patient per year (Y6)
79
explanation for this surprising result is that two methods are in agreement when their scatter lies along the line of equality, though high correlation can be achieved if the scatter lies along any straight line that need not pass through the origin. Offset intercept bias does not alter the value of the correlation coefficient in any way.
Therefore, we are going to use mean absolute error (MAE) and root mean square error (RMSE) as a comparison measurement of the estimated efficiency with the true efficiency for all cases. Below are the specifications of both equations of error where the usual formulation is adopted, whereby efficiencies are measured as percentages rather than proportions.
The MAE specification is:
N n n ne
e
N 1ˆ
1In this equation,
eˆ
nis the estimated efficiency of hospital n,e
nis the true efficiency of hospital n and N is the number of hospitals. The process of calculating MAE is relatively straightforward, as it is necessary to determine the sum of magnitudes (absolute values) that comprise the errors in order to ascertain and understand the ‘total error’ prior to using the amount of DMUs to divide the total error.The RMSE specification is:
N N n
e
ne
n
1 2 ) (ˆ
(4.4)Similarly to MAE, this measure is straightforward to calculate. Firstly, the differences between the estimated and true efficiencies are evaluated and then squared. Secondly, these errors are summed before dividing the total by the number of DMUs. Finally, the square root is taken.
80
Scenarios Description MAE
(%)
5 Imp of 1% 5 imputations of 1% missing 0.097
10 Imp of 1% 10 imputations of 1% missing 0.129
20 Imp of 1% 20 imputations of 1% missing 0.194
5 Imp of 5% 5 imputations of 5% missing 0.794
10 Imp of 5% 10 imputations of 5% missing 0.782
20 Imp of 5% 20 imputations of 5% missing 0.745
5 Imp of 10% 5 imputations of 10% missing 1.305
10 Imp of 10% 10 imputations of 10% missing 1.257
20 Imp of 10% 20 imputations of 10% missing 1.325
5 Imp of 20% 5 imputations of 20% missing 2.013
10 Imp of 20% 10 imputations of 20%missing 2.005
20 Imp of 20% 20 imputations of 20% missing 2.013
Table 4.8: MICE scenarios and MAE
Table 4.3 shows the different scenarios and resulting MAEs. As can be seen from the resulting MAEs, the same percentages of missing data produce relatively similar MAEs. For example, for 5% of missing data, there is little difference among the results for 5, 10 and 20 imputations. However, differing percentages of missing data do lead to different MAEs, although the values are still very small, given that MAE is expressed as a percentage on the scale 0 to 100. Figure 4.2 demonstrates visually how the MICE scenarios and MAE change according to the number of imputations and the percentage of missing data. It clearly shows that there is a monotonic increase in terms of MAE, so that the higher the percentage of missing data, the higher the MAE.
81
Figure 4.2: MICE scenarios and MAE
Similarly, Table 4.4 shows the different scenarios and resulting RMSE values. It is quite obvious that the same percentage of missing data leads to relatively similar RMAE values. Nonetheless, even though there are differences among them, these are not large differences. For instance, for 5% missing data, the results show that RMSE for 5 imputed datasets is 3.7, whereas for 10 and 20 imputed datasets the RMSEs are both about 3.8.
82
Scenarios Description RMSE
(%)
5 Imp of 1% 5 imputations of 1% missing 0.553173
10 Imp of 1% 10 imputations of 1% missing 0.657267
20 Imp of 1% 20 imputations of 1% missing 1.111306
5 Imp of 5% 5 imputations of 5% missing 3.724245
10 Imp of 5% 10 imputations of 5% missing 3.825572
20 Imp of 5% 20 imputations of 5% missing 3.744997
5 Imp of 10% 5 imputations of 10% missing 5.473299
10 Imp of 10% 10 imputations of 10% missing 5.332542
20 Imp of 10% 20 imputations of 10% missing 5.323721
5 Imp of 20% 5 imputations of 20% missing 6.012238
10 Imp of 20% 10 imputations of 20% missing 6.010408
20 Imp of 20% 20 imputations of 20% missing 6.03233
Table 9.4: MICE scenarios and RMSE
It is different when we take into account the differences in percentages of missing data, which lead to different RMSE values. Figure 4.3 demonstrates visually how the MICE scenarios and RMSE change according to the number of imputations and the percentage of missing data. Likewise, as with the results for MAE in Figure 4.2, it can be seen that RMSE increases monotonically when the amount of missing data increases.
83
Figure 4.3: MICE scenarios and RMSE
For further comparisons, the Maximum Absolute Error (MAX-AE) is calculated, but only for 5 imputations of the different levels of missing data. Hence, the same 1%, 5%, 10% and 20% missing levels are conducted and nested from the completed data set, and subsequently MICE of 5 imputations is applied. The estimated efficiency scores then compare with the true values by calculating MAX-AE, as shown in Table 4.5. This table shows that there is a five-fold increase in MAX-AE from the 1% and the 20% missing scenarios. Figure 4.4 similarly demonstrates monotonically MAX-AE increase when the level of missing data increases.
Scenarios MAX-AE
1% missing 4.21
5% missing 7.68
10% missing 13.18
20% missing 20.61
Table 4.10: MICE scenarios and MAX-AE
MAX-AE results provide consistent outcomes with both MAE and MSE. Therefore, this simulation study suggests that MICE is an effective approach to estimate the true efficiency when missing inputs or outputs are experienced. However, when the rate of missing data increases, the precision of estimated DEA analysis tends to decrease.
84
Figure 4.4: MICE scenarios and MAX-AE