Robustness Test under CDMAR - Robust Estimation of Visit Potential under Missing Data

5. Robust Estimation of Visit Potential under Missing Data

5.6. Robustness Test under CDMAR

In the previous section we tested the performance of the selected missing data methods under a missing completely at random (MCAR) mechanism. In this section we will introduce a de-

pendency between sociodemographic variables and the response variable, leading to covariate- dependent missing at random (CDMAR). We will test how robust the methods perform when biasing the missing data mechanism using a variable with weak or with strong dependencies to the movement behavior.

5.6.1. Test Scenario

In this section we test the performance of multiple imputation via general location model (MI- GLM), single imputation via support vector regression (SI-SVR), multiple imputation from a conditional Poisson distribution (MI-Poisson) and Kaplan-Meier (KM) as described in Section 5.2.3 - 5.2.6 under CDMAR mechanism. We hereby induce systematically missing values for entities with certain sociodemographic attributes. We start with a variable that shows only a weak connection to travel group. Afterwards, we proceed to a variable with high dependency to travel group. The aim of the experiments is to determine how strong CDMAR can influence the performance of the algorithms in our application setting and to what degree the methods can compensate the effect if the according variable is available for conditioning. All methods are parameterized according to the parameter tuning described in Section 5.4. Details on parameterization and results of each experiment can be found in Appendices B.4, C.2 and C.3. We induce missing data targeted to entities with certain sociodemographic attributes as described in Section 5.3.3. Remember that the rate corresponds to the proportion of entities with at least one missing measurement day of the respective sociodemographic group. We increase this rate from 0.1 to 1.0 in steps of 0.1. For the remaining entities we apply a constant rate of missing data of 0.5. In consequence, entities of the selected sociodemographic group are over-represented at the beginning of an experiment and underrepresented at the end of an experiment. Note that due to the different censoring schema the total number of missing measurement days for a given rate differs between the experiments under CDMAR. The reason for this is that the selected attributes have different shares in the data sample and we therefore apply the rate to different proportions of entities. For the same reason, the number of missing measurement days deviates also from the experiments under MCAR. We tested all algorithms again for location sets of sizes 25, 50, 100, 250 and 500 and conducted each parameterization with 30 different poster campaigns.

5.6.2. CDMAR for Variable with Weak Dependency

As analyzed in Section 5.1.4, the variable gender shows only a weak dependency to travel group. We therefore selected this variable for the first analysis of CDMAR mechanism and chose the attribute gender=”female” to define the group of entities with a varying rate of missing data. Table 5.23 shows the aggregated results of the experiment. Details on the mean error and root mean squared error for individual location set sizes and rates of missing data can be found in Appendix C.2 in Tables C.30 - C.32 and in Appendix C.3 in Tables C.69 - C.71, respectively. If we compare the results with Experiment 1 (see Table 5.16), we see that with exception of SI-SVR only small differences occur, which may be attributed to random effects. The behavior of SI-SVR results from the lower total proportion of missing data. As about 25 percent of the entities always keep five measurement days, extreme errors at high rates of missingness are avoided. KM and MI-Poisson still obtain very small errors, from which we may conclude that missingness related to independent variables with lose connection to the dependent variable has only little influence on results.

Table 5.23.: Experiment 7, CDMAR mechanism on gender (female), without sociodemographic variables

aace(me, · ) aace(rme, · ) aace(rmse, · ) Method GVE AVE CE GVE AVE CE GVE AVE CE

MI-GLM 39.1 0.8 0.067 0.022 0.098 0.124 95.7 0.8 0.069 SI-SVR 102.0 0.7 0.028 0.047 0.067 0.037 546.2 1.9 0.034 MI-Poisson 13.9 1.4 0.086 0.005 0.168 0.140 99.6 1.5 0.087

KM – – 0.003 – – 0.005 – – 0.015

5.6.3. CDMAR for Variable with Strong Dependency

In this section we perform CDMAR for the sociodemographic variable with the highest dependency to travel group: occupation. More specifically, we chose the attribute occupation=”employed” to define the group of entities with a varying rate of missing data. In Experiment 8 we do not provide any sociodemographic variables to the algorithms while in Experiment 9 we provide variable occupation for conditioning. The results are given in Tables 5.24 and 5.25, respectively. Further details on the mean error and root mean squared error for individual location set sizes and rates of missing data can be found in Appendix C.2 in Tables C.33 - C.38 and in Appendix C.3 in Tables C.72 - C.77, respectively.

Table 5.24.: Experiment 8, CDMAR mechanism on occupation (employed), without sociodemographic variables

aace(me, · ) aace(rme, · ) aace(rmse, · ) Method GVE AVE CE GVE AVE CE GVE AVE CE

MI-GLM 60.4 0.8 0.066 0.025 0.099 0.122 106.5 0.9 0.068 SI-SVR 127.8 0.7 0.027 0.055 0.074 0.036 496.7 1.7 0.033 MI-Poisson 16.7 1.4 0.085 0.006 0.169 0.139 102.0 1.4 0.086

KM – – 0.005 – – 0.008 – – 0.015

Table 5.25.: Experiment 9, CDMAR mechanism on occupation (employed), with sociodemographic variable occupation

aace(me, · ) aace(rme, · ) aace(rmse, · ) Method GVE AVE CE GVE AVE CE GVE AVE CE

MI-GLM 37.6 0.8 0.068 0.022 0.098 0.124 95.2 0.8 0.069 SI-SVR 65.6 0.6 0.022 0.027 0.053 0.032 280.3 1.1 0.027 MI-Poisson 18.4 1.4 0.085 0.006 0.168 0.140 101.5 1.5 0.086

KM – – 0.003 – – 0.005 – – 0.016

Clearly, the errors increase under CDMAR based on occupation. Now KM also shows a slight increase in error for entity coverage. However, the provision of occupation for conditioning reverses the effect completely for KM. MI-GLM and SI-SVR improve also in Experiment 9 and are able to compensate the CDMAR mechanism. The behavior of MI-Poisson for gross visits is a random effect, as MI-Poisson does not rely on sociodemographic variables and performs the evaluation of both experiments under the same condition. In comparison with MI-GLM and SI-SVR, the error for gross visits is still small. This is plausible because MI-Poisson imputes visits separately for each entity based on the available visits of the entity. If the assumption

of correlated visits over days as well as the assumed model is correct, MI-Poisson only has to face statistical variation during the calculation of gross visits.

In document Modeling Visit Potential of Geographic Locations Based on Mobility Data (Page 156-159)