5. Robust Estimation of Visit Potential under Missing Data
5.1.4. Analysing Missing Data Mechanisms in Mobility Data
While the pattern of missingness is easily determined, it is hard to identify the mechanism of missing data. For MCAR several tests have been proposed (Little, 1988; Park and Davis, 1993). However, the distinction between MAR and MNAR is not straightforward. In general, a distinction is not possible unless additional knowledge about the data or the surveying process is known (Little and Rubin, 2002; Schafer and Graham, 2002). One major indication for MNAR is that the distribution of the observed values differs from the known shape of distribution. If, for example, a variable follows a normal distribution, however, possesses an asymmetric shape in the data sample, it is likely that the mechanism of missingness is not MAR (Little and Rubin, 2002). In addition, knowledge about the surveying process helps to identify the mechanism of missingness. For example, Murray and Findlay (1988) describe a longitudinal study on drugs against hypertension. Patients whose blood pressure exceeded a certain threshold were naturally withdrawn from the study and received a different treatment. In this case the mechanism of missingness is MAR because blood pressure was recorded before drop-out. A different situation arises if measurements are rejected because they exceed or fall below a certain threshold, e.g. due to plausibility reasons. In this case the missingness depends on the rejected value itself and results in a MNAR mechanism. If no information about the censoring mechanism or the distribution of the data is available, it is often assumed that the mechanism of missingness is MAR. This assumption is reasonable for many real- world scenarios. However, the robustness of applied algorithm should be assured as the degree of bias in results may depend on several factors, including the hight of missing data, the implementation of the missing data mechanism, the provided independent variables and the estimated statistical quantity (e.g. mean, variance, standard error), as shown by Collins et al. (2001).
Further, as already stated in Section 5.1.2, the analysis goal determines how detailed de- pendencies between the mobility characteristic and the missingness have to be tested. For statements on the whole data set it suffices to insure the independence between both variables by itself. However, if certain characteristics shall be evaluated, for example, for sociodemo- graphic subgroups, the independence between mobility characteristic and missingness has to be assured for each subgroup as well. Therefore, the level of detail during evaluation influences the analysis of missing data mechanism as well.
In the German mobility study, the mobility information of interest is the daily number of visits of a test persons to poster sites. However, the number of visits depends on the selected poster campaign. Depending on how many locations are chosen and where they are situated, the number of visits varies. We therefore need a substitute that is proportional to the number of poster passages for an average poster campaign. We chose as substitute the daily distance that a person travels. Clearly, the more a person travels outside, the higher is the probability that he or she will see a poster. We determine the average daily travel distance for each person from their available number of measurement days, i.e.
D = (di) =
P7 j=1dij
ri
∀i = 1..n
with dij the traveled distance of person i on day j in kilometers and ri ∈ {1, ..., 7} the
number of valid measurement days for person i. We know the number of valid measurement days, i.e. the response of each test person, due to a follow-up survey. As the average travel distance covers only available measurement days, it forms only an approximation of the true
average travel distance per person. However, we are in the comfortable situation to provide a reference value for each test person independently of the number of valid measurement days.
For the dependency analysis we discretize the travel distance and measurement response in three groups each. For travel distance, we form the groups according to quantiles of the lowest, middle and highest one third of travel distances, i.e.
travel group (di) = low if di < Q0.33(D), medium if Q0.33(D) ≤ di < Q0.66(D), high if Q0.66(D) ≤ di.
For measurement response, we form the following groups:
response group (ri) = low if ri∈ {0, 1, 2}, medium if ri∈ {3, 4, 5}, high if ri∈ {6, 7}.
We perform chi-square tests in order to detect dependencies between variables. In the first place we are interested in the relationship between the independent variables and either travel group or response group. However, the relationships between the independent variables are also interesting in order to reduce complexity later on. In addition, we are interested in the relationship between travel group and response group under conditioning on the independent variables. This information is important for two reasons. First, conditioning is necessary in order to evaluate the data separately for sociodemographic groups. This, however, may induce a dependency between travel and response group and thus bias results. Second, if we detect a dependency between travel and response group in the first place, conditioning offers the possibility to reduce the dependency and may thus improve results.
In the first analysis we evaluate the dependency between any two sociodemographic vari- ables, travel group and response group. The results are depicted in Table 5.1. Unfortunately our data set shows a dependency between travel group and response group, i.e. our data are at least CDMAR. If we assume a level of statistical significance of α = 0.05, all variables with the exception of travel group are independent of the response group. Further, all variables with exception to response group and occupation are independent of the travel group. The depen- dency between our independent variables varies, however, it should be noted that householder, occupation and education show very strong dependencies to the other independent variables.
Table 5.1.: P-Values of chi-square tests between all sociodemographic variables, travel group and response group for test persons in Hamburg, Germany
age educa- occu- house- travel resp. gender group tion pation holder group group gender 0 0.709 0.079 0.004 ≤ 0.001 0.159 0.212 age group 0.709 0 ≤ 0.001 ≤ 0.001 ≤ 0.001 0.264 0.895 education 0.079 ≤ 0.001 0 *≤ 0.001 ≤ 0.001 0.593 * 0.407 occupation 0.004 ≤ 0.001 *≤ 0.001 0 ≤ 0.001 0.001 * 0.944 householder ≤ 0.001 ≤ 0.001 ≤ 0.001 ≤ 0.001 0 0.118 0.073 travel group 0.159 0.264 0.593 0.001 0.118 0 ≤ 0.001 resp. group 0.212 0.895 * 0.407 * 0.944 0.073 ≤ 0.001 0 * approximation may be incorrect due to small cell counts
In the second analysis we test the dependency between travel group and response group while conditioning on each of the independent variables. I.e., we perform a chi-square test of independence between travel and response group given all test persons with the same value of a variable. As mentioned above, this step is necessary because the application requires information about sociodemographic groups. Besides, the conditioning may reduce the de- pendency between travel and response group. The results are shown in Table 5.2. Again we obtain dependencies for each variable in at least one value. This means that we also need to test combinations of independent variables.
Table 5.2.: P-Values of chi-square tests between travel group and response group under condi- tioning on the sociodemographic variables gender, age group, education, occupation and householder for test persons in Hamburg, Germany
male female
gender 0.030 0.001
* approximation may be incorrect due to small cell counts
14 - 29 30 - 49 ≥ 50
age group *0.064 0.042 *0.007
* approximation may be incorrect due to small cell counts
in school secondary intermediate high school / general school secondary school university education ≤ 0.001 *0.188 *0.146 *0.903 * approximation may be incorrect due to small cell counts
in training employed retired unemployed occupation ≤ 0.001 *0.300 *0.073 *0.117 * approximation may be incorrect due to small cell counts
yes no
householder ≤ 0.001 *0.016
* approximation may be incorrect due to small cell counts
In the third analysis we therefore perform chi-square tests between travel and response group for any combination of value-pairs of two variables. As the conditioning reduces the number of data points in the analysis strongly, not all analyses could be conducted. We performed chi-square tests only for groups with at least 30 test persons. However, even for these groups correct approximation cannot be completely guaranteed due to small cell counts. The complete results are given in the appendix C.1. In this place we restrict the tables to the most important results. Tables 5.3 and 5.4 show all combinations that provide independence between travel group and response group for a level of statistical significance of α = 0.03. Both groups contain age group as a variable for conditioning, which is plausible as age is a strong differentiator for travel behavior. Under the assumption that our results may be safely interpreted, we can use each pair of the depicted variables for conditioning. The preferred choice is the pair (age group, householder). First, it has the highest minimum of statistical significance for all value groups. Second, it consists of a comparably small number of groups which increases the number of cases per group and thus the stability of results.
Table 5.3.: P-Values of chi-square tests between travel group and response group under con- ditioning on the sociodemographic variables age group and occupation for test persons in Hamburg, Germany
occupation
age group in training employed retired unemployed
14 - 29 *0.298 *0.264 NA NA
30 - 49 NA *0.033 NA NA
≥ 50 NA *0.087 *0.078 NA
* approximation may be incorrect due to small cell counts
Table 5.4.: P-Values of chi-square tests between travel group and response group under con- ditioning on the sociodemographic variables age group and householder for test persons in Hamburg, Germany
householder age group yes no 14 - 29 *0.058 *0.152 30 - 49 *0.096 *0.109 ≥ 50 *0.117 *0.043
* approximation may be incorrect due to small cell counts