• No results found

5. Robust Estimation of Visit Potential under Missing Data

5.1.2. Mechanisms of Missing Data

The first major works on missing data appeared in the 1970s. Rubin (Rubin, 1976) introduced a typology for missing data mechanisms and discussed their effect on the inference process. The term mechanism hereby refers to the relationship between missing data and the variables or values of variables in the considered data set, not to the actual real-world process behind the

missingness. Three variants of missing data are distinguished in the literature: missing com- pletely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) (Little and Rubin, 2002; Schafer and Graham, 2002). The differentiation is important, because the properties of methods that treat missing data depend strongly on these dependencies.

Assuming complete knowledge Y denotes the the matrix of complete observations. Y can be partitioned into subsets Yobs and Ymis that contain the values for observed and unobserved

parts of the data, respectively. The different mechanisms of missing data are then defined as follows, given that all objects are sampled independently from the population.

Definition 5.1.1 (Missing completely at random (MCAR)) Missing values are missing completely at random if the missingness is independent of the data, i.e.

P (R | X, Y ) = P (R).

We can relax MCAR to allow a dependency between the missingness and the observed data, which results in MAR.

Definition 5.1.2 (Missing at random (MAR)) Missing values are missing at random if the missingness depends at most on the observed data, i.e.

P (R | X, Y ) = P (R | X, Yobs).

Finally, we obtain MNAR if the missingness depends on the missing values themselves and cannot be removed by conditioning on the observed values.

Definition 5.1.3 (Missing not at random (MNAR)) Missing values are missing not at random if the missingness depends on the unobserved data, i.e. Definition 5.1.2 is violated:

P (R | X, Y ) 6= P (R | X, Yobs).

For longitudinal data one further mechanism of missingness is distinguished by Little (1995): covariate-dependent missingness. The term covariate refers to the independent variables and means that the missingness may depend only on these completely observed variables. It is thus a stricter version of MAR, which we will call in this thesis CDMAR.

Definition 5.1.4 (Covariate-dependent missing at random (CDMAR)) Missing val- ues are covariate-dependent missing at random if the missingness depends at most on com- pletely observed independent variables of the data, i.e.

P (R | X, Y ) = P (R | X).

Note that in the case of univariate missing data MAR and CDMAR coincide.

In order to give an intuitive explanation of the definitions, consider the univariate setting in which the values of Y are either completely observed or completely missing for each object. In this case Definition 5.1.2 simplifies to P (R | X, Y ) = P (R | X).

MCAR occurs if the probability of missing values depends neither on X nor on Y . This relationship is depicted in Figure 5.2a. Variable Z hereby denotes influences on R which, however, are independent of X and Y . If a relationship between R and X exists but R is still independent of Y , the data are defined to be MAR. MAR denotes a conditional independence of missingness given a fixed value of X (see Figure 5.2b). However, under MAR a relationship between R and Y may exist due to their mutual dependency on X. This relationship disappears

(a) MCAR (b) MAR (c) MNAR

Figure 5.2.: Graphical models of types of missing data in the univariate setting; source of figures: Schafer and Graham (2002)

once the value of X is taken into account. Finally, if the distribution of missing values depends on Y , the data are said to be MNAR (see Figure 5.2c).

For example, assume that we want to observe the daily travel distance of the German population. We recruit a representative sample of size n of the population which we ask for the sociodemographic variables gender and age, as well as their traveled distance on the previous day. For gender and age we obtain a complete observation, i.e. X = (Xg, Xa).

However, not all persons remember their travel distance of the previous day. Travel distance Y and response R are two vectors of length n. Let us assume that travel distance and age are related, which is a well-known result from travel surveys (see Section 2.2.3). MCAR exists if missingness depends neither on sociodemographic characteristics nor on travel distance. If missingness depends on age, for example, older persons may be less likely to recall their travel distance, the mechanism is MAR. Finally, MNAR results if missingness depends on the missing travel distance itself, i.e. a relationship between R and Y remains even if X has been taken into account. For example, all distances above or below a certain threshold could have been deleted due to plausibility reasons.

We can extend the example to a longitudinal setting with a monotone or arbitrary pattern of missingness by observing travel distance for q days, i.e. Y = (Y1, . . . , Yq) and R = (R1, . . . , Rq).

In this setting completely observed variables as well as partially observed variables carry in- formation. If we require CDMAR, missingness may still depend only on socio-demography. However, for MAR missingness may depend in addition to socio-demography also on any recorded value of travel distance of a given entity.

Depending on the goal of a study, missing data mechanisms have different implications (Little and Rubin, 2002). If the interest lies in the conditional distribution of the partly observed variables Y given the completely observed variables X, an analysis is only unbiased if the data are CDMAR. Returning to our above example, this means that we can estimate the travel distance for sociodemographic groups directly from the data sample. However, if the interest lies in the marginal distribution of Y (i.e. we are interested in the average travel distance of the whole population) CDMAR is not sufficient. In this situation only MCAR assures unbiased results. However, the observed data may be used to reduce the bias if data are not MCAR. If CDMAR or MAR dependencies are known, missing data algorithms can be applied that mitigate the induced bias by conditioning on the respective variables. It is therefore important to know which mechanisms of missingness exist in partially observed data.