Chapter 3 Methods for estimating causal effects in longitudinal data
6.2 Microsimulation models (MSMs)
MSMs simulate an artificial population of heterogeneous individuals, typically over a long time horizon that extends into the future.23 Each individual in the model possesses a set of
attributes or ‘states’ (e.g. physical, socio-demographic, geographic), which may be updated throughout the simulation; in particular, individuals are often defined as belonging to one of a finite number of mutually exclusive and collectively exhaustive states, and events of interest are modelled as transitions from one state to another that occur according to a set of deterministic and/or stochastic rules (i.e. ‘transition probabilities’) (136, 141, 181, 182). The parameters governing transitions between states often depend on an individual’s
characteristics and possibly on previous history, and these parameters are typically estimated from a wide range of data sources, such as cohort studies, population-based epidemiological studies, and RCTs (136, 183).
MSMs may be either case-based or time-based (184). In a case-based model, individuals are simulated one at a time through all time points; in a time-based model, all simulated individuals are transitioned simultaneously through the model. Both methods produce
equivalent results where there are no interactions amongst individuals, but time-based models tend to be more computationally efficient since they can easily be vectorised (185).
Additionally, MSMs may be modelled in either discrete or continuous time (184). In a discrete- time model, transitions between states occur at discrete time steps; in a continuous-time model, the duration between state transitions is modelled in continuous time. In this chapter, we focus only on discrete-time MSMs, as they share natural parallels with the causal data structures considered throughout this thesis.24
23 Note that the term ‘microsimulation’ may also refer to the process by which a cross-sectional
snapshot of a population is created by generating a synthetic set of individuals whose
characteristics match aggregate, area-level statistics; this type of microsimulation is referred to as ‘spatial microsimulation’ (95). However, the focus of this chapter relates primarily to
microsimulation which is explicitly longitudinal.
24 The depiction of continuous time and competing events using DAGs is complicated by their
An MSM may simply be used to model the ‘natural history’ of the population, which describes the progression of the population under no exogenous intervention (182); such a model might be used for the purposes of population projection, for instance (184). Additionally, an MSM may also be used to model ‘counterfactual histories’, which describe the progression of the population under various hypothetical interventions (98). This has historically made MSMs important tools for policy evaluation (3).
6.2.1 Representing an MSM as a DAG
A key aspect of microsimulation is evolution, which is a concept closely related to data- generating processes. For instance, Ryder, N.B. (188) note that the focus is on ‘events rather than things, processes rather than states’. This is echoed by van Imhoff, E. and W. Post (189), who argue that an MSM should not only specify what the population will look like at some future point in time, but also how it gets there.
There exist clear parallels between the data-generating processes modelled in MSMs and those represented by DAGs. At every time point in an MSM, each individual’s characteristics may be updated according to some specified probabilities, which may themselves be
conditional on any number of current and/or past characteristics; each characteristic may thus be thought of as having a conditional probability (or distribution) associated with it. Similarly, each variable in a DAG is hypothesised to have a probability (or distribution), conditional on the variables which directly cause it.
These similarities make representation of an MSM as a DAG useful and informative, as it helps to draw explicit parallels between the two processes and enables us to understand the
conditions under which MSMs may provide valid causal effect estimates. In the following subsection, we illustrate how this might be done in the context of a specific example scenario.
6.2.1.1 Example scenario
We consider an example scenario involving eleven time periods (i.e. 𝑇 = 11) and three variables: (1) sex (female or male); (2) obesity status (non-obese or obese); and (3) diabetes status (non-diabetic or diabetic). At baseline (i.e. 𝑡 = 0), individuals possess a value for each of the three attributes. At each time 𝑡, for 1 ≤ 𝑡 ≤ 10, each individual’s obesity and diabetes states may be updated according to some conditional probability. Specifically, obesity status at time t is conditional on sex, obesity status at time 𝑡 − 1, and diabetes status at time 𝑡 − 1; diabetes status at time 𝑡 is conditional on sex, diabetes status at time 𝑡 − 1, and obesity status at time 𝑡.
A DAG offers a useful way to visually summarise the aforementioned process, as in Figure 6.1. Panel (1) depicts the full data-generating process (i.e. for 0 ≤ 𝑡 ≤ 10). While correct, this representation may nevertheless be difficult to interpret due to the number of time points.
research being done in the area of causal inference in the presence of competing events (186, 187), but this is beyond the scope of the present research.
Therefore, in panel (2) we exploit the repeated nature of the data-generating process to produce a simplified representation for time 𝑡; variables depicted in grey are those which affect variables at time 𝑡 but whose causes are not themselves represented in the graph. Panel (2) allows for easier visualisation of the data-generating process and identification of the conditional probabilities which govern it.
The data-generating process described in Figure 6.1 may be thought of as the ‘natural history’ of the population, as it represents the population under no exogenous intervention.
Figure 6.1 DAG representing the data-generating process for the variables sex (𝑺), obesity (𝑶), and diabetes (𝑫), for 𝟎 ≤ 𝒕 ≤ 𝟏𝟎
In (1), the full DAG (i.e. for 0 ≤ 𝑡 ≤ 10) is depicted. In (2), only the data-generating process for time 𝑡 is depicted; variables in grey are those which affect variables at time 𝑡 but whose causes are not themselves represented in the graph.
6.2.2 Key differences between the g-formula and microsimulation
Parallels between the g-formula and microsimulation have been recognised by Murray, E.J. et al. (99), who describe the use of a ‘similar mathematical approach: construction of a sequential model that is the basis for a Monte Carlo simulation of a (counterfactual) population under each treatment strategy of interest.’ The g-formula involves modelling the observed joint distribution of the data, and then estimating the (counterfactual) distributions under various interventions to calculate causal effects. This can be related to an MSM, which models the ‘natural history’ of the population and then estimates the ‘counterfactual histories’ under various interventions. However, although the two are methodologically similar, there exist key differences which arise from their distinct historical evolutions (as outlined in Chapter 3).
The joint distribution of the data is generally unknown and cannot be directly estimated in a microsimulation model (189). This is because microsimulation models are often used to make general inferences about a population, and often in the future. This is in contrast to the g- formula, which makes inferences about a specific (often highly-selected) population from a single retrospective dataset (99). Thus, using the g-formula, the conditional probability of every variable at every time point in Figure 6.1 can be estimated from a single dataset. MSMs, however, do not have direct access to these probabilities. In this way, the g-formula may be thought of as a special case of microsimulation, in which we have access to the entire joint distribution of all variables and in which all parameters come from a single dataset.