Compositional Data - Specific health geography simulation considerations

3.10 Specific health geography simulation considerations

3.10.3 Compositional Data

Compositional data differ from composite variables in that the individual components of compositional data can be measured directly and on the same scale as a larger whole, or subdivided into smaller parts. An example of this would be total number of calories consumed, divided into calories from fat, protein and carbohydrates. This does not pose a particular issue unless interest lies in the role of one or more components in relation to the whole.

As in the case of composite variables, there is a tautological relationship, this time between the component variables and the total variable. For example, if a researcher was interested in the causal effect of the economically active population on gross domestic product (GDP), they might consider analysing this by conditioning on the total population, or not (Figure 3.10.6). The utility of these two approaches depends on context.

3.10. SPECIFIC HEALTH GEOGRAPHY SIMULATION CONSIDERATIONS 53 Economically Active Population Economically Inactive Population Total Population GDP

Figure 3.10.6: Causal Diagram indicating the relationships between the economically active and inactive populations, the total population and gross domestic product (GDP). Total population is a deterministic variable calculated by adding the economically active and inactive population together; this is indicated by the double circle around total population.

The effect of the economically active population without conditioning on the total population represents the average change in GDP that results from adding economically active individuals to the area, thereby increasing both the number of economically active individuals and the total number of individuals, whilst doing nothing to the population of economically inactive individuals. An estimate of this effect may be desirable if, for example, the government were considering a policy aimed at increasing immigration. In contrast, the effect of the economically active population whilst simultaneously conditioning on the total population represents the average change in GDP achieved by swapping economically inactive individuals for economically active individuals –either by adding economically active individuals and removing an equal number of

economically inactive (different) individuals, or by effectively converting economically inactive individuals to economically active (same) individuals, or some combination of both.

This effect is therefore a combination of the effects of both subgroups on GDP –the positive effects of simultaneously increasing the economically active population and decreasing the economically inactive population by equal numbers, thereby retaining the same overall total population. An estimate of this effect may be desirable if, for example, the government were considering implementing a job–training programme for currently unemployed individuals.

In this scenario, both the effects reflect the population–level average effects of changing the relative numbers (i.e. the proportions) of economically active individuals to alter GDP, but by different mechanisms; they therefore reflect distinct causal quantities, the utility of which must be determined by context.

Ordinarily, conditioning on a collider may be considered as introducing ‘collider bias’ (Section 2.1.11) into an analysis, however, in this context, conditioning on a collider provides an interpretable causal quantity which has real utility in certain situations.81 When simulating composite variables and compositional data it is necessary to simulate the smallest components of interest and use these for construction of the variables of interest. In the case of composite variables, this is because the composite is not measurable directly in nature and its unit of measurement is combined from its components. In the case of compositional data, this is because the whole is a constraint on its components. When causal inference is undertaken it has been recommended that the (often hypothetical) intervention is well–defined.19 _{This is linked to the key condition of}

consistency introduced in Section 2.1.2. Returning to the example of BMI and CVD risk, there is a question over whether BMI is a useful proxy of a more clearly–defined concept (e.g. adiposity), or whether it is simply a measure of weight standardised by height (i.e.

3.10. SPECIFIC HEALTH GEOGRAPHY SIMULATION CONSIDERATIONS 55

to account for the fact that taller people are generally heavier). If BMI is considered to be a useful proxy for adiposity, then it is possible that analysing it as its own variables is acceptable. However, if BMI is considered to be a measure of weight standardised by height, then it must be considered whether _heightweight2 truly represents the most effective

parameterisation of this concept. The total causal effect of BMI on CVD risk will likely differ from the total causal effect of weight on CVD risk, conditional on height; although both can be theoretically estimated without statistical bias, there is a risk of inferential bias if the effect estimate obtained does not accurately reflect the causal mechanism that is sought and may eventually be a target for intervention.

Whether ‘obesity’ can be interpreted as a definable exposure with an identifiable causal effect has previously been challenged; in particular, there are concerns that obesity fails to satisfy the consistency assumption required for causal inference (Section 2.1.5) because it can represent multiple states, including high adiposity and high muscle mass.19, 82 The same concern is relevant for BMI (and all composite variables) since any value of the composite may represent various combinations of the determining component parents. Hypothesising that BMI ‘causes’ an increased risk of CVD implies that intervening to lower BMI would result in a decreased risk of CVD. Theoretically, this could be achieved by lowering BMI by either decreasing weight or increasing height. Realistically, however, weight is the more likely target for intervention. Regardless of the philosophical perspective on the utility and validity of BMI this suggests that it might actually be more useful to estimate the causal effect of weight adjusted for height.

These issues are considered for each simulation based on observed data later in the thesis, with a focus on health geography data and further complexities that may arise from composite variables are explored in the final chapter.

In the next section, an illustration of the ‘modifiable areal unit problem’ (MAUP) is given as this is important for the upcoming chapters, particularly Chapter 6.

In document Causal inference methods and simulation approaches in observational health research within a geographical framework (Page 78-82)