Simpson’s paradox - Problems with causal modelling

2.3 Problems with causal modelling

2.3.1 Simpson’s paradox

Suppose you are a doctor reading a paper on a promising new treatment that seems to be very effective. You are excited by the discovery, and you look up the data from the trials to learn more about that. You look at the data concerning male patients and find out that

actually there is no correlation between treatment and recovery. “Well”, you will probably think, “this drug must work very well for female patients”. Then you look at the data about women, and you discover that, also in this case, the treatment does not seem to be effective. How is this possible? How can a drug be bad both for male and female patients, but at the same time good for people? It is clearly problematic if two different analyses induce you to take two opposite actions (give or not the treatment) based on the same data.

This problem is known by the name of ‘Simpson’s paradox’ and was discovered by the statistician Edward Simpson (1951), from whom it took its name. It is based on a statistical phenomenon according to which, in some datasets, subgroups with a common trend (for instance a negative trend) show the reverse trend (a positive trend) when they are aggregated. To illustrate this paradox, let us consider the example proposed by Meek and Glymour (1994, p. 1012). Suppose the data were collected from a study involving 990 patients who divided themselves into a control group (610 patients) and a treatment group (380 patients). Table 1 shows the number of recoveries in the two groups:

Control Group Treatment Group

Alive 260 240

Dead 350 140

Table 1. Survival rates in the control and treatment groups.

By comparing the data about the control group and the treatment group in Table 1, it is possible to conclude that there is a positive association between treatment and survival. Indeed, while in the control group only 260/610 patients survived (43%), in the treatment group the survival rate was 63%.

Suppose that now we look at the same data by distinguishing between male and female patients, as shown in Table 2 and 3.

Male

Control Group Treatment Group

Dead 320 80

Table 2. Survival rates in the male population.

Female

Control Group Treatment Group

Alive 100 200

Dead 30 60

Table 3. Survival rates in the female population.

In Table 2 it is possible to see that there is no correlation between treatment and recovery among males: in the treatment group 40/120 (33,3%) of patients survived, in the control group 160/480 (33,3%) of patients recovered from the disease. The percentage of recovery, hence, was the same both in the treatment group and in the control group. In Table 3, furthermore, it is shown that 200/260 (77%) of female patients who were treated survived, the same percentage that was found in the control group, where 100/130 (77%) of the untreated women recovered. Overall, hence, there was no association between treatment and survival among both male and female patients.

There is an explanation for this paradox: when data about women and men were aggregated, the percentage of recovery was calculated by taking into account the weighted average. In other words, the analysis considered how many patients were in each group. Given that women were more likely to recover and, in that specific situation, a larger proportion of women, if compared to the proportion of men, was treated, the results showed that the survival rate in the treatment group was higher than in the control group. Simpson’s paradox is not a rare problem that can be found only in fictional situations. In 1973, the Associate Dean of the graduate school of the University of California Berkeley, by looking at the admission rate data, hypothesised that there was a sex bias in the admission process. The data were indeed shown in a table like Table 4, and it was clear that the denial rate among women was particularly high if compared to the denial rate among men.

Applicants Admitted Denied

Female 1494 2827

Male 3738 4704

Table 4. Acceptance rates to the University of California Berkeley in 1973.

Some years later, in a famous paper, Bickel, Hammel and O’Connell (1975) explored the dataset, observing in detail the percentages of admission for each department. It was thanks to the data analysis performed department by department, that they were able to claim that there was no bias towards women in the admission process. Indeed, by considering the admission rate per department they generated a table similar to the following:

Department Male Male Female Female

Applications Admissions Applications Admissions

A 825 62% 108 82% B 560 63% 25 68% C 325 37% 593 34% D 417 33% 375 35% E 191 28% 393 24% F 191 28% 393 24%

Table 5. Acceptance rates per department to the University of California Berkeley in 1973.

As shown in Table 5, the problem with the admission rate was that female applicants tended to apply more to departments that were more difficult to get into, as the data about department C clearly illustrate. By considering the data properly partitioned, hence, it appears not only that there was not a bias against women, but that there was a small (but statistically significant) bias in favour of women.

While the examples described above are two of the most quoted examples of Simpson’s paradox, it can be noted that several gender bias studies and analyses of sports’ data have provided very nice examples of this phenomenon. For instance, it has been shown that evidence in support of gender bias in science funding is likely to suffer from this paradox (Albers, 2015). From the analysis of basketball’s data, in addition, researchers can claim that a team has the worse shooting in each category, and yet the best overall field goal percentage (Ma & Ma, 2011). In baseball, furthermore, a player A can have a higher batting average for three consecutive seasons if compared to another player B, and yet by combining the three seasons and considering how often each player played, the result is that player B has the higher average (Pearl & Mackenzie, 2018). More examples of this paradox, both in the natural and the social sciences, can also be found in Selvitella’s recent paper entitled ‘The ubiquity of the Simpson’s Paradox’ (2017).

It is now possible to examine how Simpson’s paradox can cause bias in the specific case of BNs. Let us consider again the treatment example discussed above: the treatment apparently worked for the whole population but was ineffective when analysed in the male and female subpopulations. By analysing the overall population divided into a treatment and a control group, researchers would consider just two variables, T (treatment) and R (recovery), and would obtain a causal DAG showing an arrow between T (treatment) and R (recovery), as illustrated in Figure 3(a). This would lead them to give a causal interpretation to the arrow and to establish that the treatment causes recovery. The probabilistic dependence between T and R, however, is actually caused by the common cause G, gender, as illustrated in Figure 3(b). Due to the way in which the control and treatment groups are partitioned, the variable G might not be added in the dataset, with the consequence that the subset of variables analysed might lead scientists to infer a causal relationship that does not exist in reality.

Figure 3. The possible DAGs between treatment, recovery and gender. 3(a) illustrates the probabilistic dependence between T (treatment) and R (recovery), if G (gender) is not considered; 3(b) illustrates the

35 probabilistic dependencies between G (gender), T (treatment) and R

(recovery).

Someone might argue that Simpson’s paradox causes problems especially when researchers analyse small datasets, given that there is a major risk of not considering relevant variables. In big data studies, on the contrary, scientists are able to analyse simultaneously numerous variables, therefore the risk of ignoring important subpopulations could be drastically reduced. This consideration is particularly relevant in specific cases such as medical studies, where it is known that patients’ characteristics such as age and gender can influence physical conditions and recovery, and where such variables are generally available.

In other cases, however, it might be difficult to imagine how to partition data or, in other words, what subpopulations could be relevant to the research question. Human behaviour studies, for instance, in general rely on heterogeneous observational data generated by subpopulations of different sizes. Such behavioural data can now be collected quite easily (for instance, data aboutconsumer behaviours are now collected through the Internet and by platforms such as Amazon, while travel behaviours can be monitored through the use of geolocated data), but at the individual level they often appear very sparse and noisy. It is for this reason that researchers might prefer to aggregate data and to analyse behaviours at the population level. Some examples are the use of population-level data for online activities (such as shopping online) and for diurnal and seasonal mood rhythms (the changes in mood through the day or in different seasons) (Lerman, 2018).

The problem that emerges, however, is that when hundreds or thousands of behavioural data are aggregated, it is not easy to identify the relevant subpopulations. In an online social network, for instance, a large number of variables such as sex, age, occupation, income, education, nationality, average online activities, the number of words per post, and the number of followers might be relevant in order to rule out the possibility of confounders. Some of these factors could be difficult to measure. Furthermore, such variables might be collected in different datasets, and getting access to them could be difficult or very expensive. In a similar situation, what to include or not would finally depend on the researchers’ decisions (with the risk of using inappropriate measures or ignoring important factors). Finally, even when all the possible relevant variables are available and easily measurable, scientists are not immune to inferential errors: too many

variables might indeed cause the curse of dimensionality, which is what I explore in the next section.

In document Inferring causation from big data in the social sciences (Page 31-37)