2.3 Problems with causal modelling
2.3.3 Data and meta-data quality
Imagine you have enormous amounts of data about what people search on the Internet. You assume that a relevant proportion of those people with flu-like symptoms would go to the web to look for information on what to do. Thanks to your data, you can select the search terms that people are more likely to use, and then build a sophisticated algorithm
40
to track where and with what frequency such search terms are used in a specific geographical area.
This is exactly what happened when the Google Flu Trend project was developed in 2008. Google engineers used Google data to identify the search terms more likely to be correlated to flu by considering both the search queries trends from 2003 to 2007, and the real local patterns regarding the number of people with influenza-like illnesses in those 4 years. In this way, they selected a list of 45 terms that people searched on the Internet during the flu peak weeks. The Google Flu Trend algorithm was then developed to identify when and where such terms were searched through Google4. As many studies
have already reported (Butler, 2013; Lazer et al., 2014), the project was far from being a success. In 2009, the algorithm badly underestimated the Influenza Virus A (H1N1) Pandemic (Cook et al., 2011), while in 2010 it was demonstrated that the information coming from that algorithm had the same accuracy than the results obtained with a fairly simple projection forward based on traditional data (Lazer et al., 2014, p. 1203).
As anyone can guess, the main problems of Google Flu Trend had to do with the initial assumption and the data used. It is indeed questionable whether so many people actually look for their symptoms and for information about influenza virus on the Internet, and it is even more questionable whether the quality of the data used was sufficiently high. These two problems are strictly linked one to the other: indeed, if only a small proportion of people looked for information about flu symptoms and remedies on Google, the data could be unrepresentative of the population under study. For instance, it might happen that only specific subpopulations (such as parents of new-born babies) search for flu symptoms on the Internet, while other members of the population (such as those aged 65 and over) do not rely on the Internet but prefer to contact their doctors.
After the failure of the program, furthermore, it turned out that the data selected and analysed through the algorithms had several problems. First, some scientists found out that Internet search behaviours changed during the Influenza Virus A (H1N1) Pandemic: the complications of that flu were different from the general influenza complications, consequently people searched for terms not included in the original Google Flu Trend list (Cook et al., 2011) and these data were not taken into account by Google Flu Trend. Second, the data-generating process used by Google was not stable: both this process and
41
the way in which Google recommended some searches were changed over time to improve Google’s service. The direct consequences were that data collected through different algorithms were not comparable, and that their reliability was affected by changes in search recommendations (Lazer et al., 2014).
The description of this case helps me to highlight another problem for which, I argue, there are no magical solutions: even with the best algorithms, data need to be properly cleaned and curators need to organise them by attaching additional information known as meta-data. Such meta-data, in general, consist of detailed information concerning the origin of the data, the methods used to collect them, the research aims and the protocols used during the collection (Leonelli, 2014a; Taylor et al., 2008). If something goes wrong, if data are not sufficiently clean or if meta-data are not sufficiently detailed, even the best algorithms will not produce reliable results.
In the Google Flu Trend example, for instance, two problems could have caused biased results. On the one hand, all or some data could have been biased. The quality of the data used by Google Flu Trend, consequently, would have been insufficient to ensure the reliability of the results. For example, the first algorithm used by Google might have generated low-quality data before being improved. To avoid this problem, researchers should have used only those data collected through the second (improved) algorithm. On the other hand, the problem could have been caused by the lack of information concerning the way in which data were collected. In such a case, the insufficient information ‘attached to’ the data would have been responsible for the failure of the program. Scientists could have avoided this problem by adding a new variable in the dataset to distinguish between the subgroup of data generated by means of the first algorithm and those data collected through the second algorithm, or curators might have properly used meta-data to put the dataset in order.
In the literature, one of the earlier discussions on the core innovations brought in by the emergence of big data has been focused on the idea of messiness. According to some authors (Leonelli, 2014b; Mayer-Schönberger & Cukier, 2013) we should not underestimate the fact that big data are very messy, vary in quality and can be stored in so many different datasets. This means that, in order to assemble big data and make it possible to integrate them, it is in general required a huge amount of manual labour. Leonelli (2012b, 2014b) pointed out the importance of these human decisions by
42
describing the different problems that database curators can encounter when new biological databases are created. In other scientific fields, similar discussions emerged when the researchers responsible for new databases started to clarify how data are repaired, curated and documented. For instance, on the website of the UK Data Archive (http://www.data-archive.ac.uk) it is possible to read how curators clean and prepare the new data that will be added to the dataset.
Curators spend a considerable amount of time on adding meta-data that are made available to datasets users. The assumption behind this kind of activity is that researchers need some information about their data before analysing them through machine learning algorithms. Meta-data can help researchers to avoid mixing data obtained in incompatible ways, like in the Google Flu Trend case, and can also allow researchers to assess the quality and the suitability of the data in relation to their specific research questions. For instance, researchers might decide to conduct new studies because they think that the method used to collect data is not sufficiently reliable, or might choose particular data because the research goal for which they were collected is very similar to the research question they are trying to answer.
Adding meta-data, however, is not a trivial task. Due to the lack of standard terminology to describe how data are collected, capturing the relevant aspects of data production can be remarkably challenging. For instance, some experimental practices or algorithms’ characteristics might not be intelligible to researchers coming from different fields, or some pieces of information regarding the data collection process might be omitted even if, for certain purposes, they might be vital parameters (Leonelli, 2014a). Consequently, it is not rare to have datasets whose meta-data are partly incomplete, from which it follows that it is not rare to have data studies based on unsuitable data. This is one of the main reasons why, even with big data and machine learning algorithms, data quality can still threaten the possibility of good statistical results.
Some examples can help to clarify this problem. In a famous work, Petricoin et al. (2002), reported having found a method to distinguish between serum samples from women with ovarian cancer, serum samples from healthy women and samples from women with a benign ovarian cancer. The algorithm used for the diagnosis was trained with molecular data obtained through mass spectrometry, called spectra. During the training it identified 50 normal spectra and 50 cancer spectra, then it predicted 116 spectra, identifying 47 out
43
of 50 normal spectra, all the 50 cancer spectra and all the 16 benign spectra as ‘other’. The result was so exciting that the U.S. Congress was urged to increase the funding to develop the diagnostic test (Check, 2004). However, when Baggerly, Morris and Coombes (2004) tried to replicate the study, they did not obtain the same exciting results. It was finally discovered that the three types of spectra used by Petricoin et al. had been pre-processed differently, and such differences (that were not associated with the biology of cancer) were those recognised by the algorithm.
Another case regards some studies that used racial data from the National Health Index to explore possible causal linkages to mortality. It was discovered that, in many occasions, the officials who completed the death certificate determined the racial status based on their own judgments rather than asking to the members of the family, and that relevant percentages of American Indians, Asian Pacific Islanders and Hispanics were identified into another racial category (Williams, 1999). The direct consequence of this problem was that any study based on such data could contain bias.
Problems in data quality can have dramatic, even expensive consequences: in 1999 NASA lost a Mars orbiter costing $125 million because engineers combined data measured with the metric system of millimetres and meters, with data measured with the Imperial system of inches, feet and pounds (De Veaux & Hand, 2005, p. 235).
After the examples described above, it is easy to imagine a situation in which BNs, due to the bad quality of data or meta-data, could produce causal graphs with distorted relationships. Like in the other cases, the solution is not straightforward: the processes of data production and curation require careful examinations in order to avoid substantial mistakes like those described in this section.