High dimensional datasets - Problems with causal modelling

2.3 Problems with causal modelling

2.3.2 High dimensional datasets

Most of the datasets created in recent years are characterised by high dimensionality, an expression that refers to the situation in which a dataset is characterised by a large number of features. When the number of features is larger than the number of observations, datasets are characterised by the ‘curse of dimensionality’ (Bühlmann & Geer, 2011, p. 1). Suppose we have a dataset containing 100 images (observations) with a high resolution. Each image is composed of thousands or even millions of pixels (in the case of Flickr, for instance, each image can contain 2048x2048 pixels), and each pixel within the image can be understood as a feature of the image. This means that we have a huge number of pixels (features) within each image (observation), and the total number of images is very small if compared to the total number of pixels. The database is hence characterised by high dimensionality. This kind of situation can be found in several contexts: financial datasets, for instance, can contain observations measured daily, hourly, even every minute or second, and for each time slice such datasets have hundreds of features. Similarly, social networks allow for the collection of hundreds of features for each individual using them. Scientific advancements in medicine, finally, have led to the current situation in which thousands of features are collected for each individual, instead of having, as in the past, a big sample with a low dimension (Zeng et al., 2016).

It would be reasonable to hold the intuition that the analysis of high dimensional datasets through machine learning algorithms should offer more accurate results. In reality, however, it is very common that the opposite happens. When datasets are characterised by the curse of dimensionality, as the number of variables increases, the number of plausible combinations of variables explodes exponentially. This happens because a fixed number of data points become increasingly ‘sparse’ if the dimensionality is increased. Figure 4, for instance, represents two cases, one in which we have two dimensions, one in which we have three dimensions: when the dimensionality increases, also the number of sides increases. Consequently, while the data points in two dimensions are sufficient to find a pattern, the same data points in three dimensions are too sparse to enable researchers to select one pattern among all other possible patterns.

Let us consider the case of BNs: as the number of nodes increases, the size of the search space of the relationships between causal nodes grows exponentially in dimensions. If the number of data points, furthermore, is very small if compared to the number of nodes, the confidence in the probability dependences represented in the DAG becomes very low: in other words, we cannot be sure that the complex DAG we are observing represents the correct causal relationships between the nodes under study.

Figure 4. Data points represented in two-dimension space and in three- dimension space.

A phenomenon closely related to the curse of dimensionality is the phenomenon known by the name of ‘overfitting’. When a new algorithm is developed, scientists use training data to train it. Overfitting occurs when the model produced by the algorithm is very accurate on the training data but is much less accurate on the real data. In Figure 5, for example, the curved line best follows the training data if compared to the dashed line, but it might be too dependent on such data. The algorithm producing the curved line, therefore, might have a higher error rate when used with new data. This happens because the set of training data points is too small if compared to the variables analysed, and the algorithm runs the risk of modelling not only the general patterns in the data, but also the idiosyncrasies of that specific data set that are unlikely to recur in further data (Hitchcock & Sober, 2004). This, hence, would cause the model to poorly perform when new data are analysed. For instance, in BNs overfitting typically takes the form of a fully linked DAGs, where the number of arrows is so high that all the nodes of the DAGs are linked to each other.

38 Figure 5. An example of overfitting (Silipo, 2007, p. 287).

Such considerations bring to the fore what Floridi has called the ‘small pattern’ problem of big data (Floridi, 2012): the vast amount of information available entails that it can be more difficult than in the past to spot where the new relevant patterns lie. Given the growing number of dimensions available for each research question, for instance, how to select the dimensions that can really help both to uncover important patterns and to avoid the curse of dimensionality? To give an example, thousands of genes (features or parameters) are monitored for each person, but for a specific disease only a fraction of them are biologically relevant. It is crucial, hence, to identify those parameters (i.e. dimensions) that can be used for prediction and diagnosis.

In the statistical and computer science literature several methods have been proposed to avoid the curse of dimensionality. One of the most common strategies is to reduce the dimensions of the datasets. Feature transformation techniques, for instance, are data pre- processing methods that allow for the transformation of the original dimensions of a data set into a more compact set of features or parameters, maintaining at the same time as much information as possible (a typical example is Principal Component Analysis). Feature selection algorithms, on the other hand, are based on an alternative approach to dimension reduction and select among the available dimensions the most relevant subset of parameters. Finally, wrapper algorithms use the classification accuracy of some classifier as a criterion to reduce dimensionality (for more details see Cunningham, 2008; Guyon & Elisseeff, 2003).

To avoid some risks associated with the curse of dimensionality and to select the right ‘small patterns’, furthermore, in general researchers work with supervised machine learning algorithms, trained with labelled training datasets containing input objects and the desired output value attached to such objects. To give a simple example, an algorithm can learn to classify animals (such as dogs and cats) after being trained on a dataset containing photos properly labelled with the corresponding species and some identifying features.

There are cases where researchers use unsupervised algorithms, that are thought to be able to recognise processes and patterns without any human guidance. However, in order to improve the reliability of data-driven results, in most of the cases researchers prefer to train the algorithm. In some situations, moreover, scientists’ knowledge is used also to

guide algorithms’ decisions after the training. As described above, for instance, in genetic studies each observation (such as each mRNA sample) can have hundreds or thousands of features (genes). Let us suppose researchers know that the expression of some specific genes depends on environmental conditions, but they are not sure about what gene expression varies under which conditions. To identify the most useful features, researchers are working on statistical techniques that can incorporate external information (for instance, a candidate list of relevant genes could already exist before and researchers could decide to give to such candidates a higher priority) (Liu et al., 2016).

Despite the use both of supervised algorithms and of techniques to reduce the dataset’s dimensions, however, it is still difficult to rule out all the problems associated with the curse of dimensionality and overfitting. Racial algorithmic bias can offer some illustrating examples to understand the consequences of such problems.

A case in point is that facial recognition algorithms are very likely to discriminate minorities, and that this problem is caused by the high dimensionality of the training datasets. This phenomenon is due to the fact that in the Western society the data used to train algorithms often over-represent one population (for instance white people) and under-represent minorities (like Afro-American people). Due to this under- representation, the number of variables examined can be larger than the number of samples in the data. As a consequence, algorithms might start to overfit the data associated with the under-represented population, modelling also idiosyncrasies, as illustrated above in Figure 5. This can cause critical situations: to give an example, in some cases facial recognition algorithms are not able to categorise members of the under- represented population as ‘persons’ because they do not have all the idiosyncrasies required by the algorithms to be classified in that way. This has led to controversial situations where recognition algorithms categorised members of the under-represented population as animals.

In document Inferring causation from big data in the social sciences (Page 37-40)