• No results found

The inductivist view to science is based upon the sheer error that a generalised prediction is tantamount to an explanation. The pattern of scientific reasoning behind inductivism has been heavily criticised, as the repetition of an observation cannot justify any subsequent theory. Such criticism is remarkably illustrated by Bertrand Russell [Rus01], with a chicken observing his farmer feeding him each and every day and thus extrapolating that the farmer will continue this generous behaviour the following day. The farmer ultimately wringing the chicken’s neck proves that induction cannot justify any conclusion unless it has been placed into the adequate framework. Given some insight about the farmer’s behaviour, the chicken might have been able

to understand why it was being fed, and predict his future death. This resumes one of the major complication of data mining: its relation with causal inference. What the chicken had in mind was a ‘false’ explanation of the farmer’s behaviour, therefore expecting to be fed everyday. Had it guessed a different explanation - that the farmer’s behaviour was driven by a more selfish motive - it would have extrapolated his behaviour differently. Through this illustration, Russell is trying to highlight the importance of the explanation in the inductive framework - knowing the causes driving a specific dynamic allow a better ‘extrapolation’ of it.

Due to the above considerations, one may expect explanations in data science to be supported by a causal model. That way, induction is supported by deductive thinking and the data mining model can be explained through causal logic. Nevertheless, this condition is seldom fulfilled.

One of the most well known instruments to find causal relationships is the Granger Causality test [Gra80] - an extremely powerful tool for assessing information exchange between different elements of a system, and understanding whether the dynamics of one of them is led by the other(s).

2.2.1 Granger Causality

The basic idea behind Granger Causality can be traced back to Wiener [Wie56] who conceived the notion that, if the prediction of one time series is improved by incorporating the knowledge of a second time series, then the latter is said to have a causal influence on the first.

The GC test is based on two very simple ideas, which take the form of two axioms:

1. Causes must precede their effects in time.

2. Information relating to a cause’s past must improve the prediction of the effect above and beyond information contained in the collective past of all other measured variables (including the effect).

Granger [Gra88a, Gra88b] later formalised Wiener’s idea in the context of linear regression models. Specifically, two auto-regressive models are fitted to the first time series – with and

without including the second time series – and the improvement of the prediction is measured by the ratio of the variance of the error terms. A ratio larger than one implies an improvement, hence a causal connection. At worst, this ratio is 1 and signifies causal independence between the two time series.

Therefore, a time serie X is considered to ‘Granger-cause’ another time serie Y if the inclusion of past values of X can improve the process of forecasting the values of Y . In mathematical terms, suppose X and Y to be two stochastic processes- Let us denote by σ2(Y

t | Ut−) the

variance of the residual of predicting time series Y using the accumulated information of the entire universe U from infinite past to present (the latter denoted by Ut; additionally, be

σ2(Y

t | Ut−\ X−) the corresponding forecast error excluding X from the universe. Assuming

the stationarity of the time series, the following definition of Granger Causality can be given:

Definition 2.1 (Granger causality). If σ2(Y

t| Ut−) < σ2(Yt| Ut−\X−), then X granger-causes

Y.

Since its inception, Granger’s and other derived causality metrics, as co-integration or trans- fer entropy to name a few [Sch00, SL08, Ver05], have been applied in economics [Hoo01], biomedicine/neuroscience [BDL+04, KDTB01, RFG05] and air transport [CC09, MSF10, BY13].

If Granger Causality has extensively been used in the analysis of real-world data, it has also been recognised that it presents several drawbacks [BS11]. From the point of view of a data analysis, two have to be highlighted. First, this metric is linear, in the sense that it assesses the presence of linear couplings between the time series - while there is a lot of real world cases where information propagates in a non-linear fashion [FVB+99, AMS04]. Second, time series

must be stationary, requiring detrending and pre-processing the data [HMAS03], which is not always possible. In Section 5.2 we will show how to overcome these limitations and widen the possibilities for Granger’s causality metric.

2.2.2 Extreme Event Causality

Granger causality and its related metrics share a common characteristic: the temporal dimen- sion required to define the relationship. Yet the dynamic of some physical processes might not easily be observable, e.g. gene-gene interactions, as a single measurement is performed by subject and by gene, thus precluding the use of the temporal dimension. One can go further and state that some causalities might even be present with no need for a temporal dynamics. For instance, the offspring genetic material is caused by that of the parents, without any ex- plicit dependence on time. Also, the correct location of a piece of a puzzle is defined by the global image, and not by the temporal sequence of the pieces’ arrangement. It appears that some causalities must be discovered looking at the statistics of their realisations. Let us explain better this point.

This has been recently proposed in [Zan16] where the author resort to statistics of occurrences to define causality. Suppose a group of snapshots of a system, for which two properties X and

Y are observed, each one naturally emerging as a consequence of the internal dynamics. If Y

is also partly caused by X, then Y should be present whenever X is observed, as the former is a consequence of the latter. Nevertheless, due to the internal dynamics, Y can also appear spontaneously, i.e. in the absence of X. Note that there is nothing in the previous definition that logically requires causes to precede their effects, and that it is solely based on the statistics of appearance of both properties X and Y .

Mathematically, suppose p1 the probability of the X-determined snapshots to have also the

property Y ; and p2 that of the Y -determined snapshots to have the property X. In the case of

a real causality, then p1 ≈ 1 and p2 ≪ 1. In the case of a confounding effect, i.e. when both

X and Y are driven by another property Z, then p1 ≈ p2. Thus, the test synthesises in testing

the statistical significance of the hypothesis p1 > p2 through a binomial two-proportion z-test:

z =p1− p2 ˆ p(1− ˆp)(n1 1 + 1 n2) (2.1)

being n1and n2the number of snapshots associated with p1 and p2, and ˆp = (n1p1+n2p2)/(n1+

n2). The corresponding p-value is then obtained through a Gaussian cumulative distribution

function, and the test is rejected or not depending on the chosen significance level.

Causality will play an important role in this PhD Thesis, as not only it allows a better under- standing of the dynamics behind any system, but it is also the ultimate bridge between data science and complex networks, as it would allow to resume in a graph all the useful causal flows of information in a system. More on that in Chapter 5, but first, let us define more on details complex networks.