Functional causal models: Beyond linear instantaneous relations

(1)

Functional causal models: Beyond linear

instantaneous relations

Kun Zhang

Max-Planck Institute for Intelligent Systems

Tübingen, Germany

(2)

Causality vs. dependence: Examples

• Causality ➜ dependence

!

dependence ➜ causality

(3)

Brief history of causality:

Western philosophical tradition

• dates back at least to Aristotle

• Causality is not based on actual reasoning: only correlation

can actually be perceived (David Hume, 1711-1776)

• One has to resort to

controlled experiments

• Manipulate a variable ‘ideally’ and see the response of the system

...

(4)

Brief history of

causality:

Eastern cultural tradition

• Illustrated Sutra of Cause and Effect (8th century)

• “coincidence” instead of causality

(Carl Jung, 1920’s)

(5)

Potential applications

• Policy making in economics, climate analysis...

• Biology, brain connectivity analysis...

• Control, robust prediction / feature selection...

• For understanding learning problems, e.g., semi-supervised

learning

(Schölkopf et al., 2012)

• ...

(6)

Advances in the past decades:

Computational causality

• In the past decades, under certain assumptions, it was

made possible to derive causation from passively

observed data

(Pearl, Spirtes, Glymore, Scheines, Hoover et al.)

• statistical data causal structure

• constraint-based approach

• causal Markov assumption

• faithfulness…

X1 X2 ---1.1 1.0 2.1 2.0 3.1 4.2 2.3 -0.6 1.3 2.2 -1.8 0.9 . . . . . .

X

1 ?

X

2 dicts cl

assica

l claims

???

(7)

Outline

• Constraint-based causal

discovery

• Functional causal model (mainly

from 2005)

• linear non-Gaussian causal model

• with temporal constraints:

Granger causality with

instantaneous effects

• with necessary nonlinearities:

Post-nonlinear causal model

X1 ! X2 ! X3

X1 → X2 → X3

(even if very nonlinear)

(8)

• Constraint-based causal discovery

• Functional causal model (from

2005)

• Linear, non-Gaussian causal model

• Granger causality with

instantaneous effects

• Post-nonlinear causal model

X1 ! X2 ! X3

X1 → X2 → X3

(if linear)

X1 → X2 → X3

(even if very nonlinear)

F

rom 1980’s...

(9)

Causal structure vs. statistical independence

(Spirtes, Pearl, et al.)

causal structure

(causal graph)

Y → X → Z

Y -- X -- Z

?

Statistical

independence(s)

Y Z | X

Causal Markov condition:

each variable is ind.

of its non-descendants

(non-effects)

conditional on

its parents

(direct causes)

Faithfulness:

all observed (conditional) independencies are

(10)

Why faithfulness assumption matters?

Y → X → Z

a

b

c

• If they are linear-Gaussian and

a=-bc

, we have Y

Z, which cannot by seen from the graph!

• Faithfulness assumption eliminates this possibility!

(11)

Constraint-based causal discovery

• Theorem: if (G,P) satisfies faithfulness, then there

is an edge between X and Y iff X Y

given any

set of variables

• uses (conditional) independence constraints to

find the candidate causal structures

(12)

Search results

• Markov equivalence class

• pattern Y -- X -- Z

• same adjacencies

• → if all agree on orientation

;

-- if disagree

• might be unique:

v-structure

Y Z | X

(13)

Constraint-based method: An

inverse problem

• {local causal structures}

→ {conditional independences}

X Z | Y

∅

X

Y

Z

X

Y

_Z

X

Y

_Z

X

Y

_Z

equi

va

lence

cla

ss

faithfulness

• Instead, functional causal

models try to directly

identify local causal

structures

two-

varia

ble ca

se?

X

Z

Y

(14)

• Constraint-based causal discovery

• Functional causal model (from

2005)

• Linear, non-Gaussian causal model

• Granger causality with

instantaneous effects

• Post-nonlinear causal model

X1 ! X2 ! X3

X1 → X2 → X3

(if linear)

X1 → X2 → X3

(even if very nonlinear)

Outline

(15)

Functional causal model

(Pearl et al.)

• generative function model for continuous variables

• x

_i

= f

_i

(pa

_i

, e

_i

), i = 1,…,n

• in econometrics, social sciences...

• well-defined examples

• Granger causality: effects follow causes in a linear form

• LiNGAM: linear, non-Gaussian and acyclic causal model

(Shimizu et al., 2006)

PA

_i

: parents (causes) of X

_i

; E

_i

mutually ind.

f

i

PA

_i

E

_i

(16)

FCM:

why independence

between X and E ?

• If X E:

• Otherwise, according to

_{Reichenbach's Common}

Cause Principle:

• much more complicated

X

E

Y

X

E

Y

Z

(17)

FCM: A general view

• Without constraints on f, for given (X

_,

Y), both y = f

₁

(x, e) with

E_||_X

and x = f

₂

(y, e

₁

) with E

₁

_||_Y

are possible

• with a Gram-Schmidt-orthogonalization procedure (Darmois,

1951)

!x

= cdf(x

₁

), so

!x ~ U(0,1);

e

= cdf(y | !x) = p

_!x,y

(

!x,t)

!" x₂

#

dt.

(18)

Suppose we observe the data

x

(19)

A universal way to construct

“trivial” FCMs

• e

’ = h ° CCDF

Y|X

(y|x) always independent from X

• Functional causal model: y = CCDF

Y|X-‐1

° h

-‐1

(e’) for any x

• how to make it identifiable (break the symmetry)?

x y x CCDF (y |x ) ï2 0 2 x h (C C D F (y |x ))

f

(20)

General FCMs:

independence vs. likelihood

• relating mutual information I and likelihood l:

• If X→Y follows the model:

• also hold for more than two variables

X

f(⋄;

β

)

Y

E

l

_X_→Y

(β) =

n

�

i=1

log P

_F

(x

_i

, y

_i

) =

n

�

i=1

log P (X = x

_i

, Y = y

_i

)

_{− I(X, E; β)}

l

_X_→Y

(β

∗

)

_{− l}

_Y _→X

(β

∗_Y

) = I(Y, E

_Y

; β

∗_Y

).

(21)

A basic functional causal model

• Constraint-based causal

discovery

• Functional causal models (from

2005)

• Linear, non-Gaussian acyclic

causal model

• extended Granger causality

• Post-nonlinear causal model

X1 ! X2 ! X3

X1 → X2 → X3

(if linear)

X1 → X2 → X3

(even if very nonlinear)

(22)

LiNGAM model

• linear,

non-Gaussian

,

acyclic

causal model (LiNGAM)

(Shimizu et al., 2006)

:

• disturbances (errors) ei are non-Gaussian (or at most

one is Gaussian) and mutually ind.

• example:

e

Bx

x

e

x

b

x

¦

or

x

Bx

e

i i j j ij i

b

x

e

x

¦

of parents : or

x

₂

x

₃

x

₁ 0.5 -0.2 0.3

e

2

e

3

x

₂

= e

₂

,

x

₃

= 0.5x

₂

+ e

₃

,

x

₁

=

_−0.2x

₂

+ 0.3x

₃

+ e

₁

.

(23)

ICA:

A well-known technique making use of

non-Gaussianity

x₁ x_m observed signals ICA system output: as independent as possible

W

… … y1 y_n de-mixing estimate …

x = A·s

_{y = W·x}

A

… … s₁ s_n

unknown mixing system independent

sources

mixing

• assumptions in ICA

• at most one of s

i

s

Gaussian

(24)

LiNGAM analysis by ICA

• LiNGAM:

x = Bx + e e = (I-B)x

• B has a special structure:

acyclic relations

• ICA:

y = Wx

• B can then seen from W by permutation and re-scaling

• e.g.

W

So we have the causal relation:

x

₂

x

₃

x

₁

0.5

(25)

Related work &

applications

• ICA with sparse connections

(Zhang et al., 2008)

;

Direct LiNGAM

(Shimizu et al., 2009)

• with mild nonlinear distortion

allowed; application in finance

(Zhang & Chan, 2006 & 2008)

• extended Granger causality analysis

for time series

(Hyvärinen et al., 2010; Zhang and Hyvärinen, 2009)

HYVARINEN¨ , ZHANG, SHIMIZU ANDHOYER

DJIt-1 N225t-1 HSIt-1 SSECt-1

DJIt 0.12 N225t 0.42 HSIt 0.02 SSECt

0.11 -0.15 _0.35

0.21

-0.07 0.04

0.05 0.04

Figure 2: Results of application of our model to daily returns of the stock indices DJI, N225, HSI, and SSEC, with k = 1 lag. Large coefficients (greater than 0.1) are shown in bold and red.

Next, we fitted an ordinary vector autoregressive model with 10 lags on the estimated sources, finding the corresponding innovation series which we denote by yi(t), i = 1, ..., 17. Our goal was

to analyze if there are some influences between the magnitudes of these innovations. We prefer to analyze the innovations because the innovations are approximately white both temporally and spatially, and thus we can analyze the magnitudes with no contamination by linear (auto)correlations of the source signals. The autoregressive model order 10 was chosen because it was the smallest order that gave approximately white innovations.

We then fitted the SVAR model on the logarithmically transformed magnitudes xi(t) = log(0.2+

|yi(t)|),i = 1,...,17. We determined the order k of our SVAR model by minimizing the AIC

crite-rion (Akaike, 1973), which is the negative log-likelihood of the MBD model plus a term measuring the complexity of the model. The log-likelihood involves the densities of the MBD outputs ˆei(t),

which were modelled by a mixture of three Gaussians. From the candidate orders between 0 and 20, we found that k = 2 gave the minimum AIC.

After finding the estimate of the coefficients ˆB!,! = 0,1,2 with the MBD-based approach, one can easily calculate the estimates of the statistics S0(i ← j) and Slag(i ← j). The bootstrapping

approach given in Section 6 was used to evaluate if these estimated statistics are significant. Here we need to test multiple hypotheses simultaneously; to reduce the type I error, we adopted the Bonferroni correction (Shaffer, 1995) for multiple testing correction. We used the significance level 5%. For both the instantaneous and lagged effects, one needs to perform 17 × 16 = 272 tests; therefore, the significance level for each individual test is then 0.05/272 ≈ 2 × 10−4. We used 104 replications for the bootstrapping.

For illustration, we give the empirical distribution of the statistics S0(7 ← 14) and Slag(7 ← 14),

as well as their estimated values for the original series xi(t), in Fig. 3. Clearly ˆS0(7 ← 14) is

significant, while ˆSlag(7 ← 14) is not.

Fig. 4 shows the resulting diagram of causal analysis with instantaneous effects between the magnitudes of the selected MEG sources, with the influences significant at 5% level (corrected for multiple testing). What we see is that the connections tend to be strong between sources which are close to each other. For example, the occipitoparietal sources such as #1, #2, #3, #8, and #11

ESTIMATION OFSVARMODEL USING NON-GAUSSIANITY

0 0.2 0.4 0.6 0.8 1 1.2 x 10!4 0 1000 2000 3000 4000 5000 6000 7000 8000

Histogram of ˆS0(7 ← 14) under null hypothesis

Critical value at 2! 10!4_level ˆ S0(7 ← 14) 0 0.2 0.4 0.6 0.8 1 1.2 x 10!4 0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Histogram of ˆS_lag_{(7 ← 14) under null hypothesis} Critical value at

2! 10!4_level

ˆ S_lag_{(7 ← 14)}

(a) (b)

Figure 3: Illustration of the empirical distribution of the statistics under the null hypothesis obtained by bootstrapping. (a) For the statistic S0(7 ← 14). (b) For Slag(7 ← 14).

Figure 4: Results of application of our model on the log-magnitudes of the MEG sources (signifi-cant at 5% level, corrected for multiple testing). Black dashed line: instantaneous effect. Red solid line: lagged effect. The thickness of the lines indicates the strength of the influences.

1727

Nonlinear ICA with Minimal Nonlinear Distortion

Hang Seng Utilities Index. x1, x9, and x11 are

con-stituents of Hang Seng Property Index. 3. Large bank companies are the cause of many stocks, meaning that the international impact to the Hong Kong stock mar-ket is probably reflected through large banks. Here x5

and x8 are the two largest banks in Hong Kong. 4.

Stocks in Hang Seng Property Index tend to depend on many other stocks, while they hardly influence oth-ers. Here x1, x9, and x11 are in Hang Seng Property

Index. These findings also indicate that the indepen-dent factor model may provide a reasonable way to explain the generation of stock returns.

x1: Cheung Kong (0001.hk) x2: CLP Hldgs (0002.hk) x3: HK & China Gas (0003.hk) x4: Wharf (Hldgs) (0004.hk) x5: HSBC Hldg (0005.hk), x6: HK Electric (0006.hk) x7: Hang Lung Dev (0010.hk) x8: Hang Seng Bank (0011.hk) x9: Henderson Land (0012.hk) x10: Hutchison (0013.hk) x11: Sun Hung Kai Prop (0016.hk) x12: Swire Pacific ’A’ (0019.hk) x13: Bank of East Asia (0023.hk) x14: Cathay Pacific Air (0293.hk)

Figure 8. Casual diagram of the 14 stocks.

7. Conclusion

We have proposed the “minimal nonlinear distortion” principle for solving the nonlinear ICA problem. This principle helps to reduce the indeterminacies in so-lutions of nonlinear ICA and to overcome the ill-posedness of nonlinear ICA. With this principle, the solution whose nonlinear mixing system is close to lin-ear is preferred. Experimental results with synthetic data show that when the data are generated with mild nonlinear distortion, the proposed method produces good and reliable results for separating various non-linear mixtures. The successful application of the pro-posed nonlinear ICA method to causality discovery in the Hong Kong stock market illustrates the applica-bility of the method and the validity of the “minimal nonlinear distortion” principle for some real-life prob-lems. The result also supports the validity of the in-dependent factor model in finance.

Acknowledgement

This work was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China.

References

Almeida, L. B. (2003). MISEP — linear and nonlinear ICA based on mutual information. Journal of Machine Learning Research, 4, 1297–1318.

Almeida, L. B. (2005). Separating a real-life nonlinear im-age mixture. Journal of Machine Learning Research, 6, 1199–1229.

Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159.

Bishop, C. (1993). Curvature-driven smoothing: a learning algorithm for feedforward networks. IEEE Trans. on Neural Networks, 4, 882–884.

Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc., 96, 1348–1360.

Harmeling, S., Ziehe, A., Kawanabe, M., & M¨uller, K. (2003). Kernel-based nonlinear blind source separation. Neural Computation, 15, 1089–1124.

Hyv¨arinen, A. (1999). Fast and robust fixed-point al-gorithms for independent component analysis. IEEE Trans. on Neural Networks, 10(3), 626–634.

Hyv¨arinen, A., & Karthikesh, R. (2000). Sparse priors on the mixing matrix in independent component analysis. Proc. 2nd Int. Workshop on ICA and BSS (ICA2000) (pp. 477–452). Helsinki, Finland.

Hyv¨arinen, A., & Pajunen, P. (1999). Nonlinear indepen-dent component analysis: Existence and uniqueness re-sults. Neural Networks, 12, 429–439.

Jutten, C., & Karhunen, J. (2003). Advances in nonlinear blind source separation. Proc. 4th Int. Symp. on ICA and BSS (ICA2003) (pp. 245–256). Invited paper in the special session on nonlinear ICA and BSS.

Jutten, C., & Taleb, A. (2000). Source separation: From dusk till dawn. 2nd Int. Workshop on ICA and BSS (ICA 2000) (pp. 15–26). Helsinki, Finland.

Poggio, T., Torre, V., & Koch, C. (1985). Computational vision and regularization theory. Nature, 317, 314–319. Shimizu, S., Hoyer, P., Hyv¨arinen, A., & Kerminen, A.

(2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7, 2003–2030.

Taleb, A., & Jutten, C. (1999). Source separation in post-nonlinear mixtures. IEEE Trans. on Signal Processing, 47, 2807–2820.

Tan, Y., Wang, J., & Zurada, J. M. (2001). Nonlinear blind source separation using a radial basis function network. IEEE Trans. on Neural Networks, 12, 124–134.

Tikhonov, A. N., & Arsenin, V. A. (1977). Solutions of ill-posed problems. Washington: Winston & Sons.

Valpola, H. (2000). Nonlinear independent component analysis using ensemble learning: Theory. Proc. 2nd Int. Workshop on ICA and BSS (ICA2000) (pp. 251– 256). Helsinki, Finland.

25 Thursday, 18 October 2012

(26)

Linear functional causal model

with temporal constraints

• Constraint-based causal

discovery

• Functional causal model (from

2005)

• Linear, non-Gaussian acyclic

causal model

• extended Granger causality

• Post-nonlinear causal model

X1 ! X2 ! X3

X1 → X2 → X3

(if linear)

X1 → X2 → X3

(even if very nonlinear)

(27)

Granger causality

• X

1

: {x

1t

}

Granger causes

X

2

: {x

2t

} if it contains information helping

predict x

2,t+h

(h>0) contained nowhere else

(Granger, 1969)

• temporal constraint: causes must precede effects + linear causal

relations

• Vector

autoregression

(VAR) estimated by

multivariate least

squares (MLS)

x

_t

=

p

�

τ =1

B

_τ

x

_t_−τ

+ e

_t

x

1t

x

2t

(28)

With instantaneous effects

• Are e

it

independent? instantaneous effects between

x

it

(

Reale, Wilson et al., 2001

)

• Granger causality with instantaneous effects

:

Dotted line: instantaneous

causal effects x_1t ! x_2t

x

_t

=

p

�

τ =1

B

_τ

x

_t_−τ

+B

₀

x

_t

+ e

_t

, or x

_t

=

p

�

τ =0

B

_τ

x

_t_−τ

+ e

_t

(29)

What happens if we ignore instantaneous

effects

(Hyvärinen et al., 2008)

• They become confounders...

• Example

x_t = p � τ =1 (I _{− B}₀)−1 _{· B}_τ _{· x}_t_−τ + (I _{− B}₀)−1e_t

ESTIMATION OF SVAR MODEL USING NON-GAUSSIANITY

While this phenomenon is, in principle, well-known in econometric literature (Swanson and Granger, 1997; Demiralp and Hoover, 2003; Moneta and Spirtes, 2006), Eq. (11) is seldom applied because estimation methods for B0 have not been well developed. To our knowledge, no estimation

method for B₀ has been proposed which is consistent for the whole matrix without strong prior assumptions on B₀.

Next we present some theoretical examples of how the instantaneous and lagged effects interact based on the formula in (11).

7.1.1 EXAMPLE 1: AN INSTANTANEOUS EFFECT MAY SEEM TO BE LAGGED Consider first the case where the instantaneous and lagged matrices are as follows:

B₀ = ! 0 1 0 0 " , B₁ = ! 0.9 0 0 0.9 " .

That is, there is an instantaneous effect x₂ _{→ x}₁, and no lagged effects (other than the purely autore-gressive xi(t − 1) → xi(t)). Now, if an AR(1) model is estimated for data coming from this model,

without taking the instantaneous effects into account, we get the autoregressive matrix

M1 = (I − B0)−1B1 = ! 0.9 0.9 0 0.9 " .

Thus, the effect x₂ _{→ x}₁ seems to be lagged although it is, actually, instantaneous. 7.1.2 EXAMPLE 2: SPURIOUS EFFECTS APPEAR

Consider three variables with the instantaneous effects x1 → x2 and x2 → x3, and no lagged effects

other than x_i_{(t − 1) → x}_i(t), as given by

B₀ =   0 0 0 1 0 0 0 1 0  , B₁ =   0.9 0 0 0 0.9 0 0 0 0.9  .

If we estimate an AR(1) model for the data coming from this model, we obtain

M1 = (I − B0)−1B1 =   0.9 0 0 0.9 0.9 0 0.9 0.9 0.9  .

This means that the estimation of the simple autoregressive model leads to the inference of a direct lagged effect x1 → x3, although no such direct effect exists in the model generating the data, for any

time lag.

A more reassuring result is the following: if the data follows the same causal ordering for all time lags, that ordering is not contradicted by the neglect of instantaneous effect. A rigorous definition of this property is the following.

Theorem 1 Assume that there is an ordering i( j), j = 1 . . . n of the variables such that no effect

goes backward, that is,

B_!_{(i( j − "),i( j)) = 0 for " > 0,! ≥ 0,1 ≤ j ≤ n.} (15)

While this phenomenon is, in principle, well-known in econometric literature (Swanson and Granger, 1997; Demiralp and Hoover, 2003; Moneta and Spirtes, 2006), Eq. (11) is seldom applied because estimation methods for B₀ have not been well developed. To our knowledge, no estimation method for B₀ has been proposed which is consistent for the whole matrix without strong prior assumptions on B₀.

7.1.1 EXAMPLE 1: AN INSTANTANEOUS EFFECT MAY SEEM TO BE LAGGED

Consider first the case where the instantaneous and lagged matrices are as follows:

B₀ = ! 0 1 0 0 " , B₁ = ! 0.9 0 0 0.9 " .

That is, there is an instantaneous effect x2 → x1, and no lagged effects (other than the purely

autore-gressive x_i_{(t − 1) → x}_i(t)). Now, if an AR(1) model is estimated for data coming from this model, without taking the instantaneous effects into account, we get the autoregressive matrix

M₁ _{= (I − B}₀)−1B₁ = ! 0.9 0.9 0 0.9 " .

Thus, the effect x₂ _{→ x}₁ seems to be lagged although it is, actually, instantaneous.

7.1.2 EXAMPLE 2: SPURIOUS EFFECTS APPEAR

Consider three variables with the instantaneous effects x₁ _{→ x}₂ and x₂ _{→ x}₃, and no lagged effects other than x_i_{(t − 1) → x}_i(t), as given by

B₀ =   0 0 0 1 0 0 0 1 0  , B₁ =   0.9 0 0 0 0.9 0 0 0 0.9  .

M₁ _{= (I − B}₀)−1B₁ =   0.9 0 0 0.9 0.9 0 0.9 0.9 0.9  .

This means that the estimation of the simple autoregressive model leads to the inference of a direct lagged effect x₁ _{→ x}₃, although no such direct effect exists in the model generating the data, for any time lag.

B_!_{(i( j − "),i( j)) = 0 for " > 0,! ≥ 0,1 ≤ j ≤ n.} (15)

:

While this phenomenon is, in principle, well-known in econometric literature (Swanson and Granger, 1997; Demiralp and Hoover, 2003; Moneta and Spirtes, 2006), Eq. (11) is seldom applied because estimation methods for B₀ have not been well developed. To our knowledge, no estimation method for B₀ has been proposed which is consistent for the whole matrix without strong prior assumptions on B0.

7.1.1 EXAMPLE 1: AN INSTANTANEOUS EFFECT MAY SEEM TO BE LAGGED Consider first the case where the instantaneous and lagged matrices are as follows:

B0 = ! 0 1 0 0 " , B1 = ! 0.9 0 0 0.9 " .

That is, there is an instantaneous effect x₂ _{→ x}₁, and no lagged effects (other than the purely autore-gressive x_i_{(t − 1) → x}_i(t)). Now, if an AR(1) model is estimated for data coming from this model, without taking the instantaneous effects into account, we get the autoregressive matrix

M₁ _{= (I − B}₀)−1B₁ = ! 0.9 0.9 0 0.9 " .

Thus, the effect x₂ _{→ x}₁ seems to be lagged although it is, actually, instantaneous. 7.1.2 EXAMPLE 2: SPURIOUS EFFECTS APPEAR

Consider three variables with the instantaneous effects x₁ _{→ x}₂ and x₂ _{→ x}₃, and no lagged effects other than x_i_{(t − 1) → x}_i(t), as given by

B₀ =   0 0 0 1 0 0 0 1 0  , B₁ =   0.9 0 0 0 0.9 0 0 0 0.9  .

M1 = (I − B0)−1B1 =   0.9 0 0 0.9 0.9 0 0.9 0.9 0.9  .

This means that the estimation of the simple autoregressive model leads to the inference of a direct lagged effect x₁ _{→ x}₃, although no such direct effect exists in the model generating the data, for any time lag.

B_!_{(i( j − "),i( j)) = 0 for " > 0,! ≥ 0,1 ≤ j ≤ n.} (15)

1723

:

x

1,t-1

x

1t

x

2,t-1

x

2t

0.9 0.9

x

3,t-1

x

3t

1 1 0.9 0.9 0.9 0.9 x_t = p � τ =0 B_τx_t_−τ + e_t 29 Thursday, 18 October 2012

(30)

Identification

(Zhang & Hyvärinen,

2009)

• e

it

independent for different i and t,

i.e., spatially & temporally

independent

• If at most one of e

it

is Gaussian, it

can be solved by

multichannel blind

deconvolution (MBD)

with causal

FIR filters

• MBD estimates

W to make ê

it

spatially and temporally independent

• B

τ

can be found from

W

τ

, by

(31)

Experiment on financial data

• extended Granger causality analysis of daily returns of stock

indices DJI, N225, HSI, and SSEC, with k = 1 lag

HYVARINEN¨ , ZHANG, SHIMIZU AND HOYER

DJI

t-1

N225

t-1

HSI

t-1

SSEC

t-1

DJI

t

0.12 N225

t

0.42 HSI

t

0.02 SSEC

t

0.11 -0.15

0.35 _0.21

-0.07

0.04

0.05

0.04

Figure 2: Results of application of our model to daily returns of the stock indices DJI, N225, HSI, and SSEC, with k = 1 lag. Large coefficients (greater than 0.1) are shown in bold and red.

Next, we fitted an ordinary vector autoregressive model with 10 lags on the estimated sources, finding the corresponding innovation series which we denote by yi(t), i = 1, ..., 17. Our goal was

to analyze if there are some influences between the magnitudes of these innovations. We prefer to analyze the innovations because the innovations are approximately white both temporally and spatially, and thus we can analyze the magnitudes with no contamination by linear (auto)correlations of the source signals. The autoregressive model order 10 was chosen because it was the smallest order that gave approximately white innovations.

We then fitted the SVAR model on the logarithmically transformed magnitudes x_i(t) = log(0.2+ |yi(t)|),i = 1,...,17. We determined the order k of our SVAR model by minimizing the AIC

crite-rion (Akaike, 1973), which is the negative log-likelihood of the MBD model plus a term measuring the complexity of the model. The log-likelihood involves the densities of the MBD outputs ˆe_i(t), which were modelled by a mixture of three Gaussians. From the candidate orders between 0 and 20, we found that k = 2 gave the minimum AIC.

After finding the estimate of the coefficients ˆB_!,! = 0,1,2 with the MBD-based approach, one can easily calculate the estimates of the statistics S₀_{(i ← j) and S}_lag_{(i ← j). The bootstrapping} approach given in Section 6 was used to evaluate if these estimated statistics are significant. Here we need to test multiple hypotheses simultaneously; to reduce the type I error, we adopted the Bonferroni correction (Shaffer, 1995) for multiple testing correction. We used the significance level 5%. For both the instantaneous and lagged effects, one needs to perform 17 × 16 = 272 tests; therefore, the significance level for each individual test is then 0.05/272 ≈ 2 × 10−4. We used 104 replications for the bootstrapping.

For illustration, we give the empirical distribution of the statistics S₀_{(7 ← 14) and S}_lag_{(7 ← 14),} as well as their estimated values for the original series xi(t), in Fig. 3. Clearly ˆS0(7 ← 14) is

significant, while ˆS_lag_{(7 ← 14) is not.}

Fig. 4 shows the resulting diagram of causal analysis with instantaneous effects between the magnitudes of the selected MEG sources, with the influences significant at 5% level (corrected for multiple testing). What we see is that the connections tend to be strong between sources which are close to each other. For example, the occipitoparietal sources such as #1, #2, #3, #8, and #11 have strong interconnections. Some perirolandic sources such as #5, #7, #10, and #14 are also interconnected. Sources #4 and #16 seems to mediate between these two groups.

(32)

Experiment on brain signals

• extended Granger causality analysis of the log-magnitude of

MEG sources (significant at 5% level; corrected for multiple

testing).

0 0.2 0.4 0.6 0.8 1 1.2 x 10!4 0 1000 2000 3000 4000 5000 6000 7000 8000

Histogram of ˆS0(7 ← 14) under null hypothesis

Critical value at 2! 10!4 level ˆ S0(7 ← 14) 0 0.2 0.4 0.6 0.8 1 1.2 x 10!4 0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Histogram of ˆS_lag_{(7 ← 14) under null hypothesis}

Critical value at 2! 10!4 level ˆ S_lag_{(7 ← 14)} (a) (b)

Figure 3: Illustration of the empirical distribution of the statistics under the null hypothesis obtained by bootstrapping. (a) For the statistic S₀_{(7 ← 14). (b) For S}_lag_{(7 ← 14).}

Figure 4: Results of application of our model on the log-magnitudes of the MEG sources (signifi-cant at 5% level, corrected for multiple testing). Black dashed line: instantaneous effect. Red solid line: lagged effect. The thickness of the lines indicates the strength of the influences.

: instant. effect;

: lagged effects

(33)

Summary: Extended Granger

causality analysis

• Granger causality as a special functional causal

model

• Even with temporal information, it might be

necessary to model instantaneous effects

• Alternative formulation

(Zhang 2011b)

: linear

non-Gaussian state-space model

Zhang Hyv¨

arinen

In practice the data are usually noisy, i.e., the observed data contain observation errors,

and the latent source processes exhibit some temporal structures (which may include delayed

influences between them). The state-space representation then oﬀers a powerful modeling

approach. Here we are particularly interested in the linear state-space model (SSM) or

linear dynamic system (Kalman,

1960;

van Overschee and de Moor,

1996). Denote by x

_t

=

(x

_1t

, ..., x

_nt

)

T

, t = 1, ..., N , the vector of the observed signals, and by y

_t

= (y

_1t

, ..., y

_mt

)

T

the vector of latent processes which are our main object of interest.

1

The observed data are

assumed to be linear mixtures of the latent processes together with some noise eﬀect, while

the latent processes follow a vector autoregressive (VAR) model. Mathematically, we have

x

_t

= Ay

_t

+ e

_t

,

(1)

y

_t

=

L

�

τ =1

B

_τ

y

_t_−τ

+ �

_t

,

(2)

where e

_t

= (e

_1t

, ..., e

T_nt

) and �

_t

= (�

_1t

, ..., �

_mt

)

T

denote the observation error and process

noise, respectively. Moreover, e

_t

and �

_t

are both temporally white and independent of

each other. One can see that because of the state transition matrices B

_τ

, y

_it

are generally

dependent, even if �

_it

are mutually independent.

In traditional SSMs, both �

_t

and e

_t

are assumed to be Gaussian; or equivalently, one

makes use of their covariance structure, and the statistical properties beyond second-order

are not considered. In Kalman filtering (Kalman,

1960), A and B

_τ

are given, and the goal

is to do inference, i.e., to estimate y

_t

based on

_{x

_t

_{}. Learning of the parameters A, B}

_τ

,

and the covariance matrices of e

_t

and �

_t

was also studied; see, e.g,

van Overschee and de

Moor

(1991);

Ghahramani and Hinton

(1996). However, it is well-known that under the

above assumptions, the SSM model is generally not identifiable; see e.g.,

Arun and Kung

(1990), and consequently, one can not use this model to recover the latent processes y

_it

.

Under specific structural constraints on B

_τ

or A, the SSM model (1

_∼

2) may become

identifiable, so that it can be used to reveal the underlying structure of the data. Many

existing models which are used for source separation or prediction of time series can be

considered as special cases of this model. For instance, the temporal structure based source

separation (Murata et al.,

2001) assume that B

_τ

are diagonal. The model also becomes

identifiable with some other structural constraints on A, as discussed in

Xu

(2002).

How-ever, one should note that in practice such constraints may not hold; for instance, for the

electroencephalography (EEG) or magnetoencephalography (MEG) data, some underlying

processes or sources may have delayed influences on others, and letting B

_τ

be diagonal will

destroy these types of connectivities.

On the other hand, distributional information also helps system identification. One

can ignore the temporal information and perform system identification based on the

non-Gaussianity of the data. For example, if the matrices B

_τ

are zero and e

_i

(t) are

non-Gaussian, it is reduced to the noisy ICA problem or the independent factor analysis (IFA)

model (Attias,

1999). In the noiseless case, ICA could recover the underlying linear mixing

system up to trivial indeterminacies. But in the noisy case, the model is just partially

1. We use the terms latent processes, factors, and sources interchangeably in this paper, depending on application scenarios.

(34)

Now comes...

• Constraint-based causal

discovery

• Functional causal model (from

2005)

• Linear, non-Gaussian acyclic

causal model

• extended Granger causality

• Post-nonlinear (PNL) causal

model

X1 → X2 → X3

X1 ! X2 ! X3

X1 → X2 → X3

(if linear)

X1 → X2 → X3

(even if very nonlinear)

(35)

Three Effects usually encountered in a

causal model

(Zhang & Hyvärinen, 2009)

• Without prior knowledge, the assumed model is expected to be

• general enough

: adapted to approximate the true generating process

• identifiable

: asymmetry in causes and effects

• represented by post-nonlinear causal model with inner additive

noise

(36)

PNL causal model with inner additive

noise

• acyclic data-generating process

• two-variable case

• x

1 →x

2 :

x

2

= f

2,2

( f

2,1

(x

1

) + e

2

)

(37)

Special cases of PNL causal model

• If f

i,1

and f

i,2

are both linear

• At most one of e

i

is Gaussian: LiNGAM

(Shimuzu et al., 2006)

• All of e

i

are Gaussian: linear Gaussian case

(Spirtes, Pearl et al.)

• If f

i,2

is identity: nonlinear causal discovery

with additive noise models

(Hoyer et al., 2009, Zhang

2009b)

(38)

Identifiability in two-variable case

• Is the causal direction implied by the model unique?

• We tackle this problem by a proof of contradiction

• Assume both x

1

→x

2

and x

1

←x

2

satisfy PNL model

(39)

(40)

(41)

(42)

Corollaries easy to verify

• Corollary 1: If

p

_e2

is not

Gaussian

, nor

log-mix-lin-exp

, nor

a generalized mixture of two exponentials

,

then the PNL causal model is identifiable

• Corollary 2: If function

f

₁

is not invertible

, then the

PNL causal model is identifiable

(43)

Method for distinguishing cause

from effect

• Examine if x

1 → x

2 holds

• Examine if x

2 → x

1 holds

• Draw conclusions

• Only one of them holds

☺

• Both hold

: they could not be distinguished by PNL

•

Additional information of the nonlinearities, such that the smoothness, nonlinear distortion level, etc. may be helpful

• If

neither of them holds

, data do not follow PNL, or confounders have

significant effects

☹ ☹

(44)

Method to examine if x

1 →x

2

• If x

1

→ x

2

, i.e., x

2

= f

2,2

( f

2,1

(x

1

) + e

2

), we have

is ind. from x

1

• Two-step procedure to examine if x

1

→ x

2

• Step 1: makes y

2

= g

2

(x

2

) - g

1

(x

1

) and x

1

as ind. as possible,

such that y

2

provides ê

2

• Step 2: uses independence tests (Gretton, et al., 2008) to

verify if x

1

and ê

2

are ind.

(45)

Application on real data

• applied on “CausalEffectPairs”

• 80 data sets for cause-effect pairs; each contains

realizations of two variables

• Causal direction is obvious to non-experts, but

background information is hidden for

participants

• Goal: to distinguish cause from effect of the two

variables

x

1 x

2 -

----1.1

1.0

2.1

2.0

3.1

4.2

2.3

(46)

Performance

• with automatic initialization

• Local optima due to MLP’s. Performance improved with specific

Bernhard 6FK|ONRSI'HFHPEHU

IGCI: Deterministic Method LINGAM: Shimizu et al., 2006 AN: Additive Noise Model (nonlinear) PNL: AN with post- nonlinearity GPI: Mooij et al., 2010

(47)

(48)

(49)

Identification with more than two variables

• identifiability

(c.f. Peters et al., 2011)

• brute-force search infeasible

• We show that when fitting x

₁

,…,x

_n

to the PNL causal model,

e

_i

are mutually ind.

,

if and only if

the causal Markov condition

holds,

and

e

_i

is ind. from pa

_i

of the same variable

• ⇒ a more practical two-step method

• find the

equivalence class

using conditional independences

• determine the

undetermined causal directions by testing

if the

disturbance is ind. from the parents of the same variable

(50)

Illustration: Boston housing data

• inferred by PC + kernel-based CI test (Zhang et al., 2011)

• PNL model further applied:

using partial correlation or mutual information. The former assumes linear relationships and Gaussian dis-tributions, and the latter does not lead to a significance test. Sun et al. (2007) and Tillman et al. (2009)

pro-pose to use CIPERM for CI testing in PC. Based on

the promising results of KCI-test, we propose to also apply it to causal inference.

4.2.1 Simulated data

We generated data from a random DAG _{G. In}

partic-ular, we randomly chose whether an edge exists and sampled the functions that relate the variables from a Gaussian Process prior. We sampled four random variables X1, . . . , X4 and allowed arrows from Xi to

Xj only for i < j. With probability 0.5 each

possi-ble arrow is either present or absent. If arrows exist,

from X1 and X3 to X4, say, we sample X4 from a

Gaussian Process with mean function U1 · X1 + U3 · X3

(with U₁, U₃ iid_{∼ U[−2; 2]) and a Gaussian kernel (with} each dimension randomly weighted between 0.1 and 0.6) plus a noise kernel as the covariance function. For significance level 0.01 and sample sizes between 100 and 700 we simulated 100 DAGs, and checked how often the diﬀerent methods infer the correct Markov equivalence class. Figure 3 shows how often PC based on KCI-test, CIPERM, or partial correlation recovered

the correct Markov equivalence class. PC based on KCI-test gives clearly the best results.

100 200 300 400 500 600 700 0.2 0.4 0.6 0.8 1

proportion of correct _{Markov equiv. classes}

sample size KCI−test

CI PERM part. corr.

Figure 3: The chance that the correct Markov equiva-lence class was inferred with PC combined with diﬀer-ent CI testing methods. KCI-test outperforms CIPERM

and partial correlation.

4.2.2 Real data

We applied our method to continuous variables in the Boston Housing data set, which is available at the UCI Repository (Asuncion and Newman, 2007). Due to the large number of variables we choose the significance level to be 0.001, as a rough way to correct for mul-tiple testing. Figure 6 shows the results for PC using CIPERM (PCCI_PERM) and KCI-test (PCKCI-test). For

conciseness, we report them in the same figure: the red

lines were found by both methods. Please refer to the data set for the explanation of the variables. Although one can argue about the ground truth for this data set, we regard it as promising that our method finds links between number of rooms (RM) and median value of houses (MED) and between non-retail business (IND) and nitric oxides concentration (NOX). The latter is also missing in the result on these data given by Mar-garitis (2005); instead their method gives some dubi-ous links like crime rate (CRI) to nitric oxides (NOX), for example.

RM LST AGE NOX IND

MED CRI DIS TAX B

Figure 4: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCI_PERM, solid lines: PCKCI-test).

RM LST AGE NOX IND

MED CRI DIS TAX B

Figure 5: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCI_PERM, solid lines: PCKCI-test).

RM LST AGE NOX IND

MED CRI DIS TAX B

Figure 6: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PC_CI_PERM, solid lines: PC_KCI-test).

5 Conclusion

We proposed a novel method for conditional indepen-dence testing. It makes use of the characterization of conditional independence in terms of uncorrelated-ness of functions in suitable reproducing kernel Hilbert spaces, and the proposed test statistic can be easily calculated from the kernel matrices. We derived its distribution under the null hypothesis of conditional

using partial correlation or mutual information. The former assumes linear relationships and Gaussian dis-tributions, and the latter does not lead to a significance test. Sun et al. (2007) and Tillman et al. (2009) pro-pose to use CI_PERM for CI testing in PC. Based on the promising results of KCI-test, we propose to also apply it to causal inference.

4.2.1 Simulated data

We generated data from a random DAG _{G. In} partic-ular, we randomly chose whether an edge exists and sampled the functions that relate the variables from a Gaussian Process prior. We sampled four random variables X₁, . . . , X₄ and allowed arrows from X_i to X_j only for i < j. With probability 0.5 each possi-ble arrow is either present or absent. If arrows exist, from X₁ and X₃ to X₄, say, we sample X₄ from a Gaussian Process with mean function U₁ _{· X}₁ + U₃ _{· X}₃ (with U₁, U₃ iid_{∼ U[−2; 2]) and a Gaussian kernel (with} each dimension randomly weighted between 0.1 and 0.6) plus a noise kernel as the covariance function. For significance level 0.01 and sample sizes between 100 and 700 we simulated 100 DAGs, and checked how often the diﬀerent methods infer the correct Markov equivalence class. Figure 3 shows how often PC based on KCI-test, CI_PERM, or partial correlation recovered the correct Markov equivalence class. PC based on KCI-test gives clearly the best results.

100 200 300 400 500 600 700 0.2 0.4 0.6 0.8 1

proportion of correct _{Markov equiv. classes}

sample size KCI−test

CI PERM part. corr.

Figure 3: The chance that the correct Markov equiva-lence class was inferred with PC combined with diﬀer-ent CI testing methods. KCI-test outperforms CI_PERM and partial correlation.

4.2.2 Real data

We applied our method to continuous variables in the Boston Housing data set, which is available at the UCI Repository (Asuncion and Newman, 2007). Due to the large number of variables we choose the significance level to be 0.001, as a rough way to correct for mul-tiple testing. Figure 6 shows the results for PC using CI_PERM (PC_CI_PERM) and KCI-test (PC_KCI-test). For conciseness, we report them in the same figure: the red arrows are the ones inferred by PC_CI_PERM and all solid lines show the result by PC_KCI-test. Ergo, red solid

lines were found by both methods. Please refer to the data set for the explanation of the variables. Although one can argue about the ground truth for this data set, we regard it as promising that our method finds links between number of rooms (RM) and median value of houses (MED) and between non-retail business (IND) and nitric oxides concentration (NOX). The latter is also missing in the result on these data given by Mar-garitis (2005); instead their method gives some dubi-ous links like crime rate (CRI) to nitric oxides (NOX), for example.

RM LST AGE NOX IND

MED CRI DIS TAX B

RM LST AGE NOX IND

MED CRI DIS TAX B

RM LST AGE NOX IND

MED CRI DIS TAX B

5 Conclusion

We proposed a novel method for conditional indepen-dence testing. It makes use of the characterization of conditional independence in terms of uncorrelated-ness of functions in suitable reproducing kernel Hilbert spaces, and the proposed test statistic can be easily calculated from the kernel matrices. We derived its distribution under the null hypothesis of conditional independence. This distribution can either be gener-ated by Monte Carlo simulation or approximgener-ated by

Remark:

1. CRI - per capita crime rate by town

2. IND - prop. of non-retail business acres per town 3. NOX - nitric oxides concentration

4. RM - average number of rooms per dwelling

5. AGE - prop. of owner-occupied units built prior to 1940

6. DIS - weighted distances to 5 Boston employment centres

7. TAX - full-value property-tax rate per $10,000

8. B - 1000(Bk - 0.63)^2 where Bk prop. of blacks by town

9. LST - % lower status of the population

10.MED - Median of owner-occupied homes in $1000's

✓

?

(51)

Simpler extension for more

than two variables

• Simple case:

component-wise

nonlinear transformations of x

i,

f

i

(x

i

),

have

linear

causal relations

• making use of PNL ICA to find

(linear) causal relations from W

• confirmed by simulations; real

applications needed

f

2

(x

₂

)

0.5 -0.2 _0.3

e

2

e

3

e

1

f

3

(x

₃

)

f

1

(x

₁

)

428 K. Zhang and L.-W. Chan

A

W

s₁ s₂ s_n f_n g_n t_n t₂ t₁ x₁ x₂ x_n z₁ z₂ z_n y_n y₁ y₂ f₂ f₁ g₁ g₂ . . . . . . . . . . . . ... . . . . . .

PNL mixing system PNL de-mixing system

linear mixing nonlinear distortion inverse nonlinear transform linear de-mixing

Figure 1: PNL mixing system and PNL demixing system. The mixing system consists of a linear mixing stage (with the mixing matrix A) and a nonlinear transform stage (applying fi to linear mixtures ti). The demixing system does an inverse operation, which is a nonlinear transform stage followed by a linear demixing stage.

For simplicity, we have assumed there is no additive noise in this system. And in the following discussion, the number of independent sources, n, is assumed to be equal to that of observations, m. As a counterpart of the mix-ing model 2.1, the separation of PNL mixtures is also a two-stage procedure: a nonlinear stage followed by a linear demixing stage. Figure 1 shows the PNL mixing system and the demixing system, where gi are elements of g

used to invert the nonlinear mapping f, z has n elements zi as an estimate of

the latent linear mixtures ti, and W is a linear demixing matrix transforming

zi to yi, the estimate of independent sources si.

Taleb and Jutten (1999b), showed that when A has at least two nonzero entries per row or per column and si accepts a density function that vanishes

at one point at least, one can affirm that the output y has mutually indepen-dent components if and only if gi ◦ fi is linear and W is a linear separating

matrix for z. In other words, under these assumptions, the sources can be separated up to the same indeterminacies as in the linear ICA model, which are permutation and scaling indeterminacies. (In some work, for instance, Hyv¨arinen, Karhunen, & Oja, 2001, it is said there is one more indetermi-nacy, named sign indetermiindetermi-nacy, which can be considered as a special case

A

W

Figure 1: PNL mixing system and PNL demixing system. The mixing system consists of a linear mixing stage (with the mixing matrix A) and a nonlinear transform stage (applying fi to linear mixtures ti). The demixing system does an inverse operation, which is a nonlinear transform stage followed by a linear demixing stage.

For simplicity, we have assumed there is no additive noise in this system. And in the following discussion, the number of independent sources, n, is assumed to be equal to that of observations, m. As a counterpart of the mix-ing model 2.1, the separation of PNL mixtures is also a two-stage procedure: a nonlinear stage followed by a linear demixing stage. Figure 1 shows the PNL mixing system and the demixing system, where gi are elements of g

used to invert the nonlinear mapping f, z has n elements z_i as an estimate of the latent linear mixtures ti, and W is a linear demixing matrix transforming

z_i to y_i, the estimate of independent sources s_i.

Taleb and Jutten (1999b), showed that when A has at least two nonzero entries per row or per column and s_i accepts a density function that vanishes at one point at least, one can affirm that the output y has mutually indepen-dent components if and only if gi ◦ fi is linear and W is a linear separating

matrix for z. In other words, under these assumptions, the sources can be separated up to the same indeterminacies as in the linear ICA model, which are permutation and scaling indeterminacies. (In some work, for instance, Hyv¨arinen, Karhunen, & Oja, 2001, it is said there is one more indetermi-nacy, named sign indetermiindetermi-nacy, which can be considered as a special case of the scaling indeterminacy.) In fact, if the mean of s is unknown, one more

A

W

Figure 1: PNL mixing system and PNL demixing system. The mixing system consists of a linear mixing stage (with the mixing matrix A) and a nonlinear transform stage (applying fi to linear mixtures ti). The demixing system does an inverse operation, which is a nonlinear transform stage followed by a linear demixing stage.

For simplicity, we have assumed there is no additive noise in this system. And in the following discussion, the number of independent sources, n, is assumed to be equal to that of observations, m. As a counterpart of the mix-ing model 2.1, the separation of PNL mixtures is also a two-stage procedure: a nonlinear stage followed by a linear demixing stage. Figure 1 shows the PNL mixing system and the demixing system, where g_i are elements of g used to invert the nonlinear mapping f, z has n elements z_i as an estimate of the latent linear mixtures t_i, and W is a linear demixing matrix transforming

zi to yi, the estimate of independent sources si.

Taleb and Jutten (1999b), showed that when A has at least two nonzero entries per row or per column and s_i accepts a density function that vanishes at one point at least, one can affirm that the output y has mutually indepen-dent components if and only if g_i _{◦ f}_i is linear and W is a linear separating matrix for z. In other words, under these assumptions, the sources can be separated up to the same indeterminacies as in the linear ICA model, which are permutation and scaling indeterminacies. (In some work, for instance, Hyv¨arinen, Karhunen, & Oja, 2001, it is said there is one more indetermi-nacy, named sign indetermiindetermi-nacy, which can be considered as a special case of the scaling indeterminacy.) In fact, if the mean of s is unknown, one more