• No results found

Functional causal models: Beyond linear instantaneous relations

N/A
N/A
Protected

Academic year: 2020

Share "Functional causal models: Beyond linear instantaneous relations"

Copied!
56
0
0

Loading.... (view fulltext now)

Full text

(1)

Functional causal models: Beyond linear

instantaneous relations

Kun Zhang

Max-Planck Institute for Intelligent Systems

Tübingen, Germany

(2)

Causality vs. dependence: Examples

Causality ➜ dependence

!

dependence ➜ causality

(3)

Brief history of causality:

Western philosophical tradition

dates back at least to Aristotle

Causality is not based on actual reasoning: only correlation

can actually be perceived (David Hume, 1711-1776)

One has to resort to

controlled experiments

Manipulate a variable ‘ideally’ and see the response of the system

...

(4)

Brief history of

causality:

Eastern cultural tradition

Illustrated Sutra of Cause and Effect (8th century)

“coincidence” instead of causality

(Carl Jung, 1920’s)

(5)

Potential applications

Policy making in economics, climate analysis...

Biology, brain connectivity analysis...

Control, robust prediction / feature selection...

For understanding learning problems, e.g., semi-supervised

learning

(Schölkopf et al., 2012)

...

(6)

Advances in the past decades:

Computational causality

In the past decades, under certain assumptions, it was

made possible to derive causation from passively

observed data

(Pearl, Spirtes, Glymore, Scheines, Hoover et al.)

statistical data causal structure

constraint-based approach

causal Markov assumption

faithfulness…

X1 X2 ---1.1 1.0 2.1 2.0 3.1 4.2 2.3 -0.6 1.3 2.2 -1.8 0.9 . . . . . .

X

1

?

X

2

dicts cl

assica

l claims

???

(7)

Outline

Constraint-based causal

discovery

Functional causal model (mainly

from 2005)

linear non-Gaussian causal model

with temporal constraints:

Granger causality with

instantaneous effects

with necessary nonlinearities:

Post-nonlinear causal model

X1 ! X2 ! X3

X1 → X2 → X3

X1 → X2 → X3

(even if very nonlinear)

(8)

Constraint-based causal discovery

Functional causal model (from

2005)

Linear, non-Gaussian causal model

Granger causality with

instantaneous effects

Post-nonlinear causal model

X1 ! X2 ! X3

X1 → X2 → X3

(if linear)

X1 → X2 → X3

(even if very nonlinear)

F

rom 1980’s...

(9)

Causal structure vs. statistical independence

(Spirtes, Pearl, et al.)

causal structure

(causal graph)

Y → X → Z

Y -- X -- Z

?

Statistical

independence(s)

Y Z | X

Causal Markov condition:

each variable is ind.

of its non-descendants

(non-effects)

conditional on

its parents

(direct causes)

Faithfulness:

all observed (conditional) independencies are

(10)

Why faithfulness assumption matters?

Y → X → Z

a

b

c

If they are linear-Gaussian and

a=-bc

, we have Y

Z, which cannot by seen from the graph!

Faithfulness assumption eliminates this possibility!

(11)

Constraint-based causal discovery

Theorem: if (G,P) satisfies faithfulness, then there

is an edge between X and Y iff X Y

given any

set of variables

uses (conditional) independence constraints to

find the candidate causal structures

(12)

Search results

Markov equivalence class

pattern Y -- X -- Z

same adjacencies

→ if all agree on orientation

;

-- if disagree

might be unique:

v-structure

Y Z | X

(13)

Constraint-based method: An

inverse problem

{local causal structures}

→ {conditional independences}

X Z | Y

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

equi

va

lence

cla

ss

faithfulness

Instead, functional causal

models try to directly

identify local causal

structures

two-

varia

ble ca

se?

X

Z

Y

(14)

Constraint-based causal discovery

Functional causal model (from

2005)

Linear, non-Gaussian causal model

Granger causality with

instantaneous effects

Post-nonlinear causal model

X1 ! X2 ! X3

X1 → X2 → X3

(if linear)

X1 → X2 → X3

(even if very nonlinear)

Outline

(15)

Functional causal model

(Pearl et al.)

generative function model for continuous variables

x

i

= f

i

(pa

i

, e

i

), i = 1,…,n

in econometrics, social sciences...

well-defined examples

Granger causality: effects follow causes in a linear form

LiNGAM: linear, non-Gaussian and acyclic causal model

(Shimizu et al., 2006)

PA

i

: parents (causes) of X

i

; E

i

mutually ind.

f

i

PA

i

E

i

(16)

FCM:

why independence

between X and E ?

If X E:

Otherwise, according to

Reichenbach's Common

Cause Principle:

much more complicated

X

E

Y

X

E

Y

Z

(17)

FCM: A general view

Without constraints on f, for given (X

,

Y), both y = f

1

(x, e) with

E_||_X

and x = f

2

(y, e

1

) with E

1

_||_Y

are possible

with a Gram-Schmidt-orthogonalization procedure (Darmois,

1951)

!x

= cdf(x

1

), so

!x ~ U(0,1);

e

= cdf(y | !x) = p

!x,y

(

!x,t)

!" x2

#

dt.

(18)

Suppose we observe the data

x

(19)

A universal way to construct

“trivial” FCMs

e

’  =  h  °  CCDF

Y|X

(y|x)  always independent from X

Functional  causal  model:  y  =  CCDF

Y|X-­‐1

 °  h

-­‐1

(e’)    for  any  x

how to make it identifiable (break the symmetry)?

x y x CCDF (y |x ) ï2 0 2 x h (C C D F (y |x ))

f

(20)

General FCMs:

independence vs. likelihood

relating mutual information I and likelihood l:

If X→Y follows the model:

also hold for more than two variables

X

f(⋄;

β

)

Y

E

l

X→Y

(β) =

n

i=1

log P

F

(x

i

, y

i

) =

n

i=1

log P (X = x

i

, Y = y

i

)

− I(X, E; β)

l

X→Y

)

− l

Y →X

Y

) = I(Y, E

Y

; β

Y

).

(21)

A basic functional causal model

Constraint-based causal

discovery

Functional causal models (from

2005)

Linear, non-Gaussian acyclic

causal model

extended Granger causality

Post-nonlinear causal model

X1 ! X2 ! X3

X1 → X2 → X3

(if linear)

X1 → X2 → X3

(even if very nonlinear)

(22)

LiNGAM model

linear,

non-Gaussian

,

acyclic

causal model (LiNGAM)

(Shimizu et al., 2006)

:

disturbances (errors) ei are non-Gaussian (or at most

one is Gaussian) and mutually ind.

example:

e

Bx

x



e

x

b

x

¦



or

x

Bx



e

i i j j ij i

b

x

e

x

¦



of parents : or

x

2

x

3

x

1 0.5 -0.2 0.3

e

2

e

3

x

2

= e

2

,

x

3

= 0.5x

2

+ e

3

,

x

1

=

−0.2x

2

+ 0.3x

3

+ e

1

.

(23)

ICA:

A well-known technique making use of

non-Gaussianity

x1 xm observed signals ICA system output: as independent as possible

W

… … y1 yn de-mixing estimate …

x = A·s

y = W·x

A

… … s1 sn

unknown mixing system independent

sources

mixing

assumptions in ICA

at most one of s

i

i

s

Gaussian

(24)

LiNGAM analysis by ICA

LiNGAM:

x = Bx + e e = (I-B)x

B has a special structure:

acyclic relations

ICA:

y = Wx

B can then seen from W by permutation and re-scaling

e.g.

W

So we have the causal relation:

x

2

x

3

x

1

0.5

(25)

Related work &

applications

ICA with sparse connections

(Zhang et al., 2008)

;

Direct LiNGAM

(Shimizu et al., 2009)

with mild nonlinear distortion

allowed; application in finance

(Zhang & Chan, 2006 & 2008)

extended Granger causality analysis

for time series

(Hyvärinen et al., 2010; Zhang and Hyvärinen, 2009)

HYVARINEN¨ , ZHANG, SHIMIZU ANDHOYER

DJIt-1 N225t-1 HSIt-1 SSECt-1

DJIt 0.12 N225t 0.42 HSIt 0.02 SSECt

0.11 -0.15 0.35

0.21

-0.07 0.04

0.05 0.04

Figure 2: Results of application of our model to daily returns of the stock indices DJI, N225, HSI, and SSEC, with k = 1 lag. Large coefficients (greater than 0.1) are shown in bold and red.

Next, we fitted an ordinary vector autoregressive model with 10 lags on the estimated sources, finding the corresponding innovation series which we denote by yi(t), i = 1, ..., 17. Our goal was

to analyze if there are some influences between the magnitudes of these innovations. We prefer to analyze the innovations because the innovations are approximately white both temporally and spatially, and thus we can analyze the magnitudes with no contamination by linear (auto)correlations of the source signals. The autoregressive model order 10 was chosen because it was the smallest order that gave approximately white innovations.

We then fitted the SVAR model on the logarithmically transformed magnitudes xi(t) = log(0.2+

|yi(t)|),i = 1,...,17. We determined the order k of our SVAR model by minimizing the AIC

crite-rion (Akaike, 1973), which is the negative log-likelihood of the MBD model plus a term measuring the complexity of the model. The log-likelihood involves the densities of the MBD outputs ˆei(t),

which were modelled by a mixture of three Gaussians. From the candidate orders between 0 and 20, we found that k = 2 gave the minimum AIC.

After finding the estimate of the coefficients ˆB!,! = 0,1,2 with the MBD-based approach, one can easily calculate the estimates of the statistics S0(i ← j) and Slag(i ← j). The bootstrapping

approach given in Section 6 was used to evaluate if these estimated statistics are significant. Here we need to test multiple hypotheses simultaneously; to reduce the type I error, we adopted the Bonferroni correction (Shaffer, 1995) for multiple testing correction. We used the significance level 5%. For both the instantaneous and lagged effects, one needs to perform 17 × 16 = 272 tests; therefore, the significance level for each individual test is then 0.05/272 ≈ 2 × 10−4. We used 104 replications for the bootstrapping.

For illustration, we give the empirical distribution of the statistics S0(7 ← 14) and Slag(7 ← 14),

as well as their estimated values for the original series xi(t), in Fig. 3. Clearly ˆS0(7 ← 14) is

significant, while ˆSlag(7 ← 14) is not.

Fig. 4 shows the resulting diagram of causal analysis with instantaneous effects between the magnitudes of the selected MEG sources, with the influences significant at 5% level (corrected for multiple testing). What we see is that the connections tend to be strong between sources which are close to each other. For example, the occipitoparietal sources such as #1, #2, #3, #8, and #11

ESTIMATION OFSVARMODEL USING NON-GAUSSIANITY

0 0.2 0.4 0.6 0.8 1 1.2 x 10!4 0 1000 2000 3000 4000 5000 6000 7000 8000

Histogram of ˆS0(7 ← 14) under null hypothesis

Critical value at 2! 10!4 level ˆ S0(7 ← 14) 0 0.2 0.4 0.6 0.8 1 1.2 x 10!4 0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Histogram of ˆSlag(7 ← 14) under null hypothesis Critical value at

2! 10!4 level

ˆ Slag(7 ← 14)

(a) (b)

Figure 3: Illustration of the empirical distribution of the statistics under the null hypothesis obtained by bootstrapping. (a) For the statistic S0(7 ← 14). (b) For Slag(7 ← 14).

Figure 4: Results of application of our model on the log-magnitudes of the MEG sources (signifi-cant at 5% level, corrected for multiple testing). Black dashed line: instantaneous effect. Red solid line: lagged effect. The thickness of the lines indicates the strength of the influences.

1727

Nonlinear ICA with Minimal Nonlinear Distortion

Hang Seng Utilities Index. x1, x9, and x11 are

con-stituents of Hang Seng Property Index. 3. Large bank companies are the cause of many stocks, meaning that the international impact to the Hong Kong stock mar-ket is probably reflected through large banks. Here x5

and x8 are the two largest banks in Hong Kong. 4.

Stocks in Hang Seng Property Index tend to depend on many other stocks, while they hardly influence oth-ers. Here x1, x9, and x11 are in Hang Seng Property

Index. These findings also indicate that the indepen-dent factor model may provide a reasonable way to explain the generation of stock returns.

x1: Cheung Kong (0001.hk) x2: CLP Hldgs (0002.hk) x3: HK & China Gas (0003.hk) x4: Wharf (Hldgs) (0004.hk) x5: HSBC Hldg (0005.hk), x6: HK Electric (0006.hk) x7: Hang Lung Dev (0010.hk) x8: Hang Seng Bank (0011.hk) x9: Henderson Land (0012.hk) x10: Hutchison (0013.hk) x11: Sun Hung Kai Prop (0016.hk) x12: Swire Pacific ’A’ (0019.hk) x13: Bank of East Asia (0023.hk) x14: Cathay Pacific Air (0293.hk)

Figure 8. Casual diagram of the 14 stocks.

7. Conclusion

We have proposed the “minimal nonlinear distortion” principle for solving the nonlinear ICA problem. This principle helps to reduce the indeterminacies in so-lutions of nonlinear ICA and to overcome the ill-posedness of nonlinear ICA. With this principle, the solution whose nonlinear mixing system is close to lin-ear is preferred. Experimental results with synthetic data show that when the data are generated with mild nonlinear distortion, the proposed method produces good and reliable results for separating various non-linear mixtures. The successful application of the pro-posed nonlinear ICA method to causality discovery in the Hong Kong stock market illustrates the applica-bility of the method and the validity of the “minimal nonlinear distortion” principle for some real-life prob-lems. The result also supports the validity of the in-dependent factor model in finance.

Acknowledgement

This work was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China.

References

Almeida, L. B. (2003). MISEP — linear and nonlinear ICA based on mutual information. Journal of Machine Learning Research, 4, 1297–1318.

Almeida, L. B. (2005). Separating a real-life nonlinear im-age mixture. Journal of Machine Learning Research, 6, 1199–1229.

Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159.

Bishop, C. (1993). Curvature-driven smoothing: a learning algorithm for feedforward networks. IEEE Trans. on Neural Networks, 4, 882–884.

Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc., 96, 1348–1360.

Harmeling, S., Ziehe, A., Kawanabe, M., & M¨uller, K. (2003). Kernel-based nonlinear blind source separation. Neural Computation, 15, 1089–1124.

Hyv¨arinen, A. (1999). Fast and robust fixed-point al-gorithms for independent component analysis. IEEE Trans. on Neural Networks, 10(3), 626–634.

Hyv¨arinen, A., & Karthikesh, R. (2000). Sparse priors on the mixing matrix in independent component analysis. Proc. 2nd Int. Workshop on ICA and BSS (ICA2000) (pp. 477–452). Helsinki, Finland.

Hyv¨arinen, A., & Pajunen, P. (1999). Nonlinear indepen-dent component analysis: Existence and uniqueness re-sults. Neural Networks, 12, 429–439.

Jutten, C., & Karhunen, J. (2003). Advances in nonlinear blind source separation. Proc. 4th Int. Symp. on ICA and BSS (ICA2003) (pp. 245–256). Invited paper in the special session on nonlinear ICA and BSS.

Jutten, C., & Taleb, A. (2000). Source separation: From dusk till dawn. 2nd Int. Workshop on ICA and BSS (ICA 2000) (pp. 15–26). Helsinki, Finland.

Poggio, T., Torre, V., & Koch, C. (1985). Computational vision and regularization theory. Nature, 317, 314–319. Shimizu, S., Hoyer, P., Hyv¨arinen, A., & Kerminen, A.

(2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7, 2003–2030.

Taleb, A., & Jutten, C. (1999). Source separation in post-nonlinear mixtures. IEEE Trans. on Signal Processing, 47, 2807–2820.

Tan, Y., Wang, J., & Zurada, J. M. (2001). Nonlinear blind source separation using a radial basis function network. IEEE Trans. on Neural Networks, 12, 124–134.

Tikhonov, A. N., & Arsenin, V. A. (1977). Solutions of ill-posed problems. Washington: Winston & Sons.

Valpola, H. (2000). Nonlinear independent component analysis using ensemble learning: Theory. Proc. 2nd Int. Workshop on ICA and BSS (ICA2000) (pp. 251– 256). Helsinki, Finland.

25 Thursday, 18 October 2012

(26)

Linear functional causal model

with temporal constraints

Constraint-based causal

discovery

Functional causal model (from

2005)

Linear, non-Gaussian acyclic

causal model

extended Granger causality

Post-nonlinear causal model

X1 ! X2 ! X3

X1 → X2 → X3

(if linear)

X1 → X2 → X3

(even if very nonlinear)

(27)

Granger causality

X

1

: {x

1t

}

Granger causes

X

2

: {x

2t

} if it contains information helping

predict x

2,t+h

(h>0) contained nowhere else

(Granger, 1969)

temporal constraint: causes must precede effects + linear causal

relations

Vector

autoregression

(VAR) estimated by

multivariate least

squares (MLS)

x

t

=

p

τ =1

B

τ

x

t−τ

+ e

t

x

1t

x

2t

(28)

With instantaneous effects

Are e

it

independent? instantaneous effects between

x

it

(

Reale, Wilson et al., 2001

)

Granger causality with instantaneous effects

:

Dotted line: instantaneous

causal effects x1t ! x2t

x

t

=

p

τ =1

B

τ

x

t−τ

+B

0

x

t

+ e

t

, or x

t

=

p

τ =0

B

τ

x

t−τ

+ e

t

(29)

What happens if we ignore instantaneous

effects

(Hyvärinen et al., 2008)

They become confounders...

Example

xt = p � τ =1 (I − B0)−1 · Bτ · xt−τ + (I − B0)−1et

ESTIMATION OF SVAR MODEL USING NON-GAUSSIANITY

While this phenomenon is, in principle, well-known in econometric literature (Swanson and Granger, 1997; Demiralp and Hoover, 2003; Moneta and Spirtes, 2006), Eq. (11) is seldom applied because estimation methods for B0 have not been well developed. To our knowledge, no estimation

method for B0 has been proposed which is consistent for the whole matrix without strong prior assumptions on B0.

Next we present some theoretical examples of how the instantaneous and lagged effects interact based on the formula in (11).

7.1.1 EXAMPLE 1: AN INSTANTANEOUS EFFECT MAY SEEM TO BE LAGGED Consider first the case where the instantaneous and lagged matrices are as follows:

B0 = ! 0 1 0 0 " , B1 = ! 0.9 0 0 0.9 " .

That is, there is an instantaneous effect x2 → x1, and no lagged effects (other than the purely autore-gressive xi(t − 1) → xi(t)). Now, if an AR(1) model is estimated for data coming from this model,

without taking the instantaneous effects into account, we get the autoregressive matrix

M1 = (I − B0)−1B1 = ! 0.9 0.9 0 0.9 " .

Thus, the effect x2 → x1 seems to be lagged although it is, actually, instantaneous. 7.1.2 EXAMPLE 2: SPURIOUS EFFECTS APPEAR

Consider three variables with the instantaneous effects x1 → x2 and x2 → x3, and no lagged effects

other than xi(t − 1) → xi(t), as given by

B0 =   0 0 0 1 0 0 0 1 0  , B1 =   0.9 0 0 0 0.9 0 0 0 0.9  .

If we estimate an AR(1) model for the data coming from this model, we obtain

M1 = (I − B0)−1B1 =   0.9 0 0 0.9 0.9 0 0.9 0.9 0.9  .

This means that the estimation of the simple autoregressive model leads to the inference of a direct lagged effect x1 → x3, although no such direct effect exists in the model generating the data, for any

time lag.

A more reassuring result is the following: if the data follows the same causal ordering for all time lags, that ordering is not contradicted by the neglect of instantaneous effect. A rigorous definition of this property is the following.

Theorem 1 Assume that there is an ordering i( j), j = 1 . . . n of the variables such that no effect

goes backward, that is,

B!(i( j − "),i( j)) = 0 for " > 0,! ≥ 0,1 ≤ j ≤ n. (15)

ESTIMATION OF SVAR MODEL USING NON-GAUSSIANITY

While this phenomenon is, in principle, well-known in econometric literature (Swanson and Granger, 1997; Demiralp and Hoover, 2003; Moneta and Spirtes, 2006), Eq. (11) is seldom applied because estimation methods for B0 have not been well developed. To our knowledge, no estimation method for B0 has been proposed which is consistent for the whole matrix without strong prior assumptions on B0.

Next we present some theoretical examples of how the instantaneous and lagged effects interact based on the formula in (11).

7.1.1 EXAMPLE 1: AN INSTANTANEOUS EFFECT MAY SEEM TO BE LAGGED

Consider first the case where the instantaneous and lagged matrices are as follows:

B0 = ! 0 1 0 0 " , B1 = ! 0.9 0 0 0.9 " .

That is, there is an instantaneous effect x2 → x1, and no lagged effects (other than the purely

autore-gressive xi(t − 1) → xi(t)). Now, if an AR(1) model is estimated for data coming from this model, without taking the instantaneous effects into account, we get the autoregressive matrix

M1 = (I − B0)−1B1 = ! 0.9 0.9 0 0.9 " .

Thus, the effect x2 → x1 seems to be lagged although it is, actually, instantaneous.

7.1.2 EXAMPLE 2: SPURIOUS EFFECTS APPEAR

Consider three variables with the instantaneous effects x1 → x2 and x2 → x3, and no lagged effects other than xi(t − 1) → xi(t), as given by

B0 =   0 0 0 1 0 0 0 1 0  , B1 =   0.9 0 0 0 0.9 0 0 0 0.9  .

If we estimate an AR(1) model for the data coming from this model, we obtain

M1 = (I − B0)−1B1 =   0.9 0 0 0.9 0.9 0 0.9 0.9 0.9  .

This means that the estimation of the simple autoregressive model leads to the inference of a direct lagged effect x1 → x3, although no such direct effect exists in the model generating the data, for any time lag.

A more reassuring result is the following: if the data follows the same causal ordering for all time lags, that ordering is not contradicted by the neglect of instantaneous effect. A rigorous definition of this property is the following.

Theorem 1 Assume that there is an ordering i( j), j = 1 . . . n of the variables such that no effect

goes backward, that is,

B!(i( j − "),i( j)) = 0 for " > 0,! ≥ 0,1 ≤ j ≤ n. (15)

:

ESTIMATION OF SVAR MODEL USING NON-GAUSSIANITY

While this phenomenon is, in principle, well-known in econometric literature (Swanson and Granger, 1997; Demiralp and Hoover, 2003; Moneta and Spirtes, 2006), Eq. (11) is seldom applied because estimation methods for B0 have not been well developed. To our knowledge, no estimation method for B0 has been proposed which is consistent for the whole matrix without strong prior assumptions on B0.

Next we present some theoretical examples of how the instantaneous and lagged effects interact based on the formula in (11).

7.1.1 EXAMPLE 1: AN INSTANTANEOUS EFFECT MAY SEEM TO BE LAGGED Consider first the case where the instantaneous and lagged matrices are as follows:

B0 = ! 0 1 0 0 " , B1 = ! 0.9 0 0 0.9 " .

That is, there is an instantaneous effect x2 → x1, and no lagged effects (other than the purely autore-gressive xi(t − 1) → xi(t)). Now, if an AR(1) model is estimated for data coming from this model, without taking the instantaneous effects into account, we get the autoregressive matrix

M1 = (I − B0)−1B1 = ! 0.9 0.9 0 0.9 " .

Thus, the effect x2 → x1 seems to be lagged although it is, actually, instantaneous. 7.1.2 EXAMPLE 2: SPURIOUS EFFECTS APPEAR

Consider three variables with the instantaneous effects x1 → x2 and x2 → x3, and no lagged effects other than xi(t − 1) → xi(t), as given by

B0 =   0 0 0 1 0 0 0 1 0  , B1 =   0.9 0 0 0 0.9 0 0 0 0.9  .

If we estimate an AR(1) model for the data coming from this model, we obtain

M1 = (I − B0)−1B1 =   0.9 0 0 0.9 0.9 0 0.9 0.9 0.9  .

This means that the estimation of the simple autoregressive model leads to the inference of a direct lagged effect x1 → x3, although no such direct effect exists in the model generating the data, for any time lag.

A more reassuring result is the following: if the data follows the same causal ordering for all time lags, that ordering is not contradicted by the neglect of instantaneous effect. A rigorous definition of this property is the following.

Theorem 1 Assume that there is an ordering i( j), j = 1 . . . n of the variables such that no effect

goes backward, that is,

B!(i( j − "),i( j)) = 0 for " > 0,! ≥ 0,1 ≤ j ≤ n. (15)

1723

:

x

1,t-1

x

1t

x

2,t-1

x

2t

0.9 0.9

x

3,t-1

x

3t

1 1 0.9 0.9 0.9 0.9 xt = p � τ =0 Bτxt−τ + et 29 Thursday, 18 October 2012

(30)

Identification

(Zhang & Hyvärinen,

2009)

e

it

independent for different i and t,

i.e., spatially & temporally

independent

If at most one of e

it

is Gaussian, it

can be solved by

multichannel blind

deconvolution (MBD)

with causal

FIR filters

MBD estimates

W to make ê

it

spatially and temporally independent

B

τ

can be found from

W

τ

, by

(31)

Experiment on financial data

extended Granger causality analysis of daily returns of stock

indices DJI, N225, HSI, and SSEC, with k = 1 lag

HYVARINEN¨ , ZHANG, SHIMIZU AND HOYER

DJI

t-1

N225

t-1

HSI

t-1

SSEC

t-1

DJI

t

0.12

N225

t

0.42

HSI

t

0.02

SSEC

t

0.11

-0.15

0.35

0.21

-0.07

0.04

0.05

0.04

Figure 2: Results of application of our model to daily returns of the stock indices DJI, N225, HSI, and SSEC, with k = 1 lag. Large coefficients (greater than 0.1) are shown in bold and red.

Next, we fitted an ordinary vector autoregressive model with 10 lags on the estimated sources, finding the corresponding innovation series which we denote by yi(t), i = 1, ..., 17. Our goal was

to analyze if there are some influences between the magnitudes of these innovations. We prefer to analyze the innovations because the innovations are approximately white both temporally and spatially, and thus we can analyze the magnitudes with no contamination by linear (auto)correlations of the source signals. The autoregressive model order 10 was chosen because it was the smallest order that gave approximately white innovations.

We then fitted the SVAR model on the logarithmically transformed magnitudes xi(t) = log(0.2+ |yi(t)|),i = 1,...,17. We determined the order k of our SVAR model by minimizing the AIC

crite-rion (Akaike, 1973), which is the negative log-likelihood of the MBD model plus a term measuring the complexity of the model. The log-likelihood involves the densities of the MBD outputs ˆei(t), which were modelled by a mixture of three Gaussians. From the candidate orders between 0 and 20, we found that k = 2 gave the minimum AIC.

After finding the estimate of the coefficients ˆB!,! = 0,1,2 with the MBD-based approach, one can easily calculate the estimates of the statistics S0(i ← j) and Slag(i ← j). The bootstrapping approach given in Section 6 was used to evaluate if these estimated statistics are significant. Here we need to test multiple hypotheses simultaneously; to reduce the type I error, we adopted the Bonferroni correction (Shaffer, 1995) for multiple testing correction. We used the significance level 5%. For both the instantaneous and lagged effects, one needs to perform 17 × 16 = 272 tests; therefore, the significance level for each individual test is then 0.05/272 ≈ 2 × 10−4. We used 104 replications for the bootstrapping.

For illustration, we give the empirical distribution of the statistics S0(7 ← 14) and Slag(7 ← 14), as well as their estimated values for the original series xi(t), in Fig. 3. Clearly ˆS0(7 ← 14) is

significant, while ˆSlag(7 ← 14) is not.

Fig. 4 shows the resulting diagram of causal analysis with instantaneous effects between the magnitudes of the selected MEG sources, with the influences significant at 5% level (corrected for multiple testing). What we see is that the connections tend to be strong between sources which are close to each other. For example, the occipitoparietal sources such as #1, #2, #3, #8, and #11 have strong interconnections. Some perirolandic sources such as #5, #7, #10, and #14 are also interconnected. Sources #4 and #16 seems to mediate between these two groups.

31 Thursday, 18 October 2012

(32)

Experiment on brain signals

extended Granger causality analysis of the log-magnitude of

MEG sources (significant at 5% level; corrected for multiple

testing).

ESTIMATION OF SVAR MODEL USING NON-GAUSSIANITY

0 0.2 0.4 0.6 0.8 1 1.2 x 10!4 0 1000 2000 3000 4000 5000 6000 7000 8000

Histogram of ˆS0(7 ← 14) under null hypothesis

Critical value at 2! 10!4 level ˆ S0(7 ← 14) 0 0.2 0.4 0.6 0.8 1 1.2 x 10!4 0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Histogram of ˆSlag(7 ← 14) under null hypothesis

Critical value at 2! 10!4 level ˆ Slag(7 ← 14) (a) (b)

Figure 3: Illustration of the empirical distribution of the statistics under the null hypothesis obtained by bootstrapping. (a) For the statistic S0(7 ← 14). (b) For Slag(7 ← 14).

Figure 4: Results of application of our model on the log-magnitudes of the MEG sources (signifi-cant at 5% level, corrected for multiple testing). Black dashed line: instantaneous effect. Red solid line: lagged effect. The thickness of the lines indicates the strength of the influences.

: instant. effect;

: lagged effects

32 Thursday, 18 October 2012

(33)

Summary: Extended Granger

causality analysis

Granger causality as a special functional causal

model

Even with temporal information, it might be

necessary to model instantaneous effects

Alternative formulation

(Zhang 2011b)

: linear

non-Gaussian state-space model

Zhang Hyv¨

arinen

In practice the data are usually noisy, i.e., the observed data contain observation errors,

and the latent source processes exhibit some temporal structures (which may include delayed

influences between them). The state-space representation then offers a powerful modeling

approach. Here we are particularly interested in the linear state-space model (SSM) or

linear dynamic system (Kalman,

1960;

van Overschee and de Moor,

1996). Denote by x

t

=

(x

1t

, ..., x

nt

)

T

, t = 1, ..., N , the vector of the observed signals, and by y

t

= (y

1t

, ..., y

mt

)

T

the vector of latent processes which are our main object of interest.

1

The observed data are

assumed to be linear mixtures of the latent processes together with some noise effect, while

the latent processes follow a vector autoregressive (VAR) model. Mathematically, we have

x

t

= Ay

t

+ e

t

,

(1)

y

t

=

L

τ =1

B

τ

y

t−τ

+ �

t

,

(2)

where e

t

= (e

1t

, ..., e

Tnt

) and �

t

= (�

1t

, ..., �

mt

)

T

denote the observation error and process

noise, respectively. Moreover, e

t

and �

t

are both temporally white and independent of

each other. One can see that because of the state transition matrices B

τ

, y

it

are generally

dependent, even if �

it

are mutually independent.

In traditional SSMs, both �

t

and e

t

are assumed to be Gaussian; or equivalently, one

makes use of their covariance structure, and the statistical properties beyond second-order

are not considered. In Kalman filtering (Kalman,

1960), A and B

τ

are given, and the goal

is to do inference, i.e., to estimate y

t

based on

{x

t

}. Learning of the parameters A, B

τ

,

and the covariance matrices of e

t

and �

t

was also studied; see, e.g,

van Overschee and de

Moor

(1991);

Ghahramani and Hinton

(1996). However, it is well-known that under the

above assumptions, the SSM model is generally not identifiable; see e.g.,

Arun and Kung

(1990), and consequently, one can not use this model to recover the latent processes y

it

.

Under specific structural constraints on B

τ

or A, the SSM model (1

2) may become

identifiable, so that it can be used to reveal the underlying structure of the data. Many

existing models which are used for source separation or prediction of time series can be

considered as special cases of this model. For instance, the temporal structure based source

separation (Murata et al.,

2001) assume that B

τ

are diagonal. The model also becomes

identifiable with some other structural constraints on A, as discussed in

Xu

(2002).

How-ever, one should note that in practice such constraints may not hold; for instance, for the

electroencephalography (EEG) or magnetoencephalography (MEG) data, some underlying

processes or sources may have delayed influences on others, and letting B

τ

be diagonal will

destroy these types of connectivities.

On the other hand, distributional information also helps system identification. One

can ignore the temporal information and perform system identification based on the

non-Gaussianity of the data. For example, if the matrices B

τ

are zero and e

i

(t) are

non-Gaussian, it is reduced to the noisy ICA problem or the independent factor analysis (IFA)

model (Attias,

1999). In the noiseless case, ICA could recover the underlying linear mixing

system up to trivial indeterminacies. But in the noisy case, the model is just partially

1. We use the terms latent processes, factors, and sources interchangeably in this paper, depending on application scenarios.

33 Thursday, 18 October 2012

(34)

Now comes...

Constraint-based causal

discovery

Functional causal model (from

2005)

Linear, non-Gaussian acyclic

causal model

extended Granger causality

Post-nonlinear (PNL) causal

model

X1 → X2 → X3

X1 ! X2 ! X3

X1 → X2 → X3

(if linear)

X1 → X2 → X3

(even if very nonlinear)

(35)

Three Effects usually encountered in a

causal model

(Zhang & Hyvärinen, 2009)

Without prior knowledge, the assumed model is expected to be

general enough

: adapted to approximate the true generating process

identifiable

: asymmetry in causes and effects

represented by post-nonlinear causal model with inner additive

noise

(36)

PNL causal model with inner additive

noise

acyclic data-generating process

two-variable case

x

1

→x

2

:

x

2

= f

2,2

( f

2,1

(x

1

) + e

2

)

(37)

Special cases of PNL causal model

If f

i,1

and f

i,2

are both linear

At most one of e

i

is Gaussian: LiNGAM

(Shimuzu et al., 2006)

All of e

i

are Gaussian: linear Gaussian case

(Spirtes, Pearl et al.)

If f

i,2

is identity: nonlinear causal discovery

with additive noise models

(Hoyer et al., 2009, Zhang

2009b)

(38)

Identifiability in two-variable case

Is the causal direction implied by the model unique?

We tackle this problem by a proof of contradiction

Assume both x

1

→x

2

and x

1

←x

2

satisfy PNL model

(39)
(40)
(41)
(42)

Corollaries easy to verify

Corollary 1: If

p

e2

is not

Gaussian

, nor

log-mix-lin-exp

, nor

a generalized mixture of two exponentials

,

then the PNL causal model is identifiable

Corollary 2: If function

f

1

is not invertible

, then the

PNL causal model is identifiable

(43)

Method for distinguishing cause

from effect

Examine if x

1

→ x

2

holds

Examine if x

2

→ x

1

holds

Draw conclusions

Only one of them holds

Both hold

: they could not be distinguished by PNL

Additional information of the nonlinearities, such that the smoothness, nonlinear distortion level, etc. may be helpful

If

neither of them holds

, data do not follow PNL, or confounders have

significant effects

☹ ☹

(44)

Method to examine if x

1

→x

2

If x

1

→ x

2

, i.e., x

2

= f

2,2

( f

2,1

(x

1

) + e

2

), we have

is ind. from x

1

Two-step procedure to examine if x

1

→ x

2

Step 1: makes y

2

= g

2

(x

2

) - g

1

(x

1

) and x

1

as ind. as possible,

such that y

2

provides ê

2

Step 2: uses independence tests (Gretton, et al., 2008) to

verify if x

1

and ê

2

are ind.

(45)

Application on real data

applied on “CausalEffectPairs”

80 data sets for cause-effect pairs; each contains

realizations of two variables

Causal direction is obvious to non-experts, but

background information is hidden for

participants

Goal: to distinguish cause from effect of the two

variables

x

1

x

2

-

----1.1

1.0

2.1

2.0

3.1

4.2

2.3

(46)

Performance

with automatic initialization

Local optima due to MLP’s. Performance improved with specific

Bernhard 6FK|ONRSI'HFHPEHU

IGCI: Deterministic Method LINGAM: Shimizu et al., 2006 AN: Additive Noise Model (nonlinear) PNL: AN with post- nonlinearity GPI: Mooij et al., 2010

(47)
(48)
(49)

Identification with more than two variables

identifiability

(c.f. Peters et al., 2011)

brute-force search infeasible

We show that when fitting x

1

,…,x

n

to the PNL causal model,

e

i

are mutually ind.

,

if and only if

the causal Markov condition

holds,

and

e

i

is ind. from pa

i

of the same variable

⇒ a more practical two-step method

find the

equivalence class

using conditional independences

determine the

undetermined causal directions by testing

if the

disturbance is ind. from the parents of the same variable

(50)

Illustration: Boston housing data

inferred by PC + kernel-based CI test (Zhang et al., 2011)

PNL model further applied:

using partial correlation or mutual information. The former assumes linear relationships and Gaussian dis-tributions, and the latter does not lead to a significance test. Sun et al. (2007) and Tillman et al. (2009)

pro-pose to use CIPERM for CI testing in PC. Based on

the promising results of KCI-test, we propose to also apply it to causal inference.

4.2.1 Simulated data

We generated data from a random DAG G. In

partic-ular, we randomly chose whether an edge exists and sampled the functions that relate the variables from a Gaussian Process prior. We sampled four random variables X1, . . . , X4 and allowed arrows from Xi to

Xj only for i < j. With probability 0.5 each

possi-ble arrow is either present or absent. If arrows exist,

from X1 and X3 to X4, say, we sample X4 from a

Gaussian Process with mean function U1 · X1 + U3 · X3

(with U1, U3 iid∼ U[−2; 2]) and a Gaussian kernel (with each dimension randomly weighted between 0.1 and 0.6) plus a noise kernel as the covariance function. For significance level 0.01 and sample sizes between 100 and 700 we simulated 100 DAGs, and checked how often the different methods infer the correct Markov equivalence class. Figure 3 shows how often PC based on KCI-test, CIPERM, or partial correlation recovered

the correct Markov equivalence class. PC based on KCI-test gives clearly the best results.

100 200 300 400 500 600 700 0.2 0.4 0.6 0.8 1

proportion of correct Markov equiv. classes

sample size KCI−test

CI PERM part. corr.

Figure 3: The chance that the correct Markov equiva-lence class was inferred with PC combined with differ-ent CI testing methods. KCI-test outperforms CIPERM

and partial correlation.

4.2.2 Real data

We applied our method to continuous variables in the Boston Housing data set, which is available at the UCI Repository (Asuncion and Newman, 2007). Due to the large number of variables we choose the significance level to be 0.001, as a rough way to correct for mul-tiple testing. Figure 6 shows the results for PC using CIPERM (PCCIPERM) and KCI-test (PCKCI-test). For

conciseness, we report them in the same figure: the red

lines were found by both methods. Please refer to the data set for the explanation of the variables. Although one can argue about the ground truth for this data set, we regard it as promising that our method finds links between number of rooms (RM) and median value of houses (MED) and between non-retail business (IND) and nitric oxides concentration (NOX). The latter is also missing in the result on these data given by Mar-garitis (2005); instead their method gives some dubi-ous links like crime rate (CRI) to nitric oxides (NOX), for example.

RM LST AGE NOX IND

MED CRI DIS TAX B

Figure 4: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCIPERM, solid lines: PCKCI-test).

RM LST AGE NOX IND

MED CRI DIS TAX B

Figure 5: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCIPERM, solid lines: PCKCI-test).

RM LST AGE NOX IND

MED CRI DIS TAX B

Figure 6: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCIPERM, solid lines: PCKCI-test).

5

Conclusion

We proposed a novel method for conditional indepen-dence testing. It makes use of the characterization of conditional independence in terms of uncorrelated-ness of functions in suitable reproducing kernel Hilbert spaces, and the proposed test statistic can be easily calculated from the kernel matrices. We derived its distribution under the null hypothesis of conditional

using partial correlation or mutual information. The former assumes linear relationships and Gaussian dis-tributions, and the latter does not lead to a significance test. Sun et al. (2007) and Tillman et al. (2009) pro-pose to use CIPERM for CI testing in PC. Based on the promising results of KCI-test, we propose to also apply it to causal inference.

4.2.1 Simulated data

We generated data from a random DAG G. In partic-ular, we randomly chose whether an edge exists and sampled the functions that relate the variables from a Gaussian Process prior. We sampled four random variables X1, . . . , X4 and allowed arrows from Xi to Xj only for i < j. With probability 0.5 each possi-ble arrow is either present or absent. If arrows exist, from X1 and X3 to X4, say, we sample X4 from a Gaussian Process with mean function U1 · X1 + U3 · X3 (with U1, U3 iid∼ U[−2; 2]) and a Gaussian kernel (with each dimension randomly weighted between 0.1 and 0.6) plus a noise kernel as the covariance function. For significance level 0.01 and sample sizes between 100 and 700 we simulated 100 DAGs, and checked how often the different methods infer the correct Markov equivalence class. Figure 3 shows how often PC based on KCI-test, CIPERM, or partial correlation recovered the correct Markov equivalence class. PC based on KCI-test gives clearly the best results.

100 200 300 400 500 600 700 0.2 0.4 0.6 0.8 1

proportion of correct Markov equiv. classes

sample size KCI−test

CI PERM part. corr.

Figure 3: The chance that the correct Markov equiva-lence class was inferred with PC combined with differ-ent CI testing methods. KCI-test outperforms CIPERM and partial correlation.

4.2.2 Real data

We applied our method to continuous variables in the Boston Housing data set, which is available at the UCI Repository (Asuncion and Newman, 2007). Due to the large number of variables we choose the significance level to be 0.001, as a rough way to correct for mul-tiple testing. Figure 6 shows the results for PC using CIPERM (PCCIPERM) and KCI-test (PCKCI-test). For conciseness, we report them in the same figure: the red arrows are the ones inferred by PCCIPERM and all solid lines show the result by PCKCI-test. Ergo, red solid

lines were found by both methods. Please refer to the data set for the explanation of the variables. Although one can argue about the ground truth for this data set, we regard it as promising that our method finds links between number of rooms (RM) and median value of houses (MED) and between non-retail business (IND) and nitric oxides concentration (NOX). The latter is also missing in the result on these data given by Mar-garitis (2005); instead their method gives some dubi-ous links like crime rate (CRI) to nitric oxides (NOX), for example.

RM LST AGE NOX IND

MED CRI DIS TAX B

Figure 4: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCIPERM, solid lines: PCKCI-test).

RM LST AGE NOX IND

MED CRI DIS TAX B

Figure 5: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCIPERM, solid lines: PCKCI-test).

RM LST AGE NOX IND

MED CRI DIS TAX B

Figure 6: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCIPERM, solid lines: PCKCI-test).

5

Conclusion

We proposed a novel method for conditional indepen-dence testing. It makes use of the characterization of conditional independence in terms of uncorrelated-ness of functions in suitable reproducing kernel Hilbert spaces, and the proposed test statistic can be easily calculated from the kernel matrices. We derived its distribution under the null hypothesis of conditional independence. This distribution can either be gener-ated by Monte Carlo simulation or approximgener-ated by

Remark:

1. CRI - per capita crime rate by town

2. IND - prop. of non-retail business acres per town 3. NOX - nitric oxides concentration

4. RM - average number of rooms per dwelling

5. AGE - prop. of owner-occupied units built prior to 1940

6. DIS - weighted distances to 5 Boston employment centres

7. TAX - full-value property-tax rate per $10,000

8. B - 1000(Bk - 0.63)^2 where Bk prop. of blacks by town

9. LST - % lower status of the population

10.MED - Median of owner-occupied homes in $1000's

?

?

?

50 Thursday, 18 October 2012

(51)

Simpler extension for more

than two variables

Simple case:

component-wise

nonlinear transformations of x

i,

f

i

(x

i

),

have

linear

causal relations

making use of PNL ICA to find

(linear) causal relations from W

confirmed by simulations; real

applications needed

f

2

(x

2

)

0.5 -0.2 0.3

e

2

e

3

e

1

f

3

(x

3

)

f

1

(x

1

)

428 K. Zhang and L.-W. Chan

A

W

s1 s2 sn fn gn tn t2 t1 x1 x2 xn z1 z2 zn yn y1 y2 f2 f1 g1 g2 . . . . . . . . . . . . ... . . . . . .

PNL mixing system PNL de-mixing system

linear mixing nonlinear distortion inverse nonlinear transform linear de-mixing

Figure 1: PNL mixing system and PNL demixing system. The mixing system consists of a linear mixing stage (with the mixing matrix A) and a nonlinear transform stage (applying fi to linear mixtures ti). The demixing system does an inverse operation, which is a nonlinear transform stage followed by a linear demixing stage.

For simplicity, we have assumed there is no additive noise in this system. And in the following discussion, the number of independent sources, n, is assumed to be equal to that of observations, m. As a counterpart of the mix-ing model 2.1, the separation of PNL mixtures is also a two-stage procedure: a nonlinear stage followed by a linear demixing stage. Figure 1 shows the PNL mixing system and the demixing system, where gi are elements of g

used to invert the nonlinear mapping f, z has n elements zi as an estimate of

the latent linear mixtures ti, and W is a linear demixing matrix transforming

zi to yi, the estimate of independent sources si.

Taleb and Jutten (1999b), showed that when A has at least two nonzero entries per row or per column and si accepts a density function that vanishes

at one point at least, one can affirm that the output y has mutually indepen-dent components if and only if gi ◦ fi is linear and W is a linear separating

matrix for z. In other words, under these assumptions, the sources can be separated up to the same indeterminacies as in the linear ICA model, which are permutation and scaling indeterminacies. (In some work, for instance, Hyv¨arinen, Karhunen, & Oja, 2001, it is said there is one more indetermi-nacy, named sign indetermiindetermi-nacy, which can be considered as a special case

428 K. Zhang and L.-W. Chan

A

W

s1 s2 sn fn gn tn t2 t1 x1 x2 xn z1 z2 zn yn y1 y2 f2 f1 g1 g2 . . . . . . . . . . . . ... . . . . . .

PNL mixing system PNL de-mixing system

linear mixing nonlinear distortion inverse nonlinear transform linear de-mixing

Figure 1: PNL mixing system and PNL demixing system. The mixing system consists of a linear mixing stage (with the mixing matrix A) and a nonlinear transform stage (applying fi to linear mixtures ti). The demixing system does an inverse operation, which is a nonlinear transform stage followed by a linear demixing stage.

For simplicity, we have assumed there is no additive noise in this system. And in the following discussion, the number of independent sources, n, is assumed to be equal to that of observations, m. As a counterpart of the mix-ing model 2.1, the separation of PNL mixtures is also a two-stage procedure: a nonlinear stage followed by a linear demixing stage. Figure 1 shows the PNL mixing system and the demixing system, where gi are elements of g

used to invert the nonlinear mapping f, z has n elements zi as an estimate of the latent linear mixtures ti, and W is a linear demixing matrix transforming

zi to yi, the estimate of independent sources si.

Taleb and Jutten (1999b), showed that when A has at least two nonzero entries per row or per column and si accepts a density function that vanishes at one point at least, one can affirm that the output y has mutually indepen-dent components if and only if gi ◦ fi is linear and W is a linear separating

matrix for z. In other words, under these assumptions, the sources can be separated up to the same indeterminacies as in the linear ICA model, which are permutation and scaling indeterminacies. (In some work, for instance, Hyv¨arinen, Karhunen, & Oja, 2001, it is said there is one more indetermi-nacy, named sign indetermiindetermi-nacy, which can be considered as a special case of the scaling indeterminacy.) In fact, if the mean of s is unknown, one more

428 K. Zhang and L.-W. Chan

A

W

s1 s2 sn fn gn tn t2 t1 x1 x2 xn z1 z2 zn yn y1 y2 f2 f1 g1 g2 . . . . . . . . . . . . ... . . . . . .

PNL mixing system PNL de-mixing system

linear mixing nonlinear distortion inverse nonlinear transform linear de-mixing

Figure 1: PNL mixing system and PNL demixing system. The mixing system consists of a linear mixing stage (with the mixing matrix A) and a nonlinear transform stage (applying fi to linear mixtures ti). The demixing system does an inverse operation, which is a nonlinear transform stage followed by a linear demixing stage.

For simplicity, we have assumed there is no additive noise in this system. And in the following discussion, the number of independent sources, n, is assumed to be equal to that of observations, m. As a counterpart of the mix-ing model 2.1, the separation of PNL mixtures is also a two-stage procedure: a nonlinear stage followed by a linear demixing stage. Figure 1 shows the PNL mixing system and the demixing system, where gi are elements of g used to invert the nonlinear mapping f, z has n elements zi as an estimate of the latent linear mixtures ti, and W is a linear demixing matrix transforming

zi to yi, the estimate of independent sources si.

Taleb and Jutten (1999b), showed that when A has at least two nonzero entries per row or per column and si accepts a density function that vanishes at one point at least, one can affirm that the output y has mutually indepen-dent components if and only if gi ◦ fi is linear and W is a linear separating matrix for z. In other words, under these assumptions, the sources can be separated up to the same indeterminacies as in the linear ICA model, which are permutation and scaling indeterminacies. (In some work, for instance, Hyv¨arinen, Karhunen, & Oja, 2001, it is said there is one more indetermi-nacy, named sign indetermiindetermi-nacy, which can be considered as a special case of the scaling indeterminacy.) In fact, if the mean of s is unknown, one more

51 Thursday, 18 October 2012

(52)

Summary: Extended

Granger causality analysis

PNL causal model: very general + identifiable

Both of nonlinear effect of the cause and sensor distortion

usually exist

clear physical interpretations of the data

(53)

Constraint-based vs. functional causal model

based causal discovery

Constraint-based approaches

+ might avoid assuming the form of f provided powerful CI tests

- info loss (underdetermination, orientation error propagation)

Functional causal model based approaches

+ could directly determine local causal structures (identifiable), & is

interpretable and facilitates prediction

- how to find the form or appropriate knowledge of f ?!

+ both could be generalized to the confounder case

Usually both involve multiple testing (possible for functional causal

models to avoid, using likelihood as the score)

(54)

Thanks also go to

Bernhard Schölkopf

Laiwan Chan

Lei Xu

Patrik Hoyer

Joris Mooij

Jakob Zscheischler

Aapo Hyvärinen

Dominik Janzing

Kenji Fukumizu

Jonas Peters

Shohei Shimizu

Eleni Sgouritsa

Thank you!

(55)

Re-consider the examples...

Causality ➜ dependence

!

dependence ➜ causality

...⇐

Figure

Figure 2: Results of application of our model to daily returns of the stock indices DJI, N225, HSI, and SSEC, with k = 1 lag
Figure 2: Results of application of our model to daily returns of the stock indices DJI, N225, HSI, and SSEC, with k = 1 lag
Figure 3: Illustration of the empirical distribution of the statistics under the null hypothesis obtained by bootstrapping
Figure 1: PNL mixing system and PNL demixing system. The mixing system consists of a linear mixing stage (with the mixing matrix A) and a nonlinear transform stage (applying fi to linear mixtures ti )

References

Related documents

bruger forum for norske kunder 600_DK User Forum - Rail sector 810_UK Fixed route on demand An tool for handling integrated and flexible transport services. 10:15 -

Mapplet enables users across the enter- prise to gain access to data and documents from a spatial perspective. Mapplet is integrated with ArcGIS API for Silverlight from

Low bid selection based on well defined QA/QC and testing requirements included in a well written performance specification will significantly improve the quality of CIPP &amp;

Throughout the literature there was substantial discussion regarding defining the concept of coaching in education. Furthermore, there was also wide ranging

transatlantic cooperation. Positive deviant behavior and nutrition education. How do socio-economic status, perceived economic barriers and nutritional benefits affect quality

The Croatian National Youth Council volunteers are educated in the Council of Europe on how to establish cooperation with Government and initiate adoption of Youth Policy,

bellina through PGR s manipulation (Maziah &amp; Chew 2008), this study was undertaken to improve the micropropagation protocol by evaluating the effect of medium strength, NH 4

Motivated by this, we have measured the K -shell X-ray spectra of highly ionized bare sulfur ions following charge exchange with gaseous molecules in an electron beam ion trap, as