Functional causal models: Beyond linear
instantaneous relations
Kun Zhang
Max-Planck Institute for Intelligent Systems
Tübingen, Germany
Causality vs. dependence: Examples
•
Causality ➜ dependence
!
dependence ➜ causality
Brief history of causality:
Western philosophical tradition
•
dates back at least to Aristotle
•
Causality is not based on actual reasoning: only correlation
can actually be perceived (David Hume, 1711-1776)
•
One has to resort to
controlled experiments
•
Manipulate a variable ‘ideally’ and see the response of the system
...
Brief history of
causality:
Eastern cultural tradition
•
Illustrated Sutra of Cause and Effect (8th century)
•
“coincidence” instead of causality
(Carl Jung, 1920’s)
Potential applications
•
Policy making in economics, climate analysis...
•
Biology, brain connectivity analysis...
•
Control, robust prediction / feature selection...
•
For understanding learning problems, e.g., semi-supervised
learning
(Schölkopf et al., 2012)
•
...
Advances in the past decades:
Computational causality
•
In the past decades, under certain assumptions, it was
made possible to derive causation from passively
observed data
(Pearl, Spirtes, Glymore, Scheines, Hoover et al.)
•
statistical data causal structure
•
constraint-based approach
•
causal Markov assumption
•
faithfulness…
X1 X2 ---1.1 1.0 2.1 2.0 3.1 4.2 2.3 -0.6 1.3 2.2 -1.8 0.9 . . . . . .X
1
?
X
2
dicts cl
assica
l claims
???
Outline
•
Constraint-based causal
discovery
•
Functional causal model (mainly
from 2005)
•
linear non-Gaussian causal model
•
with temporal constraints:
Granger causality with
instantaneous effects
•
with necessary nonlinearities:
Post-nonlinear causal model
X1 ! X2 ! X3
X1 → X2 → X3
X1 → X2 → X3
(even if very nonlinear)
•
Constraint-based causal discovery
•
Functional causal model (from
2005)
•
Linear, non-Gaussian causal model
•
Granger causality with
instantaneous effects
•
Post-nonlinear causal model
X1 ! X2 ! X3
X1 → X2 → X3
(if linear)
X1 → X2 → X3
(even if very nonlinear)
F
rom 1980’s...
Causal structure vs. statistical independence
(Spirtes, Pearl, et al.)
causal structure
(causal graph)
Y → X → Z
Y -- X -- Z
?
Statistical
independence(s)
Y Z | X
Causal Markov condition:
each variable is ind.
of its non-descendants
(non-effects)
conditional on
its parents
(direct causes)
Faithfulness:
all observed (conditional) independencies are
Why faithfulness assumption matters?
Y → X → Z
a
b
c
•
If they are linear-Gaussian and
a=-bc
, we have Y
Z, which cannot by seen from the graph!
•
Faithfulness assumption eliminates this possibility!
Constraint-based causal discovery
•
Theorem: if (G,P) satisfies faithfulness, then there
is an edge between X and Y iff X Y
given any
set of variables
•
uses (conditional) independence constraints to
find the candidate causal structures
Search results
•
Markov equivalence class
•
pattern Y -- X -- Z
•
same adjacencies
•
→ if all agree on orientation
;
-- if disagree
•
might be unique:
v-structure
Y Z | X
Constraint-based method: An
inverse problem
•
{local causal structures}
→ {conditional independences}
X Z | Y
∅
X
Y
Z
X
Y
Z
X
Y
Z
X
Y
Z
equi
va
lence
cla
ss
faithfulness
•
Instead, functional causal
models try to directly
identify local causal
structures
two-
varia
ble ca
se?
X
Z
Y
•
Constraint-based causal discovery
•
Functional causal model (from
2005)
•
Linear, non-Gaussian causal model
•
Granger causality with
instantaneous effects
•
Post-nonlinear causal model
X1 ! X2 ! X3
X1 → X2 → X3
(if linear)
X1 → X2 → X3
(even if very nonlinear)
Outline
Functional causal model
(Pearl et al.)
•
generative function model for continuous variables
•
x
i= f
i(pa
i, e
i), i = 1,…,n
•
in econometrics, social sciences...
•
well-defined examples
•
Granger causality: effects follow causes in a linear form
•
LiNGAM: linear, non-Gaussian and acyclic causal model
(Shimizu et al., 2006)
PA
i: parents (causes) of X
i; E
imutually ind.
f
i
PA
iE
iFCM:
why independence
between X and E ?
•
If X E:
•
Otherwise, according to
Reichenbach's Common
Cause Principle:
•
much more complicated
X
E
Y
X
E
Y
Z
FCM: A general view
•
Without constraints on f, for given (X
,Y), both y = f
1(x, e) with
E_||_X
and x = f
2(y, e
1) with E
1_||_Y
are possible
•
with a Gram-Schmidt-orthogonalization procedure (Darmois,
1951)
!x
= cdf(x
1), so
!x ~ U(0,1);
e
= cdf(y | !x) = p
!x,y(
!x,t)
!" x2#
dt.
Suppose we observe the data
x
A universal way to construct
“trivial” FCMs
•
e
’ = h ° CCDF
Y|X(y|x) always independent from X
•
Functional causal model: y = CCDF
Y|X-‐1° h
-‐1(e’) for any x
•
how to make it identifiable (break the symmetry)?
x y x CCDF (y |x ) ï2 0 2 x h (C C D F (y |x ))
f
General FCMs:
independence vs. likelihood
•
relating mutual information I and likelihood l:
•
If X→Y follows the model:
•
also hold for more than two variables
X
f(⋄;
β
)
Y
E
l
X→Y(β) =
n�
i=1log P
F(x
i, y
i) =
n�
i=1log P (X = x
i, Y = y
i)
− I(X, E; β)
l
X→Y(β
∗)
− l
Y →X(β
∗Y) = I(Y, E
Y; β
∗Y).
A basic functional causal model
•
Constraint-based causal
discovery
•
Functional causal models (from
2005)
•
Linear, non-Gaussian acyclic
causal model
•
extended Granger causality
•
Post-nonlinear causal model
X1 ! X2 ! X3
X1 → X2 → X3
(if linear)
X1 → X2 → X3
(even if very nonlinear)
LiNGAM model
•
linear,
non-Gaussian
,
acyclic
causal model (LiNGAM)
(Shimizu et al., 2006)
:
•
disturbances (errors) ei are non-Gaussian (or at most
one is Gaussian) and mutually ind.
•
example:
e
Bx
x
e
x
b
x
¦
orx
Bx
e
i i j j ij ib
x
e
x
¦
of parents : orx
2x
3x
1 0.5 -0.2 0.3e
2e
3x
2= e
2,
x
3= 0.5x
2+ e
3,
x
1=
−0.2x
2+ 0.3x
3+ e
1.
ICA:
A well-known technique making use of
non-Gaussianity
x1 xm observed signals ICA system output: as independent as possibleW
… … y1 yn de-mixing estimate …x = A·s
y = W·x
A
… … s1 snunknown mixing system independent
sources
mixing
•
assumptions in ICA
•
at most one of s
i
i
s
Gaussian
LiNGAM analysis by ICA
•
LiNGAM:
x = Bx + e e = (I-B)x
•
B has a special structure:
acyclic relations
•
ICA:
y = Wx
•
B can then seen from W by permutation and re-scaling
•
e.g.
W
So we have the causal relation:
x
2x
3x
10.5
Related work &
applications
•
ICA with sparse connections
(Zhang et al., 2008);
Direct LiNGAM
(Shimizu et al., 2009)•
with mild nonlinear distortion
allowed; application in finance
(Zhang & Chan, 2006 & 2008)•
extended Granger causality analysis
for time series
(Hyvärinen et al., 2010; Zhang and Hyvärinen, 2009)HYVARINEN¨ , ZHANG, SHIMIZU ANDHOYER
DJIt-1 N225t-1 HSIt-1 SSECt-1
DJIt 0.12 N225t 0.42 HSIt 0.02 SSECt
0.11 -0.15 0.35
0.21
-0.07 0.04
0.05 0.04
Figure 2: Results of application of our model to daily returns of the stock indices DJI, N225, HSI, and SSEC, with k = 1 lag. Large coefficients (greater than 0.1) are shown in bold and red.
Next, we fitted an ordinary vector autoregressive model with 10 lags on the estimated sources, finding the corresponding innovation series which we denote by yi(t), i = 1, ..., 17. Our goal was
to analyze if there are some influences between the magnitudes of these innovations. We prefer to analyze the innovations because the innovations are approximately white both temporally and spatially, and thus we can analyze the magnitudes with no contamination by linear (auto)correlations of the source signals. The autoregressive model order 10 was chosen because it was the smallest order that gave approximately white innovations.
We then fitted the SVAR model on the logarithmically transformed magnitudes xi(t) = log(0.2+
|yi(t)|),i = 1,...,17. We determined the order k of our SVAR model by minimizing the AIC
crite-rion (Akaike, 1973), which is the negative log-likelihood of the MBD model plus a term measuring the complexity of the model. The log-likelihood involves the densities of the MBD outputs ˆei(t),
which were modelled by a mixture of three Gaussians. From the candidate orders between 0 and 20, we found that k = 2 gave the minimum AIC.
After finding the estimate of the coefficients ˆB!,! = 0,1,2 with the MBD-based approach, one can easily calculate the estimates of the statistics S0(i ← j) and Slag(i ← j). The bootstrapping
approach given in Section 6 was used to evaluate if these estimated statistics are significant. Here we need to test multiple hypotheses simultaneously; to reduce the type I error, we adopted the Bonferroni correction (Shaffer, 1995) for multiple testing correction. We used the significance level 5%. For both the instantaneous and lagged effects, one needs to perform 17 × 16 = 272 tests; therefore, the significance level for each individual test is then 0.05/272 ≈ 2 × 10−4. We used 104 replications for the bootstrapping.
For illustration, we give the empirical distribution of the statistics S0(7 ← 14) and Slag(7 ← 14),
as well as their estimated values for the original series xi(t), in Fig. 3. Clearly ˆS0(7 ← 14) is
significant, while ˆSlag(7 ← 14) is not.
Fig. 4 shows the resulting diagram of causal analysis with instantaneous effects between the magnitudes of the selected MEG sources, with the influences significant at 5% level (corrected for multiple testing). What we see is that the connections tend to be strong between sources which are close to each other. For example, the occipitoparietal sources such as #1, #2, #3, #8, and #11
ESTIMATION OFSVARMODEL USING NON-GAUSSIANITY
0 0.2 0.4 0.6 0.8 1 1.2 x 10!4 0 1000 2000 3000 4000 5000 6000 7000 8000
Histogram of ˆS0(7 ← 14) under null hypothesis
Critical value at 2! 10!4 level ˆ S0(7 ← 14) 0 0.2 0.4 0.6 0.8 1 1.2 x 10!4 0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Histogram of ˆSlag(7 ← 14) under null hypothesis Critical value at
2! 10!4 level
ˆ Slag(7 ← 14)
(a) (b)
Figure 3: Illustration of the empirical distribution of the statistics under the null hypothesis obtained by bootstrapping. (a) For the statistic S0(7 ← 14). (b) For Slag(7 ← 14).
Figure 4: Results of application of our model on the log-magnitudes of the MEG sources (signifi-cant at 5% level, corrected for multiple testing). Black dashed line: instantaneous effect. Red solid line: lagged effect. The thickness of the lines indicates the strength of the influences.
1727
Nonlinear ICA with Minimal Nonlinear Distortion
Hang Seng Utilities Index. x1, x9, and x11 are
con-stituents of Hang Seng Property Index. 3. Large bank companies are the cause of many stocks, meaning that the international impact to the Hong Kong stock mar-ket is probably reflected through large banks. Here x5
and x8 are the two largest banks in Hong Kong. 4.
Stocks in Hang Seng Property Index tend to depend on many other stocks, while they hardly influence oth-ers. Here x1, x9, and x11 are in Hang Seng Property
Index. These findings also indicate that the indepen-dent factor model may provide a reasonable way to explain the generation of stock returns.
x1: Cheung Kong (0001.hk) x2: CLP Hldgs (0002.hk) x3: HK & China Gas (0003.hk) x4: Wharf (Hldgs) (0004.hk) x5: HSBC Hldg (0005.hk), x6: HK Electric (0006.hk) x7: Hang Lung Dev (0010.hk) x8: Hang Seng Bank (0011.hk) x9: Henderson Land (0012.hk) x10: Hutchison (0013.hk) x11: Sun Hung Kai Prop (0016.hk) x12: Swire Pacific ’A’ (0019.hk) x13: Bank of East Asia (0023.hk) x14: Cathay Pacific Air (0293.hk)
Figure 8. Casual diagram of the 14 stocks.
7. Conclusion
We have proposed the “minimal nonlinear distortion” principle for solving the nonlinear ICA problem. This principle helps to reduce the indeterminacies in so-lutions of nonlinear ICA and to overcome the ill-posedness of nonlinear ICA. With this principle, the solution whose nonlinear mixing system is close to lin-ear is preferred. Experimental results with synthetic data show that when the data are generated with mild nonlinear distortion, the proposed method produces good and reliable results for separating various non-linear mixtures. The successful application of the pro-posed nonlinear ICA method to causality discovery in the Hong Kong stock market illustrates the applica-bility of the method and the validity of the “minimal nonlinear distortion” principle for some real-life prob-lems. The result also supports the validity of the in-dependent factor model in finance.
Acknowledgement
This work was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China.
References
Almeida, L. B. (2003). MISEP — linear and nonlinear ICA based on mutual information. Journal of Machine Learning Research, 4, 1297–1318.
Almeida, L. B. (2005). Separating a real-life nonlinear im-age mixture. Journal of Machine Learning Research, 6, 1199–1229.
Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159.
Bishop, C. (1993). Curvature-driven smoothing: a learning algorithm for feedforward networks. IEEE Trans. on Neural Networks, 4, 882–884.
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc., 96, 1348–1360.
Harmeling, S., Ziehe, A., Kawanabe, M., & M¨uller, K. (2003). Kernel-based nonlinear blind source separation. Neural Computation, 15, 1089–1124.
Hyv¨arinen, A. (1999). Fast and robust fixed-point al-gorithms for independent component analysis. IEEE Trans. on Neural Networks, 10(3), 626–634.
Hyv¨arinen, A., & Karthikesh, R. (2000). Sparse priors on the mixing matrix in independent component analysis. Proc. 2nd Int. Workshop on ICA and BSS (ICA2000) (pp. 477–452). Helsinki, Finland.
Hyv¨arinen, A., & Pajunen, P. (1999). Nonlinear indepen-dent component analysis: Existence and uniqueness re-sults. Neural Networks, 12, 429–439.
Jutten, C., & Karhunen, J. (2003). Advances in nonlinear blind source separation. Proc. 4th Int. Symp. on ICA and BSS (ICA2003) (pp. 245–256). Invited paper in the special session on nonlinear ICA and BSS.
Jutten, C., & Taleb, A. (2000). Source separation: From dusk till dawn. 2nd Int. Workshop on ICA and BSS (ICA 2000) (pp. 15–26). Helsinki, Finland.
Poggio, T., Torre, V., & Koch, C. (1985). Computational vision and regularization theory. Nature, 317, 314–319. Shimizu, S., Hoyer, P., Hyv¨arinen, A., & Kerminen, A.
(2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7, 2003–2030.
Taleb, A., & Jutten, C. (1999). Source separation in post-nonlinear mixtures. IEEE Trans. on Signal Processing, 47, 2807–2820.
Tan, Y., Wang, J., & Zurada, J. M. (2001). Nonlinear blind source separation using a radial basis function network. IEEE Trans. on Neural Networks, 12, 124–134.
Tikhonov, A. N., & Arsenin, V. A. (1977). Solutions of ill-posed problems. Washington: Winston & Sons.
Valpola, H. (2000). Nonlinear independent component analysis using ensemble learning: Theory. Proc. 2nd Int. Workshop on ICA and BSS (ICA2000) (pp. 251– 256). Helsinki, Finland.
25 Thursday, 18 October 2012
Linear functional causal model
with temporal constraints
•
Constraint-based causal
discovery
•
Functional causal model (from
2005)
•
Linear, non-Gaussian acyclic
causal model
•
extended Granger causality
•
Post-nonlinear causal model
X1 ! X2 ! X3
X1 → X2 → X3
(if linear)
X1 → X2 → X3
(even if very nonlinear)
Granger causality
•
X
1: {x
1t}
Granger causes
X
2: {x
2t} if it contains information helping
predict x
2,t+h(h>0) contained nowhere else
(Granger, 1969)
•
temporal constraint: causes must precede effects + linear causal
relations
•
Vector
autoregression
(VAR) estimated by
multivariate least
squares (MLS)
x
t=
p�
τ =1B
τx
t−τ+ e
tx
1t
x
2t
With instantaneous effects
•
Are e
itindependent? instantaneous effects between
x
it(
Reale, Wilson et al., 2001
)
•
Granger causality with instantaneous effects
:
Dotted line: instantaneous
causal effects x1t ! x2t
x
t=
p�
τ =1B
τx
t−τ+B
0x
t+ e
t, or x
t=
p�
τ =0B
τx
t−τ+ e
tWhat happens if we ignore instantaneous
effects
(Hyvärinen et al., 2008)
•
They become confounders...
•
Example
xt = p � τ =1 (I − B0)−1 · Bτ · xt−τ + (I − B0)−1etESTIMATION OF SVAR MODEL USING NON-GAUSSIANITY
While this phenomenon is, in principle, well-known in econometric literature (Swanson and Granger, 1997; Demiralp and Hoover, 2003; Moneta and Spirtes, 2006), Eq. (11) is seldom applied because estimation methods for B0 have not been well developed. To our knowledge, no estimation
method for B0 has been proposed which is consistent for the whole matrix without strong prior assumptions on B0.
Next we present some theoretical examples of how the instantaneous and lagged effects interact based on the formula in (11).
7.1.1 EXAMPLE 1: AN INSTANTANEOUS EFFECT MAY SEEM TO BE LAGGED Consider first the case where the instantaneous and lagged matrices are as follows:
B0 = ! 0 1 0 0 " , B1 = ! 0.9 0 0 0.9 " .
That is, there is an instantaneous effect x2 → x1, and no lagged effects (other than the purely autore-gressive xi(t − 1) → xi(t)). Now, if an AR(1) model is estimated for data coming from this model,
without taking the instantaneous effects into account, we get the autoregressive matrix
M1 = (I − B0)−1B1 = ! 0.9 0.9 0 0.9 " .
Thus, the effect x2 → x1 seems to be lagged although it is, actually, instantaneous. 7.1.2 EXAMPLE 2: SPURIOUS EFFECTS APPEAR
Consider three variables with the instantaneous effects x1 → x2 and x2 → x3, and no lagged effects
other than xi(t − 1) → xi(t), as given by
B0 = 0 0 0 1 0 0 0 1 0 , B1 = 0.9 0 0 0 0.9 0 0 0 0.9 .
If we estimate an AR(1) model for the data coming from this model, we obtain
M1 = (I − B0)−1B1 = 0.9 0 0 0.9 0.9 0 0.9 0.9 0.9 .
This means that the estimation of the simple autoregressive model leads to the inference of a direct lagged effect x1 → x3, although no such direct effect exists in the model generating the data, for any
time lag.
A more reassuring result is the following: if the data follows the same causal ordering for all time lags, that ordering is not contradicted by the neglect of instantaneous effect. A rigorous definition of this property is the following.
Theorem 1 Assume that there is an ordering i( j), j = 1 . . . n of the variables such that no effect
goes backward, that is,
B!(i( j − "),i( j)) = 0 for " > 0,! ≥ 0,1 ≤ j ≤ n. (15)
ESTIMATION OF SVAR MODEL USING NON-GAUSSIANITY
While this phenomenon is, in principle, well-known in econometric literature (Swanson and Granger, 1997; Demiralp and Hoover, 2003; Moneta and Spirtes, 2006), Eq. (11) is seldom applied because estimation methods for B0 have not been well developed. To our knowledge, no estimation method for B0 has been proposed which is consistent for the whole matrix without strong prior assumptions on B0.
Next we present some theoretical examples of how the instantaneous and lagged effects interact based on the formula in (11).
7.1.1 EXAMPLE 1: AN INSTANTANEOUS EFFECT MAY SEEM TO BE LAGGED
Consider first the case where the instantaneous and lagged matrices are as follows:
B0 = ! 0 1 0 0 " , B1 = ! 0.9 0 0 0.9 " .
That is, there is an instantaneous effect x2 → x1, and no lagged effects (other than the purely
autore-gressive xi(t − 1) → xi(t)). Now, if an AR(1) model is estimated for data coming from this model, without taking the instantaneous effects into account, we get the autoregressive matrix
M1 = (I − B0)−1B1 = ! 0.9 0.9 0 0.9 " .
Thus, the effect x2 → x1 seems to be lagged although it is, actually, instantaneous.
7.1.2 EXAMPLE 2: SPURIOUS EFFECTS APPEAR
Consider three variables with the instantaneous effects x1 → x2 and x2 → x3, and no lagged effects other than xi(t − 1) → xi(t), as given by
B0 = 0 0 0 1 0 0 0 1 0 , B1 = 0.9 0 0 0 0.9 0 0 0 0.9 .
If we estimate an AR(1) model for the data coming from this model, we obtain
M1 = (I − B0)−1B1 = 0.9 0 0 0.9 0.9 0 0.9 0.9 0.9 .
This means that the estimation of the simple autoregressive model leads to the inference of a direct lagged effect x1 → x3, although no such direct effect exists in the model generating the data, for any time lag.
A more reassuring result is the following: if the data follows the same causal ordering for all time lags, that ordering is not contradicted by the neglect of instantaneous effect. A rigorous definition of this property is the following.
Theorem 1 Assume that there is an ordering i( j), j = 1 . . . n of the variables such that no effect
goes backward, that is,
B!(i( j − "),i( j)) = 0 for " > 0,! ≥ 0,1 ≤ j ≤ n. (15)
:
ESTIMATION OF SVAR MODEL USING NON-GAUSSIANITY
While this phenomenon is, in principle, well-known in econometric literature (Swanson and Granger, 1997; Demiralp and Hoover, 2003; Moneta and Spirtes, 2006), Eq. (11) is seldom applied because estimation methods for B0 have not been well developed. To our knowledge, no estimation method for B0 has been proposed which is consistent for the whole matrix without strong prior assumptions on B0.
Next we present some theoretical examples of how the instantaneous and lagged effects interact based on the formula in (11).
7.1.1 EXAMPLE 1: AN INSTANTANEOUS EFFECT MAY SEEM TO BE LAGGED Consider first the case where the instantaneous and lagged matrices are as follows:
B0 = ! 0 1 0 0 " , B1 = ! 0.9 0 0 0.9 " .
That is, there is an instantaneous effect x2 → x1, and no lagged effects (other than the purely autore-gressive xi(t − 1) → xi(t)). Now, if an AR(1) model is estimated for data coming from this model, without taking the instantaneous effects into account, we get the autoregressive matrix
M1 = (I − B0)−1B1 = ! 0.9 0.9 0 0.9 " .
Thus, the effect x2 → x1 seems to be lagged although it is, actually, instantaneous. 7.1.2 EXAMPLE 2: SPURIOUS EFFECTS APPEAR
Consider three variables with the instantaneous effects x1 → x2 and x2 → x3, and no lagged effects other than xi(t − 1) → xi(t), as given by
B0 = 0 0 0 1 0 0 0 1 0 , B1 = 0.9 0 0 0 0.9 0 0 0 0.9 .
If we estimate an AR(1) model for the data coming from this model, we obtain
M1 = (I − B0)−1B1 = 0.9 0 0 0.9 0.9 0 0.9 0.9 0.9 .
This means that the estimation of the simple autoregressive model leads to the inference of a direct lagged effect x1 → x3, although no such direct effect exists in the model generating the data, for any time lag.
A more reassuring result is the following: if the data follows the same causal ordering for all time lags, that ordering is not contradicted by the neglect of instantaneous effect. A rigorous definition of this property is the following.
Theorem 1 Assume that there is an ordering i( j), j = 1 . . . n of the variables such that no effect
goes backward, that is,
B!(i( j − "),i( j)) = 0 for " > 0,! ≥ 0,1 ≤ j ≤ n. (15)
1723
:
x
1,t-1
x
1t
x
2,t-1
x
2t
0.9 0.9x
3,t-1
x
3t
1 1 0.9 0.9 0.9 0.9 xt = p � τ =0 Bτxt−τ + et 29 Thursday, 18 October 2012Identification
(Zhang & Hyvärinen,
2009)
•
e
itindependent for different i and t,
i.e., spatially & temporally
independent
•
If at most one of e
itis Gaussian, it
can be solved by
multichannel blind
deconvolution (MBD)
with causal
FIR filters
•
MBD estimates
W to make ê
itspatially and temporally independent
•
B
τcan be found from
W
τ, by
Experiment on financial data
•
extended Granger causality analysis of daily returns of stock
indices DJI, N225, HSI, and SSEC, with k = 1 lag
HYVARINEN¨ , ZHANG, SHIMIZU AND HOYERDJI
t-1N225
t-1HSI
t-1SSEC
t-1DJI
t0.12
N225
t0.42
HSI
t0.02
SSEC
t0.11
-0.15
0.35
0.21
-0.07
0.04
0.05
0.04
Figure 2: Results of application of our model to daily returns of the stock indices DJI, N225, HSI, and SSEC, with k = 1 lag. Large coefficients (greater than 0.1) are shown in bold and red.
Next, we fitted an ordinary vector autoregressive model with 10 lags on the estimated sources, finding the corresponding innovation series which we denote by yi(t), i = 1, ..., 17. Our goal was
to analyze if there are some influences between the magnitudes of these innovations. We prefer to analyze the innovations because the innovations are approximately white both temporally and spatially, and thus we can analyze the magnitudes with no contamination by linear (auto)correlations of the source signals. The autoregressive model order 10 was chosen because it was the smallest order that gave approximately white innovations.
We then fitted the SVAR model on the logarithmically transformed magnitudes xi(t) = log(0.2+ |yi(t)|),i = 1,...,17. We determined the order k of our SVAR model by minimizing the AIC
crite-rion (Akaike, 1973), which is the negative log-likelihood of the MBD model plus a term measuring the complexity of the model. The log-likelihood involves the densities of the MBD outputs ˆei(t), which were modelled by a mixture of three Gaussians. From the candidate orders between 0 and 20, we found that k = 2 gave the minimum AIC.
After finding the estimate of the coefficients ˆB!,! = 0,1,2 with the MBD-based approach, one can easily calculate the estimates of the statistics S0(i ← j) and Slag(i ← j). The bootstrapping approach given in Section 6 was used to evaluate if these estimated statistics are significant. Here we need to test multiple hypotheses simultaneously; to reduce the type I error, we adopted the Bonferroni correction (Shaffer, 1995) for multiple testing correction. We used the significance level 5%. For both the instantaneous and lagged effects, one needs to perform 17 × 16 = 272 tests; therefore, the significance level for each individual test is then 0.05/272 ≈ 2 × 10−4. We used 104 replications for the bootstrapping.
For illustration, we give the empirical distribution of the statistics S0(7 ← 14) and Slag(7 ← 14), as well as their estimated values for the original series xi(t), in Fig. 3. Clearly ˆS0(7 ← 14) is
significant, while ˆSlag(7 ← 14) is not.
Fig. 4 shows the resulting diagram of causal analysis with instantaneous effects between the magnitudes of the selected MEG sources, with the influences significant at 5% level (corrected for multiple testing). What we see is that the connections tend to be strong between sources which are close to each other. For example, the occipitoparietal sources such as #1, #2, #3, #8, and #11 have strong interconnections. Some perirolandic sources such as #5, #7, #10, and #14 are also interconnected. Sources #4 and #16 seems to mediate between these two groups.
31 Thursday, 18 October 2012
Experiment on brain signals
•
extended Granger causality analysis of the log-magnitude of
MEG sources (significant at 5% level; corrected for multiple
testing).
ESTIMATION OF SVAR MODEL USING NON-GAUSSIANITY
0 0.2 0.4 0.6 0.8 1 1.2 x 10!4 0 1000 2000 3000 4000 5000 6000 7000 8000
Histogram of ˆS0(7 ← 14) under null hypothesis
Critical value at 2! 10!4 level ˆ S0(7 ← 14) 0 0.2 0.4 0.6 0.8 1 1.2 x 10!4 0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Histogram of ˆSlag(7 ← 14) under null hypothesis
Critical value at 2! 10!4 level ˆ Slag(7 ← 14) (a) (b)
Figure 3: Illustration of the empirical distribution of the statistics under the null hypothesis obtained by bootstrapping. (a) For the statistic S0(7 ← 14). (b) For Slag(7 ← 14).
Figure 4: Results of application of our model on the log-magnitudes of the MEG sources (signifi-cant at 5% level, corrected for multiple testing). Black dashed line: instantaneous effect. Red solid line: lagged effect. The thickness of the lines indicates the strength of the influences.
: instant. effect;
: lagged effects
32 Thursday, 18 October 2012
Summary: Extended Granger
causality analysis
•
Granger causality as a special functional causal
model
•
Even with temporal information, it might be
necessary to model instantaneous effects
•
Alternative formulation
(Zhang 2011b)
: linear
non-Gaussian state-space model
Zhang Hyv¨
arinen
In practice the data are usually noisy, i.e., the observed data contain observation errors,
and the latent source processes exhibit some temporal structures (which may include delayed
influences between them). The state-space representation then offers a powerful modeling
approach. Here we are particularly interested in the linear state-space model (SSM) or
linear dynamic system (Kalman,
1960;
van Overschee and de Moor,
1996). Denote by x
t=
(x
1t, ..., x
nt)
T, t = 1, ..., N , the vector of the observed signals, and by y
t= (y
1t, ..., y
mt)
Tthe vector of latent processes which are our main object of interest.
1The observed data are
assumed to be linear mixtures of the latent processes together with some noise effect, while
the latent processes follow a vector autoregressive (VAR) model. Mathematically, we have
x
t= Ay
t+ e
t,
(1)
y
t=
L�
τ =1B
τy
t−τ+ �
t,
(2)
where e
t= (e
1t, ..., e
Tnt) and �
t= (�
1t, ..., �
mt)
Tdenote the observation error and process
noise, respectively. Moreover, e
tand �
tare both temporally white and independent of
each other. One can see that because of the state transition matrices B
τ, y
itare generally
dependent, even if �
itare mutually independent.
In traditional SSMs, both �
tand e
tare assumed to be Gaussian; or equivalently, one
makes use of their covariance structure, and the statistical properties beyond second-order
are not considered. In Kalman filtering (Kalman,
1960), A and B
τare given, and the goal
is to do inference, i.e., to estimate y
tbased on
{x
t}. Learning of the parameters A, B
τ,
and the covariance matrices of e
tand �
twas also studied; see, e.g,
van Overschee and de
Moor
(1991);
Ghahramani and Hinton
(1996). However, it is well-known that under the
above assumptions, the SSM model is generally not identifiable; see e.g.,
Arun and Kung
(1990), and consequently, one can not use this model to recover the latent processes y
it.
Under specific structural constraints on B
τor A, the SSM model (1
∼
2) may become
identifiable, so that it can be used to reveal the underlying structure of the data. Many
existing models which are used for source separation or prediction of time series can be
considered as special cases of this model. For instance, the temporal structure based source
separation (Murata et al.,
2001) assume that B
τare diagonal. The model also becomes
identifiable with some other structural constraints on A, as discussed in
Xu
(2002).
How-ever, one should note that in practice such constraints may not hold; for instance, for the
electroencephalography (EEG) or magnetoencephalography (MEG) data, some underlying
processes or sources may have delayed influences on others, and letting B
τbe diagonal will
destroy these types of connectivities.
On the other hand, distributional information also helps system identification. One
can ignore the temporal information and perform system identification based on the
non-Gaussianity of the data. For example, if the matrices B
τare zero and e
i(t) are
non-Gaussian, it is reduced to the noisy ICA problem or the independent factor analysis (IFA)
model (Attias,
1999). In the noiseless case, ICA could recover the underlying linear mixing
system up to trivial indeterminacies. But in the noisy case, the model is just partially
1. We use the terms latent processes, factors, and sources interchangeably in this paper, depending on application scenarios.
33 Thursday, 18 October 2012
Now comes...
•
Constraint-based causal
discovery
•
Functional causal model (from
2005)
•
Linear, non-Gaussian acyclic
causal model
•
extended Granger causality
•
Post-nonlinear (PNL) causal
model
X1 → X2 → X3
X1 ! X2 ! X3
X1 → X2 → X3
(if linear)
X1 → X2 → X3
(even if very nonlinear)
Three Effects usually encountered in a
causal model
(Zhang & Hyvärinen, 2009)
•
Without prior knowledge, the assumed model is expected to be
•
general enough
: adapted to approximate the true generating process
•
identifiable
: asymmetry in causes and effects
•
represented by post-nonlinear causal model with inner additive
noise
PNL causal model with inner additive
noise
•
acyclic data-generating process
•
two-variable case
•
x
1
→x
2
:
x
2= f
2,2( f
2,1(x
1) + e
2)
Special cases of PNL causal model
•
If f
i,1
and f
i,2
are both linear
•
At most one of e
i
is Gaussian: LiNGAM
(Shimuzu et al., 2006)
•
All of e
i
are Gaussian: linear Gaussian case
(Spirtes, Pearl et al.)
•
If f
i,2
is identity: nonlinear causal discovery
with additive noise models
(Hoyer et al., 2009, Zhang
2009b)
Identifiability in two-variable case
•
Is the causal direction implied by the model unique?
•
We tackle this problem by a proof of contradiction
•
Assume both x
1→x
2and x
1←x
2satisfy PNL model
Corollaries easy to verify
•
Corollary 1: If
p
e2is not
Gaussian
, nor
log-mix-lin-exp
, nor
a generalized mixture of two exponentials
,
then the PNL causal model is identifiable
•
Corollary 2: If function
f
1is not invertible
, then the
PNL causal model is identifiable
Method for distinguishing cause
from effect
•
Examine if x
1
→ x
2
holds
•
Examine if x
2
→ x
1
holds
•
Draw conclusions
•
Only one of them holds
☺
☺
•
Both hold
: they could not be distinguished by PNL
•
Additional information of the nonlinearities, such that the smoothness, nonlinear distortion level, etc. may be helpful•
If
neither of them holds
, data do not follow PNL, or confounders have
significant effects
☹ ☹
Method to examine if x
1
→x
2
•
If x
1→ x
2, i.e., x
2= f
2,2( f
2,1(x
1) + e
2), we have
is ind. from x
1•
Two-step procedure to examine if x
1→ x
2•
Step 1: makes y
2= g
2(x
2) - g
1(x
1) and x
1as ind. as possible,
such that y
2provides ê
2•
Step 2: uses independence tests (Gretton, et al., 2008) to
verify if x
1and ê
2are ind.
Application on real data
•
applied on “CausalEffectPairs”
•
80 data sets for cause-effect pairs; each contains
realizations of two variables
•
Causal direction is obvious to non-experts, but
background information is hidden for
participants
•
Goal: to distinguish cause from effect of the two
variables
x
1
x
2
-
----1.1
1.0
2.1
2.0
3.1
4.2
2.3
Performance
•
with automatic initialization
•
Local optima due to MLP’s. Performance improved with specific
Bernhard 6FK|ONRSI'HFHPEHUIGCI: Deterministic Method LINGAM: Shimizu et al., 2006 AN: Additive Noise Model (nonlinear) PNL: AN with post- nonlinearity GPI: Mooij et al., 2010
Identification with more than two variables
•
identifiability
(c.f. Peters et al., 2011)
•
brute-force search infeasible
•
We show that when fitting x
1,…,x
nto the PNL causal model,
e
iare mutually ind.
,
if and only if
the causal Markov condition
holds,
and
e
iis ind. from pa
iof the same variable
•
⇒ a more practical two-step method
•
find the
equivalence class
using conditional independences
•
determine the
undetermined causal directions by testing
if the
disturbance is ind. from the parents of the same variable
Illustration: Boston housing data
•
inferred by PC + kernel-based CI test (Zhang et al., 2011)
•
PNL model further applied:
using partial correlation or mutual information. The former assumes linear relationships and Gaussian dis-tributions, and the latter does not lead to a significance test. Sun et al. (2007) and Tillman et al. (2009)
pro-pose to use CIPERM for CI testing in PC. Based on
the promising results of KCI-test, we propose to also apply it to causal inference.
4.2.1 Simulated data
We generated data from a random DAG G. In
partic-ular, we randomly chose whether an edge exists and sampled the functions that relate the variables from a Gaussian Process prior. We sampled four random variables X1, . . . , X4 and allowed arrows from Xi to
Xj only for i < j. With probability 0.5 each
possi-ble arrow is either present or absent. If arrows exist,
from X1 and X3 to X4, say, we sample X4 from a
Gaussian Process with mean function U1 · X1 + U3 · X3
(with U1, U3 iid∼ U[−2; 2]) and a Gaussian kernel (with each dimension randomly weighted between 0.1 and 0.6) plus a noise kernel as the covariance function. For significance level 0.01 and sample sizes between 100 and 700 we simulated 100 DAGs, and checked how often the different methods infer the correct Markov equivalence class. Figure 3 shows how often PC based on KCI-test, CIPERM, or partial correlation recovered
the correct Markov equivalence class. PC based on KCI-test gives clearly the best results.
100 200 300 400 500 600 700 0.2 0.4 0.6 0.8 1
proportion of correct Markov equiv. classes
sample size KCI−test
CI PERM part. corr.
Figure 3: The chance that the correct Markov equiva-lence class was inferred with PC combined with differ-ent CI testing methods. KCI-test outperforms CIPERM
and partial correlation.
4.2.2 Real data
We applied our method to continuous variables in the Boston Housing data set, which is available at the UCI Repository (Asuncion and Newman, 2007). Due to the large number of variables we choose the significance level to be 0.001, as a rough way to correct for mul-tiple testing. Figure 6 shows the results for PC using CIPERM (PCCIPERM) and KCI-test (PCKCI-test). For
conciseness, we report them in the same figure: the red
lines were found by both methods. Please refer to the data set for the explanation of the variables. Although one can argue about the ground truth for this data set, we regard it as promising that our method finds links between number of rooms (RM) and median value of houses (MED) and between non-retail business (IND) and nitric oxides concentration (NOX). The latter is also missing in the result on these data given by Mar-garitis (2005); instead their method gives some dubi-ous links like crime rate (CRI) to nitric oxides (NOX), for example.
RM LST AGE NOX IND
MED CRI DIS TAX B
Figure 4: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCIPERM, solid lines: PCKCI-test).
RM LST AGE NOX IND
MED CRI DIS TAX B
Figure 5: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCIPERM, solid lines: PCKCI-test).
RM LST AGE NOX IND
MED CRI DIS TAX B
Figure 6: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCIPERM, solid lines: PCKCI-test).
5
Conclusion
We proposed a novel method for conditional indepen-dence testing. It makes use of the characterization of conditional independence in terms of uncorrelated-ness of functions in suitable reproducing kernel Hilbert spaces, and the proposed test statistic can be easily calculated from the kernel matrices. We derived its distribution under the null hypothesis of conditional
using partial correlation or mutual information. The former assumes linear relationships and Gaussian dis-tributions, and the latter does not lead to a significance test. Sun et al. (2007) and Tillman et al. (2009) pro-pose to use CIPERM for CI testing in PC. Based on the promising results of KCI-test, we propose to also apply it to causal inference.
4.2.1 Simulated data
We generated data from a random DAG G. In partic-ular, we randomly chose whether an edge exists and sampled the functions that relate the variables from a Gaussian Process prior. We sampled four random variables X1, . . . , X4 and allowed arrows from Xi to Xj only for i < j. With probability 0.5 each possi-ble arrow is either present or absent. If arrows exist, from X1 and X3 to X4, say, we sample X4 from a Gaussian Process with mean function U1 · X1 + U3 · X3 (with U1, U3 iid∼ U[−2; 2]) and a Gaussian kernel (with each dimension randomly weighted between 0.1 and 0.6) plus a noise kernel as the covariance function. For significance level 0.01 and sample sizes between 100 and 700 we simulated 100 DAGs, and checked how often the different methods infer the correct Markov equivalence class. Figure 3 shows how often PC based on KCI-test, CIPERM, or partial correlation recovered the correct Markov equivalence class. PC based on KCI-test gives clearly the best results.
100 200 300 400 500 600 700 0.2 0.4 0.6 0.8 1
proportion of correct Markov equiv. classes
sample size KCI−test
CI PERM part. corr.
Figure 3: The chance that the correct Markov equiva-lence class was inferred with PC combined with differ-ent CI testing methods. KCI-test outperforms CIPERM and partial correlation.
4.2.2 Real data
We applied our method to continuous variables in the Boston Housing data set, which is available at the UCI Repository (Asuncion and Newman, 2007). Due to the large number of variables we choose the significance level to be 0.001, as a rough way to correct for mul-tiple testing. Figure 6 shows the results for PC using CIPERM (PCCIPERM) and KCI-test (PCKCI-test). For conciseness, we report them in the same figure: the red arrows are the ones inferred by PCCIPERM and all solid lines show the result by PCKCI-test. Ergo, red solid
lines were found by both methods. Please refer to the data set for the explanation of the variables. Although one can argue about the ground truth for this data set, we regard it as promising that our method finds links between number of rooms (RM) and median value of houses (MED) and between non-retail business (IND) and nitric oxides concentration (NOX). The latter is also missing in the result on these data given by Mar-garitis (2005); instead their method gives some dubi-ous links like crime rate (CRI) to nitric oxides (NOX), for example.
RM LST AGE NOX IND
MED CRI DIS TAX B
Figure 4: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCIPERM, solid lines: PCKCI-test).
RM LST AGE NOX IND
MED CRI DIS TAX B
Figure 5: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCIPERM, solid lines: PCKCI-test).
RM LST AGE NOX IND
MED CRI DIS TAX B
Figure 6: Outcome of the PC algorithm applied to the continuous variables of the Boston Housing Data Set (red lines: PCCIPERM, solid lines: PCKCI-test).
5
Conclusion
We proposed a novel method for conditional indepen-dence testing. It makes use of the characterization of conditional independence in terms of uncorrelated-ness of functions in suitable reproducing kernel Hilbert spaces, and the proposed test statistic can be easily calculated from the kernel matrices. We derived its distribution under the null hypothesis of conditional independence. This distribution can either be gener-ated by Monte Carlo simulation or approximgener-ated by
Remark:
1. CRI - per capita crime rate by town
2. IND - prop. of non-retail business acres per town 3. NOX - nitric oxides concentration
4. RM - average number of rooms per dwelling
5. AGE - prop. of owner-occupied units built prior to 1940
6. DIS - weighted distances to 5 Boston employment centres
7. TAX - full-value property-tax rate per $10,000
8. B - 1000(Bk - 0.63)^2 where Bk prop. of blacks by town
9. LST - % lower status of the population
10.MED - Median of owner-occupied homes in $1000's
✓
?
?
?
50 Thursday, 18 October 2012Simpler extension for more
than two variables
•
Simple case:
component-wise
nonlinear transformations of x
i,f
i(x
i),
have
linear
causal relations
•
making use of PNL ICA to find
(linear) causal relations from W
•
confirmed by simulations; real
applications needed
f
2(x
2)
0.5 -0.2 0.3e
2e
3e
1f
3(x
3)
f
1(x
1)
428 K. Zhang and L.-W. Chan
A
W
s1 s2 sn fn gn tn t2 t1 x1 x2 xn z1 z2 zn yn y1 y2 f2 f1 g1 g2 . . . . . . . . . . . . ... . . . . . .PNL mixing system PNL de-mixing system
linear mixing nonlinear distortion inverse nonlinear transform linear de-mixing
Figure 1: PNL mixing system and PNL demixing system. The mixing system consists of a linear mixing stage (with the mixing matrix A) and a nonlinear transform stage (applying fi to linear mixtures ti). The demixing system does an inverse operation, which is a nonlinear transform stage followed by a linear demixing stage.
For simplicity, we have assumed there is no additive noise in this system. And in the following discussion, the number of independent sources, n, is assumed to be equal to that of observations, m. As a counterpart of the mix-ing model 2.1, the separation of PNL mixtures is also a two-stage procedure: a nonlinear stage followed by a linear demixing stage. Figure 1 shows the PNL mixing system and the demixing system, where gi are elements of g
used to invert the nonlinear mapping f, z has n elements zi as an estimate of
the latent linear mixtures ti, and W is a linear demixing matrix transforming
zi to yi, the estimate of independent sources si.
Taleb and Jutten (1999b), showed that when A has at least two nonzero entries per row or per column and si accepts a density function that vanishes
at one point at least, one can affirm that the output y has mutually indepen-dent components if and only if gi ◦ fi is linear and W is a linear separating
matrix for z. In other words, under these assumptions, the sources can be separated up to the same indeterminacies as in the linear ICA model, which are permutation and scaling indeterminacies. (In some work, for instance, Hyv¨arinen, Karhunen, & Oja, 2001, it is said there is one more indetermi-nacy, named sign indetermiindetermi-nacy, which can be considered as a special case
428 K. Zhang and L.-W. Chan
A
W
s1 s2 sn fn gn tn t2 t1 x1 x2 xn z1 z2 zn yn y1 y2 f2 f1 g1 g2 . . . . . . . . . . . . ... . . . . . .PNL mixing system PNL de-mixing system
linear mixing nonlinear distortion inverse nonlinear transform linear de-mixing
Figure 1: PNL mixing system and PNL demixing system. The mixing system consists of a linear mixing stage (with the mixing matrix A) and a nonlinear transform stage (applying fi to linear mixtures ti). The demixing system does an inverse operation, which is a nonlinear transform stage followed by a linear demixing stage.
For simplicity, we have assumed there is no additive noise in this system. And in the following discussion, the number of independent sources, n, is assumed to be equal to that of observations, m. As a counterpart of the mix-ing model 2.1, the separation of PNL mixtures is also a two-stage procedure: a nonlinear stage followed by a linear demixing stage. Figure 1 shows the PNL mixing system and the demixing system, where gi are elements of g
used to invert the nonlinear mapping f, z has n elements zi as an estimate of the latent linear mixtures ti, and W is a linear demixing matrix transforming
zi to yi, the estimate of independent sources si.
Taleb and Jutten (1999b), showed that when A has at least two nonzero entries per row or per column and si accepts a density function that vanishes at one point at least, one can affirm that the output y has mutually indepen-dent components if and only if gi ◦ fi is linear and W is a linear separating
matrix for z. In other words, under these assumptions, the sources can be separated up to the same indeterminacies as in the linear ICA model, which are permutation and scaling indeterminacies. (In some work, for instance, Hyv¨arinen, Karhunen, & Oja, 2001, it is said there is one more indetermi-nacy, named sign indetermiindetermi-nacy, which can be considered as a special case of the scaling indeterminacy.) In fact, if the mean of s is unknown, one more
428 K. Zhang and L.-W. Chan
A
W
s1 s2 sn fn gn tn t2 t1 x1 x2 xn z1 z2 zn yn y1 y2 f2 f1 g1 g2 . . . . . . . . . . . . ... . . . . . .PNL mixing system PNL de-mixing system
linear mixing nonlinear distortion inverse nonlinear transform linear de-mixing
Figure 1: PNL mixing system and PNL demixing system. The mixing system consists of a linear mixing stage (with the mixing matrix A) and a nonlinear transform stage (applying fi to linear mixtures ti). The demixing system does an inverse operation, which is a nonlinear transform stage followed by a linear demixing stage.
For simplicity, we have assumed there is no additive noise in this system. And in the following discussion, the number of independent sources, n, is assumed to be equal to that of observations, m. As a counterpart of the mix-ing model 2.1, the separation of PNL mixtures is also a two-stage procedure: a nonlinear stage followed by a linear demixing stage. Figure 1 shows the PNL mixing system and the demixing system, where gi are elements of g used to invert the nonlinear mapping f, z has n elements zi as an estimate of the latent linear mixtures ti, and W is a linear demixing matrix transforming
zi to yi, the estimate of independent sources si.
Taleb and Jutten (1999b), showed that when A has at least two nonzero entries per row or per column and si accepts a density function that vanishes at one point at least, one can affirm that the output y has mutually indepen-dent components if and only if gi ◦ fi is linear and W is a linear separating matrix for z. In other words, under these assumptions, the sources can be separated up to the same indeterminacies as in the linear ICA model, which are permutation and scaling indeterminacies. (In some work, for instance, Hyv¨arinen, Karhunen, & Oja, 2001, it is said there is one more indetermi-nacy, named sign indetermiindetermi-nacy, which can be considered as a special case of the scaling indeterminacy.) In fact, if the mean of s is unknown, one more
51 Thursday, 18 October 2012