identify PCA -7 Y) if

(1)

lead to biased estimate of the path coeﬃcients. It is thus important to know if we can still identify a causal eﬀect when some variables are unobserved.

(ii) Proxy measurement: In many real applications, the variables of interest are not directly measured. This is particularly common in the social sciences where the variable of interest may be socioeconomic status, personality, or political ideology. These variables may only be approximately measured by observable variables (proxies) like human behaviours and questionnaires.

3.22 Example. Excerpt of an educational psychology study (click here).2

With latent variables, identifiability of path coeﬃcients no longer follows from Proposition 3.21 because ⌃ is only partially estimable. Path analysis (3.3) allows us to construct a mapping (Exercise 3.10)

B_{7! ⌃(B)}

between the paths coeﬃcients and the covariance matrix of the ob-served and unobob-served variables.

2_{Marsh, H. W. (1990). Causal ordering of academic self-concept and academic}

achievement: A multiwave, longitudinal panel analysis. Journal of Educational Psychology, 82 (4), 646–656. doi:10.1037/0022-0663.82.4.646.

Can

we

identify PCA

-7 Y)

if

some variables are unobserved ?

abstract constructs

.

- -

(2)

Journal of Educational Psychology

Causal Ordering of Academic Self-Concept

and Academic Achievement:

A Multiwave, Longitudinal Panel Analysis

Herbert W. Marsh

School of Education and Language Studies

University of Western Sydney, Macarthur, New South Wales, Australia

There is surprisingly little sound research on the causal ordering of academic self-concept and academic achievement in longitudinal panel studies, despite its theoretical and practical signifi-cance. Data collected in Grades 10, 11, 12, and one year after graduation from high school that were used in this study come from the large (A" = 1,456 students), nationally representative Youth in Transition study. It was found that reported grade averages in Grades 11 and 12 were significantly affected by academic self-concept measured the previous year, whereas prior reported grades had no effect on subsequent measures of academic self-concept. The results provide one of the few defensible demonstrations of prior academic self-concept influencing subsequent academic achievement, and the study appears to be methodologically stronger than previous research.

A positive self-concept is valued as a desirable outcome in many educational settings and is frequently posited as a mediating variable that facilitates the attainment of other desired outcomes such as academic achievement. A growing body of literature (e.g., Byrne, 1984;Hansford&Hattie, 1982; Marsh, 1986, 1987; Marsh, Byrne, & Shavelson, 1988; Shav-elson & Bolus, 1982) indicates that academic self-concept is clearly differentiable from general self-concept and that aca-demic self-concept is more highly correlated with acaaca-demic achievement and other academic behaviors than is general self-concept. Marsh, Byrne, and Shavelson, for example, found that none of the general self-concept scales from three different instruments were significantly correlated with school grades in English, mathematics, or all school subjects, whereas academic self-concept scales were substantially correlated with achievement. This pattern of relations supports the construct validity of academic self-concept responses and the need for educational researchers to consider academic self-concept in-stead of relying on general self-concept scales.

Wylie (1978) suggested that students' perceptions of their academic ability are based largely on school performance, so that standardized ability test scores should add little to the prediction of self-concept beyond the contribution of school performance measures. Literature reviews (e.g., Hansford & Hattie, 1982; Wylie, 1979) have found school performance indicators to be more highly correlated with self-concept than are IQ or general academic achievement. However, I noted

I would like to thank Raymond Debus, Rhonda Craven, and Rosalie Robinson for helpful comments on earlier drafts of this article. The data used in this manuscript were made available by the Inter-University Consortium for Political and Social Research and were originally collected by Jerald Bachman.

Correspondence concerning this article should be addressed to Herbert W. Marsh, School of Education and Language Studies, University of Western Sydney, Macarthur, P.O. Box 555, Campbell-town, New South Wales 2560, Australia.

in Marsh (1987) (also see Davis, 1966), that school perform-ance measures typically are normalized relative to other stu-dents within the school, whereas standardized tests are nor-malized in relation to a broader population. I suggested that high school students may use both frames of reference in forming their academic self-concepts. 1 also argued that school-based performance is more likely to be affected by effort and motivational influences than are standardized test scores, so that prior academic self-concept is more likely to affect subsequent school performance than to affect standard-ized test scores. For these reasons, I indicated the need to consider separately the effects of standardized tests scores and school performance in evaluating relations between academic self-concept and achievement.

Causal Ordering of Academic Self-Concept and Academic Achievement

Perhaps the most vexing theoretical question in academic self-concept research involves determining the causal ordering of academic self-concept and academic achievement. This question is of practical importance because many self-concept enhancement programs are based on the assumption that an improvement in self-concept will lead to gains in academic achievement.

Byrne (1984) noted that much of the interest in the relation between self-concept and achievement stems from the belief that academic self-concept has motivational properties such that changes in academic self-concept will lead to changes in subsequent academic achievement. Calsyn and Kenny (1977) contrasted self-enhancement and skill development models of the relation between self-concept and achievement. Ac-cording to the self-enhancement model, self-concept is a primary determinant of academic achievement. Support for this model would provide strong justification for the self-concept enhancement interventions that are explicit or im-646 This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

(3)

650 HERBERT W. MARSH

-.13

.4!

Figure I. Standardized effects of prior ability, reported grade averages, and academic self-concept on

subsequent grades and academic self-concept for Model 3. (SELF — Self-concept. The 15 boxes represent the 15 measured variables shown in Table 1. The ovals represent latent constructs inferred from the measured variables. The numbers following each latent construct represent the data wave in which it was collected: Tl = 10th grade, T2 = 1 lth grade, T3 = 12th grade, and T4 = one year after graduation. The straight, bold lines connecting the different latent constructs represent path coefficients: the direct effects of each construct on all subsequent constructs. Nonsignificant path coefficients are excluded for purposes of clarity, but they are presented in Table 3 under Model 3. The curved lines represent correlated residuals between measured variables.)

precede grades. Because students were asked to report their grades from the previous year, I posited that school grades preceded aca-demic self-concept. Similarly, at Time 2 reported average grades were posited to precede academic self-concept. At Time 3 and at Time 4, oniy one construct was considered, so there was no need to posit a causal ordering within each wave. One should note, however, that the ordering of variables within a given wave has no influence on the overall goodness of fit of the model and almost no influence on the path coefficients relating variables from different waves. In this a priori model, correlated residuals relating the uniquenesses of the same indicator of academic self-concept administered at different points in time were also posited as shown in Figure 1. Such correlated residuals are usually found in longitudinal panel studies, and their existence is likely to inflate estimates of the stability of the underlying construct. These observations were substantiated by fitting a series of alternative models.

In preliminary analyses, three models were evaluated in terms of their ability to fit the data. Each of the three models was reasonable

in that the iterative procedure converged to a proper solution; each of the constructs inferred from multiple indicators was well-defined; and the overall goodness-of-fit indices, particularly given the large sample sizes, was moderate to good (for a discussion of evaluating goodness of fit, see Bentler & Bonett, 1980; Marsh, Balla, & Mc-Donald, 1988). Model 1 (Table 2) did not include the correlated residuals that were hypothesized a priori. The fit of Model 1 is much poorer than the other two models, thus supporting the inclusion of the correlated residuals. Model 2 is the a priori model originally hypothesized and it fits the data very well. Inspection of the modifi-cation indices provided by LISREL, however, suggested that one addi-tional correlated residual was required between two of the multiple indicators of academic ability (Model 3 in Table 2; also see Figure 1). The inclusion of this additional parameter made a small but statisti-cally significant improvement in the goodness of fit. Model 3 provides an excellent fit and is the basis of subsequent analyses.

In SEMs latent constructs are automatically corrected for estimates of unreliability that are based on the design of the model, so long as

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

(4)

An entry (or a function) of B is said to be identifiable if it can be expressed in terms of the distribution of the observed variables. In linear SEMs with normal errors, this is equivalent toexpressing B in terms of the submatrix of ⌃ corresponding to the observed variables (because the multivariate normal distribution is uniquely determined

by its mean and covariance matrix).

When the errors are non-normal, we may further use higher mo-ments or the entire distribution of the observed variables to identify . However, it is also more sensitive to the distributional assumptions. Below we will restrict our discussion to the case of normal errors. 3.23 Remark. The notion of identifiability can depend on the context of the problem. With latent variables, it is often the case that we can only identify some path coefficients up to a sign change. In other problems (such as problems with instrumental variables), the set of nonidentifiable path coefficients has measure zero (this is called generic identifiability). We will not differntiate between these concepts in the discussion below.

To identify the entire matrix B, a necessary condition is that ⌃ has at least as many entries as B. Unfortunately, there is no known necessary and suﬃcient condition for identifiability in linear SEMs.3

3_{For a review on recent advances using algebraic geometry, see Drton, M.}

et al. (2018). Algebraic problems in structural equation modeling. In The 50th

Anniversary of Gröbner Bases(pp. 35–86). Mathematical Society of Japan.

-Can we invent B '→ ECB) ?

often

c. Latent variable U .

Sign

U unknown

only

identify

p

up

-6 a

sign

_change

. 2 . Generic

identifiabiliey

. Besides

a _measure _zero _set

, B is

identifiable

.

(5)

-3.6 Factor models and measurement models

Below we give some examples in which the path coeﬃcients are indeed identifiable.4 _{The basic idea is to use proxies for the latent variables.}

Without loss of generality, we assume all the unmeasured variables are standardised so they all have unit variance. In the diagrams below, we will use dashed circles to indicate latent variables.

U X2 X1 Xp 1 2 p . . . .

Figure 3.5: Illustration for three-indicator rule (p 3).

3.24 Lemma (Three-indicator rule). Consider any linear SEM for (U, X[p]) corresponding to Figure 3.5. Suppose X[p] is observed but U is not. Suppose Var(U) = 1. Then the path coeﬃcients are identifiable (up to a sign change) if p 3 and at least 3 coeﬃcients are nonzero.

4_{More discussion and examples can be found at Bollen, K. A. (1989). Structural}

equations with latent variables. doi:10.1002/9781118619179, page 326.

(6)

-Proof. Denote the path coeﬃcient for U1 ! Xi as i and the variance of the noise variable for Xi as 2

i. It is straightforward to show that

Cov(X) = T + diag( 2₁, . . . , _p2). When p = 3, this means that

0 B @

Var(X1)

Cov(X1, X2) Var(X2)

Cov(X1, X3) Cov(X2, X3) Var(X3) 1 C A = 0 B @ 2 1 + 12 1 2 22+ 22 1 3 2 3 32+ 23 1 C A . Therefore, we have 2 1 = Cov(X1, X2)· Cov(X1, X3) Cov(X2, X3) , and similarly for 2

2 and 23. Although the sign of 1 is not identifiable, it is easy to see that once it is fixed, the signs of 2 and 3 are also determined. Thus the vector is identifiable up to the transformation

7! .

For p > 3, we can apply apply the above result for the 3-subset of X[p] whose corresponding path coeﬃcients are nonzero.

3.25 Remark. Statistical inference for the graphical model in Figure 3.5 is often called a confirmatory factor analysis because the structure is already given. This is diﬀerent from the exploratory factor analysis (e.g., via principal component analysis), which tries to use ovserved

data to discover the factor structure.

X-_

put

E.

E- avCX7=

PPT

t

diagcri

. -n. . op's.

If

17=3 . z

fifa

P

,

B

s

-fi

's

:

eerie

:*

.

)

.

P3

_B2B

.

BY

=

Elk

.

-213

¥

. same

for

pi

.

Pj

. P>3 .

Apply

above to 3 non- Zero

(7)

3.26 Example. For the linear SEM corresponding to the graphical model in Figure 3.6, AY is identifiable. To see this, we can first use

Lemma 3.24 on {A, Y, X} and {A, Y, Z} to identify ( U A, U Y) (up to a sign change). Without loss of generality we assume A and Y have unit variance, then AY = Cov(A, Y ) U A U Y is also identified.

U A Y X Z AY U A U Y U X U Z

Figure 3.6: Illustration of using proxies of unmeasured confounders to remove unmeasured confounding bias.

3.27 Exercise. Show that AY is non-identifiable if Z is unobserved in Figure 3.6

Solution. Follow the derivations in the proof of Lemma 3.24.

Apply

3- indicator rule to X' A'2-

PUA

(

_pug

,

Puy

? upto sign x.X.2-

_Buy

. -

charge

.

He

⇐

_Gutai

₎

-

Bayt

foam

. Ty -

f

-

identify

pay

.

(8)

3.28 Exercise. Let X1, X2, X3 be three random variables/vectors. The partial covariance between X1 and X2 given X3 is defined as

PCov(X1, X2 | X3)

= Cov(X1, X2) Cov(X1, X3) Var(X3) 1Cov(X3, X2). Show that if we add a directed edge from X to A in Figure 3.6, AY is still identifiable by5

AY = Cov(A, Y )

PCov(X, Y | A)

PCov(X, Z _{| A)}Cov(A, Z).

Solution. Assume all the random variables are standardised. In the first case, Cov((X, A, Y, Z)T) = 0 B B B @ 1 XA= XA+ U X U A 1 XY AY 1 XZ = U X U Z AZ = U A U Z Y Z = U Y U Z 1 1 C C C A, where XY = XA AY + U X U A AY + U X U Y and AY = AY +

U A U Y + XA U X U Y. There are six covariances and six unknown

5_{Kuroki, M., & Pearl, J. (2014). Measurement bias and eﬀect restoration in}

causal inference. Biometrika, 101 (2), 423–437. doi:10.1093/biomet/ast066.

(9)

More explicitly, after some algebra one can show AY = AY XY·A

XZ·A AZ.

The three-indicator rule is also quite useful in the so-called mea-surement models. In this type of problems (an instance is Exam-ple 3.22), we are indeed interested in the causal eﬀects between the latent variables (these are often abstract constructs like personalities and academic achievements).

Suppose the latent variables U 2 Rq _{have unit variances and} satisfy a linear SEM with respect to a prespecified DAG. The observed variables (or measurements) X 2 Rp _{satisfy the following model (the} intercept term is omitted for simplicity)

X = U + ✏X,

where 2 Rp⇥q_{is the factor loading matrix, ✏X}_{is a vector of mutually} independent mean-zero noise variables and ✏X ?? U.

Latent

U

: Interested in their

causal

relationship

.

Observed

X

:

"

Measurements"

of

0.

(10)

-3.29 Proposition. Suppose (U, X) satisfy the measurement model described above. The path coeﬃcients between the latent variables U are identifiable (up to sign change of U) if the fol-lowing conditions are satisfied:

(i) Each row of has only 1 nonzero entry (i.e., every mea-surement loads on only one factor).

(ii) Each column of has at lest 3 nonzero entries (i.e., each factor has at least three measurements).

Proof. By Lemma 3.24, can be identified. The assumptions in the proposition statement also ensures that has full column rank, so Cov(U )can be identified from Cov(X). The conclusion then follows from Proposition 3.21.

3.30 Example. The graphical model in Figure 3.7 satisfies the cri-terion in Proposition 3.29, thus U is identifiable (up to its sign).

To see this, 11, 12, . . . , 26 can be identified by confirmatory factor analysis, and by using path analysis, we have

Cov(X1, X4) = 11 U 24.

-thot

_By

3-indicator rule . Cii) P is

identifiable

.

Count

_Paulo

₎

-Ft diag

Corcu

)

.

Apply

Prop

. 3.21

.

-hmm

(11)

U1 U U2 X2 X1 X3 11 12 13 X5 X4 X6 24 25 26

Figure 3.7: Example of a measurement model.

3.31 Exercise. Show that U in the last example is still identifiable if each latent variable only has two measurements (i.e. if X3 and X6 are deleted from the graph).

Solution. Apply the three-indicator rule to (X1, X2, X4)and (X1, X2, X5).

3.32 Remark. Although the path coeﬃcients between U can only be identified up to sign changes, this is usually not a problem in practice. Usually we can confidently make assumptions about the signs of certain factor loadings (for example, the loading of a student’s maths score on academic achievements is positive).

3-

indicator rule_:

_Pii

. -- --

Pee

. -

-Covlx

,

,X47=R

_, '

Bu

-824

. - - _{m m} - m m -

(12)

-3.7 Estimation in linear SEMs

Let X 2 Rp _{be the observed variables in a linear SEM. Let B denote} the matrix of path coeﬃcients between all the variables, observed or latent. Suppose B is indeed identifiable.

There are two main approaches to fit a linear SEM and estimate B: maximum likelihood and generalised method of moments.

By assuming the noise variable ✏[p] in (3.1) follows a multivariate normal distribution, the maximum likelihood estimator of B minimises

l(B) = 1 2log det ⇣ ⌃X(B) ⌘ +1 2tr ⇣ S⌃_X1(B)⌘, (3.7) where S is the sample covariance matrix of X and ⌃X(B) is the covariance matrix of X and depends on the path coeﬃcients B through (3.3).

3.33 Exercise. Derive (3.7).

Solution. Since the variables X can be written as a linear transfor-mation of the noises ✏ (Exercise 3.5), X also follows a multivariate normal distribution and thus the sample mean ¯X = 1_nPn_i=1Xi is independent of the sample covariance matrix

S = 1 n p n X i=1 (Xi X)(X¯ i X)¯ T. -

-5-

http

,

Hi

-

XTC

Xi -

IT

. where

I

=

I

II.

_Xi .

(13)

The likelihood function thus consists of two parts. One involves ¯X and is used to estimate the intercept terms 0 in (3.1). The other term involves S and its negative logarithm is equal to (3.7) plus a constant.

Generalised method of moments (an extension of Z-estimation) tries to directly match the theoretical covariance matrix ⌃(B) with the sample covariance matrix S by minimising

lW(B) = 1 2tr

⇣ ⇥

S ⌃(B)⇤W 1 2⌘, (3.8)

where W is a p ⇥ p positive definite weighting matrix. This is also called the generalised least squares estimator in the SEM literature.

Different choices of W lead to estimators with different asymptotic efficiency. The “optimal” choice is W = ⌃(B) (or any other matrix that converges in probability to ⌃(B)). This motivates the practical choice W = S 1_{, so we estimate B by minimising}

lS(B) = 1 2tr

⇣⇥

I S 1⌃(B)⇤2⌘. (3.9)

The generalised method of moments estimator is consistent if the linear SEM is correctly specified (so Var(X) = ⌃(B)). Furthermore, if lS(B)is used and ✏ is normally distributed, the estimator is

asymp-L s a •

by

minimising

lwCB7=

tr(

{

_§

-ECB

)

W -I

}

"

)

. In the SEM lieeraeure.

WEIR

""

_weights

.

positive

definite

.

Also called

_generalised

_lease

squares.

Different

w

Different

_asymp. variance .

"

optimal" choice W = _ECB₎ .

or consistent _estimator

(such as 57

(14)

totically equivalent to the maximum likelihood estimator and is thus asymptotically eﬃcient.6_.

6_{For more detail, see Browne, M. W. (1984). Asymptotically distribution-free}

methods for the analysis of covariance structures. British Journal of Mathematical and Statistical Psychology, 37 (1), 62–83. doi:10.1111/j.2044-8317.1984.tb00789.x.

(15)

3.8 Strengths and weaknesses of linear SEMs

Despite being a century old, linear SEMs are still widely used in many applications for many good reasons:

(i) Graphs and linear SEMs provide an intuitive way to rigorously describe causality that can also be easily understood by applied researchers.

(ii) Path analysis provides a powerful tool to distinguish correlation from causation. Even though we will move away from linearity soon, path analysis provides a straightforward way to disprove some statements and gain intuitions for others7_.

(iii) Linear SEMs allow us to directly put models on unobserved variables. This is especially useful when the causes and eﬀects of interest are abstract constructs.

(iv) Fitting a linear SEM only requires the sample covariance ma-trix, which can be handy in modern applications with privacy constraints.

Linear SEMs also have important limitations:

7_{Pearl, J. (2013). Linear models: A useful "microscope" for causal analysis.}

(16)

(i) The linear model can be misspecified and does not handle binary variables or discrete variables very well. This is problematic because the causal eﬀect is not well defined if the model is nonlinear. As a consequence, the meaning of structural equation models became obscure and lead many to believe they are just the same as linear regression. This misconception led many researchers to rejected linear SEMs as a tool for causal inference.8

(ii) Any model put on the unobserved variables is dangerous, because there is no realistic way to verify those assumptions.

8_{For a historical account, see Pearl, 2009, Section 5.1.}

(17)

Chapter 4 Graphical models

The linear SEMs are intuitive and easy to interpret, but becomes inadequate when the structural relations are non-linear. Intuitively, causality should be already entailed in the graphical diagram, and linearity should be unnecessary for causal inference.

To move away from the linearity assumption, we will introduce graphical models for the observed variables in this Chapter and for the unobserved counterfactuals in the next Chapter.

91

(18)

-4.1 Markov properties for undirected graphs

Briefly speaking, a graphical model provides a concise representation of all the conditional independence relations (aka Markov properties) between random variables. We will start from undirected graphs.

Let G = (V = [p], E) be an undirected graphical model for the random variables X[p] = (X1, X2, . . . , Xp). Edges in an undirected graph have no direction. In other words, if (i, j) 2 E, so does (j, i).

4.1 Definition(Separation in undirected graph). For any disjoint I, J, K ⇢ V , K is said to separate I and J in an undirected graph G, denoted as I ?? J | K [G], if every path from a node in I to a node in J contains a node in K.

4.2 Definition (Global Markov property). A probability distri-bution P is said to satisfy the global Markov property with respect to the graph G if I ?? J | K [G] =) XI ?? XJ | XK for any disjoint I, J, K ⇢ V .

The Markov properties with respect to a graph are closely related to the factorisation of a probability distribution.

-K

. -naw . ...in . _. _. .

:i

:

.

IIJtk-G3m.it

every

path from

I to J _contains a mode in k.

Distribution

_R

on X.

satisfies

the Global Markov wire G

if

I IJ

(

K EG

]

XIIXJIXK

. It

disjoint

IT

(19)

4.3 Definition (Factorisation according to a undirected graph). A clique in an undirected graph G is a subset of vertices such that every two distinct vertices in the clique are adjacent.

A probability distribution P is said to factorise according to G (or a Gibbs random field with respect to G)if P has a density

f that can be written as

f (x) = Y

clique C✓V

C(xC),

for some functions C, C ⇢ V .

4.4 Theorem (Hammersley-Cliﬀord). Suppose the probability distribution P has a positive density function. Then P satisfies the global Markov property with respect to G if and only if it factorises according to G.

Proof. We will only prove the (= direction here.1_. _{Let I, J, K be}

disjoint subsets of V such that I ?? J | K. A necessary consequence is that I and J must be in diﬀerent connected components of the

1_{Proof of the other direction can be found, for example, in Lauritzen, S. L.}

(1996). Graphical models. Clarendon Press, page 36.

if

_density

f-of

P

can be written _as

fix

) = IT

digne c

4C

(

XD.

for

some

functions {

4C

, C EV

}

.