3.4 Algorithmic Aspects of SEM Trees
3.4.1 Construction of SEM Trees
In the following, I present a greedy algorithm that, given a data set D and a template model
M , builds a tree that minimizes the negative two log-likelihood of observing the data given the
tree locally, that is, at each node. After laying the formal basis for evaluating split variables, a pseudocode algorithm that allows the construction of a SEM Tree is given.
Building a SEM Tree starts with the choice of a template SEM M and a data set D that is to be analyzed. First, a root node is created and ML parameters are estimated from the complete observed data. Growing the tree proceeds by iteratively testing whether the current set of leaf nodes, initially only the root, can be split further. Splits should only be performed if the split is useful. The utility of a split is typically formalized with statistical or information-theoretic measures indicating that the split candidate contains relevant information about the target vari- ables. If several potential splits are useful, the best candidate is chosen. The evaluation of split candidates is based on the fundamental notion of a multi-group model (Jöreskog, 1971), that is, a compound SEM of multiple independent SEMs modeling samples from multiple distinct groups. SEM Trees use the property that a SEM which is partitioned into two parts according to a dichotomous covariate, one associated with each resulting partition of the data, can be seen as a two-group model. Furthermore, the model that has not yet been split can be seen as a restricted version of this two-group model. I will show that this is an algebraic nesting, that is, the parameter vector of the original model is a constrained form of the parameter vector of the two-group model. Because of this property, the distribution of the ratio of the log-likelihoods between the two models is known under the hypothesis that the split candidate is uninformative. This allows us to formulate a statistical test that rejects the null hypothesis of an uninformative split candidate.
In the following, I will derive the maximum-χ2 criterion to determine whether a node should
be further expanded to a sub-tree or not. The criterion is based on the notion that a split should be performed with respect to one of the covariates whenever its result significantly increases the likelihood of observing the data given the tree. Based on the likelihood ratio test, the maximum-χ2 criterion allows us to ground the decision to accept or reject a split candidate on
a significance test. In order to derive this criterion, we need two notions of models, a pre-split
model representing the model before a candidate split, and a post-split model representing the
model after the split.
Definition 50. Let Υ be a SEM Tree with template model M that is described by a expec- tations vector µ and a covariance matrix Σ. Let n ∈ N be a node with data set D. The pre-split model of node n is a SEM M with ML parameter estimates ˆθpre based on minimizing −2LLD|Mθˆpre
. D is referred to as the data set of the pre-split model of n.
There is an important dual representation of the pre-split model. In the first instance, the pre- split model is a representation of the current data set, as described in the previous definition. Its parameter estimates are obtained by minimizing the negative log-likelihood function for observing the data set given the model. Considering a split of the data set into a left and a right data set, the likelihood of this model is equivalent to the likelihood of a compound model that consists of a left model that is associated with the left data set, and a right model associated with the right data set. When the parameter estimates for the left and right models are obtained by minimizing the respective log-likelihood function under the additional constraint that the parameter vectors of both models are equal, the resulting likelihood of the compound model is equal to the likelihood of the pre-split model as defined above. Even if this observation is not surprising, it is an important building block in deriving a hypothesis test for evaluating the significance of choosing a covariate for a node split:
Theorem 51(Dual Representation of Pre-Split Models). Let Υ, M, D, n be as above. Let Dlef t
and Dright be the left and right data sets of D with respect to the covariate encoded by n. Let ˆ θpre,lef tminimize−2LL Dlef t|M ˆθpre,lef t
and ˆθpre,rightminimize −2LL
Dright|M
ˆθpre,right
under the equality constraints ˆθpre,lef t= ˆθpre,right, then
−2LLD|Mθˆpre = −2LLDlef t|M ˆθpre,lef t − 2LLDright|M ˆθpre,right
Proof. Let (O, C) ∈ D and N be the number of observed data points O in D. Let Dlef t be the left data set of D and Dright be the right data. Let Nlef t and Nright be the number of observations in the respective partitions, such that Nlef t+ Nright = N . With Equation 3.1.3, we can rewrite this as
−2LLD′|µθˆpre , Σθˆpre = N · p log (2π) + N log |Σ| +X x∈O h (x − µ)T Σ−1(x − µ)i
(Nlef t+ Nright) · p log (2π) + (Nlef t+ Nright) log |Σ|
X
x∈(Olef t∪Oright) h
(x − µ)TΣ−1(x − µ)i
= Nlef t· p log (2π) + Nlef tlog |Σ|
+ X
x∈Olef t h
(x − µ)TΣ−1(x − µ)i
+Nright· p log (2π) + Nrightlog |Σ|
+ X x∈Oright h (x − µ)TΣ−1(x − µ)i = −2LLDlef t|M, ˆθpre,lef t − 2LLDright|M, ˆθpre,right
Theorem 52. Let Υ, M, D, n be as above. Let Dlef t be the left data set and Dright the
right data set of D with respect to the decision rule encoded in node n. Let ˆθlef t minimize −2LLDlef t|M
ˆθlef t
and ˆθright minimize −2LL
Dright|M
ˆθright
. The likelihood of the post-split model that independently estimates parameters for the left data set and right data set is −2LLDlef t|M ˆθlef t − 2LLDright|M ˆθright (3.4.1)
Proof. This proof is analogous to the proof of Theorem 51. Since both models are assumed to
be independent, their likelihoods multiply and consequently their log-likelihoods sum up. The only difference to the previous proof is that the equality constraint across the parameters in both models is removed, and therefore, two separate sets of parameter estimates ˆθlef t and ˆθright are obtained.
As reviewed earlier in Section 3.1.5, algebraic nestings of SEMs have the property to be approximately χ2-distributed under the null hypothesis that the parameter restriction holds
true in the population. This property allows the employment of significance testing to reject the null hypothesis and therefore rejection of the restricted model for being a worse representation than the unrestricted model.
Corollary 53 (Algebraic Nesting of Pre-Split and Post-Split Models). The pre-split model is
nested in the post-split model because the models are structurally equivalent and they differ by equality constraints equal to the number of free parameters in M .
An illustration of Corollary 53 is given in Figure 3.4.1. The general principle of nested pre- split and post-split models is illustrated by means of a schematic factor model with three freely estimated factor loadings λ1, λ2, and λ3, and the variance of the common factor σξ2. The pre- split model (upper panel) freely estimates the four parameters. Following Theorem 51, this
ξ Baseline Model y1 y2 y3 ξ ξ Pre-Split Model y1 y2 y3 y1 y2 y3 ξ ξ Post-Split Model y1 y2 y3 y1 y2 y3 σ2 ξ σξ2 σ2 ξ σ2 ξ1 σ 2 ξ2 λ1 λ2 λ3 λ1 λ2 λ3 λ1 λ2 λ3 λ1 λ2 λ3 λ4 λ5 λ6
Figure 3.4.1: Evaluating a split candidate involves a comparison of a baseline model (top) that models the data set without partitioning it and a two-group model that repre- sents two models, one model for each of the two partitions after the split (below). This figure illustrates that these models are indeed nested. The baseline model is equivalent to a two-group model in which all parameters are constrained to be equal across the group (middle). This virtual two-group model is in fact nested in the real two-group model (below). This allows the usage of a log-likelihood ratio test to reject the baseline model if the test statistic is significant.
model is equivalent to a two-group SEM that has equality constraints on all corresponding parameters, as depicted in the middle panel. When evaluating a split candidate, all parameters of the two models in the two-group model are freely estimated (post-split model, lower panel). The pre-split model is algebraically nested in the post-split model, and therefore, the asymptotic distribution of the likelihood ratio under the null hypothesis that the restrictions hold in the population, i.e., the split is uninformative, is known:
Theorem 54. Under the null hypothesis that a split candidate is uninformative, the likelihood
ratio LLR between a pre-split and a post-split model is asymptotically χ2-distributed with degrees of freedom corresponding to the number of free parameters k in the template SEM.
LLR =−2LLD|Mθˆ+ 2LLDlef t|M ˆ θlef t + 2LLDright|M ˆ θright ∼ χ2df =k
Proof. Following Wilks’s (1938) theorem, as reiterated in Section 3.1.5, LLR is asymptotically χ2-distributed under the null hypothesis that the restriction holds in the population, with degrees of freedom equal to the difference in freely estimated parameters. A pre-split model with k free parameters is compared to a post-split model with 2k parameters. The difference in parameters and therefore the degrees of freedom of the χ2-distribution is k.
Knowing the distribution of the test statistic LLR under the null hypothesis that the covariate is uninformative and therefore the true parameter values of the pre-split and post-split model are the same, we can formulate a statistical criterion that allows us to reject this null hypothesis for large values of LLR, and accept a split candidate as valid:
Definition 55 (Maximum-χ2 criterion). Let LLR
i be m log-likelihood ratios obtained from evaluating m candidate splits. Let z be the index that finds LLRz with the maximum value. For a chosen significance level α, the maximum-χ2 criterion for covariate selection selects the
covariate with index z, if
LLRz > crit, P
χ2df > crit= α
Given a set of covariates, all possible splits are evaluated and the model with the best increase in performance, i.e., the largest increase in likelihood of observing the data under the model, is compared against a previously chosen threshold, determined by the type-I error α of the test. If the split is statistically significant, splitting is continued recursively. Note that this model selection procedure is a typical example of the multiple comparison problem, and inferences on the acquired p-values must be taken with caution. Later, we will learn that LLRz does not follow a central χ2-distribution, and therefore α is not the type-I error rate for the question
whether one of the splits is significant. The proof for that and remedies in form of different selection criteria are discussed in the next section. However, they do not change the general procedure described here but merely change the way the threshold is calculated or the way the statistic is estimated. Algorithm 3.1 gives a recursive pseudocode algorithm that creates a tree for a given template SEM and data set, based on the maximum-χ2 criterion.
Since the Λ statistics for growing SEM Trees rely on asymptotic assumptions, it is convenient to define a minimum number of observations minN that is required for a node candidate to be
Algorithm 3.1 Recursive function for creating a SEM Tree. Arguments of the function are a SEM model and a data set D
1: function SEM-TREE-LEARNING(model, D)
2: (obs, cov) ← D ⊲ observations and covariates
3: best_lr ← 0
4: best_covariate ← None
5: θ ← ML estimate of obs under model 6: for covariate in cov do
7: obs1, obs2 ← bi-partition of obs according to covariate
8: cov1, cov2 ← bi-partition of cov according to covariate
9: θ1 ← ML estimate of obs1 under model
10: θ2 ← ML estimate of obs2 under model
11: lr ← -2LL(obs|model(θ) +2LL(obs1|model(θ1))+2LL(obs2|model(θ2)) 12: if lr < best_lr then
13: best_lr← lr
14: best_covariate← covariate
15: end if
16: end for
17: critical← (1 − α)-quantile of chi-square-distribution with degrees of freedoms equal to
free parameters in model
18: cur_node← create new node with parameter estimate θ
19: if best_lr > critical then
20: remove best_covariate from each cov1 and cov2
21: lef t_child← SEM-TREE-LEARNING(model, (obs1, cov1))
22: right_child ← SEM-TREE-LEARNING(model, (obs2, cov2))
23: add lef t_child as child to cur_node 24: add right_child as child to cur_node 25: end if
26: return cur_node
considered legitimate. Nodes with a smaller number of observations will be skipped in node evaluation.
The SEM Tree algorithm passes parameter estimates of a node as starting values for sub- models when evaluating splits. If the model does not converge, the original starting values can be used. If this still fails, either the potential split is disregarded or a new set of starting values has to be generated by a different estimation method, e.g., by a least-squares estimate.
In the parameter estimation process, the numerical optimizer might converge on negative values for estimated variances. This behavior is known as a Heywood case (Thurstone, 1947). There can be numerous causes for these cases, among these non-convergence, under-identification, model misspecification, outliers, and sample fluctuations. In practice, there are several condi- tions that require a stop to recursive branching. For once, the optimizer may not converge to a solution. In that case, the respective split has to be dismissed because the temporary solution cannot be trusted to be near a minimum of the error surface. Also, there is the possibility that the optimizer converges to a solution that falls into an inadmissible estimation area. In that case, too, the respective split is dismissed and a warning is passed to the user since the split could indeed have been meaningful if it could have been estimated correctly. A solution for these cases is a provision of new or additional sets of starting values that withdraw the optimizer from the critical region.