When I talk about confounding variables, I refer to any factors which affect the underlying relationship, but in whose involvement we are not directly interested. For example, we might wish to analyse a dataset where a sample’s gender plays a major role in determining its re- sponse value. If we are interested in studying the effect of sex, we could treat it just like any other predictor, in which case we could ask to what extent it is involved, whether it interacts with other predictors and so on. If we are not interested in these questions, but merely con- sider sex a nuisance variable, then it would be classed as confounding.
Quite often, confounding arises from variables which are difficult to observe. In association studies, one such example is relatedness. It will be very unlikely the experiment obtains exact pedigree information for all samples. I later analyse data obtained from the plant Arabidopsis thaliana. Here, more than in humans, relatedness is a major concern as accessions are often picked from heavily inbred lines. Suppose, in an experiment, two samples are so closely re- lated that their values are almost identical. This will produce a “pseudo-duplicate”. If one of these samples provides evidence for a particular association, then the second sample will likely magnify this evidence, whether or not this support is warranted. As these two samples are highly related, they will likely have developed in similar environments relative to the rest of the sample. We might consider this a more plausible reason for their similar responses, as opposed to a genetic explanation. Therefore, if we have pedigree information for the samples, it is prudent to incorporate this in the analysis. Even when this information is not available, methods exist for its estimation (discussed in Astle and Balding, 2009).
are divided according to the way Sparse Partitioning allows them to affect the underlying relationship; variables contained in Ψ are allowed to interact with the standard predictors X, those in Ω are assumed to contribute independently and additively. The regression equation becomes l(E(Y )) = f (X, Ψ)+Ω ω, where ω contains the regression coefficients corresponding to variables in Ω.
As Sparse Partitioning is only designed to consider interactions between tertiary predic- tors, the method requires that all variables of Ψ are coded as such. Consequently, the method is unable to consider interactions with quantitative confounding variables. I discuss the im- pact of this limitation after explaining how Sparse Partitioning tries to correct for confounding.
Sparse Partitioning automatically accounts for the effect of variables in Ω, assigning ω
a normal prior distribution with mean 0 and variance σ2/r0 or 1/r0, depending on whether
the response is continuous or binary. Consider the effect on the marginal likelihood when the response is continuous (Case 1):
P(Y |X, G) = Z Θ,σ2 Z ω P(Y |X, G, Θ, σ2, ω) × P(ω) dω × P(Θ, σ2) dΘ dσ2 = Z Θ,σ2 Z ω (2πσ2)−n2 exp− 1 2σ2(Y − J Θ − Ωω)T(Y − J Θ − Ωω) × (2πσ2/r0 )−D02 exp− r 0 2σ2ωTω dω × P(Θ, σ 2) dΘ dσ2,
where D0 denotes the number of variables in Ω. Using similar steps to when we earlier
integrated over Θ, we obtain
P(Y |X, G) = Z Θ,σ2 r0D02 |B0|− 1 2(2πσ2)− n 2 × exp− 1 2σ2(Y − J Θ) TC(Y − J Θ) × P(Θ, σ2) dΘ dσ2,
where C = In− ΩB0−1ΩT, with B0 = ΩTΩ + r0ID0. When the response is binary and a probit
link function is used (Case 3), Y is replaced by Z and σ2 fixed at 1, but otherwise the effect
on the marginal likelihood is identical. With this small adjustment, Sparse Partitioning is able to continue as before.
We see that the introduction of Ω is equivalent to ignoring its presence, but altering the likelihood assumption to suppose that Y is now drawn from a multivariate normal distribution
with mean J Θ and variance matrix σ2C−1. Viewed the other way round, when confounding
factors introduce correlations between response values, as clearly is the case with relatedness, the response values can equally be thought of as independent draws, but with a certain quan- tity added to their underlying relationship values.
Allowing for Ω slows down computation, as the introduction of C (size n × n) complicates existing calculations. However, the value of C remains constant throughout analysis and so, for example, CY only changes when missing response values are resampled. Tricks of this nature are able to offset the increase in computation time to some extent. Nonetheless, ac- knowledging Ω’s presence in every calculation will have a noticeable effect; for example, when the sample size is a few hundred, this will typically result in iterations taking about twice as long.
Fortunately, it is generally acceptable to adjust for the effect of Ω before analysis, by
replacing the response values with the residuals Y − Ω ˆω, where ˆω is a suitable estimate of
ω. There are two justifications for this. Consider the linear model Y = J Θ + Ωω when all variables are orthogonal. If < · , · > denotes the inner product of two vectors, we can calculate least squares estimates for each coefficient by “dotting” both sides with the corresponding variable: ˆ Θj = < Y , Jj > < Jj, Jj > and ωˆj0 = < Y , Ωj0 > < Ωj0, Ωj0 > .
Therefore, for least squares regression, the overall model fit can be calculated in a stepwise fashion, at each step regressing the current residuals on the next predictor. A similar argument can be used in the Bayesian setting to justify first regressing Y on ω, then replacing Y with
Y − Ω ˆω, where ˆω is the posterior mean from this first regression. In fact, we only require
that each column of Ω is perpendicular to each column of J , the case when their observed values are uncorrelated. Over reasonable sample sizes, we might expect this to be so.
Should this reasoning not seem sound, a second explanation might be more readily ac- cepted. By automatically considering the contribution to the model of variables in Ω, this, in effect, assigns a prior probability of 1 that they are associated. This is in stark contrast, in sparse problems at least, to the standard predictors which are assigned very small probabili- ties. This relative weighting is reasonable when the confounding variables are “major factors”, as then we would consider it far more likely that they influence the response than any single standard predictor. If X and Ω are considered together in the regression model, their corre- sponding regression coefficients will be competing over any variation both are able to explain. However, our prior belief heavily favours Ω winning this battle convincingly, justifying why we might account for its effect beforehand.
likelihood becomes P(Y |X, G) = Z Θ,ω P(Y |X, G, Θ, ω) × P(Θ) × P(ω) dΘ dω = Z Θ,ω Y i [l−1(J Θ + Ωω)]Yi [1 − l−1(J Θ + Ωω)](1−Yi)× P(Θ, ω) dΘ dω = Z Θ+ Y i [l−1(J+Θ+)]Yi [1 − l−1(J+Θ+)](1−Yi)× P(Θ+) dΘ+, P(G, Y U|X, YO)
where J+= [J Ω] and Θ+= (Θ, ω). A Laplace approximation can be used to integrate across
this extended linear model. We would prefer to be able to integrate out ω beforehand, as this integral is common to all models. However, this is not possible using a Laplace approximation, as it an analytic technique which requires numerical values for Θ.
This obstacle can have a dramatic effect on computation time. If we wish to include confounding factors based on sample relatedness, this typically produces an additional n co- variates in the model, one for each sample. In general, this will greatly increase the degrees of freedom of the linear model, making the Newton-Raphson method much slower. Therefore, when Ω is included in a problem with a binary response, I highly recommend using a probit link function or allowing the method to regress out the effect of Ω in advance of analysis, which can be justified by an argument similar to that discussed for the continuous case.
The second set of confounding variables, those contained in Ψ, are easy to include, as Sparse Partitioning simply treats them as additional columns of X. The method will return posterior estimates for these variables, as it does for all standard predictors. These variables require a prior probability of association. As we generally consider confounding variables more likely to influence the response, these probabilities should typically be set higher than for the standard predictors. If we are certain of their involvement, they can be added to the list of predictors Sparse Partitioning must include.
If confounding variables are supplied, these are incorporated during the one-predictor-at- a-time tests performed by Single by altering its null and alternative hypotheses to include a contribution from Ω and Ψ.
Discussion: Assumption that X and Ω do not interact
Sparse Partitioning is only set up to consider interactions between tertiary variables. In partic- ular, this prevents it from exploring interactions involving quantitative confounding variables. Consider the case of population confounding, which refers to trends and correlations which appear across populations as a result of migration and geographic differences.
Consider an idealised scenario, where each individual is sampled from one of two distinct populations. In this simple case, only a single, two-state vector would be required to indicate each sample’s population status and this could readily be included as an additional predictor in the model. In general the stratification will be far more complex. There will often be a number of loosely defined populations, with varying levels of overlap (“admixture”) between each pair. To describe such a complex situation requires a few vectors where, say, each vector indicates the fraction of a sample originating from a particular population. Even when very detailed migration histories are available, it will not normally be possible to reconstruct these population vectors with much accuracy.
Fortunately, methods have been developed to estimate these population vectors from the sample dataset. For a single, idealised population, Hardy-Weinberg equilibrium dictates that a SNP’s allele frequencies will remain constant throughout generations. In particular, when a SNP is biallelic, the frequencies of homozygous wildtype, heterozygous and homozygous
mutant states should obey the ratio p2 : 2p(1 − p) : (1 − p)2. When populations are merged,
unless a SNP’s allele frequencies happen to be the same across all populations, this equilibrium will be destroyed. This principle is exploited by the method STRUCTURE (Pritchard et al., 2000) which, loosely speaking, attempts to partition the sample so that within each group Hardy-Weinberg equilibrium is restored.
The extent to which a SNP obeys Hardy-Weinberg equilibrium will, in practice, depend on many factors. These include the validity of the theory’s assumptions, which in partic- ular suppose randomly mating individuals and the absence of selective forces. As a result, some SNPs will be more informative of population differences and these will contribute most to STRUCTURE’s estimates of the population vectors. The method EIGENSTRAT (Patterson et al., 2006) takes an alternative approach, one based instead on principal component analysis. Here, the idea of detecting informative variants is more explicit, as the algorithm directly sets out to find data axes across which the differences between individuals’ genetic data are highlighted.
Therefore, when considering how much Sparse Partitioning suffers for not being able to consider interactions with population covariates, it perhaps helps to bear in mind the bio- logical interpretation of these interactions. Essentially, each population variable is a statistic corresponding to the group of genetic variants which best distinguish that population. It is much easier to conceive reasons why single variants might interact, than explain why groups of scattered variants might. Even so, Sparse Partitioning’s inability to consider interactions with quantitative confounding variables clearly affects its generality to some level, so this should be acknowledged.