Fundamentals of QTL Analysis - Bayesian model-based methods for the analysis of DNA microarrays

The analysis of inbred strains involves the comparisons of means of populations. One fundamental idea in QTL analysis is that genotype influences the mean value of the trait

y so that

yi =µ+βxi+i (40)

where β is the QTL effect, and xi is the QTL genotype of the ith subject, and i is a random error with variance σ2. The vectors xi and β may be two or three dimensional depending on the number of different genotypes and whether or not the effect of the QTL is additive or has a dominance component. If there are three genotypes say QQ,

Qq and qq, then an additive model would have meansµQQ+a=µQq =µqq−afor some

a. For dominance models, there is no such a, but there are a and d where µQQ+a =

µQq+d= µqq −a. The location of the QTL or eQTL is generally unknown in advance so that markers must be used as proxies. One may use the markers themselves, but for sparse markers, this may have disadvantages such as underestimating the QTL effect and inaccurate estimation of the QTL location. Interval mapping was developed by Lander and Botstein (1989) in order to analyze the possible occurrence of a QTL between the markers. The likelihood for the QTL with additive effect a becomes

L(µ, a, σ2) = n Y i=1

[Gi(0)Li(0) +Gi(1)Li(1)] (41) where Gi(x) is the probability of the genotype x∈ {0,1} of the ith _subject. _L_i(_x_{) is the} likelihood function given genotypex. The probabilityGi(x) may be calculated conditionally upon the distance from the left and right flanking markers for any position between them. The above likelihood can be maximized by the Expectation Maximization (EM)

algorithm for positions uniformly distributed across the genome. This form of interval mapping is computationally intensive, and eQTL would greatly increase those demands. The results of interval mapping can be approximated by another more computationally efficient method of Haley and Knot (1992). They proposed that instead of Equation (40) one substitutes the expected value ofxconditional upon the flanking markers forx. This allows one to calculate the likelihood for positions between markers like the interval mapping of Lander and Botstein, but it avoids the burden of the iterative EM algorithm. Another significant advance in QTL involves correction for multiple QTL affecting a single trait. If there are many QTL then the model becomes yi = µ+P_gβgxgi +i where the subscript g indexes the different QTL. The existence of multiple, linked QTLs is known to bias the effect and location of methods that detect the largest single QTL (Zeng, 1993). Zeng (1993) proposed Composite Interval Mapping (CIM) to overcome bias due to the composite effects of multiple QTL. CIM models the traity as a function of the genotype anywhere in the interval (j, j + 1) between the two flanking markers

j and j + 1 and the genotypes at all other markers. The phenotype model becomes

yi = b0 +b∗x∗i + P

k6=j,j+1bkxki +i where k indexes the nonflanking markers. The b∗ parameter measures the effect of the loci of interest while the bk parameters are the effects of the background trait variability due to the kth _{marker. The number of back-} ground markers to adjust for is not given by the model, but the markers j−1 andj+ 2 should always be included because all of the other markers on the same chromosome are conditionally independent of x∗

eQTL detection in human studies must use the analytic methods of outbred analysis. The underlying model for the means is the same as for inbred populations, but the

modeling focuses on the analysis of variance components. In the notation of Almasy and Blangero (1998), we havey=µ+X0_β₊Pn

i=1γi+g+iwhereyis the vector of trait values,

γis the effect of QTLi, andis the random error. The termX0_β_{represents the covariates}

(e.g. age, sex) and the corresponding regression coefficients. The g parameter represents the effects of the polygene which is the composite of many QTLs with small effects. The variance of y (σ2

y) can be written in terms of the independent genetic components Pn

i=1σγ2i and the random errorσ

so thatσy2 = Pn

i=1σγ2i+σ

g+σ2. The covariance of any two related individuals is a function ofkjiwhich is the probability that theithQTL has j alleles that are Identical by Descent (IBD). Ignoring dominance effects, the covariance of these two relatives is Cov(y1, y2) =Pni=1(k1i+k2i)σai2. One may make an approximation that σ2

a = Pn

i=1σai2, and let φ = 12E[(k1i+k2i)] where φ is called the expected kinship

coefficient. This gives Cov(y1, y2) ≈ 2φσa2. If one is interested in QTL i only, we have Cov(y1, y2) = πiσ2ai+ 2φσg2 where πi =k1i+k2i. The background orpolygeniceffects are

reflected byσ2

g. The termπi is the probability of an allele of the iQTL being IBD and is called the coefficient of relationship as in Almasy and Blangero (1998). The phenotypic covariance for a general pedigree may be written as Ω = Pn_i₌₁Πiˆ σ2

ai+ 2Φσ2g+Iσe2 where the matrix ˆΠi has elements (πijl) that indicate the proportion of IBD alleles of QTL i

shared by individuals j and l. Φ is the matrix of kinship coefficients. The estimation of the ˆΠi matrix is not straightforward for general pedigrees, and there are methods designed for estimating the IBD probability for genetic marker locations (Amos et al., 1990; Davis et al., 1996) and locations in between markers (Fulker et al., 1995; Almasy and Blangero, 1998). Almasy and Blangero developed software for general pedigrees that estimates the Πi matrix for positions between markers with a regression based approach.

The likelihood for the phenotypes given the form of the covariance matrix is then given by log(L(µ, σ_ai2, σ2_g, β|y, X)) =−t 2log(2π)− 1 2log|Ω| − 1 2∆ 0_Ω−1_∆ ₍₄₂₎

where ∆ = y −µ−X0_β_{. The hypothesis concerning whether or not QTL} _i _{exists is}

equivalent to testing σai2 = 0. Since the test involves the boundary of the parameter space (σ2

ai ≥ 0), the distribution of the likelihood ratio statistic does not have a chi- square distribution. Self and Liang (1987) showed that the likelihood ratio statistic will follow a mixture of chi-square distributions, and this null distribution is the basis for inference.

In document Bayesian model-based methods for the analysis of DNA microarrays with survival, genetic, and sequence data (Page 64-67)