Mapping Quantitative Trait Loci for Expression Abundance

(1)

DOI: 10.1534/genetics.106.065599

Mapping Quantitative Trait Loci for Expression Abundance

Zhenyu Jia and Shizhong Xu

1

Department of Botany and Plant Sciences, University of California, Riverside, California 92521 Manuscript received September 5, 2006

Accepted for publication February 11, 2007

ABSTRACT

Mendelian loci that control the expression levels of transcripts are called expression quantitative trait loci (eQTL). When mapping eQTL, we often deal with thousands of expression traits simultaneously, which complicates the statistical model and data analysis. Two simple approaches may be taken in eQTL analysis: (1) individual transcript analysis in which a single expression trait is mapped at a time and the entire eQTL mapping involves separate analysis of thousands of traits and (2) individual marker analysis where differentially expressed transcripts are detected on the basis of their association with the segregation pattern of an individual marker and the entire analysis requires scanning markers of the entire genome. Neither approach is optimal because data are not analyzed jointly. We develop a Bayesian clustering method that analyzes all expressed transcripts and markers jointly in a single model. A transcript may be simultaneously associated with multiple markers. Additionally, a marker may simultaneously alter the expression of multiple transcripts. This is a model-based method that combines a Gaussian mixture of expression data with segregation of multiple linked marker loci. Parameter estimation for each variable is obtained via the posterior mean drawn from a Markov chain Monte Carlo sample. The method allows a regular quantitative trait to be included as an expression trait and subject to the same clustering assignment. If an expression trait links to a locus where a quantitative trait also links, the expressed transcript is considered to be associated with the quantitative trait. The method is applied to a microarray experiment with 60 F2mice measured for 25 different obesity-related quantitative traits. In the experi-ment,40,000 transcripts and 145 codominant markers are investigated for their associations. A program written in SAS/IML is available from the authors on request.

T

HE recently developed microarray technology al-lows us to measure the expression of many genes or transcripts in a single chip. The transcript abundance can be treated as a classical quantitative trait and thus mapping can be done on the transcript. Mendelian loci in the genome that control the expression levels of transcripts are called expression quantitative trait loci (eQTL). In eQTL analysis, an expression trait is mapped to genomic locations represented by cis- or trans-loci. Thecis-eQTL represent sequence variants that encode transcriptional differences (Hubner et al. 2005). The

trans-eQTL, however, represent remote genes that

regulate the expression of the gene being transcribed

(Yvertet al. 2003). The purpose of a linkage study is to

identify the cis- and trans-eQTL for each transcript. Results from the eQTL analysis may provide more detailed information about the biological processes of the gene network than the classical quantitative trait analysis. Regular quantitative traits are often gross clinical measurements and may be far remote from the

biological processes giving rise to the clinical traits (Schadtet al. 2003).

In eQTL analysis, we often deal with thousands of expression traits simultaneously. Methods developed for multiple-quantitative trait QTL mapping may not apply here because of the high dimensionality of the model. Two simple approaches may be taken in eQTL analysis: (1) individual transcript analysis in which a single expression trait is mapped at a time and the entire eQTL mapping involves separate analysis of thousands of traits and (2) individual marker analysis where dif-ferentially expressed transcripts are detected on the basis of their association with the segregation pattern of an individual marker and the entire analysis requires scanning markers of the entire genome. The first ap-proach requires only known single-trait QTL mapping procedures, e.g., interval mapping (Lander and B ot-stein1989), composite-interval mapping (Jansen1993;

Zeng 1994), or multiple-QTL mapping (Kao et al.

1999). A common practice for handling thousands of transcript traits is to select a small number of target transcripts on the basis of some criterion for preselec-tion and map QTL only for these prescreened tran-scripts (Lanet al. 2003). The second approach requires

only a method for differential expression analysis, 1_{Corresponding author:} _{Department of Botany and Plant Sciences,}

University of California, 900 University Ave., Riverside, CA 92521-0124. E-mail: [email protected]

(2)

e.g., the regularizedt-test (SAM) (Tusheret al. 2001),

the hierarchical mixture model of Newton_{et al}. (2004),

or the model-based cluster analysis (Pan 2002). In

differential expression analysis, one requires samples from at least two conditions, the control and the treat-ment. When applied to expression–marker association studies, the conditions become the marker genotypes,

e.g., individuals carrying one genotype are arbitrarily designated as the control and those carrying the other genotypes as the treatments. When applied to an F2 design where three groups (genotypes) are present, differential expressions are presented in two forms. One is for the ‘‘additive’’ effect where control and treat-ment represent the two homozygotes. The other is for the ‘‘dominance’’ effect where control and treat-ment represent the homozygote (both types) and the heterozygote.

It appears that eQTL mapping has been treated as either a QTL mapping problem for multiple traits or a microarray differential expression problem for multi-ple treatment comparisons. Other than the method described in Kendziorskiet al. (2006), no unique

me-thod has been particularly designed for eQTL analysis. Neither the first nor the second aforementioned sim-ple approach is optimal because data are not analyzed jointly. The mixture over marker (MOM) approach devel-oped by Kendziorskiet al. (2006) is the first attempt to

analyze transcripts and markers jointly. The method is called MOM because the expression level of a transcript is described by a mixture model over markers. A tran-script is either associated with a marker or not associated with any markers at all. Given that the transcript is associated with a marker, it is associated with one and only one of the markers. We believe that the assumption of a transcript associated with at most one marker is too stringent and needs to be relaxed. The MOM approach is able to detect either thecis-locus or one of thetrans -loci, but not both. This seriously limits the application of the MOM method.

In this study, we propose a novel statistical method that combines the two simple approaches into a single step of analysis so that parameters are inferred using multiple transcripts and markers simultaneously. This joint approach captures the maximum information from the microarray experiment. Like a regular quanti-tative trait, a transcript can be mapped to many different locations, includingcis- andtrans-loci. In multiple-QTL mapping, we face a variable selection problem. To avoid variable selection, we have adopted the Bayesian shrink-age analysis, in which marker loci of the entire genome are evaluated simultaneously (Wanget al. 2005).

Mark-ers with small effects are forced to shrink their effects to zero and markers with large effects are subject to no shrinkage. The shrinkage estimation is made possible through Bayesian hierarchical modeling (Gelman2005).

In this study, eQTL parameters are estimated under the framework of shrinkage analysis.

THEORY AND METHODS

Hierarchical linear model: LetGbe the number of

transcripts measured from N subjects of a mapping population. Let M be the number of markers whose genotypes are measured for all the subjects. We now define the following variables. For thekth individual, let

yikbe the expression level of transcriptiandxjkbe the

genotype indicator of marker j, for i¼1;. . .;G,

j¼1;. . .;M, andk¼1;. . .;N. The genotype indica-tor variable is defined as xjk¼{1, 1} for a backcross

(BC) individual orxjk¼{1, 0, 1} for an F2individual for k¼1;. . .;N. Note that, for simplicity, only the additive effect is considered for an F2population in the current study. The expression levelyikcan be expressed as

yik¼ai1

XM

j¼1

xjkgij1eik; ð1Þ

whereaiis the intercept for theith transcript,gijis the

effect of thejth marker on theith transcript, andeikis

the residual error with an assumedN(0,s2_{) distribution.} Model 1 can be rewritten in a matrix notation,

Yi¼

XM

j¼1

Xjgij1ei: ð2Þ

In model 2,Yi ¼ ½Yi1;. . .;YiNTis anN31 vector for the

expression levels of transcripti, andXj ¼ ½Xj1;. . .;XjNT

is anN31 vector for the genotype indicator variables of locusj, where bothYiandXjare assumed to have been

centered (deviations of the original measurements from the mean) and thusPN_k¼1Yik ¼0 and

PN

k¼1Xjk¼0 for i¼1;. . .;G andk¼1;. . .;N. Because of this, there is no intercept in the current linear model. Leteibe anN3

1 vector of the residual errors, andeiN(0,Rs2). Note

that R is a known positive definite matrix. It is not an identity matrix because covariances among the residual errors have been introduced when the intercept is re-moved Quand Xu(2006). Given all theg_ijvariables, the

probability density ofYiis pðYij

X

Xjgij;Rs2Þ ¼NormalðYi; X

Xjgij;Rs2Þ; ð3Þ

where we use a notation Normal(x;b,d) orN(x;b,d) to denote a normal density for vectorxwith meanb and variance matrixd. In subsequent sections, we adopt the same notation for other probability densities,i.e.,

pðvariablejparameter listÞ

(3)

pðg_ijjh_ij;s2_jÞ ¼ ð1h_ijÞNðg_ij;0;dÞ1h_ijNðg_ij;0;s2_jÞ; ð4Þ whered¼104_{(a small positive number) and}_s2

j is an

unknown variance assigned to the jth locus. Variable

hij¼{0, 1} is used to indicate whethergijis sampled from

aN(0,d) or aN(0,s2

j) distribution. If it comes from the

first normal distribution,gijis virtually fixed at zero. If it

comes from the second normal distribution (the distri-bution with a large variance),g_ijhas a nontrivial value and should be estimated from the data. Therefore,h_ij¼ 1 means that locusjis an eQTL for transcripti. In some of the literature (e.g., Zhanget al. 2005), the mixture

distribution is described as

pðg_ijjh_ij;s2_jÞ ¼ ð1h_ijÞdðg_ijÞ1h_ijNðg_ij;0;s2_jÞ; ð5Þ whered(gij) is the Dirac delta function (originally

intro-duced by the British theoretical physicist Paul Dirac) that has the value of‘ for gij¼ 0 and the value zero

elsewhere. The Dirac delta function is actually a prob-ability density function. Therefore, the integral of the delta function from‘to1‘is 1. The above mixture distribution says thatg_ijhas a nonzero probability mass at the value zero and a quite flat normal distribution around zero provided thats2

j is large. This distribution

is called the spike and slab distribution (Mitchelland

Beauchamp 1988). When d ¼ 0, the two forms of

mixture distribution are identical. We further describe

hij by a Bernoulli distribution with probability rj,

denoted by Bernoulli(hij;rj). Because of the

hierarchi-cal nature of the model, we can further describerjby a

Dirichlet distribution, denoted by Dirichlet(rj; 1, 1).

The parameter rj will control the proportion of the

transcripts that are associated with markerj. All tran-scripts that are assigned to the second component of the mixture distribution are associated with thejth marker. The variance of the genes in this cluster is assigned a scaled inverse chi-square distribution, denoted by Invx2_ð_s2

j;d0;v0Þ, whered0¼5 andv0¼50 are used

in this study. The residual variance is assigned a vague prior;i.e., Invx2₍_s2_{; 0, 0)}_¼_1/_s2_{. This vague prior may} cause an improper posterior (Hobert and Casella

1996). In practice, this vague prior has been used all the time and people rarely see any problems. We used a proper scaled inverse gamma distribution for one case of the simulation to test the difference between the vague prior and the proper prior and did not see any notable difference (data not shown).

Markov chain Monte Carlo:Given the likelihood and

the prior of parameters, we are ready to infer the pos-terior distribution of the parameters. Because the form of the posterior distribution is intractable, we use Markov chain Monte Carlo (MCMC) sampling to draw a posterior sample from which empirical posterior means of interested parameters can be found. The most important parameters in the eQTL mapping arehijand

rj. The posterior mean of hij represents the

prob-ability that transcript iis associated with markerj. The posterior mean of rj reflects the proportion of

tran-scripts that are associated with markerj. In addition to these parameters, other parameters may also be inter-esting. For example, the posterior mean ofgijrepresents

how strongly marker j affects the expression of tran-scripti,i.e., the size of the eQTL.

First, we choose an initial value for each variable. We then derive the distribution of one variable conditional on the data and values of all other variables. This distri-bution is called the conditional posterior distridistri-bution, denoted by p(u_k j data, u_k), where u_k is the current variable of interest and u_kis the list of the remaining variables (excludinguk). This distribution usually has a

simple form from which a realized value for ukis

sam-pled. Once each and every variable is updated, we com-plete one iteration or sweep. The sampling process continues until the Markov chain reaches its stationary distribution. Convergence diagnosis was conducted us-ing the R package ‘‘coda’’ (Rafteryand Lewis1992). We

discard a number of iterations from the beginning of the chain (burn in) and save one observation in every 10 sweeps of the remaining iterations to form a posterior sample until the posterior sample is sufficiently large to allow an accurate estimate of the posterior mean for each variable. We now describe the sampling algorithm for each variable:

1. Variablehijis simulated from Bernoulli(hij;pij), where

p_ij ¼ rjNðgij;0;s 2

jÞ

r_jNðg_ij;0;s2_jÞ1ð1r_jÞNðg_ij;0;dÞ: ð6Þ

2. Variablegijis simulated fromN(gij;mg,s2g), where

m_g¼ X_jTR1Xj1

s2 h_ijs2_j1ð1h_ijÞd

" #1

X_jTR1DYi; ð7Þ

s2_g¼ X_jTR1Xj1

s2 h_ijs2_j 1ð1h_ijÞd

" #1

s2; ð8Þ

and

DYi¼Yi

XM

j96¼j

Xj9g_ij9; ð9Þ

which is called the offset of Yi adjusted for the jth

marker effect. 3. Samples2

j from

Invx2 s2_j;X

G

i¼1

h_ij1d0;X

G

i¼1

h_ijg2_ij1v0

!

(4)

4. Samples2_from

Invx2 _s2_;_GN_;X

G

i¼1 ðYi

XM

j¼1

XjgijÞTR1ðYi

XM

j¼1

XjgijÞ !

:

5. Simulaterjfrom

Dirichlet r_j;X

G

i¼1

h_ij11;GX

G

i¼1

h_ij11

!

:

So far, every variable has been updated. Let

h_ij¼N1 p

PNp

l¼1h

ðlÞ

ij be the posterior mean of variable

hij, where Np is the posterior sample size. Transcript

i is said to be associated with marker j if the pos-terior probabilityh_ij is greater than some prespecified threshold. It has been shown that the false discovery rate (FDR) can be controlled ata100% if the appro-priate threshold is the smallest posterior probabil-ity such that the average posterior probabilprobabil-ity of all transcript–marker linkage exceeding the threshold is

.1 a (Efron 2004; Newton et al. 2004). In the

current study, 0.8 is taken as the cutoff point. The 0.80 criterion is somewhat arbitrary and another value, say 0.90, may be used. In theapplicationsection, we show

that any value between 0.6 and 1 is reasonable. This criterion does not affect the MCMC process other than the list of transcripts declared as being associated to markerj.

Missing markers: The algorithm described above

applies to the situation where the genotypes of markers are known for all individuals. In reality, the genotypes of some markers may be missing for some individuals. This means that some elements ofXjare also variables (data

not observed) and subject to the same MCMC sampling as other variables. In this case, we adopt the usual sam-pling procedure in Bayesian mapping to simulate the missing genotypes (Raoand Xu1998). First, all missing

genotypes are sampled from their Mendelian prior distribution: 50% probability of taking either genotype for a BC individual or {25%, 50%, 25%} probability of taking one of the three genotypes for an F2individual. Given the sampled missing genotypes, all marker geno-types are presumably known. We can now start the MCMC process by updating the missing values one at a time from its conditional posterior distribution given the two flanking marker genotypes and all other in-formation available at the current stage of the MCMC process. The process of marker genotype imputation is the same as that described by Sen and Churchill

(2001) and implemented in R/QTL (Broman et al.

2003). The conditional posterior distribution is briefly described below.

Consider that the genotype of markerjis missing for individualk. Letp(YjX) be the probability density of the expression levels for all transcripts measured from

individualkgiven the genotype at markerj. Of course,

Xcan take up to two different values for a BC individual or three different values for an F2individual. Letp(MLj

X) andp(MRjX) be the probabilities for the left and the right flanking markers, respectively, conditional onX. Note that the order of the three markers isML,X,

MR. Let p(X) be the Mendelian prior probability for

the genotype of the missing marker. The conditional posterior probability is defined as

pðXjY;ML;MRÞ

¼P pðXÞpðMLjXÞpðMRjXÞpðYjXÞ X9pðX9ÞpðMLjX9ÞpðMRjX9ÞpðYjX9Þ

; ð10Þ

where P_X9 means summation over all possible values

of X9. This probability is used to simulate a realized value ofX.

Heritability for a transcript: In the eQTL analysis,

each transcript has been treated as a quantitative trait. Since the variance of a quantitative trait can be parti-tioned into a genetic variance component and an envi-ronmental variance component, the variance of an expression trait can be similarly partitioned. We now give the formula for calculating the variance com-ponents for each expression trait as a function of the eQTL effects and then further derive the average variance components for all expression traits in the entire microarray experiment. These variance compo-nents are further used to calculate the ‘‘heritability’’ of a transcript.

Lets2

Yibe the variance of transcriptiover all subjects

(a total ofN subjects). Lets2

Xj be the variance of

vari-able Xj over all subjects for marker j. Let us further

definesXjXj9 as the covariance betweenXjandXj9across subjects, i.e., the covariance between markersj and j9

forj96¼j. The variance of transcriptiis expressed as

s2_Y

i ¼

XM

j¼1

g2_ijs2_X

j12 X

j9 .j

g_ijg_ij9sXjXj91s

2_: _ð₁₁_Þ

For a BC design,s2

Xj ¼1 andsXjXj9 ¼12rjj9, whereas for an F2design,s2

Xj ¼

1

2andsXjXj9 ¼ 1

2ð12rjj9Þ, where rjj9is the recombination fraction between markersjand j9. Let

s2_G

i ¼

XM

j¼1

g2_ijs2_X

j12 X

j9 .j

g_ijg_ij9sXjXj9 ð12Þ

be the genetic component ofs2

Yi. The heritability for

transcriptiis

h2_i ¼ s

2 Gi

s2_G

i1s

2: ð13Þ

The average heritability of all transcripts may simply take the averageh2

i across all transcripts. However, there

(5)

transcripts. This type of overall heritability is defined as follows. First, we need to partition the expected variance of a transcript into variance components as shown below:

E½s2

Yi ¼

XM

j¼1

E½g2

ijs2Xj12 X

j9 .j

E½g_ijg_ij₉s_X_j_X_j₉1s2_: _ð₁₄_Þ

The expectation is taken with respect to the eQTL effects. Let s2

Y ¼E½s

2

Yi be the overall variance of the

transcripts. The mixture distribution of gij leads to E½g2

ij ¼rjs2j, where rj is the proportion of transcripts

that are linked to locusj. Because gijandgij9are

inde-pendent, we haveE[gijgij9]¼0. Therefore, the above

equation becomes

s2_Y ¼X

M

j¼1

r_js2_js2_X

j1s

2_: _ð₁₅_Þ

Lets2

G¼

PM

j¼1rjs2js2Xj be the genetic variance

compo-nent. The overall heritability is

h2¼ s

2 G

s2_G1s2: ð16Þ

This overall heritability of transcripts is different from the average heritability across all transcripts.

APPLICATIONS

Simulation studies: We carried out two simulation

experiments to compare the performance of the pro-posed method (hereafter designated as BAYES) and the MOM approach (Kendziorskiet al. 2006). In the first

experiments, 10 markers were evenly placed on a 360-cM genome,i.e., 40 cM per interval. Genotypes of markers for each subject were simulated on the basis of the Haldane map function (Haldane 1919). Four eQTL

were placed at markers 1, 3, 6, and 10;i.e., the segregation of these four markers affected the expression of some transcripts. We simulated 50 subjects and 1000 transcripts among which transcripts 605–610 were affected by eQTL at marker 1, transcripts 601–604 were affected by eQTL at marker 3, transcripts 961–1000 were affected by eQTL at marker 6, and transcripts 1–50 were affected by eQTL at marker 10. Note that a transcript was either associated with one eQTL or not associated with any of them. In eQTL analysis, a marker is claimed as an eQTL if it affects at least one transcript and a linkage is defined as a transcript–marker association. For example, in the first simulation experiment, there were a total of four eQTL and 100 linkages. The eQTL effects for the 100 linked transcripts (gij) were simulated from Normal(gij; 0, 32).

The residual error was sampled from Normal(eij; 0, 0.12),

such that the expected heritability for a transcript was 0.90. We chose this small residual variance because MOM

did not work well when the residual variance was large. When the residual variance was large, MOM either declared all transcripts as differentially expressed or did not run by throwing out an error message. The experi-ment was replicated 20 times. We used two methods to analyze the 20 replicated data sets. For the BAYES method, the threshold may be chosen to bound the pos-terior expected FDR ata100% as described intheory and methods. The average threshold value of the 20

replicates was 0.55 when the expected FDR was set to 1%, indicating that any value between 0.6 and 1 can serve as a reasonable threshold value to achieve the controlled FDR. We used 0.8 as the threshold value throughout the entire experiments.

The length of the chain required for convergence was determined by the R package coda. In the current MCMC diagnosis, the quantile to be estimated was set to 0.1, the margin of error of the estimate was 0.05, the probability of obtaining an estimate in the desired interval was 0.95, and the precision for the estimate of time to convergence was 0.001. The diagnosis indicated that 1800 iterations were sufficient to achieve conver-gence. Another issue that sometimes arises is the high posterior correlations or the so-called ‘‘stickiness’’ of the MCMC algorithm. The commonly utilized approach to circumvent this problem is to use every kth simulation draw, for some value of k such as 10, 20, or 50 (see Raftery and Lewis 1992 and Gelman et al. 1995 for

more details). In our study, we deleted 1000 iterations as a ‘‘burn-in’’ period and thereafter kept one observation in every 10 iterations until the posterior sample size reached 200. The overall length of the chain was 3000, much longer than the required length of 1800 reported from the coda diagnosis.

The MOM analysis actually consists of two steps: (1) identifying differentially expressed (DE) transcripts and (2) localizing eQTL for each differentially expressed transcript. In the first step, a transcript is identified as DE if the posterior probability of equivalent expression (EE) is smaller than some threshold, where thresholds are chosen to control the FDR at 5%. In the second step, the 90% highest posterior density (HPD) region was used to specify linkages between each transcript and markers. Both methods successfully detected four eQTL and 98 linkages on the basis of the analysis of 20 repli-cates. The two missed linkages were represented by asso-ciations between transcript 605 and marker 1 and between transcript 26 and marker 10, respectively. The true effects for the two linkages were 0.035 and 0.014, which are too small to be detectable with any reasonable analysis methods. Neither method detected any false linkages. The estimated proportions (rj) of transcripts

(6)

The normalized average evidence (NAE) of linkage was defined as the average posterior probability over tran-scripts and normalized by the sum of the evidence over all markers (Kendziorski et al. 2006). The estimated

proportions (rj) in the BAYES method, which are

equivalent to the NAE used in the MOM analysis, can be used to sift hot-spot regions where the transcripts are mapped. This simulation demonstrates that both meth-ods are adequate if a transcript is affected by at most one eQTL. The estimated parameters and the effects of the 98 detected linkages used in the simulation experi-ments are provided in Table 1 and also illustrated in Figure 2. The estimated parameters and the effects of eQTL are very close to the true values used to generate

the data. Note that the sample size for the linked tran-scripts was very small (#50) for each eQTL. As a result, the estimated s2

j’s showed some deviations from the

true value of 9.

In the second simulation experiment, we used the same marker map and eQTL setting as in the first exper-iment and generated 20 replicates. In this case, however, we let the eQTL at marker 1 control transcripts 1–20 and transcripts 971–990 and let the eQTL at marker 3 control transcripts 17–20. The transcripts controlled by the eQTL at markers 6 and 10 remained the same as in the first experiment. The purpose of the second simulation experiment was to allow some transcripts to be controlled by more than one marker. For example, transcripts 1–16 were controlled by markers 1 and 10, transcripts 971–990 were controlled by markers 1 and 6, and transcripts 17–20 were controlled by markers 1, 3, and 10. Again, we sampled all the eQTL effects from Normal(g_ij; 0, 32_{) and the residual error from Normal} (eij; 0, 0.12), such that the expected heritability for a

transcript is 0.92. For the sake of comparison, we used the empirical type I and type II errors and the empirical statistical power to evaluate the performance of our method (BAYES) and the MOM method. The current simulation experiment contains 134 linkages. The BAYES analysis detected 133 linkages and no false positives were declared. Therefore, the empirical type I error was zero, the type II error was 1

134¼0:007, and the empirical power was 1 0.007 ¼0.993. When we examined the single missed linkage (transcript 2 asso-ciated with marker 10), we found that the true marker effect for this transcript was 0.055, which again might be too small to be detected by any reasonable methods. Figure 1.—Hot-spot regions along the chromosome for

the first simulation experiment. NAE, normalized average evidence.

Figure2.—Effects of 98 linked transcripts in the first

sim-ulation experiment. Solid and dashed bars represent the true and the estimated values ofgij, respectively.

TABLE 1

Posterior mean and posterior variance (in parentheses) ofrjanda2j from the Bayes analysis for

the first simulation experiment

Parameter

Marker rj s2j

1 True 0.006 9

Estimate 0.006 (5.8E-6) 2.74 (3.1E-4)

3 True 0.004 9

Estimate 0.005 (6.2E-6) 11.7 (1.6E-3)

6 True 0.040 9

Estimate 0.041 (3.3E-5) 6.83 (2.1E-4)

10 True 0.050 9

Estimate 0.050 (4.7E-5) 7.85 (2.2E-4)

(7)

The true parameters used in the simulation and their estimated vales obtained from the proposed method are given in Table 2. The true and the estimated effects for the 133 detected linkages are illustrated in Figure 3. The estimates agree well with the true values. The hot-spot regions represented by the estimated proportions (rj)

of transcripts associated with the 10 markers are shown in Figure 4 (the top plot). All 4 markers controlling the expression of transcripts have been identified. A strik-ing advantage of the new method over MOM is that an individual linkage picture can be obtained when we focus on a specific transcript. The estimated marker effects for 6 of the detected transcripts are plotted in Figure 5, showing a close agreement to the true values. In the MOM analysis, however, many true linkages have been missed (see Table 3) because each transcript was

allowed to link to at most one locus. The MOM method worked well if a transcript is linked to only one marker,

e.g., transcript 997 associated with marker 6 (see Figure 5). When a transcript is controlled by more than one marker, the linkage signal occurs only at the position where the greatest eQTL resides. For example, tran-script 10 was controlled by markers 1 and 10. The linkage occurred only at marker 1 (true effect 4.462) with marker 10 (true effect 1.510) being completely missed. Transcript 12 was also controlled by markers 1 and 10 whose true effects were 0.997 and 4.489, respectively. MOM detected marker 10 for this tran-script. If the effects of markers controlling the same transcript are similar, MOM generates a confusing result, as demonstrated by transcripts 971 and 18. In the bottom plot of Figure 4, markers 5 and 7 were falsely identified as eQTL by MOM. That marker 6 had a higher peak than marker 1 was also incorrect. The empirical type I error was₉₈₆₆4 ¼0:0004 and the power was 1 48

134¼0:642 for the MOM analysis. The power would be even lower if more multiple linkages had been simulated.

The residual variance used in the above two simula-tion experiments was too small, leading to a irrasimula-tionally high expected heritability. In the following simulation experiments, we kept everything the same as that used for the second simulation experiment except residual variance. We varied the residual variance from 0.12_{to 3}2_, withs2_¼_0.52_{, 1}2_{, 1.5}2_{, 2}2_{, 2.5}2_{, 3}2_{. Again, for each of the} six scenarios, we generated 20 replicated data sets. The corresponding expected heritabilities are 0.81, 0.56, 0.32, 0.17, 0.15, and 0.07, respectively. We used only the proposed method (BAYES) to analyze these data sets Figure 4.—Hot-spot regions along the chromosome for

the second simulation experiment. NAE, normalized average evidence.

Figure3.—Effects of 133 linked transcripts in the second

simulation experiment. Solid and dashed bars represent the true and the estimated values ofgij, respectively.

TABLE 2

Posterior mean and posterior variance (in parentheses) ofrjands2j from the BAYES analysis for

the second simulation experiment

Parameter

Marker rj s2j

1 True 0.040 9

Estimate 0.042 (2.9E-5) 5.72 (3.3E-4)

3 True 0.004 9

Estimate 0.005 (6.1E-6) 6.43 (1.1E-3)

6 True 0.040 9

Estimate 0.041 (3.7E-5) 10.9 (2.1E-4)

10 True 0.050 9

Estimate 0.049 (4.5E-5) 13.3 (2.9E-4)

(8)

because MOM did not work whens2_._0.12_{. From Figure} 6, we can see that the empirical power decreased dramatically as the residual variance increased. The empirical type I error increased accordingly, but not as much as the decrease in the empirical power due to the stringent control for the FDR.

Mice data analysis: We analyzed a mice data set

pub-lished by Lanet al. (2006). The data are publicly

avail-able at gene expression omnibus (GEO) with accession no. GSE3330. The data consist of 40,738 transcripts whose expression levels were measured from 60 F2(ob/ob) mice in an obesity-related research. The expression levels were normalized and background was corrected by the robust multiarray average method (Irizarry et al.

2003). Genotypes for 145 markers (distributed over 19 chromosomes) and phenotypes for 25 obesity-related traits were collected from the 60 mice. We noted that the expression levels of most transcripts are constant across the 60 individuals. Those transcripts may not provide

any information on the eQTL analysis and thus should be eliminated prior to the analysis. We sorted all tran-scripts by their variances across individuals and deleted the transcripts with variances,0.12, leaving 1576 most varying transcripts for further analysis. Figure 7 shows the variations of 6 transcripts across 60 individuals. The 3 transcripts on the left had large variances and thus were kept in the data for further analysis. The remaining three transcripts (on the right) were deleted because their variances were small (,0.12).

For the BAYES method, we used the same length of the Markov chain as used in the simulation studies. The chain was diagnosed for convergence using coda.

Because there were 145 markers, the model for each transcript contained 145 effects. These 145 effects were estimated simultaneously. Of the 1576 transcripts in-cluded in the analysis, 843 of them were linked to at least one marker on the mouse genome. Of the 145 markers, 129 of them were claimed to control the expression of the transcripts to some extent. Figure 8 shows the proportion of transcripts associated with each of the 145 markers. Five markers with the highest proportions of linked transcripts are indicated by triangles. The five highest peaks of the profile (hot-spot regions) are dif-ferent from the 5 strongest markers mapped by K end-ziorskiet al. (2006). Marker D4Mit237 was the largest

eQTL on the mouse genome detected with MOM. However, we found that only 2 transcripts were linked to this locus. The marker with the highest proportion of linked transcripts identified by our method was Figure 5.—Estimated marker

effects gij, for six detected

tran-scripts by BAYES (the second sim-ulation experiment). The triangle in each plot indicates the position of eQTL discovered by MOM.

TABLE 3

Number of linkages detected by BAYES and MOM for the second simulation experiment

Marker

Method 1 3 6 10

True 40 4 40 50

BAYES Estimate 40 4 40 49

(9)

D15Mit63, a locus proved to be associated with the trait of higher early life body weight (Milleret al. 2002). In

addition, the hottest marker on chromosome 4 was actually D4Mit149. Another highly ranked eQTL de-clared by MOM was marker D2Mit241, which resides on

chromosome 2. However, only 3 transcripts were claimed to be linked to this position with the BAYES method. The most influencing marker on chromosome 2 identified by the new method was D2Mit274. It is close to marker D2Mit9, which is an obesity-modifier locus Figure 6.—The changes of empirical type I

error (top) and empirical power (bottom) as the residual variance increases.

Figure7.—Plots of expression

(10)

recently discovered by Stoehr et al. (2004). We also

found that marker D2Mit9, itself an eQTL identified by the BAYES method, affected the expression of 7 tran-scripts. Markers D5Mit1 and D8Mit249 are known to affect triglyceride level (Colinayo et al. 2003) and fat

content (Naggert et al. 1995), both of which are

obesity-related traits. These two markers were success-fully identified by both the BAYES and the MOM methods.

Researchers usually start with a particular gene that has a known function and then quickly identify other genes that are associated with the known gene. Further research is then performed to discover the biological functions for the unannotated genes. The new method developed in this study provides such a tool to detect these genes. For example, a recent study has shown that stearoyl-CoA desaturase-1 (Scd1) is an important gene for lipid metabolism and insulin sensitivity (Lan et al.

2006). The linkage profile (Figure 9, top) shows that the obesity-related gene Scd1 was associated with four markers represented by the largest four eQTL effects. Among the four markers, D15Mit63 was a very impor-tant locus that also controls the expression of 520 other transcripts. Note that marker D15Mit63 was the largest eQTL on the mouse genome identified with our method and it has been proved to be an obesity-related locus (Milleret al. 2002). We plotted the estimated effects for

two of the transcripts that also link to this locus (see Figure 9). One gene was ELOVL family member 6

(Elovl6), a gene in charge of elongation of long-chain

fatty acids. The other was a gene that encodes fatty acid synthase (Fasn). These genes are likely to be involved in metabolism related to obesity.

Two steps are usually taken to infer the functions of transcripts. The first step is to map QTL for the traits of interest. The second step is to perform eQTL analysis only for the markers detected in the QTL analysis. If a QTL regulating a trait of interest is also an eQTL for some transcripts, then the functions of the transcripts can be inferred. The proposed Bayesian analysis allows us to infer functions of transcripts jointly in a single step. In the mouse experiment (Lanet al. 2006), 25

obesity-related traits were measured. We simply treated the 25 traits as 25 additional transcripts and added them to the existing list of 1576 transcripts, making a list of 25 1

1576 ¼ 1601 traits. These 25 traits were subjected to the same eQTL mapping. If a marker is identified as being associated with both a transcript and a regular quantitative trait, the transcript is claimed to be associ-ated with the quantitative trait. However, quantitative traits are measured in different scales from the tran-scripts. Therefore, the effects of QTL have different scales from the effects of eQTL. This problem can be solved by rescaling the quantitative traits. In the mouse data analysis, we sorted the 1576 transcripts by the var-iance across the 60 mice in a descending order and cal-culated the average variance of the top 5% transcripts. We then used this average variance to rescale each of the 25 traits so that each of them had a variance equal to the average variance after the rescaling. There were 5% mis-sing phenotype measurements in the mouse data. The missing phenotypes were imputed using the multiple-imputation method (Rubin 1987). The proposed

Bayesian analysis shows that each one of the 25 traits was associated with at least one marker of the mouse genome and 12 traits were linked to marker D15Mit63. A total of 521 transcripts were also linked to marker D15Mit63, implying that these transcripts may alter the 12 obesity-related traits. A complete list of the associated transcripts, markers, and phenotypes is provided in the supplemental table at http://www.genetics.org/ supplemental/.

DISCUSSION

MOM is so far the only statistical method specifically developed for analyzing expression data and marker data jointly. The advantage of the MOM approach is that its computational efficiency allows all expression traits to be accounted for in the analysis. However, the as-sumption of a single eQTL per expression trait is too strong. Although HPD can be used to identify multiple eQTL, it does not perform well in general. For example, if eQTL are adjacent to one another and their effects are Figure_{8.—Plots of the hot-spot regions for the}

(11)

of the same size, HPD is able to detect them all by placing posterior probabilities evenly on each eQTL (see the top plot in Figure 10). However, if neither of the two conditions is satisfied, the MOM method will generate incomplete or misleading results (see middle and bottom plots in Figure 10).

The BAYES method proposed in this study is a multiple-eQTL model, in which a transcript is allowed to be linked to more than one locus. The results pre-sented in this study showed that a multiple-eQTL model is more desirable than a single-eQTL model. However, an investigator needs to trim down the transcript space to a reasonable size. For the BAYES analysis, we sug-gested selecting transcripts on the basis of the variances of their expression levels; i.e., a cutoff value is sub-jectively chosen and transcripts with variances less than this value are excluded from analysis. Usually, 1000– 2000 transcripts might be a good choice because the expression levels of most transcripts do not change across experimental subjects. The preliminary screen-ing aimed at reducscreen-ing the computational burden and had no effects on the result. We noted that results obtained from the analysis of 1500 selected transcripts were the same as those obtained from 5000 selected transcripts for mice data (data not shown). A similar prescreening scheme was used in a recent microarray data analysis (Ghazalpour et al. 2006). In reality, we

may run the computationally quicker MOM approach

first to estimate the number of differentially expressed transcripts and then use this information for transcript screening before the BAYES method is applied. In addition, the results obtained from MOM may be useful when setting priors for the BAYES analysis.

The hyperparameters for the scaled inverse chi-square priors in this study are chosen in a subjective way as done in Ishwaranand Rao(2005). According to

Ishwaranand Rao(2003), these values do not need to

be tuned for each data set and can be fixed. No sub-stantial differences have been observed when several different sets of priors were tried in a simulation study (data not shown). The vague prior (1/s2_{) we chose as} the prior for the residual variance has not caused any problem, since the MCMC diagnosis indicated a satis-factory convergence of the posterior sample. However, as Hobert and Casella(1996) warned, the marginal

posterior distribution for the variance may not exist. Therefore, the hyperparameters for the scaled inverse chi-square prior should be chosen so that it is proper, which will lead to a proper posterior.

In the simulation studies, we sampled the effects of linkages from Normal(gij; 0, 32), where 32was chosen in

a subjective manner. The performance of the proposed method does not depend on the choice of the variance for linkage effects, but rather on the ratio of this variance to the residual variance that affects the ex-pected heritability for a transcript. As we demonstrated Figure 9.—Effects of markers

(12)

in theapplicationssection, our method is robust given

different residual variances. We also carried out one more simulation experiment, where the effects of link-ages were sampled from a uniform distribution with a wide range (from 10 to 10). The performance of the BAYES method was still very satisfactory (data not shown).

We introduced a Bayes method for joint analysis of transcripts and markers. If a marker is associated with at least one transcript, this marker is considered as a candidate eQTL. However, it is rare that an eQTL sits exactly at a marker position. Therefore, the marker analysis will provide biased estimates for both the locations and the sizes of the eQTL unless the marker density is sufficiently high. The method can be ex-tended so that eQTL can be mapped to arbitrary posi-tions between markers. This is equivalent to the extension of individual marker analysis to interval map-ping for quantitative trait loci. Note that QTL locations were often treated as fixed in QTL analyses (Hoeschele

and VanRaden1993a,b). In a very recent interval

map-ping technique, such as that in Wanget al. (2005), a

QTL was allowed to take a position varying within a marker interval, leading to locating QTL and estimating their effects more precisely. This extension will add extra complexity to the existing method because eQTL positions become parameters also and are subject to similar Monte Carlo sampling in the Bayesian analysis. Two approaches can be taken for such an extension. One is the fixed-interval approach where one potential eQTL is assumed in each marker interval. The prior distribution of the location of the putative eQTL is

uniform within the interval. The posterior distribution can be inferred and a realized location can be sampled using the Metropolis–Hastings method (Metropolis

et al. 1953; Hastings1970). If an interval has no eQTL,

the posterior distribution will remain uniform and the eQTL effect will be shrunken to zero. If an interval does cover an eQTL, the posterior distribution of the eQTL location will be peaked at the true location and the eQTL effect will be estimated subject to no shrinkage

(Wanget al. 2005). When the marker density is high and

markers are unevenly distributed along the genome, the variable-interval approach may be adopted to sample the eQTL location (Wanget al. 2005), where the

num-ber of eQTL included in the model can be substantially less than the number of marker intervals. The eQTL genotypes are always missing, but realized genotypes can be sampled from the distribution given in Equation 10 of theMissing markerssection.

Mapping eQTL is an important subject in the field of statistical genomics. Current methods for eQTL map-ping rely mostly on either a microarray analysis pro-cedure or a QTL mapping statistic since no well thought out statistical method has emerged. The proposed joint analysis is one of the very few studies particularly tar-geting this subject. The field of eQTL mapping is very young and needs substantial effort from scientists across multidisciplinary fields to become a mature science. The proposed BAYES method may still be very crude, but it provides a starting point from which more com-prehensive techniques can be developed.

Figure10.—Linkage maps for three simulated

(13)

This research was supported by National Institutes of Health grant R01-GM55321 and National Science Foundation grant DBI-0345205 to S.X.

LITERATURE CITED

Broman, K. W., H. Wu, S. Senand G. A. Churchill, 2003 R/qtl:

Qtl mapping in experimental crosses. Bioinformatics19:889– 890.

Colinayo_{, V. V., J. H. Q}iao_{, X. P. W}ang_{, K. L. K}rass_{, E. S}chadt_{et al}_.,

2003 Genetic loci for diet-induced atherosclerotic lesions and plasma lipids in mice. Mamm. Genome14:464–471.

Efron, B., 2004 Large-scale simultaneous hypothesis testing: the

choice of a null hypothesis. J. Am. Stat. Assoc.99:96–104. Gelman, A., 2005 Analysis of variance - why it is more important

than ever. Ann. Stat.33:1–31.

Gelman_{, A., J. C}arlin_{, H. S}tern_{and D. R}ubin_{, 1995} _{Bayesian Data}

Analysis. Chapman & Hall/CRC Press, New York.

George_{, E. I., and R. E. M}cculloch_{, 1993} _{Variable selection via}

Gibbs sampling. J. Am. Stat. Assoc.88:881–889.

Ghazalpour_{, A., S. D}oss_{, B. Z}hang_{, S. W}ang_{, C. P}laisier_{et al}_.,

2006 Integrating genetic and network analysis to characterize genes related to mouse weight. Plos Genet.2:1182–1192. Haldane, J. B. S., 1919 The combination of linkage values and

the calculation of distances between the loci of linked factors. J. Genet.8:299–309.

Hastings, W. K., 1970 Monte-Carlo sampling methods using

Mar-kov chains and their applications. Biometrika57:97–109. Hobert, J. P., and G. Casella, 1996 The effect of improper priors

on Gibbs sampling in hierarchical linear mixed models. J. Am. Stat. Assoc.91:1461–1473.

Hoeschele_{, I., and P. M. V}an_Raden_{, 1993a} _{Bayesian analysis of}

link-age between genetic markers and quantitative trait loci. I. Prior knowledge. Theor. Appl. Genet.85:953–960.

Hoeschele, I., and P. M. VanRaden, 1993b Bayesian analysis of

linkage between genetic markers and quantitative trait loci. II. Combining prior knowledge with experimental evidence. Theor. Appl. Genet.85:946–952.

Hubner_{, N., C. A. W}allace_{, H. Z}imdahl_{, E. P}etretto_{, H. S}chulz

et al., 2005 Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease. Nat. Genet.37:243–253.

Irizarry_{, R. A., B. H}obbs_{, F. C}ollin_{, Y. D. B}eazer_-Barclay_{, K. J.}

Antonelliset al., 2003 Exploration, normalization, and

sum-maries of high density oligonucleotide array probe level data. Biostatistics4:249–264.

Ishwaran, H., and J. S. Rao, 2003 Detecting differentially

ex-pressed genes in microarrays using Bayesian model selection. J. Am. Stat. Assoc.98:438–455.

Ishwaran_{, H., and J. S. R}ao_{, 2005} _{Spike and slab gene selection for}

multigroup microarray data. J. Am. Stat. Assoc.100:764–780. Jansen_{, R. C., 1993} _{Interval mapping of multiple quantitative trait}

loci. Genetics135:205–211.

Kao_{, C. H., Z-B. Z}eng_{and R. D. T}easdale_{, 1999} _{Multiple interval}

mapping for quantitative trait loci. Genetics152:1203–1216. Kendziorski, C. M., M. Chen, M. Yuan, H. Lanand A. D. Attie,

2006 Statistical methods for expression quantitative trait loci (eqtl) mapping. Biometrics62:19–27.

Lan, H., J. P. Stoehr, S. T. Nadler, K. L. Schueler, B. S. Yandell

et al., 2003 Dimension reduction for mapping mRNA abun-dance as quantitative traits. Genetics164:1607–1614.

Lan_{, H., M. C}hen_{, J. B. F}lowers_{, B. S. Y}andell_{, D. S. S}tapleton_{et al}_.,

2006 Combined expression trait correlations and expression quantitative trait locus mapping. PloS Genet.2:51–61. Lander, E. S., and D. Botstein, 1989 Mapping Mendelian factors

underlying quantitative traits using RFLP linkage maps. Genetics 121:185–199.

Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H.

Teller_{and E. T}eller_{, 1953} _{Equation of state calculations by}

fast computing machines. J. Chem. Phys.21:1087–1092. Miller_{, R. A., J. M. H}arper_{, A. G}alecki_{and D. T. B}urke_{, 2002} _Big

mice die young: early life body weight predicts longevity in genet-ically heterogeneous mice. Aging Cell1:22–29.

Mitchell, T. J., and J. J. Beauchamp, 1988 Bayesian variable

selec-tion in linear regression. J. Am. Stat. Assoc.83:1023–1036. Naggert, J. K., L. D. Fricker, O. Varlamov, P. M. Nishina,

Y. Rouille_{et al}., 1995 Hyperproinsulinaemia in obese fat/fat

mice associated with a carboxypeptidase-e mutation which re-duces enzyme-activity. Nat. Genet.10:135–142.

Newton_{, M. A., A. N}oueiry_{, D. S}arkar_{and P. A}hlquist_{, 2004}

De-tecting differential gene expression with a semiparametric hier-archical mixture method. Biostatistics5:155–176.

Pan, W., 2002 A comparative review of statistical methods for

discov-ering differentially expressed genes in replicated microarray experiments. Bioinformatics18:546–554.

Qu, Y., and S. Xu, 2006 Quantitative trait associated microarray gene

expression data analysis. Mol. Biol. Evol.23:1558–1573. Raftery, A. E., and S. M. Lewis, 1992 One long run with

diagnos-tics: implementation strategies for Markov chain Monte Carlo. Stat. Sci.7:493–497.

Rao_{, S. Q., and S. X}u_{, 1998} _{Mapping quantitative trait loci for}

cat-egorical traits in four-way crosses. Heredity81:214–224. Rubin_{, D. B., 1987} _{Multiple Imputation for Nonresponse in Surveys}_.

John Wiley & Sons, New York.

Schadt_{, E. E., S. A. M}onks_{, T. A. D}rake_{, A. J. L}usis_{, N. C}he_{et al}_.,

2003 Genetics of gene expression surveyed in maize, mouse and man. Nature422:297–302.

Sen_{, S., and G. A. C}hurchill_{, 2001} _{A statistical framework for}

quan-titative trait mapping. Genetics159:371–387.

Stoehr_{, J. P., J. E. B}yers_{, S. M. C}lee_{, H. L}an_{, I. V. B}oronenkov_{et al}_.,

2004 Identification of major quantitative trait loci controlling body weight variation in ob/ob mice. Diabetes53:245–249. Tusher, V., R. Tibshiraniand C. Chu, 2001 Significance analysis of

microarrays applied to ionizing radiation response. Proc. Natl. Acad. Sci. USA98:5116–5121.

Wang, H., Y. M. Zhang, X. M. Li, G. L. Masinde, S. Mohan_{et al}.,

2005 Bayesian shrinkage estimation of quantitative trait loci pa-rameters. Genetics170:465–480.

Yvert_{, G., R. B. B}rem_{, J. W}hittle_{, J. M. A}key_{, E. F}oss _{et al}_.,

2003 Trans-acting regulatory variation in saccharomyces cerevi-siae and the role of transcription factors. Nat. Genet.35:57–64. Zeng, Z-B., 1994 Precision mapping of quantitative trait loci.

Genet-ics136:1457–1468.

Zhang, D., M. T. Wells, C. D. Smartand W. E. Fry, 2005 Bayesian

normalization and identification for differential gene expression data. J. Comput. Biol.12:391–406.