On Prediction of Genetic Values in Marker-Assisted Selection

(1)



On Prediction of Genetic Values in Marker-Assisted Selection

Christoph Lange*

,†

_{and John C. Whittaker}

†

*Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115 and†_{School of Applied Statistics,} University of Reading, Reading RG6 6FN, United Kingdom

Manuscript received August 31, 2000 Accepted for publication September 5, 2001

ABSTRACT

We suggest a new approximation for the prediction of genetic values in marker-assisted selection. The new approximation is compared to the standard approach. It is shown that the new approach will often provide substantially better prediction of genetic values; furthermore the new approximation avoids some of the known statistical problems of the standard approach. The advantages of the new approach are illustrated by a simulation study in which the new approximation outperforms both the standard approach and phenotypic selection.

M

ARKER-ASSISTED selection (MAS), like many single marker and trait bySmithandSimpson(1986);

quantitative trait loci (QTL)-mapping techniques here we describe a method suitable for multiple markers (Haley and Knott 1992; Zeng 1994; Jansen 2001), and multiple traits. The approach is computationally exploits the linkage disequilibrium between markers simpler than the standard approach by Lande and and QTL produced when inbred lines are crossed. How- Thompson(1990) and simulation studies show that the ever, in MAS we do not aim to map the QTL and esti- new approach can give substantial improvements in se-mate their positions and effect sizes. The goal of marker- lection response. Further, the new approach can be assisted selection is the improvement of certain traits in seen as a natural extension of the method proposed by breeding programs for plants or animals. We therefore _SmithandSimpson(1986).

want to predict the genetic valueszfor certain traits of each individual and select the best individuals according to their genetic values for further breeding.

Consider-METHODS able literature on this topic now exists, reviewed in

Whittaker (2001). The key conclusion is that, since We start with n individuals from an F2 generation

marker-assisted selection employs both marker and phe- derived by crossing two inbred parental lines. For each notypic information for the prediction of the genetic individual we record phenotypes atmtraits to give the values, it can outperform selection based solely on phe- vectoryi⫽(yi1, . . . ,yim)T,i⫽1, . . . ,n, and also a set

notypic information, especially when the sample size is ofpmarker values xi⫽ (xi1, . . . ,xip). We assume that

large (Gimelfarb andLande 1994; Whittaker et al. in every individual the same genetic markers are typed.

1995;Hospital et al.1997). _Y_⫽₍_Y_ij₎_⫽_僆_⺢n⫻m_{is the matrix of all phenotypic values}

The standard approach to marker-assisted selection _and_Z_⫽ ₍_Z_ij₎_僆_⺢n⫻m_{the corresponding matrix of the}

for inbred lines, due toLandeandThompson(1990), _{unobservable genetic values. For all loci, both markers} is based on a two-stage procedure, where first pheno- _{and QTL, we denote loci homozygous for the allele} types are regressed on marker information to give a _{from the first parental line by}_⫺_{1, loci homozygous for} prediction of the genetic value known as the marker _{the allele from the second parental line by} _⫹_{1, and} score for each individual and then this score is com- _{heterozygotes by 0.}

bined with phenotypic information using a selection _{Suppose we wish to improve all}_m_{traits simultaneously} index to give a final prediction of genetic merit. _{in the breeding program. We can rank and select} indi-Here we introduce an alternative single-stage proce- _{viduals for further breeding only when the genetic} “val-dure, which essentially treats information at each individ- _{ues” of the individuals can be characterized by a single} ual marker as a separate trait. Thus all marker information _{scalar value and not by an}_m_{dimensional vector. Hence} can be entered, together with phenotypic information, _{one has to combine the distinct trait values and define} into a single selection index, which is then used to

an overall genetic “value” for each individual. predict the genetic merit. This was first suggested for a

This is typically done by computing a linear index ⫽ ( 1, . . . , n)T, where i is the value for the ith

individual and is determined by the index vector a⫽

Corresponding author:John C. Whittaker, Department of

Epidemiol-(a1, . . . , am)T 僆⺢m, where aj typically describes the ogy and Public Health, Imperial College School of Medicine, St. Mary’s

Campus, Norfolk Place, London W2 1PG, United Kingdom. economic value of the corresponding trait (Falconer

(2)

and Mackay1997). The scalar index value ifor the mated by regressing the phenotype on the marker

infor-mation. The best linear predictorPZ_i

Y_iS_iofE(Zi|yi,si)

con-ith individual is then computed by

ditional on the phenotypic information yi and the

marker scoresi⫽ (si1, . . . ,sim)Tis then given by

with Zi⫽ (Zi1, . . . ,Zim).

The individuals with the largest index values are

se-E(Zi|yi,si) ≈P Z_i

Y_i,Si⫽ (B0|B1)

冢

yi

si

冣

lected for further breeding. We therefore have to pre-dict the index vector on the basis of the phenotypic

⫽B0yi⫹ B1si (7)

informationyand the marker information {x1, . . . ,xn}:

(BrockwellandDavis1991). This is the standard selec-tion index theory (Falconer and Mackay 1997), with matrixB0僆⺢(m⫻m)andB1僆⺢(m⫻m),

B⫽(B0 B1)⫽GP⫺1僆⺢m⫻2m,

where

This linear structure of the index i means that the

problem of predictingE( i|yi,xi) is reduced to the

prob-lem of predictingE(Zi|yi,xi). We assume that the

unob-servable genetic valueZijis given by

Zij ⫽

兺

Kj

k⫽1

ajkGik (3)

and the phenotypic valueYij by

Yij⫽ Zij ⫹εij (4)

with Gik the genetic QTL score of thekth QTL in the

withS⫽(Sij)僆⺢n⫻m.

ith individual,ajkthe additive effect of thekth QTL on

We now show how the two-stage Lande and Thompson the jth trait, Nj the number of QTL acting on the jth

procedure (5) described above, withsicalculated by

regres-trait, and εij the environmental errors. Note that the

sion ofyionxiand thenyiandsicombined to predict the

environmental errors for different traits in the same

genetic valuezi, can be replaced by a single-step procedure.

individual may be correlated, but we assume that the

Instead of the separate steps we compute directly the best environmental errors for different individuals are

inde-linear predictor PZi

Yi, Xi of E(Zi|yi,, xi) conditional on the

pendent. ThusZijdepends on the unknown QTL scores

phenotypic informationyiand the complete set of marker

Gik, the unknown additive effects ajk, and on the

un-informationxiby

known numbers of QTL Kj acting on each trait. The

exact computation ofE(Zi|yi,xi) is consequently difficult

E(Zi|yi,xi) ≈PYZi_i,X_i⫽(B˜0|B˜1)

冢

yi

xi冣

(Whittakeret al.1995), and so a linear approximation of E(Zi|yi,xi) first suggested by LandeandThompson

⫽B˜0yi ⫹B˜1xi (8)

(1990) is often used. Instead of conditioning on the phenotypic informationyiand the marker information

with

xi, they tried to approximate the conditional

expecta-tion of Zi given the phenotypic informationyiand the

marker scoresiby

E(Zi|yi,xi)≈E(Zi|yi,si) (5)

with the multivariate marker score si⫽ (si1, . . . ,sim)T

defined by

sij⫽ ␤0j⫹

兺

k僆Ꮽj

␤kjxik, Ꮽj傺 {1, . . . ,p}, (6)

where Ꮽj is a subset of the recorded markers that is

This essentially sets up a selection index including all assumed to model the QTL effects of thejth trait. There

markers and phenotypes, where each marker is treated as are a number of ways to get the marker subsets Ꮽj,

a separate trait. To compare the Lande and Thompson including model selection strategies such as the Akaike

approximation (7) with the approximation proposed in information criterion, Bayesian information criterion,

(8), we consider the spaces in which the best linear pre-or Mallow’sCp(Whittaker 2001).

(3)

Because of the definition ofSi(Equation 6) it is obvious

that

Sij僆span(Xi1, . . . ,Xip)

and consequently

span(Yi,Si1, . . . ,Sim)傺span(Yi,Xi1, . . . ,Xip)

For either approach a number of matrices have to be withXi ⫽ (Xi1, . . . , Xip). Thus the best linear predictor

estimated. To use theLandeandThompson(1990) ap-ofE(Zi|yi,si) is always computed in a subspace of the space

proach the matrices Var(Y), Cov(Y, S), Var(S), Var(Z), in which the best linear predictor ofE(Zi|yi,xi) is computed

and Cov(Z,S) must be computed; a detailed discussion (Equation 6). Therefore the approximation proposed by

on the different ways of estimating these matrices is

Lande and Thompson (1990) is always suboptimal to

given inWhittaker(2001), but here we give only key approximation (8). Since the number of markerspcan

details. be substantially greater than the number of traitsm, the

The estimation of the variance matrices Var(Y), dimension of the subspace spanned by span(Yi,Si1, . . . ,

Var(S), and Var(Z) is straightforward. Var(Y) is esti-Sim) may be much smaller than the dimension of span(Yi,

mated directly by the empirical covariance matrix of Xi1, . . . ,Xip) and so the best linear predictorP

Z_i

Yi, Sicomputed

the phenotypic information and Var(S) by in the subspace may have a much greater mean square

error than the best linear predictorPZi

Yi, Xicomputed in the _Var(_Sˆ_). ₍₁₁₎

entire span ofYiandXi. This suggests that the prediction

of the genetic values suggested byLandeandThompson For the computation of Var(Z) it is usual to assume that (1990) might have a substantially bigger mean square all variance explained by the marker scoressiis genetic

error than prediction (8). On the other hand, Appendix variance. Var(Z) can therefore be estimated by Var(S). III ofLande andThompson(1990) shows that the two Note that this tends to overestimate the genetic variance. predictors should be equivalent in the special case of large The main problem in the Lande and Thompson ap-population sizes and linkage equilibrium between all proach (9) is the estimation of Cov(Z, S). Using the markers. Crucially, however, this result does not hold when same arguments used for the estimation of Var(Z) we markers are in linkage disequilibrium, as is the case for can justify estimating Cov(Z,S) by Cov(Y,S). However, the populations considered here, and our simulation re- since all model selection criteria tend to reduce the sults show a substantial advantage for approximation (8). residual variation in the data set to some extent, the Finally, note that a further drawback of the Lande and way we chose the subsetsᏭi of predictor/marker

vari-Thompson approach is the appearance of the marker ables will always influence Cov(Y, S). Thus estimation scoreSiin the approximation (7). The marker scoreSiis of Cov(Z, S) by Cov(Y, S) will tend to overestimate

unknown and has to be predicted by Equation 6. Since the true covariance matrix. Whittaker et al. (1997) the marker subsets Ꮽj are not known, they have to be developed an approach based on cross-validation to

obtained by model selection strategies (Whittaker2001). _{tackle this problem, but even then estimation is quite} There is often considerable noise associated with this _poor.

model selection, so that the linear approximation (8), _{Now consider our new approach. To compute the} which avoids this step, might be expected to give more _{weight matrices}_P˜ andG˜ of approach (10) we have to stable predictions. This might also suggest using the Lande _{estimate Var(}_Y_{), Var(}_Z_{), Cov(}_Y_,_X_{), Var(}_X_{), and Cov(}_Z_, and Thompson approach without the model selection _X_{). Note that we estimate matrix Var(}_Z_{) by Var(}_S_{) as} step. However, this is known to perform poorly in most _{for the Lande and Thompson approach. Although there} situations (e.g.,ZhangandSmith1992;Gimelfarband _{are nonparametric ways to estimate Var(}_Z_{) (}_Falconer

Lande1994; Whittakeret al. 1995), and we found by _and_Mackay_{1997), our experience with simulation} ex-simulation that the same was true here (results not shown). _{periments suggests that preference should be given to} Now we can apply either the linear approximation (7) _{estimation via marker scores (}_Lange_{2000). The} matri-ofE(Zi|yi,si) or (8) ofE(Zi|yi,xi) to the index prediction _{ces Var(}_Y_{), Var(}_X_{), and Cov(}_Y_,_X_{) are estimated directly}

problem (9) and predict the index value i of the ith _{by their empirical variances and covariances. To}

esti-individual;e.g., for approximation (7) _{mate Cov(}_Z_,_X_{) we assume that there is no}

(4)

Figure 1.—Response to marker-assisted selection for environmental correlation 0.0, heritability 0.2, and in the absence of esti-mation error.

expect it to be substantially better. In the next section we the genome contains 46 QTL in total. QTL locations are obtained by drawing random samples from a uniform compare the two methods by simulation experiments.

distribution. As Lande and Thompson (1990) sug-gested, we compute the QTL effectsaijusing geometric

SIMULATION EXPERIMENTS

series and choose the effects of the QTL so that the total heritability of each trait is 0.2.

Simulation experiment when more records than

mark-We use a population size of 600 individuals, with index

ers are given:To compare the performance of

marker-a⫽(1, 1)T_{, and assume that there is no environmental}

assisted selection based on the Lande and Thompson

correlation (i.e., theεijin Equation 4 are independent). approximation (9) with marker-assisted selection based

We run the index selection for 20 generations and re-on our approximatire-on (10), we simulate 20

chromo-peat the experiment 100 times. On the basis of their somes of an F2 generation from two inbred parental

predicted index values we select the best 20% of the lines and distribute 11 markers uniformly over each

individuals in each generation and mate them randomly chromosome. Each marker interval is 10 cM in length

to produce the next generation. with a total chromosome length of 100 cM. We generate

two Gaussian traits, each influenced by 23 QTL, so that We conduct the simulation experiment twice. First

(5)

Figure 3.—Response to marker-assisted selection for environmental correlation 0.0, heritability 0.05, and sample size 50 in the absence of estimation error.

we use the true empirical covariance matrices Cov(Y, denoted by “L&T MAS” and approach (10) by “Opt MAS” in each case. Classical selection based only on

S) and Var(Z) in (9) and (10). This gives the

perfor-mance of the two approaches in the absence of estima- phenotypic information is denoted by “Pheno.” In the absence of estimation error, the plots show a tion error. Second, the marker scoressi are predicted

by linear regression of the phenotypic information on clear superiority of approach (10) over the Lande and Thompson approach (9), with both being superior to the marker information. For the Lande and Thompson

approach the covariance matrix Cov(Y,S) is estimated classical phenotypic selection. Similar results are ob-tained in the presence of estimation error (Figure 2). by the cross-validation method proposed byWhittaker

et al.(1997). All parameters of the new approach and The performance of all three methods is reduced, but

the ordering of the methods is unaffected. of the Lande and Thompson approach are reestimated

in each generation, using the new marker and pheno- Simulation experiment when more markers than re-cords are given:We repeat the previous simulation ex-typic data. On the basis of these estimates the indices

are also recalculated in each generation. periment with sample size 50, 10 chromosomes of length

1 M, and 11 markers spaced uniformly over each chro-The mean responses for the first and second

simula-tion experiments are shown in Figures 1 and 2, respec- mosome. The total heritability is assumed to be 0.05. Since there are more markers (110) than records (50), tively, with the Lande and Thompson approach (9)

(6)

the empirical variance matrix of the markers is now the differences between the MAS approaches will be largest when the heritability of the traits is low or dense singular and we have to compute the generalized inverse

matrix of P˜ instead of the standard inverse. Standard marker maps are given, since in these cases modeling the marker scoressiis more difficult,e.g., a model

selec-theory (Searle1971, 1982) implies that, although the

estimated index weights will depend on the particular tion problem for marker scores. This assumption was supported by the simulation experiments in the absence choice of the generalized inverse, the predicted index

value i will not. No other changes to the method are of estimation error (Figures 1 and 3). However, in the

presence of estimation error the advantages of the new needed.

For the simulation study without estimation error the approach partly vanish (Figures 2 and 4). In particular, where the number of records is less than the number plots of the mean responses are shown in Figure 3.

Figure 4 shows the same plots in the presence of estima- of markers our approach can be inferior to the standard two-stage method in early generations. This indicates tion error.

In the absence of estimation error, the overall order- the need for more sophisticated estimation methods than the ones used for the new approach here. ing of the methods is maintained, although response

to selection is lower than in the previous section. How- In practice, selection solely on individual phenotype, as described here, is seldom used. Rather, phenotypic ever, in the presence of estimation error the response

to our approach is reduced far more than the response information on relatives would be incorporated, either via selection indices or, more usually, by the use of to the Lande and Thompson approach. For the first

five generations the Lande and Thompson approach BLUP breeding value estimates: This of course reduces the advantage of MAS. We used phenotypic selection even performs slightly better than our approach.

here solely to provide a point of reference for the MAS results, since our focus was on the comparison of the DISCUSSION

MAS approaches.

We also assumed that all markers are typed in all In this article we propose a new approximation for

the prediction of genetic values that has theoretical generations. However, it would be possible to reduce the typing cost by selecting a subset of markers in the advantages over the standard approach. We have shown

that our new approach is also applicable when the sam- F2generation and genotyping only these markers in all

subsequent generations. Then the advantages of the ple size is smaller than or equal to the number of

mark-ers. The marker variance matrix Var(X) can be esti- new methods will be slightly reduced but still be of practical relevance.

mated by the empirical variance matrix even when the

sample size is smaller than the number of markers. Finally, note that our approximation exploits the whole marker map. The variance of the predicted values However, the empirical variance matrix will then be

singular and instead of the standard inverse the general- can therefore be calculated by ized inverse matrix has to be computed. When the

sam-ple size is substantially smaller than the number of

mark-which allows the performance of different marker maps ers, the empirical variance matrix might be a poor

to be compared. The value of adding additional markers estimate and alternative estimators may be considered;

to the map can thus be investigated and the optimal e.g., since we start with an F2generation of two inbred

marker spacing determined. For the Lande and Thomp-lines, the marker variance matrix can also be computed

son approach the variance of the predicted values analytically when the marker interval lengths are known.

Var( ) is more difficult to compute, since the variance Alternatively the marker variance matrix might also be

of the predicted values is influenced by the selection of estimated by bootstrapping or Monte Carlo simulation

markers used for the prediction of the marker scores experiments. These issues will be a topic of further

re-and these marker subsets are unknown when the experi-search.

ment is designed. The theoretical advantages of the new approach are

In conclusion, marker-assisted selection based on ap-confirmed by our simulation experiments where the

proximation (10) for the prediction of the genetic val-new approximation clearly outperforms both the

classi-ues has a number of advantages over the standard ap-cal Lande and Thompson approach and phenotypic

proach by Lande and Thompson. selection when the number of records exceeds the

num-ber of markers to be included in the index. We also We thank two referees for their constructive comments on an earlier draft of this article. This research was supported in part by the

Biotech-performed simulation studies for smaller sample sizes,

nology and Biological Sciences Research council (United Kingdom)

more traits, different heritabilities, and nonzero

envi-and in part by grant MH59532 of The National Institutes of Health.

ronmental correlation (Lange2000). In all simulation studies we observed the same pattern. The superiority of the new approach over both the classical Lande and

LITERATURE CITED Thompson approach and phenotypic selection was

al-ways substantial, provided the number of records ex- _{Brockwell, P. J.,} _and_{R. A. Davis,} ₁₉₉₁ _{Time Series: Theory and} Methods.Springer-Verlag, Berlin/Heidelberg/New York.

(7)

Falconer, D. S.,andT. F. C. Mackay,1997 Introduction to Quantita- Searle, S. R.,1982 Matrix Algebra Useful for Statistics.Wiley, New York.

tive Genetics.Longman, New York.

Gimelfarb, A.,andR. Lande,1994 Simulation of marker assisted Smith, C.,andS. P. Simpson,1986 The use of polymorphisms in live stock improvement. J. Anim. Breed. Genet.103:205–217. selection for non-additive traits. Genet. Res.64:127–136.

Haley, C. S.,andS. A. Knott,1992 A simple regression method Whittaker, J. C.,2001 Marker assisted selection and introgression, pp. 673–695 in Handbook of Statistical Genetics, edited by D. J.

for mapping quantitative trait loci in line crosses using flanking

markers. Heredity69:315–324. Balding,M. BishopandC. Cannings.Wiley, New York.

Whittaker, J. C., R. N. Curnow, C. S. HaleyandR. Thompson, Hospital, F., L. Moreau, F. Lacourde, A. CharcossetandA.

Gal-lais,1997 More on the efficiency of marker-assisted selection. 1995 Using marker-maps in marker-assisted selection. Genet.

Theor. Appl. Genet.95:1181–1189. Res.66:255–265.

Jansen, R. C.,2001 Quantitative trait loci mapping in inbred lines, Whittaker, J. C., C. S. HaleyandR. Thompson,1997 Optimal pp. 567–599 in Handbook of Statistical Genetics, edited by D. J. weighting of information in marker-assisted selection. Genet. Res.

Balding,M. BishopandC. Cannings.Wiley, New York. 69:137–144.

Lande, R.,andR. Thompson, 1990 Efficiency of marker-assisted Zeng, Z-B.,1994 Precision mapping of quantitative trait loci. Genet-selection in the improvement of quantitative traits. Genetics124: ics136:1457–1468.

743–756. Zhang, W.,andC. Smith,1992 Computer simulation of

marker-Lange, C.,2000 Generalized estimating equation methods in statisti- assisted selection utilizing linkage disequilibrium. Theor. Appl. cal genetics. Ph.D. Thesis, Department of Applied Statistics, The _Genet._83:_813–820.

University of Reading, United Kingdom.