On Prediction of Genetic Values in Marker-Assisted Selection
Christoph Lange*
,†and John C. Whittaker
†*Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115 and†School of Applied Statistics, University of Reading, Reading RG6 6FN, United Kingdom
Manuscript received August 31, 2000 Accepted for publication September 5, 2001
ABSTRACT
We suggest a new approximation for the prediction of genetic values in marker-assisted selection. The new approximation is compared to the standard approach. It is shown that the new approach will often provide substantially better prediction of genetic values; furthermore the new approximation avoids some of the known statistical problems of the standard approach. The advantages of the new approach are illustrated by a simulation study in which the new approximation outperforms both the standard approach and phenotypic selection.
M
ARKER-ASSISTED selection (MAS), like many single marker and trait bySmithandSimpson(1986);quantitative trait loci (QTL)-mapping techniques here we describe a method suitable for multiple markers (Haley and Knott 1992; Zeng 1994; Jansen 2001), and multiple traits. The approach is computationally exploits the linkage disequilibrium between markers simpler than the standard approach by Lande and and QTL produced when inbred lines are crossed. How- Thompson(1990) and simulation studies show that the ever, in MAS we do not aim to map the QTL and esti- new approach can give substantial improvements in se-mate their positions and effect sizes. The goal of marker- lection response. Further, the new approach can be assisted selection is the improvement of certain traits in seen as a natural extension of the method proposed by breeding programs for plants or animals. We therefore SmithandSimpson(1986).
want to predict the genetic valueszfor certain traits of each individual and select the best individuals according to their genetic values for further breeding.
Consider-METHODS able literature on this topic now exists, reviewed in
Whittaker (2001). The key conclusion is that, since We start with n individuals from an F2 generation
marker-assisted selection employs both marker and phe- derived by crossing two inbred parental lines. For each notypic information for the prediction of the genetic individual we record phenotypes atmtraits to give the values, it can outperform selection based solely on phe- vectoryi⫽(yi1, . . . ,yim)T,i⫽1, . . . ,n, and also a set
notypic information, especially when the sample size is ofpmarker values xi⫽ (xi1, . . . ,xip). We assume that
large (Gimelfarb andLande 1994; Whittaker et al. in every individual the same genetic markers are typed.
1995;Hospital et al.1997). Y⫽(Yij)⫽僆⺢n⫻mis the matrix of all phenotypic values
The standard approach to marker-assisted selection andZ⫽ (Zij)僆⺢n⫻mthe corresponding matrix of the
for inbred lines, due toLandeandThompson(1990), unobservable genetic values. For all loci, both markers is based on a two-stage procedure, where first pheno- and QTL, we denote loci homozygous for the allele types are regressed on marker information to give a from the first parental line by⫺1, loci homozygous for prediction of the genetic value known as the marker the allele from the second parental line by ⫹1, and score for each individual and then this score is com- heterozygotes by 0.
bined with phenotypic information using a selection Suppose we wish to improve allmtraits simultaneously index to give a final prediction of genetic merit. in the breeding program. We can rank and select indi-Here we introduce an alternative single-stage proce- viduals for further breeding only when the genetic “val-dure, which essentially treats information at each individ- ues” of the individuals can be characterized by a single ual marker as a separate trait. Thus all marker information scalar value and not by anmdimensional vector. Hence can be entered, together with phenotypic information, one has to combine the distinct trait values and define into a single selection index, which is then used to
an overall genetic “value” for each individual. predict the genetic merit. This was first suggested for a
This is typically done by computing a linear index ⫽ ( 1, . . . , n)T, where i is the value for the ith
individual and is determined by the index vector a⫽
Corresponding author:John C. Whittaker, Department of
Epidemiol-(a1, . . . , am)T 僆 ⺢m, where aj typically describes the ogy and Public Health, Imperial College School of Medicine, St. Mary’s
Campus, Norfolk Place, London W2 1PG, United Kingdom. economic value of the corresponding trait (Falconer
and Mackay1997). The scalar index value ifor the mated by regressing the phenotype on the marker
infor-mation. The best linear predictorPZi
YiSiofE(Zi|yi,si)
con-ith individual is then computed by
ditional on the phenotypic information yi and the
marker scoresi⫽ (si1, . . . ,sim)Tis then given by
with Zi⫽ (Zi1, . . . ,Zim).
The individuals with the largest index values are
se-E(Zi|yi,si) ≈P Zi
Yi,Si⫽ (B0|B1)
冢
yi
si
冣
lected for further breeding. We therefore have to pre-dict the index vector on the basis of the phenotypic
⫽B0yi⫹ B1si (7)
informationyand the marker information {x1, . . . ,xn}:
(BrockwellandDavis1991). This is the standard selec-tion index theory (Falconer and Mackay 1997), with matrixB0僆 ⺢(m⫻m)andB1僆⺢(m⫻m),
B⫽(B0 B1)⫽GP⫺1僆⺢m⫻2m,
where
This linear structure of the index i means that the
problem of predictingE( i|yi,xi) is reduced to the
prob-lem of predictingE(Zi|yi,xi). We assume that the
unob-servable genetic valueZijis given by
Zij ⫽
兺
Kjk⫽1
ajkGik (3)
and the phenotypic valueYij by
Yij⫽ Zij ⫹εij (4)
with Gik the genetic QTL score of thekth QTL in the
withS⫽(Sij)僆 ⺢n⫻m.
ith individual,ajkthe additive effect of thekth QTL on
We now show how the two-stage Lande and Thompson the jth trait, Nj the number of QTL acting on the jth
procedure (5) described above, withsicalculated by
regres-trait, and εij the environmental errors. Note that the
sion ofyionxiand thenyiandsicombined to predict the
environmental errors for different traits in the same
genetic valuezi, can be replaced by a single-step procedure.
individual may be correlated, but we assume that the
Instead of the separate steps we compute directly the best environmental errors for different individuals are
inde-linear predictor PZi
Yi, Xi of E(Zi|yi,, xi) conditional on the
pendent. ThusZijdepends on the unknown QTL scores
phenotypic informationyiand the complete set of marker
Gik, the unknown additive effects ajk, and on the
un-informationxiby
known numbers of QTL Kj acting on each trait. The
exact computation ofE(Zi|yi,xi) is consequently difficult
E(Zi|yi,xi) ≈PYZii,Xi⫽(B˜0|B˜1)
冢
yi
xi冣
(Whittakeret al.1995), and so a linear approximation of E(Zi|yi,xi) first suggested by LandeandThompson
⫽B˜0yi ⫹B˜1xi (8)
(1990) is often used. Instead of conditioning on the phenotypic informationyiand the marker information
with
xi, they tried to approximate the conditional
expecta-tion of Zi given the phenotypic informationyiand the
marker scoresiby
E(Zi|yi,xi)≈E(Zi|yi,si) (5)
with the multivariate marker score si⫽ (si1, . . . ,sim)T
defined by
sij⫽ 0j⫹
兺
k僆Ꮽjkjxik, Ꮽj傺 {1, . . . ,p}, (6)
where Ꮽj is a subset of the recorded markers that is
This essentially sets up a selection index including all assumed to model the QTL effects of thejth trait. There
markers and phenotypes, where each marker is treated as are a number of ways to get the marker subsets Ꮽj,
a separate trait. To compare the Lande and Thompson including model selection strategies such as the Akaike
approximation (7) with the approximation proposed in information criterion, Bayesian information criterion,
(8), we consider the spaces in which the best linear pre-or Mallow’sCp(Whittaker 2001).
Because of the definition ofSi(Equation 6) it is obvious
that
Sij僆span(Xi1, . . . ,Xip)
and consequently
span(Yi,Si1, . . . ,Sim)傺span(Yi,Xi1, . . . ,Xip)
For either approach a number of matrices have to be withXi ⫽ (Xi1, . . . , Xip). Thus the best linear predictor
estimated. To use theLandeandThompson(1990) ap-ofE(Zi|yi,si) is always computed in a subspace of the space
proach the matrices Var(Y), Cov(Y, S), Var(S), Var(Z), in which the best linear predictor ofE(Zi|yi,xi) is computed
and Cov(Z,S) must be computed; a detailed discussion (Equation 6). Therefore the approximation proposed by
on the different ways of estimating these matrices is
Lande and Thompson (1990) is always suboptimal to
given inWhittaker(2001), but here we give only key approximation (8). Since the number of markerspcan
details. be substantially greater than the number of traitsm, the
The estimation of the variance matrices Var(Y), dimension of the subspace spanned by span(Yi,Si1, . . . ,
Var(S), and Var(Z) is straightforward. Var(Y) is esti-Sim) may be much smaller than the dimension of span(Yi,
mated directly by the empirical covariance matrix of Xi1, . . . ,Xip) and so the best linear predictorP
Zi
Yi, Sicomputed
the phenotypic information and Var(S) by in the subspace may have a much greater mean square
error than the best linear predictorPZi
Yi, Xicomputed in the Var(Sˆ). (11)
entire span ofYiandXi. This suggests that the prediction
of the genetic values suggested byLandeandThompson For the computation of Var(Z) it is usual to assume that (1990) might have a substantially bigger mean square all variance explained by the marker scoressiis genetic
error than prediction (8). On the other hand, Appendix variance. Var(Z) can therefore be estimated by Var(S). III ofLande andThompson(1990) shows that the two Note that this tends to overestimate the genetic variance. predictors should be equivalent in the special case of large The main problem in the Lande and Thompson ap-population sizes and linkage equilibrium between all proach (9) is the estimation of Cov(Z, S). Using the markers. Crucially, however, this result does not hold when same arguments used for the estimation of Var(Z) we markers are in linkage disequilibrium, as is the case for can justify estimating Cov(Z,S) by Cov(Y,S). However, the populations considered here, and our simulation re- since all model selection criteria tend to reduce the sults show a substantial advantage for approximation (8). residual variation in the data set to some extent, the Finally, note that a further drawback of the Lande and way we chose the subsetsᏭi of predictor/marker
vari-Thompson approach is the appearance of the marker ables will always influence Cov(Y, S). Thus estimation scoreSiin the approximation (7). The marker scoreSiis of Cov(Z, S) by Cov(Y, S) will tend to overestimate
unknown and has to be predicted by Equation 6. Since the true covariance matrix. Whittaker et al. (1997) the marker subsets Ꮽj are not known, they have to be developed an approach based on cross-validation to
obtained by model selection strategies (Whittaker2001). tackle this problem, but even then estimation is quite There is often considerable noise associated with this poor.
model selection, so that the linear approximation (8), Now consider our new approach. To compute the which avoids this step, might be expected to give more weight matricesP˜ andG˜ of approach (10) we have to stable predictions. This might also suggest using the Lande estimate Var(Y), Var(Z), Cov(Y,X), Var(X), and Cov(Z, and Thompson approach without the model selection X). Note that we estimate matrix Var(Z) by Var(S) as step. However, this is known to perform poorly in most for the Lande and Thompson approach. Although there situations (e.g.,ZhangandSmith1992;Gimelfarband are nonparametric ways to estimate Var(Z) (Falconer
Lande1994; Whittakeret al. 1995), and we found by andMackay1997), our experience with simulation ex-simulation that the same was true here (results not shown). periments suggests that preference should be given to Now we can apply either the linear approximation (7) estimation via marker scores (Lange2000). The matri-ofE(Zi|yi,si) or (8) ofE(Zi|yi,xi) to the index prediction ces Var(Y), Var(X), and Cov(Y,X) are estimated directly
problem (9) and predict the index value i of the ith by their empirical variances and covariances. To
esti-individual;e.g., for approximation (7) mate Cov(Z,X) we assume that there is no
Figure 1.—Response to marker-assisted selection for environmental correlation 0.0, heritability 0.2, and in the absence of esti-mation error.
expect it to be substantially better. In the next section we the genome contains 46 QTL in total. QTL locations are obtained by drawing random samples from a uniform compare the two methods by simulation experiments.
distribution. As Lande and Thompson (1990) sug-gested, we compute the QTL effectsaijusing geometric
SIMULATION EXPERIMENTS
series and choose the effects of the QTL so that the total heritability of each trait is 0.2.
Simulation experiment when more records than
mark-We use a population size of 600 individuals, with index
ers are given:To compare the performance of
marker-a⫽(1, 1)T, and assume that there is no environmental
assisted selection based on the Lande and Thompson
correlation (i.e., theεijin Equation 4 are independent). approximation (9) with marker-assisted selection based
We run the index selection for 20 generations and re-on our approximatire-on (10), we simulate 20
chromo-peat the experiment 100 times. On the basis of their somes of an F2 generation from two inbred parental
predicted index values we select the best 20% of the lines and distribute 11 markers uniformly over each
individuals in each generation and mate them randomly chromosome. Each marker interval is 10 cM in length
to produce the next generation. with a total chromosome length of 100 cM. We generate
two Gaussian traits, each influenced by 23 QTL, so that We conduct the simulation experiment twice. First
Figure 3.—Response to marker-assisted selection for environmental correlation 0.0, heritability 0.05, and sample size 50 in the absence of estimation error.
we use the true empirical covariance matrices Cov(Y, denoted by “L&T MAS” and approach (10) by “Opt MAS” in each case. Classical selection based only on
S) and Var(Z) in (9) and (10). This gives the
perfor-mance of the two approaches in the absence of estima- phenotypic information is denoted by “Pheno.” In the absence of estimation error, the plots show a tion error. Second, the marker scoressi are predicted
by linear regression of the phenotypic information on clear superiority of approach (10) over the Lande and Thompson approach (9), with both being superior to the marker information. For the Lande and Thompson
approach the covariance matrix Cov(Y,S) is estimated classical phenotypic selection. Similar results are ob-tained in the presence of estimation error (Figure 2). by the cross-validation method proposed byWhittaker
et al.(1997). All parameters of the new approach and The performance of all three methods is reduced, but
the ordering of the methods is unaffected. of the Lande and Thompson approach are reestimated
in each generation, using the new marker and pheno- Simulation experiment when more markers than re-cords are given:We repeat the previous simulation ex-typic data. On the basis of these estimates the indices
are also recalculated in each generation. periment with sample size 50, 10 chromosomes of length
1 M, and 11 markers spaced uniformly over each chro-The mean responses for the first and second
simula-tion experiments are shown in Figures 1 and 2, respec- mosome. The total heritability is assumed to be 0.05. Since there are more markers (110) than records (50), tively, with the Lande and Thompson approach (9)
the empirical variance matrix of the markers is now the differences between the MAS approaches will be largest when the heritability of the traits is low or dense singular and we have to compute the generalized inverse
matrix of P˜ instead of the standard inverse. Standard marker maps are given, since in these cases modeling the marker scoressiis more difficult,e.g., a model
selec-theory (Searle1971, 1982) implies that, although the
estimated index weights will depend on the particular tion problem for marker scores. This assumption was supported by the simulation experiments in the absence choice of the generalized inverse, the predicted index
value i will not. No other changes to the method are of estimation error (Figures 1 and 3). However, in the
presence of estimation error the advantages of the new needed.
For the simulation study without estimation error the approach partly vanish (Figures 2 and 4). In particular, where the number of records is less than the number plots of the mean responses are shown in Figure 3.
Figure 4 shows the same plots in the presence of estima- of markers our approach can be inferior to the standard two-stage method in early generations. This indicates tion error.
In the absence of estimation error, the overall order- the need for more sophisticated estimation methods than the ones used for the new approach here. ing of the methods is maintained, although response
to selection is lower than in the previous section. How- In practice, selection solely on individual phenotype, as described here, is seldom used. Rather, phenotypic ever, in the presence of estimation error the response
to our approach is reduced far more than the response information on relatives would be incorporated, either via selection indices or, more usually, by the use of to the Lande and Thompson approach. For the first
five generations the Lande and Thompson approach BLUP breeding value estimates: This of course reduces the advantage of MAS. We used phenotypic selection even performs slightly better than our approach.
here solely to provide a point of reference for the MAS results, since our focus was on the comparison of the DISCUSSION
MAS approaches.
We also assumed that all markers are typed in all In this article we propose a new approximation for
the prediction of genetic values that has theoretical generations. However, it would be possible to reduce the typing cost by selecting a subset of markers in the advantages over the standard approach. We have shown
that our new approach is also applicable when the sam- F2generation and genotyping only these markers in all
subsequent generations. Then the advantages of the ple size is smaller than or equal to the number of
mark-ers. The marker variance matrix Var(X) can be esti- new methods will be slightly reduced but still be of practical relevance.
mated by the empirical variance matrix even when the
sample size is smaller than the number of markers. Finally, note that our approximation exploits the whole marker map. The variance of the predicted values However, the empirical variance matrix will then be
singular and instead of the standard inverse the general- can therefore be calculated by ized inverse matrix has to be computed. When the
sam-ple size is substantially smaller than the number of
mark-which allows the performance of different marker maps ers, the empirical variance matrix might be a poor
to be compared. The value of adding additional markers estimate and alternative estimators may be considered;
to the map can thus be investigated and the optimal e.g., since we start with an F2generation of two inbred
marker spacing determined. For the Lande and Thomp-lines, the marker variance matrix can also be computed
son approach the variance of the predicted values analytically when the marker interval lengths are known.
Var( ) is more difficult to compute, since the variance Alternatively the marker variance matrix might also be
of the predicted values is influenced by the selection of estimated by bootstrapping or Monte Carlo simulation
markers used for the prediction of the marker scores experiments. These issues will be a topic of further
re-and these marker subsets are unknown when the experi-search.
ment is designed. The theoretical advantages of the new approach are
In conclusion, marker-assisted selection based on ap-confirmed by our simulation experiments where the
proximation (10) for the prediction of the genetic val-new approximation clearly outperforms both the
classi-ues has a number of advantages over the standard ap-cal Lande and Thompson approach and phenotypic
proach by Lande and Thompson. selection when the number of records exceeds the
num-ber of markers to be included in the index. We also We thank two referees for their constructive comments on an earlier draft of this article. This research was supported in part by the
Biotech-performed simulation studies for smaller sample sizes,
nology and Biological Sciences Research council (United Kingdom)
more traits, different heritabilities, and nonzero
envi-and in part by grant MH59532 of The National Institutes of Health.
ronmental correlation (Lange2000). In all simulation studies we observed the same pattern. The superiority of the new approach over both the classical Lande and
LITERATURE CITED Thompson approach and phenotypic selection was
al-ways substantial, provided the number of records ex- Brockwell, P. J., andR. A. Davis, 1991 Time Series: Theory and Methods.Springer-Verlag, Berlin/Heidelberg/New York.
Falconer, D. S.,andT. F. C. Mackay,1997 Introduction to Quantita- Searle, S. R.,1982 Matrix Algebra Useful for Statistics.Wiley, New York.
tive Genetics.Longman, New York.
Gimelfarb, A.,andR. Lande,1994 Simulation of marker assisted Smith, C.,andS. P. Simpson,1986 The use of polymorphisms in live stock improvement. J. Anim. Breed. Genet.103:205–217. selection for non-additive traits. Genet. Res.64:127–136.
Haley, C. S.,andS. A. Knott,1992 A simple regression method Whittaker, J. C.,2001 Marker assisted selection and introgression, pp. 673–695 in Handbook of Statistical Genetics, edited by D. J.
for mapping quantitative trait loci in line crosses using flanking
markers. Heredity69:315–324. Balding,M. BishopandC. Cannings.Wiley, New York.
Whittaker, J. C., R. N. Curnow, C. S. HaleyandR. Thompson, Hospital, F., L. Moreau, F. Lacourde, A. CharcossetandA.
Gal-lais,1997 More on the efficiency of marker-assisted selection. 1995 Using marker-maps in marker-assisted selection. Genet.
Theor. Appl. Genet.95:1181–1189. Res.66:255–265.
Jansen, R. C.,2001 Quantitative trait loci mapping in inbred lines, Whittaker, J. C., C. S. HaleyandR. Thompson,1997 Optimal pp. 567–599 in Handbook of Statistical Genetics, edited by D. J. weighting of information in marker-assisted selection. Genet. Res.
Balding,M. BishopandC. Cannings.Wiley, New York. 69:137–144.
Lande, R.,andR. Thompson, 1990 Efficiency of marker-assisted Zeng, Z-B.,1994 Precision mapping of quantitative trait loci. Genet-selection in the improvement of quantitative traits. Genetics124: ics136:1457–1468.
743–756. Zhang, W.,andC. Smith,1992 Computer simulation of
marker-Lange, C.,2000 Generalized estimating equation methods in statisti- assisted selection utilizing linkage disequilibrium. Theor. Appl. cal genetics. Ph.D. Thesis, Department of Applied Statistics, The Genet.83:813–820.
University of Reading, United Kingdom.