Dicussion - A Stochastic Restricted Maximum Likelihood Method for Genomic Selection and Genome-

A novel approach has been presented to enhance the accuracy of prediction in genomic selection and the accuracy of identifying QTL under collinearity in GWAS. BLUP assumes marker effects are randomly assigned from a normal distribution with homogeneous variance, while Bayesian alphabet models assume marker effect variances are heterogeneous. The proposed method SREML is between BLUP and Bayesian alphabet models. Unlike the assumption of BLUP, the novel method assumes a large group of markers have small effects and a small number of markers have extra effects than the remaining markers. This assumption allows we obtain additional information from the data to improve the model performance and obtain the markers with large genetic effects simultaneously. The simu- lation study in genomic selection shows the proposed method outperforms other methods when we used model 3.2.1 to generate phenotypic values.

As discussed above, the procedure of SREML is stochastic and it will be impressed that this algorithm is similar to Bayesian approaches. However, SREML is distinguishing from Bayesian approaches because we don’t expect it will converge to a specific value. The result from each iteration is stochastic even through it is based on the result from the previous iteration. The reason is every element of the u vector is randomly sampled from a Bernoulli distribution with the posterior probability of success ρ, which involves a random variable π. And the value of π is randomly sampled from a Beta distribution and its parameters are calculated by taking advantage of the data information. It implies that the number of nonzero-effect markers is automatically obtained based on the data itself, but not an artificial setting. The unknown parameters are then estimated by the REML

function. Compared with Bayesian approaches, the cluster label vector u of SREML is the only random variable sampled from a distribution. Variance components are obtained according to the current cluster labels. Vector u will never converge. Therefore, the values of variance components vary from iteration to iteration. In addition, all unknown parameters are sampled from distributions in Bayesian approaches. Sometimes, the rate of convergence can be slow, which requires large iteration numbers. However, combining the stochastic step and the REML step in SREML facilitates to obtain good parameter estimates within the comparatively small iteration numbers.

The other contribution of this study is to make the algorithm to be computationally tractable. For each iteration, all markers need to be scanned to update the indicator variable ui. It leads to the computational burden because the inverse and determinant of updated

phenotypic variance Vineed to be calculated for every marker. Due to it, Woodbury matrix

identities and the matrix determinant lemma are applied in the algorithm to reduce the computational efforts.

One application of GWAS is to identify QTL which are markers with large genetic effects. Nowadays, millions of SNPs can be produced as our genetic information, but only few of them are associated with quantitative traits. Therefore, a large amount of information is useless and time-consuming for GWAS. Selecting informative markers can achieve accuracy enhancement and marker interpretability, where SREML, LASSO, and BayesB make contributions on excluding redundant markers. Based on the results above, SREML outperforms LASSO and BayesB under collinearity condition in GWAS. LASSO tends to select unimportant markers than the other two methods, and LASSO and BayesB tends to select fewer of correlated markers than SREML. Since the genotypic values are highly

correlated, the posterior probabilities of ui and uj tends to be close to each other if their

current labels are the same. Consequently, for this case, SREML is superior to LASSO and BayesB.

3.A

Appendix of Chapter 3

A The R code for SREML

SREML<−function ( z , y , w, i t e r n u m ) { n<−ncol ( z ) m<−nrow( z ) x<−matrix ( 1 , n , 1 ) m1<−m∗w m2<−m∗(1−w)

d e l t a<−sample ( c ( rep ( −1 ,m1 ) , rep ( 1 ,m2 ) ) ) p r e d<−NULL d e l t<−NULL max. p r e d<− −1e 1 0 max. d e l t a<−d e l t a max. i t e r<− 0 max. PRESS<−0 f o r ( i t e r i n 1 : i t e r n u m ) {

z1<−z [ which ( d e l t a ==−1),] z2<−z [ which ( d e l t a== 1 ) , ] k1<−crossprod ( z1 ) k2<−crossprod ( z2 ) kk<−l i s t ( k1 , k2 ) p<−length ( kk )

f i t<−mixed ( x , y=y , kk ) #REML method t o e s t i m a t e unknown p a r a m e t e r s f n 0<− − f i t [ [ 1 ] ] $ f n v1<− f i t [ [ 2 ] ] $v1 v2<− f i t [ [ 2 ] ] $v2 ve<− f i t [ [ 2 ] ] $ve parm<−c ( v1 , v2 , ve ) v<−diag ( n ) ∗parm [ p+1] g<−matrix ( 0 , n , n ) vv<−0 f o r ( k i n 1 : p ) { vv<−vv+parm [ k ] g<−g+kk [ [ k ] ] ∗vv } v<−v+g v i<−solve ( v ) H<−g%∗%v i

b l u p 0<−y−x%∗%f i t [ [ 1 ] ] $beta b l u p<−H%∗%b l u p 0

PRESS<−sum ( ( ( blup0−b l u p ) /(1−diag (H) ) ) ˆ 2 ) SS<−sum( ( blup0−mean( b l u p 0 ) ) ˆ 2 ) PRED<−1−PRESS/SS i f (PRED>max. p r e d ) { max. p r e d<−PRED max. d e l t a<−d e l t a max. i t e r<−i t e r max. PRESS<−PRESS max. parm<−parm

max. beta<− f i t [ [ 1 ] ] $beta } b<−b l u p 0 bb<−t ( b )%∗%v i%∗%b xx<−t ( x )%∗%v i%∗%x dv<−u n l i s t ( d e t e r m i n a n t ( v ) ) dv<−dv [ 1 ] a l p h a<−length ( which ( d e l t a ==−1)) beta<−m−a l p h a

w<−rbeta ( 1 , a l p h a+m, beta +0.1∗m)

d e l t<−rbind ( d e l t , c ( i t e r , d e l t a ) ) f o r ( i i n 1 :m) { zk<−as . matrix ( z [ i , ] ) z z<−t ( zk )%∗%v i%∗%zk bz<−t ( b )%∗%v i%∗%zk zb<−t ( bz ) xz<−t ( x )%∗%v i%∗%zk zx<−t ( xz ) xHx<−xx−xz%∗%solve ( zz −1/ ( d e l t a [ i ] ∗v2 ) )%∗%zx bHb<−bb−bz%∗%solve ( zz −1/ ( d e l t a [ i ] ∗v2 ) )%∗%zb z v z<−abs(− d e l t a [ i ] ∗v2∗ z z +1) dx<−u n l i s t ( d e t e r m i n a n t ( xHx ) ) dx<−dx [ 1 ] dv2<−log ( z v z ) f n<− −0.5∗ ( dv+dv2+dx+bHb ) u<−rbinom ( 1 , 1 , w/ (w+(1−w) ∗exp ( d e l t a [ i ] ∗ ( fn0−f n ) ) ) ) d e l t a [ i ]<−1−2∗u } } } mixed<−function ( x , y , kk ) {

l o g l i k e<−function ( parm ) { v<−diag ( n ) ∗parm [ p+1] vv<−0 f o r ( k i n 1 : p ) { vv<−vv+parm [ k ] v<−v+kk [ [ k ] ] ∗vv } v i<−solve ( v ) xx<−t ( x )%∗%v i%∗%x xy<−t ( x )%∗%v i%∗%y b<−solve ( xx , xy ) d1<−u n l i s t ( d e t e r m i n a n t ( v ) ) d1<−d1 [ 1 ] d2<−u n l i s t ( d e t e r m i n a n t ( xx ) ) d2<−d2 [ 1 ] r<−y−x%∗%b q<−t ( r )%∗%v i%∗%r l o g l i k e<− −0.5∗ ( d1+d2+q ) return(− l o g l i k e ) } f i x e d<−function ( parm ) {

v<−diag ( n ) ∗parm [ p+1] g<−matrix ( 0 , n , n ) vv<−0 f o r ( k i n 1 : p ) { vv<−vv+parm [ k ] g<−g+kk [ [ k ] ] ∗vv } v<−v+g v i<−solve ( v ) xx<−t ( x )%∗%v i%∗%x xy<−t ( x )%∗%v i%∗%y covb<−solve ( xx ) beta<−solve ( xx , xy )

yhat<−g%∗%v i%∗%( y−x%∗%beta ) yobs<−y−x%∗%beta

r 2<−cor ( yobs , yhat ) ˆ 2

r e s u l t<−l i s t ( beta , covb , r 2 ) return ( r e s u l t ) } l o g l i k e 0<−function ( x , y ) { xx<−t ( x )%∗%x xy<−t ( x )%∗%y

b<−solve ( xx , xy ) r<−y−x%∗%b s 2<−drop ( t ( r )%∗%( r ) ) / ( n−ncol ( x ) ) v<−diag ( n ) ∗ s 2 v i<−diag ( n ) / s 2 d1<−u n l i s t ( d e t e r m i n a n t ( v ) ) d1<−d1 [ 1 ] d2<−u n l i s t ( d e t e r m i n a n t ( xx ) ) d2<−d2 [ 1 ] q<−t ( r )%∗%v i%∗%r l o g l i k e<− −0.5∗ ( d1+d2+q ) return(− l o g l i k e ) } n<−length ( y ) p<−length ( kk ) f n 0<−l o g l i k e 0 ( x , y ) parm0<−rep ( 1 , p+1)

r e s u l t<−optim ( par=parm0 , f n=l o g l i k e , h e s s i a n = TRUE, method=”L−BFGS−B” , lower=1e −5 ,upper=1e5 )

parm<−r e s u l t $par

conv<−r e s u l t $ c o n v e r g e n c e f n<−r e s u l t $ v a l u e

l r t<−2∗ ( fn0−f n ) h e s s<−r e s u l t $ h e s s i a n covp<− solve ( h e s s ) bb<−f i x e d ( parm ) beta<−bb [ [ 1 ] ] covb<−bb [ [ 2 ] ] r 2<−bb [ [ 3 ] ]

f i x e d<−data . frame ( conv , fn0 , fn , l r t , beta , covb , r 2 ) v1<−parm [ 1 ] v2<−parm [ 2 ] ve<−parm [ 3 ] parm<−data . frame ( v1 , v2 , ve ) r e s u l t<−l i s t ( f i x e d , parm ) return ( r e s u l t ) } B Supplemental tables

Table 3.10 Variance components estimated by SREML and BLUP in Scenario II. Method σ_γ2 σ_S2 σ_L2 σ2_e

SREML - 8.8391 104.9971 1.4338 BLUP 22.2774 - - 1.4377

Bibliography

Allen, D. M. (1971). Mean square error of prediction as a criterion for selecting variables. Technometrics, 13(3):469–475.

Allen, D. M. (1974). The relationship between variable selection and data agumentation and a method for prediction. Technometrics, 16(1):125–127.

Bernardo, R. and Yu, J. (2007). Prospects for genomewide selection for quantitative traits in maize. Crop Science, 47(3):1082–1090.

Celeux, G. (1985). The sem algorithm: a probabilistic teacher algorithm derived from the em algorithm for the mixture problem. Computational statistics quarterly, 2:73–82. Chen, W., Gao, Y., Xie, W., Gong, L., Lu, K., Wang, W., Li, Y., Liu, X., Zhang, H., Dong,

H., et al. (2014). Genome-wide association analyses provide genetic and biochemical insights into natural variation in rice metabolism. Nature genetics, 46(7):714.

Cheng, H., Qu, L., Garrick, D. J., and Fernando, R. L. (2015). A fast and efficient gibbs sampler for bayesb in whole-genome analyses. Genetics Selection Evolution, 47(1):80. Consortium, C. A. D. C. G. et al. (2011). A genome-wide association study in europeans

and south asians identifies five new loci for coronary artery disease. Nature genetics, 43(4):339.

Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1):15–18.

Cook, R. D. (1979). Influential observations in linear regression. Journal of the American Statistical Association, 74(365):169–174.

Crossa, J., Pérez-Rodr´ıguez, P., Cuevas, J., Montesinos-López, O., Jarqu´ın, D., de los Campos, G., Burgueño, J., Camacho-González, J. M., Pérez-Elizalde, S., Beyene, Y., et al. (2017). Genomic selection in plant breeding: methods, models, and perspectives. Trends in plant science.

de los Campos, G., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D., and Calus, M. P. (2013). Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics, 193(2):327–345.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from in- complete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22.

Desta, Z. A. and Ortiz, R. (2014). Genomic selection: genome-wide prediction in plant improvement. Trends in plant science, 19(9):592–601.

Domarkien˙e, I., Pranculis, A., Germanas, ˇS., Jakaitien˙e, A., Vitkus, D., Dˇzenkeviˇci¯ut˙e, V., Kuˇcinskien˙e, Z., and Kuˇcinskas, V. (2013). Rtn4 and fbxl17 genes are associated with coronary heart disease in genome-wide association analysis of lithuanian families. Balkan Journal of Medical Genetics, 16(2):17–22.

Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004). Least angle regression. The Annals of statistics, 32(2):407–499.

Erbe, M., Hayes, B., Matukumalli, L., Goswami, S., Bowman, P., Reich, C., Mason, B., and Goddard, M. (2012). Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. Journal of dairy science, 95(7):4114–4129.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.

Ghassibe-Sabbagh, M., Haber, M., Salloum, A. K., Al-Sarraj, Y., Akle, Y., Hirbli, K., Romanos, J., Mouzaya, F., Gauguier, D., Platt, D. E., et al. (2014). T2dm gwas in the lebanese population confirms the role of tcf7l2 and cdkal1 in disease susceptibility. Scientific reports, 4:7351.

Gianola, D. and Sch¨on, C.-C. (2016). Cross-validation without doing cross-validation in genome-enabled prediction. G3: Genes, Genomes, Genetics, 6(10):3107–3128.

Golub, G. H., Heath, M., and Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223.

Harris, B., Johnson, D., Spelman, R., et al. (2009). Genomic selection in new zealand and the implications for national genetic evaluation. ICAR Technical Series, (13):325–330. Hayes, B. J., Bowman, P. J., Chamberlain, A., and Goddard, M. (2009). Invited review:

Genomic selection in dairy cattle: Progress and challenges. Journal of dairy science, 92(2):433–443.

Heffner, E. L., Sorrells, M. E., and Jannink, J.-L. (2009). Genomic selection for crop improvement. Crop Science, 49(1):1–12.

Henderson, C. R. (1950). Estimation of genetic parameters. In Biometrics, volume 6, pages 186–187. INTERNATIONAL BIOMETRIC SOC 1441 I ST, NW, SUITE 700, WASHINGTON, DC 20005-2210.

Henderson, C. R. (1963). Selection index and expected genetic advance. Statistical genetics and plant breeding, 982:141–163.

Henderson, C. R. (1975). Best linear unbiased estimation and prediction under a selection model. Biometrics, pages 423–447.

Henderson, C. R., Kempthorne, O., Searle, S. R., and Von Krosigk, C. (1959). The estimation of environmental and genetic trends from records subject to culling. Biometrics, 15(2):192–218.

Hoerl, A. E. and Kennard, R. W. (1970a). Ridge regression: applications to nonorthogonal problems. Technometrics, 12(1):69–82.

Hoerl, A. E. and Kennard, R. W. (1970b). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67.

Huang, X., Sang, T., Zhao, Q., Feng, Q., Zhao, Y., Li, C., Zhu, C., Lu, T., Zhang, Z., Li, M., et al. (2010). Genome-wide association studies of 14 agronomic traits in rice landraces. Nature genetics, 42(11):961.

Huang, X., Yang, S., Gong, J., Zhao, Y., Feng, Q., Gong, H., Li, W., Zhan, Q., Cheng, B., Xia, J., et al. (2015). Genomic analysis of hybrid rice varieties reveals numerous superior alleles that contribute to heterosis. Nature communications, 6:6258.

Jia, G., Huang, X., Zhi, H., Zhao, Y., Zhao, Q., Li, W., Chai, Y., Yang, L., Liu, K., Lu, H., et al. (2013). A haplotype map of genomic variations and genome-wide association studies of agronomic traits in foxtail millet (setaria italica). Nature genetics, 45(8):957. Kang, H. M., Zaitlen, N. A., Wade, C. M., Kirby, A., Heckerman, D., Daly, M. J., and

Eskin, E. (2008). Efficient control of population structure in model organism association mapping. Genetics, 178(3):1709–1723.

Lee, S. H., van der Werf, J. H., Hayes, B. J., Goddard, M. E., and Visscher, P. M. (2008). Predicting unobserved phenotypes for complex traits from whole-genome snp data. PLoS genetics, 4(10):e1000231.

Legarra, A., Robert-Grani´e, C., Manfredi, E., and Elsen, J.-M. (2008). Performance of genomic selection in mice. Genetics, 180(1):611–618.

Li, Y. R., Li, J., Zhao, S. D., Bradfield, J. P., Mentch, F. D., Maggadottir, S. M., Hou, C., Abrams, D. J., Chang, D., Gao, F., et al. (2015). Meta-analysis of shared genetic architecture across ten pediatric autoimmune diseases. Nature medicine, 21(9):1018. Meuwissen, T., Hayes, B., and Goddard, M. (2001). Prediction of total genetic value using

genome-wide dense marker maps. Genetics, 157(4):1819–1829.

Michailidou, K., Lindstr¨om, S., Dennis, J., Beesley, J., Hui, S., Kar, S., Lema¸con, A., Soucy, P., Glubb, D., Rostamianfar, A., et al. (2017). Association analysis identifies 65 new breast cancer risk loci. Nature, 551(7678):92.

P´erez, P. and de Los Campos, G. (2014). Genome-wide regression & prediction with the bglr statistical package. Genetics, pages genetics–114.

Picard, R. R. and Cook, R. D. (1984). Cross-validation of regression models. Journal of the American Statistical Association, 79(387):575–583.

Piepho, H.-P. (2009). Ridge regression and extensions for genomewide selection in maize. Crop Science, 49(4):1165–1176.

Risch, N. and Merikangas, K. (1996). The future of genetic studies of complex human diseases. Science, 273(5281):1516–1517.

Sim, X., Ong, R. T.-H., Suo, C., Tay, W.-T., Liu, J., Ng, D. P.-K., Boehnke, M., Chia, K.- S., Wong, T.-Y., Seielstad, M., et al. (2011). Transferability of type 2 diabetes implicated loci in multi-ethnic cohorts from southeast asia. PLoS genetics, 7(4):e1001363.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.

VanRaden, P., Van Tassell, C., Wiggans, G., Sonstegard, T., Schnabel, R., Taylor, J., and Schenkel, F. (2009). Invited review: Reliability of genomic predictions for north american holstein bulls. Journal of dairy science, 92(1):16–24.

Verbyla, K. L., Hayes, B. J., Bowman, P. J., and Goddard, M. E. (2009). Accuracy of genomic selection using stochastic search variable selection in australian holstein friesian dairy cattle. Genetics research, 91(5):307–311.

Wei, G. C. and Tanner, M. A. (1990). A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association, 85(411):699–704.

Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E., and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714–721. Xu, S. (2017). Predicted residual error sum of squares of mixed models–an application to

genomic prediction. G3: Genes, Genomes, Genetics, pages g3–116.

Xu, S., Xu, Y., Gong, L., and Zhang, Q. (2016). Metabolomic prediction of yield in hybrid rice. The Plant Journal, 88(2):219–227.

Xu, S., Zhu, D., and Zhang, Q. (2014). Predicting hybrid performance in rice using genomic best linear unbiased prediction. Proceedings of the National Academy of Sciences, 111(34):12456–12461.

Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., et al. (2010). Common snps explain a large proportion of the heritability for human height. Nature genetics, 42(7):565. Yu, H., Xie, W., Wang, J., Xing, Y., Xu, C., Li, X., Xiao, J., and Zhang, Q. (2011). Gains in qtl detection using an ultra-high density snp map based on population sequencing relative to traditional rflp/ssr markers. PloS one, 6(3):e17595.

Yu, J., Pressoir, G., Briggs, W. H., Bi, I. V., Yamasaki, M., Doebley, J. F., McMullen, M. D., Gaut, B. S., Nielsen, D. M., Holland, J. B., et al. (2006). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature genetics, 38(2):203.

Zhao, K., Tung, C.-W., Eizenga, G. C., Wright, M. H., Ali, M. L., Price, A. H., Norton, G. J., Islam, M. R., Reynolds, A., Mezey, J., et al. (2011). Genome-wide association mapping reveals a rich genetic architecture of complex traits in oryza sativa. Nature communications, 2:467.

Zhou, X., Carbonetto, P., and Stephens, M. (2013). Polygenic modeling with bayesian sparse linear mixed models. PLoS genetics, 9(2):e1003264.

Zhou, X. and Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nature genetics, 44(7):821.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301– 320.

In document A Stochastic Restricted Maximum Likelihood Method for Genomic Selection and Genome-Wide Association Studies (Page 70-84)