• No results found

CHAPTER 3. IDENTIFYING RELEVANT COVARIATES IN RNA-SEQ ANALYSIS BY

3.3 Real Data Analysis

3.4.1 Simulation Description

The goal of our simulation study is twofold: 1) to evaluate the FSR method in terms of its ability to select the most relevant covariates and 2) to evaluate the model selected by FSR method in terms of its ability to identify DE genes while controlling FDR. Such evaluations require simulated datasets to contain a set of truly relevant covariates and to contain both EE and DE genes for the included variables.

First, to evaluate the FSR method’s ability to select the relevant covariates, we examine how well the method can control FSR at nominal thresholds. We consider 3 different nominal FSR thresholds γ0 ∈ {0.01, 0.05, 0.1} and 6 different sets of relevant covariates as shown in Table 3.3. These

covariates are chosen based on their levels of relevance when applying the FSR backward selection procedure to the RFI RNA-seq dataset. The first three cases represent situations where there are a small number of relevant covariates (0, 1, or 2 relevant covariates) among all 13 covariates, while

Figure 3.2: Histograms of p-values of the included variable (Line) and the covariates selected by our FSR backward selection algorithm with γ0 ∈ {0.01, 0.05, 0.1, 0.2}.

the last three cases represent situations where there are a large number of relevant covariates (6, 7, or 8 relevant covariates) among all 13 covariates.

The last case with 8 relevant covariates is an example of when the relevant covariate RFI is strongly correlated with the included variable Line. The covariate RFI provides a continuous measure of residual feed intake for each of the 31 pigs in the study. Because the low RFI and high RFI lines were created by selecting on residual feed intake for several generations, it is not surprising that the low RFI pigs tend to have lower RFI values than the high RFI pigs in our study. This inclusion of the strongly confounding variable RFI makes it difficult to distinguish the direct effect of Line from the direct effect of RFI on transcript abundance levels, which may result in the failure of FDR control (Nguyen et al., 2015).

Table 3.3: Six different simulation scenarios corresponding to six different sets of truly relevant covariates.

Case Model Size Truly Relevant Covariates

1 0 Nothing

2 1 Mono

3 2 Mono, Concb

4 6 Mono, Concb, Neut, Block, RINa, Baso

5 7 Mono, Concb, Neut, Block, RINa, Baso, Lymp

6 8 Mono, Concb, Neut, Block, RINa, Baso, Lymp, RFI

Second, to evaluate the selected model’s ability to identify DE genes while controlling FDR, we simulated datasets that contain both EE and DE genes with respect to the included variable and each of the relevant covariates. For each simulation scenario, as true parameters to simulate new data, we used the precision weights, the scaled error variances and the partial regression coefficient estimates from the fit of the corresponding model to the RFI RNA-seq data, except that we set partial regression coefficients on each variable to zero for a subset of genes to permit simulation of EE genes. More specifically, for each variable j (either relevant covariate or the included variable Line), the ˆm(j)0 least significant partial regression coefficients were set to zero, where ˆm(j)0 is the estimated number of the j-variable partial regression coefficients equal to zero when the method of Nettleton et al. (2006) is applied to the j-variable’s p-values from the fit of the corresponding model

to the RFI RNA-seq data. This strategy yielded a parameter vector consisting of a scaled error variance, precision weights, and partial regression coefficients for each of 11280 genes. To simulate any particular dataset for a given set of truly relevant covariates (either 0, 1, 2, 6, 7, or 8 relevant covariates), we randomly sampled 2000 gene parameter vectors. The selected parameters and the explanatory variable values for the 31 pigs were used to simulate a 2000 × 31 dataset of read counts following the inverse steps of (3.1) and (3.2). Random selection of parameters and generation of data was independently repeated 100 times to obtain the 100 datasets for each scenario.

In addition to two goals above, we also investigate the sensitivity of the FSR approach to the number of pseudo-covariates kP = {1, 3, 5, 7}. Furthermore, we consider 8 versions of the

FSR method by combining 2 FSR formulas – γER and γRE – and 4 pseudo-covariate generating

methods – WN, RX, OWN, and ORX. We call these 8 versions WN.RE, WN.ER, RX.RE, RX.ER, OWN.RE, OWN.ER, ORX.RE, and ORX.ER.

3.4.2 Simulation Results

Using the simulated datasets, we first evaluate the ability to control FSR of 9 methods

• OldBS: the backward selection procedure with the p.05 measure of covariate relevance (Nguyen et al., 2015).

• WN.RE, WN.ER, RX.RE, RX.ER, OWN.RE, OWN.ER, ORX.RE, and ORX.ER: 8 versions of our FSR backward selection method.

Then, we analyzed these simulated datasets using covariates obtained from the 9 methods together with 5 other strategies handling covariates. These 5 strategies use model that includes

• all available covariates (Full)

• only the factor of primary interest (OnlyLine)

• surrogate variable analysis (sva -Leek and Storey (2007)) • direct surrogate variable analysis (dSVA -Lee et al. (2017))

• the true set of covariates used to simulate the data for each gene (Oracle).

Of course, the Oracle procedure cannot be used in practice, but its inclusion provides a useful reference measure of the performance achieved if covariate selection was perfect. In addition, sva (Leek and Storey, 2007) and dSVA (Lee et al., 2017) are the surrogate variable analysis method where the surrogate variables are constructed by ignoring all available covariates.

For these analysis strategies, the voom method in the limma R package was used to com- pute p-values for testing the significance of the partial regression coefficients corresponding to the explanatory variables. For the included variables, these p-values were converted to q-values (as described in Section 3.2.2), and genes with q-values no larger than 0.05 were declared as DE. For covariates that are subject to variable selection, these p-values were used to calculate the relevance measure r.

Figure 3.3 shows simulation results in evaluating the ability to select relevant covariates of OldBS and 8 versions of our proposed FSR method. OldBS intends to select a subset of covariates whose effects are accounted for in a model to maximize the number of DE genes with respect to the included variable Line. Because it is not designed to control FSR, FSR value of OldBS is unchanged for any threshold γ0. The FSR of OldBS seems to be decreasing with respect to the number of

relevant covariates kI. In the scenario kI = 8, FSR of OldBS is almost 0 because OldBS selects

Mono, Concb, Neut, Block, RNAa, Baso, Lymp for more than 90 of the 100 simulated datasets in each scenario. This happens because in scenario kI = 8 the relevant covariate RFI is strongly

associated with the included variable Line due to the selection of lines as discussed in Section3.4.1. Because OldBS always prefers model with maximum number of DE genes with respect to Line, RFI is discouraged in the selection process, which shown by the number of covariates selected in this case, S = 7.

Figure 3.4 shows the performance of 14 methods in identifying DE genes with respect to the included variable Line. As shown in Figure 3.3, our method performs best when using kP = 7

pseudo-covariates. Therefore, when analyzing simulated data, our FSR method was implemented using kP = 7 pseudo-covariates. Figure 3.4 shows that all 8 versions of our method control FDR

Figure 3.3: Empirical estimates of false selection rate (FSR), the average number of selected irrele- vant covariates (U), the average number of selected relevant covariates (S) from 100 replications as a function of kP ∈ {1, 3, 5, 7} for OldBS, ORX.ER, ORX.RE, OWN.ER, OWN.RE, RX.ER, RX.RE,

well. The OnlyLine, sva and dSVA methods fail to control FDR when kI = 8, which is the case

there is a relevant covariate that is strongly correlated with the included variable. OldBS performs as well as our method except when kI = 8. The 8 versions of our proposed method perform well in

terms of PAUC. Among all scenarios that FDR is controlled at the nominal level 0.05, the number of true positives detected by our method is very high.

3.5 Discussion

In this paper, we proposed a new covariate selection strategy in RNA-seq data analysis. We showed that our method can accurately choose the truly relevant covariates, even when there are covariates strongly associated with the included variables. As a result, our method performs very well in the downstream differential expression analysis. In particular, our method gives a reliable list of DE genes, which are shown by its ability to control FDR and its ability to distinguish EE and DE genes from one another.

We’ve also shown that the sva and dSVA methods suffer when there are many relevant covariates available. This suggests a careful consideration of analysis strategy needs to be taken into account under the availability of many covariates. These covariates should be checked to see if any of them is relevant before conducting further analyses.

We also want to emphasize that the proposed covariate selection strategy can be applied to the analysis of other ’omics data as well, such as microarray data because the nature of adding pseudo-covariates can be extended to any other high-dimensional data types.

3.6 Appendix: Description of Variables in the RFI Dataset

x·1= Line is the categorical factor of primary scientific interest. Line has two levels, which corre-

spond to the HRFI and LRFI selection lines. Among the 31 pigs in this study, 15 were from the LRFI line and 16 were from the HRFI line.

Figure 3.4: Empirical estimates of false discovery rate (FDR), the average number of true positive (NTP) detections of differential expression, and the average partial area under the receiver operat- ing characteristic curve (PAUC) from 100 replications for Oracle, ORX.ER, ORX.RE, OWN.ER, OWN.RE, RX.ER, RX.RE, WN.ER, WN.RE, OldBS, dSVA, sva, OnlyLine, and All methods, three FSR thresholds γ0 ∈ {0.01, 0.05, 0.1}, and six scenarios.

x·2= RFI is a continuous covariate that provides a measure of the residual feed intake for each

of the 31 pigs from which blood samples were drawn for RNA-seq analysis. Pigs in the HRFI line tend to have high RFI values, while pigs in the LRFI line tend to have low RFI values. x·3= Diet is a categorical factor with two levels corresponding to the two diets (high fiber, low

energy vs. low fiber, high energy) that were fed to the pigs in this study. Approximately half the pigs within each line were fed each diet. Because RNA-seq analysis was performed on blood samples collected prior to the initiation of the two diets, this factor is not expected to be associated with the transcript abundance levels measured by RNA-seq.

x·4= Baso is a continuous covariate that provides a measure of the concentration of basophil cells

in the blood sample drawn from each pig.

x·5= Eosi is a continuous covariate that provides a measure of the concentration of eosinophil

cells in the blood sample drawn from each pig.

x·6= Lymp is a continuous covariate that provides a measure of the concentration of lymphocyte

cells in the blood sample drawn from each pig.

x·7= Mono is a continuous covariate that provides a measure of the concentration of monocyte

cells in the blood sample drawn from each pig.

x·8= Neut is a continuous covariate that provides a measure of the concentration of neutrophil

cells in the blood sample drawn from each pig.

x·9= Concb is a continuous measure of the RNA concentration in each sample before globin

depletion (a step that is necessary to focus sequencing efforts on messenger RNA molecules other than highly abundant globin messenger RNA in each blood sample).

x·10= Conca is a continuous measure of the RNA concentration in each sample after globin

depletion.

x·12= RINa is a continuous measure of RNA integrity within each sample after globin depletion.

x·13= Block is a categorical factor with four levels corresponding to the four blocks used to

organize sample collection and processing. Initially, each block involved eight samples, two for each combination of Line and Diet. One LRFI sample from the first block was removed from the study due to low-quality RNA.

x·14= Order is a categorical factor with eight levels indicating the random order samples were

processed within each block.

Acknowledgments

This material is based upon work supported by Agriculture and Food Research Initiative Com- petitive Grant No. 2011-68004-30336 from the United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA), and by National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health (NIH) and the joint National Science Foun- dation (NSF)/NIGMS Mathematical Biology Program under award number R01GM109458. The opinions, findings, and conclusions stated herein are those of the authors and do not necessarily reflect those of USDA, NSF, or NIH.

Bibliography

Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11(10):R106.

Bullard, J. H., Purdom, E., Hansen, K. D., and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics, 11(1):94.

Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74(368):829–836.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2):407–499.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360.

Grenander, U. (1956). On the theory of mortality measurement. Scandinavian Actuarial Journal, 1956(2):125–153.

Kolmogorov, A. (1933). Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital. Attuari, 4:83–91.

Law, C. W., Chen, Y., Shi, W., and Smyth, G. K. (2014). voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 15(2):R29.

Lee, S., Sun, W., Wright, F. A., and Zou, F. (2017). An improved and explicit surrogate variable analysis procedure by coefficient adjustment. Biometrika, 104(2):303–316.

Leek, J. and Storey, J. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3(9):e161.

Liang, K. and Nettleton, D. (2012). Adaptive and dynamic adaptive procedures for false discov- ery rate control and estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1):163–182.

Love, M. I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12):550.

Lun, A. T. L., Chen, Y., and Smyth, G. K. (2016). It’s DE-licious: A recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR. In Math´e, E. and Davis, S., editors, Statistical Genomics: Methods and Protocols, pages 391–416. Springer New York, New York, NY.

Lund, S. P., Nettleton, D., McCarthy, D. J., and Smyth, G. K. (2012). Detecting differential expres- sion in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology, 11(5):1544–6115.

Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., and Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research, 18(9):1509–1517.

McCarthy, D. J., Chen, Y., and Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Research, 40(10):4288– 4297.

Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods, 5(7):621–628.

Nettleton, D., Hwang, J. T. G., Caldo, R. A., and Wise, R. P. (2006). Estimating the number of true null hypotheses from a histogram of p-values. Journal of Agricultural, Biological, and Environmental Statistics, 11(3):337.

Nguyen, Y., Nettleton, D., Liu, H., and Tuggle, C. K. (2015). Detecting differentially expressed genes with RNA-seq data using backward selection to account for the effects of relevant covariates. Journal of Agricultural, Biological, and Environmental Statistics, 20(4):577–597.

Risso, D., Ngai, J., Speed, T. P., and Dudoit, S. (2014a). Normalization of RNA-seq data using factor analysis of control genes or samples. Nature Biotechnology, 32(9):896–902.

Risso, D., Ngai, J., Speed, T. P., and Dudoit, S. (2014b). The role of spike-in standards in the normalization of RNA-seq. In Statistical Analysis of Next Generation Sequencing Data, pages 169–190. Springer.

Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7):e47.

Robinson, M. D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3):R25.

Smirnov, N. (1948). Table for estimating the goodness of fit of empirical distributions. The Annals of Mathematical Statistics, 19(2):279–281.

Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential ex- pression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3(1):1–25.

Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(3):479–498.

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288.

Wu, Y. (2004). Controlling Variable Selection By the Addition of Pseudo-Variables. PhD disserta- tion, Department of Statistics, North Carolina State University.

Wu, Y., Boos, D. D., and Stefanski, L. A. (2007). Controlling variable selection by the addition of pseudovariables. Journal of the American Statistical Association, 102(477):235–243.

CHAPTER 4. RNA-SEQ ANALYSIS FOR REPEATED-MEASURES DATA