Top PDF Statistical Methods for Gene Differential Expression Analysis of RNA-Sequencing

Statistical Methods for Gene Differential Expression Analysis of RNA-Sequencing

Statistical Methods for Gene Differential Expression Analysis of RNA-Sequencing

RNA-Sequencing (“RNA-Seq”) is performed to measure gene expression, often to ask the question of what genes are differentially expressed across various biological conditions. Statistical methods have been used to model RNA-Seq quantifications in order to determine differential expression, and have traditionally be divided into gene-level methods and transcript-level methods. There has been little attempt to connect the statistical divide, although transcript expression and gene expression are biologically inextricably linked. In this thesis, we provide a case study of a comparative differential expression analysis, demonstrating that many differential expression events happen on the isoform-level, and that performing an analysis using only summarized gene quantifications would fail to capture these events. Furthermore, we develop statistical methods that unify the transcript-level and gene-level analysis. In bulk RNA-Seq, by using p-value aggregation methods, we are able to translate transcript-level results into gene-level results under a unified framework. For single cell RNA-Seq, we propose using multiple logistic regression, leveraging the high dimensionality of the data in order to determine if the transcript quantifications pertaining to a gene are able to constitute a linear discriminant for cell type.
Show more

140 Read more

Statistical Methods for Gene Differential Expression Analysis of RNA-Sequencing

Statistical Methods for Gene Differential Expression Analysis of RNA-Sequencing

depicted in Supplementary Figure 2a,b was used to benchmark different parameters. In (a), three different normalization methods: transcript counts, size factor normalization from DESEq2, and transcript-per-million (TPM) normalization, were compared on this simulation. In (b), we compared tximport’s three methods of summing transcript quantifications to gene quantifications prior to differential gene expression analysis with DESeq2 (b).

22 Read more

Statistical methods for the analysis of RNA sequencing data

Statistical methods for the analysis of RNA sequencing data

MAR if missingness on Y is only related to the observed values of Y , so that the missing data may depend on the observed variables but not on the missing variable itself. MCAR and MAR are sometimes referred to as ‘ignorable’ missingness and valid statistical methods have been developed to analyze data with such missing mechanisms. The case of MNAR is the most di ffi cult to deal with because the probability that a value is missing depends on the missing value itself. Missingness with MNAR is often termed ‘non-ignorable’ and statistical analysis on this type of missing data is complicated because it relies on the ability to model the process which generates the missing data. When working with microarray data, one may expect the missingness arises from a random distribution due to hybridization failures, dusts on the slides, etc., at random spots on the array. However, when the signal intensity at a spot is too low to be distinguished from the background it may be declared as a missing value. This represents the situation where the missing pattern actually depends on the spot intensity itself, thus leading to a case of a mixture of MAR and MNAR in the dataset (Br´as and Menezes, 2007).
Show more

183 Read more

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads

different experimental conditions. While enormously successful, DE analysis also suffers from systematic noise and sequencing biases, such as sequence quality bias, wrong base calls, variability in sequence depth across the transcriptome, and the coverage depth differ- ences of replicate samples [1]. There are already many statistical testing methods for RNA-seq differential expression analysis. One is to normalize the read counts of target transcripts, converting them into reads per kilobase per million mapped reads (RPKM) and then perform linear modeling methods that are used in microarray experiments [2]. However, the method designed for microarray measurement may not fit the characteristics of sequencing data properly. In past years, algorithms have been developed specifically for RNA-seq data analysis. Among them, two popular soft- ware packages implemented the negative binomial (NB) model that account for genome-wide read counts and moderate dispersion estimates with different statistical methods [3,4]. EdgeR [4] uses a trended-by-mean esti- mate to moderate dispersion estimates, whereas DESeq [3] takes account of the maximum of a fitted dispersion mean or the feature-wise dispersion estimate, as reviewed in [5]. However, neither of these methods con- sidered the misaligned reads existing in the sequencing data, which may play a significant role in detecting the significance of target transcripts.
Show more

13 Read more

Statistical and Computational Methods for Differential Expression Analysis in High-throughput Gene Expression Data

Statistical and Computational Methods for Differential Expression Analysis in High-throughput Gene Expression Data

In [57], the authors separately pooled the reads for ASD and control to generate sufficient read coverage for the quantitative analysis of alternative splicing events (referred as “pooled dataset” below), and then used an exon-based method similar to MATS in their analysis and detected 212 significantly differentially spliced exons (belonging to 196 unique genes). As we have shown in the analysis of the ESRP1 dataset, the exon-based methods provide less accurate results for complex alternative splicing events and cannot infer the abundances of the isoforms, here we analyze this pooled dataset using rSeqDiff (detailed method is given in Appendix A Section A.5). rSeqDiff classifies 4,507 genes (with 6,850 transcripts) as model 0, 12,374 genes (with 19,556 transcripts) as model 1, 1,769 genes (with 5,848 transcripts) as model 2, and 7,349 genes (with 8,884 transcripts) are filtered out because they have less than 5 mapped reads in both conditions. We also run Cuffdiff 2 [37,59] on this dataset with its default settings. We find Cuffdiff 2 to be relatively conservative for detecting differential expression of spliced transcripts and it only identifies 43 transcripts as significant under default settings (FDR<0.05). Figure A.4 in Appendix shows the scatter plot of the log2 fold changes of transcript abundances between ASD and control estimated by the two approaches (genes with low read counts or failed to be tested by Cuffdiff 2 are excluded). Similar to the analysis of the ESRP1 dataset, the two methods generate concordant results overall (PCC = 0.825, SCC = 0.937). The correlation coefficients for transcripts classified in each of the three models are PCC=0.539, SCC=0.796 (model 0), PCC =0.847, SCC=0.940 (model 1) and PCC=0.854, SCC= 0.953 (model 2), which also show the same pattern as we observed in the ESRP1 dataset. We also run rSeqDiff on each individual biological replicate and get consistent results as the analysis on the pooled dataset (Table A.4 in Appendix).
Show more

159 Read more

Comprehensive evaluation of differential gene expression analysis methods for RNA seq data

Comprehensive evaluation of differential gene expression analysis methods for RNA seq data

which rely on simulated data generated by specific statisti- cal distribution or limited experimental datasets [23,33,34], we used the SEQC experimental dataset where a large frac- tion of the differentially expressed genes were validated by qRT-PCR and biological replicates from three cell lines profiled by the ENCODE project [13]. Overall, no single method emerged as favorable in all comparisons but it is apparent that methods based on negative binomial model- ing (DESeq, edgeR, and baySeq) have improved specificity and sensitivities as well as good control of false positive errors with comparable performance. However, methods based on other distributions, such as PoissonSeq and limma, compared favorably and have improved modeling of genes expressed in one condition. On the other hand, Cuffdiff has reduced sensitivity and specificity as measured by ROC analysis as well as the significant number of false positives in the null model test. We postulate that this is related to its normalization procedure, which attempts to account for both alternative isoform expression and length of transcripts. Table 2 summarizes the comparison results in addition to a number of additional quality measures, which were not directly evaluated in this study.
Show more

13 Read more

Cloud scale RNA sequencing differential expression analysis with Myrna

Cloud scale RNA sequencing differential expression analysis with Myrna

Discussion Myrna is a computational pipeline for RNA-Seq differ- ential expression analysis using cloud computing. We used Myrna to analyze a large publicly available RNA- Seq dataset with over 1 billion reads. The efficiency of our pipeline allowed us to test a number of different models rapidly on even this large data set. We showed that under random labeling, a Gaussian or permutation- based testing strategy, including a normalization con- stant as a term in the model showed the least bias, and that the often used Poisson model vastly overestimates the amount of differential expression when biological variation is assessed. We have implemented both Gaus- sian and parallelized permutation tests for differential expression in Myrna.
Show more

11 Read more

Compare RNA-Seq Differential Expression Analysis Methods

Compare RNA-Seq Differential Expression Analysis Methods

CHAPTER 5. Discussion eBayes method is based on obtaining estimate for hyperparameters followed by MCMC to estimate gene-specific parameters. The empirical Bayes posteriors were used to estimate posterior probabilities of DE . Through a simulation study, we demonstrated that this method outperformed alternative methods in most simulation scenarios with higher AUC values. More samples (nSamples) would improve all methods’ performance given the same proportion of DE genes (pDiff). DE analysis methods work better on count data with more replicates per condition and higher true DE genes proportion. This is not surprising considering that the focus of most methods is to model the variability in gene expression measurements and therefore increasing the number of replicates strengthen the estimate. The true DE genes proportion (pDiff) affects the outperformance of eBayes. When pDiff is smaller, eBayes performs much better than other methods in terms of AUC values, which means eBayes method has some advantages handling smaller true DE proportion scenario. We observed that the ROC curves for sSeq, DESeq were almost straight lines in some senarios, which might be associated with equal cost of misclassifying DE and misclassifying non-DE cases. We got rid of such issues when we used unadjusted p-value as the statistics of sSeq, DESeq. But using unadjusted p- values ended up inherently including a large number of false positives given cutoff of 0.05 for p-values. Soneson (Soneson and Delorenzi, 2013) also mentioned that some methods including DESeq exhibit an excess of large p-values, which has been attributed to the use of exact tests based on discrete probability distributions(Robles et al., 2012).
Show more

39 Read more

Statistical Methods for the Analysis of Contextual Gene Expression Data

Statistical Methods for the Analysis of Contextual Gene Expression Data

Different technologies allow for generating spatially resolved expression profiles. Imaging Mass Cytometry (IMC) (Giesen et al., 2014; Chang et al., 2017) and Multiplexed Ion Beam Imaging (MIBI) (Angelo et al., 2014) rely on protein labelling with antibodies coupled to metal isotopes of specific masses followed by high-resolution tissue ablation and ionisa- tion. IMC currently allows for the profiling of up to 37 targeted proteins with subcellular resolution. Other methods such as MxIF and CycIF use immunofluorescence for protein quantification of dozens of markers in single cells (Gerdes et al., 2013; Lin et al., 2015). Increasingly, there also exist fluorescence-based assays to measure single cell RNA levels in spatial context (Strell et al., 2018). Mer-FISH and seqFISH use a combinatorial approach of fluorescence-labeled small RNA probes to identify and localise single RNA molecules (Shah et al., 2016; Chen et al., 2015; Gerdes et al., 2013; Lin et al., 2015), which allows for a larger number of readouts (currently between 130 and 250). Even higher-dimensional expression profiles can be obtained from spatial expression profiling techniques such as Spatial Tran- scriptomics (Ståhl et al., 2016). However, they currently do not offer single cell resolution and are therefore not sufficient to study cell-to-cell variation.
Show more

224 Read more

A comparison of methods for differential expression analysis of RNA-seq data.

A comparison of methods for differential expression analysis of RNA-seq data.

Conclusions In this paper, we have evaluated and compared eleven methods for differential expression analysis of RNA-seq data. Table 2 summarizes the main findings and observa- tions. No single method among those evaluated here is optimal under all circumstances, and hence the method of choice in a particular situation depends on the experimental conditions. Among the methods evaluated in this paper, those based on a variance-stabilizing trans- formation combined with limma (i.e., voom+limma and vst+limma) performed well under many conditions, were relatively unaffected by outliers and were computation- ally fast, but they required at least 3 samples per condi- tion to have sufficient power to detect any differentially expressed genes. As shown in the supplementary mater- ial (Additional file 1), they also performed worse when the dispersion differed between the two conditions. The non-parametric SAMseq, which was among the top performing methods for data sets with large sample sizes, required at least 4-5 samples per condition to have sufficient power to find DE genes. For highly expressed genes, the fold change required for statistical significance by SAMseq was lower than for many other methods, which can potentially compromise the biological signifi- cance of some of the statistically significantly DE genes.
Show more

18 Read more

Statistical methods for gene expression studies using next-generation sequencing experiments

Statistical methods for gene expression studies using next-generation sequencing experiments

Part of the reason is due to the randomness of MCMC, if we run another MCMC using different seed, the overlap between the two MCMCs is about 97%. Since we generated 5000 posterior samples to calculate the estimated posterior probabilities after 3000 iterations burn-in, whether the Markov chains are long enough to get accurate results is also a potential problem. We checked the effective sample size for each gene, genes that had the same declared DE status after swapping treatments had effective sample sizes about 500 or larger, but genes that had different declared DE status overlapped had effective sample sizes around only 100. Effective sample size around 400 can be regarded as large enough, so for those genes with low effective sample size, we may need to run longer MCMC. Based on simulation checking, running the Markov chains longer do increase the percentage of overlapping genes, as expected. For example, for simulation A with n = 6, if we doubled the length of chain, the overlap for iSBA increased to 95.28%. However, running longer chains is more time consuming, and it only benefits a small proportion of genes while results for most genes would not change. Therefore, it is a tradeoff between efficiency and accuracy, and we will let the users decide which one is more important for a practical application.
Show more

170 Read more

Differential gene expression analysis tools exhibit substandard performance for long non coding RNA sequencing data

Differential gene expression analysis tools exhibit substandard performance for long non coding RNA sequencing data

Simulation of lncRNA expression data only Results presented up to this point came from simulating, normalizing, and analyzing lncRNAs and mRNAs to- gether. Of note, joint analysis of the two gene biotypes may affect results. For example, estimates of gene-specific dispersion parameters for negative binomial models are often done by sharing information across all genes using empirical Bays strategy [32–34], and hence the results for lncRNAs depend on mRNA read counts and vice versa. In addition, adjusted p values aimed at controlling FDR are calculated taking into account the total number of genes included in the analysis [31]. Therefore, we also evaluated the performance of the DE tools with only lncRNA data, using the same simulation procedures. Our con- clusions remain the same. The results are shown in Additional file 1: Figure S29. The FDR control is gen- erally worse when analyzing lncRNA separately, par- ticularly for small replicate sizes. Only a small reduction in TPR is observed.
Show more

16 Read more

A tail-based test for differential expression analysis and pathway analysis in RNA-sequencing  data

A tail-based test for differential expression analysis and pathway analysis in RNA-sequencing data

Advisory Professor: Jianhua Hu, Ph.D RNA sequencing data have been abundantly generated in biomedical research for biomarker discovery and pathway analysis. Such data at the exon-level are usually heavily tailed and correlated. Conventional statistical tests based on the mean or median difference for differential expression likely suffer from low power when the between-group difference occurs mostly in the upper or lower tail of the distribution of gene expression. We propose a tail-based test to make comparisons between groups in terms of a specific distribution area rather than a single location. The proposed test, which is derived from quantile regression, adjusts for covariates and accounts for within-sample dependence among the exons through a specified correlation structure.
Show more

126 Read more

UMI count modeling and differential expression analysis for single cell RNA sequencing

UMI count modeling and differential expression analysis for single cell RNA sequencing

scRNA-seq differential expression analysis A direct consequence of properly modeling scRNA-seq counts is the power to accurately conduct differential expression analyses. Based on the knowledge derived from UMI-count modeling, we proposed a NB-based al- gorithm for differential expression analysis of large-scale UMI-based scRNA-seq data. We extended the general NB-based models by allowing independent dispersion parameters in each biological condition, resulting in the NBID method. This approach is analogous to the t-test, which allows different variances between groups when testing the equivalence of means. The rationale stems from the apparent variations in dispersion even at the same average expression level [3, 7]. Because the number of cells in each condition is generally sufficient in large-scale datasets, we derive separate dispersion esti- mates for each condition; these are used in the subse- quent NB-based test against the null hypothesis that different conditions have the same average expression.
Show more

17 Read more

Differential Gene Expression in Coiled versus Flow Diverter Treated Aneurysms: RNA Sequencing Analysis in a Rabbit Aneurysm Model

Differential Gene Expression in Coiled versus Flow Diverter Treated Aneurysms: RNA Sequencing Analysis in a Rabbit Aneurysm Model

Another important function for aneurysm occlusion after en- dovascular treatment is wound healing, consisting subsequently of thrombus formation, myofibroblast invasion, and extracellular matrix deposition. 5,12,62 C-type lectin domain family 7, member A (dectin 1) is a molecule promoting wound healing by the en- hanced production of collagen matrices with ␤ -glucans. 63-65 Our results show that dectin 1 is approximately 4 times overexpressed in coiled IAs compared with flow-diverter-treated IAs. This find- ing suggests that wound healing is a process that is much more preponderant in coils than in flow-diverter treatment. FGFBP1 is another molecule promoting wound healing. 66,67 The present re- sults show that FGFBP1 is decreased in flow-diverter-treated an- eurysms compared with untreated aneurysms; this decrease sup- ports the idea that aneurysm occlusion after flow-diverter therapy is not related to wound-healing mechanisms but mostly to endo- thelial cell proliferation originating from the parent artery, as pre- viously demonstrated. 6 We also identified another molecule of interest, HHIP, which is abundantly expressed in vascular endo- thelial cells and involved in angiogenesis. 68 We observed in our study that the expression of HHIP is down-regulated in coiled aneurysms. HHIP down-regulation is involved in the promotion of angiogenesis and could be involved in the neovascularization of the wound during the healing of coiled aneurysms. 5,69
Show more

8 Read more

Identification of differential gene expression profile from peripheral blood cells of military pilots with hypertension by RNA sequencing analysis

Identification of differential gene expression profile from peripheral blood cells of military pilots with hypertension by RNA sequencing analysis

signaling pathway and cell surface receptor signaling path- way were in the core position (Additional file 2: Figure S1). Then, pathway-analysis was carried out to detect signifi- cant and important pathways of these differential expres- sion genes. Influenza A and osteoclast differentiation were the most significant (Fig. 4a). Also, the top 20 of pathway enrichment was displayed in Fig. 4b. P-Value and gene number was indicated as circle size and color. Next, the pathways interaction network was built to analysis deeply. The analysis results showed that the most important path- ways were apoptosis, Jak-STAT signaling pathway, toll-like receptor signaling and cytokine-cytokine receptor inter- action (Additional file 3: Figure S2). Because these four pathways are located at the center of the all significant pathways and have the most arrowheads around, these four pathways are likely to be most important in the elevated blood pressure of military pilots. This result suggested that differential expression genes related to apoptosis, Jak-STAT signaling pathway, toll-like receptor signaling and cytokine-cytokine receptor interaction may have important
Show more

10 Read more

Statistical Design and Analysis of RNA Sequencing Data

Statistical Design and Analysis of RNA Sequencing Data

For both replicated and unreplicated scenarios the proposed balanced block designs benefit from both the NGS platform design and multiplexing. These designs are as good as, if not better than, their unblocked counterparts in terms of power and type I error and are considerably better when batch and/or lane effects are present. Realizing of course that it is not possible to determine whether batch and/or lane effects are pre- sent a priori, we recommend the use of block designs to protect against observed differences that are attribut- able to these potential sources of variation. Since we understand both the expense associated with block designs and the concern of multiplexing, we offer some alternatives. Certainly, it is possible to avoid multiplex- ing if there are enough biological replicates and sequencing lanes that allow for designs that block on lane and/or flow cell (see Figure 6). However, if resources are limited (i.e., one flow cell), multiplexing offers an alternative that at the very least eliminates the potential for confounding of effects. Multiplexing and blocking aside, the bottom line remains the same: results from unreplicated data cannot be generalized beyond the sample tested (here, differential expression).
Show more

16 Read more

High-Resolution Analysis of Coronavirus Gene Expression by RNA Sequencing and Ribosome Profiling

High-Resolution Analysis of Coronavirus Gene Expression by RNA Sequencing and Ribosome Profiling

Viruses present particular challenges in profiling experiments. One of these is library con- tamination, which in this study may have been derived from two sources. The first was low- level contamination of one sample by another, a problem that is compounded by the high levels of virus RNA synthesised in infected cells at late time points. We took precautions to avoid this source of contamination, including the use of designated work spaces, buffers and equipment, and avoiding parallel processing of early and late time points where possible. Potentially, con- tamination may also have been introduced through the multiplex adaptor sequences. In the rel- atively small number of published studies on virus ribosome profiling, data from mock- infected samples and tests for contamination are often not reported, so the level of contamina- tion suffered by others is uncertain. A second potential source of contamination could derive from RNPs comprising virus or host mRNA complexed with virus or stress-induced host RNA binding proteins. Such RNPs might co-sediment with ribosomes during the sucrose cushion centrifugation step and contaminate RiboSeq libraries. Although we were mindful of the possi- bility of such contamination, we found little evidence for it occurring as a result of MHV infec- tion. An increased 3 0 UTR RiboSeq (CHX) density was not apparent until 8 h p.i. (when the plateau of virus production has been reached and virtually all cells are involved in extensive syncytium formation) and, even then, the read length distributions were similar to those of mock-infected cells; suggesting that the increased 3 0 UTR occupancy was as much due to bona fide RPFs as contaminating RNPs. The former could be due to depletion of ribosome recycling factors resulting in increased amounts of unrecycled post-termination ribosomes accessing the 3 0 UTR [31]. The high level of phasing in our RiboSeq data (S4 Fig) allowed us to carefully assess contamination issues, and our observations reinforce the essentiality of basic data quality checks (e.g. S4 – S10 Figs) in profiling studies. Despite these challenges, the profiling and RNA- Seq analysis of MHV infection still showed itself to be a powerful tool to investigate specific aspects of MHV replication at high resolution.
Show more

44 Read more

aFold – using polynomial uncertainty modelling for differential gene expression estimation from RNA sequencing data

aFold – using polynomial uncertainty modelling for differential gene expression estimation from RNA sequencing data

Conclusions Here, we introduce a new approach for normalization and DE analysis of RNA-Seq data. The new normalization procedure included in the package, qtotal, adjusts for the influence of the number of DE genes on the overall read count distribution and accurately approximates the true sequence depth. Qtotal can also be combined with differ- ent RNA-Seq analysis methods outside of aFold. It re- sults in DE identification that is at least as good as and often better than those produced with alternative normalization procedures, especially in case of asym- metrical distrbutions of up- and down-regulated genes. The new fold change inference and analysis method, the aFold DE analysis algorithm, is distinct from other current methods, because it uses polyno- mial uncertainty modelling to infer fold changes and considers variance in read count data across genes and treatment groups. It thus permits reliable fold-change comparisons across genes, which will en- hance correct ranking of genes for selection of candi- dates for subsequent analysis and gene set enrichment analysis [41]. Using real and simulated data sets, we demonstrate that aFold is at least as efficient as and often better in discriminating DE and non-DE, espe- cially in the presence of outliers or biased DE distrubu- tions. Our statistical framework shows high power to control FDR and type I error rate across expression levels. Based on our analyses, we conclude that the aFold package represents a highly efficient novel tool for RNA-Seq data normalization, fold change estima- tion, and identification of significant DE across a wide range of conditions. It may help the experimentalist to avoid an arbitrary choice of cut-off thresholds and may enhance subsequent downstream functional analyses.
Show more

17 Read more

Deciphering Poxvirus Gene Expression by RNA Sequencing and Ribosome Profiling

Deciphering Poxvirus Gene Expression by RNA Sequencing and Ribosome Profiling

ABSTRACT The more than 200 closely spaced annotated open reading frames, extensive transcriptional read-through, and numerous unpre- dicted RNA start sites have made the analysis of vaccinia virus gene expression challenging. Genome-wide ribosome profiling provided an unprecedented assessment of poxvirus gene expression. By 4 h after infection, approximately 80% of the ribosome- associated mRNA was viral. Ribosome-associated mRNAs were detected for most annotated early genes at 2 h and for most inter- mediate and late genes at 4 and 8 h. Cluster analysis identified a subset of early mRNAs that continued to be translated at the later times. At 2 h, there was excellent correlation between the abundance of individual mRNAs and the numbers of associated ribosomes, indicating that expression was primarily transcriptionally regulated. However, extensive transcriptional read- through invalidated similar correlations at later times. The mRNAs with the highest density of ribosomes had host response, DNA replication, and transcription roles at early times and were virion components at late times. Translation inhibitors were used to map initiation sites at single-nucleotide resolution at the start of most annotated open reading frames although in some cases a downstream methionine was used instead. Additional putative translational initiation sites with AUG or alternative codons occurred mostly within open reading frames, and fewer occurred in untranslated leader sequences, antisense strands, and intergenic regions. However, most open reading frames associated with these additional translation initiation sites were short, raising questions regarding their biological roles. The data were used to construct a high-resolution genome-wide map of the vaccinia virus translatome.
Show more

13 Read more

Show all 10000 documents...

Related subjects