differentialanalysis rely on converting gene abundance estimates to gene counts (Soneson et al. 2016, Pimentel et al. 2017). Such methods have two major drawbacks. First, even though the resulting gene counts can be used to accurately estimate fold changes, the associated variance estimates can be distorted (see Figure 1 and Additional file 1: Section 1). Second, the assignment of a single numerical value to a gene can mask dynamic effects among its multiple constituent transcripts (Figure 2). In the case of “cancellation” (Figure 2a), the abundance of transcripts changing in opposite directions cancels out upon conversion to gene abundance. In “domination” (Figure 2b), an abundant transcript that is not changing can mask substantial change in abundance of a minor transcript. Finally, in the case of “collapsing” (Figure 2c), due to overdispersion in variance, multiple isoforms of a gene with small effect sizes in the same direction do not lead to a significant change when observed in aggregate, but their independent changes constitute substantial evidence for differentialexpression. As shown in Figure 2, these scenarios are not only hypothetical scenarios in a thought experiment, but events that occur in biological data.
One approach to overcome this issue is to use data transformations. Li et al. (2010) per- formed log-transformation onto gene expressions from RNA-seq experiment and identified dif- ferentially expressed genes using K-means clustering algorithm. J¨ager et al. (2011) standard- ized the RNA-seq count values from their experiment and performed hierarchical clustering on the normalized data to obtain gene groups with similar expression. In another time-course ex- periment focusing on the early zebrafish development, the RNA-seq data were also normalized and clustered using K-means (Pauli et al., 2012). These heuristic approaches have the advan- tage of easy implementation; however, they have not been evaluated for RNA-seq data analysis and the employed clustering methods ignore the time-dependence among the time-series data. There are also complications when analyzing transformed count data. Transformation of count data cannot be well approximated by continuous distributions, and it is particularly problematic for data with small sample sizes and lower count ranges (Oshlack et al., 2010). Data with very small counts after transformation are far from normally distributed and count data usually con- tain a mean-variance relationship that is not addressed by normal-based analyses (McCarthy et al., 2012).
Following the advent of RNA-sequencing (RNA-seq) technologies, several statistical tools for differentialgeneexpression (DGE) analysis have been introduced. However, low and noisy read counts, such as those com- ing from lncRNAs, are potentially challenging for the tools [10, 11]. For example, it is commonly observed that low count genes show large variability of the fold-change esti- mates and thus exhibit inherently noisier inferential behavior. The majority of the methods suggest removal of low expressed genes before the start of data analysis, but this procedure essentially blocks researchers from studying lncRNAs. In our study, no such severe filtering was applied, leaving almost all lncRNAs in the dataset. To our knowledge, no statistical method has been specifically developed for the analysis of lncRNA-seq data and there- fore transcriptome studies make use of statisticalmethods that assume sufficient expression levels. In this paper, we evaluated and compared the performance of many popu- lar statisticalmethods (Table 1) developed for testing DGE of RNA-seq data (hereafter referred to as “DE tools”), with special emphasis on lncRNAs and low-abundance mRNAs. All tools considered in this study are popular (in terms of number of citations), available as R software packages , and use gene or transcript level read counts as input. Our conclusions are based on six RNA-seq data- sets and many realistic simulations, representing various typical geneexpression experiments.
Microarrays enable the study of many genes simulta- neously, in a semi-quantitative way, and have been widely used to address several biological, genetic and biochemical issues [37,58]. Normally this analysis is performed with previously known gene sequences in- serted into an array. Nevertheless, this array can be sup- plied with sequences derived from RNA libraries con- structed specifically for the situation under study. In this case the selected sequences should be informative and amenable to physiological or interaction study . However, this type of analysis depends on a prior knowledge of the genetic content of a species, although this information may be often available in existing data- bases. Furthermore, the sequences available in databases are usually cultivar non-specific and are not PCR ampli- fied, decreasing the sensitivity of this method .
Through RNAseq, we identified 359 genes with differentialexpression between groups, 170 up-regulated, and 189 down-regulated for treatment with higher lipid content when compared to the control diet. Treatment effect can be observed through the activity of these genes in important metabolic pathways, some of which affect meat quality. Among these pathways, those associated with lipid metabolism were not affected, coinciding with the similarity of IMF and fat content. We observed an effect on carbohydrate metabolism, with a lower expression of the genes involved in these pathways in the treatment with higher lipid content. Other important pathways affected were muscle tissue development and ECM-receptor interaction. As for muscle tissue development, we found a lower expression in the treatment with higher lipid content, while the opposite occurred for the ECM pathway. The results of the differentialexpression of the genes involved in these pathways, which was affect by the lipid content of the diet, provides information that may be useful for future research related to meat quality, also providing information that may be valuable in the study of human diseases, mainly those related to energy metabolism and muscle development.
expression. A fundamental property of geneexpression is transcriptional bursting, in which transcription from DNA to RNA occurs in bursts, depending on whether the gene’s promoter is activated (Figure 3.1A) (109, 110). Transcriptional bursting is a widespread phenomenon that has been observed across many species including bacteria (111), yeast (112), Drosophila embryos (113), and mammalian cells (114, 115), and is one of the primary sources of expression variability in single cells. Figure 3.1B illustrates the expression across time of the two alleles of a gene. Under the assumption of ergodicity, each cell in a scRNA-seq sample pool is at a different time in this process, implying that for each allele, some cells might be in the transcriptional “ON” state, whereas other cells are in the “OFF” state. While in the “ON” state, the magnitude and length of the burst can also vary across cells, further complicating analysis. For each expressed heterozygous site, a scRNA-seq experiment gives us the bivariate distribution of the expression of its two alleles across cells, allowing us to compare the alleles not only in their mean, but also in their distribution. In this paper, we will use scRNA-seq data to characterize transcriptional bursting in an allele-specific manner and detect genes with allelic differences in the parameters of this process.
Technical noise in scRNA-seq and other complicating factors Additional file 1: Figure S1 outlines the major steps of the scRNA-seq protocols and the sources of bias that are intro- duced during library preparation and sequencing. After the cells are captured and lysed, exogenous spike-ins are added as internal controls, which have fixed and known concen- trations and can thus be used to convert the number of se- quenced transcripts into actual abundances. During the reverse transcription, pre-amplification, and library prepar- ation steps, lowly expressed transcripts might be lost, in which case they will not be detected during sequencing. This leads to so-called “dropout” events. Since spike-ins undergo the same experimental procedure as endogenous RNAs in a cell, amplification and sequencing bias can be captured and estimated through the spike-in molecules. Here we adopt the statistical model in TASC (Toolkit for Analysis of Single Cell data, unpublished), which explicitly models the technical noise through spike-ins. TASC’s model is based on the key observation that the probability of a gene being a dropout depends on its true expression in the cell, with lowly expressed genes more likely to drop out. Specifically, let Q cg and Y cg be, respectively, the observed
conducted in accordance with the Declaration of Helsinki and approved by the Air Force Medical Center of PLA Ethics committee. Brie ﬂ y, the RNA quality was examined and sequencing libraries were generated using the VAHTS mRNA-seq v2 Library Prep Kit for Illumina ® (Vazyme, NR601). After measuring the library concentration, the clus- tering of the index-coded samples was performed on a cBot Cluster Generation System (Illumina, USA) and the library preparations were sequenced on an Illumina Hiseq X Ten plat- form and 150-bp paired-end module. Genedifferential expres- sion was analyzed by Cuffdiff (v2.2.1), Genes with corrected p-value ≤ 0.05 and the absolute value of log2 (fold change) ≥ 1 were assigned as signi ﬁ cantly differentially expressed. For real-time PCR, 1 µg mRNA was reverse transcribed and ampli ﬁ ed, which detected by Bio-Rad CFX Connect System (Bio-Rad). The primers were as follows: SCAMP3 from
The previous section introduced count-based DE in general terms: Each row of a count matrix is submitted to a statistical model (often by first estimating moderated variance parameters over the whole data set) and hypothesis tests of interest are conducted, with an adjustment for multiple test- ing. In this section, we discuss additional approaches to interrogate RNA-seq data in terms of DE. Although DE is of obvious interest, this can manifest or be defined in multiple ways (see Figure 9). One may want to cast inferences to the gene level, but measurements are made at the fragment level. We use the term “differentialgeneexpression” (DGE) to refer to hypothesis testing related to the total outcome of an annotated gene, by comparing either accumulated TPM estimates or raw counts while including an adjustment for average transcript length via offsets (74). If the expression of transcripts is the feature of interest (independent of other transcripts), differential transcript expression (DTE) analyses can be conducted. Alternatively, one could be interested in whether at least one transcript from a gene is DE. This requires statistical testing at the transcript level and then aggregation to the gene level. Yet another strategy is to consider whether the relative abundance (i.e., proportions) of transcripts for a specific genomic locus changes between conditions, which is commonly termed differential transcript usage (DTU) or, more generally, differential splicing (DS). A surrogate for DTU, differential exon usage, is conducted on exon-level quantifications; in this case, the goal is to identify exons that deviate from proportional expression to separate differential usage from DE. Yet another alternative is to quantify and test differences at the event level, where reads supporting (or not supporting) an event (e.g., inclusion of a cassette exon) are summarized and compared (167).
One major challenge in characterizing the eﬀect of cell cycle on observed geneexpression is that of measuring the cell cycle stage itself. Experimental approaches to accomplish this are varied. One method is to induce cell-cycle arrest, either by depleting factors driving progression between stages, by chemical treatments, or by inhibiting key pathways [Meijer, 1996]. Other techniques include using centrifugation to stratify cells by size (and by proxy, their stage) [Ly et al., 2014], or using ﬂow cytometry to measure DNA content based on retention of a dye [Nunez, 2001]. Major drawbacks to these approaches include the labor intensiveness of the procedures, as well as potential side eﬀects that could disrupt the biological system in undesirable ways. For these reasons, most single-cell geneexpression datasets that do not directly aim to study the cell cycle are generally not accompanied by cell cycle stage anno- tations; that is, it is still uncommon for experimenters to annotate cell cycle stages as a matter of course. Consequently, though it is generally recognized that cell cycle eﬀects exist and may be substantial, the magnitude of cell-cycle distortions to geneexpression has not been precisely characterized, nor is it well-understood which genes are aﬀected. The drawbacks of experimental approaches to cell cycle characterization motivate the application of statistical tools that can achieve the same goals.
limma integrates a number of statistical principles in a way that is effective for large-scale expression studies. It operates on a matrix of expression values, where each row represents a gene, or some other genomic feature relevant to the cur- rent study, and each column corresponds to an RNA sam- ple. On the one hand, it fits a linear model to each row of data and takes advantage of the flexibility of such models in various ways, for example to handle complex experimen- tal designs and to test very flexible hypotheses. On the other hand, it leverages the highly parallel nature of genomic data to borrow strength between the gene-wise models, allow- ing for different levels of variability between genes and be- tween samples, and making statistical conclusions more re- liable when the number of samples is small. All the features of the statistical models can be accessed not just for gene- wise expression analyses but also for higher level analyses of geneexpression signatures. Figure 1 depicts the linear model and highlights the statistical principles employed in a typical limma analysis.
In this study, it is shown that the gene dispersion value as estimated in the negative binomial modelling of read counts [13, 14] is the key determinant of the read count bias. We found that the read count bias in DE analysis of RNA-seq data was mostly confined to data with small gene dispersions such as technical replicate or some of the genetically identical (GI) replicate data (generated from cell lines or inbred model organisms). In contrast, the replicate data from unrelated individuals, denoted by unrelated replicates, had overall tens to hundreds times greater gene dispersion values than those of technical replicate data, and DE analysis with such unrelated repli- cate data did not exhibit the read count bias except for genes with some small read counts (< tens). Such a pat- tern was observed for different levels of DE fold changes and sequencing depths. Although DE analysis of tech- nical replicates is not meaningful, it is included to con- trast the patterns and pinpoint the cause of read count bias. Lastly, it is shown that the sample-permuting gene- set enrichment analysis (GSEA)  is highly affected by the read count bias and hence generates a considerable number of false positives, while the preranked GSEA does not generate false positives by the read count bias. See also the paper by Zheng and colleagues for other types of biases in quantifying RNA-seq geneexpression rather than in DE analysis . We also note a recent study reporting that small dispersions result in high stat- istical power in DE analysis of RNA-seq data .
Since the microarray “truth sets” were based on HGNC symbols, we needed to convert the aligned and quan- tified transcript-level and gene-level counts to HGNC- level counts; for this, we used code adapted from the Williams et al. methods in conjunction with a conver- sion table provided by the authors . Using microar- ray “truth sets” also required an additional filter step to remove HGNC symbols detected by the microarray plat- form but not RNA-Seq (and vice versa). For this, we ref- erenced the hgu133plus2.db, illuminaHumanv4.db, and illuminaHumanv2.db databases (from Bioconductor) to build an HGNC-level “gene universe” for each microarray platform . We then performed a simple set intersec- tion between the microarray “gene universe” with the RNA-Seq HGNC-level “gene universe” prior to calculat- ing precision and recall. Note that our “gene universes” likely differ from those used by Williams et al. (which are not available from their Additional files) . As such, any gene uniquely present (or absent) in our uni- verse could marginally change the measured performance. Specifically, the numerator or denominator of the preci- sion and recall estimates could change by an offset up to the number of genes uniquely present (or absent) in our universe.
Here, we search a better normalization procedure which focus on two main questions: (1) Does the normalization improve DE detection (sensitivity) in reducing the false dis- cover rate. (2) Does the normalization result in low technical variability across replicates (specificity)? The standard procedure is to compute the proportion of each gene’s reads relative to the total number of reads, and compare that across all libraries, either by trans- forming the original data or by introducing a constant into a statistical model. Robinson et al.  proposed a scale normalization (TMM) method which is two-side symmetry trimmed log-fold-changes. Compared to the previous normalization, the method shows improved results for inferring differentialexpression in simulated and real data. But the TMM method can not normalize the data reasonable when the data are asymmetric, espe- cially, when the proportion of DE genes is large (Additional file 1: Figure S1). Exclusion of most of genes may lead to lost of too much information and the TMM normaliza- tion scale is estimated by a symmetry trimmed will bring biased results when the data are asymmetric. We develop a new method with an iteration median of M-values (IMM) to normalize the samples of different sequence depths. The IMM method normalizes the libraries without a symmetry trimmed. The aim of iteration process of IMM method is to look for an invariant set of non-DE genes and use the invariant set to normalize the samples.
Due to the technical limitations of the PAX technique (low yield after additional methodological manipulation), this method was disregarded for PCR validation which was carried out on LL and TEM GC samples, as they represent better methods for GEP profiling. Six genes were chosen with a minimum of 1.5 fold difference between disease and controls for each method and real- time PCR resulted in the validation of 4 out of 6 genes for each condition, indicating that both methods present similar consistency and reliability. These similarities in consistency between the 2 methods may not represent similarities in differential regulation of gene pathways, and therefore GO functional annotation clustering analysis was carried out on the respective gene lists to ascertain putative differences/similarities. Considering the a) relatively few differentially regulated genes in common between the 2 methods, and b) the difference in starting material (whole blood versus leukocyte enriched), GO analysis revealed a striking similarity in the number of common cluster terms, and this was supported by similar observations when carrying out KEGG pathway analysis. It is noteworthy that the top cluster terms, ribosome-associated genes, in TEM GC were not present in LL, and this raises the question as to whether these are associated with pathological changes and/or mirror underlying differences in initial cell populations used as starting material for each extraction method. As there are only small differences between globin levels (depleted and undepleted isoforms) in TEM GC samples when comparing patients and controls (data not shown) we can discount any potential sampling/extraction bias leading to the observed differences in ribosome-associated geneexpression. Taking into account the relatively small number of ALS patients and controls in this study, drawing conclusions regarding pathway analysis should be cautious, however examples of pathways involved in ALS include: defects in RNA processing , and thus changes in expression of ribosome-associated genes during disease may arise from the aggregation of proteins such as TDP-43 and FUS, and down-stream effects e.g. on stress granule formation arising from Table 4. PCR primer sequences designed for genes validated
In this paper, we show that both the targeted align- ment and spliced alignment approaches can be used in complementary ways to study TICs and gene fusions in individual cancer and normal samples assayed by deep transcriptional sequencing. We applied both methods to sets of single-end reads that we obtained by sequencing the transcriptomes of three primary human prostate adenocarcinomas (denoted by T1, T2, and T3) and their matched normal samples (N1, N2, and N3), as well as the human brain reference (HBR) and universal human reference (UHR) samples used in the Microarray Quality Control (MAQC) project [29,30]. One of the adeno- carcinomas, T1, is ETS-negative, and the remaining ade- nocarcinomas are ETS-positive, with the positive expression of an ETS family member conferred by a TMPRSS2-ERG gene fusion . We also implemented filtering methods that are necessary to remove possible false positive alignments due to gene families or other homologous genes. After filtering, we were still left with sequence-based support for a large number of TIC events among our samples, which afforded us an oppor- tunity to further characterize the phenomenon.
We tested the performance of Corset against other cluster- ing and counting methods using three RNA-seq datasets: chicken male and female embryonic tissue , human primary lung fibroblasts, with and without a small inter- fering RNA (siRNA) knock down of HOXA1 , and yeast grown under batch and chemostat conditions . We selected three model organisms in order to compare our de novo differentialgeneexpression (DGE) results against a genome-based analysis (referred to herein as the truth dataset). In the chicken dataset we tested for DGE between males and females. The homology between chicken genes, which is around 90% on the sex chro- mosomes , offered a challenging test for clustering algorithms. The human dataset was selected because human is one of the best annotated species and the yeast was used to assess whether clustering is beneficial for organisms with minimal splicing. Each dataset was assembled using Trinity and Oases, which have different underlying assembly strategies, to ensure that the results were consistent. Overall, six different assemblies were used as a starting point for the evaluation of Corset.
Results: The anatomical structure and disease symptoms of grapevine leaves were analyzed for two grapevine species, and the critical period of resistance of grapevine to pathogenic bacteria was determined to be 12 h post inoculation (hpi). Differentially expressed genes (DEGs) were identified from transcriptome analysis of leaf samples obtained at 12 and 36 hpi, and the transcripts in four pathways (cell wall genes, LRR receptor-like genes, WRKY genes, and pathogenesis-related (PR) genes) were classified into four co-expression groups by using weighted correlation network analysis (WGCNA). The gene VdWRKY53, showing the highest transcript level, was introduced into Arabidopsis plants by using a vector containing the CaMV35S promoter. These procedures allowed identifying the key genes contributing to differences in disease resistance between a strongly resistant accession of a wild grapevine species Vitis davidii (VID) and a susceptible cultivar of V. vinifera, ‘ Manicure Finger ’ (VIV). Vitis davidii, but not VIV, showed a typical hypersensitive response after infection with a fungal pathogen (Coniella diplodiella) causing white rot disease. Further, 20 defense-related genes were identified, and their differentialexpression between the two grapevine species was confirmed using quantitative real-time PCR analysis. VdWRKY53, showing the highest transcript level, was selected for functional analysis and therefore over-expressed in Arabidopsis under the control of the CaMV35S promoter. The transgenic plants showed enhanced resistance to C. diplodiella and to two other pathogens, Pseudomonas syringae pv. tomato DC3000 and Golovinomyces cichoracearum.
Data generated by the nCounter system have to be normalized, to account for sample preparation variation, sample content variation, and background noise, etc., before they can be used to quantify geneexpression and conduct any downstream statisticalanalysis. Here, the availability of data from FFPE samples would allow us to explore major characteristics of such data and examine key assumptions/hypotheses about the mean structure of the data, when developing a new normalization method that aims to improve existing ones. Meanwhile, the availability of data from paired FF samples would enable us to quantitatively assess and compare the performance of any normalization methods developed for the nCounter system. Due to the lack of ground truth, it is generally diﬃcult to compare the performance of diﬀerent normalization methods on real data. Nevertheless, the data from FF samples, once available, can be used to provide a surrogate of the truth. This is because FF tissues are known to maintain RNA very well (much lower degradation of RNA and no methylene crosslink between RNA and proteins) and thus are considered as a gold standard for most molecular assays (Solassol et al., 2011).
Like all other existing imputation methods [24–27], we have been primarily focused on modeling and imputing log-transformed normalized geneexpression data that are converted from the original count data. Modeling log-transformed normalized expression data assumes ap- plication of a normalization method as a pre-processing step. Though the performance of normalization methods varies in different settings, VIPER does not depend on choice of a particular normalization method. While in the present study we have only examined a relatively simple normalization method based on RPM, we note that using advanced normalization offsets  or including cellular detection rate  may further improve the performance of VIPER. Importantly, modeling log-transformed normal- ized geneexpression data using Gaussian models is com- putationally more tractable than modeling count data using over-dispersed Poisson models (e.g., negative bino- mial or Poisson mixed models) [42–44]. Because of the computational tractability, modeling log-transformed nor- malized geneexpression data are commonly applied in scRNAseq studies for clustering analysis, differential ex- pression analysis, and various other analytic tasks [20, 22, 45]. However, scRNAseq data are of count nature. Because of the relatively low sequencing depth of scRNAseq, ac- counting for the mean and variance relationship by mod- eling the original count data directly often has added benefits [20, 22, 23]. Therefore, extending our method to model and impute the count data from scRNAseq directly while properly accounting for the over-dispersion or drop- out events will likely improve imputation accuracy further, especially for data with lower per-cell read depth such as those collected from the 10x genomics platform. In addition, for study designs, whether the bulk tissue se- quencing and scRNAseq are applied to same cell content, incorporating bulk data as prior information for imput- ation of scRNAseq data will likely further improve accur- acy. Exploring and benchmarking this strategy will also be a promising direction.