RNA-seq and data processing - Connecting mitochondrial and transcriptional variability

2.2 Connecting mitochondrial and transcriptional variability

2.2.1 RNA-seq and data processing

Cells stained with MitoTracker green were sorted into two subpopulations with high and low mitochondrial content respectively. The difference in mitochondrial mass across subpopulations was about 5-fold. From each population, RNA was extracted, purified and sequenced. This procedure was repeated for three cell lines: HeLa (3 biological replicates), Jurkat (3 biological replicates) and MRC-5 (2 biological replicates). More information on experimental methods such as cell culture, sorting, RNA extraction, etc. can be found in appendix A.

HeLa is the most commonly used human cell line in biological research. It is derived from cervical cancer.

Jurkat is a strain of human T lymphocytes originally obtained from the blood of a patient with T cell leukemia.

MRC-5 are fibroblasts derived from lung tissue of a human fetus.

Sequenced reads were first passed to FastQC v0.11.8153for quality check, finding no significant presence of adapter sequences in any sample. From sequenced reads, transcript abundance was quantified using the quasi-mapping mode of Salmon v0.12.0154 with default settings, and mappings were validated using the alignment-based mode. The Salmon index was built using the cDNA file of the Homo sapiens genome, version GRCh38 from Ensembl.155 _{Downstream analyses were performed using R v3.5.2.}156 Additional data was retrieved from the Ensembl BioMart tool with the aid of the biomaRt

v2.38.0 package.157, 158_{Transcript counts were aggregated at the gene level using the tx-} importv1.10.1 package.159_Di_{fferential expression analyses were done with the DESeq2} v1.22.2 package.160

Expression levels were obtained in units of TPM (transcripts per million). TPM values in the “low” mitochondria condition were corrected by a strain-specific factor to account for the global differences in per-cell RNA abundance across subpopulations with high and low mitochondrial mass. To obtain this factor, cells were stained with Mi- toTracker green and the mRNA content per cell was checked by poly(T) mRNA FISH. Fluorescence images were analyzed with the ImageJ software.161 Cells with high and low mitochondrial content were selected and the average poly(T) intensity of both subpopulations was quantified. The ratio (“low”/“high”) between said intensities (equal to 0.37 in HeLa) was used to scale TPM values in the “low” condition.

Transcripts and genes expressed under a threshold were discarded. This threshold (detection limit, DL) was estimated as follows:162 all features (transcripts and genes) with at least one zero and one non-zero expression value in any condition were selected. All non-zero values of this subset were listed, and the DL was taken as the median of their distribution (0.5TPM).

A note on units of expression

Understanding the units of transcript expression is key to perform consistent downstream analyses. In general, RNA-seq experiments do not provide a quantification of the RNA copy numbers per cell. The sequenced RNA is typically obtained from populations of cells (although modern single-cell sequencing techniques also exist), with a fixed library size (i.e. number of reads to be sequenced). Thus, when quantifying the expression of a given transcript from RNA-seq data, we are usually studying what fraction of the sequenced reads comes from each type of transcript in the sample, but obtaining absolute copy number requires additional steps. One example is the inclusion of spike-ins, transcripts of known length and well characterized quantity used to calibrate measurements in RNA-seq and other similar assays.

The most straightforward way to report the expression of a transcript is to simply give the raw number of reads that were aligned to it, that is, that came from the processing of a transcript of that type. Some common bioinformatic tools use this as input. However, there are important problems when attempting to relate these raw counts to the true level of expression of a transcript (namely the per-cell copy number):

• RNA-seq entails the breakdown of transcripts into fragments for sequencing, so longer transcripts produce increased raw fragment counts. This bias is particularly important when comparing the level of expression of two transcripts within a same sample.

To enable comparisons across different transcripts in a sample and across samples with different library sizes, these effects need to be taken into account. If we denote the number of counts of the transcript i as xi, a straightforward way to remove the effect of the library size is to use counts per million (CPM):

CPMi= xi X · 10

6 _(2.6)

being X the total number of reads. CPMs are still biased for the transcript length. In fact, the magnitude of the bias for the i-th transcript is determined by the so-called effective length ˜li, computed as:

˜li= li+ 1 − µ (2.7)

where liis the true length of the transcript and µ is the mean of the fragment length distribution of the library. The effective length is interpreted as the number of possible start sites at which a transcript could have generated a fragment of that length. Normalizing the raw counts by this effective length as well as for the total number of reads gives fragments per kilobase of exon per million reads(FPKM):

FPK Mi= xi X · ˜li

· 109 (2.8)

Finally, if the normalization by the library size is done using counts scaled by their effective length (instead of raw counts), transcripts per million (TPM) are obtained:

T PMi= xi ˜li ·        1 P jxj/˜lj       · 10 6₌ FPK Mi P jFPK Mj · 106 (2.9)

TPMs are, in principle, unbiased with respect to transcript length and library size.163 They simply represent the fraction of transcripts of each type within a pool of RNAs (times a factor 106_{). TPMs are still not equivalent to transcript copy numbers, but in-} dependent experiments quantifying total RNA per cell can be used to make the conver- sion.150, 164

A note on global scaling

When performing bulk RNA-seq, RNA is extracted from cell populations. This in- evitably implies a loss of information on the states of single individuals. In our case, we know that each cell’s mitochondrial content determines the abundances of RNA (as well as protein and other components), but these differences are not captured by bulk RNA-seq data. To minimize this effect, we scaled TPM values in the “low” condition by a factor extracted from independent experiments quantifying total RNA per cell. By doing so, we are neglecting the cell-to-cell differences within our subpopulations (with “high” or “low” mitochondrial content). We assume that the introduction of this global

scaling factor will yield a good estimation of the per-cell content of individual transcripts, but this is not always necessarily the case: potential sources of variability that are independent of mitochondria could induce significant cell-to-cell differences in the copy numbers of specific RNAs. For them, averaging over the whole subpopulation would provide an inaccurate description of individual cells.

In summary, even though RNA-seq is a powerful tool to investigate the effect of a variable on the gene expression landscape, we need to be aware of the bias introduced by assuming that all cells within an ensemble are identical. On the other hand, single-cell transcriptomics (which in principle bypasses this issue) has a high degree of experimental variability because of the need to sequence very small amounts of genetic material, that then has to be amplified in inherently noisy protocols.

In document Mitochondrial control of gene expression and extrinsic apoptosis (Page 48-51)