RNA-seq processing and quality control - Measuring gene expression and quality control

2.6 Measuring gene expression and quality control

2.6.2 RNA-seq processing and quality control

We retained RNA-seq reads passing the Illumina chastity filter and mapped reads to a reference sequence composed of ERCC control fragments and all chromosomes and contigs from GRCh37/hg19, excluding alternate haplotypes, replacing chromosome M with the Cambridge Reference Sequence and masking the pseudoautosomal region on chromosome Y. We aligned reads using STAR v2.3.1y [85] with default parameters and a splice junction catalogue based on GENCODE v19 [146]. Duplicate read pairs were retained. Non-uniquely mapping reads and read pairs with unpaired alignments were discarded.

We performed RNA-seq QC at the level of read groups, defined as a library on a lane, using QoRTs v1.1.18 [147]. Seeing little variation from lane to lane, we summarised the QoRTs measurements by taking the mean for each sample. We looked for outliers using a variety of measures including GC content, transcriptional diversity, and gene body coverage. There were no outliers for GC content. For transcriptional diversity, we calculated the distribution of the fraction of total transcription in 500 roughly equal count bins, according to the median counts for each gene, then compared each sample to the median of all samples using the Kolmogorov-Smirnov test (ks.test function in R), dropping 7 outlier samples (p-value < 0.01; Figure 2.3). Many of the samples removed as outliers in the transcriptional diversity analysis also had a decreased fraction of skeletal muscle when included in tissue heterogeneity estimates. Most notably, all 4 samples with an estimated skeletal muscle fraction < 90% were dropped based on transcriptional diversity measures. For gene body coverage, we compared samples based on the fraction of reads in 40 bins along the normalised length of all genes; we dropped four samples as outliers based on their coverage the 3’ end in the (0.9,0.925] bin, possibly indicating RNA degradation.

We used verifyBamID v1.1.1 [175] with the following parameters “--ignoreRG --precise --best --maxDepth 100” to remove RNA-seq samples comprised of reads derived from more than one individual and and identify sample swaps by comparing transcribed SNPs to SNP chip genotype data. We identified two pairs of sample swaps and removed one sample that showed high levels of contamination (~8%). In addition, we removed six intentional replicates, one unintentional replicate, one participant of non-Finnish ancestry, and one of 2 pairs of first-degree relatives. After all exclusions, there were 301 muscle RNA-seq samples available for analysis (Table 2.2).

0.00 0.25 0.50 0.75 1.00 10 1000 Gene Counts Cum ulativ e Fr action _drop FALSE TRUE

Figure 2.3 Transcriptional diversity of each sample. Red samples exhibit increased transcriptional diversity and were dropped. The blue line shows the median across all samples.

We previously [341] quantified gene expression as fragments per kilobase per million reads (FPKM) [266, 386]. This unit of measure accounts for bias in transcript length, where even if expressed at the same level, longer transcripts have more reads because they produce more molecules in the fragmentation step of Illumina sequencing. In addition, the FPKM unit of measure controls for variable sequencing depth (total number of reads obtained in one sequencing run), which if not accounted for would make genes equally expressed in two samples appear more expressed in the sample with greater sequencing depth. Within a sample, for a gene, g, the FPKM is calculated as:

FPKM_g=cg10

l_gN (2.1)

where c is the number fragments mapping to a gene’s exons, l is the length of the gene (sum of exons—number of possible start positions for a fragment), and N is the sequencing depth of a sample (number of mapped reads).

Transcripts per million (TPM) is another expression measurement unit, slightly different from FPKM, and is thought to be a more accurate measurement of relative molar RNA concentration [405, 211]. Within a sample, for gene, g, among n total genes, the TPM is calculated as: T PMg= c_g s_g× 1 ∑n_j=1cjsj × 106 (2.2)

where c is the number fragments mapping to a gene’s exons and s the effective gene length defined as the length of unioned exons for a gene minus the median insert length of a sample. Given these reports on TPM [405, 211], we decided to change the unit of measure for gene expression from FPKM to TPM. However, for exon expression, it was unclear how to calculate effective gene length, because, unlike for genes, the exon fragment length is often smaller than the insert size. In such cases, the effective gene length, s, would be negative. Therefore, we used FPKM as a unit of measure for exon fragments.

Definitions for all transcriptome features were based on GENCODE v19 [146]. We counted fragments mapping to genes using htseq-count v0.5.4 [10], and used QoRTs to parse GEN- CODE v19 exon annotations into non-overlapping fragments and count exon reads. To reduce the number of transcripts per gene, to avoid identifiability issues, and to restrict analysis to high-confidence transcripts, we estimated transcript expression values for the subset of GENCODE transcripts with the tag “basic” in the GTF file.

After the above exclusions and swaps, we adjusted the gene expression TPMs for age, sex, batch, and RIN, and performed PCA on the residuals to look for additional outliers. We selected the first 2 PCs, which accounted for > 40% of the cumulative variance explained and transformed the PCs to z-scores. We found no striking outliers that warranted removal, defined as |z-score| > 5 (Figure 2.4). For subsequent linear models, we filtered for genes with ≥ 5 counts in > 25% of samples and inverse normalised the TPMs. Additionally, using the filtered expression data did not affect the PCA outlier decisions.

(a) Cumulative variance explained

0.4 0.6 0.8 1.0 0 100 200 300 Principal components Cum ulativ e v ar iance e xplained (b) PCs −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 PC1 PC2

Figure 2.4 Expression PCA. (a) Cumulative variance explained by each PC. (b) No outliers were identified in the first 2 PCs.

In document A genetic analysis of molecular traits in skeletal muscle (Page 71-75)