RNA-seq analysis - Chapter 2: Materials and Methods

2. Chapter 2: Materials and Methods

2.3. RNA-seq analysis

2.3.1. Read cleaning and filtering

Paired-end RNA-seq reads which had already been adapter trimmed by the DNA sequencing facility at Centre for Haematology, Division of Experimental Medicine Faculty of Medicine, Imperial College London were processed with FastQC version 0.11.3 to confirm good quality reads. No further quality processing was necessary.

2.3.2. RNA-seq quantification

The paired-end reads for each sample were used with the pipeline_rnaseqdiffexpression pipeline from CGATPipelines (CGAT_Developers, 2018) and Salmon version 0.11.4 (Patro et al., 2017) to obtain an estimate of how many reads were mapping to each gene quantified

(including all the associated transcripts). Salmon was used with fragment GC bias correction, 100 bootstrap samples and using an auxiliary k-mer hash over k-mers of length 31. A reference geneset from the reference genome was produced for this task. This process was taking into account the nature of the reads (uniquely mapping or multimapping to each transcript) and the relative abundance estimate for each transcript. For each sample, the number of reads for each gene were rounded to the nearest integer and a table was created with the integer number of reads for each gene in the reference geneset for each sample.

2.3.3. RNA-seq mapping

Pipeline_mapping pipeline from CGATPipelines was used to map the RNA-seq reads. Briefly, a reference geneset was created starting with the geneset from the human genome and filtering mitochondrial and non-standard chromosomes, removing long (>2Mb) and very short (<5bp) introns, ribosomal RNA. From this geneset, only the protein coding transcripts were retained and the splice junctions were curated. These known splice junctions and paired-end reads from each sample were supplied to Hisat version 0.1.6 (D. Kim et al., 2015) which maps the reads using index for the reference genome. These mapped files were used for RNA-seq signals shown throughout this thesis.

2.3.4. RNA-seq quality control statistics

The following statistics were generated as part of the analysis:  RNA Starting read pairs: Reads pairs obtained.

 RNA Mapping rate (Salmon): Percentage of the read pairs quantified in a gene transcript “RNA-seq quantification”.

 RNA Mapped read pairs: Calculated by multiplying starting read pairs by mapping rate.

2.3.5. Obtaining annotated and unannotated Transcription Start Sites

Unannotated TSS present in the PC and MM samples (primary and cell lines) were obtained using the pipeline for the detection of alternative polyadenylation (Sudbery, 2019a) developed by Dr. Ian Sudbery. Briefly, Stringtie version 1.2.3 (Pertea et al., 2015) was used with each sample’s RNA-seq mappings (RNA-seq mapping, section 2.3.3) in conjunction with the geneset from the human genome to generate cases of novel transcripts with alternate exon usage.

From all the novel transcripts, the ones that are one exon long were removed from this list (assuming they were eRNA transcripts). These novel transcripts were transformed using GNU Awk version 3.1.7 to obtain the 1bp long region representing their strand-aware five prime end. To obtain the promoter regions from these transformed novel transcripts, the strand- aware five prime end was extended 2kb upstream of the TSS, this way the TATA box, proximal and distal promoter were likely to be captured. In addition, they were extended 100bp downstream to cover the TSS site; these are referred to as unannotated promoter sites. This was done using Bedtools version 2.22.1 (Quinlan and Hall, 2010).

To obtain the annotated TSS, the annotations for the human genome were used. The TSS for the coding and non-coding genes were obtained and transformed in the same way as the unannotated TSS: the strand-aware five prime end was extended 2kb upstream and 100bp downstream of the TSS (strand-aware), these were referred to as annotated promoter sites. The unannotated and annotated promoter sites were merged into one file using Bedtools merge.

2.3.6. DE and OE genes 2.3.6.1. Obtaining DEMM genes

The table with the quantified reads in each gene for each sample (Table 2-1 “MM-PC” category) was produced as explained in the RNA-seq quantification, section 2.3.2. The gene quantification table was inputted using the DESeq2 settings and test schemes outlined in General considerations for RNA-seq and ATAC-seq analysis, section 2.2 for RNA-seq MM vs. PC analysis (primary samples), DEMM genes were obtained.

2.3.6.2. Obtaining DESMM genes

The table with the quantified reads in each gene for each sample (Table 2-1 “Subgroup” category) was produced as explained in the RNA-seq quantification, section 2.3.2. The gene quantification table was inputted using the DESeq2 settings and test schemes outlined in General considerations for RNA-seq and ATAC-seq analysis, section 2.2 for RNA-seq subgroup MM vs. PC analysis (primary samples), DESMM genes were obtained.

2.3.6.3. Obtaining MM vs. PC DE genes between MM cell lines vs. PC primary samples

The table with the quantified reads in each gene for each sample PC and MM CL (Table 2-1) was produced as explained in the RNA-seq quantification, section 2.3.2. Sample RSJJN3.1 is formed of two technical replicates: RS_3B.10, RS_4.3, the replicates were collapsed. The gene quantification table was inputted using the DESeq2 settings and test schemes outlined in

General considerations for RNA-seq and ATAC-seq analysis, section 2.2 for DE genes between MM cell lines vs. PC primary samples analysis.

2.3.6.4. Obtaining subgroup MM vs. PC DE genes between MM cell lines vs. PC primary samples

The table with the quantified reads in each gene for each sample PC and MM CL (Table 2-1) was produced as explained in Obtaining MM vs. PC DE genes between MM cell lines vs. PC primary samples, section 2.3.6.3. The gene quantification table was inputted using the DESeq2 settings and test schemes outlined in General considerations for RNA-seq and ATAC-seq analysis, section 2.2, for DE genes between subgroup MM cell lines vs. PC primary samples analysis. There are no HD cell lines, so tables were created only for CCND1 vs. PC, MAF vs. PC and MMSET vs. PC.

2.3.6.5. Obtaining OEMM genes

OEMM genes were produced from DEMM genes as specified in General considerations for RNA-seq and ATAC-seq analysis, section 2.2.

2.3.6.6. OESMM genes

OESMM genes were produced from DESMM genes as specified in General considerations for RNA-seq and ATAC-seq analysis, section 2.2 (genes significantly OE in any MM subgroup vs. PC are filtered).

In document Computational analysis of enhancer deregulation in Multiple Myeloma (Page 64-67)