2 Hypothesis and objectives ................. ¡Error! Marcador no definido
3.6 Differential expression analyses
3.6.1 RNA-seq experiments
Human Body Map
SRA files were converted to fastq using the SRA toolkit fastq-dump function (Leinonen, Sugawara, & Shumway, 2011). The –O flag is the output directory;
-v for verbose display and --gzip to compress the output file.
fastq-‐dump -‐O <path> -‐v -‐-‐gzip <path/file>
Then, the quality of the samples was checked using FastQC (Andrew, 2010).
The –o flag is the output directory.
fastqc <path/file> -‐o=<path>
All fastqc results were processed by MultiQC to make a single quality report (RNAseq_Quality_Analysis12), all samples were used and none had any adaptor
1 This is a link to online appendixes. For quality analysis similar to what is described in this thesis, I recommend to download all files in the directory. For other results, that are in html files, I recommend using the method described in https://www.michaelcrump.net/how-to-run-html-files-in-your-browser-from-github/ .
Material and methods
trimmed. In the case of Illumina human tissue atlas, there were 3 samples (one paired en library of brain tissue, ERR030882_1, a single end library also for brain tissue, ERR030890, and a single end library of heart tissue, ERR030894) with lower phred scores (from 25 to 30, in the first 10 bases) (Ewels, Magnusson, Lundin, & Käller, 2016). Kallisto (Bray, Pimentel, Melsted,
& Pachter, 2016) was used to get the counts of reads that pseudo-align against each transcript. First, the index was built for the human
transcriptome, from the Ensemble database
(Homo_sapiens.GRCh38.cdna.all.fa.gz).
kallisto index –i H.sapiens_GRCh38.idx <path/file>
Then, the kallisto quant function was used to estimate the counts. The parameters were the type of library, single end or paired end (--single, default) ; length of sequences ( -l ), the standard deviation of fragment length ( -s ) and number of bootstraps (-b).
kallisto quant -‐i $Index -‐o <path> -‐-‐single -‐l 75 -‐s 1 -‐b 100 "$file"
&> <outpath.temp>
kallisto quant -‐i $Index -‐o <path> -‐b 100 "${Ones[$i]}" "${Twos[$i]}"
&> <outpath.temp>
The kallisto quant output used for the estimated counts is the abundances.tsv file. The single and pair ends outputs of each tissue were summed, as they are technical replicates. The differential expression analysis was done with the edgeR package in R (Robinson, McCarthy, & Smyth, 2010).
For this analysis, the dispersion value was manually defined due to the absence of biological replicates. The method was: first removal of genes with zero counts and low expressed genes (keeping genes with >1 counts-per-million in at least 2 libraries) in the raw reads. Second, normalization factors were calculated to scale the raw library sizes using the calcNormFactors function with the default normalization method (TMM, weighted trimmed
2
https://github.com/gabee-chan/MSc_Thesis/tree/Thesis_Results/RNAseq_Quality_Analysis
Material and methods
mean of M-values). The defined dispersion value was 0.4 as suggested in the edgeR manual for human data with well-controlled experiments when no replicates are available. Then, the negative-binomial model was fitted using the glmFit function to the counts table with the normalized libraries and the dispersion value. The output of this function was passed to the glmLRT function to perform the likelihood ratio tests according to a contrast, which in our case was the brain vs. the average of the rest of the tissues. Significantly differentially expressed genes were those with a false discovery rate (FDR) <=
0.05 and log Fold Change >= log2(1.5). The result of the analysis is a list of genes which was ordered from low to high values according to the sign of the logFC and the log p-value of this analysis. This list was later used for the seed enrichment analysis with Sylamer (van Dongen, Abreu-Goodger, & Enright, 2008).
Xenopus tropicalis tissue atlas
A similar pipeline as the one for the Human Body Map was done for X.
tropicalis tissue atlas samples:
• FastQC was used for quality check of the samples and a single report was done with MultiQC ( quality report at RNAseq_Quality_Analysis3) (Andrew, 2010; Ewels et al., 2016).
• RNA-seq quantification was done with kallisto. For the index the transcriptome used was from the Ensembl database (Xenopus_tropicalis.JGI_4.2.cdna.all.fa.gz). The counts quantification parameters were: --single, -l 88, -s 1 and –b 100.
• The differential expression analysis was done with edgeR as for the Human Body Map, with the difference that the dispersion value was estimated with the estimateDisp function.
• Significantly differentially expressed genes were those with FDR <=
0.05 and abs(logFC) >= log2(1.5).
3
https://github.com/gabee-chan/MSc_Thesis/tree/Thesis_Results/RNAseq_Quality_Analysis
Material and methods
• The final gene list was ordered according to decreasing value of the sign of logFC multiplied by the log(p-value). This ordered gene list was also used for Sylamer analysis.
3.6.2 Microarray experiments
Quality assessment of the samples of each experiment was done with the arrayQualityMetrics library (quality reports at Microarray_Quality_Analysis4) (Kauffman, Gentleman, & Huber, 2009). The differential expression analysis was done with the R libraries affy and limma (Gautier, Cope, Bolstad, &
Irizarry, 2004; Ritchie et al., 2015). The annotation of the probes was obtained from the library hgu133plus2.db for human data, mouse4302.db for mouse data and droshophila2.db for fly. A common pipeline was used for all the experiments. First, the CEL files of all the samples were read with the ReadAffy function and performed a quality analysis of the raw data with the arrayQualityMetrics function. Afterwards, the data was normalized with the vsnrma function which does a probe-wise background correction and between-array normalization. Then, a quality analysis was done again but with the normalized data. Before doing the differential expression analysis, the Ensembl gene ID for all the microarray probes was retrieved, filtering them according to the longest gene. The normalized expression matrix was renamed according to the Ensembl gene ID. Then the differential expression analysis was done as advised in the limma User’s Guide, which consists of fitting linear models to the expression data using the lmFit function, performing the appropriate contrast (table 5) and shrinking the gene variances towards a global value with the eBayes function.
4
https://github.com/gabee-chan/MSc_Thesis/tree/Thesis_Results/Microarray_Quality_Analysis
Material and methods
For the differential expression analysis of the miR-124 transfection time course experiment, not all the samples were used. This was because the PCA of the quality analysis showed that the transfected samples of the time points 4 and 8 clustered with the control samples, I used the time points samples:
16, 24, 32 and 72 (Microarray_Quality_Analysis/Human/Time_Course5). The analysis was performed with all the samples except time points 4 and 8. The analysis with the filtered samples showed more differentially expressed genes.
5
https://github.com/gabee-chan/MSc_Thesis/tree/Thesis_Results/Microarray_Quality_Analysis/Human/Time_Course
Organism GSE ID Experiment Contrast
Human Set of different GSEs Constructed tissue atlas Brain vs. average other tissues
Human GSE32876 Inferring transcriptional and microRNA-mediated regulatory
programs in glioblastoma
Transfected vs.
control
Human GSE6207 miR-124 transfection time
course
Transfected vs.
control Mouse GSE9954 Large-scale analysis of the
mouse transcriptome
Brain vs. average other tissues
Mouse GSE8498 The MicroRNA miR-124
promotes neuronal
Fly GSE7763 Using FlyAtlas to identify better Drosophila models of human
disease
Brain vs. average other tissues
Table 5. Data of human, mouse and fly microarray datasets used.