Computational Analysis

Chapter 2: Materials & Methods

2.6. Computational Analysis

2.6.1. Single cell gene expression data analysis (Fluidigm BioMark™ HD)

All analysis of qPCR data from the Fluidigm BioMark™ was performed using R (www.r- project.org). Most scripts used were written by Fiona Hamey or Victoria Moignard and Fernando Calero-Nieto, and Fiona Hamey and Wajid Jawaid helped with some coding aspects. All analyses were carried out by Sonia Shaw unless otherwise stated. Fiona Hamey performed the pseudotime inference, network construction and stable state analyses. Specific details are provided in Chapters 5 and 6.

2.6.1.1. Data processing and filtering

Single-cell gene expression data was collected using Fluidigm Data Collection software and analysis was performed as previously described (Moignard et al. 2013; N. K. Wilson et al. 2015). ΔCt values were calculated by normalising mean expression levels to housekeeping genes Ubc and Polr2a (G. Guo et al. 2010). Where a gene could not be detected, the maximum ΔCt value for a gene/assay was calculated and 3.5 was added.

2.6.1.2. Downstream analyses

All housekeeper genes (Ubc, Polr2a, Eif2b1), Cdkn2a, Egfl7, Gfi1, and Spi1 were removed from the dataset for downstream analysis. Cdkn2a was not expressed in any of the cell types, and Egfl7, Gfi1 and Spi1 were removed due to technical issues.

The data collected for FSR-HSC2, MPP and PreMegE cells in this investigation were projected onto a principal component analysis (PCA) plot together with data collected by Wilson et al. for the following populations: LMPPs, CMPs, GMPs, MEPs, FSR-HSCs, and HSCs. The data were also re-analysed with data from Wilson et al. (N. K. Wilson et al. 2015). Since the projected PCA plot and the re-analysed PCA plot showed similar correlations between the cell populations, analysis was continued using the re-analysed data set containing all 12 populations.

Hierarchical clustering was performed using the hclust function and heatmap.2 from the gplots package. Spearman rank correlations and ward linkage were used. PCA was performed using the default settings for the prcomp() function. T-distributed Stochastic Neighbour Embedding (t-SNE) analysis was performed using the tsne package. Diffusion maps dimensionality reductions were calculated using the destiny package using centred cosine distance and σ = 0.3 (Angerer et al. 2016). Cells were retrospectively coloured based on clusters or the population to which they belonged. Subsequent analyses in which MolO cells were projected onto the data were performed using the roots package.

2.6.1.3. Additional information

Single cell gene expression data were also collected for HoxB8-FL cells. These data were processed by Fiona Hamey. All the single cell gene expression data from the Fluidigm BioMark™ platform can be downloaded from:

http://blood.stemcells.cam.ac.uk/single_cell_qpcr.html

2.6.2. Single cell gene expression data analysis (scRNA-seq)

All analysis of the scRNA-seq data was performed using R (www.r-project.org) unless otherwise stated. The script used for the analysis was written by Fiona Hamey. All analyses were carried out by Sonia Shaw unless otherwise stated.

2.6.2.1. Aligning reads and quality control

Reads were aligned using G-SNAP (T. D. Wu and Nacu 2010) and the mapped reads were assigned to Ensembl genes (release 81) (Zerbino et al. 2018) by HTSeq (Anders, Pyl, and Huber 2015). This was done by Evangelia Diamanti.

Quality control and data normalization was performed by Fiona Hamey. To pass quality control, cells had to meet the following requirements:

• Cells need to have at least 200,000 reads mapping to nuclear genes • Cells need to have at least 4,000 genes detects

• Less than 10% of mapped reads should map to mitochondrial genes • Less than 50% of mapped reads should map to ERCC spike-ins.

The reads were normalised using the method of Lun et al. (Lun, Bach, and Marioni 2016). Technical variance was estimated using ERCC spike-ins, as described by Brennecke et al. (Brennecke et al. 2013). The data were normalized in R using flowCore (Hahne et al. 2009) and ComBat (Johnson, Li, and Rabinovic 2007).

2.6.2.2. Assigning population thresholds

Population thresholds were assigned retrospectively by comparing normalised index data with published literature (A. Wilson et al. 2008; Pronk et al. 2007; Pietras et al. 2015; Cabezas- Wallscheid et al. 2014). The index data was plotted in FlowJo (Treestar) and gated to define HSPC, MPP and progenitor populations. CD45 was not available in the index data; therefore, E-SLAM cells were gated using the following strategy: EPCR+ CD48- CD150+. The set gates either covered all cells (broad gating) or left unclassified cells in between populations to prevent overlap between gates (narrow gating).

2.6.2.3. Downstream analyses

Hierarchical clustering was performed using the hclust function with average linkage and (1 – Spearman’s correlation)/2 distance. Clusters were identified using the cutreeDynamic function from the dynamicTreeCut package using a minimum cluster size of 10 and the deepSplit parameter set at 1. Wilcoxon rank sum tests with Benjamini-Hochberg correction tested for differential expression in genes expressed in at least half of the cells in a cluster. Diffusion maps dimensionality reductions were calculated using the destiny package with cosine distance and Gaussian kernel width = 0.3 (Angerer et al. 2016). Pseudotime analysis was performed by Fiona Hamey.

2.6.2.4. Haematopoietic differentiation landscape – Online resource

An interactive website was designed by Blanca Pijuan-Sala, which allows other researchers to view the expression of their genes of interest on the HSPC differentiation landscape. The website also contains the surface marker and cell phenotype visualisations. The raw data was also made available for others to use the HSPC differentiation atlas in their research. This interactive website can be found following this link:

http://blood.stemcells.cam.ac.uk/single_cell_atlas.html

2.6.2.5. STREAM analysis

STREAM analysis (Single-cell Trajectories Reconstruction, Exploration and Mapping) was performed on the scRNA-seq dataset by Huidong Chen from the Pinello lab at Harvard University (H. Chen et al. 2018). The data was made available at:

http://stream.pinellolab.org/

Within this online resource, Sonia Shaw visualized the expression of various genes and made observations about genes involved in branching and transitioning points in the pseudotime ordering.

2.6.2.6. SPRING analysis

SPRING analysis was performed on the scRNA-seq dataset by Caleb Weinreb from the Klein lab at Harvard University (Weinreb, Wolock, and Klein 2018). The data was made available at:

https://kleintools.hms.harvard.edu/tools/springViewer.html?cgi- bin/client_datasets/gottgens_prenorm

Within this online resource, Sonia Shaw visualized the expression of various genes and proteins and made observations about the discernible cell populations.

2.6.3. Analysis of genotyping data

Genotyping results were analysed in R studio. The genotyping protocol is described in Section 2.4. Evangelia Diamanti mapped the reads to the mouse genome and generated .fastq files for each library sequenced using MiSeq Nano. The first 10,000 reads of each .fastq file were aligned to the appropriate reference sequence using custom functions created by Iwo Kucinski. The output of the script included the fraction of indels and frameshift mutations for each sample, which were compared to the empty vector controls to determine whether the CRISPR gRNAs successfully targeted the genes of interest.

In document Mapping the transcriptional landscape of haematopoietic stem and progenitor cells (Page 71-75)

Chapter 2: Materials &amp; Methods

2.6. Computational Analysis

Chapter 2: Materials & Methods