Chapter 5: Resolving heterogeneity in HSPC populations
5.5. Resolving populations using multidimensionality analysis
Dimensionality reduction was required to investigate how the cell populations related to each other based on their gene expression. Dimensionality reduction methods are useful for visualising large datasets in a lower dimensionality space. In this investigation, they were used to evaluate the
heterogeneity and structure of the haematopoietic bone marrow compartment in an unsupervised fashion.
Principal component analysis (PCA) was used to visualise relationships between cell populations (Fig. 5.2). PCA is a linear dimensionality method in which principal component (PC) 1 has the largest variance, followed by PC2. Therefore, the data was plotted in the first two components to demonstrate the variance between the cell populations. The new data collected for this investigation was integrated into the Wilson et al. dataset and analysed together (Fig. 5.2A). PreMegEs, MPPs and FSR-HSC2 cells are intermediate populations in the haematopoietic hierarchy, which is recapitulated by their location on the PCA plot.
The four HSC populations were clustered together at the top of the graph (Fig. 5.2B). HSC1 showed the most dispersed expression; however, the four strategies enriched for cells with an overall similar expression profile. The FSR-HSC populations were located between the HSCs, MPPs and LMPPs, consistent with the classical view of the haematopoietic hierarchy. For greater visual clarity, the four HSC populations and the two FSR-HSC populations were coloured together (Fig. 5.2C). Although there were no clear projections in the PCA visualisation, the HSCs, MEPs and LMPPs were found at distinct edges of the structure, indicating these populations were the most different from one another. PreMegEs clustered closely to the MEPs whereas the GMPs were in between MEP and LMPP populations, albeit more concentrated near the LMPPs. The CMPs were disperse among the progenitor populations, consistent with previous observations about their heterogeneity (Paul et al. 2015).
The PCA loadings show which genes contributed to the separation of the data (Fig. 5.2D). At the top of the PCA plot, Mpl, Mecom and Procr contributed to the variance that separated HSCs from the other populations, consistent with these genes being important to HSC characteristics (Table 5.1). Gata1 and Gfi1b contributed to the separation on the left side of the PCA plot, and Notch and Csf1r contributed to the variance on the right, consistent with these regions of the PCA plot being made up of MEPs, LMPPs and GMPs, respectively.
Figure 5.2. Visualisation of single-cell qRT-PCR data using principal component analysis. (A) PCA plot showing
the integration of the new data with the data from Wilson et al. (2015). New data – black; Wilson et al. data – grey. (B) PCA plot of all populations, calculated on the expression of 41 genes measured by qRT-PCR. The plot is coloured by sorting gate. HSC1 – purple, HSC2 – dark purple, HSC3 – pink, HSC4 – cyan, FSR-HSC1 – forest green, FSR- HSC2 – olive green; MPP – yellow green; PreMegE – dark brown; LMPP – blue; CMP – orange; MEP – red; GMP – yellow. (C) PCA plot of all populations, coloured by cell type. The four HSC populations are grouped together (purple) and the two FSR-HSC populations are grouped together (olive green). MPP – light blue, PreMegE – dark brown, LMPP – blue, CMP – yellow green; MEP – red; GMP – orange. (D) PCA loading plots, showing genes that contribute to the variance in PC1 and PC2. PC: Principal Component.
Although PCA is an informative dimensionality-reduction method, it can only capture linear structures in the data. More recently, non-linear dimensionality reduction methods such as t- distributed stochastic neighbour embedding (t-SNE) and diffusion maps have been applied to single-cell data (Maaten and Hinton 2008; Haghverdi, Buettner, and Theis 2015). These methods are able to capture more complex structures in the data. t-SNE aims to conserve the local distances
of the high-dimensionality data in a low-dimensionality structure, so that cells with similar gene expression are nearby on the plot.
The qRT-PCR data was visualised using t-SNE (Fig. 5.3), which recapitulated the structure seen using PCA. The HSCs were located at the top of the landscape. The HSC1 cells showed the most heterogeneity and the HSC4 population appeared more molecularly different from the other HSC sorting strategies than seen in the PCA plot (Fig. 5.3A). As the sorting strategies were different, it was assumed that the functional HSCs would be similar, and each strategy would differ in the phenotype of contaminating cells that it captured. The t-SNE separated the data into two distinct branches, separating MEPs and LMPPs, which is clearly shown when the four HSC populations and two FSR-HSC populations are coloured together (Fig. 5.3B). The CMPs and GMPs were both dispersed among the progenitor cells; on the LMPP branch, GMPs were at the tip of the branch, but in between MEPs and HSCs on the MEP branch.
Figure 5.3. Visualisation of single-cell qRT-PCR data using t-distributed stochastic neighbour embedding. (A)
t-SNE plot of all populations, calculated on the expression of 41 genes measured by qRT-PCR. The plot is coloured by sorting gate. HSC1 – purple, HSC2 – dark purple, HSC3 – pink, HSC4 – cyan, FSR-HSC1 – forest green, FSR- HSC2 – olive green; MPP – yellow green; PreMegE – dark brown; LMPP – blue; CMP – orange; MEP – red; GMP – yellow. (B) t-SNE plot of all populations, coloured by cell type. The four HSC populations are grouped together (purple) and the two FSR-HSC populations are grouped together (olive green). MPP – light blue, PreMegE – dark brown, LMPP – blue, CMP – yellow green; MEP – red; GMP – orange.
A disadvantage of t-SNE analysis is that it is a stochastic model, which means that while the overall conclusions from the analysis do not change, the t-SNE visualisation will be altered every time it is generated. It is therefore necessary to generate t-SNE plots multiple times to confirm that structure of the dataset is reproducible, and then set the seed parameter to be able to reproduce the same figure every time. Furthermore, both PCA and t-SNE dimensionality reduction methods are
designed to detect differences in the data rather than continuous relationships (Haghverdi, Buettner, and Theis 2015). As haematopoiesis involves the differentiation of an HSC towards a mature cell fate while passing through intermediate progenitor phenotypes, it would be beneficial to visualise the data using a dimensionality-reduction method that is better able to determine more complex structures in the data. Diffusion maps use the length of diffusion-like random walks through the data in high-dimensional space to determine a projection of the cells, and have been adapted to successfully display single-cell data (Coifman et al. 2005).
The qRT-PCR data was visualised on a diffusion map (Fig. 5.4). As in the PCA and t-SNE plots, the HSCs sat at the top of the structure and HSC1 showed the most disperse expression pattern of the four HSC sorting strategies. Furthermore, the diffusion map recapitulated the pattern seen in the t-SNE plot, in which HSC4 was most distinct from the four HSC sorting strategies (Fig. 5.4A). When the HSC and FSR-HSC populations are coloured together, it is easier to visualise that the structure roughly segregated the cells into two projections, separating MEPs and LMPPs (Fig. 5.4B). The LMPPs were located closer to the FSR-HSCs than the MEPs, suggesting their gene expression was closer to that of the early progenitors. The distinct gene expression of PreMegEs and MEPs was more clearly visualised in the diffusion map than using the other methods.
Figure 5.4. Visualisation of single-cell qRT-PCR data using diffusion maps. (A) Diffusion map of all populations
calculated on the expression of 41 genes measured by qRT-PCR. The plot is coloured by sorting gate. HSC1 – purple, HSC2 – dark purple, HSC3 – pink, HSC4 – cyan, FSR-HSC1 – forest green, FSR-HSC2 – olive green; MPP – yellow green; PreMegE – dark brown; LMPP – blue; CMP – orange; MEP – red; GMP – yellow. (B) Diffusion map of all populations coloured by cell type. The four HSC populations are grouped together (purple) and the two FSR-HSC populations are grouped together (olive green). MPP – light blue, PreMegE – dark brown, LMPP – blue, CMP – yellow green; MEP – red; GMP – orange. DC: Diffusion Component.