One fundamental question is to understand how cell lineages from complex tissues or organism are formed. Finding an answer for this question is essential not only to monitor the normal tissue development and homeostasis but also is essential to develop our understanding of pathological situations such as cancers, or in our case pathological remodelling of the airway epithelium in chronic lung diseases.
Until now the different techniques that accomplish cell lineage tracing consist in the introduction of a heritable mark in a cell, this mark allows the researchers to follow the progeny of this cell. This technique makes possible the detection of a certain number of cells harbouring this mark, meaning that all these cells come from the same founder cell. By looking at the heterogeneity of the cell types found in the progeny, it is possible to determine the potency of the founder cell.
Nowadays, the development of single-cell transcriptome makes possible to perform “cell trajectories” independent from any genetic tracing. Lineages can be inferred from analyses of the transcriptome of the different single cells, without making any prior hypothesis about the populations of cells that are involved. In developing/differentiating tissues, some cells are transitioning from one cell state to another. If sufficient number of cells in these transitioning states are captured in the single cell experiment there may be enough information to place them on the differentiation trajectories. These computational methods are based on the hypothesis that cells with very similar gene expression profile arise from the same lineage, which can be assessed by the similarity of the transcriptome of these cells. It results from these analyses a clustering of these cells in different groups.
Over the last years, many computational methods have been developed to analyse single cell RNA seq experiments and perform lineage inference.
VI.1 Dimensionality reduction-based algorithms
a) Monocle (Trapnell et al., 2014) was one of the first algorithms applied for differentiation trajectory construction. This algorithm is based on the use of an independent-component analysis (ICA) to project all the cells in a 2-dimentional space (usually only 2). The following step is the construction of a minimum spanning tree (MST) that is a subset of connected points in a graph, these connections should be the shorter as possible and not making circles, and the definition of a “track” connecting the two most different populations of cells. All other cells are going to be projected into this “track” resulting in a 1-dimensional ordering of all the cells, which is called “Pseudotime”. The pseudotime predicts the lineage trajectory of all the cells. When using Monocle, the “sense” of the lineage is not defined, so that some pre-knowledge is required in order to provide the right starting and ending points of the lineage. Monocle was not very efficient to track branches in a lineage, and the software was therefore restricted to the analysis of linear trajectories. This issue has been
solved with the release of Monocle 2 (Qiu et al., 2017), which perform the analysis in a higher dimensional space: this approach is able to perform more intricate trajectory analysis (Fig. 29).
b) SLICE (Guo et al., 2017) (Single Cell Lineage Inference Using Cell Expression Similarity and Entropy). This algorithm uses predefined cluster of cells instead of the single cells that were considered by Monocle. The use of cell clusters largely simplifies the MST. SLICE consists in the use of transcriptomic entropy as a measure for differentiation. With this, it can detect the less differentiated cell population and create a starting point for the pseudotime. The algorithm can build complex branching trees that are defining different coexisting differentiation pathways (Fig. 29).
c) SCUBA (Marco et al., 2014) (Single Cell Using Bifurcation Analysis). This algorithm also uses predefined cluster of cells to create the MST. It reduces data dimensionality by using t-Stochastic Neighbor Embedding (tSNE), followed by fitting of a smooth curve. With this method the authors analysed two different data sets and successfully reconstructed the cellular hierarchy during development of mouse embryos. They defined the dynamic changes in the gene expression patterns and were able to predict possible perturbations in the trajectory with the perturbation of key transcriptional regulators (Fig. 29).
d) TSCAN (Ji and Ji, 2016) (Tools for Single Cell Analysis). This method also groups the cells into clusters then uses a MST approach to order the cells connecting the center of each cluster. The Pseudotime is obtained projecting every cell into the MST. The authors analysed a single-cell data set of hematopoietic cells and revealed the importance of a specific regulator (HOPX) in the formation of blood cells (Fig. 29).
e) Slingshot (Street et al., 2018) uses dimensional reduction, constructs an MST to identify the key elements of the global lineage structure, and then uses simultaneous principal curves to fit smooth branching curves to these lineages. The cells are then projected into the resulting tree in an ordered lineage trajectory that includes bifurcations.
Fig. 29. Overview of Lineage Reconstruction Algorithms. Lineage reconstruction algorithms based on dimensional reduction. Monocle uses independent-component analysis, followed by the constructionof a minimum spanning tree (MST) connecting all cells. Connecting the two cells furthest away from each other identifies a backbone. Directionality can be provided by the user through the identification of a root cell. Large side branches are excluded, and remaining cells are projected onto the pseudotime backbone. SLICE constructs a MST of cluster centers, and directionality is inferred from transcriptome entropy. Single cells are projected on the edges connecting the cluster centers. TSCAN and Waterfall also constructs a MST based on the cluster centers, followed by projection of the single cells onto the edges to align cellsin pseudotime. SCUBA uses tSNE for dimensionality reduction followed by the fitting of a smooth curve. Single cells are projected on the smooth curve to order them in pseudotime. Monocle, TSCAN, Waterfall, and SCUBA all require user input to infer directionality. From (Kester and van Oudenaarden, 2018)
VI.2 Nearest neighbour graph-based algorithms
In these algorithms each cell is connected to its nearest neighbors, linking together similar cells.
a) Wanderlust (Bendall et al., 2014) Consists in the ordering of the cells by determining each cell position by performing steps between neighboring cells. It generates a collection of the shortest “walks” from one cell to a neighbour. This way, several trajectories are obtained and the algorithm calculates the most probable one (Fig. 30).
b) Wishbone (Setty et al., 2016) Uses a cell ordering technique similar to that of Wanderlust, but is able to identify bifurcations/branches (Fig. 30).
Fig. 30. Overview of Lineage Reconstruction Algorithms. Lineage reconstruction algorithms based on NNGs. Both Wanderlust and Wishbone start with the construction of a NNG. A collection of shortest walks, from a user-defined root cell to all other cells in the graph, is then used to construct the lineage trajectory. Wishbone has the added benefit that it can identify bifurcations in the lineage trajectory. From (Kester and van Oudenaarden, 2018)
VI.3 Connecting of cluster centers in high dimensional space
a) StemID (Grün et al., 2016). This algorithm clusters the cells using k-medoid, the clusters are connected by their centers in a high dimensional space then the cells are projected in the edges between the clusters. Then StemID identifies a stem cell population from the entropy of the cluster determining the starting point of the network like in the SLICE method (Fig. 31).
b) Mpath (Chen et al., 2016). This algorithm clusters the cells using hierarchical clustering. As in the StemID algorithm, it connects the centers of the clusters, creating a network and then projecting all the rest of the cells in this network. Mpath can identify linear and branching lineage pathways, it does not need a lot of cells to already determine the trajectories and it can define the starting point of the network just using the gene signature of genes from the most differentiated cells (Fig. 31).
c) Cluster triplet construction. This algorithm was first used for bulk RNA-seq analysis (Heinäniemi et al., 2013) and later on was used by Furchtgott (Furchtgott et al., 2017) to analyse scRNA-seq data. It consists in the generation of cell clusters, then grouping these cell clusters in 3 and performing differential gene expression analysis to determine the order between clusters in a trajectory. after all the relations of all the triplets of cluster are disposed generating a lineage tree (Fig. 31).
Fig. 31. Overview of Lineage Reconstruction Algorithms. Lineage reconstruction algorithms based on cluster networks. Both StemID and Mpath start by connecting all cluster centers in a high dimensional space. Single cells are then projected on the edges between the clusters, and underrepresented edges are removed from the graph. StemID identifies a potential stem cell population based on transcriptome entropy. The Furchgott method infers the intermediate cluster (if possible) from each triplet of clusters in the data, followed by tree construction based on the triplet relations. From (Kester and van Oudenaarden, 2018)
d) Dimensional reduction of current state. RNA velocity (La Manno et al., 2018) “RNA velocity” is defined as the time derivative of the gene expression state. It can be inferred by the quantification of unspliced and spliced mRNAs, with the assumption that unspliced transcripts are a prediction of the state of a cell within the next hours. The unspliced fraction of all the transcripts of all the cells is going to be represented as a vector that points to the future state of a cell, spliced state. Thus, RNA velocity is a high-dimensional vector that predicts the future state of individual cells on a timescale of hours (Fig. 31).