Huh_unc_0153D_18545.pdf

(1)

CLUSTER ENSEMBLE METHODS FOR SINGLE CELL RNA-SEQ DATA AND

DECONVOLUTION OF BULK HI-C DATA

Ruth Huh

A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in

partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department

of Biostatistics in the Gillings School of Global Public Health.

Chapel Hill

2019

Approved by:

Yun Li

Yuchao Jiang

Michael Love

Jin Szatkiewicz

Kirk Wilhelmsen

(2)

©

2019

Ruth Huh

(3)

ABSTRACT

Ruth Huh : Cluster Ensemble Methods for Single Cell RNA-Seq Data and Deconvolution of Bulk

Hi-C Data

(Under the direction of Yun Li)

Clustering single-cell RNA-seq (scRNA-seq) data is a critically important task to shed

light on tissue complexity including the number of cell types present and transcriptomic

signa-tures of each cell type. Due to its importance, several novel methods have been developed

re-cently for clustering scRNA-seq data. However, different approaches generate varying estimates

regarding number of clusters and cluster assignments making it hard to gauge which method to

use.

In the first paper, we present SAFE-clustering, Single-cell Aggregated (From

Ensem-ble) clustering, a flexible, accurate and robust method for clustering scRNA-seq data.

SAFE-clustering takes results from multiple SAFE-clustering methods to build one consensus SAFE-clustering. In

our current implementation, individual solutions are ensembled using three hypergraph-based

partitioning algorithms, namely hypergraph partitioning algorithm (HGPA), meta-cluster

algo-rithm (MCLA) and cluster-based similarity partitioning algoalgo-rithm (CSPA). In our evaluations,

SAFE-clustering generates high-quality clustering, in terms of both cluster number and cluster

assignment, across various datasets.

In the second paper, we present SAME-clustering, Single-cell Aggregated Clustering via

Mixture Model Ensemble, where we follow a similar pipeline with SAFE-clustering but change

the ensemble clustering method to a probabilistic framework. Specifically, SAME-clustering uses

a finite mixture model of multinomial distributions. Results show that our SAME-clustering

ensemble method, using a mixture model, yields enhanced clustering, in terms of both cluster

assignments and number of clusters.

In the third paper, we shift gears from analyzing scRNA-seq data to C data.

(4)

which measures spatial interactions important in providing information for gene regulation and

3D structure of the genome. Standard Hi-C data are generated from millions of cells, thus

pro-viding a population average measure of heterogeneous cells. Therefore, observed differences in

contact information are confounded by relative proportions of cell types among samples. It is

important to adjust for these proportions in downstream bulk Hi-C analysis. To date, there are

no deconvolution methods applied to Hi-C data to estimate these proportions. We propose using

nonnegative matrix factorization (NMF) for a matrix decomposition-based framework to estimate

(5)

(6)

ACKNOWLEDGEMENTS

I would like to thank my advisor Professor Yun Li, for taking me as her student later

in my PhD career. I am grateful for her guidance in my disseratation and support of my

well-being throughout my 3 years working with her. I would also like to thank all of my committee

members, Professor Yuchao Jiang, Professor Michael Love, Professor Jin Szatkiewicz, Professor

Kirk Wilhelmsen, and Professor Di Wu, for your helpful input and questions on my dissertation

topic.

I would like to thank all Li lab members. You have helped me with programming in

many platforms, widening my biological and statistical knowledge, and improving my

presen-tation skills. I would like to especially thank Yuchen Yang for collaborating closely with me

throughout my dissertation.

I am thankful for all the support and encouragement I received from my friends and

family. I am grateful for my mom and in-laws who took care of my daughter during my studies,

my daughter Yena for giving me joy in the midst of my studies and my husband Hojoon for

per-severing with me through various life stages. Last but not least, I would like to thank my Father

(7)

LIST OF TABLES . . . .

ix

LIST OF FIGURES . . . .

x

CHAPTER 1: LITERATURE REVIEW . . . .

1

1.1 Motivation behind first and second papers . . . .

1

1.2 Some Recent Clustering Methods for single-cell RNA-seq . . . .

3

1.3 Ensemble Methods and their advantages . . . .

7

1.4 Motivation for third paper . . . .

9

1.5 Deconvolution methods for RNA-seq . . . 10

CHAPTER 2: SAFE-clustering: Single-cell Aggregated (From Ensemble) Clustering

for Single-cell RNA-seq Data . . . 14

2.1 Overview of SAFE-clustering . . . 14

2.2 Expression matrix normalization . . . 15

2.3 Benchmarking Datasets . . . 15

2.4 Improving and running four state-of-the-art methods . . . 16

2.5 Hypergraph Partitioning Cluster Ensemble Algorithms . . . 22

2.6 Performance evaluation using ANMI . . . 24

2.7 Benchmarking of three hypergraph partitioning algorithms in SAFE . . . 25

2.8 Results . . . 28

2.9 Discussion . . . 35

CHAPTER 3: SAME-clustering: Single-cell Aggregated Clustering via Mixture Model

Ensemble . . . 37

(8)

3.2 Expression matrix normalization . . . 38

3.3 Benchmark Datasets . . . 39

3.4 Implementation and Evaluation of Individual clustering methods . . . 39

3.5 SAME-clustering method using the multinomial mixture model . . . 50

3.6 Diversity of individual cluster results to improve SAME-clustering . . . 58

3.7 Results . . . 61

3.8 Discussion . . . 69

CHAPTER 4: Reference free Hi-C Deconvolution . . . 70

4.1 Overview of Hi-C Deconvolution using NMF . . . 71

4.2 Simulation of Bulk Samples . . . 71

4.3 NMF method . . . 72

4.4 Preliminary Results . . . 74

4.5 Selecting Significant Features . . . 81

4.6 Results . . . 87

4.7 Discussion . . . 94

(9)

LIST OF TABLES

Table 1

Characteristics of the 12 benchmarking datasets . . . 16

Table 2

Additional characteristics of the 12 benchmarking datasets . . . 28

Table 3

Major characteristics of 15 benchmarking datasets . . . 39

Table 4

Evaluating effect of gene filter on SC3 . . . 40

Table 5

Evaluating robustness of Seurat and SAME across number of PCs used . . . 42

Table 6

Evaluating ADPclust centroid detection . . . 45

Table 7

Assessing the effect of gene filtering on individual methods . . . 49

Table 8

Difference in ARIs with and without gene filtering . . . 50

Table 9

Evaluation of NMF and CIBERSORT deconvolution . . . 78

Table 10

Evaluation different feature selection methods . . . 82

Table 11

Results implementing the 2-step NMF . . . 88

(10)

LIST OF FIGURES

Figure 1

Overview of SAFE-clustering . . . 15

Figure 2

PCA plot to determine number of clusters . . . 18

Figure 3

Seurat performace evaluation . . . 19

Figure 4

t-SNE+kmeans performance evaluation . . . 21

Figure 5

Comparing HGPA, MCLA, and CSPA . . . 27

Figure 6

Diversity among individual clustering methods . . . 29

Figure 7

Benchmarking of SAFE-clustering in 12 published datasets . . . 30

Figure 8

Accuracy evaluation of the inferred number of cluster . . . 31

Figure 9

Evaluating robustness of t-SNE+kmeans . . . 32

Figure 10

Evaluating robustness of SAFE-clustering . . . 34

Figure 11

Overview of SAME-clustering . . . 38

Figure 12

Visualization to determine number of clusters . . . 41

Figure 13

Elbow plot to determine number of PCs to be used . . . 43

Figure 14

t-SNE+kmeans clusters from automatically determined centers . . . 46

Figure 15

t-SNE+kmeans clusters from manually selected centers . . . 47

Figure 16

Performance of AIC and BIC for model selectiong . . . 56

Figure 17

Determining iterations needed for EM algorithm . . . 57

Figure 18

Pairwise similarities of individual methods for Zeisel dataset . . . 59

Figure 19

Evaluating SAME-clustering with all or a diverse set of individual results . . . 60

Figure 20

Evaluation of SAME-clustering on 15 benchmark datasets . . . 62

Figure 21

Similarity between estimated and true clusters . . . 64

Figure 22

Similarity between estimated and true clusters (adjusted) . . . 65

Figure 23

SAME-clustering determines limitations in true annotations . . . 66

Figure 24

NK marker gene expression in the novel cluster . . . 67

Figure 25

SAME-clustering discovers a novel cluster . . . 68

(11)

Figure 27

Difference of Top Fano Factor bins between HAP1 and HeLa cells . . . 76

Figure 28

Plot of True vs Estimated cell type proportions for HAP1 and HeLa . . . 79

Figure 29

Plot of Estimated cell type contacts against Simulated cell type contacts . . . 80

Figure 30

Observed contacts vs the fitted contacts . . . 80

Figure 31

Relationship between feature score and fold change . . . 83

Figure 32

Relationship between feature score and difference . . . 84

Figure 33

Signature bins for HAP1 and HeLA cells . . . 86

Figure 34

PCA and PC2 loadings plot separating HAP1 and HeLa cells . . . 88

Figure 35

Estimated proportions from 2-step NMF . . . 89

Figure 36

Correlation of estimated and simulated interchromosomal expression . . . 89

Figure 37

Correlation of Observed vs Fitted contacts . . . 90

Figure 38

PCA and PC2 loadings plot separating Patski and MEF cells . . . 91

Figure 39

Difference of Hi-C contacts between Patski and MEF cells . . . 92

(12)

CHAPTER 1: LITERATURE REVIEW

1.1 Motivation behind first and second papers

The development RNA-sequencing (RNA-seq) has allowed us to obtain a global view

of the transcriptome in different species and cell types (Wang et al., 2009). Cells from different

cell types have unique transcriptomes, informative of their roles in and contribution to normal

cellular function, cell fate determination, early development, as well as disease development

(Bi-ase et al., 2014; Deng et al., 2014; Goolam et al., 2016; Yan et al., 2013; Huang, 2009; Tang et al.,

2011; Liu and Trapnell, 2016). Moreover, even among cells of the same cell type, gene expression

levels are found to be highly variable (Buganim et al., 2012; Guo et al., 2010; Hashimshony et al.,

2012; Shalek et al., 2013). Because of such heterogeneity among cells, the commonly adopted

bulk RNA-seq measurements may obscure and mask true behaviors of distinct cell types

(Trap-nell et al., 2014). Therefore, transcriptomic analysis should be carried out at a single cell level

(Tang et al., 2011).

Single cell RNA-seq (scRNA-seq) analysis led to, among others, the identification of

existing and novel cell types, characterization of cells, prediction of cell fate, classification of

tumor subpopulations and investigation of cellular heterogeneity (Treutlein et al., 2014; Li et al.,

2016; Xin et al., 2016; Arsenio et al., 2014; Gr¨

un et al., 2015; Patel et al., 2014; Min et al., 2015;

Wills and Mead, 2015; Baron et al., 2016; Biase et al., 2014; Darmanis et al., 2015; Deng et al.,

2014; Goolam et al., 2016; Li et al., 2017; Ting et al., 2014; Yan et al., 2013; Zeisel et al., 2015;

Zheng et al., 2017). For all these applications, single cell clustering is a crucial preprocessing

step. After clustering the single cells, all of the following analyses can be conveniently carried out:

identification and examination of cell type specific gene expression signatures (Rozenblatt-Rosen

et al., 2017; Yan et al., 2013), cell type covariate adjustment for differential expression analysis

(Li et al., 2017; Gr¨

un et al., 2015), and deconvolution of bulk RNA expression data (Baron et al.,

(13)

Due to its importance, it is not surprising to find many existing single cell RNA-seq

clustering methods. The high dimensionality of scRNA-Seq data pose a grand challenge for

un-supervised cell clustering (Xu and Su, 2015; Lin et al., 2017; Wang et al., 2017). Principal

com-ponent analysis (PCA) and t-Distributed Stochastic Neighbor Embedding algorithm (t-SNE)

(Maaten and Hinton, 2008) are both commonly used for data visualization and dimensional

re-duction prior to single-cell clustering. The dimension reduced data is often followed by kmeans

to perform the actual clustering (Shin et al., 2015; Kiselev et al., 2017). Other recent methods

include RaceID (Gr¨

un et al., 2015), RCA (Li et al., 2017), SC3 (Kiselev et al., 2017), Seurat

(Satija et al., 2015), CIDR (Lin et al., 2017), DIMM-SC (Sun et al., 2017), and SIMLR (Wang

et al., 2017). Different clustering methods manipulate the input scRNA-seq data differently, for

example, by using different distance metrics and dimensionality reduction methods. Therefore,

clustering results may have inconsistencies due to methods taking different approaches and/or

making different underlying assumptions. As a matter of fact, cluster results from different

meth-ods are found to be rather dissimilar, with discrepancies occurring both in the estimated number

of clusters and in actual single-cell-level cluster assignment. Therefore the use of two or more

clustering methods is recommended for more accurate and comprehensive overview of cell

cluster-ing (Freytag et al., 2017). However, when true cluster labels are unknown, it would be difficult

to select the best method, prior to or even after clustering analysis (even with multiple methods

implemented separately).

Since it is hard to select the optimal method when true cell types are unknown,

ensem-bling information from multiple individual methods becomes an appealing alternative. cluster

en-semble solutions are known to provide robust and improved quality solutions (Strehl and Ghosh,

2002; Ghosh and Acharya, 2011) in many other contexts including analysis of cell signalling

dy-namics and protein folding (Hubner et al., 2005; Kuepfer et al., 2007). Ensemble solutions range

from probabilistic approaches to graph partitioning methods (Ghosh and Acharya, 2011). In

addition to providing robust and improved cluster results, cluster ensemble methods provides a

way to estimate an optimal number of clusters, which is important when there are dramatically

(14)

1.2 Some Recent Clustering Methods for single-cell RNA-seq

There are many clustering methods developed specifically for single-cell RNA-seq data.

Some like RaceID (Gr¨

un et al., 2015) and GiniClust (Jiang et al., 2016) are tailored for rare cell

type identification. Most other methods focus on identification of common cell types. Desirable

features of scRNA-seq clustering methods include dimension reduction, gene filtering,

normal-ization and detection of number of clusters. The following methods have some or all of these

features in their pipelines.

SC3

SC3 (Kiselev et al., 2017), consensus clustering of scRNA-seq data, includes several

steps in their clustering methods. It can take different types of inputs, ranging from Reads

Per Kilobase per Million mapped reads (RPKM), Fragments Per Kilobase per Million mapped

reads (FPKM), Transcripts Per Million mapped reads (TPM), Counts Per Million mapped reads

(CPM), and counts, where counts are converted to CPM to account for sequencing depth. The

normalized expression matrix is log-transformed after adding ones to avoid taking log of zeros.

SC3, then filters out genes/transcripts that are expressed in less than 10% of cells or more than

90% of cells to remove rare and ubiquitous genes/transcripts. 10% is the default value that can

be adjusted by the user and this gene filtering step is optional as an entirety. After the filtering

step, three distance metrics, Euclidian distance, Pearson and Spearman correlations, are

calcu-lated between single cells. These three distance matrices are then transformed using both PCA

and eigenvector decomposition of a Laplacian graph separately. Next, kmeans clustering is

per-formed on these transper-formed distance matrices. A consensus matrix of the different kmeans

clus-tering solutions is constructed using the cluster-based similarity partitioning algorithm (CSPA)

(Strehl and Ghosh, 2002). Lastly, this consensus matrix is clustered assuming an a priori

k

clus-ters. SC3 has a method to estimate

k

using the Tracy-Widom distribution (Tracy and Widom,

1994).

SC3 has a moderate computational cost and therefore, they employ support vector

ma-chine (SVM) (Ben-Hur, 2001) model to speed up computation for large datasets. For datasets

with more than 5,000 single cells, a subset of 5,000 single cells are randomly selected and

(15)

cells. Althogether, SC3 includes all important features of scRNA-seq clustering, which are

dimen-sion reduction using PCA and Laplacian, gene filtering of rare and ubiquitous genes,

normaliza-tion of input expression matris, and estimanormaliza-tion of number of clusters through the Tracy Widom

distribution (Tracy and Widom, 1994).

CIDR

CIDR’s (Lin et al., 2017) main advantage is that it imputes dropout genes expression

values to improve clustering results. Dropouts are common in scRNA-seq data where there is less

starting mRNA material to work with compared to bulk RNA-seq experiments and can cause

failures in amplification in the RNA-seq experiment. The type of input taken by CIDR are

log-arithmic transformed CPM values. After dropout determination and imputation, dissimilarity

matrix is calculated by using squared Euclidean distance between pairs of cells. CIDR employs

principal coordinate analysis (PCoA) for dimensional reduction. For ease, CIDR R package

au-tomatically determines the number of principal coordinates to use for dimensional reduction in

its

nPC

function. However, users may choose to alter the number of Pco’s to use by searching

for the elbow point in the plot showing proportion of variation explained by each principal

co-ordinate. With the dimension reduced dissimilarity matrix, hierarchical clustering is applied to

assign cluster labels to single cells. The number of clusters,

k

, is estimated with the

Calinski-Harabasz index (Cali´

nski and Harabasz, 1974). One key feature missing in CIDR is gene filtering,

which is most likely because it attempts to solve the problem of rarely expressed genes through

imputation.

Seurat

Seurat (Butler et al., 2018) combines dimensional reduction with graph partitioning

methods and takes as input raw Unique Molecular Identifier (UMI) counts or read counts. The

expression matrix is first filtered to exclude single cells with less than 200 expressed genes, and to

exclude genes expressed in fewer than three single cells. Thus, single cells with

<

200 expressed

genes are not given a final cluster label, resulting in missing cluster labels. For counts input,

Seurat normalizes, for each single cell, by the total gene expression in each cell and multiplies by

a scale factor of 10,000. Natural log transformation is subsequently applied after normalization.

Seurat then scales the data by removing unwanted sources of variation, by regressing out number

(16)

non-UMI data. Furthermore, Seurat can regress out batch effect, cell alignment rate, percent

of mitochondrial genes, and cell cycle. Next, PCA is performed on the scaled data to reduce

dimensionality. To cluster the cells, Seurat first constructs a K-nearest neighbor (KNN) graph

with Euclidean distance in PCA space, where edges are drawn between similar cells. These edge

weights are refined based on Jaccard distance which measures the dissimilarity between local

neighborhoods. The cells are clustered by applying modularity optimization techniques (Blondel

et al., 2008), and the number of clusters,

k

, is determined during the clustering process.

There are couple of parameters that need to be specified by the user for better results.

The resolution parameter is influential on cluster results as it sets the ’granularity’ of clustering.

Larger values lead to greater number of clusters and Seurat finds that setting this resolution

pa-rameter between 0.6-1.2 typically returns good clustering results. Another papa-rameter that needs

to be specified is the number of PCs to use in the clustering. Users can examine the plot of the

standard deviation of principal components and determine a cutoff at the elbow of the plot. This

decision can be rather arbitrary when there is not a clear elbow. Another option is to use the

embedded JackStrawPlot function in the Seurat R package, to discover significant PCs. Overall,

Seurat includes all important features of scRNA-seq clustering, however, the ambiguous decision

needed to make for the resolution parameter and the number of PCs to use, may pose a challenge

for potential users. Aside from clustering, Seurat provides many convenient downstream analysis

features including finding differentially expressed genes and t-SNE visualization.

t-SNE+kmeans

t-distributed Stochastic Neighbor Embedding (t-SNE) is a popular approach to visualize

high dimensional data in two or three dimensional space and has been used widely to visualize

single cell clusters (Gr¨

un et al., 2015; Zheng et al., 2017). It is an improvement to their previous

technique, Stochastic Neighbor Embedding (SNE) (Hinton and Roweis, 2003), as it is easier

to optimize, reduces the crowding of points towards the center, and helps with the appearance

of clusters (global structure) in a map. Briefly, high dimensional Euclidean distances between

data points are converted into joint probabilities that represent similarities. These similarities

are also computed for the low dimensional counterparts. The basic concept is that if the low

dimensional points correctly model the similarity between the high-dimensional data points,

(17)

converted into probabilities using a Gaussian distribution. In the low dimensional map, distances

are converted into probabilities using a Student t-distribution with 1 degree of freedom which

has much heavier tails than a Gaussian. Heavy tails allow for moderately dissimilar points in

high dimension to be modeled by a much larger distance in low dimensions, thus resolving the

crowding problem and giving global structure in cluster visualization. Solution is found through

minimizing a single Kullback-Leibler divergence between a joint probability distribution,

P

, in

the high- dimensional space and a joint probability distribution,

Q

, in the low-dimensional space

using the gradient descent method. One parameter that needs to be specified by the user is the

perplexity parameter. Perplexity can be seen as a measure of the effective number of neighbors

and t-SNE is reportedly fairly robust to a range of values, and typical values are between 5 and

50. After reducing to two or three dimensional space, data can be clustered via kmeans.

The R package

Rtsne

, does an initial reduction of space using PCA, and implements the

Barnes-Hut algorithm (Van Der Maaten, 2014) to mitigate the heavy computational burden that comes

with minimizing the Kullback-Leibler divergence using gradient descent for large datasets. t-SNE

+ kmeans is lacking key features of scRNA-seq clustering, which are gene-filtering, normalization,

and determination of number of clusters. Gene-filtering and normalization can be added easily

when finalizing the gene expression matrix. However, a method needs to be developed to estimate

the number of clusters.

ADPclust(Wang and Xu, 2017) may be used as an intermediate step between t-SNE

and kmeans to determine an optimal number of clusters. ADPclust is an adaptive density peak

detection method using nonparametric multivariate kernel density estimation. To determine the

number of clusters,

k

, cluster assignments are evaluated through the silhouette index (Rousseeuw,

1987) using a grid of values for both bandwidth and number of clusters. ADPclust chooses a

plug-in bandwidth estimator that minimizes the asymptotic mean integrated squared error

(AMISE) to estimate the parameters of the multivariate kernel distribution. The cluster centroids

and number of clusters that produces the maximum average silhouette index can be carried

(18)

SIMLR

Single-cell interpretation via multikernel learning (SIMLR) (Wang et al., 2017) takes

raw gene expression matrix as input and applies a log 10 transformation prior to analysis. SIMLR

learns a similarity metric through learning proper weights for several Gaussian kernels. The

advantage of using multiple kernels lies in its flexibility in comparison to a single kernel and

therefore, can capture diverse statistical characteristics of single-cell data. Then a cell-to-cell

similarity matrix in constructed and SIMLR assumes that this matrix should be approximately

block-diagonal when clusters exist. This similarity matrix is used as input, instead of the regular

gene expression matrix, to reduce the dimension of the data using t-SNE and k-means is used for

clustering.

For large-scale version of SIMLR, k-nearest-neighbor (KNN) is used to approximate

the pairwise similarity matrix. After the similarity matrix is obtained, spectral clustering is

used instead of applying kmeans after t-SNE, because t-SNE is more computationally expensive.

SIMLR supplies two ways to estimate the optimal number of cluster: eigengap and separation

cost. They utilize eigenvalues and eigenvectors which can easily become computationally

expen-sive with larger datasets. Therefore, it is not feasible to estimate number of clusters using these

methods for large datasets. An important feature of scRNA-seq clustering that SIMLR lacks is

gene-filtering. However, SIMLR claims to address the problem of high dropouts by

implement-ing a rank constraint in the learned similarity matrix and graph diffusion which improves weak

similarity measures.

1.3 Ensemble Methods and their advantages

Cluster ensembles aim to combine the information from individual clustering methods to

provide an improved overall clustering of the given data. The advantage of cluster ensembles lies

in its robustness by providing better average clusterings across datasets and its novelty in finding

a combined solution unattainable by individual clustering methods (Topchy et al., 2005). Many

papers have found that diversity and quality of partitions influence the performance of ensemble

solutions (Fern and Lin, 2008; Fern and Brodley, 2003; Kuncheva and Hadjitodorov, 2004;

(19)

to improve ensemble results. We will now briefly lay out some graph partitioning approaches and

one probabilitic approach to cluster ensembles.

Graph partitioning methods uses a concept called hypergraph. Briefly, for the

j

th

clus-tering method, we use

v

ik

(note subscript

j

is omitted for presentation brevity) to denote the

i

th

row of the hypergraph

H

j

, which is the row vector for the cluster labels (coded as binary

dum-mies or indicator functions) of the

i

th

single cell, where

v

ik

=











1 ,

the i

th

cell

∈

the k

_j

th

cluster

0 ,

the i

th

cell /

∈

the k

_j

th

cluster

and

k

j

= 1

,

2 , ..., K

j

,

with

K

j

being the total number of clusters from the

j

th

clustering method.

Here, each column is a hyperedge, representing one particular cluster identified by that method.

An overall hypergraph

H

is constructed by combining individual hypergraphs (from individual

methods). Cluster-based Similarity Partitioning Algorithm (CSPA), HyperGraph-Partitioning

ALgorithm (HGPA) and Meta-Clustering Algorithm (MCLA) all use the concept of hyperedges

and hypergraphs, but they are inherently different on how they use them. Specifically, CSPA

uses the similarity matrix that is constructed from the hypergraph, where two objects are fully

similar if they are always in the same cluster, to perform partitioning. HGPA partitions the

hypergraph by cutting a minimal number of hyperedges that creates

k

clusters of approximately

equal size. This algorithm would not be optimal when cluster sizes vary a lot. MCLA collapses

related hyperedges(clusters) and assigns each cell to the most related collapsed hyperedge. This

collapsed hyperedge is referred to as a meta-cluster, thus the name Meta-Clustering Algorithm.

A probablistic approach to solve the cluster ensemble problem would be to use a

multi-nomial mixture model (Topchy et al., 2004). Assume that the number of consensus clusters is

known to be ˆ

k

, where each is indexed by ˆ

l

. For each consensus cluster ˆ

l

and each individual

method

q

, we have a multinomial distribution

β

_ˆ

(

q

)

l

of dimension

k

(

q

)

_{, where}

_k

(

q

)

_{is the number of}

clusters determined by the

q

t

h

clustering method. Each draw from this multinomial distribution

would correspond to the cluster label from the

q

t

h

individual clustering. With these probabilistic

(20)

Combining multiple clustering comes with new challenges. One is that there is no

ex-plicit correspondence between labels from different clustering methods. Another added

complex-ity is that different cluster methods may contain different number of clusters, which adds to the

correspondence problem. Both ensemble methods discussed completely avoids the label

corre-spondence problem through the use of hyperedges and multinomial distribution and is able to

deal with varying numbers of clusters determined by individual methods.

1.4 Motivation for third paper

Most tissue samples are heterogeneous consisting several cell types, and cell type

pro-portions are highly variable between samples. When transcriptional profiles of bulk RNA-seq

samples are compared among different phenotypic states, cell type composition is a strong

con-founder of observed differences (Repsilber et al., 2010; Palmer et al., 2006; Baron et al., 2016).

Transcript abundance may vary due to the physiological condition of the samples, individual

vari-ation, and relative proportions of cell types (Shen-Orr and Gaujoux, 2013). Since gene expression

varies across cell types in a tissue, these variations in each sample is better captured by reporting

differences in cell type proportions among the samples. Without relative proportions of cell

sub-sets, it is hard to distinguish whether increased gene expression is due to an overexpression of a

gene, or to merely having more cells that express that gene. Also, without accounting for varying

cell type proportions, it would be hard to identify which cell type the observed difference came

from. Failing to adjust for relative cell type proportions between samples suffers from increased

false positives of differentially expressed genes when cell type proportion difference is correlated

with the phenotype of interest, difficulty in attributing the observed difference to a specific cell

type, and restricts interpretability of results (Shen-Orr and Gaujoux, 2013).

Cellular compositions of samples have been deconvolved using Fluorescence Activated

Cell Sorting (FACS), Laser Capture Micro-dissection(LCM), and Translating Ribosome Affinity

Purification(TRAP) to separate defined cell types (Okaty et al., 2011). However, these methods

encounter technical difficulties with limited availability of surface markers, increases stress on

(21)

Shen-Orr et al., 2010; Shen-Shen-Orr and Gaujoux, 2013). Therefore, deconvolution of bulk RNA-seq gene

expression is more efficient, unbiased and economical (Qiao et al., 2012).

Hi-C is a genome-wide (”all-against-all”) variant of chromosome conformation capture

technique, which measures spatial interactions and provides information for gene regulation and

3D structure of the genome. It gives another level of information compared to RNA-seq data.

Just like RNA-seq data, Hi-C data are generated from millions of cells, providing a population

average measure of heterogeneous cells. Therefore, it has the same problem with bulk-RNA seq

data where observed differences in contact information are confounded by relative proportions

of cell types among samples. There is a vast amount of research on deconvolution methods for

RNA-seq data, but none have been applied on Hi-C data. Therefore, I study several RNA-seq

de-convolution methods that were developed to solve for cell type composition of biological samples,

to see what methods can be effectively applied to bulk Hi-C deconvolution.

1.5 Deconvolution methods for RNA-seq

Most deconvolution methods have gene expression matrix of samples and a cell type

specific expression profile as input. The goal is to solve for the cell type proportion of samples.

The gene expression matrix is a

n

×

p

matrix we denote as

X

, cell-type specific expression matrix

is a

k

×

p

matrix,

H

, and cell type proportion of samples is a

n

×

k

,

W

, matrix. The different

de-convolution methods are a variation of solving for

W

in

X

=

W H

in their respective ways. Below

I will go over the method and the advantages and disadvantages that come with the methods. I

aim to draw from the advantages of each method to apply it to Hi-C data.

CIBERSORT

CIBERSORT (Newman et al., 2015) takes as input a gene expression of a complex

tis-sue and cell-type gene signature matrix to find the cell type proportions using

ν

-support vector

regression. In creating the gene signature matrix, they are made more robust by minimizing

an inherent matrix property called the condition number. CIBERSORT then adaptive selects

genes from this signature matrix and solves the deconvolution problem. Support vector

regres-sion (SVR) fits a hyperplane to as many data points as possible within a distance

. Data points

(22)

signature matrix. These points are evaluated according to a linear

insensitive loss function and

provide a sparse solution to the regression where overfitting is minimized. SVR seeks to minimize

both the linear

insensitive loss function, and the L2-norm penalty function.

ν

-SVR is employed

since it sets both an upper bound on training errors and lower bound on support vectors. Higher

values of

ν

yield lower values of

. The current implementation of CIBERSORT runs

ν

-SVR with

a linear kernel for three values of

ν

= (0

.

25 ,

0 .

5 ,

0 .

75) to solve for

W

and saves the best value.

Best value is determined by lowest root mean squared error between

X

and

W

×

H

. Negative

re-gression coefficients are set to 0 and the other coefficients are normalized to sum to 1.

X

and

H

are normalized to zero mean and unit variance for better performance and run time. The major

challenge would be obtaining an accurate cell-type gene signature matrix to solve the problem

when pure cell type gene signature matrix may not be available. A key advantage of the method

is that it performs feature selection, where signature genes are selected to deconvolve the mixture

samples.

PERT

PERT (Qiao et al., 2012) aims to correct for two major limitations in deconvolution

methods. They develop a flexible deconvolution method to account for the possible presence of

new cell types and possible fluctuations between gene expressions between the reference profiles

and the constituent profiles. PERT compares four models. Non-negative least squares model,

N N LS

, uses linear regression framework to estimate the proportion of celltypes in each sample

and assumes that all reference profile and constituent profile are similar and require cell type

spe-cific signature genes. Non-negative maximum likelihood model,

N N M L

, makes the same

assump-tion as NNLS but uses a Latent Dirichlet Allocaassump-tion (LDA) framework. The LDA framework

uses a multinomial noise model which better fits noise in gene expression data. Non-negative

maximum likelihood model new population,

N N M L

np

, is a version of ISOLATE (Quon and

Morris, 2009), based on an LDA framework and assumes that there is an additional constituent

population in the heterogeneous sample that is not represented in the reference profile. This is

their attempt at addressing the possibility of a new cell type. PERT, a perturbation model, is

based on the

N N M L

framework and considers transcriptional variations between reference and

constituent profiles.

N N M L

is less sensitive to selection on cell type signature genes and can be

(23)

PERT takes as input reference and constituent profiles to output cell type proportions in each

sample. PERT relaxes the assumption that the provided reference distributions are a good

repre-sentation of the constituent cell populations. It introduces a multiplicative factor

ρ

g

to account

for systematic changes in gene expression, which are assumed to be equal across cell types.

DSA

Digital Sorting Algorithm, DSA (Zhong et al., 2013), takes gene expression matrix of

heterogenous samples and marker genes of cell types to solve for cell type frequencies and cell

type specific gene expression profile. DSA first estimates cell type frequencies from marker genes

by solving a system of linear equations. This step can be skipped if cell type frequencies are

already known. It then deconvolves gene expression profile of mixed tissue sample into cell-type

specific expression profiles by using the cell type frequencies that are either estimated or provided

and uses quadratic programming to solve the equation.

Quadratic programming is a Non-negative Matrix Factorization(NMF) based method

and a major drawback is the non-uniqueness of the factorization (Donoho and Stodden, 2004).

Therefore, DSA uses a set of marker genes, a gene that is only expressed in one cell type, which

is equivalent to the separability assumption for the uniqueness of NMF (Donoho and Stodden,

2004). In practice, marker genes are unknown, and this is a challenging factor for their method.

Unified statistical framework for single cell and bulk RNA sequencing data

This framework (Zhu et al., 2018) uses both scRNA-seq and bulk RNA-seq data to solve

the deconvolution problem. Single-cell RNA-seq data give a high-resolution view of cell types

that cannot been seen in bulk data. However, it has a lot of technical noise leading to many

dropout genes not seen in bulk data. For scRNA-seq data, it is important to distinguish between

genes that are truly unexpressed and genes that are missed by technical noise. This model uses

strengths from both data types, obtains estimates for cell type specific gene expression profiles,

and infers dropout gene expression and cell type proportions in bulk samples. Their inputs are

scRNA-seq, bulk-RNA seq measurements, and number of clusters, and therefore is a reference

free method. Single cell RNA-seq and bulk RNA-seq data are linked through the cell type

spe-cific gene expression profile matrix. They use single cell RNA-seq and bulk RNA-seq

simultane-ously to impute dropouts in single cell RNA-seq data and to infer the cell type mixing proportion

(24)

specific profile matrix and gibbs sampling is used to infer the probability of drop out and mixing

(25)

CHAPTER 2: SAFE-clustering: Single-cell Aggregated (From Ensemble) Clustering for

Single-cell RNA-seq Data

To date, there is no published cluster ensemble approach across multiple types of

clus-tering methods specifically designed for scRNA-Seq data. To bridge the gap, we have developed

SAFE-clustering, Single-cell Aggregated (From Ensemble) clustering, to provide more stable,

robust and accurate clustering for scRNA-Seq data. In the current implementation,

SAFE-clustering first performs independent SAFE-clustering using four state-of-the-art methods, SC3, CIDR,

Seurat and t-SNE + k-means, and then combines the four individual solutions into one

consoli-dated solution using one of three hypergraph partitioning algorithms: hypergraph partitioning

algorithm (HGPA), meta-clustering algorithm (MCLA) and cluster-based similarity partitioning

algorithm (CSPA) (Strehl and Ghosh, 2002).

2.1 Overview of SAFE-clustering

Our SAFE-clustering leverages hypergraph partitioning methods to ensemble results

from multiple individual clustering methods. The current SAFE-clustering implementation

em-beds four clustering methods: SC3, Seurat, t-SNE + k-means, and CIDR. Figure 1 shows the

(26)

Figure 1: Overview of SAFE-clustering. Log-transformed expression matrix of scRNA-Seq data

are first clustered using four state-of-the-art methods, SC3, CIDR, Seurat and t-SNE + k-means;

and then individual solutions are combined using one of the three hypergraph-based partitioning

algorithms: hypergraph partitioning algorithm (HGPA), meta-cluster algorithm (MCLA) and

cluster-based similarity partitioning algorithm (CSPA) to produce consensus clustering.

2.2 Expression matrix normalization

SAFE-clustering takes an expression matrix as input, where each column represents

one single cell and each row corresponds to one gene or transcript. To make the data well-suited

for all four individual clustering methods, UMI counts are converted into Counts Per Million

mapped reads (CPM). For CIDR input, Fragments/Reads Per Kilobase per Million mapped

reads (FPKM/RPKM) data are converted into Transcripts Per Million (TPM). Lastly, for SC3,

CIDR, and t-SNE + k-means, the input expression matrix is log-transformed after adding ones

(to avoid taking log of zeros).

2.3 Benchmarking Datasets

For performance evaluation, we carried out clustering analysis on 12 benchmark

scRNA-Seq datasets (Table 1) (Baron et al., 2016; Biase et al., 2014; Darmanis et al., 2015; Ting et al.,

2014; Yan et al., 2013; Zeisel et al., 2015; Zheng et al., 2017), using our SAFE-clustering and the

four individual clustering methods. All these datasets have pre-defined gold/silver-standard (we

(27)

# of estimated clusters

Dataset organism #single cells #true clusters SC3 CIDR Seurat t-SNE + k-means SAFE-clustering Baron human1 Human 1,937 14 23 3 12 9 13

Baron human2 Human 1,724 14 23 9 10 6 6 Baron human3 Human 3,605 14 37 5 12 10 20 Baron human4 Human 1,303 14 19 3 9 3 4

Baron mouse1 Mouse 822 13 18 13 9 4 8

Biase Mouse 49 3 3 5 3 3 4

Darmanis Human 420 8 11 7 5 4 7

Ting Mouse 187 7 13 10 5 10 10

Yan Human 90 7 5 5 3 3 4

Zeisel Mouse 3,005 9 32 5 13 4 14

simple case PBMC mixture

Human 28,733 3 3 3 17 3 3

challenging case PBMC mixture

Human 32,695 3 2 10 13 3 3

Table 1: Major characteristics of the 12 benchmarking datasets, including organism origin,

num-ber of single cells, the numnum-bers of true and estimated clusters by SAFE-clustering and four

indi-vidual methods

Performance is measured by the similarity between the estimated cluster labels

L

E

and

the true cluster labels

L

T

using the Adjusted Rand Index (ARI) (Hubert and Arabie, 1985):

ARI

(

L

E

, L

T

) =

P

e,t

n

et

2 −

[

P

e

n

e

2 P

t

n

t

2 ]

/

n

₂

1

2 [

P

e

n

e

2 +

P

t

n

t

2 ]

−

[

P

e

n

e

2 P

t

n

t

2 ]

/

n

₂

where

n

is the total number of single cells;

n

e

and

n

t

are the number of single cells in estimated

cluster

e

and in true cluster

t

, respectively; and

n

et

is the number of single cells shared by

esti-mated cluster

e

and true cluster

t

. ARI takes a maximum value of 1 when two clustering fully

agree and its expected value is 0 when they are random clusters.

2.4 Improving and running four state-of-the-art methods

We took care in choosing the four individual clustering methods, SC3 (Kiselev et al.,

2017), CIDR (Lin et al., 2017), Seurat (Butler et al., 2018), and t-SNE+kmeans (Maaten and

Hinton, 2008), as diverse clustering methods are known to optimize ensemble solutions. These

methods also estimate the number of clusters and does not rely on apriori knowledge of the

num-ber of clusters. Furthermore, we evaluate the performance of parameters in individual clustering

methods, such as Seurat and t-SNE, and set default parameters to optimize performance of

clus-tering methods and minimize the number of parameters the users need to input. Altogether, we

created a user-friendly package, where minumum input is needed from users to create robust

(28)

SC3

Quality control (QC) metrics are calculated on the input expression matrix to detect

potentially problematic genes and/or single cells. In order to speed up computation, we first

use the Tracy-Widom method (Tracy and Widom, 1994) to estimate the number of clusters,

de-noted by ˆ

k

opt

−

SC

3 . With the estimated ˆ

k

opt

−

SC

3 , matrices of Euclidean, Pearson and Spearman

(dis)similarity metrics are calculated among single cells, followed by k-means clustering. Based

on k-means results across the three different (dis)similarity matrices and two different

dimen-sion reduction methods, a consensus matrix is computed using CSPA, followed by a hierarchical

clustering to assign the single cells into ˆ

k

opt

−

SC

3 clusters.

For the two PBMC mixture datasets (both with

>

5,000 single cells), via SC3 default

implementation, support vector machine (SVM) is employed to further speed up computation.

SC3 estimated 588 and 586 clusters for the simple and challenging case, respectively,

dramati-cally deviating from the truth (

k

= 3 for both two datasets). The

k

estimation method in SC3

has not been benchmarked and validated for large, shallowly sequenced datasets, and it is likely

that the distribution of eigenvalues of the covariance matrix does not adhere to the assumed

Tracy-Widom distribution (Tracy and Widom, 1994). However, clustering results of SC3 are not

affected by this since

k

estimation in SC3 is completely independent of the clustering algorithm.

Therefore, we produced a PCA plot visualization (using plotPCA function of scater R-package)

to narrow down a reasonable range of k. PCA plot suggested 3 distinct clusters for the simple

case and 2 clusters for the challenging case (Figure 2). We therefore decided, for SC3, on

k

= 3

for the simple case and

k

= 2 for the challenging case. SC3 ARI for the simple case at our

se-lected

k

= 3 is 0.995 and for the challenging case at

k

= 2 is 0.595.

CIDR

Given the normalized expression matrix, dropout candidates are identified and implicitly

imputed to mitigate the impact of lowly expressed genes. Then, dissimilarity matrix (Euclidean

distance) is calculated between single cells using the imputed data (Lin et al., 2017). As CIDR

performs principal coordinate analysis (PCoA) to reduce dimensionality, the number of principal

coordinates (PCo’s) identified, representing the estimated data dimensionality, heavily influences

the final clustering results. Here, the number of PCo’s is determined by the internal

nPC

(29)

Figure 2: PCA plot for the simple case (a) and challenging case (b) to estimate the number of

clusters to input for SC3.

into ˆ

k

opt

−

CIDR

clusters, with ˆ

k

opt

−

CIDR

estimated using the Calinski-Harabasz Index (Cali´

nski

and Harabasz, 1974).

Seurat

Seurat embeds an unsupervised clustering algorithm, combining dimension reduction

with graph-based partitioning methods. After gene and cell filtering, for counts input, Seurat

normalizes, for each single cell, by the total expression and multiplies by a scale factor of 10,000.

Natural log transformation is subsequently applied after normalization. We skip the

normaliza-tion step and only apply log transformanormaliza-tion if input data are already normalized. After that,

undesired sources of variations are regressed out. Single cells with

<

200 expressed genes would

be considered as ”NA” in the final Seurat clustering results. Data dimensionality is reduced via

principal component analysis (PCA) with the number principal components (PCs) selected by

the nPC function in the CIDR package. Graph-based clustering is carried out using the smart

local moving algorithm (SLM) (Waltman and Van Eck, 2013).

Seurat provides a ”resolution” parameter to alter the granularity of the clustering results.

However, the default ”resolution” (= 0.8) tends to result in no clustering for small datasets, as

shown in the SC3 paper (Kiselev et al., 2017). To further evaluate the performance of Seurat

on small datasets, we generated 100 subsets of samples from the Darmanis dataset, using

strati-fied random sampling without replacement where each cell type was one stratum and single cells

(30)

0.00 0.25 0.50 0.75 1.00

61 89 120 150 181 210 239

Cell number

ARI

Resolution

0.6

0.9

1.2 a

0.00 0.25 0.50 0.75 1.00

Biaes

(49 cells) (90 cells)Yan (187 cells)Ting

ARI

Resolution

0.6

0.9

1.2 b

Figure 3: a) Evaluation performance of Seurat with varying resolution parameters depending on

sample size. b) Seurat clustering performance in 3 different small datasets with varying resolution

parameters.

Our sampling strategy resulted in 61 - 239 single cells from the eight cell types, across the 100

generated datasets. The resolution was set to 0.6, 0.9 and 1.2, respectively, following the

instruc-tion of Seurat. Due to non-determinainstruc-tion from random sampling, the sampling process and the

downstream clustering were repeated 100 times for each resolution. The performance of different

resolution is quantified by ARI according to published clustering. When sample size ranges from

61 to 150, Seurat clustering with resolution = 1.2 performs significantly better than 0.6 and 0.9

(p

<

0.05, Figure3a), except for the case between resolution 0.9 and 1.2 in the subset of 120 cells

(p = 0.124). Comparatively, only one cluster is identified in the subset of 61 cells when resolution

= 0.6. When sample size increases to 210, resolution makes no difference.

When applying Seurat to the three small datasets, Biase (n=49 single cells), Yan (n=90)

and Ting (n=187), we used all three resolutions. Overall, Seurat performed better with

resolu-tion = 1.2 (Figure3b), with the excepresolu-tion of Yan dataset, where clusterings with all the three

resolutions are the same. For Biase dataset, Seurat cannot distinguish different cell types with

resolution = 0.6, but ARI reaches 1 when resolution increases to 1.2.

t-SNE+kmeans

t-SNE is a popular method to visualize single cell clustering. Here, we use the Rtsne

(31)

default. However, when the number of input single cells is small, users may run into the problem

that the default perplexity of 30 is too big. Since t-SNE has been shown to be fairly reasonably

robust across perplexity values ranging from 5 to 50 (Maaten and Hinton, 2008), we set the

per-plexity to be 10 when the input data contain

<

200 single cells.

We implement kmeans after t-SNE dimension reductions to get clustering results.

So-lutions from k-means clustering can vary dramatically across different runs even with the same

input data and same parameters because of random initial cluster centers. To mitigate this

po-tentially highly stochastic behavior, we use the ADPclust R-package (Wang and Xu, 2017) to

first estimate the centroids. ADPclust can also estimate the number of clusters. Therefore, in our

SAFE-clustering implementation, we perform k-means clustering using the centroids and number

of clusters estimated through ADPclust.

Results from t-SNE + k-means are stochastic rather than deterministic. We compared

the performance with and without this ADPclust centroid estimation step before k-means, on

four datasets, Yan, Ting, Darmanis and Baron human2. Expression matrix was log-transformed

and dimensionality was reduced using t-SNE. For each clustering strategy, t-SNE was carried

out 100 times. The number of clusters ranged from 2 to (

k

M

+ 2), where

k

M

is the maximum

number of clusters detected by individual methods. As expected, ARI’s from the 100 datasets

without ADPclust centroid estimation varied dramatically at most

k

’s attempted where

k

is the

number of clusters fed to k-means (Figure 4). In contrast, k-means with ADPclust centroid the

(32)

0.00

0.25

0.50

0.75

1.00

2

3

4

5

6

7

8

9 k

ARI

k-means

ADPclust + k-means

a

0.00

0.25

0.50

0.75

1.00

2

3

4

5

6

7

8 k

ARI

k−means

ADPclust + k−means

b

0.00

0.25

0.50

0.75

1.00

2

3

4

5

6

7

8

9

10 k

ARI

k-means

ADPclust + k-means

c

0.00

0.25

0.50

0.75

1.00 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

k

ARI

k-means

ADPclust + k-means

d

9 10 11 12

(33)

2.5 Hypergraph Partitioning Cluster Ensemble Algorithms

After obtaining clustering results from different individual methods, we perform cluster

ensemble to provide a consensus clustering using one of three hypergraph-based partitioning

algo-rithms: HGPA, MCLA and CSPA, as described in Strehl and Ghosh (2002). Certain single cell(s)

may be excluded from clustering by some individual clustering method(s) due to quality control

filter(s) of the corresponding method(s). Ensemble approach can provide a consolidated

assign-ment for these single cells well by borrowing information from solutions of the other methods.

We start with transforming the output labels of each clustering method into a

hyper-graph. Briefly, for the

j

th

clustering method, we use

v

ik

(note subscript

j

is omitted for

presenta-tion brevity) to denote the

i

th

row of the hypergraph

H

j

, which is the row vector for the cluster

labels (coded as binary dummies or indicator functions) of the

i

th

single cell, where

v

ik

=











1 ,

the i

th

cell

∈

the k

_j

th

cluster

0 ,

the i

th

cell /

∈

the k

_j

th

cluster

and

k

j

= 1

,

2 , ..., K

j

,

with

K

j

being the total number of clusters from the

j

th

clustering method.

Here, each column is a hyperedge, representing one particular cluster identified by that method.

An overall hypergraph

H

is constructed by combining individual hypergraphs (from individual

methods).

HGPA

HGPA directly partitions hypergraphs by cutting a minimal number of hyperedges. We

adopt the approach described in (Karypis et al., 1999), where the authors developed a fast and

efficient multilevel hypergraph partitioning algorithm through recursive bisection. Specifically,

we perform a k-way hypergraph partitioning using the

shmetis

program in the hMETIS package

v.1.5 (Karypis et al., 1999) for a range of k from 2 to

max

(

K

j

)

, j

= 1

,

2 ,

3 ,

and 4 for the four

different individual clustering methods and

K

j

again for the total number of clusters from the

j

th

method. The parameter

UBfactor

is set at 5, so that in any bisection, each of the two partitions

(34)

MCLA

Unlike HGPA, MCLA starts with computing pairwise Jaccard similarities (

S

J

) among

all the hyperedges. Specifically, for any two hyperedges

h

p

and

h

q

:

S

J

=

h

p

h

T

q

h

2 p

+

h

2 q

−

h

p

h

T

q

where

p

and

q

= 1

, ., h

, where

h

is the total number of hyperedges, which equals to the sum of

estimated cluster numbers from individual solutions. With the calculated similarity matrix, all

the hyperedges are partitioned into

k

meta-clusters using the

gpmetis

program in the hypergraph

partitioning package METIS v. 5.1.0 (Karypis and Kumar, 1998).

An association index

AI

(

M C

ci

) is computed to represent the association between

meta-cluster

c

and the

i

th

single cell, by averaging the vertices

v

ch

of the corresponding hyperedges:

AI

(

M C

ci

) =

1 H

c

X

H

c

v

ch

where

h

∈

H

c

is the set of hyperedges assigned in meta-cluster

c

. Each single cell is assigned

to the meta-cluster with the highest association index. However, some of the

k

clusters may be

empty due to no single cells having the highest association index with the cluster(s) (Strehl and

Ghosh, 2002). Under that scenario, we will re-label the single cells into

k

0 clusters, where

k

0 is

the number of non-empty clusters.

CSPA

.

CSPA also starts with computing pairwise similarities. In contrast to MCLA, CSPA

Huh_unc_0153D_18545.pdf

CLUSTER ENSEMBLE METHODS FOR SINGLE CELL RNA-SEQ DATA AND

DECONVOLUTION OF BULK HI-C DATA

Ruth Huh

A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in

partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department

of Biostatistics in the Gillings School of Global Public Health.

Chapel Hill

2019

Approved by:

Yun Li

Yuchao Jiang

Michael Love

Jin Szatkiewicz

Kirk Wilhelmsen

©

2019

Ruth Huh

ABSTRACT

Ruth Huh : Cluster Ensemble Methods for Single Cell RNA-Seq Data and Deconvolution of Bulk

Hi-C Data

(Under the direction of Yun Li)

Clustering single-cell RNA-seq (scRNA-seq) data is a critically important task to shed

light on tissue complexity including the number of cell types present and transcriptomic

signa-tures of each cell type. Due to its importance, several novel methods have been developed

re-cently for clustering scRNA-seq data. However, different approaches generate varying estimates

regarding number of clusters and cluster assignments making it hard to gauge which method to

use.

In the first paper, we present SAFE-clustering, Single-cell Aggregated (From

Ensem-ble) clustering, a flexible, accurate and robust method for clustering scRNA-seq data.

SAFE-clustering takes results from multiple SAFE-clustering methods to build one consensus SAFE-clustering. In

our current implementation, individual solutions are ensembled using three hypergraph-based

partitioning algorithms, namely hypergraph partitioning algorithm (HGPA), meta-cluster

algo-rithm (MCLA) and cluster-based similarity partitioning algoalgo-rithm (CSPA). In our evaluations,

SAFE-clustering generates high-quality clustering, in terms of both cluster number and cluster

assignment, across various datasets.

In the second paper, we present SAME-clustering, Single-cell Aggregated Clustering via

Mixture Model Ensemble, where we follow a similar pipeline with SAFE-clustering but change

the ensemble clustering method to a probabilistic framework. Specifically, SAME-clustering uses

a finite mixture model of multinomial distributions. Results show that our SAME-clustering

ensemble method, using a mixture model, yields enhanced clustering, in terms of both cluster

assignments and number of clusters.

In the third paper, we shift gears from analyzing scRNA-seq data to C data.

which measures spatial interactions important in providing information for gene regulation and

3D structure of the genome. Standard Hi-C data are generated from millions of cells, thus

pro-viding a population average measure of heterogeneous cells. Therefore, observed differences in

contact information are confounded by relative proportions of cell types among samples. It is

important to adjust for these proportions in downstream bulk Hi-C analysis. To date, there are

no deconvolution methods applied to Hi-C data to estimate these proportions. We propose using

nonnegative matrix factorization (NMF) for a matrix decomposition-based framework to estimate

ACKNOWLEDGEMENTS

I would like to thank my advisor Professor Yun Li, for taking me as her student later

in my PhD career. I am grateful for her guidance in my disseratation and support of my

well-being throughout my 3 years working with her. I would also like to thank all of my committee

members, Professor Yuchao Jiang, Professor Michael Love, Professor Jin Szatkiewicz, Professor

Kirk Wilhelmsen, and Professor Di Wu, for your helpful input and questions on my dissertation

topic.

I would like to thank all Li lab members. You have helped me with programming in

many platforms, widening my biological and statistical knowledge, and improving my

presen-tation skills. I would like to especially thank Yuchen Yang for collaborating closely with me

throughout my dissertation.

I am thankful for all the support and encouragement I received from my friends and

family. I am grateful for my mom and in-laws who took care of my daughter during my studies,

my daughter Yena for giving me joy in the midst of my studies and my husband Hojoon for

per-severing with me through various life stages. Last but not least, I would like to thank my Father

TABLE OF CONTENTS

LIST OF TABLES . . . .

ix

LIST OF FIGURES . . . .

x

CHAPTER 1: LITERATURE REVIEW . . . .

1

1.1

Motivation behind first and second papers . . . .

1

1.2

Some Recent Clustering Methods for single-cell RNA-seq . . . .

3

1.3

Ensemble Methods and their advantages . . . .