Inference of tumor subclonal composition
and evolution by the use of single-cell and
bulk DNA sequencing data
by
Salem Malikić
M.Sc., Simon Fraser University, 2014 B.Sc., University of Sarajevo, 2011
Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy in the
School of Computing Science Faculty of Applied Sciences
c
Salem Malikić 2019 SIMON FRASER UNIVERSITY
Summer 2019
All rights reserved.
However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, education, satire, parody, criticism, review and news reporting is likely
Approval
Name: Salem Malikić
Degree: Doctor of Philosophy (Computing Science)
Title: Inference of tumor subclonal composition and
evolution by the use of single-cell and bulk DNA sequencing data
Examining Committee: Chair: Jian Pei
Professor Leonid Chindelevitch Senior Supervisor Assistant Professor S. Cenk Sahinalp Co-Supervisor Professor
School of Informatics, Computing and Engineering, Indiana University Cedric Chauve Supervisor Professor Colin Collins Supervisor Professor
Department of Urologic Sciences University of British Columbia Maxwell W Libbrecht Internal Examiner Assistant Professor Russel Schwartz External Examiner Professor
Department of Biological Sciences Carnegie Mellon University
Abstract
Cancer is a genetic disease characterized by the emergence of genetically distinct populations of cells (subclones) through the random acquisition of mutations at the level of single-cells and shifting prevalences at the subclone level through selective advantages purveyed by driver mutations. This interplay creates complex mixtures of tumor cell populations which exhibit different susceptibility to targeted cancer therapies and are suspected to be the cause of treatment failure. Therefore it is of great interest to obtain a better understanding of the evolutionary histories of individual tumors and their subclonal composition. In this thesis we present three methods for the inference of tumor subclonal composition and evolution by the use of bulk and/or single-cell DNA sequencing data.
First, we present CTPsingle, a method which aims to infer tumor subclonal composition from single-sample bulk sequencing data. CTPsingle consists of two steps: (i) robust cluster-ing of mutations uscluster-ing beta-binomial mixture modellcluster-ing and (ii) inference of tumor phyloge-nies by the use of integer linear programming. On simulated data, we show that CTPsingle is able to infer the purity and the clonality of single-sample tumors with high accuracy even when restricted to a coverage depth as low as ∼ 30×. CTPsingle is currently used to infer clonality as a part of the Evolution and Heterogeneity Working Group of Pan Can-cer Analysis of Whole Genomes project where sequencing data of over 2700 tumors are analyzed.
Next, we present B-SCITE, the first available computational approach that infers tumor phylogenies from combined single-cell and bulk sequencing data. B-SCITE is a probabilistic method which searches for tumor phylogenetic tree maximizing the joint likelihood of the two data types. Tree search in B-SCITE is performed by the use of customized MCMC search over the space of labeled rooted trees. Using a comprehensive set of simulated data, we show that B-SCITE systematically outperforms existing methods with respect to tree reconstruction accuracy and subclone identification. On real tumor data, mutation histories generated by B-SCITE show high concordance with expert generated trees.
In the third part, we introduce PhISCS, the first method which integrates single-cell and bulk sequencing data while accounting for the possible existence of mutations affected by undetected copy number aberrations, as well as mutations for which the commonly used and
recently debated Infinite Sites Assumption is violated. PhISCS is a combinatorial method and, in contrast to the available alternatives which are mostly based on the probabilistic search schemes, it can provide guarantee of optimality of the reported solutions. We provide two different implementations of PhISCS: (i) the implementation based on the use of integer linear programming and (ii) the implementation based on the use of constraint satisfaction programming. We show that the latter has lower running time on most of the instances that we used to asses the performance of the two implementations. These results suggest that in some applications constraint satisfaction programming might be a viable alternative to commonly used integer linear programming. We also demonstrate the utility of PhISCS in analyzing real sequencing data where it reports more plausible and parsimonious tumor phylogenies than the available alternatives.
Keywords: Intra-tumor heterogeneity; Tumor evolution; Single-cell DNA sequencing; Bulk
DNA sequencing; Infinite sites assumption; Markov chain Monte Carlo; Joint probabilistic model; Integer linear programming; Constraint satisfaction programming
Acknowledgements
First and foremost, I would like to thank my supervisor Dr. S. Cenk Sahinalp for his extensive guidance, support and patience during my studies. I especially thank him for the endless effort he put into training me in the scientific field. I am also very thankful to the other supervisors: Dr. Leonid Chindelevitch, Dr. Cedric Chauve and Dr. Colin Collins for following my work in the past years and providing many suggestions that helped improving it. In addition, I thank Dr. Maxwell Libbrecht for his insightful questions and helpful suggestions during the depth exam and thesis defence, where he served as an Examiner. This thesis was considerably improved by the input from Dr. Russell Schwartz, whom I am very grateful for serving as an External Examiner, for his very detailed proofreading of the thesis and for providing numerous suggestions. Also, I thank dr. Jian Pei for chairing the defence.
This work would be impossible without many of the collaborators whom I worked with during my master’s and doctoral studies. I thank Dr. Nilgun Donmez and Dr. Andrew McPherson for introducing me to the studies of tumor heterogeneity, providing extensive guidance in the first years of my research and their vast contribution to the development of CITUP, which was the basis of my master thesis, and CTPsingle, which is the first method presented in this thesis. B-SCITE, the second method presented in the thesis, is a result of a collaboration with the Computational Biology Group at ETH Zurich lead by Dr. Niko Beerenwinkel. I spent five months in Switzerland working together with Dr. Beerenwinkel and two of the members of his group, Dr. Katharina Jahn and Dr. Jack Kuipers. I thank them all for their hospitality and collaboration on B-SCITE. Work on PhISCS, which is the third presented method, is a joint effort of the labs lead by Dr. Sahinalp and Dr. Iman Hajirasouliha from Cornell University. In addition to Dr. Sahinalp and Dr. Hajirasouliha, here I would like to thank Simone Cicolella, Ehsan Haghshenas, Md. Khaledur Rahman, Camir Ricketts, Daniel Seidman and Dr. Faraz Hach for their contributions to this project. My special thanks go to Farid Rashidi Mehrabadi, who put an endless effort in PhISCS, contributing to the data analysis, code preparation and methods design.
Some of the research that I was involved in is not included in the thesis. However, it helped me in gaining a valuable experience in method development, collaborative research and gave me an opportunity to attend several scientific conferences. For these
collabo-rations, I would first like to thank Dr. Ibrahim Numanagic and Michael Ford, whom I worked together with on the development of methods for genotyping highly polymorphic genes. I also thank Nikolai Karpov and Md. Khaledur Rahman for a joint work on the development of a similarity measure for comparing trees of tumor evolution. With Dr. Sahand Khakabimamaghani I worked on the development of a method for collaborative intra-tumor heterogeneity detection. I thank Dr. Khakabimamaghani for leading this work and for many insightful discussions about the tumor heterogeneity and potential new ways of solving several important problems in the field. I also thank to all members of Evolution and Heterogeneity Working Group of Pan Cancer Analysis of Whole Genomes (PCAWG) project that I have been a member of since 2014.
I would also like to acknowledge insightful feedback received from other colleagues from Dr. Sahinalp’s lab and Laboratory for Advanced Genome Analysis at Vancouver Prostate Center: Dr. Yen Yi Lin, Ermin Hodzic, Can Kockan, Dr. Alex Gawronski, Iman Sarrafi, Hossein Asghari and Dr. Raunak Shrestha.
I am indebted to many teachers and professors who helped me in developing passion and enthusiasm towards Mathematics and the other scientific fields. Here, I especially thank to Nermin Suljic, Ali Lafcioglu, Dr. Hasan Jamak and Dr. Dino Oglic. Furthermore, I thank all of the people from Bosna Sema Educational Institutions for providing an excellent environment and support during my high school and undergraduate studies, as well as to Canadian granting agencies, in particular NSERC, for supporting my research.
I devote a special thanks to two of my colleagues and friends, Dr. Ibrahim Numanagic and Ermin Hodzic. Our friendship dates back to the time of our undergraduate studies at the University of Sarajevo in Bosnia and Herzegovina. Without Ibrahim coming to SFU in 2011, it is very unlikely that I would have ever ended up studying here. He introduced me to dr. Sahinalp and provided extensive help with everything. With Ermin, I have been living during the whole course of my PhD studies and he has been a great roommate, colleague and friend.
I am very grateful to my dear aunt Faiza and uncle Dzevad together with their family for providing moral support and for all the help that they provided during my internship in Switzerland (where they are currently living). I also thank people from Contextual Genomics, a company where I have been working over the past eight months, for providing an excellent work environment and for the great understanding that they showed while I was preparing the thesis.
Last, but not least, I would like to express deep gratitude to my parents, Sadeta and Faiz, and sister Faiza, for their unconditional love and support. I devote this thesis to them and to my sweet little niece Amina.
Table of Contents
Approval ii
Abstract iii
Acknowledgements v
Table of Contents vii
List of Tables xi
List of Figures xii
1 Introduction 1
1.1 Genetic basis of cancer and evidence for the existence of genetic intra-tumor
heterogeneity . . . 1
1.2 Cancer onset and evolution of cancerous cells . . . 2
1.2.1 Clonal theory and branching model of tumor evolution . . . 3
1.2.2 Other theories of tumor evolution . . . 5
1.3 Clinical relevance of intra-tumor heterogeneity . . . 7
1.4 Motivation, Contributions and Thesis Organization . . . 8
2 Background 12 2.1 Next Generation Sequencing . . . 12
2.1.1 Preparing the input of NGS experiment . . . 13
2.1.2 Output of NGS experiment . . . 14
2.1.3 The uses of NGS data in studies of intra-tumor heterogeneity and tumor evolution . . . 15
2.2 Inference of tumor subclonal composition and evolution from bulk sequencing data . . . 16
2.2.1 Variant and reference read counts as a proxy for the fraction of cells harboring mutation . . . 17
2.2.2 Clustering of mutations based on the read counts . . . 19
2.2.4 Theoretical limitations . . . 25
2.2.5 Potential benefits of the use of multiple samples . . . 26
2.2.6 Methods for the inference of clonal trees based on the use of SNVs . 28 2.2.7 Methods based on the use of CNAs and other types of mutations . . 29
2.3 Inference of tumor subclonal composition and evolution from single-cell se-quencing data . . . 30
2.3.1 The main characteristics of single-cell sequencing data . . . 31
2.3.2 Strengths and weaknesses of single-cell sequencing data in recon-structing trees of tumor evolution . . . 33
2.3.3 The existing methods for studying ITH and evolution by the use of SNVs from single-cell sequencing data . . . 33
2.3.4 Analysis of CNAs from single-cell sequencing data . . . 36
2.4 Inference of tumor evolution and subclonal composition by integrative use of single-cell and bulk sequencing data . . . 36
3 Clonality inference from single tumor samples using low coverage se-quencing data 39 3.1 Introduction . . . 39
3.1.1 Related work . . . 40
3.2 Methods . . . 41
3.2.1 Input processing . . . 41
3.2.2 Robust clustering using beta-binomial mixture modelling . . . 42
3.2.3 Estimation of tumor purity . . . 43
3.2.4 Inference of tree of tumor evolution . . . 43
3.3 Results . . . 45
3.3.1 Simulations . . . 45
3.3.2 Applications in real data analysis . . . 49
3.4 Discussion . . . 51
4 Integrative inference of subclonal tumor evolution from single-cell and bulk sequencing data 52 4.1 Introduction . . . 53
4.1.1 Background . . . 53
4.1.2 Contributions . . . 55
4.2 Methods . . . 56
4.2.1 Tree models of tumor evolution . . . 56
4.2.2 Input data . . . 57
4.2.3 Tree scoring based on bulk sequencing data . . . 58
4.2.4 Tree scoring based on single-cell data . . . 59
4.2.6 Compression of mutation trees into clonal trees . . . 61
4.3 Results . . . 62
4.3.1 Performance assessment on simulated data . . . 62
4.3.2 Application to real data . . . 69
4.4 Discussion . . . 75
5 A combinatorial approach for sub-perfect tumor phylogeny reconstruc-tion via integrative use of single-cell and bulk sequencing data 77 5.1 Introduction . . . 78
5.2 Methods . . . 83
5.2.1 Input data . . . 83
5.2.2 PhISCS-I for tumor phylogeny inference via single-cell sequencing (SCS) data with no mutation elimination allowed . . . 83
5.2.3 Allowing mutations elimination in PhISCS-I . . . 85
5.2.4 Additional ILP constraints to integrate VAFs derived from bulk se-quencing data into PhISCS-I . . . 86
5.2.5 PhISCS-B for tumor phylogeny inference via SCS data . . . 88
5.2.6 Additional Boolean constraints to integrate VAFs derived from bulk sequencing data into PhISCS-B . . . 90
5.3 Results on simulated data . . . 92
5.3.1 Comparative running time analysis of PhISCS-I and PhISCS-B . . . 93
5.3.2 Measuring accuracy in tree inference . . . 93
5.3.3 Comparing the accuracy of PhISCS and alternative methods . . . . 95
5.4 Results on real sequencing data . . . 101
5.5 Discussion . . . 107
6 Conclusion 108 6.1 Future Directions . . . 110
Bibliography 112 Appendix A Supplementary Material for CTPsingle: Clonality inference from single tumor sample using low coverage sequencing data 128 A.1 Simulation set up . . . 128
A.2 Calculation of evaluation measures and run-time settings for AncesTree, LICHeE and PyClone . . . 129
Appendix B Supplementary Material for B-SCITE: Integrative inference of subclonal tumor evolution from single-cell and bulk sequencing data 131 B.1 Details of generating simulated data . . . 131
B.3 Derivation of the Binomial distribution approximation formula . . . 134
B.4 Details of running ddClone, OncoNEM, SCITE, PhyloWGS and B-SCITE . 135 B.5 Details of input data pre-processing for ALL, TNBC and CRC patients . . 137
B.6 Supplementary figures . . . 139
Appendix C Supplementary Material for PhISCS: A combinatorial ap-proach for sub-perfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data 154 C.1 Generalizing the triple-VAF constraints to arbitrary number of mutations . 154 C.2 Simulation models used for benchmarking tumor phylogeny inference methods155 C.2.1 Generating simulated data without ISA violations . . . 155
C.2.2 Simulation of mutations violating ISA . . . 155
C.2.3 Simulations involving mutations from regions affected by Copy Num-ber Gains . . . 156
C.3 TPTED measure for comparing tumor phylogenies . . . 157
C.4 Benchmarking SCITE, SiFit, B-SCITE and PhISCS . . . 158
C.5 Details of obtaining and pre-processing real data . . . 159
C.6 Source codes of Max-SAT solvers used for the implementation of CSP for-mulation of PhISCS . . . 159
List of Tables
List of Figures
Figure 1.1 Branching clonal model and clonal tree of tumor evolution . . . 4
Figure 1.2 Alternative illustration of the branching clonal model and a clonal
tree of tumor evolution in case where no losses of mutations are allowed 5
Figure 1.3 Linear and netrual models of tumor evolution . . . 6
Figure 2.1 An example of copy number event affecting region harboring SNV . 18
Figure 2.2 Clonal tree of tumor evolution and plot of distribution of cellular
prevalences (estimated based on the read counts with sequencing
depth of 200×) . . . 20
Figure 2.3 Clonal tree of tumor evolution and plot of distribution of cellular
prevalences (estimated based on the read counts with sequencing
depth of 50×) . . . 21
Figure 2.4 Desirable clustering of mutations from hypothetical examples with
200× and 50× coverage datasets . . . 22
Figure 2.5 Limitation of bulk sequencing data in separating mutations of the
same prevalence . . . 26
Figure 2.6 Multiple clonal trees consistent with the mutation frequencies
ob-served in bulk data . . . 27
Figure 2.7 An overview of single-cell sequencing experiment and output data . 32
Figure 2.8 Strengths and weaknesses of single-cell sequencing data in inferring
pairwise order of mutations in tree of tumor evolution . . . 34
Figure 2.9 Bulk data can improve phylogenetic inference by reducing the effects
of noise in single-cell sequencing data . . . 37
Figure 3.1 Comparison of purity inference accuracy of CTPsingle, PyClone,
LICHEeE and AncesTree . . . 46
Figure 3.2 Comparison of CTPsingle, PyClone, LICHeE and AncesTree based
on the absolute difference between the true and predicted number of
subclones . . . 47
Figure 3.3 Comparison of CTPsingle, PyClone, LICHeE and AncesTree based
on the quadratic mean of difference of true and predicted lineage
Figure 3.4 Effect of false positive SNVs and copy number aberrations on the
performance of CTPsingle . . . 50
Figure 3.5 Performance of CTPsingle on simulated datasets containing increased
number of subclones . . . 50
Figure 4.1 Comparison of the inference of tumor evolution based on single-cell
and bulk sequencing data . . . 54
Figure 4.2 Schematic overview of B-SCITE . . . 56
Figure 4.3 Comparison of v-measure accuracy of mutation clustering by
dd-Clone, OncoNEM and B-SCITE for simulated clonal trees with 10 nodes and 50 mutations. Three different rates of doublet noise were added to single-cell data which consists of 25 genotypes drawn under
various values of sampling distortion parameter. . . 63
Figure 4.4 Comparison of co-clustering accuracy measure of phylogenetic
infer-ence of OncoNEM, SCITE and B-SCITE for simulated clonal trees with 10 nodes and 50 mutations. Three different rates of doublet noise were added to single-cell data which consists of 25 genotypes
drawn under various values of sampling distortion parameter. . . . 64
Figure 4.5 Comparison of ancestor-descendant accuracy measure of
phyloge-netic inference of OncoNEM, SCITE and B-SCITE for simulated clonal trees with 10 nodes and 50 mutations. Three different rates of doublet noise were added to single-cell data which consists of 25 genotypes drawn under various values of sampling distortion
param-eter. . . 65
Figure 4.6 Comparison of different-lineages accuracy measure of phylogenetic
inference of OncoNEM, SCITE and B-SCITE for simulated clonal trees with 10 nodes and 50 mutations. Three different rates of dou-blet noise were added to single-cell data which consists of 25
geno-types drawn under various values of sampling distortion parameter. 65
Figure 4.7 The effect of CNAs on the co-clustering accuracy measure of
phylo-genetic inference of B-SCITE with bulk data coverage of 10, 000× . 67
Figure 4.8 Comparison of co-clustering accuracy measure of phylogenetic
infer-ence of B-SCITE and PhyloWGS on simulated data with 1, 2 and 4
bulk samples, varying bulk coverage and 25 sampled single cells . . 68
Figure 4.9 Comparison of co-clustering accuracy measure of phylogenetic
infer-ence of B-SCITE and PhyloWGS on simulated data with 1, 2 and 4
Figure 4.10 Comparison of co-clustering accuracy measure of phylogenetic infer-ence of B-SCITE and PhyloWGS on simulated data with 1, 2 and 4
bulk samples, varying bulk coverage and 100 sampled single cells . 69
Figure 4.11 Mutation histories inferred by CTPsingle, SCITE and B-SCITE for
Patient 1 from childhood leukemia study (Gawad et al. 2014) . . . 70
Figure 4.12 Mutation histories inferred by CTPsingle, SCITE and B-SCITE for
Patient 2 from childhood leukemia study (Gawad et al. 2014) . . . 71
Figure 4.13 Mutation histories inferred by the original study, SCITE and
B-SCITE for triple-negative breast cancer patient (Wang et al. 2014) 73
Figure 4.14 Mutation histories inferred by B-SCITE for two colorectal patients
with liver metastasis (Leung et al. 2017) . . . 74
Figure 5.1 Comparisons of PhISCS and SiFit based on the normalized
Robinson-Foulds distance . . . 97
Figure 5.2 Comparison of PhISCS with SCITE based on the normalized MLTSM
similarity measure. . . 98
Figure 5.3 Comparison of PhISCS with SCITE based on TPTED dissimilarity
measure. . . 99
Figure 5.4 Comparison of PhISCS with SCITE on larger number of subclones
and larger number of mutations. . . 100
Figure 5.5 Comparison of PhISCS and B-SCITE according to both MLTSM
and its dual MLTD measures. . . 102
Figure 5.6 Mutation histories inferred by PhISCS for patient with primary
col-orectal cancer and liver metastasis (patient CRC2 from Leung et al. 2017) . . . 103
Figure 5.7 Mutation histories inferred by SCITE, B-SCITE and PhISCS for
Patient 2 from childhood leukemia study (Gawad et al. 2014) . . . 106 Figure A.1 The distribution of lineage frequencies and fraction of mutations per
subclone across all simulation datasets generated in CTPsingle . . . 128 Figure A.2 Comparison of CTPsingle and CITUP on the simulated data . . . . 130 Figure B.1 Comparison of v-measure accuracy of mutation clustering by
dd-Clone, OncoNEM and B-SCITE for simulated clonal trees with 6 nodes and 50 mutations. Three different rates of doublet noise were added to single-cell data which consists of 25 genotypes drawn under various values of sampling distortion parameter. . . 139
Figure B.2 Comparison of v-measure accuracy of mutation clustering by dd-Clone, OncoNEM and B-SCITE for simulated clonal trees with 10 nodes and 50 mutations. 25, 50 and 100 genotypes were drawn under various values of sampling distortion parameter. . . 140 Figure B.3 Comparison of v-measure accuracy of mutation clustering by
dd-Clone, OncoNEM and B-SCITE for simulated clonal trees with 20 nodes and 100 mutations. 25, 50 and 100 genotypes were drawn under various values of sampling distortion parameter. . . 141 Figure B.4 Comparison of v-measure accuracy of mutation clustering by
dd-Clone, OncoNEM and B-SCITE for simulated clonal trees with 40 nodes and 100 mutations. 25, 50 and 100 genotypes were drawn under various values of sampling distortion parameter. . . 142 Figure B.5 Comparison of adjusted Rand index accuracy of mutation clustering
by ddClone, OncoNEM and B-SCITE for simulated clonal trees with 10 nodes and 50 mutations. 25, 50 and 100 genotypes were drawn under various values of sampling distortion parameter. . . 142 Figure B.6 Comparison of adjusted Rand index accuracy of mutation clustering
by ddClone, OncoNEM and B-SCITE for simulated clonal trees with 20 nodes and 100 mutations. 25, 50 and 100 genotypes were drawn under various values of sampling distortion parameter. . . 143 Figure B.7 Comparison of adjusted Rand index accuracy of mutation clustering
by ddClone, OncoNEM and B-SCITE for simulated clonal trees with 40 nodes and 100 mutations. 25, 50 and 100 genotypes were drawn under various values of sampling distortion parameter. . . 143 Figure B.8 Comparison of v-measure accuracy of mutation clustering by
dd-Clone, OncoNEM and B-SCITE as a function of the false negative rate. False positive rate was set to 0.00001. . . 144 Figure B.9 Comparison of v-measure accuracy of mutation clustering by
dd-Clone, OncoNEM and B-SCITE as a function of the false negative rate, but with highly elevated false positive rate of 0.01. . . 144 Figure B.10 Comparison of co-clustering accuracy measure of phylogenetic
infer-ence of OncoNEM, SCITE and B-SCITE for simulated clonal trees with 6 nodes and 50 mutations. Three different rates of doublet noise were added to single-cell data which consists of 25 genotypes drawn under various values of sampling distortion parameter. . . 145 Figure B.11 Comparison of co-clustering accuracy measure of phylogenetic
infer-ence of OncoNEM, SCITE and B-SCITE for simulated clonal trees with 10 nodes and 50 mutations. 25, 50 and 100 genotypes were drawn under various values of sampling distortion parameter. . . . 146
Figure B.12 Comparison of co-clustering accuracy measure of phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE for simulated clonal trees with 20 nodes and 100 mutations. 25, 50 and 100 genotypes were drawn under various values of sampling distortion parameter. . . . 147 Figure B.13 Comparison of co-clustering accuracy measure of phylogenetic
infer-ence of OncoNEM, SCITE and B-SCITE for simulated clonal trees with 40 nodes and 100 mutations. 25, 50 and 100 genotypes were drawn under various values of sampling distortion parameter. . . . 148 Figure B.14 Comparison of co-clustering accuracy measure of phylogenetic
infer-ence of OncoNEM, SCITE and B-SCITE as a function of the false negative rate. False positive rate was set to 0.00001. . . 148 Figure B.15 Comparison of co-clustering accuracy measure of phylogenetic
infer-ence of OncoNEM, SCITE and B-SCITE as a function of the false negative rate, but with highly elevated false positive rate of 0.01. . 149 Figure B.16 The effect of CNAs on the co-clustering accuracy measure of
phylo-genetic inference of B-SCITE with increased bulk data coverage of 1, 000, 000× . . . 149 Figure B.17 The effect of CNAs on the ancestor-descendant accuracy measure of
phylogenetic inference of B-SCITE with bulk data coverage of 10, 000×150 Figure B.18 The effect of CNAs on the ancestor-descendant accuracy measure of
phylogenetic inference of B-SCITE with increased bulk data coverage of 1, 000, 000× . . . 150 Figure B.19 The effect of CNAs on the different-lineage accuracy measure of
phy-logenetic inference of B-SCITE with bulk data coverage of 10, 000× 151 Figure B.20 The effect of CNAs on the different-lineage accuracy measure of
phy-logenetic inference of B-SCITE with increased bulk data coverage of 1, 000, 000× . . . 151 Figure B.21 Multiple clonal trees compatible with clustering of mutations inferred
by CTPsingle for ALL patient . . . 152 Figure B.22 Clonal trees for ALL and TNBC patients derived from B-SCITE
Chapter 1
Introduction
1.1
Genetic basis of cancer and evidence for the existence of
genetic intra-tumor heterogeneity
Cancer is a common name to the group of over 200 diseases characterized by uncontrolled cell division. In the case of leukemia (cancer of blood or bone marrow), cancer manifests itself by overproduction of abnormal white blood cells, whereas in the other cancer types uncontrolled cell divisions result in the formation of abnormal masses of cells, also known as malignant tumors. In addition, cancerous cells typically have potential to leave the primary site of cancer origin and invade distant tissues forming metastases. [91, 127].
Cancer is nowadays widely recognized as a disease of genome [91, 105, 13]. It is the most common genetic disease and is estimated to be responsible for nearly 10 million deaths worldwide in 2018 alone [10]. Genetic mutations are one of the key causes of cancer onset, growth, spread and treatment resistance. At the time of clinical diagnosis, genomes of cancerous cells typically harbor a large number of mutations detectable from data generated by currently available DNA sequencing technologies. According to some estimates, for most of the tumors, these numbers are varying between 1,000 and 20,000 of single nucleotide variants (SNVs), and a few to hundreds of copy number aberrations (CNAs) and other structural rearrangements [100].
While the role and importance of genetic mutations in cancer onset and progression have been studied (at a limited resolution) for a long time, completion of the first draft of the human reference genome in 2001 [22] and technological advancements in DNA sequencing, in particular the introduction of next-generation sequencing (NGS) technologies in 2004 [109], enabled researchers to study genomic profiles of individual tumors at unprecedented scale and resolution. These developments enabled sequencing large parts or even whole genomes of individual tumors, as well as sequencing large cohorts of tumor samples [173]. They were followed by development of computational methods for detection of various types of somatic mutations such as single-nucleotide variants (SNVs), small insertion and deletions
(indels), large-scale insertions, inversions, translocations, copy number aberrations (CNAs) and others [178, 78].
Sequencing data that has been generated in the past years has revealed a striking de-gree of genetic intra-tumor diversity in cancer. In [50], mutation profiling of four patients with metastatic renal-cell carcinoma was performed. It revealed the existence of muta-tions present in some, but not in all, tumor sites, implying the existence of spatial genetic intra-tumor genetic heterogeneity. This diversity was not only observed between physi-cally separated primary and metastatic tumor sites, but also among distinct regions of the primary tumor that were sequenced independently. In another study the authors tracked tumor progression in three chronic lymphocytic leukemia patients [147]. For each patient, five blood samples were obtained at different timepoints of disease progression and subset of mutations were selected for targeted deep amplicon sequencing. The average depth of sequencing coverage achieved by amplicon sequencing was 100, 000× yielding highly reliable variant allele frequencies for the selected sets of mutations. Large differences in values of the obtained variant allele frequencies for many pairs of mutations are clear indicator of the existence of genetically distinct cells and temporal genetic intra-tumor heterogeneity. Similar findings were reported in [99, 83, 13, 116, 64, 179, 49, 172, 88] and many other studies.
In addition to the genetic intra-tumor heterogeneity, there are several other types of heterogeneity in tumors of a single patient (e.g., epigenetic intra-tumor heterogeneity). However, since our main focus is on the genetic intra-tumor heterogeneity, we adopt con-vention that in the rest of the thesis term intra-tumor heterogeneity, abbreviated as ITH, refers to this type of heterogeneity, unless stated otherwise.
1.2
Cancer onset and evolution of cancerous cells
In this section we will discuss several existing theories of cancer onset and evolution. Our main goal is to attempt answering one of the fundamental questions about ITH: what are the mechanisms by which ITH emerges during a tumor growth and does it play any role in a tumor progression?
Most of the available tumor sequencing data support hypothesis of single-cell origin of cancer. According to this hypothesis, cancer originates from a single cell, also known as
cancer founding cell, which acquires a set of mutations giving it some proliferative advantage
over the neighboring healthy cells. The evidence supporting this can be found in studies where multiple regions of the same tumor were sequenced and it was observed that all regions share a common set of mutations [50, 166]. Additional evidence is provided by studies where mutational profiling of individual tumor cells was performed and sets of mutations present in all cancerous cells identified [179, 49, 172]. Some studies also suggest that a small fraction of tumors might have multiple cells of origin [183, 152]. Such tumors are known in literature as
multicentric [70, 24] (terms multifocal and polycentric are also used as synonyms, although
in some publications term multifocal has different meaning [136]).
Mutagenesis (the production of genetic mutations) in cancer is a dynamic process. Ex-isting mutations typically cause defects in mechanisms which ensure the accuracy of DNA replication during the process of cell division. As a consequence, cancerous cells are usually characterized by elevated mutation rate in comparison to the other cells. During the process of cell division, as well as due to exogeneous factors (e.g., tobacco smoke), cancerous cells acquire new mutations distinguishing them from the adjacent cells. There exist several the-ories about tumor evolution, which have different implications on the impact of the newly acquired mutations and the role of intra-tumor heterogeneity in tumor progression. Below we summarize the most important of these theories.
1.2.1 Clonal theory and branching model of tumor evolution
In 1976, Peter Nowell proposed the clonal theory of cancer evolution, which posits that can-cer is an evolutionary process driven by the acquisition of somatic mutations [121]. During tumor growth, descendants of cancer founding cell acquire new mutations that are later passed on their descendants and the process is continued over time. Consequently, at some timepoint, in one of the descendants of the cancer founding cell, a critical set of mutations giving it some selective advantage in comparison to the other cells can be accumulated. The emergence of such a cell is then followed by the expansion of the population of descendants of this cell, which leads to the formation of a genetically highly similar set of cells, better
known as subclone 1. This process is then continued over time and, at the time of
clin-ical diagnosis, tumors usually consist of multiple subclones characterized by distinct sets of somatic mutations. The model of tumor evolution that follows clonal theory and allows co-existence of multiple subclones over time is known as branching clonal evolution and is shown in Figure 1.1.
Tree of tumor evolution
The process of tumor evolution can be depicted by a clonal tree (of tumor evolution), here also referred to as tumor phylogeny, shown in Figure 1.1. In clonal tree, individual nodes represent subclones, with root note representing either population of healthy cells or the
first population of cancerous cells 2. Mutations are placed at the subclone (node) of their
first occurrence. The first population of cancerous cells is also known as the cancer founding
clone and mutations that it harbors (that are shared among all cancerous cells) as clonal
1
In this thesis we will use definition of subclone as a set of genetically highly similar cells (similar definition can be found in [148]). Consequently, and for the sake of simplicity, the population of healthy cells will be treated as one of the subclones.
2
Note that in the case of multicentric tumors it is necessary that root node represents the population of healthy cells
or trunk mutations. In addition to a mutational label, a frequency label is also commonly assigned to the node. Frequency labels usually represent prevalence of the corresponding subclone in the tumor sample or average variant allele frequency of mutations present in node’s mutational label. In case where multiple tumor samples are sequenced, frequency labels can be represented as vectors of real numbers.
0%
time
tumor size Clonal tree of tumor evolution
20%
20%
35% 25%
Healthy cell First cancer cell Clonal (trunk)
mutations Set of mutations
Subclone
Figure 1.1: Tumor growth according to branching clonal model of tumor evolution (left)
and clonal tree of tumor evolution (right). In the left, healthy cells are shown at the top as a purple circles. The first cancerous cell and the set of mutations that it harbors are respectively depicted as a blue circle and a red star. In the branching clonal model of tumor evolution, multiple subclones, depicted in the left as triangles of the same color, emerge over time and co-exist in the tumor (with possibility of being outcompeted and eliminated). The emergence of subclonal populations is driven by the acquisition of somatic mutations. Sets of mutations distinguishing a subclone from its most recent ancestor (parent) are depicted as stars of different colors. A clonal tree of tumor evolution, shown in the right figure, is a convenient way of depicting tumor clonal evolution. Individual nodes of a clonal tree represent subclones and mutations are placed at the node (subclone) of their first occurrence or, equivalently, to the edge connecting the node with its parent. Frequency labels are also commonly assigned to the nodes of the tree. Here, each frequency represents the prevalence of the corresponding subclone in this hypothetical tumor at the time of obtaining tumor biopsy tissue (the latest timepoint in the left part of the figure). Note that some of the subclones that existed in the course of tumor evolution might have been outcompeted before the time of obtaining the biopsy. Such subclones are assigned zero prevalence and, although they are absent from the sequenced sample, their existence in the tumor evolutionary history can, in some cases, be inferred from the sequencing data.
A mutation tree can be defined as a clonal tree of the highest granularity where at each node only a single mutation is placed. There are other tree representations of clonal evolution discussed in [67] (e.g., it can be depicted by binary genealogical tree [67]) but here we will restrict ourselves to the representation by clonal/mutation trees. Later, in Section 4.2.1, we also provide formal mathematical definition of clonal and mutation trees.
time
20%
20%
35% 25%
0%
Figure 1.2: Alternative illustration of the branching clonal model (left) and a clonal tree of
tumor evolution (right) under the assumption that mutations present in a subclone are not lost in any of its descendants. Similarly as in Figure 1.1, circles in the left part represent cells and different sets of mutations are depicted as stars of different colors. In a clonal tree, each edge is labeled with sets of mutations distinguishing each child subclone from its parent, whereas the mutational label of each node consists of the set of all mutations harbored by the corresponding subclone. Note that, under the assumption of no losses of mutations, the tree from the right is equivalent to clonal tree from Figure 1.1.
Under the assumption that none of the mutations present in some subclone are lost in any of its descendants, we can also depict the process of tumor growth and tree of tumor evolution as shown in Figure 1.2.
1.2.2 Other theories of tumor evolution
Although the clonal theory of tumor evolution is well established with a lot of real datasets
supporting this model3, alternative models of tumor evolution can also be found in the
literature.
According to multistep tumorigenesis model proposed by Fearon and Vogelstein in 1990, tumor progression follows a linear evolution [44]. In this model, analogous to the clonal theory of cancer evolution, acquisition of a set of somatic mutations can provide a selective
3
advantage to the host cell and lead to the emergence of a new subclone. However, the model proposes that the acquired mutations provide such a strong selective advantage to the newly formed subclone that it soon outcompetes the existing one(s). Consequently, most of the time, tumors are expected to be largely homogeneous with the bulk of tumor mass consisting of a single, dominant, subclone (see Figure 1.3). Next generation sequencing data increasingly disputes this simple model of tumor evolution. There is now an overwhelming evidence that tumor evolution is a more complex and in many cases branching process where multiple subclones co-exist in the same tumor [50, 49, 172].
Linear evolution Neutral evolution
Figure 1.3: Linear and neutral models of tumor evolution. Coloring is analogous to that in
Figure 1.1. In a linear model of tumor evolution, set of mutations that drive an emergence of a new subclone provide it with selective advantage that it soon outcompetes the other subclone(s). The neutral model of tumor evolution posits that intra-tumor heterogeneity is a byproduct of tumor growth but mutations acquired by a cell do not confer it a significant selective advantage. Consequently, according to this model, a large number of genetically distinct cells and wide spectrum of mutational variant allele frequencies is expected to be observed in the tumor biopsy sample.
The neutral model of tumor evolution is another model that gained attention in the past years [175]. It posits that most of the driver mutations are acquired at the early stages of tumor growth and mutations occurring later typically do not provide selective advantage to the host cells. In contrast to the linear theory characterized by selective sweeps and a largely homogeneous tumor, a tumor following the neutral model of evolution is expected to have a large number of genetically distinct populations of cells and wide spectrum of mutational variant allele frequencies (see Figure 1.3). The methodology used in [175] to demonstrate widespread neutral evolution among tumors was recently disputed in [103]
and [161]. However, it is likely that some of the tumors evolve according to the neutral model and further research and evidence supporting this model will be required in the future. A similar argument applies to the punctuated tumor evolution model, also known as the ’big bang’ model [156], which posits that most of the ITH occurs at the early stages of tumor growth and is followed by stable expansion of one or several subclonal populations [160, 30, 156].
Cancer stem cell theory, which is based on hypothesis that tumor growth is driven by a rare subpopulation of cells, dubbed cancer stem cells, is beyond the scope of this thesis and we refer readers to [20] for more details about this theory.
Here, we will use clonal branching theory as a gold standard for simulating tumor evolution, although two of the three methods that are going to be presented in the following chapters do not require a tumor to strictly follow this model of evolution nor do they require it to be of a single-cell origin.
1.3
Clinical relevance of intra-tumor heterogeneity
Before getting into discussion of clinical relevance of ITH, we quote the very first sentence from one of the latest reviews on the topic: "Intratumor heterogeneity, which fosters tumor evolution, is a key challenge in cancer medicine." [104].
Numerous studies in the past years suggest that ITH has several potential clinical im-plications. For instance, in [93] and [108] a correlation between subclonal diversity and progression to esophageal adenocarcinoma in Barrett’s esophagus was reported. In chronic lymphocytic leukemia, the presence of a subclonal driver was found to be an independent risk factor for rapid disease progression [85]. The extent of ITH has also been linked to the tumor metastatic potential and disease-free survival. Findings from [180] suggest that patients developing metastasis in triple-negative breast cancer had a significantly higher measure of ITH in the primary tumor, whereas study of colorectal cancer [71] reported that high degree of ITH in the primary tumor was correlated with an increased rate of liver metastasis and shorter disease-free survival.
Presence of extensive ITH and the ability of a tumor to acquire new mutations is con-sidered to be one of the key causes of treatment failure. In most cases, even if a drug works at first, it will not work over the long term [72]. Radiation and chemotherapy can promote the emergence of new subclones resistant to treatment, but treatment resistance can also be driven by a minor or dormant subclone already existing in tumor prior to treatment initiation [23, 42, 147, 111, 41, 117].
While there is an increasing evidence that ITH can be exploited in clinics as a prognostics indicator [126], research in the design of effective treatments that will cure cancer or, at least prevent its uncontrolled growth and turn it into chronic disease with low impact on
the quality of life [4], is still in its inception. A tumor’s ability to adopt to treatment and pervasive intra and inter-tumor heterogeneity, even among cancers of the same type, largely complicate design of clinical trials. We expect that technological advancements in sequencing, imaging, information sharing and many other fields will facilitate design of larger clinical trials and inspire discovery of novel therapeutic targets and treatment regimens. Adaptive therapy was proposed as a potential treatment strategy to prevent uncontrolled tumor growth [48]. Its main idea lies in continuously modulating treatment in order to achieve fixed tumor population while avoiding complete elimination of subclones sensitive to treatment. Namely, elimination of these subclones is typically followed by uncontrolled growth of chemoresistant populations. On the other hand, allowing a fraction of chemosensitive subclones to survive can provide a means to suppress proliferation of the less fit but chemoresistant subclones through the competition for limited resources between subclones. Although adaptive therapy is an interesting approach for controlling tumor growth, its successful implementation in clinical practice will require a good understanding of the tumor subclonal composition, fitness of individual subclones and selective advantage of the chemosensitive over the chemoresistant subclones.
In addition to genetic ITH, which is our main focus, there are also other types of ITH. For example, in [131] methylation profiling of localized lung adenocarcinomas revealed correlation between the extent of DNA methylation ITH and tumor size. In the same study, it was also found that, on average, most of the somatic DNA mutations were shared among all of the sequenced tumor regions, suggesting that they occurred at the early stages of tumor progression, whereas only a quarter of the differentially methylated probes were shared among all regions. These findings indicate that tumor-specific DNA methylation might be associated with later branched evolution observed in the set of patients analyzed in this study [131]. It is also known that gene expression in individual tumor cells belonging to the same subclonal population can be influenced by their position in the tumor (e.g., center of a tumor vs. its boundary) [62, 157]. Incorporating genetic ITH with other types of ITH and other important factors (e.g., interaction of tumor cells with microenvironment) will be of great importance in future studies of tumor growth and progression.
1.4
Motivation, Contributions and Thesis Organization
We expect that, in the foreseeable future, analysis of tumor subclonal composition and evolution will become more common in clinical practice and aid clinicians in diagnostics, prognostics, as well as in making treatment decisions and designing the best therapies. Due to the extensive intra and inter-tumor heterogeneity, such therapies will most likely be tailored according to the genetic makeup of individual tumors and consist of combination of several drugs targeting different subclonal populations. In this context, the knowledge of the clonal tree of tumor evolution can also be highly valuable as it reveals divergent
subclonal populations (i.e., subclones evolving on different branches of the tree) that might need to be targeted separately, particularly in cases where drugs targeting clonal mutations fail to provide desired results in halting tumor growth.
Future studies of shared patterns of tumor evolution among large cohorts of cancer pa-tients will also benefit from methods for accurate inference of clonal trees for individual patients. These studies will further improve our understanding of cancer onset and progres-sion. Shared patterns can provide novel insights about the most significant mutations that promote tumor growth (i.e., driver mutations) and treatment resistance, but also about the advantages and disadvantages that simultaneous presence of sets of mutations confer to the host cells. Furthermore, we expect that patterns of tumor evolution will be used for predicting next steps in tumor evolution. Successful prediction of tumor evolutionary behavior requires deterministic patterns of tumor evolution and is expected to be one im-portant subject for future cancer research. One recent study, involving over 100 patients diagnosed with clear-cell renal cell carcinoma, found evidence for deterministic nature of clonal evolution in this cancer type [166]. However, these findings need to be validated on larger cohorts and similar studies conducted for other cancer types.
Metastasis is estimated to be responsible for ∼90% of cancer related deaths [107, 158]. Tracking the metastatic seeding patterns is one of the very important problems in better understanding of biological background of metastasis. This task is very challenging as metastatic seeding is a complex process. In addition to the metastatic seeding from primary site, the existing metastases can give rise to the new ones. Re-seeding between two existing metastatic sites adds an additional level of complexity to the whole problem [16]. However, all cancerous cells of a given patient are related through a common (shared) tree of tumor evolution, which can provide answers to many questions related to the metastatic process. In the past years, we have been witnessing increasing interest in the research of various aspects of metastasis [55, 106, 63, 139] and specialized methods for studying metastatic seeding patterns were developed recently [40, 135]. We recommend [165] for a thorough review of the subject.
All of the above illustrate the importance of better understanding of ITH and tumor evolution at the level of the individual patient. The inference of tumor subclonal com-position and evolution can be performed by the use of various signals originating from detected mutations and will be discussed in more detail in Chapter 2. The pioneering work studying ITH and evolution typically focused on a small number of selected genomic al-terations from several tumor samples and involved manually reconstructing the subclonal composition and/or phylogeny of these tumors [147, 50]. Developments in DNA sequencing technologies enabled large-scale cancer sequencing efforts where whole exomes/genomes of thousands of tumor samples were sequenced [173]. Manual analysis of each of the individual
tumors from such large data cohorts clearly became non-practical and required development of automated computational methods.
The vast majority of all currently available tumor DNA sequencing data was obtained via bulk sequencing where DNA of millions of cells are pooled and sequenced together giving only an average signal over a large number of cells. Studying tumor evolution and subclonal composition from such data requires development of computational methods specialized for deconvolution of mixed signals while handling intrinsic properties of sequencing data. We devote a separate section in Chapter 2 to provide more details about the advantages and limitations of the use of bulk sequencing data in this context. In Chapter 3 we introduce CTPsingle, a method for the inference of tumor subclonal composition and evolution from bulk sequencing data. CTPsingle assumes that cancer originates from a single cell and follows the clonal theory of evolution. It is currently used as one of the methods to infer clonality in the Evolution and Heterogeneity Working Group of Pan-Cancer Analysis of Whole Genomes project [173]. Like other similar tools, CTPsingle is also faced with some theoretical limitations due to limited resolution of bulk sequencing data.
Some of the limitations of bulk data can be resolved by the use data obtained by recently introduced single-cell sequencing. Several methods for single-cell sequencing developed in the past years generate data of the ultimate resolution for studying ITH and evolution. However, obtaining a large number of single cells that are a good representative of the subclonal populations present in a tumor is still very challenging due to the cost of single-cell sequencing and non-uniform sampling of individual tumor single-cells. Furthermore, single-single-cell data is contaminated with various types of noise, which is a major obstacle for the analysis of tumor evolution by direct application of standard phylogenetic techniques. Therefore successful use of this type of data in tumor phylogenetics requires development of specialized computational methods. We devote a separate section in Chapter 2 to provide more detailed background on the main characteristics of this data type and to summarize developments in the design of related computational methods.
Importantly, bulk data is not exploited in any of the previously developed methods for studying tumor evolution from single-cell data. As strengths and weaknesses of the two data types are to a large extent complementary with respect to phylogeny inference, performing both, bulk and single-cell sequencing simultaneously, may be a competitive strategy for tumor phylogeny reconstruction. In Chapters 4 and 5 we introduce B-SCITE and PhISCS, the first two methods for tumor phylogeny inference that leverage complementary strengths of single-cell and bulk sequencing data in a joint inference framework. In addition to superior performance over the existing alternatives on the comprehensive set of simulated data, we also show that these tools generate more realistic mutation histories on several real datasets. For PhISCS, we provide implementations by the use of both integer-linear programming (ILP) and constraint-satisfaction programming (CSP) and show that, at least in the context of tumor phylogeny inference, CSP might be a time-efficient alternative to the ubiquitously
used ILP. In contrast to the existing methods for inferring trees of tumor evolution from single-cell data, that are based on the probabilistic search schemes, PhISCS provides a guarantee of optimality for the reported solutions or bound on the best achievable objective. PhISCS is also the first method that integrates single-cell and bulk sequencing data, while accounting for the possible existence of violations of commonly used and recently debated [81] infinite sites assumption.
Since all three methods are based on the use of single nucleotide variants, the majority of the attention in Chapter 2 is devoted to description of the related methods based on the use of this type of mutations.
Although they are not our main contributions, it is worth mentioning that in this thesis we introduce potentially interesting strategies for generating simulated data, in particular mutations violating the infinite sites assumption and mutations affected by copy number aberrations. We also propose several distinct and novel measures for comparing trees of tumor evolution and use them in comparisons of performance of different methods for tree reconstruction. Methods for comparing trees of tumor evolution are currently lacking and we believe that the proposed measures will inspire future research on the topic. Some other interesting and important directions for future research are presented in Chapter 6.
Chapter 2
Background
In this chapter we provide a background on the existing computational approaches for deciphering ITH and inferring trees of tumor evolution. Based on the input data used, we classify methods into three main groups: (i) methods designed only for bulk sequencing data (ii) methods designed only for single-cell sequencing data (iii) methods combining both, single-cell and bulk, sequencing data. We devote a separate section to each group of the methods and also provide a description of the main advantages and limitations of the bulk and single-cell sequencing data in studying ITH and evolution.
The rapid developments in the design of algorithms and methods discussed in this thesis would be largely impossible without the completion of the first draft of human reference genome announced in 2001 [22] and the invention of novel approaches for DNA sequencing that followed soon afterwards. In 2004 several technologies for massively parallel DNA sequencing, better known as Next Generation Sequencing (NGS), were introduced [109]. Since all methods discussed in this work rely on data generated by some of the available NGS platforms, we devote it the first section of this chapter. In the following sections we discuss the main concepts and the existing methods for inferring tumor subclonal composition and/or trees of tumor evolution.
2.1
Next Generation Sequencing
Next generation sequencing, also called second-generation sequencing, is a common name used for several sequencing technologies that first appeared in 2004 and gradually replaced traditional Sanger sequencing. Due to the high cost of sequencing at the early stages of technology developments, its use was largely limited to academic research laboratories. The first NGS sequencers were used for sequencing selected genomic regions of interest and, rarely, for sequencing whole genomes [174, 170].
However, since its introduction, the broad potential of NGS technologies was recognized and we have been witnessing a large investments and rapid technological advancements in
the field over the past 15 years. Some of the main drivers of innovation include market competition among companies providing sequencing infrastructure, but also large financial support through public research funding agencies. For example, National Human Genome Research Institute awarded more than 100 million USD for developments in NGS between 2004 and 2008 [109, 146]. As a result, cost of sequencing was constantly plummeting and whole genome sequencing (WGS) can nowadays be routinely performed, while the use of genetic tests is becoming common clinical practice [80, 73].
Developments in the sequencing technologies also lead to the development of different NGS platforms. Description of most of the particular details of the equipment and lab-oratory steps required in order to perform NGS experiment falls largely out of the scope of this thesis and, for more thorough reading, we refer to some of the numerous reviews published on these topics [109, 153, 53]. Here, we will focus only on summarizing details of NGS sequencing process and generated output data that are most relevant in development of computational methods discussed later in the thesis.
2.1.1 Preparing the input of NGS experiment
One of the first steps that needs to be performed in NGS experiment is input DNA prepa-ration. Depending on the intended use of data and financial, technical and other resources available, there are different strategies of extraction of cellular DNA and preparation of the final DNA that is later provided as input to the sequencing machine. We distinguish between different DNA preparation strategies in terms of the number of tumor cells repre-sented in the final DNA (bulk vs. single-cell sequencing) and between different approaches to sequencing in terms of the size of sequenced region (targeted sequencing, whole exome and whole genome sequencing). We now briefly describe these two classifications, without getting into application details that are discussed later.
Bulk vs. single-cell sequencing
Each NGS experiment requires a minimum amount of DNA to be provided as input to the sequencing machine. According to some estimates, this amount is equal to the amount of DNA found in approximately 80,000 single cells [169]. For this reason, the extraction of DNA from hundreds of thousands or millions of single-cells is usually one of the first steps of input DNA preparation. DNA extraction steps can be followed by some additional DNA amplification steps (e.g., in targeted sequencing approach discussed below). Nevertheless, in this case the sequenced DNA is a mixture of DNA originating from hundreds of thousands or millions of single cells.
Sequencing where initial DNA material is obtained using this approach is also known as bulk sequencing.
On the other hand, since the amount of DNA present in a single cell is insufficient to perform sequencing, sequencing DNA from single cell first requires precise extraction of the cell’s DNA followed by several rounds of DNA amplification in order to reach desired amount of DNA used for sequencing. Sequencing where DNA is prepared in this way is better known as single-cell sequencing (SCS).
Targeted, whole exome and whole genome sequencing
The process of input DNA preparation and sequencing also depends on the intended use of the data. The aim of performing sequencing experiment might be to obtain data about a particular region (target) of interest or to interrogate genomic variation at the whole exome or the whole genome level. Based on the size of sequenced region, we divide sequencing experiments in the following three groups:
1. Targeted sequencing, which is used in cases where we are interested in examining genetic variants in a pre-defined set of genomic regions. Some examples of the use of targeted sequencing include validating mutation of interest and sequencing a selected set of genes known to harbor mutations causing a particular disease. In the clinical uses of sequencing in cancer treatment, targeted sequencing of genes for which known treatment options exist is becoming common practice.
2. Whole Exome Sequencing (WES), which is used to search for mutations in protein-coding regions (exons) of genes. These regions are expected to harbor the majority of deleterious mutations.
3. Whole Genome Sequencing (WGS), where the goal is to obtain sequencing data of the whole genome.
2.1.2 Output of NGS experiment
The typical output of NGS experiment consists of millions or, more recently, billions of sequencing reads (a short fragments of the sequenced DNA). Nowadays, most of the NGS data is generated by the use of Illumina’s short read sequencing technology, which provides data of high throughput and accuracy. Reads generated by this technology are usually of lengths 100 to 150 base-pairs and have between 0.1 and 1% erroneously called bases.
Sequencing depth, also called sequencing coverage, is another important parameter of
the NGS dataset. Coverage at a given position can be defined as the number of reads that overlap with this position after the process of read mapping (during the process of read mapping, locations in the Human Reference Genome that are most similar to individual reads or their parts are determined). Average coverage of a given set of genomic regions is defined as the mean value of coverage of positions falling into these regions.
Sequencing breadth is usually defined as percentage of positions which have sequencing
depth greater than or equal than some given constant c. When computing sequencing breadth, only positions that were intended to be sequenced (e.g., the set of all exons in WES) are considered.
2.1.3 The uses of NGS data in studies of intra-tumor heterogeneity and tumor evolution
NGS nowadays enables cost-effective interrogation of genomic regions of interest or even whole genomes. Due to the importance of genetic aberrations in cancer onset and progres-sion, sequencing of tumor samples has been one of the main applications of NGS since its very beginnings. Detection of various types of mutations by the use of whole exome or whole genome sequencing facilitates identification of the key cancer driver mutations and mutational burden among distinct cancer types. Targeted sequencing in search for the ge-nomic aberrations for which known treatment options are available is offered at affordable prices (on the order of hundreds of US dollars) and is starting to become clinical routine (e.g., in a study involving 1281 oncologists in United States, 75.6% reported using NGS tests to guide treatment decisions for their patients [46]).
In addition to enabling large scale studies of inter-tumor heterogeneity [162, 173], which is characterized by distinct sets of mutations present among distinct patients, developments in sequencing technologies also facilitated the exploration and better understanding of the extent of genetic ITH and tumor evolution (see Section 1.1 for summary of the first studies using NGS for analyzing ITH and evolution).
Due to cost and technological constraints, bulk sequencing is still the dominant approach in tumor sequencing. Most of the analyses of ITH and evolution from bulk sequencing data start with whole exome or whole genome sequencing [50, 55, 119, 147, 173, 172, 88]. While data produced by WES or WGS is of high sequencing breadth, its typical depth ranges between 30× and 100× [172, 49]. Due to low sequencing depth, variant allele frequencies, defined as the fraction of reads supporting the variant allele of a reported mutation, are usually characterized by high variance. An additional limitation of such data is in detecting rare mutations as mutation signals are usually not discernible from the sequencing noise. Therefore it is also very common that WES or WGS data is used for identifying putative variants and then followed by targeted sequencing of the selected subset of the putative variants [147, 55, 172]. Targeted sequencing is performed in order to identify true variants and obtain highly reliable variant allele frequencies that are later used in the analysis or for validating findings obtained from WES or WGS data. Custom sequencing panels targeting hundreds of genes were recently used in [88] and [166]. These panels can provide data of higher coverage at lower cost than standard WES, but many of the important mutations from genes not covered by the panels can be missed.
The first method for single-cell cell sequencing was introduced in 2011 [116]. Although there have been many developments in single-cell sequencing since 2011 [81, 186], isolating, amplifying and sequencing DNA of larger number of individual cells (which is necessary in order to get an appropriate input for inferring tumor subclonal frequencies and tree of tumor evolution) is still challenging and expensive in comparison to the bulk sequencing. Single-cell data is also characterized by elevated noise rates with many false negative and some false positive mutation calls. Occasionally DNA from two or more single cells may be extracted together resulting in doublets noise and output data that reflects DNA of multiple cells. Non-uniform extraction of single cells can also result in sampling biases where numbers of cells sampled from subclones are not proportional to their cellular prevalences. The effect of this type of noise is significantly lower in bulk sequencing. Nevertheless, despite various types of noise, single-cell sequencing yields data of the highest possible resolution and has great potential to revolutionize the studies of tumor evolution. This type of data can also help in detecting some of the rare mutations that can be missed by standard bulk approaches. Some evidence for this was provided in [172] and [88]. However, detection of rare mutations by SCS depends on the quality of generated data and sampling of single-cells (due to noise in SCS data, usually at least two cells harboring mutation need to be sampled in order to get reliable mutation calls).
2.2
Inference of tumor subclonal composition and evolution
from bulk sequencing data
Due to the input DNA preparation strategy used, a bulk sequencing dataset consists of a set of reads originating from a mixture of a large number of different cells and therefore yields only an aggregate signal about their DNA. Consequently, in the case of bulk sequencing of heterogeneous tumor tissue, none of the subclonal populations is observed directly and, in order to infer tumor subclonal composition, we need computational methods for deconvolu-tion of the observed aggregate signals. Each such method is faced with the very challenging problem of inferring unknown numbers of tumor subclones of unknown prevalences and unknown sets of somatic mutations harbored by individual subclones. In addition, many of the methods also infer a clonal tree of tumor evolution (defined in Section 1.2.1) and exploit data obtained by sequencing multiple samples of the same patient [81].
Due to their high prevalence in many tumors [19], well developed methods for identifi-cation from bulk data [178] and simplicity of use in modeling tumor subclonal composition and evolution, single-nucleotide variants (SNVs) are the most widely used type of mutations among the existing methods for studying ITH and evolution [97, 60, 38, 39, 128, 142, 66, 130, 33, 150, 125, 189, 159, 145, 110, 69, 120, 164, 114, 133]. In addition, several methods based on the use of copy number aberrations (CNAs) [58, 124, 123, 25, 185] and a few based on the use of other types of variants (e.g., large insertions) [37, 21] were also