• No results found

Inference of tumor evolution and subclonal composition by integrative use of

Due to relatively uniform sampling of the underlying subclonal populations, methods based on the use of bulk sequencing data usually report prevalences of individual subclones. In contrast, SCS data is more prone to sampling bias where a fraction of sequenced cells which originate from a given subclone is not close to its cellular prevalence. Even if the sampling of single cells is more uniform, the effect of variance in relatively small SCS datasets is very likely to result in the subclones that are under or over-represented among the sampled cells.

For these reasons, as we could observe from the above presentation of methods working with SCS data, these methods usually do not infer subclonal frequencies.

Although methods operating on bulk data report subclonal frequencies, we have seen that these methods have some theoretical limitations due to limited resolution of bulk data. One of the most notable limitations is in the case where clusters of mutations of similar prevalence co-exist in the sequenced tumor. It can be easily observed that, due

to lineage precedence rule, mutations from these clusters must occur at different branches of the clonal tree. If matching SCS data is available and several single cells harboring one of the mutations are sampled from each of the branches, it is expected to provide a strong evidence for separation of these mutations (above we discussed strength of single-cell data in accurately identifying pairs of mutations from different branches of the clonal tree).

Hence, the information about mutual co-occurrence of mutations in individual cells, which is available in single-cell data, can be used as an informative prior in distinguishing clusters of mutations of the same prevalence. This was recently demonstrated in ddClone [144], the first method which infers tumor subclonal composition by the joint use of single-cell and bulk sequencing data.

Ambiguities in reconstructing clonal trees are the main challenge in studying tumor evolution from bulk data. However, SCS data can help in resolving some of the ambiguities in tree inference that are frequently encountered in methods using solely bulk data. For example, single-cell data can provide a strong signal about the placement of green and blue clusters in Figure 2.6 on different branches, thus significantly reducing the probability that chain topology is incorrectly reported among the optimal solutions (recall that we assume that leftmost tree in Figure 2.6 is the ground truth tree).

The negative effect of sampling biases and false negative mutation calls, which can perturb the order of mutations along linear chains in trees inferred from SCS data (see Fig-ure 2.8), can be reduced or eliminated by a matching bulk data as illustrated in FigFig-ure 2.9.

For example from Figure 2.8, if we have a matching bulk data it can provide a strong signal for the placement of clonal mutation M3 as an ancestor of mutation M4 (similar applies to M1 and M2), thus fixing incorrect ordering inferred when SCS data are used as the only input.

𝑀1, 𝑀2, 𝑀3

𝑀4, 𝑀5 15%

10%

35%

20% 20%

True tree

𝐶𝑃 𝑀3 = 0.85

𝐶𝑃 𝑀4 = 0.75 𝑀3

𝑀4

Lineage precedence rule

𝑀8, 𝑀9, 𝑀10 𝑀6, 𝑀7

Inferred order of 𝑴𝟑and 𝑴𝟒

Figure 2.9: Bulk data can improve phyloge-netic inference by reducing the effects of sam-pling biases and false negative mutation calls in SCS data (see also Figure 2.8).

While ddClone leverages information from both, single-cell and bulk data for the inference of tumor subclonal composition, it does not infer tree of tumor evolution.

Its additional limitation is in requirement of an estimate of tumor purity, copy num-ber status of mutated loci and (highly rec-ommended) pre-processing of the single-cell mutation matrix. Tools for a joint infer-ence of tumor subclonal composition and evolution that can exploit complementary strengths of single-cell and bulk data can be a more accurate alternative to ddClone and methods based on the use of either type of data alone. This was also recently

demon-strated in [87] for copy number aberrations. In Chapters 4 and 5 we introduce two SNV-based methods, BSCITE and PhISCS, which combine two data types in a joint inference framework. Similar to [87], our results also suggest that combining single-cell and bulk data might be a competitive strategy for studying intra-tumor heterogeneity and evolution.

Chapter 3

Clonality inference from single

tumor samples using low coverage sequencing data

Abstract

Inference of intra-tumor heterogeneity can provide valuable insight into cancer evolution.

Somatic mutations detected by sequencing can help estimate the purity of a tumor sample and reconstruct its subclonal composition. While several methods have been developed to infer intra-tumor heterogeneity, the majority of these tools rely on variant allele frequencies as estimated via ultra-deep bulk sequencing of multiple tumor samples extracted at the primary and/or metastatic sites of the same patient. In practice, obtaining sequencing data from a large number of such samples per patient is only feasible in a few cancer types such as liquid tumors, or in rare cases involving solid tumors selected for research. We introduce CTPsingle, which aims to infer the subclonal composition using low-coverage bulk sequencing data from a single tumor sample. We show that CTPsingle is able to infer the purity and the clonality of single-sample tumors with high accuracy, even restricted to a sequencing depth of ∼ 30×.

3.1 Introduction

In the past decade, cancer genomics and sequencing revealed a striking degree of intra-tumor diversity in cancer. Molecular evidence increasingly suggests that this diversity has clinical implications. The pioneering work studying intra-tumor heterogeneity typically focused on a small number of selected genomic alterations from several tumor samples and involved manually reconstructing the phylogeny of these tumors [147, 50]. Nevertheless, large-scale

This chapter is largely based on the work presented in RECOMB 2016 and published in [36].

cancer sequencing efforts such as the PanCancer Analysis of Whole Genomes (PCAWG) require fully-automated methods [173].

Previously, we have developed a tool named CITUP to tackle this problem in the ex-istence of multiple samples from the same tumor [97]. Using simulations and real data, we showed that CITUP is able to reconstruct the tumor phylogeny when supplied with deep sequencing data on multiple samples from a single patient [97]. While targeted deep sequencing or high coverage exome sequencing are feasible alternatives to whole genome sequencing, obtaining multiple samples from solid tumors is a challenge in most clinical settings.

In fact, the majority of the tumor cohort currently analyzed by PCAWG [173] have single sample, low to medium coverage bulk sequencing data. Unfortunately, for CITUP and similar tools that exploit multiple samples to infer clonality, the ability to robustly determine the subclonal architecture of tumors deteriorate with decreasing number of samples per tumor [97]. To overcome this challenge and improve the purity and subclonal composition estimation in single-sample tumors, we introduce a new tool named CTPsingle that is specifically designed to work with low coverage bulk sequencing data from a single sample.

CTPsingle features a robust clustering framework based on a beta-binomial mixture model and infers possible phylogenies using a fast mixed integer linear programming (mILP) formulation. Currently, CTPsingle is also used to infer clonality as a part of the Tumor Evolution and Heterogeneity working group of PCAWG [173]. CTPsingle is freely avail-able from https://github.com/nlgndnmz/CTPsingle and its core functionality has been implemented in R using open source packages DPpackage [68] and lpSolve [12].

3.1.1 Related work

CTPsingle is partially based on CITUP, which uses a mixed Quadratic Integer Programming (mQIP) framework. Like CITUP, CTPsingle works on somatic single nucleotide variants (SNVs) on copy neutral regions of the genome. However, unlike CITUP, which takes variant allele frequencies (VAFs) as input, CTPsingle takes reference and variant read counts as input and clusters SNVs using a beta-binomial mixture model. This allows CTPsingle to infer the number of subclones in advance of phylogeny search and account for the higher noise in VAFs associated with low coverage. In addition, CTPsingle employs a simplified, iterative mILP formulation implemented using the freely available lpSolve library [12] and does not rely on any commercial libraries such as IBM CPLEXTM.

CTPsingle is also related to TrAp[159], PhyloSub [70], rec-BTP [60], Clomial [189], BayClone [151], PyClone [142], PhyloWGS [33], LICHeE [128] and AncesTree [38]. The majority of these methods are designed to work with SNVs in copy neutral regions and are developed specifically for multiple samples, while a few of them work only with single-sample datasets. Other relevant tools such as THETA [123], THetA2 [124], TITAN [58] and

CLONET [129] are designed to work on copy number data, although some of them allow the use of additional types of mutation calls.

While rec-BTP is also exclusively designed for single-sample tumors, we had previously shown that this method has inferior performance compared to CITUP even on single-sample datasets [97]. Moreover, this tool does not report which mutations are assigned to which subclones, prohibiting us from calculating some of the evaluation measures used in this work.

Instead, we compare CTPsingle to AncesTree, LICHeE and PyClone, which can also take single-sample data as input. AncesTree has an integer linear programming framework where it formulates the problem of clonality inference as a variant allele factorization problem [38]. LICHeE works by constructing an evolutionary constraint network and finding the best scoring spanning trees [128]. While PyClone does not attempt to infer tree topologies, it has a similar clustering framework to CTPsingle that is based on a Dirichlet process [142]. We show that CTPsingle outperforms these methods even when they are supplied with more than one sample per tumor. In addition, we compare CTPsingle to CITUP and demonstrate that CTPsingle performs better than CITUP in low-coverage datasets.