• No results found

3.2.1 Input processing

As input, CTPsingle takes reference and variant read counts for somatic single nucleotide variant (SNV) calls. These calls can be obtained from whole-genome, whole-exome or targeted sequencing data. CTPsingle expects that all mutations used as the input reside in regions not affected by copy number aberrations. In practice, it is usually not possible to achieve perfect filtering of mutations not fulfilling this criteria. However, below we show that the inference in CTPsingle is not expected to be largely affected if a minor fraction of mutations used as the input belong to regions affected by copy number aberrations. In addition, for mutations from autosomal chromosomes (i.e., all chromosomes other than X and Y), region harboring mutation is expected to be diploid in normal cells. In cases where gender of patient is known, mutations from allosomal (i.e., X and Y) chromosomes can also be included. However inclusion of such mutations requires adjustment of total read counts as described below. We also assume that tri-allelic mutations are discarded from the input.

Adjusting total read counts for mutations from allosomal chromosomes

Since clustering in CTPsingle is performed in read count space using the intuition discussed in Section 2.2, read counts for mutations originating from allosomal chromosomes in males need to be adjusted. More precisely, in order to simulate diploidy of X and Y chromosomes, for each X and Y we need to simulate matching ‘phantom’ chromosomes that emit reads supporting the reference allele. In CTPsingle, this is achieved by adjusting the total read

count ti of the position of mutation Mi as follows: assume that ci and di denote the total number of copies of region containing Mi in tumor and normal cells, respectively. Let vi

and ri respectively denote the number of variant and reference reads for mutation Mi. Then we set:

ti=2 ci

vi

+2 di

ri

 .

Note that here we assume that the normal sample does not contain any chromosomal abnormalities. Such regions, if they exist, should be removed from automated analysis and manually investigated.

3.2.2 Robust clustering using beta-binomial mixture modelling

Let n denote the total number of mutations left after the above filtering steps and used by CTPsingle. After computing values of ti we proceed with clustering of mutations based on the read counts. The clustering in CTPsingle is performed via a beta-binomial model. The following assumes that viis binomial distributed with an unknown (i.e., variable) probability of success pi:

vi|(ti, pi) ∼ Binom(ti, pi); i = 1, 2, ..., n. (3.1) We further assume that the probability parameter pi is generated from a Dirichlet Pro-cess (DP) as given below:

pi|G ∼ G (3.2)

G|(α, G0) ∼ Dir(α, G0). (3.3)

Above, the concentration parameter α can either be given as a user-defined input or further sampled from a Gamma distribution. The baseline distribution G0 is taken to be the Beta distribution with parameters a1 and b1:

G0 = Beta(a1, b1) (3.4)

Since the prior Beta distribution is conjugate to the Binomial distribution, resulting in a beta-binomial posterior, inference can be performed using a standard Markov Chain Monte Carlo (MCMC) method [92].

Above, the model parameters α, a1, b1 are set to 0.001, 5.0, 5.0 respectively in our imple-mentation. These values were selected empirically based on our observation on real data, however, they can be modified by the user if desired. In addition to the estimated values pi,

the algorithm provides the inferred number of clusters and an assignment of the mutations to these clusters.

3.2.3 Estimation of tumor purity

From the clustering stage, we obtain the number k of clusters and the mean variant allelic frequency sj for each cluster j = 1, 2, ..., k. Since we modify all mutations to be heterozy-gous, cellular frequency xj of mutational cluster j is simply calculated as xj = 2sj. Given the cellular frequency of clusters, we estimate the tumor purity p as the highest frequency of any cluster: p = max(xj); j = 1, 2, ..., k. Note that this formulation assumes that the cancer is unicentric. Like most other tools, CTPsingle can not explicitly handle multicentric tumors.

3.2.4 Inference of tree of tumor evolution

In tree inference step, our goal is to find clonal tree of tumor evolution, together with the assignment of subclonal frequencies, that best describes the clustering inferred in the previous step. Ideally, each cluster corresponds to the set of mutations that lead to the same subclonal expansion and is placed to one of the nodes of the tree. For an arbitrary clonal tree with root node representing the first population of cancerous cells, each node is expected to be assigned at least one cluster in order to avoid nodes (i.e., subclones) of the same genotype. As different clusters have different frequencies they are expected to be assigned to different subclones (i.e., different nodes of the tree). This motivates us to perform search in the space of labeled rooted trees of size k, where the tree labeling is such that there exists one-to-one mapping between the clusters (used as labels) and the nodes of the tree.

Since CTPsingle is designed to work with single sample bulk sequencing data, the num-ber of inferred clusters k usually ranges from 2 to 6. However, even when k = 10 the numnum-ber of non-isomorphic rooted trees of size 10 equals 719.1 Considering that we propose two very time-efficient implementations of the optimization problem introduced below, this permits exhaustive search over all tree topologies and solving the optimization problem indepen-dently for each of them. For this reason, in the rest of this section we assume that we are given a tree T = (V, E) of fixed topology with k = |V | nodes. We assume that the labeling of the tree nodes is not fixed and is determined as part of the optimization problem.

To enforce one-to-one mapping between clusters and nodes of the tree, we first define the indicator variables δjv, where δjv = 1 iff cluster j is assigned to node v, and 0 otherwise.

Then we add the following constraint for each cluster j = 1, 2, . . . , k:

1If Tndenotes the number of rooted non-isomorphic trees of size n, then

Tn∼ Cr−nn32 as n → ∞, (3.5)

where C = 0.4399237... and r = 0.3383219... [122].

X

v∈V

δjv = 1. (3.6)

For each vertex v ∈ V we add the following constraint

k

X

j=1

δjv= 1. (3.7)

Next, for each v ∈ V , we define non-negative real variable φv representing the cellu-lar prevalence of the subclone corresponding to the node v. If D(v) denotes the set of descendants of v in T , then tree-based frequency of cluster assigned to v equals:

∀v ∈ V : yv = φv+ X

u∈D(v)

φu (3.8)

Additionally, we require that yr1.0, where r denotes the root of T .

Subject to the constraints defined above, the objective of CTPsingle is to minimise the difference between frequencies of mutational clusters inferred during the clustering step and their (unknown) tree-based frequencies. This difference can be expressed as the following sum:

X

j

X

v

δjv|xj− yv| (3.9)

The mixed ILP formulation as given above is expected to have few variables in single sample datasets, and can be solved using a simple iterative approach similar to the heuristic version of CITUP as described in [97].

Briefly, this is accomplished in two steps: (1) given fixed values for the variables φv, find the optimal assignment of subclones to nodes as given by δjv that minimises equation 3.9;

(2) given the assignment δjv of subclones to nodes, calculate values for φv that minimise equation 3.9. The initial values of φv for Step 1 are chosen randomly and these two steps are repeated for a given maximum number of iterations or until, for some positive integer i, the decrease in objective score (given by equation 3.9) between iterations i and i + j is less than , where j and  are a user-defined constants.

While Step 1 still employs integer variables, Step 2 is a standard case of linear pro-gramming. As neither step can increase the objective score, convergence of the algorithm to at least a local optimum is guaranteed. Although convergence to a global optimum is not guaranteed for large trees, in most cases this problem can be alleviated by performing multiple re-starts.

Alternatively, it is possible to introduce a set of new variables τjv with the additional constraints that:

∀j, v: τjv ≥ δjv1 + xj− yv (3.10)

∀j, v: τjv ≥ δjv1 − xj+ yv (3.11)

∀j, v: τjv ≥ 0 (3.12)

In this case, the objective is modified to minimise

k

X

j=1

X

v∈V

τjv.

In the practical applications of CTPsingle, we encounter small values of k (usually less than 10). In such cases the above ILP consists of a small number of variables (usually less than 200 variables in total) and can be solved very efficiently even using freely available ILP solvers.

In practice, inferring a phylogeny from single-sample tumors is typically an under-determined problem resulting in multiple optimal solutions, except for special cases where the estimated frequencies admit only one optimal solution. As a result, rather than report-ing a sreport-ingle unique solution, CTPsreport-ingle reports all feasible solutions with some topologies eliminated. Further elimination of tree topologies may be possible by examination of nearby germline mutations or additional information about the tumors and is left to the user of the accompanying software.