The dChip algorithm - Estimation of LOH in literature

Part I Bayesian integrative genomics

2.3 Estimation of LOH in literature

2.3.1 The dChip algorithm

Beroukhim et al. [8] derived an HMM algorithm for inferring the LOH of unpaired samples which is usually referred to as dChip. As we saw in Sub- section 2.2.2, HMM needs the specification of the unobserved states and observed variables, the emission probabilities, the transition probabilities and the initial probabilities.

The observed variables are the SNP calls classified as homozygous (Hom), heterozygous (Het) and “NoCall”. The unobserved states are the LOH status, which are defined as loss (LOSS) if there is an LOH, and retention (RET), otherwise. The aim is to estimate the unobserved states from the observations (see Figure 2.3).

Fig. 2.3 Scheme of the HMM used in dChip [Adapted from

PLOS Computational Bi- ology[8], copyright (2006), available under Creative Com- mons Attribution License].

The emission probabilities are the probabilities of the observed calls, given an unobserved state. To define them, the authors considered a SNP having observed calls Hom and Het as a random variable with a different distribution with respect to the NoCall SNPs. In fact, they assumed that actually a NoCall SNP can be Hom or Het, independently of the corresponding LOH status. Thus, P(NoCall|LOSS) = 1 and P(NoCall|RET ) = 1. For the others SNPs, it is sufficient to set P_{(Het|RET ) and P(Het|LOSS)} and then, P_{(Hom|LOSS) = 1 - P(Het|LOSS) and P(Hom|RET ) = 1 -} P_{(Het|RET). For any SNP}i(SNPi denotes the ith SNP interrogated), the

probability of being heterozygous under the RET state is estimated with the average heterozygosity rate in a normal population (P_{(Het|RET) =} phet). Instead, the probability of being heterozygous under the LOSS state

is related to the genotyping error (which is 0.01 [37]). Thus, P_(Het|LOSS) = 0.01.

The initial probabilities are the prior distribution of the LOH status at any SNPi: P0(Het) and P0(Hom) = 1 - P0(Het). Using basic probabilities

rules, we can see that

P(Het) = P(Het|LOSS)P0(LOSS) + P(Het|RET)P0(RET )

= 0.01P0(LOSS) + phetP0(RET )

≈ phetP0(RET )

⇒ P0(RET ) ≈

P(Het) p_het ,

by considering the SNP genotyping error negligible. As a consequence, P0(RET ) is estimated as the ratio between the proportion of heterozygous

SNPs in the sample and the heterozygosity rate in the normal population. The transition probabilities are the conditional distribution of the LOH state of two consecutive SNPs. Since nearby SNPs tend to have the same LOH status, while distant markers not, first the authors defined the proba- bilityθthat SNPi₋₁does not influence SNPi. They used a function which

increases with the distance d (in Megabases, Mb) between the markers:

θ =(1 − e−2d_{). Therefore, the probability of a LOSS at SNP}_i_{, given the}

LOSS at SNPi₋₁, is decomposed in the probability that SNPi is not in-

fluenced by SNPi₋₁ and SNPi is LOSS, and the probability that SNPi is

influenced by SNPi−1,

P(LOSS at SNPi|LOSS at SNPi−1) =θP0(LOSS) + (1 −θ).

Similarly, the probability of a LOSS at SNPi, given the RET status at

SNPi−1, is equal to the probability that SNPiis not influenced by SNPi−1

and SNPiis LOSS,

P(LOSS at SNPi|RET at SNPi−1) =θP0(LOSS).

Obviously,

P(RET at SNPi|LOSS at SNPi₋₁) = 1 − P(LOSS at SNPi|LOSS at SNPi₋₁)

2.3 Estimation of LOH in literature 37

P(RET at SNPi|RET at SNPi₋₁) = 1 − P(LOSS at SNPi|LOSS at SNPi₋₁)

Chapter 3 New statistical methods for copy number

estimation

AbstractAs we saw in Chapter 2, the copy number profile can be estimated with either a piecewise constant function or a continuous curve. In [32, 33], Hutter proposed two Bayesian regression methods that can be applied for the inference of the copy number profile: the Bayesian Piece- wise Constant Regression (BPCR) and the Bayesian Regression Curve (BRC).

BPCR is a Bayesian regression method for data that are noisy observations of a piecewise constant function. The method estimates the unknown segment number, the endpoints of the segments and the value of the segment levels of the underlying piecewise constant function. BRC estimates the same data with a smoothing curve. However, in the original formulation, some estimators failed to properly determine the corresponding parame- ters. For example, the boundary estimator did not take into account the dependency among the boundaries and estimated more than one breakpoint at the same position, losing segments.

Therefore, in Section 3.1, we present an improved version of the BPCR (called mBPCR), changing the segment number estimator and the boundary estimator to enhance the fitting procedure. We also propose an alterna- tive estimator of the variance of the segment levels, which is useful in case of data with high noise. In Section 3.2 we deduce two improved versions

of BRC: mBRC and BRCAk.

In literature, some methods estimate the copy numbers as a piecewise constant function, while other algorithms estimate them as a continuous curve (Chapter 2). Hence, we compare the original and the modified version of BPCR to the former group of methods (Subsection 3.1.7), while the the original and the modified version of BRC to the latter (Subsection 3.2.3). On artificial data, we show that mBPCR and the improved versions of BRC generally outperformed all the others. We observe that similar results were obtained also on real data. The choice of using Bayesian statistics, although it has higher computational complexity, appears appropriate es- pecially for the estimation of regions containing only few data points. In Section 3.3, we describe a dynamic programming for the computation of the quantities involved in the estimation, since it is not possible to find them analytically. In Section 3.4, we show a further change of mBPCR, in order to reduce the false discovery rate of the breakpoint estimator in presence of only one segment.

Our method (already published in [65]) was implemented in R and the corresponding R package (called mBPCR) can be downloaded from the Bioconductor website (http://www.bioconductor.org/).

Regarding notations, we will not indicate explicitly the random variable to which a distribution is referred, if it is clear from the context. For example, pK(k) ≡ p(k) or pY,M(y,µ) ≡ p(y,µ).

In document Stochastic methods in cancer research : applications to genomics and angiogenesis (Page 53-58)