2.2 Materials and Methods
2.2.4 CNV detection method
This triplet of data for each individual genome is input into an integrative hidden Markov model (HMM), which classifies each window to a copy-number state based on maximum a posteriori probability, while simultaneously accounting for sources of bias. The state changes mark the predicted breakpoints of CNVs. Below we present our method and the elements needed in HMM characterization.
Hidden states and transition probability. While microarray analysis suffers from
oversaturation at high copy numbers, HTS allows RD-based methods to determine high copy numbers with improved accuracy (Campbell et al., 2008). The state represents the underlying copy number (CN). The state variableqt =CNtis hidden and discrete withN possible values,
(0,1, ..., N −1). The total number of hidden statesN is implemented as an input parameter of
GENSENG and can be freely specified by users. Theoretically, the more high copy number states specified, the more accurate the model becomes. However, a number of practical issues must be considered. For example, specifying more states means longer computing time and, for some datasets, there may not exist sufficient regions from which to estimate parameters. In practice,N
can be derived from the data by K-mean clustering the logarithm of the read-depth.For the HTS datasets used in this study, we assume seven hidden states representing copy numbers of 0, 1, 2, 3, 4, 5, and 6 or more. For homozygous populations such as inbred mice, we assume four hidden states representing copy numbers of 0, 2, 4, and 6 or more. We collapse the duplications with 6 or more copies into one state, because they are difficult to distinguish because of both experimental (reduced signal-to-noise ratio) and computational concerns (having few regions with very high read-depth signal). State transitions proceed from one window to the next according to a first-order time-homogeneous Markov process. The transition probability describes the probability of having a copy-number state change between two adjacent windows. Letπj be the initial state probability, the probability that the state of the first window is statej. The underlying hidden Markov chain is defined by state transitionsP(qt|qt−1)and is represented by a time-independent stochastic
transition matrixA={ajz}=P(qt=z|qt−1 =j). Intuitively, the copy number state is unlikely to
change for nearby windows but is more likely to change for windows that are far apart.
Emission probability. The hidden copy-number states emit probabilistic outputs at each
window, i.e. the observed RD signal representing integer-valued count data. In the absence of sources of bias, sequencing coverage is uniform across the genome such that the emission
probability of RD could be modeled by a Poisson distribution with equal mean and variance. In the presence of sources of bias, sequencing coverage is not uniform and the Poisson-distribution assumption fails. To account for biases, the emission probability of RD is modeled as a mixture of uniform distribution and negative binomial (NB), expressed as the following:
e(t, j) = P(Ot=ot|qt=j) = c/Rm+ (1−c)eN B(t, j) = c Rm + (1−c)Γ(ot+ 1/(φj)) ot!Γ(1/φj) ( 1 1 +φjµtj )1/φj( φjµtj 1 +φjµtj )ot,
wheree(t, j)is the emission probability of a particular observation at a particular timetfor statej cis the mixing probability,Otis a discrete count variable of the observation variable,qtis
the state variable,otis the RD signal for windowt,µtj is the mean RD for windowtgiven statej,
φj is the overdispersion parameter given statej. To describe the negative binomially distributed component,eN B(t, j), we first explain the relationship between the Poisson and the negative binomial distributions. The Poisson distribution imposes that the variance equals to the mean. The negative binomial distribution allows overdispersion. Specifically, ifO follows a Poisson
distribution with meanµ, andµfollows a gamma distribution, the resulting distribution forO is a negative binomial distribution. The variance of negative binomial distribution isµt+φµ2t, where
φµ2t is the overdispersion part of the variance. Asφ→0,fN B(ot;µt, φ)reduces to a Poisson distribution with meanµtand varianceµt. fP(ot;µt) =
exp(−µt)µott
ot! . Next, the mean value of the
negative binomially distributed component is expressed as a function of a set of covariates to account for confounders.
µtj = α0∗(CNt)β1 ∗(lt)β2 ∗(gt)β3 (2.1)
wheretdenotes thetth window,j is the index of the copy number state,j emphasizes the dependency of the meanµton the copy numberCNt,ltis the mappability score,gtis the GC content. For computational convenience, we setCNt = 0.5whenj = 0, and setCNt=j when
j >0.
We then employ a log link function to acknowledge the fact thatµtj >0and obtain:
log(µtj) = β0+β1∗log(CNt) +β2∗log(lt) +β3∗log(gt) (2.2)
β0, β1, β2, β3are the regression coefficients. Specifically,β0 = log(α0), is the intercept parameter
and is interpreted as the average level of read-depth signal when all covariates are equal to zero. β1
is the amount of increase of read-depth for every unit increase of copy number, CN.β2 is the
amount of increase of read-depth for every unit increase of the mappability score,l. β3 is the
The uniform distribution has a density function1/Rm to model any random fluctuation of read depth, whereRmis treated as a known constant using the largest RD among all windows of the chromosome. When non-overlapping windows are used, the mean RD for each window,µtj, is modeled by a negative binomial regression model, where the predictors include copy-number state, GC content, and mappability score. A standard HMM assumes the Markov property,
P(qt|qt−1, qt−2, qt−3, ...q1) =P(qt|qt−1). An additional assumption that is often employed is that
the observations are independent given the states,Ot⊥Oi(i6=t)|qt, which is valid when the windows are non-overlapping. When the windows are overlapping, this assumption is invalid; and instead, the observations are drawn from an autoregression process (Juang and Rabiner, 1985). We have implemented an autoregressive HMM to model this feature of the data. Specifically, a residual term is included as an additional predictor in the negative binomial regression model assuming first order autoregression. Thus in each round of inference, we would first fit the model and obtain the expected read count of statej at windowt µt−1,j, and calculate residual
rt−1,j = log(ot−1)−log(µt−1,j)fort >1and letr1,j = 0. With this extra predictor, we will run
GLM again to obtain the true expected read count of statejat windowt µt,j. The additional noise in the data that cannot be explained by variability in GC content and mappability are
accommodated by,φj, the overdispersion parameter of the NB distribution (allowing variance to be larger than mean) and the uniform distribution in the mixture model.
Tuning parameters. Given the HMM topology, the challenge lies in optimizing model
parameters given the observed data, a.k.a. HMM training. There are many parameters to be optimized. To reduce computational difficulty, we choose to specify a subset of HMM parameters based on prior knowledge and user preference, including the initial state probability, state transition probability, and the mixing probability in emission probability. These tuning parameters can be influential and should be chosen carefully. The remaining emission parameters, including the coefficients and overdispersion parameters in the NB regression model, are estimated for each dataset.
Parameter estimation. The optimization problem is solved by the Baum-Welch algorithm (Baum et al., 1970), which maximizes the data likelihood for an individual chromosome in iterative steps including initialization, expectation, and maximization. Following Bilmes (Bilmes, 1998), we define the complete-date likelihood and solve theQfunction in order to find the maximum
likelihood estimates (MLE) of the HMM parameters. In the initialization step, we rely on intuitive guesses as well as empirical values. The initial emission parameters were estimated from the 1000GP and the Mouse Genomes Project datasets where known CNVs are available. These initial emission parameters are saved for the human and mouse genomes respectively and are used for any new sample without prior knowledge of its CNVs. In the maximization step, we obtain
maximum-likelihood estimates of emission parameters. Because we fixcandRm as constant, parameter estimation will only concern the negative binomially distributed component. We apply a weighted negative binomial regression model, where the weights are posterior probabilities for each window belonging to a particular copy-number state, given the observed data of an entire
chromosome. These weights represent current knowledge of the probabilistic classification of a window to copy-number state and are updated in the expectation step. While included as a predictor in the regression model, the copy number is the hidden variable to be inferred from the observed data. Intuitively, by using posterior probability as regression weights, we are able to partition the observed RD across all hidden states, proportional to the likelihood. The weighted NB regression model is fitted by alternately estimating regression coefficients using iteratively reweighted least squares and estimating the overdispersion parameter using a Newton-Raphson method. In the expectation step, we update the forward, backward, and posterior probability given the current estimates from the maximization step. The expectation and maximization steps iterate until the convergence criterion (smaller than10−6 change in the log-likelihood) is reached.
CNV calling. Using the parameters at convergence, first we obtain the final estimates of the
posterior probability for each window belonging to a particular state, given the observed data from the entire chromosome. Second, we assign the final estimate of copy number for each window using the state with the largest posterior probability. The state changes mark the predicted
breakpoints of CNVs. The confidence score of a CNV region is computed as the sum of the posterior probabilities of all windows enclosed within the breakpoints. Next, a two-step merging algorithm is carried out to refine the boundaries of the CNVs.
Prioritization of CNV calls. A CNV quality control step can be applied to remove CNVs
predicted with the lowest confidence. We recommend removing predicted CNVs shorter than 800bps (i.e. removing those that appear in only one window as shown in Table 2.3), or predicted CNVs with an average mappability lower than 0.3 (i.e. removing those that cannot be confidently predicted as shown in Figure 2.6 (b)). An additional prioritization approach was implemented via the read-depth-accessibility (RDA) statistic, which reflects the signal-to-noise ratio of a predicted CNV region after accounting for known confounders in read-depth. The term of
read-depth-accessibility was first coined by Abyzov et al. (Abyzov et al., 2011), but for a different purpose. The RDA statistic is computed in 3 steps: (1) after CNV calling, identify all compatible copy-number-neutral windows whose GC-content and mappability scores are the same as those from the region of interest; (2) calculate the average window counts from (1) as the expected read-depth for the region of interest; (3) obtain the RDA by dividing the observed read-depth by the expected read-depth for the region of interest. Using a copy number of two as normalization for copy-number-neutral autosomal regions, the theoretical signal-to-noise ratios are 0, 0.5, 1.5, and 2 for copy numbers of 0, 1, 3, and 4, respectively. Therefore, a region is considered to be
read-depth-accessible if its RDA value is lower than 0.5 for homozygous deletions, lower than 0.75 for heterozygous deletions, and greater than 1.25 for duplications. In general, we recommend removing CNVs predicted from regions that are not read-depth-accessible (e.g. if its RDA values range between 0.75 to 1.25). In addition, we recommend ranking the predicted regions by their RDAs, where a higher signal-to-noise ratio reflects higher confidence that the predicted CNVs are correct; this is analogous to ranking by fold-change in gene-expression analysis.