1.2 Negative binomial distribution and change-point analysis
1.3.3 Gene annotation
In Chapter 3, our ultimate goal is the comparison of the location of transcribed regions of the genome of a species which has been grown in different environments. In the seg- mentation context, this problem is equivalent to comparing the change-point locations of independent profiles.
An algorithm for the computation of change-point location uncertainty. We are here typically in a framework where the complexity of the computational issue (iii) is less crucial than before, for instance the approaches presented in the previous sections, namely the exact Bayesian segmentation and the constrained HMM, can be applied. Indeed, here
n is the order of 103 as we consider genes, and K is of the order of at most ten, since we
want to separate the coding regions (i.e. exons) from non-coding regions (i.e. introns) within a same gene. However, dealing with change-point comparison requires the ability to compute quantities such as the uncertainty of their location. To this effect, we have proved that the negative binomial distribution verifies the requirements of the approach of Rigaill et al. (2012) and have implemented it, as well as for diverse other distributions, in an R package called EBS (for Exact Bayesian Segmentation). We describe in Chapter
Assessing the quality of the models. Because the goal is the comparison of indepen- dent segmentations, it seemed natural to check the relevance of our modeling (i) contribu- tions beginning with a state of the art on segmentation methods that can take into account both the discrete nature of our data and the absence of a reference profile. We show (in Chapter 3.1), in collaboration with Sandrine Dudoit and Stéphane Robin, that algorithms are more effective when K is known, an assumption that is not absurd in contexts where we already have an approximate genome annotation that we seek to refine. The PDPA and EBS algorithms then have optimal performances, while the constrained HMM approach is faster than EBS but has slightly worse results and therefore does not represent a gain in this context of ’small data’. The other considered algorithms, which were only implemented for the Poisson distribution, failed to match any of our three approaches.
Methods for the comparison of change-point location. We have subsequently re- tained the exact Bayesian segmentation model to perform our location comparison (associ- ated with the inference issue (ii)), and proposed, in collaboration with Stéphane Robin, two approaches which are presented in Chapter3.2, the first one dedicated to the comparison of two profiles, while the other applies regardless of the number of profiles considered.
In the case of two profiles, the posterior distribution of the shift in locations can be computed by simple convolution as
δk1,k2(d; K 1 , K2) =X t pk1(t; Y 1; K1)p k2(t − d; Y 2; K2). where pk`(t, Y
`, K`) is the posterior distribution of change-point τ
k` from the segmentation
in K` segments of profile Y`.
This does not hold as soon as we have more than two series. It is then natural to compute the posterior probability of the event E0 = {τk11 = · · · = τ
I
kI} to decide on the
equality of the change-points. This lead us to introduce an additional layer in the graphical model as illustrated in Figure1.13.
Figure 1.13: Original and modified graphical models for the comparison of change-point location. We add an additional layer in the hierarchical model of Rigaill et al. (2012) for the comparison of the change-points location in independent profiles. E is the random variable which corresponds to the event ’change-points are identical’.
can then exactly compute the posterior probability of this event. Both frameworks provide natural decision rules for the equality of change-points in the profiles.
We return to our benchmark dataset in Chapter 3.4 in which we apply these rules to a subset of yeast genes with two exons. We illustrate the expected result which is that the boundaries of introns are not dependent on growth condition, while the beginning and end of transcription are subject to changes according to their environment.
Conclusion. We have developed in this thesis several segmentation methods for the general framework of genome annotation which we have illustrated on the same dataset throughout the manuscript. This has allowed us to highlight their richness when studying biological phenomena such as differential splicing. Table1.3summarizes the majority of our contribution. Depending on the depth of the analysis performed (from whole-genome to sin- gle gene), each of our three methods, namely the pruned Dynamic Programming algorithm with the non-asymptotic penalty, the constrained HMM with the ICL penalty, and the exact Bayesian approach, can be applied to determine the localization of the change-points
and assess their credibility. Moreover, they all meet the following three requirements: • they are suitable for modeling count data, especially with the negative binomial, but
can however be extended to many other distributions,
• they solve the criterion they seek to optimize in an exact manner, and • they are implemented in R packages and freely available to the public.
In the next two chapters, the sections correspond to papers submitted during the PhD completed by some discussion. Depending on the papers, those discussions are either small remarks or illustrations (as in Section 3.1), or more global extensions of the work (as in Section2.2).
Contribution
and examples and complexity package
Whole genome n: 108 O(Kn) pruned Dynamic optimal segmentation qualitative
e.g. expressed genes K: 103 Programming oracle inequality
e.g. new transcripts Segmentor3IsBack
n: 105 O(K2n) constrained HMM ICL conditional
K: 102 postCP
Genes n: 104 O(Kn2) Exact Bayesian optimal segmentation exact
e.g. confident annotation K: 101 Segmentation ICL
e.g. profile comparison EBS decision rules
Table 1.3: Overview of Thesis contribution. Our contribution is organized by scale of the profiles (from whole genomes to single genes) for which we give potential biological applications and the tools developed for their analysis. For each of them, we recall their complexity and the maximum values of the parameters n (length of the data) and K (number of segments) and some examples of information provided by the methods.