Characteristics of the data - Exploratory data analysis of RUNX1/ETO

2.3 Exploratory data analysis of RUNX1/ETO

2.3.3 Characteristics of the data

Correlation

In the data, it can be seen that there is a weak correlation between the read counts. In Figure 2.5, we show the auto correlation function (acf ) for the first part of chromosome 1 from the test sample (the control sample shows similar behaviour). From the figure it can be seen that the test sample shows a weak correlation between the read counts compared. It is generally known that genetic data are serially correlated, but the amount of correlation varies. We notice that the read counts show weak correlation if we consider the whole test sample (the whole genome). On the other hand, by considering small regions of the genome, the read counts show little correlation.

Although the read counts show weak correlation globally, they show no correlation locally within small regions. That is, the read counts within windows of sizes 200 bp show no correlation. For instance, the read counts in Figure 2.5 appear uncorrelated in both the test and the control samples. One can argue that the read counts show no correlation within small regions because there are not enough read counts, i.e, the sample size is small so the correlation between the counts is not significant. We can say that if the interest is only in the small regions, then considering the whole genome with its all features is not necessary.

Zero counts

It was mentioned that most of the read counts in the data are zeros at more than 99%. We mean by zero counts the unobserved base pairs so we observe zero counts in these base pairs. Hence, the zero counts are worth further investigation. As there are few non-zero counts, there are many consecutive zero counts. In Table 2.5, we show summary statistics of the observed lengths of consecutive zero counts. In addition, in Figure 2.6, we show

Chapter 2. Optimal window size and exploratory data analysis of RUNX1/ETO 28

Figure 2.5: The top panel shows the acf between the read counts of the first part of chromosome 1 from the test sample. Note, no windowing is used for the reads (window size 1 bp). The lower panel shows the acf between read counts of a window from chromosome 1, which is shown in Figure 2.4, for the test (left) and the control (right) samples. Note, the vertical axis in all plots ranges from 0 to 1, and here we show the lower part of it where the correlation exists.

the observed distribution of lengths of consecutive zero counts. Note that the table and the figure are produced by using the first part of chromosome 1, and in total there are 616, 638 lengths of consecutive zeros.

Min. 1st_Qu. _Median _Mean ₃st_Qu. _Max.

0 17 49 97.56 118 184900

Table 2.5: Summary statistics of observed lengths of consecutive zero counts in the first part of chromosome 1.

From Table 2.5, it can be seen that there are quite long gaps of zeros between non-zero counts. In addition, it can be noticed that the difference between the median and the mean is about 50, and this is a large difference. On the other hand, the difference between the median and the third quartile is not as large as that between the mean and the median. In general, the lengths vary and 75% of them are of lengths less than 120 zeros. In Figure 2.6, we can see that there are some large lengths, but they are not many From the figure in (c) and (d), it can be seen that the lengths have a clear pattern, more clearly in (d) where more than 75% of the lengths are represented.

In Figure 2.6 (d), we can see three clear features. First, an exponential decay from 0 until around 20 zeros. Second, a bump from 30 to around 70 zeros, and third, a linear component after length 70 zeros. These components suggest that the lengths of the zeros might not follow a single distribution. As a result, the read counts might not follow a single distribution either. If the read counts followed a Poisson distribution, for instance, then the lengths of consecutive zeros would follow a geometric distribution, hence the decay in Figure 2.6 (d) would be linear throughout. In Section 3.3, we will see how these features can be useful in finding a statistical model for the read counts.

Chapter 2. Optimal window size and exploratory data analysis of RUNX1/ETO 30

Figure 2.6: Distribution of lengths of consecutive zero counts in the first part of chromosome 1. (a) shows raw distribution for consecutive zeros lengths. (b) represents the frequencies in log scale. (c) shows lengths from 0 to 1000 consecutive zeros with frequencies in log scale, and (d) shows lengths from 0 to 200 consecutive zeros. Note, we mean by zero counts the unobserved base pairs so we observe zero counts in these base pairs.

Chapter 3 Modelling the distribution of ChIP-Seq

data

3.1 Introduction

In this section, we look for a statistical model that can be fitted to the read counts. For the read counts, we seek a statistical model that can be used to achieve the objective of the study, which is to detect regions in the genome that are significantly different.

A model that can be used to describe the read counts is Poisson Distribution. That is, the Poisson model describes the number of events at a particular location. The number of events is a non-negative and discrete variable. Hence, the read counts can be considered as the number of events, and they are associated with positions, which can be considered as the location. We noticed in RUNX1/ETO, for example, that most of the regions across the genome have low read counts (or no reads at all). Although we observed a few regions with quite large read counts, these regions represent a very small fraction of the rest of the genome (see Table 2.4). This means that the read counts are quite similar across the genome and the average read count per base pair is low as well. Hence, it can be said that

Chapter 3. Modelling the distribution of ChIP-Seq data 32

there is no difference between the mean and the variance of the data. Thus, the Poisson distribution can be considered to model the read counts.

Although the filtering process is considered in the read-mapping step, some mis-mapped reads might exist. Assuming a random model for the read counts like Poisson can handle . That is, the mis-mapped reads would be considered as a part of the randomness of the model under consideration. Furthermore, the mis-mapped reads and non-zero reads make up a very small fraction compared to the zeros in ChIP-Seq data (see Table 2.4).

We observed in the previous chapter that many zeros are observed in RUNX1/ETO data and in any ChIP-Seq data in general. In many of the ChIP-Seq data studies most of these zeros are thrown away. However, this common feature is an important part of the nature of the ChIP-Seq data. Hence considering this common and huge part of the data might lead to find a common model that can generally describe ChIP-Seq data. Thus, the aim is to find a model that can handle this common and natural feature of ChIP-Seq data. It was seen that the read counts are weakly correlated. Hence, it can be said that the read counts are not independent, which violates a simple Poisson model. Correlation can exist between variables that are drawn from mixture models [15]. On the other hand, this correlation can be an issue when a single model is assumed for the whole genome. However, there is a direction that can be followed to satisfy conditional independence. That direction is Hidden Markov Models (HMM), which are introduced in the following section, Section 3.2.

In document Statistical analysis of genomic binding sites using high-throughput ChIP-seq data (Page 44-49)