Simulated data - Simulation study - Statistical analysis of genomic binding sites using high-th

4.3 Simulation study

4.3.1 Simulated data

In the simulation study, we want to simulate data that behave like real ChIP-Seq data. In ChIP-Seq data, it is known most of the genome’s regions have low levels of read counts and a few regions have quite high levels of read counts. For instance, in RUNX1/ETO data most of the windows have a rate of read counts of 1 read per window, approximately, and a few windows have rates larger than 10 reads per window. It was also mentioned that the read counts can be modelled with a Poisson model. Moreover, the optimal window for RUNX1/ETO is 200 bp and we seek simulated data with the same structure, hence the same window size is used in the simulation.

To determine whether or not the methods control the false-positive rate at a given nominal level, e.g, 5%, we constructed a simulation study as follows:

1. Simulate two samples each of 10, 000 windows (samples) of size 200 bp from Poi(λ). λ ranges from 0.01 to 0.5 with step 0.02. The simulation is made for each value in λ. By simulating in this setting, we simulate from the null hypothesis, which assumes no difference between the two samples.

2. Perform the testing methods, and calculate the proportion of significant windows out of 10, 000 for each testing method, where any window with p-value < 0.05 is considered significant.

3. Repeat the above three steps 50 times, and consider the average of them as the final results of false-positive rate for each of the used methods.

1. Simulate two samples each of 10, 000 windows of size 200 bp from Poi(λ1 = 0.01).

λ1 here represents the low rate read counts, which is most of the genome’s regions.

2. Simulate 2000 windows out of the 10, 000 in both samples at the same locations with Poi(λ2 = 0.2)

3. Simulate another 1000 windows at the same location in both samples by using Poisson with fixed rate λ2 = 0.02 and rate λ3ranges from 0.22 to 1 by step 0.02. To

keep the difference existing after the normalising process of the methods, in the first sample we simulate the first 500 windows by using λ2and the second 500 windows

by using λ3, and in the second sample the first 500 from λ3and the second 500 from

λ2. Note that the choice of λ2 = 0.2 will be discussed in the comparison chapter,

Chapter 6.

4. Perform the testing methods, and calculate the proportion of detected significant windows out of 1000 windows for each of the testing methods.

5. Repeat the above steps 50 times and consider the average of them as the final results of the power for each of the used methods.

MACS and MAnorm

As MAnorm is a peak- based method, we used MACS and evaluated the performance of MACS and MAnorm jointly. MAnorm can be evaluated by itself, if we give it the exact regions that we simulate as peaks. Although MAnorm is a peak-based method, we did give MAnorm exact windows to evaluate its individual performance. That is, for false- positive determination we used MACS and MAnorm jointly, as well as using MAnorm only by giving it the exact window’s starting and ending points, so MAnorm would do the significance test for each of the windows. Similarly for the power evaluation, we used the methods jointly in one run, and MAnorm only by giving it the 1000 windows with

Chapter 4. Current methods for identifying differential binding sites 76

different rates as the peak’s regions. The results are shown in Figure 4.1 Note, MACS was used in its default setting.

From Figure 4.1 it can be seen that MACS and MAnorm jointly do not show good control of the false-positive rate, especially if λ is very low. By investigating the results we found that MACS cannot detect the peaks when λ is quite low, and then it does not report any peak to MAnorm. From the figure, we can see the false-positive rate of MACS and MAnorm starts to recover around λ = 0.1. On the other hand, by considering MAnorm only it can be seen that the behaviour of the false-positive rate is opposite to the behaviour of MACS and MAnorm jointly when λ is low. Moreover, we can see that MACS and MAnorm jointly and MAnorm only show close behaviour when λ > 0.2.

In the power plot, Figure 4.1, it can be seen that MACS and MAnorm jointly do start detecting significant results from differences of around 0.1, whereas MAnorm only starts much earlier with differences of around 0.02. In fact, from the results of the false-positive rate, it is expected from MAnorm only to show higher power compared to MACS and MAnorm jointly when the rate is low. But in the power simulation we set all the rates to be larger than or equal to 0.2, and at that level the methods can be considered comparable.

diffReps

diffReps is also evaluated in terms of controlling the false-positive rate and power. In the diffReps experiment, we used a window size of 200 bp, and the rest of the options as in the default setting. Note that in diffReps the shift process has to be made, and the shift size has to be a fraction of the window size; by default the shift size is 100 bp. Hence, we applied diffReps with its G test and the result is shown in Figure 4.2.

In Figure 4.2 the false-positive rate plot, it can be seen that diffReps shows A quite high false-positive rate for all used values of rate λ, and the highest associated with low λ’s. It is assumed that diffReps controls the false positive at 5% level, but the actual level which

Figure 4.1: Top panel shows the determination of the false-positive rate by using MACS and MAnorm jointly, and MAnorm only based on simulated data. In that panel, the horizontal axis represents the λ used in the simulation, and the vertical axis represents the average of 50 false-positive rates for each of the λ’s. The lower panel shows the power of MACS and MAnorm jointly, and MAnorm only based on simulated data. In that panel, the horizontal axis represents the difference in rates between the two simulated data, where we enrich the windows with different rates, and the vertical axis represents the average power of 50 simulations for each difference point λ3− λ2.

Chapter 4. Current methods for identifying differential binding sites 78

Figure 4.2: Top panel shows the determination of the false-positive rate of diffReps based on the simulated data. In that panel, the horizontal axis represents the λ used in the simulation and the vertical axis represents the average of 50 false-positive rates for each of the λs. The lower panel shows the power of diffReps based on simulated data. In that panel, the horizontal axis represents the difference in rates between the two simulated datasets, where we enrich the windows with different rates, and the vertical axis represents the average power of 50 simulations for each difference point λ3− λ2.

can be seen from the plot is 0.15 for λ ≥ 0.2. As we deal with simulated data and we know the underlying assumption, hence we can adjust the p-values of diffReps so it can meet the assumed level of false-positive rate. By running diffReps on the simulated data with λ = 0.2 and at many levels of significance lower than 0.05, we found that by assuming the significance level at 0.012, it is actually controlled at 0.05. Hence, p-values of diffReps from the simulated data can be adjusted as follows. Let p and p∗ be the observed and adjusted p-values, respectively. The adjusted p-values can be calculated as:

p∗ = a × pb, (4.4)

where a and b are parameters to be calculated by solving the following two equations. 0.15 = a × (0.05)b,

0.05 = a × (0.012)b_.

After few steps of algebra, we found a = 1.51 and b = 0.78. Hence, these two parameters are used to adjust the resulting p-values from diffReps in the simulation study by using Equation (4.4). By performing this adjustment, the false-positive rate is controlled at the assumed level, as shown in Figure 4.2.

From the power plot, Figure 4.2, we can see that diffReps starts to detect the differences quite early by using its raw p-values. However, we saw that the raw setting returns a high false-positive rate. Hence, an adjustment procedure is needed here as well. As we fixed λ2 to be 0.2 in the simulation, we can used the same adjusting values a and b to adjust the

p-values in the power simulation. After the adjustment, it can be seen that the power curve shifted to the right. Hence, diffReps starts to detect significant results around a difference of 0.2, as shown in Figure 4.2.

Note that the adjustment parameters a and b would be different from one dataset to another. In the simulation, we used simulated data with the same rate λ to adjust the p-values. In the case where we have real data, the adjustment would be made if biological replicates were provided (see Section 4.3.2). That is, in real data, we expect replicates to have the same rate, and hence the adjustment can be made.

Chapter 4. Current methods for identifying differential binding sites 80

In document Statistical analysis of genomic binding sites using high-throughput ChIP-seq data (Page 91-97)