Selecting only a subset of the sequences reduces ‘noise’ from low sequence

Chapter 2 Sequence bias in ChIP-seq experiments

2.4 Supplementary results

2.4.1 Selecting only a subset of the sequences reduces ‘noise’ from low sequence

Introduction

There is a significant variation in the numbers of each of the 65536 different 8-mer sequences in a typical genome. There are potential difficulties with using the data from 8- mers for model-fitting when there are only a few instances in the genome because there may only be one or two fragments associated with these 8-mers in a dataset. This mirrors the problem with using such data for bias correction (Section 2.3.10). The solution adopted to reduce the noise contribution from such points and reduce the computational load during model fitting was only to use sequence data where the number of associated 8-mers exceeded some threshold. This section examines the impact of this approach.

Analysis a) SL117 b) SL523 c) Sequence count threshold N

Number of 8mers with more than N instances

Pearson coefficient SL117 Pearson coefficient SL523 0 65536 0.9440 0.7935 5000 61924 0.9504 0.8181 50000 37134 0.9724 0.9522

Figure 2-22 Variation of bias correlation with threshold a) and b) x-y plots showing the

correlation of 8-mer bias using data from the two halves of the genome for SL117 and SL523. Each point compares the bias of the same sequence from the two half genomes. Data for sequences that occur fewer times than the three values of threshold shown are excluded from the graphs and calculations. Lines associated with quantisation of break counts are very visible in the SL523 data at low thresholds. Red oval indicates 8-mers where the ratio of the number of breaks in the two half genomes is 2:1. This artefact is not present when a threshold of 50000 is used.

The first 25 nucleotides of each fragment were sequenced in experiment SL117 in order to align the fragments to the genome. A 25 nucleotide sequence is sufficient to identify approximately 4.48 billion unique sequence tag positions in the human genome. In the SL117 dataset there were 19.3 million tags that were able to be uniquely mapped to the genome. If the DNA sequence was essentially random then any given 8-mer would occur approximately 68400 times in the genome, and in the 19.3 million reads there would be expected to be an

average of 19,300,000/65536 = 294 fragments associated with each sequence. The non- random nature of the DNA sequence means that some sequences are significantly underrepresented (For example, CGCGTACG only occurs 503 times in the mappable regions of the human genome) and the number of breaks associated with the sequence is consequently very low (There were only 11 instances where the data shows a break occurs between the C and G at the start of the CGCGTACG sequence). Any data derived from sequences which occur so infrequently will be very noisy.

Figure 2-22c) shows how many of the 65536 different possible 8-mers have more than N instances in the genome for various values of N.

In order to assess the impact of using a subset of the N-mers, the genome was split into two, assigning each chromosome to one or other subset such that the two subsets are approximately equal in size. The sequence bias for each of the 8-mers was calculated for both of the two half genomes and plotted against each other. This was done for the two datasets and for various threshold values (Figure 2-22a and b).

Larger values of N remove the sequences with fewer instances across the genome, and in both datasets this removes the outliers around a central core distribution, reducing the noise associated with the distribution and improving the Pearson correlation coefficient. Horizontal lines in the SL523 data result from 8-mers for which there is a combination of only a few instances of the 8-mer in the genome and also a low sequence bias, resulting in just one or two fragments being associated with the 8-mer.

The quantisation of the results to integral numbers of fragments results in the diagonal lines which are associated with bias ratios that are a ratio of two low integer values. This gives an indication of the types of artefacts that can occur with 8-mers associated with such low fragment counts, raising concerns that this could cause other subtler effects during modelling. Using a threshold of 50,000 removes the points where this artefact was most obvious, without appearing to distort the general distribution of data.

a) Model fitting using 5000 threshold b) Model fitting using 50000 threshold

Figure 2-23 Comparison of SL523 PCMs generated by thresholds set to 5000 and 50000.This

shows that broadly similar characteristics are obtained with the two different thresholds, although there are some subtle differences. Model fitting with a 5000 threshold results in a PCM with a C at position two, implying that there are a number of over-represented 8-mers with a C at this position but that they tend to be associated with 8-mers with fewer than 50000 instances in the genome.

In order to test for possible effects due to using different thresholds, the model fitting results obtained using two different thresholds were compared. Model fitting was used to generate PCMs for the two datasets with the threshold set to 5000 (which includes 94.8% of the 8-mers) and 50000 (which includes 56.7% of the 8-mers). The two PCMs for the SL117 dataset were essentially identical (Figure 2-23a). The two sets of PCMs for the SL523 dataset were very similar, but showed a very slight difference (Figure 2-23b). Pearson coefficients were used to test the degree of model fit for the SL523 data which also showed that the fit was largely independent of the choice of threshold between 5000 and 50000. (Table 2-4). The results show that the Pearson coefficient is determined predominantly by the threshold used in the evaluation of the PCMs rather than the threshold used to generate the PCMs. Both sets of

results indicate that no systematic errors are introduced as a result of using a threshold of 50000 instances of an 8-mer in the genome when working with data fromH. sapiens.

Coefficients optimised with 5000 Coefficients optimised with 50000

Tested with 5000 0.8725 0.8521

Tested with 50000 0.9263 0.9383

Table 2-4 Pearson coefficients indicate equivalence of PCMs generated with different

threshold values.PCMs were generated with thresholds set to 5000 and 50000 and then Pearson

correlation calculated for the fit between model and observed data for both sets with both thresholds. Values are largely determined by the test conditions and not the threshold used to generate the coefficients.

In document Informative sequence based models for fragment distributions in ChIP seq, RNA seq and ChIP chip data (Page 90-94)