• No results found

Chapter 1 Introduction

2.2 A Brief Introduction to ChIP-Seq technology

2.2.2 ChIP-Seq Analysis step

ChIP-Seq is a powerful technique that allows us to investigate the physical interaction with proteins or transcription factors. It also helps discover and understand the pattern of any epigenetic chromatin modification. Once the ChIP-Seq data is generated, the sequences are further analysed to determine the binding locations of protein under investigation. Figure 2.6 is the workflow diagram for steps involved in ChIP-Seq data analysis followed by the brief overview of some of those steps.

Figure 2:6 Schematics of analysis steps of ChIP-Seq data. The sequences are produced and their quality is checked, they are mapped to the whole genome and a peak-calling algorithm is applied to the aligned data to find the regions that are enriched by the protein. Further downstream analysis can be performed on the enriched results.

41

The raw data for chromatin immunoprecipitation followed by sequencing is generated by next generation platform and such platforms are Illumina (http://www.illumina.com/) and ABI SOLiD [Shah 2009]. The reads yielded by these platforms are short reads (typically around 25~30bp in length). However, recent platforms can result in longer reads (up to 50 ~ 100 bp) and extreme high throughput can result in up 700MB to 1GB per lane. Below each step that is involved in the workflow of the ChIP-Seq data analysis is described.

Quality Control ChIP-Seq Experiments

After sequencing, before the sequences are mapped and analysed to find the protein bound locations, a number of quality controls can be used to determine if the data is worthwhile for any further investigation and validation. Packages such as FastQC [Andrews 2010] allow raw sequence quality to be assessed. There are several features that are used in assessing the quality of sequence data such as alignment independent features. Most sequencing hardware provides quality score for each base call in the read to report the confidence in assigning a specific nucleotide to each base.

Figure 2.7: Per base sequence quality assessed by FastQC. (Left) shows sequence quality is unacceptable as good portions of the sequences scores very low in quality check and (right) shows good quality sequence data as most of the sequences scores high in quality check. In both plots, the X axis shows the position of the bases in read (1 – 99), and the Y axis shows the quality score (0 – 40).

42

The quality control software such as FastQC uses these scores to create plots and statistical reports about the overall quality of the data. Another feature is the number of bases that could not be called i.e the number of ‘N’s in the data also provides some insight to the quality of the data.

Figure 2.7 is an example of outputs by FastQC, which are the assessments of quality of per base sequence of two ChIP-Seq data. Read count enrichment can be calculated between ChIP and input samples and can help control for biases in the experimental methods. Visual inspection of the data allows for a simple but effective tool.

Genome Alignment

ChIP-Seq analysis starts with mapping all the raw reads to the reference genome, the uniquely mapped reads from the ChIP experiment. In a typical ChIP-Seq experiment for a typical mammalian biological sample/biopsy, tens or even hundreds of millions of sequences must be aligned to gigabytes of a reference genome and for that reason; alignment is one of the most computationally challenging tasks in the ChIP-Seq data analysis process [Trapnell et al. 2009]. For alignment, Bowtie [Langmead et al. 2009], ELAND [Bentley et al. 2008], MACS [Zhang et al. 2008] are the most popular choices for the ChIP-Seq experiment.

There are several conditions or issues that need to be considered when choosing a mapping algorithm and its parameters. For example, one need to decide whether to keep only the reads that are found in unique position in the reference genome or whether to include reads that map to multiple locations. Accepting only unique reads, some true binding sites may not be found as they may be located in repeats or duplicated regions. On the other hand, multireads may improve signals but simultaneously may increase false positive rates. Therefore, a balance needs to be maintained between increased specificity and sensitivity while choosing the mapping algorithm [Pepke et al. 2009]. It also needs to be remembered that sequencing error can occur. Therefore alignment of reads should allow for a small number of mismatches

43

Identification of enriched region

After the sequenced reads are aligned to the genome, the next steps of the analysis are converting the mapped reads into a representative count number at each position in the genome and identification the regions or locations that are enriched significantly with reads or tags where significance is estimated from the distribution of the data along the genome or part of the genome that has been investigated. This step where enriched regions or peaks are identified is also known as ‘peak calling’. There are several issues related to this step. The user needs to be careful while choosing a ‘peak calling algorithm’ as different peak callers may deal with different issues and each can be suitable for particular type of ChIP-Seq data.

A major challenge involved in detecting enriched region is that there are three types of such regions. Sharp peaks are usually found for protein-DNA binding or histone modifications at regulatory elements. Histone modifications marking domains for example transcribed or repressed regions usually have broad regions. The regions can be mixed as well. Figure 2.8 presents different types of peaks found in different data. Most of the available algorithms are designed for sharp peaks, while merging adjacent peaks for broad regions [Park et al 2009]. An effective method should take both types of regions into account and apply the relevant technique applicable for a given dataset. Peak detection algorithm is therefore a key to meaningful interpretation of ChIP-Seq data.

In peak calling, steps can be subdivided into several tasks such as, generating a signal profile for individual chromosome, defining the noise or background and true signal, identify peaks, assessing significance and finally removing artefacts [Pepke et al, 2008]. Different tools adapt different methods for these tasks.

44

Figure 2.8: Different types of enriched regions depending on target proteins [Kotwaliwale 2013].

Building a signal profile is crucial in identifying enriched regions with confidence. Some tools slide a fixed length bin or window where each bin has the summation of the count at the centre. CisGenome [Ji et al. 2008] and SiSSRs [Jothi et al. 2008] both follow this method and also set criteria for consecutive windows to be merged. However, some peak calling algorithms take advantage of the direction of the reads. In this approach, the fragments are sequenced at the 5′ end and the positions of mapped reads form two separate distributions. One on the positive strand and the other on the negative strand and both is kept with a consistent distance between the peaks of the distributions. However, positive or negative strand peaks do not represent actual location of the enriched site.

To address these issues, some algorithms first construct a smoothed profile on each strand and then calculate the combined profile as showed in Figure 2.9. In order to achieve that, each distribution can be moved towards the centre or mapped location can be extended towards right fragments and fragments can be summed up.

45

Figure 2.9: Forward and reverse (Blue and Red respectively) read density profile is used to make a combined density profile (orange) [Valouev et al. 2008].

MACS (Model-based Analysis of ChIP-Seq) [Zhang et al. 2008] shifts the read by d 2⁄ where d is the fragment length, other methods such as FindPeaks [Fejes et al. 2008], PeakSeq [Rozowsky et al. 2009] etc. elongate the reads to a size of d where d is estimated from the actual data. This methodology should create better profile; however, there are some limitations of this approach. One needs a prior estimate of the fragment size and should assume that fragment size is uniform.

From the combined profile, peaks can be estimated. Random distribution of reads in a window of size w modelled using a theoretical distribution. Poisson model for tag distribution is a good approach as it takes into consideration both folds ratio and the absolute tag numbers. Poisson distribution has just one parameter, λ. If,

λ = expected number of reads in window k = number of occurences of any read

46 P(X = k) = e−k λk

k! (2.1)

Binomial distribution is another good approach which has two parameters. p = probability to start a read at particular position

n = window size

np = expected number of reads in a window Then the probability function takes the form,

P(X = k) = Cnkpk(1 − p)n−k (2.2)

Figure 2.10: Poisson and Negative Binomial distribution.

However, the Poisson distribution has a single parameter, which is uniquely determined by its mean; its variance and all other properties follow from it; in particular, the variance is equal to the mean. However, it has been noted [Robinson et al. 2007] that the assumption of Poisson distribution is too restrictive as it predicts smaller

47

variations than what is normally observed in the data to be investigated. Therefore, the resulting statistical test does not control type-I error (the probability of false discoveries) as required. To address this so-called over-dispersion problem, it has been proposed to model count data with negative binomial (NB) distributions [Whitaker, 1914].

Negative Binomial distribution has 2 parameters. p = probability to start a read at particular position r = number of sucsesses

And NB can have large variance.

Var(XNB) =1−pX̅ (2.3)

Depending on the underlying statistical model, a significance metric (e.g. p-value, q- value) is assigned to each putative peak.

In some experiments enriched regions are compared to a control sample, say where a non-specific antibody is used, in other cases differential binding of a protein between two or more biological conditions are also investigated.

There are several packages that are available to identify and analyse the enriched regions, all of which address different issues related to ChIP-Seq data analysis. PeakSeq [Rozowsky et al. 2009], Mosaics [Chung et al. 2014], MACS [Zhang et al. 2008], CisGenome [Ji et al. 2008], enRich [Bao et al. 2015] are among those tools to name a few. User needs to determine which one to choose in order to analyse their data depending on the type of the data in hand. Several reviews have been written summarising the methods used by different tools and their strengths and weaknesses [Ma et al. 2011; Shin et al. 2013; Steinhauser et al. 2016]. In table 2.2 profiling techniques of some of the tools along with their strengths and weaknesses are summarised.

48

Peak caller Profiling of the count data Selection Peak Joint Modelling of two data together

Consideration of spatial dependency in adjacent windows

CisGenome Strand specific window scan Number of reads in

window No No

MACS Tag shifted then window scan

Number of reads in

window No No

FindPeaks overlapped tags Summation of Height cut-off No No

PeakSeq Extended tag aggression Local region binomial p

value No No SICER Sliding through windows and aggregating counts Enrichment in relation to control No Yes

Mosaics Window scan

Number of reads per

window No Yes

enRich Window scan Number of reads per window

Yes Yes

Table 2.2: Summary of some of the popular peak calling tools.

Downstream analysis

After the peak is detected, there are two common downstream analysis tasks: gene annotation of the location of the enriched regions and the discovery of binding sequence motifs. Sequence motifs, the short recurring patterns in DNA play important role in regulation of gene expression. Different proteins and also RNA molecules bind to these motifs to initiate gene expression. There are several such programs available for motif discovery analysis from ChIP-seq data, for example MEME [Timothy et al. 2009], Weeder [Pavesi et al. 2004], TAMO [Gordon et al. 2005] etc. These algorithms return the

49

details of potential motifs along with their statistical significance. Several tools for motif discovery analysis specifically designed for ChIP-seq data have been reviewed by Lihu et al. [2015].

The University of California Santa Cruz (UCSC) genome browser (genome.ucsc.edu) [Kent 2002] is a popular web-based application where alignment data can be visualized as signal overage. It also provides genomic annotations including genes (e.g. refseq, Ensembl), SNPs, evolutionary conservation, sequence properties, and patterns (e.g., CpG islands, repeats), as well as tracks for regulatory elements (e.g., transcription factor binding sites, methylation) from the ENCODE consortium [Encode], an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). An analyst can interpret the peaks in the context of functionally relevant genomic regions. There are other tools available that annotate peaks in relation to some known genomic features, for example, the transcriptional start site (TSS), exon/intron boundaries, and the 3′ ends of genes etc. ChIP-peak data can also be tested for biological pathways, Gene Ontology terms and other types of gene sets.