• No results found

Chapter 2 Sequence bias in ChIP-seq experiments

2.2 Method

2.2.1 Summary of data sources †‡

For most of this investigation, data from the ChIP-seq input fragments were used in order to avoid possible sequence bias that might arise from using immunoprecipitated fragments where sequences are preferentially drawn from regions where the target protein binds (see Section 1.4.5).

ChIP-seq data will normally contain a mixture of fragments that originate in the nuclear DNA and the mitochondrial DNA. The mitochondrial genome is significantly smaller than the nuclear genome, but is present in a much higher copy number within the cell. The average fragment density p

 

 is therefore different for the nuclear and mitochondrial DNA. If both sorts of DNA are used in the calculations then in the definition ofYsin (2.4) the p

 

s factor

for such data is no longer simply a function of the number of each N-mer in the genome, but is now a more complex function where the mitochondrial and nuclear DNA are treated separately, weighting each component by the relative concentrations of the two types of DNA. Rather than introducing this additional complexity into the definitions, the mitochondrial DNA has simply been excluded from the analysis.

The analysis was performed using 12 sets of input DNA from Homo sapiensChIP-seq experiments conducted by the Myers/HudsonAlpha lab [49, 96], 11 sets of input DNA from

theHomo sapiens ChIP-seq experiments conducted by the Yale/UCD/Harvard labs, 2 Homo

sapiensdatasets published as part of an investigation into the mapping of HATs and HDACs

[99] , a set of 4 input data fromCaenorhabditis elegans ChIP-seq experiments [20] and a set of data fromArabidopsis thalianaproduced at the University of Warwick.

The following provides more details of these data sources. Data from the Myers/HudsonAlpha lab,

Input fragment data from ChIP-seq experiments on various Homo sapiens cell lines and types which had been produced as part of the Encyclopaedia of DNA Elements (ENCODE) project. These were obtained from:

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeHudsonalphaChipSeq/ Lab

Version

File Cell line Protocol

(a)

Treat anti body

Replicate GC- rich (b) SL116 wgEncodeHudsonalphaChipSeqAlignmentsRep1Panc1Nrsf.tagAlign.gz PANC1 PCR2x None NRSF 1 Y

SL117 wgEncodeHudsonalphaChipSeqAlignmentsRep1Panc1Control.tagAlign.gz PANC1 PCR2x None input 1 Y

SL522 wgEncodeHudsonalphaChipSeqAlignmentsRep2Panc1Nrsf.tagAlign.gz PANC1 PCR2x None NRSF 2 N

SL523 wgEncodeHudsonalphaChipSeqAlignmentsRep2Panc1Control.tagAlign.gz PANC1 PCR2x None input 2 N

SL102 wgEncodeHudsonalphaChipSeqAlignmentsSknmcControl.tagAlign.gz SK-N-MC PCR1x None Input 2 Y

SL103 wgEncodeHudsonalphaChipSeqAlignmentsRep1U87Control.tagAlign.gz U87 PCR2x None Input 1 Y

SL217 wgEncodeHudsonalphaChipSeqAlignmentsRep1Gm12878ControlPcr2x.tagAlign.gz GM12878 PCR2x None input 1 Y

SL218 wgEncodeHudsonalphaChipSeqAlignmentsRep2Gm12878ControlPcr2x.tagAlign.gz GM12878 PCR2x None input 2 Y

SL516 wgEncodeHudsonalphaChipSeqAlignmentsRep1Gm12878ControlV2.tagAlign.gz GM12878 PCR1x None input 1 N

SL517 wgEncodeHudsonalphaChipSeqAlignmentsRep2Gm12878ControlV2.tagAlign.gz GM12878 PCR1x None input 1 N

SL518 wgEncodeHudsonalphaChipSeqAlignmentsRep1K562ControlV2.tagAlign.gz K562 PCR1x None input 1 N

(a) The data were produced using two different amplification methods, as designated in the table: PCR2x: Two rounds of amplification, 25 and 15 cycles

PCR1x: One round of amplification, 15 cycles

(b) “GC-rich” is an indication as to whether or not the bias at the fragment end conformed to the GC-rich pattern shown by SL117 or the more varied pattern similar to that shown by SL523 (Section 2.3.6).

The Myers lab used different sonication methods during the period covered these experiments. The following excerpt from the protocol description used by the Myers/HudsonAlpha lab provides more details [78].

“Note: The Myers lab has used two different methods for sonicating chromatin. All of our experiments until Fall 2009 used a Sonics VibraCell sonicator, a relatively inexpensive approach that we fine-tuned to fragment the chromatin to a specific size range. After that time, we began using a Bioruptor sonicator, which is much easier (multiple samples can be sonicated at the same time) and cleaner (the samples are closed during the sonication treatment). The reagents used are the same, but the methods differ.”

Data from the Snyder/Yale lab

Input fragment data from ChIP-seq experiments on various H. sapiens cell lines and types produced as part of the ENCODE project obtained from:

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeYaleChIPseq/

Name File Cell Treat antibody Replicate

Y633-1 wgEncodeYaleChIPseqAlignmentsRep1K562InputV3.tagAlign.gz K562 None input 1

Y633-2 wgEncodeYaleChIPseqAlignmentsRep2K562InputV3.tagAlign.gz K562 None input 2

Y787-1 wgEncodeYaleChIPseqAlignmentsRep1Helas3MouseiggV2.tagAlign.gz HeLa-S3 None input 1

Y787-2 wgEncodeYaleChIPseqAlignmentsRep2Helas3MouseiggV2.tagAlign.gz HeLa-S3 None input 2

Y864-1 wgEncodeYaleChIPseqAlignmentsRep1K562MusiggMusigg.tagAlign.gz K562 None input 1

Y864-2 wgEncodeYaleChIPseqAlignmentsRep2K562MusiggMusigg.tagAlign.gz K562 None input 2

Y956-1 wgEncodeYaleChIPseqAlignmentsRep1Gm12878MusiggMusigg.tagAlign.gz

(GM12878_IgG_Control_tagAlign_rep1_FC30P42HM_20081212_s_6.) GM12878 None input 1

Y1066-1 wgEncodeYaleChIPseqAlignmentsRep1Hepg2ControlForskln.tagAlign.gz HepG2 forskolin input 1

Y1066-2 wgEncodeYaleChIPseqAlignmentsRep2Hepg2ControlForskln.tagAlign.gz HepG2 forskolin input 2

Y1109-1 wgEncodeYaleChIPseqAlignmentsRep1Gm12878InputIggrab.tagAlign.gz

GM12878_Rabbit_IgG_tagAlign_rep1_100106_ROCKFORD_FC600AF_s_4 GM12878 None input 1

Y1109-2 wgEncodeYaleChIPseqAlignmentsRep2Gm12878InputIggrab.tagAlign.gz

Data previously analysed by Wang et al [99]

Data on variousH. sapienscell lines and types obtained from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database [6].

GSM File Cell Treat Antibody Replicate

GSM393947 GSM393947_CD4-PCAF.bed.gz CD4+ T cell None PCAF 1

GSM418301 GSM418301_HeLa-siControl-H3K9ac-HDACi-0h.bed.gz HeLa None None -

Data previously analysed by Cheung et al [20]

Data from various C. elegans ChIP-seq experiments obtained from the GEO database [6]. Raw sequence data extracted from the sra file, and aligned to the UCSC version 6 of theC. elegansgenome based on Wormbase WS190 using the -m 1 option so that sequences that map to multiple locations are excluded.

GSM File Strain Stage

GSM706161 SRR190662.sra N2 L3

GSM706164 SRR192330.sra N2 L3

GSM727910 SRR210889.sra N2 L3

GSM727911 SRR210890.sra N2 L3

Arabidopsis thaliana input DNA

2.2.2 Definition of ChIP-seq sequence bias ‡