Summary of data sources †‡ - Sequence bias in ChIP-seq experiments

Chapter 2 Sequence bias in ChIP-seq experiments

2.2 Method

2.2.1 Summary of data sources †‡

For most of this investigation, data from the ChIP-seq input fragments were used in order to avoid possible sequence bias that might arise from using immunoprecipitated fragments where sequences are preferentially drawn from regions where the target protein binds (see Section 1.4.5).

ChIP-seq data will normally contain a mixture of fragments that originate in the nuclear DNA and the mitochondrial DNA. The mitochondrial genome is significantly smaller than the nuclear genome, but is present in a much higher copy number within the cell. The average fragment density p

 

 is therefore different for the nuclear and mitochondrial DNA. If both sorts of DNA are used in the calculations then in the definition ofYsin (2.4) the p

 

s factor

for such data is no longer simply a function of the number of each N-mer in the genome, but is now a more complex function where the mitochondrial and nuclear DNA are treated separately, weighting each component by the relative concentrations of the two types of DNA. Rather than introducing this additional complexity into the definitions, the mitochondrial DNA has simply been excluded from the analysis.

The analysis was performed using 12 sets of input DNA from Homo sapiensChIP-seq experiments conducted by the Myers/HudsonAlpha lab [49, 96], 11 sets of input DNA from

theHomo sapiens ChIP-seq experiments conducted by the Yale/UCD/Harvard labs, 2 Homo

sapiensdatasets published as part of an investigation into the mapping of HATs and HDACs

[99] , a set of 4 input data fromCaenorhabditis elegans ChIP-seq experiments [20] and a set of data fromArabidopsis thalianaproduced at the University of Warwick.

The following provides more details of these data sources. Data from the Myers/HudsonAlpha lab,

Input fragment data from ChIP-seq experiments on various Homo sapiens cell lines and types which had been produced as part of the Encyclopaedia of DNA Elements (ENCODE) project. These were obtained from:

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeHudsonalphaChipSeq/ Lab

Version

File Cell line Protocol

(a)

Treat anti body

Replicate GC- rich (b) SL116 wgEncodeHudsonalphaChipSeqAlignmentsRep1Panc1Nrsf.tagAlign.gz _PANC1 _PCR2x _None _{NRSF 1} _Y

SL117 wgEncodeHudsonalphaChipSeqAlignmentsRep1Panc1Control.tagAlign.gz _PANC1 _PCR2x _None _input ₁ _Y

SL522 wgEncodeHudsonalphaChipSeqAlignmentsRep2Panc1Nrsf.tagAlign.gz _PANC1 _PCR2x _None _{NRSF 2} _N

SL523 wgEncodeHudsonalphaChipSeqAlignmentsRep2Panc1Control.tagAlign.gz _PANC1 _PCR2x _None _input ₂ _N

SL102 wgEncodeHudsonalphaChipSeqAlignmentsSknmcControl.tagAlign.gz _SK-N-MC _PCR1x _None _Input ₂ _Y

SL103 wgEncodeHudsonalphaChipSeqAlignmentsRep1U87Control.tagAlign.gz _U87 _PCR2x _None _Input ₁ _Y

SL217 wgEncodeHudsonalphaChipSeqAlignmentsRep1Gm12878ControlPcr2x.tagAlign.gz _GM12878 _PCR2x _None _input ₁ _Y

SL218 wgEncodeHudsonalphaChipSeqAlignmentsRep2Gm12878ControlPcr2x.tagAlign.gz _GM12878 _PCR2x _None _input ₂ _Y

SL516 wgEncodeHudsonalphaChipSeqAlignmentsRep1Gm12878ControlV2.tagAlign.gz _GM12878 _PCR1x _None _input ₁ _N

SL517 wgEncodeHudsonalphaChipSeqAlignmentsRep2Gm12878ControlV2.tagAlign.gz _GM12878 _PCR1x _None _input ₁ _N

SL518 wgEncodeHudsonalphaChipSeqAlignmentsRep1K562ControlV2.tagAlign.gz _K562 _PCR1x _None _input ₁ _N

(a) The data were produced using two different amplification methods, as designated in the table: PCR2x: Two rounds of amplification, 25 and 15 cycles

PCR1x: One round of amplification, 15 cycles

(b) “GC-rich” is an indication as to whether or not the bias at the fragment end conformed to the GC-rich pattern shown by SL117 or the more varied pattern similar to that shown by SL523 (Section 2.3.6).

The Myers lab used different sonication methods during the period covered these experiments. The following excerpt from the protocol description used by the Myers/HudsonAlpha lab provides more details [78].

“Note: The Myers lab has used two different methods for sonicating chromatin. All of our experiments until Fall 2009 used a Sonics VibraCell sonicator, a relatively inexpensive approach that we fine-tuned to fragment the chromatin to a specific size range. After that time, we began using a Bioruptor sonicator, which is much easier (multiple samples can be sonicated at the same time) and cleaner (the samples are closed during the sonication treatment). The reagents used are the same, but the methods differ.”

Data from the Snyder/Yale lab

Input fragment data from ChIP-seq experiments on various H. sapiens cell lines and types produced as part of the ENCODE project obtained from:

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeYaleChIPseq/

Name File Cell Treat antibody Replicate

Y633-1 wgEncodeYaleChIPseqAlignmentsRep1K562InputV3.tagAlign.gz _K562 _None _input ₁

Y633-2 wgEncodeYaleChIPseqAlignmentsRep2K562InputV3.tagAlign.gz _K562 _None _input ₂

Y787-1 wgEncodeYaleChIPseqAlignmentsRep1Helas3MouseiggV2.tagAlign.gz _HeLa-S3 _None _input ₁

Y787-2 wgEncodeYaleChIPseqAlignmentsRep2Helas3MouseiggV2.tagAlign.gz _HeLa-S3 _None _input ₂

Y864-1 wgEncodeYaleChIPseqAlignmentsRep1K562MusiggMusigg.tagAlign.gz _K562 _None _input ₁

Y864-2 wgEncodeYaleChIPseqAlignmentsRep2K562MusiggMusigg.tagAlign.gz _K562 _None _input ₂

Y956-1 wgEncodeYaleChIPseqAlignmentsRep1Gm12878MusiggMusigg.tagAlign.gz

(GM12878_IgG_Control_tagAlign_rep1_FC30P42HM_20081212_s_6.) GM12878 None input 1

Y1066-1 wgEncodeYaleChIPseqAlignmentsRep1Hepg2ControlForskln.tagAlign.gz _HepG2 _forskolin _input ₁

Y1066-2 wgEncodeYaleChIPseqAlignmentsRep2Hepg2ControlForskln.tagAlign.gz _HepG2 _forskolin _input ₂

Y1109-1 wgEncodeYaleChIPseqAlignmentsRep1Gm12878InputIggrab.tagAlign.gz

GM12878_Rabbit_IgG_tagAlign_rep1_100106_ROCKFORD_FC600AF_s_4 GM12878 None input 1

Y1109-2 wgEncodeYaleChIPseqAlignmentsRep2Gm12878InputIggrab.tagAlign.gz

Data previously analysed by Wang et al [99]

Data on variousH. sapienscell lines and types obtained from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database [6].

GSM File Cell Treat Antibody Replicate

GSM393947 GSM393947_CD4-PCAF.bed.gz _{CD4+ T cell} _None _PCAF ₁

GSM418301 GSM418301_HeLa-siControl-H3K9ac-HDACi-0h.bed.gz _HeLa _None _None _-

Data previously analysed by Cheung et al [20]

Data from various C. elegans ChIP-seq experiments obtained from the GEO database [6]. Raw sequence data extracted from the sra file, and aligned to the UCSC version 6 of theC. elegansgenome based on Wormbase WS190 using the -m 1 option so that sequences that map to multiple locations are excluded.

GSM File Strain Stage

GSM706161 SRR190662.sra N2 L3

GSM706164 SRR192330.sra N2 L3

GSM727910 SRR210889.sra N2 L3

GSM727911 SRR210890.sra N2 L3

Arabidopsis thaliana input DNA

2.2.2 Definition of ChIP-seq sequence bias ‡

In document Informative sequence based models for fragment distributions in ChIP seq, RNA seq and ChIP chip data (Page 45-50)