The Segway annotation
of ENCODE data
Michael M. Hoffman
Department of Genome Sciences
University of Washington
Overview
1. ENCODE Project
2. Semi-automated genomic annotation
3. Chromatin
Functional genomics
Chromatin immunoprecipitation
Park PJ 2009. Nat Rev Genet 10:669.
sequence
signal: Wiggler
• Extends tags in strand
direction
• Extension length
determined by
cross-correlation peak
• Signal only in mappable
regions
• 1-bp resolution
Anshul Kundaje http://align2rawsignal.googlecode.com/ Hoffman MM et al. 2013. Nucleic Acids Res 41:827.
Fine-scale data
300 bp H3K4me2 H3K27me3 Pol2b Egr-1 GABP Pol2 (Myers) Sin3Ak-20 TAF1 Histone modifications Transcription factorssig
nal
tr
acks
ex
tended
reads per
base
Maher B 2012. Nature 489:46.
2685
Maher B 2012. Nature 489:46.
2685
data sets
Overview
1. ENCODE Project
2. Semi-automated genomic annotation
3. Chromatin
4. RNA-seq
Semi-automated annotation
signal tracks
interpretation
visualization
annotation
pattern
discovery
0
1
0
1
2
1
0
1
0
1
1
Maximize similarity in labels
Bayesian network for ChIP-seq
X
t
observed random variable
signal at position
t
Bayesian network for ChIP-seq
Q
t
X
t
hidden random variable observed random variable
transcription factor present at position
t
?
0: transcription factor is not present
1: transcription factor is present
signal at position
t
Bayesian network for ChIP-seq
Q
t
X
t
µ0 σ0 µ1 σ1emission probability parameter hidden random variable
conditional relationship observed random variable
TF present at position
t
?
signal at position
t
discrete continuous
P
(
X
t| Q
t=
0) ~
N
(µ
0,
σ
0)
Bayesian network: 2 positions
Q
t
X
t
µ0 σ0 µ1 σ1emission probability parameter hidden random variable
conditional relationship observed random variable
discrete continuous
Q
t
+1
X
t
+1
µ0 σ0 µ1 σ1Bayesian network: 2 positions
Q
t
X
t
µ0 σ0 µ1 σ1emission probability parameter hidden random variable
conditional relationship observed random variable
discrete continuous
Q
t
+1
X
t
+1
µ0 σ0 µ1 σ1 00 01 10 11transition probability parameter
P
(
Q
t+1= 0
| Q
t=
0) = 0.99
P
(
Q
t+1= 1
| Q
t=
0) = 0.01
P
(
Q
t+1= 0
| Q
t=
1) = 0.01
Dynamic Bayesian network
(DBN)
Q
t
X
t
µ0 σ0 µ1 σ1emission probability parameter hidden random variable
conditional relationship observed random variable
discrete continuous
Q
t
+1
X
t
+1
µ0 σ0 µ1 σ1 00 01 10 11transition probability parameter
Q
t
+2
X
t
+2
µ0 σ0 µ1 σ1 00 01 10 11Q
X
µ0 µ1 00 01 10 11Dynamic BN for segmentation
segment label CTCF H3K36me3 DNaseItransition probability parameter emission probability parameter hidden random variable
conditional relationship observed random variable
Heterogeneous missing data
Handling missing data
µ0 σ0 µ1 σ1 00 01 10 11 µ0 σ0 µ1 σ1transition probability parameter emission probability parameter hidden random variable
conditional observed random variable
segment
DNaseI
discrete continuous switching
segment label
CTCF
H3K36me3 DNaseI
transition probability parameter emission probability parameter hidden random variable
conditional observed random variable
discrete continuous switching
present(CTCF) present(H3K36me3) present(DNaseI)
segment label CTCF H3K36me3 DNaseI present(CTCF) present(H3K36me3) present(DNaseI)
Length
distribution
segment label CTCF H3K36me3 DNaseI present(CTCF) present(H3K36me3) present(DNaseI) segment countdown segment transition ruler frame index
Length
distribution
• Minimum segment length
• Maximum segment length
• Trained geometric length distribution
• Dirichlet prior on segment length
Segway
A way to segment the genome
http://noble.gs.washington.edu/proj/segway/ Hoffman MM et al. 2012. Nat Methods 9:473.
Overview
1. ENCODE Project
2. Semi-automated genomic annotation
3. Chromatin
embryoblast endoderm mesoderm intermediate mesoderm lateral mesoderm hemangioblast blood vessel endothelium hemocytoblast mesendoderm H1 hESC embryonic stem cell myeloid progenitor lymphoid progenitor HeLa-S3 cervical carcinoma cell HepG2 hepatocelluar carcinoma cell HUVEC umbilical vein endothelial cell K562 chronic myeloid leukemia cell GM12878 lymphoblastoid cell liver cervix lymphoblast
49
49 tracks
• ENCODE K562
ChIP-seq
DNase-seq
FAIRE-seq
• 8 different labs
Input tracks
25 labels
0 1 2 3 4 5 6 7 8 9 1011 1213 1415 16171819 202122 23 24
0 1 2 3 4 5 6 7 8 9 1011 1213 1415 16171819 202122 23 24
Emission parameters
Each cell represents a Gaussian.
Means are row-normalized so the highest mean value for a track is red and the lowest mean value is blue.
Standard deviation is proportional to the length of the black bar.
TSS transcription start site GS gene start GM gene middle GE gene end E enhancer I insulator R repression D dead
Transcription start site (TSS)
Zooming out 10×
TSS segments occur near 5’ ends of genes TSS/G* segments missing in gene deserts R*/D* segments occur more in gene deserts3' gene ends
Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Jason Ernst
Lots of genes but very few TSS/GS segments. Why?
Because these genes are not expressed in K562.
A
p
uzzl
ing
region
Experimental validation
http://switchgeargenomics.com/products/promoter-reporter-collection/
Testing <1000bp sequences for promoter activity
• predicted + in K562
• predicted – in K562
predicted + in GM12878
predicted – in GM12878
Luciferase assay
results
Comparison with GWAS catalog
Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Bob Harris, Ross Hardison
Summary of results
Semi-automated genomic annotation
begins with
pattern discovery
from multiple
functional genomics data sets and enables:
• A simple
annotation
with a single label for
each part of the genome.
•
Visualization
reducing multivariate data to
a comprehensible representation.
•
Interpretation
of the context and potential
Software availability
• Segway
data tracks
segmentation
Hoffman MM
et al.
2012.
Nat Methods
9:473.
http://noble.gs.washington.edu/proj/segway/
• Segtools
segmentation
plots and summary statistics
Buske OJ
et al.
2011.
BMC Bioinformatics
12:415
http://noble.gs.washington.edu/proj/segtools/
• Genomedata
efficient access to numeric data anchored to genome
Hoffman MM
et al
. 2010.
Bioinformatics
26:1458.
Acknowledgments
University of Washington: Harshad Petwe, Meg Olson, Sheila Reynolds,
Noble Research Group. University
of Massachusetts Medical School:
Zhiping Weng. SwitchGear
Genomics: Patrick Collins. Stanford University: Anshul Kundaje.
Pennsylvania State University:
Ross Hardison, Bob Harris.
European Bioinformatics Institute: Ewan Birney, Ian Dunham.
University of California, Santa Cruz: Kate Rosenbloom, Brian
Raney. Cold Spring Harbor
Laboratory: Tom Gingeras, Carrie
Davis. CRG: Sarah Djebali. RIKEN:
Timo Lassmann.
ENCODE Project Consortium. NIH/NHGRI:
K99HG006259, U54HG004695.