The Segway annotation of ENCODE data

(1)

The Segway annotation

of ENCODE data

Michael M. Hoffman

Department of Genome Sciences

University of Washington

(2)

Overview

1. ENCODE Project

2. Semi-automated genomic annotation

3. Chromatin

(3)

Functional genomics

(4)

Chromatin immunoprecipitation

Park PJ 2009. Nat Rev Genet 10:669.

(5)

(6)

sequence



signal: Wiggler

• Extends tags in strand

direction

• Extension length

determined by

cross-correlation peak

• Signal only in mappable

regions

• 1-bp resolution

Anshul Kundaje http://align2rawsignal.googlecode.com/ Hoffman MM et al. 2013. Nucleic Acids Res 41:827.

(7)

Fine-scale data

300 bp H3K4me2 H3K27me3 Pol2b Egr-1 GABP Pol2 (Myers) Sin3Ak-20 TAF1 Histone modifications Transcription factors

sig

nal

tr

acks

ex

tended

reads per

base

(8)

Maher B 2012. Nature 489:46.

2685

(9)

Maher B 2012. Nature 489:46.

2685

data sets

(10)

Overview

1. ENCODE Project

3. Chromatin

4. RNA-seq

(11)

Semi-automated annotation

signal tracks

interpretation

visualization

annotation

pattern

discovery

(12)

(13)

(14)

(15)

0

1

0

1

2

1

(16)

0

1

0

1

Maximize similarity in labels

(17)

Bayesian network for ChIP-seq

X

t

observed random variable

signal at position

t

(18)

Q

t

X

t

hidden random variable observed random variable

transcription factor present at position

t

?

0: transcription factor is not present

1: transcription factor is present

signal at position

t

(19)

Q

t

X

t

µ₀ σ₀ µ₁ σ₁

emission probability parameter hidden random variable

conditional relationship observed random variable

TF present at position

t

?

signal at position

t

discrete continuous

P

(

X

_t

| Q

_t

=

0) ~

N

(µ

₀

,

σ

₀

)

(20)

Bayesian network: 2 positions

Q

t

X

t

µ₀ σ₀ µ₁ σ₁

discrete continuous

Q

t

+1

X

t

+1

µ₀ σ₀ µ₁ σ₁

(21)

Bayesian network: 2 positions

Q

t

X

t

µ₀ σ₀ µ₁ σ₁

discrete continuous

Q

t

+1

X

t

+1

µ₀ σ₀ µ₁ σ₁ 00 01 10 11

transition probability parameter

P

(

Q

_t₊₁

= 0

| Q

_t

=

0) = 0.99

P

(

Q

_t₊₁

= 1

| Q

_t

=

0) = 0.01

P

(

Q

_t₊₁

= 0

| Q

_t

=

1) = 0.01

(22)

Dynamic Bayesian network

(DBN)

Q

t

X

t

µ₀ σ₀ µ₁ σ₁

discrete continuous

Q

t

+1

X

t

+1

µ₀ σ₀ µ₁ σ₁ 00 01 10 11

transition probability parameter

Q

t

+2

X

t

+2

µ₀ σ₀ µ₁ σ₁ 00 01 10 11

Q

X

µ₀ µ₁ 00 01 10 11

(23)

Dynamic BN for segmentation

segment label CTCF H3K36me3 DNaseI

transition probability parameter emission probability parameter hidden random variable

(24)

Heterogeneous missing data

(25)

Handling missing data

µ₀ σ₀ µ₁ σ₁ 00 01 10 11 µ₀ σ₀ µ₁ σ₁

conditional observed random variable

segment

DNaseI

discrete continuous switching

(26)

segment label

CTCF

H3K36me3 DNaseI

conditional observed random variable

discrete continuous switching

present(CTCF) present(H3K36me3) present(DNaseI)

(27)

segment label CTCF H3K36me3 DNaseI present(CTCF) present(H3K36me3) present(DNaseI)

Length

distribution

(28)

segment label CTCF H3K36me3 DNaseI present(CTCF) present(H3K36me3) present(DNaseI) segment countdown segment transition ruler frame index

Length

distribution

• Minimum segment length

• Maximum segment length

• Trained geometric length distribution

• Dirichlet prior on segment length

(29)

Segway

A way to segment the genome

http://noble.gs.washington.edu/proj/segway/ Hoffman MM et al. 2012. Nat Methods 9:473.

(30)

Overview

1. ENCODE Project

3. Chromatin

(31)

embryoblast endoderm mesoderm intermediate mesoderm lateral mesoderm hemangioblast blood vessel endothelium hemocytoblast mesendoderm H1 hESC embryonic stem cell myeloid progenitor lymphoid progenitor HeLa-S3 cervical carcinoma cell HepG2 hepatocelluar carcinoma cell HUVEC umbilical vein endothelial cell K562 chronic myeloid leukemia cell GM12878 lymphoblastoid cell liver cervix lymphoblast

(32)

49

49 tracks

• ENCODE K562



ChIP-seq



DNase-seq



FAIRE-seq

• 8 different labs

Input tracks

(33)

25 labels

0 1 2 3 4 5 6 7 8 9 1011 1213 1415 16171819 202122 23 24

(34)

0 1 2 3 4 5 6 7 8 9 1011 1213 1415 16171819 202122 23 24

Emission parameters

Each cell represents a Gaussian.

Means are row-normalized so the highest mean value for a track is red and the lowest mean value is blue.

Standard deviation is proportional to the length of the black bar.

(35)

TSS transcription start site GS gene start GM gene middle GE gene end E enhancer I insulator R repression D dead

(36)

Transcription start site (TSS)

(37)

(38)

Zooming out 10×

TSS segments occur near 5’ ends of genes TSS/G* segments missing in gene deserts R*/D* segments occur more in gene deserts

(39)

3' gene ends

Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Jason Ernst

(40)

Lots of genes but very few TSS/GS segments. Why?

Because these genes are not expressed in K562.

A

p

uzzl

ing

region

(41)

Experimental validation

http://switchgeargenomics.com/products/promoter-reporter-collection/

Testing <1000bp sequences for promoter activity

• predicted + in K562

• predicted – in K562



predicted + in GM12878



predicted – in GM12878

(42)

Luciferase assay

results

(43)

Comparison with GWAS catalog

Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Bob Harris, Ross Hardison

(44)

Summary of results

Semi-automated genomic annotation

begins with

pattern discovery

from multiple

functional genomics data sets and enables:

• A simple

annotation

with a single label for

each part of the genome.

•

Visualization

reducing multivariate data to

a comprehensible representation.

•

Interpretation

of the context and potential

(45)

Software availability

• Segway



data tracks



segmentation



Hoffman MM

et al.

2012.

Nat Methods

9:473.



http://noble.gs.washington.edu/proj/segway/

• Segtools



segmentation



plots and summary statistics



Buske OJ

et al.

2011.

BMC Bioinformatics

12:415



http://noble.gs.washington.edu/proj/segtools/

• Genomedata



efficient access to numeric data anchored to genome



Hoffman MM

et al

. 2010.

Bioinformatics

26:1458.

(46)

Acknowledgments

University of Washington: Harshad Petwe, Meg Olson, Sheila Reynolds,

Noble Research Group. University

of Massachusetts Medical School:

Zhiping Weng. SwitchGear

Genomics: Patrick Collins. Stanford University: Anshul Kundaje.

Pennsylvania State University:

Ross Hardison, Bob Harris.

European Bioinformatics Institute: Ewan Birney, Ian Dunham.

University of California, Santa Cruz: Kate Rosenbloom, Brian

Raney. Cold Spring Harbor

Laboratory: Tom Gingeras, Carrie

Davis. CRG: Sarah Djebali. RIKEN:

Timo Lassmann.

ENCODE Project Consortium. NIH/NHGRI:

K99HG006259, U54HG004695.

http://noble.gs.washington.edu/proj/segway/

http://noble.gs.washington.edu/proj/segtools/