• No results found

The Segway annotation of ENCODE data

N/A
N/A
Protected

Academic year: 2021

Share "The Segway annotation of ENCODE data"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)

The Segway annotation

of ENCODE data

Michael M. Hoffman

Department of Genome Sciences

University of Washington

(2)

Overview

1. ENCODE Project

2. Semi-automated genomic annotation

3. Chromatin

(3)

Functional genomics

(4)

Chromatin immunoprecipitation

Park PJ 2009. Nat Rev Genet 10:669.

(5)
(6)

sequence

signal: Wiggler

• Extends tags in strand

direction

• Extension length

determined by

cross-correlation peak

• Signal only in mappable

regions

• 1-bp resolution

Anshul Kundaje http://align2rawsignal.googlecode.com/ Hoffman MM et al. 2013. Nucleic Acids Res 41:827.

(7)

Fine-scale data

300 bp H3K4me2 H3K27me3 Pol2b Egr-1 GABP Pol2 (Myers) Sin3Ak-20 TAF1 Histone modifications Transcription factors

sig

nal

tr

acks

ex

tended

reads per

base

(8)

Maher B 2012. Nature 489:46.

2685

(9)

Maher B 2012. Nature 489:46.

2685

data sets

(10)

Overview

1. ENCODE Project

2. Semi-automated genomic annotation

3. Chromatin

4. RNA-seq

(11)

Semi-automated annotation

signal tracks

interpretation

visualization

annotation

pattern

discovery

(12)
(13)
(14)
(15)

0

1

0

1

2

1

(16)

0

1

0

1

1

Maximize similarity in labels

(17)

Bayesian network for ChIP-seq

X

t

observed random variable

signal at position

t

(18)

Bayesian network for ChIP-seq

Q

t

X

t

hidden random variable observed random variable

transcription factor present at position

t

?

0: transcription factor is not present

1: transcription factor is present

signal at position

t

(19)

Bayesian network for ChIP-seq

Q

t

X

t

µ0 σ0 µ1 σ1

emission probability parameter hidden random variable

conditional relationship observed random variable

TF present at position

t

?

signal at position

t

discrete continuous

P

(

X

t

| Q

t

=

0) ~

N

0

,

σ

0

)

(20)

Bayesian network: 2 positions

Q

t

X

t

µ0 σ0 µ1 σ1

emission probability parameter hidden random variable

conditional relationship observed random variable

discrete continuous

Q

t

+1

X

t

+1

µ0 σ0 µ1 σ1
(21)

Bayesian network: 2 positions

Q

t

X

t

µ0 σ0 µ1 σ1

emission probability parameter hidden random variable

conditional relationship observed random variable

discrete continuous

Q

t

+1

X

t

+1

µ0 σ0 µ1 σ1 00 01 10 11

transition probability parameter

P

(

Q

t+1

= 0

| Q

t

=

0) = 0.99

P

(

Q

t+1

= 1

| Q

t

=

0) = 0.01

P

(

Q

t+1

= 0

| Q

t

=

1) = 0.01

(22)

Dynamic Bayesian network

(DBN)

Q

t

X

t

µ0 σ0 µ1 σ1

emission probability parameter hidden random variable

conditional relationship observed random variable

discrete continuous

Q

t

+1

X

t

+1

µ0 σ0 µ1 σ1 00 01 10 11

transition probability parameter

Q

t

+2

X

t

+2

µ0 σ0 µ1 σ1 00 01 10 11

Q

X

µ0 µ1 00 01 10 11
(23)

Dynamic BN for segmentation

segment label CTCF H3K36me3 DNaseI

transition probability parameter emission probability parameter hidden random variable

conditional relationship observed random variable

(24)

Heterogeneous missing data

(25)

Handling missing data

µ0 σ0 µ1 σ1 00 01 10 11 µ0 σ0 µ1 σ1

transition probability parameter emission probability parameter hidden random variable

conditional observed random variable

segment

DNaseI

discrete continuous switching

(26)

segment label

CTCF

H3K36me3 DNaseI

transition probability parameter emission probability parameter hidden random variable

conditional observed random variable

discrete continuous switching

present(CTCF) present(H3K36me3) present(DNaseI)

(27)

segment label CTCF H3K36me3 DNaseI present(CTCF) present(H3K36me3) present(DNaseI)

Length

distribution

(28)

segment label CTCF H3K36me3 DNaseI present(CTCF) present(H3K36me3) present(DNaseI) segment countdown segment transition ruler frame index

Length

distribution

• Minimum segment length

• Maximum segment length

• Trained geometric length distribution

• Dirichlet prior on segment length

(29)

Segway

A way to segment the genome

http://noble.gs.washington.edu/proj/segway/ Hoffman MM et al. 2012. Nat Methods 9:473.

(30)

Overview

1. ENCODE Project

2. Semi-automated genomic annotation

3. Chromatin

(31)

embryoblast endoderm mesoderm intermediate mesoderm lateral mesoderm hemangioblast blood vessel endothelium hemocytoblast mesendoderm H1 hESC embryonic stem cell myeloid progenitor lymphoid progenitor HeLa-S3 cervical carcinoma cell HepG2 hepatocelluar carcinoma cell HUVEC umbilical vein endothelial cell K562 chronic myeloid leukemia cell GM12878 lymphoblastoid cell liver cervix lymphoblast

(32)

49

49 tracks

• ENCODE K562

ChIP-seq

DNase-seq

FAIRE-seq

• 8 different labs

Input tracks

(33)

25 labels

0 1 2 3 4 5 6 7 8 9 1011 1213 1415 16171819 202122 23 24

(34)

0 1 2 3 4 5 6 7 8 9 1011 1213 1415 16171819 202122 23 24

Emission parameters

Each cell represents a Gaussian.

Means are row-normalized so the highest mean value for a track is red and the lowest mean value is blue.

Standard deviation is proportional to the length of the black bar.

(35)

TSS transcription start site GS gene start GM gene middle GE gene end E enhancer I insulator R repression D dead

(36)

Transcription start site (TSS)

(37)
(38)

Zooming out 10×

TSS segments occur near 5’ ends of genes TSS/G* segments missing in gene deserts R*/D* segments occur more in gene deserts
(39)

3' gene ends

Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Jason Ernst

(40)

Lots of genes but very few TSS/GS segments. Why?

Because these genes are not expressed in K562.

A

p

uzzl

ing

region

(41)

Experimental validation

http://switchgeargenomics.com/products/promoter-reporter-collection/

Testing <1000bp sequences for promoter activity

• predicted + in K562

• predicted – in K562

predicted + in GM12878

predicted – in GM12878

(42)

Luciferase assay

results

(43)

Comparison with GWAS catalog

Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Bob Harris, Ross Hardison

(44)

Summary of results

Semi-automated genomic annotation

begins with

pattern discovery

from multiple

functional genomics data sets and enables:

• A simple

annotation

with a single label for

each part of the genome.

• ​

Visualization

reducing multivariate data to

a comprehensible representation.

• ​

Interpretation

of the context and potential

(45)

Software availability

• Segway

data tracks

segmentation

Hoffman MM

et al.

2012.

Nat Methods

9:473.

http://noble.gs.washington.edu/proj/segway/

• Segtools

segmentation

plots and summary statistics

Buske OJ

et al.

2011.

BMC Bioinformatics

12:415

http://noble.gs.washington.edu/proj/segtools/

• Genomedata

efficient access to numeric data anchored to genome

Hoffman MM

et al

. 2010.

Bioinformatics

26:1458.

(46)

Acknowledgments

University of Washington: Harshad Petwe, Meg Olson, Sheila Reynolds,

Noble Research Group. University

of Massachusetts Medical School:

Zhiping Weng. SwitchGear

Genomics: Patrick Collins. Stanford University: Anshul Kundaje.

Pennsylvania State University:

Ross Hardison, Bob Harris.

European Bioinformatics Institute: Ewan Birney, Ian Dunham.

University of California, Santa Cruz: Kate Rosenbloom, Brian

Raney. Cold Spring Harbor

Laboratory: Tom Gingeras, Carrie

Davis. CRG: Sarah Djebali. RIKEN:

Timo Lassmann.

ENCODE Project Consortium. NIH/NHGRI:

K99HG006259, U54HG004695.

http://noble.gs.washington.edu/proj/segway/ http://noble.gs.washington.edu/proj/segtools/

References

Related documents

This is an anonymous national study of nonprofit executive director tenure being conducted by CompassPoint Nonprofit Services (formerly the Support Center for Nonprofit Management), a

To measure the QOS of web, email, and news services, Firehunter includes tests: Web service measurements: The monitoring test for web services uses active measurements to assess the

All factories producing Shopko private label merchandise must be approved by Shopko’s Manager of Product Regulatory &amp; Quality prior to production being placed at a

Informed by a multi-method design and relying on data elicited through a validated inventory (Akbari et al., 2010) and through classroom observation, this study

Table 1 Boundary conditions used for the past and future climate simulations Simulations Atmospheric composition (greenhouse gases) Land surface Ocean su rface Orbital parameters Yea

Clearly we cannot present a counterfactual of the religious behaviour of Moroccan Muslims in the UK or Bangladeshi Muslims in the Netherlands, which might isolate

The moisture content, total ash content, water soluble ash content and acid insoluble ash content, water soluble extractives, alcohol soluble extractives and ether

Its methods and properties let you do things such as resizing the Cardbox window, and it is also the starting point for getting objects that refer to database windows and the