• No results found

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

N/A
N/A
Protected

Academic year: 2021

Share "8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Copyright © 2011 Partek Incorporated. All rights reserved.

Experimental Design &

Intro to NGS Data Analysis

Ryan Peters

Field Application Specialist

Partek, Incorporated

Copyright © 2011 Partek Incorporated. All rights reserved.

Agenda

Experimental Design Examples

ANOVA

What assays are possible?

NGS Analytical Process

Alignment of NGS Data

Challenges of NGS Analysis

Partek Flow Demonstration

2

Examples

Shoe Example

Breast Cancer Example

Rat Example (Experimental Design)

(2)

Copyright © 2011 Partek Incorporated. All rights reserved.

The Role of Experimental Design

The

goal of statistics

is to find signals in a sea of

noise

The

goal of experimental design

is to reduce that

noise so true biological signals can be found with as

small a sample size as possible

Copyright © 2011 Partek Incorporated. All rights reserved.

Partek Shoe Example

Question: Do shoes affect height?

Hypothesis: Yes, shoes affect height.

Assay: Measure the height 10 people

with & without shoes. (Change only one

variable.)

Sample Size: 10 people @ Partek (5

male, 5 female)

Analysis: Use a “two sample” t-test to

see if there is a difference between the

mean of two groups: with shoes and

without shoes.

A simple t-test does not have the power to correctly identify this pattern,

because it assumes multiple samples from the same individual are

independent when they are not.

Conclusion - No “statistically significant difference” in height due to shoes.

p= 0.51

Fold-change =

1.02

(3)

Copyright © 2011 Partek Incorporated. All rights reserved.

The paired t-test provides substantially more statistical power by removing

person-to-person differences from the noise.

p

(Shoes)=1e-5

p

(Person)=2e-9

Paired t-Test

Copyright © 2011 Partek Incorporated. All rights reserved.

Once person is known, gender is already known; thus the p-value for Shoe

remains unchanged.

We get the estimate of gender effect for free!

Introducing Gender

p

(Shoes)=1e-5

p

(Gender)=.04

p

(Person)=2e-9

It appears (p=.04) that men

(at Partek) are significantly

taller than women

(4)

Copyright © 2011 Partek Incorporated. All rights reserved.

Do shoes have the same effect on men & women?

p

(Shoes)=1e-8

p

(Gender)=.04

p

(Person)=2e-12

p

(Shoe*Gender) =7e-5

Wow! Shoes affect women’s height

more than men’s!

Also note that p-values for shoe effect

are even smaller because we explained

more “noise”.

Explore Gender/Shoe Interaction

Copyright © 2011 Partek Incorporated. All rights reserved.

Breast Cancer Example

Example of Large Batch Effect

Example Data, GEO Experiment GSE848

°

Control (E2) Plus Drug Treatment of Breast Cancer Cells

°

5 Treatments x 3 Time Points x 2 replicates

°

“Biological replicates” were processed in 2 batches

Control

Estrogen (E2)

E2 + ICI

E2 + Raloxifene

E2 + Tomoxifen

0 hr

2

8 hr

2

2

2

2

48 hr

2

2

2

2

(5)

Copyright © 2011 Partek Incorporated. All rights reserved.

As Seen Using PCA

Copyright © 2011 Partek Incorporated. All rights reserved.

As Seen Using Hierarchical Clustering

What is Analysis of Variance?

Analysis (Source: m-w.com)

°

Etymology: New Latin, from Greek, from analyein to break up

°

separation of a whole into its component parts

Analysis of Variance “ANOVA”

°

a technique that partitions the variance in data into separate

components or “factors”

58.36% 1.64% 17.40% 1.15% 17.49% Treatment Time
(6)

Copyright © 2011 Partek Incorporated. All rights reserved.

Good News! – Balanced Experimental Design

• The treatments were perfectly balanced with the batches, so batch

can be included as a blocking factor in ANOVA, and the batch effect

(noise) can be removed from the data.

• In terms of p-values for this gene, the difference is dramatic. With a

simple 2-way ANOVA, this gene was #228 on the gene list and would

not pass multiple test correction for significance. With a 3-way ANOVA

including batch, it was #2 on the gene list.

Factor

2-way ANOVA

3-way ANOVA

Treatment

0.00391497

3.43275E-07

Time

0.396031

0.00964938

Treatment*Time

0.100862

3.56752E-05

Copyright © 2011 Partek Incorporated. All rights reserved.

#2 Most Significant Gene

Tuesday

Median

A

=8.5

Median

B

=9.7

Tue vs. Mon more than 2-fold difference

Monday

ANOVA Partitions Variability

Total variance is partitioned into variability due to influencing

factors and the rest is assumed to be due to random error

(noise).

R

2

=81% for 2-way ANOVA

(7)

Copyright © 2011 Partek Incorporated. All rights reserved.

Batch Effect Remover

19

Before Batch Removal

After Batch Removal

Copyright © 2011 Partek Incorporated. All rights reserved.

Batch Effect Remover

20

• For visualization purposes only!

• Factors you would normally add for ANOVA

• How do we account for batch without Partek Batch Remover?

Building Blocks of Experimental Design

No Randomization

Completely Randomized

°

Subjects randomly assigned to treatment groups

Randomized Block

°

Subjects randomly assigned to treatment groups within

similar “blocks” (e.g. gender, litters)

°

Requires

a priori

knowledge of differences between the

blocks

(8)

Copyright © 2011 Partek Incorporated. All rights reserved.

Simplest Design: Not Randomized

8 Male Rats

4 Control

4 Treated

Stripe coated rats are “faster” or “more alert”.

Copyright © 2011 Partek Incorporated. All rights reserved.

Completely Randomized

8 Male Rats

4 Control

4 Treated

A Better Approach…

“Randomized Block Design”

First divide into blocks, then randomly assign to

(9)

Copyright © 2011 Partek Incorporated. All rights reserved.

Randomized Block Design

8 Male Rats

4 Control

4 Treated

Copyright © 2011 Partek Incorporated. All rights reserved.

Technical Blocks in Microarray Experiments

Litter is an example of a “biological” block

Examples of Technical/Processing Blocks:

°

RNA Isolation Batch

°

Hybridization Batch

°

Operator

°

As well as – (although less so)

°

Wash and Stain Batch

°

Reagent, Cocktail Batches

°

Chip Lot

In Summary

“Block what you can and randomize what you

cannot.” – Box, Hunter, & Hunter (1978)

Blocking ensures that the differences in treatment

cannot possibly be due to the blocking factor

Blocking completely eliminates noise due to blocks

Randomization gives approximate balance across

(10)

Copyright © 2011 Partek Incorporated. All rights reserved.

Analysis of Variance

Also Known As:

°

ANOVA

°

ANCOVA

°

Linear Model

°

Mixed Linear Model

Invented in 1900, 1908, 1923 – Still remains the

most commonly used statistical method to

analyze clinical trials!

Copyright © 2011 Partek Incorporated. All rights reserved.

Simple ANOVA: Student’s t-test

t

and

F

Statistics

Fun fact

°

In equal variance t-test is mathematically equivalent to a

1-way ANOVA.

(11)

Copyright © 2011 Partek Incorporated. All rights reserved.

Assumptions of ANOVA

Data is Normally distributed (bell shaped) within different

treatment groups

Ensure data is log transformed

Variance is equal within different treatment groups

Design balanced experiments

Samples groups are independent.

Don’t make the shoe mistake

*Replicates Required to get p-value

Copyright © 2011 Partek Incorporated. All rights reserved.

Random vs Fixed Effects

If the experiment were to be performed

again, would the same levels of the

factor be used?

°

Yes - Fixed effect (e.g. gender, dose,

time, dye)

°

No - Random effect (e.g. hyb batch,

wash batch, litter, subject)

Why do I have to worry about this?

°

In general, treating a random effect as

a fixed effect will produce an

over-optimistic p-value, leading to a false

discovery.

What Factors Belong in the Model?

Obviously, the factors of

interest to the researcher

°

e.g. strain, time, strain*time

Any factor needed to account

for dependence of samples

(don’t violate assumption of

independence!)

°

e.g. donor

Any additional blocking

factors for noise reduction

(12)

Copyright © 2011 Partek Incorporated. All rights reserved.

Partek Expression Philosophy

Use PCA to aid in quality control & sample grouping

Use ANOVA to detect significantly expressed genes. Fold change

is interesting for ranking, but not a great primary filtering metric

Incorporate as much phenotypic and experimental design

information into the ANOVA model as possible. Measure the

experimental technical components.*

Make sense of gene lists through functional groups

Copyright © 2011 Partek Incorporated. All rights reserved.

How NOT to Run/Ruin Your Next Experiment!

Samples are frequently “organized” by treatment groups.

Samples are then processed in batches corresponding to

treatment groups.

But please do NOT process your control samples on Monday,

and then process your treated samples on Tuesday.

You will confound these two variables. ANOVA is powerful but

not magical.

Summary – Experimental Design & Analysis

Understand how

separating variables in

your analysis is critical to

your success

Design balanced

experiments.

Let p-values rank your

data, but don’t be a slave

to FDR.

(13)

Copyright © 2011 Partek Incorporated. All rights reserved.

What kinds of assays are possible?

DNA-Seq

Copy Number

SNP

Structural variants

Whole genome sequencing

Metagenomics

Targeted/Amplicon Sequencing

RNA-Seq Transcriptome

Differential Gene Expression

Alternative Splicing

SNP detection

Indel detection

Novel exons/genes

miRNA-Seq

identify regulatory (non-coding) RNAs

ChIP-Seq

Transcription Factor binding sites

Methylation sites

Histone modifications

RIP-Seq (RNA-binding proteins)

Copyright © 2011 Partek Incorporated. All rights reserved.

NGS Analysis Phases

38

Data File

(Reads +

Quality)

Control

Software

Data File

(Reads +

Quality)

Bowtie/BIOSC

OPE/BWA, etc.

Reads

aligned to

genome

Reads

aligned to

genome

Primary Analysis

Secondary Analysis

Tertiary Analysis

FASTQ,

BAM, …

FASTQ

Modified from Strand Life Sciences

Illumina HiSeq SOLiD Roche 454 Ion Torrent PacBio

Intuitive Visualizations

Publication

QC & Exploratory Analysis

Powerful Statistics

Genome Alignment

Sequencing

Biological Interpretation

Integrated Genomics

NGS Analytical Process

GAGGTTGCAGTTTG chr1 243919543 R ACTGCTCCGCCTCA chr16 49094914 F GAATAAAAAATCCA chr13 55882620 F CGTCCTTCACCCTCT chr13 110085165 R CCTTAAGGAAAGGA chr18 72273046 F CAGCTAGGGTTGCC chr2 120786940 R CTGCTGGTGCTGCG chr10 73237323 F
(14)

Copyright © 2011 Partek Incorporated. All rights reserved.

Comprehensive Analysis of NGS Data

40

RNA-Seq

Methylation

Seq

ChIP-Seq

SmallRNA-Seq

DNA-Seq

Copyright © 2011 Partek Incorporated. All rights reserved.

Read Types for NGS

Single End Reads

Paired-end Reads

Junction Reads

Multiple Aligned reads

Strand-specific reads

Paired End & Single End Reads

42

Single End

Paired End

DNA Space

DNA Space

chr5

chr2

Multiple aligned

(15)

Copyright © 2011 Partek Incorporated. All rights reserved.

Junction Reads

43

Derived from transcripts, some RNA-Seq reads will read through splice junctions (single end or

paired-end)

They will not align well to genomic reference since the two ends are many nucleotides apart (separated

by the intronic region)

DNA Space

Copyright © 2011 Partek Incorporated. All rights reserved.

Next Gen File Formats

Unaligned (FASTA, FASTQ, SCARF, QSEQ, SRA, RAW, TXT,

others)

Aligned: SAM, BAM, Vendor Specific Formats/Color Space

Variant Call File (VCF, BCF) – SNPs, indels

44

GAGGTTGCAGTTTG chr1 243919543 R ACTGCTCCGCCTCA chr16 49094914 F GAATAAAAAATCCA chr13 55882620 F CGTCCTTCACCCTCT chr13 110085165 R CCTTAAGGAAAGGA chr18 72273046 F CAGCTAGGGTTGCC chr2 120786940 R CTGCTGGTGCTGCG chr10 73237323 F

Alignment Tools

ELAND

BFAST

Bowtie

TMAP

BWA

TopHat

SOAP

Etc.

What to expect?

• File size depends on read length, read type

• 4GB single lane (~100 million reads) – Bowtie w/ 8 cores = 20/25 minutes;

reference genome - read length = 33bp (older)

• TopHat – same file – 1 day

• Read length x number of reads x 8 = file size (fasta, double for fastq)

• BAM file ~ 3-4x smaller than unaligned file

Laptop

Cloud

Cluster

(16)

Copyright © 2011 Partek Incorporated. All rights reserved.

FASTQ Format(Unaligned Reads)

46

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Line 1) begins with a '@' character and is followed by a sequence identifier

and an

optional

description (like a FASTA title line).

Line 2) is the raw sequence letters(ACGT).

Line 3) begins with a '+' character and is

optionally

followed by the same

sequence identifier (and any description) again.

Line 4) encodes the quality values for the sequence in Line 2, and must

contain the same number of symbols as letters in the sequence

Sanger format can encode a Phred quality score

from 0 to 93 using ASCII 33 to 126

Copyright © 2011 Partek Incorporated. All rights reserved.

What is Alignment?

Read comes off a sequencing machine

Goal: Determine where on the genome that read belongs

Method: Match sequence of read to sequence from a

reference genome

Result:

Genomic Location of read

47

A

T

G

G

T

C

A

A

T

G

G

T

C

A

G

G

C

A

T

G

G

T

C

A

T

T

C

(reference

genome)

(read)

48

A

T

G

G

T

C

A

G A T G C A C G G A T

T G T C A T

(Reference Genome)

(Read)

DNA Space

RNA Space

Gene/Transcript

Exon junction

Align junction reads

1) Align to Genome – gapped alignment time expensive

-breaks up read in pieces (25mer)

(17)

Copyright © 2011 Partek Incorporated. All rights reserved.

SAM Format (Aligned reads)

Sequence Alignment/MAP (SAM) format is TAB-delimited

BAM is binary SAM

Header line

Read id

Bitwise

flag

Reference genome

position

(header)

(Reference

Sequence)

Quality score

CIGAR

M-match

I-Insertion

D-Deletion

Reference

name of

mate

Position

of mate

length

sequence

quality

optional

Copyright © 2011 Partek Incorporated. All rights reserved.

Explain flags

http://picard.sourceforge.net/explain-flags.html

50

VCF Format (Variant Call Format)

The Variant Call Format (VCF) is a TAB-delimited format with each

data line consists of the following fields:

Chromosome, Position, variant id, reference/alternative alleles,

quality, information(read depth), event, sample Id (optional), format

(optional)

(18)

Copyright © 2011 Partek Incorporated. All rights reserved.

Partek Flow

Web based Application

Cloud, Desktop, Server

Chrome, Firefox, Safari

Access from any terminal,

smartphone

Project centric

Protocols

Collaborate with others

Current release 1.0 / 2.1 beta

Alignment, QA/QC, GSA

Export results to PGS

Coming soon – SGE,

52

Copyright © 2011 Partek Incorporated. All rights reserved.

Challenge: Data volume is a bottleneck

Help, I’m drowning in data!

How do I handle all this data?

Solution: Schedule Tasks

• Schedule &

Queue tasks

• Emails you

when tasks are

complete

• Keep your

hardware

running 24/7/365

(19)

Copyright © 2011 Partek Incorporated. All rights reserved.

Challenge: The quality of the data will affect the

alignment

55

How do I determine data quality?

Do I have outliers?

Can I move forward with my analysis?

Do I need to trim/filter my reads?

Copyright © 2011 Partek Incorporated. All rights reserved.

Solution: Pre & Post Alignment QA/QC

56

• Group and

individual QA/QC

for excluding

outliers

• Quality score per

read/position

• Look for drop in

quality scores

• Make intelligent

decisions for

trimming/filtering

adaptors, barcodes,

low quality reads

Challenge: Alignment

Different people, different parameters will result in

different alignment.

Which aligner to use? Some aligners have more than 50

different options. How do I know what to set?

What options do I choose for RNA-Seq, ChIP-Seq,

DNA-Seq, miRNA-Seq, MeDip-Seq?

What options do I choose for the different read types?

Junction reads? Paired-End reads? Multiple Aligned

reads?

(20)

Copyright © 2011 Partek Incorporated. All rights reserved.

Solution: Multiple aligners with recommended

defaults

58

Vendor Specific

default options

Automatic Download

of reference genomes

Assay specific

default options

(RNA-Seq, ChIP-(RNA-Seq,

DNA-Seq)

Advanced options

also available through

GUI Interface (no

command line)

Copyright © 2011 Partek Incorporated. All rights reserved.

Challenge: How do I keep track of my samples?

59

Which samples are Tumor? Control? Age? Sex/Gender?

How am I ever going to keep track of this clinical information?

Solution: Advanced Sample Management

• Manage files

associated with sample

throughout life of project

• Keep track of reference

genome

• Controlled vocabulary

• SNOMED List

• In-place editing of

sample info

(21)

Copyright © 2011 Partek Incorporated. All rights reserved.

Plug-in for Torrent Suite

• Performs QA/QC within Torrent Suite

• Uploads data to Partek Flow

Perform QA/QC within Torrent Suite

and seamlessly upload data to

Partek

®

Flow

for Comprehensive

Data Analysis and Visualization

Copyright © 2011 Partek Incorporated. All rights reserved.

Comprehensive Solution for RNA-Seq

Alignment Mapping QC Statistics Visualization Integrated Genomics Interpretation Biological

Acknowledgements

References

Related documents

Prior to deploying the Linksys PAP2 device to your customer, it must be configured to point to Net2Phone’s provisioning server where it downloads its account information..

Press the back button (circular arrow soft key) three times to return to the main Settings Menu.. Scroll up to 3 Time and Date Settings and press the

In the Enter Phone Number field, enter the phone number by clicking the numbers on the WebPhone keypad, and then click the DIAL button or press the Enter key on the

By raising money through donation, sponsorship and advertising for Cinematic Arts Experience, you as a student have the opportunity to participate in the rewards program

Give Miami Day and nonprofit trainings leading up to Give Miami Day ■ Logo on Community Block Party event promotional materials and signage at event ■ ■ ■. Opportunity to

Š Left click on the Component symbol in the Schematic Editor Toolbar p y Š Select load (or load2) circuit element and configure as pulsed. Š Left click

on the care market structure. We do not deny that the presence of a plurality of providers entails, through competition mechanisms, a benefit for the dependent elderly people

This page contains various configuration settings for launching the dialer and establishing single-user or shared access to the PC. To access it, click the Configuration link on the