Copyright © 2011 Partek Incorporated. All rights reserved.
Experimental Design &
Intro to NGS Data Analysis
Ryan Peters
Field Application Specialist
Partek, Incorporated
Copyright © 2011 Partek Incorporated. All rights reserved.
Agenda
•
Experimental Design Examples
•
ANOVA
•
What assays are possible?
•
NGS Analytical Process
•
Alignment of NGS Data
•
Challenges of NGS Analysis
•
Partek Flow Demonstration
2
Examples
•
Shoe Example
•
Breast Cancer Example
•
Rat Example (Experimental Design)
Copyright © 2011 Partek Incorporated. All rights reserved.
The Role of Experimental Design
•
The
goal of statistics
is to find signals in a sea of
noise
•
The
goal of experimental design
is to reduce that
noise so true biological signals can be found with as
small a sample size as possible
Copyright © 2011 Partek Incorporated. All rights reserved.
Partek Shoe Example
•
Question: Do shoes affect height?
•
Hypothesis: Yes, shoes affect height.
•
Assay: Measure the height 10 people
with & without shoes. (Change only one
variable.)
•
Sample Size: 10 people @ Partek (5
male, 5 female)
•
Analysis: Use a “two sample” t-test to
see if there is a difference between the
mean of two groups: with shoes and
without shoes.
A simple t-test does not have the power to correctly identify this pattern,
because it assumes multiple samples from the same individual are
independent when they are not.
Conclusion - No “statistically significant difference” in height due to shoes.
p= 0.51
Fold-change =
1.02
Copyright © 2011 Partek Incorporated. All rights reserved.
The paired t-test provides substantially more statistical power by removing
person-to-person differences from the noise.
p
(Shoes)=1e-5
p
(Person)=2e-9
Paired t-Test
Copyright © 2011 Partek Incorporated. All rights reserved.
Once person is known, gender is already known; thus the p-value for Shoe
remains unchanged.
We get the estimate of gender effect for free!
Introducing Gender
p
(Shoes)=1e-5
p
(Gender)=.04
p
(Person)=2e-9
It appears (p=.04) that men
(at Partek) are significantly
taller than women
Copyright © 2011 Partek Incorporated. All rights reserved.
Do shoes have the same effect on men & women?
p
(Shoes)=1e-8
p
(Gender)=.04
p
(Person)=2e-12
p
(Shoe*Gender) =7e-5
Wow! Shoes affect women’s height
more than men’s!
Also note that p-values for shoe effect
are even smaller because we explained
more “noise”.
Explore Gender/Shoe Interaction
Copyright © 2011 Partek Incorporated. All rights reserved.
Breast Cancer Example
Example of Large Batch Effect
•
Example Data, GEO Experiment GSE848
°
Control (E2) Plus Drug Treatment of Breast Cancer Cells
°
5 Treatments x 3 Time Points x 2 replicates
°
“Biological replicates” were processed in 2 batches
Control
Estrogen (E2)
E2 + ICI
E2 + Raloxifene
E2 + Tomoxifen
0 hr
2
8 hr
2
2
2
2
48 hr
2
2
2
2
Copyright © 2011 Partek Incorporated. All rights reserved.
As Seen Using PCA
Copyright © 2011 Partek Incorporated. All rights reserved.
As Seen Using Hierarchical Clustering
What is Analysis of Variance?
Analysis (Source: m-w.com)
°
Etymology: New Latin, from Greek, from analyein to break up
°
separation of a whole into its component parts
Analysis of Variance “ANOVA”
°
a technique that partitions the variance in data into separate
components or “factors”
58.36% 1.64% 17.40% 1.15% 17.49% Treatment TimeCopyright © 2011 Partek Incorporated. All rights reserved.
Good News! – Balanced Experimental Design
• The treatments were perfectly balanced with the batches, so batch
can be included as a blocking factor in ANOVA, and the batch effect
(noise) can be removed from the data.
• In terms of p-values for this gene, the difference is dramatic. With a
simple 2-way ANOVA, this gene was #228 on the gene list and would
not pass multiple test correction for significance. With a 3-way ANOVA
including batch, it was #2 on the gene list.
Factor
2-way ANOVA
3-way ANOVA
Treatment
0.00391497
3.43275E-07
Time
0.396031
0.00964938
Treatment*Time
0.100862
3.56752E-05
Copyright © 2011 Partek Incorporated. All rights reserved.
#2 Most Significant Gene
Tuesday
Median
A
=8.5
Median
B
=9.7
Tue vs. Mon more than 2-fold difference
Monday
ANOVA Partitions Variability
•
Total variance is partitioned into variability due to influencing
factors and the rest is assumed to be due to random error
(noise).
•
R
2
=81% for 2-way ANOVA
Copyright © 2011 Partek Incorporated. All rights reserved.
Batch Effect Remover
19
Before Batch Removal
After Batch Removal
Copyright © 2011 Partek Incorporated. All rights reserved.
Batch Effect Remover
20
• For visualization purposes only!
• Factors you would normally add for ANOVA
• How do we account for batch without Partek Batch Remover?
Building Blocks of Experimental Design
•
No Randomization
•
Completely Randomized
°
Subjects randomly assigned to treatment groups
•
Randomized Block
°
Subjects randomly assigned to treatment groups within
similar “blocks” (e.g. gender, litters)
°
Requires
a priori
knowledge of differences between the
blocks
Copyright © 2011 Partek Incorporated. All rights reserved.
Simplest Design: Not Randomized
8 Male Rats
4 Control
4 Treated
Stripe coated rats are “faster” or “more alert”.
Copyright © 2011 Partek Incorporated. All rights reserved.
Completely Randomized
8 Male Rats
4 Control
4 Treated
A Better Approach…
•
“Randomized Block Design”
•
First divide into blocks, then randomly assign to
Copyright © 2011 Partek Incorporated. All rights reserved.
Randomized Block Design
8 Male Rats
4 Control
4 Treated
Copyright © 2011 Partek Incorporated. All rights reserved.
Technical Blocks in Microarray Experiments
•
Litter is an example of a “biological” block
•
Examples of Technical/Processing Blocks:
°
RNA Isolation Batch
°
Hybridization Batch
°
Operator
°
As well as – (although less so)
°
Wash and Stain Batch
°
Reagent, Cocktail Batches
°
Chip Lot
In Summary
“Block what you can and randomize what you
cannot.” – Box, Hunter, & Hunter (1978)
•
Blocking ensures that the differences in treatment
cannot possibly be due to the blocking factor
•
Blocking completely eliminates noise due to blocks
•
Randomization gives approximate balance across
Copyright © 2011 Partek Incorporated. All rights reserved.
Analysis of Variance
•
Also Known As:
°
ANOVA
°
ANCOVA
°
Linear Model
°
Mixed Linear Model
•
Invented in 1900, 1908, 1923 – Still remains the
most commonly used statistical method to
analyze clinical trials!
Copyright © 2011 Partek Incorporated. All rights reserved.
Simple ANOVA: Student’s t-test
t
and
F
Statistics
•
Fun fact
°
In equal variance t-test is mathematically equivalent to a
1-way ANOVA.
Copyright © 2011 Partek Incorporated. All rights reserved.
Assumptions of ANOVA
•
Data is Normally distributed (bell shaped) within different
treatment groups
•
Ensure data is log transformed
•
Variance is equal within different treatment groups
•
Design balanced experiments
•
Samples groups are independent.
•
Don’t make the shoe mistake
•
*Replicates Required to get p-value
Copyright © 2011 Partek Incorporated. All rights reserved.
Random vs Fixed Effects
•
If the experiment were to be performed
again, would the same levels of the
factor be used?
°
Yes - Fixed effect (e.g. gender, dose,
time, dye)
°
No - Random effect (e.g. hyb batch,
wash batch, litter, subject)
•
Why do I have to worry about this?
°
In general, treating a random effect as
a fixed effect will produce an
over-optimistic p-value, leading to a false
discovery.
What Factors Belong in the Model?
•
Obviously, the factors of
interest to the researcher
°
e.g. strain, time, strain*time
•
Any factor needed to account
for dependence of samples
(don’t violate assumption of
independence!)
°
e.g. donor
•
Any additional blocking
factors for noise reduction
Copyright © 2011 Partek Incorporated. All rights reserved.
Partek Expression Philosophy
•
Use PCA to aid in quality control & sample grouping
•
Use ANOVA to detect significantly expressed genes. Fold change
is interesting for ranking, but not a great primary filtering metric
•
Incorporate as much phenotypic and experimental design
information into the ANOVA model as possible. Measure the
experimental technical components.*
•
Make sense of gene lists through functional groups
Copyright © 2011 Partek Incorporated. All rights reserved.
How NOT to Run/Ruin Your Next Experiment!
•
Samples are frequently “organized” by treatment groups.
•
Samples are then processed in batches corresponding to
treatment groups.
•
But please do NOT process your control samples on Monday,
and then process your treated samples on Tuesday.
•
You will confound these two variables. ANOVA is powerful but
not magical.
Summary – Experimental Design & Analysis
•
Understand how
separating variables in
your analysis is critical to
your success
•
Design balanced
experiments.
•
Let p-values rank your
data, but don’t be a slave
to FDR.
Copyright © 2011 Partek Incorporated. All rights reserved.
What kinds of assays are possible?
DNA-Seq
Copy Number
SNP
Structural variants
Whole genome sequencing
Metagenomics
Targeted/Amplicon Sequencing
RNA-Seq Transcriptome
Differential Gene Expression
Alternative Splicing
SNP detection
Indel detection
Novel exons/genes
miRNA-Seq
identify regulatory (non-coding) RNAs
ChIP-Seq
Transcription Factor binding sites
Methylation sites
Histone modifications
RIP-Seq (RNA-binding proteins)
Copyright © 2011 Partek Incorporated. All rights reserved.
NGS Analysis Phases
38
Data File
(Reads +
Quality)
Control
Software
Data File
(Reads +
Quality)
Bowtie/BIOSC
OPE/BWA, etc.
Reads
aligned to
genome
Reads
aligned to
genome
Primary Analysis
Secondary Analysis
Tertiary Analysis
FASTQ,
…
BAM, …
FASTQ
Modified from Strand Life Sciences
Illumina HiSeq SOLiD Roche 454 Ion Torrent PacBio
Intuitive Visualizations
Publication
QC & Exploratory Analysis
Powerful Statistics
Genome Alignment
Sequencing
Biological Interpretation
Integrated Genomics
NGS Analytical Process
GAGGTTGCAGTTTG chr1 243919543 R ACTGCTCCGCCTCA chr16 49094914 F GAATAAAAAATCCA chr13 55882620 F CGTCCTTCACCCTCT chr13 110085165 R CCTTAAGGAAAGGA chr18 72273046 F CAGCTAGGGTTGCC chr2 120786940 R CTGCTGGTGCTGCG chr10 73237323 FCopyright © 2011 Partek Incorporated. All rights reserved.
Comprehensive Analysis of NGS Data
40
RNA-Seq
Methylation
Seq
ChIP-Seq
SmallRNA-Seq
DNA-Seq
Copyright © 2011 Partek Incorporated. All rights reserved.
Read Types for NGS
•
Single End Reads
•
Paired-end Reads
•
Junction Reads
•
Multiple Aligned reads
•
Strand-specific reads
Paired End & Single End Reads
42
Single End
Paired End
DNA Space
DNA Space
chr5
chr2
Multiple aligned
Copyright © 2011 Partek Incorporated. All rights reserved.
Junction Reads
43
•
Derived from transcripts, some RNA-Seq reads will read through splice junctions (single end or
paired-end)
•
They will not align well to genomic reference since the two ends are many nucleotides apart (separated
by the intronic region)
DNA Space
Copyright © 2011 Partek Incorporated. All rights reserved.
Next Gen File Formats
•
Unaligned (FASTA, FASTQ, SCARF, QSEQ, SRA, RAW, TXT,
others)
•
Aligned: SAM, BAM, Vendor Specific Formats/Color Space
•
Variant Call File (VCF, BCF) – SNPs, indels
44
GAGGTTGCAGTTTG chr1 243919543 R ACTGCTCCGCCTCA chr16 49094914 F GAATAAAAAATCCA chr13 55882620 F CGTCCTTCACCCTCT chr13 110085165 R CCTTAAGGAAAGGA chr18 72273046 F CAGCTAGGGTTGCC chr2 120786940 R CTGCTGGTGCTGCG chr10 73237323 FAlignment Tools
ELAND
BFAST
Bowtie
TMAP
BWA
TopHat
SOAP
Etc.
What to expect?
• File size depends on read length, read type
• 4GB single lane (~100 million reads) – Bowtie w/ 8 cores = 20/25 minutes;
reference genome - read length = 33bp (older)
• TopHat – same file – 1 day
• Read length x number of reads x 8 = file size (fasta, double for fastq)
• BAM file ~ 3-4x smaller than unaligned file
Laptop
Cloud
Cluster
Copyright © 2011 Partek Incorporated. All rights reserved.
FASTQ Format(Unaligned Reads)
46
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Line 1) begins with a '@' character and is followed by a sequence identifier
and an
optional
description (like a FASTA title line).
Line 2) is the raw sequence letters(ACGT).
Line 3) begins with a '+' character and is
optionally
followed by the same
sequence identifier (and any description) again.
Line 4) encodes the quality values for the sequence in Line 2, and must
contain the same number of symbols as letters in the sequence
Sanger format can encode a Phred quality score
from 0 to 93 using ASCII 33 to 126
Copyright © 2011 Partek Incorporated. All rights reserved.
What is Alignment?
•
Read comes off a sequencing machine
•
Goal: Determine where on the genome that read belongs
•
Method: Match sequence of read to sequence from a
reference genome
Result:
Genomic Location of read
47
A
T
G
G
T
C
A
A
T
G
G
T
C
A
G
G
C
A
T
G
G
T
C
A
T
T
C
(reference
genome)
(read)
48
A
T
G
G
T
C
A
G A T G C A C G G A T
T G T C A T
(Reference Genome)
(Read)
DNA Space
RNA Space
Gene/Transcript
Exon junction
Align junction reads
1) Align to Genome – gapped alignment time expensive
-breaks up read in pieces (25mer)
Copyright © 2011 Partek Incorporated. All rights reserved.
SAM Format (Aligned reads)
•
Sequence Alignment/MAP (SAM) format is TAB-delimited
•
BAM is binary SAM
•
Header line
Read id
Bitwise
flag
Reference genome
position
(header)
(Reference
Sequence)
Quality score
CIGAR
M-match
I-Insertion
D-Deletion
Reference
name of
mate
Position
of mate
length
sequence
quality
optional
Copyright © 2011 Partek Incorporated. All rights reserved.
Explain flags
•
http://picard.sourceforge.net/explain-flags.html
50
VCF Format (Variant Call Format)
The Variant Call Format (VCF) is a TAB-delimited format with each
data line consists of the following fields:
Chromosome, Position, variant id, reference/alternative alleles,
quality, information(read depth), event, sample Id (optional), format
(optional)
Copyright © 2011 Partek Incorporated. All rights reserved.
Partek Flow
•
Web based Application
•
Cloud, Desktop, Server
•
Chrome, Firefox, Safari
•
Access from any terminal,
smartphone
•
Project centric
•
Protocols
•
Collaborate with others
•
Current release 1.0 / 2.1 beta
•
Alignment, QA/QC, GSA
•
Export results to PGS
•
Coming soon – SGE,
52
Copyright © 2011 Partek Incorporated. All rights reserved.
Challenge: Data volume is a bottleneck
Help, I’m drowning in data!
How do I handle all this data?
Solution: Schedule Tasks
• Schedule &
Queue tasks
• Emails you
when tasks are
complete
• Keep your
hardware
running 24/7/365
Copyright © 2011 Partek Incorporated. All rights reserved.
Challenge: The quality of the data will affect the
alignment
55
•
How do I determine data quality?
•
Do I have outliers?
•
Can I move forward with my analysis?
•
Do I need to trim/filter my reads?
Copyright © 2011 Partek Incorporated. All rights reserved.
Solution: Pre & Post Alignment QA/QC
56
• Group and
individual QA/QC
for excluding
outliers
• Quality score per
read/position
• Look for drop in
quality scores
• Make intelligent
decisions for
trimming/filtering
adaptors, barcodes,
low quality reads
Challenge: Alignment
•
Different people, different parameters will result in
different alignment.
•
Which aligner to use? Some aligners have more than 50
different options. How do I know what to set?
•
What options do I choose for RNA-Seq, ChIP-Seq,
DNA-Seq, miRNA-Seq, MeDip-Seq?
•
What options do I choose for the different read types?
Junction reads? Paired-End reads? Multiple Aligned
reads?
Copyright © 2011 Partek Incorporated. All rights reserved.
Solution: Multiple aligners with recommended
defaults
58
•
Vendor Specific
default options
•
Automatic Download
of reference genomes
•
Assay specific
default options
(RNA-Seq, ChIP-(RNA-Seq,
DNA-Seq)
•
Advanced options
also available through
GUI Interface (no
command line)
Copyright © 2011 Partek Incorporated. All rights reserved.
Challenge: How do I keep track of my samples?
59
Which samples are Tumor? Control? Age? Sex/Gender?
How am I ever going to keep track of this clinical information?
Solution: Advanced Sample Management
• Manage files
associated with sample
throughout life of project
• Keep track of reference
genome
• Controlled vocabulary
• SNOMED List
• In-place editing of
sample info
Copyright © 2011 Partek Incorporated. All rights reserved.
Plug-in for Torrent Suite
• Performs QA/QC within Torrent Suite
• Uploads data to Partek Flow
Perform QA/QC within Torrent Suite
and seamlessly upload data to
Partek
®
Flow
™
for Comprehensive
Data Analysis and Visualization
Copyright © 2011 Partek Incorporated. All rights reserved.
Comprehensive Solution for RNA-Seq