NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Basic processing of next-generation
sequencing (NGS) data
Getting from raw sequence data to expression analysis!
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Reminder: we are measuring expression
of protein coding genes by transcript
abundance
Chromosome
Gene
Transcript
Protein
mRNA abundance
(transcript sequence copies)
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
A typical experimental setup
3
Condition 1
Condition 2
Tissue sample 1
Tissue sample 2
.
Tissue sample n
Tissue sample 1
Tissue sample 2
.
Tissue sample n
mRNA sequencing
(RNA-seq)
Data processing
Expression analysis
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj Raw sequence generation Sequence filtration Mapping Mapping filtration Gene expression analysis Computing gene expression values
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Sequence file formats
› Output files from sequencing equipment
› Can be MANY and HUGE data files!
A common file format is Fastq (36 bp reads example):
@OBAN:8:1:2:902#0/1 AAAGCTTGTTTTTTCCCTACANCTGTATCCTTTCTT +OBAN:8:1:2:902#0/1 aaaVa]aY]aaba_RYZ``X[DN_[X_]PU_aaZ_a @OBAN:8:1:2:1718#0/1 TAAATATAACATTCTTTCCACNACACTTTCTAGGAC +OBAN:8:1:2:1718#0/1 aaaaaaaaa`aaa]aaa[`_XD\\`\`a^[^]PFXW . . . @OBAN:8:1:1114:370#0/1 GGAAGGCAGCGAACATCTGTTCAATCTCCTCCTTGG +OBAN:8:1:1114:370#0/1 a^aba\]_[_]Wa[[Z]__M^YNX``]]^a`_XTXB 5 Sequence number 1 Sequence number 2 Sequence number n DNA sequence Quality sequence
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Sequence format conversions
If specific sequence file format is required for downstream analysis/processing, conversion can be done with programs like maq or fq_all2std.pl
Program: maq (Mapping and Assembly with Qualities) Version: 0.7.1
Contact: Heng Li <[email protected]> Usage: maq <command> [options] Format converting:
sol2sanger convert Solexa FASTQ to standard/Sanger FASTQ mapass2maq convert mapass2's map format to maq's map format bfq2fastq convert BFQ to FASTQ format
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Filtering on raw sequences
› Quality
› Read count per tissue/sample
› Uniqueness
› Trimming
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Mapping sequences
Choice of suitable program
Depend on your purpose of analysis. There are many alignment software programs, here are some commonly used examples
› Bowtie: Basic mapping to reference sequence
› Maq/BWA: Mapping and SNP/indel detection
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Mapping sequences
Choice of reference sequence database – what to map against?
Genome reference: common if you want to do transcript assembly, study alternative splicing and detect SNPs/indels
Transcript reference: common if you want to do ‘simple’ read mapping, count transcript copies and analyze expression levels
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Mapping sequences
Common issues
Reference database not available.
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Mapping sequences
Common issues
Lack of unique mapping will typically lead to random mapping
11
ATCATCGGGCCATCGATTAGCTGATCGGACGCTA
TTTTCCTCTTTATCGATTAGCTGGGGGT
ATCGATTAGCTG
ATCGATTAGCTG
Gene A
Gene B
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Mapping sequences
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
13
Example: map reads to reference
database with Bowtie
Build reference database index:
bowtie-build NC_002127.fna e_coli_O157_H7
Test build:
bowtie -c e_coli_O157_H7 GCGTGAGCTATGAGAAAGCGCCACGCTTCC
Map/align reads to references:
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Sequence Alignment MAP format
(SAM: Bowtie output example)
Standard alignment file format:
OBAN:8:1:3:1366#0/1 16 ENSSSCG00000004803|ENSSSCT00000005302|ACTC1 1250 255 36M * 0 0 GTCTACTTTACGTTCAGGATGACAGGTTAATGCTTC VXG^Z^`_ZVUYZ]T`ZQ\_U\W[X_^aaa`[R\\R XA:i:0 MD:Z:36 NM:i:0
OBAN:8:1:3:285#0/1 4 * 0 0 * * 0 0 AGGTATTGGGTTTGGGGGCCTTACACACCAGGTGGA `VOW^b`RVRS`aUQMT[Z^_a_`_Y_]Y]\RNTVW XM:i:0 OBAN:8:1:3:672#0/1 4 * 0 0 * * 0 0 TGGGTATACAGTTCATCCAGTACCCGCTCCGGCTTC a`^\Y_^``]``aQ]a^^_VP`[[\^[SY[YQJWUB XM:i:0
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Filtering sequence mapping
› Low percent reads mapped
› Low number of genes covered
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Computing absolute expression
Counting transcript copies
Mapping output – Sample 1 Count – Sample 1
GeneA ATCGATTAGAC GeneA 2
GeneA ATGGGCTGCAG GeneB 1
GeneB ATTTCGGCTGC GeneC 3
GeneC ATCCCTCCCTA GeneC GGGCTGGCTGC GeneC GCCGGCGGCAA
Count copies
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Creating a gene expression matrix
› Concat and Pivot
› Column IDs defined by tissue samples
› Rows IDs defined by gene/transcript IDs
Gene Sample1 Sample2 . SampleN
GeneA 2 4 GeneB 1 0 GeneC 3 14 . . . 17
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Filtering genes by absolute expression
› Number of reads per gene per sample (alternatively per total samples)
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Transformation of expression values
Adjust for technical differences like total read count per sample
› Depending in the downstream analysis tool
› Some tools read absolute count and does transformation/normalization
› Relative abundance (RA)
› Log transformation of RA
› Reads per Kilobase per Million Reads (RPKM; Mortazavi et al, 2008)
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Carry on with gene expression analysis!
› Differential expression
› Clustering
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Some general considerations
› Linux is the best environment for handling and analyzing huge data files
› Learning some of the Linux commands can be helpful (grep, sed, cut, awk)
› Learning Perl/R programming can also help data text file processing
› Use batch files to build data processing pipelines (documentation and re-use)
› Get use to shift between various tools for processing, analyzing and visualization
› Check input/output files, you are responsible, not the software/script authors!
NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj
Resources for NGS data
› Forum for discussing NGS data analysis: http://www.seqanswers.com
› Galaxy online tools: http://main.g2.bx.psu.edu
› NCBI Short Read Archive (SRA): http://trace.ncbi.nlm.nih.gov/Traces/sra
› Bioconductor packages for NGS:
http://www.bioconductor.org/help/workflows/high-throughput-sequencing ShortRead, Biostrings, edgeR, Rsamtools, biomaRt