• No results found

Basic processing of next-generation sequencing (NGS) data

N/A
N/A
Protected

Academic year: 2021

Share "Basic processing of next-generation sequencing (NGS) data"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Basic processing of next-generation

sequencing (NGS) data

Getting from raw sequence data to expression analysis!

(2)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Reminder: we are measuring expression

of protein coding genes by transcript

abundance

Chromosome

Gene

Transcript

Protein

mRNA abundance

(transcript sequence copies)

(3)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

A typical experimental setup

3

Condition 1

Condition 2

Tissue sample 1

Tissue sample 2

.

Tissue sample n

Tissue sample 1

Tissue sample 2

.

Tissue sample n

mRNA sequencing

(RNA-seq)

Data processing

Expression analysis

(4)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj Raw sequence generation Sequence filtration Mapping Mapping filtration Gene expression analysis Computing gene expression values

(5)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Sequence file formats

› Output files from sequencing equipment

› Can be MANY and HUGE data files!

A common file format is Fastq (36 bp reads example):

@OBAN:8:1:2:902#0/1 AAAGCTTGTTTTTTCCCTACANCTGTATCCTTTCTT +OBAN:8:1:2:902#0/1 aaaVa]aY]aaba_RYZ``X[DN_[X_]PU_aaZ_a @OBAN:8:1:2:1718#0/1 TAAATATAACATTCTTTCCACNACACTTTCTAGGAC +OBAN:8:1:2:1718#0/1 aaaaaaaaa`aaa]aaa[`_XD\\`\`a^[^]PFXW . . . @OBAN:8:1:1114:370#0/1 GGAAGGCAGCGAACATCTGTTCAATCTCCTCCTTGG +OBAN:8:1:1114:370#0/1 a^aba\]_[_]Wa[[Z]__M^YNX``]]^a`_XTXB 5 Sequence number 1 Sequence number 2 Sequence number n DNA sequence Quality sequence

(6)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Sequence format conversions

If specific sequence file format is required for downstream analysis/processing, conversion can be done with programs like maq or fq_all2std.pl

Program: maq (Mapping and Assembly with Qualities) Version: 0.7.1

Contact: Heng Li <[email protected]> Usage: maq <command> [options] Format converting:

sol2sanger convert Solexa FASTQ to standard/Sanger FASTQ mapass2maq convert mapass2's map format to maq's map format bfq2fastq convert BFQ to FASTQ format

(7)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Filtering on raw sequences

› Quality

› Read count per tissue/sample

› Uniqueness

› Trimming

(8)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Mapping sequences

Choice of suitable program

Depend on your purpose of analysis. There are many alignment software programs, here are some commonly used examples

› Bowtie: Basic mapping to reference sequence

› Maq/BWA: Mapping and SNP/indel detection

(9)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Mapping sequences

Choice of reference sequence database – what to map against?

Genome reference: common if you want to do transcript assembly, study alternative splicing and detect SNPs/indels

Transcript reference: common if you want to do ‘simple’ read mapping, count transcript copies and analyze expression levels

(10)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Mapping sequences

Common issues

Reference database not available.

(11)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Mapping sequences

Common issues

Lack of unique mapping will typically lead to random mapping

11

ATCATCGGGCCATCGATTAGCTGATCGGACGCTA

TTTTCCTCTTTATCGATTAGCTGGGGGT

ATCGATTAGCTG

ATCGATTAGCTG

Gene A

Gene B

(12)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Mapping sequences

(13)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

13

Example: map reads to reference

database with Bowtie

Build reference database index:

bowtie-build NC_002127.fna e_coli_O157_H7

Test build:

bowtie -c e_coli_O157_H7 GCGTGAGCTATGAGAAAGCGCCACGCTTCC

Map/align reads to references:

(14)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Sequence Alignment MAP format

(SAM: Bowtie output example)

Standard alignment file format:

OBAN:8:1:3:1366#0/1 16 ENSSSCG00000004803|ENSSSCT00000005302|ACTC1 1250 255 36M * 0 0 GTCTACTTTACGTTCAGGATGACAGGTTAATGCTTC VXG^Z^`_ZVUYZ]T`ZQ\_U\W[X_^aaa`[R\\R XA:i:0 MD:Z:36 NM:i:0

OBAN:8:1:3:285#0/1 4 * 0 0 * * 0 0 AGGTATTGGGTTTGGGGGCCTTACACACCAGGTGGA `VOW^b`RVRS`aUQMT[Z^_a_`_Y_]Y]\RNTVW XM:i:0 OBAN:8:1:3:672#0/1 4 * 0 0 * * 0 0 TGGGTATACAGTTCATCCAGTACCCGCTCCGGCTTC a`^\Y_^``]``aQ]a^^_VP`[[\^[SY[YQJWUB XM:i:0

(15)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Filtering sequence mapping

› Low percent reads mapped

› Low number of genes covered

(16)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Computing absolute expression

Counting transcript copies

Mapping output – Sample 1 Count – Sample 1

GeneA ATCGATTAGAC GeneA 2

GeneA ATGGGCTGCAG GeneB 1

GeneB ATTTCGGCTGC GeneC 3

GeneC ATCCCTCCCTA GeneC GGGCTGGCTGC GeneC GCCGGCGGCAA

Count copies

(17)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Creating a gene expression matrix

› Concat and Pivot

› Column IDs defined by tissue samples

› Rows IDs defined by gene/transcript IDs

Gene Sample1 Sample2 . SampleN

GeneA 2 4 GeneB 1 0 GeneC 3 14 . . . 17

(18)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Filtering genes by absolute expression

› Number of reads per gene per sample (alternatively per total samples)

(19)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Transformation of expression values

Adjust for technical differences like total read count per sample

› Depending in the downstream analysis tool

› Some tools read absolute count and does transformation/normalization

› Relative abundance (RA)

› Log transformation of RA

› Reads per Kilobase per Million Reads (RPKM; Mortazavi et al, 2008)

(20)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Carry on with gene expression analysis!

› Differential expression

› Clustering

(21)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Some general considerations

› Linux is the best environment for handling and analyzing huge data files

› Learning some of the Linux commands can be helpful (grep, sed, cut, awk)

› Learning Perl/R programming can also help data text file processing

› Use batch files to build data processing pipelines (documentation and re-use)

› Get use to shift between various tools for processing, analyzing and visualization

› Check input/output files, you are responsible, not the software/script authors!

(22)

NGS/Array expression analysis course , Feb 2011 Henrik Hornshøj

Resources for NGS data

› Forum for discussing NGS data analysis: http://www.seqanswers.com

› Galaxy online tools: http://main.g2.bx.psu.edu

› NCBI Short Read Archive (SRA): http://trace.ncbi.nlm.nih.gov/Traces/sra

› Bioconductor packages for NGS:

http://www.bioconductor.org/help/workflows/high-throughput-sequencing ShortRead, Biostrings, edgeR, Rsamtools, biomaRt

References

Related documents

Microstructural analyses reveal that the difference in the two permeability-porosity relationships reflects different mineral precipitation processes as pore space evolves

In August 1994, the NRC and DOE approved an “Agreement Establishing Guidance for the NRC Inspection Activities at the Paducah and Portsmouth Gaseous Diffusion Plants between

Our pilot study involved three participants in a laboratory setting who reported on negative critical incidents that occurred while performing tasks with an ERP system (note that

category (cf. Germany 14 percent, Russia 12 percent). The “careless debtor” has fewer problems getting into debt; only five percent of this group find debt an emotional

The catalogue for the important sale of Karl Lagerfeld's collection of Memphis furniture and objects.. Lager- feld had built up the collection during the 1980s in order to furnish

Beebe &amp; Beebe (2012), said in persuasive strategy there are some types, using evidence consists of using credible evidence, using new evidence, using specific evidence,

While Sequence 1 forms consistent connections to cohorts from a broad historical base (past, contemporary, and future), Sequence 3 shows consistent connections between artists who

audience. The elitism of the text further underscores Cameron’s efforts to reach a wider population in 1982, as his articulation of concepts was less ostentatious and more