Building Excellence in Genomics and Computational Bioscience
Data formats and file conversions
Richard Leggett (TGAC) John Walshaw (IFR)
Common file formats
FASTQ FASTA EMBL GenBank UniProt SAM BAM MSF Raw sequence Databases Alignments BED VCF WIG Annotation GFF• e.g. Illumina read files
• 4 lines per read
• Stores sequence and quality information
FASTQ files
@HWI-ST790:234:D0W8BACXX:1:1101:1792:2000 1:N:0:GCCAA ACNATTAACAACCTTGGTGTTCAGCATGAGAACTTATCTGCAGCTGAGTCTCGTATCCGTGACG + CC#4ADDFHHHHHIIIIEGHIIIIIIIIIIGIIIIIIIIIIIIIIIIIDGHHIDHHIII6@FGI @HWI-ST790:234:D0W8BACXX:1:1101:2592:1999 1:N:0:GCCAA CTNGAATGCAGGTAGAATACATCTCCCGGATAAGCCTCGCGGCCCCCGGGGCGGGGGGGGAGAG + :=#44AA?:<DFFE>FED?3A<EHH>FIF?ADGCGBA?D######################### @HWI-ST790:234:D0W8BACXX:1:1101:4221:1999 1:N:0:GCCAA Read ID Sequence Quality• Sanger format quality scores 0-93
• Encoded with ASCII characters 33-126
• Older versions of Illumina software slightly different
• Q score relates to probability, p, that base is incorrect:
• What this means…
• e.g. assembler contigs
• Stores ID and sequence data only
• Sequence data can cover multiple lines
FASTA files
>contig1 ACNATTAACAACCTTGGTGTTCAGCATGAGAACTTATCTGCAGCTGAGTCTCGTATCCGTGACG CTGAGTCTCGTATCCGTGACGGTTAGGGCGATTAGCATAGA >contig2 TGACTAGCGGATTCGATTCGGAGGCTTATGGGCATTCCAGATGCAGCTAGCAGATGACATAGAT GGGCATT >contig3 CCCCCCTGACTAGCGGATTCGGTTCAGCATGAGTACGAATTCGGAGGCTTATGGGCATTCCAGA AGCGTGCAGCTAGCAGATGAAGCGCATAGATGGGCTATTGTTCAGCATGAGCTGATCAACTACG TACGGGACTGAGATGCCATGCAGTTGG >contig4 TGACTAGCTAGTGGATTGACGAC Sequence ID Sequence• Numerous options:
– FASTX toolkit – conversion, quality statistics, clipping,
renaming, trimming, reverse compliment, formatting & more.
– NGSUtils – suite of utils for working with NGS datasets.
– EMBOSS sequence analysis package – mature package which can do a lot.
– Many other programs/scripts or collections of scripts are available for common tasks – Google can help find them!
– Simple manipulations possible even with one-line commands in UNIX/Linux shells – see “Introduction to Linux” session!
• Using FASTX Toolkit
FASTQ to FASTA conversion
$ fastq_to_fasta –h
usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]
version 0.0.6
[-h] = This helpful help screen.
[-r] = Rename sequence identifiers to numbers.
[-n] = keep sequences with unknown (N) nucleotides. Default is to discard such sequences.
[-v] = Verbose - report number of sequences.
If [-o] is specified, report will be printed to STDOUT. If [-o] is not specified (and output goes to STDOUT), report will be printed to STDERR.
[-z] = Compress output with GZIP.
[-i INFILE] = FASTA/Q input file. default is STDIN. [-o OUTFILE] = FASTA output file. default is STDOUT.
• No one killer app:
– shuffleSequences_fastq.pl – comes with Velvet in the ‘contrib’ directory.
– Interleave_fastq.py
• Example with shuffleSequences:
shuffleSequences_fastq.pl file_R1.fastq file_R2.fastq file_R1R2.fastq
• Don’t often need to go back, but popgentools has a script called split-interleaved-fastq.pl.
• For example, to spread alignment load.
• For FASTA files:
– Using fastasplit (Exonerate)
fastasplit –f in.fasta –o outdir -c 100
• For FASTQ files:
– As long as not multi-line FASTQ, can use Linux split command:
split -l 1000 in.fastq outprefix_
– Using NGSUtils:
fastqutils split in.fastq outprefix_ 100
Splitting FASTA/Q files into chunks
1. Convert the file example.fastq in the Documents directory into a FASTA file.
2. Interleave the two LIB6574 files inside
Documents/reads to make a single FASTQ file.
3. Split the file exreads.fastq in the Documents directory into 5 (approximately) chunks.
4. Split the file example.fastq in the Documents directory into 3 (approxiamtely) chunks.
• Primary nucleotide DBs have their own native formats
– ENA db: EMBL format
– NCBI Nucleotide db (“Genbank”): Genbank format – DDBJ: DDBJ format – very similar to Genbank
• Primary protein DBs likewise:
– UniProt Knowledgebase: Swiss-Prot format
• Essentially the same as EMBL format
– NCBI Protein db: Genbank format (“Genpept”)
• Most sequence DBs will also provide the data in FASTA format
• Other DBs (e.g. for a particular genome-sequencing project) might use their own or standard formats
• We will query ENA for some entries representing (partial) gene sequences of Purple Osier Willow
– Obtain an entry in native ENA (“EMBL”) format
– And FASTA format
– And repeat the query in the NCBI Nucleotide DB to obtain the equivalent record in Genbank format
• In a different search, we will query the Sequence Read Archive
(SRA) to obtain FASTA- and FASTQ-format data from the genome-sequencing project of the same Willow
– We will use the NCBI implementation of SRA
– (the ENA or DRA versions could be used for the same search)
– This sequencing project used 454 sequencing – keeps the data sets
• http://www.ebi.ac.uk/ena/ • Search ENA for: Salix purpurea
• Examine the hit-list of coding sequences
• Choose an entry representing a whole (not partial) gene
• Obtain native (EMBL) format and FASTA-format files of this
• Make a note of the Accession number of the record
Exercise: Sequence databases (2)
• Examine the EMBL-format record:
• Can you see cross-references to other databases?
• Any to the UniProt KnowledgeBase?
• Make a note of any cross-reference to UniProtKB which you see.
Extra exercise if you have time:
Find, examine and download in Swiss-Prot format this
Exercise: Sequence databases (3)
• http://www.ncbi.nlm.nih.gov/
• Change ‘All Databases’ to ‘Nucleotide’ and search for Salix purpurea
• To narrow down the hit list, click Advanced
(under the search box)
• Restrict the search to:
– Organism = Salix purpurea
– Entries which do NOThave ‘partial cds’ in any field
• How many of the hits appear to be protein-coding sequences?
• The entry equivalent to the one found in the ENA search should be in the list. What is its
Exercise: Sequence databases (4)
• Obtaining read data sets (FASTA and/or FASTQ) from SRA
• http://www.ncbi.nlm.nih.gov/ - change DB to search to SRA; search for Salix purpurea
• The hit list is a list of sequencing ‘experiments’
– Accession of an SRA experiment begins with SRX…
• Among the hit list look for those annotated as ‘random whole genome shotgun library’
• Note that these are 454 (GS FLX) sequence reads – each set is much smaller than the other (Illumina, GA II)
• Pick the smallest experiment (read set) (should take you here: http://www.ncbi.nlm.nih.gov/sra/SRX029333)
Exercise: Sequence databases (5)
• Each experiment is associated with oneor more sequencing runs. This
experiment has only one run. Click on the link (SRR070318)
• Click the Reads tab.
• Individual reads can be examined. But here we will download the set in bulk. Click on the Filtered Download button • Select ‘clipped’ and ‘FASTA’; click
Download
• This will deliver the whole set of reads (auto quality-clipped) in a single
compressed (gzipped) file
• SAM format – Sequence Alignment/Map
• BAM format – binary version of SAM (compressed, more efficient)
• Use SAMtools to process.
Alignments
C T T A G T C C T T A G T C T A C T A G T G T C T T A G T C C C T T G G T C T C T A A G C T A Reference Reads InsertionThe SAM file
C T T A G T C C T T A G T C T A C T A G T G T C T T A G T C C Reference Reads InsertionRead1 0 TheRef 3 178 8M * 0 0 CTTAGTCC EEDDEEDE AS:i:8 XS:i:0 Read2 16 TheRef 10 150 8M * 0 0 CTTGGTCT FFEEDDEE AS:7 XS:i:0 Read3 0 TheRef 16 120 3M2I3M * 0 0 CTAAGCTA GGGHHHHH AS:i:5 XS:i:0
Read ID Ref ID Flags Pos MAPQ CIGAR Mate Read Qualities Optional fields
• SAMtools tools:
– view – filter SAM or BAM
– sort – sort according to position on reference
– index – create fast look-up of BAM or SAM
– tview – text viewer for alignments
– mpileup – generate pileup (BCF) file, eg. for SNP calling
– merge – merge sorted alignments
– rmdup – remove potential PCR duplicates
– and more…
• For more info:
http://samtools.sourceforge.net/samtools.shtml
Multiple Sequence Alignments
• Sequence read alignment(“assembly”)
• Each nucleotide position (column) represents
multiple copies of the same base of an original
• Multiple protein or nucleotide sequence alignment
• Each position (column) represents a
homologous nucleotide (or amino acid).
• Sequences are evolutionarily related (homologous) sequences, typically from different organisms, and/or
• Various file formats for MSA
• A multiple alignment can be represented in FASTA format
• MSA-dedicated formats are more richly annotated and more flexible for some purposes
– MSF
– Stockholm
– Selex
– …and others
• Each nucleotide or amino acid, and indel, is represented explicitly
– C.f. SAM/BAM
Multiple Sequence Alignments
• Many (but not all) sequence formats are flatfiles – they consist of plain-text characters
• It may be convenient to:
– Examine a file’s contents, e.g. • UNIX/Linux ‘less’
• Text editor, e.g. gedit
• Can be useful as a quick ‘sanity check’
– perform a single operation on a single sequence manually
• But if even a simple manual operation is to be repeated many times, errors are likely
• Manual operations likely to be infeasible for large sequence sets
– Or possible, but very timewasting
– If you find yourself doing something repetitive using interactive tools, ask yourself if there might be an easier way
– Often the answer is, there must be an easier way
• Repetitive chains of operations:
– Data set A, in file A1
• reformat fileA1 → fileA2
• Input fileA2 into tool X → (output) fileA3
• Reformat fileA3 → fileA4
• Input fileA4 into toolY -> (output) fileA5 – Next week, repeat on Data set B…
• Use automated pipelines
– Re-useability of analysis steps/tools
– In different combinations for different purposes
–
• A real-world example (but not with this actual sequence)
• A plant scientist working on a particular gene/protein asked a bioinformatician colleague to do some analyses on the
protein sequence, along with those from the same family in related plants.
• The sequences were emailed to the bioinformatician.
• Unsurprisingly, the family of proteins exhibited numerous amino acid substitutions, and insertions/deletions
• It was noticed that one sequence alone had two instances of an inserted dipeptide, Phenylalanine-Threonine. These were 59 amino acids apart, and appeared to be absent from all related proteins in the databases.
>WillowMatK
FSDSAIIDRFVRICRNLSHYYSGSSRKKSLYRIKYILRLSCVKTLFTARKHKSTVRIFLK RLGSELLDEFFTEEEQILFLTFPRVSSISQKLYRGRVWYLDIICINFTELSNHE
The (t)errors of cut-and-paste
ID AJ849584; SV 1; linear; genomic DNA; STD; PLN; 622 BP. …
…
DE Salix purpurea chloroplast partial tRNA-Lys gene intron and partial matK DE gene for maturase K, clone A
XX
KW matK gene; maturase K; tRNA-Lys. … … XX" FT /gene="matK" FT /product="maturase K" FT /db_xref="GOA:A0ZVW3" FT /db_xref="InterPro:IPR024937" FT /db_xref="UniProtKB/TrEMBL:A0ZVW3" FT /protein_id="CAH74183.1"
• FASTX Toolkit: http://hannonlab.cshl.edu/fastx_toolkit/ • NGSUtils: http://ngsutils.org/ • EMBOSS: http://emboss.sourceforge.net • Exonerate: http://www.ebi.ac.uk/~guy/exonerate/ • Velvet: https://www.ebi.ac.uk/~zerbino/velvet/ • Interleave_fastq.py: https://gist.github.com/ngcrawford/2232505 • popgentools: http://code.google.com/p/popgentools/ • SAMtools: http://samtools.sourceforge.net