Bioinformatics using Python for Biologists
10.1 The
SeqIO
module
Many file formats are employed by the most popular databases to store information in
ways that “should” be easily interpreted by a computer program. In this case,
interpreting means extracting information (i.e. parsing) and converting it in formats
appropriate for further processing and analysis. The parsing of such files is very often
a very important task that the bioinformatician must do very accurately. However, the
task of parsing these files can be frustrated by the fact that the formats can change
quite regularly, and that formats may contain small subtleties which can break even
the most well designed parsers. Biopython
SeqIO
module provides parsers for many
common file formats, which generally extract information from the inout file and
convert it into a
SeqRecord
object. There are two methods for sequence file
parsing:
SeqIO.parse()
and
SeqIO.read()
; both of them require two
mandatory arguments and an optional argument:
−
a “handle” that specifies where the data must be read (could be a file name, a
file opened for reading, data downloaded from a database using a script, or the
output of another piece of code);
−
a flag indicating the format of the data (a full list of supported format is
available at
http://biopython.org/wiki/SeqIO
);
−
an optional argument that specifies the alphabet of the sequence data.
The difference between
S e q I O . p a r s e ( )
and
SeqIO.read()
is that
SeqIO.parse()
returns an iterator that goes through all records in the input
handle, to be used in
for
or
while
loops. On the other hand,
SeqIO.read()
must be used on files containing a single record. The arguments are the same; Both
methods return
SeqRecord
objects.
10.2 Reading local files
Let's read the file “D.rerio_calcineurin.fasta”, containing fasta format records of all
entries matching the keyword “calcineurin” in the zebrafish (
Danio rerio
) genome
obtained from the NCBI (
http://www.ncbi.nlm.nih.gov/nuccore
). The
SeqIO.parse()
method will generate an iterator on
SeqRecord
objects; features
can then be extracted from each
SeqRecord
object as described in the Module 9:
Since the handle is a file, it is good habit to close it when the processing is done.
Remember that the iterator “empties” the file, meaning that to scan the records
another time, the file must be closed, than opened again, and then used again as the
handle argument to
SeqIO.parse()
.
In a similar way, we can parse an equivalent file, this time in genbank format; this
time, we also omit the explicit creation of the handle and pass to
SeqIO.parse
the
file name or complete path:
Few things must be noted: the genbank-specific
SeqIO.parse()
is able to assign
the correct alphabet to the sequence records in the input file, while the fasta parser
assigns a generic
SingleLetterAlphabet()
.
Second, the genbank
SeqRecord
store a more compact
id
attribute for the sequence records.
As mentioned before,
SeqIO.parse()
can process any number of records in the
input handle.
SeqIO.read()
instead checks whether there is only one record in the
>>> import Bio
>>> from Bio import SeqIO
>>> handle = open("D.rerio_calcineurin.fa","r") >>> type(handle)
<type 'file'>
>>> for seq_record in SeqIO.parse(handle,"fasta"): ... print seq_record.id ... print repr(seq_record.seq) ... print len(seq_record) ... gi|326679292|ref|XM_003201225.1| Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTA...TAG ', SingleLetterAlphabet()) 2808 gi|326677866|ref|XM_003200885.1| Seq('ATGCCTGTTCCACATACTGAAGTATCCAGGGAAAAAGAGGAACAGCAGCCTGGC...TAA ', SingleLetterAlphabet()) 1035 ... >>> handle.close() >>> for seq_record in \ ... SeqIO.parse("D.rerio_calcineurin.gb","genbank"): ... print seq_record.id ... print repr(seq_record.seq) ... print len(seq_record) ... XM_003201225.1 Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTA...TAG ', IUPACAmbiguousDNA()) 2808 XM_003200885.1 Seq('ATGCCTGTTCCACATACTGAAGTATCCAGGGAAAAAGAGGAACAGCAGCCTGGC...TAA ', IUPACAmbiguousDNA()) 1035 ...
handle, raising an exception if this condition is not met:
The usage of an iterator is a way to parse large files without consuming large amounts
of memory. On the other hand, as mentioned above each single record can be
accessed only one time in the
for
loop. The iterator provides methods to access
records step by step:
When the records in the file are over, the
.next()
method will either returns the
special Python object
None
or a
StopIteration
exception (depending on which
Biopython release you have installed on your system).
Using this approach you could in principle assign each record to a different variable,
if you need to keep these records at hand. This is impractical if the number of record
is high, or it is unknown beforehand. It is however possible to store all
SeqReference
objects returned by
SeqIO
into a data structure such as a list:
>>> handle = open(“D.rerio_calcineurin.gb") >>> iterator = SeqIO.parse(handle,"genbank") >>> first_record = iterator.next() >>> type(first_record) <class 'Bio.SeqRecord.SeqRecord'> >>> first_record.id 'XM_003201225.1' >>> first_record.seq Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTA...TAG ', IUPACAmbiguousDNA()) >>> first_record.description
'PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC100333254), mRNA.'
>>> second_record = iterator.next() >>> second_record.id
'XM_003200885.1'
>>> handle = open("D.rerio_calcineurin.gb","r") >>> SeqIO.read(handle,"genbank")
Traceback (most recent call last): File "<stdin>", line 1, in <module>
File "Bio/SeqIO/__init__.py", line 614, in read ValueError: More than one record found in handle
SeqIO
provides also a method to convert the iterator
SeqRecord
objects into
values of a dictionary, whose keys are the
SeqRecord.id
attributes:
>>> records = list\
...(SeqIO.parse("D.rerio_calcineurin.gb", "genbank")) >>> len(records)
61
>>> records[0] # the first record
SeqRecord(seq=Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGG AGTATTTA...TAG', IUPACAmbiguousDNA()), id='XM_003201225.1',
name='XM_003201225', description='PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC100333254), mRNA.', dbxrefs=[]) >>> records[0].id 'XM_003201225.1' >>> records[0].seq Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTA...TAG ', IUPACAmbiguousDNA())
>>> for key,value in records[0].annotations.items(): ... print key,value
...
comment MODEL REFSEQ: This record is predicted by automated computational
analysis. This record is derived from a genomic sequence
(NW_003336048) annotated using gene prediction method: GNOMON, supported by EST evidence.
Also see:
Documentation of NCBI's Annotation Process sequence_version 1
source Danio rerio (zebrafish)
taxonomy ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Actinopterygii', 'Neopterygii', 'Teleostei', 'Ostariophysi', 'Cypriniformes', 'Cyprinidae', 'Danio']
keywords ['']
accessions ['XM_003201225'] data_file_division VRT date 23-MAR-2011
organism Danio rerio gi 326679292
>>> records[-1] # the last record
SeqRecord(seq=Seq('GCAGCAATTTGAGGAAGAAGCGCAAACAGACAGGTCAGGTGTGGCG ATGGCAGC...AAA', IUPACAmbiguousDNA()), id='BC139891.1',
name='BC139891', description='Danio rerio zgc:162913, mRNA (cDNA clone MGC:162913 IMAGE:7401269), complete cds.', dbxrefs=[])
Note that if duplicate keys are found, an exception will be raised.
For very large number of records, there is a method,
Bio.SeqIO.index()
, which
creates a dictionary-like object, but without keeping all the data in memory. Instead,
the dictionary values correspond to the position of the record in the file. When a
particular record is accessed, the record content is parsed on the fly. This method
allows the handling of a huge number of records, with a little cost in flexibility and
speed. Moreover, these dictionary-like objects are read-only, meaning that once
created, data can not be inserted or removed. Note that in this case the first argument
(the handle) can not be an open file handle, but it must be a file name.
10.3 Reading files from the web
As we stated before, a handle can also be used to fetch data from web databases.
Since parsing the file with an iterator using a handle “consumes” the handle itself, it is
good practice to store the downloaded file locally. Nevertheless, sometimes it could
>>> handle = open(“D.rerio_calcineurin.gb")
>>> records = SeqIO.to_dict(SeqIO.parse(handle, "genbank")) >>> for key,value in records.items():
... print key,value.id,value.description ...
BC093219.1 BC093219.1 Danio rerio zgc:112142, mRNA (cDNA clone MGC:112142 IMAGE:7428541), complete cds.
XM_685181.5 XM_685181.5 PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic, calcineurin-dependent 3 (nfatc3), mRNA.
BC076024.1 BC076024.1 Danio rerio zgc:92347, mRNA (cDNA clone MGC:92347 IMAGE:7055812), complete cds. ... >>> records = SeqIO.index("D.rerio_calcineurin.gb","genbank") >>> records.keys() ['BC093219.1', 'XM_685181.5', 'BC076024.1', 'BC091833.1', 'BC152175.1', 'NM_200899.1', 'NM_001005392.1', 'BC062840.1', 'NM_199836.1', 'BC154648.1', 'NM_001099250.1', 'BC076019.1', 'NM_001002452.1', 'BC064307.1', 'BC153488.1', 'BC122248.1', 'NM_001007413.1', 'NM_001044758.1', 'BC065451.1', 'BC093272.1', 'XM_001922343.4', 'BC065972.1', 'BC090735.1', 'NM_001017701.1', 'XM_002664259.1', 'XM_001923726.2', 'XM_678815.5', 'XM_694965.5', 'BC163337.1', 'XM_001923264.3', 'BC139891.1', 'NM_205678.1', 'NM_200854.1', 'XM_687678.4', 'BC076439.1', 'XM_001339606.4', 'BC058868.1', 'NM_214773.1', 'NM_199653.1', 'NM_001017735.1', 'NM_200042.1', 'BC071331.1', 'BC129492.1', 'BC055256.1', 'GU733827.1', 'XM_003200885.1', 'NM_200037.1', 'NM_199895.1', 'BC076514.1', 'AY639016.1', 'BC049341.1', 'BC150441.1', 'NM_001002447.1', 'BC163350.1', 'NM_001014338.1', 'NM_001045159.1', 'BC155186.1', 'BC045981.1', 'XM_003201225.1', 'BC142750.1', 'BC053153.1'] >>> print records["BC093219.1"].description
Danio rerio zgc:112142, mRNA (cDNA clone MGC:112142 IMAGE:7428541), complete cds.
be more easy to perform the parsing on-the-fly using web handles. To download files
from the NCBI, we will use the
Entrez.efetch
interface, which takes as
arguments the database where the file should be found, the file format, and the
database identifier:
It is possible to download multiple files, by writing a string containing all their
identifiers separated by commas:
>>> from Bio import Entrez
>>> handle = Entrez.efetch(db="nucleotide",\ ... rettype="fasta",id="XM_003201225.1") >>> record = SeqIO.read(handle,"fasta") >>> record SeqRecord(seq=Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTC GGAGTATTTA...TAG', SingleLetterAlphabet()), id='gi|326679292|ref|XM_003201225.1|', name='gi|326679292|ref|XM_003201225.1|',
description='gi|326679292|ref|XM_003201225.1| PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC100333254), mRNA', dbxrefs=[])
>>> handle = Entrez.efetch(db="nucleotide",\ ... rettype="gb",id="XM_003201225.1") >>> record = SeqIO.read(handle,"genbank") >>> print record ID: XM_003201225.1 Name: XM_003201225
Description: PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC100333254), mRNA.
Number of features: 4
/comment=MODEL REFSEQ: This record is predicted by automated computational
analysis. This record is derived from a genomic sequence
(NW_003336048) annotated using gene prediction method: GNOMON, supported by EST evidence.
Also see:
Documentation of NCBI's Annotation Process /sequence_version=1
/source=Danio rerio (zebrafish)
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Actinopterygii', 'Neopterygii', 'Teleostei', 'Ostariophysi', 'Cypriniformes', 'Cyprinidae', 'Danio'] /keywords=[''] /accessions=['XM_003201225'] /data_file_division=VRT /date=23-MAR-2011 /organism=Danio rerio /gi=326679292 Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTA...T AG', IUPACAmbiguousDNA())
10.4 Writing sequence files
The
SeqIO.write()
method can write into a file
SeqRecord
objects in the
format specified by the user, from a list of popular sequence file formats. The method
requires three arguments:
−
one or more
SeqRecord
objects;
−
a handle or a filename to write to;
−
a sequence format.
In the following example, we manually create three
SeqRecord
objects for three
(very short) proteins. Then, the three objects are put into a list, which is used as the
first argument for the
SeqIO.write()
method, to specify which objects to write
into a file. Next, we create a handle, which is a file opened for writing, and pass it to
the method as the second argument. Finally, we specify that we want the output file to
be written in fasta format.
The
Bio.SeqIO.write()
function returns the number
of SeqRecord objects written to the file.
The input
SeqRecord
objects can be in the form of a list, such as in the above
example, or an iterator, or an individual
SeqRecord
:
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecords import SeqRecord >>> from Bio.Alphabet import generic_protein
>>> Rec1 = SeqRecord(Seq(“ACCA”,generic_protein), \ ... id=“1”, description=“”) >>> Rec2 = SeqRecord(Seq(“CDFAA”,generic_protein), \ ... id=“2”, description=“”) >>> Rec3 = SeqRecord(Seq(“GRKLM”,generic_protein), \ ... id=“3”, description=“”)
>>> My_records = [Rec1, Rec2, Rec3] >>> from Bio import SeqIO
>>> handle_w = open(“MySeqs.fa”,”w”)
>>> SeqIO.write(My_records, handle_w, “fasta”) 3 >>> handle_w.close() >>> handle = Entrez.efetch(db="nucleotide",\ ... rettype="gb",id="XM_003201225.1,BC076024.1,\ ... BC091833.1") >>> record = SeqIO.parse(handle,"genbank") >>> for seq_record in record:
... print seq_record.id, seq_record.description[:50] ... print "Sequence length %i," % len(seq_record), ... print "%i features," % len(seq_record.features), ... print "from: %s" % seq_record.annotations["source"] ...
XM_003201225.1 PREDICTED: Danio rerio nuclear factor of activated ...
Sequence length 2808, 4 features, from: Danio rerio (zebrafish) BC076024.1 Danio rerio zgc:92347, mRNA (cDNA clone MGC:92347 ... Sequence length 1188, 3 features, from: Danio rerio (zebrafish) BC091833.1 Danio rerio zgc:113352, mRNA (cDNA clone MGC:11335... Sequence length 1660, 3 features, from: Danio rerio (zebrafish)
10.5 Parsing Multiple Alignments
Biopython provides a data structure to store multiple alignments (the
MultipleSeqAlignment
class), and the
Bio.AlignIO
module for reading and
writing them as various file formats.
Let's open the
s e e d
multiple sequence alignment of the c
alcineurin-like
phosphoesterases from the Pfam
Family Metallophos (PF00149), containing 330
protein sequences. The file is in the Stockholm format, which is one of the most
popular formats for multiple alignment handling. The
Bio.AlignIO
module
provides two methods to parse multiple alignments,
.parse()
and
.read()
,
which parse files containing many or just one alignments, as usual Biopython
convention. Both methods require the same arguments:
−
an handle to the multiple alignment, either an open file or a filename;
−
the format of the multiple alignment (a full list of available formats can be
found at http://biopython.org/wiki/AlignIO);
−
the alphabet used by the alignment (optional).
>>> handle = open("D.rerio_calcineurin.gb") >>> records = SeqIO.parse(handle,"genbank")>>> handle_w = open("all_records_in_fasta.fa","w") >>> SeqIO.write(records, handle_w, "fasta")
60 >>> handle.close() >>> handle_w.close() >>> handle = open("D.rerio_calcineurin.gb") >>> records = SeqIO.parse(handle,"genbank") >>> first_record = records.next() >>> handle_w = open("only_the_first_record.fa","w") >>> SeqIO.write(first_record, handle_w, "fasta") 1
>>> handle.close() >>> handle_w.close()
The
AlignIO.parse()
returns an iterator that goes through the alignment
providing
SeqRecord
objects for each sequence in the alignment.
>>> from Bio import AlignIO
>>> alignment = AlignIO.read("PF00149.sth", "stockholm") >>> dir(alignment)
['__add__', '__doc__', '__format__', '__getitem__', '__init__', '__iter__', '__len__', '__module__', '__repr__', '__str__', '_alphabet', '_annotations', '_append', '_records', '_str_line', ' a d d _ s e q u e n c e ' , ' a p p e n d ' , ' e x t e n d ' , ' f o r m a t ' , 'get_alignment_length', 'get_all_seqs', 'get_column', 'get_seq_by_num', 'sort']
>>> print alignment
SingleLetterAlphabet() alignment with 330 rows and 477 columns FKIVQFSDAHLSDYFTLE---...HGG YKUE_BACSU/58-225 LRVLHISDLHMLPNQHR---...HGG O69651_MYCTU/51-235 LRVLQVSDIHMVGGQRK---...HGG Q9X935_STRCO/47-241 LNILHLSDLHLENISVS---...HGG YKOQ_BACSU/46-211 LPYGVISDPHYHRWDAFATTNA---DGLN-SRLE--...HNH Q9R2P6_YERPE/3-205 LRFVQLSDIHLGTVRSAG---...HGG O27247_METTH/130-285 LRIVQISDLHLNHSTPDA---...HGP Y461_CHLTR/52-261 LRIAQISDLHFHKRVPEK---...HGP Y578_CHLPN/45-254 >>>
Similarly to other modules, the
AlignIO
module provides to write alignments to file
in several formats, to convert between formats, and so on. You can also perform
slicing operations, which can be thought as accessing the alignment as a matrix. The
standard slicing operator
[i:j]
returns the alignment rows between row i and row
j-1. To select alignment columns, you can use the operator
[:,k]
, which will select the
k
thcolumn
>>> for record in alignment:
... print record.id,record.annotations ...
YKUE_BACSU/58-225 {'start': 58, 'end': 225, 'accession': 'O34870.2'}
O69651_MYCTU/51-235 {'start': 51, 'end': 235, 'accession': 'O69651.1'}
Q9X935_STRCO/47-241 {'start': 47, 'end': 241, 'accession': 'Q9X935.1'}
YKOQ_BACSU/46-211 {'start': 46, 'end': 211, 'accession': 'O35040.1'}
Q9R2P6_YERPE/3-205 {'start': 3, 'end': 205, 'accession': 'Q9R2P6.1'}
O27247_METTH/130-285 {'start': 130, 'end': 285, 'accession': 'O27247.1'}
Y461_CHLTR/52-261 {'start': 52, 'end': 261, 'accession': 'O84467.1'}
Y578_CHLPN/45-254 {'start': 45, 'end': 254, 'accession': 'Q9Z7X6.1'}
O03968_9CAUD/269-543 {'start': 269, 'end': 543, 'accession': 'O03968.1'}
ASM3A_MOUSE/35-294 {'start': 35, 'end': 294, 'accession': 'P70158.1'}
ASM3B_HUMAN/21-281 {'start': 21, 'end': 281, 'accession': 'Q92485.2'}
>>> print "Number of rows: %i" % len(alignment) Number of rows: 330
>>> print alignment[3:7]
SingleLetterAlphabet() alignment with 4 rows and 477 columns LNILHLSDLHLENISVS---...HGG YKOQ_BACSU/46-211 LPYGVISDPHYHRWDAFATTNA---DGLN-SRLE--...HNH Q9R2P6_YERPE/3-205 LRFVQLSDIHLGTVRSAG---...HGG O27247_METTH/130-285 LRIVQISDLHLNHSTPDA---...HGP Y461_CHLTR/52-261 >>> print alignment[:,6] SSSSSSSSSTATTSTSAAATSSSTSASSTAPATTTTTTTSASAAAAASSGSSSASAAASGGGGGG GNNGGGGSGGGGGGGGSGCGGGGGGSNNNNNNNNNNNNNNNNNNNSSTTTTTTNNGGGGGGTTTG GGGGSSSSASSTSSSSASSSSGGGGGSASSGSASAASAAAAATSTTSSSSSSASSSSSSSAAAGG GGGGGGGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGSGGGGGGGGPGGGGSSASSGSTSGASSSSSTTSSSSSSSSSSSSSAAAAA GGGST >>> print alignment[2,6] S