Bioinformatics using Python for Biologists

(1)

Bioinformatics using Python for Biologists

10.1 The

SeqIO

module

Many file formats are employed by the most popular databases to store information in

ways that “should” be easily interpreted by a computer program. In this case,

interpreting means extracting information (i.e. parsing) and converting it in formats

appropriate for further processing and analysis. The parsing of such files is very often

a very important task that the bioinformatician must do very accurately. However, the

task of parsing these files can be frustrated by the fact that the formats can change

quite regularly, and that formats may contain small subtleties which can break even

the most well designed parsers. Biopython

SeqIO

module provides parsers for many

common file formats, which generally extract information from the inout file and

convert it into a

SeqRecord

object. There are two methods for sequence file

parsing:

SeqIO.parse()

and

SeqIO.read()

; both of them require two

mandatory arguments and an optional argument:

−

a “handle” that specifies where the data must be read (could be a file name, a

file opened for reading, data downloaded from a database using a script, or the

output of another piece of code);

−

a flag indicating the format of the data (a full list of supported format is

available at

http://biopython.org/wiki/SeqIO

);

−

an optional argument that specifies the alphabet of the sequence data.

The difference between

S e q I O . p a r s e ( )

and

SeqIO.read()

is that

SeqIO.parse()

returns an iterator that goes through all records in the input

handle, to be used in

for

or

while

loops. On the other hand,

SeqIO.read()

must be used on files containing a single record. The arguments are the same; Both

methods return

SeqRecord

objects.

10.2 Reading local files

Let's read the file “D.rerio_calcineurin.fasta”, containing fasta format records of all

entries matching the keyword “calcineurin” in the zebrafish (

Danio rerio

) genome

obtained from the NCBI (

http://www.ncbi.nlm.nih.gov/nuccore

). The

SeqIO.parse()

method will generate an iterator on

SeqRecord

objects; features

can then be extracted from each

SeqRecord

object as described in the Module 9:

(2)

Since the handle is a file, it is good habit to close it when the processing is done.

Remember that the iterator “empties” the file, meaning that to scan the records

another time, the file must be closed, than opened again, and then used again as the

handle argument to

SeqIO.parse()

.

In a similar way, we can parse an equivalent file, this time in genbank format; this

time, we also omit the explicit creation of the handle and pass to

SeqIO.parse

the

file name or complete path:

Few things must be noted: the genbank-specific

SeqIO.parse()

is able to assign

the correct alphabet to the sequence records in the input file, while the fasta parser

assigns a generic

SingleLetterAlphabet()

.

Second, the genbank

SeqRecord

store a more compact

id

attribute for the sequence records.

As mentioned before,

SeqIO.parse()

can process any number of records in the

input handle.

SeqIO.read()

instead checks whether there is only one record in the

>>> import Bio

>>> from Bio import SeqIO

>>> handle = open("D.rerio_calcineurin.fa","r") >>> type(handle)

>>> for seq_record in SeqIO.parse(handle,"fasta"): ... print seq_record.id ... print repr(seq_record.seq) ... print len(seq_record) ... gi|326679292|ref|XM_003201225.1| Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTA...TAG ', SingleLetterAlphabet()) 2808 gi|326677866|ref|XM_003200885.1| Seq('ATGCCTGTTCCACATACTGAAGTATCCAGGGAAAAAGAGGAACAGCAGCCTGGC...TAA ', SingleLetterAlphabet()) 1035 ... >>> handle.close() >>> for seq_record in \ ... SeqIO.parse("D.rerio_calcineurin.gb","genbank"): ... print seq_record.id ... print repr(seq_record.seq) ... print len(seq_record) ... XM_003201225.1 Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTA...TAG ', IUPACAmbiguousDNA()) 2808 XM_003200885.1 Seq('ATGCCTGTTCCACATACTGAAGTATCCAGGGAAAAAGAGGAACAGCAGCCTGGC...TAA ', IUPACAmbiguousDNA()) 1035 ...

(3)

handle, raising an exception if this condition is not met:

The usage of an iterator is a way to parse large files without consuming large amounts

of memory. On the other hand, as mentioned above each single record can be

accessed only one time in the

for

loop. The iterator provides methods to access

records step by step:

When the records in the file are over, the

.next()

method will either returns the

special Python object

None

or a

StopIteration

exception (depending on which

Biopython release you have installed on your system).

Using this approach you could in principle assign each record to a different variable,

if you need to keep these records at hand. This is impractical if the number of record

is high, or it is unknown beforehand. It is however possible to store all

SeqReference

objects returned by

SeqIO

into a data structure such as a list:

>>> handle = open(“D.rerio_calcineurin.gb") >>> iterator = SeqIO.parse(handle,"genbank") >>> first_record = iterator.next() >>> type(first_record) <class 'Bio.SeqRecord.SeqRecord'> >>> first_record.id 'XM_003201225.1' >>> first_record.seq Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTA...TAG ', IUPACAmbiguousDNA()) >>> first_record.description

'PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC100333254), mRNA.'

>>> second_record = iterator.next() >>> second_record.id

'XM_003200885.1'

>>> handle = open("D.rerio_calcineurin.gb","r") >>> SeqIO.read(handle,"genbank")

Traceback (most recent call last): File "<stdin>", line 1, in <module>

File "Bio/SeqIO/__init__.py", line 614, in read ValueError: More than one record found in handle

(4)

SeqIO

provides also a method to convert the iterator

SeqRecord

objects into

values of a dictionary, whose keys are the

SeqRecord.id

attributes:

>>> records = list\

...(SeqIO.parse("D.rerio_calcineurin.gb", "genbank")) >>> len(records)

61

>>> records[0] # the first record

SeqRecord(seq=Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGG AGTATTTA...TAG', IUPACAmbiguousDNA()), id='XM_003201225.1',

name='XM_003201225', description='PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC100333254), mRNA.', dbxrefs=[]) >>> records[0].id 'XM_003201225.1' >>> records[0].seq Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTA...TAG ', IUPACAmbiguousDNA())

>>> for key,value in records[0].annotations.items(): ... print key,value

...

comment MODEL REFSEQ: This record is predicted by automated computational

analysis. This record is derived from a genomic sequence

(NW_003336048) annotated using gene prediction method: GNOMON, supported by EST evidence.

Also see:

Documentation of NCBI's Annotation Process sequence_version 1

source Danio rerio (zebrafish)

taxonomy ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Actinopterygii', 'Neopterygii', 'Teleostei', 'Ostariophysi', 'Cypriniformes', 'Cyprinidae', 'Danio']

keywords ['']

accessions ['XM_003201225'] data_file_division VRT date 23-MAR-2011

organism Danio rerio gi 326679292

>>> records[-1] # the last record

SeqRecord(seq=Seq('GCAGCAATTTGAGGAAGAAGCGCAAACAGACAGGTCAGGTGTGGCG ATGGCAGC...AAA', IUPACAmbiguousDNA()), id='BC139891.1',

name='BC139891', description='Danio rerio zgc:162913, mRNA (cDNA clone MGC:162913 IMAGE:7401269), complete cds.', dbxrefs=[])

(5)

Note that if duplicate keys are found, an exception will be raised.

For very large number of records, there is a method,

Bio.SeqIO.index()

, which

creates a dictionary-like object, but without keeping all the data in memory. Instead,

the dictionary values correspond to the position of the record in the file. When a

particular record is accessed, the record content is parsed on the fly. This method

allows the handling of a huge number of records, with a little cost in flexibility and

speed. Moreover, these dictionary-like objects are read-only, meaning that once

created, data can not be inserted or removed. Note that in this case the first argument

(the handle) can not be an open file handle, but it must be a file name.

10.3 Reading files from the web

As we stated before, a handle can also be used to fetch data from web databases.

Since parsing the file with an iterator using a handle “consumes” the handle itself, it is

good practice to store the downloaded file locally. Nevertheless, sometimes it could

>>> handle = open(“D.rerio_calcineurin.gb")

>>> records = SeqIO.to_dict(SeqIO.parse(handle, "genbank")) >>> for key,value in records.items():

... print key,value.id,value.description ...

BC093219.1 BC093219.1 Danio rerio zgc:112142, mRNA (cDNA clone MGC:112142 IMAGE:7428541), complete cds.

XM_685181.5 XM_685181.5 PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic, calcineurin-dependent 3 (nfatc3), mRNA.

BC076024.1 BC076024.1 Danio rerio zgc:92347, mRNA (cDNA clone MGC:92347 IMAGE:7055812), complete cds. ... >>> records = SeqIO.index("D.rerio_calcineurin.gb","genbank") >>> records.keys() ['BC093219.1', 'XM_685181.5', 'BC076024.1', 'BC091833.1', 'BC152175.1', 'NM_200899.1', 'NM_001005392.1', 'BC062840.1', 'NM_199836.1', 'BC154648.1', 'NM_001099250.1', 'BC076019.1', 'NM_001002452.1', 'BC064307.1', 'BC153488.1', 'BC122248.1', 'NM_001007413.1', 'NM_001044758.1', 'BC065451.1', 'BC093272.1', 'XM_001922343.4', 'BC065972.1', 'BC090735.1', 'NM_001017701.1', 'XM_002664259.1', 'XM_001923726.2', 'XM_678815.5', 'XM_694965.5', 'BC163337.1', 'XM_001923264.3', 'BC139891.1', 'NM_205678.1', 'NM_200854.1', 'XM_687678.4', 'BC076439.1', 'XM_001339606.4', 'BC058868.1', 'NM_214773.1', 'NM_199653.1', 'NM_001017735.1', 'NM_200042.1', 'BC071331.1', 'BC129492.1', 'BC055256.1', 'GU733827.1', 'XM_003200885.1', 'NM_200037.1', 'NM_199895.1', 'BC076514.1', 'AY639016.1', 'BC049341.1', 'BC150441.1', 'NM_001002447.1', 'BC163350.1', 'NM_001014338.1', 'NM_001045159.1', 'BC155186.1', 'BC045981.1', 'XM_003201225.1', 'BC142750.1', 'BC053153.1'] >>> print records["BC093219.1"].description

Danio rerio zgc:112142, mRNA (cDNA clone MGC:112142 IMAGE:7428541), complete cds.

(6)

be more easy to perform the parsing on-the-fly using web handles. To download files

from the NCBI, we will use the

Entrez.efetch

interface, which takes as

arguments the database where the file should be found, the file format, and the

database identifier:

It is possible to download multiple files, by writing a string containing all their

identifiers separated by commas:

>>> from Bio import Entrez

>>> handle = Entrez.efetch(db="nucleotide",\ ... rettype="fasta",id="XM_003201225.1") >>> record = SeqIO.read(handle,"fasta") >>> record SeqRecord(seq=Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTC GGAGTATTTA...TAG', SingleLetterAlphabet()), id='gi|326679292|ref|XM_003201225.1|', name='gi|326679292|ref|XM_003201225.1|',

description='gi|326679292|ref|XM_003201225.1| PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC100333254), mRNA', dbxrefs=[])

>>> handle = Entrez.efetch(db="nucleotide",\ ... rettype="gb",id="XM_003201225.1") >>> record = SeqIO.read(handle,"genbank") >>> print record ID: XM_003201225.1 Name: XM_003201225

Description: PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC100333254), mRNA.

Number of features: 4

/comment=MODEL REFSEQ: This record is predicted by automated computational

analysis. This record is derived from a genomic sequence

(NW_003336048) annotated using gene prediction method: GNOMON, supported by EST evidence.

Also see:

Documentation of NCBI's Annotation Process /sequence_version=1

/source=Danio rerio (zebrafish)

/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Actinopterygii', 'Neopterygii', 'Teleostei', 'Ostariophysi', 'Cypriniformes', 'Cyprinidae', 'Danio'] /keywords=[''] /accessions=['XM_003201225'] /data_file_division=VRT /date=23-MAR-2011 /organism=Danio rerio /gi=326679292 Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTA...T AG', IUPACAmbiguousDNA())

(7)

10.4 Writing sequence files

The

SeqIO.write()

method can write into a file

SeqRecord

objects in the

format specified by the user, from a list of popular sequence file formats. The method

requires three arguments:

−

one or more

SeqRecord

objects;

−

a handle or a filename to write to;

−

a sequence format.

In the following example, we manually create three

SeqRecord

objects for three

(very short) proteins. Then, the three objects are put into a list, which is used as the

first argument for the

SeqIO.write()

method, to specify which objects to write

into a file. Next, we create a handle, which is a file opened for writing, and pass it to

the method as the second argument. Finally, we specify that we want the output file to

be written in fasta format.

The

Bio.SeqIO.write()

function returns the number

of SeqRecord objects written to the file.

The input

SeqRecord

objects can be in the form of a list, such as in the above

example, or an iterator, or an individual

SeqRecord

:

>>> from Bio.Seq import Seq

>>> from Bio.SeqRecords import SeqRecord >>> from Bio.Alphabet import generic_protein

>>> Rec1 = SeqRecord(Seq(“ACCA”,generic_protein), \ ... id=“1”, description=“”) >>> Rec2 = SeqRecord(Seq(“CDFAA”,generic_protein), \ ... id=“2”, description=“”) >>> Rec3 = SeqRecord(Seq(“GRKLM”,generic_protein), \ ... id=“3”, description=“”)

>>> My_records = [Rec1, Rec2, Rec3] >>> from Bio import SeqIO

>>> handle_w = open(“MySeqs.fa”,”w”)

>>> SeqIO.write(My_records, handle_w, “fasta”) 3 >>> handle_w.close() >>> handle = Entrez.efetch(db="nucleotide",\ ... rettype="gb",id="XM_003201225.1,BC076024.1,\ ... BC091833.1") >>> record = SeqIO.parse(handle,"genbank") >>> for seq_record in record:

... print seq_record.id, seq_record.description[:50] ... print "Sequence length %i," % len(seq_record), ... print "%i features," % len(seq_record.features), ... print "from: %s" % seq_record.annotations["source"] ...

XM_003201225.1 PREDICTED: Danio rerio nuclear factor of activated ...

Sequence length 2808, 4 features, from: Danio rerio (zebrafish) BC076024.1 Danio rerio zgc:92347, mRNA (cDNA clone MGC:92347 ... Sequence length 1188, 3 features, from: Danio rerio (zebrafish) BC091833.1 Danio rerio zgc:113352, mRNA (cDNA clone MGC:11335... Sequence length 1660, 3 features, from: Danio rerio (zebrafish)

(8)

10.5 Parsing Multiple Alignments

Biopython provides a data structure to store multiple alignments (the

MultipleSeqAlignment

class), and the

Bio.AlignIO

module for reading and

writing them as various file formats.

Let's open the

s e e d

multiple sequence alignment of the c

alcineurin-like

phosphoesterases from the Pfam

Family Metallophos (PF00149), containing 330

protein sequences. The file is in the Stockholm format, which is one of the most

popular formats for multiple alignment handling. The

Bio.AlignIO

module

provides two methods to parse multiple alignments,

.parse()

and

.read()

,

which parse files containing many or just one alignments, as usual Biopython

convention. Both methods require the same arguments:

−

an handle to the multiple alignment, either an open file or a filename;

−

the format of the multiple alignment (a full list of available formats can be

found at http://biopython.org/wiki/AlignIO);

−

the alphabet used by the alignment (optional).

>>> handle = open("D.rerio_calcineurin.gb") >>> records = SeqIO.parse(handle,"genbank")

>>> handle_w = open("all_records_in_fasta.fa","w") >>> SeqIO.write(records, handle_w, "fasta")

60 >>> handle.close() >>> handle_w.close() >>> handle = open("D.rerio_calcineurin.gb") >>> records = SeqIO.parse(handle,"genbank") >>> first_record = records.next() >>> handle_w = open("only_the_first_record.fa","w") >>> SeqIO.write(first_record, handle_w, "fasta") 1

>>> handle.close() >>> handle_w.close()

(9)

The

AlignIO.parse()

returns an iterator that goes through the alignment

providing

SeqRecord

objects for each sequence in the alignment.

>>> from Bio import AlignIO

>>> alignment = AlignIO.read("PF00149.sth", "stockholm") >>> dir(alignment)

['__add__', '__doc__', '__format__', '__getitem__', '__init__', '__iter__', '__len__', '__module__', '__repr__', '__str__', '_alphabet', '_annotations', '_append', '_records', '_str_line', ' a d d _ s e q u e n c e ' , ' a p p e n d ' , ' e x t e n d ' , ' f o r m a t ' , 'get_alignment_length', 'get_all_seqs', 'get_column', 'get_seq_by_num', 'sort']

>>> print alignment

SingleLetterAlphabet() alignment with 330 rows and 477 columns FKIVQFSDAHLSDYFTLE---...HGG YKUE_BACSU/58-225 LRVLHISDLHMLPNQHR---...HGG O69651_MYCTU/51-235 LRVLQVSDIHMVGGQRK---...HGG Q9X935_STRCO/47-241 LNILHLSDLHLENISVS---...HGG YKOQ_BACSU/46-211 LPYGVISDPHYHRWDAFATTNA---DGLN-SRLE--...HNH Q9R2P6_YERPE/3-205 LRFVQLSDIHLGTVRSAG---...HGG O27247_METTH/130-285 LRIVQISDLHLNHSTPDA---...HGP Y461_CHLTR/52-261 LRIAQISDLHFHKRVPEK---...HGP Y578_CHLPN/45-254 >>>

(10)

Similarly to other modules, the

AlignIO

module provides to write alignments to file

in several formats, to convert between formats, and so on. You can also perform

slicing operations, which can be thought as accessing the alignment as a matrix. The

standard slicing operator

[i:j]

returns the alignment rows between row i and row

j-1. To select alignment columns, you can use the operator

[:,k]

, which will select the

k

th

_column

>>> for record in alignment:

... print record.id,record.annotations ...

YKUE_BACSU/58-225 {'start': 58, 'end': 225, 'accession': 'O34870.2'}

O69651_MYCTU/51-235 {'start': 51, 'end': 235, 'accession': 'O69651.1'}

Q9X935_STRCO/47-241 {'start': 47, 'end': 241, 'accession': 'Q9X935.1'}

YKOQ_BACSU/46-211 {'start': 46, 'end': 211, 'accession': 'O35040.1'}

Q9R2P6_YERPE/3-205 {'start': 3, 'end': 205, 'accession': 'Q9R2P6.1'}

O27247_METTH/130-285 {'start': 130, 'end': 285, 'accession': 'O27247.1'}

Y461_CHLTR/52-261 {'start': 52, 'end': 261, 'accession': 'O84467.1'}

Y578_CHLPN/45-254 {'start': 45, 'end': 254, 'accession': 'Q9Z7X6.1'}

O03968_9CAUD/269-543 {'start': 269, 'end': 543, 'accession': 'O03968.1'}

ASM3A_MOUSE/35-294 {'start': 35, 'end': 294, 'accession': 'P70158.1'}

ASM3B_HUMAN/21-281 {'start': 21, 'end': 281, 'accession': 'Q92485.2'}

(11)

>>> print "Number of rows: %i" % len(alignment) Number of rows: 330

>>> print alignment[3:7]

SingleLetterAlphabet() alignment with 4 rows and 477 columns LNILHLSDLHLENISVS---...HGG YKOQ_BACSU/46-211 LPYGVISDPHYHRWDAFATTNA---DGLN-SRLE--...HNH Q9R2P6_YERPE/3-205 LRFVQLSDIHLGTVRSAG---...HGG O27247_METTH/130-285 LRIVQISDLHLNHSTPDA---...HGP Y461_CHLTR/52-261 >>> print alignment[:,6] SSSSSSSSSTATTSTSAAATSSSTSASSTAPATTTTTTTSASAAAAASSGSSSASAAASGGGGGG GNNGGGGSGGGGGGGGSGCGGGGGGSNNNNNNNNNNNNNNNNNNNSSTTTTTTNNGGGGGGTTTG GGGGSSSSASSTSSSSASSSSGGGGGSASSGSASAASAAAAATSTTSSSSSSASSSSSSSAAAGG GGGGGGGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGSGGGGGGGGPGGGGSSASSGSTSGASSSSSTTSSSSSSSSSSSSSAAAAA GGGST >>> print alignment[2,6] S