10.1 What Is Biopython?
Biopython1 is a package of useful modules to develop bioinformatics ap-plications. Although each bioinformatics analysis is unique, there are some tasks that are repeated, constants shared between programs and standard file formats. This situation suggests the need for a package to deal with biological problems.
Biopython started as an idea in August of 1999, it was an initiative by Jeff Chang and Andrew Dalke. Although they came up with the idea, col-laborators soon joined the project. Among the most active developers, Brad Chapman, Peter Cock, Michiel de Hoon and Iddo Friedberg stand out. The project began to take code form in February 2000 and in July of the same year the first release was made. The original idea was to build a package equiv-alent to BioPerl which back then was the principal bioinformatics package.
Although BioPerl may have been Biopython’s inspiration, the conceptual dif-ferences between Perl and Python have given Biopython a particular way of doing things. Biopython is part of the family of open-bio projects (also known as Bio*), for which institutionally it is a member of the Open Bioinformatics Foundation.2
10.1.1 Project Organization
It is an open source community project. Although the Open Bioinformat-ics Foundation takes care of administrative, economic and legal aspects, its content is managed by the programmers and users.
Anyone can participate in the project. The code is public domain and is available in CVS form through the Web.3 The procedure that you have to
1Available fromhttp://www.biopython.org.
2http://www.open-bio.org
3The following address is likely to change as Biopython moves from CVS to Git or an-other version control system: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/
?cvsroot=biopythonPlease seehttp://biopython.org/wiki/GitMigrationfor more infor-mation on the migration to Git distributed version control.
follow to collaborate on Biopython is similar to other open source projects.
You have to use the software and then determine if it needs any additional features or if you want to modify any of the existing features. Before writing any code my recommendation is to discuss your ideas on the development mailing list.4 first There you will find out if that feature had already been discussed and was rejected or if it was not included because no one needed it until that time. In the case of a bug fix, you don’t need to ask, just report it in the bug tracking software,5 and if possible, add a solution proposal. If what we have is a proposal for improvement and there weren’t any objections from the list, we can send the code using the bug tracking system, although it has to be marked as “enhancement” in the “Severity” drop-down menu.
Due to the open nature of the project tens of people have contributed code from diverse fields within Bioinformatics, from information theory to population genetics.
I was involved in Biopython as a user since 2002 and submited my first contribution on 2003 with lcc.py, a function to calculate the local composi-tional complexity of a sequence. In 2004 I submitted code for melting point calculation of oligonucleotides. My last submission was some functions for the CheckSum module in 2007.6 In every case I found a supportive community, especially in the first submission when my coding skills were at a beginner level.
For more information concerning how to participate in the Biopython project, see the specific instructions athttp://biopython.org/wiki/Contributing.
The Biopython code is developed under the “Biopython License.7” It is very liberal and there are virtually no restrictions to its use.8
10.2 Biopython Components
Biopython has various modules. Some facilitate tasks that are undertaken on a daily basis in a molecular biology laboratory while others have very specific objectives. What is “commonly used” will depend on the work envi-ronment of the reader, but after having worked giving IT support to molecular
4http://lists.open-bio.org/mailman/listinfo/biopython-dev
5http://bugzilla.open-bio.org
6Bassi, Sebastian and Gonzalez, Virginia. New checksum functions for Biopython. Avail-able from Nature Precedings <http://dx.doi.org/10.1038/npre.2007.278.1> (2007).
7The license is included in the biopython package and available online at http://www.
biopython.org/DIST/LICENSE.
8The only condition imposed for using Biopython are related to publishing the copyright notice and not to use the name of the contributors in advertising.
biologists at a biotech research center, reading the mailing list for Biopython for a few years and doing consulting work, I think I can identify key modules.
As with all enumerations, it is arbitrary and it is possible that it would not reflect the interests of all readers. It’s sorted in didactic fashion with the intention that the first items will help you to understand the rest.
10.2.1 Alphabet
In bioinformatics we constantly deal with alphabets. DNA has a 4 letter alphabet (A,C,T,G) while proteins have their 20 amino acids, each one repre-sented by a letter of the alphabet. There are also special “alphabets” like the ones that contemplate ambiguity positions, these are, positions where more than one nucleotide may be present. For example the letter S may represent the nucleic acids C or G, the letter H represents A, C, or T. This ambiguous al-phabet in Python is called ambiguous dna. Concerning the proteins, there is also an extended dictionary, which is, the dictionary that contains amino acids that are not normally found in proteins9(ExtendedIUPACProtein). Simi-larly, there is an extended alphabet for nucleotides (ExtendedIUPACDNA) that allows letters with modified bases. Going back to proteins, there is also a reduced alphabet that, taking into account common physicochemical prop-erties, lumps together several amino acids into one letter.
There is even one alphabet that is not DNA or amino-acid based: Sec-ondaryStructure. This alphabet represents domains like Helix, Turn, Strand and Coil.
Alphabets defined by IUPAC are stored in Biopython as classes of the IU-PAC module. Parent module (Bio.Alphabet) includes more general/generic cases. Here are some attributes of the alphabets:
>>> import Bio.Alphabet
>>> Bio.Alphabet.ThreeLetterProtein.letters
[’Ala’, ’Asx’, ’Cys’, ’Asp’, ’Glu’, ’Phe’, ’Gly’, ’His’, <=
’Ile’, ’Lys’, ’Leu’, ’Met’, ’Asn’, ’Pro’, ’Gln’, ’Arg’, <=
’Ser’, ’Thr’, ’Sec’, ’Val’, ’Trp’, ’Xaa’, ’Tyr’, ’Glx’]
>>> from Bio.Alphabet import IUPAC
>>> IUPAC.IUPACProtein.letters
9Selenocysteine and pyrrolysine are typical examples.
>>> IUPAC.ExtendedIUPACDNA.letters
’GATCBDSW’
Alphabets are used to define the content of a sequence. How do you know that sequence made of “CCGGGTT” is a small peptide with several cys-teine, glycine and threonine or it is a DNA fragment of cytosine, guanine and thymine? If sequences were stored as strings, there would be no way to know what kind of sequence it is. This is why Biopython introduces Seq objects.
10.2.2 Seq
This object is composed of the sequence itself and an alphabet that defines the nature of the sequence.
Let’s create a sequence object as a DNA fragment:
>>> from Bio.Seq import Seq
>>> import Bio.Alphabet
>>> seq = Seq(’CCGGGTT’,Bio.Alphabet.IUPAC.unambiguous_dna) Since this sequence (seq) is defined as DNA, you can apply operations that are permitted to DNA sequences. Seq objects have the transcribe and translate methods:
>>> seq.transcribe()
Seq(’CCGGGUU’, IUPACUnambiguousRNA())
>>> seq.translate() Seq(’PG’, IUPACProtein())
An RNA sequence can’t be transcribed, but it can be translated:
>>> rna_seq = Seq(’CCGGGUU’,Bio.Alphabet.IUPAC.unambiguous_rna)
>>> rna_seq.transcribe()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sb/Seq.py", line 520, in transcribe raise ValueError("RNA cannot be transcribed!") ValueError: RNA cannot be transcribed!
>>> rna_seq.translate() Seq(’PG’, IUPACProtein())
You can go back from RNA to DNA using the back transcribe method
>>> rna_seq.back_transcribe()
Seq(’CCGGGTT’, IUPACUnambiguousDNA())