Lecture 2, Introduction to Python
Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University
BINF 3360, Introduction to Computational Biology
Python Programming Language
Script Language General-purpose script language
Broad applications
(web, bioinformatics, network programming, graphics, software engineering)
Features
Object-oriented
Extension with modules
Database integration
Embeddable
Getting Started
Download & Installation http://www.python.org/download/ (the most recent version: Python 3.3)
Edit & Run
Create a file named test.py
Edit the code
Run the code
# This is a test. dna = ‘ATCGATGA’ print dna, ‘\n’
> python test.py
Primitives
Primitive Data Types Numbers or Strings
Substring
Reversing
num = 1234 st = ‘1234’
num_1 = num + int(st) st_1 = str(num) + st dna1 = ‘ACGTGAACT’ dna2 = dna1[::-1] dna1 = ‘ACGTGAACT’ dna2 = dna1[0:4] length = len(dna2)
Lists
List Variables A list of comma-separated values
Insert, Delete, Append, Reverse, and Sort lst1 = [‘A’, ‘C’, ‘G’] lst2 = [‘T’] lst1 = lst1 + lst2 Variable-length list lst = [‘A’, ‘T’, ‘G’] lst.insert(1, ‘C’) del lst[2] lst.append(‘T’) lst.extend([‘A’, ‘C’]) lst.reverse() lst.sort() lst = [‘A’, ‘T’, ‘G’] lst [1:2] = ‘C’ lst [1:1] = ‘T’ lst [2:3] = ‘’ lst [len(lst) : len(lst)] = ‘T’ lst [len(lst) : len(lst)] = [‘A’, ‘C’] lst [::-1]
Sets
Set VariablesAdd and Remove
DNAbases = {‘A’, ‘C’, ‘G’, ‘T’} RNAbases = {‘A’, ‘C’, ‘G’, ‘U’} DNAbases | RNAbases DNAbases & RNAbases DNAbases - RNAbases
bases = {‘A’, ‘D’, ‘G’} bases.add(‘T’) bases.remove(‘D’)
Dictionaries
Initialization Mapping Delete d = dict() d[‘key1’] = ‘value1’ k2, v2 = ‘key2’, ‘value2’ d[k2] = v2 d = { ‘key1’: ‘value1’ , ‘key2’: ‘value2’ , ‘key3’: ‘value3’ } d[‘key1’] d.get(‘key1’) d.keys() d.values() del d[‘key1’]Input / Output
Standard Input Reading Files Writing Files name = ‘myfilename.txt’ with open(name) as file:data = file.read()
name = sys.stdin.readline() with open(name) as file:
data = file.read()
name = sys.argv[1] with open(name) as file:
data = file.read() import sys
data = sys.stdin.readline().replace(‘\n’, ‘ ’)
name = ‘output.txt’ with open(name, ‘w’) as file:
Functions
Types Built-in system functions
User-defined functions
Defining Function
Function Call
def function_name (parameter_list): statement statement return value
Iteration
Iterative Process def find_max(lst): max_so_far = lst[0] for item in lst[1:]: if item > max_so_far: max_so_far = item return max_so_far lst1 = [3,5,10,4,6] maximum = find_max(lst1)Recursion
Recursive Calldef print_tree(tree, level): print ‘ ’ * 4 * level, tree[0] for subtree in tree[1:]:
print_tree(subtree, level+1) t1 = [‘A’, [‘T’, [‘A’], [‘T’]], [‘G’, [‘G’], [‘C’]]] print_tree(t1, 0)
Modules
Module A collection of functions Module python (.py) files in a library directory
Module Call
import random seq = 'ATCGATAGCTA'
random_base = seq[random.randint(0,len(seq)-1)]
from random import * seq = 'ATCGATAGCTA'
Regular Expressions
Special Languages Metacharacters Quantifiers Alternatives Character Set UsageSame to the regular expressions in Perl
import re
if re.match(‘TATA .* AA’, seq): print ‘It matched!’
import re
matches = re.findall(‘TATA .* AA’, seq) print matches
Biological Applications
Parsing SequencesBase Frequency Counting
Motif (Substring) Search
Sequence Transformation
DNA Replication
Transcription from DNA to RNA
Translating RNA into Protein
Parsing Sequences (1)
Single Sequence in FASTA FormatParsing
Make a function to return the sequence from the FASTA format >gi|5524211|gb|AAD44166.1| cytochrome b LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIP YIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDK IPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRS VPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYP YTIIGQMASILYFSIILAFLPIAGXIENY def read_FASTA_seq(filename): with open(filename) as f: return f.read().partition(‘\n’)[2].replace(‘\n’, ‘’)
Parsing Sequences (2)
Multiple Sequences in FASTA FormatParsing ? >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIP QFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFY VMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGE NLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
Frequency Counting
DNA Sequence ValidationCounting Base Frquency
Make a function to calculate the percent of ‘C’ and ‘G’ in a DNA sequence
def validate_dna (base_sequence): seq = base_sequence.upper() return len(seq) == (seq.count(‘T’) +
seq.count(‘C’) + seq.count(‘A’) + seq.count(‘G’) )
def validate_dna (base_sequence): seq = base_sequence.upper() for base in seq:
if base not in ‘ACGT’: return False return True
def percent_of_GC (base_sequence): seq = base_sequence.upper()
return (seq.count(‘G’) + seq.count(‘C’)) / len(seq)
Motif Search
Searching Substring Make a function to take a sequence and a motif and return the position(s) of matching in the sequence
def motif_search (seq, motif): return seq.find(motif)
def all_motif_search (seq, motif): pos = []
idx = seq.find(motif) pos.append(idx)
seq = seq.partition(motif)[2] while seq.find(motif) >= 0:
idx += seq.find(motif) + len(motif) pos.append(idx)
Transcription
Simulating Transcription Make a function to transcribe a DNA into an RNA
def transcription (dna): return dna.replace(‘T’, ‘U’)
Translation (1)
Making Genetic Code Make a function to translate a codon to an amino acid
def codon2aa(codon):
genetic_code = { ‘UUU’: ‘F’, ‘UUC’: ‘F’, ‘UUA’: ‘L’, …… } if codon in genetic_code.keys():
return genetic_code[codon] else:
Translation (2)
Simulating Translation Make a function to translate an RNA into a protein sequence
def translation(rna): protein = ‘’
for n in range(0, len(rna), 3): protein += codon2aa(rna[n:n+3]) return protein
Translation (3)
Simulating Translation – cont’ Make a generator
- an object that returns values from a series it computes
def aa_generator(rna):
return (codon2aa(rna[n:n+3]) for n in range(0, len(rna), 3) )
def translation(rna): gen = aa_generator(rna) protein = ‘’
aa = next(gen) while aa:
Mutation
Simulating Mutation Make a function to simulate single point mutations in a DNA sequence
import random def mutation(dna): position = random.randint(0,len(dna)-1) bases = ‘ACGT’ new_base = bases[random.randint(0,3)] dna[position:position+1] = new_base return dna bases.replace(dna[position], ‘’) new_base = bases[random.randint(0,2)]
Questions?
Lecture Slides are found on the Course Website, web.ecs.baylor.edu/faculty/cho/3360