ultiple exons from genomic ' - python for biologists

$he fie genomic"na=t!t !ontains a se!tion of genomi! D;4, an% the fie e!ons=t!t !ontains a ist of start/stop positions of eons. )a!h eon is on a separate ine an% the start an% stop positions are separate% by a !omma. rite a program that i etra!t the eon segments, !on!atenate them, an% rite them to a ne fie.

@0 ChapterB:+istsan%oops

&olutions

Processing '.4 in a file

$his seems a bit more !ompi!ate% than previo&s eer!ises 5 e are being as-e% to rite a program that %oes to things at on!e8 5 so ets ta!-e it one step at a time.

First, e rite a program that simpy rea%s ea!h se<&en!e from the fie an% prints it to the s!reen:

file B open(*input.txt*# for dna in file!

print(dna#

e !an see from the o&tp&t that eve forgotten to remove the neines from the en%s of the D;4 se<&en!es 5 there is a ban- ine beteen ea!h:

A))&%A))A)AA%&)&%A)&%A)&%A)&%A)&%A)&%A)&%A)&%A)&%A)&%A)& A))&%A))A)AA%&A&)%A)&%A)&%A)&%A)&%A)&%A)%&)A)&%)&%)

A))&%A))A)AA%&A)&%A)&A&%A)&)A)&%)A&%)A)%&A)A)&%A)A)&%A)&%)A%)& A))&%A))A)AA%&A&)A)&%A)%A)&)A%&)A&%A)&%)A%&)%)A

A))&%A))A)AA%&A&)A%&)A%)&)&%A)%&A)%A)&A%&))A%&)%A)%A)%&)A)%&A

b&t e ignore that for no. $he net step is to remove the first 1B bases of ea!h se<&en!e. e -no that e ant to ta-e a s&bstring from ea!h se<&en!e, starting at the fifteenth !hara!ter, an% !ontin&ing to the en%. *nfort&natey, the se<&en!es are a %ifferent engths, so the stop position is going to be %ifferent for a of them. e have to !a!&ate the position of the ast !hara!ter for ea!h se<&en!e, by &sing the len f&n!tion to !a!&ate the ength.

@1 ChapterB:+istsan%oops

7eres hat the !o%e oo-s i-e ith the s&bstring part a%%e%:

file B open(*input.txt*# for dna in file!

last/character/position B len(dna#

trimmed/dna B dna$1'!last/character/position print(trimmed/dna#

4s before, e are simpy printing the trimme% D;4 se<&en!e to the s!reen, an% from the o&tp&t e !an !onfirm that the first 1B bases have been remove% from ea!h se<&en!e: )&%A)&%A)&%A)&%A)&%A)&%A)&%A)&%A)&%A)&%A)& A&)%A)&%A)&%A)&%A)&%A)&%A)%&)A)&%)&%) A)&%A)&A&%A)&)A)&%)A&%)A)%&A)A)&%A)A)&%A)&%)A%)& A&)A)&%A)%A)&)A%&)A&%A)&%)A%&)%)A A&)A%&)A%)&)&%A)%&A)%A)&A%&))A%&)%A)%A)%&)A)%&A

;o that e -no o&r !o%e is or-ing, e sit!h from printing to the s!reen to riting to a fie. e have to open the fie before the oop, then rite the trimme%

se<&en!es to the fie inside the oop:

file B open(*input.txt*#

output B open(*trimmed.txt*, **# for dna in file!

last/character/position B len(dna#

trimmed/dna B dna$1'!last/character/position output.rite(trimmed/dna#

@2 ChapterB:+istsan%oops

?pening &p the trimme"=t!t fie, e !an see that the res&t oo-s goo%. 't %i%nt matter that e never remove% the neines, be!a&se they appear in the !orre!t pa!e in the o&tp&t fie anyay:

)&%A)&%A)&%A)&%A)&%A)&%A)&%A)&%A)&%A)&%A)& A&)%A)&%A)&%A)&%A)&%A)&%A)%&)A)&%)&%)

A)&%A)&A&%A)&)A)&%)A&%)A)%&A)A)&%A)A)&%A)&%)A%)& A&)A)&%A)%A)&)A%&)A&%A)&%)A%&)%)A

A&)A%&)A%)&)&%A)%&A)%A)&A%&))A%&)%A)%A)%&)A)%&A

;o the fina step 5 printing the engths to the s!reen 5 re<&ires 9&st one more ine of !o%e. 7eres the fina program in f&, ith !omments:

4 open the input file file B open(*input.txt*# 4 open the output file

output B open(*trimmed.txt*, **#

4 go through the input file one line at a time for dna in file!

4 calculate the position of the last character last/character/position B len(dna#

4 get the su-string from the 1th character to the end trimmed/dna B dna$1'!last/character/position

4 print out the trimmed seuence output.rite(trimmed/dna#

4 print out the length to the screen

@3 ChapterB:+istsan%oops

ultiple exons from genomic '.4

$his is very simiar to the eer!ises from the previo&s to !hapters, an% so o&r so&tion to it is going to oo- very simiar. +ets !on!entrate on the ne bit of the probem first 5 rea%ing the fie of eon o!ations. 4s before, e !an start by opening &p the fie an% printing ea!h ine to the s!reen:

exon/locations B open(*exons.txt*# for line in exon/locations!

print(line#

$his gives &s a oop in hi!h e are %eaing ith a %ifferent eon ea!h time ro&n%. 'f e oo- at the o&tp&t, e !an see that e sti have a neine at the en% of ea!h ine, b&t e not orry abo&t that for no:

,G 72,133 1J0,27" 3'0,3JG

;o e have to spit &p ea!h ine into a start an% stop position. $he split metho% is probaby a goo% !hoi!e for this 9ob 5 ets see hat happens hen e spit ea!h ine &sing a !omma as the %eimiter:

exon/locations B open(*exons.txt*# for line in exon/locations!

positions B line.split(,# print(positions#

@B ChapterB:+istsan%oops

$, Gn $72, 133n $1J0, 27"n $3'0, 3JGn

$he se!on% eement of ea!h ist has a neine on the en%, be!a&se e havent remove% them. +ets try assigning the start an% stop position to sensibe variabe names, an% printing them o&t in%ivi%&ay:

exon/locations B open(*exons.txt*# for line in exon/locations!

positions B line.split(,# start B positions$0

stop B positions$1

print(*start is * C start C *, stop is * C stop#

$he o&tp&t shos that this approa!h or-s 5 the start an% stop variabes ta-e %ifferent va&es ea!h time ro&n% the oop:

start is , stop is G start is 72, stop is 133 start is 1J0, stop is 27" start is 3'0, stop is 3JG

;o ets try p&tting these variabes to &se. e rea% the genomi! se<&en!e from the fie a in one go &sing read 5 theres no nee% to pro!ess ea!h ine separatey, as e 9&st ant the entire !ontents. $hen e &se the eon !oor%inates to etra!t one eon ea!h time ro&n% the oop, an% print it to the s!reen:

@ ChapterB:+istsan%oops

genomic/dna B open(*genomic/dna.txt*#.read(# exon/locations B open(*exons.txt*#

for line in exon/locations!

positions B line.split(,# start B positions$0

stop B positions$1

exon B genomic/dna$start!stop print(*exon is! * C exon#

*nfort&natey, hen e r&n this !o%e e get an error at ine H:

ile *multiple/exons/from/genomic/dna.py*, line 7, in ;module+ exon B genomic/dna$start!stop

)ype9rror! slice indices must -e integers or <one or hae an //index// method

hat has gone rongE e!a that the res&t of &sing split on a string is a ist of strings 5 this means that the start an% stop aria$les in o&r program are aso strings be!a&se theyre 9&st in%ivi%&a eements of the positions ist=. $he probem !omes hen e try to &se them as n&mbers in ine H. Fort&natey, its easiy fie% 5 e 9&st have to &se the int f&n!tion to t&rn o&r strings into n&mbers:

start B int(positions$0# stop B int(positions$1#

an% the program or-s as inten%e%.

;et step: %oing something &sef& ith the eons, rather than 9&st printing them to the s!reen. $he eer!ise %es!ription says that e have to !on!atenate the eon

se<&en!es to ma-e a ong !o%ing se<&en!e. 'f e ha% a the eons in separate variabes, then this o&% be easy

coding/se B exon1 C exon2 C exon3 C exon'

1 2 3 4 5 6 7 8

@G ChapterB:+istsan%oops

b&t instea% e have a singe exon variabe that stores one eon at a time. 7eres one ay to get the !ompete !o%ing se<&en!e: before the oop starts e !reate a ne variabe !ae% coding_seAuence an% assign it to an empty string. $hen, ea!h time ro&n% the oop, e a%% the !&rrent eon on to the en%, an% store the res&t ba!- in the same variabe. hen the oop has finishe%, the variabe i !ontain a the eons. $his is hat the !o%e oo-s i-e ith ine n&mbers as the program is getting <&ite ong=:

genomic/dna B open(*genomic/dna.txt*#.read(# exon/locations B open(*exons.txt*#

coding/seuence B **

for line in exon/locations!

positions B line.split(,# start B int(positions$0# stop B int(positions$1#

exon B genomic/dna$start!stop

coding/seuence B coding/seuence C exon

print(*coding seuence is ! * C coding/seuence#

?n ine 3 e !reate the coding_seAuence variabe, an% on ine @, insi%e the oop, e a%% the !&rrent exon on to the en%. $his is an &n&s&a type of variabe

assignment, be!a&se the coding_seAuence variabe is on both the eft an% right si%e of the e<&as sign. $he tri!- to &n%erstan%ing ine @ is to rea% the righthan% si%e of the statement first i.e. >concatenate the current coding_seAuence an" the current exon 0 then store the result of that concatenation in coding_seAuence>. ?n ine 10, instea% of printing the eon, ere printing the !o%ing se<&en!e, an% e !an see from the o&tp&t ho the !o%ing se<&en!e is gra%&ay b&it &p as e go ro&n% the oop:

1 2 3 4 5 6 7 8 9 10

@H ChapterB:+istsan%oops

coding seuence is ! &%)A&&%)&%A&%A)%&)A&%A)&%)&%A)&%)A%)&%A)&A)&%A)&%A)&% coding seuence is ! &%)A&&%)&%A&%A)%&)A&%A)&%)&%A)&%)A%)&%A)&A)&%A)&%A)&%&%A)&%A)&%A)A)&%A)&%A )A)&A)&%A)%&A)&%A)&A)&%A)&%A)&%A)&%A)&%A coding seuence is ! &%)A&&%)&%A&%A)%&)A&%A)&%)&%A)&%)A%)&%A)&A)&%A)&%A)&%&%A)&%A)&%A)A)&%A)&%A )A)&A)&%A)%&A)&%A)&A)&%A)&%A)&%A)&%A)&%A&%A)&%A)&%A)&%)A%&)A%&)A%&)A%A)&%A )&A)&A)&%)A%&)A%&)&%A&)A%&)A&%)A&%A)&%A)%&A)&%A)&%)A coding seuence is ! &%)A&&%)&%A&%A)%&)A&%A)&%)&%A)&%)A%)&%A)&A)&%A)&%A)&%&%A)&%A)&%A)A)&%A)&%A )A)&A)&%A)%&A)&%A)&A)&%A)&%A)&%A)&%A)&%A&%A)&%A)&%A)&%)A%&)A%&)A%&)A%A)&%A )&A)&A)&%)A%&)A%&)&%A&)A%&)A&%)A&%A)&%A)%&A)&%A)&%)A&%A)&%A)&%A)&%A)&%A)&% A)&%A)&%A)&%A)&%A)&%)A%&)A%&)A&%A)&%

$he fina step is to save the !o%ing se<&en!e to a fie. e !an %o this at the en% of the program ith three ines of !o%e. 7eres the fina !o%e ith !omments:

@I ChapterB:+istsan%oops

4 open the genomic dna file and read the contents genomic/dna B open(*genomic/dna.txt*#.read(#

4 open the exons locations file exon/locations B open(*exons.txt*#

4 create a aria-le to hold the coding seuence coding/seuence B **

4 go through each line in the exon locations file for line in exon/locations!

4 split the line using a comma positions B line.split(,#

4 get the start and stop positions start B int(positions$0#

stop B int(positions$1#

4 extract the exon from the genomic dna exon B genomic/dna$start!stop

4 append the exon to the end of the current coding seuence coding/seuence B coding/seuence C exon

4 rite the coding seuence to an output file output B open(*coding/seuence.txt*, **#

output.rite(coding/seuence# output.close(#

@@ Chapter : riting o&r on f&n!tions

!:

%riting our on functions

In document python for biologists (Page 98-108)