$he (is' restri!tion enyme !&ts at an even i%er range of motifs 5 the pattern is RC;RC, here ; represents any base. e !an &se the same aternation te!hni<&e to sear!h for this pattern:
dna B *A)&%&%AA))&A&*
if re.search(r*%&(AT)T%T&#%&*, dna#! print(*restriction site found5*#
7oever, theres another reg&ar epression feat&re that ets &s rite the pattern more !on!isey. 4 pair of s<&are bra!-ets ith a ist of !hara!ters insi%e them !an represent any one of these !hara!ters. "o the pattern #C!"#C2#C i mat!h
1BH ChapterH:eg&arepressions
#C!#C, #C"#C, #C##C an% #CC#C. 7eres the same program &sing !hara!ter gro&ps:
dna B *A)&%&%AA))&A&*
if re.search(r*%&$A)%&%&*, dna#! print(*restriction site found5*#
'f e ant a !hara!ter in a pattern to mat!h an* !hara!ter in the inp&t, e !an &se a perio% 5 the pattern #C.#C o&% mat!h a fo&r possibiities. 7oever, the
perio% o&% aso mat!h any !hara!ter hi!h is not a D;4 base, or even a etter. $herefore, the hoe pattern o&% aso mat!h #C7#C, #CK#C an% #C?#C, hi!h may not be hat e ant.
"ometimes its easier, rather than isting a the a!!eptabe !hara!ters, to spe!ify the !hara!ters that e don-t ant to mat!h. #&tting a !aret U at the start of a
!hara!ter gro&p i-e this UXN i negate it, an% mat!h any !hara!ter that isn-t in the gro&p.
uantifiers
$he reg&ar epression feat&res %is!&sse% above et &s %es!ribe variation in the in%ivi%&a !hara!ters of patterns. 4nother gro&p of feat&res, ;uantifiers, et &s %es!ribe variation in the n&mber of times a se!tion of a pattern is repeate%. 4 <&estion mar- imme%iatey fooing a !hara!ter means that that !hara!ter is optiona 5 it !an mat!h either ;ero or one times. "o in the pattern #!"LC the " is optiona, an% the pattern i mat!h either #!"C or #!C. 'f e ant to appy a
<&estion mar- to more than one !hara!ter, e !an gro&p the !hara!ters in
parentheses. For eampe, in the pattern RRR444=E$$$ the gro&p of three !s is optiona, so the pattern i mat!h either ###!!!""" or ###""".
4 p&s sign imme%iatey fooing a !hara!ter or gro&p means that the !hara!ter or gro&p must be present b&t !an be repeate% any n&mber of times 5 in other or%s,
1BI ChapterH:eg&arepressions
it i mat!h one or more times. For eampe, the pattern ###!F""" i mat!h three #s, fooe% by one or more !s, fooe% by three "s. "o it i mat!h
###!""", ###!!"", ###!!!"", et!. b&t not###""".
4n asteris- imme%iatey fooing a !hara!ter or gro&p means that the !hara!ter or gro&p is optiona, b&t !an aso be repeate%. 'n other or%s, it i mat!h ;ero or more times. For eampe, the pattern ###!4""" i mat!h three #s, fooe% by ero or more !s, fooe% by three "s. "o it i mat!h ###""", ###!""",
###!!""", et!.
'f e ant to spe!ify a spe!ifi! n&mber of repeats, e !an &se !&ry bra!-ets. Fooing a !hara!ter or gro&p ith a single n&mber insi%e !&ry bra!-ets i
mat!h ea!ty that n&mber of repeats. For eampe, the pattern #!M1" i mat!h #!!!!!" b&t not #!!!!" or #!!!!!!". Fooing a !hara!ter or gro&p ith a pair of numbers insi%e !&ry bra!-ets separate% ith a !omma aos &s to spe!ify an a!!eptabe range of n&mber of repeats. For eampe, the pattern #!M6%8" i mat!h #!!", #!!!" an% #!!!!" b&t not #!" or #!!!!!".
Positions
$he fina set of reg&ar epression toos ere going to oo- at %ont represent !hara!ters at a, b&t rather positions in the inp&t string. $he !aret symbo O
mat!hes the start of a string, an% the %oar symbo mat!hes the end of a string. $he pattern O!!! i mat!h !!!""" b&t not ###!!!""". $he pattern ### i mat!h !!!### b&t not !!!###CCC.
1B@ ChapterH:eg&arepressions
Combining
$he rea poer of reg&ar epressions !omes from !ombining these toos. e !an &se <&antifiers together ith aternations an% !hara!ter gro&ps to spe!ify very feibe patterns. For eampe, heres a !ompe pattern to i%entify f&ength e&-aryoti! messenger ;4 se<&en!es:
8A)%$A)%&U30,1000VAU,10V6
ea%ing the pattern from eft to right, it i mat!h:
• an 4$R start !o%on at the beginning of the se<&en!e
• fooe% by beteen 30 an% 1000 bases hi!h !an be 4, $, R or C • fooe% by a poy4 tai of beteen an% 10 bases at the en% of the
se<&en!e
4s yo& !an see, reg&ar epressions !an be <&ite tri!-y to rea% &nti yo&re famiiar ith them8 7oever, its e orth investing a bit of time earning to &se them, as
the same notation is &se% a!ross m&tipe %ifferent toos. $he reg&ar epression s-is that yo& earn in #ython are transferabe to other programming ang&ages, !omman% ine toos, an% tet e%itors.
$he feat&res eve %is!&sse% above are the ones most &sef& in bioogy, an% are s&ffi!ient to ta!-e a the eer!ises at the en% of the !hapter. 7oever, there are many more reg&ar epression feat&res avaiabe in #ython. 'f yo& ant to be!ome a reg&ar epression master, its orth rea%ing &p on gree"y vs= minimal ;uantifiers, backBreferences,lookahea" an% lookbehin" assertions, an% builtBin character classes. (efore e move on to oo- at some more sophisti!ate% &ses of reg&ar epressions, its orth noting that theres a metho% simiar to re.search !ae% re.match. $he %ifferen!e is that re.search i i%entify a pattern o!!&rring an*+ere in the string, hereas re.match i ony i%entify a pattern if it mat!hes the entire string. Most of the time e ant the former behavio&r.
10 ChapterH:eg&arepressions
!tracting the part of the string that matche"
'n the se!tion above e &se% re.search as the !on%ition in an if statement to %e!i%e hether or not a string !ontaine% a pattern. ?ften in o&r programs, e ant to fin% o&t not ony if a pattern mat!he%, b&t +at part of the string as mat!he%. $o %o this, e nee% to store the res&t of &sing re.search, then &se the group metho% on the res&ting ob9e!t.
hen intro%&!ing the re.search f&n!tion above ' rote that it as a tr&e/fase f&n!tion. $hats not e!actly !orre!t tho&gh 5 if it fin%s a mat!h, it %oesnt ret&rn "rue, b&t rather an ob9e!t that is eva&ate% as tr&e in a !on%itiona !ontet1 if the %istin!tion %oesnt seem important to yo&, then yo& !an safey ignore it=. $he va&e thats a!t&ay ret&rne% is a mat!h ob9e!t 5 a ne %ata type that eve not
en!o&ntere% before. +i-e a fie ob9e!t see !hapter 3=, a mat!h ob9e!t %oesnt
represent a simpe thing, i-e a n&mber or string. 'nstea%, it represents the res&ts of a reg&ar epression sear!h. 4n% again, 9&st i-e a fie ob9e!t, a mat!h ob9e!t has a n&mber of &sef& metho%s for getting %ata o&t of it.
?ne s&!h metho% is the group metho%. 'f e !a this metho% on the res&t of a reg&ar epression sear!h, e get the portion of the inp&t string that mat!he% the pattern:
dna B *A)%A&%)A&%)A&%A&)%*
4 store the match o-Dect in the aria-le m m B re.search(r*%A$A)%&U3VA&*, dna#
print(m.group(##
'n the above !o%e, ere sear!hing insi%e a D;4 se<&en!e for #!, fooe% by three bases, fooe% by !C. (y !aing the group metho% on the res&ting mat!h ob9e!t,
1 'f a mat!h isnt fo&n%, then the same thing appies the f&n!tion %oesnt ret&rn 7alse, b&t a %ifferent b&itin va&e 5None 5 that eva&ates as fase. 'f this %oesnt ma-e sense, %ont orry.
11 ChapterH:eg&arepressions
e !an see the part of the D;4 se<&en!e that mat!he%, an% fig&re o&t hat the mi%%e three bases ere:
%A&%)A&
hat if e ant to etra!t more than one bit of the patternE "ay e ant to mat!h this pattern:
%A$A)%&U3VA&$A)%&U2VA&
$hats #!, fooe% by three bases, fooe% by !C, fooe% by to bases, fooe% by !C again. e !an s&rro&n% the bits of the pattern that e ant to etra!t ith parentheses 5 this is !ae% capturing it:
%A($A)%&U3V#A&($A)%&U2V#A&
e !an no refer to the !apt&re% bits of the pattern by s&ppying an arg&ment to the group metho%. group+3, i ret&rn the bit of the string mat!he% by the se!tion of the pattern in the first set of parentheses, group+6, i ret&rn the bit mat!he% by the se!on%, et!.:
dna B *A)%A&%)A&%)A&%A&)%*
4 store the match o-Dect in the aria-le m
m B re.search(r*%A($A)%&U3V#A&($A)%&U2V#A&*, dna# print(*entire match! * C m.group(##
print(*first -it! * C m.group(1## print(*second -it! * C m.group(2##
$he o&tp&t shos that the three bases in the first variabe se!tion ere C#", an% the to bases in the se!on% variabe se!tion ere #":
12 ChapterH:eg&arepressions
entire match! %A&%)A&%)A& first -it! &%)
second -it! %)