Regular Expressions and Pattern Matching
Regular Expression (regex):
a separate language, allowing the construction of patterns. used in most programming languages.
very powerful in Perl.
Pattern Match:
using regex to search data and look for a match.
Overview:
how to create regular expressions
how to use them to match and extract data biological context
So Why Regex?
Parse files of data and information: fasta
embl / genbank format html (web-pages)
user input to programs
Check format
Find illegal characters (validation) Search for sequences motifs
Simple Patterns
place regex between pair of forward slashes (/ /).
try: #!/usr/bin/perl while (<STDIN>) { if (/abc/) { print “1 >> $_”; } }
Run the script.
Type in something that contains abc:
abcfoobar
Type in something that doesn't:
fgh cba foobar ab c foobar
Simple Patterns (2)
Can also match strings from files.
genomes_desc.txt contains a few text lines containing information about three genomes.
try:
#!/usr/bin/perl
open IN, “<genomes_desc.txt”; while (<IN>) {
if (/elegans/) { #match lines with this regex
print; #print lines with match
} }
Parses each line in turn.
Flexible matching
There are many characters with special meanings – metacharacters. star (*) matches any number of instances
/ab*c/ => 'a' followed by zero or more 'b' followed by 'c'
=> abc or abbbbbbbc or ac
plus (+) matches at least one instance
/ab+c/ => 'a' followed by one or more 'b' followed by 'c'
=> abc or abbc or abbbbbbbbbbbbbbc NOT ac
question mark (?) matches zero or one instance
/ab?c/ => 'a' followed by 0 or 1 'b' followed by 'c'
More General Quantifiers
Match a character a specific number or range of instances
{x} will match x number of instances. /ab{3}c/ => abbbc
{x,y} will match between x and y instances.
/a{2,4}bc/ => aabc or aaabc or aaaabc
{x,} will match x+ instances.
/abc{3,}/ => abccc or abccccccccc or
abcccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccc ccccccccccccccccccccccc
More metacharacters
dot (.) refers to any character even tab (\t) and space but not newline (\n).
Escaping
But I want to use these symbols in my regex!?!
to use a * , + , ? or . in the pattern when not a metacharacter, need to 'escape' them with a backslash.
/C\. elegans/ => C. elegans only
/C. elegans/ => Ca , Cb , C3 , C> , C. , etc...
The 'delimitor' of the regex, forward slash “/”, and the 'escape'
character, backslash “\”, are also metacharacters. These need to be escaped if required in regex.
Important when trying to match URLs and email addresses. /joe\.bloggs\@darwin\.co\.uk/
Using metacharacters.
The file nemaglobins.embl contains 21 embl database files that contain a globin protein within their sequence.
try:
#!/usr/bin/perl $count;
open IN, “<nemaglobins.embl” or die; while (<IN>) {
if (/AC .*/) { #that's three spaces print;
$count++; }
}
Grouping Patterns
Can group patterns in parentheses “()”. Useful when coupled with quantifiers
/elegans+/ => eleganssssssssssssss
/(elegans)+/ => eleganselegans...elegans
/eleg(ans){4}/ => elegansansansans
2 n 1 1 4 3 2
Alternatives
Want either this pattern or that pattern. Two ways:
1.) the vertical bar '|' either the left side matches or the right side matches /(human|mouse|rat)/ => any string with human or mouse or rat. Combine with previous examples:
/Fugu( |\t)+rubripes/ matches if Fugu and rubripes are seperated by any mixture of spaces and tabs
2.) character class is a list of characters within '[]'. It will match any single character within the class.
/[wxyz1234\t]/ => any of the nine.
a range can be specified with '-' /[w-z1-4\t]/ => as above
to match a hyphen it must be first in the class
/[-a-zA-Z]/ => any letter character and a hyphen
negating a character with '^' /[^z]/ => any character except z
Other Shortcuts
\d => any digit [0-9]\w => any “word” character [A-Za-z0-9_] \s => any white space [\t\n\r\f ]
\D => any character except a digit [^\d]
\W => any character except a “word” character [^\w] \S => any character except a white space [^\s]
Can use any of these in conjunction with quantifiers, /\s*/ => any amount of white space
Using alternatives to find a hydrophobic region... try:
open IN, "< nippo_sigpept.fsa" or die; while (<IN>) {
if (/>/) { #a header line
$count++; #keep running total of sequence number }
else { #not a header
if (/[VILMFWCA]{8,}/) { $match++;
} }
}
print "Hydrophobic region found in $match sequences from $count\n";
Binding Operator
Revisited?
So far matching against $_
The binding operator “=~”matches the pattern on right against the string on left.
Usually add the m operator (optional).
$sumthing = 'Ascaris suum is a nematode'; if ($sumthing=~m/suum.*nematode/) {
print “this organism infects pigs!\n”; }
Anchors
/pattern/ will match anywhere in the string. Use anchors to hold pattern to a point in the string.
caret “^” (shift 6) marks the beginning of string while dollar “$” marks end of a string.
/^elegans/ => elegans only at start of string. Not C. elegans. /Canis$/ => Canis only at end of string. Not Canis lupus.
/^\s*$/ => a blank line.
“$” ignores new line character “\n”.
Anchors (2)
Word Boundary
\b matches the start or end of a word.
/\bmus\b/ would match mus but not musculus
/la\b/ => Drosophila but not Plasmodium
/\btes/ => Comamonas testosteroni but not Pan troglodytes \b ignores newline character.
Memory Variables
Able to extract sections of the pattern match and store in a variable. Anything stored in parentheses “()” is written into a special variable. The first instance is $1, the second $2, the fourth $4 and so on.
Extract from file:
Organism: Homo sapiens ...
Extract from Perl script:
while ($line=<IN>) {
if ($line=~m/Organism:\s(\w)+\s(\w)+/) {
$genus=$1; #stores Homo
$species=$2; #stores sapiens }
Substitutions
Able to replace a pattern within a string with another string. Use the “s” operator
s/abc/xyz/ => find abc and replace with xyz
By default only the first instance of a match.
Using 'g' modifier (global) will find and replace all instances.
$line = 'abccdcbabc'; $line =~ s/abc/xyz/g;
print $line; #produces xyzcdcbxyz;
Run dna2rna.pl
Now look at dna2rna.pl
dna2rna.pl
#!/usr/bin/perl
print "Enter DNA sequence\n"; while ($line = <STDIN>) {
chomp $line; #remove trailing \n
if ($line=~m/[^AGCT]/i) { #case insensitive infered by 'i'
#modifier print "your sequence contained an invalid nucleotide:
$&\nPlease try again\n";
#'$&' is a special variable which stores what the
#regular expression matched. Don't worry about it for now. }
else {
$line=~s/t/u/g; #replace all lower case 't' $line=~s/T/U/g; #replace all upper case 'T' print "The RNA sequence is:\n$line\n";
print “Try again or ctrl C to quit\n”; }
EMBL file revisited
using shortcuts and anchors to help make more robust:
if (/AC .*/) { #that's three spaces
can be rewritten as;
if (/^AC\s{3}(.*)\n$/){ #more certain to return what you want $accession=$1; #now have info stored to use later.
Now Its Your Turn :o)
nemaglobins.embl contains entries for complete cds of nematode sequences. Foreach entry print the ACcession, OrganiSm name and AGCT content of the SeQuence.
Output should read:
Accession: AC00000 <tab> Species: Toxocara canis <newline> A: 34 G: 65 C: 24 T: 75 <newline><newline>
Hints:
The lines of interest are AC, OS, and SQ.
Three regular expressions - one for each query.
Use a series of if and elsif loops to search for regular expressions.
Print when matched.
Bonus point - remove the semi-colon from the accession id. Shout if need help.