Regular Expressions and Pattern Matching

(1)

Regular Expressions and Pattern Matching

[email protected]

Regular Expression (regex):

a separate language, allowing the construction of patterns. used in most programming languages.

very powerful in Perl.

Pattern Match:

using regex to search data and look for a match.

Overview:

how to create regular expressions

how to use them to match and extract data biological context

(2)

So Why Regex?

Parse files of data and information: fasta

embl / genbank format html (web-pages)

user input to programs

Check format

Find illegal characters (validation) Search for sequences motifs

(3)

Simple Patterns

place regex between pair of forward slashes (/ /).

try: #!/usr/bin/perl while (<STDIN>) { if (/abc/) { print “1 >> $_”; } }

Run the script.

Type in something that contains abc:

abcfoobar

Type in something that doesn't:

fgh cba foobar ab c foobar

(4)

Simple Patterns (2)

Can also match strings from files.

genomes_desc.txt contains a few text lines containing information about three genomes.

try:

#!/usr/bin/perl

open IN, “<genomes_desc.txt”; while (<IN>) {

if (/elegans/) { #match lines with this regex

print; #print lines with match

} }

Parses each line in turn.

(5)

Flexible matching

There are many characters with special meanings – metacharacters. star (*) matches any number of instances

/ab*c/ => 'a' followed by zero or more 'b' followed by 'c'

=> abc or abbbbbbbc or ac

plus (+) matches at least one instance

/ab+c/ => 'a' followed by one or more 'b' followed by 'c'

=> abc or abbc or abbbbbbbbbbbbbbc NOT ac

question mark (?) matches zero or one instance

/ab?c/ => 'a' followed by 0 or 1 'b' followed by 'c'

(6)

More General Quantifiers

Match a character a specific number or range of instances

{x} will match x number of instances. /ab{3}c/ => abbbc

{x,y} will match between x and y instances.

/a{2,4}bc/ => aabc or aaabc or aaaabc

{x,} will match x+ instances.

/abc{3,}/ => abccc or abccccccccc or

abcccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccc ccccccccccccccccccccccc

(7)

More metacharacters

dot (.) refers to any character even tab (\t) and space but not newline (\n).

(8)

Escaping

But I want to use these symbols in my regex!?!

to use a * , + , ? or . in the pattern when not a metacharacter, need to 'escape' them with a backslash.

/C\. elegans/ => C. elegans only

/C. elegans/ => Ca , Cb , C3 , C> , C. , etc...

The 'delimitor' of the regex, forward slash “/”, and the 'escape'

character, backslash “\”, are also metacharacters. These need to be escaped if required in regex.

Important when trying to match URLs and email addresses. /joe\.bloggs\@darwin\.co\.uk/

(9)

Using metacharacters.

The file nemaglobins.embl contains 21 embl database files that contain a globin protein within their sequence.

try:

#!/usr/bin/perl $count;

open IN, “<nemaglobins.embl” or die; while (<IN>) {

if (/AC .*/) { #that's three spaces print;

$count++; }

}

(10)

Grouping Patterns

Can group patterns in parentheses “()”. Useful when coupled with quantifiers

/elegans+/ => eleganssssssssssssss

/(elegans)+/ => eleganselegans...elegans

/eleg(ans){4}/ => elegansansansans

2 n 1 1 4 3 2

(11)

Alternatives

Want either this pattern or that pattern. Two ways:

1.) the vertical bar '|' either the left side matches or the right side matches /(human|mouse|rat)/ => any string with human or mouse or rat. Combine with previous examples:

/Fugu( |\t)+rubripes/ matches if Fugu and rubripes are seperated by any mixture of spaces and tabs

(12)

2.) character class is a list of characters within '[]'. It will match any single character within the class.

/[wxyz1234\t]/ => any of the nine.

a range can be specified with '-' /[w-z1-4\t]/ => as above

to match a hyphen it must be first in the class

/[-a-zA-Z]/ => any letter character and a hyphen

negating a character with '^' /[^z]/ => any character except z

(13)

Other Shortcuts

\d => any digit [0-9]

\w => any “word” character [A-Za-z0-9_] \s => any white space [\t\n\r\f ]

\D => any character except a digit [^\d]

\W => any character except a “word” character [^\w] \S => any character except a white space [^\s]

Can use any of these in conjunction with quantifiers, /\s*/ => any amount of white space

(14)

Using alternatives to find a hydrophobic region... try:

open IN, "< nippo_sigpept.fsa" or die; while (<IN>) {

if (/>/) { #a header line

$count++; #keep running total of sequence number }

else { #not a header

if (/[VILMFWCA]{8,}/) { $match++;

} }

}

print "Hydrophobic region found in $match sequences from $count\n";

(15)

Binding Operator

Revisited?

So far matching against $_

The binding operator “=~”matches the pattern on right against the string on left.

Usually add the m operator (optional).

$sumthing = 'Ascaris suum is a nematode'; if ($sumthing=~m/suum.*nematode/) {

print “this organism infects pigs!\n”; }

(16)

Anchors

/pattern/ will match anywhere in the string. Use anchors to hold pattern to a point in the string.

caret “^” (shift 6) marks the beginning of string while dollar “$” marks end of a string.

/^elegans/ => elegans only at start of string. Not C. elegans. /Canis$/ => Canis only at end of string. Not Canis lupus.

/^\s*$/ => a blank line.

“$” ignores new line character “\n”.

(17)

Anchors (2)

Word Boundary

\b matches the start or end of a word.

/\bmus\b/ would match mus but not musculus

/la\b/ => Drosophila but not Plasmodium

/\btes/ => Comamonas testosteroni but not Pan troglodytes \b ignores newline character.

(18)

Memory Variables

Able to extract sections of the pattern match and store in a variable. Anything stored in parentheses “()” is written into a special variable. The first instance is $1, the second $2, the fourth $4 and so on.

Extract from file:

Organism: Homo sapiens ...

Extract from Perl script:

while ($line=<IN>) {

if ($line=~m/Organism:\s(\w)+\s(\w)+/) {

$genus=$1; #stores Homo

$species=$2; #stores sapiens }

(19)

Substitutions

Able to replace a pattern within a string with another string. Use the “s” operator

s/abc/xyz/ => find abc and replace with xyz

By default only the first instance of a match.

Using 'g' modifier (global) will find and replace all instances.

$line = 'abccdcbabc'; $line =~ s/abc/xyz/g;

print $line; #produces xyzcdcbxyz;

Run dna2rna.pl

Now look at dna2rna.pl

(20)

dna2rna.pl

#!/usr/bin/perl

print "Enter DNA sequence\n"; while ($line = <STDIN>) {

chomp $line; #remove trailing \n

if ($line=~m/[^AGCT]/i) { #case insensitive infered by 'i'

#modifier print "your sequence contained an invalid nucleotide:

$&\nPlease try again\n";

#'$&' is a special variable which stores what the

#regular expression matched. Don't worry about it for now. }

else {

$line=~s/t/u/g; #replace all lower case 't' $line=~s/T/U/g; #replace all upper case 'T' print "The RNA sequence is:\n$line\n";

print “Try again or ctrl C to quit\n”; }

(21)

EMBL file revisited

using shortcuts and anchors to help make more robust:

if (/AC .*/) { #that's three spaces

can be rewritten as;

if (/^AC\s{3}(.*)\n$/){ #more certain to return what you want $accession=$1; #now have info stored to use later.

(22)

Now Its Your Turn :o)

nemaglobins.embl contains entries for complete cds of nematode sequences. Foreach entry print the ACcession, OrganiSm name and AGCT content of the SeQuence.

Output should read:

Accession: AC00000 <tab> Species: Toxocara canis <newline> A: 34 G: 65 C: 24 T: 75 <newline><newline>

Hints:

The lines of interest are AC, OS, and SQ.

Three regular expressions - one for each query.

Use a series of if and elsif loops to search for regular expressions.

Print when matched.

Bonus point - remove the semi-colon from the accession id. Shout if need help.