• No results found

The Language of Regular Expressions

Chapter 4. Files and Directories in Un

5.6 The Language of Regular Expressions

The pattern-matching language known as regularexpressions allows you to search for and extract matches and to replace patterns of characters in files (given the right program). Regular expressions are used in the vi and Emacs text-editing programs. Since much of the data that biologists work wit h contains patterns, one of the first skills you need to learn is how to match patterns and extract them from files.

Regular expressions also are understood by the Perl language interpreter. Knowing how to use regular expressions along with the basic commands of Perl gives you a powerful set of data-processing tools. We'll cover the basics of regular expressions here, and return to them again in Chapter 12.

If you've ever used a wildcard character in a search, you've used a regular expression. Regular expressions are patterns of text to be matched. There are also special characters that can be used in regular expressions to stand for variable patterns, which means you can searc h for partial or inexact matches. Regular expressions can consist of any combination of explicit text and special characters. The special characters recognized in basic regular expressions are:

The backslash acts as an escape character for a special character that follows it. If part of the pattern you are searching for is a dot, you give the regular expression chars\.txt to find the pattern chars.txt.

.

The dot matches any single character. *

The behavior of the asterisk in regular expressions is different from its behavior as a shell wildcard. If preceded by a character, it matches zero or more occurrences of that character. If preceded by a character class description, it matches zero or more characters from that set. If preceded by a dot, it matches zero or more arbitrary characters, which is equivalent to its behavior in the shell.

^

The caret at the beginning of a regular expression matches the beginning of a line. Otherwise, it matches itself.

$

The dollar sign at the end of a regular expression matches the end of a line. Otherwise, it matches itself.

[charset]

A group of characters enclosed in square brackets matches any single character within the brackets. [badger] matches any of (a, b, d, e, g, r). Within the set, only -, caret, ], and [ are special. All other characters, including the general special characters, match themselves. A range of characters in the form [c1-c2 ] can also be given; e.g., [0 -9] or [A-Z].

5.6.1 Searching for Patterns with grep Usage: grep -[options] 'pattern' filenames

grep allows you to search for patterns (in the form of regular expressions) in a file or a group of files. GNU grep (the standard on Linux) searches for one of three kinds of patterns, depending on which of the following functions is selected:

-G

Standard grep : searches for a regular expression (this is the default) -E

-F

Fast grep : rapidly searches for a fixed string (a pattern made of normal characters, as opposed to regular expressions)

Note that the -E and -F options can be explicitly selected by calling egrep or fgrep on some systems. If no files are specified to be searched, grep searches the standard input for the pattern, allowing the output of another program to be redirected to grep if you are looking for a pattern in the output. As a simple example, consider the following commands:

% grep -c '>' SP-caspases-A.fasta SP-caspases-B.fasta % grep '>' SP-caspases-A.fasta SP-caspases-B.fasta

These both search through a file of FASTA-formatted sequences (whose header lines, you will

remember, begin with the > symbol). The first command returns the number of sequences in each file, while the second returns a list of the sequence headers. Be sure to enclose the > in quotes, though. Otherwise, as one of us once found out the hard way, the command is interpreted as a request for grep

to search the standard input for no pattern and then redirect the resulting empty string to the files listed, overwriting whatever was already there.

grep takes dozens of options. Here are some of the more useful ones: -c

Prints only a count of matching lines, rather than printing the matching lines themselves -i

Ignores uppercase/lowercase distinctions in both file and pattern -n

Prints lines and line numbers for each occurrence of a pattern match -l

Prints filenames containing matches to pattern, but not matching lines -h

Prints matching lines but not filenames (the opposite of -l ) -v

Prints only those lines that don't contain a match with pattern -q

(quiet mode) Stops listing matches after the first occurrence

In protein structure files, protein sequence information is stored as a sequence of three -letter codes, rather than in the more compact single-letter code format. It's sometimes necessary to extract sequence information from protein structure files. In real life, you can do this with a simple Perl program and then go on to translate the sequence into single-letter code. But you can also extract the sequence with two simple Unix filter commands.

The first step is to find the SEQRES records in the PDB file. This is done using the grep command:

% grep SEQRES pdbfile > seqres

This gives you a file called seqres containing records that look like this:

SEQRES 1 357 GLU VAL LEU ILE THR GLY LEU ARG THR ARG ALA VAL ASN 2MNR 106