Constructing and Using Your Own Profiles

Chapter 8. Multiple Sequence Alignments, Trees, and Profiles

8.4 Profiles and Motifs

8.4.2 Constructing and Using Your Own Profiles

Motif databases are useful if you're looking for protein families that are already well documented. However, if you think you've found a new motif you want to use to search GenBank, or you want to get creative and look for patterns in unusual places, you need to build your own profiles. Several software packages and servers are available for motif discovery, the process of finding and constructing your own motifs from a set of sequences. The simplest way to construct a motif is to find a well-conserved section out of a multiple sequence alignment. As usual, though, we encourage you to use automated approaches instead of doing things by hand: automation makes your work faster, more reproducible, and less error-prone. In addition to Block Maker, a number of other programs are commonly used to search for and discover motifs. In this section, we discuss the use of the MEME and HMMer programs, two packages commonly used for motif analysis.

Before we begin, though, here are two observations about motif discovery. First, as InterPro and Blocks grow, it is becoming increasingly difficult to find completely novel sequence motifs

undocumented by one of their member databases. Be sure to check your motif against the set of known motifs, either by searching your sequences against the databases or by using a motif-comparison tool, such as the Blocks server's LAMA program. Second, in order to find patterns reliably and search with them, you need a lot of sequences. We have used these programs in projects where very few (5-10)

sequences were available, but, as a rule of thumb, more than 20 sequences are needed for reasonable motif predictions. The more sequences you have, the more reliable the resulting motifs will be.

8.4.2.1 Finding new motifs with MEME

The MEME programs are a set of tools for motif analysis developed by Charles Elkan, Tim Bailey, and William Grundy of the University of California, San Diego. MEME is short for Multiple EM for Motif Elicitation (EM, in turn, is short for Expectation Maximization, a procedure from the world of statistics for predicting the values of "missing," or unobserved, values). They can be used over the Web

(http://meme.sdsc.edu) or their C source code can be downloaded, compiled, and run on a local computer; here, we look at the web version. There are three programs in the MEME suite: MEME

Discovers shared motifs in a set of unaligned sequences MAST

Takes a motif discovered by MEME and uses it to search a sequence database MetaMEME

Constructs a model from multiple MEME motifs and uses it to search a sequence database When you submit a set of sequences to MEME, you are testing the hypothesis that, although though you don't know the overall alignment of the sequences, they share short regions of similarity. You begin using MEME by entering on a web form your email address and a set of sequences in which you wish to search for a motif. Sequences can be in one of several formats, although FASTA is preferred. At the bottom of the sumission page are some parameters you need to set regarding the number of times per sequence you expect a motif to occur, the number of motifs you expect to find, and the approximate width of each motif.

The results will be sent back to you in three emails. The first is just a confirmation message, letting you know that the job is being processed. The second (with the subject line "MEME Job xxxxx results:", where xxxxx is the job number assigned by the MEME server) contains MEME's prediction for the motifs in both human- and machine-readable form. This message is the one you need to search the database; be sure to save the contents of this message to a text file, so you can later submit it to MAST or MetaMEME. The third message (with the subject line "MEME job... MAST analysis:") is an HTML document (making it suitable for viewing in a web browser) that shows the location of each motif in the sequences you submitted. Each message is well documented and contains detailed explanations of the contents.

8.4.2.2 Searching for motifs with MAST and MetaMEME

The next step of a motif analysis is to see whether there are new occurrences of your motif in other sequences. The MEME server provides two distinct programs, MAST and MetaMEME, that allow you to search a sequence database using your new MEME motifs. MAST simply searches for occurrences of each motif and reports matching sequences, while MetaMEME combines multiple MEME motifs

take the MEME motif prediction from the second email[4]

as input; MetaMEME also uses the original sequence file that generates the MEME motifs in creating its HMM. Both programs return results showing the position of each match, its score, and its statistical significance.

[4]

You did save the second email to a text file as we suggested, didn't you?

8.4.2.3 Motif discovery with other programs

As we mentioned previously, there are a number of programs that discover motifs in groups of unaligned sequences. Besides the ones we mentioned, you may want to try these: the SAM HMM programs developed by David Haussler and coworkers at University of California, Santa Cruz; the Emotif and Ematrix servers in the Brutlag group at Stanford University; and the ASSET, gibbs, and Probe tools available for download from NCBI. Again, a good thing to do early on is to use the LAMA program to compare your motif against the motifs in the Blocks database. If it looks like you really do have a novel motif, it can be useful to compare the results of one or more of these other motif discovery tools. If all the programs predict the same motif from the same sequences, you can be more confident in your results.

8.4.2.4 HMMer

HMMer is a software package for building profile HMMs. HMMer's central functionality is located in the hmmbuild program, which creates profile HMMs from sequence alignment, and the hmmcalibrate

program, which calibrates search statistics for the HMM. The HMMer package also contains tools for generating new sequences probabilistically based on an HMM, searching sequence databases with a profile as the query, and searching profile databases with a query sequence, as well as the handy utility programs we list here:

getseq

Extracts a sequence from a large flat-file database by name. Handy to have around if you're selecting specific records out of a database from the command line.

hmmalign

Reads both a sequence file and a profile HMM and creates a multiple sequence alignment. hmmbuild

Builds a profile HMM from a multiple sequence alignment. It can produce global results for the entire alignment or results for multiple local alignments.

hmmcalibrate

Reads an HMM and calibrates its search statistics. hmmconvert

hmmemit

Generates sequences probabilistically based on a profile HMM. It can also generate a consensus sequence.

hmmfetch

Retrieves a profile HMM from a database if the name of the desired record is known. hmmindex

Indexes a profile HMM database. hmmpfam

Searches a profile HMM database (e.g., Pfam) with a query sequence. Use this if you're trying to annotate an unknown sequence.

hmmsearch

Searches a sequence database with a profile HMM. Use this if you're looking for more instances of a pattern in a sequence database.

sreformat

Converts a sequence or alignment file from one format to another. Handy to have around. HMMer reads multiple sequence alignment files from several different sequence alignment programs, including ClustalW. The HMMer authors recommend ClustalW as a tool to generate multiple

alignments for input into hmmbuild.

HMMer is available for download from Dr. Sean Eddy at Washington University

(http://hmmer.wustl.edu). HMMer is a very well-behaved program, which installs without difficulty from source on Linux systems: just follow the directions in the INSTALL file. It even installs its own Unix manpages so you can access online help for each of the HMMer programs using the man

command. Specific information about each of the HMMer programs' command -line options can also be viewed by running the program with the -h option.

In document C Gibas, P Jambeck Developing Bioinformatics Computer Skil pdf (Page 184-187)