• No results found

bacteria eukarya

7.5 SSU alignment methods

Due to the prevalence and scale of SSU-based environmental surveys, several databases dedicated to SSU sequence analysis have been developed. Nearly all of these databases use their own alignment tool for aligning the SSU sequences they contain. These tools differ mainly by alignment strategy. This section briefly discusses these different strategies, and the next section provides more detail about the the most widely used databases and their respective alignment tools.

The gold standard: manual alignment

Despite all of the advances in computational methods for sequence alignment in the past 30 years, the most reliable alignments are still manually created by expert curators. While

a computer program is often very good at getting an alignment almost right, if accuracy is of paramount importance then it is still necessary for an expert to check over the output alignment. One of the reasons human experts outperform computers is that a human can easily take into account extra information unavailable to most computer programs. A good example is the conserved secondary structure of SSU, which most alignment programs are ignorant of. Higher-order structural contacts (tertiary interactions) are another example - no existing alignment programs take these into account. Further, a human has access to databases of existing alignments with sequences covering the entire tree of life and, more importantly, the capacity to intelligently mine that data as needed. The curator can extract similar sequences as needed when computing the alignment, essentially performing simultaneous phylogenetic inference and alignment, which current automated methods have difficulty with.

The problem with manual alignment is that it is time consuming. When only a few dozen SSU sequences existed in the 1980s, they could all be aligned manually in a reasonable amount of time. This was still true in the early 1990s when several hundred sequences were available [145]. Today, with several hundred thousand sequences being generated per year (Figure 7.2), it is clearly impractical. For this reason, nearly all SSU surveys use an automated alignment computer program to create alignments.

There are several different computer programs that are used. The main similarity be-tween them all is that they take advantage of a manually created reference, or seed, align-ment when aligning new target sequences. The main difference between the programs is the specific manner in which the seed alignment is used. There are two classes of programs.

Profile-based programs align new sequences to a statistical model (a profile) that represents the diversity in the entire seed alignment. Nearest-neighbor programs use a small, carefully selected subset of the seed sequences when aligning new sequences (Figure 7.7).

3. Add to new alignment trusted seed alignment:

nearest neighbor strategy

2. Align to template(s) new target sequences:

1. Find template(s) (nearest neighbor(s)

?

?

?

?

profile

alignment new target sequences:

profile strategy

trusted seed alignment:

construction procedure

Figure 7.7: Schematic of nearest-neighbor and profile based alignment strategies.

Nearest-neighbor strategy

Given a new target sequence to align, the nearest-neighbor strategy proceeds through two steps. First, one or more template sequences, or nearest-neighbors, are selected from the seed alignment. These are the seed sequences most similar to the target by some criterion (for example: most matching 7-mers (oligonucleotides of length 7)). The second step is the calculation of the alignment of the target sequence to the template(s). The non-template seed sequences are ignored in the alignment step.

Importantly, if only one template sequence is chosen then the alignment is a simple pairwise alignment to that template. In this case, the program has no information regarding the varying levels of expected conservation, or the likelihood of insertions or deletions, at different positions of the alignment. (Profile-based alignment programs, however, as described next, do have access to this information.) This is information that an expert curator would almost certainly take into account when computing the alignment.

Of course if the template is 100% identical to the target, the alignment is trivial and the method used is irrelevant. However, as the identity between template and target decreases the reliability of a pairwise alignment strategy decreases.

The existing SSU nearest-neighbor tools differ in the number of template sequences they use and in the specific scoring system used to calculate an alignment as discussed in

more detail below. All of these tools use only primary sequence information to calculate an alignment, i.e. they do not explicitly score how well the proposed alignment agrees with a model of SSU secondary structure. However, there is nothing inherent to the nearest-neighbor strategy that prevents it from modeling structure.

Profile strategy

An alternative to nearest-neighbor based approaches is to align a sequence to a statistical model called a profile that is built from a multiple sequence alignment of a representative set of sequences (the seed alignment). Profiles are routinely used for homology search as discussed in Chapter 1, for which they are generally considered among the most powerful tools available. Homology search with profiles requires scoring sequences by aligning them to the profile. Because of this, it is trivially simple to modify profile homology search programs to create multiple alignments, and the widely used hmmer, sam and infernal packages are able to create alignments as well as perform searches.

Profile-based alignment is used by some large and popular non-SSU sequence databases including pfam [76], rfam [89], and smart [151]. These databases use the “seed-full”

strategy for building and maintaining multiple alignments using profiles. A small set of typically 50 or so representative sequences are chosen and aligned with manual curation to create a seed alignment. A profile is built from the seed alignment and used to align all other examples of the sequence family (potentially found in a database search using the profile) to create a full alignment. If the seed sequences are indeed representative of the sequence diversity in the family, the model is typically able to accurately align the target sequences. A single SSU database, the rdp database, uses profile based alignment to a seed, as discussed below.

Though presented as two distinct classes, profile and nearest-neighbor based methods can also be viewed as two extremes on a continuum. As the number of template sequences used by a nearest-neighbor based approach increases, the method becomes increasingly similar to a profile strategy. By using more than one template, the program has

position-specific information. For example, an A in the target sequence should align with a higher score to a position that is 100% A in the template sequences, than to a position that is 25%

A, C, G, and U in the template sequences.

Important considerations regarding alignment strategies

1. Error propagation. Errors in the seed alignment are likely to propagate during the alignment of target sequences. In the nearest-neighbor approach, an error in the alignment of template sequence x is likely to propagate on to the alignment of any new sequence y that uses x as a template.

2. Running time for nearest-neighbor template selection scales with the size of the seed alignment. Finding the appropriate template sequence requires some type of comparison between the new sequence and each candidate template sequences, so as the number of candidate templates increases, so does the number of computations required to pick the templates. For a profile, no such step is required, once the profile is built, each sequence is independently aligned directly to it.

3. Aligning novel sequences. With the nearest-neighbor strategy, as the similarity between the target and template sequence(s) decrease, the probability of alignment errors in the target sequence alignment increase. Similarly for profiles, as the target sequence becomes increasingly different from all of the seed sequences, it becomes more difficult for the profile to accurately align the sequence. However, a profile is more general than any nearest-neighbor approach because it encapsulates the diversity of the entire seed align-ment, and so may be better able to accurately align novel sequences. Reliable alignment of novel sequences is crucial because they are the interesting sequences in SSU surveys (the more divergent, the more interesting) and are continually being discovered. During man-ual alignment, an expert curator would expend disproportionate effort when aligning novel sequences, taking structure into account where necessary.

These considerations have implications on the desired number of sequences in the seed alignment, especially for the nearest-neighbor based methods. Considerations #1 and #2

suggest that the number should be low, to make the construction of a manually curated seed alignment with minimal errors feasible and to limit the time required to define the templates. However, consideration #3 argues that the seed alignment should contain a large number of sequences (or at least a sufficiently dense representative set) to minimize the probability that any target sequence is significantly different from all the seed sequences.

In practice, the size of seed alignments used by nearest-neighbor based and profile based SSU alignment tools differ dramatically (Table 7.2). The nearest-neighbor based approaches for SSU alignment all use seeds with thousands of sequences, while the only existing SSU profile based database (rdp, described below) uses two profiles built from seeds of about 500 and 80 sequences respectively. The profile-based pfam and rfam databases include some alignments of hundreds of thousands of sequences created with profiles that were built from seeds of a few hundred sequences or less.

If the alignment accuracy of the two strategies is comparable, this presents a clear and important advantage of profile-based methods. Manually constructing a highly refined, accurate seed of a hundred or so sequences is easier than constructing one with thousands of sequences.

Structural SSU alignment using profile SCFGs

Unlike for nearest-neighbor based methods, there are existing profile-based SSU alignment tools that can explicitly take into account conserved secondary structure during alignment.

Stochastic context-free grammars (SCFGs) are probabilistic models well-suited to modeling the well-nested structure and sequence conservation of RNAs. The infernal package implements profile SCFGs called covariance models (CMs, introduced in Chapter 1) that model the consensus structure and sequence of a particular RNA family [190]. The rnacad package is another implementation of profile SCFGs by Michael Brown and David Haussler [24].

Profile SCFGs are probabilistic models that can directly calculate confidence estimates of the alignment ambiguity given the model for each aligned residue in output alignments.

These confidence estimates could be used to automatically determine alignment-specific masks for removing ambiguously aligned regions prior to phylogenetic inference.

However, SCFG-based alignment is much more computationally expensive than primary sequence-based alignment. This is especially true for infernal. In 2001, prior to develop-ment of version 0.55 of infernal, aligndevelop-ment of a single SSU sequence required more than 22 Gb of RAM, making it infeasible on modern computers. Sean Eddy solved the memory problem with version 0.55 by extending the Myers-Miller linear memory dynamic program-ming trick [186] to CMs, reducing the required RAM to 67 Mb. However, version 0.55 still requires about 15 minutes to align a single SSU sequence. rnacad uses constraints from a first pass sequence-only based alignment to accelerate alignment, and requires about 30 seconds to align a single SSU sequence [24]. In Chapter 8, I describe the application of an acceleration technique like Brown’s to infernal, which reduces SSU alignment time to about 1 second per sequence (timings are for a single Intel Xeon 3.0 Ghz processor).