Sequence alignment is a fundamental concept in bioinformatics, as it is the base of many further analyses. Since the nucleic acid fragments and protein primary structures data is represented with sequences, related each other by the process of evolution, the alignment of such sequences is necessary to compare genotypes.
The conceptual rule of sequence alignment algorithms is to revert the evolution process in order to put in evidence any conservations and substitutions occurred in the evolution process. In addition, sequence alignment can highlight similarities due to structural or functional reasons. Two sequences are aligned by writing them in two rows:
QVQLQESG-AEVMKPGASVKISCKATG---YTFSTYWIEWVKQRPGHGLEWIGEILPGSGSTYYNEKFKG-K VLMTQTPLSLPVSLGDQASISCKSSQSIVHSSGNTYFEWYLQKPG----QSPKLL----IYKVSNRFSGVPD
3.2 Sequence Alignment
Characters (i.e. nucleic acids or amino acid residues) placed in the same column rep- resent conservations and substitutions. Gaps (character ‘-’) represent either insertions or deletions.
An alignment algorithm is defined by two components: (i) the score function i.e. a scoring scheme used to evaluate an alignment; (ii) the alignment strategy, i.e. a strategy which gives the succession of substitutions, insertions and deletions which maximize the score.
A scoring scheme’s aim is to favour those substitutions, insertions and deletions which have been selected by evolution. Since a lot of factors may participate in the evolutionary selection, finding the correct scoring scheme is still an open problem, which greatly affects the reliability of sequence alignments, especially for distantly related proteins. The most commonly used scoring schemes rely on substitution matrices, which are based on the statistical observation of the substitutions occurred in sets of homologous proteins. In addition to substitution matrices, usually two different costs for gap creation and extension are part of a scoring system, in order to reproduce the fact that a new insertion/deletion is less probable to occur than a yet existing one to be extended (affine gap penalties).
Alignment strategies may have a global or local scope, and different strategies apply when a couple of sequences (pairwise alignments) or sets of related sequences (multiple alignments) are considered. Global strategies attempt to align every residue, and are suitable when the sequences are similar and of roughly equal size. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity within their larger sequence context.
3.2.1 Substitution Matrices
The PAM (Point Accepted Mutation) matrices, or Dayhoff Matrices (23), are calculated by observing the substitutions in closely related proteins. The PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed. The PAM1 matrix is used as the basis for calculating other PAM matrices by assuming that repeated mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same site. Using this logic, Dayhoff derived matrices as high as PAM250. Usually the PAM30 and the PAM70 are used.
BLOSUM matrices (24) are based on the analysis of substitutions observed within conserved blocks of aligned protein sequences belonging to different protein families. Like PAM matrices, BLOSUM matrices can be derived at different evolutionary dis- tances. The measure of evolutionary distance is the percentage identity of the aligned blocks in which the substitutions are counted. For instance, the widely used BLO- SUM62 matrix is based on aligned blocks of proteins where on average 62% of the amino acid substitutions are identical. This is approximately equivalent to PAM150.
In order to permit a simple use in scoring systems, substitution matrices are nor- mally given in log-odd format: the odds are used instead of probabilities to take into account the different occurrence of amino acids. The use of logarithms is a conve- nience to obtain the score of an alignment from the sum of the scores for the single substitutions.
Many matrices alternative to PAM and BLOSUM have been proposed; for a review and assessment of substitution matrices, see (25).
3.2.2 Pairwise Sequence Alignment
A pairwise alignment can be represented by a path in matrix, in which one of the sequences represents the rows and the other the columns. A crosswise movement rep- resents a substitution and horizontal and vertical movements represent gaps. The dot-matrix sequence comparison is a technique which permits to visually detect the regions of similarities between two sequences.
The Needleman and Wunsch algorithm (26) is the standard technique for global pairwise sequence alignment. It is based on dynamic programming, an optimization technique which relies on the fact that the solution for the whole problem can be obtained by the solutions for its subproblems. The Needleman and Wunsch algorithm finds the path in the matrix, by storing the optimal score for each cell incrementally.
The Smith and Waterman algorithm(27) is the reference local alignment method and it is also based on dynamic programming. The main difference with the Needleman and Wunsch algorithm is the adoption a breaking threshold when the score for the alignment goes below a certain score while building the matrix.
Both the Needleman and Wunsch and the Smith and Waterman algorithm are suited to be used with substitution matrices and affine gap costs.
3.2 Sequence Alignment
3.2.3 Multiple Sequence Alignment
Multiple alignments can be used to evidence conserved regions in a set of proteins. They are often used to perform phylogenetic analyses on protein families.
Dynamic programming techniques can be extended for multiple sequences, but they are not usually purely adopted for sequence alignments with more than a handful of sequences, due to the explosion of the memory and time cost, which grows exponentially with the number of sequences.
Progressive techniques are an effective compromise, and can be scaled to deal with great numbers of sequences. The main idea behind progressive techniques is to start from the most related sequences then and add the other sequence one at a time. The most famous progressive method are CLUSTALW (28) and MUSCLE (29). T-Coffee (30) is another progressive method which also allows to combine results obtained with several alignment methods.
Multiple alignments are often expressed as Hidden Markov Models or Position- Specific Scoring Matrices to describe protein families or motifs.
Figure 3.4: A multiple sequence alignment.
3.2.4 Position-Specific Scoring Matrices
Analysis of multiple sequence alignments for conserved blocks of sequence leads to production of the position-specific scoring matrix, or PSSM. PSSMs are used to search a motif in a sequence, to extend similarity searches, or as sequence encoding for prediction algorithms. A PSSM can be constructed by a direct logarithmic transformation of a
matrix giving the frequency of each amino acid in the sequence. If the number of aligned residues in a particular position is large and reasonably diverse, the sequences represent a good statistical sampling of all sequences that are ever likely to be found. For example, if a given column in 20 sequences has only isoleucine, it is not very likely that a different amino acid will be found in other sequences with that motif because the residue is probably important for function. In contrast, another column from the same 20 sequences may have several amino acids represented by few residues. Hence, if the data set is small, unless the motif has almost identical amino acids in each column, the column frequencies in the PSSM may not be highly representative of all other occurrences of the motif.
The estimates of the amino acid frequencies can be improved by adding extra amino acid counts, called pseudocounts, to obtain a more reasonable distribution of amino acid frequencies in the column (31):
pca=
nca· bca Nc· Bc
(3.1) where nca and bca are the real counts and pseudo-counts, respectively, of amino acid a in column c, Nc and Bc are the total number of real counts and pseudo-counts, respectively, in the column.
Knowing how many counts to add is not trivial; relatively few pseudo-counts should be added when there is a good sampling of sequences, and more should be added when the data are more sparse. Substitution matrices (e.g. BLOSUM or PAM) provide information on amino acid variation. Then, bcamay be estimated from the total number of pseudo-counts in the column by:
bca = Bc· Qi (3.2)
where Qi = P
iqia. qia is the frequency of substitution of amino acid i for amino acid a in the substitution matrix.