Alignments of DNA and protein sequences containing frameshift errors

(1)

Alignments of DNA and protein sequences

containing frameshift errors

Xiaojun Guan and Edward C.Uberbacher

Abstract

Molecular sequences, like all experimental data, are subject to error. Many current DNA sequencing protocols have very significant error rates and often generate artefactual insertions and deletions of bases (indels) which corrupt the translation of sequences and compromise the detection of protein homologies. The impact of these errors on the utility of molecular sequence data is dependent on the analytic technique used to interpret the data. In the presence of frameshift errors, standard algorithms using six-frame translation can miss important homologies because only subfragments of the correct translation are available in any given frame. We present a new algorithm which can detect and correct frameshift errors in DNA sequences during comparison of translated sequences with protein sequences in the databases. This algorithm can recognize homologous proteins sharing 30% identity even in the presence of a 7% frameshift error rate. Our algorithm uses dynamic pro-gramming, producing a guaranteed optimal alignment in the presence of'frameshifts, and has a sensitivity equivalent to Smith-Waterman. The computational efficiency of the algorithm is 0(nm) where n and m are the sizes of two sequences being compared. The algorithm does not rely on prior knowledge or heuristic rules and performs significantly better than any previously reported method.

Introduction

Knowledge of protein and nucleic acid sequences is central to many aspects of modern biology. Experimentally determined DNA sequences are generally used as the basis for determining the sequence of amino acids in encoded proteins using multiple-frame translation or pattern recognition methods. At present, comparison among the sequences is most commonly done using protein sequence (i.e. deduced amino acid sequence) to understand functional relationships among proteins in the same and in different species (States et al., 1991).

A number of sequence comparison algorithms have been introduced. For rigorous analysis, the dynamic

program-Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6364, USA

E-mail:[email protected]

ming algorithm proposed by Smith and Waterman (1981), and modified by Gotoh (1982), can be used to find the optimal alignment of two sequences. For fast identifica-tion of homologous sequences, heuristic methods such as BLAST (Altschul et al., 1990), based on a probabilistic model, and FastA (Pearson and Lipman, 1988), based on initial &-tuple matches, can be used. These heuristic methods are not as sensitive or accurate as the full dynamic programming method.

As experimental data, sequences are subject to error. Error can arise from a variety of sources, such as migration artifacts in sequencing gels, improper gel reading by computer software, methods for removing inconsistencies during sequence assembly, and in data entry. Although some level of error in sequences seems inevitable, different sequencing strategies have different intrinsic error rates and types. Some approaches, such as single-pass cDNA sequencing or cosmid skimming, have virtually no redundancy checks on the sequence, and can have error rates in excess of 10%, while high-redundancy (shotgun) genomic sequencing can be ~ 1 % or less.

In considering the impact of errors in nucleotide sequences, it is useful to consider not only the frequency of errors but also their type (substitution, deletion or insertion of one or more bases). In particular, derived amino acid sequences are affected very differently by errors of different types. The most common type of error involves an incorrect read for a base which results in a substitution of one base for another. This weakens the recognition of homologies in a linear and fairly gentle way. The requirement of a consistent reading frame for translation, free of insertions or deletions (i.e. without 'frameshift' errors), is a much more stringent constraint on the quality of sequence data used to recognize homologies with standard sequence comparison algorithms. In the presence of insertion-deletion errors, resulting transla-tions for the coding regions may have only certain parts correct in any one given frame. Therefore, when compar-ing the protein translation to sequences in a protein database, useful homologies can easily be lost.

The trend toward efficient sequencing methods which inherently have less redundancy creates a need for systems which can detect coding regions and detect sequence homologies in the presence of significant error rates.

at Pennsylvania State University on February 27, 2013

http://bioinformatics.oxfordjournals.org/

(2)

Several methods have been developed to deal with sequence errors when comparing sequences. Posfai and Roberts (1992) used an approach based on Fast A to compare the six-frame translations of a DNA sequence to sequences in a protein database. In this method, if two strongly matched segments are adjacent and are in different reading frames, a possible frameshift site has been found. States and Botstein (1991) proposed an algorithm based on a probabilistic model incorporating codon usage information and a non-uniform distribution of error probabilities. In their algorithm, a DNA sequence which may have errors is compared with a protein sequence, and uncertainties are assigned for each base in the DNA sequence as well as for an insertion or a deletion at that base using the prior information. In this method, and for a specific organism, proteins with 33% sequence identity can be recognized in the presence of 1 % frameshifting errors (deletion or insertion) and 5% base substitution. Both existing methods either rely on a probability model or on heuristic rules and involve methods with reduced sensitivity and accuracy compared to more robust dynamic programming methods.

While these algorithms have some specialized utility, we describe here a more general and more sensitive approach to dealing with errors in the course of sequence comparison. We have previously described a coding recognition system which functions in the presence of indels (Xu el al., 1995), and in this paper address methods for recognition of protein sequence homologies when frameshifts are present. The algorithm we describe can recognize homology between DNA sequence translations and related protein sequences in the presence of significant (several percent) indel rates in DNA sequences. Further-more, in this method and for each identified homology, the frameshift sites in the query sequence are located as well. The method is very general because it does not use heuristic rules, specific codon usage information, or any knowledge other than DNA sequences themselves. Our algorithm uses a rigorous dynamic programming approach similar to the Smith-Waterman algorithm, and is therefore more sensitive and accurate than other current methods.

Methods

The Smith-Waterman algorithm for local alignment modified by Gotoh (1982) is as follows. Given two s e q u e n c e s A=ala2--an a n d B = b\b2...bm, a c o s t

function, w, for k insertions or deletions is defined as

w(k) = —(M * k + v), where w^O and v^-Q. The alignment

of A and B is performed by creating a score matrix D whose values are computed systematically from the upper left corner of the matrix to the lower right. Two other

matrices P and Q are used to calculate matrix D. D{i,j) = max{£>(/ - l , j - \ ) + c{bhaj), P{i,j), Q{U),0}

o < /ss«7,o <y<«

where

P(iJ) = max{D(/ - 1,/) + ..-(1), P(i - 1,/) + u}

and

Q(i,j) = max{D(i,j - 1) + >v(l), Q(i,j - 1) + u] />(0, k) = Q{k, 0) = D(k, 0) = £>(0, k) = 0, \fk > 0

c(bi,Oj) represents the similarity between cij and bj.

To find the best local alignment, the maximum score in the score matrix D is found first which corresponds to the right end of an optimal local alignment, then a traceback procedure is performed to find the left end of the optimal local alignment. An example is given in Figure 1. Here the gap penalty function w(k) = - ( 1 *k+ 10), and simila-rities between amino acids are defined by the protein similarity matrix BLOSUM 62 (Henikoff and Henikoff, 1993). The same gap penalty function and the matrix BLOSUM 62 are used in all the experiments in this paper. When a DNA sequence with frameshift errors is translated into a protein sequence and then aligned with sequences in a protein database, correctly translated segments fall into different frames. In the comparison

M C T G L V S T R A V L T A G H H F C G G T S L I N D Q W V V C P T b. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 1 1 0 0 0 0 0 0 9 0 0 0 0 0 1 0 0 0 0 0 0 10 0 0 SI: S2: 0 0 0 0 7 0 5 1 0 0 1 0 0 0 0 0 0 9 5 3 2 0 0 0 0 6 13 2 5 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 6 14 6 3 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 5 12 10 5 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 2 3 13 13 2 1 0 0 4 5 0 0 0 0 0 0 0 0 0 1 6 2 11 14 3 2 1 0 2 4 0 1 CGGTSLINDQWW CTG-SLVSTRAVL 0 0 0 0 0 0 5 2 5 1 11 13 2 1 1 0 1 3 5 15 13 0 0 0 0 0 0 0 4 0 2 2 9 14 3 2 1 0 0 2 0 0 0 0 0 0 0 1 3 0 1 1 8 11 3 2 1 0 0 0 0 0 0 0 0 0 0 2 6 0 0 2 5 15 7 3 2 1 0 0 0 0 0 0 0 0 4 4 3 0 1 0 e 16 6 4 3 0 0 0 0 0 0 5 1 0 3 4 2 0 0 3 e 15 5 9 0 0 0 0 0 0 0 6 0 0 1 2 1 0 2 4 6 14 5 0 0 0 0 6 6 0 0 2 0 0 0 0 0 I 3 3 4 12 0 8 0 0 0 4 4 0 0 0 1 0 0 0 0 2 2 2 2

Fig. 1. Two protein sequences, SI = HFCGGTSLINDQWVVCPT and S2 = MCTGSLVSTRAVLTAGH, are compared using the Smith-Waterman algorithm. Here w(k) = - ( 1 * k + 10), and similarities between amino acids are defined by the protein similarity matrix BLOSUM 62 (Henikoff and Henikoff, 1993). The maximum score in the score matrix D (a) is 16 at (15,13) and the corresponding alignment is given in (b).

(3)

a .

Identities: 63/234 (27%)

Query: 36 PYQVSLNSGYHFCGGSLINDQWWSAAHCYKSRIQVRLGEHNINVLEGDEQFI NAA 91 Sbjct: 41 PFIAFLTTERTMCTGSLVSTRAVLTAGHCVCSPLPV-IRVSFLTLRNGIX2QGIHHQPSGV 99 Query: 92 KIIK--HPNY S-SWTLNN-DIMLIKLSSPVKLNARVAPVAU--PSACAPAGTQ 138 Sbjct: 100 KVAPGYMPSCMSARQRRPIAQTLSGFDIAIVMLAQMVNLQSGIRVISLPQPSDIPPPGTG 159 Query: 139 CLISGWG---NTLSNGVNNPDLLQCVDAPVLSQADCEAAYPGEITSSMICVGFLEG-GKD 194 Sbjct: 160 VFIVGYGRDDNDRDPSRKNGGILKKGRATIM ECRHATNG NPICVKAGQNFGQL 212 Query: 195 SCQGDSGGPW--CNGQLQGIVSWGYGCALPDNPG--V-YTKVCNFVGWIQDTI 243 Sbjct: 213 PAPGDSGGPLLPSLQGPVLGWS--HGVTLPNLPDIIVEYASVARMLDFVRSNI 264 b. framel: Query: 3 6 PYQVSLNSGYHFCGGSLINDQWWSAAH 63 Sbjct: 41 PFIAFLTTERTMCTGSLVSTRAVLTAGH 68 frarae2: Query: 191 GKDSCQGDSGGP 202 Sbjct: 210 GQLPAPGDSGGP 221 frame3: Query: 106 DIMLIKLSSPVKLNARVAPVALP 128 Sbjct: 126 DIAIVMLAQMVNLQSGIRVISLP 148

Fig. 2. The alignments of the translation of the rat mRNA and cercarial elastase using the Smith-Waterman algorithm before and after three deletion errors were introduced, where the rat mRNA is the query sequence and cercarial elastase is the subject sequence, (a) The local alignment of the translation of the rat mRNA sequence with cercarial elastase. (b) The three-frame local alignment of the rat mRNA sequence containing three deletion errors with cercarial elastase. Only small segments from the error-free alignment were picked up in each of the three frame alignments.

for any given frame, correct segments may be small and some segments will be incorrect, resulting in significant mismatch, even if the two sequences are truly homo-logous. An example is shown in Figure 2. The translation of the rat neutrophil elastase mRNA and human cercarial elastase share 27% sequence identity. The translation of rat mRNA is first aligned with human neutrophil elastase using the standard Smith-Waterman algorithm to show the homology (Figure 2(a)). When three deletion errors are introduced in the rat mRNA sequence, and a three-frame comparison is made using Smith-Waterman, the best local homology found in any frame is very poor (Figure 2(b)). In a comprehensive database search, the original alignment was at position 219 in the result list, but with the indels the best local alignment moves down to position 1410, effectively becoming lost in the noise.

As in the above example, consider what happens when the translation in all three frames is aligned independently. Suppose the coding region starts in frame 1, and a deletion occurs in the sequence. The frame 1 translation is aligned with the protein sequence until the deletion site, then the frame 3 translation is aligned with the protein sequence. If a second deletion is encountered, the frame 2 translation picks up the alignment at this site. If instead an insertion

were to occur after the first deletion, the alignment would return to the frame 1 translation. If we align the three frame translations independently, segments of alignments will be scattered throughout the three score matrices.

To find potential frameshift sites, one can examine the segments of alignments in all three frames and look for cases where two segments are adjacent and are in different frames. One might consider heuristic rules to decide which segments to bring into an alignment, especially if the homology is quite strong. However, if the homology is weak, or if there are many such segments, it is difficult to find the overall best alignment because of the complexity and combinatorics involved.

We describe a new algorithm to find the optimal alignment in such a situation and which also locates the indel sites. Since any combination of deletions and insertions places the correct reading frame into one of three frames, no combinatorial calculation is necessary and a computation-ally efficient dynamic programming approach can be used.

In the standard Smith-Waterman algorithm, a score matrix cell (/J)'s value can depend on three other matrix cells ( ; — l j - 1 ) , (ij — \) and (i-\,j). In the new alignment algorithm we consider not only the three cells in the same matrix, but also the same three cells in the other two

(4)

frame 2

penalty, and let £>,, Ph and Qt, 1 ^ / ^ 3 , denote the matrices

for the three frames. These matrices are initialized in the same way as in the Smith-Waterman algorithm. The steps to calculate D\(iJ) are listed below.

w{\), Qx{iJ - 1) + u)

\) + c(bhaJ),Pl{i,j),

g,(/,j) = max{£>,(/,y- 1 £),(/j) = max{£>,(/- I J

/, = max{£>2(/- I J ) + «'(1), P2(i - \,j)

h=max{D2(iJ- \) + w(\),Q2{U- ')

/3 = m a x { 02( / - I J - ^) + c(bhaj),tut if ?3 -6> D\(i,j) then i2 = ma\{D3(iJ- 1) + t3 = m a x { £ >3( / I J -if tJ-6> D](/j) then

D\{i,j) = t

3

— 6

Fig. 3. This picture shows the basic idea of our algorithm. When I calculating score matrix cell (ij) for frame 1, we not only consider the

three matrix cells (/ - IJ - 1), (/ — 1 J) and (ij - 1) in frame 1, but also consider the same three matrix cells in frames 2 and 3, thereby connecting segments of matching sequences in all three frames. This is done for all three score matrices.

frames' matrices. That is, when computing the alignment for

a given frame of translation, we also consider whether there r is a better partial alignment in either of the other frames prior

to this point that can be continued by shifting the frame to ' ^'^' ~ ' the one under consideration (see Figure 3). To prevent an Q\{'J) = t2 — 6

alignment from being constructed from random matches _ from three frames and very frequent frameshifts, a penalty is \\ J) — i imposed for shifting the frame. Let 8 denote the frameshift }

Identities: 61/234 (26%)

Query: 36 PYQVSLNSGYHFCGGSLINDQWWSAAHAT-NPIQVRLGEHNINVLEGDEQFI NAA Sbjct: 41 PFIAFLTTERTMCTGSLVSTRAVLTAGHCVCSPLPV-IRVSFLTLRNGDQQGIHHQPSGV Frame: 41 111111111111111111111111111111111333333333 33 33 33333333333333 Query: 91 KIIK--HPNY S-SWTLNN-DIMLIKLSSPVKLNARVAPVAL--PSACAPAGLS Sbjct: 100 KVAPGYMPSCMSARQRRPIAQTLSGFDIAIVMLAQMVNLQSGIRVISLPQPSDIPPPGTG Frame: 100 333333 3333333333333333 333333 33 33 33333333 33 33 3333333333333333

Q u e r y : 138 ALISGWG NTLSNGVNNPDLLQCVDAPVLSQADCEAAYPGEITSSMICVGFLEG-GKD

Sbjct: 160 VFIVGYGRDDNDRDPSRKNGGILKKGRATIM ECRHATNG NPICVKAGQNFGQL

F r a m e : 160 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Q u e r y : 194 SCQGDSGGPW--CNGQLQGIVSWGYGCALPDNPG--V-YTKVCNFVGWIQDTI 242 Sbjct: 213 PAPGDSGGPLLPSLQGPVLGWS--HGVTLPNLPDIIVEYASVARMLDFVRSNI 264 Frame: 213 222222221111111111111111111111111111111111111111111111 264 Fig. 4. The alignment of the rat mRNA containing three deletion errors with cercarial elastase using our algorithm.

90 99 99 137 159 159 193 212 212

(5)

D2 and £>3 are calculated similarly. When the calcula-tion is completed, segments of matching sequences in the three score matrices are threaded together into sequence alignments. Along with each alignment path we record the frame of the current match. In the traceback procedure, we start from the matrix cell of maximum value and traceback the alignment through the three frames' matrices. If there is true homology between the two sequences aligned, the three score matrices will show basically the same alignment with slight differences at the ends. Figure 4 shows the alignment of the same sequences in Figure 2 using the new algorithm.

Results and discussion

We evaluated the sensitivity and accuracy of our algorithm in two different ways. Firstly, using several different examples, a DNA sequence coding region was subject to different error rates (errors were introduced randomly) and compared with a homologous protein sequence. As a control, the results were compared with that of the randomly permuted sequences that share the same base composition with the DNA sequence in a manner similar to a previous study by States and Botstein (1991). The goal was to evaluate the impact of different error rates on a known sequence alignments, and determine how much error was tolerable before the homology would become insignificant relative to the noise. Secondly, we searched a DNA sequence with different rates of indel error against a protein database to see whether the sequences from the same protein family were recognized by the algorithm, and how both strong and weak homologies within the result list were affected by different rates of sequence error.

Binary comparison and alignment

This evaluation is similar to one performed by States and Botstein (1991). We selected the same two sequences, rat trypsin and human neutrophil, which share 33% sequence identity. We first tested the impact of substitu-tion errors on the sequence alignment. For each error rate, 100 runs of the alignments were performed, each with a new set of errors randomly introduced in the coding region of the rat trypsin sequence. The distribu-tion of the scores is plotted in Figure 5. As a control for the comparison, the DNA sequence was randomly permuted, translated and compared with the target protein sequence. The score distribution is also plotted on the same graph in Figure 5. The purpose was to examine the discrimination between the true alignment and the resulting random alignments.

As can be seen in Figure 5, as more errors are introduced, the alignment and discrimination become

80 40 60 40

-X

0 50 80 -\ 100 150 200 250 300 350 score 100 150 200 250 300 350 score

, ^TTThrn—, ,

150 200 250 300 350 score 10Z 100 150 200 250 300 350 score 15Z 200 250 300 350 score 202 100 150 200 250 300 350 253! 100 150 200 250 300 350

Fig. 5. Two sequences, rat trypsin and human neutrophil, which share 33% sequence identity, were compared with a range of substitution error rates (1-25%). For each error rate, 100 runs of the alignments, each with a new set of errors introduced in the sequences, were performed and the distribution of the alignment scores was plotted as the open bars. The solid bars represented the distribution of the alignment scores of the same DNA sequences, randomly permuted, translated and compared with the target protein sequence.

weaker (as expected). However the discrimination remains significant to very high substitution rates. The perform-ance is basically the same as standard Smith-Waterman in this case (base substitution but no indels). With 33% sequence identity between the two sequences, the algo-rithm can recognize the homology in the presence of up to 15% substitution errors, as opposed to ~ 5% for the States-Botstein algorithm.

A similar evaluation was done to measure the effects of indel errors. Indel errors cause more damage to sequence alignment than substitution errors because indels pro-duce frameshifts, while substitutions only repro-duce

(6)

80 -i 40 -80 -i 100 150 200 250 300 350 score 50 100 150 200 250 300 350 score 250 300 350 score 52 50 100 150 200 250 300 350 score 50 100 150 200 250 300 350 score 10* 150 200 250 300 350 score 15* 100 150 200 250 300 350 score Fig. 6. Discrimination for sequences with indel errors. Indel error rates (1-15%) were introduced and the distributions of alignments scores were plotted as in Figure 5.

sequence identity or similarity. For the same test sequences, the algorithm adequately separated the actual homologies from noise in the presence of a 7% frameshift error rate (see Figure 6), compared to ~ 1% for the States-Botstein algorithm. As we calculate the three score matrices, the frameshift information is recorded and can be used during traceback to get the positions of the frameshifts which are included in the final alignment (see Figure 4).

To simulate the real applications where the back-ground is very large, such as a sequence database, we repeated the above test with the entire SwissProt database as the control set. In Figure 7(a), the score of a perfect alignment between rat trypsin and human neutrophil was plotted along with the score distribution

100 150 200 250 300 350 score

12 50 100 150 200 250 300 350 Fig. 7. (a) The score of a perfect alignment between rat trypsin and human neutrophil was plotted (the open bar) along with the score distribution of the alignments of rat trypsin with all sequences in the SwissProt database where the sequences homologous to rat trypsin were removed, (b) Indel errors (1%) were added to rat trypsin and the resulting alignment score distribution was plotted as in Figure 6. The control set is the same set of sequences used in (a). Because of the large number of the sequences in the SwissProt database, the distribution of the control set was scaled down when plotted.

of the alignments of rat trypsin with all sequences in the SwissProt database where the sequences homo-logous to rat trypsin were removed. The purpose was to show how the perfect alignment between the two sequences is separated from the background noise. In Figure 7(b), 1% indel errors were added to rat trypsin and the resulting alignment score distribution was plotted as in Figure 6. The control set is the same set of sequences used in Figure 7(a). Because of the large number of the sequences in the SwissProt database, the distribution of the control set was scaled down when plotted (note that the tail of the distribution does not extend past 65 along the score axis). Figure 7 shows that although the 1% indels weakened the real alignments and increased the background noise some-what, the real alignments remained clearly separated from the background.

Comparison to protein family and the database

One way of evaluating a protein sequence comparison algorithm is to examine how well a member of a protein

(7)

Rank Score Identity . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 712 543 266 160 151 140 123 75 59 58 58 55 55 53 52 129/129(100%) 93/106(88%) 54/109(50%) 37/101(37%) 40/113(35%) 31/94(33%) 40/126(32%) 32/107(30%) 19/59(32%) 17/43(40%) 22/80(28%) 14/37(38%) 12/20(60%) 12/24(50%) 16/70(23%)

P00131 CYC3 DESVH DESULFOVIBRIO VULGARI P00132 CYC3_DESVM DESULFOVIBRIO VULGARI P00133 CYC3_DESGI DESULFOVIBRIO GIGAS. P00135 CYC3_DESSA DESULFOVIBRIO SALEXIG PO0134 CYC3_DESDE DESULFOVIBRIO DESULFU P38554 CY32_DESDN DESULFOVIBRIO DESULFU P00136 CY31_DESDN DESULFOVIBRIO DESULFU P24092 HMWC_DESVH DESULFOVIBRIO VULGARI P32707 NRFB_ECOLI ESCHERICHIA COLI. P24735 AMPC_PSEAE PSEUDOMONAS AERUGINOS P00137 CYC3 DESAC DESULFUROMONAS ACETOX P32707 NRFB ECOLI ESCHERICHIA COLI. P11006 MAGA_XENLA XENOPUS LAEVIS (AFRIC P00120 C553_DESVM DESULFOVIBRIO VULGARI P23857 PSPE_ECOLI ESCHERICHIA COLI.

b. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

659 127/129(98%) P00131 CYC3_DESVH DESULFOVIBRIO VULGARI 523 92/107(86%) P00132 CYC3_DESVM DESULFOVIBRIO VULGARI 251 54/108(50%) P00133 CYC3_DESGI DESULFOVIBRIO GIGAS.< 153 25/66(38%) P00134 CYC3_DESDE DESULFOVIBRIO DESULFU 152 36/81(44%) POO135 CYC3_DESSA DESULFOVIBRIO SALEXIG 143 30/80(38%) P38554 CY32_DESDN DESULFOVIBRIO DESULFU 128 40/119(34%) P00136 CY31_DESDN DESULFOVIBRIO DESULFU 97 34/109(31%) Q02817 MUC2_HUMAN HOMO SAPIENS (HUMAN). 89 26/88(30%) P13728 SGS3_DROYA DROSOPHILA YAKUBA (FR 87 28/88(32%) Q06885 GP10_DICDI DICTYOSTELIUM DISCOID 86 32/99(32%) P13730 SGS3_DROER DROSOPHILA ERECTA (FR 85 27/89(30%) P28968 VGLX_HSVEB EQUINE HERPESVIRUS TY 84 23/79(29%) P24092 HMWC_DESVH DESULFOVIBRIO VULGARI 84 23/76(30%) Q05049 MUC1_XENLA XENOPUS LAEVIS (AFRIC 82 30/90(33%) P28623 GUND_CLOCL CLOSTRIDIUM CELLULOVO

c. 1 414 111/125(89%) P00131 2 332 86/106(81%) P00132 3 154 49/106(46%) P00133 4 99 34/88(39%) POO134 5 96 26/65(40%) P38554 6 90 43/104(41%) P00135 7 75 31/95(33%) P00136 8 74 15/37(41%) Q03391 9 69 27/75(36%) P13730 10 68 i21/69(30%) P13728 11 67 25/106(24%) P35820 12 67 21/56(38%) P28968 13 65 1 7 / 6 0 ( 2 8 % ) P37202 14 65 10/29(34%) P38387 15 64 30/97(31%) P07978

CYC3_DESVH DESULFOVIBRIO VULGARI CYC3_DESVM DESULFOVIBRIO VULGARI CYC3_DESGI DESULFOVIBRIO GIGAS.< CYC3_DESDE DESULFOVIBRIO DESULFU CY32_DESDN DESULFOVIBRIO DESULFU CYC3_DESSA DESULFOVIBRIO SALEXIG CY31_DESDN DESULFOVIBRIO DESULFU NME4_MOUSE MUS MUSCULUS (MOUSE). SGS3_DROER DROSOPHILA ERECTA (FR SGS3_DROYA DROSOPHILA YAKUBA (FR PSC_DROME DROSOPHILA MELANOGASTE VGLX_HSVEB EQUINE HERPESVIRUS TY DIS3_SCHPO SCHIZOSACCHAROMYCES P SECD_MYCLE MYCOBACTERIUM LEPRAE. HSP2_MOUSE MUS MUSCULUS (MOUSE).

Fig. 8. Searching the Desutfovibrio vulgaris (Hildenborough) gene for cytochrome c3 (GenBank accession no. X04304) in the SwissProt database with different indel error rates. Although the scores and the relative positions changed with different error rates, the same family members were recognized consistently in the top 10 of the returned sequence list, (a) Top 15 sequences returned by the standard Smith-Waterman algorithm, (b) Top 15 sequences returned by our algorithm when 1% indels were added, (c) Top 15 sequences returned by our algorithm when 5% indels were added.

family hits the other members of the family, including those distantly related. As an example of this we use the

Desulfovibrio vulgaris (Hildenborough) gene for

cyto-chrome c3 (GenBank accession no. X04304) as the query sequence. We first used the standard Smith-Waterman algorithm to find the homologous sequences in SwissProt protein database (version 31; Bairoch and Boeckmann, 1994). These matching sequences share sequence identity ranging from 88% to 30% with the query. The result is shown in Figure 8(a). The new algorithm was tested using several rates of indel in the query sequence. As is shown in Figure 8(c), the algorithm functioned well with a 5% frameshift rate, and still recognized the homologous sequences in the top 10 of the returned sequence list. In

a comparison, we searched the sequence with a 5% frameshift rate in three frames independently using the standard Smith-Waterman algorithm. Only three of the top 10 homologous sequences remained in the top 10 list, and the rest of the top 10 homologous sequences were not even present in the top 100 sequences returned in all three cases (see Figure 9—only the top 15 of the returned sequences for each frame are listed due to space limitations).

The frameshift penalty value used in these tests was 10. With respect to the gap penalty function

w(k) = -{k+ 10), we have experimented with different

frameshift penalty values, and 10 produced the best overall results.

(8)

Rank Score Idencicy

CYC3_DESVH DESULFOVIBRIO VULGARI CYC3_DESVM DESULFOVIBRIO VULGARI CYC3_DESGI DESULFOVIBRIO GIGAS. COXD_HUMAN HOMO SAPIENS -(HUMAN) . UROT_MOUSE MUS MUSCULUS (MOUSE). RNS3_STRAU STREPTOMYCES AUREOFAC AMY_STRTL STREPTOMYCES THERMOVIO UROT_RAT RATTUS NORVEGICUS (RAT) PLAS_SYNY3 SYNECHOCYSTIS SP. (ST IL5_HUMAN HOMO SAPIENS (HUMAN). PERF_HUMAN HOMO SAPIENS (HUMAN) . YACK_ECOLI ESCHERICHIA COLI. ADA3_YEAST SACCHAROMYCES CEREVIS DPOL_VZVD VARICELLA-ZOSTER VIRUS LIMA_PSEAE PSEUDOMONAS AERUGINOS

CYC3_DESVH DESULFOVIBRIO VULGARI CYC3_DESVM DESULFOVIBRIO VULGARI NIFB_ANASP ANABAENA SP. (STRAIN TAT_SIVAG SIMIAN IMMUNODEFICIENC TRBM_MOUSE MUS MUSCULUS (MOUSE). NOS3_HUMAN HOMO SAPIENS (HUMAN). HXB5_XENLA XENOPUS LAEVIS (AFRIC MDHC_MAIZE ZEA MAYS (MAIZE).

FRA1_HUMAN HOMO SAPIENS (HUMAN). ETBR_PIG SUS SCROFA (PIG). UNG_HSV11 HERPES SIMPLEX VIRUS ( 3MG_RAT RATTUS NORVEGICUS (RAT). TNR1_HUMAN HOMO SAPIENS (HUMAN). FRA1_RAT RATTUS NORVEGICUS (RAT) TA4 EIMTE EIMERIA TENELLA.

CYC3_DESVH DESULFOVIBRIO VULGARI CYC3_DESVM DESULFOVIBRIO VULGARI PHI0_HOLTU HOLOTHURIA TUBULOSA ( CON8_NEUCR NEUROSPORA CRASSA. MUC1_XENLA XENOPUS LAEVIS (AFRIC SGS3_DROER DROSOPHILA ERECTA (FR APC_HUMAN HOMO SAPIENS (HUMAN). HCC2_CRYCO CRYPTHECODINIUM COHNI H11L_CHICK GALLUS GALLUS (CHICKE ETBR_PIG SUS SCROFA (PIG). SOX5_MOUSE MUS MUSCULUS (MOUSE). KARP_CHLTR CHLAMYDIA TRACHOMATIS EXTN_SORVU SORGHUM VULGARE (SORG HRDB_STRCO STREPTOMYCES COELICOL NCA1_MOUSE MUS MUSCULUS (MOUSE).

Fig. 9. Searching the Desulfovibrio vulgaris (Hildenborough) gene for cytochrome c3 (GenBank accession no. X04304) in three frames independently using the standard Smith-Waterman algorithm. Top 15 sequences returned for frame 1 (a), frame 2 (b) and frame 3 (c).

a. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 b. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 c. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 182 83 72 55 55 54 52 52 52 51 50 49 48 48 48 153 135 56 51 51 49 48 48 48 47 47 46 46 46 45 69 66 65 58 57 52 50 49 49 49 49 48 48 48 47 37/55(67%) 16/33(48%) 15/33(45%) 12/23(52%) 26/101(26%) 26/87(30%) 17/41(41%) 25/101(25%) 16/54(30%) 13/45(29%) 24/84(29%) 19/57(33%) 11/31(35%) 22/71 (31%) 15/38(39%) 52/126(41%) 38/101(38%) 12/32(38%) 11/35(31%) 20/64(31%) 19/66(29%) 22/75(29%) 11/31(35%) 25/82(30%) 17/60(28%) 10/29(34%) 9/33(27%) 30/122(25%) 23/81(28%) 9/23(39%) 16/32(50%) 28/95(29%) 21/71(30%) 15/46(33%) 26/121(21%) 25/120(21%) 7/18(39%) 18/75(24%) 13/47(28%) 19/70(27%) 22/88(25%) 29/117(25%) 19/71(27%) 27/101(27%) 13/42(31%) P00131 POO132 P00133 QO2221 P11214 P30289 P27350 P19637 P21697 P05113 P14222 P36649 P32494 P09252 Q04591 P00131 P00132 P20627 P27982 P15306 P29474 P09019 P15719 P15407 P35463 P10186 P23571 P19438 P10158 P13399 P00131 P00132 P14309 P10169 Q05049 P13730 P25054 Q01238 P08287 P35463 P35710 P38020 P24152 P18183 P13595

An issue that often accompanies methods based on the Smith-Waterman algorithm is their efficiency. The basic form of our algorithm is a factor of about seven times slower than Smith-Waterman with the same computational complexity (0(nm)). To speed up the algorithm, we made some small modifications without affecting its performance significantly. When calculat-ing a score matrix for a given frame, instead of considering all three matrix cells (/ — l,j— 1) (/' — 1, j) and (/', j — 1) in other frames' matrices, we only consider the matrix cell ( / - 1, j — 1) (see Figure 10). The reason is that if there is a good partial match around the cell

(i— \, j — \), it will be reflected in all three cells

(/'- 1, j - 1), ( / - 1, j) and (/, j - 1), with slight differ-ences (a insertion, a deletion or a substitution). So the result of considering just cell (/— 1, j — 1) should not differ significantly from considering all three cells

(assume the errors are distributed randomly). This reduces the running time by as much as 30% (to about one-fourth the speed of the Smith-Waterman algo-rithm). In Figure 11, the result of the modified algorithm is compared with that of the original algorithm.

Our algorithm will be most useful in database searches when errors may be present in a query DNA sequence. If there are no errors, the new algorithm works as the standard Smith-Waterman algorithm. If frameshift errors are present in the query sequence, the new algorithm will correct the errors and retrieve the homologous sequences. Furthermore, as the new algorithm corrects the frameshift errors, it lists the positions in the final alignment where it makes the error corrections, which provide clues as to where the real errors are (most of the time, the precise positions of the errors are reported).

(9)

frame 2

frame 1

IH.HI llH.Dl

Fig. 10. Schematic for the modified algorithm in which only one matrix cell (/ - 1J - 1) in other frames' matrices is considered in calculating matrix cell (ij) for the present frame.

As an experiment we used the algorithm to locate a known frameshift error in GenBank. We chose a sequence (P.adspersus Tpa2 gene, partial, accession no. X74339) in GenBank (release 89.0) that contains a frameshift error. This sequence (its protein translation) was not in the SwissProt database (version 31; Bairoch and Boeckmann, 1994), and served as a good test case as an unknown query sequence to SwissProt. The top two sequences returned in the search using our algorithm are shown in Figure 12. The two alignments consistently show a frameshift at position 36 in the protein translation (or position 111 in the DNA sequence), the actual frameshift site.

Another use of the new algorithm is when a heuristic search method is used for fast identification of homology, and a query sequence hits several segments of a sequence in different frames. The new algorithm can be used to show the overall alignment of the two sequences and report potential frameshift errors.

Conclusion

This paper presented an algorithm for detecting and correcting frameshift errors that occur in protein-encoding regions and can corrupt the recognition of important homologies. The algorithm appears to be robust, and can recognize weakly related sequences in the presence of

100 150 200 250 300 350 score 80 40 -0 0 50 31 100 150 200 250 300 350 score 150 200 250 300 350 80 -i

Fig. 11. This figure shows that the simplification of the algorithm had no significant effect on the performance of the algorithm. We repeated the test described in Figure 5. (a) We added 3 % substitution errors to the rat trypsin sequence and then compared it to the human neutrophil sequence using the original and modified algorithms. The result of the original algorithm is shown in (al) and the result of the modified algorithm in (a2). (b) The same test was repeated with 3 % indel errors added to the rat trypsin sequence. The result of the original algorithm is shown in (bl) and the result of the modified algorithm in (b2).

~ 7% frameshift error and provide an optimal alignment. The algorithm provides a capability which can make single-pass or low-redundancy sequence data more informative, thereby reducing the necessity for high-redundancy sequencing for gene and protein characteriza-tion purposes. This work complements earlier work designed to recognize frameshifts in coding regions using pattern recognition (Xu et al., 1995), and both methods combined provide a powerful technology which can improve the efficiency and reduce the costs of genomic and cDNA sequencing.

(10)

Hit 1 >P10978 POLX_TOBAC NICOTIANA TABACUM (CO score: 122 I d e n t i t i e s : 2 7 / 6 8 (40%) Q u e r y : 1 ELTEEIYMEQPPRFKDEQRPDLVCKRHRSIYAKKQA-RAWNIKINEVFTQQDFQRSKADP 59 Sbjct: 931 DLEEEIYMEQPEGFEVAGKKHMVCKLNKSLYGLKQAPRQWYMKFDSFMKSQTYLKTYSDP 990 Frame: 931 222222222222222222222222222222222222333333333333333333333333 990 Q u e r y : 60 CLYTKKLA 67 S b j c t : 991 CVYFKRFS 998 Frame: 991 33333333 998

H i t 2 >P25600 YCH4_YEAST SACCHAROMYCES CEREVIS s c o r e : 85 I d e n t i t i e s : 2 5 / 7 3 (34%) Q u e r y : 2 LTEEIYMEQPPRFKDEQRPDLVCKRHRSIYAKKQAE-AWNIKINEVFTQQDFQRSKADPC 60 Sbjct: 12 MDEPIYVKQPPGFVNERNPDYWELYGGMYGLKQAPLLWNEHINNTLKKIGFCRHEGEHG 7 1 Frame: 12 222222222222222222222222222222222222333333333333333333333333 7 1 Q u e r y : 61 LYTKKLARRWIYM 73 S b j c t : 72 LYFRSTSDGPIYI 84 Frame: 72 3333333333333 84

Fig. 12. We selected a sequence (P.adspersus Tpa2 gene, partial, accession no. X74339) from GenBank (release 89) that contains a frameshift error and searched it against the SwissProt protein database. The top two homologous sequences returned from the search both suggested a frameshift in the sequence, around the position where the actual frameshift occurs.

Acknowledgements

We would like to thank the manuscript reviewers for their important suggestions. This research was supported by the Office of Health and Environmental Research. US Department of Energy under contract no. DE-AC05-84OR21400 with Martin Marietta Energy System. Inc.

References

Altschul.S.F., Gish.W.. Miller.W. Myers.E.W. and Lipman.D.J. (1990) Basic local alignment search tool. J. Mol. Bioi. 215, 403-410. Bairoch.A. and Boeckmann,B. (1994) The SWISSPROT protein

sequence data bank: current status. Nucleic Acids Res., 22. 3578-3580. Gotoh.O. (1982) An improved algorithm for matching biological

sequences. J. Mol. Biol.. 162, 705-708.

Hcnikoff.S. and HenikofT.J. (1993) Performance evaluation ofaminoacid substitution matrices. Proteins, 17, 49-61.

Pearson,W.R. and Lipman.D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. of Sci. USA, 85, 2444-2448. Posfai.J. and Roberts.R.J. (1992) Finding errors in DNA sequences.

Proc. Nail. Acad. of Sci. USA, 89, 4698-4702.

Smith,T.F. and Waterman,M. (1981) Comparison of biosequences. Adv. Appl. Math., 2, 482-489.

States,D. J. and Botstein.D. (1991) Molecular sequence accuracy and the analysis of protein coding regions. Proc. Natl. Acad. of Sci. USA, 88. 5518-5522.

States.D.J.. Gish.W. and Altschul.S. F. (1991) Improved sensitivity of nucleic acid database searches using application-specific matrices. Methods. 3, 66-70.

Xu.Y.. Mural,R.J. and Uberbacher.E.C. (1995) Correcting sequencing errors in DNA coding regions using a dynamic programming approach. Comput. Applic. Biosci. 11, 117-124.

Received on June 30. 1995; accepted on October 16, 1995