• No results found

region of the search text.

At each step, the number of characters in the search string—in this case, three—advances the search unless a character in the search text corresponding to the length of the search string is also in the search string. For example, in line 2 of Figure 4-11, there is no space in the search string and so the search advances three characters. However, in line 3, the "n" character in "nth" is also in the search string, but the n's don't line up. In this way, skipping continues until line 6, where the "rot" in

"protein" contains the "r" character in the search string. A comparison of the first characters is made, which matches, and so the second characters are compared. This comparison fails, and the search is again advanced. In line 7, there is again a possible match, and the search string is advanced one character to verify a match, as in line 8. However, the space at the end of "protein" doesn't match the "A" in "RNA", and the search position is shifted three places to the right, as in line 10.

This process continues, stopping to check for matches whenever the next three characters contains an "r", "n", or "a". That is, if the algorithm is examining a character that doesn't occur in the search string at all, the algorithm moves ahead by the length of the entire search string. If a character does appear in the search string, the algorithm advances the search marker by the distance between that character and the right end of the string. By step 18, the search is complete, on a total of 44

characters (including spaces). The number of comparisons per character is approximately 0.4—an efficient search algorithm.

A second search heuristic makes use of repeating patterns in the search string, such as "hand to hand, door to door", and attempts to match the first repeating word, according to the algorithm described for skipping characters. When a match is made, the search string is advanced to the point that the first occurrence of the repeated word is aligned with the first occurrence of the matching term in the main text. A comparison is now made for the second occurrence of the search term in the text. Obviously, the major gain in computational efficiency and performance is obtained by the first heuristic, and the advantage of the second heuristic is dependent on the appearance of repeating words in the search pattern.

In addition to running time, another major metric for characterizing search algorithms is the need for backtracking. Some search algorithms are linear, working efficiently from the beginning of the

sequence to be searched to the end, whereas others move back and forth in the text to be searched during processing. For example, the skipping search algorithm moves the index ahead by the number

of characters in the search string but then backs up to compare characters if a possible match exists.

Approximate Searches

Algorithms that efficiently locate exact matches have many applications in bioinformatics, including searching for data in PubMed or some other bibliographic reference database by a specific disease or author, searching a clinical database by a specific disease or patient identification number, or

searching any database where data are indexed by a known, controlled vocabulary. However, search algorithms that look for approximate matches are more useful in one of the most computationally challenging tasks in bioinformatics—that of searching sequence databases for homologies of

particular sequences. Approximate match algorithms vary from the use of templates, to the use of a distance function, and the use of how words sound when spoken.

String search algorithms based on templates use metacharacters to specify the range of permissible strings that must be matched exactly. For example, the UNIX utility "grep" (general regular

expression parser) uses metacharacters such as "*", "\", "$", "+", and "^" to perform a brute-force search. As such, applications such as grep don't do true approximate searches. Similarly, the Find function in the Windows operating system allows a search string such as "*research.doc" to locate Microsoft Word documents that include "MyResearch.doc", "DNAResearch.doc", and

"ProteinResearch.doc". However, the search wouldn't locate documents such as "MyReserch.doc", "DNAResaerch.doc", or "ProteinRsch.doc", because of missing or transposed characters in the file names compared to the search string.

True approximate search algorithms allow approximate matches, permit the transposition of adjacent characters, substitution of characters, and assign different weights to different types of errors. An approximate match algorithm for nucleotide sequences should be able to locate nucleotide sequences despite the presence of single nucleotide polymorphism, for example. Searching with an approximate match algorithm for a nucleotide sequence that contains the string "AAGGTTAA" should be able to locate the sequence "ATGGTTAA", where the second "A" in the first string is replaced by a "T" in the second string.

Phonetic comparison algorithms, typified by Soundex and Metaphone, are examples of true approximate search algorithms that have application in bioinformatics. For example, they can be used to search bibliographic databases by author name when the exact spelling of the author's name may be unknown, or search a taxonomy database by phonetically spelling a species name. The Soundex approximate search algorithm addresses the problem of uncertain spelling by indexing and searching databases by an encoded string. These encoded strings are created by dropping vowels and silent consonants and assigning one of six values to the remaining consonants (see Table 4-3).

Table 4-3. Soundex Codes. Vowels and silent consonants are dropped from