Proposed Algorithm for Locating Tandem Repeats

A New Algorithm for Locating Tandem Repeats in a DNA Sequence

3. Proposed Algorithm for Locating Tandem Repeats

In this paper, an algorithm is developed for locating repetitive regions within DNA sequences. Identifying and locating repeats will help biologists learn more about repetitive regions. In addition, computational analysis of DNA sequences becomes increasingly complex when repetitive DNA occurs in the sequences under analysis. Thus, identification of repetitive DNA is a first step towards enabling biologists to understand DNA sequences and computational biologists to solve more complex analysis problems involving DNA sequences [7].

Tandem repeats are one type of repetitive DNA. It is a string of characters which recur consecutively within a larger string. In biological terms, it is a concatenation of basic units within a DNA sequence where the DNA sequence and basic unit are composed of the bases: A, C, G and T.

A tandem repeat can occur at any position in the DNA sequence. It can start with any base in the basic unit but must continue the repeat with the next base in the basic unit. For example, the tandem repeat CAGGCAGGCAGGCAG has a basic unit of GGCA. The region begins with C and continues with A, the next base in the basic unit. (Note: the subsequent base is the first base in the basic unit. This is what is meant by a concatenation of basic units.) Practical issues.

3.1 Description of the Proposed Algorithm

The proposed algorithm requires the following steps to be implemented:- 1. Select the Tandem P with length (m) to be find in a DNA sequence. 2. Select the master DNA sequence F with length (n).

3. Find the locations of P (1) in F. 4. Find the locations of P (m) in F.

5. For each pair of locations P (1) and P (m) calculate the distant between the two locations. 6. Ignore all the distances which are less than m or greater than m.

7. Compare each subsequence of F (with distant = m) with the selected sequence P, if they are equal then save the position of P(1) as a starting index of the matched sequence.

Thus after running this algorithm, we can obtain valuable information which can be used for finding the similarity and to perform some analysis on DNA sequence.

3.2 Algorithm Implementation

The proposed algorithm uses dynamic programming to locate all tandem repeats which have a basic unit as specified in the input file. The output can be used in conjunction with a graphing program to identify plateaus. These plateaus represent tandem repeat regions. Dynamic Programming is a two - pass process which combines traditional dynamic programming with a second pass to wrap scores within a row. It is the second pass which is critical for identifying tandem repeats.

The following shows some results of detecting tandem in a DNA sequence. We have to notice that there are no modifications in the algorithm in order to perform the way of matching (using the direct match from beginning to end or using the reverse match from end to beginning)

ttaaggaccccatgccctcgaataggcttgagcttgccaattaacgcgcacgggctggccgggcgtataagccaaggtgtagtgaggttgcattata catgccggcttgtgattaacgcatgccataggacggttaggctcagaacccgcaaccaatacacgtgattttctcgtcccctg

Results for 180 residue sequence "sample sequence one" starting "ttaaggaccc" >match number 1 to "ttaa" ends at position 4 on the direct strand

ttaa

>match number 2 to "ttaa" ends at position 44 on the direct strand ttaa

>match number 3 to "ttaa" ends at position 116 on the direct strand ttaa

>match number 4 to "ttaa" ends at position 68 on the reverse strand ttaa

>match number 5 to "ttaa" ends at position 140 on the reverse strand ttaa

>match number 6 to "ttaa" ends at position 180 on the reverse strand ttaa

sample sequence two

aggcgtatgcgatcctgaccatgcaaaactccagcgtaaatacctagccatggcgacacaaggcgcaagacaggagatgacggcgtttagatcgg cgaaatattaaagcaaacgacgatgacttcttcgggaaattagttccctactcgtgtactccaattagccataacactgttcgtcaagatatagggggtc acccatgaatgtcctctaaccagaccatttcgttacacgaacgtatct

Results for 243 residue sequence "sample sequence two" starting "aggcgtatgc" >match number 1 to "gat" ends at position 13 on the direct strand

gat

>match number 2 to "gat" ends at position 78 on the direct strand gat

>match number 3 to "gat" ends at position 92 on the direct strand gat

>match number 4 to "gat" ends at position 119 on the direct strand gat

>match number 5 to "gat" ends at position 185 on the direct strand gat

>match number 6 to "gat" ends at position 4 on the reverse strand gat

>match number 7 to "gat" ends at position 153 on the reverse strand gat

>match number 8 to "gat" ends at position 232 on the reverse strand gat

sample sequence three

tactcagggctccagaggtacaagttggtaatcggttaggtgtatcgccgccaggggtgcgtcgtcatgactcggttaga Results for 80 residue sequence "sample sequence three" starting "tactcagggc" >match number 1 to "tcat" ends at position 68 on the direct strand

tcat

>match number 2 to "tcat" ends at position 14 on the reverse strand tcat

3.3 Experimental Results

The proposed algorithm was implemented using different DNA sequences in size, and it was compared with TBM algorithm, which was also executed using the same files.

Tables (1-6) show the obtained results, (number of comparisons and CPB (Comparisons per Byte) which is equal the number of comparisons divided by the DNA sequence size).

A New Algorithm for Locating Tandem Repeats in a DNA Sequence 371 Table 1.1: DNA sequence size=103337