• No results found

Proposed Algorithm for Locating Tandem Repeats

A New Algorithm for Locating Tandem Repeats in a DNA Sequence

3. Proposed Algorithm for Locating Tandem Repeats

In this paper, an algorithm is developed for locating repetitive regions within DNA sequences. Identifying and locating repeats will help biologists learn more about repetitive regions. In addition, computational analysis of DNA sequences becomes increasingly complex when repetitive DNA occurs in the sequences under analysis. Thus, identification of repetitive DNA is a first step towards enabling biologists to understand DNA sequences and computational biologists to solve more complex analysis problems involving DNA sequences [7].

Tandem repeats are one type of repetitive DNA. It is a string of characters which recur consecutively within a larger string. In biological terms, it is a concatenation of basic units within a DNA sequence where the DNA sequence and basic unit are composed of the bases: A, C, G and T.

A tandem repeat can occur at any position in the DNA sequence. It can start with any base in the basic unit but must continue the repeat with the next base in the basic unit. For example, the tandem repeat CAGGCAGGCAGGCAG has a basic unit of GGCA. The region begins with C and continues with A, the next base in the basic unit. (Note: the subsequent base is the first base in the basic unit. This is what is meant by a concatenation of basic units.) Practical issues.

3.1 Description of the Proposed Algorithm

The proposed algorithm requires the following steps to be implemented:- 1. Select the Tandem P with length (m) to be find in a DNA sequence. 2. Select the master DNA sequence F with length (n).

3. Find the locations of P (1) in F. 4. Find the locations of P (m) in F.

5. For each pair of locations P (1) and P (m) calculate the distant between the two locations. 6. Ignore all the distances which are less than m or greater than m.

7. Compare each subsequence of F (with distant = m) with the selected sequence P, if they are equal then save the position of P(1) as a starting index of the matched sequence.

Thus after running this algorithm, we can obtain valuable information which can be used for finding the similarity and to perform some analysis on DNA sequence.

3.2 Algorithm Implementation

The proposed algorithm uses dynamic programming to locate all tandem repeats which have a basic unit as specified in the input file. The output can be used in conjunction with a graphing program to identify plateaus. These plateaus represent tandem repeat regions. Dynamic Programming is a two - pass process which combines traditional dynamic programming with a second pass to wrap scores within a row. It is the second pass which is critical for identifying tandem repeats.

The following shows some results of detecting tandem in a DNA sequence. We have to notice that there are no modifications in the algorithm in order to perform the way of matching (using the direct match from beginning to end or using the reverse match from end to beginning)

ttaaggaccccatgccctcgaataggcttgagcttgccaattaacgcgcacgggctggccgggcgtataagccaaggtgtagtgaggttgcattata catgccggcttgtgattaacgcatgccataggacggttaggctcagaacccgcaaccaatacacgtgattttctcgtcccctg

Results for 180 residue sequence "sample sequence one" starting "ttaaggaccc" >match number 1 to "ttaa" ends at position 4 on the direct strand

ttaa

>match number 2 to "ttaa" ends at position 44 on the direct strand ttaa

>match number 3 to "ttaa" ends at position 116 on the direct strand ttaa

>match number 4 to "ttaa" ends at position 68 on the reverse strand ttaa

>match number 5 to "ttaa" ends at position 140 on the reverse strand ttaa

>match number 6 to "ttaa" ends at position 180 on the reverse strand ttaa

sample sequence two

aggcgtatgcgatcctgaccatgcaaaactccagcgtaaatacctagccatggcgacacaaggcgcaagacaggagatgacggcgtttagatcgg cgaaatattaaagcaaacgacgatgacttcttcgggaaattagttccctactcgtgtactccaattagccataacactgttcgtcaagatatagggggtc acccatgaatgtcctctaaccagaccatttcgttacacgaacgtatct

Results for 243 residue sequence "sample sequence two" starting "aggcgtatgc" >match number 1 to "gat" ends at position 13 on the direct strand

gat

>match number 2 to "gat" ends at position 78 on the direct strand gat

>match number 3 to "gat" ends at position 92 on the direct strand gat

>match number 4 to "gat" ends at position 119 on the direct strand gat

>match number 5 to "gat" ends at position 185 on the direct strand gat

>match number 6 to "gat" ends at position 4 on the reverse strand gat

>match number 7 to "gat" ends at position 153 on the reverse strand gat

>match number 8 to "gat" ends at position 232 on the reverse strand gat

sample sequence three

tactcagggctccagaggtacaagttggtaatcggttaggtgtatcgccgccaggggtgcgtcgtcatgactcggttaga Results for 80 residue sequence "sample sequence three" starting "tactcagggc" >match number 1 to "tcat" ends at position 68 on the direct strand

tcat

>match number 2 to "tcat" ends at position 14 on the reverse strand tcat

3.3 Experimental Results

The proposed algorithm was implemented using different DNA sequences in size, and it was compared with TBM algorithm, which was also executed using the same files.

Tables (1-6) show the obtained results, (number of comparisons and CPB (Comparisons per Byte) which is equal the number of comparisons divided by the DNA sequence size).

A New Algorithm for Locating Tandem Repeats in a DNA Sequence 371 Table 1.1: DNA sequence size=103337

P length(m)(Tandem) TBM algorithm TBM CPB Developed algorithm Developed CPB

1 206674 2 99004 0.96

2 162084 1.57 86892 0.84

3 119658 1.16 89232 0.86

4 107856 1.04 89132 0.86

Table 1.2: DNA sequence size=70182

P length(m)(Tandem) TBM algorithm TBM CPB Developed algorithm Developed CPB

1 140364 2 67468 0.96

2 110088 1.57 59034 0.84

3 81276 1.16 60444 0.86

4 73163 1.04 60438 0.86

Table 1.3 DNA sequence size=36578

P length(m)(Tandem) TBM algorithm TBM CPB Developed algorithm Developed CPB

1 73156 2 35248 0.96

2 57373 1.57 30814 0.84

3 42398 1.16 31470 0.86

4 38112 1.04 31588 0.86

Table 1.4: DNA sequence size=19900

P length(m)(Tandem) TBM algorithm TBM CPB Developed algorithm Developed CPB

1 39800 2 19516 0.98

2 31227 1.57 16913 0.85

3 23074 1.16 16996 0.85

4 20696 1.04 17161 0.86

Table 1.5: DNA sequence size=5578

P length(m)(Tandem) TBM algorithm TBM CPB Developed algorithm Developed CPB

1 11156 2 5436 0.97

2 8754 1.57 4791 0.86

3 6444 1.16 4756 0.85

4 5803 1.04 4990 0.89

Table 1.6: DNA sequence size=2480

P length(m)(Tandem) TBM algorithm TBM CPB Developed algorithm Developed CPB

1 4960 2 2436 0.97

2 3870 1.57 2121 0.86

3 2877 1.16 2137 0.85

A matching algorithm was proposed based on string matching for locating tandem repeats of DNA sequence. It was implemented and compared with other algorithms like TBM. Our algorithm has shown better performance, and valuable outputs which can be used later for analysis purposes.

The time complicity of the algorithm (worst case) is always less than O(n), and the proposed algorithm decreases the computation times from 1.7 to 2.06 times comparing with TBM algorithm.

Also we can see from the obtained results that the time complexity does not depend on the DNA sequence size, but it depends on the tandem size.

References

[1] Sirinawin, J. 1995. The Structure and Function of Gene and Chromosome. Essential Medical Genetics. 3.

[2] National Center for Biotechnology Information (NCBI). 2000. Basic Local Alignment Search Tool (BLAST), http://www.ncbi.nim.nih.gov

[3] Crochemore, M., 1997. Off-line Serial Exact String Searching in Pattern Matching Algorithms, ed. A. Apostolico and Z. Galil, Chapter 1, pp 1-53, Oxford University Press

[4] Crochemore, M ., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandrowski, W., Rytter, W., 1994, Speeding up Two String Matching Algorithms, Algorithmica 12(4/5):247-267

[5] Sirinawin, J. 1995. The Structure and Function of Gene and Chromosome. Essential Medical Genetics. 3.

[6] Allen R, Balding D, Donnelly P, Friedman R, Kaye D, LaRue L, Park R, Robertson B, Stein A. 1995. Probability and Proof in State v Skipper: an Internet Exchange. Jurimetrics J 35: 277-310.

[7] Fischetti, V.A., etal. ``Identifying Periodic Occurrences of a Template with Applications to Protein Structure.'' . Proceeding of the Third Annual Symposium on Combinatorial Pattern Matching. AprMay 1992. Arizona. pp111120 (Wendt Library) or pp109118 (Debby).

[8] Giegerich, R. and Wheeler D., “Pairwise sequence alighnment”, Biocomputing Hypertext Course Book. www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/prwali.html, May 21, 1996. [9] Yang, R. “Multiple Protein /DNA Sequence Alignment with Constraint Programming”.

Proceeding of the fourth international conference on the practical application of constraint technology, 4:159-166, 1998.

European Journal of Scientific Research

ISSN 1450-216X Vol.15 No.3 (2006), pp. 373-380 © EuroJournals Publishing, Inc. 2006

http://www.eurojournals.com/ejsr.htm