Algorithm 6 utilizes the general idea of bit parallelism from Baeza-Yates and Gonnet as well as the ideas presented by Saada and Zhang [37] (later discovered in the literature).
Next, Theorem 3 is presented, which will describe the big-O runtime metric for the inexact matching steps.
Theorem 3. The inexact matching steps result in a worst-case runtime of O(N ).
Proof of Theorem 3. Each assignment step has cost O(1), which also assumed for the mod operations. Also, each for loop, has a cost of O(N ), where N is the number of characters in the string S. Since no for loop is nested, we have O(N ) as a worst-case cost.
Now, I’ll analyze the time complexity with the exact matching steps included.
Theorem 4. With exact matching, the time complexity remains O(N ).
Proof of Theorem 4. The exact matching step passes over the data of size N an additional time. Thus, it follows from Theorem 3 that the time remains O(N ).
Next, I’ll analyze the space requirements for the inexact matching steps.
Theorem 5. The inexact matching steps for Algorithm 7 results in a worst-case space requirement of
O(N ), where N is the number of characters coded in ASCII.
Proof of Theorem 5. The string, S, is referred to throughout the program. Thus, it must be stored until it is no longer needed. N = len(S) is the maximum size of the string.
There are also lists storing the indices start and stop codons, whose combined size can be
no longer than N for a valid DNA sequence. Thus, 2N is the spatial cost of storing the string. Also, there are at most three bits stored in the list for each codon, which remains negligible compared to the size of the string in ASCII. Therefore, the maximum required space for the inexact matching steps is O(N ).
Soundness and completeness of the bit-based algorithms will be proven.
Theorem 6. The bit-based algorithm is sound.
Proof of Theorem 6. Let N = len(S). Given that the inexact matching procedure yields to an output where the position k ∈ [0, N ) is the last letter of the potential codon. k − 2 is the start of the potential codon. This is because all bits are independent of each other and if any other letter in the DNA alphabet other than ‘T’, ‘G’ or ‘A’, is in the string, all bits reset to 0 at that position. Thus, the bits in the inexact matching steps produce only the numbers associated with the start of the codons “AGT”, “ATG”, “GAT”, “GTA”, “AGT”
and “TAG” that are potentially in the string. Next, I’ll prove that the exact matching steps doesn’t produce false positives.
If exact mashing is used, the second for loop would traverse through the string again. The length of the list of bits matches up with the number of characters in the string. Thus, any given index in [0, N ) matches up with the respective character in S. Also, if the let-ters “ATG” are identified by the inexact matching step, the exact matching step traverses through the bit list and confirms on the respective characters in that potential codon in the original string. Thus, soundness is proven if exact matching is used.
Theorem 7. The bit-based algorithm is complete.
Proof of Theorem 7. Let’s say this algorithm is not complete. Then we have the following cases.
Case I: A check for “ATG” in the input string does not correspond to the location of this codon being found in the output
Case II: A check for “TGA” in the input string does not correspond to the location of this
codon being found in the output.
Case III: A check for “TAG” in the input string does not correspond to a to the location of this codon being found in the output
Case IV: A check for “TAA” in the input string does not correspond to the location of this codon being found in the output.
First, case I will be checked. The inexact matching step results in the following possi-ble codons in the string for the characters in “ATG”: “ATG”, “TGA” and “TAG”. Since the string is directly checked for “ATG”, the output for this codon is shown to be complete.
Similar logic holds contradicting cases II and III.
For case IV, the possible codons containing both ‘T’ and ‘A’ and no other letters are:
“TAA”, “TAT”, “ATT”, “ATA” and “TTA”. Similar to cases I-III, case IV is contradicted because the string is directly checked. Thus, completeness is shown by contradiction of all four cases.
Theorem 8. Algorithm 9 is sound when exact matching is used.
Proof of Theorem 8. Let N = len(S). Given that the inexact matching procedure yields to an output where the position k ∈ [0, N ) is the first letter of the potential codon. If ‘T’,
‘G’, or ‘A’ is present, then the bitwise or operation unique to these respective characters is run on binCtr. If any other letter that is in the DNA alphabet other than ‘T’, ‘G’ or ‘A’ is present, the result is like the bitwise or with regard to 0. After those steps, binCompress() is called.
binCompress() acts on the bits for the boundary conditions of 4 or 28 bits. The latter is the integer boundary in Python. When the 28th-bit in an integer is encountered the int is appended to binList and the variables for keeping track of the number of bits are reset.
Theorem 9. Algorithm 9 is complete when exact matching steps are used.
Proof of Theorem 9. The logic that proves Theorem 7 also holds for Algorithm 9.
Chapter 5
Conclusion & Future Work
In conclusion, two specialized algorithms for string searching have been created that require both linear time and space in the worst case. With the approximate matching steps common in bioinformatics contexts, namely hashing and Hamming distances, these run in a short period of time (seconds–minutes range) on the H. pylori genome data. What is particularly notable is that Algorithm 8 runs on an ordinary computer for the O. sativa genome when hashing is used in the second step.
Additional data could be collected for cloud and supercomputing environments, to in-crease the understanding of what can be run using the bit-based algorithm with compres-sion, with focus on larger genomes, such as the Human Genome. Another extension of the work would be encoding a larger pattern in a similar manner to save space in memory.
List of References
[1] What is STR Analysis? | National Institute of Justice. Retrieved from https://nij.ojp.gov/topics/articles/what-str-analysis.
[2] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6):333–340, 1975.
[3] Ricardo Baeza-Yates and Gaston Gonnet. A new approach to text searching.
Communications of the ACM, 35(10):74–82, 1992.
[4] Ricardo A. Baeza-Yates. Efficient text searching. Ph.D. Dissertation, University of Waterloo, Waterloo, Ontario, 1989.
[5] Ardeshir Bayat. Bioinformatics (science, medicine and the future.). British Medical Journal, 324(7344), 2002. Retrieved from Gale OneFile: Nursing and Allied Health.
[6] Shakeela Bibi, Javed Iqbal, Adnan Iftekhar, and Mir Hassan. Analysis of
compression techniques for DNA sequence data. European Academic Research, 6(1), 2018.
[7] Robert S. Boyer and J. Strother Moore. A fast string searching algorithm.
Communications of the ACM, 20(10):762–772, 1977.
[8] M. Burrows and D.J. Wheeler. A block-sorting lossless data compression algorithm.
Technical Report, Systems Research Cetner, Palo Alto, CA, May 1994.
[9] Burset and Guigo. Evaluation of gene structure prediction programs. Genomics, 34:353–367, 1996.
[10] Lianhua Chi and Xingquan Zhu. Hashing techniques: A survey and taxonomy. ACM Computing Surveys, 50(1), April 2017.
[11] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introduction to algorithms. The MIT Press, Cambridge, MA, 3 edition, 2009.
[12] Anhai Doan, Alon Halevy, and Zachary Ives. String matching. In Principles of data integration, pages 95–119. Elsevier, Boston, 2012.
[13] Newton Faller. An adaptive system for data compression. In Proceedings of COPPE, Rio de Janerio, Brazil.
[14] Robert G. Gallager. Variations on a theme by Huffman. Submitted to IEEE Transactions on Information Theory, December 1977.
[15] Michael R. Garey and David S. Johnson. The theory of NP-completeness. In Victor Klee, editor, Computers and intractibility: A guide to the theory of NP-completeness, A series of books in the mathematical sciences, pages 17–44. W.H. Freeman, Murray Hill, NJ, 1979.
[16] Scott Gigante. fasta - Uppercase vs lowercase letters in reference genome, 2017.
Retrieved from https://bioinformatics.stackexchange.com/questions/225/uppercase-vs-lowercase-letters-in-reference-genome.
[17] Dan Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, 1997.
[18] R.W. Hamming. Error detecting and error correcting codes. Bell System Technical Journal, 29(2):147–160, 1950.
[19] John Hopcroft. An n log n algorithm for minimizing states in a finite automaton. In Zvi Kohavi and Azaria Paz, editors, Proceedings of an International Symposium on the Theory of Machines and Computations, pages 189–196. Academic Press, 1971.
[20] Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249–260, 1987.
[21] Donald E. Knuth. Introduction to combinatorial algorithms and boolean functions, volume 4, fasc. 0 of The Art of Computer Programming. Addison-Wesley, Upper Saddle River, NJ, 2008.
[22] Donald E. Knuth. Dynamic huffman coding. In Donald E. Knuth, editor, Selected Papers on Design of Algorithms, number 191 in CSLI Lecture Notes, pages 51–70.
CSLI Publications, Stanford, CA, 2010. Corrected reprint of the 1985 paper originally published in the Journal of Algorithms, 6, 163-180.
[23] Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. Fast pattern matching in strings. In Donald E. Knuth, editor, Selected Papers on Design of Algorithms, number 191 in CSLI Lecture Notes, pages 99–135. CSLI Publications, Stanford, CA, 2010. Corrected reprint of the 1977 paper originally published in SIAM Journal on Computing, 6, 323–350.
[24] Leon G. Kraft. A device for quantizing, grouping, and coding amplitude modulated pulses. Master of Science Thesis, MIT, Cambridge, MA, May 1949.
[25] Hans E. Krokan, Finn Drabløs, and Geir Slupphaug. Uracil in DNA – occurrence, consequences and repair. Oncogene, 21:8935–8948, 2002.
[26] Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversal. Soviet Physics-Doklady, 10(8):707–710, 1966.
[27] Yishay Mansour, Noam Nisan, and Prasoon Tiwari. The computational complexity of universal hashing. In Proceedings of the twenty-second annual ACM symposium
on Theory of Computing, volume 22, pages 235–243, Baltimore, MD, April 1990.
ACM.
[28] B. McMillan. Two inequalities implied by unique decipherability. IRE Transactions on Information Theory, 2(4):115–116, 1956.
[29] Rohith K. Menon, Goutham P. Bhat, and Michael C. Schatz. Rapid parallel genome indexing with MapReduce. In MapReduce’11, San Jose, CA, 2011. ACM.
[30] Alistair Moffat. Huffman coding. ACM Computing Surveys, 52(4), 2019.
[31] Paul J. Muhlrad. Alternative splicing. In Genetics, volume 1, pages 17–21. Gale, 2 edition, 2018. eBook.
[32] Marshall Nirenberg. The genetic code, December 1968. Nobel Lecture.
[33] Gonzalo Novarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001.
[34] P. Pandiselvam, T. Marimuthu, and R. Lawrance. A comparative study on string matching algorithms of biological sequences. arXiv:1401.7416 [cs.DS], 2014.
[35] Qura-Tul-Ein, Yousaf Saeed, Shahid Naseem, Fahad Ahmad, Tahir Alyas, and Nadia Tabassum. DNA patern analysis using finite automata. International Research Journal of Computer Science, 1(2), 2014.
[36] Michael O. Rabin. Mathematical theory of automata. In J.T. Schwartz, editor, Proceedings of Symposia in Applied Mathematics, volume 19, pages 153–175, Providence, RI, 1967. Ameican Mathematical Society.
[37] Bacem Saada and Jing Zhang. DNA sequences compression algorithms based on the two bits codation method. In 2015 IEEE Conference on Bioinformatics and
Biomedicine (BIBM). IEEE Computer Society Press, 2015.
[38] Sanjeewa B. Senanayaka. Sub-linear algorithms for shortest unique substring and maximal unique matches. Master of Science Thesis, University of
Wisconsin-Whitewater, Whitewater, Wisconsin, December 2019.
[39] Michael Sipser. Introduction to the theory of computation. Thompson Learning, Boston, 2 edition, 2006.
[40] Steven S. Skiena. The algorithm design manual. Springer-Verlag, London, 2 edition, 2010.
[41] Jesmin Jahan Tithi. Engineering high-performance parallel algorithms with applications to bioinformatics. Ph.D. Dissertation, Stony Brook, Stony Brook, NY, December 2015.
[42] Jeffrey Scott Vitter. Design and analysis of dynamic huffman codes. Journal of the ACM, 34(4):825–845, 1987.
[43] J.D. Watson and F.H.C. Crick. Molecular structure of nucleic acids: A structure for Deoxyribose Nucleic Acid. Nature, 171(4356):737–738, 1956.
Appendices
List of Algorithms
1 Boyer-Moore . . . 9
2 Shift-Or Preprocessing . . . 13
3 Shift-Or Searching . . . 14
4 searchCodons Outline . . . 15
5 searchCodons . . . 18
6 Bit-Based Inexact Codon Matching . . . 20
7 Bit-based Codon Matching With Hashing . . . 22
8 Bit-based Codon Matching With Compression – Searching for A,T,G . . . 24
9 Bit-based Codon Matching With Compression – Matching . . . 26
List of Symbols
len(), length function (algorithms) mod n, used to check divisibility by n N∗, the set of non-negative integers P , the pattern to be searched for S, a string of characters
>> x, shift a number represented in binary left by x bits
<< x, shift a number represented in binary right by x bits
& , bitwise and
| , bitwise or
Σ, the alphabet of a language