Future Work - Reconstruction Algorithms for DNA-based Storage Systems

Even though our algorithms improved previous results, there are still several chal-lenges that need to be addressed in order to fully solve the DNA reconstruction problem. We list the following directions for future work.

1. Design error correcting codes and coding schemes for DNA-storage systems.

2. Design DNA-storage experiments to evaluate other aspects of our algorithms.

3. The presented algorithms were designed to work with different cluster sizes.

However, as presented in Section 5.3.3, in cases where the cluster is of large size, some of the traces can be filtered out to reduce the complexity and the computation time of the reconstruction process. Hence, we think that future work should focus on defining and evaluating filtering criteria for large clusters.

Acknowledgments

The authors thank Alexander Yucovich for his ideas and algorithms for Chapters 4 and 5 and its help with the simulations. They also thank Guy Shapira for his great contribution.

They also thank Prof. Zohar Yakhini for its kind and helpful guidance in Chap-ter 3, and for Yoav Orlev, Roy Shafir and Leon Anavy for co-writing the SOLQC tool.

The authors thank Matika Lidgi for her help in the Divider BMA algorithm.

They also thank her, along with Danit Goldberg, Amir Biran, Batel Carmona, Rotem Samuel, Guy Shapira, Idan Raz, Ron Yizhak and Dafna Regev for they contribution to this work .

The authors thank Prof. Gala Yadgar for sharing her servers for the simulations in this research.

The authors thank Lee Organick, Hossein Yazdi, Karin Strauss, Yaniv Erlich, Roee Amit, Sarah Goldberg and Cyrus Rashtchian for valuable discussion.

Bibliography

[1] L. Anavy, I. Vaknin, O. Atar, R. Amit, and Z. Yakhini, ”Data storage in DNA with fewer synthe-sis cycles using composite DNA letters,” Nature Biotechnology, vol. 37, no. 10, pp. 1229–1236, 2019.

[2] A. Apostolico, S. Browne, and C. Guerra, ”Fast linear-space computations of longest common subsequences”, Theoretical Computer Science, vol. 92, no. 1, pp. 3–17, 1992.

[3] A. Atashpendar, M. Beunardeau, A. Connolly, R. Geraud, D. Mestel, A. W. Roscoe, and P. Y.

A. Ryan, ”From clustering supersequences to entropy minimizing subsequences for single and double deletions,” CoRR, abs/1802.00703, 2018.

[4] M. T. Barrett, A. Scheffer, A. Ben-Dor, N. Sampas, D. Lipson, R. Kincaid, P. Tsang, B. Curry, K. Baird, P. S. Meltzer, et al, ”Comparative genomic hybridization using oligonucleotide mi-croarrays and total genomic DNA,” emphProceedings of the National Academy of Sciences, vol. 101, no. 51, pp. 17765–17770, 2004.

[5] T. Batu, S. Kannan, S. Khanna, and A. McGregor, ”Reconstructing strings from random traces,”

emphIn Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pp.

910–918. Society for Industrial and Applied Mathematics, 2004.

[6] S. L. Beaucage and R. P. Iyer, ”Advances in the synthesis of oligonucleotides by the phospho-ramidite approach,” Tetrahedron, vol. 48, no. 12, pp. 2223–2311, 1992.

[7] M. Blawat, K. Gaedke, I. Hutter, X.-M. Chen, B. Turczyk, S. Inverso, B. W. Pruitt, and G. M.

Church, ”Forward error correction for DNA data storage”, Procedia Computer Science, vol. 80, pp. 1011–1022, 2016.

[8] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, G. Seelig, and K. Strauss, ”A DNA-based archival storage system”, ACM SIGARCH Computer Architecture News, vol. 44, no. 2, pp.

637–649, 2016.

[9] J. Brakensiek, R. Li, and B. Spang, ”Coded trace reconstruction in a constant number of traces”, 2019.

[10] D. Carmean, L. Ceze, G. Seelig, K. Stewart, K. Strauss, and M. Willsey, ”DNA data storage and hybrid molecular–electronic computing”. Proceedings of the IEEE, vol. 107, no. 1, pp. 63–72, 2018.

[11] S. Chandak, K. Tatwawadi, B. Lau, J. Mardia, M. Kubit, J. Neu, P. Griffin, M. Wootters, T. Weissman, and H. Ji, ”Improved read/write cost tradeoff in DNA-based data storage using ldpc codes”, 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 147–156. 2019.

[12] Y. M. Chee, H. M. Kiah, A. Vardy, V. K. Vu, and E. Yaakobi, ”Coding for racetrack memories”, IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 7094-7112, Nov 2018.

[13] Z. Chen, W. Zhou, S. Qiao, L. Kang, H. Duan, X. S. Xie, and Y. Huang, ”Highly accurate fluorogenic DNA sequencing with information theory–based error correction”, Nature biotech-nology, vol. 35, no.12, pp. 1170, 2017.

[14] M. Cheraghchi, J. Ribeiro, R. Gabrys, and O. Milenkovic, ”Coded trace reconstruction”, IEEE Information Theory Workshop (ITW), pp. 1–5, 2019.

[15] Y. Choi, T. Ryu, A. C. Lee, H. Choi, H. Lee, J. Park, S.-H. Song, S. Kim, H. Kim, W. Park, and S. Kwon, ”High information capacity DNA-based data storage with augmented encoding characters using degenerate bases”, Scientific Reports, vol. 9, no.1, pp. 6582, 2019.

[16] G. M. Church, Y. Gao, and S. Kosuri, ”Next-generation digital information storage in DNA”, Science, vol. 337, no. 6102, pp. 1628–1628, 2012.

[17] C. T. Clelland, V. Risca, and C. Bancroft, ”Hiding messages in DNA microdots”, Nature, vol.

399, no. 6736, pp. 533, 1999.

[18] A. De, R. O’Donnell, and R. A. Servedio, ”Optimal mean-based algorithms for trace recon-struction”, 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1047– 1056, 2017.

[19] J. Duda, W. Szpankowski, and A. Grama, ”Fundamental bounds and approaches to sequence reconstruction from nanopore sequencers”, arXiv preprint arXiv:1601.02420, 2016.

[20] R. C. Edgar. ”Muscle: multiple sequence alignment with high accuracy and high throughput”, Nucleic acids research, vol. 32, no. 5, pp. 1792–1797, 2004.

[21] C. Elzinga, S. Rahmann, and H. Wang, ”Algorithms for subsequence combinatorics”, Theoret-ical Computer Science, vol. 409, no. 3, pp. 394–404, 2008.

[22] Y. Erlich and D. Zielinski, ”DNA fountain enables a robust and efficient storage architecture”, Science, vol. 355, no. 6328, pp. 950–954, 2017.

[23] R. Gabrys and E. Yaakobi, ”Sequence reconstruction over the deletion channel”, IEEE Trans-actions on Information Theory, vol. 64, no. 4, pp. 2924–2931, 2018.

[24] D. G. Gibson, J. I. Glass, C. Lartigue, V. N. Noskov, R.-Y. Chuang, M. A. Algire, G. A.

Benders, M. G. Montague, L. Ma, M. M. Moodie, et al, ”Creation of a bacterial cell controlled by a chemically synthesized genome”, Science, vol. 329, no. 5987, pp. 52–56, 2010.

[25] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney,

”Towards prac- tical, high-capacity, low-maintenance information storage in synthesized DNA”, Nature, vol. 494, no. 7435, pp. 77, 2013.

[26] P. S. Gopalan, S. Yekhanin, S. D. Ang, N. Jojic, M. Racz, K. Strauss, and L. Ceze, ”Trace reconstruction from noisy polynucleotide sequencer reads”, US Patent App, 15/536,115. 2018.

[27] R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, and W. J. Stark, ”Robust chemical preserva-tion of digital informapreserva-tion on DNA in silica with error-correcting codes”. Angewandte Chemie International Edition, vol. 54, no. 8, pp. 2552–2555, 2015.

[28] R. Heckel, G. Mikutis, and R. N. Grass, ”A characterization of the DNA data storage channel”, arXiv preprint, arXiv:1803.03322, 2018.

[29] N. Holden, R. Pemantle, and Y. Peres, ”Subpolynomial trace reconstruction for random strings and arbitrary deletion probability”, arXiv preprint, arXiv:1801.04783, 2018.

[30] T. Holenstein, M. Mitzenmacher, R. Panigrahy, and U. Wieder, ”Trace reconstruction with constant deletion probability and related results”, The nineteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 389–398. 2008.

[31] S. Y. Itoga. ”The string merging problem”, BIT Numerical Mathematics, vol. 21, no. 1, pp.

20–30, 1981.

[32] A. D. Johnson, ”An extended iupac nomenclature code for polymorphic nucleic acids”, Bioin-formatics, vol. 26, no. 10, pp. 1386–1389, 2010.

[33] S. Kannan and A. McGregor, ”More on reconstructing strings from random traces: insertions and deletions”, International Symposium on Information Theory (ISIT), pp. 297–301. 2005.

[34] H. M. Kiah, T. T. Nguyen, and E. Yaakobi, ”Coding for sequence reconstruction for single edits”, Submitted to IEEE International Symposium on Information Theory, arXiv preprint, arXiv:2001.01376, 2020.

[35] H. M. Kiah, G. J. Puleo, and O. Milenkovic, ”Codes for DNA sequence profiles”, IEEE Trans-actions on Information Theory, vol. 62, no. 6, pp. 3125–3146, 2016.

[36] S. Kosuri and G. M. Church, Large-scale de novo DNA synthesis: technologies and applica-tions, emphNature methods, vol. 11, no. 5, pp. 499, 2014.

[37] H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, ”Terminator-free template-independent enzymatic DNA synthesis for digital information storage”, Nature communica-tions, vol. 10, no. 1, pp. 1–12, 2019.

[38] E. M. LeProust, B. J. Peck, K. Spirin, H. B. McCuen, B. Moore, E. Namsaraev, and M. H.

Caruthers, ”Synthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled process”, Nucleic acids research, vol. 38. no. 8, pp. 2522–2540, 2010.

[39] V. I. Levenshtein, ”Efficient reconstruction of sequences”, IEEE Transactions on Information Theory, vol. 47, no. 1, pp. 2–22, 2001.

[40] V. I. Levenshtein, ”Efficient reconstruction of sequences from their subsequences or superse-quences”, Journal of Combinatorial Theory, Series A, vol. 93, no. 2, pp. 310–332, 2001.

[41] V. Levenshtein, E. Konstantinova, E. Konstantinov, and S. Molodtsov, ”Reconstruction of a graph from 2-vicinities of its vertices”, Discrete Applied Mathematics, vol. 156, no. 9, pp.

1399–1406, 2008.

[42] V. I. Levenshtein and J. Siemons, Error graphs and the reconstruction of elements in groups, Journal of Combinatorial Theory, Series A, vol. 116, no. 4, pp. 795–815, 2009.

[43] R. Lopez, Y.-J. Chen, S. D. Ang, S. Yekhanin, K. Makarychev, M. Z. Racz, G. Seelig, K.

Strauss, and L. Ceze, ”DNA assembly for nanopore data storage readout”, Nature communica-tions, vol. 10, no. 1, pp. 1–9, 2019.

[44] MATLAB, ”Multialign function”, 2016. https://www.mathworks.com/help/bioinfo/ref/multialign.html.

[45] A. McGregor, E. Price, and S. Vorotnikova, ”Trace reconstruction revisited”, European Sym-posium on Algorithms, pp. 689–700, 2014.

[46] M. Mitzenmacher, On the theory and practice of data recovery with multiple versions, IEEE International Symposium on Information Theory, pp. 982–986, July 2006

[47] F. Nazarov and Y. Peres, ”Trace reconstruction with exp(o(n³)) samples, 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1042–1046. 2017.

[48] L. Organick, S. D. Ang, Y.-J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Z. Racz, G.

Ka- math, P. Gopalan, B. Nguyen, C. N. Takahashi, S. Newman, H.-Y. Parker, C. Rashtchian, K. Stewart, G. Gupta, R. Carlson, J. Mulligan, D. Carmean, G. Seelig, L. Ceze, and K. Strauss,

”Random access in large-scale DNA data storage”, Nature Biotechnology, vol. 36, pp. 242.

2018.

[49] S. Palluk, D. H. Arlow, T. De Rond, S. Barthel, J. S. Kang, R. Bector, H. M. Baghdassarian, A.

N. Truong, P. W. Kim, A. K. Singh, et al, ”De novo DNA synthesis using polymerase-nucleotide conjugates”, Nature biotechnology, vol. 36, no. 7, pp. 645, 2018.

[50] W. Pan, M. Byrne-Steele, C. Wang, S. Lu, S. Clemmons, R. J. Zahorchak, and J. Han, ”DNA polymerase preference determines pcr priming efficiency”, BMC Biotechnology, vol. 14, no.1, pp. 10, 2014.

[51] Y. Peres and A. Zhai, ”Average-case reconstruction for the deletion channel: subpolynomi-ally many traces suffice”, IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 228–239, 2017.

[52] J. Ruijter, C. Ramakers, W. Hoogaars, Y. Karlen, O. Bakker, M. Van den Hoff, and A. Moor-man, ”Amplification efficiency: linking baseline and bias in the analysis of quantitative PCR data”, Nucleic Acids Research, vol. 37, no.6, pp. 45, 2009.

[53] O. Sabary, Y. Orlev, R. Shafir, L. Anavy, E. Yaakobi, and Z. Yakhini, ”SOLQC: Synthetic oligo library quality control tool”, BioRxiv, 840231, 2019.

[54] O. Sabary, E. Yaakobi, and A. Yucovich, ”The error probability of maximum-likelihood decod-ing over two deletion channels”, arXiv preprint, arXiv:2001.05582, 2020.

[55] F. Sala, R. Gabrys, C. Schoeny, and L. Dolecek, Three novel combinatorial theorems for the insertion/deletion channel, IEEE International Symposium on Information Theory (ISIT), pp.

2702–2706, 2015.

[56] C. Schoeny, A. Wachter-Zeh, R. Gabrys, and E. Yaakobi, ”Codes correcting a burst of deletions or insertions”, IEEE Transactions on Information Theory, vol. 63, no. 4, pp. 1971–1985, April 2017.

[57] T. Shinkar, E. Yaakobi, A. Lenz, and A. Wachter-Zeh, ”Clustering-correcting codes”, IEEE International Symposium on Information Theory (ISIT), pp. 81–85, 2019.

[58] S. Snir, E. Yeger-Lotem, B. Chor, and Z. Yakhini, ”Using restriction enzymes to improve se-quencing by hybridization”, Technical report, Computer Science Department, Technion, 2002.

[59] M. Sosic and M. Sikic, ”Edlib: a c/c++ library for fast, exact sequence alignment using edit distance”, Bioinformatics, vol. 33, no. 9, pp. 1394–1395, 2017.

[60] S. R. Srinivasavaradhan, M. Du, S. Diggavi, and C. Fragouli, ”On maximum likelihood re-construction over multiple deletion channels”, IEEE International Symposium on Information Theory (ISIT), pp. 436–440. 2018.

[61] S. R. Srinivasavaradhan, M. Du, S. Diggavi, and C. Fragouli, ”Symbolwise map for multiple deletion channels”, IEEE International Symposium on Information Theory (ISIT), pp. 181–185, 2019.

[62] S. K. Tabatabaei, B. Wang, N. B. M. Athreya, B. Enghiad, A. G. Hernandez, J.-P. Leburton, D.

Solove- ichik, H. Zhao, and O. Milenkovic, ”DNA punch cards: Encoding data on native DNA sequences via topological modifications”, bioRxiv, pp. 672394, 2019.

[63] C. N. Takahashi, B. H. Nguyen, K. Strauss, and L. Ceze, ”Demonstration of end-to-end au-tomation of DNA data storage”, Scientific Reports, vol. 9, no. 1, pp. 1–5, 2019.

[64] K. Tatwawadi and S. Chandak, ”Tutorial on algebraic deletion correction codes”, CoRR, abs/1906.07887, 2019.

[65] J. Tian, H. Gong, N. Sheng, X. Zhou, E. Gulari, X. Gao, and G. Church, Accurate multiplex gene synthesis from programmable DNA microchips, Nature, vol. 432, no. 7020, pp. 1050, 2004.

[66] K. Viswanathan and R. Swaminathan, ”Improved string reconstruction over insertion-deletion channels”, The nineteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 399–408, 2008.

[67] R. R. Varshamov and G. M. Tenenholtz, ”A code for correcting a single asymmetric error”, Automatica i Telemekhanika, vol. 26, no. 2, pp. 288–292, 1965.

[68] E. Yaakobi and J. Bruck, On the uncertainty of information retrieval in associative memories, IEEE International Symposium on Information Theory, pp. 106–110, 2012.

[69] E. Yaakobi, M. Schwartz, M. Langberg, and J. Bruck, Sequence reconstruction for grassmann graphs and permutations, IEEE International Symposium on Information Theory, pp. 874–878, 2013.

[70] S. H. T. Yazdi, R. Gabrys, and O. Milenkovic, ”Portable and error-free DNA-based data stor-age”, Scientific Reports, vol. 7, no.1, pp. 5011, 2017.

[71] S. H. T. Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic, ”A rewritable, random-access DNA-based storage system”, Scientific Reports, vol. 5, pp. 14138, 2015.

א ל ג ו ר י ת מ י ש ח ז ו ר ע ב ו ר

מ ע ר כ ו ת א ח ס ו ן מ ב ו ס ס ו ת ד נ א

ע

ו

מ

ר

צ

ב

ר

י

א

ה

International Symposium on Information Theory ס

,

ת

ה

In document Reconstruction Algorithms for DNA-based Storage Systems (Page 105-117)