eMerGent And future trends - Data Mining Patterns New Methods And Applications Pascal Poncelet

This is the final part of our narration, and we can discuss some of the approaches recently emerging in this fascinating field. In the works presented in the third section, the concept of interest for patterns was usually related to the frequency of occurrences, that is, patterns were interesting

if they were over-represented in a given set of sequences. In many biological contexts though, patterns occurring unexpectedly, often or rarely, often called surprising words, can be associated with important biological meanings, for example, they can be representative of elements having patterns repeated surprisingly differently from the rest of population due to some disease or genetic malformation. Distance measures based not only on the frequency of occurrences of a pattern, but also on its expectation, are assuming a fundamental role in these emergent studies. Thus,

interesting patterns can be patterns such that the difference between observed and expected counts, usually normalized to some suitable moment, are beyond some preset threshold. The increasing volumes of available biological data makes exhaustive statistical tables become excessively large to guarantee practical accessibility and usefulness. In Apostolico, Bock, and Lonardi (2003) the authors study probabilistic models and scores for which the population of potentially surprising words in a sequence can be described by a table of size at worst linear in the length of that sequence, supporting linear time and space algorithms for their construction. The authors consider as candidate surprising words, only the members of an a priori (i.e., before any score is computed) identified set of representative strings, where the cardinality of that set is linear in the input sequence length. The construction is based on the constraint that the score is monotonic in each class of related strings described by such a score. In the direction of extracting over-represented motifs, the authors of Apostolico, Comin, and Parida (2005) introduce and study a charac-

terization of extensible motifs in a sequence which tightly combines the structure of the pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. They show that a prudent combination of saturation conditions (expressed in terms of minimum number of don’t care compatible with a given list of occurrences) and monotonicity of probabilistic scores over regions of constant frequency afford significant parsimony in the generation and test- ing of candidate over-represented motifs. The approach is validated by tests on protein sequence families reported in the PROSITE database. In both the approaches illustrated in Apostolico et al. (2003) and Apostolico et al. (2005), the concept of interestingness is related to the concept of surprise. The main difference between them is that the former aims at showing how the number of over- and under-represented words in a sequence can be bound and computed in efficient time and space, if the scores under consideration grow monotonically, a condition that is met by many scores; the latter presents a novel way to embody both statistical and structural features of patterns in one measure of surprise.

If the emergent trends underline the impor- tance of making the search for motifs occurring unexpectedly often or rarely in a given set of sequences in input efficient, further studies are driven towards the characterization of patterns representing outliers for a given population, look- ing at the structure of the pattern as the key to detect unexpected properties possessed by a single individual, being object of examination, belonging to a large reference collection of categorical data. Moreover, the complexity of biological processes, such as gene regulation and transcription, in eu- karyotes organisms, leaves many challenges still open and requires the proposal of new and more efficient techniques to solve particular problems, such as the computation of distances on specific classes of patterns, or of operations on boxes more complex than swaps or skips.

conclusIon

The ever-increasing growth of the amount of biological data that are available and stored in biological databases, significant especially after the completion of human genome sequencing, has stimulated the development of new and efficient techniques to extract useful knowledge from biological data. Indeed, the availability of enormous volumes of data does not provide “per se” any increase in available information. Rather, suitable methods are needed to interpret them.

A particular class of biological data is biosequences, denoting DNA or protein sequences, and characterized by the presence of regular expressions associated with the presence of similar biological functions in the different mac- romolecules. Such regularities can be represented by substrings, that are, patterns, common to a number of sequences, meaning that sequences sharing common patterns are associated with a common biological function in the corresponding biological components.

The main aim of this chapter is that of ad- dressing the problem of mining patterns that can be considered interesting, according to some predefined criteria, from a set of biosequences. First, some basic notions and definitions have been provided concerning concepts both derived from string matching and from pattern discovery for sets of sequences. Then, since in many biological problems such as, for example, gene regulation, the structure of the repeated patterns in the sequences can be fixed for the generic case, and maybe also complex and vary with the different analyzed cases, an intuitive formalization of the problem has been given considering structural properties in which parameters related to the structure of the patterns to search for can be specified according to the specific problem. This formalization aimed at guiding the reader in the discussion put forward about a number of approaches to pattern discovery that have been presented in the literature in recent years. The

approaches regarded fundamental areas of application of pattern discovery in biosequences, such as protein family characterization, tandem repeats and gene regulation. A further section has been devoted to pointing out emergent approaches, dealing with the extraction of patterns occurring unexpectedly often or rarely and to pointing out possible future developments.

As a final remark, it is worth pointing out that mining patterns is central as well w.r.t. many biological applications. It has been applied to solve different problems involving sets of sequences, as an alternative to other existing methods (e.g., sequence alignment) but, especially, as a powerful technique to discover important properties from large data bunches. Biology is an evolving science, and emerging biological data are, in some cases, sets of aggregate objects rather than just biosequences. Therefore, it is anticipated that a very interesting evolution of the problem of pattern discovery presented in this chapter consists in its extension to more structured cases, involving objects that cannot be encoded as simple sequences of symbols.

references

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215, 403-410.

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Anang, Z., Miller, W., & Lipman, D. J. (1997). Gapped blast and psi-blast: A new generation of protein database search programs. Nucleic Acids Research,25, 3389-3402.

Apostolico, A. (1985). The myriad virtues of subword trees. In A. Apostolico & Z. Galil (Eds.),

Combinatorial algorithms on words (12, pp. 85- 96). Springer-Verlag Publishing.

Apostolico, A., Bock, M. L., & Lonardi, S. (2003). Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology

10(3/4), 283-311.

Apostolico, A., & Crochemore, M. (2000). String pattern matching for a deluge survival kit. In P. M. Pardalos, J. Abello, & M. G. C. Resende (Eds.), Handbook of massive data sets. Kluwer Academic Publishers.

Apostolico, A., Comin, M., & Parida, L. (2005). Conservative extraction of over-represented extensible motifs. ISMB (Supplement of Bioin- formatics), 9-18.

Apostolico, A., Iliopoulos, C., Landau, G., Schie- ber, B., & Vishkin, U. (1988). Parallel construction of a suffix tree with applications. Algorithmica, 3, 347-365.

Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L. L., Studholme, D. J., Yeats, C., & Eddy, S. R. (2004). The pfam protein families database. Nucleic Acids Research, 32, 138-141.

Benson, D. A., Mizrachi, I. K., Lipman, D. J., Ostell, J., & Wheeler, D. L. (2005). Genbank.

Nucleic Acids Research,33.

Benson, G. (1997). Sequence alignment with tandem duplication. Journal of Computational

Biology, 4(3), 351-67.

Benson, G. (1999). Tandem repeats finder: A pro- gram to analyze DNA sequences. Nucleic Acids Research, 27(2), 573–580.

Berg, J., & Lassig, M. (2004). Local graph alignment and motif search in biological networks. In Proceedings of the National Academy of

Sciences of the United States of America, (pp. 14689-14694).

Berman, H. M., Bourne, P. E., & Westbrook, J. (2004). The pdb: A case study in management of community data. Current Proteomics,1, 49-57. Brazma, A., Jonassen, I., Eidhammer, I., & Gilbert, D. (1998). Approaches to the automatic discovery of patterns in biosequences. Journal of

Computational Biology,5(2), 277-304.

Brazma, A., Jonassen, I., Vilo, J., & Ukkonen, E. (1998). Predicting gene regulatory elements in silico on a genomic scale. Genome Research, 8(11), 1998.

Carvalho, A. M., Freitas, A. T., Oliveira, A. L., & Sagot, M. F. (2005). A highly scalable algorithm for the extraction of cis-regulatory regions. In

Proceedings of the 3rd Asia-Pacific Bioinformat- ics Conference (pp. 273-282).

Elloumi, M., & Maddouri, M. (2005). New vot- ing strategies designed for the classification of nucleic sequences. Knowledge and Information

Systems, 1, 1-15.

Eskin, E., & Pevzner, P. A. (2002). Finding com- posite regulatory patterns in DNA sequences.

Bioinformatics, 18(Suppl. 1),S354-63.

Fassetti, F., Greco, G., & Terracina, G. (2006). Efficient discovery of loosely structured motifs in biological data. In Proceedings of the 21st An- nual ACM Symposium on Applied Computing,

(pp. 151-155).

Fredkin, E. (1960). Trie memory. Communications of the ACM, 3(9), 490-499.

Gusfield, D. (Ed.). (1997). Algorithms on strings, trees and sequences: Computer science and com- putational biology. Cambrige University Press. Gusfield, D., & Stoye, J. (2004). Linear time algorithms for finding and representing all the tandem repeats in a string. Journal of Computer

and System Sciences, 69, 525-546.

Hauth, A. M., & Joseph, D. A. (2002). Beyond tandem repeats: Complex pattern structures and

distant regions of similarity. Bioinformatics,

18(1), 31-37.

Hu, H., Yan, X., Huang, Y., Han, J., & Zhou, X., J. (2005). Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics, 18(1), 31-37.

Hulo, N., Sigrist, C. J. A., Le Saux, V., Lan- gendijk-Genevaux, P. S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P., & Bairoch, A. (2004). Recent improvements to the PROSITE database.

Nucleic Acids Research,32, 134-137.

Kolpakov, R. M., Bana, G. & Kucherov, G. (2003). Mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Research, 31(13), 3672-3678.

Koyuturk, M., Kim, Y., Subramaniam, S., Sz- pankowski, W. & Grama, A. (2006). Journal of

Computational Biology, 13(7), 1299-1322. Jeffreys, A. J., Wilson, V., & Thein, S. L. (1985). Hypervariable “minisatellite” regions in human DNA. Nature, 314(6006), 67-73.

Jeffreys, A. J., Wilson, V., & Thein, S. L. (1985). Individual-specific “fingerprints” of human DNA.

Nature, 316(6023), 76-9.

Jensen, L. J., & Knudsen, S. (2000). Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and functional annotation. Bioinformatics, 16(4), 326-333.

Jonassen, I., Collins, J. F., & Higgins, D.G. (1995). Finding flexible patterns in unaligned protein sequences. Protein Science, 4, 1587-1595. Lesk, A. (Ed.). (2004). Introduction to protein

science architecture, function, and genomics. Oxford University Press.

Liu, H., & Wong, L. (2003). Data mining tools for biological sequences. Journal of Bioinformatics

Marsan, L., & Sagot, M. F. (2000). Algorithms for extracting structured motifs using a suffix tree with application to promoter and regulatory site consensus identification. Journal of Compu- tational Biology, 7, 345-360.

McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2), 262-272.

Morrison, D. R. (1968). Patricia: Practical algorithm to retrieve information coded in alphanu- meric. Journal of the ACM, 15(4), 514-534. Neuwald, A. F., & Green, P. (1994). Detecting patterns in protein sequences. Journal of Molecular.

Biology, 239 (5), 698-712.

Pal, K. P., Mitra, P., & Pal, S. K. (2004). Pattern recognition algorithms for data mining. CRC Press.

Palopoli, L., Rombo, S., & Terracina, G. (2005). In M.-S. Hacid, N. V. Murray, Z. W. Ras, & S. Tsumoto (Ed.) Flexible pattern discovery with (extended) disjunctive logic programming. (pp. 504-513). Lecture Notes in Computer Science.

Rigoutsos, I., Floratos, A., Parida, L., Gao, Y., & Platt, D. (2000). The emergence of pattern discovery techniques in computational biology. Journal

of Metabolic Engineering, 2(3), 159-177. Terracina, G. (2005). A fast technique for deriving frequent structured patterns from biological data sets. New Mathematics and Natural Computation, 1(2), 305-327.

Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica, 14(3), 249-260.

Vilo, J. (2002). Pattern discovery from biose-

quences. Academic Dissertation, University of Helsinki, Finland. Retrieved March 15, 2007, from http://ethesis.helsinki.fi/julkaisut/mat/tieto/ vk/vilo/

Wang, J., Shapiro, B., & Shasha, D. (Eds.). (1999).

Pattern discovery in biomolecular data: Tools, techniques and applications. New York: Oxford University Press.

Chapter V

In document Data Mining Patterns New Methods And Applications Pascal Poncelet (2008) pdf (Page 118-123)