Theorem 5.9 (Pattern matching in O(|P|+ log|Σ|) time). For a text T of length n over an alphabet Σ there is a data structure with space-occupancy of ≈ 2.54311n+o(n) bits that, together with the suffix array and the LCP- array for T, allows the retrieval of all occ occurrences of a pattern P in T in
O(|P|log|Σ|+occ) time, for any alphabet size |Σ|. This data structure can be constructed in O(n)time, and the additional space at construction time is o(n)
bits.
Proof. All that remains to show is the statement on the additional space con- sumption at construction time. It is clearly o(n) if |Σ|=o(log2+ǫn/log logn). On the other hand, if |Σ| is larger, one simply has to precompute the long queriesbefore the short queries. Then the nbits needed for the bit-vectors D′
j
can be re-used for the 2.54 bits needed for the type-table T′. The key to this result was a pseudo-median algorithm for RMQ, which led us to a natural generalization of Cartesian Trees, involving generalizations of the Catalan and Ballot Numbers.
We wish to emphasize the fact that our ideas are also compatible with Sadakane’s compressed suffix trees (2007a). In this case, our type-table T′ from Sect. 5.5.2.2 is not necessary, as the balanced parentheses sequence of the suffix tree already respects the layout of the minima inside the blocks. However, the ideas from Sect. 5.5.2.2 can be transferred one-to-one, and thus close the aforementioned gap in the compressed suffix tree.
A further advantage of ourO(mlog|Σ|)-search is that it is perfectly compat- ible with compressed representations of suffix arrays (Sect. 2.7). For example, combining the Compressed Suffix Array due to Grossi and Vitter with our search strategy, locating all occ occurrences takes O((mlog|Σ|+occ) logα|Σ|n) time (0< α≤1), forany alphabet size|Σ|, while needing onlyα−1H
0n+O(n)
bits in total (H0 being the empirical order-0 entropy of the input text). If occ
is not too small, this is a significant improvement over the currently fastest lo- cating time in compressed indexes (Ferragina et al., 2007), which takes O(m+ (m+occlog1+βn) log|Σ|/log logn) time to locateoccoccurrences (0< β <1), and even this only for|Σ|=o(n/log logn).
6
String Mining Problems
6.1 Chapter Introduction
Mining in databases of graphs, trees, and sequences has attracted a lot of in- terest in recent years, starting with the famous Apriori algorithm for mining frequent itemsets (Agrawal et al., 1993; Agrawal and Srikant, 1994). The typi- cal characteristics of data mining problems is that the patterns to be searched are a prioriunknown. This means that the user can impose certain conditions on patterns that must be fulfilled to make a pattern be part of the solution. In a certain sense, this is the complete opposite of usual search algorithms such as exact string matching algorithms (cf. Sect. 2.2), where the input is usually the pattern to be found, and the output is the number (and possibly positions) of all matches. In the setting of data mining, the input would be the number of matches, and the output could be allpatterns that occur at least that often in the data.
In this chapter, we focus on string mining under frequency constraints, i.e., predicates over patterns depending solely on the frequency of their occurrence in the data. This category encompasses combined minimum/maximum support constraints (De Raedt et al., 2002), constraints concerning emerging substrings (Chan et al., 2003), and possibly other constraints concerning statistically sig- nificant substrings. We briefly describe the two most common problems in more detail:
• Frequent String Mining: the usual setting here is that we are given two databases, one containing positive, the othernegative patterns. Then one might be interested in extracting all patterns that pass a certain minimum frequency threshold in the positive database, but do not occur too often
in the negative database. The biological relevance of this task can be seen in the following example: suppose a genetic disease is conjectured to be caused by a defect on the X-chromosome, but it is unknown where and how this failure occurs. One can then collect the genetic sequences of the X-chromosome of 1000 ill patients in the positive database, and likewise the genetic sequences of 1000 healthy persons in the negative database. Then all patterns that occur frequently (or always) in the positive database and not too often (or never) in the negative database are potential indicators of the genetic defect under consideration.
• Emerging Substring Mining: this is an extension of the frequent string mining problem and considers patterns as relevant if they have a certain
growth-rate, defined as the ratio of the relative frequency in the positive database to the relative frequency in the negative database. In this set- ting, constraints are often easier to formulate, as only one quantity needs to be specified. However, because one is mostly interested in patterns that have a certain statistical significance, one usually specifies an ad- ditional constraint which guarantees that the solution patterns have a certain minimal frequency in the positive database.
Further potential application areas of both methods are, among others, find- ing discriminative features for sequence classification (Birzele and Kramer, 2006), discovering new binding motifs of transcription factors, identifying gene- coding regions in DNA, and microarray design. In the latter example the goal is to find probes (short stretches of sequence spotted on a microarray) differ- entiating well between groups of sequences. Additionally, the probes have to possess certain physico-chemical properties to qualify them for inclusion on the microarray. Outside the field of computational biology, we mention automatic language classification of texts, spam-recognition of e-mails, and distinction between melodic and non-melodic patterns in MIDI data.
In this chapter, we present an algorithm that is able to answer frequency- based queries optimally, that is, in time linear in the size of the input databases, plus the time to output the solution patterns. The only two assumptions we make is that the number of given databases is constant, and that the frequency- based predicates can be evaluated in constant time. Both assumptions are highly realistic; all of the above mentioned applications can be modeled with just a handful of databases, and in most cases there are only two sets (positive and negative). It is interesting to note that no optimal algorithms are known for other pattern domains such as itemsets or graphs, or there are even hardness results (Wang et al., 2005).