Formal Problem Definition - Fischer, Johannes (2007): Data Structures for Efficient Stri

We consider patterns from the domain of strings. Extending the notation from Sect. 3.3.2, we will write lce(φ, ψ) denotes the longest common extension of φ and ψ, for φ, ψ _∈ Σ⋆_{. For example,} _lce₍_aab_,_abab_{) =} _a_{. Given a set (or}

database) _{D ⊆} Σ⋆ _{with strings over Σ, we write} _|D| _{to denote the number of}

strings in D, and kDk to denote their total length, i.e., kDk =P

φ∈D|φ|. We

define thefrequency and thesupport of a pattern φ_∈Σ⋆ in_D as follows:

freq(φ,_D) :=_|{d_{∈ D} : φEd_}|, supp(φ,_D) := freq(φ,D)

|D|

Note that this is not the same as counting all occurrences of aφinD, because one string in the database could contain multiple occurrences ofφ.1 _{The main}

contribution of this chapter is to show how one can compute the frequencies (or support) of all strings occurring at least once in one of the databases in optimal time, i.e., in time linear in the size of the input databases (assuming the number of databases is constant). This allows us to solve frequency-related mining queries in optimal time, i.e., in time linear in the sum of the input- and the output-size. Naturally, the query must be computable from the frequency (or support) in constant time.

We now introduce three example problems that can be solved optimally with our approach. The first one is as follows.

Problem 6.1. Givenmdatabases_D1, . . . ,Dmof strings over Σ (constantm) and

mpairs of frequency thresholds (min1,max1), . . . ,(minm,maxm), theFrequent Pattern Mining Problem is to return all strings φ ∈ Σ⋆ that satisfy mini ≤ freq(φ,_Di) ≤ maxi for all 1 ≤ i ≤ m. In accordance with the data mining

literature (e.g., see Mannila and Toivonen, 1997), this set of solution is often denoted byTh (for “theory”).

This problem has been addressed by many authors using different solution strategies and data structures (De Raedt et al., 2002; Fischer and De Raedt, 2004; Lee and De Raedt, 2005; Fischer et al., 2005), but none of these are optimal.

Example 6.1. Let Σ = _{a,b,c_}, _D1 = {bbabab, abacac, bbaaa},D2 = {aba,

babbc, cba}, min1 = 2, max1 = ∞, min2 = −∞, and max2 = 2. Then Th =_{ab, aba,bb, bba_}. Note in particular that because ba is a substring of all 3 strings in_D1 it satisfies the minimum frequency constraint, but is not part

ofTh, because its frequency in_D2 is also 3, which is too high.

The size of the solution spaceTh can be quite big; as a worst case example, assume that we are only given one database_D1, and the thresholds aremin1=

1, max1 = ∞. If D1 consists of a single string s which is composed of n

different letters, then all Θ(n2) substrings of s are in the solution space, so

kTh_k= Θ(n3). This space can be reduced if, instead of enumerating all patterns in Th, one considers a different representation of Th, similar to the idea of

1_{Our algorithm can also be used to solve the simpler problem of counting all occurrences}

of a pattern in the database; for this one only has to calculate theS-counters defined by (6.5) and (6.6), andnot theC-counters from Sect. 6.3.2.

Gusfield and Stoye (2004) for computing all tandem repeats in a string by returning a suffix tree where all such repeats are marked. In our case, we will see that it is possible to return a “labeled” suffix array from which all solution patterns can be extracted, thereby bounding the size of the output by O(n).

Next, we consider a 2-class problem for a (usually positive) databaseD1 and

a (usually negative) database _D2. We define thegrowth-rate from D2 to D1 of

a stringφ as

growth_D₂_→D₁(φ) := supp(φ,D1)

supp(φ,D2)

, ifsupp(φ,D2)6= 0 ,

and growth_D₂_→D₁(φ) = _∞ otherwise. The following definition is motivated by the problem of mining Emerging Patterns (Dong and Li, 1999):

Problem 6.2. Given two databases D1 and D2 of strings over Σ, a support

threshold ρs (1/|D1| ≤ ρs ≤ 1), and a minimum growth rate ρg > 1, the Emerging Substrings Mining Problem is to find all strings φ _∈ Σ⋆ such that

supp(φ,_D1)≥ρs and growthD2→D1(φ)≥ρg.

The patterns satisfying both the support- and the growth-rate condition are called Emerging Substrings (ESs). ESs with an infinite growth-rate are called

Jumping Emerging Substrings (JESs), because they are highly discriminative for the two databases. The only known solution for finding ESs (Chan et al., 2003) is quadratic in the input size. The following example will be continued throughout this chapter.

Example 6.2. Let _D1 = {aaba,abaaab}, D2 = {bbabb,abba}, ρs = 1, and

ρg = 2. Then the emerging substrings fromD2 to D1 areaa, aaband aba. In

this case, these are also the JESs.

As a last example problem that can be solved optimally with our method we mention the χ2-test.

Problem 6.3. Givenmdatabases_D1, . . . ,Dmof strings over Σ and a thresholdρ.

Letn=Pm

j=1|Dj|be the total number of strings,f = Pm

i=1freq(φ,Di) the total

frequency ofφ, andEj =f·|Dj|/nbe the expected value ofφ’s frequency. Then

φis significant if it passes the χ2-test, i.e., if χ2 =Pm j=1

(freq(φ,Dj)−Ej)2

E_j ≥ρ.

In document Fischer, Johannes (2007): Data Structures for Efficient String Algorithms. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 96-98)