Chapter Introduction - Fischer, Johannes (2007): Data Structures for Efficient String Al

As we have already seen in Chapter 2, one of the most important tasks in classical string matching is to construct an index on the input text in order to answer future queries faster. Well-known examples of such indexes include suffix-trees, word graphs, and suffix arrays (see, e.g., Gusfield, 1997). Despite the extensive research that has been done in the last three or four decades, this topic has recently re-gained popularity with the rise of compressed indexes (see Navarro and M¨akinen, 2007) and new applications such as data compression, text mining, and computational linguistics.

However, all of the indexes mentioned above are full-text indexes, in the sense that they index any position in the text and thus allow to search for occurrences of patterns starting atarbitrary text positions. In many situations, deploying the full-text feature might be like using a “cannon to shoot a fly,” with undesired negative impacts on both query time and space usage. For example, in European languages, words are separated by special symbols such as spaces or punctuation signs; in a dictionary of URLs, “words” are separated by dots and slashes. In both cases, the results found by a word-based search with a full-text index would have to be filtered out by discarding those that do not occur at word-boundaries. Possibly a time-costly step! Additionally, indexing every text position would affect the overall space occupancy of the index, with an increase in the space complexity which could be estimated in practice as a factor 5–6, given the average word length of linguistic texts. Of course, the use of word-based indexes is not limited to pattern searches, as they have been successfully used in many other contexts, like data compression (Isal and Moffat, 2001) and computational linguistics (Yamamoto and Church,

2001), just to cite a few.

Surprisingly enough,word-based indexes have been introduced only recently in the string-matching literature (Andersson et al., 1999), although they were very famous in Information Retrieval many years before (cf. Witten et al., 1999). The basic idea underlying their design consists of storing just asubsetof the text positions, namely the ones that correspond to word beginnings. It is actually easy to construct such indexes ifO(n) additional space is allowed at construction time (n being the text size): Simply build the normal index for every position in the text and then discard those positions which do not correspond to word beginnings. Unfortunately, such a simple (and common, among practitioners!) approach is not space optimal. In fact, O(n) construction time cannot be improved, because this is the time needed to scan the input text. But O(n) additional working space (other than the indexed text) seems too much because the final index will need O(k) space, where k is the number of words in the indexed text. This is an interesting issue, not only theoretically, because “. . .we have seen many papers in which the index simply ‘is,’ without discussion of how it was created. But for an indexing scheme to be useful it must be possible for the index to be constructed in a reasonable amount of time” (Zobel et al., 1996). And in fact, the working-space occupancy of construction algorithms for full- text indexes is yet a primary concern and an active field of research (Hon et al., 2003).

The first result addressing this issue in the word-based indexing realm is due to Andersson et al. (1999), who showed that the so-called word suffix tree can be constructed in O(n) expected time and deterministic O(k) working space. In 2006, Inenaga and Takeda (2006a) improved this result by providing an on-line algorithm which runs in O(n) time in the worst case and O(k) space in addition to the indexed text. They also gave two alternative indexing structures (Inenaga and Takeda, 2006b,c) which are generalizations of Directed Acyclic Word Graphs (DAWGs) or compact DAWGs, respectively. The compact version has the same worst case guarantees as the suffix tree, though being smaller in practice. All of Inenaga and Takeda’s construction methods are variations of the construction algorithms for (usual) suffix trees (Ukkonen, 1995), DAWGs (Blumer et al., 1985) and CDAWGs (Inenaga et al., 2005), respectively.

The only missing item in this quartet is a word-based analog of the suffix array, a gap which we close in this chapter. We emphasize the fact that, as it is the case with full-text suffix arrays (see, e.g., K¨arkk¨ainen et al., 2006), we get a class-note solution which is simple and practically effective, thus surpassing the previous ones by all means.

A comment is in order at this place. A more general problem than word- based string matching is that ofsparse string matching, where the set of points to be indexed is given as an arbitrary set, not necessarily coinciding with the word boundaries. Although Inenaga and Takeda (2006a,b,c) claim that their indexes can solve this task as well, they did not take into account that search time becomes exponential in the pattern length in this case.1 _{To the best of}

our knowledge, this problem is still open. The only step in this direction has been made by K¨arkk¨ainen and Ukkonen (1996) who considered the special case where the indexed positions are evenly spaced.

In document Fischer, Johannes (2007): Data Structures for Efficient String Algorithms. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 111-113)