Improved Single and Multiple Approximate String Matching

(1)

Improved Single and Multiple Approximate

String Matching

Kimmo Fredriksson

Department of Computer Science, University of Joensuu, Finland

Gonzalo Navarro

Department of Computer Science, University of Chile

(2)

The Problem Setting & Complexity

Given text and pattern

over some finite

alphabet of size

, find the approximate occurrences of

from

, allowing at most

differences (edit operations).

Exact matching (single pattern) lower bound:

character comparisons (Yao, 79). Approximate matcing lower bound:

(Chang & Marr, 94). We will search simultaneously a set

! "! # # # %$ & of ' patterns. ( '

lower bound for '

patterns (Fredriksson & Navarro, 2003)

(3)

Previous work

Only a few algorithms exist for multipattern approximate searching under the

differences model.

Naïve approach: search the '

patterns separately, using any of the single pattern search algorithms. (Muth & Manber, 1996):

' average time algorithm using " '

space. The algorithm is based on hashing, and works only for

. (Baeza-Yates & Navarro, 1997):

Partitioning into exact search:

on average (

'

preprocessing), but can be improved to

' . Works for ' . Other less interesting ones. CPM’04 – p.3/26

(4)

Previous work

(Fredriksson & Navarro, 2003): The first average-optimal algorithm. average-optimal ' up to error level . linear

on average up to error level

.

(Hyyrö, Fredriksson & Navarro, 2004):

'

worst case for short patterns, where

(5)

This work

We have improved the (optimal) algorithm of (Fredriksson & Navarro, 2003)

Faster in practice, and...

...allows error levels up to

. Our algorithm runs in

' average time, which is optimal.

Preprocessing time is ' , and the algorithm needs ' space, where ' .

The fastest algorithm in practice for intermediate

and small . CPM’04 – p.5/26

(6)

The method in brief:

The algorithm is based on the

preprocessing/filtering/verification paradigm. The preprocessing phase generates all

strings of lenght

, and computes their minimum distance over the set of patterns.

The filtering phase searches (approximately) text

-grams from the patterns, using the precomputed distance table, accumulating the differences.

The verification phase uses dynamic programming algorithm, and is applied to each pattern

(7)

Preprocessing

Build a table as follows: 1. Choose a number in the range

2. For every string

of length ( –gram), search for in 3. Store in

the smallest number of differences needed to match

inside

(a number between 0 and

).

requires space for

entries and can be computed in ' time. CPM’04 – p.7/26

(8)

Filtering

Any occurrence is at least

characters long use a sliding window of

characters over

Invariant: all occurreces starting before the window are already reported.

Read -grams " # # #

from the text window, from right to left:

T:

S3 S2 S1

text window m−k characters

Any occurrence starting at the beginning of the window must contain all the

(9)

Filtering

Accumulate a sum of necessary differences:

. If becomes

for some (i.e. the smallest)

, then no occurrence can contain the

-grams # # # "

slide the window past the first character of

. E.g. " :

new window position

T:

S3 S2 S1

T:

(10)

Verification

If

, then the window might contain an occurrence

the occurrence can be

characters long, so verify the area

, where is the starting position of the window

T:

S3 S2 S1

verification area m+k characters

The verification is done for each of the '

patterns, using standard dynamic programming algorithm.

(11)

Stricter matching condition

Our basic algorithm: text

-grams can match anywhere inside the patterns.

If

, then we know that no occurrence can contain the -grams # # # in any position. The matching area can be made smaller without losing this property.

(12)

Stricter matching condition

Consider an approximate occurrence of

"

inside the pattern.

"

cannot be closer than

positions from the end of the pattern.

For

"

precompute a table "

, which considers its best match in the area

rather than . In general, for preprocess a table , using the area Compute as

(13)

Stricter matching condition

T:

P:

text window Area for D₁[ ]S Area for D₂[ ]S Area for D₃[ ]S 1 2 3 S3 S2 S1 CPM’04 – p.13/26

(14)

Stricter matching condition

for any and the smallest

that permits shifting the window is never smaller than for the basic method.

this variant never examines more

-grams, verifies more windows, nor shifts less.

Drawback: needs more space and preprocessing effort

Can be slower in practice.

The matching condition can be made even stricter Work less per window...

(15)

Analysis

It can be shown that the basic algorithm has optimal average case complexity

' . This holds for

.

The worst case complexity can be made

' (filtering verification). The preprocessing cost is

' , and it requires ' " space.

Since the algorithm with the stricter matching

condition is never worse than the basic version, it is also optimal.

(16)

Analysis

For a single pattern our complexity is the same as the algorithm of Chang & Marr, i.e.

( ...

...but our filter works up to

, whereas the filter of Chang & Marr works only up to .

(17)

Experimental results

Implementation in C, compiled using icc 7.1

with full optimizations, run in a 2GHZ Pentium 4,

with 512MB RAM, running Linux 2.4.18.

Experiments for alphabet sizes

(DNA) and

(proteins), both random and real texts. Text lengths were 64Mb, and patterns 64

characters.

In the implementation we used several practical improvements described in (Fredriksson &

Navarro, 2003)

Bit-parallel counters

Hierarchical / bit-parallel verification

(18)

Experimental results

We used

for DNA, and

for proteins. the maximum values we can use in practice, otherwise the preprocessing cost becomes too high. Analytical results: # # #

for DNA, and

# # # for proteins (depending on ' ).

Altought our algorithms are fast, in practice they cannot cope with as high difference ratios as

(19)

Experimental results

Comparison against:

CM: Our previous optimal filtering algorithm LT: Our previous linear time filter

EXP: Partitioning into exact search

MM: Muth & Manber algorithm, works only for

ABNDM: Approximate BNDM algorithm, a single

pattern approximate search algorithm extending classical BDM.

BPM: Bit-parallel Myers, currently the best

non-filtering algorithm for single patterns.

(20)

Experimental results

Comparison against Muth and Manber (

):

Alg.

DNA

MM

1.30

3.97

12.86

42.52 Ours

0.08

0.12

0.21

0.54 Alg.

proteins

MM

1.17

1.19

1.26

2.33

(21)

Experimental results

, random DNA

0.1 1 2 4 6 8 10 12 14 16 time (s) k Ours, l=6 Ours, l=8 Ours, strict Ours, strictest CM LT EXP BPM ABNDM _{CPM’04 – p.21/26}

(22)

Experimental results

, random DNA

0.1 1 10 100 2 4 6 8 10 12 14 time (s) k

(23)

Experimental results

, random proteins

0.1 1 10 2 4 6 8 10 12 14 16 time (s) k Ours Ours, stricter Ours, strictest CM LT EXP BPM ABNDM CPM’04 – p.23/26

(24)

Experimental results

, random proteins

0.1 1 10 100 2 4 6 8 10 12 14 16 time (s) k Ours CM BPM

(25)

Experimental results

Areas where each algorithm performs best. From left to right, DNA (

), and proteins ( ). Top row: random data. bottom row: real data.

256 64 16 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 BPM Ours EXP k r 256 64 16 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 BPM Ours EXP k r 256 64 16 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Ours k r EXP 256 64 16 1 Ours k r EXP P E X E X P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 CPM’04 – p.25/26

(26)

Conclusions

Our new algorithm becomes the fastest for low

. The larger '

, the smaller

values are tolerated. When applied to just one pattern, our algorithm becomes the fastest for low difference ratios.

Our basic algorithm usually beats the extensions. True only if we use the same parameter

for both algorithms.

For limited memory we can use the stricter matching condition with smaller

, and beat the basic algorithm