• No results found

Approximate Regular Expression Matching

1995; Wu, Manber, Myers GONZALONAVARRO

Department of Computer Science, University of Chile, Santiago, Chile

Keywords and Synonyms

Regular expression matching allowing errors or differ- ences

Problem Definition

Given a text string T=t1t2: : :tn and aregular expres- sion Rof lengthmdenoting language L(R), over an al- phabet˙ of size, and given adistance functionamong stringsd and athreshold k, the approximate regular ex- pression matching (AREM)problem is to find all the text positions that finish a so-calledapproximate occurrenceof

RinT, that is, compute the setfj;9i;1 i j;9P 2 L(R);d(P;ti: : :tj) kg.T,R, andkare given together,

whereas the algorithm can be tailored for a specificd.

This entry focuses on the so-calledweighted edit dis- tance, which is the minimum sum of weights of a se- quence of operations converting one string into the other. The operations are insertions, deletions, and substitutions of characters. The weights are positive real values asso- ciated to each operation and characters involved. The weight of deleting a charactercis writtenw(c!), that of inserting c is written w(!c), and that of substi- tuting c by c06=c is written w(c!c0). It is assumed

w(c!c) = 0 for allc2˙[ and the triangle inequal- ity, that is, w(x!y) +w(y!z)w(x!z) for any

x;y;z2˙[ fg. As the distance may be asymmetric, it is also fixed that thatd(A;B) is the cost of convertingAinto

B. For simplicity and practicalitym=o(n) is assumed in this entry.

Key Results

The most versatile solution to the problem [3] is based on a graph model of the distance computation process. As- sume the regular expressionRis converted into a nonde- terministic finite automaton (NFA) withO(m) states and transitions using Thompson’s method [8]. Take this au- tomaton as a directed graphG(V;E) where edges are la- beled by elements in ˙[ fg. A directed and weighted graphGis built to solve the AREM problem.Gis formed by puttingn+ 1 copies ofG;G0;G1; : : : ;Gn, and connect-

ing them with weights so that the distance computation reduces to finding shortest paths inG.

More formally, the nodes ofGarefvi;v2V;0i ng, so thatviis the copy of nodev2Vin graphGi. For

each edgeu!c vinE,c2˙[ fg, the following edges are added to graphG:

ui!vi; with weight w(c!); 0in; ui!ui+1; with weight w(!ti+1); 0i<n; ui!vi+1; with weight w(c!ti+1); 0i<n:

Assume for simplicity thatGhas initial statesand a unique final statef (this can always be arranged). As defined, the shortest path inGfroms0tofngives the smallest distance

betweenTand a string inL(R). In order to adapt the graph to the AREM problem, the weights of the edges betweensi

andsi+1are modified to be zero.

Then, the AREM problem is reduced to computing shortest paths. It is not hard to see thatGcan be topolog- ically sorted so that all the paths to nodes inGiare com-

puted before all those toGi+1. This way, it is not hard to

solve this shortest path problem inO(mnlogm) time and

O(m) space. Actually, if one restricts the problem to the particular case ofnetwork expressions, which are regular

Approximate Regular Expression Matching

A

47 expressions without Kleene closure, thenGhas no loops

and the shortest path computation can be done inO(mn) time, and even better on average [2].

The most delicate part in achieving O(mn) time for general regular expressions [3] is to prove that, given the types of loops that arise in the NFAs of regular expressions, it is possible to compute the distances correctly within eachGiby (a) computing them in a topological order ofGi

without considering theback edgesintroduced by Kleene closures; (b) updating path costs by using the back edges once; (c) updating path costs once more in topological or- der ignoring back edges again.

Theorem 1 (Myers and Miller 1989 [3]) There exists an O(mn) worst-case time solution to the AREM problem un- der weighted edit distance.

It is possible to do better when the weights are integer- valued, by exploiting the unit-cost RAM model through a four-Russian technique [10]. The idea is as follows. Take a small subexpression ofR, which produces an NFA that will translate into a small subgraph of eachGi. At the time

of propagating path costs within this automaton, there will be a counter associated to each node (telling the current shortest path from s0). This counter can be reduced to

a number in [0;k+ 1], wherek+ 1 means “more thank”. If the small NFA hasrstates,rdlog2(k+ 2)ebits are needed to fully describe the counters of the corresponding sub- graph ofGi. Moreover, given an initial set of values for the

counters, it is possible to precompute all the propagation that will occur within the same subgraph ofGi, in a table

having 2rdlog2(k+2)eentries, one per possible configuration

of counters. It is sufficient thatr< ˛logk+2nfor some ˛ <1 to make the construction and storage cost of those tableso(n). With the help of those tables, all the propa- gation within the subgraph can be carried out in constant time. Similarly, the propagation of costs to the same sub- graph atGi+1can also be precomputed in tables, as it de-

pends only on the current counters inGiand on text char-

acterti+1, for which there are onlyalternatives.

Now, take all the subtrees of R of maximum size not exceedingrand preprocess them with the technique above. Convert each such subtree into a leaf inRlabeled by a special characteraA, associated to the corresponding

small NFAA. Unless there are consecutive Kleene closures inR, which can be simplified asR=R, the size ofRaf- ter this transformation isO(m/r). CallR0the transformed regular expression. One essentially applies the technique of Theorem 1 toR0, taking care of how to deal with the special leaves that correspond to small NFAs. Those leaves are converted by Thompson’s construction into two nodes linked by an edge labeledaA. When the path cost propa-

gation process reaches the source node of an edge labeled

aAwith costc, one must update the counter of the initial

state of NFAAtoc(ork+ 1 ifc>k). One then uses the four-Russians table to do all the cost propagation within

Ain constant time, and finally obtain, at the counter of the final state ofA, the new value for the target node of the edge labeledaAin the top-level NFA. Therefore, all the

edges (normal and special) of the top-level NFA can be tra- versed in constant time, so the costs atGican be obtained

inO(mn/r) time using Theorem 1. Now one propagates the costs toGi+1, using the four-Russians tables to obtain

the current counter values of each subgraphAinGi+1.

Theorem 2 (Wu et al. 1995 [10]) There exists an O(n+mn/ logk+2n)worst-case time solution to the AREM problem under weighted edit distance if the weights are in- teger numbers.

Applications

The problem has applications in computational biology, to find certain types of motifs in DNA and protein se- quences. See [1] for a more detailed discussion. In par- ticular, PROSITE patterns are limited regular expressions rather popular to search protein sequences. PROSITE pat- terns can be searched for with faster algorithms in prac- tice [7]. The same occurs with other classes of complex patterns [6] and network expressions [2].

Open Problems

The worst-case complexity of the AREM problem is not fully understood. It is of course ˝(n), which has been achieved formlog(k+ 2) =O(logn), but it is not known how much can this be improved.

Experimental Results

Some recent experiments are reported in [5]. For small

m and k, and assuming all the weights are 1 (except

w(c!c) = 0), bit-parallel algorithms of worst-case com- plexityO(kn(m/ logn)2) [4,9] are the fastest (the second

is able to skip some text characters, depending onR). For arbitrary integer weights, the best choice is a more com- plex bit-parallel algorithm [5]; or the four-Russians based one [10] for largermandk. The original algorithm [3] is slower but it is the only one supporting arbitrary weights. URL to Code

Well-known packages offering efficient AREM (for sim- plified weight choices) areagrep[9] (http://webglimpse. net/download.html, top-level subdirectoryagrep/) and

48

A

Approximate Repetitions

nrgrep[4] (http://www.dcc.uchile.cl/~gnavarro/software). For biological applications, anrep [2] (http://www.cs. arizona.edu/people/gene/CODE/anrep.tar.Z) matches se- quences of approximate network expressions with arbi- trary weights and a specified gap length between each net- work expression and the next.

Cross References

Regular Expression Matchingis the simplified case where exact matching with strings inL(R) is sought. Sequential Approximate String Matchingis

a simplification of this problem, and the relation between graphGhere and matrixCthere should be apparent.

Recommended Reading

1. Gusfield, D.: Algorithms on strings, trees and sequences. Cam- bridge University Press, Cambridge (1997)

2. Myers, E.W.: Approximate matching of network expressions with spacers. J. Comput. Biol.3(1), 33–51 (1996)

3. Myers, E.W., Miller, W.: Approximate matching of regular ex- pressions. Bullet. Math. Biol.51, 7–37 (1989)

4. Navarro, G.: Nr-grep: a fast and flexible pattern matching tool. Softw. Pr. Exp.31, 1265–1312 (2001)

5. Navarro, G.: Approximate regular expression searching with ar- bitrary integer weights. Nord. J. Comput.11(4), 356–373 (2004) 6. Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge (2002) 7. Navarro, G., Raffinot, M.: Fast and simple character classes and

bounded gaps pattern matching, with applications to protein searching. J. Comput. Biol.10(6), 903–923 (2003)

8. Thompson, K.: Regular expression search algorithm. Commun. ACM11(6), 419–422 (1968)

9. Wu, S., Manber, U.: Fast text searching allowing errors. Com- mun. ACM35(10), 83–91 (1992)

10. Wu, S., Manber, U., Myers, E.W.: A subquadratic algorithm for approximate regular expression matching. J. Algorithms19(3), 346–360 (1995)

Approximate Repetitions

Outline

Related documents