• No results found

FAST PATTERN MATCHING IN STRINGS*

N/A
N/A
Protected

Academic year: 2021

Share "FAST PATTERN MATCHING IN STRINGS*"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

FAST PATTERN MATCHING IN STRINGS*

DONALD E. KNUTHf, JAMES H. MORRIS, JR.:l: AND VAUGHAN R. PRATT

Abstract. An algorithm ispresented which finds all occurrences of one. given stringwithin another, in running time proportional to the sum of thelengths of the strings. The constant of proportionalityislowenoughtomakethisalgorithm of practical use, and theprocedurecan alsobe extendedtodealwith somemoregeneralpattern-matching problems. Atheoreticalapplicationofthe algorithm shows that theset of concatenationsofevenpalindromes, i.e., the language{can}*,canbe recognizedin linear time.Other algorithmswhichruneven fasteronthe average arealso considered.

Keywords, pattern, string, text-editing, pattern-matching,triememory, searching, period ofa string, palindrome, optimum algorithm, Fibonacci string, regular expression

Text-editing programs are often required to search through a string of characters looking for instances of agiven

"pattern"

string; we wish to findall positions, orperhaps onlythe leftmost position, in which the pattern occursasa contiguous substring of thetext.

For

example,c a en a ry containsthe pattern

en, but we donotregardca n a ry as asubstring.

The obviouswaytosearch foramatchingpatternistotry searchingatevery starting position of the text, abandoning the search as soon as an incorrect character is found.Butthisapproachcanbe very inefficient, forexamplewhenwe are looking for an occurrence of aaaaaaab in aaaaaaaaaaaaaab.

When the patternisa"band thetext is

a2"b,

wewill find ourselves making

(n + 1)

comparisons of characters. Furthermore, the traditional approach involves

"backing

up"

the input text as we go through it, and this can add annoying complications when we consider the buffering operations that are frequently involved.

In

this paper we describe a pattern-matching algorithm which finds all occurrencesofapattern oflengthrn withinatextoflengthn in

O(rn + n)

unitsof time, without "backing

up"

the input text. The algorithm needs only

O(m)

locations of internal memory ifthe text is read from an external file, andonly O(log

m)

units of timeelapsebetweenconsecutivesingle-character inputs. All of theconstantsofproportionalityimpliedbythese

"O"

formulas are independent of thealphabetsize.

*Receivedbythe editorsAugust29,1974, andin revisedform April 7, 1976.

tComputerScienceDepartment,Stanford University, Stanford,California94305.Theworkof thisauthorwassupportedin partby theNational Science FoundationunderGrantGJ 36473Xandby the Officeof Naval Research underContractNR044-402.

XeroxPaloAltoResearchCenter,PaloAlto,California94304.The work of this author was supportedin partbythe National Science Foundation under GrantGP7635 attheUniversityof California, Berkeley.

Artificial IntelligenceLaboratory, Massachusetts Institute ofTechnology,Cambridge, Mas- sachusetts02139.The workofthisauthor wassupportedin partbythe National Science Foundation underGrant GP-6945at UniversityofCalifornia, Berkeley,andunderGrantGJ-992atStanford University.

323

(2)

Weshallfirstconsiderthealgorithmin aconceptuallysimple but somewhat inefficientform. Sections3and 4 of thispaperdiscusssomewaystoimprove the efficiency andtoadaptthealgorithmtootherproblems. Section 5 developsthe underlyingtheory, and 6 usesthe algorithm to disprove the conjecture that a certain context-free language cannot be recognized in linear time. Section 7 discusses the origin of the algorithm and its relationtootherrecentwork. Finally,

8 discusses still morerecentworkonpatternmatching.

1. Informaldevelopment.The idea behindthisapproachtopattern match- ingisperhapseasiesttograspifweimagineplacing the pattern over thetextand sliding ittothe right ina certainway.Considerforexampleasearch for the pattern abcabcacab in the textbabcbabcabcaabcabcabcacabc; initially weplacethe patternattheextremeleft andpreparetoscantheleftmost character of the inputtext:

abc abc ac ab

babc babc abcaabcabcabcacabc

The arrow here indicates the current text character; since it points to b, which doesn’t match that a,we shiftthe pattern onespacerightandmove to the next input character:

abc abc ac ab

babc babc

abcaabcabcabcacabc

Now

wehaveamatch, sothe pattern stays put while thenextseveral characters arescanned. Soonwe come toanother mismatch"

abc abc ac ab

babc babc

abcaabcabcabcacabc

At

thispointwehave matched the first three pattern characters butnotthefourth, so weknow that the last four characters of theinputhave beenabcx where x#a;

wedon’thavetoremember thepreviouslyscannedcharacters,since ourpositionin thepatternyields

enough information

to recreatethem.

In

thiscase,nomatterwhat x is

(as

longasit’s not

a),

wededuce that the patterncanimmediatelybe shifted four moreplacestothe right;one,two,orthree shifts couldn’t possibly leadto amatch.

Soonwegettoanother partialmatch,this time with a failure onthe eighth pattern character:

abcabcacab

babc babc abc

aabcabcabcaca

bc

(3)

Now

we know that the lasteight characters wereabcabcax, where x#c.The pattern should therefore be shifted threeplacestothe right"

abc abc ac ab

babcbabcabc aabc

abcabcacabc

We

trytomatchthenewpatterncharacter,butthisfails too, soweshiftthepattern four

(not

three or five) more places. Thatproduces a match, and we continue scanning until reachinganothermismatchonthe eighth pattern character"

abc abc ac ab

ba bcbabcabcaabc abc abc acabc

Againweshift thepattern threeplacestothe right; this timeamatchisproduced, andweeventuallydiscoverthefull pattern:

abc abc ac ab

ba bcbabcabcaa bc abc abc ac abc

The play-by-play description for this example indicates that the pattern- matching process willrun efficiently if we have an auxiliary table that tells us exactlyhow fartoslidethepattern,whenwedetectamismatchat

its/’th

character pattern[if. Let

next[f]

be the character positionin the patternwhich should be checkednextafter suchamismatch,sothatwe aresliding thepattern

] next[] ]

placesrelative tothe text.Thefollowingtableliststhe appropriatevalues:

/’=1

2 3 4 5 6 7 8 9 10

pattern[f]=a

b c a b c a c a b

next[f ]

O 1 1 0 1 1 0 5 0 1

(Note

thatnext[j] 0 means thatwearetoslidethe pattern all thewaypastthe current text

character.) We

shall discuss how to precompute this table later;

fortunately, the calculations are quite simple, and we will see that theyrequire only

O(m)

steps.

At

each step of thescanning process,we move either thetextpointerorthe pattern, and each of these can moveat most ntimes;soat most2n steps needtobe performed,afterthenexttable has been set up. Of course thepatternitself doesn’t reallymove;wecan do thenecessaryoperations simplybymaintainingthepointer variable

].

(4)

2. Programming thealgorithm.Thepattern-match processhas thegeneral form

placepatternatleft;

while patternnotfullymatched andtext notexhausteddo begin

while patterncharacterdiffersfrom current textcharacter

do shift pattern appropriately;

advanceto nextcharacteroftext;

end;

Forconvenience, letusassumethat the inputtext ispresent inanarray

text[

1"n

],

andthat the patternappears inpattern[1

:inf. We

shall also assume thatrn

>

0,

i.e., that thepatternis nonempty.Let kand

f

be integervariablessuch that

text[k]

denotes the current text character and

pattern[f]

denotes the corresponding pattern character; thus, the pattern is essentially aligned with positions p

+

1 throughp

+

rn of the text, where k=p

+f.

Then the above program takes the following simple form"

1:=

k := 1;

while/"

_-<rn andk_-< n do begin

whilej

>

0 and

text[k] pattern[f]

doj:= next[j];

k:=k+l;j:=j+l;

end;

If/" >

rn atthe conclusion of the program, the leftmost match has been foundin positionsk rn throughk 1;but

if/"

<-m, thetexthas been exhausted.

(The

and operation here is the "conditional and" which does not evaluate the relation

text[k] pattern[f] unless/" > 0.)

Theprogramhasacuriousfeature, namelythat the innerloopoperation "j:=

next[f]"

isperformednomoreoftenthan theouter loopoperation"k := k

+

1";infact,the innerloopisusuallyperformedsomewhat less often, sincethe pattern generally moves right lessfrequently thanthe text pointer does.

To

prove rigorously that the above program is correct, we may use the following invariant relation" "Let p

k-/"

(i.e., the position in the text just preceding thefirstcharacter of the pattern, inourassumedalignment).Thenwe have text[p+i]=pattern[i] for l=<i

</"

(i.e., we have matched the first

f-1

characters of the pattern, if j

>0);

butfor0_-<

<p

wehave

text[t + if

pattern[if

forsomei, where1-< -< rn (i.e.,there isnopossiblematch of theentirepatternto the left of

p)."

Theprogramwill ofcourse becorrectonlyifwecancomputethenexttableso that the above relation remains invariant when we perform the operation j:= next[j]. Let us look at that computation now. When the program sets

(5)

/"

:=

next[f],

weknow that

f >

0,and that the

last/"

characters of the input uptoand including

text[k]

were

pattern[l].., pattern[j-

1]

x

where x pattern

If].

Whatwe wantis tofindthe leastamountof shift for which these characters can possibly match theshiftedpattern; in otherwords,we want

next[f]

tobe thelargest less

than/"

such that the last characters of the input were pattern 1

]...

pattern

1]

x

and pattern[i]pattern[f].

(If

no such exists, we let

next[i]=O.)

With this definition of

next[f]

it is easy to verify that

text[t+l]...text[k]

pattern[l] pattern[k 1]

fork

f <= <

k

next[f];

hence the statedrelation is indeed invariant, andourprogramiscorrect.

Nowwe mustface uptotheproblemwehave been postponing, the task of calculating

next[f]

in the firstplace. This problemwould be easier if we didn’t requirepattern[i]

pattern[f]

inthe definition ofnext[j],so weshall consider the easierproblem first. Let

f[f]

be thelargest less

than/"

such thatpattern[l]...

pattern[i-

1]

pattern[f-i

+ 1]

pattern[j-

1];

since this conditionholds vac- uouslyfor 1,wealwayshave

f[/’] >=

1

when/" >

1.

By

conventionwelet

f[ 1]

0.

Thepatternused in theexampleof 1 has thefollowing

f

table:

/’=1

2 3 4 5 6 7 8 9 10

pattern[f]=a

b c a b c a c a b

f[/’]=0

1 1 1 2 3 4 5 1 2.

If

pattern[j]=pattern[f[j]]

then

f[f+

1]=f[j]+ 1; but if not, we can use essentiallythe same pattern-matching algorithm as above to compute f[j

+ 1],

with text

pattern! (Note

the similarity of the f[j] problem to the invariant condition of the matching algorithm. Our program calculates the largestj less than or equal to k such that

pattern[I]..,

pattern[j-

1]

text[k-j+

1]...

text[k- 1],

so we cantransfer the previoustechnologytothe presentproblem.) The following program will compute

f[/" + 1],

assuming thatf[j] and

next[l],

next[j-

1]

havealreadybeen calculated"

:=

while

>

0 andpattern[j]

pattern[t]

dot:=

next[t];

f[/’+ 1]

:= t+1;

The correctness of thisprogramisdemonstratedasbefore; wecan imaginetwo copies of the pattern, one sliding to the right with respect to the other.

For

example, suppose we have established that

f[8]=

5 in the above case; let us considerthe computation of

]’[9].

The appropriate picture is

abcabc acab abc abcac ab

(6)

Since

pattern[8]

b,we shift the uppercopyright,knowingthat themostrecently scanned characters of the lower copywere abc ax for x b.Thenexttable tells usto shiftrightfourplaces,obtaining

abcabcacab abcabcacab

and again there isno match. Thenextshift makes 0, so

]’[9]

1.

Once

we understand how to compute

f,

it is only a short step to the computationof

next[f]. A

comparison ofthedefinitions showsthat,

for/" >

1,

f

f[j], ifpattern[j] pattern[fir]f;

next[f]

next[f[j]], if pattern

[/’]

pattern

If[if].

Therefore we can compute the next table as follows, without ever storing the values off[j]inmemory.

j:= 1; := 0;

next[l]

:= 0;

while j

<

m do

begin comment t=

f[/’];

while

>

0 andpattern[j]

pattern[t]

do t:=

next[t];

:= t+l;/"

:=/’+1;

if pattern

[/’]

pattern

It]

then

next[f]

:=

next[t]

elsenext[j] := t;

end.

Thisprogramtakes

O(m)

unitsof time, for thesamereason as thematching programtakes

O(n):

the operation :=

next[t]

in theinnermostloop alwaysshifts theupper copyof the patterntothe riglt, soit isperformedatotalofm timesat most.

(A

slightlydifferentway to prove that the running time is bounded by a constanttimes m istoobserve that the variable starts at0 and itisincreased, m-1 times, by 1; furthermore its value remains nonnegative. Therefore the operation :=

next[t],

whichalwaysdecreases t, can beperformedat mostm- 1 times.)

To

summarize what we have said so far: Strings of text can be scanned efficiently bymaking useof two ideas.

We

can precompute "shifts",specifying howtomove the given pattern when a mismatch occursat

its/’th

character; and this precomputation of "shifts" can be performed efficiently by using the same principle, shifting the patternagainstitself.

3. Gainingefficiency.

We

havepresentedthepattern-matching algorithmin a formthatisrather easilyprovedcorrect; but as so oftenhappens,thisform isnot very efficient.

In

fact,the algorithm as presented abovewould probablynot be competitive with the naive algorithm on realistic data, even though the naive algorithm has a worst-casetimeof orderm times n insteadof mplusn,because

(7)

the chance of thisworstcase israther slim.

On

the otherhand,awell-implemented formof thenewalgorithmshould gonoticeably fasterbecausethere isnobacking up after a partial match.

It

is not difficult to see the source of inefficiency in the new algorithm as presentedabove: When thealphabetof charactersislarge,we willrarelyhave a partial match, and the program will waste a lot of time discovering rather awkwardlythat

text[k] pattern[l]

fork 1, 2, 3,

When/"

1and

text[k]

pattern[l],

the algorithm

sets/"

:=

next[If=O,

then discovers

that/"

=0, then increaseskby 1,thensets

]

to 1again, then tests whetheror not1is

<=m,

and later ittestswhetheror not 1 isgreater than 0. Clearlywewould be muchbetter off making/" 1into aspecialcase.

The algorithm alsospendsunnecessarytimetesting

whether/" >

m ork

>

n.

A

fully-matched pattern can be accounted for by setting

pattern[m + 1]=

"@"

for some impossible character @ that will never be matched, and by letting

next[m+l]=-l;

then a test

for/’<0

can be inserted into a less-frequently executed part of the code. Similarly we can for example set

text[n + 1]

"_1_"

(another

impossible

character)

and

text[n

+2]=pattern[if, so that the test for k

>

n needn’t be made very often.

(See [17]

foradiscussionof suchmore orless mechanical transformations on

programs.)

The following form ofthe algorithm incorporatesthese refinements.

a :=pattern

1];

pattern[m + 1]

:= ’@’

next[m + 1]

:= -1;

text[n + 1]

:= ’_t_’

text[n + 2]

:= a;

f:=k:=

1;

get started:

comment/"

1;

while

text[k]

a dok := k

+

1;

ifk

>

n thengoto input exhausted;

char

matched’/"

:=

f +

1; k := k

+

1;

loop:

comment/" >

0;

if

text[k] pattern[f]

thengoto charmatched;

] :=

next[f];

i/"

1 thengoto getstarted;

if

f

0 then

begin

]

:= 1; k :=k+l;

go to getstarted;

end;

if

/

0 thengotoloop;

comment

text[k- m]

through

text[k- 1]

matched;

This program will usually run faster than the naive algorithm; the worst case occurswhen tryingtofindthe patternabinalong stringof

a’s.

Similar ideas can be usedtospeedup theprogramwhichpreparesthenexttable.

In

atext-editorthe patterns are usually short,so that it is most efficientto translate the pattern directlyinto machine-language code whichimplicitly con- tains thenexttable

(cf. [3,

Hack

179]

and

[24]). For

example, the pattern in 1

(8)

could becompiledintothe machine-languageequivalent of L0: k:=k+l;

LI:

if

text[k]

a thengotoL0;.

k :=k+l;

ifk

>

n thengoto inputexhausted;

L2: if

text[k

b thengotoL1;

k:=k+l;

L

3: if

text[k]

c thengotoL1;

k :=k+l;

L4: if

text[k]

a thengotoL0;

k:=k+l;

L5: if

text[k]

b thengo to L1;

k:=k+l;

L6: if

text[k]

c thengo to L1;

k := k+l;

L7: if

text[k]

a then

go

toL0;

k := k+l;

L8: if

text[k]

c thengoto

L5;

k :=k+l;

L9: if

text[k]

a thengotoL0;

k := k+l;

L10: if

text[k]

b thengotoL1;

k := k+l;

Thiswillbe slightlyfaster, since it essentially makes a special case fora//values

of/.

It

isacuriousfactthatpeopleoftenthinkthenew algorithmwillbe slower than the naive one, even though itdoes less work. Since the new algorithm is conceptuallyhardtounderstandatfirst,bycomparison with other algorithms of the samelength,wefeelsomehow thatacomputerwill haveconceptualdifficulties toomwe expect the machine to run more slowly when it gets to such subtle instructions!

4. Extensions. Sofar ourprogramshaveonlybeen concerned with finding the leftmost match.

However,

itiseasyto seehowtomodifytheroutine sothat all matches are found in turn:

We

can calculate the next table for the extended pattern of length m+l using pattern[re+if="@", and then we set resume :=

next[m + 1]

before setting

next[m + 1]

to-1. After findingamatch and doing whatever action is desiredtoprocess thatmatch,the sequence

/"

:= resume;gotoloop;

will restartthingsproperly.

(We

assumethat text hasnot changedinthe mean- time.

Note

thatresume cannotbe

zero.)

Anotherapproach wouldbetoleave

next[m + 1]

untouched,never changing itto-1,andtodefine integerarrays

head[1

:m

]

and

link[1

:n

]

initiallyzero,and toinsertthe code

link[k]

:=

head[f];

head[j] := k;

(9)

at label "char matched". The test

"if/>0

then" is also removed from the program. This forms linked lists for

1-</=<m

of all places where the first

/

characters of the pattern

(but

nomore

than/)

arematchedinthe input.

Still another straightforward modificationwillfindthelongestinitialmatchof the pattern,i.e.,the

maximum/"

such that

pattern[l].., pattern[f]

occurs intext.

In

practice, the text characters are often packed into words, with say b charactersper word, and the machine architecture often makes it inconvenient toaccess individual characters.When efficiencyfor largen isimportantonsuch machines, one alternativeis to carry out b independentsearches, one for each possiblealignmentof the

pattern’s

firstcharacterin the word. These searchescan treat entire words as "supercharacters", with appropriate masking, instead of workingwith individualcharacters andunpackingthem. Since thealgorithmWe havedescribeddoesnotdependonthe size ofthealphabet,it iswellsuitedtothis and similar alternatives.

Sometimes we want to matchtwo ormore patterns insequence,findingan occurrence of the first followed by the second, etc.; this is easily handled by consecutivesearches,and thetotalrunningtimewillbe of ordernplusthesumof the individualpattern lengths.

We

might also want tomatchtwo ormore patterns inparallel, stoppingas soon asanyone ofthem isfully matched.

A

search ofthiskindcould bedone with multiplenextand pattern tables,with

one/"

pointer foreach;but this wouldmake the runningtimekn plusthesumofthepatternlengths,when there arekpatterns.

Hopcroft and

Karp

have observed (unpublished) that our pattern-matching algorithmcanbeextendedsothat the running time for simultaneous searchesis proportional simply to n, plus the alphabet size times the sum of the pattern lengths.Thepatternsare combined into a "trie"whose nodesrepresentall of the initial substrings of one or more patterns, and whose branches specify the appropriate successor node asafunction ofthe nextcharacter in the inputtext.

For

example,if there are four patterns

{a

bc ab,ababc,bc ac,b bc

},

the trie is shown in Fig. 1.

node

0 2 3 4 5 6 7 8 9 10

substring

a abc abca aba abab b bc bca bb

if if

7 2 10 7 abcab

6 10 10 7 2 10

if

0 0 3 0 bcac

0 ababc

8 0 bcac

bbc FIG.

Such a trie can be constructed efficiently by generalizing the ideawe used to calculate

next[f];

detailsandfurtherrefinementshavebeendiscussedbyAho and Corasick

[2],

who discovered the algorithm independently.

(Note

that this

(10)

algorithmdependsonthealphabetsize; suchdependenceisinherent, if wewish to keepthe coefficient ofnindependent ofk,sinceforexamplethek patterns might each consist of a single unique

character.) It

is interesting to compare this approachto whathappens when the

LR(0)

parsing algorithm is applied to the regulargrammar

Sa$lbSlcSlabcablababclbcaclbbc.

5. Theoretical considerations. Iftheinputfileisbeingreadin"real time", wemightobjecttolongdelaysbetween consecutiveinputs.

In

this sectionweshall prove that the number of

times/"

:=

next[f]

isperformed,beforekisadvanced,is bounded bya functionof theapproximateformlog6m, where

4 (1 +/)/2

1.618 isthe goldenratio, and that this bound is best possible.

We

shall use lower case Latin letters to represent characters, and lower case Greek letters a,/3,..,torepresent strings,withetheemptystring and

[al

thelengthofa.Thus

lal-

1for all characters a;

I t l- I1 +ltl;

and

[el

=0.Wealsowrite

a[k]

for the

kthcharacter of a, when 1-<k-<

I 1,

As

awarmup for our theoretical discussion, let us consider the Fibonacci strings

[14,

exercise

1.2.8-36],

which turn out to be especially pathological patterns for the above algorithm. Thedefinitionof Fibonacci strings is

(1)

b=b, b=a; )n-")n_l)n_2 forn_->3.

For

example,

b3

ab,

4

ab a,

4

aba ab.

It

follows that the

length [b,

is thenth Fibonacci number

Fn,

and that

4n

consistsof thefirst

F,

characters of an infinite string

boo

whenn->_ 2.

Considerthe pattern48, whichhas the functions

f[/’]

and

next[f]

shown in

Table 1.

TABLE

/’=1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 pattern[i]=a b a a b a b a a b a a b a b a a b a b a

fl/’]=0 1 2 2 3 4 3 4 5 6 7 5 6 7 8 9 10 11 12 8

next[f]=O 1 0 2 1 0 4 0 2 0 7 1 0 4 0 2 1 0 12 0

Ifwe extend this pattern to boo, we obtain infinite sequences f[j] and

next[f]

having thesamegeneralcharacter. Itispossibletoprove byinduction that

(2) [[f]=f-G,-

forFg

<=f

<Fk+l,

because of the following remarkable near-commutative property of Fibonacci strings:

(3)

l)n_2n_

C()n_l)n_2)

forn ->-3,

where

c(a)

denotes changing the two rightmost characters of a.

For

example,

I6

aba ab aba and

c(b6)

aba a ba ab. Equation

(3)

is obvious when n=3; and for n>3 we have

c(b_2b,_0=b,_eC(bn_l)=b,_.b_3b_:

b,-lbn-: by induction;hence

c(b_:b,_l)

c(c(qb-149,-:)) 9-19,-:.

Equation

(3)

implies that

(4) next[F 1] Fk-1

1 fork_->3.

(11)

Therefore ifwe have amismatch

when/" Fs-1

20,our algorithm might set

/"

:=

next[f]

for the successive values20, 12, 7, 4, 2, 1,0

of/’.

Since

Fk

is

(b k/ff-)

rounded to thenearestinteger, it ispossibleto have upto

log

m consecutive iterationsof

the/"

:=

next[f]

loop.

We

shall now show that Fibonacci stringsactuallyaretheworstcase, i.e.,that log,m is also anupperbound. First letusconsiderthe concept ofperiodicity in strings.

We

saythat pisaperiodofce if

(5) a[i] ce[i +p]

for1=<i

=< I l-p,

It

iseasytosee that p isaperiodofce if andonlyif

(6)

a

(ce lce2)kce

forsomek =>0,where

[alce2[

=p andce2 E.Equivalently, p isaperiod ofceif and onlyif

(7) ce02 02ce

forsome

02

and

02

with

}021 1021

=p. Condition

(6)

implies

(7)

with

02

ce2celand

02

ce2ce2.Condition

(7)

implies

(6),

forwedefinek

[Ic{/p]

andobservethat if k >0, then ce

02/3

implies

/302 02/3

and

[IBI/P]

k- 1; hence, reasoning inductively,ce

Ozce

forsomece with

Ice 21 <

P,andce

02 02ce

i.Writing

02

celce2

yields.

(6).

The relevance of periodicityto ouralgorithm is clearoncewe considerwhat it means to shift a pattern. If

pattern[If.., pattern[f-1]=ce

ends with pattern[if pattern[i-1]=/3,wehave

(8)

ce

fl01 02fl

where

[02[ 102[ /

i,sothe amount of shift j isaperiodofce.

The construction of next[j] in our algorithm implies further that pattern[if,which isthe first character of 01, isunequaltopattern[j]. Letusassume

that/3

itselfissubsequentlyshiftedleavingaresidue %so that

(9) /3 3’02 2’

wherethefirstcharacterof

02

differs frrn that of

02. We

shallnowprove that

(10)

if

1 1+1 1 1 1,

there is an overlap of

d=lfll+l l-I l

characters between the occurrences of

fl

and y in

f102

=ce 0.0zy; hence the first character of

02

is y[d

+ 1].

Similarlythere isanoverlapof d characters between the occurrences of

/3

and

3/in 02/3

ce y0202; hence the first character of

02 is/3[d + 1].

Sincethese characters are distinct, we obtain

y[d+ 1]fl[d+l],

contradicting

(9).

This establishes

(10),

and leads directlytothe announcedresult:

THzoz.

The number

of

consecutive timesthat

]

:=

next[f]

isperformed, whileone textcharacterisbeingscanned, is at most 1

+ log6

m.

Pro@ Let Lr

be thelengthof the shortest stringceas inthe above discussion suchthatasequenceofrconsecutive shiftsispossible. Then

L

0,

L2

1,and we

have

1[3l>=Lr_l, lyl>-L_2

in

(10);

hence

La>-Fr+I

1 byinduction onr.

Now

ifr shiftsoccurwehave m

-->E+l->b r-.

[-1

(12)

Thealgorithmof 2 wouldruncorrectlyinlinear timeevenif

f[/’]

wereused

instead of

next[f],

buttheanalogueof theabovetheorem wouldthenbefalse.

For

example,thepatterna leadsto

f

[j] j 1 for 1

=<

j

=<

m.Therefore ifwematched a to the text

a"-lbce,

using

f[/’]

instead of next[j], the mismatch

text[m]

pattern[m]

wouldbe followedbym occurrences of j:=

f[/’]

andm 1 redundant comparisonsof

text[m]

with

pattern[j],

beforek is advancedto m

+

1.

Thesubjectofperiodsin strings hasseveralinterestingalgebraic properties, butareader who isnotmathematicallyinclinedmay skipto 6since thefollowing material isprimarily an elaboration of some additionalstructurerelated tothe above theorem.

LEMMA

1.

If

p and q are periods

of

ce, and p

+

q <-_

Ice[ +

gcd(p,

q),

then

god(p,

q)

is aperiod

of

a.

Proof. Let d=gcd(p,q),

and assume without loss of generality that d

<p <

q p

+

r.

We

have

a[i] a[i +p]

for1_<-

_<-Ice]-p

and

a[i] ce[i + q]

for l_<-i--<

I[-q;

hence

c[i +r] [i +q] ce[i]

for 1+r<-i

+r<=lcel-p,

i.e.,

a[i]=a[i+r]

for

1-<ill-q.

Furthermorea

01 02

where

[01[

p, and itfollowsthat p andr areperiods

of/3,

where p

+

r-<

1/31+

d

I/3[+ god(p, r). By

induction, disaperiod

of/3.

Since

I[ [al-p >=q

-d

>=q-r

=p

[0a],

thestrings

0a

and

Oz

(whichhave therespec-

tiveforms

21

and

12

by

(6)

and

(7))

aresubstrings

of/3;

sotheyalso havedas

a period. The string a

(/3a/32)+/31

must now have d as a period, since any charactersd positions apartarecontained within

1/2 or/11.

[[]

The result of Lemma 1 but with the stronger hypothesis p

+

q _-<

Ice]

was provedby Lyndon andSchiitzenberger in connectionwithaproblemabout free groups

[19, Lem. 4].

Theweaker hypothesisinLemma 1turns out togive the best possible bound: If

gcd(p,q)<p<q

we can find a string of length p

+

q-god(p,

q)-

1for whichgod(p,

q)

is not aperiod.

In

orderto seewhythisis so,consider firsttheexampleinFig.2 showingthemostgeneral stringsoflengths 15 through25 havingboth 11and15 asperiods.

(The

stringsare

"most

general"

in the sense thatanytwocharacter positions that canbedifferentare different.)

abc def ghi j ka bc d

abc daf ghi j ka bc.da

abc dab ghi ka bc dab abc dabc hi j ka bc dabc abc dabc di ka bc dabc d abc dabc daj ka bc dabc da abc dabc dab ka bc dabc dab ab c dab c dab c a b c dab c dab c ab c aab c aabc a bc aab c aab c a aac aaac aaac a ac aaac aaac aa aaaaaaaaaaaa aaaaaaaaaaaaa

FIG.2

Note

that the number ofdegreesoffreedom, i.e.,the number of distinctsymbols, decreases by 1 at each step.

It

isnot difficult toprove that the number cannot decreasebymore than 1as we gofrom

lal

n-1 to

I 1-

n, since the onlynew

(13)

relations are

a[n]=a[n-q]=a[n-p];

we decrease the number of distinct symbols byoneif andonlyif positionsn-q andn-p contain distinctsymbolsin themostgeneralstring oflengthn 1. The lemma tellsusthatweareleftwith at most god(p,

q)

symbolswhen the lengthreaches p

+

q-god(p,

q);

on the other handwealwayshaveexactlypsymbolswhen thelengthis q. Therefore each of the p-god(p,

q)

steps must decrease the number of symbols by 1, and the most general stringof lengthp

+q- gcd(p, q)-1

musthave exactly god(p,

q)+

1 dis- tinctsymbols.

In

otherwords,thelemmagives the best possible bound.

When p and q are relatively prime, the strings of lengthp

+

q- 2 on two symbols, having both p and q as periods, satisfy a number of

remarkable

properties, generalizingwhat we have observedearlier about Fibonacci strings.

Since the properties of these pathological patterns may prove useful in other investigations,weshall summarize them in the following lemma.

LEMMA

2.

Let

thestrings

r(m, n) of

lengthnbe

defined for

all relativelyprime pairs

of

integers n

>=

m

>=

0 as

follows:

r(0, 1)

a,

o,(1, 1)

b,

r(1, 2)

ab;

r(m,

m

+ n) r(n

mod m,

m)r(m, n)

)

(11) cr(n,m+n)=cr(m,n)r(n

modm,

m) ifO<m<n.

Thesestringssatisfy the]ollowingproperties:

(i)

r(m,

qm

+ r)r(m

r,

m) r(r, m)r(m,

qm

+ r), for

m

>

2;

(ii)

r(m, n)

has periodm,

for

m

>

1;

(iii)

c(r(m, n))=r(n

-m,

n), for

n

>

2.

(The

function

c(a)

was defined in connection with

(3) above.)

Proof. We

have, for 0

<

m

<

n and q

=>

2,

r(m +

n,

q(m + n)+ m) r(m,

m

+ n) r(m +

n,

(q- 1)(m + n)+ m), r(m +n, q(m +n)+n)=r(nm +n) r(m +n, (q-1)(m +n)+n),

r(m +

n,2m

+ n)= r(m,

m

+ n) o,(n

mod m,

m), cr(m +

n,m

+ 2n) o’(n,

m

+ n) o-(m, n);

hence,if

01 o-(n

mod m,

m)

and

02 o-(m, n)

and q _->1,

(12) o-(m+n,q(m+n)+m)=(OxOz)q01, r(m+n,q(m+n)+n)=(OzO1)qOz.

It

follows that

o-(m +n, q(m +n)+m)o-(n,

m

+ n)= o-(m,

m

+n) o’(m +n, q(m +n)+m), o’(m +

n,

q(m + n)+ n)o’(m,

m

+ n)= o(n,

m

+ n)o’(m +

n,

q(m + n)+ n),

which combine to prove (i).

Property

(ii) also follows immediately from

(12),

except for the casem 2,n 2q

+

1,

or(2,

2q

+ 1) (ab)’a,

whichmaybe verified directly. Finally,it suffices toverify property(iii)for 0

<

m

< 1/2n,

since

c(c(a))

a;

wemustshow that

c(tr(m,m+n))=tr(m,n)o’(n

modm,

m)

forO<m<n.

(14)

When m

<=

2thispropertyiseasilychecked,and when rn

>

2 it is equivalentby inductionto

r(m,m+n)=r(m,n)cr(m-(nmodm),m)

for0<m<n, m>2.

Setnmod rn r,

[n/m

q, andapplyproperty(i).

By

properties (ii) and (iii) of this lemma,

r(p,

p

+q)

minus its last two characters is the string of length p

+

q-2 having periods p and q.

Note

that Fibonacci strings are just a very special case, since b,,

=cr(F,_a, F,).

Another propertyof theo-stringsappearsin

15]. A

completelydifferentproofof

Lemma

1 and its optimality, and acompletelydifferent definitionof

r(m, n),

weregivenby Fine and Will in1965

[7].

Thesestringshavealong historygoingbackatleastto theastronomerJohann Bernoulli in1772;see

[25, 2.13]

and

[21].

If c isanystring, let

P(c)

be its shortestperiod.

Lemma

1 implies that all periods q which are not multiples of

P(a)

must be greater than

lal-P(a)+

gcd(q, P(a)).

This is arather strong condition in terms of thepatternmatching algorithm,becauseofthe following result.

LEMMA

3.

Let

a pattern

[ 1]...

pattern

[f 1]

andleta pattern

[/’]. In

the pattern matching algorithm,

f[/’]=/’-P(a),

and

next[f ]=f-q,

where q is the smallestperiod

of

awhichis not aperiod

o[aa. (If

nosuch periodexists,

next[f] 0.)

If P(a)

divides

P(ca)

and

P(aa) <

f, then

P(a)= P(aa). If P(a)

doesnotdivide

P(aa)

or

if P(aa)=L

thenq

=P(a).

Proof.

The characterizations of

f[/’]

and

next[f]

followimmediatelyfrom the definitions. Since every period of aa is a period of a, the only nonobvious statementisthat

P(a)= P(aa)

whenever

P(a)

divides

P(aa)

and

P(cea) f. Let P(a)

=p and

P(aa)=

rap; then the(mp)thcharacter fromtheright ofc isa, as is the

(m-

1)pth, as is thepth; hence p is aperiod ofaa.

Lemma 3 shows that

the/"

:= next[j] loop will almost always terminate quickly. If

P(a) P(aa),

then qmust notbe amultipleof

P(a);

henceby

Lemma

1,

P(a)+q

>=j

+

1.

On

the otherhand q

>P(a);

hence q

>1/2j

and

next[j]<1/2/. In

the other case q

P(a),

wehad betternothave qtoosmall,since q will beaperiod in the residual pattern after shifting, and

next[next[ill

will be

<q. To

keepthe looprunning it isnecessaryfornewsmall periodstokeep poppingup,relatively primetothe previousperiods.

6. Palindromes.

One

of the most outstanding unsolved questions in the theory of computational complexity is the problem of how long it takes to determinewhetheror nota given string oflengthnbelongstoa given context-free language.

For

manyyears the best upper bound for thisproblemwas

O(n 3)

in a generalcontext-freelanguageasn

c; L. G.

Valianthasrecently loweredthisto

O(n og7). On

the otherhand,theproblemisn’tknowntorequiremorethanorder n units of time for any particular language. This big gap between

O(n)

and

O(n 2.8a)

deserves tobe closed, and hardly anyonebelievesthat the final answer will be

O (n).

Let

be afinitealphabet,let

Z*

denotethe stringsover

,

and let

P {aa

R

]a

e

E*}.

Here

cr denotesthereversalof a,i.e.,

(a,a2 a,)

a, azal.Eachstring

7rin

P

isapalindromeof evenlength,andconverselyeveryevenpalindrome over

(15)

is inP.

At

one time itwaspopularly believed that the language

P*

of

"even

palindromes starred", namelythesetofpalstars7rl

rn

where each7ri isin

P,

wouldbeimpossibleto recognizein

O(n)

steps onarandom-access computer.

It

isn’t especially easy to spot members of this language.

For

example, aabbab b a is apalstar,but its decompositioninto evenpalindromesmightnot be immediately apparent; and the reader might needseveralminutes todecide whetheror not

baabbabbaababbaabbabbabaa

bbabbabbabbaabababbabbaab

is in

P*. We

shallprove, however,thatpalstarscanberecognizedin

O(n)

unitsof time,byusing their algebraicproperties.

Let us say that a nonemptypalstar is prime if it cannot be written as the productoftwo nonempty palstars.

A

primepalstarmustbean evenpalindrome

actR but the converse doesnothold.

By

repeated decomposition,itiseasyto see that every palstar/ isexpressible as a product

fll fit

of prime palstars, for

some

>=

0;what is less obviousisthatsuchadecompositionintoprime factorsis

unique. This"fundamentaltheorem ofpalstars"isanimmediateconsequenceof the following basicproperty.

LEMMA

1.

A

primepalstarcannotbegin withanotherprimepalstar.

Proof. Let

taRbeaprimepalstar suchthataOR

Ry

forsomenonempty even palindrome

flflR

and some y#e; furthermore, let

R

have minimum length among all such counterexamples. If then aaR=

flflR3,=a6y

for some 6 e; henceaR

6%

and

flflR (flflR)R (a6)n 8RaR 6R6%

con-

tradicting the minimality of

[flflR I.

Therefore

IflflR[<--__[al;

hence a

=flflg6

for some 8, and

flflg,y

OgoR__

RRR.

But this implies that y is the palstar

RR,

contradicting theprimalityof

COROLLARY (Left

cancellation

property.) If aft

and

or’are

palstars, so is

ft.

Proof. Let

a a

a

anda[3

fla fl

be prime factorizations ofaand

aft.

If al

a fll

fl, then

fl

fl+l..,

fl

is a palstar. Otherwise let j be minimal with

a.

fli; then ai begins with

fli

or vice versa, contradicting

Lemma

1.

LEMMA

2.

If

ois astring

of

lengthn,wecan determinethe

length of

thelongest

evenpalindrome

fl P

such thata fl’y, in

O(n

steps.

Proof.

Applythe pattern-matchingalgorithmwith pattern aandtext ct

R.

When k-n+1 the algorithm will stop with j maximal such that

pattern[i]

pattern[j-

1] text[n +

2-j]

text[n].

Now performthe follow- ingiteration:

while j ->_ 3

and/"

evendo j := f(j).

By

thetheory developedin 3,thisiteration terminates

with/"

->3 if and

on(y

ift begins withanonempty evenpalindrome,

and/"-

1 will be thelengthof the largestsuchpalindrome.

(Note

that

]’[]

must be usedhereinstead of

next[]];

e.g.

consider the case aa b a a b.

But

the patternmatchingprocess takes

O(n)

timeevenwhen

[[f]

is

used.)

THEOREM.

Let L

be any language such that

L*

has the

left

cancellation propertyandsuch that,given anystringct

of

lengthn,we can

find

anonempty

L

(16)

such thatabeginswith or wecanprovethatnosuch exists, in

O(n

steps. Then wecandetermine in

O(n

timewhetheror not agiven stringis in

L*.

Proofi Let

a be any string, andsuppose that the time required to test

or

nonempty prefixes in

L

is

<-Kn

for all large n.

We

begin by testing

a’s

initial subsequences of lengths 1, 2, 4,

...,

2

, ...,

andfinallya itself, untilfinding a

prefix in

L

oruntilestablishingthatahasnosuch prefix.

In

the lattercase,aisnot in

L*,

andwe have consumed at most

(K + Ka) + (2K + Ka) + (4K + Ka)

+"

+ (]a [K + Ka) <

2Kn

+ Ka

lognunitsof time for some constant

Ka. But

if we finda

nonempty prefix/3

L

wherea fly, we haveusedat most

41 [K + K(log2 [/3 [)

unitsof time so far.

By

theleft cancellationproperty,a

L*

if andonlyif y

L*,

andsince

[71

n

-I/31

wecanprovebyinductionthatat most

(4K + Ka)n

units of time areneededtodecidemembershipin

L*,

whenn

>

0.

U

COROLLARY. P*

canberecognizedin

O(n)

time.

Note

that therelatedlanguage

P* (Tr

Y-,*

17r

7r and

e 2}*

cannotbe handledbytheabovetechniques, since it contains botha a abb b and a a ab b b b a; the fundamental theorem ofpalstarsfails withavengeance. Itis anopenproblemwhetheror not

PI*

canberecognizedin

O(n)

time,althoughwe

suspect thatit canbe done. Oncethe reader hasdisposedof thisproblem,heor she isurgedtotackleanother languagewhichhasrecentlybeen introducedby S.

A. Greibach

[ 11],

sincethe latterlanguageisknown tobeashardaspossible;no context-freelanguagecan be hardertorecognize except byaconstantfactor.

7. Historical remarks. The pattern-matching algorithm of this paper was discovered in a rather interesting way.

One

of the authors

(J. H.

Morris) was implementing atext-editor for the

CDC

6400computer during the summer of 1969, and since the necessary buffering was rather complicated he sought a method that would avoid backing up the text file. Using concepts of finite automatatheory as amodel, he devisedan algorithm equivalenttothemethod presented above, althoughhisoriginal form ofpresentationmade it unclear that the running time was

O(m + n).

Indeed,itturnedoutthat Morris’s routine wastoo complicated for other implementors of the system to understand, and he dis- coveredseveral months later that gratuitous"fixes" had turnedhisroutine into a shambles.

In

atotallyindependentdevelopment, anotherauthor

(D. E. Knuth)

learned early in 1970 of

S. A.

Cook’s surprising theorem about two-way deterministic pushdown automata

[5].

According to Cook’s theorem, any language recog- nizablebyatwo-waydeterministicpushdown automaton,in any amountof time, canbe recognizedon arandom access machine in

O(n)

units of time. Since

D.

Chester had recently shown that the set of strings beginning with an even palindrome could be recognizedbysuch an automaton, and since Knuth couldn’t imagine how to recognize such a language in less than about n2 steps on a conventional computer, Knuth laboriouslywentthrough all thestepsofCook’s constructionasappliedtoChester’s automaton. Hisplanwasto"distilloff" what

(Note added April, 1976.)Zvi Galil andJoelSeiferas haverecently resolvedthisconjecture affirmatively.

(17)

washappening,inordertodiscoverwhythe algorithm worked soefficiently.After ponderingthe mass of details for severalhours, hefinally succeeded inabstracting themechanism whichseemedtobeunderlyingthe construction, in the special case ofpalindromes, and he generalized it slightly to aprogram capableof finding the longestprefix of onegivenstring that occurs in another.

Thiswasthe first time in Knuth’s experience that automatatheoryhadtaught him howtosolveareal programming problem betterthan hecouldsolve itbefore.

He

showedhisresultstothe third author

(V. R. Pratt),

and

Pratt

modifiedKnuth’s data structure so that the running-time was independent of the alphabet size.

When

Pratt

describedthe resultingalgorithmtoMorris, thelatter recognizeditas hisown, andwaspleasantly surprisedtolearnof the

O(m + n)

timebound,which he and

Pratt

described in amemorandum

[22].

Knuth waschagrinedtolearnthat Morris hadalreadydiscoveredthe algorithm, without knowingCook’stheorem;

but thetheoryof finite-state machines had been ofusetoMorris too, in his initial conceptualization of the algorithm, so it was still legitimate to conclude that automatatheoryhadactuallybeenhelpfulin thispracticalproblem.

The idea of scanning astringwithoutbackingupwhile looking for apattern, in the case of atwo-letteralphabet, isimplicitin theearlywork of Gilbert

[10]

dealingwith comma-free codes.

It

also is essentially a special case of Knuth’s

LR(0)

parsingalgorithm

[16]

when appliedto the grammar

for each a in the alphabet,

where a is the pattern. Diethelm and Roizen

[6]

independentlydiscovered the idea in1971. Gilbert andKnuthdidnotdiscussthepreprocessingtobuildthenext table, since they were mainly concerned with other problems, and the pre- processing algorithm given byDiethelm and Roizen was oforderrn

2. In

the case of a binary

(two-letter)

alphabet, Diethelm and Roizen observed that the algorithm of 3 can be improved further" we can go immediately to "char matched"

after/"

:= next[j]in this case ifnext[j]

> O.

A

conjectureby

R. L.

Rivest led

Pratt

todiscover the

log6

rnupper boundon patternmovementsbetween successiveinputcharacters,and Knuthshowedthat this was best possible by observing that Fibonacci strings have the curious properties provedin 5.ZviGalilhasobservedthat a real-timealgorithmcanbe obtainedby lettingthe textpointermoveahead in anappropriatemanner while the

f

pointeris moving down

[9].

In

hislecturesatBerkeley,

S. A.

Cook hadprovedthat

P*

wasrecognizable in

O(n

log

n)

stepsonarandom-access machine,and

Pratt

improvedthis to

O(n)

usingapreliminary formof the ideas in 6.The slightly more refinedtheoryin the present version of 6 is joint work of Knuth and

Pratt.

Manacher

[20]

found another waytorecognizepalindromes inlineartime, and Galil

[9]

showedhowto improve thistoreal time.

See

also Slisenko

[23].

It

seemed at first that there might be a way to find the

longest

common substring oftwo givenstrings,in time

O(m + n);

butthe algorithm of thispaper doesnotreadily support any suchextension, and Knuthconjecturedin 1970that such efficiency would be impossible to achieve.

An

algorithm dueto

Karp,

Miller, andRosenberg

[13]

solved theproblemin

O((m + n)

log

(m + n))

steps, andthis

(18)

tendedtosupport the conjecture

(at

leastin the mindofitsoriginator).

However,

PeterWeinerhasrecentlydevelopedatechnique for solvingthelongestcommon substring problem in

O(m + n)

units of time with a fixed alphabet, using tree structures in aremarkable newway

[26].

Furthermore, Weiner’s algorithm has the followinginteresting consequence,pointedoutby E. McCreight"a textfile can beprocessed (inlineartime)sothat it ispossibletodetermineexactlyhow much of apatternisnecessarytoidentifyaposition in the textuniquely;asthe pattern is being typedin, thesystemcan beinterruptassoonas it"knows"what therestof thepatternmustbe!Unfortunatelythetimeandspacerequirements for Weiner’s algorithmgrow with increasingalphabetsize.

Ifweconsidertheproblemof scanning finite-statelanguagesingeneral,it is known

[1 9.2]

that the language definedby any regular expression of length m isrecognizable in

O(mn)

units of time. When theregular expressionhas the form

the algorithm we have discussed shows that only

O(m + n)

units of time are needed (considering

*

as acharacter oflength 1intheexpression).

Recent

work by

M. J.

Fischer and

M.

S. Paterson

[8]

shows thatregularexpressionsofthe form

i.e., patterns with "don’t

care"

symbols, can be identified in

O(n

logrn log logrnlog

q)

units oftime, where q is the alphabet size and rn

[cela2 cer[+

r.The constant ofproportionality in their algorithm isextremely large, but the existence of their construction indicates that efficient new algorithms for general pattern matching problems probably remain to be dis- covered.

A

completelydifferentapproachtopattern matching,based on hashing, has beenproposed byMalcolm

C.

Harrison

[12]. In

certain applications, especially with very large text files and short patterns, Harrison’s method may be sig- nificantly faster than thecharacter-comparingmethod of the presentpaper,onthe average, although the redundancy of English makes the performance of his method unclear.

8. Postscript:

Faster pattern

matching in strings.2

In

the spring of 1974, RobertS.

Boyer

and

J.

StrotherMooreand(independently)

R. W. Gosper

noticed thatthereis an evenfasterwaytomatchpatternstrings,by skippingmorerapidly over portions of the textthatcannotpossiblyleadto amatch. Their idea wasto lookfirstat

text[m],

insteadof

text[l].

Ifwefindthatthe character

text[m]

does not appear in the pattern at all, we can immediately shift the pattern right rn places.Thus,when thealphabetsizeqislarge,weneedtoinspectonlyabout

n/rn

charactersof the text, onthe

average!

Furthermore if

text[m]

doesoccurin the pattern,wecan shift thepatternbythe minimumamountconsistent withamatch.

Thispostscriptwas added byD. E. Knuth inMarch, 1976, becauseofdevelopmentswhich occurred after preprintsof thispaperweredistributed.

(19)

Severalinteresting variationson this strategyare possible.

For

example, if

text[m]

doesoccur in the pattern, we might continue the search by looking at

text[m 1], text[m 2],

etc.;inarandom file we willusuallyfindasmall value ofr such that thesubstring

text[m- r] text[m]

doesnotappearin thepattern,so we can shift thepattern m-r places. Ifr-

[2 logq

m

J,

there are more than m2 possible values of

text[m r]... text[m ],

butonlym rsubstringsoflengthr

+

1 inthepattern,hence theprobabilityis

O(1/m)

that

text[m r] text[m]

occurs in thepattern; Ifitdoesn’t,we can shift the patternrightm rplaces; butif itdoes, we can determine all matches in positions

<m-r

in

O(m)

steps, shifting the patternm-rplaces bythe method of thispaper.

Hence

the expectednumber of charactersexamined amongthe first m

[2 logq

m is

O(1ogq m);

thisprovesthe existence of a linearworst-casealgorithmwhichinspects

O(n (1Ogq m)/m)

charac-

tersin arandomtext.Thisupperboundontheaveragerunning time appliestoall patterns,and there are somepatterns

(e.g.,

a or

(a b)’/2)

for which theexpected number of characters examinedby the algorithmis

O(n/m).

Boyer

andMoorehave refinedtheskipping-by-mideain anotherway.Their original algorithmmaybeexpressedasfollows using our conventions:

whilek_-<n do begin

]:=m;

while j

>

0 and

text[k]

=pattern[j]do begin

f:=f-1;k:=k-1;

end;

if

]

0 then begin

match

foundat (k);

k:=k+m+l end else

k := k

+

max

(d[text[k]],

dd[j]);

end;

This program calls match

found

at

(k)

for all O<=k<-n-m such that

pattern[l].., pattern[m] text[k + 1]

text[k

+ m].

There are two precom- puted tables, namely

d[a]

min

{sis

m or

(0

_-<s

<

mand

pattern[m s] a)}

foreach of the q possiblecharacters a, and dd[j] min

{s +

m-j s

>--

1 and

((s_-

>1 or

pattern[i-s]=pattern[i])

for j<i

=<m)},

for 1 _<-/"_-<m.

(20)

Thed table canclearly beset up in

O(q + rn)

steps, andtheddtable can be precomputedin

O(m)

steps usinga technique analogous to the method in 2 above,as weshall see. The

Boyer-Moore

paper

[4]

containsfurtherexpositionof thealgorithm,including suggestions forhighly efficientimplementation, andgives boththeoretical and empiricalanalyses.

In

theremainderof this section we shall show howthe above methods can be used toresolve some of theproblems left openin

[4].

Firstletusimprove theoriginal

Boyer-Moore

algorithmslightlyby replacing

dd[f ]

by

dd’[f ]

min

{s +

rn -j s

=>

1 and

(s

>=jor

pattern[f-s]

#

pattern[f])

and

((s

=>i orpattern[i-s] pattern[i])for j<i

=< m)}.

(Thisisanalogoustousingnext[j]instead off[j];

Boyer

andMoore

[4]

creditthe improvement in this case to

Ben

Kuipers, but they do not discuss how to determinedd’ efficiently.)The followingprogramshows how thedd’ tablecan be precomputedin

O(m)

steps; forpurposesof comparison, theprogramalso shows howtocomputedd,whichactuallyturns out torequireslightlymore operations thandd’

fork := 1 step 1untilrn do

dd[k]

:=

dd’[k]

:= 2

x

rn-k;

j:=m;t:=m+l;

while/" >

0 do begin

fEi]

:= t;

while

=<

rn andpattern[j]

pattern[t]

do begin

dd’[t]

:= min

(dd’[t], rn-/’);

:=/It];

end;

t:=t-1;f:=f-1;

dd[t]

:= min

(dd[t], rn-/’);

end;

fork := 1 step 1until tdo begin

dd[k]

:=min

(dd[k],

m

+

t-

k);

dd’[

k

]

:= min

dd’[

k

],

rn

+

t-

k);

end;

In

practice onewould,ofcourse,computeonly

dd’,

suppressingall referencesto dd.TheexampleinTable 2 illustratesmostof the subtleties of this algorithm.

TABLE2

j=l 2 3 4 5 6

pattern[j] b a d b a c f[j]=10 11 6 7 8 9 dd[j]=19 18 17 16 15 8

dd’[/’]=19 18 17 16 15 8

7 8 9 10 11

b a c b a

10 11 11 11 12

7 6 5 4

13 12 8 12

(21)

To

provecorrectness,onemayshow first that

f[]]

isanalogoustothe

f[]]

in 2, but with right and left of the patternreversed; namely

f[m

m

+

1,and for

] <

m we have

]’[]3

min{i

IJ <

-<_m and

pattern[i

+ 1] pattern[m] =pattern[] + 1]... pattern[m +]-i]}.

Furthermore thefinalvalueof correspondsto

f[0]

in thisdefinition;m isthe maximumoverlapof the patternonitself.Thecorrectnessofdd[j]anddd’[j]for all j nowfollowswithoutmuch difficulty,byshowing that the minimum value ofs in the definition of

dd[]o]

or dd’[jo] is discovered by the algorithm when (t,j)=(jo, jo-s).

The

Boyer-Moore

algorithm and its variants can have curiously anomalous behavior in unusual circumstances.

For

example, the method discovers more quickly that the pattern a a a a a a a cb does not appear in the text

(a b)"

if it suppresses thedheuristicentirely,i.e., if

d[t]

issetto -oofor allt. Likewise,dd actually turns out to be better than dd’ when matching a15bcbabab in

(b

a aba

b)",

forlargen.

Boyer

and

Moore

showedthattheiralgorithm has quadratic behavior inthe worst case; the runningtime can be essentially proportional to pattern length timestextlength,forexamplewhen thepatternc a

(b a)"

occurstogetherwiththe text

(x

TMa

a(b a)")".

Theyobserved thatthisparticular examplewashandledin linear timewhen Kuiper’s improvement

(dd’

for

dd)

wasmade;butthey leftopen the question of thetrueworstcase behavior of theimprovedalgorithm.

There are trivial cases in which the

Boyer-Moore

algorithm hasqladratic behavior, when matching all occurrences of the pattern, for example when matching thepattern

a"

in thetext

a". But

we areprobablywillingtoacceptsuch behavior when there are so many matches; the crucial issue is how long the algorithm takes in the worstcase to scan overa text thatdoes not contain the patternatall.

By

extending the techniquesof

5,

it is possibletoshow thatthe modified

Boyer-Moore

algorithmislinearin suchasituation:

THEOREM.

ff

theabove algorithmisusedwithdd’ replacing dd, and

if

thetext

does notcontain any occurrences

of

thepattern, the total number

o]:

characters matchedis at most6n.

Proof. An

execution of the algorithm consists of a series ofstages,in whichmk characters are matched and then the pattern is shiftedSk places,fork 1, 2,

We

want toshow that

Y

mk

<=

6n; the proofis basedonbreaking this cost into threeparts,twoof which are trivially

O(n)

and the third of which is lessobviously

SO.

Letm

,

mk

2Sk

ifmk

> 2Sk

otherwiseletm

,

0. Whenm

, >

0,wewillsay that the leftmost

m,

text characters matched during the kth stage have been

"tapped".

It

sufficestoprovethat the algorithm taps charactersatmost4n times, since

Y.

mk

<= mk +

2

Y.

Sk and

Y’.

Sk<--_n. Unfortunately it is possible for some characters of the text to be tapped roughly logm times, so we need to argue carefullythat m

,_-<

4n.

Suppose

the rightmost

m’

of the m

,

text characterstapped during the kth

stage are matched again during some later stage, butthe leftmost m

,-m/,’

are

References

Related documents

Commercial banks generally offer a greater variety of credit than other lenders, including credit cards, lines of credit, term loans, and installment loans—both

Zoho Creator is an online platform that allows you to build custom applications for any business need, all by yourself.. together - just the application

National concerns about quality of care and safety Shortages of nursing personnel which demands a higher level pf preparation for leaders who can design and assess care.. Shortages

Students are asked to first assume the role of a manager and confront the employee on sensitive first assume the role of a manager and confront the employee on sensitive issues

This study employed qualitative approach to explore verbal data on students’ and lecturers’ perspective of IC as well as the difficulties faced by the

In this paper, we investigate an approach for process analysis and organizational min- ing to support monitoring and improvement of systems processes based on in- tegrated

Concur: The peer review process has been strengthened to ensure documentation of corrective action completion is included in the Peer Review Committee minutes. We recommended

In order to explore teachers’ purposes for and methods of using their personal writing as models, and to investigate whether teachers experience any benefits and