FAST PATTERN MATCHING IN STRINGS*
DONALD E. KNUTHf, JAMES H. MORRIS, JR.:l: AND VAUGHAN R. PRATT
Abstract. An algorithm ispresented which finds all occurrences of one. given stringwithin another, in running time proportional to the sum of thelengths of the strings. The constant of proportionalityislowenoughtomakethisalgorithm of practical use, and theprocedurecan alsobe extendedtodealwith somemoregeneralpattern-matching problems. Atheoreticalapplicationofthe algorithm shows that theset of concatenationsofevenpalindromes, i.e., the language{can}*,canbe recognizedin linear time.Other algorithmswhichruneven fasteronthe average arealso considered.
Keywords, pattern, string, text-editing, pattern-matching,triememory, searching, period ofa string, palindrome, optimum algorithm, Fibonacci string, regular expression
Text-editing programs are often required to search through a string of characters looking for instances of agiven
"pattern"
string; we wish to findall positions, orperhaps onlythe leftmost position, in which the pattern occursasa contiguous substring of thetext.For
example,c a en a ry containsthe patternen, but we donotregardca n a ry as asubstring.
The obviouswaytosearch foramatchingpatternistotry searchingatevery starting position of the text, abandoning the search as soon as an incorrect character is found.Butthisapproachcanbe very inefficient, forexamplewhenwe are looking for an occurrence of aaaaaaab in aaaaaaaaaaaaaab.
When the patternisa"band thetext is
a2"b,
wewill find ourselves making(n + 1)
comparisons of characters. Furthermore, the traditional approach involves"backing
up"
the input text as we go through it, and this can add annoying complications when we consider the buffering operations that are frequently involved.In
this paper we describe a pattern-matching algorithm which finds all occurrencesofapattern oflengthrn withinatextoflengthn inO(rn + n)
unitsof time, without "backingup"
the input text. The algorithm needs onlyO(m)
locations of internal memory ifthe text is read from an external file, andonly O(logm)
units of timeelapsebetweenconsecutivesingle-character inputs. All of theconstantsofproportionalityimpliedbythese"O"
formulas are independent of thealphabetsize.*Receivedbythe editorsAugust29,1974, andin revisedform April 7, 1976.
tComputerScienceDepartment,Stanford University, Stanford,California94305.Theworkof thisauthorwassupportedin partby theNational Science FoundationunderGrantGJ 36473Xandby the Officeof Naval Research underContractNR044-402.
XeroxPaloAltoResearchCenter,PaloAlto,California94304.The work of this author was supportedin partbythe National Science Foundation under GrantGP7635 attheUniversityof California, Berkeley.
Artificial IntelligenceLaboratory, Massachusetts Institute ofTechnology,Cambridge, Mas- sachusetts02139.The workofthisauthor wassupportedin partbythe National Science Foundation underGrant GP-6945at UniversityofCalifornia, Berkeley,andunderGrantGJ-992atStanford University.
323
Weshallfirstconsiderthealgorithmin aconceptuallysimple but somewhat inefficientform. Sections3and 4 of thispaperdiscusssomewaystoimprove the efficiency andtoadaptthealgorithmtootherproblems. Section 5 developsthe underlyingtheory, and 6 usesthe algorithm to disprove the conjecture that a certain context-free language cannot be recognized in linear time. Section 7 discusses the origin of the algorithm and its relationtootherrecentwork. Finally,
8 discusses still morerecentworkonpatternmatching.
1. Informaldevelopment.The idea behindthisapproachtopattern match- ingisperhapseasiesttograspifweimagineplacing the pattern over thetextand sliding ittothe right ina certainway.Considerforexampleasearch for the pattern abcabcacab in the textbabcbabcabcaabcabcabcacabc; initially weplacethe patternattheextremeleft andpreparetoscantheleftmost character of the inputtext:
abc abc ac ab
babc babc abcaabcabcabcacabc
The arrow here indicates the current text character; since it points to b, which doesn’t match that a,we shiftthe pattern onespacerightandmove to the next input character:
abc abc ac ab
babc babc
abcaabcabcabcacabc
Now
wehaveamatch, sothe pattern stays put while thenextseveral characters arescanned. Soonwe come toanother mismatch"abc abc ac ab
babc babc
abcaabcabcabcacabc
At
thispointwehave matched the first three pattern characters butnotthefourth, so weknow that the last four characters of theinputhave beenabcx where x#a;wedon’thavetoremember thepreviouslyscannedcharacters,since ourpositionin thepatternyields
enough information
to recreatethem.In
thiscase,nomatterwhat x is(as
longasit’s nota),
wededuce that the patterncanimmediatelybe shifted four moreplacestothe right;one,two,orthree shifts couldn’t possibly leadto amatch.Soonwegettoanother partialmatch,this time with a failure onthe eighth pattern character:
abcabcacab
babc babc abc
aabcabcabcaca
bcNow
we know that the lasteight characters wereabcabcax, where x#c.The pattern should therefore be shifted threeplacestothe right"abc abc ac ab
babcbabcabc aabc
abcabcacabc
We
trytomatchthenewpatterncharacter,butthisfails too, soweshiftthepattern four(not
three or five) more places. Thatproduces a match, and we continue scanning until reachinganothermismatchonthe eighth pattern character"abc abc ac ab
ba bcbabcabcaabc abc abc acabc
Againweshift thepattern threeplacestothe right; this timeamatchisproduced, andweeventuallydiscoverthefull pattern:
abc abc ac ab
ba bcbabcabcaa bc abc abc ac abc
The play-by-play description for this example indicates that the pattern- matching process willrun efficiently if we have an auxiliary table that tells us exactlyhow fartoslidethepattern,whenwedetectamismatchat
its/’th
character pattern[if. Letnext[f]
be the character positionin the patternwhich should be checkednextafter suchamismatch,sothatwe aresliding thepattern] next[] ]
placesrelative tothe text.Thefollowingtableliststhe appropriatevalues:/’=1
2 3 4 5 6 7 8 9 10pattern[f]=a
b c a b c a c a bnext[f ]
O 1 1 0 1 1 0 5 0 1(Note
thatnext[j] 0 means thatwearetoslidethe pattern all thewaypastthe current textcharacter.) We
shall discuss how to precompute this table later;fortunately, the calculations are quite simple, and we will see that theyrequire only
O(m)
steps.At
each step of thescanning process,we move either thetextpointerorthe pattern, and each of these can moveat most ntimes;soat most2n steps needtobe performed,afterthenexttable has been set up. Of course thepatternitself doesn’t reallymove;wecan do thenecessaryoperations simplybymaintainingthepointer variable].
2. Programming thealgorithm.Thepattern-match processhas thegeneral form
placepatternatleft;
while patternnotfullymatched andtext notexhausteddo begin
while patterncharacterdiffersfrom current textcharacter
do shift pattern appropriately;
advanceto nextcharacteroftext;
end;
Forconvenience, letusassumethat the inputtext ispresent inanarray
text[
1"n],
andthat the patternappears inpattern[1:inf. We
shall also assume thatrn>
0,i.e., that thepatternis nonempty.Let kand
f
be integervariablessuch thattext[k]
denotes the current text character and
pattern[f]
denotes the corresponding pattern character; thus, the pattern is essentially aligned with positions p+
1 throughp+
rn of the text, where k=p+f.
Then the above program takes the following simple form"1:=
k := 1;while/"
_-<rn andk_-< n do beginwhilej
>
0 andtext[k] pattern[f]
doj:= next[j];
k:=k+l;j:=j+l;
end;
If/" >
rn atthe conclusion of the program, the leftmost match has been foundin positionsk rn throughk 1;butif/"
<-m, thetexthas been exhausted.(The
and operation here is the "conditional and" which does not evaluate the relationtext[k] pattern[f] unless/" > 0.)
Theprogramhasacuriousfeature, namelythat the innerloopoperation "j:=next[f]"
isperformednomoreoftenthan theouter loopoperation"k := k+
1";infact,the innerloopisusuallyperformedsomewhat less often, sincethe pattern generally moves right lessfrequently thanthe text pointer does.To
prove rigorously that the above program is correct, we may use the following invariant relation" "Let pk-/"
(i.e., the position in the text just preceding thefirstcharacter of the pattern, inourassumedalignment).Thenwe have text[p+i]=pattern[i] for l=<i</"
(i.e., we have matched the firstf-1
characters of the pattern, if j
>0);
butfor0_-<<p
wehavetext[t + if
pattern[ifforsomei, where1-< -< rn (i.e.,there isnopossiblematch of theentirepatternto the left of
p)."
Theprogramwill ofcourse becorrectonlyifwecancomputethenexttableso that the above relation remains invariant when we perform the operation j:= next[j]. Let us look at that computation now. When the program sets
/"
:=next[f],
weknow thatf >
0,and that thelast/"
characters of the input uptoand includingtext[k]
werepattern[l].., pattern[j-
1]
xwhere x pattern
If].
Whatwe wantis tofindthe leastamountof shift for which these characters can possibly match theshiftedpattern; in otherwords,we wantnext[f]
tobe thelargest lessthan/"
such that the last characters of the input were pattern 1]...
pattern1]
xand pattern[i]pattern[f].
(If
no such exists, we letnext[i]=O.)
With this definition ofnext[f]
it is easy to verify thattext[t+l]...text[k]
pattern[l] pattern[k 1]
forkf <= <
knext[f];
hence the statedrelation is indeed invariant, andourprogramiscorrect.Nowwe mustface uptotheproblemwehave been postponing, the task of calculating
next[f]
in the firstplace. This problemwould be easier if we didn’t requirepattern[i]pattern[f]
inthe definition ofnext[j],so weshall consider the easierproblem first. Letf[f]
be thelargest lessthan/"
such thatpattern[l]...pattern[i-
1]
pattern[f-i+ 1]
pattern[j-1];
since this conditionholds vac- uouslyfor 1,wealwayshavef[/’] >=
1when/" >
1.By
conventionweletf[ 1]
0.Thepatternused in theexampleof 1 has thefollowing
f
table:/’=1
2 3 4 5 6 7 8 9 10pattern[f]=a
b c a b c a c a bf[/’]=0
1 1 1 2 3 4 5 1 2.If
pattern[j]=pattern[f[j]]
thenf[f+
1]=f[j]+ 1; but if not, we can use essentiallythe same pattern-matching algorithm as above to compute f[j+ 1],
with text
pattern! (Note
the similarity of the f[j] problem to the invariant condition of the matching algorithm. Our program calculates the largestj less than or equal to k such thatpattern[I]..,
pattern[j-1]
text[k-j+1]...
text[k- 1],
so we cantransfer the previoustechnologytothe presentproblem.) The following program will computef[/" + 1],
assuming thatf[j] andnext[l],
next[j-
1]
havealreadybeen calculated":=
while
>
0 andpattern[j]pattern[t]
dot:=
next[t];
f[/’+ 1]
:= t+1;The correctness of thisprogramisdemonstratedasbefore; wecan imaginetwo copies of the pattern, one sliding to the right with respect to the other.
For
example, suppose we have established thatf[8]=
5 in the above case; let us considerthe computation of]’[9].
The appropriate picture isabcabc acab abc abcac ab
Since
pattern[8]
b,we shift the uppercopyright,knowingthat themostrecently scanned characters of the lower copywere abc ax for x b.Thenexttable tells usto shiftrightfourplaces,obtainingabcabcacab abcabcacab
and again there isno match. Thenextshift makes 0, so
]’[9]
1.Once
we understand how to computef,
it is only a short step to the computationofnext[f]. A
comparison ofthedefinitions showsthat,for/" >
1,f
f[j], ifpattern[j] pattern[fir]f;next[f]
next[f[j]], if pattern
[/’]
patternIf[if].
Therefore we can compute the next table as follows, without ever storing the values off[j]inmemory.
j:= 1; := 0;
next[l]
:= 0;while j
<
m dobegin comment t=
f[/’];
while
>
0 andpattern[j]pattern[t]
do t:=
next[t];
:= t+l;/"
:=/’+1;
if pattern
[/’]
patternIt]
then
next[f]
:=next[t]
elsenext[j] := t;
end.
Thisprogramtakes
O(m)
unitsof time, for thesamereason as thematching programtakesO(n):
the operation :=next[t]
in theinnermostloop alwaysshifts theupper copyof the patterntothe riglt, soit isperformedatotalofm timesat most.(A
slightlydifferentway to prove that the running time is bounded by a constanttimes m istoobserve that the variable starts at0 and itisincreased, m-1 times, by 1; furthermore its value remains nonnegative. Therefore the operation :=next[t],
whichalwaysdecreases t, can beperformedat mostm- 1 times.)To
summarize what we have said so far: Strings of text can be scanned efficiently bymaking useof two ideas.We
can precompute "shifts",specifying howtomove the given pattern when a mismatch occursatits/’th
character; and this precomputation of "shifts" can be performed efficiently by using the same principle, shifting the patternagainstitself.3. Gainingefficiency.
We
havepresentedthepattern-matching algorithmin a formthatisrather easilyprovedcorrect; but as so oftenhappens,thisform isnot very efficient.In
fact,the algorithm as presented abovewould probablynot be competitive with the naive algorithm on realistic data, even though the naive algorithm has a worst-casetimeof orderm times n insteadof mplusn,becausethe chance of thisworstcase israther slim.
On
the otherhand,awell-implemented formof thenewalgorithmshould gonoticeably fasterbecausethere isnobacking up after a partial match.It
is not difficult to see the source of inefficiency in the new algorithm as presentedabove: When thealphabetof charactersislarge,we willrarelyhave a partial match, and the program will waste a lot of time discovering rather awkwardlythattext[k] pattern[l]
fork 1, 2, 3,When/"
1andtext[k]
pattern[l],
the algorithmsets/"
:=next[If=O,
then discoversthat/"
=0, then increaseskby 1,thensets]
to 1again, then tests whetheror not1is<=m,
and later ittestswhetheror not 1 isgreater than 0. Clearlywewould be muchbetter off making/" 1into aspecialcase.The algorithm alsospendsunnecessarytimetesting
whether/" >
m ork>
n.A
fully-matched pattern can be accounted for by settingpattern[m + 1]=
"@"for some impossible character @ that will never be matched, and by letting
next[m+l]=-l;
then a testfor/’<0
can be inserted into a less-frequently executed part of the code. Similarly we can for example settext[n + 1]
"_1_"(another
impossiblecharacter)
andtext[n
+2]=pattern[if, so that the test for k>
n needn’t be made very often.(See [17]
foradiscussionof suchmore orless mechanical transformations onprograms.)
The following form ofthe algorithm incorporatesthese refinements.
a :=pattern
1];
pattern[m + 1]
:= ’@’next[m + 1]
:= -1;text[n + 1]
:= ’_t_’text[n + 2]
:= a;f:=k:=
1;get started:
comment/"
1;while
text[k]
a dok := k+
1;ifk
>
n thengoto input exhausted;char
matched’/"
:=f +
1; k := k+
1;loop:
comment/" >
0;if
text[k] pattern[f]
thengoto charmatched;] :=
next[f];
i/"
1 thengoto getstarted;if
f
0 thenbegin
]
:= 1; k :=k+l;go to getstarted;
end;
if
/
0 thengotoloop;comment
text[k- m]
throughtext[k- 1]
matched;This program will usually run faster than the naive algorithm; the worst case occurswhen tryingtofindthe patternabinalong stringof
a’s.
Similar ideas can be usedtospeedup theprogramwhichpreparesthenexttable.In
atext-editorthe patterns are usually short,so that it is most efficientto translate the pattern directlyinto machine-language code whichimplicitly con- tains thenexttable(cf. [3,
Hack179]
and[24]). For
example, the pattern in 1could becompiledintothe machine-languageequivalent of L0: k:=k+l;
LI:
iftext[k]
a thengotoL0;.k :=k+l;
ifk
>
n thengoto inputexhausted;L2: if
text[k
b thengotoL1;k:=k+l;
L
3: iftext[k]
c thengotoL1;k :=k+l;
L4: if
text[k]
a thengotoL0;k:=k+l;
L5: if
text[k]
b thengo to L1;k:=k+l;
L6: if
text[k]
c thengo to L1;k := k+l;
L7: if
text[k]
a thengo
toL0;k := k+l;
L8: if
text[k]
c thengotoL5;
k :=k+l;
L9: if
text[k]
a thengotoL0;k := k+l;
L10: if
text[k]
b thengotoL1;k := k+l;
Thiswillbe slightlyfaster, since it essentially makes a special case fora//values
of/.
It
isacuriousfactthatpeopleoftenthinkthenew algorithmwillbe slower than the naive one, even though itdoes less work. Since the new algorithm is conceptuallyhardtounderstandatfirst,bycomparison with other algorithms of the samelength,wefeelsomehow thatacomputerwill haveconceptualdifficulties toomwe expect the machine to run more slowly when it gets to such subtle instructions!4. Extensions. Sofar ourprogramshaveonlybeen concerned with finding the leftmost match.
However,
itiseasyto seehowtomodifytheroutine sothat all matches are found in turn:We
can calculate the next table for the extended pattern of length m+l using pattern[re+if="@", and then we set resume :=next[m + 1]
before settingnext[m + 1]
to-1. After findingamatch and doing whatever action is desiredtoprocess thatmatch,the sequence/"
:= resume;gotoloop;will restartthingsproperly.
(We
assumethat text hasnot changedinthe mean- time.Note
thatresume cannotbezero.)
Anotherapproach wouldbetoleave
next[m + 1]
untouched,never changing itto-1,andtodefine integerarrayshead[1
:m]
andlink[1
:n]
initiallyzero,and toinsertthe codelink[k]
:=head[f];
head[j] := k;at label "char matched". The test
"if/>0
then" is also removed from the program. This forms linked lists for1-</=<m
of all places where the first/
characters of the pattern(but
nomorethan/)
arematchedinthe input.Still another straightforward modificationwillfindthelongestinitialmatchof the pattern,i.e.,the
maximum/"
such thatpattern[l].., pattern[f]
occurs intext.In
practice, the text characters are often packed into words, with say b charactersper word, and the machine architecture often makes it inconvenient toaccess individual characters.When efficiencyfor largen isimportantonsuch machines, one alternativeis to carry out b independentsearches, one for each possiblealignmentof thepattern’s
firstcharacterin the word. These searchescan treat entire words as "supercharacters", with appropriate masking, instead of workingwith individualcharacters andunpackingthem. Since thealgorithmWe havedescribeddoesnotdependonthe size ofthealphabet,it iswellsuitedtothis and similar alternatives.Sometimes we want to matchtwo ormore patterns insequence,findingan occurrence of the first followed by the second, etc.; this is easily handled by consecutivesearches,and thetotalrunningtimewillbe of ordernplusthesumof the individualpattern lengths.
We
might also want tomatchtwo ormore patterns inparallel, stoppingas soon asanyone ofthem isfully matched.A
search ofthiskindcould bedone with multiplenextand pattern tables,withone/"
pointer foreach;but this wouldmake the runningtimekn plusthesumofthepatternlengths,when there arekpatterns.Hopcroft and
Karp
have observed (unpublished) that our pattern-matching algorithmcanbeextendedsothat the running time for simultaneous searchesis proportional simply to n, plus the alphabet size times the sum of the pattern lengths.Thepatternsare combined into a "trie"whose nodesrepresentall of the initial substrings of one or more patterns, and whose branches specify the appropriate successor node asafunction ofthe nextcharacter in the inputtext.For
example,if there are four patterns{a
bc ab,ababc,bc ac,b bc},
the trie is shown in Fig. 1.node
0 2 3 4 5 6 7 8 9 10
substring
a abc abca aba abab b bc bca bb
if if
7 2 10 7 abcab
6 10 10 7 2 10
if
0 0 3 0 bcac
0 ababc
8 0 bcac
bbc FIG.
Such a trie can be constructed efficiently by generalizing the ideawe used to calculate
next[f];
detailsandfurtherrefinementshavebeendiscussedbyAho and Corasick[2],
who discovered the algorithm independently.(Note
that thisalgorithmdependsonthealphabetsize; suchdependenceisinherent, if wewish to keepthe coefficient ofnindependent ofk,sinceforexamplethek patterns might each consist of a single unique
character.) It
is interesting to compare this approachto whathappens when theLR(0)
parsing algorithm is applied to the regulargrammarSa$lbSlcSlabcablababclbcaclbbc.
5. Theoretical considerations. Iftheinputfileisbeingreadin"real time", wemightobjecttolongdelaysbetween consecutiveinputs.
In
this sectionweshall prove that the number oftimes/"
:=next[f]
isperformed,beforekisadvanced,is bounded bya functionof theapproximateformlog6m, where4 (1 +/)/2
1.618 isthe goldenratio, and that this bound is best possible.
We
shall use lower case Latin letters to represent characters, and lower case Greek letters a,/3,..,torepresent strings,withetheemptystring and[al
thelengthofa.Thuslal-
1for all characters a;I t l- I1 +ltl;
and[el
=0.Wealsowritea[k]
for thekthcharacter of a, when 1-<k-<
I 1,
As
awarmup for our theoretical discussion, let us consider the Fibonacci strings[14,
exercise1.2.8-36],
which turn out to be especially pathological patterns for the above algorithm. Thedefinitionof Fibonacci strings is(1)
b=b, b=a; )n-")n_l)n_2 forn_->3.For
example,b3
ab,4
ab a,4
aba ab.It
follows that thelength [b,
is thenth Fibonacci numberFn,
and that4n
consistsof thefirstF,
characters of an infinite stringboo
whenn->_ 2.Considerthe pattern48, whichhas the functions
f[/’]
andnext[f]
shown inTable 1.
TABLE
/’=1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 pattern[i]=a b a a b a b a a b a a b a b a a b a b a
fl/’]=0 1 2 2 3 4 3 4 5 6 7 5 6 7 8 9 10 11 12 8
next[f]=O 1 0 2 1 0 4 0 2 0 7 1 0 4 0 2 1 0 12 0
Ifwe extend this pattern to boo, we obtain infinite sequences f[j] and
next[f]
having thesamegeneralcharacter. Itispossibletoprove byinduction that
(2) [[f]=f-G,-
forFg<=f
<Fk+l,because of the following remarkable near-commutative property of Fibonacci strings:
(3)
l)n_2n_C()n_l)n_2)
forn ->-3,where
c(a)
denotes changing the two rightmost characters of a.For
example,I6
aba ab aba andc(b6)
aba a ba ab. Equation(3)
is obvious when n=3; and for n>3 we havec(b_2b,_0=b,_eC(bn_l)=b,_.b_3b_:
b,-lbn-: by induction;hence
c(b_:b,_l)
c(c(qb-149,-:)) 9-19,-:.Equation
(3)
implies that(4) next[F 1] Fk-1
1 fork_->3.Therefore ifwe have amismatch
when/" Fs-1
20,our algorithm might set/"
:=next[f]
for the successive values20, 12, 7, 4, 2, 1,0of/’.
SinceFk
is(b k/ff-)
rounded to thenearestinteger, it ispossibleto have upto
log
m consecutive iterationsofthe/"
:=next[f]
loop.We
shall now show that Fibonacci stringsactuallyaretheworstcase, i.e.,that log,m is also anupperbound. First letusconsiderthe concept ofperiodicity in strings.We
saythat pisaperiodofce if(5) a[i] ce[i +p]
for1=<i=< I l-p,
It
iseasytosee that p isaperiodofce if andonlyif(6)
a(ce lce2)kce
forsomek =>0,where
[alce2[
=p andce2 E.Equivalently, p isaperiod ofceif and onlyif(7) ce02 02ce
forsome
02
and02
with}021 1021
=p. Condition(6)
implies(7)
with02
ce2celand02
ce2ce2.Condition(7)
implies(6),
forwedefinek[Ic{/p]
andobservethat if k >0, then ce02/3
implies/302 02/3
and[IBI/P]
k- 1; hence, reasoning inductively,ceOzce
forsomece withIce 21 <
P,andce02 02ce
i.Writing02
celce2yields.
(6).
The relevance of periodicityto ouralgorithm is clearoncewe considerwhat it means to shift a pattern. If
pattern[If.., pattern[f-1]=ce
ends with pattern[if pattern[i-1]=/3,wehave(8)
cefl01 02fl
where
[02[ 102[ /
i,sothe amount of shift j isaperiodofce.The construction of next[j] in our algorithm implies further that pattern[if,which isthe first character of 01, isunequaltopattern[j]. Letusassume
that/3
itselfissubsequentlyshiftedleavingaresidue %so that(9) /3 3’02 2’
wherethefirstcharacterof
02
differs frrn that of02. We
shallnowprove that(10)
if
1 1+1 1 1 1,
there is an overlap ofd=lfll+l l-I l
characters between the occurrences offl
and y inf102
=ce 0.0zy; hence the first character of02
is y[d+ 1].
Similarlythere isanoverlapof d characters between the occurrences of/3
and3/in 02/3
ce y0202; hence the first character of02 is/3[d + 1].
Sincethese characters are distinct, we obtainy[d+ 1]fl[d+l],
contradicting(9).
This establishes(10),
and leads directlytothe announcedresult:THzoz.
The numberof
consecutive timesthat]
:=next[f]
isperformed, whileone textcharacterisbeingscanned, is at most 1+ log6
m.Pro@ Let Lr
be thelengthof the shortest stringceas inthe above discussion suchthatasequenceofrconsecutive shiftsispossible. ThenL
0,L2
1,and wehave
1[3l>=Lr_l, lyl>-L_2
in(10);
henceLa>-Fr+I
1 byinduction onr.Now
ifr shiftsoccurwehave m-->E+l->b r-.
[-1Thealgorithmof 2 wouldruncorrectlyinlinear timeevenif
f[/’]
wereusedinstead of
next[f],
buttheanalogueof theabovetheorem wouldthenbefalse.For
example,thepatterna leadstof
[j] j 1 for 1=<
j=<
m.Therefore ifwematched a to the texta"-lbce,
usingf[/’]
instead of next[j], the mismatchtext[m]
pattern[m]
wouldbe followedbym occurrences of j:=f[/’]
andm 1 redundant comparisonsoftext[m]
withpattern[j],
beforek is advancedto m+
1.Thesubjectofperiodsin strings hasseveralinterestingalgebraic properties, butareader who isnotmathematicallyinclinedmay skipto 6since thefollowing material isprimarily an elaboration of some additionalstructurerelated tothe above theorem.
LEMMA
1.If
p and q are periodsof
ce, and p+
q <-_Ice[ +
gcd(p,q),
thengod(p,
q)
is aperiodof
a.Proof. Let d=gcd(p,q),
and assume without loss of generality that d<p <
q p+
r.We
havea[i] a[i +p]
for1_<-_<-Ice]-p
anda[i] ce[i + q]
for l_<-i--<I[-q;
hencec[i +r] [i +q] ce[i]
for 1+r<-i+r<=lcel-p,
i.e.,a[i]=a[i+r]
for1-<ill-q.
Furthermorea
01 02
where[01[
p, and itfollowsthat p andr areperiodsof/3,
where p+
r-<1/31+
dI/3[+ god(p, r). By
induction, disaperiodof/3.
SinceI[ [al-p >=q
-d>=q-r
=p[0a],
thestrings0a
andOz
(whichhave therespec-tiveforms
21
and12
by(6)
and(7))
aresubstringsof/3;
sotheyalso havedasa period. The string a
(/3a/32)+/31
must now have d as a period, since any charactersd positions apartarecontained within1/2 or/11.
[[]The result of Lemma 1 but with the stronger hypothesis p
+
q _-<Ice]
was provedby Lyndon andSchiitzenberger in connectionwithaproblemabout free groups[19, Lem. 4].
Theweaker hypothesisinLemma 1turns out togive the best possible bound: Ifgcd(p,q)<p<q
we can find a string of length p+
q-god(p,q)-
1for whichgod(p,q)
is not aperiod.In
orderto seewhythisis so,consider firsttheexampleinFig.2 showingthemostgeneral stringsoflengths 15 through25 havingboth 11and15 asperiods.(The
stringsare"most
general"in the sense thatanytwocharacter positions that canbedifferentare different.)
abc def ghi j ka bc d
abc daf ghi j ka bc.da
abc dab ghi ka bc dab abc dabc hi j ka bc dabc abc dabc di ka bc dabc d abc dabc daj ka bc dabc da abc dabc dab ka bc dabc dab ab c dab c dab c a b c dab c dab c ab c aab c aabc a bc aab c aab c a aac aaac aaac a ac aaac aaac aa aaaaaaaaaaaa aaaaaaaaaaaaa
FIG.2
Note
that the number ofdegreesoffreedom, i.e.,the number of distinctsymbols, decreases by 1 at each step.It
isnot difficult toprove that the number cannot decreasebymore than 1as we gofromlal
n-1 toI 1-
n, since the onlynewrelations are
a[n]=a[n-q]=a[n-p];
we decrease the number of distinct symbols byoneif andonlyif positionsn-q andn-p contain distinctsymbolsin themostgeneralstring oflengthn 1. The lemma tellsusthatweareleftwith at most god(p,q)
symbolswhen the lengthreaches p+
q-god(p,q);
on the other handwealwayshaveexactlypsymbolswhen thelengthis q. Therefore each of the p-god(p,q)
steps must decrease the number of symbols by 1, and the most general stringof lengthp+q- gcd(p, q)-1
musthave exactly god(p,q)+
1 dis- tinctsymbols.In
otherwords,thelemmagives the best possible bound.When p and q are relatively prime, the strings of lengthp
+
q- 2 on two symbols, having both p and q as periods, satisfy a number ofremarkable
properties, generalizingwhat we have observedearlier about Fibonacci strings.Since the properties of these pathological patterns may prove useful in other investigations,weshall summarize them in the following lemma.
LEMMA
2.Let
thestringsr(m, n) of
lengthnbedefined for
all relativelyprime pairsof
integers n>=
m>=
0 asfollows:
r(0, 1)
a,o,(1, 1)
b,r(1, 2)
ab;r(m,
m+ n) r(n
mod m,m)r(m, n)
)(11) cr(n,m+n)=cr(m,n)r(n
modm,m) ifO<m<n.
Thesestringssatisfy the]ollowingproperties:
(i)
r(m,
qm+ r)r(m
r,m) r(r, m)r(m,
qm+ r), for
m>
2;(ii)
r(m, n)
has periodm,for
m>
1;(iii)
c(r(m, n))=r(n
-m,n), for
n>
2.(The
functionc(a)
was defined in connection with(3) above.)
Proof. We
have, for 0<
m<
n and q=>
2,r(m +
n,q(m + n)+ m) r(m,
m+ n) r(m +
n,(q- 1)(m + n)+ m), r(m +n, q(m +n)+n)=r(nm +n) r(m +n, (q-1)(m +n)+n),
r(m +
n,2m+ n)= r(m,
m+ n) o,(n
mod m,m), cr(m +
n,m+ 2n) o’(n,
m+ n) o-(m, n);
hence,if
01 o-(n
mod m,m)
and02 o-(m, n)
and q _->1,(12) o-(m+n,q(m+n)+m)=(OxOz)q01, r(m+n,q(m+n)+n)=(OzO1)qOz.
It
follows thato-(m +n, q(m +n)+m)o-(n,
m+ n)= o-(m,
m+n) o’(m +n, q(m +n)+m), o’(m +
n,q(m + n)+ n)o’(m,
m+ n)= o(n,
m+ n)o’(m +
n,q(m + n)+ n),
which combine to prove (i).Property
(ii) also follows immediately from(12),
except for the casem 2,n 2q+
1,or(2,
2q+ 1) (ab)’a,
whichmaybe verified directly. Finally,it suffices toverify property(iii)for 0<
m< 1/2n,
sincec(c(a))
a;wemustshow that
c(tr(m,m+n))=tr(m,n)o’(n
modm,m)
forO<m<n.When m
<=
2thispropertyiseasilychecked,and when rn>
2 it is equivalentby inductiontor(m,m+n)=r(m,n)cr(m-(nmodm),m)
for0<m<n, m>2.Setnmod rn r,
[n/m
q, andapplyproperty(i).By
properties (ii) and (iii) of this lemma,r(p,
p+q)
minus its last two characters is the string of length p+
q-2 having periods p and q.Note
that Fibonacci strings are just a very special case, since b,,=cr(F,_a, F,).
Another propertyof theo-stringsappearsin15]. A
completelydifferentproofofLemma
1 and its optimality, and acompletelydifferent definitionofr(m, n),
weregivenby Fine and Will in1965[7].
Thesestringshavealong historygoingbackatleastto theastronomerJohann Bernoulli in1772;see[25, 2.13]
and[21].
If c isanystring, let
P(c)
be its shortestperiod.Lemma
1 implies that all periods q which are not multiples ofP(a)
must be greater thanlal-P(a)+
gcd(q, P(a)).
This is arather strong condition in terms of thepatternmatching algorithm,becauseofthe following result.LEMMA
3.Let
a pattern[ 1]...
pattern[f 1]
andleta pattern[/’]. In
the pattern matching algorithm,f[/’]=/’-P(a),
andnext[f ]=f-q,
where q is the smallestperiodof
awhichis not aperiodo[aa. (If
nosuch periodexists,next[f] 0.)
If P(a)
dividesP(ca)
andP(aa) <
f, thenP(a)= P(aa). If P(a)
doesnotdivideP(aa)
orif P(aa)=L
thenq=P(a).
Proof.
The characterizations off[/’]
andnext[f]
followimmediatelyfrom the definitions. Since every period of aa is a period of a, the only nonobvious statementisthatP(a)= P(aa)
wheneverP(a)
dividesP(aa)
andP(cea) f. Let P(a)
=p andP(aa)=
rap; then the(mp)thcharacter fromtheright ofc isa, as is the(m-
1)pth, as is thepth; hence p is aperiod ofaa.Lemma 3 shows that
the/"
:= next[j] loop will almost always terminate quickly. IfP(a) P(aa),
then qmust notbe amultipleofP(a);
hencebyLemma
1,P(a)+q
>=j+
1.On
the otherhand q>P(a);
hence q>1/2j
andnext[j]<1/2/. In
the other case qP(a),
wehad betternothave qtoosmall,since q will beaperiod in the residual pattern after shifting, andnext[next[ill
will be<q. To
keepthe looprunning it isnecessaryfornewsmall periodstokeep poppingup,relatively primetothe previousperiods.6. Palindromes.
One
of the most outstanding unsolved questions in the theory of computational complexity is the problem of how long it takes to determinewhetheror nota given string oflengthnbelongstoa given context-free language.For
manyyears the best upper bound for thisproblemwasO(n 3)
in a generalcontext-freelanguageasnc; L. G.
Valianthasrecently loweredthistoO(n og7). On
the otherhand,theproblemisn’tknowntorequiremorethanorder n units of time for any particular language. This big gap betweenO(n)
andO(n 2.8a)
deserves tobe closed, and hardly anyonebelievesthat the final answer will beO (n).
Let
be afinitealphabet,letZ*
denotethe stringsover,
and letP {aa
R]a
eE*}.
Here
cr denotesthereversalof a,i.e.,(a,a2 a,)
a, azal.Eachstring7rin
P
isapalindromeof evenlength,andconverselyeveryevenpalindrome overis inP.
At
one time itwaspopularly believed that the languageP*
of"even
palindromes starred", namelythesetofpalstars7rlrn
where each7ri isinP,
wouldbeimpossibleto recognizeinO(n)
steps onarandom-access computer.It
isn’t especially easy to spot members of this language.For
example, aabbab b a is apalstar,but its decompositioninto evenpalindromesmightnot be immediately apparent; and the reader might needseveralminutes todecide whetheror notbaabbabbaababbaabbabbabaa
bbabbabbabbaabababbabbaab
is inP*. We
shallprove, however,thatpalstarscanberecognizedinO(n)
unitsof time,byusing their algebraicproperties.Let us say that a nonemptypalstar is prime if it cannot be written as the productoftwo nonempty palstars.
A
primepalstarmustbean evenpalindromeactR but the converse doesnothold.
By
repeated decomposition,itiseasyto see that every palstar/ isexpressible as a productfll fit
of prime palstars, forsome
>=
0;what is less obviousisthatsuchadecompositionintoprime factorsisunique. This"fundamentaltheorem ofpalstars"isanimmediateconsequenceof the following basicproperty.
LEMMA
1.A
primepalstarcannotbegin withanotherprimepalstar.Proof. Let
taRbeaprimepalstar suchthataORRy
forsomenonempty even palindromeflflR
and some y#e; furthermore, letR
have minimum length among all such counterexamples. If then aaR=flflR3,=a6y
for some 6 e; henceaR
6%
andflflR (flflR)R (a6)n 8RaR 6R6%
con-tradicting the minimality of
[flflR I.
ThereforeIflflR[<--__[al;
hence a=flflg6
for some 8, andflflg,y
OgoR__RRR.
But this implies that y is the palstarRR,
contradicting theprimalityofCOROLLARY (Left
cancellationproperty.) If aft
andor’are
palstars, so isft.
Proof. Let
a aa
anda[3fla fl
be prime factorizations ofaandaft.
If ala fll
fl, thenfl
fl+l..,fl
is a palstar. Otherwise let j be minimal witha.
fli; then ai begins withfli
or vice versa, contradictingLemma
1.LEMMA
2.If
ois astringof
lengthn,wecan determinethelength of
thelongestevenpalindrome
fl P
such thata fl’y, inO(n
steps.Proof.
Applythe pattern-matchingalgorithmwith pattern aandtext ctR.
When k-n+1 the algorithm will stop with j maximal such that
pattern[i]
pattern[j-1] text[n +
2-j]text[n].
Now performthe follow- ingiteration:while j ->_ 3
and/"
evendo j := f(j).By
thetheory developedin 3,thisiteration terminateswith/"
->3 if andon(y
ift begins withanonempty evenpalindrome,
and/"-
1 will be thelengthof the largestsuchpalindrome.(Note
that]’[]
must be usedhereinstead ofnext[]];
e.g.consider the case aa b a a b.
But
the patternmatchingprocess takesO(n)
timeevenwhen
[[f]
isused.)
THEOREM.
Let L
be any language such thatL*
has theleft
cancellation propertyandsuch that,given anystringctof
lengthn,we canfind
anonemptyL
such thatabeginswith or wecanprovethatnosuch exists, in
O(n
steps. Then wecandetermine inO(n
timewhetheror not agiven stringis inL*.
Proofi Let
a be any string, andsuppose that the time required to testor
nonempty prefixes in
L
is<-Kn
for all large n.We
begin by testinga’s
initial subsequences of lengths 1, 2, 4,...,
2, ...,
andfinallya itself, untilfinding aprefix in
L
oruntilestablishingthatahasnosuch prefix.In
the lattercase,aisnot inL*,
andwe have consumed at most(K + Ka) + (2K + Ka) + (4K + Ka)
+"+ (]a [K + Ka) <
2Kn+ Ka
lognunitsof time for some constantKa. But
if we findanonempty prefix/3
L
wherea fly, we haveusedat most41 [K + K(log2 [/3 [)
unitsof time so far.
By
theleft cancellationproperty,aL*
if andonlyif yL*,
andsince
[71
n-I/31
wecanprovebyinductionthatat most(4K + Ka)n
units of time areneededtodecidemembershipinL*,
whenn>
0.U
COROLLARY. P*
canberecognizedinO(n)
time.Note
that therelatedlanguageP* (Tr
Y-,*17r
7r ande 2}*
cannotbe handledbytheabovetechniques, since it contains botha a abb b and a a ab b b b a; the fundamental theorem ofpalstarsfails withavengeance. Itis anopenproblemwhetheror not
PI*
canberecognizedinO(n)
time,althoughwesuspect thatit canbe done. Oncethe reader hasdisposedof thisproblem,heor she isurgedtotackleanother languagewhichhasrecentlybeen introducedby S.
A. Greibach
[ 11],
sincethe latterlanguageisknown tobeashardaspossible;no context-freelanguagecan be hardertorecognize except byaconstantfactor.7. Historical remarks. The pattern-matching algorithm of this paper was discovered in a rather interesting way.
One
of the authors(J. H.
Morris) was implementing atext-editor for theCDC
6400computer during the summer of 1969, and since the necessary buffering was rather complicated he sought a method that would avoid backing up the text file. Using concepts of finite automatatheory as amodel, he devisedan algorithm equivalenttothemethod presented above, althoughhisoriginal form ofpresentationmade it unclear that the running time wasO(m + n).
Indeed,itturnedoutthat Morris’s routine wastoo complicated for other implementors of the system to understand, and he dis- coveredseveral months later that gratuitous"fixes" had turnedhisroutine into a shambles.In
atotallyindependentdevelopment, anotherauthor(D. E. Knuth)
learned early in 1970 ofS. A.
Cook’s surprising theorem about two-way deterministic pushdown automata[5].
According to Cook’s theorem, any language recog- nizablebyatwo-waydeterministicpushdown automaton,in any amountof time, canbe recognizedon arandom access machine inO(n)
units of time. SinceD.
Chester had recently shown that the set of strings beginning with an even palindrome could be recognizedbysuch an automaton, and since Knuth couldn’t imagine how to recognize such a language in less than about n2 steps on a conventional computer, Knuth laboriouslywentthrough all thestepsofCook’s constructionasappliedtoChester’s automaton. Hisplanwasto"distilloff" what
(Note added April, 1976.)Zvi Galil andJoelSeiferas haverecently resolvedthisconjecture affirmatively.
washappening,inordertodiscoverwhythe algorithm worked soefficiently.After ponderingthe mass of details for severalhours, hefinally succeeded inabstracting themechanism whichseemedtobeunderlyingthe construction, in the special case ofpalindromes, and he generalized it slightly to aprogram capableof finding the longestprefix of onegivenstring that occurs in another.
Thiswasthe first time in Knuth’s experience that automatatheoryhadtaught him howtosolveareal programming problem betterthan hecouldsolve itbefore.
He
showedhisresultstothe third author(V. R. Pratt),
andPratt
modifiedKnuth’s data structure so that the running-time was independent of the alphabet size.When
Pratt
describedthe resultingalgorithmtoMorris, thelatter recognizeditas hisown, andwaspleasantly surprisedtolearnof theO(m + n)
timebound,which he andPratt
described in amemorandum[22].
Knuth waschagrinedtolearnthat Morris hadalreadydiscoveredthe algorithm, without knowingCook’stheorem;but thetheoryof finite-state machines had been ofusetoMorris too, in his initial conceptualization of the algorithm, so it was still legitimate to conclude that automatatheoryhadactuallybeenhelpfulin thispracticalproblem.
The idea of scanning astringwithoutbackingupwhile looking for apattern, in the case of atwo-letteralphabet, isimplicitin theearlywork of Gilbert
[10]
dealingwith comma-free codes.
It
also is essentially a special case of Knuth’sLR(0)
parsingalgorithm[16]
when appliedto the grammarfor each a in the alphabet,
where a is the pattern. Diethelm and Roizen
[6]
independentlydiscovered the idea in1971. Gilbert andKnuthdidnotdiscussthepreprocessingtobuildthenext table, since they were mainly concerned with other problems, and the pre- processing algorithm given byDiethelm and Roizen was oforderrn2. In
the case of a binary(two-letter)
alphabet, Diethelm and Roizen observed that the algorithm of 3 can be improved further" we can go immediately to "char matched"after/"
:= next[j]in this case ifnext[j]> O.
A
conjecturebyR. L.
Rivest ledPratt
todiscover thelog6
rnupper boundon patternmovementsbetween successiveinputcharacters,and Knuthshowedthat this was best possible by observing that Fibonacci strings have the curious properties provedin 5.ZviGalilhasobservedthat a real-timealgorithmcanbe obtainedby lettingthe textpointermoveahead in anappropriatemanner while thef
pointeris moving down[9].
In
hislecturesatBerkeley,S. A.
Cook hadprovedthatP*
wasrecognizable inO(n
logn)
stepsonarandom-access machine,andPratt
improvedthis toO(n)
usingapreliminary formof the ideas in 6.The slightly more refinedtheoryin the present version of 6 is joint work of Knuth andPratt.
Manacher[20]
found another waytorecognizepalindromes inlineartime, and Galil[9]
showedhowto improve thistoreal time.See
also Slisenko[23].
It
seemed at first that there might be a way to find thelongest
common substring oftwo givenstrings,in timeO(m + n);
butthe algorithm of thispaper doesnotreadily support any suchextension, and Knuthconjecturedin 1970that such efficiency would be impossible to achieve.An
algorithm duetoKarp,
Miller, andRosenberg[13]
solved theprobleminO((m + n)
log(m + n))
steps, andthistendedtosupport the conjecture
(at
leastin the mindofitsoriginator).However,
PeterWeinerhasrecentlydevelopedatechnique for solvingthelongestcommon substring problem inO(m + n)
units of time with a fixed alphabet, using tree structures in aremarkable newway[26].
Furthermore, Weiner’s algorithm has the followinginteresting consequence,pointedoutby E. McCreight"a textfile can beprocessed (inlineartime)sothat it ispossibletodetermineexactlyhow much of apatternisnecessarytoidentifyaposition in the textuniquely;asthe pattern is being typedin, thesystemcan beinterruptassoonas it"knows"what therestof thepatternmustbe!Unfortunatelythetimeandspacerequirements for Weiner’s algorithmgrow with increasingalphabetsize.Ifweconsidertheproblemof scanning finite-statelanguagesingeneral,it is known
[1 9.2]
that the language definedby any regular expression of length m isrecognizable inO(mn)
units of time. When theregular expressionhas the formthe algorithm we have discussed shows that only
O(m + n)
units of time are needed (considering*
as acharacter oflength 1intheexpression).Recent
work byM. J.
Fischer andM.
S. Paterson[8]
shows thatregularexpressionsofthe formi.e., patterns with "don’t
care"
symbols, can be identified inO(n
logrn log logrnlogq)
units oftime, where q is the alphabet size and rn[cela2 cer[+
r.The constant ofproportionality in their algorithm isextremely large, but the existence of their construction indicates that efficient new algorithms for general pattern matching problems probably remain to be dis- covered.A
completelydifferentapproachtopattern matching,based on hashing, has beenproposed byMalcolmC.
Harrison[12]. In
certain applications, especially with very large text files and short patterns, Harrison’s method may be sig- nificantly faster than thecharacter-comparingmethod of the presentpaper,onthe average, although the redundancy of English makes the performance of his method unclear.8. Postscript:
Faster pattern
matching in strings.2In
the spring of 1974, RobertS.Boyer
andJ.
StrotherMooreand(independently)R. W. Gosper
noticed thatthereis an evenfasterwaytomatchpatternstrings,by skippingmorerapidly over portions of the textthatcannotpossiblyleadto amatch. Their idea wasto lookfirstattext[m],
insteadoftext[l].
Ifwefindthatthe charactertext[m]
does not appear in the pattern at all, we can immediately shift the pattern right rn places.Thus,when thealphabetsizeqislarge,weneedtoinspectonlyaboutn/rn
charactersof the text, onthe
average!
Furthermore iftext[m]
doesoccurin the pattern,wecan shift thepatternbythe minimumamountconsistent withamatch.Thispostscriptwas added byD. E. Knuth inMarch, 1976, becauseofdevelopmentswhich occurred after preprintsof thispaperweredistributed.
Severalinteresting variationson this strategyare possible.
For
example, iftext[m]
doesoccur in the pattern, we might continue the search by looking attext[m 1], text[m 2],
etc.;inarandom file we willusuallyfindasmall value ofr such that thesubstringtext[m- r] text[m]
doesnotappearin thepattern,so we can shift thepattern m-r places. Ifr-[2 logq
mJ,
there are more than m2 possible values oftext[m r]... text[m ],
butonlym rsubstringsoflengthr+
1 inthepattern,hence theprobabilityisO(1/m)
thattext[m r] text[m]
occurs in thepattern; Ifitdoesn’t,we can shift the patternrightm rplaces; butif itdoes, we can determine all matches in positions<m-r
inO(m)
steps, shifting the patternm-rplaces bythe method of thispaper.Hence
the expectednumber of charactersexamined amongthe first m[2 logq
m isO(1ogq m);
thisprovesthe existence of a linearworst-casealgorithmwhichinspectsO(n (1Ogq m)/m)
charac-tersin arandomtext.Thisupperboundontheaveragerunning time appliestoall patterns,and there are somepatterns
(e.g.,
a or(a b)’/2)
for which theexpected number of characters examinedby the algorithmisO(n/m).
Boyer
andMoorehave refinedtheskipping-by-mideain anotherway.Their original algorithmmaybeexpressedasfollows using our conventions:whilek_-<n do begin
]:=m;
while j
>
0 andtext[k]
=pattern[j]do beginf:=f-1;k:=k-1;
end;
if
]
0 then beginmatch
foundat (k);
k:=k+m+l end else
k := k
+
max(d[text[k]],
dd[j]);end;
This program calls match
found
at(k)
for all O<=k<-n-m such thatpattern[l].., pattern[m] text[k + 1]
text[k+ m].
There are two precom- puted tables, namelyd[a]
min{sis
m or(0
_-<s<
mandpattern[m s] a)}
foreach of the q possiblecharacters a, and dd[j] min
{s +
m-j s>--
1 and((s_-
>1 orpattern[i-s]=pattern[i])
for j<i=<m)},
for 1 _<-/"_-<m.
Thed table canclearly beset up in
O(q + rn)
steps, andtheddtable can be precomputedinO(m)
steps usinga technique analogous to the method in 2 above,as weshall see. TheBoyer-Moore
paper[4]
containsfurtherexpositionof thealgorithm,including suggestions forhighly efficientimplementation, andgives boththeoretical and empiricalanalyses.In
theremainderof this section we shall show howthe above methods can be used toresolve some of theproblems left openin[4].
Firstletusimprove theoriginal
Boyer-Moore
algorithmslightlyby replacingdd[f ]
bydd’[f ]
min{s +
rn -j s=>
1 and(s
>=jorpattern[f-s]
#pattern[f])
and((s
=>i orpattern[i-s] pattern[i])for j<i=< m)}.
(Thisisanalogoustousingnext[j]instead off[j];
Boyer
andMoore[4]
creditthe improvement in this case toBen
Kuipers, but they do not discuss how to determinedd’ efficiently.)The followingprogramshows how thedd’ tablecan be precomputedinO(m)
steps; forpurposesof comparison, theprogramalso shows howtocomputedd,whichactuallyturns out torequireslightlymore operations thandd’fork := 1 step 1untilrn do
dd[k]
:=dd’[k]
:= 2x
rn-k;j:=m;t:=m+l;
while/" >
0 do beginfEi]
:= t;while
=<
rn andpattern[j]pattern[t]
do begindd’[t]
:= min(dd’[t], rn-/’);
:=/It];
end;
t:=t-1;f:=f-1;
dd[t]
:= min(dd[t], rn-/’);
end;
fork := 1 step 1until tdo begin
dd[k]
:=min(dd[k],
m+
t-k);
dd’[
k]
:= mindd’[
k],
rn+
t-k);
end;
In
practice onewould,ofcourse,computeonlydd’,
suppressingall referencesto dd.TheexampleinTable 2 illustratesmostof the subtleties of this algorithm.TABLE2
j=l 2 3 4 5 6
pattern[j] b a d b a c f[j]=10 11 6 7 8 9 dd[j]=19 18 17 16 15 8
dd’[/’]=19 18 17 16 15 8
7 8 9 10 11
b a c b a
10 11 11 11 12
7 6 5 4
13 12 8 12
To
provecorrectness,onemayshow first thatf[]]
isanalogoustothef[]]
in 2, but with right and left of the patternreversed; namelyf[m
m+
1,and for] <
m we have]’[]3
min{iIJ <
-<_m andpattern[i
+ 1] pattern[m] =pattern[] + 1]... pattern[m +]-i]}.
Furthermore thefinalvalueof correspondsto
f[0]
in thisdefinition;m isthe maximumoverlapof the patternonitself.Thecorrectnessofdd[j]anddd’[j]for all j nowfollowswithoutmuch difficulty,byshowing that the minimum value ofs in the definition ofdd[]o]
or dd’[jo] is discovered by the algorithm when (t,j)=(jo, jo-s).The
Boyer-Moore
algorithm and its variants can have curiously anomalous behavior in unusual circumstances.For
example, the method discovers more quickly that the pattern a a a a a a a cb does not appear in the text(a b)"
if it suppresses thedheuristicentirely,i.e., ifd[t]
issetto -oofor allt. Likewise,dd actually turns out to be better than dd’ when matching a15bcbabab in(b
a abab)",
forlargen.Boyer
andMoore
showedthattheiralgorithm has quadratic behavior inthe worst case; the runningtime can be essentially proportional to pattern length timestextlength,forexamplewhen thepatternc a(b a)"
occurstogetherwiththe text(x
TMaa(b a)")".
Theyobserved thatthisparticular examplewashandledin linear timewhen Kuiper’s improvement(dd’
fordd)
wasmade;butthey leftopen the question of thetrueworstcase behavior of theimprovedalgorithm.There are trivial cases in which the
Boyer-Moore
algorithm hasqladratic behavior, when matching all occurrences of the pattern, for example when matching thepatterna"
in thetexta". But
we areprobablywillingtoacceptsuch behavior when there are so many matches; the crucial issue is how long the algorithm takes in the worstcase to scan overa text thatdoes not contain the patternatall.By
extending the techniquesof5,
it is possibletoshow thatthe modifiedBoyer-Moore
algorithmislinearin suchasituation:THEOREM.
ff
theabove algorithmisusedwithdd’ replacing dd, andif
thetextdoes notcontain any occurrences
of
thepattern, the total numbero]:
characters matchedis at most6n.Proof. An
execution of the algorithm consists of a series ofstages,in whichmk characters are matched and then the pattern is shiftedSk places,fork 1, 2,We
want toshow thatY
mk<=
6n; the proofis basedonbreaking this cost into threeparts,twoof which are triviallyO(n)
and the third of which is lessobviouslySO.
Letm
,
mk2Sk
ifmk> 2Sk
otherwiseletm,
0. Whenm, >
0,wewillsay that the leftmostm,
text characters matched during the kth stage have been"tapped".
It
sufficestoprovethat the algorithm taps charactersatmost4n times, sinceY.
mk<= mk +
2Y.
Sk andY’.
Sk<--_n. Unfortunately it is possible for some characters of the text to be tapped roughly logm times, so we need to argue carefullythat m,_-<
4n.Suppose
the rightmostm’
of the m,
text characterstapped during the kthstage are matched again during some later stage, butthe leftmost m