• No results found

From Regular Expression to Scanner

In document Engineering A Compiler pdf (Page 45-52)

2.7

From Regular Expression to Scanner

The goal of our work with fas is to automate the process of building an exe- cutable scanner from a collections of regular expressions. We will show how to accomplish this in two steps: using Thompson’s construction to build an nfa from are[50] (Section 2.7.1) and using the subset construction to convert the nfa into a dfa (Section 2.7.2). The resulting dfa can be transformed into a minimaldfa—that is, one with the minimal number of states. That transfor- mation is presented later, in Section 2.8.1.

In truth, we can construct adfa directly from are. Since the direct con- struction combines the two separate steps, it may be more efficient. However, understanding the direct method requires a thorough knowledge of the indi- vidual steps. Therefore, we first present the two-step construction; the direct method is presented later, along with other useful transformations on automata.

2.7.1 Regular Expression to NFA

The first step in moving from a regular expression to an implemented scanner is deriving annfa from the re. The construction follows a straightforward idea. It has a template for building the nfa that corresponds to a single letter re, and transformations on the nfas to represent the impact of there operators, concatenation, alternation, and closure. Figure 2.5 shows the trivialnfafor the res aandb, as well as the transformations to form theresab,a|b, anda.

The construction proceeds by building a trivialnfa, and applying the trans- formations to the collection of trivial nfas in the order of the relative prece- dence of the operators. For the regular expression a(b|c), the construction would proceed by buildingnfas fora, b, andc. Next, it would build thenfa forb|c, then(b|c), and, finally, fora(b|c). Figure 2.6 shows this sequence of transformations.

Thompson’s construction relies on several properties of res. It relies on the obvious and direct correspondence between there operators and the transfor- mations on the nfas. It combines this with the closure properties on res for assurance that the transformations produce validnfas. Finally, it uses-moves to connect the subexpressions; this permits the transformations to be simple templates. For example, the template fora looks somewhat contrived; it adds extra states to avoid introducing a cycle of-moves.

The nfas derived from Thompson’s construction have a number of useful properties.

1. Each has a single start state and a single final state. This simplifies the application of the transformations.

2. Any state that is the source or sink of an-move was, at some point in the process, the start state or final state of one of thenfas representing a partial solution.

3. A state has at most two entering and two exiting-moves, and at most one entering and one exiting move on a symbol in the alphabet.

s0 - a s1 s2 - b s3 - c s4 s5

nfas for a,band c

s6 * H H H j s2 - b s4 c- s3 H H H j s5 * s7 nfafor b|c - s8 s6 * H H H j s2 b- s4 c- s3 H H H j s5 * s7 - s9 * nfa for(b|c) s0 a- s1 ? - s8 s6 * H H H j s2 b- s4 - c s3 H H H j s5 * s7 - s9 *

nfafor a(b|c)

2.7. FROM REGULAR EXPRESSION TO SCANNER 35 1. S -closure(q0N)

2. while (S is still changing) 3. for eachsi S

4. for each characterα∈Σ 5. if-closure(move(si,α) ∈S) 6. add it toS assj

7. T[si, α]←sj

Figure 2.7: The Subset Construction

These properties simplify an implementation of the construction. For exam- ple, instead of iterating over all the final states in the nfafor some arbitrary subexpression, the construction only needs to deal with a single final state.

Notice the large number of states in thenfathat Thompson’s construction built fora(b|c). A human would likely produce a much simplernfa, like the following: s0 s1 - a ? a,b

We can automatically remove many of the-moves present in thenfabuilt by Thompson’s construction. This can radically shrink the size of the nfa. Since the subset construction can handle-moves in a natural way, we will defer an algorithm for eliminating-moves until Section 2.9.1.

2.7.2 The Subset Construction

To construct a dfa from an nfa, we must build the dfa that simulates the behavior of the nfaon an arbitrary input stream. The process takes as input annfaN = (QN,Σ, δN, q0

N, FN) and produces adfaD= (QD,Σ, δD, q0

D, FD). The key step in the process is deriving QD andδD from QN and δN (q0D and

FDwill fall out of the process in a natural way.) Figure 2.7 shows an algorithm that does this; it is often called the subset construction.

The algorithm builds a set S whose elements are themselves sets of states inQN. Thus, eachsi ∈S is itself a subset ofQN. (We will denote the set of all subsets ofQN as 2QN, called the powersetofQ

N.) Eachsi ∈Srepresents a state inQD, so each state inQD represents a collection of states inQN (and, thus, is an element of 2QN). To construct the initial state, s0 S, it puts q0

N

intos0 and then augmentss0 with every state inQN that can be reached from

q0N by following one or more-transitions.

The algorithm abstracts this notion of following-transitions into a function, called -closure For a state, qi, -closure(qi) is the set containing qi and any other states reachable fromqi by taking only-moves. Thus, the first step is to construct s0 as-closure(q0N).

OnceS has been initialized withs0, the algorithm repeatedly iterates over the elements ofS, extending the partially constructed dfa (represented byS) by following transitions out of each si S. The while loop iterates until it completes a full iteration over S without adding a new set. To extend the partial dfa represented byS, it considers each si. For each symbol α∈Σ, it collects together all thenfastatesqk that can be reached by a transition onα from a stateqj∈si.

In the algorithm, the computation of a new state is abstracted into a function call tomove. Move(si, α)returns the set of states in 2QN that are reachable from

someqi ∈QN by taking a transition on the symbolα. These nfastates form the core of a state in thedfa; we can call itsj. To completesj, the algorithm takes its-closure. Having computedsj, the algorithm checks ifsj∈S. Ifsj ∈S, the algorithm adds it toS and records a transition fromsi tosj onα.

The while loop repeats this exhaustive attempt to extend the partial dfa until an iteration adds no new states to S. The test in line 5 ensures that S

contains no duplicate elements. Because each si ∈S is also an element of 2Qn,

we know that this process must halt. Sketch of Proof

1. 2QN is finite. (It can be large, but is finite.)

2. S contains no duplicates.

3. Thewhile loop adds elements toS; it cannot remove them. 4. S grows monotonically.

The loop halts.

When it halts, the algorithm has constructed model of thedfathat simulates

QN. All that remains is to useS to construct QD andT to constructδD. QD

gets a stateqito represent each setsi∈S; for anysi that contains a final state ofQN, the corresponding qi is added to FD, the set of final states for thedfa. Finally, the state constructed froms0 becomes the initial state of thedfa. Fixed Point Computations The subset construction is an example of a style of computation that arises regularly in Computer Science, and, in particular, in compiler construction. These problems are characterized by iterated application of a monotone function to some collection of sets drawn from a domain whose structure is known.4 We call these techniquesfixed pointcomputations, because they terminate when they reach a point where further iteration produces the same answer—a “fixed point” in the space of successive iterates produced by the algorithm.

Termination arguments on fixed point algorithms usually depend on the known properties of the domain. In the case of the subset construction, we know that eachsi∈S is also a member of 2QN, the powerset ofQ

N. SinceQN is finite, 2QN is also finite. The body of thewhile loop is monotone; it can only

add elements to S. These facts, taken together, show that the while loop can execute only a finite number of iterations. In other words, it must halt because

2.7. FROM REGULAR EXPRESSION TO SCANNER 37

S -closure(q0N) while (∃ unmarkedsi∈S)

marksi

for each characterα∈Σ

t←-closure(move(si,α)) ift ∈S then

addt toS as an unmarked state

T[si, α]←t

Figure 2.8: A faster version of the Subset Construction

it can add at most|2QN | elements toS; after that, it must halt. (It may, of

course, halt much earlier.) Many fixed point computations have considerably tighter bounds, as we shall see.

Efficiency The algorithm shown in Figure 2.7 is particularly inefficient. It re- computes the transitions for each state inS on each iteration of thewhile loop. These transitions cannot change; they are wholly determined by the structure of the inputnfa. We can reformulate the algorithm to capitalize on this fact; Figure 2.8 shows one way to accomplish this.

The algorithm in Figure 2.8 adds a “mark” to each element of S. When sets are added to S, they are unmarked. When the body of the while loop processes a set si, it marks si. This lets the algorithm avoid processing each

si multiple times. It reduces the number of invocations of-closure(move(si,α)) from O(| S |2 · |Σ |) to O(|S | · | Σ|). Recall thatS can be no larger than 2QN.

Unfortunately, S can become rather large. The principal determinant of how much state expansion occurs is the degree of nondeterminism found in the inputnfa. Recall, however, that thedfamakes exactly one transition per input character, independent of the size ofQD. Thus, the use of non-determinism in specifying and building the nfa increases the space required to represent the correspondingdfa, but not the amount of time required for recognizing an input string.

Computing-closure as a Fixed Point To compute-closure(), we use one of two approaches: a straightforward, online algorithm that follows paths in thenfa’s transition graph, or an offline algorithm that computes the -closure for each state in thenfain a single fixed point computation.

for each staten∈N E(n)

while (someE(n)has changed) for each state n∈N

Here, we have used the notationn, s, to name a transition from n to s on

. Each E(n) contains some subset of N (an element of 2N). E(n) grows monotonically since line five uses(not). The algorithm halts when noE(n) changes in an iteration of the outer loop. When it halts, E(n) contains the names of all states in-closure(n).

We can obtain a tighter time bound by observing that| E(n)| can be no larger than the number of states involved in a path leavingn that is labeled entirely with ’s. Thus, the time required for a computation must be related to the number of nodes in that path. The largestE(n) set can haveN nodes. Consider that longest path. The algorithm cannot halt until the name of the last node on the path reaches the first node on the path. In each iteration of the outer loop, the name of the last node must move one or more steps closer to the head of the path. Even with the worst ordering for that path, it must move along one edge in the path.

At the start of the iteration, nlast E(ni) for someni. If it has not yet reached the head of the path, then there must be an edgeni, nj, in the path. That node will be visited in the loop at line six, sonlast will move fromE(ni) to E(nj). Fortuitous ordering can move it along more than one-transition in a single iteration of the loop at line six, but it must always move along at least one-transition, unless it is in the last iteration of the outer loop.

Thus, the algorithm requires at most one while loop iteration for each edge in the longest-path in the graph, plus an extra iteration to recognize that the

Esets have stabilized. Each iteration visitsN nodes and doesEunions. Thus, its complexity isO(N(N+E)) orO(max(N2, N E)). This is much better than O(2N).

We can reformulate the algorithm to improve its specific behavior by using a worklist technique rather than a round-robin technique.

for each staten∈N

E(n) WorkListN

while (WorkList=∅) remove ni from worklist E(nj)←n

i,nj, E(nj) if E(nj) changed then

WorkListWorkList∪ {nk| nk, ni, ∈δNF A}

This version only visits a node when the E set at one of its -successors has changed. Thus, it may perform fewer union operations than the round robin version. However, its asymptotic behavior is the same. The only way to improve its asymptotic behavior is to change the order in which nodes are removed from the worklist. This issue will be explored in some depth when we encounter data-flow analysis in Chapter 13.

2.7. FROM REGULAR EXPRESSION TO SCANNER 39

2.7.3 Some final points

Thus far, we have developed the mechanisms to construct a dfa implementa- tion from a single regular expression. To be useful, a compiler’s scanner must recognize all the syntactic categories that appear in the grammar for the source language. What we need, then, is a recognizer that can handle all theres for the language’s micro-syntax. Given theres for the various syntactic categories,

r1, r2, r3, . . . , rk, we can construct a singlerefor the entire collection by forming (r1|r2|r3|. . .|rk).

If we run this re through the entire process, buildingnfas for the subex- pressions, joining them with -transitions, coalescing states, constructing the dfathat simulates thenfa, and turning thedfainto executable code, we get a scanner that recognizes precisely one word. That is, when we invoke it on some input, it will run through the characters one at a time and accept the string if it is in a final state when it exhausts the input. Unfortunately, most real programs contain more than one word. We need to transform either the language or the recognizer.

At the language level, we can insist that each word end with some easily recognizable delimiter, like a blank or a tab. This is deceptively attractive. Taken literally, it would require delimiters surrounding commas, operators such as +and-, and parentheses.

At the recognizer level, we can transform the dfa slightly and change the notion of accepting a string. For each final state, qi, we (1) create a new state

qj, (2) remove qi from F and add qj to F, and (3) make the error transition from qi go to qj. When the scanner reaches qi and cannot legally extend the current word, it will take the transition toqj, a final state. As a final issue, we must make the scanner stop, backspace the input by one character, and accept in each new final state. With these modifications, the recognizer will discover the longest legal keyword that is a prefix of the input string.

What about words that match more than one pattern? Because the methods described in this chapter build from a base of non-determinism, we can union together these arbitraryres without worrying about conflicting rules. For exam- ple, the specification for an Algol identifier admits all of the reserved keywords of the language. The compiler writer has a choice on handling this situation. The scanner can recognize those keywords as identifiers and look up each identifier in a pre-computed table to discover keywords, or it can include a re for each keyword. This latter case introduces non-determinism; the transformations will handle it correctly. It also introduces a more subtle problem—the final nfa reaches two distinct final states, one recognizing the keyword and the other rec- ognizing the identifier, and is expected to consistently choose the former. To achieve the desired behavior, scanner generators usually offer a mechanism for prioritizingres to resolve such conflicts.

Lexand its descendants prioritize patterns by the order in which they appear in the input file. Thus, placing keyword patterns before the identifier pattern would ensure the desired behavior. The implementation can ensure that the final states for patterns are numbered in a order that corresponds to this priority

P{F, (Q - F)} while (P is still changing)

T

for each sets∈P for eachα∈Σ partitionsbyα

In document Engineering A Compiler pdf (Page 45-52)