From Regular Expression to Scanner - Engineering A Compiler pdf

2.7 From Regular Expression to Scanner

The goal of our work with fa_{s is to automate the process of building an exe-} cutable scanner from a collections of regular expressions. We will show how to accomplish this in two steps: using Thompson’s construction to build an nfa from are_{[50] (Section 2.7.1) and using the subset construction to convert the} nfa into a dfa (Section 2.7.2). The resulting dfa can be transformed into a minimaldfa—that is, one with the minimal number of states. That transfor- mation is presented later, in Section 2.8.1.

In truth, we can construct adfa directly from are. Since the direct construction combines the two separate steps, it may be more eﬃcient. However, understanding the direct method requires a thorough knowledge of the indi- vidual steps. Therefore, we ﬁrst present the two-step construction; the direct method is presented later, along with other useful transformations on automata.

2.7.1 Regular Expression to NFA

The ﬁrst step in moving from a regular expression to an implemented scanner is deriving annfa _{from the} re_{. The construction follows a straightforward idea.} It has a template for building the nfa _{that corresponds to a single letter} re_, and transformations on the nfa_{s to represent the impact of the}re _operators, concatenation, alternation, and closure. Figure 2.5 shows the trivialnfafor the res _aand_b, as well as the transformations to form theres_ab,_a|_b, and_a∗.

The construction proceeds by building a trivialnfa, and applying the transformations to the collection of trivial nfas in the order of the relative prece- dence of the operators. For the regular expression a(b|c)∗, the construction would proceed by buildingnfas for_a, _b, and_c. Next, it would build thenfa forb|c, then(b|c)∗, and, ﬁnally, fora(b|c)∗. Figure 2.6 shows this sequence of transformations.

Thompson’s construction relies on several properties of res. It relies on the obvious and direct correspondence between there operators and the transformations on the nfas. It combines this with the closure properties on res for assurance that the transformations produce validnfas. Finally, it uses-moves to connect the subexpressions; this permits the transformations to be simple templates. For example, the template fora∗ looks somewhat contrived; it adds extra states to avoid introducing a cycle of-moves.

The nfas derived from Thompson’s construction have a number of useful properties.

1. Each has a single start state and a single ﬁnal state. This simpliﬁes the application of the transformations.

2. Any state that is the source or sink of an-move was, at some point in the process, the start state or ﬁnal state of one of thenfas representing a partial solution.

3. A state has at most two entering and two exiting-moves, and at most one entering and one exiting move on a symbol in the alphabet.

s₀ - a s₁ s₂ - b s₃ - c s₄ s₅

nfas for _a,_band _c

s₆ * H H H j s₂ - b s₄ c- s₃ H H H j s₅ * s₇ nfa_for _b|_c - s₈ s₆ * H H H j s₂ b- s₄ c- s₃ H H H j s₅ * s₇ - s₉ * nfa for_(b|_c)∗ s₀ a- s₁ ? - s₈ s₆ * H H H j s₂ b- s₄ - c s₃ H H H j s₅ * s₇ - s₉ *

nfa_for _a(b|_c)∗

2.7. FROM REGULAR EXPRESSION TO SCANNER 35 1. S _←-closure(q₀_N)

2. while (S is still changing) 3. for eachs_i ∈ S

4. for each characterα∈Σ 5. if-closure(move(s_i,α) ∈S) 6. add it toS ass_j

7. T[s_i, α]←s_j

Figure 2.7: The Subset Construction

These properties simplify an implementation of the construction. For example, instead of iterating over all the ﬁnal states in the nfa_{for some arbitrary} subexpression, the construction only needs to deal with a single ﬁnal state.

Notice the large number of states in thenfa_{that Thompson’s construction} built fora(b|c)∗. A human would likely produce a much simplernfa_{, like the} following: s₀ s₁ - a ? a,b

We can automatically remove many of the-moves present in thenfa_{built by} Thompson’s construction. This can radically shrink the size of the nfa_{. Since} the subset construction can handle-moves in a natural way, we will defer an algorithm for eliminating-moves until Section 2.9.1.

2.7.2 The Subset Construction

To construct a dfa from an nfa, we must build the dfa that simulates the behavior of the nfaon an arbitrary input stream. The process takes as input annfaN = (Q_N,Σ, δ_N, q₀

N, FN) and produces adfaD= (Q_D,Σ, δ_D, q₀

D, FD). The key step in the process is deriving Q_D andδ_D from Q_N and δ_N (q₀_D and

F_Dwill fall out of the process in a natural way.) Figure 2.7 shows an algorithm that does this; it is often called the subset construction.

The algorithm builds a set S whose elements are themselves sets of states inQ_N. Thus, eachs_i ∈S is itself a subset ofQ_N. (We will denote the set of all subsets ofQ_N as 2QN_{, called the} _powerset_of_Q

N.) Eachsi ∈Srepresents a state inQ_D, so each state inQ_D represents a collection of states inQ_N (and, thus, is an element of 2QN_{). To construct the initial state,} _s₀ ∈_S_{, it puts} _q₀

intos₀ and then augmentss₀ with every state inQ_N that can be reached from

q₀_N by following one or more-transitions.

The algorithm abstracts this notion of following-transitions into a function, called -closure For a state, q_i, -closure(q_i) is the set containing q_i and any other states reachable fromq_i by taking only-moves. Thus, the ﬁrst step is to construct s₀ as-closure(q₀_N).

OnceS has been initialized withs₀, the algorithm repeatedly iterates over the elements ofS, extending the partially constructed dfa (represented byS) by following transitions out of each s_i ∈ S. The while loop iterates until it completes a full iteration over S without adding a new set. To extend the partial dfa represented byS, it considers each s_i. For each symbol α∈Σ, it collects together all thenfastatesq_k that can be reached by a transition onα from a stateq_j∈s_i.

In the algorithm, the computation of a new state is abstracted into a function call tomove. Move(s_i, α)returns the set of states in 2QN _{that are reachable from}

someq_i ∈Q_N by taking a transition on the symbolα. These nfastates form the core of a state in thedfa; we can call its_j. To completes_j, the algorithm takes its-closure. Having computeds_j, the algorithm checks ifs_j∈S. Ifs_j ∈S, the algorithm adds it toS and records a transition froms_i tos_j onα.

The while loop repeats this exhaustive attempt to extend the partial dfa until an iteration adds no new states to S. The test in line 5 ensures that S

contains no duplicate elements. Because each s_i ∈S is also an element of 2Qn_,

we know that this process must halt. Sketch of Proof

1. 2QN _{is ﬁnite. (It can be large, but is ﬁnite.)}

2. S contains no duplicates.

3. Thewhile loop adds elements toS; it cannot remove them. 4. S grows monotonically.

⇒ The loop halts.

When it halts, the algorithm has constructed model of thedfathat simulates

Q_N. All that remains is to useS to construct Q_D andT to constructδ_D. Q_D

gets a stateq_ito represent each sets_i∈S; for anys_i that contains a final state ofQ_N, the corresponding q_i is added to F_D, the set of final states for thedfa. Finally, the state constructed froms₀ becomes the initial state of thedfa. Fixed Point Computations The subset construction is an example of a style of computation that arises regularly in Computer Science, and, in particular, in compiler construction. These problems are characterized by iterated application of a monotone function to some collection of sets drawn from a domain whose structure is known.4 We call these techniquesfixed pointcomputations, because they terminate when they reach a point where further iteration produces the same answer—a “fixed point” in the space of successive iterates produced by the algorithm.

Termination arguments on ﬁxed point algorithms usually depend on the known properties of the domain. In the case of the subset construction, we know that eachs_i∈S is also a member of 2QN_{, the powerset of}_Q

N. SinceQN is ﬁnite, 2QN _{is also ﬁnite. The body of the}_while _{loop is monotone; it can only}

add elements to S. These facts, taken together, show that the while loop can execute only a ﬁnite number of iterations. In other words, it must halt because

2.7. FROM REGULAR EXPRESSION TO SCANNER 37

S _←-closure(q₀_N) while (∃ unmarkeds_i∈S)

marks_i

for each characterα∈Σ

t←-closure(move(s_i,α)) ift ∈S then

addt toS as an unmarked state

T[s_i, α]←t

Figure 2.8: A faster version of the Subset Construction

it can add at most|2QN | _{elements to}_S_{; after that, it must halt. (It may, of}

course, halt much earlier.) Many ﬁxed point computations have considerably tighter bounds, as we shall see.

Eﬃciency The algorithm shown in Figure 2.7 is particularly ineﬃcient. It re- computes the transitions for each state inS on each iteration of thewhile loop. These transitions cannot change; they are wholly determined by the structure of the inputnfa. We can reformulate the algorithm to capitalize on this fact; Figure 2.8 shows one way to accomplish this.

The algorithm in Figure 2.8 adds a “mark” to each element of S. When sets are added to S, they are unmarked. When the body of the while loop processes a set s_i, it marks s_i. This lets the algorithm avoid processing each

s_i multiple times. It reduces the number of invocations of-closure(move(s_i,α)) from O(| S |2 · |Σ |) to O(|S | · | Σ|). Recall thatS can be no larger than 2QN_.

Unfortunately, S can become rather large. The principal determinant of how much state expansion occurs is the degree of nondeterminism found in the inputnfa_{. Recall, however, that the}dfa_{makes exactly one transition per input} character, independent of the size ofQ_D. Thus, the use of non-determinism in specifying and building the nfa _{increases the space required to represent the} correspondingdfa_{, but not the amount of time required for recognizing an input} string.

Computing-closure as a Fixed Point To compute-closure(), we use one of two approaches: a straightforward, online algorithm that follows paths in thenfa_’s transition graph, or an oﬄine algorithm that computes the -closure for each state in thenfa_{in a single ﬁxed point computation.}

for each staten∈N E(n)_←∅

while (someE(n)has changed) for each state n∈N

Here, we have used the notationn, s, to name a transition from n to s on

. Each E(n) contains some subset of N (an element of 2N_). _E₍_n_{) grows} monotonically since line ﬁve uses∪(not∩). The algorithm halts when noE(n) changes in an iteration of the outer loop. When it halts, E(n) contains the names of all states in-closure(n).

We can obtain a tighter time bound by observing that| E(n)| can be no larger than the number of states involved in a path leavingn that is labeled entirely with ’s. Thus, the time required for a computation must be related to the number of nodes in that path. The largestE(n) set can haveN nodes. Consider that longest path. The algorithm cannot halt until the name of the last node on the path reaches the ﬁrst node on the path. In each iteration of the outer loop, the name of the last node must move one or more steps closer to the head of the path. Even with the worst ordering for that path, it must move along one edge in the path.

At the start of the iteration, n_last ∈ E(n_i) for somen_i. If it has not yet reached the head of the path, then there must be an edgen_i, n_j, in the path. That node will be visited in the loop at line six, son_last will move fromE(n_i) to E(n_j). Fortuitous ordering can move it along more than one-transition in a single iteration of the loop at line six, but it must always move along at least one-transition, unless it is in the last iteration of the outer loop.

Thus, the algorithm requires at most one while loop iteration for each edge in the longest-path in the graph, plus an extra iteration to recognize that the

Esets have stabilized. Each iteration visitsN nodes and doesEunions. Thus, its complexity isO(N(N+E)) orO(max(N2, N E)). This is much better than O(2N_).

We can reformulate the algorithm to improve its speciﬁc behavior by using a worklist technique rather than a round-robin technique.

for each staten∈N

E(n)_←∅ WorkList_←N

while (WorkList=∅) remove n_i from worklist E(n_j)←_n

i,nj, E(nj) if E(n_j) changed then

WorkList_←WorkList∪ {n_k| n_k, n_i, ∈δ_{NF A}}

This version only visits a node when the E set at one of its -successors has changed. Thus, it may perform fewer union operations than the round robin version. However, its asymptotic behavior is the same. The only way to improve its asymptotic behavior is to change the order in which nodes are removed from the worklist. This issue will be explored in some depth when we encounter data-ﬂow analysis in Chapter 13.

2.7. FROM REGULAR EXPRESSION TO SCANNER 39

2.7.3 Some ﬁnal points

Thus far, we have developed the mechanisms to construct a dfa implementation from a single regular expression. To be useful, a compiler’s scanner must recognize all the syntactic categories that appear in the grammar for the source language. What we need, then, is a recognizer that can handle all theres for the language’s micro-syntax. Given theres for the various syntactic categories,

r₁, r₂, r₃, . . . , r_k, we can construct a singlerefor the entire collection by forming (r₁|r₂|r₃|. . .|r_k).

If we run this re through the entire process, buildingnfas for the subexpressions, joining them with -transitions, coalescing states, constructing the dfathat simulates thenfa, and turning thedfainto executable code, we get a scanner that recognizes precisely one word. That is, when we invoke it on some input, it will run through the characters one at a time and accept the string if it is in a ﬁnal state when it exhausts the input. Unfortunately, most real programs contain more than one word. We need to transform either the language or the recognizer.

At the language level, we can insist that each word end with some easily recognizable delimiter, like a blank or a tab. This is deceptively attractive. Taken literally, it would require delimiters surrounding commas, operators such as +and-, and parentheses.

At the recognizer level, we can transform the dfa slightly and change the notion of accepting a string. For each ﬁnal state, q_i, we (1) create a new state

q_j, (2) remove q_i from F and add q_j to F, and (3) make the error transition from q_i go to q_j. When the scanner reaches q_i and cannot legally extend the current word, it will take the transition toq_j, a final state. As a final issue, we must make the scanner stop, backspace the input by one character, and accept in each new final state. With these modifications, the recognizer will discover the longest legal keyword that is a prefix of the input string.

What about words that match more than one pattern? Because the methods described in this chapter build from a base of non-determinism, we can union together these arbitraryres without worrying about conflicting rules. For example, the specification for an Algol identifier admits all of the reserved keywords of the language. The compiler writer has a choice on handling this situation. The scanner can recognize those keywords as identifiers and look up each identifier in a pre-computed table to discover keywords, or it can include a re _{for each} keyword. This latter case introduces non-determinism; the transformations will handle it correctly. It also introduces a more subtle problem—the final nfa reaches two distinct final states, one recognizing the keyword and the other recognizing the identifier, and is expected to consistently choose the former. To achieve the desired behavior, scanner generators usually offer a mechanism for prioritizingres to resolve such conflicts.

Lex_{and its descendants prioritize patterns by the order in which they appear} in the input file. Thus, placing keyword patterns before the identifier pattern would ensure the desired behavior. The implementation can ensure that the final states for patterns are numbered in a order that corresponds to this priority

P_←{F, (Q - F)} while (P is still changing)

T_←∅

for each sets∈P for eachα∈Σ partitionsbyα

In document Engineering A Compiler pdf (Page 45-52)