Better Implementations - Engineering A Compiler pdf

A straightforward scanner generator would take as input a set of regular expressions, construct thenfafor eachre, combine them using-transitions (using the pattern fora|bin Thompson’s construction), and perform the subset construction to create the corresponding dfa. To convert the dfa into an executable program, it would encode the transition function into a table indexed by current state and input character, and plug the table into a fairly standard skeleton scanner, like the one shown in Figure 2.3.

While this path from a collection of regular expressions to a working scanner is a little long, each of the steps is well understood. This is a good example of the kind of tedious process that is well suited to automation by a computer. However, a number of reﬁnements to the automatic construction process can improve the quality of the resulting scanner or speed up the construction.

2.8.1 DFA minimization

Thenfa _todfa _{conversion can create a} dfa _{with a large set of states. While} this does not increase the number of instructions required to scan a given string, it does increase the memory requirements of the recognizer. On modern com- puters, the speed of memory accesses often governs the speed of computation. Smaller tables use less space on disk, inram, and in the processor’s cache. Each of those can be an advantage.

To minimize the size of thedfa_, _D _{= (}_Q,_Σ_{, δ, q}₀_{, F}_{), we need a technique} for recognizing when two states are equivalent—that is, they produce the same behavior on any input string. Figure 2.9 shows an algorithm that partitions the states of adfa into equivalence classes based on their behavior relative to an input string.

2.8. BETTER IMPLEMENTATIONS 41 Because the algorithm must also preserve halting behavior, the algorithm cannot place a ﬁnal state in the same class as a non-ﬁnal state. Thus, the initial partitioning step dividesQinto two equivalence classes,F andQ−F.

Each iteration of the while loop reﬁnes the current partition, P, by split- ting apart sets in P based on their outbound transitions. Consider a set

p = {q_i, q_j, q_k}in the current partition. Assume that q_i, q_j, and q_k all have transitions on some symbol α ∈ Σ, with q_x = δ(q_i, α), q_y = δ(q_j, α), and

q_z=δ(q_k, α). If all ofq_x,q_y, andq_zare in the same set in the current partition, thenq_i,q_j, andq_k should remain in the same set in the new partition. If, on the other hand, q_z is in a diﬀerent set thanq_x and q_y, then the algorithm splits p

intop₁={q_i, q_j}andp₂={q_k}, and puts bothp₁andp₂into the new partition. This is the critical step in the algorithm.

When the algorithm halts, the ﬁnal partition cannot be reﬁned. Thus, for a set s ∈ P, the states in s cannot be distinguished by their behavior on an input string. From the partition, we can construct a newdfaby using a single state to represent each set of states inP, and adding the appropriate transitions between these new representative states. For each state s ∈P, the transition out ofson someα∈Σ must go to a single settinP; if this were not the case, the algorithm would have splits into two or more smaller sets.

To construct the newdfa, we simply create a state to represent eachp∈P, and add the appropriate transitions. After that, we need to remove any states not reachable from the entry state, along with any state that has transitions back to itself on every α∈Σ. (Unless, of course, we want an explicit representation of the error state.) The resulting dfa _{is minimal; we leave the proof to the} interested reader.

This algorithm is another example of a fixed point computation. P is finite; at most, it can contain | Q | elements. The body of thewhile loop can only increase the size of P; it splits sets inP but never combines them. The worst case behavior occurs when each state inQhas different behavior; in that case, thewhile loop halts whenP has a unique set for eachq∈Q. (This would occur if the algorithm was invoked on a minimaldfa.)

2.8.2 Programming Tricks

Explicit State Manipulation Versus Table Lookup The example code in Figure 2.3 uses an explicit variable, state, to hold the current state of the dfa. Thewhile loop tests char against eof, computes a new state, callsaction to interpret it, advances the input stream, and branches back to the top of the loop. The implementation spends much of its time manipulating or testing the state (and we have not yet explicitly discussed the expense incurred in the array lookup to implement the transition table or the logic required to support the switch statement (see Chapter 8).

We can avoid much of this overhead by encoding the state information implic- itly in the program counter. In this model, each state checks the next character against its transitions, and branches directly to the next state. This creates a program with complex control ﬂow; it resembles nothing as much as a jumbled

char _←next character; s₀: word←char ;

char ←next character; if (char = ’r’) then

goto s₁; else goto s_e;

s₁: word_←word + char; char _←next character; if (’0’≤char≤’9’) then

goto s₂; else goto s_e;

s₂: word_←word + char; char_←next character; if (’0’≤char≤’9’) then

goto s₂;

else if (char =eof) then report acceptance; else goto s_e;

s_e: print error message; return failure;

Figure 2.10: A direct-coded recognizer for “rdigit digit∗”

heap of spaghetti. Figure 2.10 shows a version of the skeleton recognizer written in this style. It is both shorter and simpler than the table-driven version. It should be faster, because the overhead per state is lower than in table-lookup version.

Of course, this implementation paradigm violates many of the precepts of structured programming. In a small code, like the example, this style may be comprehensible. As there speciﬁcation becomes more complex and generates both more states and more transitions, the added complexity can make it quite diﬃcult to follow. If the code is generated directly from a collection of res, using automatic tools, there is little reason for a human to directly read or debug the scanner code. The additional speed obtained from lower overhead and better memory locality5 makes direct-coding an attractive option.

Hashing Keywords versus Directly Encoding Them The scanner writer must choose how to specify reserved keywords in the source programming language— words like for, while, if, then, and else. These words can be written as regular expressions in the scanner specification, or they can be folded into the set of identifiers and recognized using a table lookup in the actions associated with an identifier.

With a reasonably implemented hash table, the expected case behavior of the two schemes should diﬀer by a constant amount. The dfa requires time proportional to the length of the keyword, and the hash mechanism adds a constant time overhead after recognition.

From an implementation perspective, however, direct coding is simpler. It avoids the need for a separate hash table of reserved words, along with the cost of a hash lookup on every identiﬁer. Direct coding increases the size of thedfa_{from which the scanner is built. This can make the scanner’s memory} requirements larger and might require more code to select the transitions out

2.9. RELATED RESULTS 43

In document Engineering A Compiler pdf (Page 52-55)