• No results found

Better Implementations

In document Engineering A Compiler pdf (Page 52-55)

A straightforward scanner generator would take as input a set of regular expres- sions, construct thenfafor eachre, combine them using-transitions (using the pattern fora|bin Thompson’s construction), and perform the subset construc- tion to create the corresponding dfa. To convert the dfa into an executable program, it would encode the transition function into a table indexed by cur- rent state and input character, and plug the table into a fairly standard skeleton scanner, like the one shown in Figure 2.3.

While this path from a collection of regular expressions to a working scanner is a little long, each of the steps is well understood. This is a good example of the kind of tedious process that is well suited to automation by a computer. However, a number of refinements to the automatic construction process can improve the quality of the resulting scanner or speed up the construction.

2.8.1 DFA minimization

Thenfa todfa conversion can create a dfa with a large set of states. While this does not increase the number of instructions required to scan a given string, it does increase the memory requirements of the recognizer. On modern com- puters, the speed of memory accesses often governs the speed of computation. Smaller tables use less space on disk, inram, and in the processor’s cache. Each of those can be an advantage.

To minimize the size of thedfa, D = (Q,Σ, δ, q0, F), we need a technique for recognizing when two states are equivalent—that is, they produce the same behavior on any input string. Figure 2.9 shows an algorithm that partitions the states of adfa into equivalence classes based on their behavior relative to an input string.

2.8. BETTER IMPLEMENTATIONS 41 Because the algorithm must also preserve halting behavior, the algorithm cannot place a final state in the same class as a non-final state. Thus, the initial partitioning step dividesQinto two equivalence classes,F andQ−F.

Each iteration of the while loop refines the current partition, P, by split- ting apart sets in P based on their outbound transitions. Consider a set

p = {qi, qj, qk}in the current partition. Assume that qi, qj, and qk all have transitions on some symbol α Σ, with qx = δ(qi, α), qy = δ(qj, α), and

qz=δ(qk, α). If all ofqx,qy, andqzare in the same set in the current partition, thenqi,qj, andqk should remain in the same set in the new partition. If, on the other hand, qz is in a different set thanqx and qy, then the algorithm splits p

intop1={qi, qj}andp2={qk}, and puts bothp1andp2into the new partition. This is the critical step in the algorithm.

When the algorithm halts, the final partition cannot be refined. Thus, for a set s P, the states in s cannot be distinguished by their behavior on an input string. From the partition, we can construct a newdfaby using a single state to represent each set of states inP, and adding the appropriate transitions between these new representative states. For each state s ∈P, the transition out ofson someα∈Σ must go to a single settinP; if this were not the case, the algorithm would have splits into two or more smaller sets.

To construct the newdfa, we simply create a state to represent eachp∈P, and add the appropriate transitions. After that, we need to remove any states not reachable from the entry state, along with any state that has transitions back to itself on every α∈Σ. (Unless, of course, we want an explicit representation of the error state.) The resulting dfa is minimal; we leave the proof to the interested reader.

This algorithm is another example of a fixed point computation. P is finite; at most, it can contain | Q | elements. The body of thewhile loop can only increase the size of P; it splits sets inP but never combines them. The worst case behavior occurs when each state inQhas different behavior; in that case, thewhile loop halts whenP has a unique set for eachq∈Q. (This would occur if the algorithm was invoked on a minimaldfa.)

2.8.2 Programming Tricks

Explicit State Manipulation Versus Table Lookup The example code in Figure 2.3 uses an explicit variable, state, to hold the current state of the dfa. Thewhile loop tests char against eof, computes a new state, callsaction to interpret it, advances the input stream, and branches back to the top of the loop. The implementation spends much of its time manipulating or testing the state (and we have not yet explicitly discussed the expense incurred in the array lookup to implement the transition table or the logic required to support the switch statement (see Chapter 8).

We can avoid much of this overhead by encoding the state information implic- itly in the program counter. In this model, each state checks the next character against its transitions, and branches directly to the next state. This creates a program with complex control flow; it resembles nothing as much as a jumbled

char next character; s0: word←char ;

char ←next character; if (char = ’r’) then

goto s1; else goto se;

s1: wordword + char; char next character; if (’0’≤char≤’9’) then

goto s2; else goto se;

s2: wordword + char; charnext character; if (’0’≤char≤’9’) then

goto s2;

else if (char =eof) then report acceptance; else goto se;

se: print error message; return failure;

Figure 2.10: A direct-coded recognizer for “rdigit digit∗

heap of spaghetti. Figure 2.10 shows a version of the skeleton recognizer written in this style. It is both shorter and simpler than the table-driven version. It should be faster, because the overhead per state is lower than in table-lookup version.

Of course, this implementation paradigm violates many of the precepts of structured programming. In a small code, like the example, this style may be comprehensible. As there specification becomes more complex and generates both more states and more transitions, the added complexity can make it quite difficult to follow. If the code is generated directly from a collection of res, using automatic tools, there is little reason for a human to directly read or debug the scanner code. The additional speed obtained from lower overhead and better memory locality5 makes direct-coding an attractive option.

Hashing Keywords versus Directly Encoding Them The scanner writer must choose how to specify reserved keywords in the source programming language— words like for, while, if, then, and else. These words can be written as regular expressions in the scanner specification, or they can be folded into the set of identifiers and recognized using a table lookup in the actions associated with an identifier.

With a reasonably implemented hash table, the expected case behavior of the two schemes should differ by a constant amount. The dfa requires time proportional to the length of the keyword, and the hash mechanism adds a constant time overhead after recognition.

From an implementation perspective, however, direct coding is simpler. It avoids the need for a separate hash table of reserved words, along with the cost of a hash lookup on every identifier. Direct coding increases the size of thedfafrom which the scanner is built. This can make the scanner’s memory requirements larger and might require more code to select the transitions out

2.9. RELATED RESULTS 43

In document Engineering A Compiler pdf (Page 52-55)