Digression: What about Context-Sensitive Grammars?
Given the progression of ideas from the previous chapters, it might seem natural to consider the use of context-sensitive languages (csls) to address these issues. After all, we used regular languages to perform lexical analysis, and context-free languages to perform syntax analysis. A natural progres- sion might suggest the study of csls and their grammars. Context-sensitive grammars (csgs) can express a larger family of languages than cancfgs.
However, csgs are not the right answer for two distinct reasons. First, the problem of parsing acsgisp-space complete. Thus, a compiler that used csg-based techniques could run quite slowly. Second, many of the important questions are difficult to encode in acsg. For example, consider the issue of declaration before use. To write this rule into a csg would require distinct productions for each combination of declared variables. With a sufficiently small name space, this might be manageable; in a modern language with a large name space, the set of names is too large to encode into acsg.
4.5
What Questions Should the Compiler Ask?
Recall that the goal of context-sensitive analysis is to prepare the compiler for the optimization and code generation tasks that will follow. The questions that arise in context-sensitive analysis, then, divide into two broad categories: checking program legality at a deeper level than is possible with thecfg, and elaborating the compiler’s knowledge base from non-local context to prepare for optimization and code generation.
Along those lines, many context-sensitive questions arise.
• Given a variablea, is it a scalar, an array, a structure, or a function? Is it declared? In which procedure is it declared? What is its datatype? Is it actually assigned a value before it is used?
• For an array reference b[i,j,k], is b declared as an array? How many dimensions doesbhave? Arei,j, andkdeclared with a data type that is valid for an array index expression? Doi,j, andkhave values that place
b[i,j,k]inside the declared bounds of b?
• Where canaandbbe stored? How long must their values be preserved? Can either be kept in a hardware register?
• In the reference *c, is c declared as a pointer? Is *c an object of the appropriate type? Can*cbe reached via another name?
• How many arguments does the functionfeetake? Do all of its invocations pass the right number of arguments? Are those arguments of the correct type?
• Does the function fie() return a known constant value? Is it a pure function of its parameters, or does the result depend on some implicit state (i.e., the values of static or global variables)?
Most of these questions share a common trait. Their answers involve information that is not locally available in the syntax. For example, checking the number and type of arguments at a procedure call requires knowledge of both the procedure’s declaration and the call site in question. In many cases, these two statements will be separated by intervening context.
4.6
Summary and Perspective
In Chapters 2 and 3, we saw that much of the work in a compiler’s front end can be automated. Regular expressions work well for lexical analysis. Context-free grammars work well for syntax analysis. In this chapter, we examined two ways of performing context-sensitive analysis.
The first technique, using attribute grammars, offers the hope of writing high-level specifications that produce reasonably efficient executables. Attribute grammars have found successful application in several domains, ranging from theorem provers through program analysis. (See Section 9.6 for an application in compilation where attribute grammars may be a better fit.) Unfortunately, the attribute grammar approach has enough practical problems that it has not been widely accepted as the paradigm of choice for context sensitive analysis.
The second technique, calledad hoc syntax-directed translation, integrates arbitrary snippets of code into the parser and lets the parser provide sequencing and communication mechanisms. This approach has been widely embraced, because of its flexibility and its inclusion in most parser generator systems.
Questions
1. Sometimes, the compiler writer can move an issue across the boundary between context-free and context-sensitive analysis. For example, we have discussed the classic ambiguity that arises between function invocation and array references in Fortran 77 (and other languages). These constructs might be added to the classic expression grammar using the productions:
factor → ident (ExprList )
ExprList → expr
| ExprList ,expr
Unfortunately, the only difference between a function invocation and an array reference lies in how theidentis declared.
In previous chapters, we have discussed using cooperation between the scanner and the parser to disambiguate these constructs. Can the problem be solved during context-sensitive analysis? Which solution is preferable? 2. Sometimes, a language specification uses context-sensitive mechanisms to check properties that can be tested in a context free way. Consider the grammar fragment in Figure 4.7. It allows an arbitrary number ofStor- ageClassspecifiers when, in fact, the standard restricts a declaration to a singleStorageClass specifier.
4.6. SUMMARY AND PERSPECTIVE 129 (a) Rewrite the grammar to enforce the restriction grammatically. (b) Similarly, the language allows only a limited set of combinations of
TypeSpecifier. long is allowed with either int or float; short is allowed only withint. Eithersignedor unsignedcan appear with any form of int. signed may also appear on char. Can these re- strictions be written into the grammar?
(c) Propose an explanation for why the authors structured the grammar as they did. (Hint: the scanner returned a single token type for any of the StorageClass values and another token type for any of the TypeSpecifiers.)
(d) How does this strategy affect the speed of the parser? How does it change the space requirements of the parser?
3. Sometimes, a language design will include syntactic constraints that are better handled outside the formalism of a context-free grammar, even though the grammar can handle them. Consider, for example, the follow- ing “check off” keyword scheme:
phrase → keywordα β γ δ ζ α → α-keyword | β → β-keyword | γ → γ-keyword | δ → δ-keyword | ζ → ζ-keyword |
with the restrictions thatα-keyword, β-keyword, γ-keyword, δ-keyword, and ζ-keyword appear in order, and that each of them appear at most once.
(a) Since the set of combinations is finite, it can clearly be encoded into a series of productions. Give one such grammar.
(b) Propose a mechanism using ad hoc syntax-directed translation to achieve the same result.
(c) A simpler encoding, however, can be done using a more permissive grammar and a hard-coded set of checks in the associated actions. (d) Can you use an-production to further simply your syntax-directed
Chapter 6
Intermediate
Representations
6.1
Introduction
In designing algorithms, a critical distinction arises between problems that must be solvedonline, and those that can be solvedoffline. In general, compilers work offline—that is, they can make more than a single pass over the code being translated. Making multiple passes over the code should improve the quality of code generated by the compiler. The compiler can gather information in one pass and use that information to make decisions in later passes.
The notion of a multi-pass compiler (see Figure 6.1) creates the need for an intermediate representation for the code being compiled. In translation, the compiler must derive facts that have no direct representation in the source code—for example, the addresses of variables and procedures. Thus, it must use some internal form—an intermediate representation orir—to represent the code being analyzed and translated. Each pass, except the first, consumes ir. Each pass, except the last, produces ir. In this scheme, the intermediate representation becomes the definitive representation of the code. Theirmust be expressive enough to record all of the useful facts that might be passed between phases of the compiler. In our terminology, theirincludes auxiliary tables, like a symbol table, a constant table, or a label table.
Selecting an appropriate irfor a compiler project requires an understand- ing of both the source language and the target machine, of the properties of programs that will be presented for compilation, and of the strengths and weak- nesses of the language in which the compiler will be implemented.
Each style of ir has its own strengths and weaknesses. Designing an ap- propriate ir requires consideration of the compiler’s task. Thus, a source-to- source translator might keep its internal information in a form quite close to the source; a translator that produced assembly code for a micro-controller might use an internal form close to the target machine’s instruction set. It requires
source code Multi-pass compiler front end middle end back end - - - - target code IR IR
Figure 6.1: The role of irs in a multi-pass compiler
consideration of the specific information that must be recorded, analyzed, and manipulated. Thus, a compiler forCmight have additional information about pointer values that are unneeded in a compiler for Perl. It requires consid- eration of the operations that must be performed on the irand their costs, of the range of constructs that must be expressed in the ir; and of the need for humans to examine theirprogram directly.
(The compiler writer should never overlook this final point. A clean, readable external format for the irpays for itself. Sometimes, syntax can be added to improve readability. An example is the ⇒symbol used in the iloc examples throughout this book. It serves no real syntactic purpose; however, it gives the reader direct help in separating operands from results.)
6.2
Taxonomy
To organize our thinking aboutirs, we should recognize that there are two major axes along which we can place a specific design. First, theirhas a structural organization. Broadly speaking, three different organizations have been tried.
• Graphicalirs encode the compiler’s knowledge in a graph. The algorithms are expressed in terms of nodes and edges, in terms of lists and trees. Examples include abstract syntax trees and control-flow graphs.
• Linear irs resemble pseudo-code for some abstract machine. The algo- rithms iterate over simple, linear sequences of operations. Examples in- clude bytecodes and three-address codes.
• Hybrid irs combine elements of both structural and linear irs, with the goal of capturing the strengths of both. A common hybrid representation uses a low-level linear code to represent blocks of straight-line code and a graph to represent the flow of control between those blocks.1
The structural organization of an irhas a strong impact on how the compiler writer thinks about analyzing, transforming, and translating the code. For example, tree-like irs lead naturally to code generators that either perform a
1We say very little about hybridirs in the remainder of this chapter. Instead, we focus on the linearirs and graphicalirs, leaving it to the reader to envision profitable combinations of the two.
6.2. TAXONOMY 135 tree-walk or use a tree pattern matching algorithm. Similarly, linear irs lead naturally to code generators that make a linear pass over all the instructions (the “peephole” paradigm) or that use string pattern matching techniques.
The second axis of ourirtaxonomy is the level of abstraction used to repre- sent operations. This can range from a near-source representation where a pro- cedure call is represented in a single node, to a low-level representation where multipleiroperations are assembled together to create a single instruction on the target machine.
To illustrate the possibilities, consider the difference between the way that a source-level abstract syntax tree and a low-level assembly-like notation might represent the reference A[i,j]into an array declaredA[1..10,1..10]).
subscript n A n i jn , , , ? @ @ @ R
abstract syntax tree
load 1 ⇒ r1 sub rj, r1 ⇒ r2 loadi 10 ⇒ r3 mul r2, r3 ⇒ r4 sub ri, r1 ⇒ r5 add r4, r5 ⇒ r6 loadi @A ⇒ r7 loadAO r7, r6 ⇒ rAij
low-level linear code
In the source-level ast, the compiler can easily recognize that the computation is an array reference; examining the low-level code, we find that simple fact fairly well obscured. In a compiler that tries to perform data-dependence anal- ysis on array subscripts to determine when two different references can touch the same memory location, the higher level of abstraction in theastmay prove valuable. Discovering the array reference is more difficult in the low-level code; particularly if the irhas been subjected to optimizations that move the indi- vidual operations to other parts of the procedure or eliminate them altogether. On the other hand, if the compiler is trying to optimize the code generated for the array address calculation, the low-level code exposes operations that remain implicit in the ast. In this case, the lower level of abstraction may result in more efficient code for the address calculation.
The high level of abstraction is not an inherent property of tree-basedirs; it is implicit in the notion of a syntax tree. However, low-level expression trees have been used in many compilers to represent all the details of computations, such as the address calculation forA[i,j]. Similarly, linearirs can have relatively high- level constructs. For example, many linearirs have included amvcloperation2 to encode string-to-string copy as a single operation.
On some simpleRiscmachines, the best encoding of a string copy involves clearing out the entire register set and iterating through a tight loop that does a multi-word load followed by a multi-word store. Some preliminary logic is needed to deal with alignment and the special case of overlapping strings. By
using a single irinstruction to represent this complex operation, the compiler writer can make it easier for the optimizer to move the copy out of a loop or to discover that the copy is redundant. In later stages of compilation, the single instruction is expanded, in place, into code that performs the copy or into a call to some system or library routine that performs the copy.
Other properties of theirshould concern the compiler writer. The costs of generating and manipulating theirwill directly effect the compiler’s speed. The data space requirements of differentirs vary over a wide range; and, since the compiler typically touches all of the space that is allocated, data-space usually has a direct relationship to running time. Finally, the compiler writer should consider the expressiveness of their—its ability to accommodate all of the facts that the compiler needs to record. This can include the sequence of actions that define the procedure, along with the results of static analysis, profiles of previous executions, and information needed by the debugger. All should be expressed in a way that makes clear their relationship to specific points in their.
6.3
Graphical IRs
Manyirs represent the code being translated as a graph. Conceptually, all the graphical irs consist of nodes and edges. The difference between them lies in the relationship between the graph and the source language program, and in the restrictions placed on the form of the graph.
6.3.1 Syntax Trees
Thesyntax tree, orparse tree, is a graphical representation for the derivation, or parse, that corresponds to the input program. The following simple expression grammar defines binary operations +,−, ×, and÷over the domain of tokens
numberandid.
Goal → Expr
Expr → Expr+Term
| Expr−Term
| Term
Term → Term×Factor
| Term÷Factor
| Factor Factor → Number
| Id
Simple Expression Grammar
Term Factor x ? ? Expr Term × ? ? , , @ @ R Goal Expr + Factor 2 ? ? ? , , X X X X X X X z Term Factor x ? ? Term × ? , , @ @ R Term × Factor 2 ? ? , , @ @ R Factor y ?
Syntax tree forx×2 +x×2×y The syntax tree on the right shows the derivation that results from parsing the expressionx×2 +x×2×y. This tree represents the complete derivation, with
6.3. GRAPHICAL IRS 137 a node for each grammar symbol (terminal or non-terminal) in the derivation. It provides a graphic demonstration of the extra work that the parser goes through to maintain properties like precedence. Minor transformations on the grammar can reduce the number of non-trivial reductions and eliminate some of these steps. (See Section 3.6.2.) Because the compiler must allocate memory for the nodes and edges, and must traverse the entire tree several times, the compiler writer might want to avoid generating and preserving any nodes and edges that are not directly useful. This observation leads to a simplified syntax tree.
6.3.2 Abstract Syntax Tree
The abstract syntax tree (ast) retains the essential structure of the syntax tree, but eliminates the extraneous nodes. The precedence and meaning of the expression remain, but extraneous nodes have disappeared.
x × 2 + x × 2 × y , , @ @ R , , P P P P P q , , @ @ R , , @ @ R
Abstract syntax tree for x×2 +x×2×y
The astis a near source-level representation. Because of its rough correspon- dence to the parse of the source text, it is easily built in the parser.
Asts have been used in many practical compiler systems. Source-to-source systems, including programming environments and automatic parallelization tools, generally rely on anastfrom which the source code can be easily regen- erated. (This process is often called “pretty-printing;” it produces a clean source text by performing an inorder treewalk on theastand printing each node as it is visited.) The S-expressions found in Lisp and Scheme implementations are, essentially,asts.
Even when theastis used as a near-source level representation, the specific