• No results found

Symbol Tables

In document Engineering A Compiler pdf (Page 163-178)

machines, with the notion that a singleir can bind together all of the various components.

The more ambitious of these projects have foundered on the complexity of their. For this idea to succeed, the separation between front end, back end, and irmust be complete. The front end must encode all language-related knowledge and must encode no machine-specific knowledge. The back end must handle all machine-specific issues and have no knowledge about the source language. The irmust encode all the facts passed between front end and back end, and must represent all the features of all the languages in an appropriate way. In practice, machine-specific issues arise in front ends; language-specific issues find their way into back ends; and “universal”irs become too complex to manipulate and use. Some systems have been built using this model; the successful ones seem to be characterized by

irs that are near the abstraction level of the target machine target machines that are reasonably similar

languages that have a large core of common features

Under these conditions, the problems in the front end, back end, andirremain manageable. Several commercial compiler systems fit this description; they compile languages such as Fortran, C, and C++ to a set of similar architectures.

6.7

Symbol Tables

As part of translation, the compiler derives information about the various en- tities that the program manipulates. It must discover and store many dis- tinct kinds of information. It will encounter a variety of names that must be recorded—names for variables, defined constants, procedures, functions, la- bels, structures, files, and computer-generated temporaries. For a given textual name, it might need a data type, the name and lexical level of its declaring procedure, its storage class, and a base address and offset in memory. If the object is an aggregate, the compiler needs to record the number of dimensions and the upper and lower bounds for each dimension. For records or structures, the compiler needs a list of the fields, along with the relevant information on each field. For functions and procedures, the compiler might need to know the number of parameters and their types, as well as any returned values; a more sophisticated translation might record information about the modification and use of parameters and externally visible variables.

Either theirmust store all this information, or the compiler must re-derive it on demand. For the sake of efficiency, most compilers record facts rather than recompute them. (The one common exception to this rule occurs when the ir is written to external storage. Such i/o activity is expensive relative to computation, and the compiler makes a complete pass over the irwhen it reads the information. Thus, it can be cheaper to recompute information than to write it to external media and read it back.) These facts can be recorded

directly in the ir. For example, a compiler that builds an ast might record information about variables as annotations (or attributes) on the nodes repre- senting each variable’s declaration. The advantage of this approach is that it uses a single representation for the code being compiled. It provides a uniform access method and a single implementation. The disadvantage of this approach is that the single access method may be inefficient—navigating theastto find the appropriate declaration has its own costs. To eliminate this inefficiency, the compiler can thread theirso that each reference has a link back to its declara- tion. This adds space to theirand overhead to their-builder. (The next step is to use a hash table to hold the declaration link for each variable during ir construction—in effect, creating a symbol table.)

The alternative, as we saw in Chapter 4, is to create a central repository for these facts and to provide efficient access to it. This central repository, called a symbol table, becomes an integral part of the compiler’s ir. The symbol table localizes information derived from distant parts of the source code; it simplifies the design and implementation of any code that must refer to information de- rived earlier in compilation. It avoids the expense of searching theirto find the portion that represents a variable’s declaration; using a symbol table often elim- inates the need to represent the declarations directly in their. (An exception occurs in source-to-source translation. The compiler may build a symbol table for efficiency and preserve the declaration syntax in theirso that it can produce an output program that closely resembles the input program.) It eliminates the overhead of making each reference contain a pointer back to the declaration. It replaces both of these with a computed mapping from the textual name back to the stored information. Thus, in some sense, the symbol table is simply an efficiency hack.

Throughout this text, we refer to “the symbol table.” In fact, the compiler may include several distinct, specialized symbol tables. These include variable tables, label tables, tables of constants, and reserved keyword tables. A careful implementation might use the same access methods for all these tables. (The compiler might also use a hash table as an efficient representation for some of the sparse graphs built in code generation and optimization.)

Symbol table implementation requires attention to detail. Because nearly every aspect of translation refers back to the symbol table, efficiency of access is critical. Because the compiler cannot predict, before translation, the number of names that it will encounter, expanding the symbol table must be both graceful and efficient. This section provides a high-level treatment of the issues that arise in designing a symbol table. It presents the compiler-specific aspects of symbol table design and use. For deeper implementation details and design alternatives, the reader is referred to Section B.4.

6.7.1 Hash Tables

In implementing a symbol table, the compiler writer must choose a strategy for organizing and searching the table. Myriad schemes for organizing lookup tables exist; we will focus on tables indexed with a “hash function.” Hashing,

6.7. SYMBOL TABLES 155 h(d) 1 a c b 0 1 2 3 4 5 6 7 8 9

Figure 6.8: Hash-table implementation — the concept

as this technique is called, has an expected-case O(1) cost for both insertion and lookup. With careful engineering, the implementor can make the cost of expanding the table and of preserving it on external media quite reasonable. For the purposes of this chapter, we assume that the symbol table is organized as a simple hash table. Implementation techniques for hash tables have been widely studied and taught.

Hash tables are conceptually elegant. They use ahash function, h, to map names into small integers, and take the small integer as an index into the table. With a hashed symbol table, the compiler stores all the information that it derives about the name n in the table at h(n). Figure 6.8 shows a simple ten-slot hash table. It is a vector of records, each record holding the compiler- generated description of a single name. The names a, b, and c have already been inserted. The namedis being inserted, ath(d) = 2.

The primary reason for using hash tables is to provide a constant-time lookup, keyed by a textual name. To achieve this, h must be inexpensive to compute, and it must produce a unique small integer for each name. Given an appropriate functionh, accessing the record fornrequires computingh(n) and indexing into the table at h(n). If hmaps two or more symbols to the same small integer, a “collision” occurs. (In Figure 6.8, this would occur ifh(d) = 3.) The implementation must handle this situation gracefully, preserving both the information and the lookup time. In this section, we assume thathis a perfect hash function—that is, it never produces a collision. Furthermore, we assume that the compiler knows, in advance, how large to make the table. Section B.4 describes hash-table implementation in more detail, including hash functions, collision handling, and schemes for expanding a hash table.

6.7.2 Building a Symbol Table

The symbol table defines two interface routines for the rest of the compiler.

LookUp(name) returns the record stored in the table ath(name) if one exists. Otherwise, it returns a value indicating thatname was not found.

Insert(name,record) stores the information inrecord in the table ath(name). It may expand the table to accommodate the record forname.

Digression: An Alternative to Hashing

Hashing is the most widely used method for organizing a compiler’s symbol table. Multiset discrimination is an interesting alternative that eliminates any possibility of worst-case behavior. The critical insight behind this technique is that the index can be constructed off-line in the scanner.

To use multiset discrimination for the symbol table, the compiler writer must take a different approach to scanning. Instead of processing the input in- crementally, the compiler scans the entire program to find the complete set of identifiers. As it discovers each identifier, it creates a tuplename,position, wherename is the text of the identifier andpositionis its ordinal position in the list of all tokens. It enters all the tuples into a large multiset.

The next step lexicographically sorts the multiset. In effect, this creates a set of bags, one per identifier. Each bag holds the tuples for all of the occurrences of its identifier. Since each tuple refers back to a specific token, through itspositionvalue, the compiler can use the sorted multiset to rewrite the token stream. It makes a linear scan over the multiset, processing each bag in order. The compiler allocates a symbol table index for the entire bag, then rewrites the tokens to include that index. This augments the identifier tokens with their symbol table index. If the compiler needs a textual lookup function, the resulting table is ordered alphabetically for a binary search.

The price for using this technique is an extra pass over the token stream, along with the cost of the lexicographic sort. The advantages, from a com- plexity perspective, are that it avoids any possibility of hashing’s worst case behavior, and that it makes the initial size of the symbol table obvious, even before parsing. This same technique can be used to replace a hash table in almost any application where an off-line solution will work.

The compiler needs separate functions forLookUpandInsert. (The alternative would haveLookUpinsert the name when it fails to find it in the table.) This en- sures, for example, that aLookUpof an undeclared variable will fail—a property useful for detecting a violation of the declare-before-use rule in syntax-directed translation schemes, or for supporting nested lexical scopes.

This simple interface fits directly into thead hocsyntax-directed translation scheme for building a symbol table, sketched in Section 4.4.3. In processing declaration syntax, the compiler builds up a set of attributes for the variable. When the parser reduces by a production that has a specific variable name, it can enter the name and attributes into the symbol table using Insert. If a variable name can appear in only one declaration, the parser can call LookUp

first to detect a repeated use of the name. When the parser encounters a variable name outside the declaration syntax, it uses LookUpto obtain the appropriate information from the symbol table. LookUp fails on any undeclared name. The compiler writer, of course, may need to add functions to initialize the table, to store it and retrieve it using external media, and to finalize it. For a language with a single name space, this interface suffices.

6.7. SYMBOL TABLES 157

6.7.3 Handling Nested Lexical Scopes

Few, if any, programming languages provide a single name space. Typically, the programmer manages multiple names spaces. Often, some of these name spaces are nested inside one another. For example, a C programmer has four distinct kinds of name space.

1. A name can have global scope. Any global name is visible in any procedure where it is declared. All declarations of the same global name refer to a single instance of the variable in storage.

2. A name can have a file-wide scope. Such a name is declared using the

staticattribute outside of a procedure body. A static variable is visible to every procedure in the file containing the declaration. If the name is declared static in multiple files, those distinct declarations create distinct run-time instances.

3. A name can be declared locally within a procedure. The scope of the name is the procedure itself. It cannot be referenced by name outside the declaring procedure. (Of course, the declaring procedure can take its ad- dress and store it where other procedures can reference the address. This may produce wildly unpredictable results if the procedure has completed execution and freed its local storage.)

4. A name can be declared within a block, denoted by a pair of curly braces. While this feature is not often used by programmers, it is widely used by macros to declare a temporary location. A variable declared in this way is only visible inside the declaring block.

Each distinct name space is called ascope. Language definitions includes rules that govern the creation and accessibility of scopes. Many programming lan- guages include some form of nested lexical scopes.

Figure 6.9 shows some fragments of c code that demonstrate its various scopes. The level zero scope contains names declared as global or file-wide static. Bothexampleandxare global, whilewis a static variable with file-wide scope. Procedure examplecreates its own local scope, at level one. The scope containsaandb, the procedure’s two formal parameters, and its local variable

c. Inside example, curly braces create two distinct level two scopes, denoted as level 2a and level 2b. Level 2a declares two variables, b and z. This new incarnation ofboverrides the formal parameterbdeclared in the level one scope byexample. Any reference tobinside the block that created2a names the local variable rather than the parameter at level one. Level2bdeclares two variables,

a andx. Each overrides a variable declared in an outer scope. Inside level2b, another block creates level three and declares candx.

All of this context goes into creating the name space in which the assignment statement executes. Inside level three, the following names are visible: afrom 2b,bfrom one,cfrom three,examplefrom zero,wfrom zero, andxfrom three. No incarnation of the namezis active, since2a ends before three begins. Since

static int w; /* level 0 */ int x; void example(a,b); int a, b; /* level 1 */ { int c; { int b, z; /* level 2a */ ... } { int a, x; /* level 2b */ ... { int c, x; /* level 3 */ b = a + b + c + x; } } } Level Names 0 w, x, example 1 a, b, c 2a b, z 2b a, x 3 c, x

Figure 6.9: Lexical scoping example

example at level zero is visible inside level three, a recursive call on example

can be made. Adding the declaration “int example” to level2b or level three would hide the procedure’s name from level three and prevent such a call.

To compile a program that contains nested scopes, the compiler must map each variable reference back to a specific declaration. In the example, it must distinguish between the multiple definitions ofa,b,c, andxto select the relevant declarations for the assignment statement. To accomplish this, it needs a symbol table structure that can resolve a reference to the lexically most recent definition. At compile-time, it must perform the analysis and emit the code necessary to ensure addressability for all variables referenced in the current procedure. At run-time, the compiled code needs a scheme to find the appropriate incarnation of each variable. The run-time techniques required to establish addressability for variables are described in Chapter 8.

The remainder of this section describes the extensions necessary to let the compiler convert a name likexto a static distance coordinate—alevel,offset pair, where level is the lexical level at which x’s declaration appears and off- set is an integer address that uniquely identifies the storage set aside for x. These same techniques can also be useful in code optimization. For example, thedvnt algorithm for discovering and removing redundant computations re- lies on a scoped hash table to achieve efficiency on extended basic blocks (see Section 12.1).

6.7. SYMBOL TABLES 159 level 3 level 2 level 1 level 0 - - - - current level c,· · · x,· · · a,· · · x,· · · a,· · · c,· · · b,· · · w,· · · x,· · · exa· · ·

Figure 6.10: Simple “sheaf-of-tables” implementation

The Concept To manage nested scopes, the parser must change, slightly, its ap- proach to symbol table management. Each time the parser enters a new lexical scope, it can create a new symbol table for that scope. As it encounters dec- larations in the scope, it enters the information into the current table. Insert

operates on the current symbol table. When it encounters a variable reference,

LookUp must first search the table for the current scope. If the current table does not hold a declaration for the name, it checks the table for the surround- ing scope. By working its way through the symbol tables for successively lower lexical levels, it will either find the most recent declaration for the name, or fail in the outermost scope—indicating that the variable has no declaration visible from the current scope.

Figure 6.10 shows the symbol table built in this fashion for our example program, at the point where the parser has reached the assignment statement. When the compiler invokes the modified LookUp function for the name b, it will fail in level three, fail in level two, and find the name in level one. This corresponds exactly to our understanding of the program—the most recent dec- laration forbis as a parameter to example, in level one. Since the first block at level two, block2a, has already closed, its symbol table is not on the search

In document Engineering A Compiler pdf (Page 163-178)