The Importance o f Code Optimization - [Steven S. Muchnick] Advanced Compiler Design And

Generally, the result of using a one-pass compiler structured as described in Sec tion 1.1 is object code that executes much less efficiently than it might if more effort were expended in its compilation. For example, it might generate code on an expression-by-expression basis, so that the C code fragment in Figure 1.4(a) might result in the sparc assembly code3 in Figure 1.4(b), while it could be turned into the much more efficient code in Figure 1.4(c) if the compiler optimized it, including allo cating the variables to registers. Even if the variables are not allocated to registers, at least the redundant load of the value of c could be eliminated. In a typical early one- scalar sparc implementation, the code in Figure 1.4(b) requires 10 cycles to execute, while that in Figure 1.4(c) requires only two cycles.

Among the most important optimizations, in general, are those that operate on loops (such as moving loop-invariant computations out of them and simplify ing or eliminating computations on induction variables), global register allocation, and instruction scheduling, all of which are discussed (along with many other opti mizations) in Chapters 12 through 20.

However, there are many kinds of optimizations that may be relevant to a particular program, and the ones that are vary according to the structure and details of the program.

A highly recursive program, for example, may benefit significantly from tail-call optimization (see Section 15.1), which turns recursions into loops, and may only then benefit from loop optimizations. On the other hand, a program with only a few loops but with very large basic blocks within them may derive significant benefit from loop distribution (which splits one loop into several, with each loop body doing part of the work of the original one) or register allocation, but only modest improvement from other loop optimizations. Similarly, procedure integration or inlining, i.e., replacing subroutine calls with copies of their bodies, not only decreases the overhead of calling them but also may enable any or all of the intraprocedural optimizations to be applied to the result, with marked improvements that would

Section 1.4 Structure of Optimizing Compilers 7 in t a , b , c , d; ldw a , r l add r l , r 2 , r 3 c = a + b; ldw b ,r 2 add r 3 , l , r 4 d = c + 1; add r l , r 2 , r 3 stw r 3 ,c ldw c ,r 3 add r 3 , l , r 4 stw r 4 ,d (a) (b) (c)

FIG. 1.4 A C code fragment in (a) with naive sparc code generated for it in (b) and optimized code in (c).

not have been possible without inlining or (the typically much more expensive) techniques of interprocedural analysis and optimization (see Chapter 19). On the other hand, inlining usually increases code size, and that may have negative effects on performance, e.g., by increasing cache misses. As a result, it is desirable to measure the effects of the provided optimization options in a compiler and to select the ones that provide the best performance for each program.

These and other optimizations can make large differences in the performance of programs—frequently a factor of two or three and, occasionally, much more, in execution time.

An important design principle for large software projects, including compilers, is to design and construct programs consisting of small, functionally distinct modules and make each module as simple as one reasonably can, so that it can be easily designed, coded, understood, and maintained. Thus, it is entirely possible that unoptimized compilation does very local code generation, producing code similar to that in Figure 1.4(b), and that optimization is necessary to produce the much faster code in Figure 1.4(c).

.4

Structure o f Optimizing Compilers

A compiler designed to produce fast object code includes optimizer components. There are two main models for doing so, as shown in Figure 1.5(a) and (b).4 In Figure 1.5(a), the source code is translated to a low-level intermediate code, such as our lir (Section 4.6.3), and all optimization is done on that form of code; we call this the low-level model of optimization. In Figure 1.5(b), the source code is translated to a medium-level intermediate code, such as our mir (Section 4.6.1), and optimizations that are largely architecture-independent are done on it; then the code is translated to a low-level form and further optimizations that are mostly architecture-dependent are done on it; we call this the mixed model of optimization. In either model, the optimizer phase(s) analyze and transform the intermediate code to eliminate unused generality and to take advantage of faster ways to perform given tasks. For example, the optimizer might determine that a computation performed

4. Again, lexical analysis, parsing, semantic analysis, and either translation or intermediate-code generation might be performed in a single step.

8 Introduction to A dvanced T o p ics

String of characters

(a) (b)

FIG. 1.5 Two high-level structures for an optimizing compiler: (a) the low-level model, with all optimization done on a low-level intermediate code, and (b) the mixed model, with optimization divided into two phases, one operating on each of a medium-level and a low-level intermediate code.

in a loop produces the sam e result every time it is executed, so that moving the com putation out o f the loop w ould cause the program to execute faster. In the mixed m odel, the so-called postpass optim izer perform s low-level optim izations, such as taking advantage o f machine idioms and the target machine’s addressing modes, while this would be done by the unitary optimizer in the low-level model.

A mixed-model optim izer is likely to be more easily adapted to a new architec ture and may be more efficient at com pilation time, while a low-level-model opti mizer is less likely to be easily ported to another architecture, unless the second ar chitecture resembles the first very closely— for exam ple, if it is an upward-compatible

Section 1.4 Structure of Optimizing Compilers 9

extension of the first. The choice between the mixed and low-level models is largely one of investment and development focus.

The mixed model is used in Sun Microsystems’ compilers for sparc (see Sec tion 21.1), Digital Equipment Corporation’s compilers for Alpha (see Section 21.3), Intel’s compilers for the 386 architecture family (see Section 21.4), and Silicon Graph ics’ compilers for m ips. The low-level model is used in IBM’s compilers for power

and PowerPC (see Section 21.2) and Hewlett-Packard’s compilers for pa-r isc. The low-level model has the advantage of making it easier to avoid phase ordering problems in optimization and exposes all address computations to the entire optimizer. For these and other reasons, we recommend using the low-level model in building an optimizer from scratch, unless there are strong expectations that it will be ported to a significantly different architecture later. Nevertheless, in the text we describe optimizations that might be done on either medium- or low- level code as being done on medium-level code. They can easily be adapted to work on low-level code.

As mentioned above, Sun’s and Hewlett-Packard’s compilers, for example, rep resent contrasting approaches in this regard. The Sun global optimizer was originally written for the Fortran 77 compiler for the Motorola MC68010-based Sun-2 series of workstations and was then adapted to the other compilers that shared a com mon intermediate representation, with the certain knowledge that it would need to be ported to future architectures. It was then ported to the very similar MC68020- based Sun-3 series, and more recently to sparc and sparc-V9. While considerable investment has been devoted to making the optimizer very effective for sparc in par ticular, by migrating some optimizer components from before code generation to after it, much of it remains comparatively easy to port to a new architecture.

The Hewlett-Packard global optimizer for pa-risc, on the other hand, was designed as part of a major investment to unify most of the company’s computer products around a single new architecture. The benefits of having a single optimizer and the unification effort amply justified designing a global optimizer specifically tailored to pa-risc.

Unless an architecture is intended only for very special uses, e.g., as an embedded processor, it is insufficient to support only a single programming language for it. This makes it desirable to share as many of the compiler components for an architecture as possible, both to reduce the effort of writing and maintaining them and to derive the widest benefit from one’s efforts at improving the performance of compiled code. Whether the mixed or the low-level model of optimization is used makes no difference in this instance. Thus, all the real compilers we discuss in Chapter 21 are members of compiler suites for a particular architecture that share multiple components, including code generators, optimizers, assemblers, and possibly other components, but that have distinct front ends to deal with the lexical, syntactic, and static-semantic differences among the supported languages.

In other cases, compilers for the same language are provided by a software vendor for multiple architectures. Here we can expect to see the same front end used, and usually the same optimizer components, but different cqde generators and possibly additional optimizer phases to deal with the particular features of each architecture. The mixed model of optimization is the more appropriate one in this case. Often the code generators are structured identically, independent of the target

10 Introduction to Advanced Topics

machine, in a way appropriate either to the source language or, more frequently, the common intermediate code, and differ only in the instructions generated for each target.

Yet another option is the use of a preprocessor to transform programs in one language into equivalent programs in another language and to compile them from there. This is how the early implementations of C++ worked, using a program named c fr o n t to translate C++ code to C code, performing, in the process (among other things), what has come to be known as name mangling— the transformation of readable C++ identifiers into virtually unreadable— but compilable— C identifiers. Another example of this is the use of a preprocessor to transform Fortran programs to ones that can take better advantage of vector or multiprocessing systems. A third example is to translate object code for an as-yet-unbuilt processor into code for an existing machine so as to emulate the prospective one.

One issue we have ignored so far in our discussion of optimizer structure and its place in a compiler or compiler suite is that some optimizations, particularly the data-cache-related ones discussed in Section 20.4, are usually most effective when applied to a source-language or high-level intermediate form, such as our hir (Sec tion 4.6.2). This can be done as the first step in the optimization process, as shown in Figure 1.6, where the final arrow goes to the translator in the low-level model and

FIG. 1.6 Adding data-cache optimization to an optimizing compiler. The continuation is to either the translator in the low-level model in Figure 1.5(a) or to the intermediate-code generator in the mixed model in Figure 1.5(b).

Section 1.5 Placement of Optimizations in Aggressive Optimizing Compilers 11

to the intermediate-code generator in the mixed model. An alternative approach, used in the IBM compilers for power and PowerPC, first translates to a low-level code (called XIL) and then generates a high-level representation (called YIL) from it to do data-cache optimization. Following the data-cache optimization, the resulting YIL code is converted back to XIL.

1.5 Placement o f Optimizations in Aggressive

In document [Steven S. Muchnick] Advanced Compiler Design And (Page 37-42)