Maximizing Multiprocessor

Performance with

the SUIF Compiler

Parallel izing compilers for multiprocessors face many h u rdles. However, SUIF's robust analysis and memory optimization techniques enabled speedups on three fourths of the NAS and SPECfp95 benchmark programs.

December 1 996, pages 84-89. This _p3pahas been mod i tied for

publication here with the addition of the section The Status :md

Fut_ur_e_{ofS l " l F}

I

Mary W. Hall Jetmifer M. Anderson Sarnart P. Amarasinghe Briart R. Murphy Shih-Wei Liao Edouard Bugnion Monica S. Lam

The affordability of shared memory multi processors offers the potential of supercomputer-class performance to the general public. Typical ly used in a multiprogram ming mode, these machines increase throughput by running several independent applications in parallel . But m u l tiple processors can also work together to speed up single applications. This requires that ordinary sequential programs be rewritten to take advantage of the extra processors. ' 4 Automatic parallelization with a compiler otfers a way to do this.

Parallelizing compilers face more difficult challenges from multiprocessors than from vector machines, which were their initial target. Using a vector architecwre eftec· tively involves parallelizi ng repeated a.tithmetic opera

tions on large data su-eams-for example, the innermost loops in array-oriented programs. On a multiprocessor, however, this approach typically does not provide suffi cient granularity of paral lelism: Not enough work is performed in parallel to overcome processor synch ronization and communication overhead . To use a multiprocessor effectively, the compiler must exploit coarse-grain paral lelism, locating large computations that can execute independently in parallel .

Locating parallel ism i s just the first step in prod uc· ing efficient multiprocessor code. Achieving h igh per formance also requires effective use of the memory hierarchy, and multjprocessor systems have more com plex memory hierarchies than typical vector machines:

They contain not only shared memory

but

also

multi

ple levels of cache memory.

These added challenges often limited tl1e effectiveness of early paralJelizing compilers for multiprocessors, so programmers developed their applications fi·om scratch,

without assistance from tools. But explicitly managing an

application's parallelism and memory use requires a great deal of programming knowledge, and tl1e work is tedious and error-prone. Moreover, the resulting programs are optimized for only a specific machine. Thus, the effort required to develop efficient parallel programs restricts the user base for multiprocessors.

This article describes automatic parallelization tech niques in the SU I F (Stanford U niversity Intermed iate

Format) compiler that result in good multiprocessor pertormance for array- based numerical progra ms. vVe provide SUIF performance measurements for the com plete NAS and SPECfP95 benchmark suites. Overall , the results tor these scientific programs are promising. The compiler yields speedups on three fourths of the pro grams and has obtained the highest ever pcrronnancc on the SPECfP95 benchmark, indicating that the compiler can also achieve e fficient absolute performance.

Finding Coarse-grain Parallelism

Mu ltiprocessors work best when the individu,l l proces sors have large units of independent com putation , but it is not easy to find such coarse-grain parallelism . First the compiler must find available parallelism across pro ced ure boundaries. Furthermore, the original compu tations may not be parallelizable as given and may first require some transtonnations. For example, experience in parallelizing by hand suggests that we must often replace global arrays with private versions on different processors. In other cases, the computation may need to be restructured-for example, we may have to replace a sequential accumulation with J p:trallel reduc tion operation.

I t takes a l arge suite of robust analysis techniq ues to successfully locate coarse -grain p::trallel ism . General and unir(xm frameworks helped us manage the com plexity i nvolved in building such a system into S U IF.

We automated the analysis to privatize arrays and to recognize reductions to both sca lar and array variables. Our compiler's analysis techniques all operate seam lessly :Kross procedure bound aries.

Scalar Analyses

An initial phase analyzes scalar variables in the programs.

It uses techniques such as data dependence analysis, scalar privatization analysis, and reduction recognition to detect parallelism among operations with scalar· vari ables. It also derives symbolic information on these scalar variables that is useful in the array analysis phase. Such information includes constant propagation, induction variable recognition and elimination, recognition of loop-invariant computations, and symbolic relation propagation .'"'

Array Analyses

An :trray analysis phase uses a unified mathematical tl-amework based on linear algebra and i nteger linear program ming. ' The analysis appl ies the basic data dependence test to determine if accesses to an array can rerer to the same location. To support array priva tization, it also finds array data�ow information that determi nes whether array elements used in an iteration rdcr to the val ues produced in a p revious iteration .

Digira1 Technical Journal Vol . 10 No. l 1 998

Moreover, it recognizes commutative operations on sections of an array and transforms them i nto parallel reductions. The reduction analysis is powerful enough to recognize commutative updates of even indirectly accessed array l ocations, al lowing paralle lization of sparse computations.

All these analyses are formulated in terms of i nteger programming problems on systems of linear inequali ties that represent the data accessed. These inequalities are derived from loop bounds and array access func tions. Implementing opti mizations to speed up com mon cases reduc<::s the compilation ti me.

lnterprocedural Analysis Framework

All the analyses arc i m p lemented using a uniform i n terprocedural analysis framework, which helps man age the software engineering complexity. The frame work uses interprocedural dataflow analysis,• which is more efficient tlun the more common technique of inline substitutio n . ' I n line substitution replaces each procedure cal l with J copy of the cal led proced ure, then analyzes the expanded code in the usual i ntrapro cedural manner. I n line substitution is not practical for large progra ms, because it can make the program too large to analyze .

O u r technique :111alyzes only a single copy of each procedure, captu ri ng irs side efrects in a function . This fu nction is then applied at each cal l site to produce precise results. When different call i ng contexts make it necessary, the algorithm selectively cl ones a procedure so that code can be analyzed and possibly paral lel i zed under different calling contexts (as when different constant values Jrc passed to the same formal parame ter ) . In this way the fu ll advantages of inlining are achieved without expanding the code indiscri minately. In Figure 1 the boxes represent procedure bodies, and the l ines connecting them represent procedure calls. The m::tin com putation is a series oftour loops to com pute three-dimensional fast Fourier transr(mns. Using i nterprocedural scalar and array analyses, tile

S U [f compiler determines that these loops are paral lelizable. Each loop contains more than 500 li nes of code spanning up to nine procedures with up to 42 procedure calls. If this program had been fu l l y inlined , the loops pres<::nted to the compiler for analysis would have each contained more than 86 ,000 l ines of code. Memory Optimization

Numerical applications on high-performance micro processors are often memory bou nd. Even with one or more levels of cache to bridge the gap between proces sor and memory speeds, a processor may still waste half its time stalled on memory accesses because it ITequently references an item not in the cache (a cache miss ) . This

Figure 1

The compiler discovers parallelism through intcrprocedural array analysis. Each of the four parallelized loops at left consists of more

than

500 lines of code spanning up to nine procedures ( boxes) with up to 42 procedure calls ( l i nes ) .

memory bottleneck is fi.1rther exacerbated on multi

In document dtj v10 01 1998 pdf (Page 74-76)

Performance with

the SUIF Compiler

I

but

multi­

than

memory bottleneck is fi.1rther exacerbated on multi­

multi

memory bottleneck is fi.1rther exacerbated on multi