Generating and Loading Executable Code - Lightweight Modular Staging and Embedded Compilers:Abs

6.2 Contributions

6.2.6 Generating and Loading Executable Code

Code generation in LMS is an explicit operation. For the common case where generated code is to be loaded immediately into the running program, traitCompileprovides a suitable interface in form of the abstract methodcompile:

trait Compile extends Base {

def compile[A,B](f: Rep[A] => Rep[B]): A=>B

}

The contract ofcompileis to “unstage” a function from staged to staged values into a function operating on present-stage values that can be used just like any other function object in the running program. Of course this only works for functions that do not reference externally boundRep[T]values, otherwise the generate code will not compile due to free identifiers. The given encoding into Scala’s type system does not prevent this kind of error.

For generating Scala code, an implementation of the compilation interface is provided by traitCompileScala:

trait CompileScala extends Compile {

def compile[A,B](f: Rep[A] => Rep[B]) = { val x = fresh[A] val y = accumulate { f(x) } // emit header emitBlock(y) // emit footer // invoke compiler

// load generated class file

// instantiate object of that class

} }

The overall compilation logic ofCompileScalais relatively simple: emit a class andapply- method declaration header, emit instructions for each definition node according to the sched- ule, close the source file, invoke the Scala compiler, load the generated class file and return a newly instantiated object of that class.

Part II

Chapter 7

Intro: Not your Grandfather’s Compiler

How do embedded compilers compile their programs?

The purely string based representation of staged programs from Part I does not allow analysis or transformation of embedded programs. Since LMS is not inherently tied to a particular program representation it is very easy to pick one that is better suited for optimization. As a first cut, we switch to an intermediate representation (IR) based on expression trees, adding a level of indirection between construction of object programs and code generation (Chapter 8). On this tree IR we can define traversal and transformation passes and build a straightforward embedded compiler. We can add new IR node types and new transformation passes that implement domain specific optimizations. In particular we can use multiple passes of staging: While traversing (effectively, interpreting) one IR we can execute staging commands to build another staged program, possibly in a different, lower-level object language.

However the extremely high degree of extensibility poses serious challenges. In particular, the interplay of optimizations implemented as many separate phases does not yield good results due to the phase ordering problem: It is unclear in which order and how often to execute these phases, and since each optimization pass has to make pessimistic assumptions about the outcome of all other passes the global result is often suboptimal compared to a dedicated, combined optimization phase [149, 23]. There are also implementation challenges as each optimization needs to be designed to treat unknown IR nodes in a sensible way.

Other challenges are due to the fact that embedded compilers are supposed to be used like libraries. Extending an embedded compiler should be easy, and as much of the work as possible should be delegated to a library of compiler components. Newly defined high-level IR nodes should profit from generic optimizations automatically.

To remedy this situation, we switch to a graph-based “sea of nodes” representation (Chap- ter 9). This representation links definitions and uses, and it also reflects the program block structure via nesting edges. We consider purely functional programs first. A number of non- trivial optimizations become considerably simpler. Common subexpression elimination (CSE) and dead code elimination (DCE) are particularly easy. Both are completely generic and support an open domain of IR node types. Optimizations that can be expressed as context-free rewrites are also easy to add in a modular fashion. A scheduling and code motion algorithm transforms graphs back into trees, moving computations to places where they are less often

executed, e.g. out of loops or functions. Both graph-based and tree-based transformations are useful: graph-based transformations are usually simpler and more efficient whereas tree- based transformations, implemented as multiple staging passes, can be more powerful and employ arbitrary context-sensitive information.

To support effectful programs, we make effects explicit in the dependency graph (similar to SSA form). We can support simple effect domains (pure vs effectful) and more fine grained ones, such as tracking modifications per allocation site. The latter one relies on alias and points-to analysis.

We turn to advanced optimizations in Chapter 10. For combining analyses and optimizations, it is crucial to maintain optimistic assumptions for all analyses. The key challenge is that one analysis has to anticipate the effects of the other transformations. The solution is speculative rewriting [85]: transform a program fragment in the presence of partial and possibly unsound analysis results and re-run the analyses on the transformed code until a fixpoint is reached. This way, different analyses can communicate through the transformed code and need not anticipate their results in other ways. Using speculative rewriting, we compose many optimizations into more powerful combined passes. Often, a single forward simplification pass that can be used to clean up after non-optimizing transformations is sufficient.

However not all rewrites can fruitfully be combined into a single phase. For example, high- level representations of linear algebra operations may give rise to rewrite rules like I M → M where I is the identity matrix. At the same time, there may be rules that define how a matrix multiplication can be implemented in terms of arrays and while loops, or a call to an external library (BLAS). To be effective, all the high-level simplifications need to be applied exhaustively before any of the lowering transformations are applied. But lowering transformations may create new opportunities for high-level rules, too. Our solution here is delayed rewriting: programmers can specify that a certain rewrite should not be applied right now, but have it registered to be executed at the next iteration of a particular phase. Delayed rewriting thus provides a way of grouping and prioritizing modularly defined transformations.

On top of this infrastructure, we build a number of advanced optimizations. A general pattern is split and merge: We split operations and data structures in order to expose their components to rewrites and dead-code elimination and then merge the remaining parts back together. This struct transformation also allows for more general data structure conver- sions, including array-of-struct to struct-of-array representation conversion. Furthermore we present a novel loop fusion algorithm, a powerful transformation that removes intermediate data structures.

Chapter 8

Intermediate Representation: Trees

With the aim of generating code, we could represent staged expressions directly as strings, as done in Part I. But for optimization purposes we would rather have a structured intermediate representation that we can analyze in various ways. Fortunately, LMS makes it very easy to use a different internal program representation.

In document Lightweight Modular Staging and Embedded Compilers:Abstraction without Regret for High-Level High-Performance Programming (Page 65-71)