Summary - Automatic SIMD vectorization of SSA-based control flow graphs

Whole-Function Vectorization is a generic vectorization approach that focuses on control-flow to data-flow conversion and instruction vectorization. It can be used in a variety of scenarios such as loop vectorization, outer loop vectorization, and in the back ends of data-parallel languages. In contrast to most other approaches, WFV works on the control flow graph rather than on source or syntax tree level. This allows to aggressively optimize the code before vectorization. The state-of-the art in vectorization is usually limited to straight-line code or structured control flow, and complete linearization of the control flow structure is performed except in trivial cases. WFV on the other hand can handle arbitrary control flow, and can retain some structure of the CFG. The analyses introduced in this thesis describe a more complete picture of the behavior of SIMD execution than previous divergence or variance analyses. The WFV-based OpenCL driver employs some of the most advanced code generation and vectorization techniques that have been developed. The prototype offers, together with the Intel driver, the currently best performance of any commonly used OpenCL CPU driver.

5 SIMD Property Analyses

In this chapter, we describe the heart of the WFV algorithm: a set of analyses that determine properties of a function for a SIMD execution model.

The analyses presented in this chapter determine a variety of properties for the instructions, basic blocks, and loops of a scalar source function. The properties related to values are listed in Table 5.1, those related to blocks and loops in Table 5.2. They describe the behavior of the function in data-parallel execution.

In general, the SIMD properties are dynamic because they describe values that may depend on input arguments. The Vectorization Analysis (Section 5.6) describes a safe, static approximation of them.

Many of the analysis results are interdependent. For example, values may be non-uniform because of divergent control flow, and control flow may diverge because of non-uniform values (see Section 5.2). Thus, the Vectorization Analysis consists of several parts that interact.

The analysis results are used throughout all phases of vectorization and in many cases allow to generate more efficient code.

Table 5.1 Instruction properties derived by our analyses. Note that var- ying is defined for presentation purposes only: it subsumes consecutive and unknown.

Property Symbol Description

uniform u result is the same for all instances varying v result is not provably uniform consecutive c result consists of consecutive values unknown k result values follow no usable pattern aligned a result value is aligned to multiple of S nonvectorizable n result type has no vector counterpart sequential s execute W times sequentially

Table 5.2 List of block and loop properties derived by our analyses.

Property Description

by all block is always executed by all instances div causing block is a divergence-causing block

blendv block is join point of instances that diverged at block v rewirev block is a rewire target of a div causing block v

divergent loop that instances may leave at different points (time & space)

5.1 Program Representation

We consider the scalar function f to be given in a typed, low-level repre- sentation. A function is represented as a control flow graph of instructions. Furthermore, we require that f is in SSA form, i.e., every variable has a single static assignment and every use of a variable is dominated by its definition. A prominent example of such a program representation is LLVM bitcode [Lattner & Adve 2004] which we also use in our evaluation (Chap- ter 8). We will restrict ourselves to a subset of a language that contains only the relevant elements for this thesis. Figure 5.1 shows its types and instructions. Other instructions, such as arithmetic and comparison operators are straightforward and omitted for the sake of brevity.

This program representation reflects today’s consensus of instruction set architectures well. alloca allocates local memory and returns the corresponding address as a pointer of the requested type. The gep instruction (“get element pointer”) performs address arithmetic. load (store) takes a base address and reads (writes) the elements of the vector consecutively from (to) this address. The bool type is special in that we do not allow creating pointers of it. This is because the purpose of boolean values is solely to encode control flow behavior.

The function tid returns the instance identifier of the running instance (see above). A phi represents the usual φ-function from SSA. An lcssaphi represents the φ-function with only one entry that is found in loop exit blocks of functions in LCSSA form. The operation arg(i) accesses the i-th argument to f . We assume that all pointer arguments to f are aligned to the SIMD register size.

Function calls are represented by call instructions that receive a function identifier and a set of parameters. Branches are represented by branch instructions which take a boolean condition and return one of their target program point identifiers.

5.2 SIMD Properties 41

In document Automatic SIMD vectorization of SSA-based control flow graphs (Page 54-57)