3.5 Vectorization Algorithm
3.5.1 Vectorization of Loops
Vectorization of loops is the core of our algorithm. Currently, our algorithm is able to vectorize for-loops that are in a canonical loop form, as defined in OpenMP [117]. Roughly, a loop in this canonical form allows computing its iteration count before executing it, i.e., it is a countable loop. In addition, our algorithm is also able to vectorize simple while-loops that satisfy the restrictions imposed by the canonical loop form.
Structure of a Vectorized Loop
The vectorization of a single scalar loop may result in one or more loops that, in conjunction, perform the computation equivalent to the scalar loop [16, 84]. These resulting loops can be of the following kinds:
Prologue loop: This kind of loop peels iterations from the beginning of the main
vector loop. It is normally used to peel some scalar iterations to align vec- tor memory accesses of the main vector loop [157, 16, 16, 93]. Its number of iterations is smaller than the VF.
Main vector loop: This is the main loop of the vector transformation. The result-
ing vector iterations try to exploit as much as possible all the computational resources available.
Epilogue loop: This loop computes the remaining iterations of the scalar loop not
computed in the main vector loop. Its number of iterations is smaller than the VF.
The prologue and the epilogue loops might be scalar or vector, depending on the SIMD features of the target architecture. Furthermore, the compiler can generate one or multiple scalar and/or vector versions of each one of these loops with different characteristics, as an optimization for the same main vector loop.
Figure 3.4 shows an example of the loops resulting from the vectorization of a simple loop. Figure 3.4a shows the original scalar loop. As depicted in Figure 3.4b, the first loop generated is a prologue loop in this case. This loop is aimed at aligning
a[i] accesses in the main vector loop (the second one). The prologue loop will
execute scalar iterations whereas the address &a[i] is not aligned to the vector length boundary (16 bytes in this example). The main vector loop computes vector iterations using VF equal to 4. The remaining iterations of the original scalar loop that have not been computed in the main vector loop are then computed in the epilogue loop (third loop) in a scalar fashion.
Depending on the code of the scalar loop, the compiler can generate several combination of the previous loops as a result of the vectorization. For example, if the number of iterations of the scalar loop is known at compile time and it is smaller than VF, the resulting code could contain only an epilogue loop because the main vector loop will never be executed. If the number of iterations of the scalar loop is known at compiler time and it is multiple of the VF, the compiler could prevent the generation of epilogue loops as it will never be executed. Currently, the generation of prologue loops is not implemented in our algorithm.
Loop Vectorization Steps
The vectorization of a scalar loop is conducted following three steps: the analysis of the loop, the vectorization of the loop header and the vectorization of the loop body. In the analysis of the loop, the algorithm determines if the future main vector loop will need prologue or epilogue loops. In addition, the algorithm analyzes the data types of the operations in the loop body to determine a suitable VF to be used for the vectorization. Currently, our algorithm computes VF as the number of scalar elements of the largest data type found that fits into the vector register of the target architecture. In other words, the VF chosen is equal to the vector length (VL) for the largest data type found in the code.
With the information obtained in the loop analysis step, the algorithm continues with the vectorization of the header of the loop. At this point, the lower bound, the upper bound and the step of the induction variable of the new main vector loop
3.5. Vectorization Algorithm 39
1: functionVEC LB 2: if first loop then
3: new lb = lb 4: else 5: new lb = NULL 6: end if 7: return new lb 8: end function (a) Lower bound
1: functionVEC UB 2: new ub = ub−(VF−1)×st 3: return new ub 4: end function (b) Upper bound 1: functionVEC ST 2: new st = VF×st 3: return new st 4: end function (c) Step
Figure 3.5: Some of the rules to compute the new lower bound (a), upper bound (b) and step (c) of the induction variable of the main vector loop. lb, ub, st and VF are the lower bound, upper bound and step of the scalar loop, and the vectorization factor, respectively
are computed according to the rules described in Figure 3.5. The lower bound is processed as shown in Figure 3.5a. If the loop being vectorized is the first loop of the resulting set of vector loops, the original lower bound is maintained in the vector loop. Otherwise, the lower bound of that loop is removed. Its value will be the value of the induction variable at the end of the execution of the previous loop, as happens for the main vector loop in Figure 3.4b. The upper bound and the step of the main vector loop are computed as shown in Figure 3.5b and Figure 3.5c, respectively, as a function of the vectorization factor, the upper bound and the step of the original scalar loop.
Finally, the vectorization of the code within the body of the loop is vectorized using VF, as we describe in Section 3.5.4, Section 3.5.5, and Section 3.5.6.