4.1.2 Objectives
The main goal of this proposal is to boost the exploitation of SIMD instructions by enabling the vectorization of codes that are not auto-vectorized by the compiler. The key concept to achieve this goal is moving to the programmer part of the responsi- bility of the vectorization process that now is assumed by the compiler. Thus, the programmer will guide the compiler vectorization process, indicating to the com- piler which code regions are safe and should be vectorized.
To achieve this goal in a generic, efficient and easy way for programmers, we propose a set of SIMD extensions to the 3.1 version of the OpenMP programming model. We chose OpenMP because it is a high-level architecture-independent pro- gramming model that allows programmers to efficiently describe several kinds of parallelism. This description is performed with compiler directives that do not re- quire to change the original serial code of the application. The new SIMD exten- sions that we propose for OpenMP bring SIMD parallelism and vectorization to this programming model. With these extensions, now programmers will be able to describe SIMD parallelism and even how this interacts with other kinds of paral- lelism available in OpenMP. Moreover, we propose a set of optional clauses that can be used to provide the compiler with further tuning information that may lead to a more optimized vector code.
4.2
Standalone SIMD Directive
The simplest construct that we propose to describe SIMD parallelism is the stan- dalone SIMD directive. This construct is used to delimit a code region that is safe to be vectorized according to the programmer. These regions of code can be for-loops and functions (declarations and definitions). Figure 4.2 shows the syntax of the SIMD directive for C/C++.
This directive is useful to instruct the compiler to vectorize the annotated code, relaxing some of the strong restrictions that prevent vectorization in fully auto- matic approaches (see Section 4.1.1). This means that the compiler does not have to determine whether the vectorization of that code is safe or whether it is prof- itable using cost models. It will assume that vectorization can be safely applied without considering hypothetical data dependences, pointer aliasing, function call limitations and cost models.
It is important to note that this directive must be understood as a hint to the compiler. In spite of the fact that the programmer is now responsible for providing the compiler the right information, each compiler is free to decide not to vectorize the annotated code. For instance, this could happen if the compiler cannot ensure that the resulting vector code will be correct or the target architecture does not support a SIMD version.
When a function is annotated with the SIMD directive, the compiler will try to generate a vector version of that function, converting all parameters and the return
#pragma omp simd[clause [clause] ...] new-line for-loop — function-decl — function-def
Figure 4.2: C/C++ syntax of the standalone simd construct
1 #pragma omp simd
2 float max(float a, float b);
3
4 #pragma omp simd
5 for(i=0; i<N; i++)
6 {
7 b[i] = max(a[i], b[i]);
8 }
Listing 4.2: Simple example using the standalone SIMD directive
value to vector types by default. When the SIMD directive is applied to a for-loop, the compiler will try to vectorize the loop. If a SIMD-annotated function is called inside a SIMD-annotated loop, the compiler will emit a call to the vector version of that function.
Listing 4.2 shows an example of the SIMD directive for a loop and a function definition. Without the simd annotation, the compiler would not be able to vector- ize the loop automatically if the max function is not inlined, as was demonstrated in Section 4.1.1. If vectorization was carried out, it would entail scalarizing the max call, i.e., extracting scalar elements of those vector variables that will be used as parameters, invoking the scalar function max several times with these scalar pa- rameters and building a vector with the scalar results of the function calls. This approach would be highly inefficient.
However, by using the SIMD construct on the loop, we are asking the compiler to vectorize it without taking into account the previous performance consideration. Therefore, the compiler will generate a vector version of the loop using some vec- torization strategy, such as the one that we described in Chapter 3. The resulting equivalent vector version of the code will normally contain at least one loop with vector instructions. As we described in Section 3.5, it is also very common that the vector version of the code also contains an epilogue loop to compute the remain- ing iterations not computed by the main vector loop. Furthermore, a prologue loop is sometimes added before the main vector loop to compute some scalar iterations with optimization purposes. Additionally, if we annotate the function definition, we are ensuring to the compiler that there will be a compatible vector function of max ready to use instead of the scalar version. Using the vector version of the function