4.8 Summary and Discussion
4.8.2 OpenMP 4.0
As we introduced in Section 4.6, OpenMP 4.0 incorporated SIMD extensions as a result of this work and in collaboration with Intel [83]. The most significant difference with our proposal is the definition of a for-SIMD directive instead of the SIMD-for directive. To the best of our knowledge, the reasons that led to opt by this for-SIMD construct in OpenMP 4.0 were:
• The scheduling of iterations do not change from OpenMP scalar constructs to SIMD constructs.
• Implementing the SIMD-for construct supposed an important challenge, ac- cording to some vendors.
Both arguments are totally acceptable but we want to give our opinion on the first one. We agree on that in our proposal, we had to redefine the scheduling of instructions for SIMD construct, as we described in Section 4.3.1. However, we do not see any inconvenient or inconsistency in doing this, but quite the contrary. From our point of view, we do not see the advantage of being able to specify a chunk of iterations not multiple of the vector length when we are asking the compiler to vectorize the code. If this chunk is not multiple of the vector length, the remaining iterations will be executed in a less efficient way, even using scalar instructions depending on the architecture. In fact, keeping the schedule of scalar iterations for SIMD constructs will entail the following issues:
• Programmers must specify a chunk of iterations multiple of the vector length for an optimal execution.
• Default chunk values lead to not very efficient scenarios.
Regarding the first issue, specifying the chunk of iterations as a function of the vector length is not portable across architectures of different vector lengths. In addition, the level of abstraction is affected as it requires programmers to be aware of such a low level detail of the architecture.
4.8. Summary and Discussion 103
The second issue can be illustrated with some example. A dynamic schedule with the default chunk (one iteration) will not execute any full vector iterations. A static schedule will have a chunk multiple of the vector length only if the total num- ber of iterations is multiple of the vector length. If this condition is not satisfied, the remaining iterations could be executed in a scalar fashion. This is particularly critical when the number of iterations of the loop is low and the number of threads is considerably high. This happened in the evaluation of the Cholesky benchmark even using four threads. From our point of view, a more suitable approach is that a lower number of threads executes full SIMD iterations making use of full vector registers, as in our proposal. Furthermore, this remaining iterations not multiple of the vector length introduce an offset across chunks of iterations that lead to prob- lems with the alignment of memory operations. Consequently, either unaligned memory access will have to be used or more iterations will have to be executed in a less efficient way by a prologue loop with alignment purpose.
In conclusion, we think that the approach adopted in OpenMP 4.0 might not be the most appropriate for programmers. The optimal execution of the application highly depends on the intervention of programmers and default behaviors can lead to poor performance difficult to understand for inexperienced programmers.
Chapter 5
Optimizing Overlapped Vector
Memory Accesses
5.1
Introduction
This chapter introduces our proposal on a compiler code optimization that exploits register-to-register vector instructions to improve overlapped vector memory loads. This kind of vector loads read scalar elements from memory that have already been read by other vector load. Overlapped vector loads arise naturally after the vector- ization of algorithms based on neighboring computation, e.g., stencil computation.
In addition to this optimization and in the context of the OpenMP SIMD exten- sions proposed in Chapter 4, we introduce a new clause for enabling, disabling and tuning this optimization on demand. In this way, this clause allows programmers to have further control on the compiler optimization process.
We implement a prototype of this vector code optimization in our vectorization infrastructure described in Chapter 3, targeting the Intel Xeon Phi coprocessor. Then, we evaluate the performance of the optimization on a set of stencil codes highly optimized to run on this massively parallel architecture.
5.1.1 Motivation
As we discussed in Chapter 1, SIMD instructions have become more relevant in the instruction set architecture (ISA) of the latest multi- and many-core proces- sors. They allow to achieve high computational performance ratios at moderate energy consumption. Examining past and present SIMD instruction sets, we ob- serve the trend of widening vector registers and introducing more powerful and flexible instructions generation after generation. These powerful and flexible new instructions try to overcome limitations from previous instruction sets and issues that naturally arise in the process of vectorizing a scalar code.
Advanced SIMD Instructions
However, some of these powerful SIMD instructions are so sophisticated that there exists no direct translation from scalar instructions. Their usage may require an ag- gressive code transformation of the code. Therefore, their exploitation can be hard, not only for programmers but also for compilers. This fact can lead to the underuse of these instructions. These are only three examples of this kind of instructions available in current SIMD instruction sets:
Inter-register shift instructions: These instructions emerged to emulate vector memory operations on unaligned memory addresses in architectures without specific hardware support for them [54]. Furthermore, they are also included in architectures with specific hardware support to palliate the high latency of these unaligned memory operations in comparison to the aligned counterparts [15]. They can also be used to implement circular single-register permuta- tions. The valignd instruction is one of these instructions available in several Intel SIMD instruction sets.
Instructions on multidimensional matrices: There are specific vector instructions that implement operations interpreting the vector register with a layout of a 2D matrix. For instance, the NEON instruction set contains instructions that perform a transposition on a 2D matrix or the computation of its determinant [11].
Generic permutations: These instructions are used to compose a vector register by means of reorganizing scalar elements from a single or multiple vector regis- ters. For example, the AVX-512 instruction set [74] features a wide variety of these instructions with different permutation patterns and latencies.
Compiler Limitations
As stated in Chapter 4, production compilers still have strong limitations when dealing with vectorization in general terms. If these limitations happen even in the vectorization of a simple scalar code where there exists a direct correspondence between scalar instructions and vector instructions, the exploitation of advanced vector instructions that require special code transformation is even harder. Com- pilers might not exploit these advanced vector instructions mainly for the following three reasons:
• Lack of compiler technology such as analyses, cost models, idiom recognition and code transformations. This technology is necessary to detect the appli- cability and to generate the appropriate code to be able to use a particular advanced vector instruction.
• Aggressiveness of optimizations and code transformations. Code transforma- tions can jeopardize the correctness of the code if they are applied indiscrimi- nately, degrade performance or prevent other more relevant optimizations.
5.1. Introduction 107
1 #pragma omp simd aligned(a,b:64)
2 for(i=0; i <= N-points-1; i++)
3 { 4 float tmp = 0.0f; 5 6 for(j=0; j <= points-1; j++) 7 tmp += a[i+j]; 8 9 b[i] = tmp / points; 10 }
(a) Scalar code with a standalone SIMD directive on the outer loop
for(i=0; i <= N-points-VF; i+=VF) { floatVF tmp = vpromotionVF(0.0f); for(j=0; j <= points-1; j++) tmp += unaligned_vloadVF(&a[i+j]); tmp = tmp / vpromotionVF(points); aligned_vstoreVF(&b[i], tmp); }
(b) Vector pseudo-code after the outer loop vec- torization. VF is the vectorization factor
Overlapped Vector Loads
Re-loaded element Element first load
a[0+j]
a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7]
a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8]
a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9]
a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] 0 1 2 3 j a[1] a[2] a[3] a[0]
(c) Original overlapped vector loads of the array a after the outer loop vectorization
Figure 5.1: Moving Average motivating example
• Interference with manual optimizations. Code transformations can interfere with manual optimizations of programmers that are fine tuning the code. As a result, advanced programmers pursuing the best performance for their applications are compelled to apply these optimizations manually [135]. This also implies vectorizing the code by hand using low-level hardware-specific intrinsics. This task is very cumbersome, error prone and inefficient in terms of programming because some of these optimizations might be implemented and discarded if they penalize performance.
Practical Example using SIMD Extensions for OpenMP
The SIMD extensions that we proposed in the context of OpenMP in Chapter 4 are aimed at reducing compiler auto-vectorization issues and easing SIMD exploitation. Nevertheless, these extensions still leave room for performance improvements. For instance, Figure 5.1a shows a snippet of the scalar code of our motivating example: the Moving Average benchmark. As depicted, we use a standalone SIMD directive to inform the compiler that the outer loop is safely vectorizable. Figure 5.1b shows
0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 32 64 128 Sp eed -up # points
icc for-SIMD mcc SIMD-for mcc SIMD-for + manual opt.
Figure 5.2: Speed-up of the Moving Average motivating example on a Intel Xeon Phi coprocessor 7120. Running with 183 threads (61 cores, 3 threads/core. The best ex- perimental thread configuration). N=536M, points=128. Baseline mcc SIMD-for
the output vector pseudo-code that the compiler generates following the SIMD an- notations. However, as we can see in Figure 5.1c, vector loads on the array a result in a not very efficient memory access pattern. These vector loads are overlapped: they redundantly read some scalar elements that have already been loaded by pre- vious vector loads.
The overlap optimization that we will introduce later in this chapter applies an aggressive code transformation to improve this overlapped vector memory loads. Unfortunately, at this point, the only option for programmers in order to apply this optimization is by hand. This process first entails to vectorize the code by hand using intrinsics or assembly and then to apply the optimization over the resulting vector code.
We followed these steps, but thanks to our source-to-source vectorization infras- tructure introduced in Chapter 3, we did not have to vectorize the code by hand. We used the output code with intrinsics generated by the Mercurium compiler. Then, we applied the overlap optimization by hand. Figure 5.2 shows the speed-up of three versions running on an Intel Xeon Phi coprocessor. Version icc for-SIMD is compiled with the Intel C/C++ compiler 15.0.1 using a for-SIMD directive on the outer loop. Version mcc SIMD-for is compiled with the Mercurium compiler us- ing a SIMD-for directive on the outer loop. In version mcc SIMD-for + manual opt., we optimized the overlapped vector loads using the vector code generated by the previous version. As we can see, the Intel compiler version yields slightly less performance than the version compiled with the Mercurium compiler. This is due to the differences between the for-SIMD and the SIMD-for constructs described in Chapter 4. The manual optimization of the code greatly outperformed both SIMD-