• No results found

The Overlap Optimization in OpenMP

In the context of our proposal on SIMD constructs for OpenMP introduced in Chap- ter 4, we present a new clause named overlap. This clause instructs the compiler where and how to apply the overlap optimization (see Section 5.4). It has the fol- lowing syntax:

5.6. User Guidelines 123

where:

• var list is a list of array and/or pointer variables that the overlap optimiza- tion will target.

• min group loads is the minimum number of ovls per loop iteration necessary

to constitute an overlap group for each target variable (CARDCONSTR).

• max groups is the maximum number of overlap groups that can be generated

for each target variable (GROUPS-CONSTR).

The overlap clause allows programmers to perform a systematic benchmark- ing with different overlap groups and constraints that affect the code generated by the overlap optimization. The parameters min group loads and max groups are optional, independent from each other and their value must be constant and known at compile time. A value of 0 indicates an undefined parameter. If they are un- defined or not specified, their values will be implementation defined. We decided not to expose the control of vector registers to programmers as they are a low-level hardware component.

The compiler could use heuristics to determine default values for these param- eters or fix them by default (see our implementation decisions in Section 5.7), but it is hard to automatically obtain optimal values for them. In addition, the use of the overlap clause is necessary to guide the compiler on where and over which arrays to apply the optimization to get better performance. Determining this auto- matically at compile time is even harder due to the number of factors that must be taken into account: selected overlap groups, loop unroll factor, overlap transforma- tion performance/overhead ratio, register and memory unit pressure, and instruc- tion scheduling, among others. These facts support the use of the overlap clause and make hints from programmers indispensable to optimally apply the overlap optimization.

The overlap clause must be used in conjunction with SIMD constructs defined in OpenMP, as depicted in Figure 5.10. Multiple overlap clauses can be used to specify different sets of parameters for different variables.

5.6

User Guidelines

In Section 4.5, we described several hints that may help with the application of the new SIMD directives proposed in Chapter 4. Once the vectorization strategy has been chosen using these directives, the next step is to enable the overlap optimiza- tion with the overlap clause if ovls are present in the vectorized code, as described in Section 5.3.2.

The general advice for programmers pursuing the best performance is to apply the overlap optimization on those arrays with a larger number of ovls that are as close as possible to each other. According to the vectorization scenario, the meaning of close is defined in the following way:

Innermost loop vectorization: ovls are close to each other when their indexes

only differ in a small constant offset, such as−→a[i] and−−−−→a[i + 1], where i is the induction

variable of the loop. The smaller this offset is, the closer the ovls are.

Outer loop vectorization: indexes of vls that could overlap across the inner loop

iterations typically contain both loops induction variables, such as−−−−→a[i + j]. The inner

loop induction variable with a constant step can be seen as a constant offset and then we can proceed as in the innermost loop scenario. Likewise, arrays could also overlap within the same iteration (−−−−→a[i + j]and−−−−−−−→a[i + j + 1]).

In the outer loop vectorization scenario of the Moving Average example in Fig- ure 5.1, the array a contains both loops induction variables, so their vls will be ovls, as described previously.

Again, the only way to know which arrays with ovls are the best options to apply the overlap optimization in each particular code is to carry out a trial and error approach. The parameter min group loads may also help in this process. For example, in an application with several groups of 2 and 4 ovls, programmers may run this application twice: one with the parameter unset and another one fixing this parameter to 4. In this way, programmers would know if overlap groups with 2 ovls increase or decrease performance. In those outer loop vectorization scenarios where discarding groups is not necessary, programmers could use their application knowledge to set min group loads. For example, in Moving Average, we know that the number of points could be small. To exploit the overlap optimization also in the small cases, programmers could find by benchmarking the minimum number that yields the best performance and overcome the overhead of the code transformation.

The parameter max groups can be used in the same way for the same purpose. Thus, programmers could benchmark their application increasing this parameter to find out which value offers the best performance. As vector registers are usually a scarce resource, a large number of groups is generally not recommended.

Providing all the possible information about runtime values of variables in the code can help the compiler to choose better decisions in the overlap algorithm. For example, programmers could make use of the suitable or aligned clauses intro- duced in Chapter 4 to provide information regarding loop bounds and alignment of pointers. This information could help the compiler to optimize the unroll factor and group sizes.

Programmers can also take these guidelines into account to apply the overlap optimization manually. However, this manual application requires the previous vectorization of the code. This initial vectorization step could be applied also by hand or using a source-to-source vectorization tool, as the one described in Chap- ter 3.