• No results found

In this section we will make an overview of existing concepts and approaches which allow one to exploit the SIMD extensions of a CPU. As a general remark, there exists a number of attempts to unify diverse SIMD instructions and come up with a portable level of abstraction. Our main claim here is that in order to maximise optimisation potential of programs and share a large portion of the code used in auto- vectorisers, the abstraction layer has to be implemented in a compiler as language extensions, not as a library or a novel programming language.

3.5.1

New languages e.g. ISPC

The language demonstrates quite good performance, however, we believe that it is not the best possible way to program SIMD extensions explicitly. The level of abstraction is too high which means that a lot of decisions is taken by the compiler without a programmer being able to control them. Expressing code as kernels, similarly to OpenCL or CUDA, opens a potential to reuse the same code not only on SIMD architectures, but at the same time, requires significant program rewrites. The set of supported SIMD architectures for the time being is limited to SSE2, SSE4, AVX, AVX2, and Xeon Phi.

3.5.2

Library-based solutions

The main drawback of any library approach is that dispatching from the API down to intrinsics or assembly has to be implemented at the level of the library. This means that parts of the compiler has to be reimplemented at the level of the library, and, for example, auto-vectorisers will not be able to benefit directly from the library. Or, if a compiler cannot optimise across intrinsics, then this functionality has to be implemented in the library. Ideally we would like to see a symbiosis between auto-vectorisers and explicit SIMD. One practical benefit of the library approach is an ability to make experiments very quickly, but in general it seems that library solutions do not scale.

3.5.3

Intrinsics to intrinsics mapping

That could have been a valid approach, however the variety of hardware changes quite rapidly, and if we will start to create mappings, we would need to create every- to-every instruction set mappings, or chose a canonical one. Both of the options are not desirable, not to mention, creating such a mapping requires a lot of work, for example, the mapping described in [152] is a 1MB file, and does not include AVX extensions.

3.5.4

C++ standard proposal

Independently from our work, A. Naumann et. al indicated the importance of having vectorisation as a language feature in [8]. The line of reasoning is very similar to ours — vectorisation is important, and it cannot realistically be automated or fitted into existing parallel models. They demonstrate a practical micro-kernel used in CERN and use the same arguments, i.e. more tightly integrated optimisations and implementation reusability, when discussing why they think vectorisation should be a part of the language.

3.5.5

Automatic vectorisation

The downside of automatic vectorisation is a lack of opportunities to influence the decision of the vectoriser. The number of supported patterns is always limited, and in the cases of non-trivial data-dependencies the vectoriser would give-up. In order to get the best performance from an auto-vectoriser in case of floating-point operations one has to specify flags that violate the IEEE implementation of floating point [104]. As an example we can consider the case of horizontal sums:

f l o a t ∗ array , r e s u l t ; f o r ( i = 0 ; i < N; i++) r e s u l t += array [ i ] ; f l o a t ∗ array , r e s u l t ; float_vec reg ; /∗ Assume N % 4 == 0 ∗/ f o r ( i = 0 ; i < N; i += 4)

reg += ∗( float_vec ∗)& array [ i ] ; r e s u l t = reg [ 0 ] + reg [ 1 ]

+ reg [ 2 ] + reg [ 3 ] ;

According to the IEEE floating point standard, the order of operations can change the result; hence the above optimisation is illegal. In order to legalise it in GCC, one needs to specify the -ffast-math flag when compiling and it is impossible to use it on a given loop only. It means that in order to make auto-vectoriser perform the optimisation, a programmer has to switch a flag potentially violating all the floating-point operations.

The auto-vectoriser cannot properly handle the loops with the control-flow, e.g. conditions, gotos, and uncountable loops, e.g. while (*x != NULL).

3.5.6

Virtual instruction set

The LLVA approach provides a portable standard for SIMD operations. However, this approach raises several practical and theoretical questions. Practically, the ar- chitecture exists only as a prototype with implemented translators for several ISA-s. This means that in order to integrate LLVA in any existing compiler, we will have to provide a translation from the intermediate language of a compiler to the LLVA. Assuming that we did that, we will have to implement the translators for all the targets we want to support. Keep in mind that LLVA provides instructions only for vector operations, this means that all the non-vector operations have to be integrated into the representation as well. Assuming that we did that, we come to the point when we will have to instruct our auto-vectoriser and possibly other optimisations to generate the code using LLVA. How can we estimate the cost of the operation, if we do not know the target architecture?

As several architecture classes are supported within LLVA, and there are mecha- nisms allowing careful tuning for each processor class, how efficient would it be to run an LLVA code tuned for class A on class B?

3.5.7

OpenCL

In terms of portability our approach is very similar to OpenCL and we borrow the syntax of SIMD operations, however; there are a number of important distinctions. The intent of OpenCL is very different from ours. OpenCL operates with a large-scale problem trying to target many different diverse architectures, e.g. SPMDs, GPUs; where our approach solves a single issue. The OpenCL C programming language is based on ISO/IEC 9899:1999 C language standard (a.k.a C99) [63], but it also introduces a number of restrictions. Most importantly, OpenCL tries to cover all the undefined or ambiguous cases of the C99 standard. For example, basic types, like int, char and long get a fixed size; C99 in this cases fixes only the relation of the type-sizes i.e char ≤ int ≤ long. Defining bit-shift operations e1 << e2, OpenCL states that only the lower log2N bits of e2 will be used during the operation; C99 in this case states that if e2 > log2N, the result is undefined.

Arguably, the restrictions of OpenCL increase the portability of programs but at the same time they remove backward-compatibility with ISO C code. Practically it means that existing C code may not work within the OpenCL compiler.

Technically, OpenCL provides a set of libraries, and header files, but the actual compilation is done by a C compiler of the users choice. This is a key difference from the approach we are taking, as we implement SIMD operations as an integral part of the C language; hence as a part of the compiler. Decoupling a framework from the compiler gives you freedom when you choose a compiler, but in terms of SIMD operations we see the following problems with this approach:

1. In order to define a SIMD vector OpenCL provides the typen construction, where n can be 2,3,4,8 or 16 and the type is a basic scalar type, e.g. char,

int, float, etc. As there is no way to override selection operator [] in C, OpenCL introduces a new scheme for enumerating vector components intro- ducing the notion of lo, hi components x, y, z, w, etc. The vector type is mapped to the hardware-specific SIMD vector type or static array in case SIMD accelerators are not present within the architecture. Such a design makes it complicated to support vectors of arbitrary length, as each typen is defined as a new structure and the chosen indexing scheme leads to a combinatorial explosion. Also, each time the length of vector register doubles, a standard cor- rection is required. For example, currently, it is impossible to define a char32 type, however, it is supported by Intel AVX.

Our approach allows one to define a vector of the arbitrary length, where the length is a power of two. To index elements we use a standard selection [] operator and during compilation vector operations are compiled to the longest vectors supported by the architecture.

2. Basic vector operations like arithmetic, comparison, shuffling are, whether aliases to the intrinsic functions or external functions, defined in the library. From the performance point of view both cases are harmful as they decrease the chance for optimisations. Library function calls prohibit2 even simple constant

propagation; intrinsic functions normally do not participate in the optimisation cycle. If vector operations would be inlined, the compiler can generate better code with respect to the pipelining and register pressure.

3. OpenCL SDKs are mainly closed-source products which are released for some combination of hardware architecture and operating system. This means that there is a chance that all these products perform slightly differently. Our approach does not solve this problem fully, as code-generators are unique for every hardware architecture as well; however, most of the optimisations happen in the middle-end. Also, as GCC compiler is an open-sourced product, anyone has a chance to identify the reason of the undesired behaviour by means of code inspection or debugging.