• No results found

There is a lot to gain in speed by using vector code if the algorithm allows parallel calculation. The gain depends on the number of elements per vector. The simplest and most clean solution is to rely on automatic vectorization by the compiler. The compiler will vectorize the code automatically in simple cases where the parallelism is obvious and the

code contains only simple standard operations. All you have to do is to enable the appropriate instruction set and relevant compiler options.

However, there are many cases where the compiler is unable to vectorize the code

automatically or does so in a suboptimal way. Here you have to vectorize the code explicitly. There are various ways to do this:

• Use assembly language

• Use intrinsic functions

• Use predefined vector classes

The easiest way to vectorize code explicitly is by using a vector class library. You may combine this with intrinsic functions if you need things that are not defined in the vector class library. Whether you choose to use intrinsic functions or vector classes is just a matter of convenience - there is usually no difference in performance. Intrinsic functions have long names that look clumsy and tedious. The code becomes more readable when you are using vector classes and overloaded operators.

A good compiler is often able to optimize the code further after you have vectorized it manually. The compiler can use optimization techniques such as function inlining, common subexpression elimination, constant propagation, loop optimization, etc. These techniques are rarely used in manual assembly coding because it makes the code unwieldy, error prone, and almost impossible to maintain. The combination of manual vectorization with further optimization by the compiler can therefore give the best result in many cases.

Current compilers are not perfect at constant propagation and other optimization techniques on vector code. Therefore, it is sometimes better to rely on automatic vectorization by the compiler in cases where the compiler can do so without problems. Some experimentation may be needed to find the best solution. You may look at the assembly output or the disassembly display in a debugger to check what the compiler is doing.

Vectorized code often contains a lot of extra instructions for converting the data to the right format and getting them into the right positions in the vectors. This data conversion and permutation can sometimes take more time than the actual calculations. This should be taken into account when deciding whether it is profitable to use vectorized code or not. The VCL Vector Class Library has some very useful permutation functions that automatically find the optimal implementation of a particular permutation pattern.

I will conclude this section by summing up the factors that decide how advantageous vectorization is.

Factors that make vectorization favorable:

• Small data types: char, int16_t, float. • Similar operations on all data in large arrays. • Array size divisible by vector size.

• Unpredictable branches that select between two simple expressions.

• Operations that are only available with vector operands: minimum, maximum, saturated addition, fast approximate reciprocal, fast approximate reciprocal square root, RGB color difference.

• Vector instruction set available, e.g. AVX, AVX2, AVX-512 • Mathematical vector function libraries.

• Use Gnu or Clang compiler.

• Larger data types: int64_t, double. • Misaligned data.

• Extra data conversion, permutation, packing, unpacking needed.

• Predictable branches that can skip large expressions when not selected. • Compiler has insufficient information about pointer alignment and aliasing.

• Operations that are missing in the instruction set for the appropriate type of vector, such as 32-bit integer multiplication prior to SSE4.1, and integer division.

• Older CPUs with execution units smaller than the vector register size.

Vectorized code is more difficult for the programmer to make and therefore more error prone. The vectorized code should therefore preferably be put aside in reusable and well- tested library modules and header files.

13 Making critical code in multiple versions for different

instruction sets

Microprocessor producers keep adding new instructions to the instruction set. These new instructions can make certain kinds of code execute faster. The most important addition to the instruction set is the vector operations mentioned in chapter 12.

If the code is compiled for a particular instruction set, then it will be compatible with all CPUs that support this instruction set or any higher instruction set, but not with earlier CPUs. The sequence of backwards compatible instruction sets is as follows:

Instruction set Important features

80386 32 bit mode

SSE 128 bit float vectors

SSE2 128 bit integer and double vectors SSE3 horizontal add, etc.

SSSE3 a few more integer vector instructions SSE4.1 some more vector instructions

SSE4.2 string search instructions AVX 256 bit float and double vectors AVX2 256 bit integer vectors

FMA3 floating point multiply-and-add

AVX-512 512 bit integer and floating point vectors AVX-512 BW, DQ, VL More 512-bit vector instructions

Table 13.1. Instruction sets

A more detailed explanation of the instruction sets is provided in manual 4: "Instruction tables". There are certain restrictions on mixing code compiled for AVX or later with code compiled without AVX, as explained on page 114.

A disadvantage of using the newest instruction set is that the compatibility with older

microprocessors is lost. This dilemma can be solved by making the most critical parts of the code in multiple versions for different CPUs. This is called CPU dispatching. For example, you may want to make one version that takes advantage of the AVX512 instruction set, another version for CPUs with only the AVX instruction set, and a generic version that is compatible with old microprocessors without any of these instruction sets. The program should automatically detect which instruction set is supported by the CPU and the operating system and choose the appropriate version of the subroutine for the critical innermost loops.