Motivation - Data layout types : a type-based approach to automatic data layout transformations

As a running example we will consider a Move To Front (MTF) transformation which is used in modern compressing algorithms like BZIP2 [120]. This algorithm can be vectorised, however the pattern of the vectorisation is non-trivial and none of the auto-vectorisers we have tried out succeeded.

3.2.1 Move To Front (MTF) Algorithm

In order to improve compression algorithms which use the Burrows-Wheeler Trans- formation (BWT) [19] as an additional post-processing step one can use the Move To Front (MTF) transformation. After applying BWT we expect to get a string

containing groups of repeating characters; for example ‘bbbcccaaa’. In order to decrease the entropy of the message and improve the efficiency of further Huffman encoding [59] we replace each symbol of the message with its index in the list of recently used symbols. The way the MTF works can be understood in Fig. 3.1. As

Original message Encoded message Alphabet

bbbcccaaa ∅ abcdefghijklmnopqrstuvwxyz bbbcccaaa 1 abcdefghijklmnopqrstuvwxyz bbbcccaaa 1,0 bacdefghijklmnopqrstuvwxyz bbbcccaaa 1,0,0 bacdefghijklmnopqrstuvwxyz bbbcccaaa 1,0,0,2 bacdefghijklmnopqrstuvwxyz bbbcccaaa 1,0,0,2,0 cbadefghijklmnopqrstuvwxyz bbbcccaaa 1,0,0,2,0,0 cbadefghijklmnopqrstuvwxyz bbbcccaaa 1,0,0,2,0,0,2 cbadefghijklmnopqrstuvwxyz bbbcccaaa 1,0,0,2,0,0,2,0 acbdefghijklmnopqrstuvwxyz bbbcccaaa 1,0,0,2,0,0,2,0,0 acbdefghijklmnopqrstuvwxyz ∅ 1,0,0,2,0,0,2,0,0 acbdefghijklmnopqrstuvwxyz

Figure 3.1: The MTF transformation steps

it can be seen, the encoding process at each step consists of finding an index of the symbol in the current state of the alphabet followed by the alphabet update, where the symbol is being moved to the first position of the alphabet.

The MTF is being used in BZIP2 compression, but while decompression the reverse version (unMTF) is used. This reverse version and its vectorisation will be used as a motivating example in this chapter. The decoding procedure is very similar to the encoding one. We will need the encoded message and the original alphabet used during encoding. We traverse the encoded message from left to right and replace every number it with the symbol in the alphabet at the position equal to the number, and, as during the encoding, move the symbol to the front of the alphabet. Fig 3.2

demonstrates this process.

Encoded message Decoded message Alphabet

1,0,0,2,0,0,2,0,0 ∅ abcdefghijklmnopqrstuvwxyz 1,0,0,2,0,0,2,0,0 b abcdefghijklmnopqrstuvwxyz 1,0,0,2,0,0,2,0,0 bb bacdefghijklmnopqrstuvwxyz 1,0,0,2,0,0,2,0,0 bbb bacdefghijklmnopqrstuvwxyz 1,0,0,2,0,0,2,0,0 bbbc bacdefghijklmnopqrstuvwxyz 1,0,0,2,0,0,2,0,0 bbbcc cbadefghijklmnopqrstuvwxyz 1,0,0,2,0,0,2,0,0 bbbccc cbadefghijklmnopqrstuvwxyz 1,0,0,2,0,0,2,0,0 bbbccca cbadefghijklmnopqrstuvwxyz 1,0,0,2,0,0,2,0,0 bbbcccaa acbdefghijklmnopqrstuvwxyz 1,0,0,2,0,0,2,0,0 bbbcccaaa acbdefghijklmnopqrstuvwxyz ∅ bbbcccaaa acbdefghijklmnopqrstuvwxyz

Figure 3.2: The unMTF transformation steps

The trivial implementation of a single step of unMTF is the following one: char unMTF( char alphabet [ 2 5 6 ] , int idx )

{

char c = alphabet [ idx ] ; f o r ( ; idx > 0 ; idx −−)

return alphabet [ 0 ] = c ; }

The variable idx is a number in the encoded message, and alphabet is the current state of the alphabet. The algorithm implemented as above is uses inefficient alphabet rotation — it has O(N) worst case complexity, where N is the length of the alphabet. This rotation happens on every symbol of the encoded message so BZIP2 uses a more advanced implementation which makes it possible to reduces the worst case complexity to O(√N).

By dividing the alphabet into √N chunks, the chunk that contains the symbol has to be updated, shifting the elements as in the above code, but all the chunks before can be updated by changing their first and last symbols. The implementation of this approach looks as follows:

#define N 4096

char alphabet [N ] ;

short ptr [ 1 6 ] = {N−256 , N−256+16 , N−256+16∗2 , N−256+16∗3 , . . . } ; void rotate_segment ( char ∗v , int idx )

{ i f ( idx == 0) return; do v [ idx ] = v [ idx − 1 ] ; while (−− idx ) ; } void rearrange_alphabet ( ) { i n t i , j , k = N−1; f o r ( i = 1 5 ; i >= 0 ; i −−) { f o r ( j = 1 5 ; j >= 0 ; j −−) alphabet [ k ] = alphabet [ ptr [ i ] + j ] , k− −; ptr [ i ] = k + 1 ; } }

void unMTF( int idx ) { i n t i , q , r , c ; i f ( idx == 0) return; q = idx / 1 6 ; r = idx % 1 6 ; c = alphabet [ ptr [ q ] + r ] ; rotate_segment(&alphabet [ ptr [ q ] ] , r ) ; ptr [ q]++; f o r ( i = q ; i > 0 ; i −−) { ptr [ i ] − −; alphabet [ ptr [ i ] ] = alphabet [ ptr [ i −1]+15]; } alphabet [−− ptr [ 0 ] ] = c ; i f ( ptr [ 0 ] == 0) rearrange_alphabet ( ) ; return c ; }

The chunk q contains the symbol we are after (stored in c) at position r. The alphabet is stored in the variable alphabet which has bigger size than the length of the actual alphabet. Chunks are represented as parts of alphabet of a constant size. Shifting all the elements of a chunk one element to the right is achieved by decreasing the starting position of the chunk by one element and updating the element at this position. The variable ptr is an array of starting positions of the chunks in alphabet.

The chunk q is updated by moving the elements from r-1 to 0 one element to the right (this is done by rotate_segment). The element at position 0 in the chunk q is replaced with the last element from the chunk q-1. For all the chunks from q-1 to 0 we decrease its starting index by one, and put the last element of the previous chunk into the position 0 of the chunk we currently update. The first symbol of the very first chunk is replaced with the c.

As each unMTF step potentially moves chunks to the left, eventually the first chunk will reach the first position in alphabet, in which case the alphabet-array has to be rearranged by putting all the chunks at the end of the array; this is done using rearrange_alphabet function.

In BZIP2 the length of the alphabet is 256 which after dividing into chunks gives us 16 chunks each of which is 16 characters long. Conveniently enough standard SIMD registers these days are 128-bit long which is exactly one chunk. This means that rearrange_alphabet can move chunks with two vector instructions rather than with 16 scalar ones. The fact that most of the SIMD architectures support permutations within a vector gives us a chance to implement a vectorised version of rotate_segment. Now, how can the desired vectorisation be expressed?

Auto-vectorisers we tried out (GCC, ICC) did not consider any of the functions suitable for the vectorisation. There are several reasons for that: first of all, the rotate_segment signature does not contain any information about the maximal values of idx, so a compiler can only deduce this information from the calling context. Secondly, a compiler needs to apply a cost model to show that the transformation is beneficial, but this is not an easy task as a potential vectorisation may increase the number of instructions which affects an instruction pipeline; or add conditions which affect branch prediction; or change the memory access; or similar transformations that may harm program performance. Without the knowledge that a particular function is a hot-spot, a compiler can take a decision not to vectorise a function even if it is possible in theory.

In order to express rotate_segment explicitly in a portable SIMD way we have to have an interface for vector permutation. In GCC it was impossible before we added this to version 4.7. Alternatively one can express a permutation using inline assembly, but disregard the fact it is non-portable, even for one architecture one may end-up creating several variants of the code. For example: Intel SSE3 has a PSHUFB instruction which does a byte-level permutation; any lower version of SSE supports 32-bit elements permutations only which require a programmer to come-up with vector shifting and masking scheme which is less efficient and in case the architecture uses AVX another version of the code is needed.

Vectorisation of rearrange_alphabet can be done in a portable way starting from GCC v3.2, declaring a variable of vector type and for every chunk loading it to the variable and storing it back into the memory. The code for the function looks as following:

#define v e c t o r ( elcount , type ) \

__attribute__ ( ( v e c t o r _ s i z e ( ( e l c o u n t )∗ sizeof ( type ) ) ) ) type

typedef char __attribute__ ( ( v e c t o r _ s i z e ( 1 6 ) , a l i g n e d ( 4 ) ) ) xchar ; #define unaligned ( x ) ( ( xchar ∗) x )

void rearrange_alphabet ( )

{

i n t i ;

f o r ( i = 1 5 ; i >= 0 ; i −−) {

v e c t o r (16 , char ) vec = ∗ unaligned (&alphabet [ ptr [ i ] ] ) ; short idx = N−256+16∗ i ;

∗( v e c t o r (16 , char ) ∗)& alphabet [ idx ] = vec ; ptr [ i ] = idx ;

} }

Some architectures, for example Intel, differentiate aligned and unaligned vector loads providing two separate instructions for this purpose. In the code above, we have to take care of the cases when a vector-assignment accesses unaligned memory. In order to inform the compiler, we mark potentially unaligned memory by converting it to the vector type with minimal alignment.

In document Data layout types : a type-based approach to automatic data layout transformations for improved SIMD vectorisation (Page 43-47)