• No results found

In our work we have identified that fixed data layouts in many cases prevent efficient vectorisation. The observation that fixed data layouts prevent some of the optimisa- tions and relaxing data mappings can lead to more efficient code is not new and was observed in various contexts. Let us explore the existing techniques in the context of data layout problem.

2.6.1

Data distribution on clusters

When computation happens on a distributed system, the data used in a program is distributed across the nodes of the system. Normally, communication between the nodes is expensive, so ideally we want all the data required for a computation to be available on a local node. In order to do so one has to decide how to distribute the data in the first place. This problem is addressed by Kennedy et. al. [72] where they propose an automatic tool that performs a whole program analysis, generates layouts for program parts and uses integer programming techniques to select optimal layout combination for the overall program under a given cost model. [110] solve a similar problem, but additionally to data distribution they ensure that in terms of one parallel loop, arrays are aligned towards each other with respect to offsets, strides and general axes relations. For clusters such an alignment results in more efficient runtime. Their technique allows to analyse whole programs including branching, loops and nested parallelism. As a result of such an analysis arrays might change dimensionality, or padding, or store data in a strided fashion, i.e. use every n-th element.

2.6.2

Optimisation of cache misses

Data layouts are known to be used for optimising cache misses. One of the simplest techniques briefly discussed in [29] is to transpose an array. In case of two dimensions, transpose means switching from row-major to column major representation. One of the simplest examples to illustrate the benefit is a standard matrix multiply: Cij = ∑

k

AikBkj. From the indexes we can see that if the data layouts of A and B are the same, then one of the arrays will have strided access, which increases cache misses. In case A is row-major and B is column-major, both accesses will happen with stride one. [113] decrease cache misses by adding paddings to the arrays used in a program. Paddings are added on the individual dimensions of the arrays as well as between the allocated arrays. This is useful because, as the authors identified, when arrays are allocated at addresses which are multiples of the cache size, they might be mapped into the same cache-line when loading the data. This results in a cache miss on every access. This does not happen when paddings are introduced. [24] investigates usage of non-linear data layouts. Non-linear means that the mapping of the array indexes cannot be expressed as a linear function. They consider two cases of such layouts:

blocked layout and Morton order. The idea to cut arrays into blocks comes from the result obtained in [78] that if an array of size tR×tC is continuous in memory and fits in cache then it creates no self-interference misses. Such a block is used as a building block for new data layouts. One layout that is considered converts a two-dimensional array m × n into a four-dimensional m div tR× n div tC × m mod tR× n mod tC. Another layout considered is Morton order. Morton order is one of the recursive data layouts, discussed in [25] and can be described as follows: divide the original matrix into four quadrants, and lay out these sub-matrices in memory in the order NW, NE, SW, SE, then apply the same procedure to every quadrant. The stopping point could be either when 1 × 1 size is reached, or as in the case of [24] when the quadrant size reaches tR× tC which is laid in memory using row-major order. Both cases introduce better spatial locality, and as authors suggest, are better suited for hierarchical memory systems. Finally, [145] introduces a source to source compiler for the C language that implements matrices in Morton order.

Typical questions for the layout type analysis are:

1. What to do with conflicting data layouts? For example, in the case of trans- posing arrays, what happens if the same array is referenced in two expressions which imply contradicting layouts. We can either abandon layout transforma- tion for such an array, or try to estimate which layout is more beneficial, or to change the layout dynamically.

2. How to make sure that transformation does not decrease performance? We can be either very conservative and reject programs that might decrease per- formance or we can try to use a cost model to decide.

3. Applicability. Are there any factors which make it impossible to transform layouts? One of the factors could be hard-analysable language constructs. For example, in C one can obtain a pointer into an array and access data via this pointer. And what happens if a program is split into modules which are compiled separately?

2.6.3

Vectorisation

When looking at manual optimisations of data layouts for better vectorisation, trans- posing the data becomes very important as it improves vector loads and stores. These transformations are commonly known as “transforming array of structures into struc- ture of arrays” or (AoS-to-SoA). Here is an example described in [108]. Assume that we store an array of triplets in the following way:

/∗ Define a s t r u c t u r e t h a t h o l d s t h r e e elements . ∗/ s t r u c t t r i p l e t { double x , y , z ; } ; /∗ Define an array o f t r i p l e t s o f s i z e N. ∗/ s t r u c t t r i p l e t A[N ] ;

f o r ( i = 0 ; i < N; i++) { . . .

A[ i ] . x . . . }

Let us assume that the above loop can be vectorised over i, which means that we replace V subsequent loop iterations with vector operations, where V is a number of elements in a SIMD register. In that case we will have to load V components of the array A at positions i, i + 1, . . . , i + V − 1. This creates a strided access into memory. To accommodate this we need to reshuffle the elements. This is what is called AoS-to-SoA transformation. There are number of different ways to do this:

1. We can do it dynamically on every load. Some of the architectures provide instructions to make such a load. If not, we can manually access individual components in memory and put them in the corresponding positions of a vector. Usually this is not very efficient, as we mix vector and scalar instructions, which affects pipelines and strided access affects caches. Alternatively we can load 3V elements into SIMD vectors, and reshuffle them within registers. In that case after reshuffling we get three vectors with x, y and z components accordingly. This improves memory access, but reshuffling pattern gets rather complicated and might be inefficient.

2. We can change data layouts and store a transposed version of A in memory: s t r u c t t r i p l e t _ t r {

double x [N] , y [N] , z [N ] ; } ;

s t r u c t t r i p l e t _ t r A_tr ;

This can be done locally i.e. before entering a loop we want to vectorise, we transpose an array, either in-place or copying data to a newly allocated memory and then we update data accesses to the array replacing it with accesses to the transposed data structure. For the considered example it will look like:

/∗ Transpose . ∗/ f o r ( i = 0 ; i < N; i++) { A_tr . x [ i ] = A[ i ] . x ; A_tr . y [ i ] = A[ i ] . y ; A_tr . z [ i ] = A[ i ] . z ; }

/∗ Updated r e f e r e n c e s in the loop . ∗/ f o r ( i = 0 ; i < N; i++) {

. . .

A_tr . x [ i ] . . .

}

/∗ Update array A, in case i t i s in use . ∗/

Note that if the elements of the array A are in use after the loop, array A has to be updated by copying data from the A_tr. This approach is described in [106], where the authors pay attention on how to do vectorise the transpose itself. Alternatively, data layouts can be changed globally. For our example it means that we replace all the references to A with a modified reference to A_tr which

means that we do not need to copy memory at runtime, but the price for that is a whole program analysis.

The program analysis that is required to change data layouts automatically across the whole program is highly non-trivial process. There were a number of attempts to formalise such a process, for example in the work of O’Boyle et. al. in [100, 99]. The authors describe an algebraic transformation framework for data layouts and introduce how it can be modified with polyhedral-like local loop transformations. The main idea is to present data accesses to arrays inside a loop-nest as a system of linear inequalities. Further, layouts of the arrays can be transformed in a systematic way even in the presence of loop transformations. However, the global layout trans- formations involving data-layout-related questions formulated above are left out of the scope.