Cache contentions in large data structures

It is not always possible to access a multidimensional array sequentially. Some applications (e.g. in linear algebra) require other access patterns. This can cause severe delays if the distance between rows in a big matrix happen to be equal to the critical stride, as explained on page 89. This will happen if the size of a matrix line (in bytes) is a high power of 2. The following example illustrates this. My example is a function which transposes a quadratic matrix, i.e. each element matrix[r][c] is swapped with element

matrix[c][r]. // Example 9.9a

const int SIZE = 64; // number of rows/columns in matrix void transpose(double a[SIZE][SIZE]) { // function to transpose matrix // define a macro to swap two array elements:

#define swapd(x,y) {temp=x; x=y; y=temp;} int r, c; double temp;

for (r = 1; r < SIZE; r++) { // loop through rows

for (c = 0; c < r; c++) { // loop columns below diagonal swapd(a[r][c], a[c][r]); // swap elements

} } }

void test () {

alignas(64) // align by cache line size double matrix[SIZE][SIZE]; // define matrix

transpose(matrix); // call transpose function }

Transposing a matrix is the same as reflecting it at the diagonal. Each element

matrix[r][c] below the diagonal is swapped with element matrix[c][r] at its mirror position above the diagonal. The c loop in example 9.9a goes from the leftmost column to the diagonal. The elements at the diagonal remain unchanged.

The problem with this code is that if the elements matrix[r][c] below the diagonal are accessed row-wise, then the mirror elements matrix[c][r] above the diagonal are accessed column-wise.

Assume now that we are running this code with a 6464 matrix on a Pentium 4 computer where the level-1 data cache is 8 kb = 8192 bytes, 4 ways, with a line size of 64. Each cache line can hold 8 double's of 8 bytes each. The critical stride is 8192 / 4 = 2048 bytes = 4 rows.

Let's look at what happens inside the loop, for example when r = 28. We take the elements from row 28 below the diagonal and swap these elements with column 28 above the

diagonal. The first eight elements in row 28 share the same cache line. But these eight elements will go into eight different cache lines in column 28 because the cache lines follow the rows, not the columns. Every fourth of these cache lines belong to the same set in the cache. When we reach element number 16 in column 28, the cache will evict the cache line that was used by element 0 in this column. Number 17 will evict number 1. Number 18 will evict number 2, etc. This means that all the cache lines we used above the diagonal have been lost at the time we are swapping column 29 with line 29. Each cache line has to be reloaded eight times because it is evicted before we need the next element. I have

confirmed this by measuring the time it takes to transpose a matrix using example 9.9a on a Pentium 4 with different matrix sizes. The results of my experiment are given below. The time unit is clock cycles per array element.

Matrix size Total kilobytes Time per element

6363 31 11.6 6464 32 16.4 6565 33 11.8 127127 126 12.2 128128 128 17.4 129129 130 14.4 511511 2040 38.7 512512 2048 230.7 513513 2056 38.1

Table 9.1. Time for transposition of different size matrices, clock cycles per element.

The table shows that it takes 40% more time to transpose the matrix when the size of the matrix is a multiple of the level-1 cache size. This is because the critical stride is a multiple of the size of a matrix line. The delay is less than the time it takes to reload the level-1 cache from the level-2 cache because the out-of-order execution mechanism can prefetch the data.

The effect is much more dramatic when contentions occur in the level-2 cache. The level-2 cache is 512 kb, 8 ways. The critical stride for the level-2 cache is 512 kb / 8 = 64 kb. This corresponds to 16 lines in a 512512 matrix. My experimental results in table 9.1 show that it takes six times as long time to transpose a matrix when contentions occur in the level-2 cache as when contentions do not occur. The reason why this effect is so much stronger for level-2 cache contentions than for level-1 cache contentions is that the level-2 cache cannot prefetch more than one line at a time.

A simple way of solving the problem is to make the rows in the matrix longer than needed in order to avoid that the critical stride is a multiple of the matrix line size. I tried to make the matrix 512520 and leave the last 8 columns unused. This removed the contentions and the time consumption was down to 36.

There may be cases where it is not possible to add unused columns to a matrix. For example, a library of math functions should work efficiently on all sizes of matrices. An efficient solution in this case is to divide the matrix into smaller squares and handle one

square at a time. This is called square blocking or tiling. This technique is illustrated in example 9.9b.

// Example 9.9b

void transpose(double a[SIZE][SIZE]) { // Define macro to swap two elements: #define swapd(x,y) {temp=x; x=y; y=temp;}

// Check if level-2 cache contentions will occur: if (SIZE > 256 && SIZE % 128 == 0) {

// Cache contentions expected. Use square blocking: int r1, r2, c1, c2; double temp;

// Define size of squares:

const int TILESIZE = 8; // SIZE must be divisible by TILESIZE // Loop r1 and c1 for all squares:

for (r1 = 0; r1 < SIZE; r1 += TILESIZE) { for (c1 = 0; c1 < r1; c1 += TILESIZE) {

// Loop r2 and c2 for elements inside sqaure: for (r2 = r1; r2 < r1+TILESIZE; r2++) { for (c2 = c1; c2 < c1+TILESIZE; c2++) { swapd(a[r2][c2],a[c2][r2]); } } }

// At the diagonal there is only half a square. // This triangle is handled separately:

for (r2 = r1+1; r2 < r1+TILESIZE; r2++) { for (c2 = r1; c2 < r2; c2++) { swapd(a[r2][c2],a[c2][r2]); } } } } else {

// No cache contentions. Use simple method. // This is the code from example 9.9a: int r, c; double temp;

for (r = 1; r < SIZE; r++) { // loop through rows

for (c = 0; c < r; c++) { // loop columns below diagonal swapd(a[r][c], a[c][r]); // swap elements

} } } }

This code took 50 clock cycles per element for a 512512 matrix in my experiments. Contentions in the level-2 cache are so expensive that it is very important to do something about them. You should therefore be aware of situations where the number of columns in a matrix is a high power of 2. Contentions in the level-1 cache are less expensive. Using complicated techniques like square blocking for the level-1 cache may not be worth the effort.

Square blocking and similar methods are further described in the book "Performance Optimization of Numerically Intensive Codes", by S. Goedecker and A. Hoisie, SIAM 2001.

In document Optimizing Software in C++ - Free Computer, Programming, Mathematics, Technical Books, Lecture Notes and Tutorials (Page 103-105)