• No results found

The efficiency of a loop depends on how well the microprocessor can predict the loop control branch. See the preceding paragraph and manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for an explanation of branch prediction. A loop with a small and fixed repeat count and no branches inside can be predicted perfectly. As explained above, the maximum loop count that can be predicted depends on the processor. Nested loops are predicted well only on some processors that have a special loop predictor. On other processors, only the innermost loop is predicted well. A loop with a high repeat count is mispredicted only when it exits. For example, if a loop repeats a thousand times then the loop control branch is mispredicted only one time in thousand so the misprediction penalty is only a negligible contribution to the total execution time.

Loop unrolling

In some cases it can be an advantage to unroll a loop. Example: // Example 7.30a int i; for (i = 0; i < 20; i++) { if (i % 2 == 0) { FuncA(i); } else { FuncB(i); } FuncC(i); }

This loop repeats 20 times and calls alternately FuncA and FuncB, then FuncC. Unrolling the loop by two gives:

// Example 7.30b int i; for (i = 0; i < 20; i += 2) { FuncA(i); FuncC(i); FuncB(i+1); FuncC(i+1); }

This has three advantages:

• The i<20 loop control branch is executed 10 times rather than 20.

• The fact that the repeat count has been reduced from 20 to 10 means that it can be predicted perfectly on most CPUs.

• The if branch is eliminated. Loop unrolling also has disadvantages:

• The unrolled loop takes up more space in the code cache or micro-op cache.

• Many CPUs have a loopback buffer that improves performance for very small loops, as explained in my microarchitecture manual. The unrolled loop is unlikely to fit into the loopback buffer.

• If the repeat count is odd and you unroll by two then there is an extra iteration that has to be done outside the loop. In general, you have this problem when the repeat count is not certain to be divisible by the unroll factor.

Loop unrolling should only be used if there are specific advantages that can be obtained. If a loop contains floating point calculations or vector instructions and the loop counter is an integer, then you can generally assume that the overall computation time is determined by the floating point code rather than by the loop control branch. There is nothing to gain by unrolling the loop in this case.

Loop unrolling is useful when there is a loop-carried dependency chain, as explained on page 111.

Loop unrolling should preferably be avoided on processors with a micro-op cache because it is important to economize the use of the micro-op cache. This is also the case for

processors with a loopback buffer.

Compilers will usually unroll a loop automatically if this appears to be profitable (see page 72). The programmer does not have to unroll a loop manually unless there is a specific advantage to obtain, such as eliminating the if-branch in example 7.30b.

The loop control condition

The most efficient loop control condition is a simple integer counter. A microprocessor with out-of-order capabilities (see page 110) will be able to evaluate the loop control statement several iterations ahead.

It is less efficient if the loop control branch depends on the calculations inside the loop. The following example converts a zero-terminated ASCII string to lower case:

// Example 7.31a

char string[100], *p = string; while (*p != 0) *(p++) |= 0x20;

If the length of the string is already known then it is more efficient to use a loop counter: // Example 7.31b

char string[100], *p = string; int i, StringLength; for (i = StringLength; i > 0; i--) *(p++) |= 0x20;

A common situation where the loop control branch depends on calculations inside the loop is in mathematical iterations such as Taylor expansions and Newton-Raphson iterations. Here the iteration is repeated until the residual error is lower than a certain tolerance. The time it takes to calculate the absolute value of the residual error and compare it to the tolerance may be so high that it is more efficient to determine the worst-case maximum repeat count and always use this number of iterations. The advantage of this method is that the microprocessor can execute the loop control branch ahead of time and resolve any branch misprediction long before the floating point calculations inside the loop are finished. This method is advantageous if the typical repeat count is near the maximum repeat count and the calculation of the residual error for each iteration is a significant contribution to the total calculation time.

A loop counter should preferably be an integer. If a loop needs a floating point counter then make an additional integer counter. Example:

// Example 7.32a

double x, n, factorial = 1.0;

for (x = 2.0; x <= n; x++) factorial *= x;

This can be improved by adding an integer counter and using the integer in the loop control condition:

// Example 7.32b

double x, n, factorial = 1.0; int i;

for (i = (int)n - 2, x = 2.0; i >= 0; i--, x++) factorial *= x;

Note the difference between commas and semicolons in a loop with multiple counters, as in example 7.32b. A for-loop has three clauses: initialization, condition, and increment. The three clauses are separated by semicolons, while multiple statements within each clause are separated by commas. There should be only one statement in the condition clause. Comparing an integer to zero is sometimes more efficient than comparing it to any other number. Therefore, it is slightly more efficient to make a loop count down to zero than making it count up to some positive value, n. But not if the loop counter is used as an array index. The data cache is optimized for accessing arrays forwards, not backwards.

Copying or clearing arrays

It may not be optimal to use a loop for trivial tasks such as copying an array or setting an array to all zeroes. Example:

// Example 7.33a

const int size = 1000; int i; float a[size], b[size];

// set a to zero

for (i = 0; i < size; i++) a[i] = 0.0; // copy a to b

for (i = 0; i < size; i++) b[i] = a[i]; It is often faster to use the functions memset and memcpy:

// Example 7.33b

const int size = 1000; float a[size], b[size]; // set a to zero

memset(a, 0, sizeof(a)); // copy a to b

memcpy(b, a, sizeof(b));

Most compilers will automatically replace such loops by calls to memset and memcpy, at least in simple cases. The explicit use of memset and memcpy is unsafe because serious errors can happen if the size parameter is bigger than the destination array. But the same errors can happen with the loops if the loop count is too big.