Metaprogramming with constexpr functions - Optimizing Software in C++

A constexpr function is a function that can do almost any calculations at the time of compilation if the parameters are compile time constants. With the C++14 standard, you can have branches, loops, etc. in a constexpr function.

This example finds the position of the most significant 1-bit in an integer. This is the same as the bit scan reverse instruction, but calculated at compile time:

// Example 15.3. Find most significant bit, using constexpr function constexpr int bit_scan_reverse (uint64_t const n) {

if (n == 0) return -1; uint64_t a = n, b = 0, j = 64, k = 0; do { j >>= 1; k = (uint64_t)1 << j; if (a >= k) { a >>= j; b += j; } } while (j > 0); return int(b); }

A good optimizing compiler will do simple calculations at compile time anyway, if all the inputs are known constants, but few compilers are able to do more complicated calculations at compile time if they involve branches, loops, function calls, etc. A constexpr function can be useful to make sure that certain calculations are done at compile time.

The result of a constexpr function can be used wherever a compile time constant is required, for example for an array size or in a compile time branch.

While C++14 and C++17 provide important improvements to the metaprogramming

possibilities, there are still things that you cannot do with the C++ language. For example, it is not possible to make a compile time loop that repeats ten times to generate ten functions named func1, func2, ..., func10. This is possible with certain script languages and

assemblers.

16 Testing speed

Testing the speed of a program is an important part of the optimization job. You have to check if your modifications actually increase the speed or not.

There are various profilers available which are useful for finding the hot spots and measuring the overall performance of a program. The profilers are not always accurate, however, and it may be difficult to measure exactly what you want when the program spends most of its time waiting for user input or reading disk files. See page 16 for a discussion of profiling.

When a hot spot has been identified, then it may be useful to isolate the hot spot and make measurements on this part of the code only. This can be done with the resolution of the CPU clock by using the so-called time stamp counter. This is a counter that measures the number of clock pulses since the CPU was started. The length of a clock cycle is the

reciprocal of the clock frequency, as explained on page 15. If you read the value of the time stamp counter before and after executing a critical piece of code then you can get the exact time consumption as the difference between the two clock counts.

The value of the time stamp counter can be obtained with the function ReadTSC listed below in example 16.1. This code works only for compilers that support intrinsic functions. Alternatively, you can use the header file timingtest.h from

www.agner.org/optimize/testp.zip or get ReadTSC as a library function from www.agner.org/optimize/asmlib.zip.

// Example 16.1

#include <intrin.h> // Or #include <ia32intrin.h> etc. long long ReadTSC() { // Returns time stamp counter int dummy[4]; // For unused returns

volatile int DontSkip; // Volatile to prevent optimizing long long clock; // Time

__cpuid(dummy, 0); // Serialize

DontSkip = dummy[0]; // Prevent optimizing away cpuid clock = __rdtsc(); // Read time

return clock; }

You can use this function to measure the clock count before and after executing the critical code. A test setup may look like this:

// Example 16.2 #include <stdio.h>

#include <asmlib.h> // Use ReadTSC() from library asmlib.. // or from example 16.1

void CriticalFunction(); // This is the function we want to measure ...

const int NumberOfTests = 10; // Number of times to test int i; long long time1;

long long timediff[NumberOfTests]; // Time difference for each test for (i = 0; i < NumberOfTests; i++) { // Repeat NumberOfTests times time1 = ReadTSC(); // Time before test

CriticalFunction(); // Critical function to test timediff[i] = ReadTSC() - time1; // (time after) - (time before) }

printf("\nResults:"); // Print heading

for (i = 0; i < NumberOfTests; i++) { // Loop to print out results printf("\n%2i %10I64i", i, timediff[i]);

}

The code in example 16.2 calls the critical function ten times and stores the time

consumption of each run in an array. The values are then output after the test loop. The time that is measured in this way includes the time it takes to call the ReadTSC function. You can subtract this value from the counts. It is measured simply by removing the call to

The measured time is interpreted in the following way. The first count is usually higher than the subsequent counts. This is the time it takes to execute CriticalFunction when code and data are not cached. The subsequent counts give the execution time when code and data are cached as good as possible. The first count and the subsequent counts represent the "worst case" and "best case" values. Which of these two values is closest to the truth depends on whether CriticalFunction is called once or multiple times in the final program and whether there is other code that uses the cache in between the calls to

CriticalFunction. If your optimization effort is concentrated on CPU efficiency then it is the "best case" counts that you should look at to see if a certain modification is profitable. On the other hand, if your optimization effort is concentrated on arranging data in order to improve cache efficiency, then you may also look at the "worst case" counts. In any event, the clock counts should be multiplied by the clock period and by the number of times

CriticalFunction is called in a typical application to calculate the time delay that the end user is likely to experience.

Occasionally, the clock counts that you measure are much higher than normal. This happens when a task switch occurs during execution of CriticalFunction. You cannot avoid this in a protected operating system, but you can reduce the problem by increasing the thread priority before the test and setting the priority back to normal afterwards.

The clock counts are often fluctuating and it may be difficult to get reproducible results. This is because modern CPUs can change their clock frequency dynamically depending on the work load. The clock frequency is increased when the work load is high and decreased when the work load is low in order to save power. There are various ways to get more reproducible time measurements:

• warm up the CPU by giving it some heavy work to do immediately before the code to test.

• disable power-save options in the BIOS setup.

• on Intel CPUs: use the core clock cycle counter (see below)

In document Optimizing Software in C++ - Free Computer, Programming, Mathematics, Technical Books, Lecture Notes and Tutorials (Page 164-166)