2.4 Vegas using a GPU
2.4.2 Random number generation
We have demonstrated the overwhelming superiority of GPUs in terms of execution time. However, this comes with a cost: large memory consumption. This is troublesome in two ways. First it limits the number of samples by the available memory and second we know that the memory bandwidth between host and device is quite slow, giving also some impact on the execution time.
For a serial execution it is easiest to hold only that much memory as necessary, which is mostly even on the stack. But parallelized we have to store for example the results of all samples for subsequent processing. These are inevitable, but in our naive implementation we also have to store the random numbers that are used to determine the sample point. We now aim for a direct production of the random numbers on the device to come over this problem.
Generating random numbers on a computer is obviously a very delicate task, as all computations are deterministic by construction. To obtain random numbers it would be necessary to observe a physical process which underlies true randomness.
Indeed the computer has access to suitable observables, for example the fluctuations of the supply voltage. However, in practice this is cumbersome. Additionally it is not necessary to have real randomness. Instead it is sufficient to produce numbers that are deterministic but seem to be random. Functions that provide such numbers are called pseudo random number generators (PRNGs). The randomness of them can be tested against expectations that would hold for truly randomness. The to our knowledge currently most extensive tests are provided by the TestU01 library [150], also known by the names of its three levels ‘Small Crush’, ‘Crush’ and ‘Big Crush’ consisting of 10, 96 and 160 tests respectively. A random number generator that passes all of these tests is called Crush-resistant. But this is not necessary for most practical issues, although it is surely comforting. We also like to mention that pseudo randoms numbers even have a benefit over truly random numbers: they are reproducible, which is extremely useful for comparisons or debugging purposes.
In our naive implementation of Vegas we used a very common class of PRNGs, which are linear congruential generators (LCGs). They consist of a function f that generates a sequence of integers
f (n) = (a · n + c) mod m (2.33)
using the result as next input, so f is called a state transition function. Then a subsequently applied function g (the output function) converts the integers uniformly into a real value, mostly in the interval (0, 1), for example
g(n) = n
m (2.34)
Here the multiplier a and the modulus m are positive integers. The constant c has to be smaller than m and is mostly chosen to be zero, in which case the generator is a multiplicative linear congruential generator (MLCG).6 The quality of a MLCG is determined by the choice of a and m. Extensive studies have been performed to find good values, see for example [153]. To further improve the generators several generators can be combined as proposed in [154]. In our naive implementation we followed their best values for a combination of two distinct MLCGs. A very impressive fact about the
6Another way to generate numbers that seems to be random are so called quasi random sequences, which sample a given volume more uniformly as a MLCG does. A particular useful algorithm are Sobol sequences [151, 152]. We will however not consider them in this chapter.
generator is that is passes all but one test of the Small Crush test battery, while the famous Mersenne twister algorithm [155] which is also used in the GSL Vegas version fails two of the tests. To further improve the generator we adopted the idea proposed in [148] that an additional shuffle board will break up still present serial correlations.
Note that by using a shuffle board one looses the ability of a MLCG to jump forward in the sequence that still holds after the combination of several MLCGs [154]. Further we did not prove whether this really improves the randomness in terms of Crush-resistance.
In principle MLCGs are capable to be used in parallelized applications, since different seeds will produce distinct sequences. However, as just described, they are known to be flawed and typically do not pass randomness tests. Although this will to some certainty not affect the quality of Vegas we decided us for another ansatz proposed in [156], which is Crush-resistant and explicitly manufactured for the parallel use on GPUs. In contrast to ‘conventional’ MLCGs that focus on the state transition function f by finding good values for a and m while the output function g mapping the integers to the interval (0, 1) is trivial, they use a very simple state transition function, which degrades it to a kind of simple counter. Hence these generator type is called ‘counter based’. To achieve the randomness they focus instead on the output function g, using advanced cryptographic techniques. As a result the PRNG Philox is presented, which has all described properties. Additionally it is very attractive due to its very small internal state and huge speed on the device. On the host it is still reasonable fast achieving half the speed of the Mersenne Twister.
To measure the effect of the PRNG we isolate the process of generating the random numbers (if this happens on the host) and the kernel call that calculates the integrand function, which is reduced to a sum over all coordinates in the hypercube, to need only a small amount of time in the execution. Additionally we perform a synchronization after the kernel to avoid a flawed measurement by the asynchronous execution of the kernel with the host code. The results are shown in figure 2.7 for 106 sample points in 20 dimensions. The large number of dimensions focuses the measurement on the production of random numbers. A larger number of dimensions was not possible for single precision without changing the code base. The reason is that the total number of increments Md exceeds the limit of single precision for larger dimensions. For consistency we did not increase the number of dimensions for double precision. To average the process time per sample, the measurement consists of a warm up run of one iteration followed by the measured integration, consisting of 20 iterations. The time is taken for every single iteration and averaged for the presented results in figure 2.7.
For the host system we detect that the Philox PRNG is only slightly faster than the two combined MLCGs with additional shuffle board, which we called ‘L’Ecuyer’, after its inventor. For the devices we measure also a large improvement of the process time using the Philox PRNG, which can be ascribed to two effects. First, as mentioned before, the memory transfer of the random numbers from the host to the device is extremely slow. And second the random numbers themselves are generated in parallel instead of serially. Note that the GTX TITAN system is slower than the GTX 680 system using the L’Ecuyer PRNG, as the host system is slower, which overcompensates
101 102 103 L’Ecuyer
Philox
L’Ecuyer
Philox
L’Ecuyer
Philox
Process time per sample [ns]
single precision double precision finaelhostfinaeldevice GTX680finaeldevice GTXTITAN
Figure 2.7: Execution time of the generation of random numbers and a minimal integration kernel of Vegas for the L’Ecuyer and the Philox PRNG. Details of the measurement are provided in the text.
0 50 100 150 200 250 300 350 400 GSL
finael host
finael device GTX 680
finael device GTX TITAN
Speedup normalized to GSL implementation in double precision
single precision double precision PRNG: Philox
Figure 2.8: Same as figure 2.6, but using the Philox PRNG for the finael implementations.
the superiority of the GTX TITAN over the GTX 680, and for this measurement the random number generation is dominant and executed on the host.
Summarizing we can improve Vegas for the host as well as for the device by using Philox as PRNG, not only in respect of memory as initially intended, but also in terms of speed. Further the quality of the produced random numbers improves and becomes Crush-resistant. To quantify the speed improvement for a typical use case we repeat the measurement of the last subsection (see figure 2.6) but with a Philox PRNG instead of the naive implementation using the L’Ecuyer PRNG. The results are shown in figure 2.8. As we expect, the times do not change or only insignificantly for the host version. On the devices however we detect a measurable improvement. This means that the usage of a device made the originally compute bound calculation less compute but more memory bound. However, compared to other applications of GPGPU as lattice QCD, which are almost completely memory bound, we are still mostly compute bound, although the memory transfers contribute to a measurable degree. The improvement is dependent on the GPU, as we find for the GTX 680 a modest improvement of three percent for double precision, but a already sizable improvement of ten percent for single precision. In case of the GTX TITAN setup the improvement is even bigger, being about 13 percent for double and 22 percent for single precision. As a consequence also the relative factors to the host version improve significantly in these cases. The single precision calculation on the GTX 680 is now more than 280 times faster than the double precision calculation on the host. For the GTX TITAN the calculations are more than 170 faster for double and more than 360 times faster for single precision.
x0
x1
x0
x1
Classic Refined
Figure 2.9: Processing of bins in two dimensions using the classic manner (left) and refined for stratified sampling (right). The bins (thin green lines) are separated for clarity.
Increment borders are drawn with thick black lines. The green arrows indicate the processing sequence of the bins.