Parallel magnification map generation by both CPU and GPU

4.2 Magnification map generation

4.2.3 Parallel magnification map generation by both CPU and GPU

Solving the image area in parallel

We may speed up the computation for solving the image area by using the GPU. The GPU can compute multiple lens equations simultaneously, thus achieving a higher throughput compared with the CPU. However, the end results have to be transferred from the GPU memory back to the CPU memory though the PCI-E bus. The active cells coordinate is then arranged in an array, since the GPU ray shooting kernel described requires a list of active cell’s coordinates as input to achieve maximum performance. The overall performance with this approach is about 50% faster than using the CPU. Actually, the GPU arithmetic throughput can be a lot faster, but the PCI-E bus acts as a bottleneck since it is a lot slower than the memory bus.

One has to determine the balance between precision of the image boundary and number of unnecessary computations later in the ray shooting step. For example, consider a lens plane with dimension (lx1, ly1)=(-1.2, -1.2) and (lx2, ly2)=(1.2, 1.2), and grid resolution defined

by grid scale, gscale, the width of a grid cell in Einstein ring radius. With a reasonably dense grid resolution,gscale= 0.001RE, there are (lx2−lx1)/gscale+ 1 = 2201 grid points

on the x-axis and (ly2−ly1)/gscale+ 1 = 2201 grid points on the y-axis, with a total of

2201∗2201 = 4844401 grid points to be evaluated by the lens equation. It takes around 1 second on a 2.4GHz Core 2 Duo CPU using a single core. This is insignificant compared to generating a magnification map with one billion rays shot using the CPU which takes minutes to compute, as the time for searching active cells only accounts for <1% of the total computation time. However, the same ray shooting computation on the GPU only takes around 2.6 seconds on a NVIDIA GTX470 graphics card. The time for solving the image area suddenly becomes a significant part in the computation, which accounted for

>25% in the total map generation time.

When a microlensing event is being modelled, thousands of maps may be generated in order to find the right model. In order to utilize the GPU efficiently when generating

Initialize Solve Image positions

batch #1

Random Numbers Generation Kernel Copy active grid cells array to GPU memory Solve Image positions

batch #2

Ray Shooting Kernel with active grid cells

batch #1 as input Copy active grid cells array to GPU memory Random Numbers Generation Kernel Ray Shooting Kernel with active grid cells

batch #2 as input Solve Image positions

batch #3

Copy active grid cells array to GPU memory Random Numbers Generation Kernel Ray Shooting Kernel with active grid cells

batch #3 as input Solve Image positions

batch #4

Copy active grid cells array to GPU memory Random Numbers Generation Kernel Ray Shooting Kernel with active grid cells

batch #n as input Check Images boundary

Copy active grid cells array to GPU memory Random Numbers Generation Kernel Shoot Rays for the remaining active grid

cells Time

CPU GPU

If there are more active grid cells

Search first batch of active grid cells using

CPU. Generate

random numbers and shoot rays for active grid cells

batch #1 using GPU while searching batch #2 using CPU. Repeat this procedure until the whole image

plane are being sampled. Check if we need to enlarge the shooting grid. Process the remaining active grid cells.

Figure 4.10: Parallel magnification map generation by both CPU and GPU.

multiple maps, the image searching and the ray shooting can be done in parallel. Figure 4.10 shows the procedure in parallel magnification map generation. Since the host gets back in control once the GPU kernel is launched, we can perform tasks on the CPU and GPU simultaneously. We first solve the image positions (active grid cells) of an equally divided

area on the image plane by the CPU, then launch a GPU kernel to perform ray shooting from those active grid cells. The host (CPU) will be able to perform another task once the GPU kernel is launched. We then solve the image positions of the second pre-divided area on the image plane by the CPU and launch the GPU kernel once it is done. The whole procedure is repeated until all the divided area on the image plane is searched and processed.

4.2.4 Performance comparison

0 20 40 60 80 100 120 400 600 800 1000 1200 1400 1600 time (seconds)

rays shot (millions)

Magnification map generation performance comparison 2.4GHz Core 2 Duo

2.66GHz Core i7 920 GTX480 9800GX2

Figure 4.11: Performance comparison between CPU and GPU on magnification map generation. Each experiment is performed six times and the average time is calculated using the last five runs. The difference between each experiment is the number of rays shot. The uncertainties are comparable to the symbol sizes.

Figure 4.11 shows the comparison of magnification map generation performance between CPU and GPU with different generations of hardware. The CPU version of the magnification map generation code is multi-threaded and utilizes all the CPU cores for processing. The GPU still out performs the CPU by a large margin even though all CPU cores are being used.

In document Simulation and modelling of gravitational microlensing events using graphical processing units : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany (Auckland (Page 84-86)