4.2 Magnification map generation
4.2.3 Parallel magnification map generation by both CPU and GPU
Solving the image area in parallel
We may speed up the computation for solving the image area by using the GPU. The GPU can compute multiple lens equations simultaneously, thus achieving a higher throughput compared with the CPU. However, the end results have to be transferred from the GPU memory back to the CPU memory though the PCI-E bus. The active cells coordinate is then arranged in an array, since the GPU ray shooting kernel described requires a list of active cell’s coordinates as input to achieve maximum performance. The overall performance with this approach is about 50% faster than using the CPU. Actually, the GPU arithmetic throughput can be a lot faster, but the PCI-E bus acts as a bottleneck since it is a lot slower than the memory bus.
One has to determine the balance between precision of the image boundary and number of unnecessary computations later in the ray shooting step. For example, consider a lens plane with dimension (lx1, ly1)=(-1.2, -1.2) and (lx2, ly2)=(1.2, 1.2), and grid resolution defined
by grid scale, gscale, the width of a grid cell in Einstein ring radius. With a reasonably dense grid resolution,gscale= 0.001RE, there are (lx2−lx1)/gscale+ 1 = 2201 grid points
on the x-axis and (ly2−ly1)/gscale+ 1 = 2201 grid points on the y-axis, with a total of
2201∗2201 = 4844401 grid points to be evaluated by the lens equation. It takes around 1 second on a 2.4GHz Core 2 Duo CPU using a single core. This is insignificant compared to generating a magnification map with one billion rays shot using the CPU which takes minutes to compute, as the time for searching active cells only accounts for <1% of the total computation time. However, the same ray shooting computation on the GPU only takes around 2.6 seconds on a NVIDIA GTX470 graphics card. The time for solving the image area suddenly becomes a significant part in the computation, which accounted for
>25% in the total map generation time.
When a microlensing event is being modelled, thousands of maps may be generated in order to find the right model. In order to utilize the GPU efficiently when generating
Initialize Solve Image positions
batch #1
Random Numbers Generation Kernel Copy active grid cells array to GPU memory Solve Image positions
batch #2
Ray Shooting Kernel with active grid cells
batch #1 as input Copy active grid cells array to GPU memory Random Numbers Generation Kernel Ray Shooting Kernel with active grid cells
batch #2 as input Solve Image positions
batch #3
Copy active grid cells array to GPU memory Random Numbers Generation Kernel Ray Shooting Kernel with active grid cells
batch #3 as input Solve Image positions
batch #4
Copy active grid cells array to GPU memory Random Numbers Generation Kernel Ray Shooting Kernel with active grid cells
batch #n as input Check Images boundary
Copy active grid cells array to GPU memory Random Numbers Generation Kernel Shoot Rays for the remaining active grid
cells Time
CPU GPU
If there are more active grid cells
Search first batch of active grid cells using
CPU. Generate
random numbers and shoot rays for active grid cells
batch #1 using GPU while searching batch #2 using CPU. Repeat this procedure until the whole image
plane are being sampled. Check if we need to enlarge the shooting grid. Process the remaining active grid cells.
Figure 4.10: Parallel magnification map generation by both CPU and GPU.
multiple maps, the image searching and the ray shooting can be done in parallel. Figure 4.10 shows the procedure in parallel magnification map generation. Since the host gets back in control once the GPU kernel is launched, we can perform tasks on the CPU and GPU simultaneously. We first solve the image positions (active grid cells) of an equally divided
area on the image plane by the CPU, then launch a GPU kernel to perform ray shooting from those active grid cells. The host (CPU) will be able to perform another task once the GPU kernel is launched. We then solve the image positions of the second pre-divided area on the image plane by the CPU and launch the GPU kernel once it is done. The whole procedure is repeated until all the divided area on the image plane is searched and processed.
4.2.4
Performance comparison
0 20 40 60 80 100 120 400 600 800 1000 1200 1400 1600 time (seconds)rays shot (millions)
Magnification map generation performance comparison 2.4GHz Core 2 Duo
2.66GHz Core i7 920 GTX480 9800GX2
Figure 4.11: Performance comparison between CPU and GPU on magnification map gen- eration. Each experiment is performed six times and the average time is calculated using the last five runs. The difference between each experiment is the number of rays shot. The uncertainties are comparable to the symbol sizes.
Figure 4.11 shows the comparison of magnification map generation performance between CPU and GPU with different generations of hardware. The CPU version of the magnification map generation code is multi-threaded and utilizes all the CPU cores for processing. The GPU still out performs the CPU by a large margin even though all CPU cores are being used.