• No results found

The main goal behind using parallel processing and GPUs for calculation is obviously getting better performance and scalability. The performances of all the described simulation algorithms were implemented on both CPU (Single threaded version) and GPU (Parallel OpenCL version) for comparison. The testing results are shown on Figure III-13.

54

Figure III-13: CPU vs. GPU simulation performance

Performance testing was performed on a Dell Precision M6500 mobile

workstation with the Intel i7-820QM CPU (4 cores, 1.73 GHz) and an NVidia Quadro FX2800M GPU (96 CUDA cores, 1.5 GHz) with a 2048x2048 height map resolution.

When comparing hardware, it is important to consider cost. In order to estimate prices (because the hardware used is only available for OEMs and its prices are not publically available) the prices of similar retail GPU and CPU hardware costs with very similar performance were recorded:

 NVidia QuadroFX 2800M ~ NVidia GTS 250 ~ $125

 Intel i7-820QM ~ Intel i5-650 ~ $180

This comparison shows that the parallel algorithm running on the low-end GPU with the lower than the CPU price provides 8X better performance than the non-parallel

QuadroFX 2800M ~ 125$

i7-975 ~ 1000$

i7-829QM ~ 180$

55

algorithm on the CPU. In order to make comparison more interesting, the fastest available hardware on the market (at the time of the comparison) is the “Intel i7-975”

CPU with a price of ~$1000. The performance of the algorithm was estimated by assuming utilization of all available cores and linear performance growth. Since the performance of the CPU was linearly extrapolated, and the linear performance scaling is the maximum theoretically achievable result, it is accurate to say that the best CPU available on the market can only achieve the performance of the low-end GPU which is

~8X cheaper. The demonstrated result shows that GPUs may provide a significantly better performance for the same task than the best available multi-core CPUs if the right parallel algorithm is employed.

In contrast to the CPU, the work is always divided into groups or blocks on the GPU. Each GPU contains multiple multiprocessors with many cores. Selection of the best block size is an important performance optimization step. Figure III-14 shows that the implemented simulation algorithm works faster with larger group sizes. It also shows that there is the relatively constant performance penalty due to the collision avoidance

algorithm which performs the additional synchronization between threads for preventing collisions.

Another important factor is the dependency of the entire simulation time on a number of path points processed per iteration (Global size) as shown on Figure III-15.

56

Figure III-14: Performance vs. Group size (global size = 8k)

Figure III-15: Performance vs. Global size

0

Time to process entire path, sec

Group size

256 512 1024 2048 4096 8192 16384 32768

Time to process entire path, sec

Global size

LG 1 LG 4 LG 16 LG 64 LG 256

57

Larger global sizes result in much faster simulation due to better utilization of the GPU workload. It also shows that the GPU provides good performance only in situations when the number of processed path points (or working threads) is high enough to load the entire GPU and to hide the memory access latency. Figure III-16 shows that the high number of working threads may completely hide the cost of the additional

synchronization required for the collision avoidance algorithm.

Figure III-16: Effect of collision avoidance on performance

The most important conclusion about the GPU performance is that it has to have enough work and enough threads running in parallel to show good results. Although the GPU may yield excellent performance results if it has enough work to do, the opposite statement is also correct and the performance can be very poor if there is not enough work as shown on Figure III-17.

0

256 512 1024 2048 4096 8192 16384 32768

Time to process entire path, sec

Global size

With NO_COLLISION

Without NO_COLLISION

58

Figure III-17: Simulation performance vs. Global size with CPU

It is easy to see that in the case of only 256 path points per iteration the GPU simulation performance is lower than the single threaded CPU performance. However, its performance constantly grows with growing global size (number of processed path points) until it saturates at 32k-64k points per iteration.

There is a quadratic dependency between the resolution per side (symmetric height map is used for simplicity in this research) and the performance. The simulation performance was measured for different height map resolutions and the same tool path.

During the measurements, the global size of 1024 points per iteration and the local size 64 were used.

Processing speed, path point per second X1000

Global size

GPU_NC GPU CPU

59

Figure III-18: Rendering vs. Resolution

Figure III-18 shows results of these measurements for roughing and finishing tool paths. The noticeable difference in the performance is the result of different memory access patterns. In case of the finishing tool path, with a zigzag topology, only a small area of the height map is accessed during iterations. The reason for this is that the position of all path points is along a short line segment when the tool moves from one side to another. In contrast to finishing, the roughing path requires tool movement in a relatively random way from a memory controller point of view because the path topology depends on the target geometry and cannot be described as a list of long linear motions and linear memory access operations. As result the memory access pattern becomes non-linear. This results in a much lower memory subsystem performance and the slower simulation.

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Time per iteration, s

Resolution (piexls/side) Roughing

Finishing

60

Although the simulation performance results demonstrated on the Figure III-18 show a strong quadratic dependency, the total simulation time contains multiple components. As mentioned previously, the implemented simulation process includes:

 simulation (actual editing of the height map),

 map generation (converting the height map into the triangular mesh),

 rendering (actual rendering of the triangular mesh by OpenGL)

The independent performance results of each step for different resolutions are shown on the Figure III-19.

Figure III-19: Simulation components vs. Resolution

0

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Time per iteration, s

Resolution (pixels/side) Rendering

Map

Simulation

61

It is apparent that for low resolutions (<2048) the main part of the simulation time is the rendering time. However, at higher resolutions the rendering process does not show a quadratic performance dependency and becomes a minor part of the final result.

The second longest part of the current simulation implementation is the

generation of the triangular mesh. The process of a mesh generation is very simple from a computational perspective. The primary time consumption is due to memory transfer operations. The problem with memory transfers is because OpenCL and the OpenGL do not share memory and memory must be transferred from the GPU to the host and back.

Although the pure OpenCL specification does not allow sharing buffers there is the OpenCL-OpenGL interoperability extension available. It significantly improves data sharing performance. This extension replaces two memory transfers operations over the PCIEx bus by a single memory copy operation inside the GPU memory that works much faster. The performance benefit from the usage of the OpenCL-OpenGL interoperability extension is shown in Figure III-20.

Figure III-20: OpenCL-OpenGL interoperability improvement

With OCL-OGL

62