MatMul Benchmark - GPRM Productivity - GPRM: a high performance programming framework for manyc

5.7 GPRM Productivity

6.1.4 MatMul Benchmark

GPRM uses its par cont for (partial continuous for) worksharing construct, which is a task-based approach to this problem, and distributes the chunks based on their indices amongst the working threads. If the cutoff value is assumed as the number of tasks in GPRM, the chunk size will be N/cutoff. The implementation of the parallel loops in GPRM will be described in Chapter 7.

1 /∗ OpenMP ∗ /

2#pragma omp f o r s c h e d u l e ( dynamic , N / c u t o f f )

3 4 /∗ C i l k P l u s ∗ / 5#pragma c i l k g r a i n s i z e = N / c u t o f f 6 7 /∗ TBB ∗ / 8 p a r a l l e l f o r ( b l o c k e d r a n g e <s i z e t > ( 0 , N , N / c u t o f f ) , Body ( a , b , c ) , s i m p l e p a r t i t i o n e r ( ) ) ; 9 10/∗ GPRM ∗ / 11p a r c o n t f o r ( 0 , N , i n d , c u t o f f , t h i s , &Foo : : b a r ) ;

Listing 6.4: Defining the number of chunks (or the chunk size) in different implementations of the MatMul benchmark

The TILEPro64 is a 32-bit architecture without any FPU (Floating Point Unit). There are no vector registers or instructions on this architecture (instead it uses a 32-bit three-way issue scalar VLIW engine), and the size of caches are smaller than those of the Xeon Phi. Therefore, we should expect a huge difference in the performance in this case. In order to achieve automatic vectorization on the Xeon Phi, the Intel TBB and OpenMP codes have to be compiled with the-ansi-aliasflag.

6.1. Uniprogramming Workloads 94

The schedule clause used with OpenMP for specifies how iterations of the associated loops are divided (statically or dynamically) into contiguous chunks, and how these chunks are distributed amongst threads of the team. For the MatMul benchmark, we have included both of these OpenMP approaches in the comparison. It is important to note that the dynamic scheduling on the Xeon Phi with cutoff 2048 can improve the performance of OpenMP from 43× for the default case (with noscheduleclause) to 52×. After these considerations, we are ready to run the MatMul benchmark and compare the platforms as well as the programming models in a data parallel scenario. It is worth noting that with both GPRM approaches we have observed a superlinear speedup on the TILEPro64.

0 10 20 30 40 50 60 70 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 Speedup Number of Threads

MatMul 4096x4096 doubles, cutoff 2048

GPRM-Steal GPRM GPRM-Steal GPRM OpenMP(d) TILEPro64 OpenMP(s) TILEPro64 GPRM TILEPro64 GPRM-Steal TILEPro64 OpenMP(d) XeonPhi OpenMP(s) XeonPhi

OpenMP(d) Balanced XeonPhi OpenMP(s) Balanced XeonPhi Cilk Plus XeonPh TBB XeonPhi GPRM XeonPhi GPRM-Steal XeonPhi (a) Speedup 2 16 128 1024 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 Running Time (s) Log-scale Number of Threads

MatMul 4096x4096 doubles, cutoff 2048

GPRM-Steal GPRM GPRM-Steal GPRM OpenMP(d) TILEPro64 OpenMP(s) TILEPro64 GPRM TILEPro64 GPRM-Steal TILEPro64 OpenMP(d) XeonPhi OpenMP(s) XeonPhi OpenMP(d) Balanced XeonPhi OpenMP(s) Balanced XeonPhi Cilk Plus XeonPhi TBB XeonPhi GPRM XeonPhi GPRM-Steal XeonPhi

(b) Runtime

6.1. Uniprogramming Workloads 95

Figure 6.7(a) shows that Intel OpenMP with dynamic scheduling has the best scaling amongst all on the Xeon Phi, and both GPRM approaches scale better than TBB and Cilk Plus. On the TILEPro64, the GPRM approaches with the superlinear speedup have better scaling than OpenMP. However, as illustrated in Fig. 6.7(b), there is an enormous difference between the running time on the TILEPro64 and the Xeon Phi4_{. The Xeon Phi is a vector processing}

machine and can distinguish itself from the TILEPro64 in scenarios like this.

0 20 40 60 80 16 32 64 128 256 512 1024 2048 Speedup Cutoff value

MatMul 4096x4096 doubles, 63 threads

OpenMP(d) OpenMP(s) GPRM GPRM-Steal

(a) TILEPro64, different cutoffs

0 10 20 30 40 50 60 70 16 32 64 128 256 512 1024 2048 Speedup Cutoff Value

MatMul 4096x4096 doubles, 240 threads OpenMP(d) OpenMP(s) Cilk Plus TBB GPRM GPRM-Steal

(b) Xeon Phi, different cutoffs,

Figure 6.8: Parallel MatMul benchmark on a 4096×4096 matrix of double numbers.

4_{Note the log scale on the y-axis of Fig. 6.7(b). The best result on the Xeon Phi is approximately 106×}

6.1. Uniprogramming Workloads 96

Here, all the tasks are the same, having fairly the same size. To be precise about the results in Fig. 6.8(a), consider that we have 63 threads for GPRM on the TILEPro64 (as many as the number of available cores), but 64 threads for OpenMP. If we wanted to get better results for smaller cutoffs, we had to choose cutoff 63 for GPRM but in order to keep up with the previous experiments, we have used powers of two. 64 threads of OpenMP are time sliced over 63 cores, which results in a good speedup for cutoff 64. However, in the case of GPRM, every thread gets 1 chunk, except one of them which gets 2 chunks. That is why we see a big difference for cutoff 64 between the approaches on the TILEPro64. Instead of choosing better cutoffs for this case (63,126,...), we have increased the cutoff value, and thus creating more tasks has balanced the load distribution. The same reasoning applies to the Xeon Phi. Firstly, 4096 is not a factor of 240 (number of threads). Moreover, cutoff 256 (making 256 tasks) makes 16 cores busier than the others. Although we could choose cutoff 240 to improve the performance, for consistency with other experiments, we have limited ourselves to powers of two. By increasing the cutoff, there will be more tasks and a better distribution, hence better speedup.

0 3 6 9 12

OpenMP(d) OpenMP(s)Cilk PlusTBB GPRM GPRM-S

CPI Rate

MatMul 4096x4096 doubles, 240 threads, cutoff 2048

(a) Xeon Phi, CPI Rate

0 200 400 600 800 1000 OMP(d)OMP(s)Cilk PlusTBB GPRM GPRM-S

Total CPU Time (s)

MatMul 4096x4096 doubles, 240 threads, cutoff 2048

(b) Xeon Phi, Total CPU Time

(e) Cilk Plus, Xeon Phi CPU balance (f) TBB, Xeon Phi CPU balance

(g) GPRM, Xeon Phi CPU balance (h) GPRM-Steal, Xeon Phi CPU balance

6.1. Uniprogramming Workloads 97

The results of this benchmark on the Xeon Phi raises the question of what causes the high CPI Rate for GPRM and Intel TBB while they run sometimes faster or at least as fast as other implementations? The answer is to be found in the number of executed instructions. When we look at the hardware event INSTRUCTIONS EXECUTED sampled by the VTune Ampli- fier, then the higher CPI Rate does not necessarily mean degraded performance. Although the CPI Rate is higher for the TBB, GPRM, and GPRM-Steal approaches, the number of INSTRUCTIONS EXECUTED is notably smaller compared to Cilk Plus and OpenMP. For instance, this number in the Cilk Plus approach is almost 2× bigger than that of GPRM.

In the charts in Figures 6.9(c) and 6.9(d), there is an evident distinction between the distribution of CPU times that shows how OpenMP load balancing, when using dynamic scheduling leads to better performance. For a very detailed comparison, other hardware events should be taken into account as well, but we can already reason about the performance only by looking at these few fields.

In document GPRM: a high performance programming framework for manycore processors (Page 109-113)