6.2 Multiprogramming Workloads
6.2.2 Case 2: Single Instances of Multiple Programs
For this case, we extend the experiment in Section 4.3, by adding the GPRM approaches to the comparison. The three base benchmarks have the same input sizes as the single-program cases with the cutoff value 2048 and the default number of threads 240. We do not start all of them at the same time. Rather, we want the parallel phases to start almost simultaneously, such that the threads of all applications compete for the resources. For that purpose, the MergeSort benchmark enters the system first. Two seconds later the MatMul benchmark enters the system, and half a second after that, the Fib benchmark starts7. The results are
shown and discussed in Fig. 6.13. For the sum of the turnaround times, the difference between “GPRM-Steal with Global Sharing” and TBB as the second best result is about 17%.
Although the Total CPU Time is a key performance metric, it cannot be used solely to inter- pret the results. A sequential program can have the same value for the Total CPU Time as a parallel program. Therefore, it is also important to find out how evenly the tasks are dis- tributed across the system. As we have observed in the multiprogramming cases, compared to other GPRM approaches, the efficiency of the “GPRM-Steal with Global Sharing” comes from its better load balancing. However, the wasted CPU cycles by the runtime libraries, as for OpenMP and Cilk Plus, which can have a significant impact on the results can be detected by Total CPU Time and Instructions Executed.
6.3
Summary
We have used our “Base Benchmarks”: Fibonacci, MergeSort and MatMul again and added GPRM to the performance comparison of the parallel models discussed in Chapter 4. The same benchmarks have also been used to compare GPRM and OpenMP on the TILEPro64.
We have presented a detailed analysis of GPRM’s performance. We have also demonstrated the advantages of our task-based parallel programming model over the existing well-known parallel approaches. Traditionally, attention has focused on finding the optimal number of threads in order to achieve desirable performance. In the thread-speedup charts, we used the default number of threads for GPRM and presented its performance with only two points for no-stealing and stealing modes. Performance optimisation in GPRM is a simple matter of choosing a proper cutoff value. In other words, GPRM combines an intuitive task-based approach with excellent performance, without the need to tune the number of threads.
On the TILEPro64, GPRM outperforms OpenMP in all cases. GPRM also achieves top
7The sequential phase of the MergeSort benchmark with the input size 80M is around 2 seconds, and the
6.3. Summary 104
performance for 2 out of the 3 uniprogramming test cases on the Xeon Phi, without any tuning. Further investigation on the only case on the Xeon Phi where GPRM was not the best model, matrix multiplication (MatMul), revealed new results: for different integer, float and double matrices GPRM significantly outperforms OpenMP on the TILEPro64, specially for small matrices. On the Xeon Phi, GPRM is again the winning approach in most of the cases. In other cases (such as the one used in the default MatMul configuration), after changing GPRM’s thread mapping policy via a command-line switch, it was able to reach the top performance achieved by the optimal number of OpenMP threads.
For multiprogramming on a general-purpose parallel system, we propose the use of GPRM which implements a scheme called “Steal Locally, Share Globally”. The idea is to steal tasks locally (from within the same application) only if the initial task assignment is not optimal, and to share the least amount of information about the system’s load globally (between dif- ferent applications). We have shown that our strategy is highly competitive against other approaches, namely OpenMP, Cilk Plus and TBB for all testbenches, and achieves the top performance on the Intel Xeon Phi with 17% to 20% difference with TBB as the second best approach.
105
Chapter 7
Parallel Lower-Upper Factorisation of
Sparse Matrices
OpenMP enjoys wide support from its community and continues to evolve. This makes it a challenging competitor for every new programming model, including GPRM. In this chapter we highlight some of the drawbacks in the OpenMP tasking approach, and propose an alternative solution based on the GPRM programming framework.
We compare the performance of GPRM with that of OpenMP in 2 different scenarios: first a matrix multiplication benchmark1 which has structured parallelism, and second, a linear algebra problem which fits very well into less structured task-based parallelism.
Lower-Upper factorisation of sparse matrices is a fundamental linear algebra problem. Due to the sparseness of the matrix, conventional worksharing solutions do not result in good performance, since a lot of load imbalance exists. As a well-known testcase, we have used the SparseLU benchmark from the the Barcelona OpenMP Tasks Suite (BOTS) [149].
For the purposes of this chapter, we will show how OpenMP fails to operate as expected for a large number of fine-grained tasks, while GPRM copes with such a situation naturally (more in Section 7.2). Furthermore, we will introduce a hybrid worksharing-tasking approach to avoid creating too many tasks (more in Section 7.3).
7.1
GPRM Parallel Loops
So far, we have only used GPRM parallel loops without discussing them. In this section, we will describe them more in detail.
1In this chapter, the matrix multiplication is used to show the effect of creating small tasks (short computa-
7.1. GPRM Parallel Loops 106
GPRM is a purely task-based parallel framework. As discussed in Chapter 5, one can create CUTOFF2tasks in GPRM, each of which with their own indices. These indices can be then
used by a worksharing construct to specify which elements of the loop belongs to which thread. Normally, when the tasks are fairly equal, the best result can be obtained by choosing the cutoff value as the same as the number of threads in GPRM, which is itself as the same as the number of cores. Although in Section 7.3 tasks are not equal, as a solution one can use the GPRM parallel loops to balance the load amongst threads. This solution, as will be shown, works very well when medium size or large sparse matrices are used.
We have created a number of useful parallel loop constructs for use in GPRM. These work- sharing constructs corresponds to the for worksharing construct in OpenMP, in the sense that they are used to distribute different parts of a work among different threads. However, there is a big difference in how they perform the operation. In OpenMP, the user marks a loop as an OpenMP for with a desirable scheduling strategy, and the OpenMP runtime de- cides which threads should run which part of the loop; in GPRM, multiple instances of the same task –normally as many as the cutoff value– are generated, each with a different index (similar to theglobal idin OpenCL). Each of these tasks calls the parallel loop passing in their own index to specify which parts of the work should be performed by their host thread.
Figure 7.1: Partitioning a nested m(3 × 3) or a single m(9) loop amongst n(4) threads. a) Step size of 1, as in the par for and par nested for, b) Continuous, as in the par cont for
The par for construct is essentially a sequential loop used for parallelisation of a single loop. It distributes the work in a Round-Robin fashion to the threads. It can also be referred to as a partial for, as it is actually a sequential loop that executes only a part of the original loop. A par nested for treats a nested loop as a single loop and follows the same pattern to distribute the work. Alternatively, the Continuous method gives every thread an m/n chunk, and the remainder m%n is distributed one-by-one to the foremost threads. These methods are shown in Fig 7.1. The need to parallelise nested loops arises often, e.g. in situations where there are variable size loops such as the SparseLU benchmark in Section 7.3.
The par for and par nested for loops in GPRM are implemented using C++ tem- plates and member-function pointers. The implementation of these worksharing constructs are given in Listing 7.1 and 7.2. They will be our worksharing constructs by default. The parallel loops with Continuous partitioning have similar implementations. We denote them as partial continuous loops: par cont for.
7.1. GPRM Parallel Loops 107
1t e m p l a t e<typename T c l a s s , typename Param1>
2 i n t p a r f o r (i n t s t a r t ,i n t s i z e ,i n t i n d , i n t CUTOFF , T c l a s s ∗ TC , i n t ( T c l a s s : : ∗ w o r k f u n c t i o n ) (i n t ,i n t ,i n t , Param1 ) , Param1 p1 ) { 3 i n t t u r n = 0 ; 4 f o r(i n t i = s t a r t ; i < s i z e ; ) { 5 i f( t u r n % CUTOFF == i n d ) { 6 ( TC−>∗ w o r k f u n c t i o n ( i , s t a r t , s i z e , p1 ) ; 7 i = i + CUTOFF ; 8 } 9 e l s e { 10 i + + ; 11 t u r n + + ; 12 } 13 } 14 r e t u r n 0 ; 15}
Listing 7.1: Implementation of the par for
1t e m p l a t e <typename T c l a s s , typename Param1>
2 i n t p a r n e s t e d f o r (i n t s t a r t 1 , i n t s i z e 1 , i n t s t a r t 2 , i n t s i z e 2 , i n t i n d , i n t CUTOFF , T c l a s s ∗ TC , i n t ( T c l a s s : : ∗ w o r k f u n c t i o n ) (i n t ,i n t , i n t ,i n t ,i n t ,i n t , Param1 ) , Param1 p1 ) { 3 i n t t u r n = 0 ; 4 f o r(i n t i = s t a r t 1 ; i < s i z e 1 ; i ++) { 5 f o r(i n t j = s t a r t 2 ; j < s i z e 2 ; ) { 6 i f ( ( t u r n >= 0 ) && ( t u r n % CUTOFF == i n d ) ) { 7 ( TC−>∗ w o r k f u n c t i o n ) ( i , j , s t a r t 1 , s i z e 1 , s t a r t 2 , s i z e 2 , p1 ) ; 8 j = j + CUTOFF ; 9 i f( j >= s i z e 2 ) t u r n = s i z e 2 − j + i n d ; 10 } 11 e l s e { 12 j + + ; 13 t u r n + + ; 14 } 15 } 16 } 17 r e t u r n 0 ; 18}
Listing 7.2: Implementation of the par nested for
As we will see in the next sections, since the GPRM par nested for is implemented with minimum overhead, it is a significantly useful worksharing construct, .