Summary - Profiling a parallel domain specific language using off-the-shelf tools

This chapter covered a background about parallel computing, parallel programming languages and profiling parallel performance. We presented parallel systems architectures based on the underlying physical memory organisation: shared-memory architectures, distributed-memory architectures and hybrid architectures(Section 2.1). Later in this thesis we will measure performance on a hybrid architecture, a Beowulf cluster of multicores. We presented parallel programming models that are widely used in the parallel programming domain; shared-memory programming with OpenMP and distributed-memory programming with MPI. Also, we discussed parallel programming with high-level parallel functional languages like HdpH that we will use for the rest of the thesis (Section 2.2). We discussed the analysis of parallel performance, illustrating the performance profiling process that we will use in Chapter 4 to build the profiler HdpHProf. We also discussed performance analysis tools of both imperative parallel languages and functional parallel languages that we will study in Chapter 3 for the comparative analysis of parallel functional profiler (Section 2.4).

A Survey of Parallel Functional

Profilers

This chapter presents a survey of parallel functional profilers alongside important imperative profilers. We evaluate two parallel Haskell profilers, GHC-PPS and EdenTV, in comparison with four important profilers for imperative languages. The functional profilers are relevant as our new HdpHProf profiler exploits GHC-PPS capabilities to profile a distributed-memory DSL, and hence EdenTV is a natural comparator.

The comparison covers profilers of both shared/distributed- memory parallel languages, and is performed on common parallel architectures. The comparison uses a published benchmark, namely the Concordance application which was set as the first SICSA Multicore Challenge [23].

The GHC-PPS performs tracing profiling of shared-memory parallel Haskell, and EdenTV performs tracing profiling of the Eden distributed-memory parallel Haskell. The imperative profilers are the tracing and graphical Score-P/Vampir for MPI, Score- P/Vampir for OpenMP, the two summative profilers mpiP for MPI, and ompP for OpenMP (Section 3.1). We compare the amount of profiling data generated by the profilers classified by whether the parallelism is shared/distributed-memory, whether the profiler is imperative/functional, and tracing/summative. The study reveals some interesting results, e.g. both functional tracing profilers generate one or two orders of magnitude less data than the imperative tracing profilers (Section 3.2).

We investigate the runtime overheads of the profilers, again classified by whether the parallelism is shared/distributed memory, whether the profiler is imperative/functional, and tracing/summative. The results of this study shows, for example, both tracing functional profilers induce overheads of an order of magnitude less than the

imperative tracing profilers. A more complete account of our studies is available as a technical report [5] (Section 3.3). We systematically compare the profilers for usability and data presentation, and found that the results reflect the design philosophy of the tools. Summative tools report a small set of key data with minimal intrusion into pro- gram execution. The functional tracing profilers provide more information, together with some graphical visualisation, with little more intrusion. Vampir offers the greatest range of information at the cost of significant intrusion (Section 3.4).

We discuss a number of related studies that evaluate other parallel profilers. We compare the experimental methodology used to evaluate the profilers with the methodology we use to evaluate the functional profilers (Section 3.5). After that, we outline the findings and summarise the work of this chapter (Section 3.6).

3.1 Experimental Methodology

3.1.1 Experimental Set-up

The profilers were measured on a Beowulf cluster comprising 32-nodes, each node comprising two Intel quad-core processors (Xeon E5504) running at 2.00GHz, sharing 4MB of L3 cache and 12GB of RAM. The machines were connected via Gigabit Ethernet and ran CentOS Linux distribution [19] version 6.3 x86 64. Table 3.1 specifies the compilers and profiling tools used for the experiments.

Compiler/Profiling Tool Version

GNU Compiler Collection (GCC) Red Hat [40] 4.4.6-4 The Glorious Glasgow Haskell Compilation System (GHC) [46] 7.2.1 The Parallel Haskell Compilation System (GHC-Eden) [48] 7.4.2

ompP to profile OpenMP [36] 0.7.1

mpiP to profile MPI [141] 3.3

Vampir to profile MPI & OpenMP [139] 8.0.0 Demo

Opari2 to profile OpenMP [100] 1.0.6

Score-P to profile MPI [124] 1.1

ThreadScope to profile GHC-SMP [134] 0.2.1

EdenTV to profile GHC-Eden [10] 4

3.1.2 Concordance Benchmark Versions

The profilers were compared using implementations of the same algorithm Concordance benchmark that was published as Phase I of the SICSA MultiCore Challenge [23]. The Concordance benchmark takes as input a text file and an integer (N). It processes the text file to find all sequences of words in the text, up to the length of N, together with the number of occurrences of this sequence and a list of start indices. As the profilers work on different languages we obtained four parallel implementations of a Concordance benchmark application, i.e. Eden, GHC-SMP, MPI and OpenMP.

3.1.3 Experiments

We used the Concordance benchmark implementations to measure the profilers’ data size and runtime overhead as compared to non-profiling execution. We measured the performance of the profilers in dependence of 2 parameters; i.e. application computation size, and the number of PEs. Firstly, we studied how the increase in the computation size changed the performance of the tools. Secondly, we evaluated how the increase in the number of PEs affected these profilers. The total number of experimental executions was 2400. Each experiment was repeated 5 times and the reported figures are medians.

To increase the computation size, we used different sizes of input files because computation size grows with input size for the concordance. The SICSA MultiCore Challenge provides two input files: the smallest file is 35 KB and the largest is 4300 KB. To carry out the experiments we needed more input files with a gradual increase in size. Therefore, we used the 4300 KB file to produce files with different sizes starting with 100 KB and doubling up to 3200 KB. Our analysis is based on the data sets 100 KB to 3200 KB. However, for completeness we also included the 35 KB and the 4300 KB files in the experiment as they are the standard set of input in the SICSA MultiCore Challenge.

Similarly, we doubled the number of PEs from 1 PE to 8 PEs as this is the maxi- mum number of cores on our system. However, the MPI Concordance implementation requires a minimum of 2 PEs in to work; one as master and the other as worker. As consequence, this study reports the results based on the number of workers used in the computation. The master PE only distributes the work and waits for termina- tion, and hence does not generate profiling data. We also included measurements of

6 PEs as using all the available cores on a machine is known to sometimes perturb performance [88].

In document Profiling a parallel domain specific language using off-the-shelf tools (Page 41-45)