Related Work - Profiling a parallel domain specific language using off-the-shelf tools

Chung et al. [21] investigated how to reduce the cost of tracing by selectively recording only certain classes of events using a set of standard HPC profiling tools. They evaluated their approach with an experimental study of the cost of five profiling tools: IBM HPCT, Paraver, KOJAK, TAU, and mpiP. In a similar approach to our work, they used two metrics to characterise the profiling tools: the runtime overheads and the size of the collected profiling data. There are a number of differences between their study and our work. Firstly, their study was restricted to one programming model, imperative programming with MPI, whereas we covered a range of different programming models (shared vs distributed memory) and paradigms (imperative vs functional). However, their study used 4 benchmark applications, whereas we were limited to one because it was difficult to find multiple and similar benchmarks for all the programming models we considered. Finally, their study investigated the cost of profiling on a larger scale than ours did. We selected small numbers of processors, so that we could compare profiling tools for both distributed and shared memory, inheriting the low processor limit of shared-memory architectures.

Malony et al. [87] investigated overhead compensation in a prototype extension of the TAU [125] profiling tool. They performed experiments to evaluate the performance of their tool, measuring the runtime overhead but not the data size of profiles. However, they did not vary the computation size or the number of PEs in their experiments. They also did not compare their results with the overheads of other profilers.

Jones Jr. et al. [67] introduced the GHC parallel profiling system and the Thread- Scope visualizer to the Haskell community. To demonstrate the overheads of parallel profiling, the paper presents the runtime overheads and trace file sizes of two mi- crobenchmarks (parallel Fibonacci and parallel quicksort). However, the authors do

not investigate the impact of computation size and number of cores on the cost of profiling, nor do they compare their overheads with those of other profiling tools.

3.6 Summary

We have evaluated two functional profilers, GHC-PPS and EdenTV, alongside four important imperative profilers. The comparison is based on the SICSA Concordance benchmark [23], which covers both shared and distributed-memory parallel languages, and is performed on common parallel architectures.

Key findings are as follows: The summative profilers generated the least profiling data. More interestingly both functional tracing profilers generated one or two orders of magnitude less data than the imperative tracing profilers. While generating so much data risks distorting the parallel execution, the benefit is that tools like Score-P/Vampir can potentially assist the programmer by providing more detailed information about program execution (Section 3.2). More work is needed to establish the cost/benefit trade-off between profiling data size and the programmer’s understanding of program behaviour.

Both tracing functional profilers induce very low runtime overheads by an order of magnitude less than the imperative tracing profilers. Both functional profilers runtime overheads, whether distributed or shared-memory, are no more than twice as much than the best summative profiler in our study: for example, 9.4% of runtime overhead for EdenTV and 10.5% for GHC-PPS compared with 5.2% for ompP (Section 3.3).

Comparing the profilers for usability and data presentation, we see that the functional profilers are relatively immature when compared with tools like Vampir for popular imperative technologies. The results also reflect the profiler design philoso- phies: summative tools provide key information with minimal intrusion. The functional profilers provide more information and some graphical visualisation; Vampir offers the greatest range of information, and the most sophisticated and usable visualisation tools (Section 3.4).

Functional profilers could be improved in a number of ways. Currently the data collection and visualisation options are relatively modest, and both could be improved to approach the standard of leading tools like Vampir. Functional profiling architectures could better exploit techniques proven by tools like Vampir. For example, instead of different visualisation tools to visualise two variants of parallel Haskell, one

tool could be designed to visualise multiple variants. Similarly, instead of producing different trace formats for each Haskell variant, a standard format is needed which can capture monitoring data from a more generic abstract unit of computation resource. While GHC-PPS represents a move in this direction, it is closely entwined with GHC and has a relatively simple model of computation resources.

Interesting challenges lie ahead: functional profilers must soon address the issues of scalability and heterogeneity. The scalability challenge is to collect useful information as the number of cores grows exponentially and the bandwidth available to each core shrinks. The challenge of heterogeneity is to profile a program executing on a range of computing resources, e.g. multicores and GPUs.

Moreover, since Haskell has become the host language of several DSLs implemen- tations, another important challenge for functional profilers is how to support profiling parallel DSL. The DSL profiling challenge is the ability of the profiler to monitor, analyse the performance, and present the behaviour from the high-level abstraction of the DSL. It is unclear how much profiling technologies can be shared by the various parallel Haskell DSLs like the Par Monad [89], Cloud Haskell [28], and HdpH [84].

HdpHProf– Design and

Implementation

This chapter presents the design and implementation of HdpHProf, a profiler for the HdpH DSL. In keeping with the HdpH philosophy of relying on nothing but the host platform, HdpHProf builds on GHC’s existing profiling infrastructure, in particular on the event logging mechanism of the GHC Parallel Profiling System (GHC-PPS). Hd- pHProf is post-mortem, multi-stage, and extensible. Importantly, the implementation exploits several new GHC features, including the GHC-Events Library and Thread- Scope, to build profiling tools for HdpH. HdpHProf faces some challenges unique to the high-level distributed-memory DSL setting: how to instrument and trace the behaviour of the parallel DSL, how to tweak event logging to generate a single profile of a distributed program execution, spanning multiple machines with independent clocks, and how to analyse and visualise such trace files. The design introduces two novel analysis tools for monitoring the DSL internals, i.e. Spark Pool Contention Analy- sis and Registry Contention Analysis. Furthermore, we present how HdpHProf uses ThreadScope [134], the standard GHC shared-memory performance analysis tool, to visualise the performance of the distributed-memory executions of HdpH.

4.1 HdpHProf Requirements

The requirements for HdpHProf to profile HdpH are to use the available performance analysis infrastructure from the host language, i.e. GHC [57], to profile HdpH. The GHC compiler comes with a full profiling suite called the GHC Parallel Profiling Sys- tem (GHC-PPS) [67] and a trace visualiser, ThreadScope [134]. We can categorise

HdpHProf requirements into three different types as follows.

Architecture Requirements:

• HdpHProf should not require any change to the GHC platform.

• HdpHProf should use the GHC-PPS tracing to emit HdpH trace events into the eventlogs produced by the GHC-PPS on each node.

• HdpHProf should use and extend the GHC-Events library to read HdpH trace events from the eventlog, normalise the HdpH RTS start time in each eventlog and synchronise the time in the eventlogs accordingly, and merge the multiple eventlogs from a distributed run.

Functional Requirements:

• HdpHProf should provide analysis tools for HdpH performance, e.g. spark pool contention analysis and registry contention analysis.

• HdpHProf should use ThreadScope to browse the eventlogs and see how HdpH utilises the cores of a Beowulf cluster of multicores.

Performance Requirements:

• HdpHProf should scale to profile HdpH applications on clusters of multicores with large number of cores, e.g. 192 cores of a 32-node Beowulf cluster.

• HdpHProf should induce low tracing overheads to the GHC-PPS and the profiled applications.

In document Profiling a parallel domain specific language using off-the-shelf tools (Page 68-72)