Previous research in the field of overlapping communication and computation could roughly be categorized in three directions. These are:
• exploring state-of-the-art support for exploiting overlap; • exploring overlapping techniques; and
• measuring the potential for overlap that is present in applications.
First, several studies evaluated the overlapping capability of different processors, networks and programming languages. Sohn et al. [82] tested various multiprocessors and compared their overlapping efficiency. Furthermore, Brightwell et al. [19] quanti- fied in detail the potential influence of overlap, offload and independent progress. Later work studied many MPI implementations and showed that their overlapping abilities are different [18, 58]. Further research [26, 27] explored the potential of PGAS lan- guages to decouple communication and synchronization and achieve higher overlap. Furthermore, Bell at al. [10] showed overlapping advantages of light one-sided trans- fers implemented in UPC. Throughout our study, we assume that the used underlying
communication layer is fully capable of overlapping communication and computation. Namely, our major goal is to identify the potential overlap inherent in the application.
Second, many research efforts explored implementation issues of overlapping tech- niques. In an effort to hide communication delays, Leu et al. [59] identified overlap- ping as a technique that can achieve the maximum application speedup of two. Later, Danalis et al. [33] defined general code restructuring approaches that lead to better overlap in applications that exhibit limited dependencies among iterations. Hoefler et al. [50] proved overlapping potential of non-blocking collective communications in MPI. Furthermore, Das et al. [34] introduced compiler features that postpone recep- tions (sink waits) in MPI applications, while Iancu et al. [52] extended UPC runtime library to implement demand-driven synchronization, automatic message strip mining and message scheduling. However, the mentioned efforts fail to clearly determine the potential benefits of their overlapping techniques, because they fail to isolate the over- lapping effect from the implementations’ side-effects such as changed locality (cache and TLB misses) and non-deterministic events (OS daemons, preemptions, interfer- ences in a shared resource). On the other hand, our simulation can measure isolated impact of overlap, since the simulation framework introduces overlapping mechanisms without impacting other execution properties.
Third, there was little effort to identify the potential for overlap in applications. Sancho et al. [76] provided a theoretical estimation of the overlapping potential in scientific codes by modeling an application with one iterative loop and parameters that roughly describe the computation pattern. Our study continues Sancho’s work by designing a simulation framework that automatically estimates the potential overlap, without the need to understand the studied application. Moreover, our methodology allows studying overlap on diverse network configuration and provides visualization support to qualitatively inspect the simulated execution. Our framework allowed us to analyze how overlap depends on application properties, the overlapping technique and network resources.
7.3
Identifying parallelization bottlenecks
Tracing is a common technique for identifying performance bottlenecks. In tracing, the environment instruments the parallel execution and collects a vast amount of perfor-
mance data. Usually, the environment provides a visualization support that facilitates the programmer to explore the performance bottlenecks. Still, tracing rarely directly focuses the programmer’s attention to the problem. The most wide-spread environ- ments for tracing parallel executions are Jumpshot [91] from Argonne National Lab- oratory, Paraver [55] from Barcelona Supercomputing Center, Vampir [65] from TU Dresden and ScalaTrace project [67] from North Carolina State University.
Profiling is more efficient as it collects information throughout the execution and then reduces that information into a short report. The traditional profilers showed to be very useful for analyzing sequential execution. The most popular profiler, gprof [46], reports the percentage of computation time spent in each function. Identifying the most time consuming code section, gprof automatically finds the critical code section. How- ever, in parallel executions, the most time consuming code section is not necessarily the critical code section. The most well-known parallel profilers are FPMPI-2 [57] from
Argonne National Laboratory, mpiP [88] from Lawrence Livermore National Labora-
tory , HPCT [23] from IBM, TAU [79] from University of Oregon and Scalasca [44] from Julich Supercomputing Center.
Other techniques explore critical code sections by analyzing the critical path of execution. These techniques identify the critical code section as the code section that contributes the most to the overall critical path. The most popular representatives are Vtune [30] from Intel and Spartan [5] from University of Illinois. However, the analy- sis of the critical path omits the influence of the target parallel machine on which the application executes. Moreover, optimizing the code may shorten the starting critical path, and some other execution path may prevail, becoming the new critical path.
The newest approaches identify execution phases of low parallelism and then find culprit code sections. Quartz [6] and Intel Thread Profiler [17] [29] try to quantify the potential parallelism of each section of the code. These tools identify critical code sec- tions as the ones with lowest parallelism. Similarly, Tallent at. el. [86] try to explore which code sections are responsible for processor stalls. Furthermore, this study con- siders different policies of spreading the blame for lost performance, from contexts in which spin-waiting occurs (victims), to directly blaming a lock holder (perpetrators). However, the drawback of all these techniques is that the conclusions obtained for one target platform can hardly be applied to a target platform with a different amount of parallelism.
Using our environment (Section4.3) we show that the previous approaches cannot capture some influences that are very decisive in identifying critical code sections. First, that the choice of the critical section depends on the parallel target machine on which the application executes. Second, that depending on the factor of acceleration different sections may be the most beneficial to accelerate. Finally, all the previous techniques work with “measure-modify” approach – from the profile of one run the technique points to the critical section, the programmer optimizes the section, and then runs again the application “hoping” that the optimization resulted in overall speedup. Conversely, our approach provides anticipation of benefits – prior to any optimization effort, the programmer can estimate the potential benefits of the planned optimization.