Libtdg tool - Graph Construction - Scalability Engineering for Parallel Programs Using Empirica

4.2 Graph Construction

4.2.3 Libtdg tool

We developed the Libtdg tool on top of OMPT to overcome the limitations of the Nanos++ TDG plugin. One limitation was the strong dependence on the Nanos++ runtime and the other one was the lack of support for loop chunks. Although the extended OMPT, as presented in the previous subsection, is currently only supported by the LLVM runtime, relying on an independent interface makes the TDG-based analysis portable to other runtimes.

The Libtdg tool implements the callbacks in Table 4.1. The callbacks for the beginning of a new parallel region, explicit task creation, and a loop chunk will cause the tool to create new nodes in the graph. Libtdg tracks the execution times of each node by measuring the time when it starts and ends. For parallel regions, this happens when theompt_callback_parallel_begin_tandompt_callback_parallel_end_t

callbacks arrive. In the case of explicit tasks, the time is tracked using the scheduling callbacks, and for loop chunks, we measure the time between consecutive chunk callbacks or the

ext_callback_loop_tcallback signaling that we reached the end of a parallel loop. We also gather PAPI counters [115] for individual chunks. For these measurements we query the runtime for thread data (ompt_get_thread_data_t) in the chunk callback. The thread data structure is initialized in ompt_callback_thread_begin_t and contains all the needed information for tracking PAPI counters in each thread.

The design of Libtdg allows us to easily add new post-processing steps that the tool runs at the end of the execution. We already have steps for printing general information about the TDG, critical path computation, printing the TDG as a DOT file, and printing detailed data about the chunks (e.g., iteration range and values from PAPI counters).

Figure 4.3 shows a typical usage of Libtdg. It is invoked by a dynamic pre-load with the

LD_PRELOADenvironment variable. The post-processing steps at the end are specified with theTDG_TOOL_POSTPROCenvironment variable as a comma-separated list. In the example, “tim” means to print general timing information, “dot” means to write a DOT file to the output, and “log” means to write detailed chunk information to the output. The ability to turn on and off various post-processing steps allows us to save time by choosing only the relevant ones for our analysis. The environment variableTDG_PAPI_COUNTERSspecifies which PAPI counters Libtdg should measure. The more PAPI counters are used the larger the overhead, therefore the rule-of-thumb is to use only two counters. The ability to turn on and off various post-processing steps with theTDG_TOOL_POSTPROC variable allows us to save time by choosing only the relevant steps for our analysis.

��

Figure 4.4: Task dependency graph produced by the Libtdg tool that represents the execution of a simple matrix multiplication code with one parallel loop on one thread. The green-colored node represents the beginning of a parallel loop and its children are loop chunks. The numbers in each node before the asterisk (*) are the execution times in seconds. The numbers after it are either node IDs or the iteration ranges of the loop chunks.

��

Figure 4.5: Task dependency graph produced by the Libtdg tool that represents the execution of a simple matrix multiplication code with one parallel loop on two threads. Each green-colored node, which represents the beginning of a parallel loop, corresponds to a different thread. The children nodes of a green-colored node are loop chunks. The numbers in each node before the asterisk (*) are the execution times in seconds. The numbers after it are either node IDs or the iteration ranges of the loop chunks.

Figures4.4and4.5present an example of two TDGs produced by Libtdg. Both of the graphs represent the execution of a simple matrix multiplication code, implemented by three nested loops with the outer loop being the only parallel loop in the code. There are eight chunks in this loop and they are represented by eight wheat-colored nodes in the figures. These nodes are children of the green-colored nodes that represent parallel loops. Since the loop was executed with OpenMP’s dynamic scheduling (see Section1.2.1), all the timing information was captured in the chunk nodes, making the execution times in the green nodes negligible. If a parallel loop is executed with OpenMP’s static scheduling, the graph will have no chunk nodes and the loop nodes will show the execution times. The number of loop nodes equals the number of threads since OpenMP runtime divides the loop computation equally among the threads. The golden- colored nodes represent implicit parts of the computation, such as, the execution between the start of a parallel region and a parallel loop or between the end of the parallel loop and the end of the region. The red-colored nodes in Figure 4.5 represent barriers—one at the end of the parallel loop and the second one at the end of the parallel region.

Besides the three metrics in the example in Figure 4.3, Libtdg also has the “cri” metric that computes the critical path of the TDG. In the next section, we take a closer look at the graph analysis approach in general and the computation of the critical path in particular.

In document Scalability Engineering for Parallel Programs Using Empirical Performance Models (Page 93-96)