Instrumentation and visualization tools - Structure of the Document

1.4 Structure of the Document

2.1.1 Instrumentation and visualization tools

The use of Extrae and Paraver offers an enormous potential of analysis, both qualitative and quantitative, allowing to identify the actual performance bottlenecks of parallel applications. The microscopic view of the program behaviour that the tools provide is very useful to optimize the parallel program performance.

Paraver

Paraver [153], developed at Barcelona Supercomputing Center (BSC), is a flexible parallel program visualization and analysis tool based on an easy-to-use Motif Graphical User Interface (GUI). Paraver was developed responding to the basic need of having a qualitative global perception of the application behaviour by visual inspection, to be able to focus on the detailed quantitative analysis of the problem. Paraver provides a large amount of information that directly improves the decisions on whether and where to invest the programming effort to optimize an application. The result is a reduction of the development time as well as the minimization of the human resources. Some features of Paraver are its support for the following [153]

• Detailed quantitative analysis of program performance.

• Concurrent comparative analysis of multiple traces.

• Fast analysis of very large traces.

• Mixed support for message passing and shared memory.

• Easy personalization of the semantics of the visualized information.

One of the main features of Paraver is the flexibility to represent traces coming from different environments. Traces are composed of state transitions, events, and communications with an associated timestamp. These three elements can be used to build traces that capture the behaviour along time of very different kinds of systems. The Paraver distribution includes, either in its

2.1. INTEGRATED TOOLS

Figure 2.1: Paraver Internal Structure.

own distribution or as an additional package, the instrumentation tools for sequential and parallel application tracing, as well as for system activity tracing in a multiprogrammed environment.

Paraver allows users to develop their own tracing facilities according to their own interests and requirements. The possibilities offered by the visualization, semantic and quantitative analyzer modules are powerful, enabling users to analyze and understand the behaviour of the traced system. Paraver also allows to customize some of its parts as well as to plug-in new functionalities.

Expressive power, flexibility and the capability of efficiently handling large traces are key features addressed in the design of Paraver. The clear and modular structure of this tool plays a significant role towards achieving these targets.

The structure of Paraver consists of three levels of modules (see Figure 2.1). At the top, the Filter Module (FM) works onto the trace file. This module offers a partial view of the trace file to the next level. In the middle, the Semantic Module (SM) receives the trace file filtered by the previous module and interprets it. This module transforms the record traces to time-dependent values which will be passed to the Representation Module (RM). The SM is the most important level because it extracts and gives sense to the record values in the trace file. This file contains a significant amount of information that this module can selects (the semantics of the trace file). Finally, at the bottom, the RM receives the time-dependent values computed by the SM and displays it in different ways. The RM drives thus the whole process and offers a graphical display of the trace file.

Extrae

Extrae [79] is a dynamic instrumentation and measurement package developed at BSC. It is devoted to generate Paraver trace-files for a post-mortem analysis. This package traces programs

Supported programming models MPI OpenMP* CUDA* OpenCL* pthread* OmpSs* Java Python Supported platforms

Linux clusters (x86 and x86-64) BlueGene/Q

Cray

nVidia GPUs Intel Xeon Phi ARM

Android

Table 2.1: Programming models and systems supported by Extrae. * Also available in conjunction with MPI.

compiled and ran in the shared memory model (like OpenMP and pthreads), the MPI, or both programming models (different MPI processes using OpenMP or pthreads within each MPI process). It is currently available for different platforms and operating systems; see Table 2.1.

Extrae uses different interposition mechanisms to inject probes into the target application so as to gather information regarding the application performance.

Interposition mechanisms. Extrae takes advantage of multiple interposition mechanisms to add monitors into the application. Independently of which mechanism is employed, the goal is the same: collect performance metrics at known application points to finally provide the performance analyst a correlation between performance and the application execution. The interposition mechanisms used by Extrae are [79]:

• Linker preload (LD PRELOAD). Most of the current operating systems allow injecting a shared library into an application before the application is actually loaded. If the library that is being preloaded provides the same symbols as those contained in shared libraries, such symbols can be wrapped in order to inject code in these calls. In Linux systems this technique is commonly implemented by using the LD PRELOAD environment variable. Extrae contains substitution symbols for many parallel runtimes, such as OpenMP (either Intel, GNU or IBM runtimes), pthread, Compute Unified Device Architecture (CUDA) accelerated applications, and MPI applications.

• DynInst. DynInst is an instrumentation library that allows to modify the application by injecting code at specific code locations. Although originally it allowed modifying the application code when the application was run, now it supports rewriting the binary of the application so that the code injection is required only once. Extrae uses DynInst to instrument different parallel programming runtimes, such as OpenMP (either for Intel, GNU or IBM runtimes), CUDA accelerated applications, and MPI applications. DynInst also offers Extrae the possibility to instrument user functions by simply listing them in a file.

• Additional instrumentation mechanisms. Extrae also takes advantage of some parallel programming runtimes that have their own instrumentation (or profile) mechanisms available for performance tools. These include MPI, which provides the Profile-MPI (PMPI) layer, as well as the CUPTI infrastructure to get information from CUDA devices, OmpSs or even the Open Computing Language (OpenCL) profiling capabilities.

2.1. INTEGRATED TOOLS

There are some compilers that allow instrumenting application routines by using special compilation flags during the compilation and link phases.

• Extrae Application Programming Interface (API). Finally, Extrae offers the user the possibility to manually instrument the application and emit its own events if the previous mechanisms do not fulfill the user’s needs.

Sampling mechanisms. Extrae can be leveraged to instrument the application code, and it also offers sampling mechanisms to gather performance data. While adding monitors into a specific location of the application provides insights which can be easily correlated with the source code, the resolution of such data is directly related with the application control flow. Adding sampling capabilities into Extrae produces performance information for regions of code which have not been instrumented.

Currently, Extrae supports two different sampling mechanisms. The first mechanism is based on signal timers, which fire the sampling handler at a specified time interval. The second sampling mechanism uses the processor performance counters to trigger the sampling handler at a specified interval of events. While the first mechanism can provide samples that are totally uncorrelated with the application code, the second mechanism, using the appropriate performance counters, can provide insights of the application behaviour while still presenting some correlation with the application code/performance.

Performance data gathered. The monitors added by Extrae collect different types of information. Depending on the placement, each monitor can be instructed to collect specific information. The most common information is:

• Timestamp. When analyzing the behaviour of an application, it is important to have a fine timing granularity (up to nanoseconds). Extrae provides a set of clock functions that are specifically implemented for different target machines in order to provide the most accurate possible timing. On systems with daemons that inhibit the usage of these timers, or systems that do not have a specific timer implementation, Extrae still uses advanced Portable Operating System Interface (POSIX) clocks to provide nanosecond resolution timestamps with low cost.

• Performance and other counter metrics. Extrae uses the Performance API (PAPI) [141] and the Performance Monitor API (PMAPI) interfaces to collect information on the microprocessor performance. With the advent of the components in the PAPI software, Extrae is not only able to collect information regarding how the microprocessor behaves, but also allows studying multiple components of the system (disk, network, operating system, among others) and extend the study to the microprocessor (power consumption and thermal information). Extrae mainly collects these counter metrics at the parallel programming calls and at the samples. It also allows capturing such information at the entry and exit points of the user routines that were instrumented.

• References to the source code. Analyzing the performance of an application requires to study the code that is responsible for such performance. This way the analyst can locate the performance bottlenecks and suggest improvements on the application code. Extrae provides information on the source code that was executed (in terms of name of function, file name and line number) at specific location points such as programming model calls or sampling points.

State Name Description CPUs

C0 Operating State CPU fully turned on All CPUs

C1 Halt Stops CPU main internal clocks via software;

bus interface unit and APIC are kept running at full speed 486DX4 and above C1E Enhanced State Stops CPU main internal clocks via software and reduces CPU voltage;

bus interface unit and APIC are kept running at full speed All socket LGA775 CPUs

C1E – Stops all CPU internal clocks Turion 64, 65-nm Athlon X2

and Phenom CPUs C2 Stop Grant Stops CPU main internal clocks via hardware;

bus interface unit and APIC are kept running at full speed 486DX4 and above

C2 Stop Clock Stops CPU internal and external clocks via hardware Only 486DX4, Pentium,

Pentium MMX, K5, K6, K6-2, K6-III C2E Extended Stop Grant Stops CPU main internal clocks via hardware and reduces CPU voltage;

bus interface unit and APIC are kept running at full speed Core 2 Duo and above (Intel only)

C3 Sleep Stops all CPU internal clocks Pentium II, Athlon and above,

but not on Core 2 Duo E4000 and E6000

C3 Deep Sleep Stops all CPU internal and external clocks Pentium II and above, but not on

Core 2 Duo E4000 and E6000; Turion 64

C3 AltVID Stops all CPU internal clocks and reduces CPU voltage AMD Turion 64

C4 Deeper Sleep Reduces CPU voltage

Pentium M and above, but not on Core 2 Duo E4000 and E6000 series; AMD Turion 64

C4E/C5 Enhanced Deeper Sleep Reduces CPU voltage even more and turns off the memory cache Core Solo, Core Duo and 45-nm mobile Core 2 Duo only C6 Deep Power Down Reduces the CPU internal voltage to any value, including 0 V Re 45-nm mobile Core 2 Duo only

Table 2.2: Available processor power states (extracted from [184]).

In document Performance and Energy Optimization of the Iterative Solution of Sparse Linear Systems on Multicore Processors (Page 34-38)