Profiling and Tracing Tools

2.6 Data Reduction

2.7.2 Profiling and Tracing Tools

Open|SpeedShop

Open|SpeedShop is a suite of parallel performance analysis tools built from modular open- source infrastructure components. It includes sampled flat and call-path profilers, as well as trace tools for I/O operations, MPI, and floating-point exceptions. Open|SpeedShop uses the

Open|SpeedShop uses hardware counter measurements and binary instrumentation. For binary instrumentation, it uses the DynInst library. To support binary instrumentation, it communicates with parallel processes via a debug interface, and it uses the MRNet communication infrastructure to interact with running processes scalably. These components of Open|SpeedShop are described later in this section.

SvPablo

SvPablo (De Rose and Reed, 2000) is a well known profiling and tracing tool. It provides a GUI for manual instrumentation of code, as well as a parser for inserting source-level instrumentation into codes automatically. Instrumentation in SvPablo makes calls to an external data collection library that provides bookkeeping and I/O for profiling and tracing. The instrumentation library supports measurements of time and of hardware performance-counter values.

TAU

Tuning and Analysis Utilities (TAU) (Shende and Maloney, 2006) is a widely used set of performance and analysis tools. Like SvPablo, it supports profiling and tracing, and it provides an instrumenting parser to annotating source code automatically. It also supports measurement using either timers or hardware performance counters. Unlike SvPablo, TAU’s parsers place instrumentation inside selected functions, while SvPablo instruments each of a func- tion’s call sites.

Like Open|SpeedShop, TAU optionally supports dynamic binary instrumentation using the DynInst (Miller et al., 1995) framework, and it supports lossless trace compression using OTF.

In addition to its instrumentation components, TAU includes extensive facilities for analysis of collected performance data, implemented in its PerfExplorer tool (Huck and Malony,

2005). PerfExplorer is a GUI front-end with a set of data-mining and analysis tools that allow post-mortem data gathered using TAU (and other performance tools) to be analyzed using scalable techniques. PerfExplorer includes support for dimensionality reduction, clustering, correlation, and other statistical methods. In addition, it has been integrated with the OpenUH compiler (Liao et al., 2007) to provide automated knowledge-mining features. PerfExplorer can make use of data generated with OpenUH’s automatic instrumenting features by mining it for patterns relevant to parallel code tuning and power efficiency. The knowledge discovered by PerfExplorer is then fed back to OpenUH to inform compile-time optimization decisions.

2.7.3 Tracing Tools

Vampir and VNG

Vampir (Brunst et al., 2001) is a full event-tracing framework for MPI applications. It provides a number of Graphical User Interface (GUI) tools for viewing and analyzing stored traces, and supports lossless trace compression via OTF. Summary profiles may be generated from stored traces using Vampir’s GUI tools.

Vampir’s successor, Vampir Next Generation (VNG), extends Vampir with support for parallel client-side analysis. The VNG GUI client runs on a desktop machine, but can interact with remote tool processes for visualization and analysis. This enables VNG to process very large stored traces quickly.

Etrusca

Roth (Roth, 1996) presents a prototype system,Etruscathat reduces trace overhead through clustering. Etrusca uses clustering to select a set of representative processes from a parallel system. Once selected, only representative processes are traced, significantly reducing data

Nikolayev et al. (Nikolayev et al., 1997) extend this technique to online monitoring of large applications. Clusters are generated in real time based on observed data.

The parallel systems used here are small (≤ 128 processes) compared to those in use today. We describe the trade-offs involved with designing clustering algorithms for larger machines in Chapter 5. Further, both Nikolayev et al. and Roth use one representative from each cluster to approximate its behavior, but their methods do not guarantee adequate representation. In Chapter 4, we describe heuristics for estimating the size of a representative portion of a population using sampling techniques.

ScalaTrace

ScalaTrace (Noeth et al., 2007) is a tracing framework for scalable recording and compression of MPI communication traces. LikempiP, it uses link-level instrumentation to intercept MPI calls. For extremely regular SIMD applications, it can record all MPI events in traces of constant or near-constant size, regardless of the number of application processes, and regardless of the length of an application run.

Instead of recording a trace na¨ıvely as a raw sequence of events, ScalaTrace looks for repeating sequences as MPI operations are executed. Each process uses a local compression algorithm to translate MPI events to Regular Section Descriptors (RSDs), which are then used as a the alphabet to generate a regular expression. Regular expressions in ScalaTrace are represented as Power Regular Section Descriptors (PRSDs), a recursively defined version of RSDs where one PRSD may contain another as one of its symbols.

Locally compressed traces are merged in a distributed reduction operation at the end of a ScalaTrace run. This inter-process compression algorithm exploits several optimizations to achieve high levels of compression. First, location-independent communication endpoint en- codings are used so that send and receive operations from different processes can be merged into the same trace record. Second, ScalaTrace considers the semantics of constructs like

MPI Waitsome, for which exact repetition count can be nondeterministic. Such call records are merged into a single record in the output trace.

The standard ScalaTrace library preserves event-ordering information and offers the op- tion of recording statistical event-timing information (Ratn et al., 2008). It works well for stencil codes and for applications with regular communication patterns. However, for applications with irregular behavior, its traces can degenerate and may begin to grow linearly with the number of nodes in the system as well as the length of an application trace.

In document Scalable performance measurement and analysis (Page 63-67)