Blue Gene/L: Performance Analysis Tools

(1)

Front cover

Blue Gene/L: Performance

Analysis Tools

Learn about Blue Gene/L

performance tooling

Discover the details about PAPI and

the External Performance Monitor

Understand the pros and

cons of the different tools

(2)

(3)

International Technical Support Organization

Blue Gene/L: Performance Analysis Tools

(4)

This edition applies to Version 1, Release 3, Modification 1 of Blue Gene/L (product number 5733-BG1).

(5)

Notices

This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to:

IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A.

The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS

PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,

MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrates programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to IBM's application programming interfaces.

(8)

Trademarks

The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:

AIX 5L™ AIX® Blue Gene® IBM® LoadLeveler® PowerPC® POWER™ Redbooks™ Redbooks (logo) ™ System i™ System p™ Tracer™

The following terms are trademarks of other companies:

Java, Sun, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

Excel, Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

(9)

Preface

This IBM® Redbook is one in a series of IBM publications written specifically for the IBM System Blue Gene® supercomputer, Blue Gene/L, which was developed by IBM in collaboration with Lawrence Livermore National Laboratory (LLNL). This redbook provides an overview of the application development performance analysis environment for Blue Gene/L. This redbook explains some of the tools that are available to do application-level performance analysis. It devotes the majority of its content to Chapter 3, “External Performance

Instrumentation Facility” on page 11, and Chapter 4, “Performance Application Programming Interface” on page 55.

The team that wrote this redbook

This redbook was produced by a team of specialists from around the world working at the International Technical Support Organization (ITSO), Poughkeepsie Center.

Gary L. Mullen-Schultz is a Consulting IT Specialist at the ITSO, Poughkeepsie Center. He

leads the team that is responsible for producing Blue Gene/L documentation, and is the primary author of this redbook. Gary also focuses on Java™ and WebSphere. He is a Sun™ Certified Java Programmer, Developer and Architect, and has three issued patents.

Thanks to the following people for their contributions to this project: Mark Mendell

Kara Moscoe

IBM Toronto, Canada

Ed Barnard Todd Kelsey Gary Lakner James Milano Jenifer Servais Janet Willis

ITSO, Poughkeepsie Center

Charles Archer Peter Bergner Lynn Boger Mike Brutman Jay Bryant Tom Budnik Kathy Cebell Jeff Chauvin Roxanne Clarke Darwin Dumonceaux David Hermsmeier

(10)

Matt Light Dave Limpert Chris Marroquin Randall Massot Curt Mathiowetz Pat McCarthy Mark Megerian Marv Misgen Jose Moreira Mike Mundy Mike Nelson Jeff Parker Kurt Pinnow Scott Plaetzer Ruth Poole Joan Rabe Joseph Ratterman Don Reed Harold Rodakowski Brent Swartz Richard Shok Brian Smith Karl Solie Wayne Wellik Nancy Whetstone Mike Woiwood IBM Rochester Tamar Domany Edi Shmueli IBM Israel Gary Sutherland Ed Varella IBM Poughkeepsie Gheorghe Almasi Bob Walkup

(11)

Become a published author

Join us for a two- to six-week residency program! Help write an IBM Redbook dealing with specific products or solutions, while getting hands-on experience with leading-edge technologies. You'll team with IBM technical professionals, Business Partners and/or customers.

Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you'll develop a network of contacts in IBM development labs, and increase your productivity and marketability.

Find out more about the residency program, browse the residency index, and apply online at:

ibm.com/redbooks/residencies.html

Comments welcome

Your comments are important to us!

We want our Redbooks™ to be as helpful as possible. Send us your comments about this or other Redbooks in one of the following ways:

򐂰 Use the online Contact us review redbook form found at:

ibm.com/redbooks

򐂰 Send your comments in an e-mail to:

[email protected] 򐂰 Mail your comments to:

IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099

2455 South Road

(12)

(13)

Chapter 1.

Performance guidelines and

tools

This chapter describes the process of using tools to analyze system and application performance.

(14)

1.1 Tooling overview

A variety of tools are available to help understand your application’s performance when running on Blue Gene/L. Some of these tools are written by IBM, others are written by independent software vendors (ISVs), and still others are open source efforts.

Some tools have been ported to Blue Gene/L, and more are moving over every month. However, almost all tools work against the IBM System p™ platform. At times, it is advantageous to run your application on the System p platform and profile it there if the particular tool in which you are interested in using does not yet support Blue Gene/L.

This section first discusses the main tool suite from IBM. Then, it examines the tools that you can use on the System p platform to help understand Blue Gene/L performance. Finally, it looks at tools that run natively with applications on Blue Gene/L.

1.1.1 IBM High Performance Computing Toolkit

The Advanced Computing Technology Center (ACTC), part of IBM Research in Yorktown Heights, New York, conducts research on the performance behavior of scientific and technical computing applications. Its role in IBM is to provide strategic technical direction for the research and development of server platforms to advance the state of the art in high performance computing offerings and solutions for IBM Clients in computationally intensive industries. Such industries include automotive, aerospace, petroleum, meteorology, and life sciences.

IBM offers the IBM High Performance Computing Toolkit, a suite of performance-related tools and libraries to assist in application tuning. This toolkit is an integrated environment for performance analysis of sequential and parallel applications using the Message Passing Interface (MPI) and OpenMP paradigms. It provides a common framework for IBM mid-range server offerings, including the IBM System p and System i™ platforms and Blue Gene/L systems, on both AIX® and Linux®.

1.2 General performance testing

IBM recommends that you test an application on a System p machine before you run it on a Blue Gene/L system if possible. You should also use a memory size per Compute Node that is compatible with the Blue Gene/L architecture. This approach makes it possible to check both memory utilization and performance issues. Both the System p platform and the Blue Gene/L supercomputer use IBM XL compilers, which aid portability between the two systems.

1.2.1 Overview of the tools that are available on System p

For the best performance, it is good practice to obtain a performance profile for your application. IBM is porting its comprehensive performance analysis tools, the High Performance Computing Toolkit, to the Blue Gene/L supercomputer. In the meantime, we recommend that you perform profiling on a similar system, such as the System p platform. Most computational performance issues are the same on Blue Gene/L as on other reduced instruction set computer (RISC) processors, so this method usually identifies the main issues. For parallel performance, several MPI profiling tools are available, including the ones listed in the following sections.

(15)

IBM High Performance Computing Toolkit

The IBM High Performance Computing Toolkit is the foundation for all performance tools for Blue Gene/L and the IBM System family. The tools provide source code traceback of the performance data to help the user quickly identify any bottlenecks in the code. The toolkit includes low-overhead measurement of time spent in MPI routines for applications written in any mixture of Fortran, C, and C++.

The tools include Xprofiler, MPI_tracer, MPI_Profiler, and PeekPerf. The toolkit provides a text summary and an optional graphical display.

Paraver

Paraver is a graphical user interface (GUI)-based performance visualization and analysis tool that you can use to analyze parallel programs. It lets you obtain detailed information from raw performance traces. To learn more about Paraver, refer to the following Web address:

http://www.cepba.upc.es/paraver/

MPE/jumpshot

MPICH2 has extensions for profiling MPI applications, and the MPE extensions have been ported to Blue Gene/L. For more information, refer to the following Web address:

http://www-unix.mcs.anl.gov/mpi/mpich/

1.2.2 Overview of tools ported to Blue Gene/L

The following tools have been ported to the Blue Gene/L platform:

򐂰 Kit for Objective Judgement and Knowledge (KOJAK)-based detection of performance bottlenecks

http://www.fz-juelich.de/zam/kojak/ 򐂰 Tuning and Analysis Utilities (TAU)

http://www.cs.uoregon.edu/research/paracomp/tau/tautools/

1.3 Message passing performance

Measuring the performance of message passing (MPI) in an application can quickly help identify trouble areas. MPI Tracer™ and Profiler consist of a set of libraries that collect profiling and tracing data for MPI programs. Performance metrics, such as the time used by MPI function calls and message sizes, are reported.

These tools are available from the IBM ACTC. For more information about these and other tools that this organization provides, go to the following Web address:

http://www.research.ibm.com/actc/

1.3.1 MPI Tracer and Profiler

MPI Tracer and Profiler consists of a set of libraries that collect profiling and tracing data for MPI programs. Performance metrics, such as the time used by MPI function calls and

(16)

codes.

PeekView

gives a visual representation of the overall computation and communication pattern of the system.

MPI Profiler captures summary data for MPI calls. By this, we mean that it does not show you the specifics of an individual call to, for example MPI_Send, but rather the combined data for all calls made to that routine during the profile period. See Figure 1-1 for an example.

Figure 1-1 MPI Profiler summary data

No changes to your source code are required to use the MPI Profiler function. However, you must compile using the debug (-g) flag.

1.4 CPU performance

The CPU performance tools are from the IBM ACTC. For more information about these and other tools that IBM ACTC provides, refer to the following Web address:

1.4.1 Hardware performance monitor

The hardware performance counter monitor module provides comprehensive reports of events that are critical to performance on IBM systems. In addition to the usual timing information, the hardware performance monitor can gather critical hardware performance

Important: It is vital that you call MPI_Finalize in your application for the profiling function

(17)

metrics. These might include the number of misses on all cache levels, the number of floating point instructions executed, and the number of instruction loads that cause Translation Lookaside Buffer (TLB) misses, which help the algorithm designer or programmer identify and eliminate performance bottlenecks.

1.4.2 Xprofiler

Xprofiler is among a set of CPU profiling tools, such as grof, pprof, pprof, and tprof, that are provided on AIX. You can use them to profile both serial and parallel applications.

Xprofiler uses procedure-profiling information to construct a graphical display of the functions within an application. It provides quick access to the profiled data and helps users identify the functions that are the most CPU intensive. With the GUI, it is easy to find the application’s performance-critical areas.

1.5 I/O performance

Understanding input/output (I/O) performance is as important as understanding application and CPU performance issues.

1.5.1 Modular I/O

Modular I/O is not yet officially supported on Blue Gene/L. Modular I/O addresses the need of application-level optimization for I/O. For I/O-intensive applications, the Modular I/O libraries provide a means to analyze the I/O behavior of applications and tune I/O at the application level for optimal performance. For example, when an application exhibits the I/O pattern of sequential reading of large files, Modular I/O detects the behavior and invokes its

asynchronous prefetching module to prefetch user data.

Tests with the AIX journaled file system (JFS) demonstrates significant improvement over system throughput when using Modular I/O.

1.6 Visualization and analysis

The PeekPerf tools is from the IBM ACTC. For more information about this and other tools that they provide, refer to the following Web address:

1.6.1 PeekPerf

PeekPerf visualizes the performance trace information generated by the performance analysis tools. PeekPerf also maps the collected performance data back to the source code, which makes it easier for users to find bottlenecks and points for optimizations. PeekPerf is available on several UNIX® derivations (AIX, Linux) and Microsoft® Windows®.

(18)

1.7 MASS and MASSV libraries

The Mathematical Acceleration Subsystem (MASS) and MASSV libraries consist of a set of mathematical functions for C, C++, and Fortran-language applications that are tuned for specific IBM POWER™ architectures. You can learn more about these libraries at the following Web address:

http://www.ibm.com/software/awdtools/mass/support/

Both scalar (libmass.a) and vector (libmassv.a) intrinsic routines are tuned for the

Blue Gene/L computer. In many situations, using these libraries has been shown to result in significant code performance improvement.

Such routines as sin, cos, exp, log, and so forth from these libraries are significantly faster than the standard routines from GNU libm.a. For example, a sqrt() call costs about 106 cycles with libm.a, about 46 cycles for libmass.a, and 8 to 10 cycles per evaluation for a vector of sqrt() calls in libmassv.a.

To link with libmass.a, include the following option on the link line: -Wl,--allow-multiple-definition.

(19)

Chapter 2.

Comparison of performance

tools

Two primary tools are provided to gather application-level performance data about Blue Gene/L applications: the External Performance Instrumentation Facility (EPIF, also known as Perfmon) and Performance Application Programming Interface (PAPI). In this chapter, we discuss the functions of each of these applications to provide you with information that enables you to decide which tool is best suited to help in which situations.

(20)

2.1 External Performance Instrumentation Facility

Perfmon is an IBM performance tool that is designed specifically for Blue Gene/L. As indicated by the word “external” in the title External Performance Instrumentation Facility, Perfmon runs separately from the actual application. No changes to source code are required to run Perfmon; it operates completely externally from the application being measured. This affords Perfmon some advantages:

򐂰 Since it runs externally, the impact it has on the running application is significantly less profound.

򐂰 System administrators can measure aspects of application performance without requesting changes to (and subsequent recompilation of) source code.

򐂰 Viewing performance results is made easier by the viewer and data export utilities. In addition, data from Compute Nodes can be automatically aggregated, taking this responsibility off of the application or person who is analyzing the results.

Of course, these advantages have some drawbacks:

򐂰 Since Perfmon is specific to Blue Gene/L, its functionality is not portable to other supercomputing platforms.

򐂰 Perfmon gathers performance data at the application level. It is not possible to get more granular data (for example, specific to a certain block of code).

2.2 Performance Application Programming Interface

PAPI is an industry-standard performance application programming interface (API) that is designed to allow support on a broad range of computing devices. It is supported by a broad range of organizations, encompassing academia and the industry. The main home page for PAPI is at the following Web address:

http://icl.cs.utk.edu/papi/

Some of the advantages of PAPI include:

򐂰 Since it is based on an industry standard, its functionality is portable to other supercomputing platforms.

򐂰 PAPI allows performance data to be gathered down to a detailed level. For example, a single line of code can be monitored if necessary.

PAPI comes with the following disadvantages:

򐂰 PAPI has a more significant impact on the application that is being monitored, which can potentially change the characteristics of the application itself.

򐂰 Changing what is to be measured almost always requires modifications to the source code, with subsequent recompilation.

򐂰 Data must be aggregated, parsed, viewed, and so on with outside tooling. It is up to the application itself to write the gathered data in the format that it sees fit.

(21)

2.3 Summary comparison of Perfmon and PAPI

Table 2-1 summarizes and compares some of the similarities and differences of function between Perfmon and PAPI.

Table 2-1 Differences of function between Perfmon and PAPI

Function Perfmon PAPI

Data aggregation

Automatic aggregation of data from all Compute Nodes is used by the job.

The application must aggregate data from Compute Nodes (if desired).

Output of results

It is automatically written to a file system or relational database. It is collected via JTAG network and written to data store with sockets (via Midplane Management Control System (MMCS)). Tooling is provided to convert to CSV format.

The application must write output through the I/O Node to the file system. The format of the output is application specific.

Viewing results

A simple GUI facility is provided to allow for the display of detailed or aggregate results, and provides support to directly extract the desired results to CSV files. The GUI works on both data that has already been collected and data that is currently being collected for running applications. The extract processing to CSV files allows for the normal query functions of select and project. However, no built in subquery processing is supported through the extract tool. The extract tool can output the detailed data per node rank or aggregate data derived from the detailed data.

Since the format of the output is specific to the application that created it, no common method is available to view, parse, or format the results.

Hooks required

No hooks are required in the code; the only requirement is that static library is linked in with user program.

Explicit hooks are required in the source code.

Portability This is for Blue Gene only. It is portable across numerous supercomputing platforms.

Granularity Data is available both for specific nodes (ranks), as well as aggregate data for all nodes. However, this data encompasses the entire running application; it is not possible with Perfmon to scope the collection to a specific code block.

Data gathered can be specific, to a given node (rank), block of code, and so on. This makes it possible to scope the performance collection data to a specific area (or rank) of the application to better pinpoint the problem.

Impact on watched application

Since data is gathered from an outside utility, there is little or no impact on performance characteristics of application being measured. The fact that the performance data gathered flows across the JTAG network also limits its impact on the application.

PAPI, since it involves modifications to the actual application source code, can more significantly change the runtime characteristics of the running application. In addition, the fact that performance data is written to the file system via the I/O Nodes can also impact the application. Who

uses/controls

Perfmon is started, stopped and configured by the system administrator, as it executes on the Service Node.

PAPI is controlled by the programmer.

Configurability Many configuration options can be specified upon startup of the Perfmon server. In addition,

Since PAPI calls are embedded in the source code, little external configuration is possible,

(22)

(23)

Chapter 3.

External Performance

Instrumentation Facility

In this chapter, we take a detailed look at the External Performance Instrumentation Facility (EPIF), also known as “Perfmon.” We use both names interchangeably in this chapter.

(24)

3.1 Overview of EPIF and Perfmon

Traditional approaches to collecting performance information for an application require that the application is instrumented by modifying the source code. Because the source code is altered, the instrumentation of the code affects the running of the application and potentially yields performance data that is not representative of the application when it is not

instrumented.

As Blue Gene/L positions itself to appeal to a broader range of applications, it is fundamental that performance data can be obtained in a manner that requires no application source code changes nor measurably alter the runtime performance of those applications. A set of tools provides the foundation for the EPIF for Blue Gene/L.

The previous version of EPIF (the Perl version, which was called “perfmon.pl”) was not removed from the V1R3 distribution. To use the previous version, applications must specify the environment variable of BGL_PERFMON=1000. This variable identifies the previous default set of counters that are collected by perfmon.pl. Most, if not all users, should use the new Python version (perfmon.py), because it is the strategic direction of the tool set. However,

perfmon.pl is still distributed for some cases where a customer might still want to use it. In addition, startperfmon is a script that still references the previous (Perl) version of EPIF. It can still be used to start the old perfmon.pl, but it is not intended to be used for the new (Python) version of EPIF. There are several reasons for this, but most are due to the fact that EPIF is now designed to be started and tailored for a specific partition or set of similar sized partitions, specifying different sample intervals, sample types, and so on. Multiple instances of the new version of EPIF are possible and likely, while starting only a single server instance is not likely. In addition, each potential instance of EPIF is installation dependent. The

startperfmon script is still included in the V1R3 distribution, but we recommend that you

do

not

use it.

3.1.1 Objectives

The purpose of the EPIF is to provide the necessary support for the monitoring, collecting, and recording of performance information from Blue Gene/L Compute Nodes. The key objectives for this facility are listed here:

򐂰 This facility must be administered externally and be enabled in a way that allows all running applications to participate without requiring application source code changes. 򐂰 This facility must provide its function without measurably impacting the performance of any

application running on Blue Gene/L.

򐂰 This facility must be able to associate the performance data collected from a given set of Compute Nodes to a specific instance of an application running in a partition that houses those Compute Nodes.

Restriction: EPIF is dependent upon the use of the interval timer and establishes a

SIGALRM handler. This can cause conflicts for some applications.

Important: The previous (Perl) version of EPIF, in addition to the startperfmon script, will most likely will be removed in a future release.

(25)

3.1.2 Goals and strategies

The EPIF is not intended to be an all inclusive, comprehensive set of performance monitoring tools to do extensive analysis and presentation of performance related data for Blue Gene/L. But rather, this facility provides:

򐂰 The means to collect performance data based on hardware counters without requiring modification to any application source code; at runtime, the application is unaware of the monitoring function

򐂰 An interactive interface to see results quickly and easily, with some fundamental distillation of data

򐂰 An interface to export or extract the performance data in different formats so that other tools not provided here can be used to do further analysis of that data

There are many aspects to performance monitoring other than hardware counters, but this set of tools deals with hardware counters exclusively. Other performance tool needs exist for Blue Gene/L and may or may not be incorporated into this set.

Currently, six commands compose the set of tools for the EPIF:

򐂰 perfmon: Allows for the collection of performance data based upon hardware counters for jobs running on Blue Gene/L.

򐂰 dsp_perfmon: Provides a simple graphical user interface (GUI) to view performance data and do some high-level distillation of that data. This tool works on data collected from jobs that were previously monitored with EPIF or on data that is actively collected for currently running jobs that are monitored with EPIF.

򐂰 exp_perfmon_data: Allows already stored EPIF data to be exported from the performance database tables of the Midplane Management Control System (MMCS).

򐂰 imp_perfmon_data: Allows EPIF data to be imported into the performance database tables of MMCS.

򐂰 ext_perfmon_data: Allows for performance data to be extracted into flat files for further analysis by tools that are not provided with this tool set.

򐂰 end_perfmon: Allows for an instance of EPIF to be cleanly ended prior to the ending criteria being met that was specified on the perfmon command.

3.2 Basic concepts

EPIF is a Blue Gene/L performance tool that is based on the use of hardware counters to monitor particular runtime metrics. The number of floating point addition or floating point multiplication operations performed, L3 cache misses, and the number of XP packets sent from a Compute Node are examples or performance metrics that can be monitored using hardware counters.

The machine does not collect hardware counter information by default. You must first compile your application to link in this performance monitoring capability. When an application is compiled for Blue Gene/L, it can be linked with the performance counter library by adding

-lbgl_perfctr.rts to the link step. This action is all that is necessary to allow EPIF to collect performance data for that application. No source code changes are required.

(26)

counter values, and any processing of that data, are done outside of the user’s Blue Gene/L job.

Counters are collected during

sample intervals

. Sample intervals are user-defined, fixed amounts of time. At the beginning of a sample interval, the machine records the counter values. The counters collected during a sample interval are referred to as a

sample

. Before the first sample is taken, the counters are initially collected and recorded. This initial collection of counters is referred to as the

starting counter values

. The starting counter values do not have to start at zero. All counter values for future samples are given as a delta value from these starting counter values.

When EPIF is started, a sample interval is specified. Sample intervals for all jobs occur at the same time. For example, if two jobs, A and B, are being monitored, sample 4 for Job A and sample 4 for Job B occur in the same window of time. Defining sample intervals in this manner allows for some level of performance analysis to occur across the various blocks within a Blue Gene/L system within a given time interval.

See Figure 3-1 for more a more complete set of examples of job-start and job-termination scenarios over a series of collection intervals.

Figure 3-1 Examples of data collection for various job and interval conditions

By default, EPIF monitors all Blue Gene/L jobs. However, EPIF can be started with a filter so that only jobs with particular attributes are monitored. As Blue Gene/L jobs end, monitoring for those jobs automatically ends. As new Blue Gene/L jobs are initiated and if those jobs meet any of the specified filtering criteria, monitoring for each of those new jobs automatically

(27)

starts. If those new jobs are running applications that have been compiled to update hardware counters, those counters are collected and recorded by EPIF.

Given the way that sample intervals are defined, when EPIF is first started, all of the jobs running at that time have their first sample recorded during the sample interval. The starting counter values are collected during sample interval 0 and the first sample taken is sample 1. All Blue Gene/L jobs that start at a later time will have their first sample recorded during a future sample interval, and that first sample will not be sample 1, but rather the number for the sample interval. For example, the first sample taken for Job C might be sample 35. The job started during sample interval 34 and the starting counters were recorded. The first sample is then recorded during sample interval 35.

When EPIF is started, a

sample type

is defined. The sample type can either be

detailed

or

summary

. Detailed samples record and save the counter values for each sample interval, for each monitored job. Summary samples only record and save the counters for the last sample interval for each monitored job. The processing performed at counter collection time is the same for both sample types. The only difference is whether new values are saved for a sample interval or whether the previously saved counter values for a given job are over-written with values from the latest sample interval. Summary samples save storage space, but you lose the ability to do time analysis of the performance data. Detailed samples allow you to do time analysis of the performance data by comparing the counter values as recorded and saved during each of the sample intervals for a running job. By default, summary samples are collected.

EPIF can collect up to 52 performance counters for each Blue Gene/L job. Each job can map the 52

locations

to different

counter definitions

so that different metrics can be recorded for each job. Specific counter definitions are stored in the MMCS performance database table BGLPERFDESC. Another table within that database, BGLPERFDEF, is used to logically group a set of counters together to be collected for a Blue Gene/L job. A set of performance counters is identified by a counter definition ID that is given by the DEFINITION_ID column. The COUNTER_ID column can be used to join to the same named column in the

BGLPERFDESC table to determine which counter definitions are defined for a given

counter

definition ID

.

When a Blue Gene/L job is initiated, a specific counter definition ID is bound to that job. The ID is determined from the environment variable named BGL_PERFMON. If BGL_PERFMON is not defined when a job starts, the Blue Gene/L job uses a counter definition ID of 1004. Counter definition ID 1004 is the default set of counters for EPIF to collect. See Appendix C, “Perfmon database table specifications” on page 81, to view the list of all possible counter definitions and the list of supported counter definition IDs.

Storage is consumed within the external file system for the performance data collected by EPIF. The fixed storage cost required to save control information for a newly monitored job is approximately 72 KB per midplane. The storage required to save the 26 counters for counter definition ID 1004 is about 190 KB to 340 KB per midplane.

More than a single instance of EPIF can be run on the system at a time. Each instance is run with its own set of monitoring attributes. For example, one instance of EPIF can monitor jobs in “small” partitions and collect detailed samples with a sample interval of 15 seconds. At the same time, a second instance of EPIF can monitor jobs in “large” partitions and collect summary samples with a sample interval of one minute. Criteria that can be used to indicate

(28)

While it may not normally make sense to monitor the same Blue Gene/L job from more than one EPIF instance, it is possible to do so. The two collections do not interfere with one another. Each EPIF instance stores its collected data in a different set of files. The output from EPIF is always stored to the external file system to be later accessed by the other commands within the External Performance Instrumentation Facility suite. Each instance of EPIF always creates a new directory and stores all files that pertain to that collection in that directory. The location where that directory will be created can be specified when EPIF is started.

In addition to storing the performance data for an instance of EPIF to the external file system, the EPIF collection facility can also optionally store the data in the MMCS performance database. While the data to be stored to the external file system is done synchronously as the data is collected, the data is asynchronously written to the MMCS performance database. If the performance data is not written to the MMCS performance database as a function of the EPIF processing, the data can always be imported into the database at a later time with the

imp_perfmon_data command, which is also found within this tool suite.

See Figure 3-2 for an overview of the data collection and processing performed by EPIF.

Figure 3-2 Overview of EPIF data collection and processing

Note: While it is possible to monitor a given job using more than one instance of EPIF, all

of those instances of EPIF use the same counter definition ID for that job. The counter definition ID that is used for a Blue Gene/L job is bound to the job, not to a running instance of EPIF.

(29)

3.3 EPIF commands

In this section, we provide more details about the EPIF commands that are used to gather and manipulate external performance gathering on Blue Gene/L.

3.3.1 perfmon

The perfmon command starts an instance of the performance monitor tool. Many options can be specified to control the collection of performance data based upon the hardware counters. The remainder of this section gives a brief overview of the processing performed by EPIF and how the various command line options can be used to control the running of EPIF. See 3.5.1, “Options for EPIF” on page 44, for additional information.

When EPIF starts, the first action performed is to determine the configuration for that instance of EPIF. Defaults are defined for all options, so no options are required for an invocation of EPIF. However, it is quite common to pass one or more options to customize the running of EPIF. Options are first taken from the command line, then from any honored environment variables, and finally from the defaults as established by the perfmon command.

EPIF first validates the combination of options to be used for this instance of the command. If there are any conflicts or errors, warning or error messages are issued to the console file. Serious problems result in error messages that are sent to the console and to stderr. Error messages that are sent due to an incorrect configuration end the running of EPIF before any processing related to the collection of performance counters begins.

All pertinent information regarding the processing being performed by EPIF is sent to the console file, or simply the console. The output that is sent to the console can be redirected using standard Linux command line redirection or with the --console option. The console

option has a default path of /bgl/BlueLight/logs/Blue Gene/L system-name/EPIF. Various levels of filtering are available for the information sent to the console file. The filtering of information is set using the --verbose option, with the default value being 3. A verbose level of 3 echoes back the final configuration, messages for each job when monitoring is started and ended for it, and a summary after each sample interval. If desired, higher levels of verbosity give additional detailed flows from EPIF to its administrative threads. Error messages are always sent regardless of the verbosity level.

Many aspects of the configuration control how performance counters are to be collected. One aspect is which Blue Gene/L jobs will be monitored. Five options can be specified to control which jobs will be monitored by EPIF. A Blue Gene/L job is only monitored if it meets all of the criteria established by all five options.

The -block_id and --username options can be used to specify that only jobs running in a particular block or running under a particular user are to be monitored. Individual values or lists of values can be specified for each of these options. Regular expressions are supported

Note: In the documentation that follows, only the command line options are called out.

Many of the command line options have corresponding environment variables. See the EPIF help text for more information about the specific environment variables that are honored by EPIF. The online help text can be accessed by using the following command: perfmon.py -help

(30)

options are in midplanes, with 0 as the smallest min_block_size value allowed and the a as the smallest max_block_size value allowed.

In addition to those four options, the --sql option can be used to provide any additional predicates to the SQL statement that is used to query the MMCS database for active Blue Gene/L jobs. The value specified on the sql option must start with a logical operator and is appended to the predicates that are already being used by the perfmon command processing to find Blue Gene/L jobs. Reference the attributes in the BGLJOB table to see the additional criteria that can be used to control the jobs that are monitored by an instance of EPIF. If none of the five options are specified, the default is to monitor all jobs running in all partitions. Another aspect of the configuration is how often EPIF will collect the performance counters for the Blue Gene/L jobs being monitored. This is called the

sample interval

and is specified with the --sample_interval option. The default sample interval is determined by the size of largest potential block that can have a job monitored by this instance of EPIF. Basically, three seconds are allowed for every midplane in the largest possible block, rounded up to 10 second boundaries. Therefore, the smallest possible sample interval is 10 seconds. The maximum allowed sample interval value is 1 hour.

When a Blue Gene/L job is found by EPIF that is to be monitored, a single administrative thread is first spawned to collect additional information regarding that job. Various attributes of the job, the block in which the job is running, and the information related to the Blue Gene/L personality for the job are collected and recorded. The final piece of information that this administrative thread collects is the counter definition ID that the job is using. If the performance counter library was not linked into the application for the job, then there is no counter definition ID for the job. If the counter definition ID cannot be determined, EPIF stops monitoring for that job, without collecting any samples. Monitoring for other jobs is not affected.

If the counter definition ID can be determined after the application starts to run, EPIF spawns additional administrative threads to help in the collection of the counter data during each ensuing sample interval. The first set of counter data collected is known as the

starting

counter values

. Counter values for all ensuing sample intervals are given as delta values from these starting counter values. After the starting set of counter values is collected for a monitored job, EPIF collects the performance counters for that job at the beginning of each sample interval until the job ends or the monitor is ended. A snapshot of the counters is collected during each sample interval regardless of whether detailed samples or a summary sample is being collected. Whether detailed samples or a summary sample is collected is another configuration option.

Detailed samples

refers to the concept that all of the snapshots of counter data are saved for future analysis. A

summary sample

means only the last of the snapshots of counter values is saved. Detailed samples consume more hard disk drive, but allow for more detailed,

time-based analysis. Summary samples consume less hard disk drive, but do not allow for any analysis of these intermediate collected samples. The sample type to be collected by an instance of EPIF is defined by the --sample_type option, and the default is to collect summary samples.

After a configuration is validated by early EPIF processing, if indicated, EPIF delays the start of the collection process. The default is to start the collection process immediately, but the

--start_delay option can be used to delay the collection for a period of time or the

--start_time option can be used to indicate that the collection process is to start at a specific date or time.

Similarly, three mutually exclusive options dictate how long EPIF will continue to monitor jobs. The --run_time option allows for an amount of time to be specified. After that amount of time

(31)

has expired, the monitoring process ends. The --samples option allows for a specific number

of samples to be collected before the monitoring process is ended. Finally, the --end_time

option allows for a specific date and time to be specified, at which time, monitoring ends. The time used is based on the clock of the Service Node.

If it is determined that an instance of EPIF should be ended prior to the time specified on the

perfmon command, the end_perfmon command can be issued to end the monitor normally. Using the end_perfmon command ends the EPIF processing in a controlled fashion. The Linux

kill command can also be used to end the EPIF process. Any previously-collected data for that instance of the monitor is still preserved. However, any data for the sample interval in progress is lost.

After the general monitoring process starts, existing Blue Gene/L running jobs are inspected to see if they should be included in the initial set of jobs to monitor. If one or more jobs are found, the initial information is collected from each of those jobs, a set of threads is spawned for each of those jobs, and the starting counters are then collected from each of those jobs. Each subsequent sample interval takes a snapshot of the counters until either the job ends or the monitoring process ends.

As EPIF continues to run, new jobs are also sought out that meet the monitoring criteria. Two methods are used to find new jobs. At the beginning of each sample interval, new jobs are automatically found by querying the BGLJOB table. If any new jobs are found that meet the monitoring criteria, then those jobs are added to the set of jobs that are monitored, initial information about each of those newly found jobs is collected, and a set of threads is spawned for each of those new jobs. A starting set of counter values is collected and a snapshot of the counter values are taken on every subsequent sample interval until the job ends or the monitoring process is ended.

In addition to finding new jobs at the beginning of each sample interval, optionally a user can indicate to have a daemon started by the EPIF process that does nothing but poll the

BGLJOB table for new jobs. This polling is done based upon a time interval as dictated by the

--poll_new_jobs option. By default, this option is set to three seconds, so that EPIF polls for new jobs every three seconds. If a new job is found by this daemon, the main EPIF process is sent a message and the previously-described processing starts for that newly-found job. If enough time exists within the current sample interval, the starting counters are immediately collected for that newly found job. Otherwise, the starting counters for the newly found job are collected at the beginning of the next sample interval. A poll_new_jobs option value of “0” indicates that a daemon is not to be used to find new jobs.

All of the information collected by EPIF and its associated threads is saved to the file system. All of the data for a given instance of EPIF is stored in a single directory. The user has control over the location of this directory. The --monitor_path option indicates the path to use when EPIF creates the directory to store all of the data for an instance of EPIF. If the path does not exist, directories for the path are created by EPIF. The user has no control over the name of the directory used to house the data files for a given instance of EPIF, which is created under the directory specified with the --monitor_path option. The name of the directory has a fixed name portion and a timestamp portion to clearly identify the Service Node name and time that this instance of EPIF was started. The default monitor path value is /bgl/BlueLight/EPIF/. At the end of every sample interval, all of the necessary information that is collected by EPIF is always written to the file system. However, it is also possible to have EPIF asynchronously

(32)

At the beginning of every sample interval, if the previous sample interval generated data that must be imported into the database, a request is made to the daemon, and the results are asynchronously imported into the database by the daemon. This import facility can lag behind the actual collection of performance data that is being collected and written to the file system. However, if there are outstanding import requests when the monitoring process is to end, the ending of EPIF is delayed until all of the import requests have completed. The values for the

--import_to_database option are “True” and “False,” with the default being “False.”

If for any reason a set of counter values cannot be collected for one or more Compute Nodes for a given sample interval, no additional attempts are made during subsequent sample intervals to collect those same counter values. Monitoring continues for that job, but does not contain the counter values for those affected nodes. Appropriate messages are sent to the console for this situation. If for any reason one or more of the administrative threads

abnormally ends, the remainder of the administrative threads for that monitored job are ended and monitoring for that job is ended. Again, appropriate messages are sent to the console. In either case, monitoring for other Blue Gene/L jobs is not affected.

Advanced options for the perfmon command

Three additional advanced options can control other timing aspects regarding how the counters are collected. It is possible that certain system environmental factors can effect the processing to be performed during a particular sample interval. Examples can be a workload spike or excessive network traffic. It might be the case that all of the necessary work that must be performed for a particular sample interval cannot be completed within the time allowed by the sample interval. These three options can be specified to effect how EPIF performs given these “timeout” situations.

The first of these options is --sample_interval_extensions, which indicates the number of additional “units of sample interval time” that will be allowed for a given sample interval without any response from the administrative threads. EPIF uses multiple threads to monitor the various Blue Gene/L jobs. For example, if the sample interval is 30 seconds and

--sample_interval_extensions is set to 3, a total of 2 minutes is allowed for any of the administrative threads to respond to any of the outstanding requests. If none of the threads respond within that total amount of time, EPIF ends.

The options --thread_timeout and --thread_retries can be specified to affect the timeout and retry tolerances for these threads. It is usually not necessary to specify values for any of these three advanced options because the EPIF program chooses appropriate values given the runtime environment. But, if desired, the user can control these timing and timeout values. Additional advanced options can be specified to help define the threads that are spawned to monitor the various Blue Gene/L jobs. Each time a Blue Gene/L job is found that is to be monitored by EPIF, Perfmon spawns a “set of threads” to collect the performance counters for that particular job. Many options can be specified to control how these threads are managed by EPIF. It is not necessary to specify values for any of the four advanced options because the EPIF program chooses appropriate values based upon the runtime environment, but these values are highlighted here to show that it is possible to control certain aspects of the EPIF administrative threads.

The --max_threads_per_system option can be specified to set an upper limit to the total number of administrative threads spawned across the entire Blue Gene/L system used to collect performance counters. If not specified, the value 256 is used as the default upper limit.

Note: The data for a given instance of EPIF can always be imported into the MMCS

performance database at a later time. The imp_perfmon_data command allows for such an import to be initiated by the user.

(33)

Two additional mutually exclusive options are available to control the number of threads that are spawned for a given monitored job.

The --max_threads_per_job option gives an upper limit to the number of threads that are spawned for a monitored job and the --nodes_per_thread option gives the number of Compute Nodes that are monitored by a single administrative thread. By default, the EPIF program chooses the number of threads spawned for a given Blue Gene/L job based upon the number of Compute Nodes to be monitored in the partition. In general, the larger the number of Compute Nodes is, the larger the number of spawned threads is. Each time a collection is to be performed by an administrative thread, it opens a socket, issues JTAG requests, and then closes the socket.

The --max_concurrent_threads option controls the number of concurrent threads, across the entire Blue Gene/L system that can actively have an open socket issuing EPIF-related JTAG requests. In general, the default number of concurrent threads is 16.

3.3.2 dsp_perfmon

The dsp_perfmon command displays the results of the data saved to the file system by EPIF. This command works for all EPIF instances, including those that are still actively collecting data. The “raw” counter data can be viewed, but many other items related to the data can also be viewed or derived. In this section, we briefly describe the data that is available from using the dsp_perfmon command.

While the performance monitor data must be collected on the Service Node, the resulting data can be replicated to any platform that supports Python and the data analysis done there. The application windows shown in this document are from performance monitor files that are replicated to a mobile computer running Microsoft Windows XP. The ensuing analysis was performed there. It could have easily been done running Linux on a Front End Node. If you are viewing live data as it is being collected on the Service Node, doing that analysis on a Front End Node is normally the case, unless you are replicating the performance monitor data in real time.

The dsp_perfmon command opens a window to view the data that is collected, or currently being collected, by EPIF. Depending on the platform, it is invoked using a command similar to the following example:

python dsp_perfmon.py

The perfmon command creates a main control file with a .mon extension. When opening a performance monitor file, this main control file should be opened. You can open this file by using the dsp_perfmon command, and selecting File→ Open from the menu bar. See Figure 3-3.

(34)

Figure 3-3 Opening the main perfmon file

After you open the performance monitor file, the Display Performance Monitor Data window changes to display the files as shown in Figure 3-4.

(35)

Notice that the name of the main performance monitor file embeds the Service Node name followed by the data base name at the beginning. A time stamp can also be found in the name. This time stamp is in the format of:

yyyy-mmdd-hhmmss_xx

Here xx is a fractional part of the second’s portion of the time stamp. This time stamp is representative of when the perfmon command was issued on the Service Node.

After you open the main performance monitor file, two or more items are displayed that can be expanded. The first item, when expanded, shows attributes related to the display environment. The second item gives detailed information about the EPIF runtime environment. Each item after those first two are one-liners that exist for each monitored Blue Gene/L job. Expanding any of those one-liners gives information particular to each of those monitored jobs.

Only one performance monitor file can be opened at a time within a given dsp_perfmon

display window, but you can have more than one instance of the command running to view and compare results between two or more performance monitor files.

The main performance monitor file is essentially a file with control information. Other files, with different file extensions within the same directory as the main performance monitor file, contain detailed information for each of the monitored Blue Gene/L jobs and the counters that have been collected for each of those jobs. All of the information necessary to do analysis using the dsp_perfmon command is contained within the files in that directory. Any necessary information from the MMCS database has been extracted into these files, and the MMCS database is not required during any of the analysis. This allows the dsp_perfmon analysis to be done on a different machine or platform.

The files that contain information for each of the monitored Blue Gene/L jobs and the files that contain the counter information can be numerous and contain a lot of information. To prevent you from being overwhelmed by the amount of data, you can filter the results that are displayed to help you focus on certain data elements. The jobs to be displayed can be filtered by block ID, job ID, and user ID. An option is also provided that only shows Blue Gene/L jobs with non-zero counter values, such as those jobs that run applications that were compiled to update the hardware performance counters. All of the filtering criteria works together and can be specified in any combination. You can find all of these filtering options under the Filter menu item (Figure 3-5).

(36)

into lines of detailed data. Sample data and node data are grouped together. You can also drill down easily with a few mouse clicks to specific detail from a given sample for a given node.

Figure 3-6 Format menu items

Consistent data is always displayed. It does not matter if data is being displayed from a non-active performance monitor file or a monitor file that is actively collecting performance data. All of the data that is displayed, extracted, or drilled down into can be thought of as being taken from a snapshot of the performance monitor data at the end of a sample interval. The only time that new data is introduced is when a new monitor file is opened or the

Window→ Refresh option is selected from the menu (Figure 3-7).

Figure 3-7 Window menu items

Data desciptions using the dsp_perfmon command

The following sections further describe the data that can be displayed and derived using the

dsp_perfmon command.

Display Performance Monitor Attributes

If you expand this item, you see the following information: 򐂰 Filters in Effect (see Figure 3-8)

– Allows only jobs that have block IDs matching those in the block ID filter list to be displayed

– Allows only jobs that have job IDs matching those in the job ID filter list to be displayed – Allows only jobs that have user IDs matching those in the user ID filter list to be

displayed

– Shows only those jobs with non-zero counter values; provides for a method to filter out all Blue Gene/L jobs that are running applications not compiled to update hardware counters

(37)

Figure 3-8 Display performance monitor attributes: Filters in Effect

EPIF Runtime Capture Attributes

If you expand this item, you see the following information:

򐂰 All basic runtime data relevant to the collection, including start/stop time data 򐂰 Administrative thread information

These threads are spawned by EPIF to perform the collection of the counter data.

Note: When switching between different performance monitor files, the filters are reset

(38)

򐂰 All configuration items used for the collection

Most of these items can be specified on the perfmon command line or with environment variables that EPIF recognizes. See 3.5.1, “Options for EPIF” on page 44, for more information.

Figure 3-10 EPIF runtime capture attributes: Configuration

Job specific information

The rest of the first panel has one line for each Blue Gene/L job that was, or is currently being, monitored. The lines give information to help identify each monitored job. Figure 3-11 lists the information that is available under each of the monitored jobs:

򐂰 Number of Completed Samples

This line gives the number of samples that have successfully been completed for the job. 򐂰 Counters Collected During Original Sampling

This line gives the sample number of the starting, first, and last samples collected. 򐂰 Alternate Starting Sample Number

This line gives the alternate starting sample number. If desired, you can right-click this line to change the starting sample number used for all counter calculations, including any histogram data.

򐂰 Counter ID Filter

If desired, you can right-click this line to filter the counters that are to be displayed. By default, all counters collected for the job are displayed.

򐂰 Number of Histogram Bins

If desired, you can right-click this line to set the number of bins to be used when building histograms from the data for this job. By default, 20 histogram bins are used.

򐂰 Calculate Stats/Histograms with Normalized Values

If desired, you can right-click this line to toggle whether descriptive statistics and histograms are calculated with raw counter values or values normalized to time. By default, normalized values are used.

(39)

Figure 3-11 Job specific information

򐂰 Job Attributes (Figure 3-12)

This line lists the attributes for the Blue Gene/L job.

(40)

򐂰 Block Attributes (Figure 3-13)

This line lists the attributes for the block used by the job.

Figure 3-13 Job specific information: Block Attributes

򐂰 EPIF Attributes for Job (Figure 3-14)

This line lists the specific EPIF runtime attributes for the job, including: – Number of Administrative Threads

Gives the number of administrative threads used to monitor the job – Counter Definition ID

Gives the counter definition ID used to monitor the job, and if expanded, all of the individual counter definitions for that counter definition ID

– EPIF File Names

Gives the file names used for the performance data collected for this job – EPIF Start/Stop Times

Gives the time stamps when monitoring was started and ended, the starting and ending time stamps of the last sample, and the elapsed monitoring time for the job

(41)

Figure 3-14 EPIF attributes for job

Explore Data Using a Samples/Nodes/Counters Hierarchy

Drilling down allows you to see counter values for a given sample across all nodes for that sample. See Figure 3-15.

򐂰 Starting Counters

Shows the values for the starting counters 򐂰 Per Sample Data

– Starting time stamp of this sample for this job – Ending time stamp of this sample for this job – Descriptive Statistics per Counter Index

This line gives statistics pertaining to this sample, for all nodes and counters. These statistics are always calculated using the counter delta values, which is the difference between the counter values for the sample and the starting counter values.

• Basic (Min, Max, Range, Median, Mean, Mode, Cardinality, and so on); see Figure 3-15

(42)

Figure 3-15 Per sample data

• Histogram, Range=Number of Nodes, with counter values in increasing order

(43)

• Histogram, Range=Counter values, with counter values in increasing order (Figure 3-17)

Figure 3-17 Histogram: Range equals counter values

• Under each of the histogram bins, information is given about the nodes within that bin. See Figure 3-18.

(44)

– Per node data

• Individual counter values, with delta values calculated from the starting counter values (Figure 3-19)

The data for each individual counter includes the raw hex value, delta value as calculated from the starting counter value, and the value normalized to a rate per second. The hardware counters are updated by the machine approximately every six seconds. Even though counter values are collected using multiple threads, it is possible that the raw counter values collected for a given sample for different nodes come from different “six second machine intervals.” This normalized value makes it possible to accurately compare counter values from different nodes within the same sample interval.

(45)

• Traverse to adjacent nodes from the current node (Figure 3-20)

Using the view in Figure 3-19, we traversed from the node with rank 1, which has a torus value of (0,0,0) in the X+ direction to the node with rank 3 that has a torus value of (1,0,0). Then we traversed in the Y+ direction to the node with rank 120 with a torus value of (4,2,3). Finally, we traversed in the X- direction to the node with rank 4, with a torus value of (1,1,0). Any of the counter values along the way can be displayed for their values.

Blue Gene/L: Performance Analysis Tools

Front cover

Blue Gene/L: Performance

Analysis Tools

Learn about Blue Gene/L

performance tooling

Discover the details about PAPI and

the External Performance Monitor

Understand the pros and

cons of the different tools

International Technical Support Organization

Blue Gene/L: Performance Analysis Tools

Contents

Notices

Trademarks

Preface

The team that wrote this redbook

Become a published author

Comments welcome

Performance guidelines and

tools

1.1 Tooling overview

1.1.1 IBM High Performance Computing Toolkit

1.2 General performance testing

1.2.1 Overview of the tools that are available on System p

IBM High Performance Computing Toolkit

Paraver

MPE/jumpshot

1.2.2 Overview of tools ported to Blue Gene/L

1.3 Message passing performance

1.3.1 MPI Tracer and Profiler

PeekView

1.4 CPU performance

1.4.1 Hardware performance monitor

1.4.2 Xprofiler

1.5 I/O performance

1.5.1 Modular I/O

1.6 Visualization and analysis

1.6.1 PeekPerf

1.7 MASS and MASSV libraries

Comparison of performance

tools

2.1 External Performance Instrumentation Facility

2.2 Performance Application Programming Interface

2.3 Summary comparison of Perfmon and PAPI

External Performance

Instrumentation Facility

3.1 Overview of EPIF and Perfmon

do

not

3.1.1 Objectives

3.1.2 Goals and strategies

3.2 Basic concepts

sample intervals

sample

starting counter values

sample type

detailed

summary

locations

counter definitions

counter

definition ID

3.3 EPIF commands

3.3.1 perfmon

sample interval

starting

counter values

Detailed samples

summary sample

Advanced options for the perfmon command

3.3.2 dsp_perfmon

Data desciptions using the dsp_perfmon command

Display Performance Monitor Attributes

EPIF Runtime Capture Attributes

Job specific information

Explore Data Using a Samples/Nodes/Counters Hierarchy