• No results found

High Performance Computing Systems (CMSC714) Lecture 8: Perf. Analysis & Visualization Abhinav Bhatele, Department of Computer Science

N/A
N/A
Protected

Academic year: 2021

Share "High Performance Computing Systems (CMSC714) Lecture 8: Perf. Analysis & Visualization Abhinav Bhatele, Department of Computer Science"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Lecture 8: Perf. Analysis & Visualization

Abhinav Bhatele, Department of Computer Science

High Performance Computing Systems (CMSC714)

(2)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Summary of last lecture

• Task-based programming models and Charm++

• Key principles:

Over-decomposition, virtualization

Message-driven execution

• Automatic load balancing, checkpointing, fault tolerance

2

(3)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Tracing tools

• Record all the events in the program with timestamps

• Events: function calls, MPI events, etc.

3

Vampir visualization: https://hpc.llnl.gov/software/development-environment-software/vampir-vampir-server

(4)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Tracing tools

• Record all the events in the program with timestamps

• Events: function calls, MPI events, etc.

3

Vampir visualization: https://hpc.llnl.gov/software/development-environment-software/vampir-vampir-server

(5)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Tracing tools

• Record all the events in the program with timestamps

• Events: function calls, MPI events, etc.

3

Vampir visualization: https://hpc.llnl.gov/software/development-environment-software/vampir-vampir-server

(6)

Abhinav Bhatele (CMSC714) LIVE RECORDING

MPI trace visualization

4

Vampir

Jumpshot

(7)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Projections Performance Analysis Tool

• For Charm++/Adaptive MPI programs

• Instrumentation library

Records data at the granularity of chares (Charm++ objects)

• Java-based GUI

5

(8)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Time Profile

Figure 3. Time profile for ApoA1 on 1k processors of Blue Gene/L (with PME) in Projections6

2.15 2.2 2.25 2.3 2.35 2.4 2.45 2.5 2.55 2.6 2.65 2.7

210 215 220 225 230 235 240 245 250 255

Nanoseconds per day

No. of processors ApoA1 on Blue Gene/L ApoA1

Figure 4. NAMD performs well on any given number of processors (Plot shows ApoA1 running on Blue Gene/L, with PME on a range of processor counts varying from 207 to 255)

this decomposition is called 2-Away-X, 2-Away-XY or 2- Away-XYZ etc. depending on which dimension uses k = 2.

The choice of which decomposition to use for a particu- lar run is decided by the program depending on the atoms- to-processor ratio and other machine-dependent heuristics.

NAMD also gives the user flexibility to choose the decom- position for certain scenarios where the automatic choices do not give the best results.

Neither the number of cells nor the number of compute objects need to be equal to the exact number of processors.

Typically, the number of cells is smaller than the number of processors, by an order of magnitude, which still generates adequate parallelism (because of the separation of “com- pute” objects) to allow the load balancer to optimize com-

munication, and distribute work evenly. As a result, NAMD is able to exploit any number of available processors. Fig. 4 shows the performance of the simulation of ApoA1 on vary- ing numbers of Blue Gene/L (BG/L) processors in the range 207-255. In contrast, schemes that decompose particles into P boxes, where P is the total number of processors may limit the number of processors they can use for a particular simulation: they may require P to be a power of two or be a product of three numbers with a reasonable aspect ratio.

We now describe a few features of NAMD and analyze how they are helpful in scaling performance to a large num- ber of processors.

2.1 Adaptive Overlap of Communication

and Computation

NAMD uses a message-driven runtime system to ensure that multiple modules can be composed concurrently with- out losing efficiency. In particular, idle times in one module can be exploited by useful computations in another. Fur- thermore, NAMD uses asynchronous reductions, whenever possible (such as in the calculation of energies). As a result, the program is able to continue without sharp reductions in utilization around barriers. For example, Fig. 3 shows a time profile of a simulation of ApoA1 on 1024 processors of BG/L (This figure was obtained by using the performance analysis tool Projections [10] available in the CHARM++

framework). A time profile shows vertical bars for each (consecutive) time interval of 100 us, activities executed by the program added across all the processors. The red (dark) colored “peaks” at the bottom correspond to the force in- tegration step, while the dominant blue (light) colored re- gions represent non-bonded computations. The pink and purple (dark at the top) shade appearing in a thin layer ev- ery 4 steps represent the PME computation. One can notice

https://charm.readthedocs.io/en/latest/projections/manual.html

(9)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Usage Profile & Histogram View

7

(10)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Usage Profile & Histogram View

7

(11)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Outlier Analysis

8

(12)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Scripting for multi-run comparisons

9

No of Cores 8 16 32 64 128 256 512 1024 2048

Desmond ApoA1 256.8 126.8 64.3 33.5 18.2 9.4 5.2 3.0 2.0

NAMD ApoA1 199.25 104.96 50.69 26.49 13.39 7.12 4.21 2.53 1.94

Desmond DHFR 41.4 21.0 11.5 6.3 3.7 2.0 1.4

NAMD DHFR 27.25 14.99 8.09 4.31 2.37 1.5 1.12 1.03

Table 10. Comparison of benchmark times (ms/step) for NAMD (running on 2.6 GHz Opterons) and

Desmond (running on 2.4 GHz Opterons)

6 Future Work

Figure 8. Percentage increase of different

parts of NAMD with increase in number of

processors (ApoA1 on BG/L and XT3, with-

out PME)

The needs of biomolecular modeling community require

us to pursue strong scaling i.e. we must strive to scale the

same molecular system to an ever larger number of proces-

sors. We have demonstrated very good scalability with the

techniques described in this paper, but challenges remain,

especially if we have to exploit the petascale machines for

the same problems. To analyze if this is feasible and what

avenues are open for further optimizations, we carried out a

performance study of scaling that we summarize below.

We used the the summary data provided by Projections

which gives detailed information about the time elapsed in

the execution of each function in the program and also the

time for which each processor is idle. This data, collected

on the Cray XT3 machine at PSC and the BG/L machine at

IBM for 1 to 4,096 processors, is shown in Fig. 8.

To simplify the figures, functions involved in similar or

related activities are grouped together. The first observation

is that the idle time rises rapidly beyond 256 processors.

This is mostly due to load imbalance, based on further anal-

ysis. One of the next challenges is then to develop load

balancers that attain better performance while not spend-

ing much time or memory in the load balancing strategy it-

self. Also, there is a jump in the non-bonded work from

256 to 512 which can be attributed to the change in the

decomposition strategy from 2-Away-X to 2-Away-XY at

that point which doubles the number of cells. Since the es-

sential computational work involved in non-bonded force

evaluation does not increase, we believe that this increase

can be reduced by controlling the overhead in scheduling

a larger number of objects. The other important slowdown

is because of the increase in communication overhead as

we move across the processors. Looking back at Table 1

we see that from 512 to 16K processors, the computation

time per processor should ideally decrease 32-fold but does

not. This can be attributed in part to the communication

volume (per processor) decreasing only by a factor of about

10. Although some of this is inevitable with finer-grained

parallelization, there might be some room for improvement.

7 Summary

The need for strong scaling of fixed-size molecular sys-

tems to machines with ever-increasing number of proces-

sors has created new challenges for biomolecular simula-

tions. Further, some of the systems being studied now in-

clude several million atoms. We presented techniques we

developed or used, to overcome these challenges. These

included dealing with interaction among two of our adap-

tive strategies: generation of spanning trees and load bal-

ancing. We also presented new techniques for reducing

(13)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Limitations of current analysis tools

• Support their own unique format(s)

• Limited support for saving or automating analysis

• Most tools only support viewing one dataset at a time

• Lack capabilities to sub-select and focus on specific parts of larger datasets

10

hpcviewer’s GUI

(14)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Limitations of current analysis tools

• Support their own unique format(s)

• Limited support for saving or automating analysis

• Most tools only support viewing one dataset at a time

• Lack capabilities to sub-select and focus on specific parts of larger datasets

10

hpcviewer’s GUI

Do not enable programmatic analysis of the data

by the end user

(15)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Hatchet

• A Python-based library to enable programmatic analysis

• Creates an in-memory representation of the graph

• Leverage pandas which supports multi-dimensional tabular datasets

Use graph as structured index to index pandas dataframes

• A set of operators to sub-select and/or aggregate profile data

• A set of operators to compare multiple datasets

11

(16)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Pandas and dataframes

12

(17)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Pandas and dataframes

• Pandas is an open-source Python library

for data analysis

12

(18)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Pandas and dataframes

• Pandas is an open-source Python library

for data analysis

• Dataframe: two-dimensional tabular data

structure

Supports many operations borrowed from SQL databases

12

Columns

Rows

(19)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Pandas and dataframes

• Pandas is an open-source Python library

for data analysis

• Dataframe: two-dimensional tabular data

structure

Supports many operations borrowed from SQL databases

12

Columns

Rows

Index

(20)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Pandas and dataframes

• Pandas is an open-source Python library

for data analysis

• Dataframe: two-dimensional tabular data

structure

Supports many operations borrowed from SQL databases

• MultiIndex enables working with high-

dimensional data in a 2D data structure

12

Columns

Rows

Index

(21)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Central data structure: a GraphFrame

• Consists of a structured index

graph object and a pandas

dataframe

• Graph stores caller-callee

relationships

• Dataframe stores all numerical

and categorical data

13

(22)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Central data structure: a GraphFrame

• Consists of a structured index

graph object and a pandas

dataframe

• Graph stores caller-callee

relationships

• Dataframe stores all numerical

and categorical data

13

349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406

SC ’19, November 17–22, 2019, Denver, CO Anon.

407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464

useful orderings (like pre-order, post-order, etc.), if we want to pay the cost of a graph traversal (or sort) to generate more structured keys. We default to only guaranteeing uniqueness and not order in our keys.

3.2 Graphframe

The central data structure in the Hatchet library is a Graphframe, which combines the structured index Graph with a pandas DataFrame.

Figure ?? shows the two objects in a graphframe – a graph object (the index), and a dataframe object storing the metrics associated with each node.

main

physics solvers

mpi

psm2

hypre mpi

psm2

Figure 3: In Hatchet, the graphframe consists of a graph and a dataframe object.

Because of the way we have architected the Hatchet structured index Graph, we can insert Node objects directly into the pandas data frame. The nodes are indexed and sorted using their basic comparison operators, which operate on their key attribute. Thus, the �rst column in the dataframe (the node) is the index column.

As a convenience, we may also add columns (like name) based on attributes from each node’s FrameID. For example, in the �gure, we have added the name and nid columns from the FrameID subclass for HPCToolkit. This allows us to use regular pandas operations (selection, �ltering) on these values directly. As we will see later, the node column itself also allows various graph-semantic functions to be used, as well.

Finally, in addition to the identifying information for each node, we also add columns for each associated performance metric (inclu-

sive and exclusive time in the �gure).

Graphs vs. Trees: Hatchet stores the structure (typically a pre�x tree generated from call paths) in the input data as a directed graph (instead of a tree) for two reasons. First, subsequent operations on a tree can create edges, turning the tree into a graph. Additionally, output from tools such as callgrind is already in the form of a DAG.

Hatchet’s directed graph could be connected or have multiple dis- connected components. Each entity in the graph, such as a callsite, procedure frame, or function, is stored as a node and the caller- callee relationships are stored as directed edges. Each node in the graph can have one or multiple parents and children.

Bene�ts of Dataframes: We use a pandas dataframe to store all the numerical and categorical data associated with each node. Pro-

�le data can be inherently high-dimensional when metrics are gathered per-MPI process and/or thread. In such cases, each node

in the call tree or graph has metrics per-MPI process and/or thread and this data needs to be stored and indexed hierarchically. To index the rows of the data frame in such cases, a MultiIndex consisting of the structured index for the node and MPI rank or thread ID is used. In the most general case, a row in the data frame is indexed by a process and/or thread ID (and any other needed identi�ers in even higher dimensional cases).

3.3 Immutable Graph Semantics

Astute readers may have noted that we are adding direct references to graph nodes into the Dataframe. The risk this poses in our API is that client code can extract a subset of a Dataframe and hand it o� to other client code, which then modi�es the graph index nodes directly and corrupts all dataframes that use the same nodes.

One key aspect of Hatchet is that its graph nodes use immutable semantics. The Graphframe API is responsible for ensuring that operations between any two Graphframes use immutable graph node references, and that any operations that would modify a graph node in place instead create an entirely new graph index for the new frame to work with. So, we implement immutable semantics using copy-on-write to simplify the management of the graph and dataframe together.

One further consequence of our index model is that to use two dataframes together, we require that their graphs be uni�ed. That is, that they share the same index. This should be obvious when con- sidering that the nodes are compared by their key values, and two nodes can only be considered identical within an index if they have identical keys, which means that they must be in the same graph for comparison to make sense. We accomplish this by traversing the graphs and computing their union according to their connec- tivity and FrameID values (described further in the API section).

Incidentally, this type of restriction is not unusual in pandas, where comparing two data frames frequently requires reconciling their indexes, as well. We abstract the details of these graph operations in Hatchet through the GraphFrame API, which determines when and how Graphframes should be uni�ed.

3.4 Reading a CCT Dataset

With all of these components, that GraphFrame models the edge relationships between nodes in the structured data, and a dataframe storing the numerical (performance metrics such as time, PAPI counters data, etc.) and categorical data (e.g., load module, �le, line number) associated with each node. The generality of what can be stored in a pandas dataframe enables Hatchet to store almost any kind of contextual information recorded during sampling by diverse pro�ling tools.

Hatchet provides readers for several input formats to support data collected by popular pro�ling tools in the HPC community.

Hatchet can read in the database directory generated by HPCToolkit (hpcprof-mpi), and also split JSON �les generated by Caliper. In addition, one can provide structured data in Graphviz’ DOT format or a simple string literal.

Most pro�ling tools that generate CCTs have two kinds of in- formation in their output, often separated into di�erent parts of a �le or di�erent �les. The �rst information is the structure of the CCT – present in experiment.xml in HPCToolkit databases,

4

(23)

Abhinav Bhatele (CMSC714) LIVE RECORDING

Central data structure: a GraphFrame

• Consists of a structured index

graph object and a pandas

dataframe

• Graph stores caller-callee

relationships

• Dataframe stores all numerical

and categorical data

13

349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406

SC ’19, November 17–22, 2019, Denver, CO Anon.

407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464

useful orderings (like pre-order, post-order, etc.), if we want to pay the cost of a graph traversal (or sort) to generate more structured keys. We default to only guaranteeing uniqueness and not order in our keys.

3.2 Graphframe

The central data structure in the Hatchet library is a Graphframe, which combines the structured index Graph with a pandas DataFrame.

Figure ?? shows the two objects in a graphframe – a graph object (the index), and a dataframe object storing the metrics associated with each node.

main

physics solvers

mpi

psm2

hypre mpi

psm2

Figure 3: In Hatchet, the graphframe consists of a graph and a dataframe object.

Because of the way we have architected the Hatchet structured index Graph, we can insert Node objects directly into the pandas data frame. The nodes are indexed and sorted using their basic comparison operators, which operate on their key attribute. Thus, the �rst column in the dataframe (the node) is the index column.

As a convenience, we may also add columns (like name) based on attributes from each node’s FrameID. For example, in the �gure, we have added the name and nid columns from the FrameID subclass for HPCToolkit. This allows us to use regular pandas operations (selection, �ltering) on these values directly. As we will see later, the node column itself also allows various graph-semantic functions to be used, as well.

Finally, in addition to the identifying information for each node, we also add columns for each associated performance metric (inclu-

sive and exclusive time in the �gure).

Graphs vs. Trees: Hatchet stores the structure (typically a pre�x tree generated from call paths) in the input data as a directed graph (instead of a tree) for two reasons. First, subsequent operations on a tree can create edges, turning the tree into a graph. Additionally, output from tools such as callgrind is already in the form of a DAG.

Hatchet’s directed graph could be connected or have multiple dis- connected components. Each entity in the graph, such as a callsite, procedure frame, or function, is stored as a node and the caller- callee relationships are stored as directed edges. Each node in the graph can have one or multiple parents and children.

Bene�ts of Dataframes: We use a pandas dataframe to store all the numerical and categorical data associated with each node. Pro-

�le data can be inherently high-dimensional when metrics are gathered per-MPI process and/or thread. In such cases, each node

in the call tree or graph has metrics per-MPI process and/or thread and this data needs to be stored and indexed hierarchically. To index the rows of the data frame in such cases, a MultiIndex consisting of the structured index for the node and MPI rank or thread ID is used. In the most general case, a row in the data frame is indexed by a process and/or thread ID (and any other needed identi�ers in even higher dimensional cases).

3.3 Immutable Graph Semantics

Astute readers may have noted that we are adding direct references to graph nodes into the Dataframe. The risk this poses in our API is that client code can extract a subset of a Dataframe and hand it o� to other client code, which then modi�es the graph index nodes directly and corrupts all dataframes that use the same nodes.

One key aspect of Hatchet is that its graph nodes use immutable semantics. The Graphframe API is responsible for ensuring that operations between any two Graphframes use immutable graph node references, and that any operations that would modify a graph node in place instead create an entirely new graph index for the new frame to work with. So, we implement immutable semantics using copy-on-write to simplify the management of the graph and dataframe together.

One further consequence of our index model is that to use two dataframes together, we require that their graphs be uni�ed. That is, that they share the same index. This should be obvious when con- sidering that the nodes are compared by their key values, and two nodes can only be considered identical within an index if they have identical keys, which means that they must be in the same graph for comparison to make sense. We accomplish this by traversing the graphs and computing their union according to their connec- tivity and FrameID values (described further in the API section).

Incidentally, this type of restriction is not unusual in pandas, where comparing two data frames frequently requires reconciling their indexes, as well. We abstract the details of these graph operations in Hatchet through the GraphFrame API, which determines when and how Graphframes should be uni�ed.

3.4 Reading a CCT Dataset

With all of these components, that GraphFrame models the edge relationships between nodes in the structured data, and a dataframe storing the numerical (performance metrics such as time, PAPI counters data, etc.) and categorical data (e.g., load module, �le, line number) associated with each node. The generality of what can be stored in a pandas dataframe enables Hatchet to store almost any kind of contextual information recorded during sampling by diverse pro�ling tools.

Hatchet provides readers for several input formats to support data collected by popular pro�ling tools in the HPC community.

Hatchet can read in the database directory generated by HPCToolkit (hpcprof-mpi), and also split JSON �les generated by Caliper. In addition, one can provide structured data in Graphviz’ DOT format or a simple string literal.

Most pro�ling tools that generate CCTs have two kinds of in- formation in their output, often separated into di�erent parts of a �le or di�erent �les. The �rst information is the structure of the CCT – present in experiment.xml in HPCToolkit databases,

4

(24)

Abhinav Bhatele

5218 Brendan Iribe Center (IRB) / College Park, MD 20742 phone: 301.405.4507 / e-mail: [email protected]

Questions?

References

Related documents

Those components were summarized in a guidance paper of the United Nations (United.. Nations Conference on Trade and Development 2012) and include the legislation and

and risk factors for trachoma in Oromia regional state of Ethiopia: results of 79 population-based prevalence surveys conducted with the global trachoma mapping project. Phiri

For instance, you might use a Dynamic Array to store the values found in a disk file that is opened and read at run time, in which the number of records varies from one running of

In dealing with the dispute between Ghana and Cote D‟Ivoire, ITLOS used the provisional measure enshrined in article 290 of the UN convention to stop Ghana from

Bu çalışmanın amacı kliniğimizde boyun kitlesi nedeniyle takip edilmiş, tanı veya tedavi amacıyla cerrahi uygulanmış olan hastaların tanısal dağılımını saptamak,

When considering chartering any vessel, particularly large crude tankers or ships with no, or low, ice class notation, for voyages that include the potential for ice navigation

Liberal education models − associated in the past with the teaching of liberal arts, and also scientific and literary humanism − link the learning of various cultural and

• User created Policies that will automate switch and port configuration • Policies can be created via any Management interface. • cnMaestro configures all