• No results found

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

N/A
N/A
Protected

Academic year: 2021

Share "How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)

A Topology-Aware Performance

Monitoring Tool for Shared Resource

Management in Multicore Systems

(2)

1. Context/Motivations

2. Fast presentation of the tool 3. Demonstration

4. How does it works ? 5. How is it made ?

(3)

MOTIVATIONS

Memory hierarchy is growing deeper and larger.

No performance without a fair usage of the system topology Batch schedulers, runtimes, applications themeselves . . . are getting topology aware. IO Network Machine NUMA Shared Memory Shared Cache Private Cache Core Private Cache Core NUMA Shared Memory Shared Cache Private Cache Core Private Cache Core

(4)

MOTIVATIONS

Memory hierarchy is growing deeper and larger. Hence, data management gives opportunities for performance improvements. IO Network Machine NUMA Shared Memory Shared Cache Private Cache Core Private Cache Core NUMA Shared Memory Shared Cache Private Cache Core Private Cache Core

(5)

MOTIVATIONS

Memory hierarchy is growing deeper and larger. Hence, data management gives opportunities for performance improvements. IO Network Machine NUMA Shared Memory Shared Cache Private Cache Core Private Cache Core NUMA Shared Memory Shared Cache Private Cache Core Private Cache Core

(6)

MOTIVATIONS

Memory hierarchy is growing deeper and larger. Hence, data management gives opportunities for performance improvements. It is a multi-level and a multi-criteria problem. IO Network Machine NUMA Shared Memory Shared Cache Private Cache Core Private Cache Core NUMA Shared Memory Shared Cache Private Cache Core Private Cache Core

(7)

MOTIVATIONS

• Need to match use cases, and relevant performance metrics for each level.

Need to match performance and topology.

• Requires topology modeling skills.

(8)

Yet Another Tool to Monitor Applications Performance

• Focus on data presentation to link the results with topology informations.

• Relies on two cornerstones of topology modeling (hwloc) and performance counter abstraction (PAPI) to map the latter on the former

• Minimal configuration and software requirements. • Can help finding and caracterizing localized bottlenecks.

(9)

Hardware Locality (hwloc)

Portable abstraction of hierarchical architectures for high-performance computing

• Performs topology discovery and extracts hardware component information.

• Provides tools for memory and process binding. • Many operating systems supported

• ...

• lstopo utility to display the topology: Developped at Inria Bordeaux.

(10)

Hardware Locality (hwloc)

Machine (31GB total) NUMANode P#0 (31GB) Package P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#18 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#20 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#22 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#8 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#10 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#12 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#14 PU P#30

(11)

Performance Application Programming Interface (PAPI)

Consistent interface and methodology for use of the performance counter hardware.

• Real time relation between software performance and processor events.

• Many operating systems supported too. • Reliable and actively supported.

• Used in a wide range of performance analysis applications.

An abstraction layer to plug some other performance library is under development.

(12)

Dynamic Lstopo (example)

Machine (16GB total) NUMANode P#0 (16GB) Package P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#18 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#20 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#22 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#8 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#10 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#12 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#14 PU P#30 9,8000000000e+01 3,0100000e+02 2,1900000e+02 5,0700000e+02 2,4000000e+02 2,4000000e+02 2,7600000e+02 2,2400000e+02 2,5900000e+02 6,8200000e+02 6,4200000e+02 1,0560000e+03 7,2600000e+02 6,8700000e+02 6,8700000e+02 6,8200000e+02 6,5800000e+02 1,7130000e+03 1,7440000e+03 2,3410000e+03 1,7670000e+03 1,7380000e+03 1,7220000e+03 1,7350000e+03 1,7140000e+03 2,49344e+00 1,51514e+00 2,58947e+00 1,47808e+00 1,47165e+00 3,47249e+00 2,82472e+00 1,63417e+00 2,64271e+00 1,51689e+00 1,52142e+00 2,68772e+00 1,47441e+00 2,45916e+00 1,40868e+00 2,82176e+00

Sample of hardware performance counters mapped on a single socket of an Intel Xeon E5-2650 CPU.

(13)

A Demonstration Worth Thousand Words

PU0 PU1 PU2 PU3 L1 L2 L3 ∗4 . . .

(14)

A Demonstration Worth Thousand Words

PU0 PU1 PU2 PU3 L1 L2 L3 ∗4 . . .

(15)

A Demonstration Worth Thousand Words

PU0 PU1 PU2 PU3 L1 L2 L3 ∗4 . . .

(16)

A Demonstration Worth Thousand Words

L3_MISS{ OBJ = L3; CTR = PAPI_L3_TCM; LOGSCALE = 1; } L2_MISS{ OBJ = L2; CTR = PAPI_L2_TCM; LOGSCALE = 1; } L1_MISS{ OBJ = L1d; CTR = PAPI_L1_DCM; LOGSCALE = 1; } SINGLE_L3_MISS{ OBJ = PU; CTR = PAPI_L3_TCM; LOGSCALE = 1; }

(17)

Dynamic Lstopo (Usage)

Machine (16GB total) NUMANode P#0 (16GB) Package P#0 L3 (4096KB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#1 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#3 2,5120000000e+03 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#3 L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#1 L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#3 L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#1 L1i (32KB) Core P#1 PU P#2 PU P#3 L1i (32KB) Core P#0 PU P#0 PU P#1 Core P#1 PU P#2 PU P#3 Core P#0 PU P#0 PU P#1PU P#1 PU P#2 PU P#3 PU P#0 Counters input: SINGLE L3 MISS { OBJ = L3 ; CTR =

PAPI L2 TCM/ PAPI L2 TCA ; LOGSCALE = 1 ;

MAX=1000000; MIN=0;

}

Command line:

(18)

Dynamic Lstopo (Theory)

L2 L1 PU + PU + L1 PU + PU

1. Spawn one pthread per hardware thread (PU#0, . . . , PU#3).

(19)

Dynamic Lstopo (Theory)

L2 L1 PU + PU + L1 PU + PU

1. Spawn one pthread per hardware thread (PU#0, . . . , PU#3).

2. For each timestamp, each thread collects a local set of performance counters.

(20)

Dynamic Lstopo (Theory)

L2 L1 PU + PU + L1 PU + PU

1. Spawn one pthread per hardware thread (PU#0, . . . , PU#3).

2. For each timestamp, each thread collects a local set of performance counters.

3. Counters are accumulated in each upper level.

(21)

Dynamic Lstopo (Theory)

L2 L1 PU + PU + L1 PU + PU % % % % % % %

1. Spawn one pthread per hardware thread (PU#0, . . . , PU#3).

2. For each timestamp, each thread collects a local set of performance counters.

3. Counters are accumulated in each upper level.

4. For each level, a leaf computes an arithmetic expression of the

(22)

Dynamic Lstopo Software Architecture in Brief

Machine static topology Machine dynamic performance counters

hwloc library PAPI library

monitors library Application Monitors output

(23)

Dynamic Lstopo Software Architecture in Brief

Machine static topology Machine dynamic performance counters

hwloc library PAPI library

monitors library Application Monitors output

(24)

API

m o n i t o r s = l o a d M o n i t o r s f r o m c o n f i g ( NULL , ” m y p e r f f i l e ” , ” m y o u t p u t f i l e ” , 0 ) ; M o n i t o r s w a t c h p i d ( m o n i t o r s , g e t p i d ( ) ) ; M o n i t o r s s t a r t ( m o n i t o r s ) ; /∗ . . . ∗/ M o n i t o r s u p d a t e c o u n t e r s ( m o n i t o r s ) ; d e l e t e M o n i t o r s ( m o n i t o r s ) ;

(25)

Dynamic Lstopo (Output paje trace)

% E v e n t D e f val 0 % Id int % P h a s e int % T i m e _ u s d a t e % V a l u e d o u b l e % E n d E v e n t D e f % E v e n t D e f c o n t a i n e r 1 % Id int % L e v e l s t r i n g % S i b l i n g int % N a m e s t r i n g % L o g s c a l e int % E n d E v e n t D e f 1 3 PU 0 S I N G L E _ L 3 _ M I S S 1 1 2 PU 1 S I N G L E _ L 3 _ M I S S 1 1 1 PU 2 S I N G L E _ L 3 _ M I S S 1 1 0 PU 3 S I N G L E _ L 3 _ M I S S 1 0 2 0 9 6 2 8 3 2 7 6 2 2 2 4 67 ,00 0 1 0 9 6 2 8 3 2 7 6 2 2 2 4 58 ,00 0 3 0 9 6 2 8 3 2 7 6 2 2 2 5 77 ,00 0 0 0 9 6 2 8 3 2 7 6 2 2 3 6 64 ,00 0 3 0 9 6 2 8 3 2 8 6 0 6 7 6 9 4 5 1 4 , 0 0 0 0 0 9 6 2 8 3 2 8 6 0 6 7 6 1 2 1 7 4 6 , 0 0 0 2 0 9 6 2 8 3 2 8 6 0 6 7 6 2 0 5 1 7 0 , 0 0 0 1 0 9 6 2 8 3 2 8 6 0 7 1 7 2 0 0 9 3 1 , 0 0

(26)

Features

• Record and/or Display live machine performance counters and match them with topology.

• Several settings: counters accumulation, sampling rate, attach to a process. . .

• Replay any trace with a topology file (for external display).. • Sample specific parts of an application with the monitor

library.

(27)

Future works

• Match code and performance informations • Accept user defined aggregation operator. • Provide performance abstraction layer. • Be able to delimit phases during execution. • Find and give explicit hints on bottlenecks.

(28)

Conclusion

Data locality becomes a main criterion for high performance. We built a tool based on a topology model and a performance library to help taking up the challenge.

It maps performance values to machine objects. It is a visual tool, fast and easy to use.

It is lightweight and causes less than 1% CPU overhead. Let you build topology aware performance models.

(29)

Thank you

References

Related documents

Reassuringly the baseline and post-operative scores in our sample, as measured by the Oxford Scores, are compa- rable with data from the United Kingdom National Joint Registry, 35

This study investigates the use of query expansion (QE) methods in sentence retrieval for non-factoid queries to address the query-document term mismatch problem. Two alternative

The nowadays student is almost always ahead of the teacher as far as the use of new technologies is concerned and as a result the nowadays teacher has to struggle harder to respond

Since no experiments have been carried out to investigate the effect of limonene on myometrial smooth muscle, our primary aim was to study the effects of limonene on

The ultimate goal is to explore the meanings, complementation patterns and frequency of distribution of the evidential uses of the top object-oriented perception verbs look and

Para que possamos desenvolver esse trabalho de forma ética e também útil à sociedade é necessário que outros participantes colaborem conosco e que nós, em

South Asian Clinical Toxicology Research Collaboration, Faculty of Medicine, University of Peradeniya, Peradeniya, Sri Lanka.. Department of Medicine, Faculty of

This study investigated the hospital admission process in relation to two areas associated with known patient related risks, venous thromboembolism (VTE) risk