• No results found

Sequential Performance Analysis with Callgrind and KCachegrind

N/A
N/A
Protected

Academic year: 2021

Share "Sequential Performance Analysis with Callgrind and KCachegrind"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Sequential Performance Analysis

with Callgrind and KCachegrind

2nd Parallel Tools Workshop, HLRS, Stuttgart, July 7/8, 2008

Josef Weidendorfer

Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik, Technische Universität München

(2)

Outline

• Background

• Callgrind & KCachegrind

– Measurement – Visualization

• Practical

– Getting started

– Example: Matrix Multiplication

(3)

Background

• Sequential performance bottlenecks

(directly relates to parallel performance!)

– Logical errors (unneeded function calls) – Bad algorithm for given problem

– Bad exploitation of available resource

• How to improve performance

– Use tuned libraries when available – Check for above obstacles

(4)

Performance Analysis Tools

• Count occurrences of events

– Resource exploitation is related to events

– Software-related: function call, OS scheduling, ...

– Hardware-related: FLOP executed, memory access, cache miss, ...

• Relate events to source code

– Find code regions where most time is spent – Check for improvement after changes

– Profiling data: Sum of events happening at same code position – Inclusive vs. exclusive cost

(5)

How to measure Events

• Software-related

– Instrumentation (= insertion of measurement code)

• into OS / application, manual/automatic, on source/binary level

• Hardware-related

– Read Hardware Performance Counters

• gives exact event counts for code ranges

• needs instrumentation fixed (high) influence to measurement results

– Statistical: Sampling

• Event distribution over code approximated by checking every N-th event • Hardware notifies only about every N-th event  Influence tunable by N

– Architecture Simulation

• Events generated by a (simplified) hardware model • No influence: allows for sophisticated online processing

(6)

Architectural Performance Problem Today: Main Memory

• Access latency ~ 200 cycles

– 400 FLOP wasted for one main memory access – Solution:

• Memory controller on chip

• Exploit fast caches (Locality of accesses!) • Prefetch data (automatically)

• Bandwidth available for 1 chip ~ 3 – 10 GB/s

– All cores have to share the bandwidth – Can prevent effective prefetching

– Solution:

• Share data in caches among cores

• Keep working set in cache (Temporal locality!) • Use good data layout (Spatial locality!)

(7)

Callgrind

Cache Simulation with Call-Graph Relation

(8)

• Based on Valgrind

– Runtime instrumentation infrastructure (no recompilation needed) – Dynamic binary translation of user-level processes

– Linux/AIX on x86, x86-64, PPC32, PPC64 – Correctness checking & Profiling tools on top

– “memcheck”: accessibility/validity of memory accesses – “helgrind” / ”drd”: race detection on multithreaded code – “cachegrind”/”callgrind”: cache simulation

– “massif”: space profiling – Open source (GPL)

– www.valgrind.org

(9)

Callgrind: Basic Features

• Part of Valgrind since 3.1

– Open Source, GPL

• Measurement Method

– Profiling via Simulation

– Simple Cache Model (synthetic events, derived from Cachegrind) – Instrument memory accesses to feed cache simulator

– Hook into Call/Return instructions, thread switches / signal handlers – Instrument (conditional) jumps for CFG inside of functions

• Presentation of results:

callgrind_annotate

/ KCachegrind

(10)

• Usage of Valgrind

– Driven only by user-level instructions of one process

– Slowdown (Call-graph tracing: 15-20x, + cache simulation: 40-60x) • But: “Fast-forward mode”: 2-3x

 Allows detailed observation (arbitrary metrics such as reuse distance)  Does not need root access / can not crash machine

• Cache Model

– “Not Reality”: Synchronous 2-level inclusive Cache Hierarchy (Size/Associativity taken from real machine)

 Easy to understand/reconstruct for user

 Reproducible results independent on real machine load  Derived optimizations applicable for most architectures

(11)

Callgrind: Advanced Features

• Interactive Control (backtrace, dump command, …)

• “Fast forward”-mode to get to interesting code phase fast

• Application control via client requests (start/stop, dump)

• Options for avoidance of function call cycles

– Cycles are bad for analysis (inclusive costs not applicable) – Add context info into function names (call chain/rec. depth)

• Best case simulation of simple L2 stream prefetcher

• Usage of cache lines before eviction

– Relates to temporal/spatial locality

(12)

• valgrind –tool=callgrind [callgrind options] yourprogram args

• Cache simulator:

--simulate-cache=yes

• Start in “fast-forward”:

--instr-atstart=yes

– Switch on event collection: callgrind_control –i on

• Jump-tracing in functions (CFG):

--collect-jumps=yes

• Separate dumps per thread:

--separate-threads=yes

• Current backtrace of threads (interactive):

callgrind_control –b

• Spontaneous dump:

callgrind_control –d [dump identification]

• Prefetch simulation:

--simulate-hwpref=yes

• Cacheline usage:

--cacheuse=yes

(13)

KCachegrind

Graphical Browser for Profile Visualization

(14)

• Open Source, GPL

kcachegrind.sf.net

• Included with KDE3 & KDE4

• Visualization of

– Call relationship of functions (callers, callees, call graph) – Exclusive/Inclusive cost metrics of functions

• Grouping according to ELF object / source file / C++ class – Source/Assembly annotation: costs + CFG

– Arbitrary events counts + specification of derived events

• Callgrind support (file format, events of cache model

(15)

• kcachegrind callgrind.out.<pid> • Left: “Dockables”

– List of function groups Groups according to

– Library (ELF object) – Source

– Class (C++) – List of functions with

– Inclusive

– exclusive costs

• Right: Visualization panes

(16)

KCachegrind: Visualization panes for selected function

• List of event types

• List of callers/callees

• Treemap visualization

• Call Graph

• Source annotation

(17)

Work in Progress…

• Callgrind

– Multicore cache simulation

(detection of data sharing, not load balancing)

– Command line tool for measurement merging & results – Event relation to data structures

• KCachegrind

– Pure Qt4 version (eg. Windows) – Cross-System support

(18)

Outline

Background

Callgrind & KCachegrind

– Measurement

– Visualization

• Practical

– Getting started

– Example: Matrix Multiplication

(19)

Practical: Getting started

• Login to cluster (or use Knoppix locally):

– ssh [email protected] (PW: tws_us4er) – svn co --username tws_user

https://svn.gforge.hlrs.de/svn/tws-examples/kcachegrind • Setup your environment:

module load valgrind

tws-examples/kcachegrind/README • Test: What happens in „/bin/ls“ ?

valgrind --tool=callgrind ls /usr/bin

kcachegrind

– What function takes most instruction executions? Purpose? – Where is the main function?

(20)

Practical: Detailed analysis of matrix multiplication

• Kernel for C = A * B

– Side length N  N3 multiplications + N3 additions

– 3 nested loops (i,j,k): Best index order? – Optimization for large matrixes: Blocking

• Code: mm/mm.c

make CFLAGS=‘-O2 -g’

– Timing of orderings (size 300 – 800): ./mm 300 800

– Cache behavior for small matrix (fitting into cache):

valgrind --tool=callgrind –-simulate-cache=yes ./mm 300

– How good is L1/L2 exploitation of the MM versions?

(21)

Weidendorfer: Callgrind / KCachegrind

Q

&

A

?

References

Related documents

QicScript Plus Copyright 2014 Helix Health Limited, 3094 Lakedrive, Citywest Business Campus, Dublin 24 Tel: +353-1-4633099 Fax: +353-1-4633011.. Email:

Performance was a major concern with the initial iteration of the scanner tool. All libraries used by the monitored program must be input into the tool. At the very least this

 Instructions: the next instruction after a non-branch inst is always executed.. Cache memory.  A cache is small memory which content is an image of a subset of the

If you hypothetically ran the ENTIRE string of memory accesses with a fully associative cache (with an LRU replacement policy) of the same size as your cache, and it was a miss for

En esta fase se debe analizar y decidir cuáles serán los medios utilizados para informar a los probables candidatos sobre En esta fase se debe analizar y decidir cuáles serán los

Program Delivery costs include costs related to the administration of both financial and employment assistance, as well as the provision of employment assistance activities

To master these moments, use the IDEA cycle: identify the mobile moments and context; design the mobile interaction; engineer your platforms, processes, and people for

Event 0xA counts non-sequential data cache accesses, to cache lines, regardless of whether the access is cacheable or non-cacheable. Subtracting the event 0x9 count from the event