<Insert Picture Here>
Categories of Tracing Tools
•
Kernel Tracing
•
Userspace Applications Tracing
•
Static Tracing
•
Dynamic Tracing
•
Sampling
•
Continuous
A Brief History
• LTT (Linux Trace Toolkit) (1998) replaced in 2006 • Kprobes (2004):
http://www.ibm.com/developerworks/library/l-kprobes/index.html
• Dtrace for Solaris (2003) released on Solaris 10 in November 2005 • Systemtap for Linux (2005) at OLS
• LTTng (LTT Next Generation) (2006): https://lkml.org/lkml/2006/1/9/223 • Ftrace: (2008): https://lkml.org/lkml/2008/1/3/26
Currently Active Projects
•
Ftrace
•
Perf
•
Systemtap
Ftrace
• In kernel with portions outside
• Main Maintainer: Steven Rostedt (2008)
• CLI: /sys/kernel/debug/tracing as interface (control and output) • GUI for data visualization (KernelShark)
http://people.redhat.com/srostedt/kernelshark/HTML/
• Trace-cmd: user space tool with subcommands
http://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git
• Fundamental mechanism: gprof hooks. Adds mcount() call at entry of each
kernel function. Requires kernel to be compiled with -pg option for some functionality
• User: Root privileges • No user space tracing
Information collected: ftrace plugins
• Function: trace entry of all kernel functions
• Function-graph: traces on both entry and exit of the functions. It
then provides the ability to draw a graph of function calls
• Wakeup: max latency that it takes for the highest priority task to
get scheduled after it has been woken up
• Irqsoff: areas that disable interrupts and saves the trace with the
longest max latency
• Preemptoff: amount of time for which preemption is disabled • Preemptirqsoff: largest time for which irqs and/or preemption is
disabled
• Nop: trace nothing
Ftrace: trace-cmd
• User space tool • Requires root
• Some commands are:
• Record: start collection of events into file trace.dat
• Can use ftrace plugins (-p) or static kernel events (-e)
• Report: display content of trace.dat
• Start: start tracing but keep data in kernel ring buffer
• Stop: stop collecting data into the kernel ring buffer
• Extract: extract data from ring buffer into trace.dat
• List: list available plugins (-p) or events (-e)
• Many options, very flexible
Ftrace: Original Interface
# echo sched_switch > current_tracer # echo 1 > tracing_on # sleep 1 # echo 0 > tracing_on # cat trace # tracer: sched_switch # TASKPID CPU# TIMESTAMP FUNCTION # | | | | | bash3997 [01] 240.132281: 3997:120:R + 4055:120:R bash3997 [01] 240.132284: 3997:120:R ==> 4055:120:R sleep4055 [01] 240.132371: 4055:120:S ==> 3997:120:R bash3997 [01] 240.132454: 3997:120:R + 4055:120:S bash3997 [01] 240.132457: 3997:120:R ==> 4055:120:R sleep4055 [01] 240.132460: 4055:120:D ==> 3997:120:RFtrace: Trace-cmd
# tracecmd record e sched_switch sleep 5 # tracecmd report version = 6 cpus=2 sleep29568 [000] 2930217.458882:sched_switch:tracecmd:29568 [120] R ==> evince:28664 [120] evince28664 [000] 2930217.458977:sched_switch:evince:28664 [120] S ==> tracecmd:29568 [120] sleep29568 [000] 2930217.459349:sched_switch:sleep:29568 [120] R ==> pulseaudio:1909 [120] pulseaudio1909 [000] 2930217.459516:sched_switch:pulseaudio:1909 [120] S ==> sleep:29568 [120] sleep29568 [000] 2930217.460463:sched_switch: sleep:29568 [120] S ==> tracecmd:29566 [120] tracecmd29566 [000] 2930217.460510:sched_switch:tracecmd:29566 [120] S ==> tracecmd:29567 [120] tracecmd29567 [000] 2930217.460543:sched_switch:tracecmd:29567 [120] S ==> swapper:0 [120] <idle>0 [000] 2930217.460710:sched_switch:swapper:0 [120] R ==> Xorg:1675 [120]Ftrace: Dynamic Tracing
• Added by Masami Hiramatsu (end of 2009) • Real dynamic tracing
• Use /sys/kernel/debug/tracing/kprobe_events to control from command
line
• Flexible to an extent, but hard to use because of syntax • Read more: Documentation/trace/kprobetrace.txt
• LWN article: http://lwn.net/Articles/343766/
echo 'r:myretprobe do_sys_open $retval' > /sys/kernel/debug/tracing/kprobe_events
Perf
• In kernel (tools/perf directory) userspace tool (started in 2008) • Main contributors : Thomas Gleixner, Ingo Molnar, Frederic
Weisebecker, Arnaldo Charvalo De Melo, Peter Zijlstra and others
• Started as Hardware performance counters interface, initially called
perf counters.
• Originally on top of the hardware-based performance monitoring
interface for Linux by Stephane Eranian. Now available as libpfm version 4 http://perfmon2.sourceforge.net/docs_v4.html
• Has grown into all encompassing tracing system • Very active at the moment
• No user space tracing yet
• Documentation: tools/perf/design.txt and tools/perf/Documenation • Perf wiki: https://perf.wiki.kernel.org/index.php/Main_Page
Perf subcommands
• Perf stat: collects and display events data during a command
execution
• Perf record: run a command, store its profiling (sampling mode)
in output file (perf.data)
• Perf report: display data previously recorded in output file
(perf.data)
• Perf diff: diff between two perf.data files
• Perf top: performance counter profile in real time
• Perf probe: define dynamic tracepoints (more on this later)
• Perf kmem: show statistics on kernel mem (from perf.data), or
record kernel mem usage for a command
• Perf trace-perl (trace-python): process previously recorded trace
Perf subcommands (continued)
• Perf list: list available symbolic events types (in exadecimal
encoding)
• Perf annotate: display perf.data with assembly or source code of
the command executed
• Perf lock: lock statistics
• Perf sched: scheduler behaviour
• Perf kvm: perform same as above but on a kvm guest
environment with subcommands:
• Perf kvm top
• Perf kvm record
• Perf kvm report
Perf: Types of events
• Perf list shows the list of predefined events that are available • Use one or more of the events in perf stat commands:
perf stat -e instructions sleep 5
• Hardware: cpu cycles, instructions, cache references, cache misses, branch
instructions, branch misses, bus cycles. Uses libpfm. CPU model specific.
• Hardware cache: for instance L1-dcache-loads, L1-icache-prefetches • Hardware breakpoint
• Raw: hexadecimal encoding provided by harware vendor
• Software: in kernel events. Such as: cpu clock, task clock, page faults, context
switches, cpu migrations, page faults, alignment faults, emulation faults.
• Static tracepoints: needs the ftrace event tracer enabled. Events are listed in
/sys/kernel/debug/tracing/events/* (locations defined with TRACE_EVENT() macros)
Other Control Parameters
• Process specific: default mode, perf follows fork() and
pthread_create() calls
• Thread specific: use - - no-inherit (or -i) option • System wide: use -a option
• CPU specific: use -C with list of CPU numbers • For running process: -p <pid>
• To find what binaries were running , from a perf.data file, use Elf
unique identifier inserted by linker:
• perf buildid-list
• User, Kernel, Hypervisor modes: count events when the CPU is
in user, kernel and/or hypervisor mode only. Use with perf stat
Dynamic Tracing in Perf
• Contributed by Masami Hiramatsu
• Probes can be inserted on the fly using perf probe • Defined on the command line only
• Syntax can be simple or complex
• Can use dwarf debuginfo to show source code (--line option)
before setting probe (uses elfutils libdw like systemtap), or to use function name, line number, variable name
• Abstraction on ftrace kprobes, alleviates usage problems,
supports also same syntax to specify probes
• Some options: --add, --del, --line, --list, --dry-run, --verbose,
--force
Example
Add probe, run a command, and display the data:
# ./perf probe -P 'p:myprobe @fs/read_write.c:285 file buf count'
Adding new event: p:myprobe vfs_read+57 file=%bx buf=%si count=%ax # ./perf list
...
rNNN [raw hardware event descriptor] kprobes:myprobe [Tracepoint event]
# ./perf record -f -e kprobes:myprobe:record -F 1 -a ls #./perf trace
[...]
perf-11445 [000] 95862.048894383: myprobe: (c04bbed5) file=dae15e80 buf=b78b2000 count=400 perf-11445 [000] 95862.049066533: myprobe: (c04bbed5) file=dae15d80 buf=b78b2000 count=400 [...]
LTTng
•
External to kernel tree
•
Maintainer Mathieu Desnoyers
•
http://lttng.org/
•
Userspace tracing via markers and tracepoints (static
instrumentation). Need to link with special library.
•
kprobes/tracepoints support for kernel tracing
•
GUI to view events (LTTV)
•
Eclipse integration
•
Included in some embedded Linux distributions
Systemtap
• External to kernel tree (started in 2005) • Multiple maintainers: Red Hat, IBM, others • http://sourceware.org/systemtap/wiki
• Kernel tracing using kprobes
• Userspace tracing requires in kernel infrastructure that is not in
mainline (only Red Hat based kernels) (utrace/uprobes)
• Dynamic probes and tracepoints
• Has well defined rich scripting language • GUI available
• Safety issues around modifying memory locations (guru mode) • Can use debuginfo (seen as drawback)
Kprobes: Dynamic Tracing
•
Added to the Linux kernel in 2005
•
Similar to breakpoints: place breakpoint instruction at
desired code location and execute actions when hit.
•
Optimizations to kprobes using Jumps instead of
exceptions.
Events Markers
• Static probe points in kernel code
• Independent of users (ftrace, perf, Lttng, systemap) • Many uses in the kernel (a few hundreds and growing) • TRACE_EVENT() macro with 6 arguments
• Definitions in include/linux/tracepoint.h
• Define characteristics of tracing actions (probe) using
TRACE_EVENT() in include/trace/events/*.h
• Mark tracing locations with function call trace_my_event_name()
• e.g. trace_sched_switch() in sched.c and
TRACE_EVENT(sched_switch,...) defined in sched.h
• Read the 3-article series:
• http://lwn.net/Articles/379903/
Utrace
• Systemtap provides userspace tracing based on utrace. Code in
RHEL/Fedora.
• Utrace: goal was to be replacement for ptrace. Status: development
abandoned. Fedora still ships it, used by a few other tools.
• See mailing list: https://www.redhat.com/mailman/listinfo/utrace-devel
• See code: http://redhat.com/~roland/utrace/
• Maintained by Oleg Nesterov
• Patch kept up to date with recent kernels, most recent is for kernel 3.1
Uprobes
• Uprobes: handle userspace breakpoints in kernel • Analogous to kprobes
• Uses breakpoint instruction • Posted many revisions • No longer based on utrace • No signals
• Multiple tracers allowed
• Ptrace replacement eventually?
• Still not approved
• Latest version of the patchset https://lkml.org/lkml/2011/9/20/123
• Integration of previous version of the patch with latest kernel F15 and F16
Tracing Infrastructure Unification?
• Goal: unify the kernel infrastructure for tracing
• There are separate ring buffers used by perf and ftrace
• Need more complex interfaces
• Need to be fast
• Simpler ring buffers in the kernel (kfifo) for other users also exist • Unification effort restarted many times:
http://lkml.org/lkml/2010/7/9/368
http://lkml.org/lkml/2010/8/17/485
• Some old benchmarks:
http://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-ringbuffer.git
• Status quo with multiple buffers is likely to persist. Too many differing needs. • Possible unification of ftrace and perf? Proposal of “trace” but stalled:
Dtrace
• A Solaris tool, available since 2005
• Ported to Linux by Oracle, initial release October 2011, version 0.1, still WIP • Code posted here http://oss.oracle.com/git/
• linux-2.6-dtrace-modules-beta.git
• linux-2.6-dtrace-unbreakable-beta.git
• Dtrace provider, syscall provider
• Next: static probes, then tackle userspace tracing
• Integrated with Oracle Unbreakable Enterprise Kernel, available on UEK
2.6.32 right now (0.1 version), moving to UEK2 version 2.6.39 from now on
• x86_64 only
• Kernel changes are GPL • Kernel Module is CDDL
In Summary
• Systemtap: Moving forward, but on its own. Provides userspace tracing.
Resistance by kernel community around architectural design decisions. Integration as is with upstream kernel unlikely?
• LTTng: Developed actively, but standalone. Provides userspace tracing. Code in
its own repo. Not in mainline kernel.
• Ftrace: continuing, thanks to trace-cmd and kernelshark. Kernel focused • Perf: most active project at this time. Kernel focused
• Unification of Buffers in kernel: stalled
• Scalability: in progress. See google's comments: http://lwn.net/Articles/464268/ • Utrace: development stalled/abandoned, maintenance only
• Uprobes: hopeful
• In kernel interpreter: maybe?
• tracepoints/static events: KABI status? • Integration of ftrace and perf? trace?
Thank You
•
Ananth Mavinakayanahalli (IBM)
•
Srikar Dronamraju (IBM)
•
Masami Hiramatsu (Hitachi)
•