CHAPTER 7 ARTICLE 4 : HARDWARE TRACE RECONSTRUCTION OF RUN-
7.5 Illustrative Use-Cases
7.5.1 JIT Compiled Code
Recently improved and extended, the Berkeley Packet Filter (eBPF) in the kernel is being increasingly adopted for traffic filtering, shaping and classification. Its versatile in-kernel virtual machine is also being used to improve trace aggregation and trace filtering. At the heart of eBPF’s improvement is the eBPF’s JIT compiler [116]. Hardware trace reconstruction for eBPF code is therefore important for a complete execution flow. To demonstrate our FlowJIT technique on a JIT compiled case, we consider a scenario where a user is executing code within a process using a userspace eBPF process VM. This is illustrative of the real- life use of eBPF to filter syscalls using seccomp-bpf in Chrome or userspace network packet filtering in tools like Wireshark. We use uBPF2to emulate eBPF’s behavior in userspace while
hardware tracing is activated. As an example, our target process (P ) executes a dynamically compiled eBPF code section (CSr) which is a simple loop that increments an integer 42 times.
This section is different from the normal code of P (CSp), which includes executable code in
its address space such as its .text section, and dynamically loaded libraries. As P executes, uBPF builds the CSr section as a identifiable function, complete with a function prologue
and epilogue. This is stored in a code cache whose access permissions are set to executable and executed using a simple call instruction after loading the address of the code cache CSr
in the rax register on an x86 machine.
The process by which FlowJIT works in this context has been illustrated in pseudocode 4, where FlowJIT essentially intercepts the mprotect() call made by the uBPF compiler, as it intends to set executable permissions for the compiled bytecode. However, FlowJIT intercepts it in the kernel and flips the executable flags. Eventually, an access page fault is induced, which FlowJIT handles and generates a synthetic event. This FlowJIT event contains the timestamp, the virtual address of the code segment and the raw binary data which was the JIT compiled code. Eventually, we add this event to a memory mapped buffer which is lazily copied to a userspace file.
During the above execution, hardware tracing using Intel PT was enabled and the raw trace data was collected while the process P was executed. The decoded trace data consists of two interesting packets - the Target Instruction Pointer (TIP) packet and Taken-Not-Taken (TNT) packet. The TIP packet is generated when an indirect branch, exception or an in-
Algorithm 4
1: Userspace Process P : 2: mem ← malloc()
3: CSr ← compile(bytecode, mem)
4: mprotect(mem) with F lags = {EXEC, !WRITE} 5: In-kernel flowjit :
6: On mprotect(mem, Flags) :
7: if (mem is anonymous) then
8: Set F lags = {!EXEC}
9: mem.tracked = 1 10: end if
11: Continue with original mprotect()
12: if (P executes CSr from mem) then
13: if ({EXEC} not in F lags and mem.tracked is 1) then
14: page ← page_f ault(mem) . Access-fault on tracked VMA
15: f ault_handler(page)
16: event ← build_event(page, timestamp, address(mem))
17: list _ dump(event) . Add event to list
18: Set F lags = {EXEC}
19: Continue P execution
20: else if (F lags is {EXEC}) then
21: Continue P execution
22: end if
23: end if
terrupt occurs. The TNT packets are issued at each conditional branch and loop instruction and contain one bit for each branch taken or not taken. Thus, in our case, when P tried to execute the runtime compiled CSr section of code by issuing a call %rax instruction, the
processor generated a TIP packet with the value of the IP of the CSr section. While deco-
ding the trace data, the TNT packets are associated with the binaries of P and its associated libraries on disk to complete the instruction flow in the decoded trace. However, we observed that, when the runtime compiled CSr section address was encountered, the decoding failed
as no memory was associated with that address. This is illustrated in the missing association of trace packets (Tr) in Figure 7.5.
To overcome this decoding failure, we used the dumped FlowJIT events and queried the runtime compiled images based on the IP retrieved from the TIP trace packet when the decoding failed. This IP corresponds to the virtual address which is used as an index to access the individual FlowJIT event from the dumped data. The timestamp is used to verify the version of the JIT copy (in case the runtime compiled code was updated multiple times). To reconstruct the flow of the retrieved image (Ir), we first generate the control flow graph
CSr Process Code eBPF Code CSp HardwareTrace Packets FlowJIT Query IP ? (Tr) Ir 1 2 3 4 TNT N T Flow(CSr) CFG(CSr) Image at IP
Figure 7.5 FlowJIT retrieved runtime code image of eBPF (Ir) merged with failed decoding
in hardware trace (Tr) to reconstruct flow of JIT compiled code
representation of this image (CFG(CSr)) and then merge it with the associated trace packets Tr. For this example, 1, 2 and 4 nodes represent segments which have a unconditional jump
and are thus omitted from the trace Tr, whereas node 3 represents a conditional jump which
generates 42 TNT packets with the first 41 bits being N and the last bit T. The TNT packets are mapped to the conditional branches in Ir sequentially, which results in the true flow of
the hardware trace (Flow(CSr)). Therefore, using FlowJIT, runtime JIT compiled eBPF code
could be reconstructed successfully, without any dedicated API or invasive process, by just using kernel support.