JIT Compiled Code - Illustrative Use-Cases

CHAPTER 7 ARTICLE 4 : HARDWARE TRACE RECONSTRUCTION OF RUN-

7.5 Illustrative Use-Cases

7.5.1 JIT Compiled Code

Recently improved and extended, the Berkeley Packet Filter (eBPF) in the kernel is being increasingly adopted for traffic filtering, shaping and classification. Its versatile in-kernel virtual machine is also being used to improve trace aggregation and trace filtering. At the heart of eBPF’s improvement is the eBPF’s JIT compiler [116]. Hardware trace reconstruction for eBPF code is therefore important for a complete execution flow. To demonstrate our FlowJIT technique on a JIT compiled case, we consider a scenario where a user is executing code within a process using a userspace eBPF process VM. This is illustrative of the real- life use of eBPF to filter syscalls using seccomp-bpf in Chrome or userspace network packet filtering in tools like Wireshark. We use uBPF2_{to emulate eBPF’s behavior in userspace while}

hardware tracing is activated. As an example, our target process (P ) executes a dynamically compiled eBPF code section (CSr) which is a simple loop that increments an integer 42 times.

This section is different from the normal code of P (CSp), which includes executable code in

its address space such as its .text section, and dynamically loaded libraries. As P executes, uBPF builds the CSr section as a identifiable function, complete with a function prologue

and epilogue. This is stored in a code cache whose access permissions are set to executable and executed using a simple call instruction after loading the address of the code cache CSr

in the rax register on an x86 machine.

The process by which FlowJIT works in this context has been illustrated in pseudocode 4, where FlowJIT essentially intercepts the mprotect() call made by the uBPF compiler, as it intends to set executable permissions for the compiled bytecode. However, FlowJIT intercepts it in the kernel and flips the executable flags. Eventually, an access page fault is induced, which FlowJIT handles and generates a synthetic event. This FlowJIT event contains the timestamp, the virtual address of the code segment and the raw binary data which was the JIT compiled code. Eventually, we add this event to a memory mapped buffer which is lazily copied to a userspace file.

During the above execution, hardware tracing using Intel PT was enabled and the raw trace data was collected while the process P was executed. The decoded trace data consists of two interesting packets - the Target Instruction Pointer (TIP) packet and Taken-Not-Taken (TNT) packet. The TIP packet is generated when an indirect branch, exception or an in-

Algorithm 4

1: _{Userspace Process P :} 2: mem ← malloc()

3: CSr ← compile(bytecode, mem)

4: mprotect(mem) with F lags = {EXEC, !WRITE} 5: _{In-kernel flowjit :}

6: On mprotect(mem, Flags) :

7: if (mem is anonymous) then

8: Set F lags = {!EXEC}

9: mem.tracked = 1 10: end if

11: Continue with original mprotect()

12: if (P executes CSr from mem) then

13: if ({EXEC} not in F lags and mem.tracked is 1) then

14: page ← page_f ault(mem) . Access-fault on tracked VMA

15: f ault_handler(page)

16: event ← build_event(page, timestamp, address(mem))

17: list _ dump(event) . Add event to list

18: Set F lags = {EXEC}

19: Continue P execution

20: else if (F lags is {EXEC}) then

21: Continue P execution

22: end if

23: end if

terrupt occurs. The TNT packets are issued at each conditional branch and loop instruction and contain one bit for each branch taken or not taken. Thus, in our case, when P tried to execute the runtime compiled CSr section of code by issuing a call %rax instruction, the

processor generated a TIP packet with the value of the IP of the CSr section. While deco-

ding the trace data, the TNT packets are associated with the binaries of P and its associated libraries on disk to complete the instruction flow in the decoded trace. However, we observed that, when the runtime compiled CSr section address was encountered, the decoding failed

as no memory was associated with that address. This is illustrated in the missing association of trace packets (Tr) in Figure 7.5.

To overcome this decoding failure, we used the dumped FlowJIT events and queried the runtime compiled images based on the IP retrieved from the TIP trace packet when the decoding failed. This IP corresponds to the virtual address which is used as an index to access the individual FlowJIT event from the dumped data. The timestamp is used to verify the version of the JIT copy (in case the runtime compiled code was updated multiple times). To reconstruct the flow of the retrieved image (Ir), we first generate the control flow graph

CS_r Process Code eBPF Code CS_p Hardware_Trace Packets FlowJIT Query IP ? (T_r) I_r 1 2 3 4 TNT N T Flow(CS_r) CFG(CS_r) Image at IP

Figure 7.5 FlowJIT retrieved runtime code image of eBPF (Ir) merged with failed decoding

in hardware trace (Tr) to reconstruct flow of JIT compiled code

representation of this image (CFG(CSr)) and then merge it with the associated trace packets Tr. For this example, 1, 2 and 4 nodes represent segments which have a unconditional jump

and are thus omitted from the trace Tr, whereas node 3 represents a conditional jump which

generates 42 TNT packets with the first 41 bits being N and the last bit T. The TNT packets are mapped to the conditional branches in Ir sequentially, which results in the true flow of

the hardware trace (Flow(CSr)). Therefore, using FlowJIT, runtime JIT compiled eBPF code

could be reconstructed successfully, without any dedicated API or invasive process, by just using kernel support.

In document Low-Impact System Performance Analysis Using Hardware Assisted Tracing Techniques (Page 134-136)