Simulation Model Instrumentation Technique

As mentioned, the final goal of Di-DyMeLoR is to transparently support incre-

mental state saving. While the aforementioned approach is suitable to let the simulation kernel transparently know where the simulation state of each LP is

located, to detect which memory regions are accessed (in write mode) during the execution of an event, we rely on the static software instrumentation facilities

offered by Hijacker [123], which are thoroughly discussed in Appendix A.

Nevertheless, to clarify how Di-DyMeLoR updates its data structures when-

ever a memory-update operations takes place, we give here some insights on the instrumentation process, which specifically targets ELF (Executable and

Linkable Format) [165, 105] objects generated by standard compilers for x86 and x86-64 architectures. At the very base, Hijacker works by parsing the ob-

and by identifying every memory-write instruction inside this object, namely

mov instructions with a memory location as the destination. The instrumentation process is then supported via the insertion of a call instruction to an

update_tracker module (which is part of the Di-DyMeLoR subsystem). It’s purpose is to perform the identification of the exact memory address and the

size (amount of bytes) involved in the memory-update operation.

Although this is a typical way for tracking memory update references (e.g., in

the context of program debugging techniques [174]), the usage of this approach in optimistic simulation systems poses (more) stringent performance issues. In

particular, the update_tracker should perform its job via very few machine instructions, in order to avoid a significant impact on event execution latency.

To cope with such a performance target we have decided not to employ runtime disassembling of the memory reference instruction, which could be onerous

(compared to the event execution latency of non-instrumented software) espe- cially due to the complexity and variable format/length of the x86/x86_64 in-

struction set. Instead, we cache some of Hijacker’s disassembly information into a table which is accessed by update_tracker. Therefore, most of the disassem-

bly overhead is paid only once at compile time, leaving to the update_tracker module the task of gathering a reduced set of information which are only avail-

able at runtime.

In particular, x86/x86_64 architectures identify a memory address as the

linear combination of (up to) five parameters, namely segment, base, index,

scale and displacement, as depicted in Figure 4.3. They maintain the following information:

segment: a segment register. This is not directly specified in the instruction, yet the addressing mode will use the segment where the currently being

               CS: DS: SS: ES: FS: GS:                                                   EAX EBX ECX EDX ESP EBP ESI EDI                                    +                              EAX EBX ECX EDX EBP ESI EDI                    ∗        1 2 4 8                  + [displacement]

Figure 4.3: x86/x86_64 addressing mode

executed instruction is located;

base: this value is stored into one of the 8/16 general-purpose registers of the processing unit3 (which is therefore referred to as base register ) and is

commonly used to represent a base address from which to compute a final memory location to access;

index: this value is stored into one of the general-purpose registers as well, hence being called index register. It is commonly used to represent the

index of an array, the base of which is stored into the base register, or in the offset;

scale: this variable, which can only assume the values 1, 2, 4, 8 only, is a multiplier of the index value. It is directly coded into the instruction’s

binary representation, and is used, e.g., to represent the width of the data which compose the array being accessed;

displacement: this value, directly stored into the instruction’s binary representation, is finally added to the outcome of the memory address evaluation.

It is clear that this complex addressing format simplifies the CPU user when dealing with more complex data structures like structs or arrays. Yet, as

hinted before, some portions of the addressing format can be only evaluated at

runtime, namely the base and the index registers. All the other information can

be gathered at compile time by looking at the instruction’s opcode4 which tells which of the four fields (the segment not being specified in the instruction) are

relevant for the address evaluation and what is the actual size of the involved memory operation.

Hence, cached data from the disassembling of one single instruction, are

organized as follows:

struct update_tracker_entry { unsigned int size; char flags; char base; char index; char scale; long displacement; };

The flags field is used to identify which of the aforementioned four parameters are actually relevant and should be considered by update_tracker for

computing the exact address for the memory-write operation. Also, the size field immediately indicates to update_tracker the (compile-time defined) size

of the memory area to be dirtied by the current memory-write instruction.

We have two exceptions to this approach. One is for movs and stos in-

structions, used for moving arbitrary-size memory blocks. These instructions keep the information for identifying the destination address and the current size

4_{For simplicity, we call opcode the actual opcode, the prefixes, the ModR/M and the SIB byte}

of the x86/x86_64 instruction set, which can be all present, or only some of them, depending on the actual instruction and addressing mode. We refer the reader to [72, 73] for a complete discussion of the instruction set.

of the memory block being written into predefined registers, namely edi and

ecx, which are directly accessible by update_tracker. The second one involves

cmov instructions, which perform the memory update only if a particular condition is met. To cope with this specific case, we replace the cmov instruction with an assembly code block which mimics the execution of this instruction

(i.e., it performs the equivalent check), and if the condition is met, then the memory is updated by relying on a traditional mov instruction, which is in turn

instrumented according to the same aforementioned policy. We note that this approach is sub-optimal with respect to the execution performance of the cmov

instruction. Nevertheless, we note that to mark the memory area as updated, we necessarily must evaluate the condition. At that point, relying again on

the cmov to perform the memory update would add an additional (although minimal) cost, just to compute a value which is already available.

Recalling that the execution of update_tracker module is a performance- critical operation directly impacting the event-execution cost, we have adopted

the following strategy for minimizing the performance overhead. In particular, for each mov instruction involving a memory update, a set of push instruction

in injected before the actual call to the update_tracker module. The purpose of the push instructions is to let update_tracker find on the stack a memory

area structured as struct update_tracker_entry, where the value of the fields describe the original mov instruction which caused the actual invocation of the

module.

Upon its activation, update_tracker checks inside its own stack frame the

information needed to compute at runtime the memory address and the size of the write operation. Given that this computation can unpredictably change the

update_tracker upon its activation together with general purpose ones, and is put back in place right before returning control to the memory write instruction for which the tracking process has been activated.

In the memory model offered by Di-DyMeLoR, locations associated with au- tomatic variables (allocated inside the stack) do not belong to the object memory

map, since they do not survive across different invocations of the event handler. Hence, all those memory-write instructions that can be detected at compile-time

to access the stack (e.g., mov instructions addressing memory via base pointer or stack pointer displacement) are not actually instrumented, by relying on a

special configuration rule. Anyway, in some cases write access into the stack cannot be recognized at compile time. For this reason, after having computed

the address for the memory-write operation, update_tracker compares it with the current value of the stack pointer. In case the access is an actual stack

update, update_tracker simply returns. Otherwise, the information about the identified memory address and the size of the area being dirtied is passed to the

memory map manager. This is done by invoking an internal routine which flags the dirty bit corresponding to the involved memory chunk(s). For this task, the

software cache described in Section 4.1.1 is exploited again in order to perform a reverse query which translates a generic memory address into the chunk(s)

actually containing the memory buffers and the associated malloc_area entry. This allows fast identification of the bitmap to be involved in the update opera-

tion. Additionally, dirty_area is set, and dirty_chunks is incremented by the number of chunks which were involved by the memory-update operation.

While Hijacker is able to rebuild all the internal references between instructions and data, which are altered by the insertion of additional instructions in

time the destination of the so-called indirect branches (also referred to as register

jumps), where the destination address is dynamically identified via the content of CPU registers. To cope with this issue, we have implemented a second run-

time monitoring mechanism for supporting on-the-fly correction of destination addresses of indirect branches. Like in the aforementioned approach, this mech-

anism is based on the insertion of a call instruction to a second assembly-level monitoring module, called branch_corrector, prior to each indirect branches in

the original software. This monitoring module relies on a pushed data structure which is associated with a single register jump instruction, and keeps the infor-

mation regarding which are the registers whose values determine the destination address for the jump operation. This process is carried out in a way similar to

the one adopted for the generation of support information for update_tracker.

By exploiting the information in this data structure, branch_corrector

evaluates the original destination address for the jmp instruction (by reading the CPU registers that specify the destination value). Then it corrects this

address on the basis of the amount of bytes by which the original destination was shifted inside the instrumented object layout. To provide a lightweight

mechanism for address correction, we generate a table at compile-time, which is visible to branch_corrector. Each entry inside this table identifies an interval

of addresses for which the instrumentation process gave rise to the same amount of shift inside the final (instrumented) memory layout. Such an offset is also

maintained in the table entry. The table is ordered by interval extremes, and

branch_corrector performs a logarithmic-cost binary search to retrieve the interval containing the original destination for the register jump, and the offset to be applied for the correction.

the CPU registers involved in the jmp instruction. This would otherwise result

in an inconsistent processor state for the simulation model. We have rather adopted a different approach where the original indirect-branch instructions

(whose relevant information is anyway logged inside the pushed data structures available to branch_corrector), are replaced at compile-time by Hijacker with

(regular) offset jumps (not relying on CPU registers), where the destination address is maintained inside one field of the instruction binary representation,

and is appropriately set by the on-the-fly correction mechanism.

To support the rewrite operation of the appropriate instruction field at run-

time, without impacting typical settings associated with memory protection, the indirect-branch instruction has been moved inside a run-time re-writable ELF

section (specifically created by exploiting compiler facilities). Also, a jump-label instruction has been inserted in place of the offset jump inside the original (non-

rewritable) section(s) of the application code, which passes control to the offset jump right after the brach_corrector module has re-written the correct desti-

nation address (the offset) inside the ad-hoc re-writable section. Of course, in case of simulation kernels which are implemented according to a multi-threaded

scheme, the instrumentation process involves the generation of one new writeable section per each thread, to avoid race conditions. The whole instrumentation

process is illustrated in Figure 4.4.

We note that efficient solutions for correcting register jumps (e.g., via the

avoidance of run-time disassembling) have practical relevance since register jumps are typically generated by standard compilers for machine language trans-

lation of switch/case constructs [164], which are relevant in simulation appli- cations for, e.g., flow control inside the event handler(s) on the basis of the type

mov $3, x original memory update jmp *%eax indirect branch mov $3, x jmp .Jump call track jmp 0xXXXX .Jump: push struct

new writeable section regular jump modi ed

by branch_corrector

call corrector

Instrumentation Process

Original Executable Final executable

push struct

regular jump

Figure 4.4: Simulation Model Instrumentation Process

In document Techniques for Transparent Parallelization of Discrete Event Simulation Models (Page 129-137)