Design and Implementation - Constrained Setup

5.2 Constrained Setup

6.1.4 Design and Implementation

We implemented our checkpointing solution in QEMU/KVM (§ 2.2.5). In this process, we added a new thread which periodically creates checkpoints by repeatedly performing the following sequence of operations:

0. Initialize checkpointing 1. Suspend VM

2. Save device states (e.g., CPU registers) 3. Collect dirty bits

4. Write protect dirty pages 5. Resume VM       Synchronous (Downtime)

6. Save and unprotect dirty pages 7. Sleep until next checkpoint 8. Perform Pre-Scan (optional)



Asynchronous

9. Finalize checkpointing

To control the checkpointing thread we extended the QEMU monitor with com- mands to start (start−cp) and stop (stop−cp) checkpointing, where the start command accepts various arguments to parametrize the process (e.g., frequency, dirty logging technique, etc.).

In the initialization step (0) the checkpointing thread establishes a connection to the storage backend (§ 6.2) and requests buffer space in memory for the first checkpoint. Afterward, the selected dirty logging technique starts. This includes the allocation of dirty bitmaps (one per memory region) in user and kernel space to track page modifications12_{. The thread then enters the checkpointing loop.}

The next operation is to suspend (1) the virtual machine and wait for the vCPU threads to be kicked out of guest mode (if necessary via inter-processor interrupt) and block in user mode. Suspending the VM also completes all outstanding I/O operations so that the virtual machine is in a consistent state. This is when the downtime begins.

To save the device states (2) we use the existing migration code in QEMU. It serializes the states in a form that can easily be saved, transferred, and restored. However, we also found it to be rather slow, taking on average 3 ms to collect the 120 KiB of device data per checkpoint. Considering a median downtime of less than 10 ms, this reveals optimization potential for future versions.

Before actually being able to save modified memory pages, the checkpointing thread has to collect dirty logging information (3) – i.e., the data on which

pages have been modified in the last interval. For write protection, this includes the respective dirty bitmaps in user and kernel space. For scan and pre-scan, the thread additionally scans the EPT for dirty bits13. In every case, gathering dirty information also includes resetting the corresponding bits so as to allow tracking page modifications in the next interval.

We use a dedicated bitmap-like data structure – the copy map – to merge dirty bits from the various sources and control the following asynchronous copy operation. The copy map reserves one byte for every guest memory page (i.e., 1 MiB for 4 GiB of VM RAM). This allows us to store three states per page: (a) clean (do not copy), (b) dirty, (c) currently copying. The latter state synchronizes the asynchronous copy with a concurrent CoW fault to the same page. Every element in the copy map thus also functions as a spinlock. Since CoW faults may be caused by the vCPU threads in kernel space or QEMU’s I/O threads in user space, the copy map is accessible from both modes as shared memory.

Using the guest physical address space as the basis for the copy map is unpractical because just like in a real system this address space is a sparse composition of various MMIO, ROM-, and RAM-backed regions. The copy map therefore covers RAM blocks (see Figure 2.18 in § 2.2.5), which form a contiguous address space of all potentially accessible memory. This includes device memory such as VGA RAM but excludes the memory regions just mapping device registers. The content of these areas is preserved by saving the device states in the previous step. To perform incremental checkpointing, the first checkpoint has to capture the entire system state so that subsequent checkpoints only have to store deltas. For the first checkpoint, the copy map is thus initialized to all ones, indicating modification of all pages.

After identifying which pages need to be copied for the current checkpoint, the checkpointing thread write protects all these pages (4) in the EPT. Since this does not prevent the user-mode QEMU process from writing to its own guest physical memory mapping, we further instrumented the respective write methods in QEMU. This allows us to vector accesses to the CoW fault handler if needed. It is now safe to resume the VM (5). This ends the downtime. All operations following this point happen asynchronously to the VM execution.

The checkpointing thread continues by iterating through the copy map to find pages that need to be copied. When copying (6), the corresponding entry in the copy map is set accordingly to indicate this operation. If the concurrently executing VM attempts writing to the same page at this very time, the CoW fault spins on the copy map entry until the checkpointing thread has finished saving

13_{Note that KVM writes to the dirty bitmaps even when using scan-based dirty logging whenever} a write triggers a page fault (e.g., CoW fault) or the instruction needs to be emulated.

the page14. The CoW fault then finds this page to be clean and permits the write attempt without any further interruption. The same applies to pages that have already been copied but whose write protection is still intact. If, however, the VM accesses a dirty page before the checkpointing thread, the CoW fault handler takes care of saving the page and marking it clean in the copy map. The checkpointing thread then simply skips the page. For fast access to the EPT, the checkpointing thread performs all copy operations in kernel space.

Our storage backend (§ 6.2) supplies the buffer space to which all checkpointed data is saved. This happens in segments of 64 MiB. When the buffer is exhausted the checkpointing thread submits the filled segment and requests an empty one. This briefly interrupts the asynchronous copy and the checkpointing thread returns to user-mode for communicating with the storage backend. After receiving new buffer space the checkpointing thread continues where it left off.

When all dirty pages have been saved and unprotected, the checkpoint is complete and the checkpointing thread sleeps (7) until the next checkpoint – i.e., until the current interval ends. The sleep time is computed by taking the configured interval length and subtracting the total time for all asynchronous operations15_.

In case checkpointing takes longer than permitted by the interval length, the sleep time becomes negative and we immediately take the next checkpoint.

If pre-scan (8) is active, the sleep time is shortened by the estimated pre-scan time. We use an exponential moving average for this purpose. The true interval length thus fluctuates by a few hundred microseconds around the set value. The (asynchronous) pre-scan basically works the same as the synchronous EPT scan during the downtime. However, due to fact that the VM is concurrently running and setting A/D-bits in the EPT, collecting and resetting these bits need to be done more carefully in order to avoid inconsistencies[235]: To be effective in shortening the scan in the downtime we have to reset the access bits in higher EPT levels. At the same time, we must not reset the access bit without atomically flushing the TLB. Otherwise, writes to clean pages in the corresponding page table may not be reflected in a newly set access bit and go unnoticed. In order to mitigate this race, we first collect and reset all access bits in higher EPT levels. We then flush the TLB and use the collected information to direct the actual scan for dirty pages.

When the user enters the stop−cp command, the checkpointing thread finalizes

checkpointing (9) the next time it wakes up. This disables dirty logging, frees

respective data structures, and closes the connection to the storage backend. Eventually, the checkpointing thread terminates.

14_{Spinning also ends after reaching a threshold, but this causes the checkpoint to fail.}

15_{We do not include the downtime because according to our performance model the interval length} should determine how long the guest may execute between checkpoints. This guarantees that the guest makes progress irrespective of the length of the downtime.

In document SimuBoost: Scalable Parallelization of Functional System Simulation (Page 130-133)