Device State Compression - Checkpoint Data Reduction

7.2 Checkpoint Data Reduction

7.2.3 Device State Compression

In the deduplication and delta compression stages, the incoming data volume is, except from the first checkpoint, determined by the workload’s page modification rate, the interval length, and the actual page contents. It is thus difficult to predict the overall amount of data before actually executing the workload. Looking solely

at the RAM state map, it is much simpler. Although this does not include the miscellaneous other device states such as the CPU registers, even for a small 256 MiB VM the state map consumes over 4x the space of the other devices states typically stored by QEMU. This makes the state map the next largest position beside the actual guest memory contents.

We therefore focus on the compression of the RAM state map and resort to generic compression for all other device states as illustrated in the overview of our data reduction pipeline in Figure 7.3 on page 139. However, due to the fact that we also had large sparse devices such as disks13 in mind when designing the in-memory representation of state maps as well as their compression, some details go beyond the requirements of a RAM map.

Let P be the number of guest physical pages, then we can estimate that we generate 8_L·Pbytes per second for storing one 64-bit offset into the checkpoint database per guest physical page per checkpoint. Considering our default VM with 4 GiB of RAM14_{, we generate over 80 MiB}_{/s with L = 100 ms. This accumulates}

to 47 GiB for povray (75% of its total data) and over 150 GiB for SPECjbb (17%). With L= 1 s, on the other hand, the data rate is only at 8 MiB/s, which is 7% of a Gigabit Ethernet link. As very large VM configurations with tens of gigabytes of RAM are less likely, we can conclude that in practice an efficient device state compression is primarily important for short checkpointing intervals.

From an encoding perspective, the state map possesses a number of interesting properties:

1. Equal Offsets Due to data deduplication, pages with the same contents map to the same offset in the checkpoint database. In addition, there are often contiguous ranges of pages with the same contents (e.g., zero pages) that all receive the same offset.

2. Small Offset Deltas Pages that cannot be deduplicated are compressed and appended to the checkpoint database. The delta between the offsets of two consecutive pages in the state map is thus in most cases (i.e., no deduplication) very small; smaller than 4 KiB (+ metadata)15_.

3. Data Alignment Entries in the checkpoint database are aligned to 16 bytes. The low 4 bits of each offset are thus always 0; except for sparse regions (e.g., in disk maps) that hold the invalid offset 0xFFFFFFFF FFFFFFFF, in the following simply referred to as INV_OFF.

13_{For disks, checkpoints store only modified sectors in reference to a base image.} 14_{That is, 1052704 pages including device memory regions such as the video buffer.}

15_{This depends on the processing order in the storage backend, which in turn is determined by the} page submission – i.e., the order in which the VMM sends pages – and the non-deterministic multiprocessing. Nonetheless, we always submit pages in ascending order of their guest physical address. Furthermore, the backend processes at minimum 500 pages per job, among which the order is not affected by multiprocessing.

We devised a custom compression scheme called SimuBoost Device State Compres-

sion (SDS)which leverages these properties for dense encoding of state maps.

To save space for sparsely populated state maps, the actual in-memory data structure is a hierarchical two-level table, similar to a page table in virtual memory systems. In the default configuration, the directory table holds 212_second-level

tables each covering 21664-bit offsets into the checkpoint database for a maximum device size of 1 TiB with 4 KiB pages. Just like with conventional page tables, each entry’s address in the address space of the corresponding device (e.g., the guest physical memory) can be inferred from its location in the table hierarchy.

To simplify the design, the (de-)compression works at the granularity of the second-level tables. That means we seek a space-efficient encoding for an array of 216_{64-bit offsets, leveraging the domain-specific knowledge presented in the}

properties (1) to (3). The compressed representations of the individual tables are then simply concatenated to create the final output.

For the sake of brevity, we only cover the process of compression. Interested readers may consult the source code for details on decompression.

/simutrace/storageserver/simuboost/SimuBoost1DeviceState.cpp

Compression

A simple way to benefit from properties (1) and (2) is to use delta encoding. Instead of storing an array of absolute offsets, we only save the first offset in each table in absolute form. All following 216_{− 1 offsets are translated so that they are}

relative to their respective predecessor in the table. The resulting deltas are in most cases much smaller values, which can be represented with less than 64 bits. An exception to this are deduplicated pages, where the relative offset may still span several tens of gigabytes. The zero page, for example, will probably be part of the first checkpoint and in consequence, it will be located at the beginning of

Put O0 Δ1=O1 – O0

l = 0 Table

le�? Oﬀsetle�? Encode (Δi,l)

no yes yes no Δi=Oi – Oi-1 Δi ≠ Δi-1 Increment Length l yes no Encode (Δi,l) l = 0

Figure 7.13: Overview of the SDS compression loop. O_i denotes the absolute offset in the current table at position i. ∆i is the corresponding

relative representation. l holds the length of a consecutive run of identical deltas. Put writes a value to the final output.

the checkpoint database file. If it is referenced by a state map of a much later checkpoint (i.e., many more pages added to the database), the delta for this entry will be a large negative number. The resulting array of small delta values is thus typically interrupted by sporadic large values jumping back and forth between deduplicated and new pages (and INV_OFF). Conversely, a contiguous range of identical offsets leads to a sequence of deltas with value 0. To efficiently encode such areas, we incorporate run-length encoding (RLE). Figure 7.13 summarizes the central compression loop.

In the next step, SDS tries to find a compact encoding for each delta. If LZ4 cannot further compress a page, the page’s final size in the database including metadata is 4144 bytes. We thus observe that most deltas are below this value. To further compress the representation of these deltas, we employ a dictionary that stores previously seen deltas. We use the fixed-size hash table described in § 7.2.1 for this purpose. A configuration with 64 buckets, incremental probing, and a chain length of 4 proved to be effective. Hits – called matches – can then be encoded by their position in the dictionary using only 6 bits (26 _{= 64). Misses, on the other}

hand, are directly written to the final output. The same applies to overly large deltas. We refer to both as literals.

Encode (Δ,l) ₄₁₄₄Δ < yes _Dict?Δ in yes Put Match (s,l) Add Δ

to Dict Put Literal (Δ,l) no

s = Slot in Dict.

Figure 7.14: Overview of the SDS encode function. Small delta values are matched against a dictionary of previously seen delta values. Hits are encoded as their position in the dictionary, misses and large deltas are written as literals.

The last step is to write the matches and literals to the final output. In this process, SDS discards the first 4 bits which are always 0; property (3). Since the majority of literals are usually smaller than 216_{, we provide separate encodings for short and}

long literals, consuming 2 bytes and 4 bytes, respectively. Matches are encoded using 1 byte only. In each case, a run-length extension of 2 bytes can be added to express up to 216 _{repetitions. See Figures B.3 and B.4 for more details on the}

encoding format.

SDS Compression Ratio and Time

To get an impression of SDS’s performance, we contrast its compression ratio and time with that achieved by generic LZ4. Since a key element of SDS is its delta encoding, we further perform a generic LZ4 compression of the delta (LZ4+Delta). This allows us to better discern which aspect of SDS is responsible for performance

-56% -56% -70% _-65%_-68% -45% -65% -66% -57% -64% (a) L = 1 s 0.0 0.5 1.0 1.5 2.0 2.5 3.0 postmarksp ecjbb kernel

buildsqlite_apache

enco

de-mp3_pybenc

h povra y phpb enc h idle Compressed M ap Size [MiB] LZ4 LZ4+Delta SDS -28% -25% -27% -24% -14% -36% -15% +6% -31% -21%_-28% -25% -27% -24% -14% -36% -15% +6% -31% -21%_-28% -25% -27% -24% -14% -36% -15% +6% -31% -21% (b) L = 1 s 0 5 10 15 20 postmarksp ecjbb kernel buildsqlite_apache enco

de-mp3_pybenc

h povra y phpb enc h idle Compression Time [ms] LZ4 LZ4+Delta SDS

Figure 7.15: (a) SDS achieves on average 61% higher reduction in size than LZ4 and still 39% higher than LZ4+Delta. The 8 MiB state maps are compressed down to between 256 KiB and 1 MiB, depending on the workload, which is a compression ratio of 32:1 and 8:1. (b) At the same time, SDS is on average 21% faster than LZ4. Improvements are especially high for heavyweight workloads with complex state maps.

differences. This time, we include the first checkpoint in the benchmarks because checkpoints inherit the state maps from previous checkpoints anyway.

The results show that SDS delivers substantial higher compression than LZ4 in less time (see Figure 7.15). Only for pybench, the compression with SDS is on average slightly slower. The comparison with LZ4+Delta reveals that around half of the improvement in size reduction can be attributed to the delta encoding. The other half stems from the dense encoding. As LZ4 also supports run-length encoding, we do not assume RLE to be a major factor. The results for the compression time demonstrate that the delta encoding does not inherently translate to a faster generic encoding. We can thus conclude that the custom encoding scheme in SDS is responsible for the advantage in compression time.

In document SimuBoost: Scalable Parallelization of Functional System Simulation (Page 159-163)