3.3 The CHERI Concentrate (CC) and 64-bit capability encoding
3.3.1 Balancing between precision and memory fragmentation
It is safe for previous CHERI-128 work to largely ignore the memory fragmen-tation problem, since the number of precision bits (bits in top and base fields) is sufficiently high. For 20-bit precision in CHERI-128, objects can be precisely repre-sented up to 1MiB. For objects that are aligned to a page boundary (e.g., mmap, file system blocks), CHERI-128 can precisely describe up to 4GiB of such large memory objects, thus the memory fragmentation issue is negligible. However, developing a 64-bit capability format for 32-bit addresses makes this assumption no longer true.
Before the design of the CC-64 encoding, I would like to understand and evaluate the memory fragmentation problem to see the impact of high compression. This evaluation helps to make an informed decision in later sections.
CHAPTER 3. A 64-BIT COMPRESSED CAPABILITY SCHEME FOR EMBEDDED SYSTEMS
The fragmentation problem
Compressed capabilities cannot precisely represent all bounds. Memory allo-cators need to ensure that compressed bounds do not improperly allow access to adjacent objects. This manifests itself as increased memory fragmentation because memory allocators may need to overallocate space and overalign objects to enforce it.
5 6 7
0 1 2 3 4 8 9 10 11 12 13 14 15
requested
malloc’ed 3rd object
2nd object internal fragmentation
external fragmentation
Figure 3.5: Internal vs. external fragmentation
Figure 3.5 illustrates memory fragmentation caused by the loss of precision, as-suming a model in which capabilities are only able to represent power-of-two sizes and power-of-two aligned addresses. The figure demonstrates two typical categories:
• Internal fragmentation, caused by rounding up and padding objects so that a compressed capability can represent it.
• External fragmentation. Objects have to start from aligned addresses instead of being tightly packed.
In practice, different forms of allocation can manifest as either internal, external or both. Two most common allocation patterns in C language are of interest here, namely stack and heap allocations. The symptom of heap fragmentation is mostly internal, as the gaps can usually be filled with small allocations later. However, the C stack differs substantially due to its Last-In-First-Out (LIFO) nature. It is impossible to reuse externally fragmented space across stack frames, exposing both internal and external fragmentation problems to memory allocation. These two allocation patterns best demonstrate how much memory overhead a compressed capability encoding has under certain precisions, and will be investigated in this study.
CHAPTER 3. A 64-BIT COMPRESSED CAPABILITY SCHEME FOR EMBEDDED SYSTEMS
Evaluation methodology
I evaluate fragmentation by collecting memory allocation traces and replaying them under different precisions. I thank Jonathan Woodruff for kindly providing these traces. The heap traces are from six real-world applications (Chrome 38.0.2125, Firefox 31, Apache 2.4, iTunes 12, MPlayer build #127, mySQL 5). Chrome and Firefox are traced when viewing pre-determined web pages on BBC, Facebook and Gmail. MPlayer plays Big Buck Bunny in h264@1080p. Application iTunes plays pre-determined trailers from its video store. Apache and mySQL work as the frontend and backend respectively for a web server, which is traced when running the Apache HTTP server benchmarking tool, ab. DTrace is used to hijack function calls from these applications and to filter out malloc() related ones. To take care of nested allocators, the initial trace is processed so that a stack of *malloc() calls only shows up as one entry in the final trace file. For stack allocations, we instead compile SPEC-CPU2006 benchmarks (bzip, gobmk, mcf, sjeng and a synthetic random benchmark) and hijack CSetBounds instructions on the stack to locate stack objects. We cannot reuse the real-world applications for stack tracing as they are too large to be ported to CHERI now, and stack tracing relies on observing CSetBounds instructions. For all SPEC benchmarks, we use the reference datasets as input workloads.
I extend two commonly used allocators, dlmalloc() [40] and jemalloc() [28], with additional rounding and alignment routines to ensure that every memory chunk returned by malloc() is precisely representable by a capability and that they are placed at sufficiently aligned addresses. The extended allocators are also parameter-isable, capable of handling precision (the number of bits in the start and end index in Low-fat terminology) from 1 to 64. For the stack traces, I build a stack allocation simulator in C++ to simulate stack object placement and stack frames. Again, the simulator is aware of capability precisions and can be tuned from 1 to 64 bits. The extended heap allocators report the average internal fragmentation, as the external one is less interesting in this case, whereas the stack simulator reports the peak stack size, capturing both internal and external fragmentation.
Results
The results are presented in Figure 3.6 and 3.7. For comparison, I added the original CHERI-128 (20 bits) to the graphs.
CHAPTER 3. A 64-BIT COMPRESSED CAPABILITY SCHEME FOR EMBEDDED SYSTEMS
M-Machine struggles with power-of-two bounds (equivalent to 1-bit precision) as the heap overhead approaches 30-40% in all applications. The fragmentation quickly drops with more precision, and Low-fat (6 bits) already shows tolerable overhead.
All the curves are almost flat after 8 bits, and CHERI-128 has no problem precisely representing almost every object. The results also confirm that heap allocators al-ready round up and align objects to ease management, as even the perfect precision shows some internal fragmentation. jemalloc() apparently has stronger alignment requirements due to its need to accommodate paged-memory in modern machines.
The nature of the stack exacerbates the problem for low precisions. Both the results and visualisations of the stack allocations suggest that external fragmentation is contributing to high peak stack sizes. The total size of stack allocations is usually small for each benchmark (under 64KiB), therefore even a 200% overhead would not be a problem compared with heap allocations. However, it indicates that for low precisions, other allocations that share the same pattern (unable or difficult to reuse external fragmentation, e.g., large number of separate small sandboxes within a process) could waste a large amount of memory space when facing a similar problem.
CHAPTER 3. A 64-BIT COMPRESSED CAPABILITY SCHEME FOR
Figure 3.7: Percentage increase in peak size of total stack allocations (SPEC CPU 2006 experimental builds)
CHAPTER 3. A 64-BIT COMPRESSED CAPABILITY SCHEME FOR EMBEDDED SYSTEMS
Analytical model
Stack allocations from the benchmarks are mostly small objects with a few large objects dominating the overhead, thus the peak stack size is highly dependent on how the workloads allocate large objects on the stack. However, the number of heap allocations is usually significantly higher with a mixture of objects of various sizes, whose average can be summarised with an analytical model.
For fragmentation in Figure 3.6, its upper bound can be calculated with respect to the precision. A precision of n divides the maximal possible object under a certain exponent (2n+E bytes) into 2n blocks. In the worst case, we need to pad the object with almost a whole block (2E bytes) to account for the lack of precision. Also, the object can be as small as slightly over half of the largest possible object under its exponent, which is 2n+E−1. Therefore, the worst case internal fragmentation is:
Internal f ragmentation= 2E
2E + 2n+E−1 = 1 1 + 2n−1
If we assume the sizes of allocations are approximately uniformly distributed overall in evaluated applications, the average object size is the average of the largest and the smallest under a certain exponent:
Avg. obj. size = 2n+E−1+ 2n+E
2 = 2n+E−2+ 2n+E−1
and assuming the required padding on average is halved (2E−1) as well, the average internal fragmentation will be:
Avg.= avg. padding
avg. padding+ avg. obj. size = 2E−1
2E−1+ 2n+E−2+ 2n+E−1 = 1 1 + 2n−1+ 2n
Of course, additional heap metadata must be allocated as well for heap mainte-nance in practice, which often disrupts how objects can be placed or aligned, resulting in higher-than-predicted fragmentation. This is much more visible in dlmalloc, since its boundary-tagging design must embed inline metadata in every allocation [40],
CHAPTER 3. A 64-BIT COMPRESSED CAPABILITY SCHEME FOR EMBEDDED SYSTEMS
potentially inflating objects to the next alignment boundary or even to the next ex-ponent.
At high precisions, the padding of dlmalloc is small compared with the in-line metadata, thus the internal fragmentation mostly reflects the wasted space from metadata, which we cannot eliminate no matter how high the precision is. Since the inline metadata of each allocation is constant regardless of the object size, the ratio of internal fragmentation at high precisions depends on the allocation pattern. An application performing a large number of small allocations requires more memory for metadata than one with a small number of large allocations, explaining why some applications can reduce fragmentation to almost 0%, thus fitting the model nicely, while others cannot.
Unlike dlmalloc, jemalloc is designed for paged systems and performs its own rounding internally, effectively only having a precision between 3 and 6 for medium objects and 3 for large objects [28]. As the curve shows, it quickly becomes flat after n = 3, and approximately matches the equation between n = 3 and n = 6 for all high precisions.
Summary
The internal and external fragmentation can be captured nicely in the C stack and heap, representing the most common sources of memory overhead for C code-bases. Parameterisable tools help visualise the tradeoffs between consumed precision bits and memory wastage. The evaluation also brings the discussed compressed ca-pability schemes together for comparison. M-Machine represents one extreme where fragmentation is unacceptably high, whereas CHERI-128 shows almost no concern about memory overhead issues. Low-fat (6-bit start and end index, 6-bit block size, 18 bits total for bounds) might be a good balance for 64-bit capabilities. For some extra headroom, my initial plan is to dedicate 22 bits in total for compressed bounds with 8-bit precision.