Implementation of dlmalloc nonreuse() - New allocator design for reduced sweeping frequency

5.5 New allocator design for reduced sweeping frequency

5.5.2 Implementation of dlmalloc nonreuse()

Modern allocators attempt to re-allocate recently freed chunks in hope of max-imising the effect of CPU caches, TLB entries, etc. However, this strategy demands extremely high numbers of sweeps to achieve a non-reuse policy. Without proper al-locator support, the frequent sweeping incurs an unacceptably high cost to both run time and DRAM traffic. Therefore, I implement dlmalloc nonreuse with a modified allocation policy, which avoids immediate reuse of recently freed chunks to reduce the number of sweeps required.

The original dlmalloc consists of two spaces: in-use chunks and free chunks in free-lists. I introduce a third quarantine space which consists of memory chunks that are freed but have not been revoked (may still have stale pointers pointing to them).

Life cycle of a memory chunk. A memory chunk enters the quarantine space when a function calls free(). Chunks in this state cannot be returned to a free-list unless they undergo a revocation pass. After the revocation, all stale pointers pointing to quarantined chunks are invalidated. The chunks can now be returned to a free-list and can be reused by another malloc() call.

CHAPTER 5. TEMPORAL SAFETY UNDER THE CHERI ARCHITECTURE

Quarantine queue and threshold. The quarantine maintains a queue to include all quarantined chunks. A sweeping revocation is triggered whenever the total size of chunks in the queue reaches a threshold. The threshold is conﬁgurable at a certain percentage of the heap size. Obviously, the frequency of sweeping can be reduced at the expense of larger quarantine size.

Revocation shadow map. During sweeping revocation, stale capabilities have to be identiﬁed and revoked. To achieve this, a separate revocation shadow map is maintained to indicate whether each heap allocation granule is currently in quarantine.

For each allocation granule, which I choose to be 16 bytes of memory to match the default in dlmalloc [40], I allocate 1 bit in a shadow map; this shadow space therefore occupies ₁₂₈¹ of the heap. Before a sweep, for all allocations in the quarantine buﬀer, the revoker “paints” the bits of the shadow map corresponding to the allocation granules to indicate that these regions are in quarantine, and references to them should be revoked in the sweep. The actual sweeping procedure performs a lookup in the shadow map using the base ﬁeld of each capability to detect if it is pointing into quarantined memory¹. If so, the capability is revoked.

Efficient shadow map lookup. To achieve high memory sweeping speeds, the shadow map lookup must be simple and efficient. By default, FreeBSD does not map the bottom 2GiB of virtual address space on 64-bit architectures. Memory mapping to the bottom 2GiB can be forced by setting the MAP 32BIT flag when calling mmap().

This default setup conveniently leaves us 2GiB for shadow maps. Therefore, all normal mmaps and munmaps in the allocator are accompanied by a shadow space mmap call with MAP 32BIT set under 2GiB. Since the heap allocation granule is 16 bytes, a shadow map accompanying a normal mmap is only ₁₂₈¹ in size. Also, I set the MAP FIXED ﬂag so that the shadow map has to be at a constant bit shift (a right shift of 7 for 16-byte granule) of the original. This shadow map scheme allows fast, ﬂat index lookup for testing each capability reference during a sweep, and is deterministic in its instruction count.

The following shows the C code for revoking stale capabilities within a region.

Note that in a CHERI system, uintptr t type is a capability when the tag is set.

1We can be sure that any heap capability will have a base within the original malloc bounds due to the monotonicity of capabilities.

CHAPTER 5. TEMPORAL SAFETY UNDER THE CHERI ARCHITECTURE

1 for( uintptr_t * x= MIN_ADDR ; x< MAX_ADDR ; x++) { 2 uintptr_t capword = *x;

3 if( is_capability (x)) {

4 capword >>= 4; // 16- byte alloc granule 5 // Get the bit index .

6 int bitIdx = capword & 0x7;

7 // Get the byte from the shadow space at a constant

�→shift .

8 char shadowbyte = *(char*)( capword >> 3);

9 if( shadowbyte & (1<< bitIdx )) { 10 // Pointing at freed memory . 11 // Invalidate the capability .

12 *x = 0;

13 }

14 }

15 }

Parallel sweeping. Unlike tracing garbage collection which requires a tree walk from the roots to ﬁnd reachable objects which exposes only limited amount of paral-lelism, sweeping on the other hand is embarrassingly parallel. As shown in the code snippet of shadow map lookup, there are no dependencies between any two itera-tions of the for loop, thus theoretically all iteraitera-tions can be performed in parallel.

As a result, many optimisations are possible to achieve a higher (or completely sat-urate) DRAM bandwidth of the system during sweeping. For example, programs on a multi-core machine can spawn multiple sweeper threads to sweep its address space in parallel; dedicated DMA engines can perform sweeping, shadow map lookup and capability invalidation in the background. Even in this chapter where the benchmarks are mostly single-threaded and I assume only a single core model, the operations of sweeping can ﬁt in vector instructions which is able to parallelise nicely even on a single core.

Coalescing chunks in the quarantine space. Each quarantined chunk has an extra function call overhead, as each free() first calls quarantine() before the revoker calls the actual free function on each chunk after revocation. In my im-plementation, calling quarantine() is significantly cheaper than the actual freeing because the only maintenance is the quarantine queue, whereas the actual free needs to find the correct bucket (which may involve a prefix tree walk) or to perform other maintenance work. Therefore, if quarantined chunks are coalesced before calling free,

CHAPTER 5. TEMPORAL SAFETY UNDER THE CHERI ARCHITECTURE

the number of eventual calls is lower than calling free directly, which results in higher performance.

The coalescing algorithm is similar to how dlmalloc coalesces free chunks. The 16-byte alignment of memory chunks leaves 4 bits for other purposes in the size ﬁeld of chunk headers, among which 2 are already in use by dlmalloc. I use the remaining 2 bits to indicate whether this chunk and the previous chunk is in the quarantine space (cdirty and pdirty respectively). In this way, it can be trivially determined whether adjacent chunks are also in the quarantine space, in O(1) complexity.

1 void free_nonreuse (void* mem) {

2 void* ptr = mem2chunk (mem); // Point to metadata . 3 size = ptr -> size ;

4 if(ptr -> pdirty ) { // Try to coalesce with the previous

�→chunk .

5 pptr = get_prev_chunk_ptr (ptr);

6 unlink_freebuf ( pptr );

7 size += pptr -> size ;

8 ptr = pptr ;

9 }

10 nptr = get_next_chunk_ptr (ptr);

11 if(nptr -> cdirty ) { // Try to coalesce with the next chunk .

12 unlink_freebuf ( nptr );

13 size += nptr -> size ;

14 }

15 // Write metadata and insert to quarantine . 16 ptr -> size = size ;

17 ptr -> cdirty = 1;

18 get_next_chunk_ptr (ptr) ->pdirty = 1;

19 insert_freebuf (ptr);

20 }

As described in the C code, adjacent quarantined chunks are removed from the queue, and a new coalesced chunk will be formed and added to the tail of the queue (also demonstrated in Figure 5.3).

CHAPTER 5. TEMPORAL SAFETY UNDER THE CHERI ARCHITECTURE

Figure 5.3: c and p represent cdirty and pdrity bits. At the end Ck0 and Ck2 are unlinked from the queue and the newly freed chunk (in the middle) will be coalesced with Ck0 and Ck2 into Ck3 and inserted at the tail of the queue. After a revocation, Ck3 is returned to the free lists to be reused.

In document Capability Memory Protection for Embedded Systems (Page 108-112)