Processor Caches - Hardware Foundations - Scheduling and locking in multiprocessor real-time op

2.1 Hardware Foundations

2.1.2 Processor Caches

Modern processors employ a hierarchy of fast cache memories that contain recently accessed instructions and data to alleviate high off-chip memory latencies. Additionally, processors with memory management units (MMU) also have atranslation look-aside buffer(TLB).

A MMU is used to translate virtual memory addresses into physical memory addresses and is the foundation on which address space separation is implemented in modern OSs. Performing such a translation is relatively slow. The TLB is used to store previously resolved virtual-to-physical address mappings, thereby ensuring that the MMU does not have to perform a translation on every memory reference. There is usually one local TLB per processor.

Caches are typically organized in layers (or levels), where the fastest (and usually smallest) caches are denotedlevel-1(L1) caches, with deeper caches (L2, L3,etc.) being successively larger and slower. A cache contains either instructions or data, and may contain both if it isunified. In multiprocessors,sharedcaches serve multiple processors, in contrast toprivatecaches, which serve only one. Shared caches have become more prevalent with multicore chips. A typical design is shown in Figure 2.3, where each processor has a private L1 cache and groups of two processors each share an L2 cache.

Main Memory L2 Cache L2 Cache Core 3 L1 Cache Core 4 L1 Cache Core 1 L1 Cache Core 2 L1 Cache

Figure 2.3: Example two-level cache hierarchy with shared L2 caches.

Caches operate on blocks of consecutive addresses calledcache lineswith common sizes ranging from eight to 128 bytes. Indirect mappedcaches, each cache line may only reside in one specific location in the cache. Infully associativecaches, each cache line may reside at any location in the cache. In practice, most caches areset associative, wherein each line may reside at a fixed number of locations. In anx-way associative cache, each cache line may be mapped toxdistinct cache locations.

In a multiprocessor, caches might become inconsistent if one processor updates a memory location that is currently cached by other processors. We restrict our focus to cache-coherent

multiprocessors. Such processors employ acache-consistency protocolto transparently evict outdated cache entries from the caches of other processors. Instead of evicting outdated cache entries, a cache- consistency protocol could also propagate the new value; however, this technique is not employed by the platform considered in this dissertation. Hennessy and Patterson (2006) provide a detailed introduction to cache-consistency protocols, and Baer (2010) discusses common implementation techniques.

Cache use. The efficacy of a cache hierarchy depends on the memory requirements and access patterns of the scheduled tasks and the underlying RTOS. Fundamentally, caches work because programs exhibittemporalandspatial locality—at a given time, a typical program will only access a small subset of its memory. A classic notion of access locality is theworking setof a task. First proposed by Denning (1968) in the context of virtual memory, he defined the working set as the set of pages that must be present within a given interval to assure efficient execution,i.e., to avoid page faults. Agarwalet al. (1989) applied this definition to cache lines. Under their definition, the working set is the set of cache lines that will be referenced (within the analysis time frame). For long-running

time memory references

of a task over time

job/invocation 1 job/invocation 2 job/invocation 3

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 c1c2 c3 c4c5 c1 c2 c2 c1 c3 c4 c5 c3

Figure 2.4: Memory references (cache linesc1, . . . , c5) of three invocations of a task.

processes, the working set usually changes slowly over time when the program executes in a steady state, and may change abruptly if the program transitions to a different “execution phase.” Working sets are thus generally studied with respect to some time interval (or instruction sequence).

Recall that processes implement tasks, and that each distinct invocation of a task is modeled as a job release in the sporadic task model (discussed in detail in Section 2.2 below). Since this dissertation is concerned with sporadic workloads, we interpret “working set” with respect to a single job: the set of all cache lines accessed by a job is its working set. This is similar to Thiebaut and Stone’s notion ofcache footprint(Thiebaut and Stone, 1987), which they define as the “active portion” of a task that is present in a cache. However, strictly speaking, a cache line that is part of the cache footprint is not necessarily also part of the working set (at a given point in execution). This is because a cache line that has previously been brought into the cache could potentially not be accessed again.

Example 2.1. Figure 2.4 depicts the memory references of three invocations (or jobs) of an example task. The working set and cache footprint at each point in time during the task’s execution is listed in Table 2.1. The first job sequentially accessesc1, . . . , c5so that each cache line is accessed only once.

The working set of the job—cache lines that it requires in the future—thus shrinks with each memory reference. In contrast, its cache footprint—cache lines that have been brought into the cache on the job’s behalf—increases with each memory reference. Before the first memory reference at timet1, the working set encompasses all five cache lines, but the job has no cache footprint yet. After the last memory reference at timet5, the working set is empty (the first invocation is complete), but the cache footprint is maximal. The first job demonstrates a worst-case scenario from a cache-efficiency point of view: it has a large cache footprint, but no reuse of cache contents occurs since each cache line is accessed only once,i.e., the working set and cache footprint are disjoint.

Job Time Working set (Agarwalet al., 1989) Cache footprint (Thiebaut and Stone, 1987)

t (from timetuntil next invocation) (prior to memory reference at timet)

1 t1 _{c1, c2, c3, c4, c5_} _∅ t2 {c2, c3, c4, c5} {c1} t3 {c3, c4, c5} {c1, c2} t4 _{c4, c5_} _{c1, c2, c3_} t5 {c5} {c1, c2, c3, c4} 2 t6 {c1, c2} ∅ t7 {c1, c2} {c1} t8 {c1, c2} {c1, c2} t9 {c1} {c1, c2} 3 t10 {c3, c4, c5} ∅ t11 {c3, c4, c5} {c3} t12 _{c3, c5_} _{c3, c4_} t13 {c3} {c3, c4, c5}

Table 2.1: The working set and cache footprint of the task from Figure 2.4. At timet, the working set is the set of cache lines that will be accessed on or after timet; the cache footprint is the set of cache lines that have been accessed prior to timet. Both definitions are applied with regard to individual jobs.

The second invocation illustrates cache-line reuse. Here, the working set includes only two cache lines (c1 andc2), which are both accessed twice. After each cache line has been accessed once at timet8, the working set equals the cache footprint, an ideal situation with regard to cache efficiency. The third invocation is a mixture of the prior two scenarios such that a subset of the cache footprint is part of the working set. Cache linec3 is reused at timet13after being brought into the cache at

timet10, but the other two cache lines of the cache footprint at timet13are not useful. In general,

the cache footprint closely corresponds to the working set (after an initial warm-up phase) if a job

exhibits high spatial locality. ♦

The working set and cache footprint determine the impact of preemptions. The cache footprint of a job that is preempted is likely to be evicted while it is not scheduled. If its working set at the time of the preemption closely matched its cache footprint, then it is penalized by additional cache misses when it resumes execution. That is, immediately after a preemption, a job does not benefit from its spatial and temporal locality since the cache state was disturbed. In contrast, a job with a disjoint working set and cache footprint is not impacted by a preemption at all (e.g., the first job

in Figure 2.4). However, such cases of inherently low cache efficiency are rare in well-engineered systems.

In the context of this dissertation, the distinction between working set and cache footprint is less important since we focus onworst-caseper-job memory use, and because the time of preemption is unknown in general. In the worst case, preempting jobs create maximal cache footprints and all of a preempted job’s working set is evicted. We therefore make the simplifying assumption that the working set encompasses all cache lines accessed by a job over the course of its execution. A job’s maximum cache footprint is thus its working set (assuming it fits into the cache by itself). A characteristic measure of a job’s working set is its size, denotedworking set size(WSS).

Cache misses. If a job references a cache line that cannot be found in a level-X cache, then it suffers alevel-Xcache miss. There are four primary causes for cache misses.Compulsory misses

are triggered the first time a cache line is referenced,i.e., if the referenced cache line was not yet part of the cache footprint.Capacity missesresult if the WSS of the job exceeds the size of the cache, that is, if needed cache lines were evicted to make room for other data. Further, in direct mapped and set associative caches,conflict missesarise if cache lines were evicted to accommodate mapping constraints of other cache lines. Finally,coherency missesoccur when another processor evicted required cache lines to ensure data consistency.

Ideally, a job should incur few cache misses besides compulsory misses. However, since caches are finite, this is not always the case. Jobs that incur frequent level-Xcapacity and conflict misses even when executing in isolation are said to bethrashingthe level-Xcache.Cache affinitydescribes the effect that a job’s overall miss rate tends to decrease with increasing execution time (unless it is thrashing)—after an initial burst of compulsory misses, most of the working set has been brought into the cache and the rate of compulsory misses decreases. A job’s memory references arecache-warm

after cache affinity has been established; conversely,cache-coldreferences imply a lack of cache affinity.

In a multiprocessor system, a job that does not thrash by itself may still incur frequent cache misses due to the activity of jobs on other processors. In particular, a shared cache must exceed the combined cache footprint of all jobs accessing it, otherwise, frequent capacity and conflict misses may arise due tocache interference. Cache consistency can also become a major source of overhead

if processors frequently read and write memory locations that reside in the same cache line.Cache line bouncingdescribes the effect when two or more processors repeatedly evict the same cache line(s) from each other’s caches. Cache line bouncing can also be the result offalse sharing, wherein processors access different, but neighboring memory locations that map to the same cache line.

TLB misses occur either the first time a memory page is accessed (together with a compulsory cache miss), when the TLB entry was displaced (this corresponds to a capacity miss), or when the TLB wasflushed. TLB flushes are required whenever the address space of a task is modified (e.g., when it maps or unmaps physical addresses and memory-mapped devices). The TLB may also be flushed when the address space is changed on a context switch.

Cache-miss avoidance. Cache misses slow down a task’s execution since the processor stalls while it must wait for instructions and data to be fetched from main memory. On a modern, fast processor, servicing a cache miss can take more than 100 processor cycles (Baer, 2010). Avoiding cache misses as much as possible is thus crucial to achieving high efficiency. In real-time systems, cache misses should further be avoided since they are difficult to predict. We briefly discuss some relevant techniques for improving cache hit rates.

In practice, cache interference and cache line bouncing due to false sharing are a significant concern for OS developers. A common solution to reduce cache line bouncing is to include padding bytes in data structures such that their size becomes a multiple of the cache line size. False sharing is impossible for such data structures if they are consistently allocated at cache line boundaries. The only way to avoid cache line bouncing due to “true sharing,”i.e., due to data structures that are accessed by multiple processors, is to minimize the frequency of such accesses.

Since the extent of cache interference depends on the set of jobs executing on the processors that share a cache, cache interference can also be reduced using cache-aware scheduling approaches (Ca- landrino, 2009; Guan et al., 2009). Such approaches are complimentary to this dissertation, in the sense that our overhead-aware evaluation methodology (Chapter 4) can be used to compare scheduling algorithms that aim to reduce cache interference.

As the name implies, compulsory misses are impossible to avoid in general. However, it is possible to exploit spatial locality to avoid some compulsory misses bypredictingfuture memory references. Processors that performcache prefetchingmonitor the sequence of memory references

and proactively transfer cache lines that are likely to be accessed in the near future into a cache. Cache prefetching can be effective from a throughput point of view. However, in real-time systems, prefetching can be detrimental since it might displace part of a job’s working set with unrelated, possibly useless data when the prefetching logic mis-predicts future references.

TLB misses can also cause significant slowdowns. To avoid TLB flushes on context switches, some processors supporttagged TLBs, wherein each TLB entry is marked with anaddress space number(ASN). ASNs are unique identifiers that correspond to address spaces managed by the OS. On a context switch, instead of flushing the TLB, the OS only updates a “current ASN” register to a value corresponding to the scheduled process. TLB entries belonging to the current process can then be differentiated by the MMU from stale entries based on the differing ASN tag. The benefit of a tagged TLB is that it reduces the average cost of context switches: if a process resumes execution shortly after it was preempted, then most of its TLB entries may still be present. However, tagged TLBs are still subject to capacity limitations, and ASNs must be invalidated when address spaces are modified (which is equivalent to a process-specific TLB flush). Consequently, tagged TLBs have only little impact on the worst-case cost of a context switch.

Cache partitioning. Recall that preemptions cause the preempted job to incur additional cache misses when it resumes execution if the cache footprint of the preempting job caused parts of the working set of the preempted job to be evicted. It is difficult to predict such cache misses and those that arise due to cache interference since they depend on the identity of the interfering or preempting task (of which there can be many), and on the point in time at which the preemption or interference takes place. In effect, such misses cannot be accurately anticipated by analyzing individual tasks in isolation. To limit the impact of such cache effects, several hardware- and software-based isolation techniques have been proposed.

At the hardware level, Kirk (1989) proposedcache partitioningto reserve parts of a cache for specific processes, thereby limiting cache interference to the non-exclusive parts. As a variant of full partitioning, some architectures allow cache contents to belocked such that they will not be evicted. An alternative to caches arescratchpad memories(also calledprogrammable caches), which function similarly to a cache in that they speed up memory references (Banakaret al., 2002). The benefit of scratchpads is that their contents are completely software-controlled and thus not subject

0 physical pages (4 KB) 1 … 128 129 … 256 257 physical cache lines 0 – 63 64 – 127 2 130 258 128 – 191 … …

Figure 2.5: Illustration of page coloring. Physical pages are of the same page color if their contents map to the same physical cache lines. In this example, there are 128 distinct page colors (see Example 2.2).

to unexpected interference. These hardware-based isolation techniques have been employed in a number of uniprocessor designs targeted at embedded systems; however, they are typically not available in current mass-market multicore designs.

Page coloring. Cache partitioning can also be realized at the software level based onpage coloring. The “color” of a memory page describes to which set of cache locations its addresses will be mapped (assuming that caches are direct mapped or set associative, which they are in practice).

Example 2.2. Suppose a system with 1 GB of memory has a 512 KB direct-mapped L2 cache with a cache-line size of 64 bytes. This implies there are 512 KB / 64 bytes=8,192 physical cache lines in the L2 cache. Assuming a page size of 4 KB, there are 4 KB / 64 bytes=64 cache lines per page and a total of 1 GB / 4 KB=262,144 physical pages. Letpdenote a physical page number, and letc

denote a physical cache-line index (both zero-based,i.e., 0_≤p_≤262,143 and 0_≤c_≤8,191). As illustrated in Figure 2.5, in a direct-mapped cache, the contents of the first page (p = 0) map to the first 64 physical cache linesc ∈ {0, . . . ,63}. Similarly, the contents of the second page (p = 1) are mapped onto the cache linesc ∈ {64, . . . ,127}. Accesses to data stored in the first two physical pages can thus not interfere with each other. However, since the cache is much smaller than the main memory, the direct mapping wraps around after 8,192 / 64=128 pages. That is, the contents of pagep = 128are also mapped toc _{∈ {}0, . . . ,63_}, and thus can conflict with the contents of page p = 0. In general, the contents of a page pare mapped to the cache lines

c_{∈ {}(pmod 128)_·64, . . . ,(pmod 128)_·64 + 63_}. This implies that two pagesp1andp2conflict if and only ifp1mod 128 = p2mod 128. Hence, in this example, there are 128 disjoint sets of pages—named colors—such that pages within each set conflict with each other, but not with any

pages from other sets. If the cache in this example were a 2-way set-associative cache (i.e., if each cache line could be mapped to two distinct locations in the cache), then there would be only 64

distinct colors. ♦

Page coloring can be used to implement cache partitioning by dedicating a page color to each real-time task, i.e., by letting each real-time task allocate only pages of a reserved, task-specific color. This can be enforced either statically at the compiler level (M¨uller, 1995) or dynamically at the OS level in the virtual memory subsystem (Liedtkeet al., 1997). However, either approach limits the number of real-time tasks that may run concurrently to the number of distinct page colors.

In document Scheduling and locking in multiprocessor real-time operating systems (Page 50-59)