LRU Cache: A Solved Analysis - Lossy Compression applied to the Worst Case Execution Time Probl

2.3 Caches

2.3.2 LRU Cache: A Solved Analysis

The Least Recently Used (LRU) cache policy is one the more obvious heuris- tics for a cache replacement policy, as mentioned earlier. LRU simply dis- cards the least recently used element of the cache [55]. The assumption that this heuristic makes is that data which has been recently accessed is more likely to be accessed for a second time. While this heuristic fails for streaming data, which is accessed once and never again, it works relatively well for other types of data, for example the execution of program code. An exception for this case is when a loop of memory accesses is greater than the LRU cache size; for example a 2-way cache will not provide any caching for the sequence of accesses a, b, c, a, b, c.

Conceptually the simplest way to visualise an LRU cache is as a list, with position in the list representing the age of an element, as seen in Figure 2.7. With this representation, the evict and touch operations for the LRU cache can be defined as follows:

• EvictAndReplace(memoryLocation): replace the tail element of the cache with the new memory location.

• Touch(memoryLocation): move the indicated memory location from its current position in the list to the head of the list.

In practice there are many implementations of the LRU cache [98]. The simplest is the special case of the 2-way cache. In this case there is only a first and last element, hence it is sufficient to use a single bit of data to indicate the current last element. While this approach could be generalised to designate the tail of a list, in practice it is infeasible as when there are more than 2-ways an access may result in data having to be moved within the cache to updated positions in the list. Hence for higher associativities other methods are used.

One class of methods is to use age bits [98]. Age bits indicate the age of elements in the cache. The age bits are incremented when a cache element is demoted due to an access, and reset to zero when an element is accessed and therefore placed at the top of the cache. This method is efficient for relatively small associativity. When associativity is large age bits become inefficient, due to the difficulties in updating all age bits in each access.

For LRU caches with large associativity, methods such as a linked list become more efficient [98]. A linked list, unlike a plain array, has relatively efficient operations for moving elements within the list. However, the addi- tional complexity of the hardware required to implement a linked list means that for small associativities other methods are more efficient.

A full Must/May analysis for the LRU cache was provided in 2007 by Sen and Srikant in [91]. The method exploits the conceptual list representation of the LRU cache. The touch operation of the LRU cache moves the specified element from its current location to the head of the list; this also has the effect of moving any elements inbetween the before the specified element and down the list by one. Using this information it is trivial to construct upper and lower bounds for each element in the list. It is then possible to construct a touch operator that works on the lists of upper and lower bounds as follows, by applying the following rules to each memory location m in the cache, when touching memory location t:

• Case 1: m = t: m is now at the head of the list; set its upper and lower bound to 0.

• Case 2: lower(m) > upper(t): m will be moved down the list by t moving up; increment both upper and lower bounds by 1

• Case 2: Intervals (lower(m), upper(m)) (lower(t), upper(t)) intersect: m might be moved down the list by t, but the information present does not prove this. As such the lower bound remains the same, but the upper bound is incremented by 1.

• Case 3: upper(m) > lower(t): m is already lower than t, so it will not be affected by t moving up; keep both bounds the same.

Here, the functions upper and lower denote the position of the cache line in the lists giving the upper and lower bounds respectively. They return ∞ when the cache line is not in the cache. To complete the analysis, the lists of upper and lower bounds can be used to classify cache accesses to element m as follows:

To complete the analysis, the lists of upper and lower bounds can be used to classify cache accesses to element m as follows:

• Case 1: upper(m) < cacheSize: m is guaranteed to be in the cache, so the classification is a M ust.

• Case 2: lower(m) < cacheSize < upper(m): m could be in the cache, but might not be. Hence the classification is a M ay

• Case 3: cacheSize < lower(m): m cannot be in the cache, so the classification is a M iss.

The analysis of an LRU cache in this manner is very efficient, as all information on any number of concrete cache states can be represented by a single upper and lower bound. While the abstraction in this manner means that many impossible concrete states may be considered, these do not impact the accuracy of the analysis as the abstraction stores the best and worst cases accurately.

Unfortunately, even using the varied implementations of the LRU cache detailed in this section, the costs of implementing the LRU cache are deemed to be too high [55] for associativities greater than 2. Further, 4 or 8 way caches give much higher performance in benchmarks [26]. Hence it is nec- essary to use a different type of cache for high associativity. Even if an LRU cache is not feasible to implement for high associativity, the least recently used behaviour is a good cache heuristic in most use cases. Hence one method to keep the advantages of LRU while decreasing the cost in silicon is to approximate the behaviour or LRU.

In document Lossy Compression applied to the Worst Case Execution Time Problem (Page 47-50)