Cache Memory - Efficient Domain Partitioning for Stencil-based Parallel Operators

The memory unit of modern computer systems consists of memories of different speeds and sizes. The most common memory hierarchy in order of decreasing sizes, increasing speeds, increasing cost per byte and decreasing distance from the processor consists of the hard disk, the main memory (RAM), the cache memories (L3, L2 and L1) and the register memory [5]. Figure 2.5 shows the typical memory hierarchy of a server system along with the typical size and access times. We shall collectively refer to the L1, L2 and L3 caches as the cache hierarchy. The typical number of processor cycles to access the L1, L2 and L3 caches are approximately 1 - 2, 5 - 10, and 10 - 20, respectively [105] (though these numbers may vary depending on the system).

Although the historical increase in processor clock frequencies has stalled in recent years due to power constraints, the mismatch in the rates at which the processor computes and the main memory delivers data necessitates the introduction of the cache hierarchy. Caches exploit

Figure 2.5: Typical memory hierarchy with size and access times in a server system (reproduced from [5])

the principles of spatial locality, i.e. data in the vicinity of the data being used is most likely to be accessed, and temporal locality, i.e. data which was accessed will be accessed again [5, 105]. Generally the L3 cache contains a copy of the data contained in the L1 and L2 cache and this is termed as the principle of inclusion. This principle is also followed by the main memory and the disk storage. There are generally two types of L1 cache: Instruction cache (L1i) and the data cache (L1d) but the L2 cache is Unified (stores both instruction and data). The L3 cache is also Unified, Inclusive and shared among several cores in a multi-core system.

A cache-miss results when the data requested by the processor is not found in a cache and thus, data is fetched from the lower levels of the cache hierarchy or the main memory. As can be seen from Figure 2.5, the lower the memory level, the higher the access time and thus the aim is to minimize the cache-miss rate (or maximize the cache-hit rate). For reasons of efficiency and spatial locality, a cache miss results in fetching multiple words and not just a single word from the lower levels. This group of words is called a cache-block or cache-line. For example, instead of fetching a single double word of 8 bytes on a cache miss, 8 double words are fetched from the main memory. Thus, a typical cache-line size is 64 bytes (i.e. 8 double elements or 16 float elements). A contiguous collection of blocks in the cache memory is called a set and the fetched cache-line from the memory can be placed anywhere in this set. The cache memory can contain many such sets. Such a cache is said to be of n-way Set Associative type. In the extreme case where a single set spans the full cache memory, the cache is said to be Fully Set Associative as the cache-line can be placed anywhere in the cache. On the other hand if this set consists of only a single cache-line, i.e. n = 1, the cache is said to be Directly Mapped as there is only one location where the incoming cache-line can be loaded. In other words, a Directly mapped cache has a single block per set and a Fully Associative cache only has a single set. Fully Associative caches are generally used as special purpose caches such as Translation Look-aside Buffers (TLBs) [105]. Further, the data in the cache and main memory must be kept consistent. If the data is just being read then the memory is consistent with the cache. The problem occurs when data is written and two policies are used for memory consistency. A Write Through policy updates the cache and also the main memory. A Write

Back policy only updates the cache but delays writing to the memory for some later point in time. When writing data to the memory, the data can be copied from the cache to a buffer and then written to the memory. A large wait time can result if any incoming cache-line has to wait for some cache-line in the cache to be written directly to the memory. Thus, introducing a buffered scheme prevents full latency times and is used by both Write Through and Write Back.

As mentioned above, it is desired that cache-misses be minimized. The cache-miss rate is defined as the fraction of cache accesses which result in a cache-miss. Similarly the cache-hit rate is defined as the fraction of accesses which result in a hit. Three types of cache-misses have been identified by the 3C’s model [5]:

1. Compulsory miss: A compulsory miss results every time a cache block is requested for the first time and is not in the cache memory.

2. Capacity miss: When the working set is so large that it just cannot be contained in the cache memory, a capacity miss occurs. Thus, two things must hold true for a capacity miss. First, the cache must be full and, second, the processor must request data that is not in the cache. A cache-block or line must be evicted from the cache memory and hence the cache-miss due to the requested block is categorized as a capacity miss.

3. Conflict miss: For a non-fully associative cache, two blocks can map to the same address and hence the first cache block must be evicted from the cache. This results in a conflict miss. A conflict miss can occur even when the cache is not full i.e. the incoming data can theoretically fit into the cache but due to the constraint of mapping to a particular set, it evicts another block.

We can infer from the 3C’s model above that for a fully associative cache, only Compulsory and Capacity cache-misses can occur. In a cache-miss rate study where the cache size was varied from 8 KB - 512 KB and the set associativity varied from 1 to 8, the range of Compulsory, Capacity and Conflict cache-misses as a percentage of the total cache-misses was found to be ≈ 0.1 - 1.1%, 66 - 100%, and 0 - 35%, respectively [5]. Multicore processors add a fourth type of cache-miss called a Coherency cache-miss that are caused by eviction of cache blocks in order to maintain cache coherency across multiple cores.

When a cache-block is evicted from a set in the cache, there are policies to choose the evicted block. Various policies such as the Random policy, the Least Recently Used (LRU), Least Frequently Used (LFU) and First In First Out (FIFO) are used [5, 105]. The LRU policy chooses the block which was not accessed for the longest period of time. The logic behind using it is that a block which has not been accessed till now will have the lowest probability of being accessed again in the future. In practical implementations, a pseudo LRU algorithm approximates the behaviour of the LRU algorithm by associating one bit with each block in a set. Thus, if a cache is 4-way set associative then each set has 4 bits associated with it.

Whenever a block is accessed in a set, the corresponding bit is turned on. When all the bits in a set are in the on state, they are all turned off except for the bit corresponding to the block which was most recently accessed [5]. Thus, when a block is to be evicted, the replacement algorithm can choose from any block for which the bit is in the off state (hence multiple blocks may be available for replacement out of which one can be chosen randomly).

Cache optimization techniques can be broadly grouped into two categories: Hardware based and Software based. An optimization technique such as Prefetching can fall into both the categories. The technique of Prefetching involves fetching data or instructions based on patterns of access into the cache in order to speculatively reduce future cache-misses. Prefetching can be implemented in both hardware as well as software. Special hardware prefetch units can detect strided accesses and keep tables for detecting such patterns. Software prefetching is generally implemented by the compiler by inserting software prefetch instructions after analyzing the access pattern. Prefetching is not a silver bullet and can lead to performance deterioration as well by fetching data which may not be needed, by interfering with cache block replacement policies and by increasing capacity cache-misses etc [106].

In document Efficient Domain Partitioning for Stencil-based Parallel Operators (Page 65-68)