Cache Design - The Itanium 2 Microarchitecture

2.2 The Itanium 2 Microarchitecture

2.2.2 Cache Design

The Itanium 2 microarchitecture features a three-level, on-chip cache hierarchy. Fig. 2.7 depicts the caches with their respective sizes, associativities, read/write policies, read latencies in cycles (minimum values for the L2 and L3 cache), and peak read bandwidths at 1.5 GHz. While the L1 cache enforces a write-through policy (WT, all writes go directly through the cache to the next-lower cache level), the L2 and L3 caches use write-back (WB, lines are written to the next-next-lower cache level only on replacement) together with write-allocate (WA, a cache line is allocated also on a write miss). All floating-point memory accesses bypass the L1 cache and are served directly by the L2 cache; they take one additional cycle for format conversion.

The four-ported L1 data cache on the top is extremely fast with a single-cycle read latency, which helps avoid load-use stalls of the in-order execution pipeline [BMS02]. The low access time has been achieved through a small cache size (16 KB), aggressive circuit techniques and a prevalidated tag cache design.

The latter technique speeds up the translation from virtual to physical addresses, which is necessary as the cache is physically-addressed. The key idea is that the tag array of the cache does not contain the upper bits of a physical address (as usual), but instead a 32-bit pointer to a TLB (translation lookaside buffer, [HP03]) entry that contains this address. This pointer is organized as a “one-hot” vector, i.e., with exactly one bit equal to one, to enable a fast comparison with other pointers. If the i-th bit is set, it is meant to point to the i-th TLB entry. Each L1 cache access then initiates three parallel accesses to different structures:

• The upper-order virtual address bits are used to access the 32-entry L1D TLB, delivering an one-hot vector that points to the entry containing the translated physical address (if existing).

• The lower-order virtual address bits (which do not have to be translated since the way size (4 KB) is always less than or equal to the page size (4 KB-4 GB)) are used to access all four

L2 Cache, 256 KB, 8-way set-associative WB, WA, 128 byte lines, 5 cyc., 48 GB/s L3 Cache, up to 6 MB, 24-way set-associative,

WB, WA, 128 byte lines, 14 cyc., 48 GB/s

L1-D-Cache, 16 KB, 4-way set-associative

WT, no WA, 64 byte lines, 1 cyc., 24 GB/s

L1-I-Cache, 16 KB, 4-way set-associative,

64 byte lines, 1 cyc., 48 GB/s 1 KB register file,

0 cyc.

Physical memory, up to 2 byte, > 50 cyc., 6.4 GB/s⁵⁰ Virtual memory, up to 2 Byte⁶⁴

Figure 2.7: The Itanium 2 cache hierarchy.

Data Array Way 0/1

Data Array Way 2/3 Address Decoders

Rotating Way Mux Tag Array

L1 TLB

Address Mux

Figure 2.8: L1D cache die photo.

ways of the tag array and the data array in parallel, yielding four 32-bit one-hot vectors and four cache lines, respectively.

The four one-hot vectors then can be compared with the one from the TLB very fast (relative to the comparison of full 64-bit words). The matching one, if existing, determines the way and selects one of the four cache lines for output. The same design is also used in the 16 KB L1 instruction cache.

The multi-banked unified L2 cache is a complex, non-blocking out-of-order design. All memory operations that access the L2 cache (L1 misses and all stores) allocate into a 32-entry queuing structure. In each cycle, up to four independent and non-bank-conflicted requests are selected from this queue and issued to the L2 array. The issue logic enforces all architectural memory ordering requirements (semaphore instructions etc.), thus the L2 queue can be regarded as the “central clearing house for all address transactions” in the memory hierarchy [RG02].

The dynamic nature of the L2 cache design makes a precise specification of the access latency impossible. Each read that passes through the L2 queue takes at least 9 cycles. However, there is a feature that allows a request to bypass the queue and issue directly to the L2 data array (provided, inter alia, that there are no dependences on older operations in the queue), enabling a 5- or 7-cycle read latency. These bypasses and further details of the L2 cache design are described in [Int04].

Loads can additionally be delayed if they cause TLB misses: Loads that miss in the L1 DTLB also miss in the L1D cache; if they hit in the L2 and in the 128-entry L2 DTLB, they incur a 4-cycle-penalty in addition to the L2 cache latency. An L2 DTLB miss initiates a hardware page walker (HPW) to perform page look-ups, which costs at least 25 cycles.

Other considerable performance penalties can arise from interferences of ambiguous memory accesses in the cache system. This can happen if several loads and stores that access overlapping memory areas are issued in the same cycle (or in consecutive cycles). In these cases, the penalties with respect to the L1D cache are as follows:

• There will be no conflicts between two loads, or between a load preceding a store in an issue group, if the load(s) hit(s) in the L1D.

• If a store precedes a memory dependent load, the store data must be forwarded to the load.

This costs 17, 3-5, 3, and 1-3 cycles if the store is executed 0, 1, 2, and 3 cycles before the load, respectively. In the first two cases, only the lower 12 bits are used for the address comparison. The 17 cycle delay occurs since both requests are passed to the L2 in this case and conflict with each other there. To avoid these penalties completely, the store and load must be separated by at least four cycles.

• Two stores can conflict under circumstances described in [Int04] since the L1D is only pseudo-dual ported for write accesses. Then the younger store will wait in a store buffer;

the L1D will stall if this buffer is full.

It is important that these penalties are allowed for during scheduling [CL03]. The L2 conflict conditions are detailed in [Int04].

The large on-chip L3 cache is optimized for density. It consumes more than half of the processor area and is tiled into 140 subarrays to fit the irregular shape of the core. It is a pipelined, non-blocking design that has its own queue to support up to eight outstanding request. The minimum read latency is 14 cycles [SR03].

The extensive cache system reflects the necessity to minimize the load latencies on this in-order processor—in total, up to 54 accesses can be active throughout the memory hierarchy without stalling the execution pipeline. The design team focused their resources on the cache hierarchy instead of the pipeline, which has a relatively simple design [MS03].

In document Optimal Global Instruction Scheduling for the Itanium® Processor Architecture (Page 52-55)