Store Hit Pipeline - MPC7450 RISC Microprocessor Family Software Optimization Guide

The pipeline for stores before the data is written to the cache includes several different queues. A store instruction must go through E0 and E1 to handle address generation and translation. It is then placed in the three-entry finished store queue (FSQ). When the store is the oldest instruction, it can access the store data and update architecture-defined resources (store serialization). From this point on, the store is considered part of the architectural state.

However, before the data reaches the data cache, two write-back stages (WB0 and WB1) are needed to acquire the store data and transfer it from the FSQ to the 5-entry committed store queue (CSQ). Arbitration into the data cache from the CSQ is pipelined so a throughput of one store per cycle can be maintained. During this arbitration and cache write, stores arbitrate into the data cache from the CSQ and stay there for at least four cycles. Table 26 shows the pipelining of four stw instructions to the data cache.

Table 25. Load Hit Pipeline Example

Instr. No. Instruction 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 lfdu D I E0 E1 E2 E3/C 1 fadd D I — — — — E0 E1 E2 E3 E4 F C 2 lwzu — D I E0 E1 E2 — — — — — — C 3 add — D I — — — E — — — — — C 4 subf F2 D I — E — — — — — — — — C 5 lvewx F2 — D I E0 E1 E2 — — — — — — C 6 vaddsws F2 — D I — — — E F — — — — C

Table 26. Store Hit Pipeline Example

Instruction 0 1 2 3 4 5 6 7 8 9 10 11 12 13

stw D I E0 E1 FSQ0/C WB0 WB1 CSQ0 CSQ0 CSQ0 CSQ0

Because floating-point stores are not fully pipelined, the bottleneck is at the FSQ, where only one floating-point store can be executed every 3 cycles. See Table 27 for an example execution of four stfd instructions. Vector stores do not have this problem and are fully pipelined (similar to the integer stores as shown in Table 26).

To avoid floating-point store throughput bottlenecks, strings of back-to-back floating-point stores (like that shown in Table 27) should be avoided. Instead, floating-point stores should be mixed with other instructions wherever possible. For maximum store throughput, vector stores should be used.

9.5 Store Gathering and Merging

The MPC7450 implements two techniques to improve store performance by coalescing adjacent entries in the CSQ. Store gathering refers to coalescing adjacent cache-inhibited or write-through stores; store merging refers to coalescing adjacent cacheable write-back stores. Note that these two techniques are used only when the bottom CSQ entry is processing a cache miss or sending a store request to the memory subsystem. In such a situation, the bottom entry itself is not eligible for any coalescing operations, but all other CSQ entries are examined.

The throughput of cache-inhibited or write-through stores is usually limited by the system address bus bandwidth. With store gathering enabled (HID0[SGE] = 1), cache-inhibited or write-through stores may be combined into larger transactions. If the bottom entry of the CSQ is processing a cacheable store miss or sending a store request on to the memory subsystem, the processor examines the remaining CSQ entries for store gathering. Any set of adjacent entries in the CSQ are gathered into one transaction if they are aligned, the same size, to the same or adjacent addresses, either cache-inhibited or write-through, and the

stw — — D I E0 E1 FSQ0/C WB0 WB1 CSQ2 CSQ2 CSQ1 CSQ0

stw — — — D I E0 E1 FSQ0/C WB0 WB1 CSQ3 CSQ2 CSQ1 CSQ0

Table 27. Execution of Four stfd Instructions

Instr. No. Instruction Cycle Number 0 1 2 3 4 5 6 7 8 9 0 stfd D I E0 E1 FSQ0/C WB0 WB1 CSQ0 CSQ0 CSQ0 1 stfd — D I E0 E1 FSQ0 FSQ0 FSQ0/C WB0 WB1 2 stfd — — D I E0 E1 FSQ1 FSQ1 FSQ0 FSQ0 3 stfd — — — D I E0 E1 FSQ2 FSQ1 FSQ1 10 11 12 13 14 15 16 17 18 19 0 stfd CSQ0 1 stfd CSQ1 CSQ0 CSQ0 CSQ0 2 stfd FSQ0/C WB0 WB1 CSQ1 CSQ0 CSQ0 CSQ0 3 stfd FSQ1 FSQ0 FSQ0 FSQ0/C WB0 WB1 CSQ1 CSQ0 CSQ0 CSQ0

Table 26. Store Hit Pipeline Example (continued)

result is aligned. When the MPC7450 is on a system bus supporting the MPX protocol, this gathering may continue up to a 32-byte store request. On a 60x bus, the MPC7450 does not gather beyond a 64-bit transaction. Under ideal conditions, a stream of write-through or cache-inhibited stores to sequential addresses reduces store transactions on the system bus by a factor of four. Note that cache-inhibited guarded stores are never gathered.

The throughput of cacheable stores that miss in the L1 is limited by the latency to the L2 or L3 caches and the memory latency. When store gathering is enabled (HID0[SGE] = 1), cacheable write-back stores may also be combined. If the bottom entry of the CSQ is processing a cacheable store miss or sending a store request to the memory subsystem, any other adjacent entries in the CSQ are merged into one transaction if they are both to the same 32-byte granule, are both cacheable and write-back, and are waiting to access the L1 or have already missed in the L1 cache. For store merging, the size and alignment restrictions are relaxed, because cacheable stores are always performed by writing bytes to the L1 (if the data L1 hits) or merging bytes with reload data (if the data L1 misses).

In document MPC7450 RISC Microprocessor Family Software Optimization Guide (Page 36-38)