Implementation Issues - Scalable Load and Store Queues

2.7 Scalable Load and Store Queues

3.3.2 Implementation Issues

Partial forwardings. Partial forwarding is the situation in which a wide load forwards from a narrow store. Here, the load’s value must be constructed by combining values

from the narrow store, and some combination of other older stores and/or the data cache. Conventional processors typically wait to execute loads which require partial forwarding until the partially matching store completes to the data cache [37]. In a load latency tolerant design, stalling such loads isparticularlyundesirable. Stalling a load means that it and its dependents block the issue queue and speculative retirement. In addition to the direct performance impact—younger instructions cannot enter and execute—this may cause a resource deadlock. Specifically, older miss-dependent instructions may need to acquire issue queue entries and physical registers whose reclamation is blocked by the stalled load. The store for which the load is stalling may be unable to complete until these instructions re-execute. Breaking this deadlock requires squashing tail instructions until forward progress can be made.

A chained store buffer makes it easy to execute a load even if it partially forwards. As the load traverses the CSB, it constructs the appropriate value piece-by-piece. This construction requires a register to hold the value constructed so far, and a bit mask to indicate which bytes have been filled in. Traversal stops when either the entire value has been constructed, or an SSN less than or equal to SSNstore−completeis encountered. In the later case, the value read from the data cache is used to fill in the remaining pieces of the output value of the load.

Partial forwardings may occur for either re-executing deferred loads which search the CSB as they re-execute, or for executing tail loads which do not. When a tail load attempts to forward using SQIP but cannot due to a partial forwarding, it waits until the partially matching predicted store (speculatively) retires—which is not the same as waiting for that store to complete. The load then executes by searching the CSB.

Un-aligned loads and stores.Un-aligned loads and stores may span two CSB buckets. These require special handling as they do not fit cleanly into the chaining algorithm. An un-aligned load that spans two CSB buckets traverses the chained store buffer twice— once for each bucket—and the output value is constructed piece-by-piece similarly to the partial forwarding case.

Stores which span two root table buckets introduce more complexity. Not only do such stores require writing to two root table buckets, they also logically have two links. One option is to provision the store buffer with two link fields to accommodate these stores. This option is simplest, but over-provisions link field capacity and root table bandwidth for the uncommon case. In BOLT, an un-aligned store that spans two buckets reads only the root table bucket for its starting address, but writes its SSN in both buckets. The store then only links properly in the first bucket. As a load traverses the CSB, it checks whether the store’s link field corresponds to its own bucket by checking the store’s starting address. If the store’s link field does notcorrespond to the load’s bucket, then the load reverts to linearly scanning the store buffer until it finds a store that belongs to its bucket again. At this point, it resumes chaining. Fortunately bucket-spanning un-aligned loads and stores are rare.

Optimization for sub-word writes. The word granularity of the CSB poses two per-

formance problems in the presence of consecutive narrow (sub-word) writes. Both problems arise from the fact that consecutive sub-word writes all map to the same hash table bucket effectively forming a high number of collisions. The first problem occurs when a load maps to a bucket that contains a chain of sub-word writes but does not match their address. Here, the load must traverse each of the narrow stores serially. In the worst case, eight sequential single-byte writes would add eight cycles to the search latency of the load. The other problem arises when a narrow load which matches an older store to a word must traverse the younger stores to the same word before finding its match. An example, is byte-stores to 0x10, 0x11, 0x12,...0x17, followed by a byte-load to 0x10. Here, the load must traverse 0x17–0x11 before finding 0x10.

BOLT solves the second problem first by adding a single “run” bit per store buffer entry. This bit is set if the preceeding store buffer entries form a contiguous run of same-size sub-word stores to consecutive addresses which start at a bucket boundary. The CSB- insertion logic tracks the address and data size of the most recently inserted store to deter- mine whether to set or clear the “run” bit in each entry. The most recent address and data

Description Condition Run bit lastAddr lastDsize st.addr == lastAddr + lastDsize &&

Continuation st.dsize == lastDsize && Set st.addr (same) SameBucket (st.addr, lastAddr)

Start of run StartOfBucket (st.addr) Clear st.addr st.dsize

Not a run Otherwise Clear Invalid Invalid

Table 3.1: Rules for tracking the “run” bit in the CSB. Rules are matched in order

size registers are set to “invalid” values when no valid run is in progress. The rules for manipulating these registers and the “run” bit are shown in Table 3.1. When the “run” bit in a store buffer entry is set, a load may skip to the correct store within a word by directly calculating the matching store’s index, rather than examining each store cycle-by-cycle. The “run” bit for the store which starts the contiguous run is clear because there are no younger stores whose index may be calculated from it.

With same-word stores in contiguous runs being searchable by SSN arithmetic, their SSNlink fields no longer need to reference each other. BOLT solves the first problem by using the oldest store in the sequence’s SSNlinkfor the remaining stores in the sequence. Specifically, in the case of 8 byte-size stores to 0x10, 0x11, 0x12, ... 0x17, the store to 0x10 will read the root table as usual, but the stores to 0x11, 0x12, ... 0x17 will not. Instead, as CSB-insertion determines they are part of a run, they will receive the same SSNlink as the store to 0x10 did. A colliding load to a non-matching address will simply skip over all eight stores at once.

In document Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era (Page 79-82)