Advanced Synchronization Facility (ASF) - Colocating Application Data and TM Metadata

6.3 Colocating Application Data and TM Metadata

7.1.1 Advanced Synchronization Facility (ASF)

AMD’s Advanced Synchronization Facility is a proposal [2] of HTM extensions for x86-64 CPUs. It essentially provides hardware support for the speculative execution of regions of code. These speculative regions are similar to transactions in that they take effect atomically and can access memory transactionally using speculative loads and stores. ASF provides selective annotation, which means that nonspeculative memory accesses are supported within transactions (including nonspeculative atomic instructions) and speculative memory accesses have to be explicitly marked as such.

ASF is a best-effort design that would be feasible to implement in high- volume microprocessors. It is more advanced than the designs described previ- ously. For example, unlike with Rock’s HTM, TLB misses do not abort transactions. It comes with a number of limitations, of course. The number of disjoint locations that can be accessed in a speculative region is limited—depending on ASF’s implementation variant—either by the size of speculation buffers (which are expensive and thus have been designed with small capacity) or by the size and associativity of caches (when tracking speculative state in caches). It follows that speculative accesses and concurrency control have cache line granularity. ASF transactions are not virtualized and therefore, abort on events such as context switches or page faults. However, page faults triggered in speculative regions will be visible to the operating system after the abort, so a custom re- play of the fault by the TM runtime is not necessary to be able to retry the hardware transaction.

The original aim behind ASF was to make concurrent nonblocking programming easier and faster by providing an atomic read–modify–write operation for more than a single memory location. To that end, ASF ensures eventual forward-progress in the absence of contention and exceptions if a speculative region does not access more than four distinct cache lines.2 _{This guarantee}

prevents programmers from having to always provide a software fallback path that does not use ASF even if the speculative region is small. Note that in a general-purpose TM, we need to always provide the software fallback because we cannot make assumptions about the transactions.

In what follows, I will summarize ASF’s properties [14]. More information can be found there, in ASF’s specification [2], and in two papers about its internals and the background of the design [31, 15].

ISA extensions. The new instructions that ASF provides allow for entering and leaving speculative regions (SPECULATE,COMMIT, and ABORT) and ac- cessing protected memory locations (i. e., memory locations that can be read and written speculatively and which abort the speculative region if accessed concur- rently by another thread: LOCK MOV, WATCHR, WATCHW, and RELEASE). All of these instructions are available in all system modes (user, kernel; virtual- machine guest, host).

2_{Eventual means that there may be transient conditions that lead to spurious aborts, but}

eventually the speculative region will succeed when retried continuously. The expectation is that spurious aborts almost never occur and speculative regions succeed the first time in the vast majority of cases.

1 // DCAS Operation:

2 // if ((mem1 = rax) && (mem2 = rbx)) {

3 // mem1 = rdi; mem2 = rsi;

4 // rcx = 0;

5 // } else {

6 // rax = mem1; rbx = mem2;

7 // rcx = 1; 8 // } 9 DCAS: 10 mov %rax, %r8 11 mov %rbx, %r9 12 retry : 13 SPECULATE // SR begins

14 jnz retry // Restart SR after aborts

15 mov $1, %rcx // Default result , overwritten on success 16 lock mov (mem1), %r10

17 lock mov (mem2), %rbx

18 cmp %r8, %r10

19 jnz cmpfail

20 cmp %r9, %rbx

21 jnz cmpfail

22 lock mov %rdi, (mem1)

23 lock mov %rsi, (mem2)

24 xor %rcx, %rcx // Success indication

25 cmpfail:

26 COMMIT

27 mov %r10, %rax

Figure 7.1: An example implementation [14] of a DCAS operation using ASF.

Speculative regions are started using the SPECULATE instruction. When a speculative region is aborted, execution resumes at the instruction following theSPECULATE instruction (with a matching error code in the rAX register, which allows clients to handle aborts in a custom way). COMMITandABORT both finish the execution of a speculative region:COMMITmakes all speculative modifications instantly visible to all other CPUs, whereasABORTdiscards these modifications. Flat nesting is used for nested speculative regions.

In a speculative region, speculative/protected memory accesses can be ex- pressed in the form of ASF-specificLOCK MOVCPU instructions, and can be mixed with ordinary nonspeculative/unprotected accesses (MOV). This selective annotation allows the TM or the programmer to use speculative accesses sparingly and thus preserve precious ASF capacity. Second, the availability of nonspeculative atomic instructions allows us to use common concurrent programming techniques during a transaction, which enables novel HyTMs (see Section 7.3) and can reduce the number of transaction aborts due to benign contention (e. g., when updating a TM-internal, shared counter). In a speculative region, nonspeculative loads are allowed to read state that is speculatively updated in the same speculative region, but nonspeculative stores must not overlap with previous speculative accesses.

ASF also provides CPU instructions for just monitoring a cache line for concurrent stores (LOCK PREFETCH) or loads and stores (LOCK PREFETCHW), and for stopping monitoring a cache line (RELEASE).

Figure 7.1 shows a simplified example of a double CAS (DCAS) operation implemented using ASF.

Speculative region aborts. As explained by Christie et al. [14], there are several conditions that can lead to the abort of a speculative region, besides

CPU A CPU B cache line state

Mode Operation Protected Shared Protected Owned

Speculative region LOCK MOV (load) OK B aborts

Speculative region LOCK MOV (store) B aborts B aborts

Speculative region LOCK PREFETCH OK B aborts

Speculative region LOCK PREFETCHW B aborts B aborts

Speculative region COMMIT OK OK

Any Read operation OK B aborts

Any Write operation B aborts B aborts

Any Prefetch operation OK B aborts

Any PREFETCHW B aborts B aborts

Table 7.1: Conflict matrix for ASF operations (from [2], §6.2.1).

theABORTinstruction: (1) contention for protected memory, (2) system calls, exceptions, and interrupts, (3) the use of certain disallowed instructions, and (4) implementation-specific transient conditions. In case of an abort, all modifications to protected memory locations are undone, and execution flow is rolled back to the beginning of the speculative region by resetting the instruction and stack pointers to the values they had directly after the SPECULATE instruction. No other register is rolled back; software is responsible for saving and restoring any context that is needed in the abort handler (see Section 7.2). Additionally, the reason for the abort is passed in therAXregister. Because all privilege-level switches (including interrupts) abort speculative regions and no ASF state is preserved across such a context switch, all system components (user programs, OS kernel, hypervisor) can make use of ASF without interfering with one another.

Conflict detection for speculative accesses is handled at the granularity of a cache line. Conflict resolution in ASF follows the “requester wins” policy (i. e., existing speculative regions will be aborted by incoming conflicting memory accesses) with cache line granularity. Table 7.1 summarizes how ASF handles contention when CPU A performs an operation while CPU B is in a speculative region with the cache line protected by ASF. These conflict resolution rules are important for understanding how the HyTM algorithms presented in Section 7.3 work.

Isolation and ordering guarantees. The isolation and ordering guarantees that ASF provides for mixed speculative and nonspeculative accesses are important for the correctness of the HyTM algorithms because they access shared data nonspeculatively. Also, a speculative region can trigger externally visible side effects such as page faults. It is important to know whether these effects were caused by misspeculation (i. e., were caused by a memory access that would cause an abort) or by a consistent (yet potentially incomplete) speculative region or transaction. The guarantees described next complement the rules layed out in Table 7.1. They are not yet part of the ASF specification but reflect the intended design [29].

Aborts of a speculative region are designed to be instantaneous with respect to the program order of instructions in a speculative region. For example, aborts are supposed to happen before externally visible effects such as page faults or

Storespec(A) →hb Storenonspec(B) →hb Commit

⇒ M onitor(A) →hbRetire(A) →hbRetire(B) →hbV isible(B)

Loadspec(A) →hbLoad(B)

⇒ M onitor(A) →hbDataBind(A) →hbDataBind(B)

Figure 7.2: Ordering guarantees provided by ASF. →hb expresses happens-

before relationships conceptually similar to happens-before in the C++11 memory model (see Section 4.1). A and B are memory locations.

non-speculative stores appear. This behavior also illustrates why speculative accesses can also be referred to as “protected” accesses. A consequence of this is that speculatively accessed cache lines are monitored early for conflicting accesses (i. e., once the respective instructions are issued in the CPU, which is always before they retire). Together with the standard memory model of x86 architectures, this leads to two rules that are relevant for the HyTM algorithms. Figure 7.2 shows the order of CPU effects that is implied by certain program or execution orders. The first rule essentially states that if a speculative store to A happens before a nonspeculative store, then A’s cacheline will be monitored before the nonspeculative store is visible to other threads. Similarly, the second rule states that if a speculative load happens before a nonspeculative or speculative load, then the monitoring of the former will happen before the second load actually retrieves a value from memory.

Furthermore, atomic instructions such asCASor an atomic fetch-and-incre- ment retain their ordering guarantees. For example, a CAS ordered before a COMMIT in a program will become visible before the transaction’s commit, and aCASwill be a full memory barrier with respect to memory accesses and monitoring.

ASF implementation variants. ASF could be implemented in different ways in hardware. One major implementation choice that affects ASF’s clients is how uncommitted speculative reads and writes are tracked.

First, one can introduce a new CPU data structure called the locked-line buffer (LLB), which holds the addresses of protected memory locations accessed in the current speculative region and is fully associative. It also holds the prior values of speculatively modified memory lines. Finally, it monitors remote memory requests and aborts a current speculative regions on probe requests that represent conflicting memory accesses by other CPUs.

Second, the L1 cache of each CPU core can be extended with an additional speculative–read bit per cache line and the regular cache-coherence protocol can be used to monitor protected reads and abort a current speculative region if required. Similarly, the L1 cache could be extended with another bit for speculative stores.

One can also combine these options, using the LLB only for speculative stores and tracking speculative reads in the L1 cache. From the perspective of ASF clients, the trade-off is mostly in terms of capacity for speculative state

Name State stored in HTM capacity

LLB8 8-line LLB 8 distinct lines (loads and stores) LLB256 256-line LLB 256 distinct lines (loads and stores) LLB8L1 Stores: 8-line LLB Stores: 8 distinct lines

Loads: 1K-line L1 Loads: Minimum of 1K lines or 2-way set associativity, shared with nonspeculative accesses Table 7.2: ASF implementation variants.

(i. e., how many distinct memory lines can be accessed by a speculative region before it will exceed the capacity and will have to abort). The L1 cache is relatively large but its effective capacity can be limited by its associativity, and nonspeculative accesses will potentially compete with speculative accesses for cache space. The LLB does not suffer from these problems but will likely be smaller in size because fully associative structures are quite costly.

Table 7.2 shows the implementations that I will consider in the evaluation. LLB8 represents a minimal implementation that can only be used for small transactions. LLB256 has a large LLB whose capacity is unlikely to be exceeded based on current TM benchmarks, but thus is also costly to implement. LLB8L1 is a middle ground offering a capacity that is often sufficient.

ASF simulator. ASF is not yet implemented in hardware, so one has to rely on simulation to evaluate it. AMD has extended PTLsim [122] with support for ASF and a more detailed model of the interactions between multiple sep- arated processor cores and memory hierarchies (PTLsim-ASF [30]). PTLsim can simulate a full AMD64 system, which is important to be able to evaluate a realistic TM stack with applications, libraries, and an operating system kernel that would also run on a real machine.

AMD configured [14] the simulator to match the general characteristics of a system based on AMD Opteron processors (family 10h), with a three-wide clus- tered core, out-of-order instruction issuing, and instruction latencies modeled after the AMD Opteron microprocessor. The cache and memory configuration is:

• L1D: 64 KB, virtually indexed, 2-way set associative, 3 cycles load-to-use latency.

• L2: 512 KB, physically indexed, 16-way set associative, 15 cycles load-to- use latency.

• L3: 2 MB, physically indexed, 16-way set associative, 50 cycles load-to-use latency.

• RAM: 210 cycles load-to-use latency.

• D-TLB: 48 L1 entries, fully associative; 512 L2 entries, 4-way set associative.

The simulated machine used for the evaluation has 16 CPU cores, each having a clock speed of 2.2 GHz. PTLsim-ASF does not yet model topology

0 % 5 % 10 % 15 % 20 % 25 % 30 % 35 %

Genome KMeans-LoKMeans-HiSSCA2 Vacation-LoVacation-Hi Performance deviation (simulated over real)

Figure 7.3: PTLSim accuracy for the execution time of the STAMP benchmarks (no TM, no ASF, one thread) for simulated execution time normalized to native execution time.

information such as placement of cores on chips or sockets, and thus also does not model limited cross-socket bandwidths. Therefore, these cores behave as if they were located on the same socket, resembling future processors with higher levels of core integration. The cache-coherence model is simplified but captures first- order effects caused by cache coherence [30]. Additional ordering constraints and fencing semantics for the ASF primitives are modeled as well. However, the version of the simulator that was available did not yet correctly model the ordering guarantees between speculative and nonspeculative loads, so the implementations of HyTMs affected by this have to use additional read–read memory barriers.

Simulator accuracy. To illustrate the accuracy of the simulator, Figure 7.3 shows the difference in execution times on a real machine3 compared to a simulated execution within PTLsim-ASF4. A close match between the performance of simulated and real executions is desirable because this increases the confi- dence in the results of the evaluation. All experiments used for the evaluation (Section 7.4), including baseline STM runs, have been conducted inside the simulator to make sure that the results are comparable.

For many of the STAMP benchmarks (see Section 3.4.3), PTLsim-ASF stays within 10–15% of native execution speed, which is in line with earlier results for smaller benchmarks [30]. However, the results for Vacation-Lo and KMeans show that not all mechanisms in the microarchitecture are simulated precisely by PTLsim-ASF.

3_{AMD Opteron processor family 10h, 2.2 GHz.}

4_{These results are from an experiment on an earlier version of PTLsim-ASF, which simu-}

Hardware synchronization TM synchronization metadata Application data Programming language abstractions (objects, types, ...) Partitions (data structures, objects, ...) Synchronization objects (locks, clocks, ...) Address space High-level synchronization (Requires analysis or declarations) Low-level synchronization (Generally applicable) HTM (cache lines) Atomic operations (machine words) No synch. Analyze/use Controlled by compiler or memory allocator Map to Embed or map to Ana lyze a nd en forc e Synchronize Synchronize

Figure 7.4: TM-based synchronization: HTM with serial execution of transactions as fallback mode. Note that the fallback is implemented using standard atomic operations, which has been omitted to increase clarity.

In document Software Transactional Memory Building Blocks (Page 182-188)