Dealing with Asynchronous Aborts - An ASF-Based TM Runtime Library

7.2 An ASF-Based TM Runtime Library

7.2.4 Dealing with Asynchronous Aborts

ASF can abort speculative regions asynchronously at any time during their execution. This can make it difficult to execute uninstrumented code that modifies state nonspeculatively. We can roughly categorize this code into (1) modifications to a thread’s stack, (2) TM-internal code, and (3) functions declared with thetransaction pureortm wrapperattributes (called external code from now on). Stack modifications are straight-forward to handle because the compiler gen- erates code that can restore the stack slots potentially modified by a transaction. The nonspeculative code that ASF-TM executes after an asynchronous abort uses a new stack frame and is therefore not affected. It is not run from within a speculative region, and will eventually return to the compiler-generated code that restores the values of the stack slots potentially modified.

However, ASF-TM has to execute other TM-internal but nonspeculative code. If it is trivial code such as decrementing the nesting level when a nested transaction commits, then it is usually not much affected by asynchronous aborts (e. g., because the nesting level can be easily reset to zero on transaction restarts). However, if updates of TM metadata consist of modifiying more than one memory location, then asynchronous aborts can cause incomplete updates and a violation of invariants. Thus, such code has to be made robust to asynchronous aborts, which complicates the implementation. For example, compiler or memory barriers might have to be added, and large parts of the TM have to be built in a way similar to reentrant signal handlers.

A good example for this issue is dynamic memory allocation from within a transaction. In an STM, this is straight-forward to do because we can just 12_{The analysis would also have to consider variables that are accessed by functions declared}

call malloc. This is safe because malloc only operates on memory managed by itself, implements its own synchronization, and its callers will wait for it to have finished the operation. This is not safe anymore within a speculative region because it can abort during the operation due tomalloccalling into the operating system kernel or a conflicting memory access by another thread (e. g., the abort could happen right aftermallocacquired a lock, which in turn could block every thread that subsequently tries to allocate memory).

To handle memory allocation in ASF-TMs implementation, we therefore have to resort to (1) logging the allocation request, (2) aborting the transaction, (3) performing the allocation, and (4) restarting the transaction. This works well for allocations used by ASF-TM itself (e. g., undo-log buffers). For allocations triggered by the application, ASF-TM can just hope that the same allocation will happen in the restarted transaction, so it might have to revert to a software transaction after a small number of mispredictions.

The fact that STMs do not consider asynchronous aborts is also the key point in the case of external code, the last category that we need to deal with. From an STM perspective, this makes a lot of sense because it allows for simple wrapper implementations and simple reuse of a lot of library code, as long as these functions synchronize on their own and access data that is separate from transactionally accessed data. Asynchronous aborts conflict with this assump- tion, and I will next discuss possible work-arounds for this problem.

Software-side solutions. The simplest option would be to not call external code from within hardware transactions. The ABI would have to be extended so that it requires the compiler to notify the TM if external code is about to be executed by a transaction.13 _{TMs could then decide to switch to software trans-}

actions before executing such code. However, this is not what we really want because there could be many bits of external code (e. g., built-in functions used by the compiler) making it less likely to be able to use hardware transactions.

Another option would be to classify external code as asynchronous-abort– safe or unsafe. This would require a second group of transaction pure and tm wrapper attributes, increasing the complexity on the software side. ASF’s particular safety requirements are different than those of STMs and other HTMs. Thus, it is not clear that expecting programmers to maintain this classification is beneficial in the long term.

One could also try to suspend speculative regions around external code, similar to the split hardware transactions proposed by Lev and Maessen [73]. This suspension has to be implemented entirely in software and results in signifi- cant performance overheads for hardware transactions because it requires read logging and write buffering, wasting much of the performance benefits of HTM. Finally, we could try to automatically instrument external code in a way that makes it robust to asynchronous aborts, for example by transforming memory accesses to ASF’s speculative accesses or by redirecting them to a simple STM that just provides this robustness but not concurrency control. However, this defeats the purpose of thetransaction pureandtm wrapperannotations. Also, we cannot generically roll back custom synchronization code using software only.14

13_{This could either happen when starting a transaction, or using a new TM callback func-}

tion.

Furthermore, not all external code is available as source code or at compile time, so dynamic binary instrumentation would have be used at runtime as well. Overall, relying on instrumentation seems to be too intrusive and fragile.

Hardware-side solutions. It might also be possible to change ASF in a way that avoids expensive hardware-based virtualization (i. e., by continuing to abort transactions on far control transfers) and still makes it easier for software to deal with asynchronous aborts.

ASF could offer speculative regions to run in a mode where consistency of the region is just checked at commit time and on demand during its execution. ASF would have to support a new CPU instruction that aborts a speculative region if it is inconsistent, similar to a validate function in an STM. ASF-TM could then use this instruction after each speculative load to check that the value to be returned to the user is indeed part of a consistent snapshot.

However, this STM-like operation also causes a typical STM problem to appear in hardware transactions: There can be pending speculative loads and stores that get executed even when a speculative region’s snapshot is stale. Because aborts would not be instantaneous anymore in the new ASF execution mode, this would create a race condition with other privatizing transactions that could change the protection levels of the memory accessed by the speculative region with the stale snapshot. The software-only solution to this—ensuring privatization safety between hardware transactions—would decrease performance significantly. A hardware-based solution could be to make speculative accesses nonfaulting, and to forward some information from loads and stores to ASF’s validating instruction (i. e., so that page faults and TLB misses can be made visible if the speculative region had indeed a consistent snapshot).

This shows that asynchronous aborts are beneficial in code that is robust to them. So, what we would really want is to suspend them when running external code and resume aborting when switching back to instrumented code. We do not need to ensure privatization safety while executing external code because such code must not access shared data.15

Thus, ASF could offer new CPU instructions to suspend and resume aborts in an speculative region, which ASF-TM would call before and after the execution of external code. Let us call them SUSPEND and RESUME. ASF would roll back speculative updates instantaneously with the abort as before, but defer the jump back to SPECULATE until RESUME is executed. The abort reason could be carried forward to RESUME (requiring minimal virtualization, similar to what would be needed for the on-demand validation scheme discussed previously). Alternatively, ASF could maintain a single bit indicating whether a speculative region has aborted. This bit would be set on aborts and after context switches, and cleared when new outermost speculative regions start.

Other speculative regions nested in a region where aborts are suspended are more difficult because they need to run with open-nested16_{semantics to provide}

have no control over other threads potentially participating in the synchronization protocol.

15_{We can support accessing shared data through calls to the TM. These special transactional}

data transfer functions would just resume and suspend the speculative region before and after performing the necessary speculative accesses. Obviously, the caller would have to ensure that this is robust to asynchronous aborts at the time at which it calls these functions.

16_{Under open nesting, nested transactions essentially commit independently of parent trans-}

us with the composability that we aim for. Even if this is the case, it will be difficult to provide ASF’s minimal-progress guarantees. Thus, such special nested speculative regions could instead of depending on open nesting just abort parent speculative regions or optionally use a software fallback if a parent speculative region exists. However, this only provides partial composability, in that nesting is safe but only one of the speculative regions can actually execute.

7.2.5 Discussion and Related Work

I investigated how to build an ASF-based TM runtime library that can be integrated into a general-purpose TM system. The study that this was a part of used a near-cycle-accurate full-system simulator. Other previous or more recent studies about realistic first-generation HTM either focused on different hardware support and use cases or have not been evaluated publically.

Intel’s TSX is an HTM feature that has been announced recently for an upcoming CPU. It provides an interface roughly similar to ASF but does not support nonspeculative accesses; all memory accesses in a speculative region are automatically considered as speculative accesses without the need for any special annotations. Aborted speculative regions do not make any of their side effects visible, which simplifies their use but can also require a slightly more frequent execution of software transactions. GCC’s TM runtime library [44] has recently been extended with a simple HTM execution mode that can use Intel’s TSX; this uses the same ABI as considered here and employs serially executed software transactions as fallback mode. No performance measurements have been published so far.

The recent IBM BlueGene/Q processor also contains an HTM feature [117]. Using this HTM is significantly more complex than in the case of ASF or Intel’s TSX. There are two separate execution modes aimed at short-running and long- running transactions, both of which track speculative state in different hardware resources and require different handling by software. The ABI of the TM runtime library for this HTM is not described in detail but it has to rely on the operating system kernel to handle events like exceeding the HTM’s capacity or executing disallowed code; the kernel also executes TM conflict resolution policies in some modes.

In the primary study about Rock TM [26, 12], the authors were able to use a real hardware implementation. However, in comparison to ASF, Rock TM puts more restrictions on the code that it can run as a hardware transaction, so the focus of these studies has rather been on using HTM support for concurrent data structures in operating system kernels or virtual machines than on using it as part of a general-purpose TM (e. g., there are no results for the STAMP benchmarks). There seems to be some level of compiler support for Rock TM but it is not discussed whether it is reasonably close to what would be useful for a generic STM. Because ASF’s design is different than Rock TM (e. g., selective annotation or handling TLB misses) and supports a wider spectrum of code in transactions, studying ASF in the context of general-purpose transactions revealed the implications of different hardware design choices in further system areas.

In TxLinux and MetaTM [97, 60], an academic HTM proposal is used as basis for evaluating hardware transactions as replacement for lock-based synchronization in the Linux kernel (using a technique similar to speculative lock

elision [86]). The authors also consider userspace applications (STAMP with manual instrumentation), but in both cases the HTM is used directly and it is not investigated how the HTM would integrate with general-purpose TM support (e. g., there is no compiler support or integration with programming languages). The HTM itself does not use selective annotation and has to imple- ment virtualization for hardware transactions. Only a simple in-order simulation of the x86 architecture is used for the evaluation.

HASTM [99] is evaluated as part of general-purpose TM for userspace applications, including compiler support. It only provides hardware support for concurrency control but does not support transactional updates. It therefore can only accelerate STM algorithms and does not face issues like asynchronous aborts or false sharing of speculative and nonspeculative memory accesses. The simulator used for the study is described as an accurate IA32 simulation.

In comparison, ASF is well aligned with general-purpose, non-ASF-specific TM building blocks. Design decisions such as the visibility of page faults triggered by speculative code make building a TM based on ASF easier than it would be on Rock TM, for example.

Selective annotation is a very valuable feature of ASF because it allows to use costly ASF capacity only for memory accesses that actually need to be protected or speculative. Furthermore, it enables new HyTM algorithms (see Section 7.3). However, even though it makes sense to not support some kinds of false sharing between speculative and nonspeculative memory accesses, ASF should be more forgiving towards its clients and handle exceptional situations by aborting the speculative region instead of raising general protection faults. For customly built synchronization based on ASF, the fatal errors might be useful because they reduce the error cases that need to be handled. In contrast, for generic synchronization mechanisms like TM, it is much easier to just prevent such situations most of the time instead of having to choose a conservative implementation. Unlike the faults, speculative region aborts can be handled locally in the TM. This also applies to faults raised by the execution of instructions not supported in speculative regions.

ASF’s instantaneous aborts and the guarantee of snapshot consistency that this implies are very useful for HyTMs and also make other behavior practical (e. g., page faults triggered by speculative code being visible). However, this also implies asynchronous aborts, which complicate executing nonspeculative code. Because none of the software-only solutions to work around this issue are really practical, extending ASF with support for suspending and resuming asynchronous aborts seems to be the best overall solution. It seems to require only a few changes to ASF but allows for more encapsulation of ASF’s peculiarities inside the TM runtime library.

Overall, ASF is together with these two changes (aborts instead of faults and suspendable asynchronous aborts) ready for being used inside a general-purpose TM.17 _{Support for ASF can be confined to a TM runtime library (ASF-TM)}

that provides the same ABI as STM runtime libraries. This is important for a new hardware feature like ASF as well because it eases the transition from STM to an ASF-based TM.

17_{PTLsim-ASF implements the first change, aborting speculative regions instead of raising}

faults. Suspendable asynchronous aborts are not implemented by it but the benchmarks used for the evaluation in Section 7.4 work without this change as they do not require the execution of unsafe external code.

Hardware synchronization TM synchronization metadata Application data Programming language abstractions (objects, types, ...) Partitions (data structures, objects, ...) Synchronization objects (locks, clocks, ...) Address space High-level synchronization (Requires analysis or declarations) Low-level synchronization (Generally applicable) HTM (cache lines) Atomic operations (machine words) No synch. Analyze/use Controlled by compiler or memory allocator Map to Embed or map to Ana lyze a nd en forc e Synchronize Synchronize

Figure 7.8: TM-based synchronization: HyTM.

In document Software Transactional Memory Building Blocks (Page 194-199)