Lock Elision - Applications of HTM - Interaction of Hardware Transactional Memory and Microproc

5.2 Applications of HTM

5.2.2 Lock Elision

In addition to using transactional memory directly, transactions may be used to convert lock-based critical sections that pessimistically force serialisation through lock elision into optimistic transactions which serialise only if concurrent executions conflict. There are various ways to perform such a transformation. Manual Software Lock Elision The programmer manually converts the usage of locks / critical sections into the right transactional primitives (atomic blocks, direct usage of the TM functions / instructions). The lock usually remains as a fall-back path, in case the critical sections do not make progress with TM.

1_{See https://gcc.gnu.org/wiki/TransactionalMemory.}

2_{GCC eventually also received link-time optimisation, but as of today (version 4.9) is unable to inline the transactional memory}

Failing transactions will therefore eventually acquire the lock and execute the original content of the critical section, guaranteeing progress properties of the original critical section construct.

To properly serialise between lock-taking critical sections and concurrent lock-eliding transactions, the transactions will have to check that the lock variable is free at their linearisation point. This can be easily achieved by transactionally reading the lock and aborting if it is not free. In that case, if the lock is free it will remain free until transaction commit / abort. An interesting trade-off is when the lock should be checked for reading: pushing the check to the end of the transaction reduces the contention window with lock-acquiring critical sections, but may also cause the transaction to run off inconsistent state that is still modified by the critical section. While the transaction will eventually abort in that case, either because it conflicted with one of the “consistent-making” stores of the critical section, or it finds the lock taken, inspecting inconsistent state may cause anomalies, in particular, if the transaction can exhibit side-effects as is the case in ASF. In the literature, this is called “lazy subscription” and considered unsafe [244, 311].

Integrated and Semi-Transparent Software Lock Elision Performing lock elision “under the hood” without the programmer changing code is compelling because it unlocks potential performance gains for existing programs without code rewrite. Potential targets are eliding locks in language level con- structs, such as Java’s synchronized blocks, and C++ std::mutex or derived locks; or locks present in interpreted languages, such as the global interpreter lock in Python. Similarly, at a lower level the pthread_mutex_* library interface can be adapted to perform transactions instead of acquiring the asso- ciated lock.

The general mechanism remains the same as in manual lock elision: instead of acquiring the lock, the code is changed to start a transaction instead. If the HTM then treats normal memory loads / stores as transactional (ASF inverse mode), the modifications can be localised in the locking primitives without requiring generation of two code paths with instrumentation of all shared accesses in one of them. Local- ising the changes to the locking code can then be accomplished without changing the application code; if shared libraries are used for the implementation of the locking primitives, the application binary may remain the same. There are patches that change the popular GNU C libary [318] to perform elision in the pthread_mutex_* functions using HTM. In our work on ASF, we have investigated approaches that use LD_PRELOAD to load an interposer between the application and the exiting implementations of these functions. In addition, we have added lock elision for locks used inside Java HotSpot [290], and Python (driven mostly by Martin Pohlack, unpublished).

In some of these, complications arise from varying interfaces to the locking primitives. If locks are elided, the underlying lock is not physically acquired, because that would require a write which in turn would conflict with other eliding transactions. Instead, the lock is acquired logically: through a com- bination of transactional tracking of the data and enforcing the lock free state for the duration of the transaction. Locking interfaces which permit querying the lock variable while holding it (e.g. C++ std::unique_lock::owns_lock, or performing trylock operations on held locks) may need adaptation to precisely distinguish between the following cases: no lock held, not eliding; holding the queried lock; eliding queried lock; eliding a separate lock; running a separate transaction. Additional bookkeeping may be required, but should be performed thread-locally and in a transaction-safe manner in order to avoid transaction aborts.

Transparently eliding locks also keeps the programmer unaware of the changed performance trade- offs. In order to get good performance out of an elided critical section, the programmer must: (1) avoid writes to shared data, in particular data that is often shared between threads, but possibly not part of the core synchronisation set of the algorithm; (2) be aware of conflict detection granularities, and avoid false sharing; (3) avoid I/O and system calls as these tend to abort the transactions, too.

While all of these points constitute good advice for making (non-elided) critical sections run fast (and in fact most parallel code, including other lock-free techniques), due to reduced cache line bouncing and reduced length of the critical section, they will only slow down execution with locks, whereas they can permanently thwart progress with transactions and reduce performance to lower than the fall-back path due to previous retries. Conversely, it may be surprising for programmers to have expected performance gains for elided, non-conflicting critical sections suddenly evaporate due to changes / mechanisms invisible or unintuitive to the programmer.

Due to the unknown nature of the critical section and the cost of unsuccessful transactional execu- tion that eventually has to grab the lock, semi-transparent approaches benefit greatly from prediction: a predictor is consulted with the identity of the critical section / lock and returns whether eliding the lock acquisition would be beneficial or not. If the predictor indicates that elision is not beneficial, the lock will be taken straight away. In comparison to branch predictors (which are pretty well understood, but continue to evolve), two significant differences remain for lock elision predictors: (1) what exactly is the “identity” of a critical section, and (2) how to update the predictor.

The identity used to index into the predictor should capture a high-level abstraction of the nature of a transaction. In semi-transparent lock elision, the predictor can be a software component consulted in the implementation of the elidable locking primitive. As such we found that indexing the predictor based on the address of the lock provides good results. Indexing based on the instruction address might also be useful, because it differentiates between different usages of the same lock variable, but relies on inlined locking primitives. Alternatives include: using parts of the lock acquisition function’s return stack (GCC exposes a __builtin_return_address function to such effect), or supplying an application-specified ID value.

Transparent Hardware Lock Elision In contrast to the integrated and semi-transparent approaches, fully transparent hardware lock elision [58, 62] integrates all required elision functionality into the CPU. A full-blown, aggressive implementation could then elide all critical sections, irrespective of whether they come from a (inlined) library, the OS kernel, or are built manually. In their simplest form, locks are built out of an instruction sequence to acquire and release the lock; with the acquire path usually containing one atomic read-modify-write instruction (or load-linked / store-conditional sequence to that effect) that tries to transition the lock from aFREEto aTAKENstate, and a check of successful acquisition. The release operation then will perform the inverse by storing aFREEvalue back in the lock variable.

Transparent elision will then perform the following: (1) detect an atomic RMW instruction transitioning a lock; (2) instead of acquiring the lock, start a transaction; (3) check that the lock is free, and (4) acquire the lock locally. Finally, when a store operation is detected that transitions the lock from the local taken state back to the original free value, the local write operations (FREEto TAKEN, TAKENto FREE) are discarded and the transaction committed.

Due to the local execution of the lock acquire / release operations, multiple concurrent threads will see the lock as being free, transition it locally to taken and perform the elided critical section concurrently. If there are no true data conflicts between the critical sections, they can execute concurrently.

Similar to the integrated approaches describe earlier, predicting whether a detected critical section can be elided successfully is crucial for this approach; in transparent HLE, however, another (logical) layer is required, too: is a particular atomic RMW sequence actually a lock acquisition. From my experience, reliably detecting and eliding critical sections is challenging because:

• Different instruction sequences being used to acquire the lock

• Similar instruction sequences being used for other purposes, such as lock-free data structures, ref- erence counting, and as a fence replacement

• Various instruction sequences used to free the lock variable

• Lock algorithms transitioning the lock from a FREE1, to TAKEN1 and then FREE2 state, making it hard to detect the transaction end, and impossible to remove the costly lock variable modification through elision; for example ticket locks

• Lock algorithms that have other complex behaviours such as MCS locks, which spin locally if the lock was acquired

• Modifications of the lock variable (or neighbouring memory locations) during the lock elision; for example transitioning the lock from a TAKEN1 to a TAKEN2 value during the critical section; for example in the 32bit x86 Java HotSpot version

• Wrongly detected lock acquisitions cause nesting of elision, and missed / non-existing release operations will then continue the elision until the transaction mechanism runs out of capacity, or aborts otherwise

For these reasons, I believe that fully transparent lock elision is not feasible, or can only have very narrow coverage. Unpublished studies that I performed at AMD analysing instruction traces of a huge set of relevant workloads for lock / unlock patterns confirmed that.

Intel’s TSX proposal offers lock elision [303, 367], but side-steps (and arguably implicitly acknowl- edges) the issue of detecting proper lock acquisitions / releases over other spurious memory modifications: TSX requires annotations of the memory instructions that constitute the actual lock acquisition and release with xacquire and xrelease instruction prefixes. While these prefixes require software modification, the resulting code is still backwards compatible with older, elision-unaware CPUs, because the prefixes are selected such that they are ignored / meaningless in non-TSX ISAs (they map to the repe / repne prefixes that are only meaningful with string instructions).

In document Interaction of Hardware Transactional Memory and Microprocessor Microarchitecture (Page 128-131)