Application Studies - Transactional Memory Use-Cases

2.4 Transactional Memory Use-Cases

2.4.5 Application Studies

Investigating larger applications is important, because it lifts transactional memory research from the Petri dish into the realm of real code; with unclean patterns and optimisations, interactions that challenge TM semantics, and performance characteristics that allow meaningful decision making for the design of HTM and STM solutions.

Initially, only STM support was available on native machines, HTM was only available in time- consuming and impractical simulation, and compiler support was lacking; with the recent (since 2014)

availability of HTM hardware, compiler toolchains, this area has recently picked up a lot of activity again. Dice, et al, experiment with an HTM simulator and investigate RB-trees, n-ary CAS, and lock elision in C++ STL and libc [184]. They find that debugging (for correctness and performance) with HTM is hard, and several interesting interactions and observations are wiped away in the rollback at transaction abort. One interesting realisation is that JITting can have interesting interactions: if the code is not JITted, the interpreter is likely to cause a transaction abort, undoing the statistics counter that tracks how often a sequence of code is executed.

Click presents on Azul’s experience with Java workloads and finds that especially financial modelling applications scale well as they are mostly data parallel, while web-tier application servers often require more tuning to scale to around 50 cores [199]. Click observes that the heuristics for eliding the right locks are hard, as uncontended locks can have lower overheads than uncontended transactions. He points to typical TM unfriendly idioms, such as element counters for data structures, and performance counters. These can be rewritten easily, but have different semantics as a result.

Quake The game Quake has been looked at in various publications: in a first approach, Zyukyarov, et al, convert a Quake server with fine-grained locks to using transactions [206]. They observe that especially the hierarchical spatial locking in the BSP tree that contains the level and the entities becomes much simpler. They also find, however, cases of non-block structured lock usage that require complex code motion to be transformed into transactions. Finally, they find that several functions need annotations for being excluded from the TM mechanism, manually delay I/O (often debug print out), and find issues with the Intel STM C++prototype compiler. In their evaluation, they show that transactions do scale, but observe a 4x - 5x overhead over locks.

The second paper by the same authors, Gajinov, et al, starts with a sequential Quake game server and then uses transactions to parallelise it; again using the Intel C++ STM compiler [202]. As a result, the authors create only eight unique transactions, while the fine-grained version had 58. With ten man- months of work, they still suffer from overheads of 3x - 6x compared to locks, depending on the player count; and find that game physics only accounts for a small fraction of the run time and the application spends about 85% of time inside transactions. This a strong contrast to the overheads observed in [132], but can be explained by the fact that SPLASH-2 (the application suite used for benchmarking there) workloads spend significant less time in critical sections (max. radiosity with 22%, geomean 2.6%). Figure 2.6(top) shows the performance of the different Quake solutions. The authors invent a progress meter for aborted transactions, called reach points which effectively are breadcrumbs implemented by non-transactionally incrementing a set of counters corresponding to lines passed in a transaction.

Finally, Lupei, et al, only extract the spatial game tree from Quake and use it to run synthetic game scenarios with a high number of agents (600 - 2k), and manually instrument the data structure with their own libTM STM library [227]. Interestingly, in their use case, game physics plays an important part in the result: with game physics enabled, the STM overheads are amortised by the better scalability compared to a fine-grained hierarchical approach at two threads; without physics they too incur a roughly 4x overhead over the fine-grained locking scheme; see Figure 2.6(middle, bottom).

Testing GCC TM Support Skyrme and Rodriguez port the Lua interpreter with the luaproc package from PThread primitives to using GCC’s TM support (with an STM backend)[298]. Luaproc is a different case to the programming languages discussed in more detail, below, as it does not share memory between Lua co-routines, and also does not employ a single interpreter lock. Instead, programmers have to explicitly send messages between concurrent processes. The authors find that half of the locks convert easily to transactions along the obvious transformation lock(L); <stmt>; unlock(L); → tm_atomic { <stmt>; }. There are, however, significant challenges when converting condition variables, especially

0.0 2.0 4.0 6.0 8.0 1 2 4 8 Threads Av er ag e fr ame tim e [ms] global_lock lock_fine TM_coarse TM_fine 0 1 2 3 4 5 6 7 8 300 250 200 150 100 50 0 Pr oc essing Time (s) Threads Locks STM Time ( s) 250 200 150 100 50 0 0 1 2 3 4 5 6 7 8 Locks (low) Locks (high) Locks (medium) STM (low) STM (high) STM (medium) Threads

Figure 2.6: Performance of various implementations of Quake game logic.

Top: coarse and fine-grained locking in AtomicQuake and QuakeTM. From [202]. Middle: STM and lock performance of SynQuake with physics. Bottom: removing physics in SynQuake with various contention levels. From [227].

around the synchronous channels used for communication in the application. Skyrme and Rodriguez use relaxed transactions that can resort to acquirnig a global lock in case of unsafe transaction content. Their resulting prototype with STM is about 2x slower than the locking version.

In contrast to that, Vyas, Ruan, et al, experiment with memcached and report their findings of re- placing PThread-based synchronisation with GCC’s TM support [327]. Similar to Skyrme and Rodriguez, they find that some locks convert easily, while several patterns require code transformations: conditional synchronisation, and multiple and non-local unlock operations. In contrast to the Lua work, the authors here use the stricter atomic mode for their transactions, because they illustrate that the relaxed mode can easily stumble over an unsafe construct and then acquire the global lock. Very unfortunate examples of this are for example nested PThread lock acquisitions (even uncontended) that are marked unsafe and force serialisation of all transactions. With the stricter tm_atomic primitives, the compiler at least highlights the problem and subsequently forces transitive transactification for those locks as well, even though they would have not contended. Furthermore, Vyas, Ruan, et al, deal with other unsafe operations through reimplementation of some functions, marshalling of data to thread private locations and then marking the original functions as tm_pure, and postponing I/O to onCommit handlers. They finally remove the GCC-internal reader / writer lock for atomic blocks that is there to guard them against relaxed blocks going non-speculative, and show performance very close to that of the original fine-grained locking memcached.

Early Hardware Applications Dice, et al, use a Sun Rock prototype to explore data structures (double- ended queues, work-stealing queues, scalable-non-zero indicators) and larger algorithms (memory allo- cation, simulated annealing) with HTM support [219]. They find that their HTM makes algorithm design very easy, and generally provides good performance; while STMs perform poorly (8x slow down). They also find that hybrid approaches can have cascading performance pathologies, either when deciding to switch entirely back from STM to HTM, or when requiring instrumentation on the HTM fast-path. Dice, et al, find that transactions are generally short, and they argue against a lock fallback path, because that would thwart composability – which can clearly be worked around by only acquiring a single global lock for the outermost transaction. Instead, they encourage hardware providers to an HTM that should have small transactions commit eventually, yet, admit that such specification and implementation of these guarantees would be hard – they also ignore that composition may very well be problematic in this case, too, as that would increase the size of transactions.

Schindewolf, et al, investigate HTM support of IBM BlueGene/Q and find that none of the existing TM benchmarks are structured similar to other HPC applications [280]. They create a new benchmark, Clomp-TM, write a new Monte-Carlo simulation application, and convert Parsec’s fluidanimate to run on BG/Q HTM. Overall, they find that transaction overheads are high; needing at least 10 - 20 memory accesses per transaction to ammortise. For the applications, they observe that the simple TM scales better than a coarse lock, while fine-grained locking still remains slightly faster than even a fully tuned HTM system.

Tuning Diegues and Romano further investigate into the tuning of HTM, and develop an adaptive learn- ing mechanism that tunes the retry policy for Intel’s RTM [312, 336]. They gain about 60% performance over the best static policy, especially in cases where the latter fails to observe changes of topology (going from one thread per core to multi-threading), or workload (different transaction types in the same workload). They also observe that more complex fallback backs (SGL vs NORec) often perform worse, but do offer additional performance when transactions consistently are too large for the HTM.

Dice, et al, observe similar results, and additionally offer a software optimistic (seqlock-based) code path together with a tuning mechanism [310]. They show that for two workloads (hashmap, and an in-

memory database), the adaptive policy can extract more performance than any of the static choices. An interesting case is when for the same data structure the usage changes so that transactions are too large to fit into the HTM resources. Their approach quickly reacts and does not perform wasteful transactional attempts.

Usui, et al, also did similar work but before HTM support was available to the public; instead, they perform similar statistics collection for locks and switch between acquiring the lock and running the critical section as a software transaction [235]. They show that their system correctly switches to the STM when the high overheads are amortised by offering more scalability than than the single coarse lock.

Didona, et al, merge multiple STM, HyTM, and HTM back-ends behind the same transactional front- end (for GCC), and implement a recommendation system that learns and predicts the best algorithm to use [355]. They show that their system learns quickly and with low overhead (3%), and tracks the best performing solution per workload precisely.

Instead of switching the TM / concurrency control mechanism, another aspect is concurrency control; in many cases, increasing the number of concurrent threads attempting a transaction can lower the overall throughput by causing additional contention. While several precise techniques for scheduling of hardware transactions have been proposed [167, 187], current generation best-effort HTMs do not employ these techniques and generally give very little information about conflict reasons. Diegues, et al, show that with a probabilistic approach of announcing transactions before they are started and record- ing matrices of concurrent transactions when aborting / committing, they can build information about which transactions to avoid running together [312]. Using a pre-transaction fine-grained lock for these transactions, they show up to 60% improvement of throughput, especially in high contention scenarios under SMT.

Brown, et al, use a similar technique of controlling concurrency for systems with multiple sockets; they find that spreading execution to the other socket can often have a significant performance impact [354]. Instead of controlling transaction pairs, they monitor execution and if necessary, switch through sockets in a round-robin fashion, and let only the designated socket execute transactions (similar to cohort locks [261]).

Fresh Applications One new class of applications are in-memory databases. These store their data sets not on disks (either spinning HDDs, or SSDs), but instead in main memory. Interestingly, this means that techniques for concurrency control such as two phase locking (2PC) do not work as well as before, because their costs are not hidden behind the media access costs anymore. Therefore, multiple groups are experimenting with HTM as a concurrency control mechanism for their in-memory DB systems.

Wang, et al, use Intel TSX twice in their system: once in the underlying memory store accelerating a B+ tree and hash table, and then also as the validation and commit engine for their higher-level transaction layer [329]. They use a technique similar to boosting [181] where the respective data structures are thread safe, and the higher-level consistency is ensured via proxies. Their high-level transaction system logs reads (results of data structure query operations) and buffers writes in software, and then uses HTM to validate the results and perform the write-back of the higher-level transaction. A similar method was used for Hybrid TMs by Matveev and Shavit in [305]. The authors use higher level sequence numbers at- tached to the data store as the proxies for conflict detection. The resulting in-memory data base performs twice as fast as state of the art fine-grained locking versions.

In parallel with Wang, et al, Leis, et al, also rework an existing in-memory database to use transactions [319]. Their approach is very clearly described and very similar to a visible reader version of Riegel, et al, LSA / TinySTM [103, 160]. Again, they do lift these operations, however, from normal memory to be performed in the actual memory store. They use HTM to perform the tracking / update of the read

1 2 3 4 0 100,000 200,000 300,000 400,000 partitioned HTM optimistic serial 2PL

multiprogramming level (threads)

transactions p er second 0% 20% 40% 60% 80% 0 100,000 200,000 300,000 400,000 HTM optimistic serial 2PL partition-crossing transactions transactions p er second partitioned

Figure 2.7: In-memory database performance with scalability of different synchronisation mechanisms (left) and varying number of partition-crossing transactions (right). From [319].

/ write timestamps per element, and also for concurrency control in the backend of the memory store. Interestingly, Leis, et al, use the HLE versions of Intel’s TSX, rather than manually controlling retry of the smaller transactions. They show that their approach scales as well as a static optimal partitioning approach (and much better than 2PC and a single lock approach), while maintaining performance when the partitioning scheme does not align well with the access pattern; see Figure 2.7.

Finally, Odaira and Nakaike revisit the earlier research area of thread-level speculation (TLS), and try to see if current best-effort HTM solutions without dedicated TLS support can be useful [323]. They manually instrument promising workloads from the SPECCPU 2006 suite, and get a best-case speedup of 11%. In most cases, however, they find that simple best-effort HTM is not suitable for performing TLS. Their main source of slow-downs is not the lack of ordered transaction commits (they emulate that with a counter), but instead loop-carried dependencies that can be forwarded in TLS, but cause aborts in the simple best-effort HTM case. On top of that, they also find that conflict detection granularity on c ache lines needs careful splitting of the loop iterations in order to not cause false conflicts between parallel executions of a loop body.

In document Interaction of Hardware Transactional Memory and Microprocessor Microarchitecture (Page 54-59)