Language Support - Transactional Memory Use-Cases

2.4 Transactional Memory Use-Cases

2.4.3 Language Support

Adding transactional memory support to programming languages is a significant research question. The problem and solution space spans several dimensions:

Programming Language manged languages that are either interpreted or JIT-compiled generally permit more opportunity for unsound / incomplete optimisations ensuring correctness at runtime; their de-emphasis or outright lack of pointers makes it possible to perform more operations “under the hood”; unmanaged languages generally cannot rely on type-safe memory, runtime instrumentation support, or garbage collection

Language and Compiler Support extending languages and compilers to understand transactions gen- erally unlocks cleaner semantics and easier code, and can also unlock optimisation options in the compiler that take into account the specifics of transactions; library-based designs, however, allow much faster prototyping and do not require detailed knowledge of and modifications to language semantics and compiler internals

Visility of TM in many cases, especially in interpreted high-level languages, TM can be used in the actual language implementation itself, without being visible to the programmer

Java With its detailed memory model, built-in supports for concurrency, and a familiar, imperative syntax, Java has been subject to significant work with respect to transactional memory. Some of the first software TMs were build in / for Java. Harris and Fraser propose a low-level load / store TM interface directly to the programmer; while Herlihy, et al, use a TMObject indirection wrapper class that requires explicit opening of objects for read / write [66, 68]. Later, Herlihy, et al, refine their model to automatically wrap existing classes / objects into their transactional counterparts through a transactional factory; similarly, they implement transactions as continuations fed into a transactional execution core [112] – extending significantly what can be achieved without changes to the compiler / runtime.

Adl-Tabatabai, et al, however, show that with language and compiler support, the programming model becomes very easy to use and the awareness of the compiler and JIT can significantly reduce the overheads associated with STM [95]. They extend the Java language with a atomic { ...} keyword, add markers for the transaction boundaries, instrument loads / stores, and perform aggressive optimisations: hoisting open operations out of loops, subsume read under write opens, and inline the fast-path of the read / write TM barriers. Furthermore, drop instrumentation for immutable and transaction-local variables. As a result, they show overheads only in the 20% range even for memory intensive data structure benchmarks. They perform these operations in their production-level Intel Java environment and the resulting tools are not available.

With further optimisations, Shpeisman, et al, later even manage to add strong isolation to that system: they add barriers also for non-transactional accesses and perform extensive hybrid analysis on which objects remain thread private, and which are never accessed from both transactions and non-transactional

code [146]. For the much improved stronger behaviour of the resulting system, they manage to keep overheads below 40%.

Finally, Korland,et al, propose a hybrid that does not require modifications to the compiler or JIT, does not require programmer annotation of every object / memory access, and allows simple replacement of the underlying TM implementation [226]. Their tool, called Deuce dynamically instruments Java byte code at class load time and adds callbacks for transaction start / commit, and for every load / store. Their flexible approach does, however, cause significantly higher overheads than those reported by Adl- Tabatabai, Shpeisman, et all. They make their tool available as part of the VELOXTM Stack5_.

More recently, Zhang, et al, revisit STMs for Java and propose a strongly isolating with visible readers, undo logging, and low overheads [347]. With careful tuning of the used locks (using biased locks), the authors achieve low overheads (30% - 70%), despite also instrumenting non-transactional code; which is significantly lower than that of Deuce, and comparable to the Intel Java STM. Despite supporting visible readers (and better progress guarantees), transactions are not opaque because of the lazy read set validation.

Haskell Given its strong functional, side-effect free properties, Haskell requires annotations for shared memory accesses already. Harris adds support for STMs to the language and proposes several higher-level TM language features, such as orElse and retry. Subsequent work by Perfumo, et al, created Haskell TM benchmarks and characterised their working set sizes and other behaviour [145, 176]; they find that the serialised commit phases is one of the contributing factors for limited scalability. The authors further propose an early release primitive for higher performance [144].

C / C++ Dalessandro, et al, use C++’s advanced type and meta-programming mechanisms to provide a cleaner interface to their library STM [149]. Using a smart pointer pattern, they simplify the interface significantly, reduce clutter, and make the interface much safer to use. However, after porting a larger application (Delauny triangulation mentioned earlier), the need for accessors, and lack of optimisation, they are advocating for for full language and compiler support.

Similarly to their significant efforts in Java, Intel’s Wang, et al, integrate TM into their production-level icc C / C++ compiler [132]. They observe that adding such support to unmanaged languages is much harder than their companion effort for Java; mainly because of lack of safety, run-time inspection, and garbage collection. Their optimisations largely resemble those of Java: redundant barrier elimination, fast path inlining, and register snapshotting optimisations. For simple data structure tests (the worst case), they achieve overheads of about 60%, and reduce those in workloads that perform computation, for example to 6.4% in SPLASH2. Wang, et al, use pragmas for code blocks and functions to mark them as transactional and subsequently generate two versions that can be called inside and outside of transactions respectively.

Mirroring the situation in Java, Felber, et al, add a “transactifying” pass to the modular LLVM framework and use that to add transactional memory to C and C++ [150]. They add calls to an STM library for loads / stores inside transactions as a separate pass after optimisations have already removed redundant accesses; and can also run optimisations again to inline the fast paths of those instrumentations by performing whole program optimisation across the application and the TM library. Their annotation tools Tanger and Tarifa are available as open-source as part of the DTMC suite6_.

Instead of annotating entire functions, Crowl, et al, argue for a simpler programmer interface: they suggest simply prefixing a (compound) statement with transaction should turn the entire code transactional, rather than relying on the programmer to provide function annotations which they argue is

5_{https://github.com/DeuceSTM/DeuceSTM} 6_{https://github.com/basicthinker/PTMC}

too tedious. They also explore the tricky behaviour of exceptions that occur inside of transactions, and suggest privatisation-safe weak isolation, rather than strong isolation.

Yoo, et al, investigate larger applications on top of STMs and find that especially with STMs, the overheads of automatic instrumentation (as opposed to manual instrumentation) can be as high as 10x for some workloads (genome in STAMP) [174]. With careful hash table design (false conflicts are a significant problem); filtering of thread private data, and local variables; annotating non-privatising transactions; and even replacing short transactions with a global lock can improve performance of STMs significantly. They authors also argue for compiler support for these techniques and for mechanically creating transactional copies of code.

Invisible Usage of TM Instead of explicit language extensions, or a library interface, TM can be used in the infrastructure of a programming language system. Several publications elide the global interpreter lock (GIL) in popular interpreted programming languages such as Python and Ruby.

Riley and Zilles modify the PyPy Python interpreter and execute it on a behavioural VTM full-system model to elide the GIL that synchronises access to the internal structures when multiple threads exist in the application [110]. They also add higher-level primitives such as pause, compensation callback hooks, retry, and alert-on-update to the HTM, and use these for example to also elide locks held by the application. In order to get scalability, the authors need to undo several optimisations that are helpful when executing with mutual exclusion, but needlessly induce conflicts when used in transactions. Several fields that are logically per-thread, are held in single global variables and updated on “thread switch”. Finally, the authors propose an interesting way of dealing with system calls inside transactions: they abort the transaction, but on the retry, push the system call to the commit hook. If the result of the system call is needed earlier, the transaction will abort. Due to the behavioural simulator, Riley and Zilles do not present performance data, but show transaction characteristics for several workloads.

The reference implementation for Python is not PyPy, but instead CPython. Several authors investigate GIL elision for that interpreter, as well: Blundell, et al, use it as the “poster child” workload for their Retcon approach that allows transactions to defer arithmetic operations in HW [236]. Instead of causing WaW conflicts, the authors buffer and subsequently aggregate operations on detected memory locations, and proxy the information locally. Conditional branches that depend on such proxy data are predicted and the mathematical relation is added to a log that is checked at transaction commit. In the analysis, CPython is the workload that benefits the most from this rather invasive technique, because it heavily relies on reference counting that increments and decrements the reference count of objects also for readers thereby destroying all parallelism for readers of said object. The authors evaluate their idea in simulation and after similar code restructuring to Riley and Zilles, and with their technique to permit concurrent reference counting improve the workload from no scaling to almost linear scaling (25x on 32 cores).

Tabba uses Rock prototype hardware and also attempts to elide the GIL in CPython [233]. In addition to the code restructurings earlier, he tweaks the lock (elision) granularity: typically, the GIL is held for multiple consecutive Python instructions (100 by default), but that can cause transactional overflow, so Tabba conservatively only runs a single instruction per transaction. After essentially switching off reference counting, he achieves good scalability for a shared-nothing application with multiple threads.

Finally, Odeira, et al, build upon this body of work, but can rely on commercial HTM support (they evaluate Intel’s Haswell Xeon with TSX, and IBM zEC12 with HTM support). Instead of Python with reference counting, they elide the GIL in Ruby which has a mark-sweep garbage collector that permits concurrent readers [322]. Thanks to hardware support, they are also able to run larger workloads in the improved Ruby interpreter. In addition to the now “typical” code modifications, and similar to Tabba’s earlier work, the authors perform adaptive transaction size control: when the abort rate of a transaction

exceeds a specific threshold, it will execute fewer Ruby instructions before committing. Other surprising sources for aborts are calls into the garbage collector that of course cannot scan large regions of memory while running inside a transaction, and shared caches that are used for accelerating name lookups, and then get updated on a miss causing conflicts.

Simple lock elision of application locks is, of course, the other big “invisible” use case of TM. Azul originally built their own HTM to elide Java’s monitors and synchronized blocks [199]; in the meantime, Hotspot, Oracle’s default JVM, has acquired support for using Intel TSX to the same effect [315, 324]. In the C and C++ world, the Pthread Mutex in the standard C library is the synchronisation “staple”; and also got support for lock elision with Intel TSX [272, 318].

In the course of my thesis, I have added support for ASF to the same code base7_{, and my colleague}

Martin Pohlack experimented with Python (unpublished). We have also published on semi-transparent lock elision that replaces the lock implementation at load time (through library preloading), and performs lock elision with ASF, and in cases of failure, reverts to calling the original locking functions [254]. Finally, in the VELOXproject, we have evaluated the DTMC compiler framework with TinySTM and ASF for HTM support [210], and added support for transactional memory to GCC.

In document Interaction of Hardware Transactional Memory and Microprocessor Microarchitecture (Page 51-54)