Although ParAτdoes not make use of speculation, it is an important technique that enables automatic parallelization. Recent work of Niall et al. [11] goes as far as claiming that reasonable parallelization cannot solely rely on static dependence analysis and instead has to
use speculation to fully exploit the only dynamically exploitable parallelism. Unfortunately, existing techniques and speculation mechanisms come at a high price in terms of runtime overhead, which needs to be reflected in the decisions of an automatic parallelizer. The problem with speculation is that its overhead inherently depends on the misspeculation rate, which in turn depends on runtime features: The number of threads/tasks running in parallel as well as the structure of the input.
Developing a framework for speculative execution is not part of this thesis’ work and therefore not explained in thorough detail. Instead, an overview is given over the different options implemented in Sambamba as part of a different PhD thesis [29], as far as it is useful to put decisions explained in later chapters of this thesis into context.
Two different speculation approaches have been developed and integrated into the Sambamba framework. One approach is based on Software Transactional Memory and implemented as an extension of TinySTM [99, 100].
The other approach [14] is based on a concept commonly known as thread-level speculation (TLS). It has been specifically implemented from the ground up to drive speculative
execution as required by the parallelization framework described in this thesis.
4.2.1 Software Transactional Memory
Transactional memory systems [101] are motivated by the corresponding concept in database systems and typically guarantee atomicity and isolation. Atomicity refers to the property that the (memory-)effects of instructions contained within the same transaction are, from an external point of view, all visible (committed to main memory) at once or not at all. The system guarantees that the effects of a transaction are never partially visible only. We further differentiate between strong and weak atomicity, of which the latter guarantees atomicity only between different transactions; the former also guarantees atomicity between transactions and the surrounding code outside of the control of the transactional memory system.
Transactional memory systems implemented in software only (STM ) typically guarantee weak atomicity as they rely on instrumenting the code contained in transactional sections.
Guaranteeing strong atomicity would require the system to instrument the whole code, including dynamically linked parts, which poses technical issues, but also comes with the corresponding non-negligible overhead.
Hardware transactional memory systems (HTM ) in contrast typically provide strong atom- icity. The usual implementations, like the Intel Transactional Synchronization Extensions (TSX ) [102], are based on the cache coherence protocol and impose significantly less
runtime overhead in comparison to software only implementations.
One of two speculation systems implemented as part of the Sambamba framework is based on TinySTM [99, 100] and comes with the typical performance overhead of an STM system. To make it usable in an automatic parallelization context, where keeping the sequential semantics of the parallelized application is of importance, the implementation contained in Sambamba adds a commit order: the order between transactions resulting from automatic parallelization (for instance as done by ParAγ) is defined by the broken, i.e., speculatively
ignored, dependences. This is an important criterion that heavily influences parallelization decisions. As a result of this requirement it is illegal to form transactions in Sambamba whose speculatively ignored dependences impose a circular commit order between the transactions. Chapter 6 describes in detail how this is guaranteed by ParAγ.
Due to its typically small setup overhead per transaction (in contrast to the above mentioned high overhead per protected memory operation and commit) STM is particularly well suited to replace locking primitives protecting comparably small critical parts of big parallel tasks. It is not, however, a good fit for completely protecting very big speculative parallel tasks. To cover that use-case, Sambamba additionally provides an alternative speculation mechanism based on process forking as described in the next section.
4.2.2 K-TLS
K-TLS is in most cases the speculation system of choice in Sambamba. It is a so-called Thread-level speculation system (TLS ) based on process forking to isolate the memory effects of individual transactions to guarantee atomicity and isolation. In contrast to the STM implementation described in the previous section, K-TLS and similar systems come
with a high initial setup overhead per speculatively spawned task, but nearly no overhead per protected memory operation within a task. Upon completion of a speculatively parallel task and a successful conflict check, the memory effects of a transaction are made accessible to the main process by atomically moving over written memory pages.
The K in K-TLS comes from kernel and hints at the implementation as part of the operating system kernel. Only this way it can effectively use the hardware based memory management to keep the overhead as low as possible. Another advantage is that this way, speculatively ignoring possible system calls is straight forward, even for dynamically loaded binaries not allowing for instrumentation of the code. All system calls are handled, and can therefore be intercepted, by the OS kernel. If a system call happens speculatively, the transaction can be aborted or stalled until completion of transactions preceding the one executing the system call in the commit order, which K-TLS requires just like the alternatively usable STM .
The high setup cost of transactions in the K-TLS is a bearable cost given the low overhead of memory operations, as big transactions are typically the goal of task based parallelization of general purpose applications. The downside of K-TLS , or more generally systems exploiting the virtual memory system for conflict detection, is the granularity of conflict detection, which typically is on the page-level. That means that a conflict is detected, and consequently the speculative execution rolled back and repeated, if two speculative tasks write to the same memory page of typically four kilobytes. This granularity might, depending on the application, lead to a significant amount of so-called false conflicts caused by two tasks writing to completely disjoint regions of the same memory page.
The STM system of the previous section is instead able to detect conflicts on the word level which nearly eliminates the risk of false conflicts but comes, due to the necessary instrumentation of the code, with the limitations and draw-backs, especially performance- wise, described earlier.
K-TLS+ is a hybrid system that seeks to overcome the limitations induced by the page-level conflict detection by again relying on instrumentation of speculatively executed code to resolve potential false conflicts within a page. The granularity, and with it the overhead of the required instrumentation, is configurable in K-TLS+.