LSA on C++11 - Time-Based STM for C/C++ Environments

5.2 Time-Based STM for C/C++ Environments

5.2.1 LSA on C++11

The LSA-based STM implementation that I will discuss in what follows is both for C++11 and based on it: First, it provides the guarantees that the C++ TM specification requires for TM runtime libraries (see Section 4.2.4), and can be used to implement the respective ABI. Second, the implementation itself is based on C++11 in that it uses C++11’s memory model and atomic operations to implement the synchronization between transactions.

The high-level requirements on TM runtime libraries are shown in Table 4.2 on page 56. I will just give an overview of these requirements for now, and discuss them in detail after describing the TM algorithm.

The first requirement, L1, essentially states that transactions need to be totally ordered, called Transaction Synchronization Order (TSO). The way how LSA and time-based STMs in general use snapshot and commit times from a global time base achieves such an ordering. In particular, TSO is consistent with the ordering of commit times and snapshot times of transactions; whenever those times are not ordered, the respective transactions do not conflict and the program also cannot observe that, due to the data-race freedom requirement of C++11.

Snapshot times are a tentative TSO choice. Trying to extend a snapshot to the current time from the global time base (i. e., validating that the snapshot has not changed in the meantime) checks whether the transaction can change its position in TSO without this being visible to TM-pure or unsafe code.

The second requirement, L2, states that TSO and the transactional memory accesses must be consistent with happens–before, and that data races must not be introduced. The former essentially requires the TM runtime library to preserve happens–before relationships established in nontransactional code, and I will discuss later how the algorithm ensures that. The latter forces the library to ensure privatization safety and to access only exactly the data that the transaction would access if executed by the C++ abstract machine.8

7_{Note that in a different workload scenario in which transaction aborts are frequent, a}

write-back approach can be better due to a smaller window of interference between concurrent conflicting transactions.

8_{For example, when rolling back writes of an aborted transaction, the TM must not undo}

those writes at a coarser granularity than the original memory accesses because this could overwrite adjacent memory objects that could be concurrently accessed by other threads.

Ownership records (orecs). Unlike in the LSA version that I presented pre- viously, we do not want to maintain multiple versions of each memory object; instead, transactions modify objects in the program’s address space directly. Transactions use ownership records to synchronize with each other, which are essentially custom locks carrying a timestamp. Before updating a certain memory location, transactions first acquire the associated orecs; when committing an update, orecs are released and the timestamp is updated so that reader transactions are aware of the update.

An orec is a machine-word–sized data structure whose most-significant bit serves as a lock bit: If it is set, the orec is acquired and the remaining bits identify the transaction that acquired it (e. g., a pointer to the per-thread metadata that the TM runtime library maintains). If the bit is not set, the remaining bits are split up between a timestamp from the global time base (i. e., the commit time of this update, as we will see later) and an incarnation number in a few least- significant bits.

The incarnation number serves a different purpose than the timestamp: It does not represent when the data was committed but can instead be used to decrease runtime overheads of aborted update transactions and transactions conflicting with those aborted transactions. An incarnation number is optional and only useful in write-through designs, but can increase performance in ex- change for reserving a few bits.

Putting the lock bit, timestamp bits, and incarnation numbers bits in this order from most significant to least significant allows for small optimizations in the TM implementation. The value of an orec with a set lock bit will always be larger than any (non-overflowing) snapshot time, so a simple comparison can check two conditions; likewise, an appropriately shifted value of a snapshot time can discard incarnation numbers in a comparison between a value of an orec and a snapshot time.

Mapping memory locations to orecs. Orecs are kept separate from the memory locations that transactions access. Thus, when a transaction accesses memory at a certain address, it must map from this address to the orec associated with this memory location. All transactions must obviously use the same mapping function because they synchronize with other transactions based on the orecs; using different mappings would prevent proper synchronization.

There are no fundamental constraints on the the number of orecs and the nature of the mapping, but there are practical constraints. Calculating which orec an address maps to is on the fast path of transactional loads and stores with the current ABI, so complex functions are likely to lead to runtime overheads. Likewise, using a larger number of orecs can make false conflicts between transactions less likely (because fewer memory locations map to the same orec) but can also result in higher memory requirements and cache footprint.

These concerns motivate the use of simple hash functions, for example split- ting the address space into equally-sized stripes and mapping those to entries in an array of orecs. I will discuss these trade-offs further in Section 5.2.2. For what follows, it is sufficient to assume some deterministic mapping function that is used by all transactions (denoted hash in the algorithms).

Algorithm 3 Lazy Snapshot Algorithm (C++11-based)

1: Global state:

2: clock ← 0 . Global time base (shared integer)

3: orecs: array of word-sized ownership records, each consisting of:

4: locked : bit indicating if orec is locked

5: owner : thread owning the orec (if locked )

6: time: commit timestamp (if ¬ locked )

7: inc: incarnation number (if ¬ locked )

8: State of thread p:

9: st : snapshot time (only upper bound)

10: r-set : read set of tuples haddr , timei

11: w-set : write set of tuples horec-index , orec-valuei

12: undolog: undo-logging data (sequence of tuples haddr , val i)

13: stm-start()p:

14: st ←acqclock

15: r-set ← w-set ← undolog ← ∅

16: stm-load(addr)p:

17: orec ←acqorecs[hash(addr )]

18: if orec.locked then

19: if orec.owner 6= p then

20: abort() . Orec owned by other thread

21: return ∗addr . We own the orec; just read through

22: if orec.time > st then . Need to extend snapshot?

23: extend() . Aborts if validation fails

24: val ←acq∗addr

25: if orecs[hash(addr )] 6= orec then . Load again and compare with previous load

26: abort() . Data at addr was perhaps modified concurrently

27: r-set ← r-set ∪ {haddr , orec.timei} . Add to read set

28: return val

29: stm-store(addr,val)p:

30: orec ← orecs[hash(addr )]

31: if orec.locked then

32: if orec.owner 6= p then

33: abort() . Orec owned by other thread

34: else

35: if orec.time > st then . We may have read from addr before, so. . .

36: extend() . . . . abort if validation should fail

37: if ¬ casacq(orecs[hash(addr )] : orec → htrue, pi) then . Try to acquire orec

38: abort()

39: fencerel . Memory barrier with release memory order

40: w-set ← w-set ∪ hhash(addr ), oreci

41: undolog.push(haddr , ∗addr i) . Log previous value of *addr

42: ∗addr ← val . Write through to memory

43: stm-commit()p:

44: if w-set 6= ∅ then . Nothing to do if read-only transaction

45: ct ← atomic-inc-and-fetchacqrel(clock ) . Unique commit time (atomic increment)

46: if st < ct − 1 then . Must validate if others committed in the meantime

47: extend() . Aborts if validation fails

48: for all horec, orecval i ∈ w-set do

49: orecs[orec] ←relhfalse, ct, 0i . Release orecs

50: extend()p:

51: st ←acqclock

52: for all haddr , timei ∈ r-set do . Are orecs free and timestamps unchanged?

53: orec ← orecs[hash(addr )]

54: if (orec.locked ∧ orec.owner 6= p) ∨ (¬ orec.locked ∧ orec.time 6= time) then

Algorithm 3 Lazy Snapshot Algorithm (C++11-based, continued)

56: abort()p:

57: undolog.rollback() . Undo previous writes in reverse order

58: ct ← 0

59: for all horec, orecval i ∈ w-set do

60: if incarnation-left(orecval .inc) then . No incarnation number overflow?

61: orecs[orec] ←relhfalse, orecval.time, orecval.inc + 1i . Release orec (new incarnation)

62: else

63: if ct = 0 then . Acquire new “commit” timestamp

64: ct ← atomic-inc-and-fetchrel(clock )

65: orecs[orec] ←relhfalse, ct, 0i . Release orec (new timestamp)

Description of the algorithm. Algorithm 3 shows the C++11-based version of LSA. Even though I still show it in terms of pseudo-code, it is based on the memory model of C++11. Unlike for Algorithm 1, functions are not assumed to be atomic anymore. However, all individual memory accesses to global state (including application data) are assumed to be atomic and with relaxed memory order as default. Atomic operations that require stronger memory orders are annotated with the respective order (see Table 2.1 on page 12).

I only show load, store, start, and commit functions in Algorithm 3, but the other functions that are part of the TM runtime library ABI are either straightforward to implement, or are load or store variations. Also, I will focus on the differences to Algorithm 1 in the following description, and will discuss why certain memory orders and barriers are required afterwards; for now, it is sufficient to assume that all atomic operations are sequentially consistent.

Thestm−startfunction is similar to Algorithm 1, except that the snapshot is now characterized by just a single value—and not an interval—which is initially set to the value of the global time base when the transaction starts (line 14); we do not keep multiple versions for memory objects, so we really are only interested in the upper bound of the interval.

Transactional stores first map the target address to an orec, and then load the value of this orec (line 30). If the orec is locked by some other transaction, we abort the transaction. If the orec is not locked, we have to acquire it before we can write through to memory (line 37); usingCASmakes sure that lines 30– 37 are atomic with respect to other modifications of the orec9_{. However, if the}

transaction has read from memory mapped to the same orec before, then we need to make sure that no other value has been committed in the meantime; given that update transactions need to extend the snapshot at commit anyway, we can also try to extend the snapshot right away (line 35). Note that unlike in Algorithm 1, unsuccessful snapshot extensions abort the transaction. After successful orec acquisition, we issue a release memory barrier10_{and add the orec}

to the write set. Finally, we perform undo logging and write through to memory (lines 41–42).

When a transaction commits, it follows essentially the same steps as in Algorithm 1. However, we release the orecs that we have acquired (lines 48–49), instead of making the most recent memory object versions accessible (we already have written updates trough to memory instm−store). When releasing an orec,

9_{See the}_stm−commit_and_abort_{functions for how those prevent ABA issues.}

10_{Instead of the barrier, we could also require release memory order for all the stores to the}

application data, but this is likely to be more expensive (e. g., if there is more than one write per orec).

we set its timestamp to the transaction’s commit time, making it accessible only to transactions with a sufficiently large snapshot time. Also, the incarnation number is set to zero because we use a new timestamp. Note that even though thestm−commit function is not atomic, it provides the essential ordering (see Section 5.1.1) between updates being inaccessible (i. e., orecs locked), acquisition of a new commit time (line 45), snapshot time extension if necessary (line 47), and making updates accessible again by releasing the orecs.

Aborting a transaction is slightly more complex because of incarnation numbers, but essentially we just have to use the undolog to roll back previous updates to memory (line 57) and then release the orecs. For the latter, we need to notify reading transactions about potentially inconsistent reads of data by making sure that the orec’s value after being released differs from the value it had before we acquired it. The first way to achieve this is to acquire a new timestamp from the global time base (line 64) and then release the orec as if we would when committing the transaction (line 65). To other transactions, this kind of rollback will look like just another write-only transaction that committed but did not change any data; this is safe because we have acquired all the orecs for the data we updated. However, this requires accessing the global time base and potentially acquiring a new commit time from it, which can increase contention on the time base and thus reduce performance. Alternatively, we can keep the timestamp of the orec unchanged but instead increment the incarnation number (line 61) as long as it does not overflow. This will also let readers detect potentially dirty reads, as I will explain next.

Transactional loads are somewhat different than in Algorithm 1 because we cannot assume that stm−load executes atomically. We first map the target address to an orec, and then load the value of this orec (line 17). If the orec is locked, then we either abort if it is locked by some other transaction or we have a read–after–write situation and can thus just read the data (line 21). If the orec is not locked, then we must try to extend the snapshot if our snapshot time is not recent enough to form an atomic snapshot (line 23); we do not have multiple object versions available as with Algorithm 1, so we must read the data that has been committed most recently. Next, we can read the data from the target address (line 24). Note that this read can be pending in the sense that it is not atomic with the previous load of the orec (and thus the checks of the orec’s value); the orec can change in the meantime. Ensuring privatization safety, which I will discuss below, also ensures that such pending reads are harmless.

We can make reading the orec and the data effectively atomic by reading the orec’s value again after reading the data and aborting if the orec’s value has changed since the first read (line 25). The value could have changed if either a new update has been committed in the meantime (i. e., orec.time changed) or if we potentially read uncommitted data (i. e., orec.inc changed or the orec is now locked). When transactions change data, they always change some part of the orec’s value (lines 37, 49, 61, and 65), and in a way that avoids any ABA issues. Thus, reading the orec’s value twice is like validating the single data load, and will allow us to detect any inconsistencies; checking atomicity of the whole snapshot of the transaction is still based on time-based validation.

Snapshot extensions validate similarly to per-load validation. We first read the current value of the global time base (line 51), which becomes our new snapshot time if the snapshot extension succeeds. After this, we check that all orecs in the read set are either locked by us, or not locked and their timestamp

has not changed (line 54). Note that changes just to the incarnation numbers are fine; we had consistent reads for each orec initially, and different incarnations— but with the orec not locked—are just concurrent aborted transactions that have been already rolled back. Besides reduced contention on the global time base, this is another advantage of using incarnation numbers: They can reduce the number of extensions and aborts caused by concurrent yet aborted transactions. If validation of any orec in the snapshot fails, the transaction aborts.

For update transactions that have also read data, a successful snapshot extension thus also checks that the snapshot is still valid after the commit time has been acquired.

Also, some smaller optimizations are not shown in Algorithm 3. For example, instead of aborting immediately, transactions can also spin for a while if an orec is already acquired, in the hope that the other transaction might commit or abort soon.

Privatization safety. Privatization refers to a transaction making some data inaccessible to other threads, and the TM runtime library has to ensure that this can be safely done (see Section 4.2.4 for details). The privatizing transaction thus must be an update transaction; it has committed and thus fixed its position in TSO, but this does not immediately make other transactions aware of this. In particular, other transactions are not aware of the privatization iff their snapshot time is less than the privatizing transaction’s commit time.

If privatization is not safe, there are a few things that can go wrong. First, transactions can use private and thus potentially inconsistent data or can write to private data. This will lead to either transactions or nontransactional code operating on inconsistent data, which in turn can lead to incorrect behavior of the program. Such behavior cannot be easily contained, especially in a typical C++ implementation. STM algorithms such as NOrec (see Section 7.3.2) that rely on a centralized commit phase implicitly prevent this first kind of incorrect behavior but at the expense of less scalability.

Second, STMs that use invisible reads will have pending loads (see the previous discussion ofstm−load), which can target privatized data. This is also the case for the NOrec STM (see Sections 7.3.2 and 5.2.2), for example, which will never return the value of such a load to the transaction but will read privatized data. While the data race that the load causes is benign on typical architectures, accessing data for which the privatizing thread has changed the memory protection properties (e. g., by releasing memory or re-mapping the respective memory page as read-only) is not benign and can lead to memory protection faults.

Some architectures, such as SPARC, provide hardware instructions for non- faulting loads; using those for data loads together with an STM like NOrec can avoid the privatization problems. However, non-faulting loads are not available on many other common architectures such as x86.

Another way to make the pending loads harmless would be to try to mask protection faults caused by transactions. However, this either requires custom support in the operating system, or custom signal handlers and enforcing that the TM runtime library’s signal handlers are always the first to be called for a memory protection fault, which in turn likely requires custom standard library support.

We can also ensure privatization safety purely at the level of the TM runtime library by letting potentially privatizing transactions never return to nontransactional code until all other concurrent transactions are aware of the privatizing transaction’s commit. Informally, privatizers thus wait for quiescence of older snapshots.

To achieve that, all transactions publish their snapshot time regularly in per- thread variables. They have to do so at least when starting a transaction and after committing or aborting, marking the transaction as inactive in the latter two cases. They can also update their published snapshot time after successful snapshot extensions. Update transactions11 then have to wait until all concurrent transactions either became inactive or have a snapshot time that is equal or larger than their own commit time. This ensures that there is global consensus on TSO (up to the update transaction’s commit time) before it is exposed to nontransactional code (e. g., code changing the privatized data). Algorithm 3 does not show this privatization safety implementation, but it is straightforward to build using atomic operations and release and acquire memory orders.

The disadvantage of this approach is that it introduces a delay before update transactions can return to nontransactional code, either due to having to wait

In document Software Transactional Memory Building Blocks (Page 99-107)