• No results found

Performance Evaluation

5.4 Evaluation

5.4.1 Performance Evaluation

We perform cycle-based simulation utilizing the Multi2Sim 4.2 simulation frame- work [57]. All transactional data is stored in local memory.

96 Chapter 5. Transactional Memory in GPU Local Memory Speedup. Four TM versions of each bechmark are executed on our simu- lator, each one using a different conflict detection mechanism: DCD, SMDCD, DCD+pRWsig and DCD+pRWsig+sWOsig. Figure 5.6 presents the speedup achieved for those versions relative to the version that serializes the critical sec- tions.

HT-HC HT-LC IT-HC IT-LC VA-HC VA-LC GC-HC GC-LC

0

20

40

60

80

100

120

Speedup w.r.t. TX serialization

GA-HC GA-LC KM-HC KM-LC DB-HC DB-LC QU-HC QU-LC

0

2

4

6

8

10

Speedup w.r.t. TX serialization

31

31

40

40

pRWsig sWOsig DCD SMDCD FGL

Figure 5.6: Speedup of TM and FGL benchmark versions with respect to the serialized version (pRWsig is DCD+pRWsig; sWOsig is DCD+pRWsig+sWOsig). Our first observation is that, for all benchmarks, TM performs similar to, or better than, the serial version, except for the highly contended versions of QU and DB. QU has a very high conflict probability and only enjoys benefits for the lightweight and early conflict detection mechanism that DCD+pRWsig provides. In DB, both versions using signatures improve the performance over the serial code.

Regarding atomics, the FGL versions of HT, IT, DB and QU outperform the TM versions, as they are highly optimized. However, the transactional versions neither involve declaration of additional data structures nor the burden of atomic operation management by the programmer. On the other hand, our simulations show that TM outperforms FGL for VA and GA. The reason is that the use of atomics results in significant overhead for these algorithms due to lock acquisition- release and the mechanisms needed to avoid deadlocks. In addition, FGL requires much greater programming effort than TM.

TM and FGL do not scale well for KM and GA. In GA, the SIMT execution of long transactions hurts performance since work-items that finish the trans- action have to wait for the complete wavefront. In KM, the critical section is small compared to the rest of the code, such that the advantages of using TM cannot amortize the associated overhead. The main source of overhead is due

5.4. Evaluation 97 to a significant number of memory accesses that suffer from high contention, as multiple work items try to access the same clusters. This situation results in the serialization of lock acquisition/release operations in FGL, and a high number of retries in TM, which represents more than half of the execution time.

The use of both types of signatures (pRWsig and sWOsig) benefits bench- marks such as HT and GC, where many read-only transactions avoid aborting unnecessarily.

Execution breakdown. Figure 5.7 shows the execution breakdown for all TM and FGL versions of the benchmarks. Our simulations show that most of the overhead occurs during memory operations, as many cycles are spent during conflict detection. However, the use of signatures avoids extra accesses to local memory. For this reason, lower memory overhead is observed in the pRWsig and sWOsig TM versions, as compared to those based only on DCD and SMDCD. The VA benchmark experiences a lower rate of conflicts in both scenarios, and due to its larger critical section, the overhead of transaction management is negligible.

HT-HC HT-LC IT-HC IT-LC VA-HC VA-LC GC-HC GC-LC GA-HC GA-LC KM-HC KM-LC DB-HC DB-LC QU-HC QU-LC1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

0.00.2

0.4 0.6

0.81.0

Execution breakdown

TXBegin TXCommit Mem. Overheads TX Code Non-TX Code

Figure 5.7: Normalized execution breakdown. TXBegin, TXCommit, and Mem. Overheads represent the overheads introduced by the TM system. TX Code represents code executed within a transaction, while Non-TX Code represents code executed outside transactions. Columns 1 to 4 stand for pRWsig, sWOsig, DCD and SMDCD TM versions, respectively.

Transactional instructions. Figure 5.8 shows the percentage of instruc- tions executed within transactions over the total number of instructions exe- cuted for each workload. From our simulations we find that HT-LC and IT-LC execute less than 30% of their instructions transactionally, as transactions are short and the probability of conflict is small. KM also has a small fraction of its code running within a transaction. GC and HT-HC reduces their fraction of transactional instructions for sWOsig versus pRWsig. The fast conflict detection provided by shared signatures and the ability to complete read-only transactions without conflicts results in this behaviour.

Commit ratio. Figure 5.9 shows the ratio of transactions committed over the number of transactions started for each workload. As work-items within a wavefront execute in lockstep, the commit ratio is calculated per wavefront, as

98 Chapter 5. Transactional Memory in GPU Local Memory

HT-HC HT-LC IT-HC IT-LC VA-HC VA-LC GC-HCGC-LC GA-HC GA-LCKM-HCKM-LCDB-HC DB-LCQU-HCQU-LC

0.0 0.2 0.4 0.6 0.8 1.0 Normalized instructions in TX pRWsig sWOsig DCD SMDCD

Figure 5.8: Instructions executed within transactions normalized to the total number of instructions executed.

this metric is more directly related to workload throughput. In general, the TM versions based on DCD and SMDCD result in a higher commit ratio, as they are not affected by false positives introduced by the signatures. In some applications, such as DB, HT-HC and GC, the use of signatures improves the commit ratio. Comparing these results with the speedup (see Figure 5.6), we can observe some correlation. The reason is that these applications benefit from the layout of the signatures, and fewer transactions experience re-executions.

False positives. Signatures may return false positives (which are considered conflicts) if the bit to be checked from the signature in the current memory access coincides with the bit set by a previous access to a different memory position (i.e., a signature alias). Figure 5.10 shows the ratio of false positives that occur in the TM versions based on signatures, with respect to the total number of positives. In many scenarios, this ratio is high due to the small size of the signatures. In most cases, false positives result from read-read conflicts that can be filtered out when using shared signatures (sWOsig). DB, however, does not benefit from the use of shared signatures because they saturate.

HT-HC HT-LC IT-HC IT-LC VA-HC VA-LC GC-HC GC-LC

0.0

0.2

0.4

0.6

0.81.0

Commit ratio

GA-HC GA-LC KM-HCKM-LC DB-HC DB-LCQU-HCQU-LC

0.00

0.01

0.02

0.03

0.04

Commit ratio

pRWsig sWOsig DCDSMDCD

Figure 5.9: Commit ratio.

Forward progress. Figure 5.11 shows the percentage of transactions that execute in transactional, wavefront serialization and work-group serialization

5.4. Evaluation 99

HT-HC HT-LC IT-HC IT-LC VA-HC VA-LC GC-HC GC-LC

0.0

0.2

0.4

0.6

0.81.0

False positive rate

GA-HC GA-LC KM-HC KM-LC DB-HC DB-LC QU-HC QU-LC

0.00.2

0.4

0.6

0.8

False positive rate

pRWsig sWOsig

Figure 5.10: False positives.

HT-HC HT-LC IT-HC IT-LC VA-HC VA-LC GC-HCGC-LCGA-HCGA-LCKM-HCKM-LCDB-HCDB-LCQU-HCQU-LC1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

0.0 0.2 0.4 0.6 0.8 1.0 Type of transaction TX-WgS TX-WfS TX

Figure 5.11: Transactional and serialization execution modes. Columns 1 to 4 stand for pRWsig, sWOsig, DCD and SMDCD TM versions, respectively.

modes. In most cases, many transactions (up to 90% in HT-HC) resort to seri- alization mode (especially, WfS) to assure forward progress. The reason is that, as work-items within a wavefront execute in SIMT fashion, most of the conflicts remain after a transaction retry. We analyzed that scenario (HT-HC), and on average, 48 out of the 64 work-items belonging to the same wavefront conflict when the serialization mode is required.

Discussion. As GPU-LocalTM is configurable, this simulation-based evalua- tion can serve as guide to programmers or a hint to the compiler to select the most suitable conflict detection algorithm, or to predict the performance when storage resources are not available. The mechanism that exhibits the highest memory overhead is DCD+pRWsig+sWOsig, as it uses vector registers, scalar registers and shadow memory. This method works well for applications with many read- only transactions, such as HT and GC, as conflicts can be detected quickly with the use of signatures and read-only transactions do not conflict. DCD+pRWsig does not use shared signatures, reducing pressure on the scalar register file. This method is more effective for applications that perform read-modify-write opera-

100 Chapter 5. Transactional Memory in GPU Local Memory tions. QU, IT, VA and DB are examples that perform similar (or better) when using only pRWsig. Since DCD and SMDCD do not use vector registers, they are well suited for applications that require a large number of those registers. The effectiveness of these methods is limited to applications exhibiting a rather ran- dom access pattern (as GA), where the false positive rate can harm performance if using signatures (as HT-LC, IT-LC, and KM).

Table 5.8 summarizes the main transactional features of the benchmarks and the configuration of GPU-LocalTM to obtain best performance according to the evaluation.

Best performing

Bench. Features GPU-LocalTM configuration

HT Short transactions sWOsig (HT-HC), DCD (HT-LC)

Read-only transactions

IT Short transactions pRWsig (IT-HC), sWOsig (IT-LC)

Read-modify-write

VA Long transactions Any

Few conflicts

GC Read-only transactions sWOsig

GA Long transactions DCD

Read-modify-write

KM Long transactions DCD (KM-HC), sWOsig (KM-LC)

Multiple accesses

DB Short transactions pRWsig,sWOsig (DB-HC),

Multiple accesses sWOsig (DB-LC)

QU Short transactions pRWsig

Many conflicts

Table 5.8: Workload features and the best performing GPU-LocalTM version.