3.4 Concurrent execution of TM applications on heterogeneous CPUs
3.5.3 TM on heterogeneous CPUs: conclusions and future work
In this chapter of the thesis we explore the behavior of an existing STM library when executed on a big.LITTLE platform. In particular, we test a set of appli- cations from the STAMP benchmark suite which have been linked with the STM library TinySTM. These applications are analyzed in both clusters separately, using both at the same time, and running concurrently an application on each cluster. Furthermore, we have designed ScHeTM, a scheduler for heterogeneous CPUs that performs the schedule taking into consideration a TM-related metric
3.5. Scheduling TM applications on heterogeneous CPUs 49 (aborted transactions).
The main conclusions and contributions of this chapter are the following: • For the set of applications analyzed, the little cluster presents better scal-
ability and lower energy consumption than the big cluster.
• Applications with higher computing demands still benefit from the power offered by the big cluster.
• For this set of applications, if an application performs better in cluster A when executes in isolation, it still executes better (but with degraded performance) in such cluster when an application is executing concurrently on cluster B.
• We propose a simple scheduling model that permits to select which param- eter to optimize: execution time, energy consumption, or aborted transac- tions. A proper configuration of such parameters results speeds up execu- tion time in up to 40% and reduces energy consumption in up to 15%. • We observe that optimizing a well-differentiated parameter (excecution
time, in our case) improves the overall behavior of the application. • The scheduling scheme is not able to improve a greedy scheduler if the
applications behave always better in one of the cluster, or if the clusters behave similarly for the selected parameter.
We want to propose some directions for future work. As mentioned in Sec- tion 3.5.2.4, the mechanism included in our scheduler used to avoid resource underutilization can be improved to select more appropriate applications. In ad- dition, the desing phase (training) of the scheduler could be omitted - applications can be assigned to idle clusters (as in the greedy scheduler) the first time they are executed and information can be gathered from such execution. In subse- quent execution, if enough information has been obtained, the application can be executed followed the policy established by ScHeTM. Additionally, non-TM ap- plications have to be considered in the model, as many applications do not require of synchronization or use other synchronization techniques. An immediate way to include these applications is to consider that they present 0 aborted transactions (i.e., if no TM is used, then the application does not abort any transaction). Also, the model should consider implementing a thread-to-core mechanism to comple- ment the existing thread-to-cluster proposal. This way, applications that do not require the full cluster, or even single-threaded applications, can be managed by the scheduler.
4
Transactional Memory on
Heterogeneous CPU+GPU
processors
In multi-core CPUs, TM has emerged as a promising alternative to the use of locks. As consequence, CPU vendors are including hardware TM as part of their commercial CPUs [72, 67, 37, 3]. At the same time, GPUs have become default accelerators for graphics and data-parallel algorithms. For that reason, CPU vendors are also including integrated GPUs as part of their processors, creating the so-called Accelerated Processing Units (APUs). However, both worlds (APU processors and TM) are still separated.
In this chapter we prototype a Software TM (STM) library targeting APU processors. The goal is to allow for transactions in both the CPU and the GPU si- multenously. The implementation of TM in heterogeneous CPU-GPU processors is a challenging task, as both CPU and GPU work under different programming models. Multi-core CPUs follow the MIMD model, where multiple cores may operate on different data or a shared memory, while GPUs follow the SIMD programming model, where multiple threads execute the same instruction (lock- step execution) on different memory positions. Another challenge is the memory space. CPUs have access to a main memory via a coherent cache hierarchy. On the GPU, main memory is accessed via a cache hierarchy where, in most cases, the L1 data cache is not coherent. In addition, GPUs feature a low-latency scratch-pad memory (shared memory in CUDA, local memory in OpenCL, or tiled memory in C++AMP) that can be used to accelerate the management of GPU-private transactional metadata. Lastly, communication between CPU and
52 Chapter 4. Transactional Memory on Heterogeneous CPU+GPU processors GPU is an important problem to solve. Platform atomics ensure that values com- municate effectively between both devices, but these operations are expensive in terms of memory latency.
To better understand and overcome these challenges, we propose APUTM, a software TM designed to work on APU processors. APUTM can be configured to run separately on the CPU cores or the GPU, or simultaneously in both de- vices, ensuring mutual exclusion in any case. APUTM is inspired by NOrec [20], combining a fast timestamp-based conflict detection mechanism with a precise value-based validation. For the GPU, APUTM implements a mechanism to allow for parallel commits of transactions while updating the timestamp information to communicate with the CPU cores. In our evaluation, we use a synthetic workload and 3 microbenchmarks to analyze different configurations of APUTM. We pro- vide a discussion on the impact that our design decisions have on the performance of APUTM on both devices and provide hints for future improvements.
4.1
Background: NOrec and GPU-STM
The design of APUTM in inspired by two different STM proposals. On the CPU side, our design takes out some ideas from NOrec, while the GPU side is inspired by GPU-STM.
NOrec [20] implements conflict detection by comparing the values read from memory during the transaction with the values in memory at commit time. Mis- matching values lead to a transaction abort. To speed up this process, NOrec uses a single global sequence lock to serialize writer transactions. If this lock has not changed since the last validation, then the transaction commits its changes to memory immediately and updates the global sequence lock. This way, read-only transactions are linearized before writer transactions, allowing these to commit faster.
On the GPU side, GPU-STM [70] offers many features desirable for APU architectures. In a similar way as NOrec, GPU-STM uses the concept of global sequence lock, making feasible the integration of both solutions. In addition, GPU-STM allows for parallel commits of transactions that do not conflict. How- ever, this is implemented by using a set of locks, which requires executing a high number of atomic operations. In our design, instead of that, we synchronize transactions using the global sequential lock and use other techniques to permit parallel commits.