Task-Centric Memory Model Related Work

CHAPTER 10 Related Work

10.1 Task-Centric Memory Model Related Work

Compute accelerators have developed an array of memory models that emphasize compute density and parallel scalability. Accelerators have leveraged the lack of legacy software constraining their design and the high degree of parallelism inherent in their workloads. By design, existing software APIs and programming models for accelerator systems are parallel by design with very weak memory models and implied coherence guarantees. On the other hand, software APIs and parallel programming models for coherent CMPs have strong consistency models and hardware coherence. This section discusses existing programming models and programming languages used by compute accelerators and parallel coherent CMPs. We illustrate how these models exploit characteristics of their workloads and the underlying architecture to achieve better performance or to enhance programmability.

Many of the prevalent models for accelerators exploit the existence of coarse- grained synchronization and the relative lack of fine-grained sharing in workloads. Accelerators have achieved success relying on software for handling coherence actions and allow relaxed memory orderings, thus aiding hardware scalability and

improving power and performance density. Here we survey memory models and programming models for parallel systems and compare the models with the ap- proach presented in Chapter 5. With the increased interest in accelerator plat- forms such as GPUs for general-purpose computation, we see an opportunity for memory models that are less reliant upon hardware to become widespread as core counts continue to rise and the distinction between CMP and accelerator begins to blur.

The Task-Centric Memory Model targets systems with a single address space and hardware-managed caches without hardware-managed coherence. Our ap- proach contrasts with that of existing accelerators using software-managed scratchpads [4,57] or designs more similar to contemporary CMPs, where caches are kept coherent transparently to software [14]. Friedman [111] discusses a form of hybrid coherence where there are strong and weak memory operations. Friedman goes on to construct a memory model that eliminates latencies that are unavoidable with a conventional memory model. The model is similar to what is supported on commercially available multicore processors today where synchronization operations are done using a set of atomic operations that adhere to stricter ordering rules than normal programmed loads and stores. Moreover, the use of strong and weak operations is very similar to the local versus global distinction used in Rigel for memory operations.

Leverich et al. [112] investigate the implications of choosing between two dif- ferent memory system configurations, hardware-coherent caches and software- managed scratchpads, for future CMPs and demonstrate that software coherence actions can provide benefit to cached systems. A third choice not investigated in that work, incoherent software-based architectures, is most similar to the Task- Centric Memory Model. Furthermore, prototype systems with hardware caches, but without hardware coherence, such as CEDAR [54], have been built. These

same techniques are being reapplied to accelerator systems today, such as the Rigel accelerator [42] used as the basis for this work.

10.1.1 Parallel Programming Models

Many parallel models for existing CMPs, such as Intel’s Threading Building Blocks (TBB) [79] and Cilk [80], use explicit task generation. Models such as OpenMP [113] use implicit task generation. Explicit task generation is also used in the Rigel Task Model, but we limit RTM to BSP semantics while TBB also supports fork-join parallelism. TBB and Cilk allow for interactions between tasks and make use of parent-child communication through shared memory, which relies upon the existence of a coherent address space.

Underlying many of the models used by accelerators is the bulk-synchronous parallel model (BSP) [60]. BSP continues to be reflected in accelerator languages prevalent today, including CUDA [114] from NVIDIA and OpenCL [30]. CUDA and OpenCL are used to map data-parallel kernels to highly parallel systems comprising possibly hundreds of processing elements in a bulk-synchronous fashion.

While CMPs continue to support unrestricted sharing patterns and accelerators usually adopt shared-nothing programing models, we see a potential for an intermediate design point, such as the Task-Centric Memory Model, that exploits the structure of accelerator applications by using a software-protocol and minimal hardware to provide the programmability afforded by CMPs while achieving the scalability of accelerators.

10.1.2 Parallel Memory Models

The Rigel memory model and coherence mechanisms are akin to software coherence mechanisms used to provide the illusion of a single address space for

distributed shared memory (DSM) systems [102, 115]. Two DSMs, Midway [116] and Munin [103], used flexible consistency models to achieve parallel scalability. Midway allowed for a high degree of latency tolerance by associating individual data items with synchronization operations and only guaranteeing that the data was visible after acquiring the associated synchronization object. The system also supported multiple consistency models concurrently in one program. Munin was based on data types specified by the programmer that allowed for communication- based per-type optimizations to be exploited by the runtime. Munin and Midway are analogous to a software-only hybrid memory model.

The consistency guarantees we investigate for write-output data at RTM task boundaries are similar to Scope Consistency [117] in that dirty data is implicitly made coherent at the end of the task’s scope and updates can be deferred until the scope is reopened. Reopened in the case of RTM means starting a new task or interval following a barrier. The Cooperative Shared Memory model [71] pro- vides a similar model to Rigel. Cooperative Shared Memory relies on software to properly label shared accesses for performance and achieves scalable performance using a reduced complexity hardware coherence protocol (Dir1SW).

The BSP model was described by Valiant [60]. BSP continues to be reflected in languages prevalent today including CUDA [114] from NVIDIA and OpenCL [30]. As mentioned previously, CUDA is used to map data-parallel kernels to GPUs comprising hundreds of processing elements in a bulk-synchronous fashion, but re- quires SIMD-friendly code to achieve high execution efficiency [12]. DeNovo [118] is an attempt to exploit race-free and deterministic software to build an architecture that reduces the strain on hardware coherence mechanisms. The work is similar to the TCMM because it exploits program structure, in the case of DeNovo race freedom, to relax the constraints placed on hardware.

10.1.3 Accelerator Workloads

Examples of data- and task-parallel workloads that motivate our investigation of a task-parallel model include recognition, mining, and synthesis (RMS) [81] and physical simulation applications [15] for providing more realistic virtual worlds that are being investigated by Intel. A later study from Intel [119] comparing the performance of GPUs and CPUs for throughput-oriented workloads uses bench- marks similar to the workloads evaluated here.

A variety of highly-parallel workloads, such as the PARSEC [98] and ALP- Bench [120] suites, have been evaluated for conventional multicore processors. Accelerator workloads targeting current-generation GPUs have been studied [19], while studies motivating future accelerator architectures have focused on char- acterizing visual computing workloads [12]. While these studies investigate the scalability of visual computing workloads, we go further to point out the sharing patterns relevant to coherence management and show how these characteristics can be exploited in the design of future compute accelerators.

In document Hybrid coherence for scalable multicore architectures (Page 167-171)