Software-Hardware Co-designed Systems

Chapter 5 Related Work

5.1 Software-Hardware Co-designed Systems

Recently, there has been active research in leveraging software knowledge to address inefficiencies in hardware design.

The recent SARC coherence protocol [67] also exploits the data-race-free programming model, but its goal is to improve the conventional directory-based protocol [7]. SARC self-invalidates “tear-off, read-only” (TRO) copies of data to save power. However, SARC does not eliminate directory storage overhead or reduce protocol complexity like the DeNovo system does. Also, the concept of touched bit, which plays an important role in the DeNovo system, is not present in SARC.

VIPS-M [107] improves on SARC by adopting self-invalidations, similarly to DeNovo. It uses private/shared information about data to perform self-invalidation and eliminates the directory by perform- ing write-through at synchronization points. It implements a write-through protocol for synchronization accesses and claims that delaying the completion of a write-through helps in reducing spinning by other

cores. The approach is similar to the delaying of QOLB attempts in [104]. However, VIPS-M is not clear about how it deals efficiently with synchronization algorithms with multiple synchronization variables and frequent reads to them, such as non-blocking algorithms. If many variables cause delayed write- through, they may interfere with each other’s progress, causing degraded synchronization latency.1 The authors recently proposed a synchronization mechanism [108] to address the issues with synchronization support in VIPS-M.

Callback [108] implements a directory dedicated to spin-waiting reads so that readers can be notified when a write occurs. The callback mechanism in [108] shares the same goal as DeNovoSync to provide coherence support for synchronization accesses without re-introducing writer-initiated invalidations, but it is different in the following aspects: (1) in Callback, cores bypass local caches and register callback at the LLC on every synchronization read to ensure that the read does not see stale data, even if there are no intervening writes. DeNovo allows synchronization reads to hit in local caches as long as the data is in Registered state, which guarantees up-to-date data and no intervening synchronization accesses. (2) In addition to the data vs. synchronization distinction, Callback requires a more fine-grained distinction for synchronization writes (i.e., waking up one reader or all readers) depending on the synchronization mechanism in which they are used. Such a distinction is not trivially given and requires automatic or manual semantic analysis to identify which type of synchronization writes should be used, followed by code modifications. In contrast, data-race-free memory models adopted by modern languages (the only software assumption made by DeNovoSync) imply a data vs. synchronization distinction and do not require any additional code annotation or modification. Interestingly, the evaluation in [108] shows that exponential backoff is as efficient as the callback mechanism in most cases as long as the right parameters are used. DeNovoSync addresses the flexibility issue with exponential backoff with an adaptive mechanism that adjusts backoff parameters in response to different contention levels.

Rigel [71], with a task-centric memory model [70] for accelerator architectures, exploits the pop- ular sharing patterns in accelerator workloads to enforce coherence with explicit software-managed instructions at barriers. DeNovoD in Chapter 2 shares with Rigel the key motivation for focusing on applications with barriers only and providing software-driven coherence for such applications. However, Rigel requires that all dirty lines to the global shared cache be flushed at the end of each phase, whereas

DeNovo can keep up-to-date data in local caches with write registration and selective self-invalidations. In addition, the overall DeNovo system, including DeNovoND and DeNovoSync in Chapters 3 and 4, supports a larger scope of applications with different sharing patterns (other than Bulk-Synchronous Parallelism) and unstructured synchronizations while keeping the coherence overheads minimal.

Cohesion [72], based on the Rigel architecture, proposes a hybrid memory model that switches between hardware and software coherence depending on the sharing patterns in accelerator applications. Cohesion uses the notion of “region” of memory to track which coherence domain (HW or SW) a region belongs to and to control coherence domain transitions. Regions in Cohesion require much more hardware support than DeNovo, as Cohesion requires coarse-grained and fine-grained region tables with mappings for all memory locations to handle transitions between coherence domains. DeNovo, a unified (not a hybrid) coherence solution, does not need such overheads because it uses regions only for selective self-invalidations, and its storage overheads for regions are limited to private caches. Cohesion does not address existing limitations of software and directory-based hardware coherence mechanisms, while DeNovo simplifies hardware coherence protocol design with software-coherence inspired ideas.

Min and Baer proposed a timestamp-based software-assisted coherence scheme without global coherence communication [93]. It relies on sophisticated compile-time analysis of memory dependencies, including perfect branch prediction for maintaining coherence. The overall DeNovo system takes a software-hardware co-design approach that exploits compiler/software-provided information for better complexity and efficiency of hardware coherence, while the hardware in [93] cannot guarantee correct execution without software assistance. DeNovo also allows dirty copies to stay in private caches for better data reuse, while [93] requires that writes go to a shared cache if there are potential conflicts.

More recently, TSO-CC [45] proposed a self-invalidation based coherence protocol for the TSO memory model. However, while it reduces writer-initiated invalidation overheads with self-invalidation, it also introduces many hardware mechanisms, such as timestamps and epochs, and complicated logics to maintain them.

Dir1SW [58] simplifies a directory protocol that adds little complexity to message-passing hardware

but efficiently supports programs written within the CICO model [78]. However, it reduces transient states by handling multiple-message requests by trapping to the software trap handlers. While CICO achieves simplicity by offloading the complexity to the software and operating systems, DeNovo funda-

mentally eliminates transient states in the protocol (assuming data-race freedom).

In document DeNovo: rethinking the memory hierarchy for disciplined parallelism (Page 99-102)