Background
Condition 2. 2: Scheurich and Dubois’ Requirements for Sequential Consistency (a) Each processor issues memory requests in the order specified by its program.
(b) Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Issuing a memory request consists of placing the request in this queue.
Based on the above requirements, each processor needs to issue its memory operations in program order and needs to wait for each operation to get to the memory module, or its queue, before issuing the next memory operation. In architectures with general networks such as the one shown in Figure 2.7, detecting that a memory operation has reached its memory module requires an acknowledgement reply to be sent back from the memory module to the processor. Therefore, the processor is effectively delayed for the full latency of an operation before issuing its next operation. More efficient implementations are conceptually possible, especially with a more restrictive interconnection network such as a bus. The key observation made by Lamport’s conditions is that, from an ordering point of view, a memory operation is effectively complete as soon as it is queued at its memory module (as long as the queue is FIFO). Therefore, a processor can conceptually overlap the period of time from when an operation reaches the queue to when the memory actually responds with the time to issue and service a later operation. The above observation does not hold for designs with caches, however. Furthermore, Condition 2.1 itself is not sufficient for correctness in the presence of data replication.
Scheurich and Dubois [SD87] provide a more general set of requirements that explicitly deals with optimizations such as caching. One of the key issues that arises in the presence of multiple copies is that writes may no longer behave atomically with respect to other memory operations. Scheurich and Dubois’ sufficient conditions are summarized below. A read is considered complete when its return value is bound and a write is considered complete when all copies of the location are brought up-to-date (i.e., either through an invalidation or update mechanism).
Condition 2.2: Scheurich and Dubois’ Requirements for Sequential Consistency (a) Each processor issues memory requests in the order specified by its program.
(b) After a write operation is issued, the issuing processor should wait for the write to complete before issuing its next operation.
(c) After a read operation is issued, the issuing processor should wait for the read to complete, and for the write whose value is being returned by the read to complete, before issuing its next operation.
(d) Write operations to the same location are serialized in the same order with respect to all processors.
There are several important differences with Lamport’s conditions that arise from dealing with multiple copies of the same location. For example, a write is considered complete only after all copies of the location are brought up-to-date, and a processor cannot issue operations following a read until the read return value is bound and the write responsible for that value is complete. Condition 2.2(c) and (d) are important for dealing with the inherently non-atomic behavior of writes, and address the problems depicted in Figures 2.8 and 2.9, respectively. One way to satisfy Condition 2.2(c) in an invalidation-based caching scheme is to disallow a newly written value from being read by any processor until all invalidations corresponding to the write are
complete. Maintaining this condition for update-based protocols is more cumbersome, however, and typically requires a two phase interaction where cached copies are updated during the first phase followed by a second message that is sent to all copies to signal the completion of the first phase. These and other techniques are covered in detail in Chapter 5.
The conditions for satisfying sequential consistency are often confused with the conditions for keeping caches coherent. A cache coherence protocol typically ensures that the effect of every write is eventually made visible to all processors (through invalidation or update messages) and that all writes to the same location are seen in the same order by all processors. The above conditions are clearly not sufficient for satisfying sequential consistency. For example, sequential consistency ultimately requires writes to all locations to appear to be seen in the same order by all processors. Furthermore, operations from the same processor must appear in program order.
From an architecture perspective, the above conditions effectively require each processor to wait for a memory operation to complete before issuing its next memory operation in program order. Therefore, optimizations that overlap and reorder memory operations from the same processor may not be exploited. In addition, designs with caches require extra mechanisms to preserve the illusion of atomicity for writes. Similarly, from a compiler perspective, program order must be maintained among memory operations to shared data.
We next discuss how program-specific information can be used to achieve more aggressive implementa- tions of SC.
2.3.2
Using Program-Specific Information
The sufficient requirements presented in the previous section maintain program order among all shared memory operations. However, for many programs, not all operations need to be executed in program order for the execution to be sequentially consistent. Consider the program segment in Figure 2.5(c), for example. The only sequentially consistent outcome for this program is (u,v)=(1,1). However, maintaining the program order between the two writes to A and B on P1 or the two reads to A and B on P2 is not necessary for guaranteeing a sequentially consistent result for this program. For example, consider an actual execution where the two writes on P1 are executed out of program order, with the corresponding total order on operations of (b1,a1,c1,a2,b2,c2) which is not consistent with program order. Nevertheless, the result of the execution is (u,v)=(1,1) since the writes to A and B still occur before the reads to those locations. This result is the same as if the total order was (a1,b1,c1,a2,b2,c2), which does satisfy program order. An analogous observation holds for the program segment in Figure 2.5(b). Of course, some program orders may still need to be maintained. For example, referring back to Figure 2.5(c), executing the write to Flag before the writes to A and B are complete can easily lead to the non-SC result of (u,v)=(0,0). Therefore, program-specific information must be used about memory operations in order to decide whether a given program order can be safely violated.
Shasha and Snir [SS88] have actually proposed an automatic method for identifying the “critical” program orders that are sufficient for guaranteeing sequentially consistent executions. This method may be used, for example, to determine that no program orders must be maintained in the example program segment in Figure 2.5(a). The reason is the outcomes (u,v)=(0,0), (0,1), (1,0), and (1,1) are all sequentially consistent outcomes, and allowing the operations to execute out of program order does not introduce any new outcomes that are not possible under sequential consistency. The information that is generated by such an analysis can
be used by both the architecture and the compiler to exploit the reordering of operations where it is deemed safe. Similar to Lamport’s work, Shasha and Snir assumed writes are atomic; therefore, their framework does not deal with issues of non-atomicity arising from the presence of multiple copies.
Unfortunately, automatically figuring out the critical set of program orders is a difficult task for general programs. The primary source of difficulty is in detecting conflicting data operations (i.e., operations to the same location where at least one is a write) at compile time. This is virtually the same as solving the well-known aliasing problem, except it is further complicated by the fact that we are dealing with an explicitly parallel program. The above problem is undecidable in the context of general programs. Furthermore, inexact solutions are often too conservative especially if the program uses pointers and has a complex control flow. Therefore, while the above techniques may work reasonably well for limited programs and programming languages, it is not clear whether they will ever become practical in the context of general programs.
The next section describes alternative techniques for enhancing system performance without violating the semantics of SC.
2.3.3
Other Aggressive Implementations of Sequential Consistency
This section presents a brief overview of a few other techniques that have been proposed to enhance the performance of sequentially consistent implementations. These techniques are specifically targeted at hard- ware implementations and are not applicable to the compiler. More detailed descriptions are presented in Chapter 5.
The first technique we discuss reduces the latency of write operations when the write has to invalidate or update other cached copies. A naive implementation of sequential consistency would acknowledge the invalidation or update message from each cache after the relevant copy has actually been affected. A more efficient implementation can acknowledge the invalidation or update earlier, i.e., as soon as the message is buffered by the target processor node. Thus, the issuing processor will observe a lower latency for the write. Sequential consistency is still upheld as long as certain orders are maintained with respect to other incoming messages. Afek et al. [ABM89, ABM93] originally proposed this idea, referred to as lazy caching, in the context of a bus-based implementation using updates and write-through caches. We present several extensions and optimizations to this idea in Chapter 5, making this technique applicable to a much wider range of implementations.
The above technique can be extended to more dramatically reduce the write latency in systems that use restrictive network topologies. Examples of such networks include buses, rings, trees, and hierarchies of such topologies. Landin et al. [LHH91] propose an implementation of SC that exploits the ordering guarantees in such networks to provide an early acknowledgement for writes. The acknowledgement is generated when the write reaches the “root” node in the network, which occurs before the write actually reaches its target caches. Chapter 5 presents our generalized version of this optimization in the context of various restricted network topologies.
Adve and Hill [AH90a] have also proposed an implementation for sequential consistency that can alleviate some of the write latency by allowing certain operations that follow the write to be serviced as soon as the write is serialized (i.e., receives ownership for the line) as opposed to waiting for all invalidations to be acknowledged.
read and automatic write prefetching techniques that we proposed in an earlier paper [GGH91b]. The idea behind speculative reads is to actually issue reads in an overlapped manner with previous operations from the same processor, thus allowing the operations to potentially complete out of program order. A simple mechanism is used to detect whether the overlapped completion of the operations can possibly violate sequential consistency. In the unlikely case that such a violation is detected, the speculative read and any computation that is dependent on it are simply rolled back and reissued. This technique is especially suited for dynamically scheduled processors with branch prediction capability since the required roll-back mechanism is virtually identical to that used to recover for branch mispredictions. The idea behind the second technique is to automatically prefetch the cache lines for write operations that are delayed due to the program order requirements. Thus, the write is likely to hit in the cache when it is actually issued. This latter technique allows multiple writes to be (almost fully) serviced at the same time, thus reducing the write latency as seen by the processor. The above two techniques have been adopted by several next generation processors, including the MIPS R1000 and the Intel Pentium Pro (both are dynamically scheduled processors), to allow for more efficient SC implementations.
Overall, except for the speculative read and the write prefetching techniques which are primarily applicable to systems with dynamically scheduled processors, the rest of the techniques described above fail to exploit many of the hardware optimizations that are required for dealing with large memory latencies. Furthermore, we are not aware of any analogous techniques that allow the compiler to safely exploit common optimizations without violating SC. Therefore, preserving the semantics of sequential consistency can still severely limit the performance of a shared-memory system. The following quotation from the two-page note by Lamport that originally proposed sequential consistency [Lam79] echoes the same concern: “The requirements needed to guarantee sequential consistency rule out some techniques which can be used to speed up individual sequential processors. For some applications, achieving sequential consistency may not be worth the price of slowing down the processors.”
To achieve higher performance, many shared-memory systems have opted to violate sequential consistency by exploiting the types of architecture and compiler optimizations described in Section 2.2. This has led to a need for alternative memory consistency models that capture the behavior of such systems.
2.4
Alternative Memory Consistency Models
This section describes the various relaxed memory consistency models that have been proposed to capture the behavior of memory operations in systems that do not strictly obey the constraints set forth by sequential consistency. We begin the section by presenting a progression of relaxed models that enable more aggressive architecture and compiler optimizations as compared to sequential consistency. We next consider the rela- tionship among these models in terms of the possible outcomes they allow for individual programs. Finally, the latter part of the section describes some of the shortcomings of relaxed models relative to sequential consistency.
2.4.1
Overview of Relaxed Memory Consistency Models
The basic idea behind relaxed memory models is to enable the use of more optimizations by eliminating some of the constraints that sequential consistency places on the overlap and reordering of memory operations.
While sequential consistency requires the illusion of program order and atomicity to be maintained for all operations, relaxed models typically allow certain memory operations to execute out of program order or non-atomically. The degree to which the program order and atomicity constraints are relaxed varies among the different models.
The following sections provide an overview of several of the relaxed memory consistency models that have been proposed. We have broadly categorized the various models based on how they relax the program order constraint. The first category of models includes the IBM-370 [IBM83], Sun SPARC V8 total store ordering (TSO) [SFC91, SUN91], and processor consistency (PC) [GLL+
90, GGH93b] models, all of which allow a write followed by a read to execute out of program order. The second category includes the Sun SPARC V8 partial store ordering (PSO) model [SFC91, SUN91], which also allows two writes to execute out of program order. Finally, the models in the third and last category extend this relaxation by allowing reads to execute out of program order with respect to their following reads and writes. These include the weak ordering (WO) [DSB86], release consistency (RC) [GLL+
90, GGH93b], Digital Equipment Alpha (Alpha) [Sit92, SW95], Sun SPARC V9 relaxed memory order (RMO) [WG94], and IBM PowerPC (PowerPC) [MSSW94, CSB93] models.
In what follows, we present the basic representation and notation used for describing the models and proceed to describe each model using this representation. Our primary goal is to provide an intuitive notion about the behavior of memory operations under each model. A more formal and complete specification of the models is provided in Chapter 4. Since many of the above models were inspired by the desire to enable more optimizations in hardware, our preliminary discussion of the models focuses mainly on the architectural advantages. Discussion of compiler optimizations that are enabled by each model is deferred to Section 2.4.6.
2.4.2
Framework for Representing Different Models
This section describes the uniform framework we use to represent the various relaxed memory models. For each model, our representation associates the model with a conceptual system and a set of constraints that are obeyed by executions on that system. Below, we use sequential consistency as an example model to motivate the various aspects of our representation.
Figure 2.11 shows the basic representation for sequential consistency. The conceptual system for SC consists of n processors sharing a single logical memory. Note that the conceptual system is meant to capture the behavior of memory operations from a programmer’s point of view and does not directly represent the implementation of a model. For example, even though we do not show caches in our conceptual system, an SC implementation may still cache data as long as the memory system appears as a single copy memory (e.g., writes should appear atomic).
We represent shared memory read and write operations as R and W, respectively. A read operation is assumed to complete when its return value is bound. A write operation is assumed to complete when the corresponding memory location is updated with the new value. We assume each processor issues its memory operations in program order. However, operations do not necessarily complete in this order. Furthermore, unless specified otherwise, we implicitly assume all memory operations eventually complete.
Atomic read-modify-write operations are treated as both a read and a write operation. Most models require that it appears as if no other writes to the same location (from a different processor) occur between the read and the write of the read-modify-write. The TSO, PSO, and RMO models, however, require that no