Multiprocessor Memory Hierarchy

2.3 Memory Hierarchy

2.3.4 Multiprocessor Memory Hierarchy

The memory hierarchy of a multiprocessor system can be organised in a variety of ways, each with its own advantages and disadvantages. The most common type of multiprocessor architecture is shared memory architecture. A type of shared memory architecture is Uniform Memory Access (UMA), otherwise referred to as Symmetric Multiprocessors (SMP). Figure 2.14 illustrates a UMA/SMP system. It can be seen that each processor (P) typically has its own private level 1 cache (C). However, all the processors access a shared main memory (M). Access to the shared memory is symmetric. The terminology appears ambiguous, but simply means that all processors have equal access times to the memory, so all suffer the same latencies. Whilst the private caches may delay or prevent non-shared data traffic from competing for bus access, it can be noted that the shared bus and main memory are obvious bottlenecks. As a result, UMA/SMP-based systems are considered unscalable.

An alternative architecture for a UMA/SMP system uses a single shared cache instead of multiple private caches. Whilst this architecture may reduce the bus and memory bottleneck, a bottleneck is created at the shared cache.

Figure 2.15 illustrates a Non-uniform Memory Architecture (NUMA). This could also be termed Asymmetric Multiprocessor (AMP). A NUMA system is split into a set of segments or nodes where a node consists of a block of memory (M), caches (C) and processors (P) which share a common bus. The nodes communicate by a distributed interconnect. An interconnect can be a crossbar or a mesh. Non-uniform memory access means that it will take longer to access some regions of memory than others. This is because a memory location may be on a physically different node. Communication between processors at different nodes is typically performed by message passing.

Figure 2.14: UMA/SMP Memory Hierarchy with Private Caches, adapted from Baer [13, 264]

As previously discussed, a disadvantage of UMP/SMP systems is that they are inherently unscalable. The NUMA architecture was designed to surpass the scalability limits of the SMP architecture by distributing memory. Distributing the memory among the nodes increases bandwidth and reduces the latency to local memory, Hennessy and Patterson [8]. Since the bottlenecks can be distributed throughout the node structure, scalability is improved. The main disadvantage of a NUMA system is that inter-node communication can be complex and place additional burden on the software engineering effort.

2.3.4.1 Cache Coherency

A multiprocessor system may hold multiple copies of the same data in the caches and memory and this is the case for both UMA and NUMA systems. The shared data values must all match each other so that the processors have a coherent view of memory. This is known as cache coherency. The definition of cache coherency is subtle and complex, Hennessy and Patterson[8], and includes visibility and ordering. Informally, a memory system is coherent if any read of a data item returns the most recently written value of that data item. The two common approaches to maintaining cache coherency are cache snooping and directory-based coherency.

2.3.4.2 Snooping Based Cache Coherency

A snooping-based cache coherency protocol is used in UMA/SMP-based systems. The caches connected to the shared bus listen to messages placed on the bus by all the other caches. The cache controllers then respond in the appropriate manner to implement the coherence protocol. There are a number of common cache coherency protocols, of which a common one is the Modified, Exclusive, Shared, Invalid (MESI) protocol. Figures 2.16 and 2.17 illustrate the basic finite state machine for the MESI protocol, for both the local and remote cache controllers. Each cache line is tagged with the states M, E, S, I and the states are updated by the cache controller. It should be noted that, in practical applications, the state machine would be much more complicated due to additional architectural features such as split transaction busses. Figure 2.16 illustrates the behaviour of a write back cache controller in master mode, i.e. when being driven by the local processor. Figure 2.17 illustrates the behaviour of the same cache controller when acting in slave mode, i.e. when snooping and responding to remote cache requests.

At some point during execution, the processor will attempt to read a variable. Since the cache line containing the variable is not currently in the cache it is Invalid and a read miss occurs. The read miss causes the cache controller to issue a read request on the bus; the data may either come from memory or from another cache if one holds it. When the local cache controller places a read request on the bus, a remote cache controller snooping on the bus may recognise that it holds the data Exclusively. If so, the remote cache controller writes the data on the bus and sets the status of its cache line to Shared. The local cache controller then receives the cache line, stores it in its cache and also sets the status of the copy to Shared. At this point there are two caches holding the data for the same cache line and the status of both is Shared. In the case where the remote cache does not hold the cache line, no remote response is given. In this case the data is read from

Figure 2.16: Basic MESI Cache Coherence Finite State machine - local cache controller, taken from [13, 274]

memory and the state for the local cache line is set to Exclusive.

If the processor writes to a variable which is stored in its local cache and a remote cache (hence both cache lines are set to Shared ), the write results in a write hit. The local processor performing the write will change the state of the cache line to Modified. The local cache line is now the only valid cached version of the data. The remote cache controllers snooping on the bus see the write for the cache line on the bus and set the state of its version to Invalid. This means that the remote cache line will effectively be discarded. If the local processor writes to a cache line which is in the Exclusive state, the state of the locally held cache line transitions to Modified. However, the write is not broadcast on the bus, since there is no remote copy of it.

2.3.4.3 Directory Based Cache Coherency

Snooping-based cache coherency protocols are used for UMA/SMP-based systems. However, snooping is inappropriate for NUMA systems since the act of broadcasting on all nodes causes coupling of the busses which would reduce the improved scalability of the NUMA architecture. Effectively, the separate physical busses become a single logical bus causing a bottleneck. Since a NUMA system is inherently distributed, the cache coherency protocol also needs to be distributed. Just as main memory is physically distributed throughout the machine to improve aggregate memory bandwidth, so too is the directory. This eliminates the bottleneck that would be caused by using a single monolithic directory.

Figure 2.17: Basic MESI Cache Coherence Finite State machine - remote cache controller, adapted from [13, 274]

In a simple NUMA architecture, a node may consist of a single processor, a single cache, a single main memory and a directory. The main memory for a node is divided into cache-line sized blocks. The directory then consists of state information for each main memory block including the MESI state and the home node of the memory line.

The following describes what may happen when a read miss occurs. When a remote processor attempts to read a variable not in the cache or local memory, the cache controller must send a request to the appropriate home node (the home node is the node whose memory contains the initial value of a memory line). The home node receives the request, and may respond in a number of ways.

If the home directory indicates that that the memory line is uncached or is cached but unmod- ified, the memory line will be sent to the remote node. The home directory changes the state of the memory line to Shared and notes that a copy of the memory line is held by the remote node. If the home directory indicates that the memory line has been modified by the home cache, the data is written back from the home node cache to the home node memory. The data is then sent to the remote node.

If the home directory indicates that the memory line has been modified by a second remote cache, then the home node requests the data. When the data is received at the home node, it is written to the home node memory and forwarded to the remote node.

Other state transitions are described by Baer [13] and Hennessy and Patterson [8]. However, the above simplification demonstrates the complexity inherent in a distributed cache directory.

In document A Syntax Directed Imperative Language Microprocessor for Reduced Power Consumption and Improved Performance (Page 47-52)