Port 0 Port 1 Port 2 Port 3 Port 4 Port
2.7 Caches and Memory Hierarchy
2.7.3.3 Directory-based cache coherence protocols
Snooping protocols rely on the existence of a shared broadcast medium like a bus or a switch through which all memory accesses are transferred. This is typically the case for multicore processors or small SMP systems. But for larger systems, such a shared medium often does not exist and other mechanisms have to be used.
A simple solution would be not to support cache coherence at hardware level. Using this approach, the local caches would only store memory blocks of the local main memory. There would be no hardware support to store memory blocks from the memory of other processors in the local cache. Instead, software support could be provided, but this requires more support from the programmer and is typically not as fast as a hardware solution.
An alternative to snooping protocols aredirectory-based protocols. These do not rely on a shared broadcast medium. Instead, a central directory is used to store the state of every memory block that may be held in cache. Instead of observing a shared broadcast medium, a cache controller can get the state of a memory block by a lookup in the directory. The directory can be held shared, but it could also be distributed among different processors to avoid bottlenecks when the directory is accessed by many processors. In the following, we give a short overview of directory-based protocols. For a more detailed description, we refer again to [41, 94].
As example, we consider a parallel machine with a distributed memory. We assume that for each local memory a directory is maintained that specifies for each memory block of the local memory which caches of other processors currently store a copy of this memory block. For a parallel machine withpprocessors, the directory can be implemented by maintaining a bit vector withp presence bitsand a number of state bits for each memory block. Each presence bit indicates whether a specific processor has a valid copy of this memory block in its local cache (value 1) or not (value 0). An additionaldirty bitis used to indicate whether the local memory contains a valid copy of the memory block (value 0) or not (value 1). Each directory is maintained by adirectory controllerwhich updates the directory entries according to the requests observed on the network.
Figure2.34illustrates the organization. In the local caches, the memory blocks are marked with M (modified), S (shared), or I (invalid), depending on their state, similar to the snooping protocols described above. The processors access the memory
Fig. 2.34 Directory-based cache coherency. cache directory directory cache interconnection network memory memory processor processor
system via their local cache controllers. We assume a global address space, i.e., each memory block has a memory address which is unique in the entire parallel system.
When a read miss or write miss occurs at a processori, the associated cache con- troller contacts the local directory controller to obtain information about the accessed memory block. If this memory block belongs to the local memory and the local memory contains a valid copy (dirty bit 0), the memory block can be loaded into the cache with a local memory access. Otherwise, a nonlocal (remote) access must be performed. A request is sent via the network to the directory con- troller at the processor owning the memory block (home node). For a read miss, the receiving directory controller reacts as follows:
• If the dirty bit of the requested memory block is 0, the directory controller retrieves the memory block from local memory and sends it to the requesting node via the network. The presence bit of the receiving processoriis set to 1 to indicate thati
has a valid copy of the memory block.
• If the dirty bit of the requested memory block is 1, there is exactly one processor
j which has a valid copy of the memory block; the presence bit of this processor is 1. The directory controller sends a corresponding request to this processor j. The cache controller of j sets the local state of the memory block from M to S and sends the memory block both to the home node of the memory block and the processorifrom which the original request came. The directory controller of the home node stores the current value in the local memory, sets the dirty bit of the memory block to 0, and sets the presence bit of processorito 1. The presence bit of j remains 1.
For a write miss, the receiving directory controller does the following:
• If the dirty bit of the requested memory block is 0, the local memory of the home node contains a valid copy. The directory controller sends an invalidation request to all processors jfor which the presence bit is 1. The cache controllers of these processors set the state of the memory block to I. The directory controller waits for an acknowledgment from these cache controllers, sets the presence bit for these processors to 0, and sends the memory block to the requesting processori. The presence bit ofi is set to 1, the dirty bit is also set to 1. After having received the memory block, the cache controller ofi stores the block in its cache and sets its state to M.
2.7 Caches and Memory Hierarchy 91
• If the dirty bit of the requested memory block is 1, the memory block is requested from the processor jwhose presence bit is 1. Upon arrival, the memory block is forwarded to processori, the presence bit ofi is set to 1, and the presence bit of
jis set to 0. The dirty bit remains at 1. The cache controller of j sets the state of the memory block to I.
When a memory block with state M should be replaced by another memory block in the cache of processori, it must be written back into its home memory, since this is the only valid copy of this memory block. To do so, the cache controller ofi sends the memory block to the directory controller of the home node. This one writes the memory block back to the local memory and sets the dirty bit of the block and the presence bit of processorito 0.
A cache block with state S can be replaced in a local cache without sending a notification to the responsible directory controller. Sending a notification avoids the responsible directory controller sending an unnecessary invalidation message to the replacing processor in case of a write miss as described above.
The directory protocol just described is kept quite simple. Directory protocols used in practice are typically more complex and contain additional optimizations to reduce the overhead as far as possible. Directory protocols are typically used for distributed memory machines as described. But they can also be used for shared memory machines. An example are the Sun T1 and T2 processors, see [94] for more details.