2.5 Hierarchical Memories
2.5.4 Memory Access in Multi-processors
To take advantage of the performance potential of today’s CPUs, programmers need to write multi-threaded applications that utilize the available CPUs and cores. When multiple CPUs or cores share a single main memory, several com- plications and architectural properties have to be dealt with.
To ensure program correctness, access to shared data is typically guarded by synchronization mechanisms, which ensure that concurrent accesses get serial- ized. Such mutual exclusion synchronization mechanisms often come with high overheads, and, even worse, have to be overly conservative. Section 2.5.4 intro- duces transactional memory, which has the potential of alleviating this software bottleneck by means of novel hardware mechanisms.
Also at the hardware level, we have to deal with problems like how to scale shared memory accesses, or how to correctly maintain multiple copies of a single piece of data, where each CPU might be modifying it independently in a local memory or cache. These are the topics of Section 2.5.4, where we discuss the prevalent NUMA architecture.
Transactional Memory
Traditional thread synchronization mechanisms have two major disadvantages. First of all, synchronization limits parallelism by serializing access to critical sections in memory. Typically, it holds that the coarser grained the critical sections, the higher the penalty to parallelism. While introduction of a larger number of more fine-grained locks results in additional code complexity, making it harder to write correct programs. The second problem with synchronization is that it has to be applied “statically” (i.e. at development time) and there- fore extremely conservatively. Static synchronization is required every time a potential conflict might arise, even in case the likelihood of a conflict is low.
2.5. HIERARCHICAL MEMORIES 31
Transactional memory, recently introduced by Intel in its Haswell architec- ture under the name transactional synchronization extensions (TSX) [Int12a], tries to work around these issues by letting programmers mark transactional sections in code, and letting the hardware detect “dynamically” (i.e. at run- time) whether threads need to serialize due to a conflict. This lets the processor expose and exploit concurrency in an application that would otherwise be hid- den due to overly conservative static synchronization that would turn out to be unnecessary at run-time.
The processor executes each transactional region optimistically, without any synchronization. If a transactional region finishes, or commits, successfully, all memory operations performed within that region of code will occur to have appeared atomically (i.e. instantaneously) from the perspective of other logical processors. In case the atomic commit mechanism of the hardware detects that a conflict occurred, the optimistic execution fails, and the processor will roll back execution of the transactional region, a process called abort. On abort, the CPU discards all updates performed in the transactional region, reverting to a state as if optimistic execution never occurred, and resume execution in a serialized way.
Several types of conflicts may arise during a commit. Most common are conflicting memory accesses between a transactionally executing processor core and another core. Intel’s TSX maintains the read-set and write-set for each transactional region. The read-set is defined by all memory addresses read, while the write-set incorporates all memory addresses written. A conflict then occurs in case some other processor reads a location that is part of the transactions write-set or writes a location that is either in the read- or write-set of the transaction.
The read- and write-set are maintained at the granularity of a cache line, meaning that we can end up with false positives. Also, a transaction may be aborted when reaching certain implementation dependent capacity limits, like the number of accesses in a region. Obviously, aborts lead to wasted CPU cycles and should be kept to a minimum.
NUMA
Traditional symmetric multi-processing (SMP) systems are built using a single main memory, connected to and managed by a single memory controller, which, in turn, is shared by all CPUs. Each CPU is typically equipped with a private cache, which replicates the most recently and frequently used parts of main memory. To tune cache utilization, it is important to try and “bind” software threads to a certain CPU, a concept called CPU affinity, support for which is typically provided by the OS.
With multiple copies and versions of data being present at the same time, the need for cache coherence protocols, which guarantee correct program semantics under concurrent reads and writes, arises. In early SMP systems, such synchro- nization had to go through main memory, over the single shared bus, which is already a bottleneck in an environment where multiple CPUs are competing for the scarce memory bandwidth.
To improve scalability, the non-uniform memory access NUMA architec- ture was invented, where the limitation of a single, shared main memory is abandoned, allowing each CPU (node), multi-core or not, to have its own local
32 CHAPTER 2. HARDWARE OVERVIEW
CORE−0 CORE−1 CORE−2 CORE−3 CORE−0 CORE−1 CORE−2 CORE−3
Interconnect L2 Cache L1 Cache L2 Cache L1 Cache L2 Cache L1 Cache L2 Cache L1 Cache Shared L3 Cache
Memory Controller QPI QPI Main Memory Multi−core CPU I/O Hub L2 Cache L1 Cache L2 Cache L1 Cache L2 Cache L1 Cache L2 Cache L1 Cache Memory Controller Main Memory Shared L3 Cache QPI QPI Multi−core CPU
Figure 2.12: NUMA Architecture using Intel Quick Path Interconnect.
Data source Latency
L3 cache hit, line unshared 40 cycles L3 cache hit, shared line in another core 65 cycles L3 cache hit, modified in another core 75 cycles
remote L3 cache 100-300 cycles
Local DRAM 60 ns
Remote DRAM 100 ns
Table 2.3: Memory access latency for Intel Core i7 Nehalem [Lev09].
memory controller and attached RAM. Memory of all nodes is aggregated into a logically unified memory, of which the actual access times can vary, depending on the physical location of the data being accessed. To support memory accesses on remote nodes, CPUs need to be connected by means of a high speed com- munication channel, together with protocols that support transparent remote memory accesses.
An example NUMA system is depicted in Figure 2.12, where we see two multi-core CPUs, each connected to a local main memory through an on-chip memory controller. The CPUs are also connected to each other by means of a point to point link, in this example Intel’s QuickPath Interconnect (QPI) [Int09]. These high speed links operate independently from the main memory bus, and are managed by a separate component. Such a setup allows for better scalability than the traditional shared memory systems.
Besides varying memory access latency, a NUMA architecture as depicted in Figure 2.12 suffers from non-uniform cache access (NUCA) as well [Lev09]. The reason for this is that access times to the shared L3 cache depend on the current state of cache-coherence, i.e. whether a cache line being accessed is shared or not, and if so, whether it is modified in the local cache of some other core. Table 2.3 highlights some memory access performance numbers for our two node NUMA example. Note that these are measurements on a single CPU, that depend on things like clock rate and speed of memory used.
As with main memory in SMP systems, the shared last level cache (LLC) (i.e. L3 in our example), will become more and more of a bottleneck as the number of cores per chip increases. As the last level cache needs to grow along, even physical wire delays between cores and the large cache area become an issue [KBK02]. Several hardware and hardware aided software techniques are being researched to alleviate problems around shared cache scaling [BS13].
2.6. DISK TECHNOLOGY 33