• No results found

The “Memory Wall”

In document Updating compressed column-stores (Page 37-40)

2.5 Hierarchical Memories

2.5.1 The “Memory Wall”

Main memory, or random access memory (RAM), provides storage of computer data. By “random access” we mean that data is accessible directly, by means of a unique memory address. A memory address represents an offset into a sequential address space, typically providing access at the granularity of a byte. As with CPUs, main memory is built as an integrated circuit on a “memory chip”. We can broadly categorize memory chips into two groups: dynamic and static random access memory, DRAM and SRAM respectively. Both are volatile, in the sense that they need electrical current to retain their state.

Main memory is typically built using DRAM, built onto dual in-line memory module (DIMM) chips. DRAM stores a memory bit in a single transistor- capacitor pair, which allows for high density and cost-effective mass production. The capacitor holds a high or low voltage, i.e. a 1 or 0 bit. The transistor acts as a switch to allow for reading or changing the capacitor. DRAM needs to be refreshed periodically to avoid the capacitors from loosing their charge. It is called “dynamic” because of this refreshing requirement.

Static RAM, on the contrary, does not have such a refreshing requirement. It is built using a fast but relatively complex state storing circuit called a flip- flop, which needs four to six transistors. Therefore, it is less dense and more expensive than DRAM, and is not used for high-capacity, low-cost main memory in commodity desktop and server systems. The register file on a CPU die is typically implemented using static RAM.

2.5. HIERARCHICAL MEMORIES 25

Table 2.2 summarizes the technical evolution of DRAM technology used for commodity main memory chips. The table refers to SDRAM, or synchronous dynamic random access memory, which has nothing to do with SRAM, and is just a synchronized, or “clocked” DRAM variant. When looking at memory latency, which represent the time after which the first bit of a given requested memory address becomes available for reading, we can clearly see that improve- ments are relatively slow. Once the requested data has become available for reading, it can be transferred at exponentially improving data transfer rates (i.e. bandwidth) though. This means that, with time, larger sequential data accesses are becoming more and more favorable, as these help amortizing access latency. A trend that is supported by a correlated growth in memory storage capacities.

One technological advance that Table 2.2 does not show, is that of multi- channel memory. It effectively multiplies the bandwidth of a memory bus, by allowing multiple memory modules to be attached to that bus, each with their own 64 data lines (but shared address and control lines). This allows a single memory controller to access multiple memory modules in parallel, thereby effec- tively doubling, tripling, or even quadrupling the theoretical maximum band- width. With a trend towards memory controllers on the CPU, multi-channel memory has become the de facto standard, with dual-channel being used in desktop systems, and triple- or quadruple-channel being available in memory controllers of high-end desktop and server processors only.

With DDR4, which will be supported by Intel’s 2014 high-end Haswell CPUs, a point-to-point topology will be used, where each memory channel is connected to a single module only, and parallel access is regulated by the memory con- troller. The goal being to simplify timing of the memory bus by moving paral- lelism from the memory interface to the controller, thereby allowing for faster bus timings and therefore transfer rates. The disadvantage, however, is that for maximum performance, each memory slot will need to be occupied by a DIMM. Contrast this with, for example, a four slot dual-channel setup, where only two DIMMs need to be inserted to benefit from the maximum (two times) bandwidth increase.

Not only are improvements in memory latency lagging behind those in band- width, but an even stronger discrepancy can be found in Figure 2.9, where the left side shows relative improvement (compared to 1980 as a baseline) in both CPU and memory latency (the figure actually shows normalized inverses of the latency, to signify an improving trend). We see that (the inverse of) CPU latency, i.e. the number of nanoseconds to execute an instruction, has been improving at much higher rates than memory access latency. This has resulted in a roughly hundred-fold improvement of CPU latency versus a meager six- fold improvement of memory access times, over the same period of time. From the processors perspective, i.e. measured in clock-cycles, a memory access is becoming more and more expensive.

The growing discrepancy between CPU and memory speeds is not only present at the latency level. When looking at relative improvements of both CPU and memory bandwidth, as shown in the right-hand side of Figure 2.9, we see a growing gap as well. The relatively poor memory bandwidth has been termed the von Neumann bottleneck [Bac78], and is attributed to the word-sized bus between CPU and memory, which is responsible for transferring all memory reads, both data and instructions, and writes.

26 CHAPTER 2. HARDWARE OVERVIEW 1 10 100 1000 ’80 ’83 ’86 ’93 ’97 ’00 ’03 ’07 ’10 ’13 Latency

norm. inv. RAM latency norm. inv. CPU latency

1 10 100 1000 10000 100000 1e+06 ’80 ’83 ’86 ’93 ’97 ’00 ’03 ’07 ’10 ’13 Bandwidth

norm. RAM bw (MB/sec) norm. CPU bw (MIPS)

Figure 2.9: Relative improvements in DRAM technology over time (logarithmic scales). Cheap Small Expensive Fast Large Slow Shared Cache Main Memory

CPU

registers

Solid State Drive

Hard Disk Drive CPU Cache

2.5. HIERARCHICAL MEMORIES 27

The disparity between latency and bandwidth within CPUs and memory was already detected and predicted to grow in [WM95], where the term “memory wall” was coined. Its existence can largely be attributed to the usage of DRAM memory chips. Not only is DRAM a slower technology than SRAM, but the fact that memory is sold as separate chips, i.e. “modules”, implies that it is placed outside the CPU die. Such off-chip communication is both limited in bandwidth, as the CPU and memory modules have to interface with a memory bus, and limited in latency, due to the physical distance between CPU and memory that currents have to cover.

In document Updating compressed column-stores (Page 37-40)