Memory Bus and DRAM - CMP Shared Resources

2.2 CMP Shared Resources

2.2.2 Memory Bus and DRAM

In a traditional random access device, all locations have the same latency. However, modern Dynamic Random Access Memory (DRAM) devices do not fully comply to this definition. In order to maximize bandwidth, DRAMs are commonly organized as a 3D matrix of bits with dimensions of rows, columns and banks. Figure 2.5 illustrates this organization.

Commonly, a DRAM read transaction consists of first sending the row address, then the column address and finally receiving the data. When a row is accessed, its contents are stored in a register known as the row buffer, and a row is often referred to as a page. The row buffer is commonly much larger than a cache line to leverage the internal bandwidth of the DRAM module. If the row has to be activated before it can be read, the access is referred to as a row miss or page miss. It is possible to carry out repeated column accesses to an open page, called row hits or page hits. This is a great advantage as the latency of a row hit is much lower than the latency of a row miss [120]. The situation where two consecutive requests access the same bank but different rows is known as a row conflict and is very expensive in terms of latency. DRAM accesses are pipelined, so there are no idle cycles on the memory bus if the next column command is sent while the data transfer is in progress. Furthermore, command accesses to one bank can be overlapped with data transfers from a different bank.

2.2.2.1 DRAM Scheduling

The 3D structure of rows, columns and banks makes DRAM subsystem throughput depend heavily on the order of memory requests. Therefore, throughput can be

P A C (0,0,0) (0,1,0) (0,0,1) (0,1,1) (1,0,0) (1,1,0) (1,0,1) (1,1,1) P A C P A C P A C 0 5 10 15 20 25 30 35 40 Time

42 DRAM Cycles Total Latency

...

P A C

(a) Accesses Scheduled in Arrival Order

P A C (0,0,0) (0,1,0) (0,0,1) (0,1,1) (1,0,0) (1,1,0) (1,0,1) (1,1,1) P A C C P A C 0 5 10 15 Time

(x,y,z) (Bank, Row, Column) Precharge (Row Deactivation) Row Activation Column Access P A C Legend

18 DRAM Cycles Total Latency

...

P A C

(b) Accesses Reordered with the FR-FCFS Algorithm

Figure 2.6: Simplified DRAM Access Reordering Example [120]

improved considerably by dynamically reordering requests to improve page locality. Rixner et al. [120] provide a convincing example that illustrates the benefits of out- of-order DRAM scheduling (Figure 2.6).

Figure 2.6(a) shows a memory access order that poorly utilizes the parallelism available in the DRAM subsystem. In this example, we assume a 3 DRAM cycle precharge latency, a 3 cycle row activation latency and 1 cycle column access latency. In this example, servicing requests in the arrival order results in a total latency of 42 DRAM cycles. Figure 2.6(b) illustrates the possible latency improve- ment from reordering. Firstly, we take advantage of that command accesses to different banks can be carried out in parallel. Secondly, we carry out all pending column accesses when the row is active. These two optimizations reduce the total latency to 18 DRAM cycles and improves memory bus utilization from 14% to 33%. Note that only one command can occupy the memory bus at a time and that precharge and activate commands are sent in their first cycle.

Rixner et al. [120] showed that reordering can be implemented with three rules: 1. Prioritize requests that can be issued in this cycle (i.e. ready commands) over

commands that are not ready

2. Prioritize column commands over other commands 3. Prioritize older commands over younger commands

2.2. CMP Shared Resources 19

This scheduling algorithm is commonly referred to as First Ready - First Come First Served (FR-FCFS) scheduling, and a few researchers have proposed additions to it. Shao and Davis [125] proposed burst scheduling which clusters accesses to the same row into bursts. Furthermore, they prioritize reads over writes, but writes are piggybacked on read bursts to reduce the probability of blocking due to a full write queue. Zhu and Zhang [158] observed that performance could be improved further by taking criticality into account. A criticality-aware scheduler prioritizes the requests that contain the words that the processors are currently waiting for. Finally, Shao and Davis [126] observed that bank conflicts can be minimized by cleverly choosing the address mapping of banks, rows and columns.

2.2.2.2 Process-Aware DRAM Scheduling

The FR-FCFS scheduling algorithm does not differentiate between requests from different processes. Consequently, a process with good page locality can significantly delay the requests of other processes. To avoid this, Nesbit et al. [109] adapted network fair queuing for use in DRAM scheduling. Network fair queuing was originally used to provide fairness in packet switched networks [40]. Nesbit et al. augmented the DRAM scheduler with a model of a Virtual Time Memory System (VTMS) (one per process). Each VTMS is allocated a certain fraction of the bandwidth available in the real system. This fraction determines the bandwidth allocation of the process. Then, the finish time of the request in the VTMS is used instead of the shared mode arrival time in rule 3 of the FR-FCFS scheduling algorithm. Furthermore, Nesbit et al. limit the amount of reordering to avoid that a process with good page locality significantly delays requests with lower virtual finish times from other processes.

Network fair queuing distributes DRAM bandwidth in a fair manner. However, fairly distributing DRAM bandwidth does not necessarily result in a fair division of DRAM latency [102, 119]. Rafique et al. [119] extended the work of Nesbit et al. [109] by using an adaptive technique to tune bandwidth shares to achieve the desired average latencies. In addition, Mutlu and Moscibroda [102] provide a scheduler that equalizes the memory related stall time of different processes. Finally, Iyer et al. [64] showed that the reordering mechanism can be used to differentiate between priority classes by letting requests from high priority processes bypass the requests of low priority processes.

The schedulers discussed so far augment the FR-FCFS algorithm with additional rules. Mutlu and Moscibroda [103] approach the problem from a different angle with their batch scheduling technique. Here, they create batches of requests based on arrival time and which processor the request originated in. A batch is a group of requests with a limited size, and the batch containing the oldest request is serviced before other batches to avoid starvation and provide fairness. Within a batch, requests are scheduled to preserve bank parallelism which improves throughput. In addition, Ipek et al. [60] showed how machine learning could be applied to the access scheduling problem.

Processor Cache R Processor Cache R Processor Cache R Processor Cache R Processor Cache R Processor Cache R Processor Cache R Processor Cache R Processor Cache R

Figure 2.7: 3x3 Mesh Network on Chip

In document Managing Shared Resources in Chip Multiprocessor Memory Systems (Page 39-42)