Background: Main Memory - Predictable Shared Memory Resources for Multi-Core Real-Time Systems

As Figure4.1 illustrates, a DRAM is organized in dual in-line memory modules (DIMMs), each DIMM consists of multiple DRAM chips. Each DRAM chip consists of memory cells arranged as banks. Cells in each bank are organized in rows and columns. A DRAM rank is a group of banks. Accesses to different ranks or banks can be interleaved to minimize the DRAM latency. We define the maximum amount of data that a DRAM can transfer when interleaving across all banks as the memory granularity, and it is equal to BL × nbanks× CLW, where BL

is the burst length that can be 4B or 8B, nbanks is the number of banks the access interleaves

across, and CLW is the column width in bytes. For multi-channel DRAMs, each channel has its own buses and consists of one or more ranks. Accesses to different channels, similar to ranks and banks, can be interleaved to reduce their access latency. On the other hand, accesses to different rows in the same bank suffer from row conflicts and encounter larger latencies. Data is transferred to/from the memory cells via sense amplifiers. These sense amplifiers work as a row buffer that caches the most-recently accessed row in each bank. A DRAM request consists of a type and an address. The type is either a read or a write. DRAM accesses are controlled

A1 tRCD tWL Command Bus Data Bus W1 D1 tBUS R2 P1 A2 tRCD tRP D2 tRL tWR tRC tRAS

Figure 4.2: A write access followed by a write or read access targeting the same bank and rank for close-page policy.

by the MC, which translates the read/write requests into one or more of the following DRAM commands:ACTIVATE(A),READ(R),WRITE(W),PRECHARGE(P), andREFRESH(REF).A

fetches the row from the memory cells to the sense amplifiers (row buffer). R(W) reads (writes) the required columns in the row buffer. Pcloses the activated row, and prepares the cell array for the next memory access by restoring the charge level of each DRAM cell in the row. Finally,

REFactivates and precharges DRAM rows to prevent charge leakage.

The DRAM JEDEC standard [11] imposes strict timing constraints on these commands (Ta- ble4.1). All MC designs must satisfy these constraints to ensure correct DRAM behaviour. We use Figure4.2to illustrate the meaning of these constraints. It shows a write access followed by a write or read access targeting the same bank and rank. In Figure4.2, tRCD cycles are required betweenA1 andW1, and betweenA2 andR2. tW L cycles are required between the issuance of

W1 and the start of writing data to the DRAM. Contrarily, tRL cycles are required between the issuance ofR2 and the start of reading data from the DRAM. Then, the data transfer takes tBUS cycles. tW R cycles are necessary between the end of the data writing, and the P1 command. tRP cycles are required between P1 andA2. These are timing constraints set by the physical properties of the DRAM.

Typically, a MC implements an arbitration scheme, an address mapping, and a page policy. The arbitration scheme arbitrates amongst different requests. The address mapping translates request addresses into its 5 segments: channel (ch), rank (rnk), bank (bnk), row (rw), and column (cl). We refer the number of bits assigned to channel, rank, bank, row and column indices as CNW , RKW , BKW , RW W , and CLW , respectively. This makes the physical address P W =CNW +RKW +BKW +RW W +CLW bits. The page policy controls the liveness of the row in the row buffer.

Table 4.1: Important JEDEC timing constraints (DDR3-1333) [11].

Const. Meaning Cyc. tRC Minimum time betweenAcommands to same bank. 33 tCCD Column-to-column delay. 4

tRP Row precharge time 9 tBU S _{data bus size × 2}request size : Time required to transfer a data burst. 4 tRAS Minimum time betweenAcommand andPcommand. 24

tW L Minimum time betweenWand the start of data transfer. 7 tRL Minimum time betweenRand the start of data transfer. 9 tRCD Minimum time between activating the row and accessing it. 9 tF AW Four bank activation window in same rank. 20 tRT RS Rank to Rank switching delay. 1

tRT P Read to precharge delay. 5 tW T R Write to read switching delay. 5 tW R Write recovery delay. 10 tREF I Refresh Period. 7.8µs

tRF C Time required to refresh. 160ns RKtoRK ( tBUS + tRT RS): Rank switching delay.

RtoW ( tRL + tBUS + tRT RS − tW L): RtoWdelay. W toR B ( tW L + tBUS + tW T R):WtoRin same rank delay.

W toR RK ( tW L + tBUS + tRT RS − tRL):WtoRin different ranks delay. RtoP ( tBUS + tRT P − tCCD):RtoPdelay.

W toP ( tW L + tBUS + tW R):RtoPdelay.

4.2.1 Memory Page Policies

There are two main page policies for accessing DRAMs: close-page and open-page. These page policies manage the duration during which the data is available in the row buffer. Close-page policy writes back the data in the row buffer and flushes the row buffer after each request. Under close-page policy, each request will consist of anA, aCAS, and a Pcommands. Hence, every request takes the same amount of access time, which helps deriving predictable latencies. Open- page policy, on the other hand, leaves the data in the row buffer to allow future accesses for data within the buffer to be accessed faster than having to read the data from the memory cells into the

row buffer again. MCs deploying open-page policy keep the row open until a request to another row arrives or the refresh period is reached. This enables open-page policy to be faster than close-page in the average-case. The primary drawback of open-page policy is that requests have a larger worst-case latency (WCL). This WCL occurs when a request targets a different row than the opened row, which requires precharging the opened row before loading the requested row in the row buffer. For these reasons, MCs in high-performance architectures often use open-page policy [81], while predictable MCs typically prefer close-page policy [23,59,72–74].

4.3 Related Work

There are several efforts that propose predictable MCs [23,59,60,72–76,79,82,83]. Most of these efforts [23,59,72–74] use close-page policy. Hence, available locality in the row buffer (known as row locality) is not exploited for performance benefits. The solution proposed by Goossens et al. [82] presents a configurable architecture where the MC can be reconfigured with different time division multiplexing (TDM) schedules that satisfy new run-time requirements. Gomony et al. [83] propose an optimal mapping of requestors to channels for a multi-channel MC. However, the latter two solutions also deploy a close-page policy, and do not exploit row locality.

Wu et al. [60] utilize open-page policy; however, they require each core to be assigned its own private DRAM bank. This makes their approach inapplicable when there is shared data between cores or the number of cores is greater than the number of DRAM banks. Goossens et al. [75] offer a compromise with their proposal of conservative open-page policy. This policy exploits row locality for SRT requestors while maintaining tight WCL bounds on HRT requestors. The proposed MC in [75] retains the data in the row buffer for a specified time window. When a request targets the same row in the row buffer and arrives within this window, it takes advantage of the row locality . While this approach allows SRT tasks to leverage performance benefits from open-page, it does not reduce the WCL compared to close-page policy. Furthermore, the proposed policy depends on the arrival time of requests. As noted by Wu et al. [60], non-trivial applications deployed on multi-core systems often require the designer to make no assumptions on the arrival times of memory requests due to multiple requests arriving from various cores. Unlike [75],PMC proposed in Chapter4requires no assumption on the arrival time of memory requests. In addition, unlike [60], we allow for shared data across cores.

Li et al. [79] deploy a MC backend that dynamically schedules DRAM access commands and supports different transaction sizes. Based on the transaction size, the numbers of interleaved banks and data bursts are determined through a look-up table. The backend issues DRAM commands on a FCFS basis. The dynamic command scheduling approach is promising for mixed crit-

icality systems since it increases average-case performance for requests of SRT tasks. Though, requests from HRT tasks incur same WCL of close-page controllers. PMC is a complete fron- tend and backend controller that promotes a mixed-page policy to decrease the WCL of memory accesses.

Ecco et al. [84] reduce the data bus switching delay by employing a CAS reordering tech- nique. They schedule CAS commands in rounds such that all commands in the same round have the same type (read or write). Among theAand Pcommands, they deploy a RR arbitration. In [85], Ecco et al. extend this memory controller to support multi-ranked DRAMs. If the DRAM has multiple ranks, they schedule same type ofCASin one rank, and then switch to another rank to decrease the rank switching overhead.

Krishnapillai et al. propose ROC [78], a rank-switching open-row controller that forces consecutive requests to access different ranks to avoid the read-to-write and write-to-read switching time on the data bus. It deploys a RR arbitration across ranks and across banks of the same rank. ROC is able to decrease the WCL compared to [79]. However, it is complex to implement since it has three levels of arbitration on the backend only. Unlike [23,78], which are rank-switching MCs, PMC deploys rank interleaving. While [23,78] forces consecutive memory requests to access different ranks to avoid data bus switching,PMCinterleaves each request across ranks to decrease its latency leveraging parallelism. In addition, consecutive accesses will be mapped to different ranks to avoid bus switching similar to [23,78].

Three recent efforts have introduced MCs for MCS [23,76,77]. Jalle et al. introduce DCmc [76], which uses open-page policy and divides banks into critical and non-critical banks. They assign critical banks to critical requestors and schedule them using round robin (RR); hence, they provide latency bound guarantees for critical requestors. On the other hand, they assign non- critical banks to non-critical requestors and schedule them using first ready-first come first serve (FR-FCFS) to increase average-case performance. Ecco et al. introduce MCMC [23]. MCMC uses multi-ranks and bank partitioning with close-page policy. It assigns each bank partition to a critical requestor and a number of non-critical requestors. Then, it assigns critical requestors higher priority to eliminate the interference from the non-critical requestors. MCMC requires bank partitioning, which may limit shared data across requestors similar to [60]. Kim et al. [77] implement bank-aware address mapping and command-level scheduling to accommodate both critical and non-critical tasks. Banks are shared between both types of tasks. The command-level scheduling prioritizes commands of critical tasks. If a command from a critical request arrives while a non-critical request is being serviced, they pre-empt the non-critical request. As a re- sult, the non-critical request has to be reissued again; thus, it suffers from performance penalty. Additionally, the first command from the critical request has to wait until satisfying all timing constraints after the pre-empted non-critical command. As observed by [86], this increases the latency of the critical request. All of the three MCs [23,76,77] are dual-criticality with fixed-

priority scheduling. Critical-tasks always have higher priority; hence, non-critical tasks have neither performance nor latency guarantees. This is acceptable for systems deploying only two types of tasks. Nonetheless, we find those MCs ill-suited for systems with various mixed-critical tasks, where less-critical tasks may still require some guarantees.

In contrast, PMC does not always prioritize higher-critical tasks. Instead, it executes an optimized schedule that allows the system designer to specify different latency and bandwidth requirements for each requestor. The schedule provides each task with the amount of service that is only sufficient to meet its specified requirements, while not starving other tasks.

In document Predictable Shared Memory Resources for Multi-Core Real-Time Systems (Page 96-101)