Introduction - Managing Shared Resources in Chip Multiprocessor Memory Systems

C.1 Introduction

Chip Multi-Processors (CMPs) commonly share parts of the memory system. This resource sharing is often beneficial since it can lead to improved resource utilization and low-latency interprocessor communication. Unfortunately, the presence of shared resources makes destructive interference possible [27]. Consequently, the performance of an application may be influenced significantly by the applications it is co-scheduled with. This lack of performance predictability may be an annoyance to the desktop user, but it can be a business critical issue for data center operators. With the advent of cloud computing, where thousands of distinct users share a common computing infrastructure, on-chip resource allocation may become critical [1].

A considerable research effort has been aimed at improving the way the shared units handle independent, co-executing processes. Efforts have been directed towards the hardware-controlled memory system [3, 13, 20, 27, 37], the shared last-level cache [4, 12, 21, 29, 30, 38–41], the memory bus [11, 25, 26, 28, 33] and system software [9, 32]. Common for these approaches is that all running processes have a static and equal miss bandwidth allocation. In this work, we show that asymmetric miss bandwidth allocations can be used to partition memory system resources among processes. We provide a transparent system that adapts per-core miss bandwidth to inter-process interference which we call Miss Handling Architecture Bandwidth Control (MHABC).

The subsystems of MHABC and other resource allocation schemes can be divided into three categories: enforcement mechanisms, feedback mechanisms and allocation policies [27]. Our enforcement mechanism leverages that modern caches are non-blocking and potentially support several concurrent cache misses [22]. This feature is provided by the Miss Handling Architecture (MHA) and the key compo- nent of the MHA is the Miss Status/Information Holding Registers (MSHRs). The key observation underlying the design of MHABC is that the number of MSHRs in the last-level private cache can be used to control the number of concurrent requests in the shared memory system. We refer to an MHA where the number of MSHRs can be changed at runtime as a Dynamic MHA (DMHA) [15].

MHABC uses the Dynamic Interference Estimation Framework (DIEF) [17] to provide interference feedback. DIEF provides measurements of the current memory latency and estimates the memory latency a process would have experienced with exclusive access to all shared resources. Modern memory systems have a considerable amount of parallelism available, and the ability of a process to utilize this parallelism is known as Memory Level Parallelism (MLP) [31]. In this work, we combine DIEF’s latency estimates with MLP measurements to accurately estimate the performance a process would experience with exclusive access to shared resources. These IPC estimates have an average relative error of −0.3%, −3.5% and −5.8% and a standard deviation of 12.0%, 13.0% and 13.7% for our 4, 8 and 16-core CMPs.

MHABC’s allocation policy combines DIEF’s latency estimates with a novel miss bandwidth performance model to provide runtime estimates of the chosen performance metric. Based on these estimates, MHABC finds a good bandwidth allocation by efficiently searching through the solution space provided by the model. MHABC supports the Harmonic Mean of Speedups (HMoS) [23], Aggregate Weighted Speedup (AWS) [35], Fairness [8] and Aggregate IPC policy metrics. When MHABC is configured to optimize for the Harmonic Mean of Speedups (HMoS) metric, it improves HMoS by up to 106% and fairness by up to 200% with a worst-case reduc- tion in throughput of 3%. Although we focus on miss bandwidth allocation in this work, the allocation policy is general. It can be applied to any resource allocation mechanism if the latency effect of the allocation can be modeled.

C.2 Background

C.2.1 Interference and Performance Metrics

When evaluating CMP memory system fairness, it is convenient to compare to a baseline where interference does not occur. One way of creating such a baseline is to let the process run in one processing core of the CMP and leave the remaining cores idle [6, 26]. Consequently, the process has exclusive access to all shared resources, and we will refer to this configuration as the private mode. Conversely, all processing cores are active and the processes compete for shared resources in the shared mode. We define the interference Ip experienced by a processor p as

the difference between the shared mode latency Lp and private mode latency Lp

(i.e. Ip= Lp− Lp). This definition is an extension of the interference definition by

Mutlu and Moscibroda [25].

A shared mode estimate of a private mode value ˆX may differ from the actual private mode value X . For these estimates to be useful for allocation decisions, it is important that the difference between them is minimized. Consequently, we define the measurement error to be E = ˆX − X . Since latency and cache miss estimates are used for shared mode allocation decisions, we define the relative error ES _{as the absolute error E divided by the shared mode value X (E}S _{= E/X). For}

performance measurements, we use the error relative to the private mode EP ₌

E/X .

Table C.1 shows the system performance metrics used in this work. Here, Pp and

Pp represent the shared and private mode performance of process p, respectively.

Eyerman and Eeckhout [6] showed that the Aggregate Weighted Speedup (AWS) [35] and Harmonic Mean of Speedups (HMoS) [23] metrics represent system throughput and average normalized turnaround time, respectively. AWS is a system-oriented metric, and HMoS is a user-oriented metric. In addition, we use the fairness metric [8] which measures the difference in shared to private mode slowdown between the running processes. Consequently, fairness is maximized when all processes experience the same slowdown. Finally, we also include the Aggregate IPC (AI) metric.

C.2. Background 131

Table C.1: Multiprogrammed Workload Performance Metrics

Metric Formula System-Level

Meaning [6] Reference

Aggregate Weighted Speedup (AWS)

p=0Pp/Pp System_Throughput Snavely and Tullsen [35]

Harmonic Mean of Speedups (HMoS) n Pn p=0Pp/Pp Inverse of Average Normalized Turnaround Time Luo et al. [23] Fairness min(Pi/Pi) max(Pj/Pj)i, j ∈ {0, n} Assumed by

system software Gabor et al. [8]

Aggregate IPC (AI)

p=0Pp None [6] -

This metric values high IPC numbers and the best performance is achieved by max- imizing the IPC of the high IPC processes. For this reason, it is not recommended to use Aggregate IPC as a performance metric [23, 35], and we only include it to illustrate the effects of not using private mode performance in allocation decisions.

C.2.2 Modern Memory Bus Interfaces

Memory bus scheduling is a challenging problem due to the 3D structure of DRAM consisting of rows, columns and banks. Commonly, a DRAM read transaction consists of first sending the row address, then the column address and finally receiving the data. When a row is accessed, its contents are stored in a register known as the row buffer, and a row is often referred to as a page. If the row has to be activated before it can be read, the access is referred to as a row miss or page miss. It is possible to carry out repeated column accesses to an open page, called row hits or page hits. This is a great advantage as the latency of a row hit is much lower than the latency of a row miss. The situation where two consecutive requests access the same bank but different rows is known as a row conflict and is very expensive in terms of latency. DRAM accesses are pipelined, so there are no idle cycles on the memory bus if the next column command is sent while the data transfer is in progress. Furthermore, command accesses to one bank can be overlapped with data transfers from a different bank.

Rixner et al. [34] proposed the First Ready - First Come First Served (FR-FCFS) algorithm for scheduling DRAM requests. FR-FCFS reorders memory requests to achieve high page hit rates which result in increased memory bus utilization. This algorithm prioritizes requests according to three rules: prioritize ready commands over commands that are not ready, prioritize column commands over other commands and prioritize the oldest request over younger requests.

C.2.3 Miss Handling Architectures (MHAs)

A generic MHA consists of n MSHRs which store the cache block address of the miss, target information and a valid bit (see Figure B.1 on page 109). The cache can support as many concurrent misses to different cache blocks as there are MSHRs. Each MSHR commonly has its own comparator and the MHA can be described as a small fully associative cache. For each miss, the information required for the cache to answer the processor’s request is stored. This target information determine the number of misses to the same cache block that can be handled without blocking [7, 22]. The cache must block when all valid bits are set, and a blocked cache cannot service any requests.

In document Managing Shared Resources in Chip Multiprocessor Memory Systems (Page 151-154)