Phase-driven Learning-based Dynamic Reliability Management For Multi-core Processors

(1)

Phase-driven Learning-based Dynamic Reliability

Management For Multi-core Processors

Zhiyuan Yang, Caleb Serafy

∗

_{, Tiantao Lu}

†

_{and Ankur Srivastava}

University of Maryland, College Park, MD, USA

{zyyang, cserafy, ttlu, ankurs}@umd.edu

ABSTRACT

In this paper, we propose a phase-driven Q-learning based dynamic reliability management (DRM) technique for multi-core processors to solve DRM problems of maximizing the processor performance subject to a large class of reliability constraints by turning ON/OFF cores and dynamic voltage frequency scaling. Our technique utilizes the existing meth-ods to detect program phases (i.e. [17]) and learns (rather than obtaining at the off-line stage) the optimal configura-tion of the multi-core processor for each phase. Our tech-nique outperforms the existing learning-based DRM meth-ods in managing programs with highly diverse phases. Our proposed technique is evaluated by solving a DRM problem in 3D CPUs of maximizing processor performance subject to the electromigration induced power delivery network reli-ability constraint. Compared to the latest Q-learning based DRM technique [11], our method can achieve more than 1.3x improvement in performance with 77% memory savings.

1. INTRODUCTION

With the continuous technology scaling, more cores will be integrated in future multi-core processors. This increases the on-chip power density and temperature which makes re-liability a limiting constraint in high performance multi-core processors[1, 12]. Dynamic Reliability Management (DRM) techniques are investigated to address this issue by dynami-cally tuning the configuration of the processor through Dy-namic Frequency Voltage Scaling (DVFS) and power gating

etc. . Recently, people started using reinforcement learning (such as learning) in their DRM techniques [4, 11]. Q-learning based DRM techniques maintain a table of quality values for each state-action pair based on the past experience and use this table for future management (Figure 1). Such approaches are able to tune their management protocols by tracking the dynamic characteristics of the programs being executed. Despite of this, these techniques may require ex-tremely large memory space when applied in multi-core pro-cessors which will make them infeasible to be implemented in any of today’s processors. Moreover, the existing tech-niques may fail to provide efficient management when the diversity of programs is large.

*_{Dr. Serafy graduated in May 2016 and is currently working}

in Oracle Inc., Santa Clara, CA

_{Dr. Lu graduated in May 2016 and is currently working in}

Cadence Design Systems Inc., San Jose, CA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

DAC ’17, June 18-22, 2017, Austin, TX, USA

Programs

Figure 1: (a) Q-learning based DRM system and (b) structure of the Q-table

The execution of programs is constituted of diverse phas-es [17] and the detection/prediction of program phasphas-es has been extensively studied so far[14, 17]. The problem of high diversity in programs can be handled by using the phase de-tection technology and assigning each phase with its optimal processor configuration. So far, such phase-based manage-ment methods are used for dynamically tuning the power, temperature etc. in the processor [3, 10]. However, to the best of our knowledge, the existing phase-based dynamic management methods assume the optimal management for each phase can be obtained a priori, which may not be prac-tical in reality.

In this work, we propose a phase-driven Q-learning based DRM technique which integrates the existing on-line pro-gram phase detection techniques (e.g. [17]) to the Q-learning procedure (Section 3). Our proposed technique maximizes the processor performance subject to a large class of reliabili-ty constraints (e.g. reliability issues induced by temperature dependent dielectric breakdown (TDDB) and electromigra-tion (EM)etc. ) by turning ON/OFF cores and DVFS. Dif-ferent from the existing Q-learning-based DRM approaches [4, 11], our technique can efficiently determine the optimal configuration of the processor for each phase during the run-time without a priori knowledge of phases. Therefore, our technique can provide finer management compared to the existing Q-learning based DRM approaches. We also pro-pose two enhancement algorithms to improve the memory saving and management efficiency (Section 3.2 and 3.3). We performed a case study to evaluate the proposed DRM technique (Section 4). In this case study, we used our pro-posed technique to dynamically determine processor config-urations of a 3D CPU (Figure 4) in order to maximize the performance subject to the EM-induced power delivery net-work reliability constraint. Results (Section 5) show that, when managing programs with high diversity, our approach outperforms the latest Q-learning based DRM techniques [11] with more than 1.3x improvement in performance with 77% memory savings.

2. BACKGROUND AND MOTIVATION

2.1 Q-learning Based DRM

(2)

(a) (b) (c) (d)

Active Cores Idle Cores

(e) (f) (g)

Figure 2: Examples of active core patterns with 16 (a), 8 (b-c), 4 (d-e) and 2 (f-g) active cores

Figure 1(a) illustrates the Q-learning based DRM system. Programs are executed on the multi-core processor. The Q-learning agent selects the working modes for the multi-core processor at a fixed interval, which is called “decision epoch” [4]. A working mode is a configuration of the multi-core pro-cessor which is decided by the number and distribution of active cores (as illustrated in Figure 2 for a 16-core pro-cessor) as well as the performance state (e.g. the voltage and frequencyetc. ) of each core. At each decision epoch, the learning agent also observes the information from the multi-core processor (e.g. through the performance coun-ters, thermal sensorsetc. ). These data are used to evalu-ate the future reliability degradation using reliability models which is then fed into the Q-learning procedure for future management.

The Q-learning procedure maintains a Q-table storing the quality values (Q-values) for the “state” taking the specific “action” [4, 11]. The structure of the Q-table is illustrat-ed in Figure 1(b). The rows of the table represent “state”, which is defined differently in different works [4, 11]. The columns of the table represent “action”, which is usually de-fined as the working mode in the context of DRM. At each decision epoch, the working mode with the best Q-value for the current state is selected and the observed data are used to update the corresponding Q-value. Initially, the selected working mode may not be optimal for the state. However, the decisions will be refined during the run-time to improve the learning performance.

2.1.2 Existing Methods

According to the definition of “state”, the existing Q-learning based DRM techniques can be classified into two categories. The first type defines both the “state” and “action” (Fig-ure 1(b)) as the working mode. Hence the Q-table is basi-cally a matrix with|W|2

elements,|W|being the number of working modes. The (ij)th entry of the Q-table represents the reward gained by changing from working modeito j. For instance, Kim et al. [11] used this type of techniques to optimize the energy efficiency of multi-core processors subject to reliability, thermal and performance constraints. They adopted performance counters and models to evalu-ate the performance, reliabilityetc.of the processor during the runtime. In their work, the working mode is selected randomly at the first decision epoch. The working mod-e for thmod-e following dmod-ecision mod-epoch is smod-elmod-ectmod-ed with thmod-e bmod-est transition from the current working mode according to the Q-values. Following this, the change of performance, relia-bilityetc. between the working modes are calculated which is used to update the corresponding Q-value. If the diversity of programs is low, this technique converges quickly with a few exploration of the learning space [11]. However, when the diversity in programs is large, this technique might take long time to converge (or it may even fail to converge) thus it may fail to provide efficient management.

Daset al. [4] adopted another type of the technique and de-fined the “state” as the composition of the average thermal stress and thermal cycling induced degradation across the chip. The “actions” were defined as working modes. Their learning procedure tends to determine the optimal working

mode for each state. Another example of this type can be found in [6]. Such technique can be problematic when the program diversity is high since the “state” is affected by the working mode. For instance, the same state can result from the execution of two completely different pieces of code and the working mode selected under this state may lead to dif-ferent states when executing the subsequent pieces of code. This may cause suboptimal management and affect the effi-ciency of this technique.

In summary, the existing Q-learning based DRM techniques will be problematic with large program diversity. In order to provide efficient management to such programs, a DRM technique should be able to tune the management protocols according to the change of phases (i.e. the piece of codes) among programs. This motivates our proposed phase-driven Q-learning based DRM technique.

2.2 Phase Detection Technique

Our technique utilizes the methodologies to detect/cluster phases that is independent of processor configurations. So far, such methodologies have been extensively investigated [14, 17]. These techniques cluster phases according to the properties of code in programs (e.g. the number of branches, frequency of the use of basic blocks etc. ). One of the popular on-line phase detection techniques is proposed by Sherwood et al. [17]. Their technique tracks the program counter (PC) of each committed branch and the number of instructions (I) from the last branch. The branch PC

is then mapped through a hash function to an entry of an accumulator which records Nbucket entries. The value of

the mapped entry is updated by I. This can be performed efficiently at the clock rate of the processor. After 10 million instructions the vector recorded by the accumulator forms the “ID” of the piece of code within this interval. This ID is then compared with a set of IDs stored in the past-history-table. If the hamming distance between the new ID and one entry in the table is within the threshold, this interval will be classified into the phase represented by that entry. Otherwise, a new phase is discovered and a new entry in the past-history-table is created to store this new ID.

2.3 Phase-based Dynamic Management

Our proposed DRM technique is a special kind of phase-based dynamic management (PBDM) techniques. PBDM techniques utilize the phase detection technique and pro-vide finer tuning of the management by determining the op-timal working mode for each phase. So far, the application of PBDM in dynamic power management, dynamic thermal management,etc. has been widely investigated [3, 10]. The two key issues of PBDM are (1) how to predict the phase for the next interval and (2) how to determine the best action for each phase. A plenty of methods have been proposed to address the first issue [10, 17]. The main difference among these methods is how far you look back at the previous da-ta for prediction. For example, Isci et al. [10] presented a “Global Phase History Table” predictor to predict the next phase based on the phase patterns during the previous N intervals. On the other hand, less effort has been put to the second issue. To the best of our knowledge, the exist-ing PBDM techniques rely on off-line knowledge (e.g. by sampling a set of benchmarks) to determine the optimal so-lution of each phase [3, 10]. This, however, is not practical since the exact phases cannot be completely known before the execution of programs. In contrary to the traditional PBDM, our proposed technique dynamically learns the op-timal working mode for each phase during run-time using the reinforcement learning procedure.

(3)

Notes:

* The dark grid indicates the Q-value for the grid is updated at the labled time point * Each segment in (b) represents a decision epoch

* and above a decision epoch represent the real phase and working mode of that epoch

Figure 3: Illustration of the (a) Q-table and (b) the flow of management of our DRM technique

3. PHASE-DRIVEN Q-LEARNING BASED

DRM TECHNIQUE

Our proposed DRM technique is designed to solve a wide range of DRM problems that aims at maximizing the pro-cessor performance subject to the reliability constraints (e.g.

TDDB, EMetc. ). Our methodology will be introduced in Section 3.1. Afterwards, we will introduce a run-time phase clustering algorithm which can further cluster phases with the same optimal working modes to achieve memory savings (Section 3.2). An on-line population algorithm will be intro-duced in Section 3.3 to speed up the learning process thus improving the management efficiency of our technique.

3.1 Methodology

Definition of the Q-table. In our technique, each “state” represents a phase (τ) detected by the system (e.g. using [17]) and each “action” represents a working mode (i.e. the combination of the number of active cores, frequency of each core, etc. ). Therefore, the Q-value for a phase-working-mode pair (Q(τ, w)) indicates the reward gained by execut-ing the phase,τ, using the working mode,w. The calculation of rewards and the update of Q-values will be introduced in Section 3.1.2. Our Q-table is illustrated in Figure 3(a). In our technique, phases are detected at the granularity of the decision epoch, during which the “ID” of the piece of code within the epoch is calculated and the phase is detect-ed at the end of the epoch (Section 2.2). At the end of a decision epoch, the phase for the next epoch is assumed to be the phase detected for the current epoch and the working mode is selected accordingly. This phase prediction method is efficient if the length of a decision epoch is shorter than a phase (e.g. 10 million instructions while a typical phase lasts tens to hundreds of million instructions [17]). Note that our method can be replaced by any other predictors which may yield more accurate prediction [10].

Figure 3 illustrates our methodology. At the beginning of the DRM, we start to process the first decision epoch ( 1in Figure 3). The learning agent randomly selects a working mode (i.e. w1) and proceeds to complete two jobs during

this epoch: (1) detect the phase (as described in Section 2.2) and evaluate the average performance and failure rate of the processor within the period based on the working mode al-location,w1. Performance can be evaluated using the

per-formance counters. Failure rate is calculated using reliabil-ity models (e.g. [11]) according to specific DRM problems. The derivation of the failure rate will be introduced in Sec-tion 3.1.1. At the end of the first decision epoch ( 2), the phase for this epoch is detected (i.e. τ1). The performance

and failure rate by executing τ1 withw1 are also

evaluat-ed. we calculate the reward of this phase-working-mode pair ((τ1, w1)), creates a new row in the Q-table indexed with the

“ID” ofτ1(Section 2.2) and updates the Q-value (Q(τ1, w1))

according to Section 3.1.2. At this point, the phase for the next decision epoch is predicted as τ1 and the new

work-ing mode is assigned accordwork-ing to the learnwork-ing stage ofτ1

(Section 3.1.3). After the execution of the second decision epoch ( 3), the actual phase for this epoch is detected. We compare the ID for this phase with the index of rows in the table. If we can locate the row for the phase in the Q-table, the corresponding Q-value is updated (e.g. Q(τ1, w2)

at 3). Otherwise, we create a new row in the Q-table for the new phase and update the corresponding Q-value (e.g.

Q(τ2, w3) at 4). Misprediction may occur at the transition

between phases (e.g. at 3and 5) where the actual phase is different from the predicted one hence the assigned working mode for this epoch is not proper. This brings overhead to the management. However, when the length of a decision e-poch is smaller than the phases, this overhead is insignificant compared to the performance improvement of our technique.

3.1.1 Modeling of Reliability

The reliability of the processor is usually described using Weibull Distribution [9]: R(t, θ) = exp(−(t·λ(θ))β). R(t, θ) indicates the probability that the processor does not fail at time t under the condition θ (e.g. temperature, cur-rent density etc. ). β > 0 is the shape parameter of the Weibull distribution. λ(θ) is the scaling factor which is called “failure rate” in this paper. With the knowledge of reliability distribution with respect to time, the mean-time-to-failure (MTTF) of the processor can be calculated by

M T T F(θ) =R∞

0 R(t)dt, which gives [9] M T T F(θ) = Γ(1 + 1/β)

λ(θ) , (1)

where Γ(x) is the Gamma Function. Currently, most re-liability models evaluate MTTF of the processor (or some structures in the processor) (e.g. [11]). The failure rate can thus be calculated using Equation 1. In Section??, a specific reliability model will be introduced. Usually, a pro-cessor requires a minimum MTTF (denoted as M∗) which enforces the reliability constraint. In this technique, we use

λ∗ = Γ(1+1_M∗/β) to control the reliability. Hence the MTTF

constraints is translated into a failure rate constraint,λ∗.

3.1.2 Reward and Q-value Functions

At the end of each decision epoch, the processor performance and failure rate for this epoch are denoted asIP S(τ, w) and

λ(τ, w), respectively. The reward for this phase-working-mode pair, (τ, w), is computed as:

r(τ, w) =

IP S(τ, w)−θ(λ(τ, w)−λ∗), λ(τ, w)> λ∗

IP S(τ, w), λ(τ, w)≤λ∗

(2) In this formulation, the reward is simply the processor per-formance if the reliability constraint is not violated (i.e.

λ(τ, w) ≤ λ∗). Otherwise, a penalty proportional to the slack between λ∗ and λ(τ, w) is subtracted from the per-formance. Given the reward function, the corresponding Q-value for (τ, w) is updated with the following equation:

Qnew(τ, w) =

r(τ, w), r(τ, w)<0

(1−η)(Qold₍_{τ, w}_{)) +}_η₍_r₍_{τ, w}₎₎_, _r₍_{τ, w}₎_≥₀

(3)

Here, Qold₍_{τ, w}_{) and} _Qnew₍_{τ, w}_{) are the old and new}

Q-values for the phase-working-mode pair, (τ, w). According to the equation, when the reward is positive, the new Q-value is updated as the cumulative average of the historical Q-values and the current reward for the phase-working-mode pair. η

controls the weight of the historical value when updating the Q-value. On the other hand, when the reward is negative,

(4)

the new Q-value is directly set to the reward value thus avoiding further selection of the phase-working-mode pairs with large reliability penalty.

3.1.3 Selecting Working Modes

As noted earlier, the phase for the next decision epoch is assumed to be identical to the current one. Hence we se-lect the next working mode according to the learning stage of the current phase (τ). Similar to [4], there are three learning stages for each phase which is determined by the number of explored working modes for that phase (a work-ing mode is explored means the phase has been executed with that working mode for at least once): (1) When few working modes are explored, we are at the Exploration Stage, where the working mode is selected randomly while more weight is placed on the unexplored working modes. (2) When the number of explored working modes is large enough, we enter the Exploitation Stage where we just select the optimal working mode (i.e. the working mode for

τ with thehighestQ-value) for the phase. (3) The tran-sition period between the previous two stages is Exploit-exploration Stagewhere we balance the probability of us-ing the previous two protocols of selectus-ing workus-ing modes.

3.2 On-line Clustering Algorithm

In practice, different phases may end up with the same opti-mal working mode. Therefore, we propose a clustering algo-rithm to further cluster such phases to achieve substantial memory savings for storing the Q-table. In order to predict the real optimal working mode in the early stage, we select a set of “sampling working modes (SWM)” which are scattered in the working mode space and record the performance and failure rate for each phase executed with the SWM during the Exploration Stage. After a number of decision epochs, the phases satisfying the following two conditions are clus-tered: (1) they have similar performance and failure rate when executed with the SWM and (2) they have the same

currentoptimal working mode. If the phases are clustered, we average the Q-values of each working mode across these phases and just use one row in the Q-table to store the new values. In order to further cluster this phase-cluster with other phases in the future, we generate the performance and failure rate values of the SWM for this phase-cluster by av-eraging the values of the related phases. As a phase-cluster is created, we keep monitoring its optimal working mode. If the optimal working mode changes during the Exploitation Stage, we separate the phases constituting the cluster.

3.3 On-line Population Algorithm

In order to speed up the Exploration Stage, we propose an on-line population algorithm to predict the Q-value of the unexplored phase-working-mode pairs. During the run-time, we build a linear model to estimate the Q-values given the frequencies for each number of cores and each phase. The models are built based on the explored working modes for a phase. The unexplored Q-values are thus predicted using the corresponding models. When building such models, only positive Q-values are used, hence all predicted Q-values are positive which can be updated during the following learning procedure (Section 3.1.2). Note that if the actual Q-value for a working mode is negative, it will be corrected when the working mode is explored according to Equation 3.

3.4 Advantages of Our Technique

(1) Performance improvement. Compared to the ex-isting Q-learning based DRM methods [4, 11], our tech-nique captures the properties of different phases and

effi-Figure 4: Illustration of (left part) the 3D DRAM-on-Logic Architecture and (right part) the floorplan of the multiprocessor

ciently learns the optimal working mode for each phase dur-ing the run-time. Thus our technique outperforms the exist-ing methods in managexist-ing programs with high diversity. (2) Memory savings. The memory space for storing Q-values in the existing techniques is fixed and can be very large [11]. In our technique, the Q-table size is flexible according to the number of phases. Since the number of different phases is usually much smaller than the number of working modes, our technique can achieve substantial memory savings and our learning overhead is thus reduced. In the following sec-tions, we will use a case study to evaluate our proposed DRM technique.

4. CASE STUDY

4.1 Background and Test Vehicle

So far, the 3D CPU with stacked-DRAM-on-logic structure has attracted wide interests due to its large bus bandwidth and the ability of parallel accessing to memory [16]. One structure of this kind of 3D CPUs is illustrated in Figure 4. In the figure, a four-layer DRAM is stacked on a multi-core processor. The layers are connected through Through-Silicon-Vias (TSVs). The power is supplied from the bottom and distributed to each layer through the 3D Power Delivery Network (3D PDN) enabled by power-ground TSVs (P/G TSVs). The heat sink is implemented close to the top layer. Despite of the fact that this 3D CPU achieves significant im-provement in performance, the TSVs in this structure suffer from severe EM-induced reliability problems due to the high on-chip temperature [18] and current loads in TSVs (both signal and P/G TSVs). Since the current in P/G TSVs is unidirectional and much larger than that in signal TSVs, the reliability problem in P/G TSVs is much more severe [15, 11]. The degradation of P/G TSVs will degrade the power delivery in the 3D CPU and harm the performance and re-liability of the whole system. The EM-induced rere-liability of the 3D PDN can be dynamically controlled by tuning con-figurations of the multi-core processor since this changes the power consumption of the circuit (and also the spatial dis-tribution of power which impacts both the temperature as well as current demands in individual P/G TSVs). Howev-er, this action affects the performance of the CPU. In order to handle this trade-off, in this case study, we formulate the DRM problem as maximizing the 3D CPU performance subject to EM-induced 3D PDN reliability constraint. The

text vehicleof this case study is adopted from [16] which is illustrated in Figure 4). Readers may refer to [16] for more details of this architecture.

4.2 Simulation Mechanism and Platform

(1) Setting working modes. We select 125 different con-figurations of the multi-core processor (i.e. the number of active cores and the corresponding frequencies) for illustra-tion. Among these configurations, the number of cores is

(5)

Figure 5: The simulation platform

selected among {2,4,8,16} and the frequency of each core is selected from {1,1.5,2,2.5,3} (GHz). Different distribu-tions of active cores are also selected (as illustrated in Fig-ure 2).(2) Building the phase pool. In order to evaluate our technique on programs with different phase diversity, we would like to create a “phase pool” with large number of phases. To do so, we select three out of 15 PARSEC [2] and SPLASH-II [20] benchmarks and create 680 such combinations. In each combination, we average the perfor-mance, poweretc. of each benchmark (which are simulated using Multi2Sim [19] and McPAT [13] with different work-ing modes) in the combination thus givwork-ing us 680 phases. The performance/poweretc. information for these phases constitutes our golden data. In the actual system, this data is expected to be generated by on-chip sensors, performance counters etc. . However we use simulation data to drive our DRM in this paper. (3) Programs for simulation.

Following this, “programs” are created by randomly selecting different number of phases (ranging from 15 to 250) from the phase pool. These programs will run sequentially indicating different phase diversity.

Figure 5 illustrates the simulation flow for one decision e-poch. In this figure, “Phase” indicates the phase detected in this decision epoch (Section 3.1). When a new decision epoch comes, a working mode selector controlled by the Q-learning agent (Section 3.1) assigns a working mode for this phase, thus determining the performance and power con-sumption of this phase. The power profile is then fed into a Hotspot like thermal model [18] for calculating the ature profile. Note that in the actual system, the temper-ature profile can also be obtained through thermal sensors. The power is also fed into a power model to estimate the on-chip current distribution which is then used to estimate the current density in P/G TSVs through a PDN model. All this information is then fed into the reliability model to compute the failure rate of the P/G TSVs. Both the PDN model and the reliability model will be introduced in the following paragraphs. Finally, the Q-learning agent uses the performance and failure rate to update the Q-table for future management (Section 3).

PDN model: Our PDN model consists of an off-chip and on-chip RLC network similar to [7]. Each on-chip power grid is connected to its neighboring power grid through a resistor and an inductor in series while one power grid connects a ground grid through a decoupling capacitor. Each pair of P/G grids are also attached with a current load (generat-ed by the mechanism describ(generat-ed in the previous paragraph). P/G TSVs are also modeled as series of resistor and induc-tor with decoupling capaciinduc-tors between power and ground TSVs. The parameters of the 3D PDN model are taken from [7, 15]. The diameter of a P/G TSV is 10µmand the power grid size on each layer is 400µm×400µm[8].

EM-induced TSV Failure Model: EM-induced fail-ure mechanisms in TSVs have been widely studied [5, 11]. All these models evaluate the MTTF of TSVs affected by temperature, current etc. . Without loss of generality, in this work, we adopt the Black’s Equation to model the EM-induced failure in a single P/G TSV [5, 9]. Black’s equation

describes the EM-induced failure rate in a TSV with the cur-rent density, temperature and other material-based param-eters, and is expressed as follows[5]: 1

λT SV =

A jnexp(

EA

kBT).

Here, EA is the activation energy for the EM process, j is

the current density,kB is the Boltzmann constant andT is

the temperature. In this work, we use EA = 0.82eV and n = 2 for simulation [5]. We use the average failure rate of all P/G TSVs to characterize the failure rate of the 3D PDN. The average value is used as the metric because the failure of a single P/G TSV does not constitute the failure of the entire PDN, but each failed TSV degrades the PDN by introducing more voltage drop.

5. RESULTS AND DISCUSSION

In this section, our technique is evaluated against the lat-est learning based DRM method [11]. For [11], the Q-table size is fixed to 125x125 (rows×columns) where 125 is the number of working modes used in the simulation (Sec-tion 4.2). The definition of the reward function and the learning procedure of the existing method are the same as in [11] except that our target is to maximize the performance (IPnS, Instructions per Nanosecond) of the program. In our proposed technique,θ= 3×10−10,η= 0.5 (Section 3.1.2). The failure rate constraint (λ∗) is set such that the expect-ed MTTF (M∗) for the 3D PDN is 5-years (Section 3.1.1). Three types of our technique are evaluated: (1) without either clustering (Section 3.2) or population (Section 3.3) algorithm (No Cluster+No Pop), (2) with clustering al-gorithm but without population alal-gorithm (Cluster+No Pop) and (3) with both algorithms (Cluster+Pop). The simulation results are shown in Table 1. In the table, the diversity level (DL) represents the number of different

phas-es in each simulation. For instance, DL = 15 means that

there are totally 15 different phases from the “phase pool” (Section 4.2) used in that simulation. Large DL indicates

high phase diversity. The reliability after each simulation is evaluated with MTTF which is calculated from the av-erage failure rate during the simulation using Equation 1. In Figure 6, we also plot the change of average performance with time for DL = 30 and DL = 250 when the existing

method [11] and three types of our technique are applied to management.

Let’s first focus on the existing technique [11] and our tech-nique without either clustering or population algorithm (No Cluster+No Pop). As illustrated in Table 1, compared to [11], our method achieves up to 32% improvement in per-formance while our MTTF is closer to the reliability con-straint (5 years). This indicates that our technique is able to stay close to the reliability constraint while maximizing the performance. When the phase diversity is low, our method can achieve much smaller Q-table size than [11] (e.g. 77% memory savings whenDL = 30). With the increase of the

phase diversity, the Q-table size increases thus increasing the learning overhead of our technique (as illustrated by comparing the slope of the red curves in Figure 6(a) and (b)). Despite of this, our technique can still achieve 18% improvement in performance when DL = 250 as

illustrat-ed in Table 1, hence demonstrating that our technique can provide more efficient management for programs with high diversity compared to [11].

(1) Apply the Clustering Algorithm (Cluster+No Pop). After applying the clustering algorithm, the Q-table size can be reduced. The higher the phase diversity, the larg-er the memory savings (e.g. 40% savings whenDL= 250)

compared to our technique without using the clustering al-gorithm. As illustrated in Figure 6(a)/(b), the yellow curve rises earlier than the red curve (indicating faster learning

(6)

Table 1: Simulation results (DL= the diversity level; IPnS is the average performance; “No Cluster+No Pop”,

“Cluster+No Pop” and “Cluster+Pop” represent respectively three types of our technique as noted earlier; Size is the final Q-table size; The values in the brackets indicate the performance improvement compared to [11] for the sameDL; MTTF is evaluated in “years” and the reliability constraint is MTTF≥5 years)

DL _IPnSExisting Technique [11]_Size _MTTF _IPnSNo Cluster+No Pop_Size _MTTF _IPnSCluster+No Pop_Size _MTTF _IPnS Cluster+Pop_Size _MTTF

15 8.5 125x125 5.7 10.3 (1.21x) 15x125 5.5 10.3(1.21x) 15x125 5.4 10.8(1.27x) 15x125 5.2 30 8.2 125x125 6.8 10.5(1.28x) 30x125 5.2 10.7(1.31x) 27x125 5.1 11.0(1.35x) 29x125 5.1 60 8.6 125x125 5.9 11.2(1.31x) 60x125 5.3 11.3(1.32x) 46x125 5.3 11.5(1.34x) 52x125 5.2 90 8.0 125x125 5.8 10.3(1.29x) 90x125 5.3 10.3(1.29x) 71x125 5.3 10.6(1.32x) 73x125 5.3 125 8.0 125x125 5.8 10.1(1.28x) 125x125 5.4 10.3(1.29x) 90x125 5.3 10.4(1.31x) 96x125 5.3 250 8.1 125x125 5.7 9.6(1.18x) 250x125 5.5 9.8(1.21x) 153x125 5.4 10.3(1.27x) 159x125 5.3

Average Performance Average Performance

Figure 6: Average performance with time for (a) DL= 30 and (b)DL= 250

speed) and ends up with relatively higher performance. The figure also illustrates that such improvement brought by ap-plying the clustering algorithm is more significant when the diversity is large. (2) Apply the Population Algorithm (Cluster+Pop). When we further apply the population algorithm, the learning speed gains another substantial im-provement (as illustrated by the purple curves in Figure 6) thus causing further improvement in performance as shown in Table 1. In summary, when both enhancement modules are applied, our technique can achieve up to 35% improve-ment in performance compared to [11] with 77% memory savings for storing the Q-table (i.e. whenDL= 30).

6. CONCLUSION

In this work, we propose a phase-driven Q-learning based DRM technique to provide management of programs with large diversity. Our technique depends on the existing method (e.g. [17]) to detect phases and learns the optimal work-ing mode for each phase durwork-ing the run-time. We also de-velop two modules (i.e. on-line clustering and population algorithms) to improve memory savings and management efficiency. Compared to the latest Q-learning based DRM technique [11], our method can achieve more than 30% im-provement in performance with 77% memory savings.

7. ACKNOWLEDGEMENTS

The authors acknowledge that this work has been funded by NSF grant CCF1302375.

8. REFERENCES

[1] H. Amrouch, et al. Reliability-aware design to suppress aging. InProceedings of the 53rd DAC, page 12. ACM, 2016.

[2] C. Bienia, et al. The PARSEC benchmark suite: Characterization and architectural implications. In

Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72–81. ACM, 2008.

[3] R. Cochran and S. Reda. Consistent runtime thermal prediction and control through workload phase detection. InProceedings of the 47th DAC, pages 62–67. ACM, 2010. [4] A. Das, et al. Reinforcement learning-based inter-and

intra-application thermal optimization for lifetime

improvement of multicore systems. In51st DAC, pages 1–6. IEEE, 2014.

[5] T. Frank, et al. Electromigration behavior of 3D-IC TSV interconnects. InECTC, 2012 IEEE 62nd, pages 326–330. IEEE, 2012.

[6] Y. Ge and Q. Qiu. Dynamic thermal management for multimedia applications using machine learning. InDesign Automation Conference (DAC), 2011 48th

ACM/EDAC/IEEE, pages 95–100. IEEE, 2011. [7] M. S. Gupta, et al. Understanding voltage variations in

chip multiprocessors using a distributed power-delivery network. In2007 Design, Automation & Test in Europe Conference & Exhibition, pages 1–6. IEEE, 2007. [8] M. B. Healy and S. K. Lim. Power delivery system

architecture for many-tier 3D systems. In2010 Proceedings 60th ECTC, pages 1682–1688. IEEE, 2010.

[9] L. Huang, et al. Lifetime reliability-aware task allocation and scheduling for MPSoC platforms. InProceedings of the DATE, pages 51–56. EDAA, 2009.

[10] C. Isci, et al. Live, runtime phase monitoring and prediction on real systems with application to dynamic power management. InProceedings of the 39th Annual IEEE/ACM International Symposium on

Microarchitecture, pages 359–370. IEEE Computer Society, 2006.

[11] T. Kim, et al. Learning-based dynamic reliability management for dark silicon processor considering EM effects. In2016 DATE, pages 463–468. IEEE, 2016. [12] T. Kim, et al. Invited-Cross-layer modeling and

optimization for electromigration induced reliability. In

Proceedings of the 53rd DAC, page 30. ACM, 2016. [13] S. Li, et al. McPAT: an integrated power, area, and timing

modeling framework for multicore and manycore architectures. InMicroarchitecture, MICRO-42, pages 469–480. IEEE, 2009.

[14] S. Padmanabha, et al. Trace based phase prediction for tightly-coupled heterogeneous cores. InProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 445–456. ACM, 2013.

[15] J. S. Pak, et al. PDN impedance modeling and analysis of 3D TSV IC by using proposed P/G TSV array model based on separated P/G TSV and chip-PDN models.IEEE Transactions on Components, Packaging and Manufacturing Technology, 1(2):208–219, 2011. [16] C. Serafy, et al. Continued frequency scaling in 3D ICs

through micro-fluidic cooling. InThermal and Thermomechanical Phenomena in Electronic Systems (ITherm), 2014 IEEE Intersociety Conference on, pages 79–85. IEEE, 2014.

[17] T. Sherwood, et al. Discovering and exploiting program phases.IEEE micro, 23(6):84–93, 2003.

[18] B. Shi, et al. Hybrid 3D-IC cooling system using micro-fluidic cooling and thermal TSVs. In2012 IEEE Computer Society Annual Symposium on VLSI, pages 33–38. IEEE, 2012.

[19] R. Ubal, et al. Multi2Sim: a simulation framework for CPU-GPU computing. InProceedings of the 21st international conference on Parallel architectures and compilation techniques, pages 335–344. ACM, 2012. [20] S. C. Woo, et al. The SPLASH-2 programs:

Characterization and methodological considerations. In

ACM SIGARCH computer architecture news, volume 23, pages 24–36. ACM, 1995.