3.5 Thermal-aware Computing
3.5.2 Thermal-aware Design and Layout
Researchers have also proposed thermal-aware floorplanning to mitigate the appearance of hotspots. The primary insight in many of these approaches is to locate and place sources of heat close to “cold spots” to facilitate heat spreading and more uniform die temperatures. The HotSpot tool by Skadronet al.[154] is widely used by academics to estimate temperature based on floorplan. Powell and Vijaykumar [133] evaluate enlarging hot functional units to increase heat spreading without compromising delay along critical paths. Floorplan studies have led to key inferences about multi- and manycore architectures, as well as emerging 3D architectures. Moncheiroet al.[120] explore multicore layouts and find that laying out the cores in the center of the chip, with caches around them leads to lower average temperature because of improved heat spreading. Puttaswamy and Loh [134] propose thermal herding in stacked-3D multicores so that execution is driven towards the die nearest to the heat-spreader.
This dissertation evaluates sprinting using multiple cores, and treats the entire die as a uniform heat source based on these works that suggest that hotspots can be mitigated using circuit layout techniques. Huanget al.[71] observe that manycore architectures with small cores tend not to suffer from hotspots because the heat generated by the cores is effectively absorbed in the space between cores.
Based on the thermal principles presented in this chapter, the next chapter investigates the fea- sibility of sprinting by exceeding sustainable power temporarily.
Feasibility Study of Computational
Sprinting
Computational sprinting is motivated by three observations from Chapter 1 and Chapter 2: (i) sus- tainable performance will be limited by thermal conductivity, especially in mobile devices, (ii) interactive applications demand responsiveness—intense, brief bursts of computation punctuated by longer durations of idleness, and (iii) thermal capacitance can buffer heat to allow temporarily exceeding sustainable power. Based on the estimates of dark silicon in Chapter 2, which indicate a 10× gap between sustainable and peak power, this chapter investigates sprinting to enable 10× improvements in responsiveness for interactive applications.
As a concrete objective, this chapter considersparallelcomputational sprinting, in which a sys- tem which can sustain the operation of a single 1 W core is enabled to sprint for up to one second by activating 15 otherwise “dark” cores. To recap sprinting operation from Section 1.2, computational sprinting begins with activating all 16 cores in response to an input event. The heat in excess of the sustainable dissipation rate of the system (1 W) is buffered by the thermal capacitance in the system. After exhausting this thermal buffer, when the temperature nears the permissible threshold, the system stops sprinting and completes any remaining computation at sustainable power,i.e., by powering off the 15 additional cores and executing all instructions on a single core.
Section 4.1 first illustrates a strawman-proposal to buffer the 15 W of excess heat for one sec- ond to sustain one such sprint. This section evaluates the approach presented in Section 1.2— augmenting thermal capacitance with latent heat of phase-change—by extending the familiar ther-
Die Case
Package
PCB Case
(a) Phone cut-away
Cjunction Tamb Tjunction Rpackage P Ccase Rconvection (b) Thermal network Die TIM Case Package PCB PCM Case
(c) PCM placed near die
Cpcm Cjunction Rpcm Tamb Tpcm Tjunction Rpackage P Ccase 2 3 Rconvection 1
(d) Thermal network with PCM
Figure 4.1: The thermal components of a mobile system (a) and its thermal-equivalent circuit model (b). In (c) and (d), the system is augmented with a block of phase change material (PCM). The amount of computation possible during a sprint is primarily the system cools after a sprint.
mal analysis techniques from Chapter 3 to physical constants representative of a mobile phone [110]. To motivate the potential benefits of sprinting, Section 4.2 then evaluates the responsiveness bene- fits that such parallel sprints can enable for sample applications executed on a simulated manycore sprinting system. Section 4.3 extends the evaluation to a rudimentary model of repeated sprints separated by think times.
Following the quantitative introduction of the approach and potential benefits of sprinting, this chapter examines the feasibility of engineering a mobile system capable of sustaining such a 16-core sprint for 1 second. The feasibility analysis broadly considers the immediate thermal (Section 4.4), electrical (Section 4.5) and hardware/software challenges (Section 4.6) imposed by sprinting on existing systems and proposes approaches to address these challenges. Although not the focus of this dissertation, this feasibility study briefly discusses the implications of sprinting on reliability (Section 4.7) and cost (Section 4.8).
4.1
A Thermally-augmented System for Sprinting
To understand the basic thermal approach to sprinting, consider the heat flow in an example mobile phone. Figure 4.1a shows the physical arrangement of a package containing the processor die inside a mobile phone case. The thermal R-C network in Figure 4.1b represents the heat-path between the
Rpackage Chip to case thermal resistance 12.1 K/W
Cjunction Thermal capacitance of die 0.011 J/K
Rconvection Case-to-air thermal resistance (convection) 28 K/W
Ccase Thermal capacitance of case 8.3 J/K
RP CM Thermal resistance between die and PCM 0.001 K/W
CP CM Thermal capacitance of PCM (latent heat) 100 J/g
Tamb Ambient temperature 25◦C
Tjmax Maximum junction temperature 75◦C
Table 4.1: Thermal model parameters
processor and the ambient air. The parameter values (Table 4.1) are derived from a physically validated model of a mobile phone from 2008 by Luoet al.[110] (subsequent estimates by Shao
et al.[149] show similar values). This study used temperature probes to show that heat flows from the processor to ambient environment mostly along two paths: an upper path through the circuit board and top surface of the phone and a lower path through the battery and bottom surface, and that these two parallel paths should optimally have the same thermal conductivity. Lumping these parallel paths from the chip to the case yields the thermal network model in Figure 4.1b.
4.1.1 Thermal Resistance, Thermal Design Power, and Thermal Capacitance