Hardware Thread Migration for 3D Die-stacked Heterogeneous Multi-core Processors.

(1)

ABSTRACT

FORBES, JOHN ELLIOTT. Hardware Thread Migration for 3D Die-stacked Heterogeneous Multi-core Processors. (Under the direction of Eric Rotenberg.)

Increasing the performance and efficiency of modern microprocessors has been met with the significant challenge of designing within a tight power budget. Newer fabrication technologies

have given rise to smaller, more dense transistors. This has ushered in an era of multiple

processor cores on a single chip, larger on-chip caches, and the integration of resources that have traditionally resided off-chip. But the historical trend that smaller transistors yield lower

power has ended. Most multi-core processors rely on replicating several instances of a single core

design for each of the multiple cores. However, one promising solution to continue delivering performance and efficiency improvements is to allow for single-ISA heterogeneous processor

cores within a processor.

Single-ISA heterogeneous multi-core processors are processor designs with multiple cores, each of which may have a slightly different microarchitecture. The different microarchitectures

all execute the same instruction set but each core is tailored for different program behavior. The

benefits of single-ISA heterogeneous multi-core processors have been well explored, however, most proposals leave some performance and efficiency on the table. This is by virtue of the

assumption that the cost of moving a program from one core to another, referred to as athread

migration, is high. This assumption therefore requires a program to spend long amounts of time on a given core to amortize the penalty of migrating between cores.

In this work, I focus on eliminating as much overhead for a thread migration as possible.

With a low overhead thread migration, movement between cores can occur more frequently, and for shorter intervals of time. This allows for fine-grained program changes to be quickly mapped

to a potentially better core. This picks up the performance and efficiency improvements that

previous proposals left on the table.

A thread migration has traditionally been the purview of the operating system through a

context switch. The operating system context switch comes at a high cost. To make thread

migrations as light-weight as possible, I propose to abstract the operating system view of a

pair of heterogeneous cores. The operating system is free to assign program threads to a pair

of cores, but once assigned, the cores are free to move threads between each other as needed. This work analyses the extent to which thread migration overhead (or lack thereof) can

impact performance and efficiency over that of traditional heterogeneous multi-core processors. I

find that a hypothetical zero cost thread migration can achieve between a 2% to 5% improvement on average compared to a two-core heterogeneous processor, with some individual benchmarks

(2)

in a realistic fast migration implementation, a thread migration should take less than 100 cycles,

and expend less than 100nJ of energy.

To realize these performance and efficiency gains, in this work I also evaluate several

com-peting implementations for a low latency thread migration. These alternatives span a spectrum

of hardware complexity and power costs, trading these costs for reduced migration latency. Several of these implementations meet the sub-100 cycle, and sub-100nJ targets. The lowest

latency implementation is able to perform thread migrations in about 30-35 cycles on average,

and the lowest energy implementation uses less than 25nJ of energy to migrate a thread. Several of these implementation alternatives rely on a bulk copy of register file values from

one core to the other. This implies wiring between each bitcell of the register files of the core

pair. These wires are costly in wiring congestion, as well as delay between the bitcells. To ameliorate this issue, I explore the use of 3D die-stacking as a way to minimize wire lengths by

using face-to-face vias and register files that are directly across from each other on opposite dice.

The results of this study show that both wiring congestion and delay are kept to a minimum, even with high bitcell density. Compared to planar register file placement, the 3D die-stacked

implementation always has lower congestion for the same area.

The culmination of this work is in the design and test of a fabricated heterogeneous dual-core processor in which the fast hardware thread migration was a key feature. This was a large,

multi-team effort, and progressed in two phases. The first was a 2D test design used to vet the processor cores and migration logic. The results of testing this chip demonstrate the sub-100

cycle thread migration capability. The second phase is currently being fabricated and is a 3D

die-stacked design, incorporating fixes for the bugs uncovered in the first phase.

An important facet in any thread migration scheme is the mechanism used to steer program

phases to the heterogeneous cores. A high quality mapping of phases-to-cores can realize the

full potential of heterogeneous cores, whereas a poor mapping can detrimentally impact perfor-mance. A final analysis included in this work is to explore the possibility of using only static

program characteristics to make a core mapping decision. While the results are preliminary, the

(3)

(4)

Hardware Thread Migration for 3D Die-stacked Heterogeneous Multi-core Processors

by

John Elliott Forbes

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Doctor of Philosophy

Computer Engineering

Raleigh, North Carolina

2016

APPROVED BY:

William Davis James Tuck

Huiyang Zhou Emerson Murphy-Hill

Eric Rotenberg

(5)

DEDICATION

(6)

BIOGRAPHY

John Elliott Forbes is originally from Hastings, Michigan, graduating from Hastings High School in 2000. After high school, he attended Michigan Technological University in Houghton,

Michi-gan to pursue his Bachelor of Science in Computer Engineering. At various points during his undergraduate studies, he worked as an intern for Unisys Corporation in Roseville, Minnesota,

working on the physical design of the processor ASIC used on the ClearPath IX platform

main-frames. After graduating cum laude from Michigan Tech in 2005, he worked full-time for Smiths Aerospace (now GE Aviation Systems) in Grand Rapids, Michigan, working on graphics

sub-system testing for the C-130AMP project. While working at Smiths Aerospace, Elliott applied,

and was accepted, to the graduate program in the Electrical and Computer Engineering de-partment of North Carolina State University in Raleigh, North Carolina, obtaining his Master

of Science in Computer Engineering in 2008. After his Masters, he transitioned into the PhD

program in the ECE Department at NC State, under the direction of Dr. Eric Rotenberg. Dur-ing this time, Elliott taught the Introduction to ComputDur-ing Systems course for three years.

In the summer of 2010, he also worked in the Binary Translation Software group at Intel in

(7)

ACKNOWLEDGEMENTS

Of course I have to start with my family. Mom, dad, Abby, Holly, Bob and all of the kids. I also can’t forget Shannon. Thank you for the constant support and for pushing me to keep going.

Thanks to my friends back home Brandon Willard, Justin Benner, Ron Coats, Nick Steele, Steve Soliz, Joe Ammeraal, Tom Strey, Nathan Milz, and Dave Subert. It’s always a good day

to get a call or to meet and catch up. Also, I want to thank the professors at Michigan Tech,

Soner ¨Onder and Brian Davis, who pushed me to go to grad school in the first place.

I’ll never forget the fun and rewarding times I had teaching. Thanks to all of my past

students for expecting as much out of me as I expected out of you. Thanks particularly to Xander

Kansinally, Katie Walker, Will Galliher, John Williamson, Cesar Garzon, Michael Glander, Vic Ajewole, and Dusty Mabe. It’s been amazing to see what things all of you have done and I was

always happy to get a visit in my office long after you finished my course.

I’m glad to have had the opportunity to work in the BiTS team at Intel in Hillsboro. Thank you Suresh Srinivas for giving me that opportunity. I also want to thank Matt Pagano, Omar

Shaikh, Avadh Patel, and Carlo Angiuli for all the fun, for the stimulating research environment,

and of course for the shenanigans that we got away with in Oregon. I want to especially thank Paul Caprioli for the continued support and advice in the years since I finished my internship.

My work at NC State was part of a large multi-group effort. Thanks to the H3 team,

Zhenqian Zhang, Randy Widialaksono, Josh Schabel, and Thomas Belanger. Also a big thanks to Steve Lipa, not just for fun times in the lab but also for the great pointers to music, books,

and the odd movie or two. And also thanks for loaning tools and an extra hand when I get

myself into car projects.

Thanks to the many friends I’ve made both in CESR and in NC State ECE in general.

Thanks to the guys in Tom Conte’s crew: Jason Poovey and Chad Rosier, I missed having you

guys around after you left. Thanks George Patsilaras for the hilarious car buying catastrophe. Thanks to Ahmad Samih, Amro Awad the night owl, Bagus Wibowo, Jenn Gamble, Julian

Taylor, Shivam Priyadarshi, and Devesh Tiwari my partner in crime when it comes to teaching

undergrads. I’ll always have great memories of the guys in Eric’s research group: thanks Vinesh Srinivasan for great work and help with chip debugging, Rangeen Basu Roy Chowdhury for

the help in all things EDA, Muawya Al-Otoom for late night Cook Out trips and “other” late night things, Mark Dechene for your awesome research acumen, Hashem Hashemi, Salil

Wadhavkar, Sandeep Navada, Brandon Dwiel my best beer drinking buddy and hockey linemate,

and Sungkwan Ku for making sure I find the best Korean BBQ in Seoul. A big thanks to Niket Choudhary for what will for sure be a lifetime of research collaboration and friendship. I really

(8)

to come up with new ideas.

Thanks to my committee James Tuck, Huiyang Zhou, Rhett Davis, and Emerson Murphy-Hill for their helpful insights and new perspectives on my work. I also want to thank the

professors in the department that have helped either in classes, teaching, or otherwise. Thanks

Keith Townsend, my classic car connection in North Carolina. Edward Grant was great for pep talks and pushing me to just keep going. Thanks to Greg Byrd who was the best teaching

mentor and friend that I could have hoped for – just know that someday I’m going to make

you teach me how to fly fish. And, of course none of this would have been possible without my advisor, Eric Rotenberg. Thanks for the support, thanks for the sacrifice you make for your

students, thanks for being the hardest working person I think I’ve ever met.

This thesis was supported in part by Intel and NSF grant No. CCF-1218608. Any opinions, findings, and conclusions or recommendations expressed herein are those of the author and do

(9)

TABLE OF CONTENTS

List of Tables . . . .viii

List of Figures . . . ix

Chapter 1 Introduction . . . 1

1.1 Challenge in Migrating Physical Register File Values . . . 2

1.2 Comparison . . . 5

1.3 Contributions and Future Work . . . 6

Chapter 2 Motivation . . . 8

2.1 Benefit of Heterogeneity . . . 8

2.2 Thread Migration . . . 10

2.2.1 Architectural Study . . . 11

2.2.2 3D Physical Design Study . . . 15

2.3 Related Work . . . 19

Chapter 3 Alternatives for Hardware Thread Migration . . . 20

3.1 Overview . . . 21

3.2 No Migration Hardware . . . 23

3.3 Hardware EPC Migration . . . 24

3.4 Hardware FTM . . . 26

3.5 Asynchronous FIFO Migration . . . 28

3.6 Compulsory TRF . . . 29

3.7 Results . . . 32

3.8 Taxonomy . . . 40

Chapter 4 Modeling Heterogeneous Cores . . . 42

4.1 Verilog RTL Model . . . 42

4.2 Low Design Effort C++ Model . . . 43

4.3 High Design Effort C++ Model . . . 44

4.3.1 Balancing Pipeline Stages . . . 44

4.3.2 Transistor Sizing . . . 45

4.3.3 Pulse Latches . . . 45

4.3.4 Layout . . . 46

4.3.5 High-Effort Scaling Model . . . 47

4.4 Core Palette . . . 48

4.5 Metrics . . . 48

4.6 Workload . . . 48

4.7 Cycle-Level Simulator . . . 49

(10)

5.1 Implementation of H3 . . . 50

5.1.1 Global Migrations . . . 53

5.1.2 Local Migrations . . . 55

5.2 Test Infrastructure . . . 56

5.2.1 Duct Tape – A High-Level Assembler . . . 56

5.2.2 FPGA Chip Signal Driver . . . 59

5.2.3 Host Interface . . . 62

5.3 Results . . . 63

5.4 Errata . . . 65

5.4.1 Load Miss . . . 65

5.4.2 Clock Inputs . . . 65

5.4.3 Ammeter . . . 66

5.4.4 I-cache Requests . . . 66

5.4.5 D-Cache Hold Violations . . . 67

5.4.6 TRF Reset . . . 68

5.4.7 CTIQ Full . . . 69

5.4.8 CTIQ Reset . . . 69

5.4.9 CCD Pulse Synchronization . . . 69

Chapter 6 Static Phase-to-Core Mapping . . . 71

6.1 Benefits of Static Analysis for Migration . . . 71

6.2 Statistical Learning – Classification . . . 72

6.3 Naive Bayes Classification Postmortem . . . 77

Chapter 7 Summary . . . 83

(11)

LIST OF TABLES

Table 1.1 Comparison of the H3 thread migration with other published architectures. 6

Table 4.1 EDA tools used in this work. . . 44

Table 4.2 Pulse latch and flip-flop characterization. . . 46

Table 4.3 The palette of 18 cores considered for evaluation. . . 49

Table 5.1 H3 Core Types . . . 51

Table 5.2 Additional signals required to support FTM. . . 53

(12)

LIST OF FIGURES

Figure 1.1 Potential for copying register values between cores. . . 4

Figure 2.1 Average performance and efficiency varying the number of heterogeneous cores. . . 9

Figure 2.2 Per-phase performance and efficiency of two heterogeneous cores. . . 9

Figure 2.3 Performance and efficiency of various interval sizes. . . 12

Figure 2.4 Number of migrations at 1,000 instruction intervals. . . 13

Figure 2.5 Comparison of coarse-grain and fine-grain heterogeneity for both perfor-mance and efficiency. . . 13

Figure 2.6 Performance and efficiency relative to ideal with various migration cycle penalties. . . 14

Figure 2.7 Efficiency relative to ideal with various migration energy penalties. . . 14

Figure 2.8 Depictions of 2D and 3D layouts of fast thread migration (best viewed in color). . . 15

Figure 2.9 Routing overflows due to placement density and PRF connectivity. . . 18

Figure 2.10 PRF-to-PRF swap latency. . . 18

Figure 3.1 Spectrum of hardware migration alternatives. . . 22

Figure 3.2 Baseline cores with no hardware migration support. . . 24

Figure 3.3 Cores augmented with hardware support for migrating EPC only. . . 25

Figure 3.4 Cores augmented with hardware support for migrating all registers using TRF. . . 27

Figure 3.5 Cores augmented with hardware support for migrating all registers using asynchronous FIFO. . . 28

Figure 3.6 Cores augmented with hardware support for migrating all registers with compulsory TRF reads/writes. . . 30

Figure 3.7 Constraining pipeline stage paths with pipeline registers. . . 33

Figure 3.8 Achievable clock period of each pipeline stage. . . 34

Figure 3.9 Power overhead of migration hardware. . . 35

Figure 3.10 Area overhead of migration hardware. Note that the y-axis does not start at zero. . . 35

Figure 3.11 Migration latency both with and without (for clarity) the EPC Migration latency. . . 36

Figure 3.12 Energy required for a complete thread migration. . . 37

Figure 3.13 Average migration performance when taking into account all power, en-ergy, and cycle penalties. . . 39

Figure 3.14 Per-phase migration performance of FTM. . . 39

Figure 3.15 Per-phase migration efficiency of FTM. . . 39

Figure 3.16 Taxonomy of migration alternatives. . . 40

Figure 4.1 C++ models for estimating low and high effort design. . . 43

Figure 4.2 Frequency and energy trend for transistor sizing. . . 45

(13)

Figure 5.1 Die photos. . . 51

Figure 5.2 Block diagram of two core stack. . . 52

Figure 5.3 Timing diagram for global migration of two threads. . . 54

Figure 5.4 Timing diagram for local migration (one thread only). . . 55

Figure 5.5 Block diagram of test infrastructure. . . 56

Figure 5.6 Program organization of a complete dt program. . . 58

Figure 5.7 Conditional array summation example dt source code. . . 60

Figure 5.8 Assembled H3 test PCB. . . 62

Figure 5.9 Test workstation. . . 62

Figure 5.10 Migration latency of the 2D and 3D prototype chips. . . 65

Figure 5.11 Oscilloscope screen capture showing power supply voltage during a chip reset. . . 67

Figure 6.1 Percent of each benchmark that is spent in an inner loop. . . 73

Figure 6.2 Prediction accuracy of various classification algorithms. . . 75

Figure 6.3 Distribution of feature values for all inner loops. . . 76

Figure 6.4 Pairwise comparison of all features. . . 78

Figure 6.5 Accuracy of Gaussian Naive Bayes for each loop, sorted by the perfor-mance ratio between the two cores. . . 79

Figure 6.6 Histogram of the performance ratio between the two cores. . . 79

Figure 6.7 Accuracy of Gaussian Naive Bayes for each loop, sorted by how heavily biased a loop is toward either core. . . 80

(14)

Chapter 1

Introduction

Historically, as steady improvements in compute capacity have been made, programmers have always filled that capacity either to solve ever larger problems, or to solve the same-sized

problems more efficiently. Currently, system capacity improvements are threatened by the end

of Dennard scaling, and the possible end of Moore’s law. No longer can we rely on a smaller transistor delivering a higher frequency at lower power, and soon we may not even be able to

rely on a transistor getting smaller.

One potential avenue for improving compute capacity is by employing multiple cores within a processor, each with a different microarchitecture – a style of processor commonly referred to

as heterogeneous chip multi-processors (HCMPs). The cores within an HCMP all implement

the same instruction set (ISA), but vary in superscalar widths, pipeline depths, and sizes of structures. This body of work was spurred by seminal work by Kumar et al. [33] [35], recognizes

that different programs have different instruction-level behavior, and even a single program may

change behavior during runtime. It is not always possible to design a single processor core that is best suited for all programs or program phases. So instead of having multiple cores of a

processor all of the same design, a mix of different core types should be employed. This has

the effect of specializing a processor to better match a program providing that the program is executed by the core that most efficiently matches its needs.

This thesis makes the case that performance and energy efficiency of HCMPs can be further

improved if a program can be moved between cores at the lowest possible cost. The operating system has traditionally handled the management of threads. But to achieve the performance

and energy goals, I propose foregoing the heavy-weight computation required by the OS

schedul-ing and thread management. Instead, I assume a system in which core pairs are presented to

the OS as a single core with multiple logical thread contexts. The OS can assign a thread to

(15)

The costs of a thread migration that remain in a hardware scheme are the copying of the

architectural registers, and flushing various pipeline structures of the new core including the handling of cache state. There are also indirect costs associated with retraining speculative

structures such as branch predictors and dependence predictors. Previous work [11] has shown

that when a migration is potentially beneficial, the program memory state can be minimized to only the predicted working set which can be pre-emptively copied to the new core before the

migration occurs. In this thesis, I focus on one of the remaining challenges to a low-overhead

migration: that of copying the register state. The next section highlights why this problem is more difficult than at first glance.

1.1 Challenge in Migrating Physical Register File Values

The physical register file (PRF) of a modern superscalar, out-of-order processor [62] houses both

the known-safe committed register values as well as speculative values. This is made possible by

a register renaming mechanism, and a PRF with more registers than the architectural minimum. Any instruction with a destination register operand will write the result value directly to the

PRF before knowing whether the instruction is valid. Instructions may later be found to have

executed with incorrect source values or may be on the wrong program control path. In that case, the speculative values are discarded by undoing the logical-to-physical register mapping

for that instruction destination register. If the instruction is found to have correctly executed,

then the logical-to-physical mapping is retained, and the previous logical-to-physical mapping for that logical destination register can be freed. This has the effect of spreading logical values

throughout the PRF such that logical mappings reside in non-contiguous and unordered PRF

locations.

The size of the PRF is carefully tuned by the design team when implementing a core. A

larger PRF can support deeper speculation, but comes at the cost of per-access latency. The

PRF may need to be read by multiple instructions each cycle, so a large PRF may partly dictate the achievable core frequency. When considering an HCMP, the PRF is a key differentiating

parameter between the different core types. Some programs may derive a substantial benefit

from a processor with deep speculation, whereas other programs might better utilize a core with a higher frequency. Thus, it is likely that a hardware thread migration mechanism will

need to cope with PRFs of different sizes, as well as different frequencies.

Chapter 2 shows the impact that a low-latency and low-energy thread migration can have on overall performance and efficiency of a program. The important take-away from those studies

is that for maximum benefit, the migration should take 100 cycles or less. It is tempting to consider copying the entire PRF contents from one core to another. But Figure 1.1 shows why

(16)

renaming of four logical registers. One PRF is 8 entries, while the other is 16, and allows for

deeper speculation (we can ignore the timing differences for now). In this case, the contents of the smaller PRF can be copied directly to the larger PRF, providing that the rename map

table (RMT) is also copied. No problem exists with this example. However, if a program must

be migrated from the core with a larger PRF to the core with the smaller PRF, it might be the case that the larger PRF has mapped to physical entries that do not exist in the smaller PRF.

This situation is shown in Figure 1.1b. In this example, physical register$p11 and$p13 do not

exist in the smaller PRF. Copying the RMT will not help in this case, it must be remedied with an intermediate step.

Figure 1.1c satisfies this intermediate step. The RMT (not shown) is used rename logical

registers to consolidate the architectural registers back into an architectural register file (ARF). This puts register values in contiguous ARF locations and in the correct order. With this step

complete, ARF values can be copied from one core to the other, or with clever design, an

exchange of the ARF contents can be performed.

Even without the PRF size mismatch problem discussed above, copying directly from one

PRF to another PRF is a tenuous prospect. Supposing a core pair in which both PRFs are

the same size, it would at least require additional PRF read and write ports which are used by the opposite core. The PRFs are already highly-ported, adding yet another read and write

port is likely to impact the clock frequency of the entire pipeline. Making matters worse, the cores may operate asynchronously even with the same sized PRFs. A Teleport Register File

(TRF) [46] [63] can be used in concert with the PRF to solve these issues. A TRF can be

used like an ARF, but additionally allows for the bulk exchange of values from one TRF to another TRF. But the design of a TRF requires implementing the registers in flip-flops instead

of SRAMs since the bulk copy requires access to each bitcell of the memory array. A possible

design that works around these issues is to have a TRF outside of the core, and to introduce a new instruction whose sole purpose is to go through the full pipeline as usual, reading values

from the renamed physical registers and then copying their values into the TRF during the

Execute Stage. Another new instruction performs the reverse action of reading the TRF after the TRF exchange, and writing to a destination register which has been properly renamed to

a physical register in the PRF.

An alternative superscalar, out-of-order pipeline implementation exists [57] that has sepa-rate storage for known-safe committed values and speculative values. It is feasible for this style

of pipeline to eschew the renaming intermediate step and new instructions, and exchange the

architectural register values with another core. Pipelines of this style are not considered in this thesis, however, for two reasons. First, while register values would no longer require

consoli-dation, the copy/exchange of values from one core to another would still require bitcell-level

(17)

Copy

Copy $r0

$r3

$r1

$r2

$r0 $r3

$r1

$r2 Free register

Committed register Speculative register

$p0 $p1 $p2 $p3 $p4 $p5 $p6 $p7 $p8 $p9 $p10 $p11 $p12 $p13 $p14 $p15

$p0 $p1 $p2 $p3 $p4 $p5 $p6 $p7

Core0 PRF Core1 PRF

$p1 $p5 $p7 $p2

Core0 RMT

$p1 $p5 $p7 $p2

Core1 RMT

(a) Small-to-large PRF copy.

$r2

$r1

$r0

$r3

$p0 $p1 $p2 $p3 $p4 $p5 $p6 $p7

Core0 PRF Core1 PRF

Copy

Copy $p11

$p3 $p0 $p13

Core0 RMT Core1 RMT

$p0 $p1 $p2 $p3 $p4 $p5 $p6 $p7 $p8 $p9 $p10 $p11 $p12 $p13 $p14 $p15

(b) Large-to-small PRF copy.

$r2 $r1

$r0 $r3 $r1 $r2 $r0

$r3

$r0 $r1 $r2 $r3

$r0 $r1 $r2 $r3 $p0

$p1 $p2 $p3 $p4 $p5 $p6 $p7 $p8 $p9 $p10 $p11 $p12 $p13 $p14 $p15

$p0 $p1 $p2 $p3 $p4 $p5 $p6 $p7 Exchange

Core0 PRF Core1 PRF

(c) Remapped PRF copy.

(18)

slow the entire processor as mentioned above. Second, architectures designed in this style are

not typically used in modern processors. This is due to several complexities in hazard-checking logic and bypassing. The modern PRF approach described above simplifies these issues and

has become the dominant form of out-of-order execution in modern processors. It would be a

difficult proposal to suggest revisiting a known-inefficient pipeline implementation simply to make it easier for thread migration.

Taking these issues into account, I propose in this thesis to introduce a TRF to a

hetero-geneous core pair which resides outside of the pipeline. Access to the TRF is enabled by new instructions. Once supplied with register values, the TRF can exchange values between the

core pair. The bulk copy of values comes at a high cost in the number of wires that must be

routed. To ease this cost, the two cores can be split between two tiers of a 3D die-stacked chip multi-processor. In collaboration with a large design team, this design was implemented and

fabricated as a proof of concept. This project is called “H3”, which stands for heterogeneity in

3D.

1.2 Comparison

Table 1.1 summarizes several other recent works that relate to the H3 project and how they fare compared to H3. ARM has demonstrated their commitment to heterogeneous chip-multiprocessor

with their big.LITTLE [27] architecture. Their HCMP consists of several “big” Cortex-A15 cores

with a “little” Cortex-A7. The goal of big.LITTLE is to minimize power by using the little core as often as possible, relying on the big cores when performance needs justify the extra power

consumption.

Composite Cores was proposed [39] as a way to completely forego migration in lieu of a single architecture that has multiple pipeline back-ends. One back-end is lightweight by virtue of it’s

in-order execution model. The other back-end is a high-performance out-of-order execution

paradigm. Both of these back-ends share a unified front-end (instruction fetch, decode, etc.).

The Execution Migration Machine [38] (EM2) suggests moving threads to their required data

instead of moving data between cores. To achieve this goal, they implement a low-complexity

stack-based ISA that minimizes the amount of thread state that must be transferred to other cores. Their design was fabricated in a 110-core homogeneous CMP.

None of these competing architectures meet all of the goals set out for H3, as described in

this thesis and in [46]. The big.LITTLE implementation has no hardware migration support, and thus relies on the operating system to move threads between cores. This results in a high

latency (they cite 20 thousand cycles) migration, limiting the frequency of migration. Composite Cores provides a way to partially realize some of the benefits available in a HCMP. But two

(19)

Table 1.1: Comparison of the H3 thread migration with other published architectures.

Thread Migration Latency

Distinct (separate) cores

Asynch-ronous (GALS)

Register-based ISA

Evaluation Methodology

ARM

big.LITTLE 20,000 cycles Yes Yes Yes Real system

Composite

Cores <32 cycles

No (shared front-end and data cache)

No Yes C++ simulator

Execution Migration Machine

<100 cycles Yes No

No (stack-based ISA for partial context transfer)

RTL simulation and synthesis; chip fabricated, measurements not yet reported

H3 FTM,

this thesis <100 cycles Yes Yes Yes

RTL simulation, fabricated chip measurements

frequencies can be different (another knob to turn in providing architectural diversity). And

EM2 has a novel approach to keep on-chip network costs down by moving threads instead of

data. But EM2 may lose generality by virtue of the reliance on a stack-based ISA. Additionally,

it is unclear if their prototype was functional, as no measurements have been reported.

To my knowledge, the H3 chip was the first demonstration of a fabricated heterogeneous

multi-core consisting of two out-of-order superscalar cores. It was also the first hardware thread migration between two asynchronous cores. And the taped-out 3D design is poised to be the

first fabricated die-stacked pair of out-of-order superscalar cores.

1.3 Contributions and Future Work

This thesis makes the following contributions:

The study of the architectural impact of low-latency, low-energy thread migrations. This is

done in a way that does not presuppose any particular migration mechanism. The result is that limiting the overhead of thread migrations unlocks the opportunity to migrate more

frequently. This, in turn, makes it feasible to migrate fine-grained program phases for an

additional performance and efficiency benefit over that already exposed by coarse-grain heterogeneity.

(20)

have the ability to exchange values. Die-stacking can satisfy both the high-bandwidth

and low-latency wiring that a hardware thread migration approach requires.

A study of various implementations of fast register transfer. The implementations trade

migration latency for power. These implementations are modeled in Verilog RTL and

suggest that the best alternative may be one that uses new instructions that copy data

into and out of a TRF.

Various evaluations require a high number of full-processor simulations, necessitating the

use of fast C++ simulator. Also, these simulations must account for energy and timing based on physical design data. The physical design data can be derived using existing

FabScalar-based [19] tools, but these do not account for high design-effort by virtue of

relying on FabScalar’s automated, standard cell approach. To overcome this, I estimate the timing and energy of high design-effort cores by carefully crafting scaling parameters

based on the low design-effort estimates of FabScalar cores.

The test infrastructure and measured results of a fabricated 2D prototype chip. The

prototype chip was fabricated to vet any functional bugs and demonstrate the capability of hardware thread migration. The testing proved useful, as several interesting problems

were found and their fixes incorporated into a taped-out 3D die-stacked design – the

end-goal for the H3 project.

This thesis also studies the use of static program analysis with statistical learning in an attempt to predict the best mapping of phases to cores without first running the program. The

methods studied failed to produce accurate predictions. However, in a postmortem analysis of

the technique, I find that there may never be enough information inherent in only the static program characteristics to have a highly-accurate, static phase-to-core prediction. Future work

can leverage these lessons to try to find the best balance between offline static characterization

(21)

Chapter 2

Motivation

This chapter focuses attention on the performance and efficiency benefit of heterogeneity (Sec-tion 2.1), the potential further improvement when the overhead of migra(Sec-tion is minimized

(Section 2.2), and the specific advantages that a migration mechanism in a 3D die-stacked core

pair provide (Section 2.2.2).

Previous work on this topic [27] [39] [46] typically study a heterogeneous system in which

the cores can always have their performance ranked – that is, there is a “big” high-performance

core, and a “little” low-power core. Fast hardware thread migrations can certainly benefit such systems. In this thesis, I will focus on migrating between cores that cannot be

performance-ranked [34] [42]. In systems of this type, both cores are “big”. For instance, one core might have

high peak instruction bandwidth, but at a modest frequency compared to another core which has lower peak instruction bandwidth but a high frequency. Some programs may not be able to

take advantage of a wide core, and thus, the higher frequency is more beneficial. Establishing

performance on these non-monotonic cores is challenging, as the choice of which core to run a given phase is not always readily apparent.

2.1 Benefit of Heterogeneity

Heterogeneous multi-core processors can realize a performance and efficiency benefit without the

need for fast thread migration. While several works have shown the advantages of heterogeneous

microarchitectures [33] [35], in this section I show the potential within the methodology used for this thesis.

Using the low design-effort model from Chapter 4, I establish the overall performance (in

BIPS) and efficiency (in BIPS3/W) of all program phases on all core configurations, producing

roughly 3200 data points for each metric. These metrics assume that a phase is run on a

(22)

0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20%

1 3 5 7 9 11 13 15 17

Av

e

ra

ge

Pe

rc

e

n

t

BIPS

In

cr

ea

se

nNumber of Core Types

(a) Performance.

0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20%

1 3 5 7 9 11 13 15 17

Av

er

ag

e

Pe

rc

e

n

t

BIPS

3/W

Incr

ease

nNumber of Core Types

(b) Efficiency.

Figure 2.1: Average performance and efficiency varying the number of heterogeneous cores.

0% 20% 40% 60% 80% 100% 120% 140% 160% 180%

0 20 40 60 80 100 120 140 160 180

Pe

rc

e

n

t

BIPS

Incr

ease

(

n

=2)

Program Phases

Geometric Mean = 11.2%

(a) Performance.

0% 20% 40% 60% 80% 100% 120% 140% 160% 180%

0 20 40 60 80 100 120 140 160 180

Pe

rc

e

n

t

BIPS

3/W

Incr

ease

(

n

=2)

Program Phases

Geometric Mean = 10.4%

(b) Efficiency.

Figure 2.2: Per-phase performance and efficiency of two heterogeneous cores.

granularity of thread migration is the entire program phase. With the metrics for all core

configurations, I run an exhaustive design space exploration (DSE). The DSE tool finds the

highest average performance or efficiency for a combination ofn cores, where n is varied. The

results of these explorations for several values ofnare shown in Figure 2.1. The baseline for these

graphs is the best overall homogeneous core for a given metric, which is the core configuration

found by the DSE tool when n=1.

These results show that even among long-running program phases, a heterogeneous

combi-nation of cores can realize a performance and efficiency improvement. Program phases for this

experiment are 10 million dynamic instructions long. With even just two core configurations, performance is increased by about 11.2% and efficiency by about 10.3% on average. If a

pro-cessor employed all 18 core configurations, the average performance increase is just over 16%

and efficiency almost 19%.

Considering the design challenges associated with heterogeneous multi-core processors [19],

the average performance and efficiency gains may seem underwhelming. However, showing the

average hides the full potential of heterogeneous cores. Previous work [25] [42] has shown

that most program phases execute their best on a balanced core configuration. This is a core

configuration that is not especially wide or narrow and has average structure sizes. That same

balanced core configuration appears in all values ofn. This means that when mapping phases

(23)

And when compared to the baseline homogeneous configuration (n=1), which is the same

core configuration, those phases see no additional performance or efficiency benefit. This has

the effect of pulling the overall average down. But not all phases are best on that “average”

core configuration. Figure 2.2 shows the performance and efficiency of each phase. Roughly

44% of the phases have a performance advantage on the heterogeneous processor, and about 32% of the phases have an efficiency improvement. Also, there are several phases that have a

significant improvement in performance and/or efficiency. The phases most impacted have a

50% to 130% performance improvement on the heterogeneous core pair, while several phases have an efficiency improvement over 100%. These gains are hidden when only considering the

average over all phases.

2.2 Thread Migration

The previous section pinned phases to cores for the entire duration of the phases’ execution.

In this section, I study the potential when allowing the program phases to migrate from core to core during the execution. I refer to this as “fine-grain” heterogeneity to emphasize that the

phase may execute on a given core for only a very short number of instructions before being

migrated to another core.

A thread migration is similar to an operating system (OS) context switch. The OS provides

context switching as a means to allow more running processes than there are processors in

the system. The details of a context switch are highly dependent on the OS as well as the underlying hardware. One overhead in an OS context switch that may not be necessary in a

thread migration is that of process scheduling (determining which ready process will run after

the next context switch). But, even without counting the cost of the OS scheduler there is overhead in saving the process control block, including registers, stacks, memory mappings,

and various privileges. Many of these tasks require kernel-level access, requiring the processor

switch into and out of kernel modes – a potentially costly series of operations [36] [43]. In this thesis, I propose presenting a heterogeneous core pair to the OS as a single processor with

multiple thread contexts. The kernel is free to assign a process to a core pair, but once assigned,

threads are free to move between the two cores. After the initial assignment, this eliminates the overheads incurred by the OS if the thread migration is handled by hardware.

Therefore, for the studies in this section, n=2, and migrations can only occur between the

two cores. While this leaves some performance and efficiency improvement on the table, n=2

matches the proposed use of 3D die-stacking for ensuring low latency, high bandwidth migration

(24)

2.2.1 Architectural Study

To establish the performance and efficiency that can be expected of a hardware thread migration

scheme, I first focus on architecture-level studies. These experiments determine the effects of

the cost of thread migration without imposing any specific implementation.

Experimental Framework

To support the architectural performance analysis of thread migration, I start by simulating

all program phases on the low design-effort C++ model, outlined in Chapter 4. As simulations

progress, for every 1,000 dynamic instructions executed, the simulator saves the number of cycles needed and the energy spent to execute those 1,000 instructions. No architectural structures are

reset at these points, only the cycle and energy metrics are saved then reset. At the end of the

simulation, these statistics are numbered and saved for post-processing. Each program phase is simulated for a total of 10 million dynamic instructions, so there will be 10,000 1,000-instruction

metrics for each program phase.

By recording metrics at well-defined instruction boundaries, and keeping track of their program ordering, these 1,000-instruction segments can be aggregated by simply adding the

metrics of adjacent 1,000-instruction segments. This aggregation is also made possible by the

policy to keep pipeline structures “warmed” during execution. For example, the total cycles for a given core configuration to execute an entire program phase can be derived by simply adding

all 10,000 of the 1,000-instruction program segments for that phase. I refer to these aggregated

1,000-instruction program segments as “intervals”.

The strength of this approach manifests in several ways:

It provides the ability to change the interval size to mimic the ability to allow either small

or large migration regions.

It allows adding arbitrary cycle and energy penalties at interval boundaries to represent

the cost of a possible migration. Furthermore, the cycle and energy penalty could be due to the migration itself, or due to migration-induced events (such as cache misses that

would not have occurred had the migration never happened).

Since this is all done in post processing, it allows for oracle scheduling.

This means that we can see the effect of interval size, migration cycle penalty, and migration

(25)

0% 5% 10% 15% 20% 25% 30% 35% 40% 45%

1000 10000 100000 1000000 10000000

Pe

rc

e

n

t

BI

PS

Im

pr

ovemen

t

Interval Size (Instructions) (a) Performance.

0% 5% 10% 15% 20% 25% 30% 35% 40% 45%

1000 10000 100000 1000000 10000000

Pe

rc

e

n

t

BIP

S3/W

Imp

roveme

n

t

Interval Size (Instructions)

overall best average all cores average overall best max all cores max

(b) Efficiency.

Figure 2.3: Performance and efficiency of various interval sizes.

Results

Figure 2.3 shows the performance and efficiency potential with different sized intervals, assuming

no cycle or energy penalty for thread migrations. Note that the x-axis indicates the number of dynamic instructions per interval, not the number of 1,000-instruction segments. There are four

trends shown per graph. The trends prefaced with “overall” refers to a core-selection policy in which the migrations occur between the two overall best core pair for all 179 program phases,

determineda priori by a DSE. The best two cores when using BIPS as the performance metric

are the LE-2W-S and LE-4W-M (“LE” refers to the low design-effort model), and the best

two cores when using BIPS3_{/W as the performance metric are the LE-2W-S and LE-3W-M.}

The trends in Figure 2.3 prefaced with “all” indicates that all 18 possible core configurations

are available, but migrations are only permissible between the two cores that are best for the phase being considered. That is, the two best core configurations are found on a per-phase

basis. For both of these core-selection policies, both the average improvement and the highest

improvement for any phase is shown.

The improvements shown in Figure 2.3 are the additional benefits over coarse-grain

het-erogenous cores gained by allowing thread migrations at various finer granularity interval sizes.

The right-most data point of these graphs represent the coarse-grain heterogeneous cores where no migrations occur once the phase is mapped to a core, and the left-most point represents the

finest granularity of thread migration. The baseline is the heterogeneous core pair assuming

no migrations, similar to the analysis in Section 2.1 where n=2. Equally important, however,

is to notice that the biggest improvement in both performance and efficiency is realized when

the interval size is lowered from 10,000 instruction intervals to 1,000 instruction intervals. This

shows that the smaller the interval, the more potentially beneficial thread migration becomes. And thread migrations can be more frequent only if the overheads of those migrations are as

small as possible.

(26)

0 1000 2000 3000 4000 5000 6000 7000

0 20 40 60 80 100 120 140 160 180

Num

b

er

of

Mi

gr

ati

o

ns

Program Phase

Figure 2.4: Number of migrations at 1,000 instruction intervals.

0% 5% 10% 15% 20% 25% 30% 35% 40%

0% 50% 100% 150%

Pe

rc

e

n

t

BI

P

S

Im

p

ro

vem

en

t

(F

in

e

Gra

in)

Percent BIPS Improvement (Coarse Grain)

(a) Performance.

0% 5% 10% 15% 20% 25% 30% 35% 40%

0% 50% 100% 150%

Percent

BIPS

3/W

Improvement

(Fine

Grain)

Percent BIPS3/W Improvement (Coarse Grain) (b) Efficiency.

Figure 2.5: Comparison of coarse-grain and fine-grain heterogeneity for both performance and efficiency.

cores used to produce this graph are the overall best two cores (as opposed to the per-phase

best two cores). This graph shows that under these conditions, thread migrations will occur

quite frequently. The vast majority of phases will switch cores 1,000 or more times during the 10 million instruction phase. And the phases at the top end of this graph will migrate more

than 6,000 times during the 10 million instruction phase.

An interesting pair of plots are shown in Figure 2.5. Each data point for these graphs repre-sents a program phase. The position along the x-axis reprerepre-sents the improvement in performance

and efficiency for that phase that coarse-grain thread migrations have over a single core. The

position along the y-axis plots the additional improvement in performance and efficiency for

that phase that fine-grain thread migrations allow over the coarse-grain heterogeneity. These

graphs show that while some phases have a fine-grain migration benefit in addition to their coarse-grain benefit, there are a substantial number of phases that only get a heterogeneity

benefit when allowing fine-grain thread migrations. This is evident in the clustering of many

(27)

0% 20% 40% 60% 80% 100% 120%

0 1 10 100 1000 10000

Pe

rc

e

n

t

BI

P

S

Re

la

ti

ve

to

Fi

ne

st

Gr

anul

ari

ty

Migration Cycles Penalty

(a) Performance.

0% 20% 40% 60% 80% 100% 120%

0 1 10 100 1000 10000

Pe

rc

en

t

BI

P

S

3/W

Re

la

ti

ve

to

Fi

ne

st

Gr

an

ul

ari

ty

Migration Cycles Penalty

overall best average all cores average overall best min all cores min

(b) Efficiency.

Figure 2.6: Performance and efficiency relative to ideal with various migration cycle penalties.

0% 20% 40% 60% 80% 100% 120%

0 1 10 100 1000 10000 100000

Pe

rc

e

n

t

BIP

S

3/W

Re

la

ti

ve

to

Fi

ne

st

Gr

anula

rit

y

Migration Energy Penalty (nJ)

overall best average all cores average overall best min all cores min

Figure 2.7: Efficiency relative to ideal with various migration energy penalties.

The impact on performance and efficiency when penalizing a thread migration with various

per-migration cycle costs is shown in Figure 2.6. These were generated by picking the best

initial core based on the highest performance or efficiency for the first interval of a phase. For every interval thereafter, a comparison was made between the performance or efficiency of the

current core and the performance or efficiency of the opposite core plus the migration penalty.

The interval size for this graph is the 1,000-instruction (smallest) interval size and there is zero migration energy penalty. The cycle penalty size was increased until no migrations occurred

for any phase. The baseline for this data is the best-case, zero cycle ideal migration at the finest granularity of thread switching. Thus, the graph shows the performance and efficiency

retained when adding a cycle penalty for each migration. The “min” plots show the phase

whose performance or efficiency is degraded the most at the given cycle penalty. A knee in both performance and efficiency curves appears at the 100 cycle point. Thus, a good target for any

hardware thread migration scheme is near 100 cycles.

Figure 2.7 shows the efficiency retained when imposing various energy penalties on each migration. The methodology for this graph is similar as in Figure 2.6, except there is a zero

cycle migration penalty. Since performance is not impacted by energy consumption, it is omitted

(28)

(a) 2D baseline (only one of two cores shown).

(b) 2D FTM. (c) 3D FTM.

Figure 2.8: Depictions of 2D and 3D layouts of fast thread migration (best viewed in color).

2.2.2 3D Physical Design Study

The low-overhead migration mechanisms discussed in Chapter 3 require many additional wires and extra logic (muxes). It is also important for these wires to be as short as possible to

minimize their latency. In this section, I explore the pressure that these two requirements exert

on a layout, and project the extent to which a 3D die-stacked implementation can reduce the pressure. In particular, this section explores tradeoffs among routability, area and latency.

Experimental Framework

For these physical design experiments, I extracted a partial core from the FabScalar RTL [19].

The RTL includes the PRF and execution lanes (Register Read stage, function units, and Writeback stage including bypasses). This represents only the logic that influences the cycle

time of the PRF. Eliminating extraneous logic reduces the time needed for synthesis, placement

and routing (SPR), which is important as I sweep through many placement densities for three different PRF designs: no FTM, 2D FTM, 3D FTM. FTM refers to Fast Thread Migration –

the high wiring connectivity between physical register files. Moreover, focusing on just the

PRF-related stages yields more consistent results. FabScalar currently has pipeline stage imbalances that give SPR considerable leeway on the delay of some stages. This leeway masks some of the

effects that I measure, causes arbitrary variations across different SPR runs, etc.

With the RTL of this partial core as a starting point, I consider the following three designs. Refer to Figure 2.8 for simplified depictions of these designs. (I refer to the partial core simply

as “core” the remainder of this section.)

2D baseline: This is a 2D layout of two instances of the core without FTM. Figure 2.8a

depicts one of the cores (the other core is not shown as there is no connectivity between the cores). The core is represented with a gray substrate. On the substrate is a PRF, in

(29)

in teal. Since each function unit reads from the PRF in its Register Read stage, there are

red wires drawn from each bitcell to each function unit.

2D FTM: This is a 2D layout of two instances of the core with FTM. Figure 2.8b shows

how the layouts of the two cores can be mirrored, with their PRFs placed close together at

the center of the die. The diagram also depicts the per-bitcell wiring required for swapping

the PRFs. The extra wires increase the already congested area near the bitcells.

3D FTM: This is a projected 3D layout of two instances of the core, one on each tier,

with FTM. This design is depicted in Figure 2.8c. For clarity, the top substrate is removed and the top PRF and function units are made transparent. For FTM, the PRFs are

connected by face-to-face vias, shown in white. The congestion of 3D FTM is expected

to fall somewhere between 2D baseline and 2D FTM.

The FreePDK45 technology libraries used in these studies do not include support for 3D

die-stacking. Consequently,3D FTM is a 3D projection, based on 2D placement and routing of

the cores with routing obstructions that model the face-to-face vias connecting the two PRFs.

I model the routing blockages of face-to-face vias using the following methodology. First, I add a new D flip-flop to the LEF (geometry) file of the standard cell library. It is derived

from an existing D flip-flop. Its length is increased by two times (2x) the diameter and pitch

of a face-to-face via (coincidentally, it turns out that the standard cell height already matches the via diameter). The diameter and pitch were obtained from a Tezzaron whitepaper [28] (see

“bond points”). There are two vias per bitcell, to account for the incoming and outgoing bitcell values. The new flip-flop is about three times as long as the original flip-flop. The description

of the new flip-flop also includes metal layer obstructions (wiring blockages) onall metal layers

above the extended area of the flip-flop. Thus, when the new flip-flop is used for the PRFs, the routing algorithm steers clear of a vertical column through all metal layers down to each bitcell.

Second, the synthesized netlist is adjusted before placement and routing. All PRF flip-flops

are replaced with instances of the new flip-flop. Since we expect the connected bitcells to be placed directly above and below each other, the obstructions account for the routing that would

be generated by a 3D CAD flow or inserted by the physical designer. So one final modification

to the synthesized netlist is to remove the FTM connections between the PRFs – this keeps the muxes and bitcells intact, but eliminates the duplicate wiring that has been accounted for in

the obstructions.

3D FTM is a conservative model in two respects. First, it may not be necessary to obstruct

all metal layers. Each face-to-face bond point can be placed on the top-most metal layer, freeing

the router to complete connections to flip-flops underneath. Second, the diameter and pitch of

(30)

RTL is synthesized to the FreePDK 45nm standard cell library [54] using Synopsys

De-sign Compiler version E-2010.12-SP2. All three deDe-signs are placed and routed using Cadence Encounter RTL-to-GDSII System 9.11.

Results

To estimate the physical design impact, I perform an automated place-and-route of the three

designs. The only placement constraint applied is that each core must stay within a bounding box on one half of the die. Wiring congestion can be inferred from these routed designs by

counting the number overflowed gcells. Gcells define a region of routing within the total design,

and consist of a number of routing tracks. When global routing must pass through a gcell, the number of used tracks within that gcell is augmented by one. Once global routing is completed,

a gcell with more signals routed through it than its capacity is considered an overflow.

For each design, I vary the standard cell placement density from 80% to 30% and measure

the number of overflows, area, and latency of the PRF-to-PRF value exchange (for 2D FTM

and 3D FTM).

The graph in Figure 2.9 shows overflows (y-axis) as a function of area (x-axis). Each point is labeled with the placement density used for that point. As one would expect, increasing density

decreases area but increases overflows. If confined to a 2D layout, congestion is drastically

increased when the PRFs are connected, evident in the large increase in overflows from 2D

baseline to 2D FTM for a given area. This substantial increase in congestion may lead to a

difficult-to-route and/or lower frequency design at best, or an unroutable design at worst. The

graph also confirms the hypothesis that3D FTM should fall between2D baselineand2D FTM.

In fact, we see that3D FTM is always better (fewer overflows) than2D FTM for a given area.

The graph in Figure 2.10 factors latency into the tradeoff analysis for the two FTM designs.

The graph re-plots overflows on the primary y-axis with solid lines, and superimposes the latency of the PRF-to-PRF value exchange on the secondary y-axis with dashed lines. The

latency of2D FTM is measured directly from the post-routed netlist. The latency of 3D FTM

is constant and is assumed to be the lowest latency of2D FTM (at its most dense point, where

wires are shortest). We reason that the latency is not only low, but also independent of density,

because every flip-flop is directly above or below its counterpart. In contrast, the latency of2D

FTM is very sensitive to density. Thus, the 2D layout suffers from a difficult tradeoff: either

increase density to reduce latency, and pay the price in terms of lower routability and more

physical design effort, or decrease density and pay in terms of higher latency. The 3D layout does not pose this tradeoff: density can be decreased for a more routable design, with no impact

on latency.

(31)

30% 40%

50% 60% 70%

80% 40% 30%

50% 60%

70% 80%

30% 40%

50% 60% 70% 80%

0 200 400 600 800 1000 1200

5.0E+5 1.0E+6 1.5E+6 2.0E+6 2.5E+6

ove

rf

lo

ws (t

ho

usa

n

d

s)

area (sq. microns)

2D baseline 3D FTM 2D FTM

Figure 2.9: Routing overflows due to placement density and PRF connectivity.

0 0.5 1 1.5 2 2.5

0 200 400 600 800 1000 1200

5.0E+5 1.0E+6 1.5E+6 2.0E+6 2.5E+6

late

ncy

(ns)

overf

lo

ws (thousa

n

ds)

area (sq. microns)

3D FTM Overflows 2D FTM Overflows 3D FTM Latency 2D FTM Latency

Figure 2.10: PRF-to-PRF swap latency.

challenges in a 2D design. A 2D design requires the structures holding the state to be exchanged

or externally referenced, to be near one edge of each core. This placement may not be opti-mal for performance and energy of the core. That is, intra-core and inter-core floorplanning

may have competing interests. Moreover, as additional structures are considered for inter-core

exchange or referencing, it may not be feasible to locate all of them at one edge. With 3D die-stacking, structures can be placed anywhere within the core as long as their counterparts

are directly above or below. This satisfies both intra-core and inter-core interests and allows

(32)

2.3 Related Work

Heterogeneous multi-core processors have been shown to be a possible way increase the

per-formance and efficiency of general purpose workloads. The concept of pairing cores of dif-ferent microarchitectures was first introduced by Kumar, et al. in several seminal

publica-tions [33] [34] [35]. Their initial worked established the power reduction [33], and multithreaded

performance [35] improvements made possible by considering a mix of pre-existing designs (var-ious implementations of the Alpha ISA). Their follow-up work [34] explored the possibility that

performance of a heterogeneous multi-core can be best achieved by considering cores that may

not have previously been designed.

Spurred by these seminal works, several other academic proposals followed. Suleman et

al. [56] found that a heterogeneous mix of cores was particularly advantageous for multithreaded applications. Their key insight was that highly parallel code sections could effectively be handled

by smaller cores, but critical sections impose a serialization point in the program and should

be executed by a core designed for the highest possible performance.

Najaf-abadi et al. [41] use a heterogeneous multi-core to improve a single threaded program

by redundantly running a program on multiple cores simultaneously. The cores are

heteroge-neous and have a low latency communication channel between them. As a program executes, the core that is able to make the fastest forward progress will pass result values to the lagging

cores, keeping them at nearly the same point in the programs execution. As the program

char-acteristics change, one of the other cores may start to out-perform and overtake the previously best core. The new leading core will then start to pass results to the other cores. This has the

advantage that the best performing core for a particular program phase does not need to be

determined a priori, nor does a program need to be moved to react to program behavior.

The advantages of heterogeneous multi-core processors has been well-established enough

that several industry designs have taken the approach [4] [27] [58]. These designs all use a

(33)

Chapter 3

Alternatives for Hardware Thread

Migration

This chapter outlines several possible alternatives for hardware thread migration. Each

imple-mentation was implemented in the Verilog RTL model and simulated for thousands of

migra-tions at various relative frequencies to ensure functional correctness. In keeping with the theme of this thesis, these alternatives explore the exchange of program register values, leaving the

memory state migration to other existing and future work.

Each alternative is implemented on two baseline, “reference cores”, described in Section 3.2.

These cores are heterogeneous and out-of-order superscalar cores, but both cores are on the

lesser end of implementation complexity (neither core has especially high peak instruction band-width or large structures). These reference cores are the same cores used in the H3 fabricated

prototype chip, and their architectures are fully enumerated in Table 5.1 in Chapter 5. Using

small reference cores makes the estimates of hardware overhead as conservative as possible – as the migration is added to the minimum possible backdrop. Each alternative (Section 3.3

through Section 3.6) is described with respect to the changes needed to the pipelines of these

reference cores.

The key aspect for each of these designs is that they must work with cores that operate

at independent clock frequencies. This guides how tightly coupled cores can be, and requires

careful consideration for control and data values as they cross clock domains.

While this thesis explores hardware thread migration within the goal of improving

perfor-mance and efficiency of heterogeneous cores, that is not the only possible use-case. For instance,

it may be beneficial to consider hardware migration with thermal considerations [13] [29]. A multi-core processor typically senses the temperature of the constituent cores, and when a core

is heated to a pre-determined threshold, either the core frequency is throttled down, or the

(34)

cool down). Hardware thread migration could aid in this situation – when a core reaches the

threshold temperature, the thread can be moved to another core without incurring the addi-tional overhead of an OS-managed migration. Another use for hardware thread migration is in

support of processor sleep states. Modern processors are able to put unused cores to “sleep”

by power-gating those cores. When the supply power is switched off, state-holding structures are unable to retain their values. Migration-like hardware could be used to more quickly move

these values to newly-introduced register files (not full cores) on their own power domain. This

would allow for a low-power retention of values, while enabling the core to move into and out of sleep mode more quickly.

3.1 Overview

There are many possible hardware mechanisms and policies that could be proposed for

accel-erating thread migrations. In this thesis, I narrow focus on four such migration alternatives

(Sections 3.3- 3.6), relative to a baseline implementation (Section 3.2) that relies solely on the operating system to migrate a thread via a context switch. The impetus behind proposing

several hardware alternatives is to analyze several designs that span a spectrum of costs and

potential benefits. Figure 3.1 depicts this this spectrum.

The benefit of hardware migration is in lower latency (in cycles) to migrate a thread from one

core to another. These alternatives were selected to progressively lower the latency as additional

hardware is added. The costs associated with migration hardware is in additional power, area, and timing. Currently, area is no longer a primary design constraint, since transistors are

abundant. The clock period of a processor is a primary concern, but any additional timing

incurred by a proposed hardware thread migration can likely be ameliorated by pipelining the circuits that do not meet timing. Power remains as a key constraint. There is potential

for an additional power draw by the migration hardware due to the addition of state-holding

structures, and mechanisms for clock-domain crossing circuits. These power costs should not outweigh the benefits of introducing hardware migration. Each of these migration alternatives

incrementally add hardware in an attempt to find the power “sweet-spot”.

When considering hardware migration alternatives, I focus on designs that progressively eliminate bottlenecks with respect to a baseline design that uses the operating system for

migrating threads. Figure 3.1 enumerates the bottlenecks that each design point alleviates

(note that when considering designs from left-to-right, the bottlenecks that are eliminated are additive). As outlined in Chapter 2, when moving a thread via the operating system, several

overheads exist. These include several traps, moving the processor between privilege states, allocating memory for holding thread context, executing a scheduling algorithm, and so on.