ABSTRACT
FORBES, JOHN ELLIOTT. Hardware Thread Migration for 3D Die-stacked Heterogeneous Multi-core Processors. (Under the direction of Eric Rotenberg.)
Increasing the performance and efficiency of modern microprocessors has been met with the significant challenge of designing within a tight power budget. Newer fabrication technologies
have given rise to smaller, more dense transistors. This has ushered in an era of multiple
processor cores on a single chip, larger on-chip caches, and the integration of resources that have traditionally resided off-chip. But the historical trend that smaller transistors yield lower
power has ended. Most multi-core processors rely on replicating several instances of a single core
design for each of the multiple cores. However, one promising solution to continue delivering performance and efficiency improvements is to allow for single-ISA heterogeneous processor
cores within a processor.
Single-ISA heterogeneous multi-core processors are processor designs with multiple cores, each of which may have a slightly different microarchitecture. The different microarchitectures
all execute the same instruction set but each core is tailored for different program behavior. The
benefits of single-ISA heterogeneous multi-core processors have been well explored, however, most proposals leave some performance and efficiency on the table. This is by virtue of the
assumption that the cost of moving a program from one core to another, referred to as athread
migration, is high. This assumption therefore requires a program to spend long amounts of time on a given core to amortize the penalty of migrating between cores.
In this work, I focus on eliminating as much overhead for a thread migration as possible.
With a low overhead thread migration, movement between cores can occur more frequently, and for shorter intervals of time. This allows for fine-grained program changes to be quickly mapped
to a potentially better core. This picks up the performance and efficiency improvements that
previous proposals left on the table.
A thread migration has traditionally been the purview of the operating system through a
context switch. The operating system context switch comes at a high cost. To make thread
migrations as light-weight as possible, I propose to abstract the operating system view of a
pair of heterogeneous cores. The operating system is free to assign program threads to a pair
of cores, but once assigned, the cores are free to move threads between each other as needed. This work analyses the extent to which thread migration overhead (or lack thereof) can
impact performance and efficiency over that of traditional heterogeneous multi-core processors. I
find that a hypothetical zero cost thread migration can achieve between a 2% to 5% improvement on average compared to a two-core heterogeneous processor, with some individual benchmarks
in a realistic fast migration implementation, a thread migration should take less than 100 cycles,
and expend less than 100nJ of energy.
To realize these performance and efficiency gains, in this work I also evaluate several
com-peting implementations for a low latency thread migration. These alternatives span a spectrum
of hardware complexity and power costs, trading these costs for reduced migration latency. Several of these implementations meet the sub-100 cycle, and sub-100nJ targets. The lowest
latency implementation is able to perform thread migrations in about 30-35 cycles on average,
and the lowest energy implementation uses less than 25nJ of energy to migrate a thread. Several of these implementation alternatives rely on a bulk copy of register file values from
one core to the other. This implies wiring between each bitcell of the register files of the core
pair. These wires are costly in wiring congestion, as well as delay between the bitcells. To ameliorate this issue, I explore the use of 3D die-stacking as a way to minimize wire lengths by
using face-to-face vias and register files that are directly across from each other on opposite dice.
The results of this study show that both wiring congestion and delay are kept to a minimum, even with high bitcell density. Compared to planar register file placement, the 3D die-stacked
implementation always has lower congestion for the same area.
The culmination of this work is in the design and test of a fabricated heterogeneous dual-core processor in which the fast hardware thread migration was a key feature. This was a large,
multi-team effort, and progressed in two phases. The first was a 2D test design used to vet the processor cores and migration logic. The results of testing this chip demonstrate the sub-100
cycle thread migration capability. The second phase is currently being fabricated and is a 3D
die-stacked design, incorporating fixes for the bugs uncovered in the first phase.
An important facet in any thread migration scheme is the mechanism used to steer program
phases to the heterogeneous cores. A high quality mapping of phases-to-cores can realize the
full potential of heterogeneous cores, whereas a poor mapping can detrimentally impact perfor-mance. A final analysis included in this work is to explore the possibility of using only static
program characteristics to make a core mapping decision. While the results are preliminary, the
©Copyright 2016 by John Elliott Forbes
Hardware Thread Migration for 3D Die-stacked Heterogeneous Multi-core Processors
by
John Elliott Forbes
A dissertation submitted to the Graduate Faculty of North Carolina State University
in partial fulfillment of the requirements for the Degree of
Doctor of Philosophy
Computer Engineering
Raleigh, North Carolina
2016
APPROVED BY:
William Davis James Tuck
Huiyang Zhou Emerson Murphy-Hill
Eric Rotenberg
DEDICATION
BIOGRAPHY
John Elliott Forbes is originally from Hastings, Michigan, graduating from Hastings High School in 2000. After high school, he attended Michigan Technological University in Houghton,
Michi-gan to pursue his Bachelor of Science in Computer Engineering. At various points during his undergraduate studies, he worked as an intern for Unisys Corporation in Roseville, Minnesota,
working on the physical design of the processor ASIC used on the ClearPath IX platform
main-frames. After graduating cum laude from Michigan Tech in 2005, he worked full-time for Smiths Aerospace (now GE Aviation Systems) in Grand Rapids, Michigan, working on graphics
sub-system testing for the C-130AMP project. While working at Smiths Aerospace, Elliott applied,
and was accepted, to the graduate program in the Electrical and Computer Engineering de-partment of North Carolina State University in Raleigh, North Carolina, obtaining his Master
of Science in Computer Engineering in 2008. After his Masters, he transitioned into the PhD
program in the ECE Department at NC State, under the direction of Dr. Eric Rotenberg. Dur-ing this time, Elliott taught the Introduction to ComputDur-ing Systems course for three years.
In the summer of 2010, he also worked in the Binary Translation Software group at Intel in
ACKNOWLEDGEMENTS
Of course I have to start with my family. Mom, dad, Abby, Holly, Bob and all of the kids. I also can’t forget Shannon. Thank you for the constant support and for pushing me to keep going.
Thanks to my friends back home Brandon Willard, Justin Benner, Ron Coats, Nick Steele, Steve Soliz, Joe Ammeraal, Tom Strey, Nathan Milz, and Dave Subert. It’s always a good day
to get a call or to meet and catch up. Also, I want to thank the professors at Michigan Tech,
Soner ¨Onder and Brian Davis, who pushed me to go to grad school in the first place.
I’ll never forget the fun and rewarding times I had teaching. Thanks to all of my past
students for expecting as much out of me as I expected out of you. Thanks particularly to Xander
Kansinally, Katie Walker, Will Galliher, John Williamson, Cesar Garzon, Michael Glander, Vic Ajewole, and Dusty Mabe. It’s been amazing to see what things all of you have done and I was
always happy to get a visit in my office long after you finished my course.
I’m glad to have had the opportunity to work in the BiTS team at Intel in Hillsboro. Thank you Suresh Srinivas for giving me that opportunity. I also want to thank Matt Pagano, Omar
Shaikh, Avadh Patel, and Carlo Angiuli for all the fun, for the stimulating research environment,
and of course for the shenanigans that we got away with in Oregon. I want to especially thank Paul Caprioli for the continued support and advice in the years since I finished my internship.
My work at NC State was part of a large multi-group effort. Thanks to the H3 team,
Zhenqian Zhang, Randy Widialaksono, Josh Schabel, and Thomas Belanger. Also a big thanks to Steve Lipa, not just for fun times in the lab but also for the great pointers to music, books,
and the odd movie or two. And also thanks for loaning tools and an extra hand when I get
myself into car projects.
Thanks to the many friends I’ve made both in CESR and in NC State ECE in general.
Thanks to the guys in Tom Conte’s crew: Jason Poovey and Chad Rosier, I missed having you
guys around after you left. Thanks George Patsilaras for the hilarious car buying catastrophe. Thanks to Ahmad Samih, Amro Awad the night owl, Bagus Wibowo, Jenn Gamble, Julian
Taylor, Shivam Priyadarshi, and Devesh Tiwari my partner in crime when it comes to teaching
undergrads. I’ll always have great memories of the guys in Eric’s research group: thanks Vinesh Srinivasan for great work and help with chip debugging, Rangeen Basu Roy Chowdhury for
the help in all things EDA, Muawya Al-Otoom for late night Cook Out trips and “other” late night things, Mark Dechene for your awesome research acumen, Hashem Hashemi, Salil
Wadhavkar, Sandeep Navada, Brandon Dwiel my best beer drinking buddy and hockey linemate,
and Sungkwan Ku for making sure I find the best Korean BBQ in Seoul. A big thanks to Niket Choudhary for what will for sure be a lifetime of research collaboration and friendship. I really
to come up with new ideas.
Thanks to my committee James Tuck, Huiyang Zhou, Rhett Davis, and Emerson Murphy-Hill for their helpful insights and new perspectives on my work. I also want to thank the
professors in the department that have helped either in classes, teaching, or otherwise. Thanks
Keith Townsend, my classic car connection in North Carolina. Edward Grant was great for pep talks and pushing me to just keep going. Thanks to Greg Byrd who was the best teaching
mentor and friend that I could have hoped for – just know that someday I’m going to make
you teach me how to fly fish. And, of course none of this would have been possible without my advisor, Eric Rotenberg. Thanks for the support, thanks for the sacrifice you make for your
students, thanks for being the hardest working person I think I’ve ever met.
This thesis was supported in part by Intel and NSF grant No. CCF-1218608. Any opinions, findings, and conclusions or recommendations expressed herein are those of the author and do
TABLE OF CONTENTS
List of Tables . . . .viii
List of Figures . . . ix
Chapter 1 Introduction . . . 1
1.1 Challenge in Migrating Physical Register File Values . . . 2
1.2 Comparison . . . 5
1.3 Contributions and Future Work . . . 6
Chapter 2 Motivation . . . 8
2.1 Benefit of Heterogeneity . . . 8
2.2 Thread Migration . . . 10
2.2.1 Architectural Study . . . 11
2.2.2 3D Physical Design Study . . . 15
2.3 Related Work . . . 19
Chapter 3 Alternatives for Hardware Thread Migration . . . 20
3.1 Overview . . . 21
3.2 No Migration Hardware . . . 23
3.3 Hardware EPC Migration . . . 24
3.4 Hardware FTM . . . 26
3.5 Asynchronous FIFO Migration . . . 28
3.6 Compulsory TRF . . . 29
3.7 Results . . . 32
3.8 Taxonomy . . . 40
3.9 Related Work . . . 40
Chapter 4 Modeling Heterogeneous Cores . . . 42
4.1 Verilog RTL Model . . . 42
4.2 Low Design Effort C++ Model . . . 43
4.3 High Design Effort C++ Model . . . 44
4.3.1 Balancing Pipeline Stages . . . 44
4.3.2 Transistor Sizing . . . 45
4.3.3 Pulse Latches . . . 45
4.3.4 Layout . . . 46
4.3.5 High-Effort Scaling Model . . . 47
4.4 Core Palette . . . 48
4.5 Metrics . . . 48
4.6 Workload . . . 48
4.7 Cycle-Level Simulator . . . 49
5.1 Implementation of H3 . . . 50
5.1.1 Global Migrations . . . 53
5.1.2 Local Migrations . . . 55
5.2 Test Infrastructure . . . 56
5.2.1 Duct Tape – A High-Level Assembler . . . 56
5.2.2 FPGA Chip Signal Driver . . . 59
5.2.3 Host Interface . . . 62
5.3 Results . . . 63
5.4 Errata . . . 65
5.4.1 Load Miss . . . 65
5.4.2 Clock Inputs . . . 65
5.4.3 Ammeter . . . 66
5.4.4 I-cache Requests . . . 66
5.4.5 D-Cache Hold Violations . . . 67
5.4.6 TRF Reset . . . 68
5.4.7 CTIQ Full . . . 69
5.4.8 CTIQ Reset . . . 69
5.4.9 CCD Pulse Synchronization . . . 69
Chapter 6 Static Phase-to-Core Mapping . . . 71
6.1 Benefits of Static Analysis for Migration . . . 71
6.2 Statistical Learning – Classification . . . 72
6.3 Naive Bayes Classification Postmortem . . . 77
6.4 Related Work . . . 81
Chapter 7 Summary . . . 83
LIST OF TABLES
Table 1.1 Comparison of the H3 thread migration with other published architectures. 6
Table 4.1 EDA tools used in this work. . . 44
Table 4.2 Pulse latch and flip-flop characterization. . . 46
Table 4.3 The palette of 18 cores considered for evaluation. . . 49
Table 5.1 H3 Core Types . . . 51
Table 5.2 Additional signals required to support FTM. . . 53
LIST OF FIGURES
Figure 1.1 Potential for copying register values between cores. . . 4
Figure 2.1 Average performance and efficiency varying the number of heterogeneous cores. . . 9
Figure 2.2 Per-phase performance and efficiency of two heterogeneous cores. . . 9
Figure 2.3 Performance and efficiency of various interval sizes. . . 12
Figure 2.4 Number of migrations at 1,000 instruction intervals. . . 13
Figure 2.5 Comparison of coarse-grain and fine-grain heterogeneity for both perfor-mance and efficiency. . . 13
Figure 2.6 Performance and efficiency relative to ideal with various migration cycle penalties. . . 14
Figure 2.7 Efficiency relative to ideal with various migration energy penalties. . . 14
Figure 2.8 Depictions of 2D and 3D layouts of fast thread migration (best viewed in color). . . 15
Figure 2.9 Routing overflows due to placement density and PRF connectivity. . . 18
Figure 2.10 PRF-to-PRF swap latency. . . 18
Figure 3.1 Spectrum of hardware migration alternatives. . . 22
Figure 3.2 Baseline cores with no hardware migration support. . . 24
Figure 3.3 Cores augmented with hardware support for migrating EPC only. . . 25
Figure 3.4 Cores augmented with hardware support for migrating all registers using TRF. . . 27
Figure 3.5 Cores augmented with hardware support for migrating all registers using asynchronous FIFO. . . 28
Figure 3.6 Cores augmented with hardware support for migrating all registers with compulsory TRF reads/writes. . . 30
Figure 3.7 Constraining pipeline stage paths with pipeline registers. . . 33
Figure 3.8 Achievable clock period of each pipeline stage. . . 34
Figure 3.9 Power overhead of migration hardware. . . 35
Figure 3.10 Area overhead of migration hardware. Note that the y-axis does not start at zero. . . 35
Figure 3.11 Migration latency both with and without (for clarity) the EPC Migration latency. . . 36
Figure 3.12 Energy required for a complete thread migration. . . 37
Figure 3.13 Average migration performance when taking into account all power, en-ergy, and cycle penalties. . . 39
Figure 3.14 Per-phase migration performance of FTM. . . 39
Figure 3.15 Per-phase migration efficiency of FTM. . . 39
Figure 3.16 Taxonomy of migration alternatives. . . 40
Figure 4.1 C++ models for estimating low and high effort design. . . 43
Figure 4.2 Frequency and energy trend for transistor sizing. . . 45
Figure 5.1 Die photos. . . 51
Figure 5.2 Block diagram of two core stack. . . 52
Figure 5.3 Timing diagram for global migration of two threads. . . 54
Figure 5.4 Timing diagram for local migration (one thread only). . . 55
Figure 5.5 Block diagram of test infrastructure. . . 56
Figure 5.6 Program organization of a complete dt program. . . 58
Figure 5.7 Conditional array summation example dt source code. . . 60
Figure 5.8 Assembled H3 test PCB. . . 62
Figure 5.9 Test workstation. . . 62
Figure 5.10 Migration latency of the 2D and 3D prototype chips. . . 65
Figure 5.11 Oscilloscope screen capture showing power supply voltage during a chip reset. . . 67
Figure 6.1 Percent of each benchmark that is spent in an inner loop. . . 73
Figure 6.2 Prediction accuracy of various classification algorithms. . . 75
Figure 6.3 Distribution of feature values for all inner loops. . . 76
Figure 6.4 Pairwise comparison of all features. . . 78
Figure 6.5 Accuracy of Gaussian Naive Bayes for each loop, sorted by the perfor-mance ratio between the two cores. . . 79
Figure 6.6 Histogram of the performance ratio between the two cores. . . 79
Figure 6.7 Accuracy of Gaussian Naive Bayes for each loop, sorted by how heavily biased a loop is toward either core. . . 80
Chapter 1
Introduction
Historically, as steady improvements in compute capacity have been made, programmers have always filled that capacity either to solve ever larger problems, or to solve the same-sized
problems more efficiently. Currently, system capacity improvements are threatened by the end
of Dennard scaling, and the possible end of Moore’s law. No longer can we rely on a smaller transistor delivering a higher frequency at lower power, and soon we may not even be able to
rely on a transistor getting smaller.
One potential avenue for improving compute capacity is by employing multiple cores within a processor, each with a different microarchitecture – a style of processor commonly referred to
as heterogeneous chip multi-processors (HCMPs). The cores within an HCMP all implement
the same instruction set (ISA), but vary in superscalar widths, pipeline depths, and sizes of structures. This body of work was spurred by seminal work by Kumar et al. [33] [35], recognizes
that different programs have different instruction-level behavior, and even a single program may
change behavior during runtime. It is not always possible to design a single processor core that is best suited for all programs or program phases. So instead of having multiple cores of a
processor all of the same design, a mix of different core types should be employed. This has
the effect of specializing a processor to better match a program providing that the program is executed by the core that most efficiently matches its needs.
This thesis makes the case that performance and energy efficiency of HCMPs can be further
improved if a program can be moved between cores at the lowest possible cost. The operating system has traditionally handled the management of threads. But to achieve the performance
and energy goals, I propose foregoing the heavy-weight computation required by the OS
schedul-ing and thread management. Instead, I assume a system in which core pairs are presented to
the OS as a single core with multiple logical thread contexts. The OS can assign a thread to
The costs of a thread migration that remain in a hardware scheme are the copying of the
architectural registers, and flushing various pipeline structures of the new core including the handling of cache state. There are also indirect costs associated with retraining speculative
structures such as branch predictors and dependence predictors. Previous work [11] has shown
that when a migration is potentially beneficial, the program memory state can be minimized to only the predicted working set which can be pre-emptively copied to the new core before the
migration occurs. In this thesis, I focus on one of the remaining challenges to a low-overhead
migration: that of copying the register state. The next section highlights why this problem is more difficult than at first glance.
1.1
Challenge in Migrating Physical Register File Values
The physical register file (PRF) of a modern superscalar, out-of-order processor [62] houses both
the known-safe committed register values as well as speculative values. This is made possible by
a register renaming mechanism, and a PRF with more registers than the architectural minimum. Any instruction with a destination register operand will write the result value directly to the
PRF before knowing whether the instruction is valid. Instructions may later be found to have
executed with incorrect source values or may be on the wrong program control path. In that case, the speculative values are discarded by undoing the logical-to-physical register mapping
for that instruction destination register. If the instruction is found to have correctly executed,
then the logical-to-physical mapping is retained, and the previous logical-to-physical mapping for that logical destination register can be freed. This has the effect of spreading logical values
throughout the PRF such that logical mappings reside in non-contiguous and unordered PRF
locations.
The size of the PRF is carefully tuned by the design team when implementing a core. A
larger PRF can support deeper speculation, but comes at the cost of per-access latency. The
PRF may need to be read by multiple instructions each cycle, so a large PRF may partly dictate the achievable core frequency. When considering an HCMP, the PRF is a key differentiating
parameter between the different core types. Some programs may derive a substantial benefit
from a processor with deep speculation, whereas other programs might better utilize a core with a higher frequency. Thus, it is likely that a hardware thread migration mechanism will
need to cope with PRFs of different sizes, as well as different frequencies.
Chapter 2 shows the impact that a low-latency and low-energy thread migration can have on overall performance and efficiency of a program. The important take-away from those studies
is that for maximum benefit, the migration should take 100 cycles or less. It is tempting to consider copying the entire PRF contents from one core to another. But Figure 1.1 shows why
renaming of four logical registers. One PRF is 8 entries, while the other is 16, and allows for
deeper speculation (we can ignore the timing differences for now). In this case, the contents of the smaller PRF can be copied directly to the larger PRF, providing that the rename map
table (RMT) is also copied. No problem exists with this example. However, if a program must
be migrated from the core with a larger PRF to the core with the smaller PRF, it might be the case that the larger PRF has mapped to physical entries that do not exist in the smaller PRF.
This situation is shown in Figure 1.1b. In this example, physical register$p11 and$p13 do not
exist in the smaller PRF. Copying the RMT will not help in this case, it must be remedied with an intermediate step.
Figure 1.1c satisfies this intermediate step. The RMT (not shown) is used rename logical
registers to consolidate the architectural registers back into an architectural register file (ARF). This puts register values in contiguous ARF locations and in the correct order. With this step
complete, ARF values can be copied from one core to the other, or with clever design, an
exchange of the ARF contents can be performed.
Even without the PRF size mismatch problem discussed above, copying directly from one
PRF to another PRF is a tenuous prospect. Supposing a core pair in which both PRFs are
the same size, it would at least require additional PRF read and write ports which are used by the opposite core. The PRFs are already highly-ported, adding yet another read and write
port is likely to impact the clock frequency of the entire pipeline. Making matters worse, the cores may operate asynchronously even with the same sized PRFs. A Teleport Register File
(TRF) [46] [63] can be used in concert with the PRF to solve these issues. A TRF can be
used like an ARF, but additionally allows for the bulk exchange of values from one TRF to another TRF. But the design of a TRF requires implementing the registers in flip-flops instead
of SRAMs since the bulk copy requires access to each bitcell of the memory array. A possible
design that works around these issues is to have a TRF outside of the core, and to introduce a new instruction whose sole purpose is to go through the full pipeline as usual, reading values
from the renamed physical registers and then copying their values into the TRF during the
Execute Stage. Another new instruction performs the reverse action of reading the TRF after the TRF exchange, and writing to a destination register which has been properly renamed to
a physical register in the PRF.
An alternative superscalar, out-of-order pipeline implementation exists [57] that has sepa-rate storage for known-safe committed values and speculative values. It is feasible for this style
of pipeline to eschew the renaming intermediate step and new instructions, and exchange the
architectural register values with another core. Pipelines of this style are not considered in this thesis, however, for two reasons. First, while register values would no longer require
consoli-dation, the copy/exchange of values from one core to another would still require bitcell-level
Copy
Copy $r0
$r3
$r1
$r2
$r0 $r3
$r1
$r2 Free register
Committed register Speculative register
$p0 $p1 $p2 $p3 $p4 $p5 $p6 $p7 $p8 $p9 $p10 $p11 $p12 $p13 $p14 $p15
$p0 $p1 $p2 $p3 $p4 $p5 $p6 $p7
Core0 PRF Core1 PRF
$p1 $p5 $p7 $p2
Core0 RMT
$p1 $p5 $p7 $p2
Core1 RMT
(a) Small-to-large PRF copy.
$r2
$r1
$r0
$r3
$p0 $p1 $p2 $p3 $p4 $p5 $p6 $p7
Core0 PRF Core1 PRF
Copy
Copy $p11
$p3 $p0 $p13
Core0 RMT Core1 RMT
$p0 $p1 $p2 $p3 $p4 $p5 $p6 $p7 $p8 $p9 $p10 $p11 $p12 $p13 $p14 $p15
(b) Large-to-small PRF copy.
$r2 $r1
$r0 $r3 $r1 $r2 $r0
$r3
$r0 $r1 $r2 $r3
$r0 $r1 $r2 $r3 $p0
$p1 $p2 $p3 $p4 $p5 $p6 $p7 $p8 $p9 $p10 $p11 $p12 $p13 $p14 $p15
$p0 $p1 $p2 $p3 $p4 $p5 $p6 $p7 Exchange
Core0 PRF Core1 PRF
(c) Remapped PRF copy.
slow the entire processor as mentioned above. Second, architectures designed in this style are
not typically used in modern processors. This is due to several complexities in hazard-checking logic and bypassing. The modern PRF approach described above simplifies these issues and
has become the dominant form of out-of-order execution in modern processors. It would be a
difficult proposal to suggest revisiting a known-inefficient pipeline implementation simply to make it easier for thread migration.
Taking these issues into account, I propose in this thesis to introduce a TRF to a
hetero-geneous core pair which resides outside of the pipeline. Access to the TRF is enabled by new instructions. Once supplied with register values, the TRF can exchange values between the
core pair. The bulk copy of values comes at a high cost in the number of wires that must be
routed. To ease this cost, the two cores can be split between two tiers of a 3D die-stacked chip multi-processor. In collaboration with a large design team, this design was implemented and
fabricated as a proof of concept. This project is called “H3”, which stands for heterogeneity in
3D.
1.2
Comparison
Table 1.1 summarizes several other recent works that relate to the H3 project and how they fare compared to H3. ARM has demonstrated their commitment to heterogeneous chip-multiprocessor
with their big.LITTLE [27] architecture. Their HCMP consists of several “big” Cortex-A15 cores
with a “little” Cortex-A7. The goal of big.LITTLE is to minimize power by using the little core as often as possible, relying on the big cores when performance needs justify the extra power
consumption.
Composite Cores was proposed [39] as a way to completely forego migration in lieu of a single architecture that has multiple pipeline back-ends. One back-end is lightweight by virtue of it’s
in-order execution model. The other back-end is a high-performance out-of-order execution
paradigm. Both of these back-ends share a unified front-end (instruction fetch, decode, etc.).
The Execution Migration Machine [38] (EM2) suggests moving threads to their required data
instead of moving data between cores. To achieve this goal, they implement a low-complexity
stack-based ISA that minimizes the amount of thread state that must be transferred to other cores. Their design was fabricated in a 110-core homogeneous CMP.
None of these competing architectures meet all of the goals set out for H3, as described in
this thesis and in [46]. The big.LITTLE implementation has no hardware migration support, and thus relies on the operating system to move threads between cores. This results in a high
latency (they cite 20 thousand cycles) migration, limiting the frequency of migration. Composite Cores provides a way to partially realize some of the benefits available in a HCMP. But two
Table 1.1: Comparison of the H3 thread migration with other published architectures.
Thread Migration Latency
Distinct (separate) cores
Asynch-ronous (GALS)
Register-based ISA
Evaluation Methodology
ARM
big.LITTLE 20,000 cycles Yes Yes Yes Real system
Composite
Cores <32 cycles
No (shared front-end and data cache)
No Yes C++ simulator
Execution Migration Machine
<100 cycles Yes No
No (stack-based ISA for partial context transfer)
RTL simulation and synthesis; chip fabricated, measurements not yet reported
H3 FTM,
this thesis <100 cycles Yes Yes Yes
RTL simulation, fabricated chip measurements
frequencies can be different (another knob to turn in providing architectural diversity). And
EM2 has a novel approach to keep on-chip network costs down by moving threads instead of
data. But EM2 may lose generality by virtue of the reliance on a stack-based ISA. Additionally,
it is unclear if their prototype was functional, as no measurements have been reported.
To my knowledge, the H3 chip was the first demonstration of a fabricated heterogeneous
multi-core consisting of two out-of-order superscalar cores. It was also the first hardware thread migration between two asynchronous cores. And the taped-out 3D design is poised to be the
first fabricated die-stacked pair of out-of-order superscalar cores.
1.3
Contributions and Future Work
This thesis makes the following contributions:
The study of the architectural impact of low-latency, low-energy thread migrations. This is
done in a way that does not presuppose any particular migration mechanism. The result is that limiting the overhead of thread migrations unlocks the opportunity to migrate more
frequently. This, in turn, makes it feasible to migrate fine-grained program phases for an
additional performance and efficiency benefit over that already exposed by coarse-grain heterogeneity.
have the ability to exchange values. Die-stacking can satisfy both the high-bandwidth
and low-latency wiring that a hardware thread migration approach requires.
A study of various implementations of fast register transfer. The implementations trade
migration latency for power. These implementations are modeled in Verilog RTL and
suggest that the best alternative may be one that uses new instructions that copy data
into and out of a TRF.
Various evaluations require a high number of full-processor simulations, necessitating the
use of fast C++ simulator. Also, these simulations must account for energy and timing based on physical design data. The physical design data can be derived using existing
FabScalar-based [19] tools, but these do not account for high design-effort by virtue of
relying on FabScalar’s automated, standard cell approach. To overcome this, I estimate the timing and energy of high design-effort cores by carefully crafting scaling parameters
based on the low design-effort estimates of FabScalar cores.
The test infrastructure and measured results of a fabricated 2D prototype chip. The
prototype chip was fabricated to vet any functional bugs and demonstrate the capability of hardware thread migration. The testing proved useful, as several interesting problems
were found and their fixes incorporated into a taped-out 3D die-stacked design – the
end-goal for the H3 project.
This thesis also studies the use of static program analysis with statistical learning in an attempt to predict the best mapping of phases to cores without first running the program. The
methods studied failed to produce accurate predictions. However, in a postmortem analysis of
the technique, I find that there may never be enough information inherent in only the static program characteristics to have a highly-accurate, static phase-to-core prediction. Future work
can leverage these lessons to try to find the best balance between offline static characterization
Chapter 2
Motivation
This chapter focuses attention on the performance and efficiency benefit of heterogeneity (Sec-tion 2.1), the potential further improvement when the overhead of migra(Sec-tion is minimized
(Section 2.2), and the specific advantages that a migration mechanism in a 3D die-stacked core
pair provide (Section 2.2.2).
Previous work on this topic [27] [39] [46] typically study a heterogeneous system in which
the cores can always have their performance ranked – that is, there is a “big” high-performance
core, and a “little” low-power core. Fast hardware thread migrations can certainly benefit such systems. In this thesis, I will focus on migrating between cores that cannot be
performance-ranked [34] [42]. In systems of this type, both cores are “big”. For instance, one core might have
high peak instruction bandwidth, but at a modest frequency compared to another core which has lower peak instruction bandwidth but a high frequency. Some programs may not be able to
take advantage of a wide core, and thus, the higher frequency is more beneficial. Establishing
performance on these non-monotonic cores is challenging, as the choice of which core to run a given phase is not always readily apparent.
2.1
Benefit of Heterogeneity
Heterogeneous multi-core processors can realize a performance and efficiency benefit without the
need for fast thread migration. While several works have shown the advantages of heterogeneous
microarchitectures [33] [35], in this section I show the potential within the methodology used for this thesis.
Using the low design-effort model from Chapter 4, I establish the overall performance (in
BIPS) and efficiency (in BIPS3/W) of all program phases on all core configurations, producing
roughly 3200 data points for each metric. These metrics assume that a phase is run on a
0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20%
1 3 5 7 9 11 13 15 17
Av
e
ra
ge
Pe
rc
e
n
t
BIPS
In
cr
ea
se
nNumber of Core Types
(a) Performance.
0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20%
1 3 5 7 9 11 13 15 17
Av
er
ag
e
Pe
rc
e
n
t
BIPS
3/W
Incr
ease
nNumber of Core Types
(b) Efficiency.
Figure 2.1: Average performance and efficiency varying the number of heterogeneous cores.
0% 20% 40% 60% 80% 100% 120% 140% 160% 180%
0 20 40 60 80 100 120 140 160 180
Pe
rc
e
n
t
BIPS
Incr
ease
(
n
=2)
Program Phases
Geometric Mean = 11.2%
(a) Performance.
0% 20% 40% 60% 80% 100% 120% 140% 160% 180%
0 20 40 60 80 100 120 140 160 180
Pe
rc
e
n
t
BIPS
3/W
Incr
ease
(
n
=2)
Program Phases
Geometric Mean = 10.4%
(b) Efficiency.
Figure 2.2: Per-phase performance and efficiency of two heterogeneous cores.
granularity of thread migration is the entire program phase. With the metrics for all core
configurations, I run an exhaustive design space exploration (DSE). The DSE tool finds the
highest average performance or efficiency for a combination ofn cores, where n is varied. The
results of these explorations for several values ofnare shown in Figure 2.1. The baseline for these
graphs is the best overall homogeneous core for a given metric, which is the core configuration
found by the DSE tool when n=1.
These results show that even among long-running program phases, a heterogeneous
combi-nation of cores can realize a performance and efficiency improvement. Program phases for this
experiment are 10 million dynamic instructions long. With even just two core configurations, performance is increased by about 11.2% and efficiency by about 10.3% on average. If a
pro-cessor employed all 18 core configurations, the average performance increase is just over 16%
and efficiency almost 19%.
Considering the design challenges associated with heterogeneous multi-core processors [19],
the average performance and efficiency gains may seem underwhelming. However, showing the
average hides the full potential of heterogeneous cores. Previous work [25] [42] has shown
that most program phases execute their best on a balanced core configuration. This is a core
configuration that is not especially wide or narrow and has average structure sizes. That same
balanced core configuration appears in all values ofn. This means that when mapping phases
And when compared to the baseline homogeneous configuration (n=1), which is the same
core configuration, those phases see no additional performance or efficiency benefit. This has
the effect of pulling the overall average down. But not all phases are best on that “average”
core configuration. Figure 2.2 shows the performance and efficiency of each phase. Roughly
44% of the phases have a performance advantage on the heterogeneous processor, and about 32% of the phases have an efficiency improvement. Also, there are several phases that have a
significant improvement in performance and/or efficiency. The phases most impacted have a
50% to 130% performance improvement on the heterogeneous core pair, while several phases have an efficiency improvement over 100%. These gains are hidden when only considering the
average over all phases.
2.2
Thread Migration
The previous section pinned phases to cores for the entire duration of the phases’ execution.
In this section, I study the potential when allowing the program phases to migrate from core to core during the execution. I refer to this as “fine-grain” heterogeneity to emphasize that the
phase may execute on a given core for only a very short number of instructions before being
migrated to another core.
A thread migration is similar to an operating system (OS) context switch. The OS provides
context switching as a means to allow more running processes than there are processors in
the system. The details of a context switch are highly dependent on the OS as well as the underlying hardware. One overhead in an OS context switch that may not be necessary in a
thread migration is that of process scheduling (determining which ready process will run after
the next context switch). But, even without counting the cost of the OS scheduler there is overhead in saving the process control block, including registers, stacks, memory mappings,
and various privileges. Many of these tasks require kernel-level access, requiring the processor
switch into and out of kernel modes – a potentially costly series of operations [36] [43]. In this thesis, I propose presenting a heterogeneous core pair to the OS as a single processor with
multiple thread contexts. The kernel is free to assign a process to a core pair, but once assigned,
threads are free to move between the two cores. After the initial assignment, this eliminates the overheads incurred by the OS if the thread migration is handled by hardware.
Therefore, for the studies in this section, n=2, and migrations can only occur between the
two cores. While this leaves some performance and efficiency improvement on the table, n=2
matches the proposed use of 3D die-stacking for ensuring low latency, high bandwidth migration
2.2.1 Architectural Study
To establish the performance and efficiency that can be expected of a hardware thread migration
scheme, I first focus on architecture-level studies. These experiments determine the effects of
the cost of thread migration without imposing any specific implementation.
Experimental Framework
To support the architectural performance analysis of thread migration, I start by simulating
all program phases on the low design-effort C++ model, outlined in Chapter 4. As simulations
progress, for every 1,000 dynamic instructions executed, the simulator saves the number of cycles needed and the energy spent to execute those 1,000 instructions. No architectural structures are
reset at these points, only the cycle and energy metrics are saved then reset. At the end of the
simulation, these statistics are numbered and saved for post-processing. Each program phase is simulated for a total of 10 million dynamic instructions, so there will be 10,000 1,000-instruction
metrics for each program phase.
By recording metrics at well-defined instruction boundaries, and keeping track of their program ordering, these 1,000-instruction segments can be aggregated by simply adding the
metrics of adjacent 1,000-instruction segments. This aggregation is also made possible by the
policy to keep pipeline structures “warmed” during execution. For example, the total cycles for a given core configuration to execute an entire program phase can be derived by simply adding
all 10,000 of the 1,000-instruction program segments for that phase. I refer to these aggregated
1,000-instruction program segments as “intervals”.
The strength of this approach manifests in several ways:
It provides the ability to change the interval size to mimic the ability to allow either small
or large migration regions.
It allows adding arbitrary cycle and energy penalties at interval boundaries to represent
the cost of a possible migration. Furthermore, the cycle and energy penalty could be due to the migration itself, or due to migration-induced events (such as cache misses that
would not have occurred had the migration never happened).
Since this is all done in post processing, it allows for oracle scheduling.
This means that we can see the effect of interval size, migration cycle penalty, and migration
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
1000 10000 100000 1000000 10000000
Pe
rc
e
n
t
BI
PS
Im
pr
ovemen
t
Interval Size (Instructions) (a) Performance.
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
1000 10000 100000 1000000 10000000
Pe
rc
e
n
t
BIP
S3/W
Imp
roveme
n
t
Interval Size (Instructions)
overall best average all cores average overall best max all cores max
(b) Efficiency.
Figure 2.3: Performance and efficiency of various interval sizes.
Results
Figure 2.3 shows the performance and efficiency potential with different sized intervals, assuming
no cycle or energy penalty for thread migrations. Note that the x-axis indicates the number of dynamic instructions per interval, not the number of 1,000-instruction segments. There are four
trends shown per graph. The trends prefaced with “overall” refers to a core-selection policy in which the migrations occur between the two overall best core pair for all 179 program phases,
determineda priori by a DSE. The best two cores when using BIPS as the performance metric
are the LE-2W-S and LE-4W-M (“LE” refers to the low design-effort model), and the best
two cores when using BIPS3/W as the performance metric are the LE-2W-S and LE-3W-M.
The trends in Figure 2.3 prefaced with “all” indicates that all 18 possible core configurations
are available, but migrations are only permissible between the two cores that are best for the phase being considered. That is, the two best core configurations are found on a per-phase
basis. For both of these core-selection policies, both the average improvement and the highest
improvement for any phase is shown.
The improvements shown in Figure 2.3 are the additional benefits over coarse-grain
het-erogenous cores gained by allowing thread migrations at various finer granularity interval sizes.
The right-most data point of these graphs represent the coarse-grain heterogeneous cores where no migrations occur once the phase is mapped to a core, and the left-most point represents the
finest granularity of thread migration. The baseline is the heterogeneous core pair assuming
no migrations, similar to the analysis in Section 2.1 where n=2. Equally important, however,
is to notice that the biggest improvement in both performance and efficiency is realized when
the interval size is lowered from 10,000 instruction intervals to 1,000 instruction intervals. This
shows that the smaller the interval, the more potentially beneficial thread migration becomes. And thread migrations can be more frequent only if the overheads of those migrations are as
small as possible.
0 1000 2000 3000 4000 5000 6000 7000
0 20 40 60 80 100 120 140 160 180
Num
b
er
of
Mi
gr
ati
o
ns
Program Phase
Figure 2.4: Number of migrations at 1,000 instruction intervals.
0% 5% 10% 15% 20% 25% 30% 35% 40%
0% 50% 100% 150%
Pe
rc
e
n
t
BI
P
S
Im
p
ro
vem
en
t
(F
in
e
Gra
in)
Percent BIPS Improvement (Coarse Grain)
(a) Performance.
0% 5% 10% 15% 20% 25% 30% 35% 40%
0% 50% 100% 150%
Percent
BIPS
3/W
Improvement
(Fine
Grain)
Percent BIPS3/W Improvement (Coarse Grain) (b) Efficiency.
Figure 2.5: Comparison of coarse-grain and fine-grain heterogeneity for both performance and efficiency.
cores used to produce this graph are the overall best two cores (as opposed to the per-phase
best two cores). This graph shows that under these conditions, thread migrations will occur
quite frequently. The vast majority of phases will switch cores 1,000 or more times during the 10 million instruction phase. And the phases at the top end of this graph will migrate more
than 6,000 times during the 10 million instruction phase.
An interesting pair of plots are shown in Figure 2.5. Each data point for these graphs repre-sents a program phase. The position along the x-axis reprerepre-sents the improvement in performance
and efficiency for that phase that coarse-grain thread migrations have over a single core. The
position along the y-axis plots the additional improvement in performance and efficiency for
that phase that fine-grain thread migrations allow over the coarse-grain heterogeneity. These
graphs show that while some phases have a fine-grain migration benefit in addition to their coarse-grain benefit, there are a substantial number of phases that only get a heterogeneity
benefit when allowing fine-grain thread migrations. This is evident in the clustering of many
0% 20% 40% 60% 80% 100% 120%
0 1 10 100 1000 10000
Pe
rc
e
n
t
BI
P
S
Re
la
ti
ve
to
Fi
ne
st
Gr
anul
ari
ty
Migration Cycles Penalty
(a) Performance.
0% 20% 40% 60% 80% 100% 120%
0 1 10 100 1000 10000
Pe
rc
en
t
BI
P
S
3/W
Re
la
ti
ve
to
Fi
ne
st
Gr
an
ul
ari
ty
Migration Cycles Penalty
overall best average all cores average overall best min all cores min
(b) Efficiency.
Figure 2.6: Performance and efficiency relative to ideal with various migration cycle penalties.
0% 20% 40% 60% 80% 100% 120%
0 1 10 100 1000 10000 100000
Pe
rc
e
n
t
BIP
S
3/W
Re
la
ti
ve
to
Fi
ne
st
Gr
anula
rit
y
Migration Energy Penalty (nJ)
overall best average all cores average overall best min all cores min
Figure 2.7: Efficiency relative to ideal with various migration energy penalties.
The impact on performance and efficiency when penalizing a thread migration with various
per-migration cycle costs is shown in Figure 2.6. These were generated by picking the best
initial core based on the highest performance or efficiency for the first interval of a phase. For every interval thereafter, a comparison was made between the performance or efficiency of the
current core and the performance or efficiency of the opposite core plus the migration penalty.
The interval size for this graph is the 1,000-instruction (smallest) interval size and there is zero migration energy penalty. The cycle penalty size was increased until no migrations occurred
for any phase. The baseline for this data is the best-case, zero cycle ideal migration at the finest granularity of thread switching. Thus, the graph shows the performance and efficiency
retained when adding a cycle penalty for each migration. The “min” plots show the phase
whose performance or efficiency is degraded the most at the given cycle penalty. A knee in both performance and efficiency curves appears at the 100 cycle point. Thus, a good target for any
hardware thread migration scheme is near 100 cycles.
Figure 2.7 shows the efficiency retained when imposing various energy penalties on each migration. The methodology for this graph is similar as in Figure 2.6, except there is a zero
cycle migration penalty. Since performance is not impacted by energy consumption, it is omitted
(a) 2D baseline (only one of two cores shown).
(b) 2D FTM. (c) 3D FTM.
Figure 2.8: Depictions of 2D and 3D layouts of fast thread migration (best viewed in color).
2.2.2 3D Physical Design Study
The low-overhead migration mechanisms discussed in Chapter 3 require many additional wires and extra logic (muxes). It is also important for these wires to be as short as possible to
minimize their latency. In this section, I explore the pressure that these two requirements exert
on a layout, and project the extent to which a 3D die-stacked implementation can reduce the pressure. In particular, this section explores tradeoffs among routability, area and latency.
Experimental Framework
For these physical design experiments, I extracted a partial core from the FabScalar RTL [19].
The RTL includes the PRF and execution lanes (Register Read stage, function units, and Writeback stage including bypasses). This represents only the logic that influences the cycle
time of the PRF. Eliminating extraneous logic reduces the time needed for synthesis, placement
and routing (SPR), which is important as I sweep through many placement densities for three different PRF designs: no FTM, 2D FTM, 3D FTM. FTM refers to Fast Thread Migration –
the high wiring connectivity between physical register files. Moreover, focusing on just the
PRF-related stages yields more consistent results. FabScalar currently has pipeline stage imbalances that give SPR considerable leeway on the delay of some stages. This leeway masks some of the
effects that I measure, causes arbitrary variations across different SPR runs, etc.
With the RTL of this partial core as a starting point, I consider the following three designs. Refer to Figure 2.8 for simplified depictions of these designs. (I refer to the partial core simply
as “core” the remainder of this section.)
2D baseline: This is a 2D layout of two instances of the core without FTM. Figure 2.8a
depicts one of the cores (the other core is not shown as there is no connectivity between the cores). The core is represented with a gray substrate. On the substrate is a PRF, in
in teal. Since each function unit reads from the PRF in its Register Read stage, there are
red wires drawn from each bitcell to each function unit.
2D FTM: This is a 2D layout of two instances of the core with FTM. Figure 2.8b shows
how the layouts of the two cores can be mirrored, with their PRFs placed close together at
the center of the die. The diagram also depicts the per-bitcell wiring required for swapping
the PRFs. The extra wires increase the already congested area near the bitcells.
3D FTM: This is a projected 3D layout of two instances of the core, one on each tier,
with FTM. This design is depicted in Figure 2.8c. For clarity, the top substrate is removed and the top PRF and function units are made transparent. For FTM, the PRFs are
connected by face-to-face vias, shown in white. The congestion of 3D FTM is expected
to fall somewhere between 2D baseline and 2D FTM.
The FreePDK45 technology libraries used in these studies do not include support for 3D
die-stacking. Consequently,3D FTM is a 3D projection, based on 2D placement and routing of
the cores with routing obstructions that model the face-to-face vias connecting the two PRFs.
I model the routing blockages of face-to-face vias using the following methodology. First, I add a new D flip-flop to the LEF (geometry) file of the standard cell library. It is derived
from an existing D flip-flop. Its length is increased by two times (2x) the diameter and pitch
of a face-to-face via (coincidentally, it turns out that the standard cell height already matches the via diameter). The diameter and pitch were obtained from a Tezzaron whitepaper [28] (see
“bond points”). There are two vias per bitcell, to account for the incoming and outgoing bitcell values. The new flip-flop is about three times as long as the original flip-flop. The description
of the new flip-flop also includes metal layer obstructions (wiring blockages) onall metal layers
above the extended area of the flip-flop. Thus, when the new flip-flop is used for the PRFs, the routing algorithm steers clear of a vertical column through all metal layers down to each bitcell.
Second, the synthesized netlist is adjusted before placement and routing. All PRF flip-flops
are replaced with instances of the new flip-flop. Since we expect the connected bitcells to be placed directly above and below each other, the obstructions account for the routing that would
be generated by a 3D CAD flow or inserted by the physical designer. So one final modification
to the synthesized netlist is to remove the FTM connections between the PRFs – this keeps the muxes and bitcells intact, but eliminates the duplicate wiring that has been accounted for in
the obstructions.
3D FTM is a conservative model in two respects. First, it may not be necessary to obstruct
all metal layers. Each face-to-face bond point can be placed on the top-most metal layer, freeing
the router to complete connections to flip-flops underneath. Second, the diameter and pitch of
RTL is synthesized to the FreePDK 45nm standard cell library [54] using Synopsys
De-sign Compiler version E-2010.12-SP2. All three deDe-signs are placed and routed using Cadence Encounter RTL-to-GDSII System 9.11.
Results
To estimate the physical design impact, I perform an automated place-and-route of the three
designs. The only placement constraint applied is that each core must stay within a bounding box on one half of the die. Wiring congestion can be inferred from these routed designs by
counting the number overflowed gcells. Gcells define a region of routing within the total design,
and consist of a number of routing tracks. When global routing must pass through a gcell, the number of used tracks within that gcell is augmented by one. Once global routing is completed,
a gcell with more signals routed through it than its capacity is considered an overflow.
For each design, I vary the standard cell placement density from 80% to 30% and measure
the number of overflows, area, and latency of the PRF-to-PRF value exchange (for 2D FTM
and 3D FTM).
The graph in Figure 2.9 shows overflows (y-axis) as a function of area (x-axis). Each point is labeled with the placement density used for that point. As one would expect, increasing density
decreases area but increases overflows. If confined to a 2D layout, congestion is drastically
increased when the PRFs are connected, evident in the large increase in overflows from 2D
baseline to 2D FTM for a given area. This substantial increase in congestion may lead to a
difficult-to-route and/or lower frequency design at best, or an unroutable design at worst. The
graph also confirms the hypothesis that3D FTM should fall between2D baselineand2D FTM.
In fact, we see that3D FTM is always better (fewer overflows) than2D FTM for a given area.
The graph in Figure 2.10 factors latency into the tradeoff analysis for the two FTM designs.
The graph re-plots overflows on the primary y-axis with solid lines, and superimposes the latency of the PRF-to-PRF value exchange on the secondary y-axis with dashed lines. The
latency of2D FTM is measured directly from the post-routed netlist. The latency of 3D FTM
is constant and is assumed to be the lowest latency of2D FTM (at its most dense point, where
wires are shortest). We reason that the latency is not only low, but also independent of density,
because every flip-flop is directly above or below its counterpart. In contrast, the latency of2D
FTM is very sensitive to density. Thus, the 2D layout suffers from a difficult tradeoff: either
increase density to reduce latency, and pay the price in terms of lower routability and more
physical design effort, or decrease density and pay in terms of higher latency. The 3D layout does not pose this tradeoff: density can be decreased for a more routable design, with no impact
on latency.
30% 40%
50% 60% 70%
80% 40% 30%
50% 60%
70% 80%
30% 40%
50% 60% 70% 80%
0 200 400 600 800 1000 1200
5.0E+5 1.0E+6 1.5E+6 2.0E+6 2.5E+6
ove
rf
lo
ws (t
ho
usa
n
d
s)
area (sq. microns)
2D baseline 3D FTM 2D FTM
Figure 2.9: Routing overflows due to placement density and PRF connectivity.
0 0.5 1 1.5 2 2.5
0 200 400 600 800 1000 1200
5.0E+5 1.0E+6 1.5E+6 2.0E+6 2.5E+6
late
ncy
(ns)
overf
lo
ws (thousa
n
ds)
area (sq. microns)
3D FTM Overflows 2D FTM Overflows 3D FTM Latency 2D FTM Latency
Figure 2.10: PRF-to-PRF swap latency.
challenges in a 2D design. A 2D design requires the structures holding the state to be exchanged
or externally referenced, to be near one edge of each core. This placement may not be opti-mal for performance and energy of the core. That is, intra-core and inter-core floorplanning
may have competing interests. Moreover, as additional structures are considered for inter-core
exchange or referencing, it may not be feasible to locate all of them at one edge. With 3D die-stacking, structures can be placed anywhere within the core as long as their counterparts
are directly above or below. This satisfies both intra-core and inter-core interests and allows
2.3
Related Work
Heterogeneous multi-core processors have been shown to be a possible way increase the
per-formance and efficiency of general purpose workloads. The concept of pairing cores of dif-ferent microarchitectures was first introduced by Kumar, et al. in several seminal
publica-tions [33] [34] [35]. Their initial worked established the power reduction [33], and multithreaded
performance [35] improvements made possible by considering a mix of pre-existing designs (var-ious implementations of the Alpha ISA). Their follow-up work [34] explored the possibility that
performance of a heterogeneous multi-core can be best achieved by considering cores that may
not have previously been designed.
Spurred by these seminal works, several other academic proposals followed. Suleman et
al. [56] found that a heterogeneous mix of cores was particularly advantageous for multithreaded applications. Their key insight was that highly parallel code sections could effectively be handled
by smaller cores, but critical sections impose a serialization point in the program and should
be executed by a core designed for the highest possible performance.
Najaf-abadi et al. [41] use a heterogeneous multi-core to improve a single threaded program
by redundantly running a program on multiple cores simultaneously. The cores are
heteroge-neous and have a low latency communication channel between them. As a program executes, the core that is able to make the fastest forward progress will pass result values to the lagging
cores, keeping them at nearly the same point in the programs execution. As the program
char-acteristics change, one of the other cores may start to out-perform and overtake the previously best core. The new leading core will then start to pass results to the other cores. This has the
advantage that the best performing core for a particular program phase does not need to be
determined a priori, nor does a program need to be moved to react to program behavior.
The advantages of heterogeneous multi-core processors has been well-established enough
that several industry designs have taken the approach [4] [27] [58]. These designs all use a
Chapter 3
Alternatives for Hardware Thread
Migration
This chapter outlines several possible alternatives for hardware thread migration. Each
imple-mentation was implemented in the Verilog RTL model and simulated for thousands of
migra-tions at various relative frequencies to ensure functional correctness. In keeping with the theme of this thesis, these alternatives explore the exchange of program register values, leaving the
memory state migration to other existing and future work.
Each alternative is implemented on two baseline, “reference cores”, described in Section 3.2.
These cores are heterogeneous and out-of-order superscalar cores, but both cores are on the
lesser end of implementation complexity (neither core has especially high peak instruction band-width or large structures). These reference cores are the same cores used in the H3 fabricated
prototype chip, and their architectures are fully enumerated in Table 5.1 in Chapter 5. Using
small reference cores makes the estimates of hardware overhead as conservative as possible – as the migration is added to the minimum possible backdrop. Each alternative (Section 3.3
through Section 3.6) is described with respect to the changes needed to the pipelines of these
reference cores.
The key aspect for each of these designs is that they must work with cores that operate
at independent clock frequencies. This guides how tightly coupled cores can be, and requires
careful consideration for control and data values as they cross clock domains.
While this thesis explores hardware thread migration within the goal of improving
perfor-mance and efficiency of heterogeneous cores, that is not the only possible use-case. For instance,
it may be beneficial to consider hardware migration with thermal considerations [13] [29]. A multi-core processor typically senses the temperature of the constituent cores, and when a core
is heated to a pre-determined threshold, either the core frequency is throttled down, or the
cool down). Hardware thread migration could aid in this situation – when a core reaches the
threshold temperature, the thread can be moved to another core without incurring the addi-tional overhead of an OS-managed migration. Another use for hardware thread migration is in
support of processor sleep states. Modern processors are able to put unused cores to “sleep”
by power-gating those cores. When the supply power is switched off, state-holding structures are unable to retain their values. Migration-like hardware could be used to more quickly move
these values to newly-introduced register files (not full cores) on their own power domain. This
would allow for a low-power retention of values, while enabling the core to move into and out of sleep mode more quickly.
3.1
Overview
There are many possible hardware mechanisms and policies that could be proposed for
accel-erating thread migrations. In this thesis, I narrow focus on four such migration alternatives
(Sections 3.3- 3.6), relative to a baseline implementation (Section 3.2) that relies solely on the operating system to migrate a thread via a context switch. The impetus behind proposing
several hardware alternatives is to analyze several designs that span a spectrum of costs and
potential benefits. Figure 3.1 depicts this this spectrum.
The benefit of hardware migration is in lower latency (in cycles) to migrate a thread from one
core to another. These alternatives were selected to progressively lower the latency as additional
hardware is added. The costs associated with migration hardware is in additional power, area, and timing. Currently, area is no longer a primary design constraint, since transistors are
abundant. The clock period of a processor is a primary concern, but any additional timing
incurred by a proposed hardware thread migration can likely be ameliorated by pipelining the circuits that do not meet timing. Power remains as a key constraint. There is potential
for an additional power draw by the migration hardware due to the addition of state-holding
structures, and mechanisms for clock-domain crossing circuits. These power costs should not outweigh the benefits of introducing hardware migration. Each of these migration alternatives
incrementally add hardware in an attempt to find the power “sweet-spot”.
When considering hardware migration alternatives, I focus on designs that progressively eliminate bottlenecks with respect to a baseline design that uses the operating system for
migrating threads. Figure 3.1 enumerates the bottlenecks that each design point alleviates
(note that when considering designs from left-to-right, the bottlenecks that are eliminated are additive). As outlined in Chapter 2, when moving a thread via the operating system, several
overheads exist. These include several traps, moving the processor between privilege states, allocating memory for holding thread context, executing a scheduling algorithm, and so on.