Exploring Energy-Time Tradeoff in High Performance Computing

(1)

PAN, FENG Exploring Energy-Time Tradeoff in High Performance Computing. (Under the

direc-tion of Vincent W. Freeh).

Recently, energy has become an important issue in high-performance computing. For

example, low power/energy supercomputers, such as Green Destiny, have been built; the idea is to

increase the energy efficiency of nodes. However, these clusters tend to save energy at the expense

of performance. Our approach is instead to use high-performance cluster nodes with frequency

scalable AMD-64 processors; energy can be saved by scaling down the CPU. Our cluster provides

a different balance of power and performance than low-power machines such as Green Destiny. In

particular, its performance is on par with a Pentium 4-equipped cluster.

This thesis investigates the energy consumption and execution time of a wide range of

applications, both serial and parallel, on a power-scalable cluster. We study via direct measurement

both intra-node and inter-node effects of memory and communication bottlenecks, respectively.

Ad-ditionally, we present a framework for executing a single application in several frequency-voltage

settings. The basic idea is to first divide programs in to phases and then execute a series of

experi-ments, with each phase assigned a prescribed frequency.

Our results show that a power-scalable cluster has the potential to save energy by scaling

the processor down to lower energy levels. Furthermore, we found that for some programs, it is

possible to both consume less energy and execute in less time by increasing number of nodes and

reducing frequency-voltage setting of the nodes. Additionally, we found that our phase detecting

(2)

by

Feng Pan

A thesis submitted to the Graduate Faculty of North Carolina State University

in partial satisfaction of the requirements for the Degree of

Master of Science

Department of Computer Science

Raleigh

2005

Approved By:

Dr. Eric Rotenberg Dr. Jun Xu

Dr. Vincent W. Freeh

(3)

Biography

Feng Pan is from Dalian, China. He came to United States in 1998. From 1999 to 2003,

he attended Appalachian State University in Boone, NC, where he received Bachelor of Science

degree in both Computer Science and Applied Math. He began graduate study at North Carolina

State University immediately after his undergraduate career. He will receive Master of Science

(4)

List of Figures

3.1 Aggregate plots of NAS set. . . 12

3.2 Aggregate plots of SPEC INT set. . . 13

3.3 Aggregate plots of SPEC FP set. . . 14

3.4 Best energy-time tradeoff in each set. . . 14

3.5 Worst energy-time tradeoff in each set. . . 15

3.6 OPM vs.betafor all applications on AMD-64. . . 16

3.7 Energy consumption vs execution time for NAS benchmarks on 2, 4, and 8 (or 4 and 9) nodes. . . 18

3.8 Energy consumption vs. execution time for Jacobi iteration on 2, 4, 6, 8, and 10 nodes. 20 3.9 Relative energy vs. time. . . 21

3.10 Synthetic benchmark with high memory pressure. . . 22

4.1 Trace of operations per miss for LU C. . . 24

4.2 Energy consumption vs. execution time for three NAS benchmarks on a single AMD machine. . . 25

4.3 Heuristic for searching solution space. . . 26

4.4 Energy-time plot of many BT runs. . . 28

(6)

List of Tables

3.1 Idle and active power for AMD-64. . . 11

3.2 Predicting CPU boundness. . . 15

(7)

Chapter 1

Introduction

High-performance computing (HPC) tends to push performance at all costs.

Unfortu-nately, the “last drop” of performance tends to be the most expensive. For example, the last 10%

increase in performance requires a disproportionally large amount of resources. This

performance-at-all-costs policy is only justifiable given infinite resources. An example of nearly infinite resources

is any of the supercomputing centers, which houses large-scale supercomputers. For example, the

ASCI Q machine consumes more than 3 gigawatts [53]. This unchecked energy consumption costs

the government a significant amount of money and wastes natural resources. Moreover, it is unlikely

that supercomputing centers can continue limitless consumption of resources. In particular, energy

consumption—and the resultant heat dissipation—is becoming an important limiting factor.

As a result, lower-power, high performance clusters such as Orion[42], have been

devel-oped to stem the ever-increasing demand for energy. Such systems improve the energy efficiency

of nodes. However, their performance penalty is usually too large to be reasonable for many users.

If performance is the only goal, then one should continue on the current “performance-at-all-costs”

path of HPC architectures. On the other hand, if power is paramount, then one should use a

low-performance architecture that executes more instructions per unit energy.

We believe one should strike a path between these two extremes. We use a

high-perfor-mance microprocessor that has frequency and voltage-scaling. Each frequency provides an

energy-performance point that we call a gear. It is conceptually possible to save energy without an increase

in time because an increase in CPU frequency generally results in a smaller increase in application

(8)

increasing frequency also increases CPU stalls—usually waiting for memory or communication.

Consequently, there are opportunities where energy can be saved, without an undue

per-formance penalty, by reducing CPU frequency. On the other hand, for those programs or parts of a

program (because most programs are not uniform) where the time penalty is large, the processors

should be run at the fastest frequency.

This thesis investigates the tradeoff between energy and performance (execution time)

for HPC. We evaluate on a real, small-scale power-scalable cluster. Each node is equipped with a

AMD-64 processor that supports frequency and voltage scaling.

Towards this goal, we analyzed benchmarks, both serial and parallel, to determine the

relationship between voltage and frequency settings and execution time. Our results show that all

of our tests have what we call an energy-time tradeoff, meaning that a decrease in energy is possible

but it comes at the cost of increased time. Not every energy-time tradeoff is desirable, as some offer

little energy savings and large time penalties. However, more than half of these tests show a savings

that is equal to or better than the penalty (e.g., 10% less energy and 10% more time), and some

are much better than that. In particular, we found that on one node, it is possible to use 20% less

energy while increasing time by 3%, with CG. However, with EP there was essentially no savings.

We present a simple metric that predicts this energy-time tradeoff. Additionally, we found that in

some cases one can save energy and time by executing a program on more nodes at a slower gear

rather than on fewer nodes at the fastest gear. We believe this will be important in the future, where

a program running on a cluster may be allowed to generate only a limited amount of heat.

In addition to exploring the energy-time tradeoff by executing entire programs in a single

gear, this thesis also presents a framework for executing a single application in several

frequency-voltage settings. As mentioned above, most programs are not uniform, as a result, certain parts of

a program may have a different energy-time tradeoff and therefore should be run using a different

gear than other parts of the same program. We implement our scheme by first manually dividing

programs into phases based on trace data collected during a profile of the application. If there aren

phases in an application andgpossible gears, there aregn_{unique solutions, where a solution is an}

assignment of a gear to each phase. We present a novel heuristic that searches the space of solutions.

It executes a solution and evaluates the energy-time tradeoff based on a user-defined metric. Based

on the evaluation, it selects the next solution to evaluate. The heuristic terminates when the results

are satisfactory (again based on input from the user or cluster administrator). The heuristic requires

at mostn·g steps to find a “good” solution; however, we have found that in most cases, it takes

(9)

Performance results on the NAS suite show that our heuristic finds effective solutions.

Specifically, we find several solutions that use different gears per phase that are superior to any

solution that used a single gear for all phases. For example, for MG, a multiple-gear solution

executes in the same time but consumes 3% less energy then a single-gear solution. This same

multiple-gear solution executes 8% faster than another single-gear while using the same amount of

energy.

1.1 The Case for Power-Aware, High-Performance Computing

Power-aware computing has flourished over the last decade in many areas, especially

mobile devices. However, it has little traction in HPC computing because the HPC community has

vastly different goals than those of mobile users. Because computational scientists are using HPC

to gain maximum performance, many may resist any mechanism that increases execution time.

We believe that a significant subset of HPC programmers will be forced to accept less

performance in order to conserve power. This will occur because the large and growing power

consumption of current clusters at supercomputer centers—which are not immune to economic

pressures—will eventually come to bear on the cluster users. This may manifest itself in

program-mers utilizing lower-power machines such as Green Destiny [42, 53]. Consider the case of Green

Destiny, which is a cluster of Transmeta processors. While it uses less power and energy than a

conventional supercomputer, it also provides much less performance. In particular, Green Destiny

consumes about three times less energy per unit performance than the ASCI Q machine. However,

because Green Destiny uses a slower (and cooler) microprocessor, ASCI Q is about 15 times faster

per node (200 times overall) [53]. A reduction in performance by such a factor surely is

unreason-able from the point of view of many HPC programmers.

We believe that computational scientists would be more likely to consider energy

reduc-tion if we start from a baseline of a high-performance, scalable microprocessor. Then, from there we

attempt to reduce power with only a small performance loss. The emergence of high-performance

scalable microprocessors, such as the AMD-64/Opteron [2], makes this possible. This

micropro-cessor has performance that is on a par with high-end, fixed frequency desktop micropromicropro-cessors, but

its power consumption can be reduced to 25% of the maximum using frequency and voltage

scal-ing [54]. We expect that such power-performance settscal-ings will be commonplace on most high-end

(10)

with virtually no performance loss (e.g., mcf and art from the SPEC benchmark set).

Saving Power and Energy with gear Selection

High-performance computing (HPC) applications and machines are inefficient. For

ex-ample, the 2004 Gordon Bell Prize, which is given annually to the application with the greatest

performance, was awarded to one that sustained 46% parallel efficiency [16]. This means that

on average, a microprocessor in a supercomputer executing the world champion application was

blocked and idling more than half of the time. In such a situation, a traditional microprocessor races

forward at full speed to the next blocking point and waits. While blocked, the machine is

consum-ing energy, but not performconsum-ing any useful work. Furthermore, when a program has good parallel

efficiency, it may not be efficient sequentially (e.g., it may stall waiting for the memory subsystem).

As an analogy, consider a car speeding between stop lights. Because a traditional

mi-croprocessor has only one gear, which uses full power and provides the maximum performance,

it must race between metaphorical stop lights. Recently, manufacturers have begun producing

mi-croprocessors with multiple energy gears. With such a microprocessor, a computer can shift into a

reduced power and performance gear so that computation is potentially completed just in time for

the unblocking event—i.e., arriving just as the light turns green.

This method of execution saves energy in two ways. First, (wasted) energy consumed

while waiting is eliminated. Second, a microprocessor uses less energy per operation at lower

en-ergy gears. This is because while performance is linearly proportional to clock frequency, power

consumption is proportional to frequency and voltage squared. Thus, at a slower gear, a

micropro-cessor will consume less energy per operation than at a faster gear [40]. Again, this follows the

analogy of an automobile—it takes less gasoline to travel at 50 MPH for 2 hours than at 100 MPH

for one hour, partly because drag is proportional to velocity squared.

It may appear that node performance should be reduced solely based on its percentage

of idle time, but this not the case. Doing so blindly will likely mean that events that are on the

critical path are delayed, which increases execution time by the amount of that delay. In this paper,

we investigate the tradeoff between energy and performance (execution time). We closely examine

(11)

1.2 Contributions

This thesis makes several contributions. First, using NAS and SPEC benchmark suites,

it evaluates the energy and time tradeoff of single node applications on an AMD-64 processor that

is frequency and voltage scalable—i.e., its clock speed and power consumption can be changed

dynamically. Second, it evaluates a real, small-scale power-scalable cluster, which is a cluster

com-posed of AMD-64 processors described above. Using NAS-MPI benchmark suite, we illustrate the

tradeoff between the power-conserving benefits and the time-increasing costs of such a machine.

Third, it introduces a framework for executing a single application in several frequency-voltage

settings, in order to achieve maximum energy efficiency. This is an initial step to an power-aware

operating system infrastructure that is capable of adjusting processor frequency automatically

with-out user intervention.

The rest of this thesis is organized as follows. Chapter 2 describes related work, Next,

Chapters 3 and 4 present single-gear and multiple-gear results, respectively. Finally, Chapter 5

(12)

Chapter 2

Related Work

There has been a voluminous amount of research performed in the general area of energy

management. In this section, we describe some of the closely related research. We divide the related

work into two categories: server/desktop systems and mobile systems.

2.1 Server/Desktop Systems

Several researchers have investigated saving energy in server-class systems. The basic

idea is that if there is a large enough cluster of such machines, such as in hosting centers, energy

management can become an issue. In [7], Chase et al. illustrate a method to determine the aggregate

system load and then determine the minimal set of servers that can handle that load. All other servers

are transitioned to a low-energy state. A similar idea leverages work in cluster load balancing to

determine when to turn machines on or off to handle a given load [46, 47]. Elnozahy et al. [15]

investigated the policy in [46] as well as several others in a server farm. Such work shows that

power and energy management are critical for commercial workloads, especially web servers [4,

36]. Additional approaches have been taken to include DVS [14, 51] and request batching [14].

The work in [51] applies real-time techniques to web servers in order to conserve energy while

maintaining quality of service.

Our work differs from most prior research because it focuses on HPC applications and

(13)

servicing client requests. On the other hand, an HPC installation exists to speedup an application,

which is often highly regular and predictable. One HPC effort that addresses the memory bottleneck

is given in [29]; however, this is a purely static approach.

In server farms, disk energy consumption is also significant. One study of four energy

conservation schemes concludes that reducing the spindle speed of disks is the only viable option

for server farms [6]. DRPM is a scheme that dynamically modulates the speed of the disk to save

energy [25, 26]. Another approach is to improve cache performance—if many consecutive disk

accesses are cache hits, the disk can be profitably powered down until there is a miss; this is the

approach taken by [60]. An alternative is to use an approach based on inspection of the program

counter [22]; the basic idea is to infer the access pattern based on inspection of the program counter

and shut down the disk accordingly. A final approach is to try to aggregate disk accesses in time. A

compiler/run-time approach using this was designed and implemented in [27], and a prefetching

ap-proach in [44]. Both were designed for mobile systems but can be directly applied to server/desktop

systems.

There are also a few high-performance computing clusters designed with energy in mind.

One is BlueGene/L [1], which uses a “system on a chip” to reduce energy. Another is Green

Destiny [53], which uses low-power Transmeta nodes. A related approach is the Orion Multisystem

machines [42], though these are targeted at desktop users. However, using a low-power processor

sacrifices performance in order to save energy.

This thesis is partly based on our previously published works. A study of single-node

applications can be found in [43], however, it explores energy-time tradeoff in mobile systems. In

[20] we study the energy-time tradeoff in MPI programs. We have also examined the energy saving

potentials by using multiple energy gears in MPI programs [21]. Another member of our research

group has developed techniques for saving power in datacenters in [17] and [18].

2.2 Mobile Systems

There is also a large body of work in saving energy in mobile systems; most of the early

research in energy-aware computing was on these systems. This subsection details some of these

projects.

ECOSystem [59] attempts to implement a power management system without the need

(14)

accounting model is implemented, called the currentcy model. The currentcy model attempts to

attribute energy usage to individual applications, as well as to specific components of the machine.

Each application is allocated a certain amount of currentcy, which corresponds to the total energy it

is allowed to consume in each epoch. The system then accounts for where each application spends

its energy using a combination of real-time feedback from the battery and estimation techniques

(re-quired because the information returned from the battery is too coarse grained). While ECOSystem

is an effective framework for power management, it has the undesirable characteristic of assuming

user knowledge of currentcy. It expects users to have a predetermined idea of how much energy

should be used by each application. Therefore, ECOSystem is not transparent, which we believe is

paramount.

The case for a closer relationship between the operating system system and power

man-agement is further explored by Vahdat et al. [52, 13]. This includes a case for treating energy as a

first class resource in operating systems. Perhaps the best endorsement of OS-controlled power

con-sumption comes from the ACPI (Advanced Configuration and Power Interface) standard [9]. It is

an evolution of several existing methods including BIOS power management, the APM (Advanced

Power Management) API, and a smart battery interface.

In general, the goal of the OS is to conserve energy for the entire set of processes. Our

approach differs in that we are concerned with saving energy in a single program.

Individual Devices Many have worked on saving energy in different devices, such as the CPU,

disk, memory, and network. Many modern processor architectures allow different frequency-voltage

settings. This work developed into dynamic voltage scaling (DVS) [19, 24, 45, 48, 31], which has

come to mean the simultaneous changing of clock speed and voltage to reduce power consumption.

Typically, DVS optimizes the energy×delay or energy×delay2[38, 23] product. This creates a

system that more efficiently uses energy, but is still powerful and responsive. In order to do this,

DVS must accurately predict upcoming load. A poor prediction either consumes too much energy

or causes too much delay. In this work, we investigate the relationship between frequency/voltage

and execution time. DVS is used to adjust frequency and voltage in our multi gear tests.

Disk Disks consume a large percentage of energy on some mobile architectures. Many have

studied disk spindown to save energy (e.g., [28, 12, 55, 3, 37]). In general, the idea is to determine

when there is a large time period in which there are no disk requests and transition to a lower energy

(15)

Memory and Network In some architectures, individual memory banks and/or network card has

multiple energy states. Work on transitioning these to lower energy states include [10, 35, 8, 33,

58, 57]. The idea is to place data intelligently in banks so that some banks will not be accessed.

In some devices the network card has multiple energy states. One way to save energy is to use the

energy-saving mechanisms defined by 802.11b [8]. One improvement to 802.11b is the Bounded

Slowdown Protocol [33], which uses minimal energy given a desired maximum increase in round

trip time. In addition, Yan et al. [58] leverage TCP to save energy in large file downloads. Finally,

Kravets investigated power-aware mechanisms for end-to-end communication in wireless networks

[34].

The primary distinction between these projects and ours is that energy saving is

typi-cally the primary concern in mobile devices. In HPC applications, performance is still the primary

concern. Energy consumed by memory banks and network cards is important primarily in mobile

devices. However, in desktop or cluster processors, memory energy consumption is relatively small.

(16)

Chapter 3

Single-Gear Results

This chapter describes the results of our single-gear experiments (i.e., we do not change

gear during execution of a single application). We studied programs from three different benchmark

sets: NAS, SPEC integer, and SPEC floating point, in our single node tests. The NAS suite is

a popular high-performance computing benchmark, consisting of scientific benchmarks including

application areas such as sorting, spectral transforms, and fluid dynamics. In contrast, the SPEC

integer benchmarks are non-scientific applications that are CPU and/or memory intensive. The

SPEC floating point benchmarks are a mixture of both scientific and non-scientific programs. For

example, mesa and facerec are non-scientific, graphics programs, whereas swim and mgrid are

well-known scientific benchmarks. In our multi-node tests, we studied NAS MPI suite. We used all valid

configurations using up to 9 nodes. Presumably, such mature benchmarks have been thoroughly

analyzed and are well-written (e.g., see [56])—so that they are not unrealistically communication

bound.

This study uses, as a reference example, a cluster of ten nodes, each equipped with a

frequency- and voltage-scalable AMD Athlon-64. All single node tests are run on one of the

identi-cal nodes. We run each program at each available energy gear: 2000 MHz, 1800 MHz, 1600 MHz,

1400 MHz, 1200 MHz, and 800 MHz.1 The voltage, which ranges from 1.5–1.0V, is reduced in

each gear. Each node has 1GB main memory, a 128KB L1 cache (split), and a 512KB L2 cache,

and the nodes are connected by 100Mb/s network. In this thesis, we control the CPU power and

measure overall system energy. This is effective in saving energy because the CPU—a major power

1

(17)

F requency V oltage P ower(W)

(MHz) (V) idle active

2000 1.5 59.6 98.8–108.6

1800 1.4 57.0 86.5–95.2

1600 1.3 55.9 80.1–88.1

1400 1.2 54.8 74.3–81.9

1200 1.1 53.4 67.1–75.4

800 1.0 51.0 56.8–63.5

Table 3.1: Idle and active power for AMD-64.

consumer—uses less power. According to datasheet, the Athlon CPU used in this study consumes

up to 89W with all components fully active[2]. However, given that the maximum system power

we measured is 109W, the CPU must not reach this level in our tests.

For each program we measure execution time and energy consumed for an application at

a range of energy gears. Execution time is elapsed wall clock time. The energy consumed by the

entire system is measured by a WattsUp pro meter[39] attached to each system at the wall outlet

to determine the instantaneous power (in Watts). This value is integrated over time to determine

the energy used. Because we attach a meter to each node, for multi-node tests, the total energy

consumption by all systems can be easily and accurately obtained by calculating the sum of each

individual node’s energy consumption.

Table 3.1 shows idle power and active power ranges for all gears of AMD-64. The active

power ranges are determined using single-node data collected in Chapter 3. The idle power is

determined with CPU in HALT (C1) state. Currently in Linux, CPU enters HALT state when the

operating system determines that the system is idle. We can see that the difference in active power

consumption between highest and lowest frequencies is about 40%, while the difference in idle

power is less than 10%. Thus there is a small benefit to reduce gear before entering the HALT state.

This chapter is divided in to two parts. First, it shows results on a single node. This

section shows the energy-time tradeoff due to the memory bottleneck. Next, it describes time and

energy results on multiple nodes. This section shows the effect of the communication bottleneck as

(18)

0.9 1 1.1 1.2 1.3 1.4 1.5 800 1000 1200 1400 1600 1800 2000 Speed (MHz) NAS time Energy (a) All 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 800 1000 1200 1400 1600 1800 2000 Speed (MHz) NAS (top third)

time Energy

(b) Top third

Figure 3.1: Aggregate plots of NAS set.

3.1 Single-Node Results

Because we studied a large number of programs from three benchmark suites, space

con-straints do not permit the presentation of all the individual results. Therefore, we divide single-node

results into two parts. First, we discuss the overall results. Then, we look at a few representative

applications in detail. Complete individual application results can be found in Appendix A.

Overall Results

All of our tests show that for a given program, using the fastest gear takes the least time.

A slower gear takes more time, and because the CPU is the dominant power consumer, uses less

power. However, in terms of energy, the results vary. If the decrease in power exceeds the increase

in time, a slower gear uses less energy. This is the case all of the programs we tested, where one

of the slower gears results in the least energy consumed. However, it is conceivable that the time

increase could exceed the power decrease, so the fastest gear would also consume the least energy.

Figures 3.1(a), 3.2(a), and 3.3(a) plot the normalized aggregate results for each program

set on AMD 64. (NAS, SPEC INT and SPEC FP) The x-axis plots the gear in terms of frequency

from highest to lowest. There are two lines; the increasing line is elapsed time and the initially

decreasing line is energy consumed. All values are normalized to those of the fastest gear; thus, all

lines begin at 1 on the left-hand side. For the NAS programs, the time and energy diverge from 1

(19)

0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

800 1000 1200 1400 1600 1800 2000

Speed (MHz) SPEC INT

time Energy

(a) All

0.9 1 1.1 1.2 1.3 1.4 1.5 1.6

800 1000 1200 1400 1600 1800 2000

Speed (MHz) SPEC INT (top third)

time Energy

(b) Top third

Figure 3.2: Aggregate plots of SPEC INT set.

at 1600 MHz, the energy used and time taken are 91% and 112% of full, respectively. The SPEC

sets also show an increase in time, but show little decrease in energy. This is because there is a

high variance in the energy-time tradeoff among programs. The SPEC sets contain a few programs

for which there is little energy savings because the fastest gear uses the least energy or close to it.

Overall, the aggregate plots suggest that one needs to be selective about which programs to try to

save energy.

To investigate further, we plot for each set the programs that rank in the top 1/3 of

energy-time tradeoff. Figures 3.1(b), 3.2(b), and 3.3(b) show the results. The NAS subset looks best, with

16% average energy savings for a time delay of less than 3% at 1600 MHz. The SPEC FP subset is

similar, with 12% savings for 7% delay. Finally, the SPEC INT subset provides a 7% savings for a

14% delay.

The slowest two gears do not generally offer energy savings. While the power is

de-creased, the time delay is so great that the energy savings is small if any. However, we notice that

all of the programs have an energy-time tradeoff. The complete results are shown in Appendix A.

The next section evaluates individual programs in detail.

Detailed Results

In this section we analyze six programs in detail: the best and worst in terms of

energy-time tradeoff from each program set. The total system energy consumed at each gear is plotted on

(20)

0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 800 1000 1200 1400 1600 1800 2000 Speed (MHz) SPEC FP time Energy (a) All 0.8 0.9 1 1.1 1.2 1.3 1.4 800 1000 1200 1400 1600 1800 2000 Speed (MHz) SPEC FP (top third)

time Energy

(b) Top third

Figure 3.3: Aggregate plots of SPEC FP set.

0 10000 20000 30000 40000 50000 60000

0 100 200 300 400 500 600 700 800

Energy (mWh) Time (secs) cg.B (a) cg 0 5000 10000 15000 20000 25000

0 50 100 150 200 250 300 350

Energy (mWh) Time (secs) mcf (b) mcf 0 10000 20000 30000 40000 50000 60000 70000

0 100 200 300 400 500 600 700 800 900 1000

Energy (mWh)

Time (secs) art

(c) art

Figure 3.4: Best energy-time tradeoff in each set.

more energy, and the further right of two points takes more time. Therefore, a near-vertical slope

indicates an energy saving with little time delay between adjacent gears, whereas a horizontal slope

indicates a time penalty and no energy savings.

The programs shown in Figure 3.4 have the best energy-time tradeoff in each sets: NAS

(cg), SPEC INT (mcf ), and SPEC FP (art). In these “vertical” applications, the execution time

advantage of the fastest gear is small. However, the energy penalty for this ultimate performance is

large. Consider for example the cg benchmark, in Figure 3.4(a). Using the third gear (1600 MHz)

yields about a 1% increase in execution time compared to the fastest, while the corresponding

decrease in energy consumption is nearly 20%.

Next, we examine how a vertical energy-time shape occurs. Our results show that

(21)

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

0 100 200 300 400 500 600 700 800

Energy (mWh) Time (secs) ep.B (a) ep 0 5000 10000 15000 20000 25000

0 50 100 150 200 250 300 350 400

Energy (mWh) Time (secs) perlbmk (b) perlbmk 0 5000 10000 15000 20000 25000 30000

0 100 200 300 400 500 600

Energy (mWh)

Time (secs) sixtrack

(c) sixtrack

Figure 3.5: Worst energy-time tradeoff in each set.

benchmark β OPM log(OPM)

cg 0.04 8.46 0.927

mcf 0.19 8.6 0.934

art 0.21 10.8 1.03

perlbmk 0.92 368 2.57

ep 0.98 847 2.93

sixtrack 0.99 2808.2 3.45

Table 3.2: Predicting CPU boundness.

ever, the number of cycles that an execution takes can change, especially in the vertical applications.

For example, consider the mcf application at the two fastest gears (2000 and 1800 MHz), in which

the performance gain is less than 1%. Using the slower gear with a clock rate that is 90% of the

high-est, the execution has 90% as many cycles (approximately 5.0 to 4.5 trillion). Because the number

of micro-operations does not change, the performance, in micro-operations per cycle (UPC),

in-creases as the frequency dein-creases. The additional cycles in the faster gear do not perform useful

work. This indicates that the CPU is not the performance bottleneck. Below we examine this and,

not surprisingly, determine that memory is the bottleneck.

On the other hand, Figure 3.5 shows the programs that do not exhibit an energy penalty

for the ultimate performance. Instead, in these programs, the fastest gear results in nearly the lowest

energy consumed. We call these “horizontal” programs.

To further quantify the degree of CPU (or memory) boundness, we use an metricβ

intro-duced in [29].β is defined as a ratio between 0 and 1:

β = (T /Tmax−1)

(22)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

1 10 100 1000 10000

beta

opm OPM vs. beta

Figure 3.6: OPM vs.betafor all applications on AMD-64.

whereT andTmax are execution times at a lower frequency and maximum frequency (2000MHz

in this case), respectively, andf andfmax are the corresponding frequencies. Theoretically, if a

program is completely CPU bound, thenT would increase by the amount of increase in cycle time,

T = fmax

f ·Tmax, in which caseβ = 1. On the other hand, if a program is completely independent

of the CPU thenT =Tmaxandβ = 0.

In our tests, the valueβdoes not change significantly as the frequency changes. We found

that on average, for any application, the largest difference in β between running the application

using any two frequencies is around 5%. Column 2 of Table 3.2 showsβ of the six application we

examined, measured between 2000 and 1800 MHz. It is clear that the three “vertical” benchmarks

all have smallβvalues, indicating they are not CPU bound. On the other hand, the three “horizontal”

benchmarks all haveβ close to 1, indicating they are CPU bound. However, one problem withβ

is, while it accurately measures the CPU boundness of an application, we must run the application

twice using different gears to get this measurement. Therefore, we would like to use a metric that

can fairly accurately predictβof an application without such overhead.

The programs we consider in this section do not perform much I/O. Therefore, the

(23)

the memory pressure of an application or phase. Similar toβ, OPM stays constant as the frequency

changes. We found that for all the applications we tested, the largest difference in OPM between

running an application using two different gears is about 1%. A large value of OPM means the

program is CPU bound because on average, the program executes more instructions between cache

misses. Table 3.2 shows OPM andβ of the six program we examined. Similar to the trend ofβ, the

three vertical programs, cg, mcf and art, all have small OPM, indicating they are memory bound.

On the other hand, the three horizontal programs all have large values of OPM. One thing to notice

is that in both columns, the values of OPM andβ are sorted. Figure 3.6 shows scatter plot of OPM

vs. βfor all the applications we tested. OPM is plotted on a log scale in this figure because unlike

β, which has a maximum value of 1, the values of OPM are not bounded. In order to better

under-stand the correlation between the two metrics, we uselog(OP M)to relate toβ. We found that the

correlation coefficient is 0.84, indicating a close correlation between OPM andβ.

The advantage of OPM is that unlikeβ, this measurement can be obtained at runtime with

negligible overhead using any gear. OPM is also used to define and sort phases in a single program

in our multi-gear tests, which is described in detail in Chapter 4.

3.2 Multiple-Node Results

The previous section investigated the energy-time tradeoff on a single node. This section

studies the effect of distributed programs. Figure 3.7 shows results from six NAS programs. Each

graph has the same general layout as in Figure 3.4 and Figure 3.5, except that it shows the results

from multiple experiments: 2, 4, and 8 nodes (or 4 and 9 nodes in the case of BT and SP). It also

plots the one-node results from the previous section, but in most cases the data are to the right of

the window of time shown. The energy plotted is cumulative energy of all nodes used.

Before discussing the results, we describe the possible layouts of these graphs. First,

for a fixed number of nodes, the shape of the curve depends on the memory and communication

bottlenecks. This is because in a distributed program, not only might a processor wait for the

memory subsystem, but at times it also might block awaiting a message. In either scenario, the CPU

is not on the critical path, and it is more efficient to execute in a lower energy gear.

Second, consider the possible effects when comparing an experiment with2P nodes

ver-sus one withPnodes. The following possibilities exist. Note that we do not consider the case where

(24)

0 20 40 60 80 100 120 140 160

0 100 200 300 400 500 600

Energy Consumption (KJ)

Execution time (s) bt

1 node

4 nodes

9 nodes

(a) BT B

0 20 40 60 80 100 120

0 50 100 150 200 250 300 350 400

Execution time (s) cg

1 node

2 nodes

4 nodes 8 nodes

(b) CG B

0 5 10 15 20 25 30 35 40 45 50

0 50 100 150 200 250 300 350 400

Execution time (s) ep

1 node

2 nodes

4 nodes 8 nodes

(c) EP B

0 10 20 30 40 50 60 70 80

0 100 200 300 400 500

Execution time (s) lu

1 node

2 nodes

4 nodes 8 nodes

(d) LU B

0 50 100 150 200 250

0 200 400 600 800 1000 1200

Execution time (s) mg

1 node

2 nodes

4 nodes 8 nodes

(e) MG B

0 50 100 150 200 250 300

0 100 200 300 400 500 600 700 800

Execution time (s) sp

1 node

4 nodes

9 nodes

(f) SP B

(25)

1. The curve for 2P nodes can lie completely above and to the left of the curve forP nodes

(more energy, less time). Each point on the2Pnode curve lies above all points on thePnode

curve. This case occurs when the program achieves poor speedup on2Pnodes compared to

P nodes.

2. The point that represents the fastest energy gear for2Pnodes can be to the left, at or below, the

corresponding point on the curve forP nodes. This case occurs when the program achieves

perfect or superlinear speedup on2P nodes compared toP nodes.

3. The curve for2P nodes can lie to the left of the curve forP nodes, but not completely above

or below the fastest gear point forP. This is the most interesting case. While the program

executes faster and consumes more energy in the fastest gear on2P nodes than onP nodes,

there is a lower gear at 2P nodes that has less energy consumption than the fastest gear

point atP nodes. Therefore, it is possible to achieve better execution time and lower energy

consumption by running in a slower gear on2Pnodes than in a faster gear onP nodes. There

is not an energy-time tradeoff between these points because one point dominates the other in

both energy and time. This case occurs case when speedup is good (i.e., not superlinear and

not poor) and there are a significant number of main memory accesses (so that scaling down

the processor has only a slightly detrimental effect).

We describe each of the cases in turn below.

Case 1: Poor Speedup

Figure 3.7 offers several examples of case 1. In particular, this case is illustrated in BT,

SP, and MG from 2 to 4 nodes, and CG from 4 to 8 nodes.

We believe in the future a given supercomputer cluster will be restricted to a certain

amount of power consumption or heat dissipation. If there is a limit for energy/power

consump-tion or heat dissipaconsump-tion, this would be represented as a horizontal line. For programs in this case,

the line will intersect at most one of the curves. The most desirable point would be the leftmost

(26)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

0 100 200 300 400 500 600 700 800

Energy consumption (kJ)

Execution time (s)

1 node 2 nodes

4 nodes

6 nodes 8 nodes

10 nodes

Figure 3.8: Energy consumption vs. execution time for Jacobi iteration on 2, 4, 6, 8, and 10 nodes.

Case 2: Superlinear Speedup

Figure 3.7 does not contain an example of superlinear speedup. However, EP, which

gets almost perfect speedup, illustrates this case. Power consumption doubles when the number

of nodes doubles. Because the time is cut in half, the total energy consumed is the same. With

superlinear speedup the energy consumption decreases as nodes are added. When speedup is perfect

or superlinear there is no energy-time tradeoff.

Case 3: Good Speedup

Figure 3.7 shows several examples of this case. First, consider LU at 4 and 8 nodes. Gear

4 on 8 nodes uses approximately the same energy as the fastest gear on 4 nodes, but executes 50%

more quickly. The fastest gear on 8 nodes executes 72% faster than on 4 nodes, but uses 12%

more energy. This case illustrates an additional choice not available in a conventional cluster, which

only supports either the fastest gear option (4 or 8 nodes). So a user must trade off a performance

increase against an energy increase. With a power-scalable cluster, the user can select a slower gear

on 8 nodes, which may offer better performance for the same energy consumption. Thus, a user

(27)

0.9 1 1.1 1.2 1.3 1.4 1.5 1.6

0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6

Relative energy consumption

Relative execution time

1 node

2 nodes

4 nodes

8 nodes

(a) EP B

0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3

0.8 1 1.2 1.4 1.6 1.8 2 2.2

1 node

4 nodes

9 nodes

(b) BT B

0.9 0.95 1 1.05 1.1 1.15 1.2

0.8 1 1.2 1.4 1.6 1.8 2

1 node

2 nodes

4 nodes

8 nodes

(c) MG B

Figure 3.9: Relative energy vs. time.

performance gear. In case 3, the user may be able to get better performance by using more nodes,

with each node executing at a lower energy gear.

Next, Figure 3.8 plots data for a (hand-written) Jacobi iteration application. This

applica-tion is shown because it can run at any number of nodes, unlike the NAS benchmarks. The figure

shows energy-time curves on 5 configurations: 2, 4, 6, 8, and 10 nodes. Because this application

gets good speedup (1.9, 3.6, 5.0, 6.4, and 7.7) each adjacent pair of curves falls in case 3. For

example, executing in second or third gear on 6 nodes results in the program finishing faster and

using less energy than using first gear on 4 nodes.

So far we have discussed the results in absolute terms. It is also useful to look at the

relative shapes of the energy-time curves. Figure 3.9 shows the curves for three of the NAS programs

relative to the fastest gear. Therefore, the fastest gear is always plotted at(1,1). These plots are

presented because they represent the three distinct features that have been observed. Figure 3.9(a)

shows EP, in which the shapes essentially overlay each other. This is because EP has almost no

communication. Consequently, the program executes nearly identically on any number of nodes.

The next feature is illustrated by BT in Figure 3.9(b). BT has a large communication

component. While an increase in CPU cycle time increases computation time, it has little effect

on the communication time. Therefore, the increase in execution time due to frequency reduction

is less significant as the computation to communication ratio decreases. This effect is shown in

Figure 3.9(b), as the curve compresses horizontally as the number of nodes increases. For example,

the rightmost (slowest) point decreases from 2.15 to 1.76 to 1.52.

The last feature is illustrated by MG in Figure 3.9(c). This example shows how the curve

becomes more vertical as parallelism increases, indicating that the time penalty decreases. Both

(28)

0 10000 20000 30000 40000 50000 60000 70000

0 50 100 150 200 250 300 350 400 450

Execution time (s)

2 nodes 4 nodes

8 nodes

Figure 3.10: Synthetic benchmark with high memory pressure.

munication. The important factor is the relative amounts of computation (executing) and

communi-cation (blocked) time. The gear setting only affects the active portion. Because BT has significant

idle/blocked time (33% and 50% on 4 and 9 nodes, respectively), the gear setting has a small effect

on BT. However, MG has on a much smaller amount of idle time (10%, 13%, and 24% on 2, 4, and

8 nodes, respectively); therefore, the gear setting makes a more pronounced difference.

Finally, we present results from a synthetic benchmark. This benchmark models CG in

terms of its cache miss rate, but achieves good speedup (over 7 on 8 nodes). The purpose of this

benchmark is to show the potential of a power-scalable cluster. Figure 3.10 shows the results.

Because the miss rate is high (7%), the execution time penalty for scaling down is low (e.g., 3%

at 1200MHz), and the corresponding energy savings is large (24%). Furthermore, compared to top

(29)

Chapter 4

Multiple-Gear Results

This chapter describes the results of using our framework to executing a single program

in multiple gears. (A majority of the results are based on our previous study in [21] and [20]) We

studied seven benchmarks from NAS parallel benchmark suite. All benchmarks are tested using

either 8 or 9 nodes configurations. We divide this chapter into two parts. First, we describe how we

use profiling technique to determine an effective gear for each phase. Then, we analyze the results.

4.1 Profile-Directed Technique

This section describes our profiling technique for determining an effective gear for each

phase. First, it describes detecting and prioritizing phases. Next, it discusses the mechanism that

col-lects performance data which determine the energy-time tradeoff. Finally, it discusses our method

for choosing an assignment of gear to each phase.

Phase Detection This thesis, uses a straightforward programming model, which primarily applies

to iterative and predictable HPC applications. Specifically, it starts by obtaining a trace of the

application in question (run at the fastest gear—highest frequency-voltage setting). From there, it

divides a program into blocks. The term block is motivated by the common compiler term basic

block. In a basic block, all statements must be executed. In a similar way, in a block, all statements

(30)

0 20 40 60 80 100 120 140 160

0 200 400 600 800 1000 1200 1400 1600 1800 2000

OPM (operations/miss)

Time (5 ticks/s)

(a) Full trace

0 20 40 60 80 100 120 140 160

200 205 210 215 220 225 230

OPM (operations/miss)

Time (5 ticks/s)

Phase 3 Phase 1 Phase 2 Phase 3 Phase 1 Phase 2 Phase 3

(b) Window from 40–46 seconds.

Figure 4.1: Trace of operations per miss for LU C.

The actual division into blocks is done by examining the trace and using an ad hoc

ap-proach that conforms to the following principles. First, any MPI operation demarcates a block

boundary. Second, a boundary is inserted after any first level inner loops, following the notion that

ends of loop nests are where program characteristics generally change [32]. We do not divide first

level loops any further mainly because it makes block sizes too small to make any measurements,

such as memory pressure. We describe our approach assuming that there is an outermost loop that

consists of a linear list of blocks (i.e., no control flow). While either restriction can be relaxed, doing

so unnecessarily complicates this discussion. Moreover, HPC codes often have one outermost loop,

and MPI calls are generally unconditionally invoked.

From there, we merge blocks into phases. Two adjacent blocks are merged into a phase if

their corresponding memory pressure is within the same threshold used in rule two above. We note

that this is a simple algorithm and may not find all possible phases; our focus is not on developing

new phase detection techniques. Our plan in the future is to use the significant work on phase

detection that has previously been done—both by compiler [32, 50, 41] and dynamically [30, 11].

As described in Chapter 3, we introduce the metric operations per miss (OPM) as a

mea-sure of the memory presmea-sure of an application or phase. We have found OPM to be effective in

determining phase boundaries. For example, Figure 4.1 gives an example of how OPM varies in the

LU C benchmark on one of 8 nodes. The right-hand figure shows a window of 6 seconds. Here,

OPM clearly partitions the code into three distinct phases. These phases were determined by hand,

(31)

130 135 140 145 150 155 160 165

900 1100 1300 1500 1700 0.95 1 1.05 1.1 1.15 1.2 1.25 1 1.2 1.4 1.6 1.8 2

Execution time (s)

(a) BT B

58 60 62 64 66 68 70 72

500 550 600 650 700 0.825 0.85 0.875 0.9 0.925 0.95 0.975 1 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45

Execution time (s)

(b) CG B

45 50 55 60 65

350 400 450 500 550 600 650 700 750 1 1.1 1.2 1.3 1.4 1.5 1 1.25 1.5 1.75 2 2.25 2.5

Execution time (s)

(c) EP B

Figure 4.2: Energy consumption vs. execution time for three NAS benchmarks on a single AMD machine.

Phases Prioritization Our method for assigning gears to phases requires ordering the phases such

that the most likely phases to benefit from running at a lower gear are identified. This allows our

algorithm to run in linear time, yet still find the desired solution. Thus, our approach requires

distinguishing a phase that has a good energy-time tradeoff from one that does not. That is, we

must estimate the effect executing a phase in a slower gear will have on the energy consumption

and execution time of the block. The key here again is OPM (introduced above), which estimates

the phases that have a good energy-time tradeoff.

Figure 4.2 shows the results of executing 3 NAS programs on a single Athlon-64

pro-cessor. EP is CPU bound and therefore has a large time penalty and little or no energy savings at

reduced gears. On the other hand, CG is memory bound, so the time penalty is almost nothing. BT

is a more typical application that sits between the extremes of EP and CG.

Data Collection The first step gathers profile data during an execution of the program. This

implementation uses our MPI-jack tool, which is an interface that exploits PMPI [49], the profiling

layer of MPI. MPI-jack enables a user transparently to intercept (hijack) any MPI call. A user can

execute arbitrary code before and/or after an intercepted call. These are called pre and post hooks.

In this work, we use MPI-jack to shift gears in post hooks.

The first step involves sampling an application, producing profile or trace data. An

appli-cation is sampled at every phase boundary. This is done entirely through MPI-jack; in the case of

MPI calls, it is trivial to add the sampling code. When it is necessary to insert a phase boundary at

the end of a loop nest (i.e., not at the end of an MPI call), we simply insert a pseudo MPI call; this

(32)

/*Gis ann-dimensional vector of gear selections, one for each phase.*/ setGk= 0,∀k|0≤k < n /*The top gear is 0; slowest gear isn−1*/

givenT /*Tuple (energy,time) for currentG(initially the baseline)*/

given≺ /*A user-defined relationship that defines total order ofT */

/*Invoke the function to start method that produces final solution,Gf */

Gf ←evaluate(program, G,0, n, T)

defineevaluate(program, G, i, n, T)

ifi≥norGi≥gslowestthen returnGfi Gi←Gi+ 1

execute program using solutionG

measure energy and time forG, store tuple inT0

ifT0 ≺T then /*T0is not better thanT */

G_i←G_i−1

G= evaluate(program, G, i+ 1, n, T)

else /*T0is better thanT*/

G= evaluate(program, G, i, n, T0₎

fi

returnG

end

Figure 4.3: Heuristic for searching solution space.

The information we collect includes the type of call and location (program counter). It

shows status (gear, time, etc.) and metrics (µops and L2 cache misses, so OPM can be computed).

Note that our inclusion of the program counter bears some resemblance to existing program counter

based techniques [22], though that work is aimed solely at saving energy in the disk.

Energy-Time Tradeoff Our analysis (see below) requires energy consumption data; however, we

have found that energy cannot be measured accurately if the period of measurement is too fine

grain.1 For this thesis, as the benchmark programs complete relatively quickly, we measure energy

at the coarsest grain possible—the complete run of a program. (For longer running programs we

can terminate a test run after a fixed number of iterations or phases.)

Determining which of two solutions is “better” depends on how a user wants to trade off

energy savings and time delay. This work does not impose an evaluation. Rather, it leaves it to the

1

(33)

user or cluster administrator to select the metric. It could be energy-delay or energy-delay squared,

the latter proposed in [38, 23] has been adopted by Cameron et al. [5] for use in power-aware,

high-performance computing.

Methodology Given a program partitioned into phases, we proceed to our method for determining

an effective assignment of gears to phases. If there arenphases and ggears, than the number of

possible solutions (phase-gear assignments) for the program is gn. In general, this is too large

to explore by brute force. Therefore, the second part of our method is a heuristic (described in

Figure 4.3) that we use to find the “best” solution. The heuristic finds the “best” gear in a phase,

then moves on to the next phase. Once it moves on to another phase, the gear for the preceding

phase has been determined.

Initially, the solutionG(a vector of gears) is set to the baseline value—all zeroes. (Gear 0

is the baseline gear, which is also the fastest. The gear number indicates the number of steps slower

than the baseline) The recursive function is invoked on the 0th phase. It executes the program

using the next slower gear in this phase (all other phases are as before). If the energy-time tradeoff

(defined by a user relation, see below) of this new solution is better than the current solution, it is

accepted. The algorithm recursively tries the next lower gear on this phase. The gear is determined

when the new tradeoff is worse than the current or when there are no slower gears. After fixing the

gear, it moves on to the next phase. This heuristic has running time at mostn·g.

After each program execution, the energy and time are measured and compared via the

user-defined relationship. For our tests, we use a simple and intuitive evaluation of the tradeoff based

on the slope of the line between two solutions. The slope is defined as the ratio of energy savings

to time delay: E_Ti−_i−E_T_jj, for two solutions iandj. Thus a slope of -1 (i.e., 45◦ below the horizon)

means savings and delay are equally weighted. A user-defined limit of 0 means minimize energy,

and−∞means minimize time. We consider a new solution with a larger slope (in magnitude) than

the user-defined limit to be better. We do not advocate this metric instead of the others. We use it

because it is reasonable and it is easy to visualize.

Now we show an example of how our methodology works by studying BT class C in

depth. In the graph shown in Figure 4.4, the baseline is the leftmost (fastest) point, where the top

gear, 2000MHz, is used for the entire program. (As before, the higher of two points uses more

energy, and the further right of two points takes more time.) For all other points, at least one phase

is executed in a lower gear. Each point is labeled as a tuple, where theithentry represents the gear

(34)

57000 58000 59000 60000 61000 62000 63000 64000 65000

480 490 500 510 520 530 540 550 560 570

Energy (J)

Time (s) 00

11 01

02

03

04

12 ₁₃

22 23 14

10

hull

-5/+5

Figure 4.4: Energy-time plot of many BT runs.

have a different number of phases.) For convenience, we refer to a tuple of phase-gear assignments

as a solution.

Our analysis of the operations per miss (OPM) identified two phases in BT. The baseline

solution is labeled “00,” meaning both phases are run in gear 0. In the figure we have drawn a line

connecting 5 points that are “good” choices under a simple slope-based energy-time metric. The

line forms a convex hull that is left of and below all other solutions. Any solution “inside” the hull

is not a “good” choice according to this metric, although there may exist solutions inside the hull

that are good under a different metric.

Recall that the goal of our profiling algorithm is to do “better” than the baseline. As our

chosen metric is the slope between two solutions,L, our algorithm will select a unique point on the

hull. This is illustrated in Table 4.1, which shows five different cases regarding BT, which select the

five solutions on the convex hull.

In the first case (Table 4.1(a)), L is steep (large negative number). Such a value favors

time delay over energy savings. In this case and all others, the algorithm adjusts gears from right

to left across solutions or tuples. Thus, the first test is solution01. The steps of the algorithm are

(35)

Solutions Slope < L?

1 00→01 -11.7 false

2 00→10 -1.39 false

00is best

(a) Case 1:−12> L.

1 00→01 -11.7 true

2 01→02 -1.78 false

3 01→11 -1.01 false

01is best

(b) Case 2:−1.78> L >−12.

1 00→01 -11.7 true

2 01→02 -1.78 true

3 02→03 -1.19 false

4 02→12 -1.44 false

02is best

(c) Case 3:−1.44>L>−1.78.

1 00→01 -11.7 true

2 01→02 -1.78 true

3 02→03 -1.19 false

4 02→12 -1.44 true

5 12→22 -0.25 false

12is best

(d) Case 4:0> L >−1.44.

1 00→01 -11.7 true

2 01→02 -1.78 true

3 02→03 -1.19 true

4 03→04 0.17 false

5 03→13 -1.20 true

6 13→23 0.97 false

13is best

(e) Case 5:L= 0.

(36)

indicates the direction of the slope. Of these two solutions, the energy and time of the one on the

left is known, so only one run of the program is necessary to complete this step. The third column

shows the slope, and the fourth column indicates whether is it less than the user-defined limit. In the

first step, the slope from00to01is greater thanL(not as steep); therefore, it is rejected. We back

up to the previous solution (00) and try the next phase to the left (which in this case happens to be

the leftmost phase). In step 2, solution10is rejected. Thus, the algorithm selects00, the baseline,

whenLis sufficiently steep.

Table 4.1(b) shows the steps in the second case. In this case, 01 is accepted but02 is

rejected. The algorithm fixes the second phase at gear 1 and then determines the gear for the first

phase, ultimately selecting 01. The next case (Table 4.1(c)) is similar; the slope limit is slightly

larger (less steep), which results in acceptance of02, followed by the rejection of03and12.

Table 4.1(d) shows the steps in case 4. In this case, the second phase is fixed at gear 2.

The solution for12is considered better than02, so the algorithm accepts it and tries22, which is

rejected.

The last case is shown in Table 4.1(e). This case temporarily selects solution03, which

is not on the convex hull, before eventually selecting13. The solution03is interesting, because

the algorithm and evaluation metric we describe here will never select it. This is because our slope

metric only picks solutions that are on the convex hull. However, it is a reasonable solution in that

there does not exist a solution that is better than it in both time and energy (i.e., “dominates” it). So,

it easily can be argued that03is better than02. For BT, solutions03and11fall into this category

and therefore may be selected by a different evaluation metric. However, the other solutions (04,

10,14,22, and23) will never be selected by any metric, because each is dominated in both energy

and time by at least one other solution.

4.2 Analysis of Results

Figure 4.5 shows results from the other six NAS programs that we executed. Because

each graph has different x and y ranges, we plot a normalizing line that shows a 5% decrease in

energy and a 5% increase in time. It is important to consider both the direction and magnitude of

this line. Overall, the behavior of the applications varies widely. Each program falls into one three

groups based on its benefit from using a reduced gear. In each figure, we show the convex hull along

(37)

39500 40000 40500 41000 41500 42000 42500 43000 43500 44000 44500 45000

390 400 410 420 430 440 450 460 470 480 490 500

Energy (J) Time (s) 0 1 2 3 4 5 hull -5/+5

(a) CG C

41000 41500 42000 42500 43000 43500 44000 44500 45000 45500

230 240 250 260 270 280 290 300

Energy (J) Time (s) 0 1 2 hull -5/+5

(b) EP C

11000 11500 12000 12500 13000 13500

129 130 131 132 133 134 135 136

Energy (J) Time (s) 00 11 22 01 02 13 12 03 04

05 15

hull -5/+5

(c) IS C

58000 59000 60000 61000 62000 63000 64000

440 450 460 470 480 490 500 510

Energy (J) Time (s) 000 001 011 002 021 111 031 121 112 ₂₂₁

321 222 211 223 113 020 012 311 hull -5/+5

(d) LU C

32500 33000 33500 34000 34500 35000 35500 36000 36500 37000 37500

310 320 330 340 350 360 370 380 390 400 410

Energy (J)

Time (s) 00

11 01

02 03 04 05 12 32 42 14 33 44 hull -5/+5

(e) MG C

78000 80000 82000 84000 86000 88000 90000 92000 94000 96000 98000 100000

740 760 780 800 820 840 860 880 900

Energy (J) Time (s) 00 11 01 02 03 04

22 13 23 33 43 10 hull -5/+5

(f) SP C

(38)

Multiple Gear Benefit Applications in this group show significant benefit from a multiple-gear

solution. This is the case whenever severalij solutions, wherei6=j, fall on the convex hull. Four

of the seven NAS programs have this characteristic: MG, BT, LU, and IS. For example, in

multiple-gear solution 32, MG saves 11% energy with a 4% time penalty over the baseline. On the other

hand, the single-gear solution33 saves 10% energy with a 7% time delay. BT using solution12

saves 10% energy with a 5% time penalty over the baseline. This compares favorably to single gear

solutions11and22, which yield energy-time tradeoffs of -6%/4% and -8%/17%, respectively.

In the case of LU, there are three phases, as shown above in Figure 4.1. This program is

the only one for which we identified more than two phases. This plot has more points than any other

plot, yet 70 more points are needed to exhaustively search just the top three gears (0 thru 2).

IS is an extreme case, where the first phase is CPU bound and the second phase is both

memory and disk bound. Therefore, a single-gear solution is bound to be poor, as it is necessarily

a compromise solution. Regardless of the desire of the user, single-gear solutions (other than the

baseline) are dominated by points on the hull. The 05solution in IS saves 16% energy over the

baseline (00) at a cost of a 1% time increase. Compared instead to 22, this solution saves 9%

energy and executes 1% faster.

Single Gear Benefit Applications in this group show significant benefit from using a single lower

gear, but no significant benefit from multiple-gear solutions. CG and SP fall into this category. In

CG, only one phase was detected. The application is highly memory bound: its is OPM an order of

magnitude smaller than any other. Moreover, all blocks in the loop have comparable OPM. In SP,

while we found two phases, there is almost no benefit to running phases in different gear. Although

23is on the convex hull, it barely makes it.

Besides the reason given above, there is a second reason that a single gear benefit is all

that can be achieved. Conceptually, an application can fall into the single gear category because

the phases are too fine-grain (possibly because there are too many) and the cost of switching gears

outweighs the benefit. However, this was not the case in any of the NAS programs.

No Benefit Only EP falls into this category. Simply put, EP has only one phase, no communication

(except once at the end of the program), and is CPU bound. Therefore, a reduction in the gear results

Exploring Energy-Time Tradeoff in High Performance Computing

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

The Case for Power-Aware, High-Performance Computing

Saving Power and Energy with gear Selection

1.2

Contributions

Chapter 2

Related Work

2.1

Server/Desktop Systems

2.2

Mobile Systems

Chapter 3

Single-Gear Results

3.1

Single-Node Results

3.2

Multiple-Node Results

Chapter 4

Multiple-Gear Results

4.1

Profile-Directed Technique

4.2

Analysis of Results