FAST, ACCURATE, AND VALIDATED FULL-SYSTEM SOFTWARE SIMULATION

(1)

...

F

AST

, A

CCURATE

,

AND

V

ALIDATED

F

ULL

-S

YSTEM

S

OFTWARE

S

IMULATION

OF X

86 H

ARDWARE

...

T

HIS ARTICLE PRESENTS A FAST AND ACCURATE INTERVAL

-

BASED

CPU

TIMING MODEL THAT IS EASILY IMPLEMENTED AND INTEGRATED IN THE

COTS

ON FULL

-

SYSTEM SIMULATION INFRASTRUCTURE

. V

ALIDATION AGAINST REAL X

86

HARDWARE DEMONSTRATES THE TIMING MODEL

’

S ACCURACY

. T

HE END RESULT IS A SOFTWARE SIMULATOR THAT FAITHFULLY SIMULATES X

86

HARDWARE AT A SPEED IN THE TENS OF

MIPS

RANGE

.

...

Architectural simulation is a challenging problem in contemporary com-puter architecture research and development. Contemporary processors integrate billions of transistors on a single chip, implement multiple cores along with on-chip periph-erals, and are complex pieces of engineering. In addition, modern software stacks are in-creasingly complex, and include commercial operating systems and virtual machines with an entire application stack. These work-loads differ from those traditionally consid-ered in computer architecture research (for example, SPEC CPU). Ideally, a computer architect wants to simulate an entire system with high accuracy in a reasonable amount of time while running complete and un-modified software stacks.

However, the common practice of detailed cycle-accurate processor simulation is becom-ing infeasible because it is too slow. More-over, many practical studies might not require cycle-accurate simulation. Many de-sign trade-offs must be made at the system

level, for which the slow speed and high level of detail of cycle-accurate simulation only gets in the way. (See also the ‘‘Related Work in Architectural Simulation’’ sidebar.)

We therefore propose CPU timing simu-lation at a higher level of abstraction, and we present an approach using an analytical

model calledinterval analysis.1,2The model

analyzes a program’s miss events as well as its dependence structure to estimate CPU performance. We implement and integrate this interval-based CPU timing model in the COTSon full-system simulation

infra-structure.3 We validate the timing model

against real hardware using a set of micro-benchmarks, (multithreaded) CPU-intensive benchmarks, and a server workload. The end result is a validated simulation approach that is both accurate and fast, is relatively easy to implement, and can run full-system x86 workloads, including commercial oper-ating systems and entire software stacks, and system devices such as network cards and disks in an affordable amount of time.

Frederick Ryckbosch

Stijn Polfliet

Lieven Eeckhout

Ghent University

(2)

COTSon

Before presenting the higher-abstraction timing model, we first describe the COTSon framework in which we integrated the timing model. COTSon is an open source simulator framework developed by HP Labs that aims to provide a fast evaluation vehicle for

cur-rent and future computing systems.3 It

covers entire software stacks as well as hard-ware modules, including processors and sys-tem devices such as network cards and disks. COTSon targets cluster-level systems consisting of multiple multicore processor nodes interconnected through a network— that is, it targets both scale-up (multicore and many-core processor simulation) and scale-out (simulation of a multinode cluster). Figure 1 shows the organization of the COTSon simulator. COTSon uses the AMD SimNow full-system simulator to functionally simulate each node in the cluster. AMD’s SimNow can simulate x86 and x86_64 processors, and uses dynamic compi-lation and code-caching techniques to speed up simulation. SimNow is about 10 times

slower than native hardware execution, and can boot a system with an unmodified oper-ating system and execute any complex appli-cation. Each COTSon node further consists of timing models for the disks, network inter-face card, and CPU (that is, processor and memory). The various COTSon nodes are interconnected through a network mediator. The timing models in each COTSon node communicate with the functional simu-lator through event queues. These event queues are either synchronous (for communi-cating with the disk and network timing models) or asynchronous (for communicating with the CPU timing model). Synchronous event queues need an immediate response from the timing model upon a request from the functional simulator. Asynchronous event queues, on the other hand, decouple the generation of events by the functional simulator and their processing by the timing models. Asynchronous event queues imple-ment a unique timing feedback mechanism, which periodically adjusts the functional simulator’s speed to reflect the timing

...

Related Work in Architectural Simulation

Mauer et al. present a useful taxonomy for execution-driven simula-tion.1 Functional-first simulation lets a functional simulator feed a trace of instructions into a timing simulator, which can lead to loss in accuracy along mispredicted paths and when simulating multithreaded workloads. In timing-directed simulation, functional simulation is driven by the timing simulator—that is, the timing simulator directs the func-tional simulator when to change architecture state. Timing-first simula-tion lets the timing simulator run ahead with the funcsimula-tional simulator as a checker. COTSon implements afunctional-directed simulation para-digm: the functional simulator can run ahead of the timing simulator, however, the timing simulator periodically adjusts its speed. Functional-directed simulation can be viewed as middle ground between functional-first and timing-directed simulation.2

Various research groups focus on field-programmable gate array (FPGA) accelerated simulation.3An FPGA-accelerated simulator exploits fine-grained parallelism and achieves simulation speeds on the order of tens of MIPS. However, FPGA acceleration can increase simulator de-velopment time because it requires modeling the target architecture in a hardware description language such as Verilog, VHDL, or Bluespec. The software simulation approach presented in this article falls within the same speed range, but is much easier to develop, requiring only four en-gineer months to implement and validate the CPU timing model within the COTSon infrastructure.

Simulator validation is a nontrivial and tedious endeavor. Desikan et al. validated the detailed cycle-levelsim-alphasimulator against the Alpha 21264 processor.4They improved the simulator to be within 2 percent compared to the real hardware for a set of microbenchmarks. However, when running real SPEC CPU benchmarks, the average error was around 20 percent. Our interval-based CPU timer is a simulation model at a much higher level of abstraction thansim-alpha, yet it is equally accurate for CPU-intensive workloads.

References

1. C.J. Mauer, M.D. Hill, and D.A. Wood, ‘‘Full-system Timing-first Simulation,’’Proc. ACM SIGMetrics Conf. Measurement and Modeling of Computer Systems,ACM Press, 2002, pp. 108-116. 2. E. Argollo et al., ‘‘COTSon: Infrastructure for Full System Simulation,’’ SIGOPS Operating System Rev., vol. 43, no. 1, Jan. 2009, pp. 52-61.

3. J. Wawrzynek et al., ‘‘RAMP: Research Accelerator for Multiple Processors,’’IEEE Micro,vol. 27, no. 2, Mar. 2007, pp. 46-57. 4. R. Desikan, D. Burger, and S.W. Keckler, ‘‘Measuring Exper-imental Error in Microprocessor Simulation,’’Proc. Ann. Int’l Symp. Computer Architecture(ISCA 01), ACM Press, 2001, pp. 266-277.

(3)

models’ timing estimates. This functional-directed simulation approximates timing behavior more accurately than purely trace-driven or functional-first simulation (while being faster than timing-directed execution-driven simulation). Timing feedback lets the simulator better approximate time-dependent behavior (such as synchronization, operating system scheduling, and neting), which is important for real-life work-loads in terms of load balancing, quality of service, and so on.

COTSon simulates multicore processors by serializing the functional simulation of the various cores. Each core can run for some fixed amount of time in the functional simulator, and when all cores have reached the same point in time (the simulation win-dow), COTSon sends the various instruction streams to the timing models. Hence, the functional simulator determines which thread acquires the lock for entering a critical section. The timing models then determine the progress for each core, and the cores in turn adjust the functional simulator’s speed

through timing feedback. For example, if the timing model determines that a core achieves an instruction throughput that is twice as high as that achieved by another core, the functional simulator will simulate twice as many instructions for that one core as for the other core in the next simula-tion window. The feedback mechanism aims at limiting the functional simulator’s diver-gence with respect to the timing simulator.

The open source version of COTSon comes with two CPU timing models, timer0 and timer1, for an in-order and out-of-order processor, respectively. These CPU timing models are fairly simple, and are primarily designed for tutorial pur-poses and not to provide realistic levels of

ac-curacy. In particular, timer1operates as

follows. It stalls the front-end pipeline upon an instruction cache/translation look-aside buffer (TLB) miss and branch mispre-diction. Loads have priority over stores, and can be issued to memory as long as memory ports are available. This timer does not model miss event overlaps, hardware pre-fetching, break-up of macro-operations into micro-operations; nor does it model the im-pact of instruction execution latencies and interinstruction dependencies (that is, it does not model the critical path’s impact).

The average error for timer1for our set

of microbenchmarks and CPU-intensive benchmarks equals 42.4 percent and 31.8 percent, respectively. The interval-based CPU timing model, which we describe next, achieves substantially higher levels of accuracy. In this work, we use the existing COTSon network and disk timers.

Interval simulation

The interval analysis model is mechanistic in nature, meaning that it is built on first principles: the performance model is derived in a bottom-up fashion, starting from a basic understanding of the mechanics of a

contem-porary processor.1

As Figure 2 illustrates, interval analysis partitions a program’s execution time into intervals separated by disruptive miss events such as cache misses, TLB misses, branch mispredictions, and serializing instructions. The figure shows the number of dispatched instructions on the vertical axis versus time COTSon node COTSon node COTSon node SimNow Network mediator CPU timer Disk timer NIC timer

Figure 1. The COTSon architecture. Each COTSon node consists of the SimNow functional simulator feeding instructions into the CPU, and disk and network interface card (NIC) timing models.

(4)

on the horizontal axis. Under optimal condi-tions (that is, in the absence of miss events), the processor sustains a level of performance more or less equal to its pipeline front-end dispatch width. (We refer to dispatch as the point of entering the instructions from the front-end pipeline into the reorder buffer and issue queues.) However, miss events dis-rupt the smooth streaming of instructions through the dispatch stage. By dividing exe-cution time into intervals, we can analyze the performance behavior of the intervals in-dividually. In particular, we use the interval type (the miss event that terminates it) to de-termine the performance penalty per miss event:

The penalty for an instruction

cache/TLB miss equals its miss delay.

The penalty for a branch misprediction

equals the branch resolution time (number of cycles between the branch entering the reorder buffer and issue queue, and its resolution) plus the front-end pipeline depth.

The penalty per long-latency load miss

(that is, a last-level cache/TLB load miss) is approximated by its miss delay (memory access time). Multiple independent load misses might overlap their execution and expose memory-level parallelism (MLP).

The penalty for a serializing instruction

equals the reorder buffer drain time. We might not always achieve the smooth streaming of instructions between miss events at a rate close to the designed dispatch width. Low instruction-level parallelism (ILP) applications might exhibit long chains of dependent instructions, first-level (L1) data cache misses, and long-latency func-tional unit instructions (divide, multiply, floating-point operations, and so on), or store instructions, which might cause a re-source (for example, reorder buffer or issue queue) to fill up. A resource stall might thus cause dispatch to eventually stall for sev-eral cycles. To model this situation, interval modeling uses an ILP model that computes the critical path over a window of instruc-tions while keeping track of the interinstruc-tion dependencies and instrucinterinstruc-tion execuinterinstruc-tion

latencies. The intuition is that the window (reorder buffer) cannot slide over the dy-namic instruction stream any faster than dic-tated by the critical path. The effective dispatch rate is then computed through Lit-tle’s law (reorder buffer size divided by criti-cal path length), capped by the designed dispatch width.

Interval analysis also provides good in-sight into how miss events overlap. For ex-ample, the penalty due to an instruction cache miss following a long-latency load miss is hidden beneath the long-latency load penalty. Similarly, the penalty for a mis-predicted branch following a long-latency load in the dynamic instruction stream on which it does not depend is completely hidden underneath the penalty due to the long-latency load. If, on the other hand, the mispredicted branch depends on the long-latency load, both penalties serialize.

Using interval modeling, we can build ar-chitecture simulators that model the target machine at a higher level of abstraction. In

this approach, called interval simulation,2

the interval model replaces the cycle-accurate core-level timing model. The core-level inter-val model interacts with the branch predictor and memory subsystem simulators to derive the miss events and (possibly) their latencies. The interval model then estimates how many cycles it takes to execute each interval. This includes analyzing the amount of ILP to de-termine the effective dispatch rate between miss events, as well as estimating how many cycles it takes to resolve a mispredicted branch and to drain the reorder buffer on a serializing instruction. Finally, the model also estimates the amount of overlap between miss events to do an accurate accounting in

Branch misprediction Interval 2 Long-latency load miss t Dispatch rate Interval 3 Interval 1 L1 instruction cache miss

Figure 2. Interval analysis analyzes performance on an interval basis determined by disruptive miss events.

(5)

terms of their penalties. In other words, the interval model estimates a core’s overall prog-ress based on timing estimates of each individual interval. The miss events are determined by simulating the branch predic-tor and memory subsystem (the miss events determine the intervals) and the timing for each interval is estimated through the inter-val model. The key benefits of interinter-val simu-lation are that it is easy to implement and runs substantially faster than cycle-accurate simulation, while maintaining good accu-racy. Genbrugge et al. validated the interval

simulator against the M5 simulator,4which

implements the Alpha RISC instruction set architecture (ISA). They achieved an average error of 4.6 percent and a tenfold simulation speedup compared to detailed simulation while running full-system multithreaded workloads.

Accurate x86 CPU timing model

We set out to achieve three major goals in this work.

First, we wanted to validate the model against real hardware. Although our previous work demonstrated the accuracy of interval modeling and simulation, we validated it using an academic simulator. This is a good first step; however, it is unclear how ac-curate the model is against real hardware. Prior work in simulator validation has shown that it is extremely difficult to validate an academic simulator against real

hard-ware.5This raises the question of whether a

model that has been validated against a sim-ulator is close to real hardware.

We also wanted to validate the model for the prevalent x86 and x86_64 ISA. Our work in interval modeling and simulation (like many other modeling and simulation efforts in computer architecture) uses Alpha, a RISC ISA that is relatively easy to handle. This might not be sufficient given the prevalence of the x86 and x86_64 ISAs in contemporary computer systems. More-over, given that we target the simulation infrastructures of computer systems running real and unmodified software stacks, x86 is the ISA of choice.

Finally and foremost, we wanted an accu-rate, fast, and easy-to-implement simula-tor that can run unmodified commercial

full-system workloads at scale in an afford-able amount of time. Although the COTSon simulation infrastructure fulfills most of these requirements—it is fast and can run unmodified complex workloads—the avail-able CPU timing models are simple tutorial models.

The possibility of integrating the interval model as a CPU timing model into the COTSon infrastructure initiated this work. Doing this would let us meet all three goals. It enables validation for the x86 ISA; it enables validation against real hardware (given the predominance of x86 hardware); and it might improve the COTSon infra-structure’s accuracy. As an end result, we achieved all three goals: the interval model-based CPU timing model significantly improves the accuracy of the COTSon simu-lation infrastructure compared to real hard-ware running complex x86 workloads.

Modeling

Because the interval model is relatively easy to implement, we were able to integrate it as a novel CPU timing model in COTSon in about one engineer month. This includes the interval model itself along with several particularities relating to x86 architectures. Subsequently, we validated the model against real hardware, which took another three en-gineer months. We performed this validation process against an AMD Opteron server sys-tem (see the ‘‘Experimental Setup’’ section for more details) and found several opportu-nities for improving the model. Building a validated interval-based CPU timing model took a total of four engineer months.

Compared to the interval model,2 the

interval-based CPU timing model includes several novel features.

First, the interval-based CPU timing model breaks x86 instructions (macro-operations) into RISC-like micro-operations. It performs this break up generically. It breaks an x86 instruction into one or more load micro-ops, followed by an arithmetic opera-tion and one or more store micro-ops. Our current implementation does not include macro-op or micro-op fusion, although we could easily add this.

Second, we integrated an x86 disassem-bler as part of the CPU timing model to

...

(6)

enable micro-op formation and to determine an instruction’s type as well as its input and output operands. The x86 disassembly also involves register assignment and dependence analysis to create data dependencies between micro-ops. Note that the integration of a dis-assembler into the timing model results from the fact that the COTSon simulator leverages AMD’s proprietary SimNow functional sim-ulator, which does not expose the instruction type and operands to COTSon. If SimNow communicated disassembly information to COTSon, we would not need to integrate a disassembler in the timing model.

All modern high-end processors imple-ment some form of hardware prefetching to hide memory access latencies. Prior versions of the interval simulator did not include hardware prefetching, however. On par

with the AMD Opteron processor6that we

validate against, the interval-based CPU tim-ing model implements hardware prefetchtim-ing at multiple levels of the memory hierarchy, namely at the core-level L1 data cache (the core prefetcher) and at the L3 cache (the DRAM prefetcher). The core prefetcher is in-struction pointer based, whereas the DRAM prefetcher initiates prefetches based on the observed L3 cache access patterns. Both pre-fetchers are stride-based.

The interval-based CPU timing model also supports overlapping miss events. Inter-val analysis assumes that only off-chip mem-ory accesses (that is, last-level L3 cache misses) cause the reorder buffer to fill up and stall dispatch. Other misses, such as L2 misses that hit in L3, are assumed to be hid-den through out-of-order execution. We found this to be an invalid assumption for the real hardware we validated against. Therefore, we consider L2 misses as another source of miss events, and we apply the over-lap algorithm to L2 misses accordingly. That is, we assume dispatch blocks on an L2 miss and independent miss events further down the dynamic instruction stream that make it into the reorder buffer simultaneously with the L2 miss might (partially) overlap this L2 miss.

Interval analysis uses instruction latencies to determine the length of the critical data dependence path through the program, which in turn is important to determine

the effective dispatch rate in the absence of miss events. Unfortunately, instruction exe-cution latencies are poorly documented. We therefore considered synthetically gener-ated kernels to determine instruction laten-cies. We used this procedure to determine the latencies of several instruction types, such as integer divide and multiply opera-tions, floating-point operaopera-tions, and stream-ing SIMD extension (SSE) operations.

Validation against real hardware

The validation process against real hard-ware revealed many opportunities for improving the interval-based CPU timing model. Figure 3 shows the progress during the validation process. The vertical axis shows the absolute error between the simula-tor and the real hardware for a set of microbenchmarks. For each intermediate version of the timing model, we show the av-erage absolute error (diamond) as well as its standard deviation (error bar). The starting point for the validation process was the interval simulator’s initial implementation. We subsequently added core prefetching, adjusted the cache latencies, included the overlap algorithm for L2 misses, improved

100 80 60 40 20 Baseline Cor e pr efetching Cache latencies Overlap algorithm for L2 misses Impr

oved ef

fective dispatch rate computationMor e aggr

essive cor e pr

efetching Adjusted instruction latenciesImpr

oved micr o-op br

eak-up

Mor

e accurate fetch stall conditions DRAM pr

efetching

0

Absolute err

or (%)

Figure 3. Validation process using the microbenchmarks and synthetically generated kernels: Modeling accuracy is shown on the vertical axis as a function of the modeling enhancements over time on the horizontal axis.

(7)

the effective dispatch rate computation to be capped by both the critical path and the pro-cessor width, and adjusted the core pre-fetcher to be more aggressive. This brought us to a point with an average error of 11.5 percent. Although this is reasonably accurate, we observed relatively large errors for some of our microbenchmarks (up to 24.8 percent).

The next step in the validation process used synthetically generated kernels to reveal the instruction latencies for the various in-struction types. Although this improved the accuracy for the microbenchmarks that we had very high errors for, the average error increased substantially (to 36.3 percent), and for some other microbenchmarks the error increased dramatically (up to 81.7 per-cent). Further improvements in the micro-op break-up algorithm and fetch stall condi-tions, and the addition of the DRAM pre-fetcher brought the average error down to 9.8 percent, with a maximum error of 19.8 percent (see the rightmost point in Figure 3).

Experimental setup

We validated our model against an AMD Opteron 2350 quadcore processor

machine.6 It implements AMD’s K10

microarchitecture in a 65-nanometer tech-nology at 2 GHz. Each core is a 3-wide superscalar out-of-order architecture with a 72-entry reorder buffer. The L1 caches are 64 Kbytes in size. Further, it implements a per-core 512-Kbyte L2 cache, a shared 2-Mbyte L3 cache, and an on-chip memory controller.

We repeated our real hardware measure-ments 15 times, and we report average performance numbers along with its 95 per-cent confidence intervals. We made the measurements on an idle machine, and

mea-sured time using the Linuxtimecommand.

The microbenchmarks we used (bsearch, dijkstra, div, dl1, fp, memory, mul, and qsort) stress specific aspects of the architecture, such as floating-point units, divide, core pre-fetching, and DRAM prefetching. We took the compute-intensive benchmarks (black-scholes, bodytrack, freqmine, ferret, stream-cluster, raytrace, swaptions, blastn, blastp, ce, h264dec, h264enc, and specjbb2005) from

various sources, such as Parsec,7 BioPerf,8

MediaBench II,9 and SPECjbb2005. The

Parsec benchmarks are multithreaded and model recognition, mining, and synthesis (RMS) workloads. This set of benchmarks covers workload classes such as data analytics, presentation, multimedia, and gaming, which are likely candidates to run in (future) com-puter systems. Finally, Nutch is a Web 2.0 search engine workload in which a client sends search requests to the Nutch server and measures the response time and through-put at the client side.

Evaluation: Accuracy versus speed

Our evaluation of the interval-based CPU timer within COTSon followed several steps. We first focused on accuracy, and considered the microbenchmarks and the compute-intensive benchmarks. Subsequently, we focused on the speed versus accuracy trade-off while employing sampling.

We used microbenchmarks and CPU-intensive benchmarks to evaluate accuracy. Figure 4 compares the relative error for the interval-based timer against real hardware execution using the microbenchmarks when reporting simulation time in seconds. The average absolute error is 9.8 percent. The interval-based CPU timer is also accurate

40 30 20 10 0 –10 Relative err or (%) –20 –30 –40 bsear ch dijkstra div dl1 fp memor y mul qsor t Absolute average

Figure 4. Modeling error (vertical axis) of the interval-based CPU timing model against real hardware using microbenchmarks (horizontal axis).

(8)

for the compute-intensive benchmarks, as Figure 5 shows. The average absolute error for the interval-based timer is 18.6 percent (maximum error of 41 percent).

As mentioned earlier, the Parsec bench-marks are multithreaded workloads, and we run up to four threads because the AMD Opteron machine that we compare against is a quadcore processor. As we in-crease the core counts, we also inin-crease the number of threads that co-execute, and these co-executing threads affect each other’s performance through synchronization as well as through shared resource contention in the L3 cache, off-chip bandwidth, and main memory. Interval-based CPU model-ing captures these interactions well. Note, however, that AMD’s SimNow serializes the functional simulation of cores, which might lead to behavior during functional simulation that differs from the behavior in a timing-directed simulator or on real hardware. For example, a spin lock loop might be iterated a different number of times in COTSon than on real hardware, which is a concern especially for workloads with high contention locks. Functional-directed simulation as implemented in COTSon addresses this concern to some ex-tent. The error numbers reported here

include this inaccuracy. One solution might be to more tightly couple the func-tional simulator’s speed on the one side and the timing simulator on the other side. However, doing so without compromising simulation speed too much is an orthogonal issue that falls outside this article’s scope.

Running complex full-system work-loads—which is our ultimate goal— requires that very long running workloads can be simulated in a reasonable amount of time. Our interval-based CPU timing model achieves 350 thousand instructions per second (KIPS), which is 38 percent slower than the COTSon CPU timer run-ning at 570 KIPS. Although this is a reason-able simulation speed, it is not fast enough to simulate complex workloads in an affordable amount of time. Sampling is a well-founded

technique for speeding up simulation.10-12

The idea behind sampling is to simulate only a small fraction of the entire dynamic instruction stream in detail and then extrapolate—that is, by taking small sam-pling units randomly or periodically, you can get an accurate picture of the entire ex-ecution. Because only a small fraction is simulated in detail, we obtain substantial speedups. Figure 6 shows the accuracy for three sampling scenarios (we explored

40 30 20 10 0 –10 –20 –30 –40 h264dec h264enc blastn blastp ce 1-cor e 2-cor e 4-cor e 1-cor e 2-cor e 4-cor e 1-cor e 2-cor e 4-cor e 1-cor e 2-cor e 4-cor e 1-cor e 2-cor e 4-cor e 1-cor e 2-cor e 4-cor e 1-cor e 2-cor e 4-cor e specjbb2005 Absolute average stream-cluster ferret swaptions raytrace freqmine Relative err or (%) body-track black-scholes

Figure 5. Modeling error (vertical axis) of the interval-based CPU timing model against real hardware using a suite of compute-intensive benchmarks from BioPerf, MediaBench II, Parsec, and SPECjbb2005 (horizontal axis).

(9)

more strategies but do not show them here to improve readability):

1 million instruction warming and 100

thousand instruction sampling units,

100 thousand instruction warming and

100 thousand instruction sampling units, and

100 thousand instruction warming and

10 thousand instruction sampling units. There are 100 million instructions be-tween the sampling units for all three

strategies. Accuracy improves as sampling unit size and warming increase. The 1 mil-lion warming and 100 thousand sampling unit scenario achieves an average error of 23.1 percent and a simulation speed of 37 MIPS. Figure 7 shows the trade-off in accu-racy versus speed, and considers several sam-pling strategies. We find the 100 thousand sampling strategy (with one sampling unit every 100 million instructions and 1 million instructions of warming) to be a good trade-off in speed versus accuracy, and we use it further.

Case study: Server workload

We now consider a more complex server workload, namely a Web 2.0 search engine application based on the Nutch platform. Nutch is built on Lucene Java, adding vari-ous Web specifics such as crawling, HTML parsing, and a link-graph database. Our benchmark consists of a server holding the search database and a variable number of cli-ents that submit requests to the server. The server runs on one COTSon simulation node, and the clients are run on another.

Figure 8 shows the response time and throughput on the client side for the real hardware and COTSon (which uses the interval-based CPU timer). The simulation is within 7.0 percent and 12.7 percent on average for response time and throughput, respectively. As the figure shows, throughput increases for up to 100 concurrent clients

150 100 50 0 Absolute err or (%) h264dec h264enc blastp ce 1-cor e 2-cor e 4-cor e 1-cor e 2-cor e 4-cor e 1-cor e 2-cor e 4-cor e 1-cor e 2-cor e 4-cor e 1-cor e 2-cor e 4-cor e 1-cor e 2-cor e 4-cor e 1-cor e 2-cor e 4-cor e specjbb2005 A verage 100M-1M-100k 100M-100k-100k 100M-100k-10k stream-cluster ferret swaptions raytrace freqmine body-track black-scholes

Figure 6. Accuracy for three sampling strategies: One million warming and 100 thousand instruction sampling units (a), 100 thousand warming and 100 thousand instruction sampling units (b), and 100 thousand warming and 10 thousand

instruction sampling units (c). There are 100 million instructions between the sampling units for all three strategies.

40 35 30 25 20 0 10 10M-100k 1M-10k 10M-10k 10M-1M 1M-1M100k-1M 1M-100k 100k-100k 100k-10k 20 30 40 50 60 70

Million instructions per second (MIPS)

A

verage absolute err

or (%)

Figure 7. Speed versus accuracy trade-off. The Pareto front is formed through the dashed line. A sampling strategy A-B means A instructions for warming and B instructions for the sampling unit. All sampling strategies assume 100 million instructions between sampling units.

(10)

with only a modest increase in response time. Throughput decreases dramatically past 140 clients, with a highly variable transition phase between 100 and 140 clients. Software simulation captures this trend well.

Software simulation’s real power is that it lets developers explore the microarchitecture and its effect on overall performance. Figure 9 shows results from a case study involving three L3 cache sizes: 1 Mbyte, 8 Mbytes, and 32 Mbytes. The response time for the Nutch benchmark decreases as cache size increases. The 1-Mbyte cache appears suffi-cient for limited levels of concurrency, whereas an 8-Mbyte cache is clearly benefi-cial for larger numbers of concurrent clients, and a 32-Mbyte cache brings no further improvement.

S

imulation is an invaluable tool for

contemporary system design. Higher-abstraction timing models reduce simulator development and evaluation time, and open up opportunities for both system architecture and software research and development. System integrators and architects can use the simulation approach to make system-level design trade-offs, whereas software developers can use it to perform software performance studies in a reasonable amount of time. As part of our future work, we plan to study

simulation approaches with yet higher simu-lation speeds while enabling modeling large

systems at scale. MICRO

Acknowledgments

We thank Paolo Faraboschi (HP Labs) and the anonymous reviewers for their thoughtful comments and suggestions. Frederick Ryckbosch is supported through a doctoral fellowship by the Research Foun-dation—Flanders (FWO). Stijn Polfliet is supported through a doctoral fellowship by

0.30 0.25 0.20 0.15 0.10 0.05 0 10 1 Mbyte 8 Mbytes 32 Mbytes 50 100 150 200 Concurrency

Response time (seconds)

Figure 9. Microarchitecture study using varying cache sizes for the Nutch benchmark. Response time is shown as a function of the level of concurrency and L3 cache size.

0.25

AMD Opteron (real hardware) Interval-based timer 0.20 0.15 0.10 0.05 (a) (b) 0 10 20 30 40 50 60 70 80 90 ₁₀₀ ₁₁₀ ₁₂₀ Concurrency Response time (seconds) 130 140 150 160 170 180 190 200 Thr oughput (MBytes/second) 10 8 6 4 2 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Concurrency

AMD Opteron (real hardware) Interval-based timer

Figure 8. Evaluating the accuracy of the interval-based timer against real hardware for the server-side Nutch benchmark: response time (a), and throughput (b).

(11)

the Agency for Innovation by Science and Technology (IWT). The FWO projects G.0232.06, G.0255.08, and G.0179.10, and the UGent-BOF projects 01J14407 and 01Z04109 provide additional support.

...

References

1. S. Eyerman et al., ‘‘A Mechanistic Perfor-mance Model for Superscalar Out-of-Order Processors,’’ACM Trans. Computer Systems

(TOCS), vol. 27, no. 2, May 2009, Article 3. 2. D. Genbrugge, S. Eyerman, and L.

Eeck-hout, ‘‘Interval Simulation: Raising the Level of Abstraction in Architectural Simulation,’’

Proc. Int’l Symp. High-Performance Com-puter Architecture (HPCA 10), IEEE CS Press, 2010, pp. 307-318.

3. E. Argollo et al., ‘‘COTSon: Infrastructure for Full System Simulation,’’SIGOPS Operating System Rev., vol. 43, no. 1, Jan. 2009, pp. 52-61.

4. N.L. Binkert et al., ‘‘The M5 Simulator: Mod-eling Networked Systems,’’ IEEE Micro,

vol. 26, no. 4, 2006, pp. 52-60.

5. R. Desikan, D. Burger, and S.W. Keckler, ‘‘Measuring Experimental Error in Micro-processor Simulation,’’ Proc. Ann. Int’l Symp. Computer Architecture(ISCA 01), ACM Press, 2001, pp. 266-277.

6. C.N. Keltcher et al., ‘‘The AMD Opteron Processor for Multiprocessor Servers,’’

IEEE Micro, vol. 23, no. 2, Mar. 2007, pp. 66-76.

7. C. Bienia et al., ‘‘The PARSEC Benchmark Suite: Characterization and Architectural Implications,’’Proc. Int’l Conf. Parallel Archi-tectures and Compilation Techniques(PACT 08), ACM Press, 2008, pp. 72-81.

8. D.A. Bader et al., ‘‘BioPerf: A Benchmark Suite to Evaluate High-performance Com-puter Architecture on Bioinformatics Appli-cations,’’Proc. IEEE Int’l Symp. Workload Characterization(IISWC 05), IEEE Press, 2005, pp. 163-173.

9. C. Lee, M. Potkonjak, and W.H. Mangione-Smith, ‘‘MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communi-cations Systems,’’Proc. Ann. IEEE/ACM Symp. Microarchitecture(Micro 97), IEEE CS Press, 1997, pp. 330-335.

10. T.M. Conte, M.A. Hirsch, and K.N. Menezes, ‘‘Reducing State Loss for Effective Trace

Sampling of Superscalar Processors,’’Proc. Int’l Conf. Computer Design (ICCD 96), IEEE CS Press, 1996, pp. 468-477. 11. T. Sherwood et al., ‘‘Automatically

Charac-terizing Large Scale Program Behavior,’’

Proc. Int’l Conf. Architectural Support for Programming Languages and Operating Systems(ASPLOS 02), ACM Press, 2002, pp. 45-57.

12. R.E. Wunderlich et al., ‘‘SMARTS: Acceler-ating Microarchitecture Simulation Via Rig-orous Statistical Sampling,’’Proc. Ann. Int’l Symp. Computer Architecture(ISCA 03), ACM Press, 2003, pp. 84-95.

Frederick Ryckbosch is a PhD student in the Electronics and Information Systems Department at Ghent University, Belgium. His research interests include computer architecture in general, and simulation of large-scale computer systems in particular. Ryckbosch has an MS in computer science and engineering from Ghent University.

Stijn Polfliet is a PhD student in the Electronics and Information Systems De-partment at Ghent University, Belgium. His research interests include computer architec-ture in general, and simulation of large-scale computer systems in particular. Polfliet has an MS in computer science and engineering from Ghent University.

Lieven Eeckhoutis an associate professor in the Electronics and Information Systems Department at Ghent University, Belgium. His research interests include computer architecture and the hardware/software inter-face, with a focus on performance analysis, evaluation and modeling, and workload characterization. Eeckhout has a PhD in computer science and engineering from Ghent University. He is a member of IEEE and the ACM.

Direct questions and comments about this article to Lieven Eeckhout, ELIS— Ghent University, Sint-Pietersnieuwstraat 41, B-9000 Gent, Belgium; leeckhou@elis. ugent.be.