Computer Science 146/246 Homework #3

(1)

Computer Science 146/246

Homework #3

Due 11:59 P.M. Sunday, April 12th, 2015

We played with a Pin-based cache simulator for Homework 2. This homework will pre-pare you to setup and run a detailed microarchitecture-level simulator, as well as to mod-ify the simulator to facilitate your own research. We will use XIOSim, a Pin-based x86 microarchitecture-level simulator, for this homework. You can find more details about (an older version of) XIOSim in this paper [2] (Download), or you can just check it out on

Github.1

1 Download and Configure XIOSim

a. Download XIOSim You can get XIOSim from Github:

$ git clone https://github.com/s-kanev/XIOSim.git

b. Set up your environment The build and the following scripts rely on these vari-ables to know which XIOSim installation to look for and how to satisfy dependences. Just execute:

$ export BOOST_HOME=/home/cs246/boost_1_54_0 $ export XIOSIM_TREE=/your/path/to/XIOSim

$ export XIOSIM_INSTALL=${XIOSIM_TREE}/pintool/obj-ia32

You can add these to your ∼/.bashrcfile so you don’t type them out every time you login.

c. Build XIOSim

$ cd pintool $ make

1_{Full disclosure, I’m the main author of XIOSim. So, if you have any feedback, suggestions, bug reports,} or curse words you want to throw at me, I’d be more than happy to listen.

(2)

d. Run Your First Test Program Let’s test the simulator with a simple benchmark. There is a script calledrun.shunderXIOSim/pintooldirectory which sets up the simulated

architecture configuration and runs the simulation. It looks like this:

PIN=${PIN_ROOT}/pin.sh

PINTOOL=./obj-ia32/feeder_zesto.so ZESTOCFG=../config/A.cfg

BENCHMARK_CFG_FILE=benchmarks.cfg

CMD_LINE="setarch i686 -BR ./obj-ia32/harness \

-benchmark_cfg ${BENCHMARK_CFG_FILE} \ -pin ${PIN} \ -pause_tool 1 \ -xyzzy \ -t \ ${PINTOOL} \ -num_cores 1 \ -s \ -config ${ZESTOCFG}" echo ${CMD_LINE} ${CMD_LINE}

The are two things to notice. First benchmarks.cfg chooses which program to simulate.

Then, A.cfg at ../config/ sets up the simulation paramters This particular file models

Intel’s Atom processor. You should already be familiar with some of the knobs in that file. For example, search for bpredand you can see that the Atom model is set up to simulate a

2-level gshare predictor, very similar to what you implemented in homework 1. Now you are ready to run the simulator.

pintool $ ./run.sh

It will finish in a couple of minutes. The simulation output is in sim.out. You can see

the simulator statistics about each pipeline stage, instruction breakdowns, as well as various caches and memory stats. You will want to spend some time looking through this file and the config file (A.cfg), to understand the output data, some of which you will need for the

(3)

e. Run SPEC Benchmarks Now you are ready to run full-blown SPEC benchmarks us-ing XIOSim. Before we modify the script, make a directory outside your XIOSim repository for output files, e.g. mkdir /your/path/to/hw3/spec out.

Switch to XIOSim/scriptsdirectory.

First modify line #7 in spec.py to specDir=‘/home/cs246/CPU2006’. Then make sure

./run spec.py knows about your new output directory. I’ve summarized the changes in

that file for you below:

Line # Change to

8 RUN DIR ROOT = "/your/path/to/hw3/spec out"

9 RESULT DIR = "/your/path/to/hw3/spec out"

10 CONFIG FILE = "config/A.cfg"

After these edits, you can just execute ./run spec.py to run the simulation for benchmark

401.bzip2 with inputchicken.

scripts $ nohup ./run_spec.py &

nohup will keep your job running in the background even if you log out of the machine.

Currently we simulate 100M instructions, which will take around 30 mins per run.

Note that XIOSim requires 2 threads per run. We strongly recommend you to run at most 3 jobs (6 threads in total) at a time. Before you start, do make sure there are at least two cores idling. Otherwise, you will grind not only your jobs, but everyone else’s on the machine to a halt. You can usetop to check whether there are jobs already running.

After the simulation finishes, you can check the simulation output file (*.sim.out) located at

/your/path/to/spec out/. It reports that the execution time of sampled 100M instructions

is 62834 us (sim time).

For this homework, you need to present your results for two SPEC benchmarks, 401.bzip2

and 429.mcf. To run 429.mcf, just change the last line in run spec.py to RunSPECBenchmark("429.mcf.inp").

(4)

2 Execution Time Decomposition [30 Points]

In this homework, you will first use XIOSim to reproduce the execution time decomposition from Doug Burger’s paper [1] (Download).

A. Read the paper and understand how to quantifyprocessor time,latency time, andbandwidth time;

B. Run simulations to generatefp, fL, and fB for 401.bzip2 and 429.mcf

C. Plot the breakdowns similar to Figure 3 in the paper and explain your findings.

You do not need to change or recompile the simulator for this problem. You may need to change certain knobs in the config files to run the simulations with different assumptions, listed below.

baseline

change nothing, just use A.cfg

every request hits in the L1 data cache

core cfg.exec cfg.dcache cfg.magic hit rate : "1.0"

infinite bandwidth between LLC and memory

uncore cfg.fsb cfg.magic : "true" uncore cfg.dram cfg.dram config :

"simplesdram-infbw:4:4:35:11.25:11.25:11.25:11.25:64"

Here, the ”.”-s separate the different sections in A.cfg. The run spec.py script is set up to

take config file replacements in this format (check out line 68), so that it’s much easier to automate your parameter sweeps. Of course, if you don’t trust shady python scripts, you are welcome to change the config file by hand.

3 Effect of Increasing Frequency [15 Points]

The paper mentions that faster clock speed will reduce processor time but increase latency and bandwidth times. You can test whether the statement is true or not with the help of the simulator.

(5)

A. Change the knob core cfg.core clock inA.cfg from 1600 MHz to 3200 MHz

B. Repeat the steps in Section 2 to generate the time breakdowns.

C. Plot the breakdowns, compare your results here with the ones running at 1.6 GHz, explain your findings.

4 Power and Energy [5 Points]

Dynamic power can be estimated using Equation 1

P =αCV2f (1)

where α is an activity factor, C is capacitance, V is voltage, and f is frequency. Using Equation1, assuming constant activity, capacitance, and voltage, increasing frequency from 1.6 GHz to 3.2 GHz also doubles dynamic power consumption. We ignore static and leakage power for this homework.

Energy is equal to power multiplied by time. Based on our assumption that dynamic power doubles from 1.6 GHz to 3.2 GHz, compare the dynamic energy consumption for two runs (1.6 GHz and 3.2 GHz) for each benchmark. Explain your findings.

In order to simulate power, you need to add

system cfg.simulate power : "true"

to the list of replacments in your A.cfgfile.

5 Dynamic Voltage Frequency Scaling [50 Points]

Dynamic Voltage and Frequency Scaling (DVFS) is a power management technique that dynamically adjusts power and frequency based on the runtime behavior of applications to reduce power/energy consumption. Contemporary processors feature a variety of hardware power management mechanisms to adjust voltage and frequency. In this part, you will implement a frequency scaling policy which adapts the core’s frequency to program behavior. The intuition behind dynamic frequency scaling is that if the core is stalling due to cache misses, lowering frequency can reduce dynamic power withot an effect on performance. XIOSim provides a modular interface for frequency scaling policies. You can find one simple example,sample.cpp, inside theXIOSim/ZCOMPS-dvfs/directory. Currently, the scheduler is

(6)

really simple: every tick, it compares the dynamic IPC with a constant (0.6 in this case, out of a theoretical maximum of 2.0 on Atom). Higher IPC switches to the maximum frequency (3.2 GHz) and lower – to the minimum (1.6 GHz).

We already ran this simple policy on a test microbenchmark; results are shown in Figure 1.

0.0 0.5 1.0 1.5 2.0 Normalized Performance 0.0 0.5 1.0 1.5 2.0 Normalized Power Simple-DFS 1.6G 3.2G ideal

Figure 1: Normalized power and performance running at different frequencies.

Power and performance results running at 1.6 GHz are the baseline for normallization. Run-ning at 3.2 GHz doubles power consumption since frequency doubles, but we only get around 1.8× performance benefit. Using our Simple-DFS policy, the performance benefit is almost linear with the additional power consumption. The ideal case would get the best of both world: the performance of running at 3.2 GHz and the power of running at 1.6 GHz. Al-though it is ideal, we want to get as close to that target as possible.

YOUR JOB: Design and implement your own scaling policy in sample.cpp to get closer

to the ideal case. Table1shows some stats from XIOSim, which may (or may not) be useful for your policy.

You need to add #include ‘‘zesto-uncore.h’’ to XIOSim/zesto-dvfs.cpp if you need

(7)

need branch predictor stats.

Stats Notes

core->stat.commit insn committed instructions

core->sim cycle simulated core cycles

uncore->LLC->stat.core lookups[0] number of lookups in the LLC

uncore->LLC->stat.core misses[0] number of misses in the LLC

core->fetch->bpred->num updates number of predictions branch predictor makes core->fetch->bpred->num hits number of correct branch prediction

Table 1: Stats you may need for your scaling policy. You need to add

uncore cfg.dvfs cfg.config : "sample"

uncore cfg.dvfs cfg.interval : 1000000

to the A.cfg file replacements in your runscripts. dvfs cfg.config sets the name of the

DVFS policy to use; dvfs intervalsets how frequently we update the frequency (in cycles).

Rebuild the simulator every time you changesample.cpp. Do amakeunderXIOSimfollowed

by a make underXIOSim/pintool.2

To quickly test your policy, you can use thestepmicro-benchmark in the XIOSim directory.

Just change pintool/bencharks.cfg to point to ../tests/step, and run pintool/run.sh

as in Section 1 (d). For the testing run, you need to set dvfs cfg.interval to 20000 since

the micro-benchmark is relatively short. After you test with the simple benchmark, you can run your DVFS policy with SPEC and a longer DVFS interval.

Run the two SPEC benchmarks in the following three cases:

A. fixed 1.6 GHz B. fixed 3.2 GHz

C. dynamic frequency using your DFS policy

Generate figures similar to Figure1 for each benchmark and explain your findings. 2_{Yeah, I know this two-step build thing is ugly. Fixing it is on my to-do, I promise.}

(8)

6 Submission Instructions

Please present your results/figures/findings for all the problems in a single PDF file. Send the PDF file along with your frequency scaling policy file (sample.cpp) to [email protected].

References

[1] Doug Burger, James R. Goodman, and Alain K¨agi. Memory bandwidth limitations of future microprocessors. In Computer Architecture (ISCA), 1996.

[2] Svilen Kanev, Gu-Yeon Wei, and David Brooks. XIOSim: power-performance modeling of mobile x86 cores. In Low power electronics and design (ISLPED), 2012.