Benchmarks - Reducing main memory access latency through SDRAM address mapping techniques and a

Table 3.2: Revised M5 baseline machine configuration M5 v1.1 configuration

CPU 4GHz, 8-way out-of-order execution, 32 LSQ, 196 ROB L1 cache 128KB I-cache and 128KB D-cache, 2-way, 64B cache line L2 cache 2MB, 16-way, 64B cache line

SDRAM module v2.0 configuration Front Side Bus 64-bit, 800MHz (1.6GHz data rate)

Main memory 4GB DDR2 800 (PC2-6400) SDRAM, 5-5-5-17, 64-bit data bus, burst length 8

Channel/Rank/Bank 2/4/4

Virtual paging Not available Controller policy Open Page Address mapping Page interleaving

Access reordering Bank in order scheduling Memory access pool 256

Maximal writes 64

policy and conventional page interleaving address mapping are used to isolate the contribu- tions of controller policy and address mapping techniques. Read accesses and write accesses share a memory access pool. The size of the memory access pool (256) is chosen not to create a bottleneck. Because writes need extra spaces to store the write data, a maximal of 64 writes can be hold by the memory access pool.

3.4 Benchmarks

SPEC CPU2000 is the fourth major version of the Standard Performance Evaluation Cor- poration (SPEC) CPU benchmark suites, which in 1989 became the first widely accepted standard for comparing compute-intensive performance across various architectures [65].

SPEC CPU2000 comprises two sets of benchmarks: CINT2000 and CFP2000. The CINT2000 comprises 12 application-based benchmarks written in C and C++ languages for measuring compute-intensive integer performance. The CFP2000 comprises 14 benchmarks written in FORTRAN (77 and 90) and C languages for measuring compute-intensive

floating point performance. The two suites measure the performance of a computer’s proces- sor, memory architecture and compiler. Compared to the previous version SPEC CPU95, CPU2000 suite offers longer run times, larger problems for benchmarks and more application diversity.

SPEC CPU2000 benchmark suite is selected and used in SDRAM address mapping and access reordering studies, except for the sixtrack benchmark which could not complete due to insufficient floating point precision of the simulators. Some benchmarks are not memory intensive, thus memory optimization techniques begin studied in this thesis may have little impacts on them. However, results from the rest 25 SEPC CPU2000 benchmarks will all be presented in Chapter 4 and Chapter 5 to prevent biasing the results [12]. Pre-compiled little-endian Alpha ISA SPEC2000 binaries with reference input sets are used [11]. Table 3.3 lists the all simulated benchmarks and their command line parameters.

Besides SPEC CPU2000 benchmarks, SPEC CPU95 and other benchmarks such as the STREAM are also used during the studies [64, 32]. Because these benchmarks are mainly used for testing and debugging, their simulation results are not shown in this thesis.

3.4.1 Number of Instructions to Simulate

With reference input sets, simulation times of some SPEC CPU2000 benchmarks can be extremely long, especially with the M5 simulator. Therefore only a selected number of instructions are simulated.

The choice of the number instructions to simulate is based on simulation times and how fast the caches are warmed up. The simulations should be able to complete within reasonable times, meanwhile enough instructions should be simulated to expose the major behaviors of the benchmarks. Fast forwarding, commonly used to skip cache warming up stage, is not employed in studies presented in this thesis. This is because during cache warming up stage there will be many cache misses (main memory accesses), meaning more

3.4 Benchmarks 57

Table 3.3: SPEC CPU2000 benchmark suites and command line parameters CINT2000, 11 applications written in C and 1 in C++ (252.eon)

Name Remarks Input sets

164.gzip Data compression utility gzip.input.source 60

175.vpr FPGA circuit placement and routing net.in arch.in place.out dum.out -nodisp -place only -init t 5 -exit t 0.005 -alpha t 0.9412 -inner num 2

176.gcc C compiler 166.i -o 166.s

181.mcf Minimum cost network flow solver mcf.inp.in 186.crafty Chess program < crafty.in

197.parser Natural language processing 2.1.dict -batch < parser.ref.in 252.eon Ray tracing chair.control.cook chair.camera

chair.surfaces chair.cook.ppm ppm pixels out.cook

253.perlbmk Perl -I./lib diffmail.pl 2 550 15 24 23 100 254.gap Computational group theory -l ./ -q -m 192M < gap.ref.in 255.vortex Object Oriented Database lendian1.raw

256.bzip2 Data compression utility bzip2.input.source 58 300.twolf Place and route simulator ref

CFP2000, 14 applications (6 Fortran-77, 4 Fortran-90 and 4 C)

Name Remarks Input sets

168.wupwise Quantum chromodynamics

171.swim Shallow water modeling < swim.in 172.mgrid Multi-grid solver in 3D potential field < mgrid.in 173.applu Parabolic/elliptic partial differential

equations

< applu.in

177.mesa 3D Graphics library -frames 1000 -meshfile mesa.in -ppmfile mesa.ppm

178.galgel Fluid dynamics: analysis of oscilla- tory instability

< galgel.in 179.art Neural network simulation; adaptive

resonance theory

-scanfile c756hel.in -trainfile1 a10.img -trainfile2 hc.img -stride 2 -startx 110 -starty 200 -endx 160 -endy 240 -objects 10

183.equake Finite element simulation; earthquake modeling

< equake.inp.in 187.facerec Computer vision: recognizes faces < facerec.ref.in 188.ammp Computational chemistry < ammp.in 189.lucas Number theory: primality testing < lucas2.in 191.fma3d Finite element crash simulation

200.sixtrack Particle accelerator model < sixtrack.inp.in 301.apsi Solves problems regarding tempera-

ture, wind, velocity and distribution of pollutants

performance improvement space is available to memory optimization techniques.

For SDRAM address mapping studies, the first 232 (about 4.3-billion) instructions are simulated for all benchmarks. The first 2-billion instructions of CPU2000 benchmarks are simulated for access reordering studies.

3.4.2 Main Memory Access Behaviors of Simulated Benchmarks

It is desirable, and even necessary, to know the memory access pattern of the benchmarks that will be used in the studies of SDRAM address mapping and access reordering.

Figure 3.6 shows the total number of read memory accesses and write memory accesses for the first 2-billion simulated instructions of SPEC CPU2000 benchmarks (except for sixtrack). The data are obtained by the revised M5 simulator using the baseline machine configuration as listed in Table 3.2. Because main memory accesses are cache misses, the number of main memory accesses are significantly fewer than the number of executed load/store instructions after being filtered by the caches.

Among 25 benchmarks, gcc, mcf, swim, mgrid, applu, art, facerec, ammp and lucas are the most memory intensive benchmarks. Benchmark ammp has the most number of main memory accesses, about one main memory access every 14 instructions. These memory intensive benchmarks are expected to show more significant performance differences than other less memory intensive benchmarks when the proposed SDRAM address techniques and access reordering mechanisms are applied.

Figure 3.7 shows the read write ratio in percentage. The most accesses of gcc and apsi are writes, while eon and ammp have little writes. The average read write ratio of simulated SPEC CPU2000 benchmarks is 58/42. Read write ratio can be used in memory access scheduling to make scheduling decision as Chapter 5 will discuss. Other characteristics of main memory access stream, such as localities, will be presented in Chapter 4 during the studies of SDRAM address mapping.

3.4 Benchmarks 59 gzip vpr gcc _mcf crafty parser eon perlbmk gap

vortex bzip2 twolf

wupwise

swim mgrid applu mesa galgel

art

equake _facerec ammp lucas fma3d

apsi 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Number of Accesses x1e8 Read Write

Figure 3.6: Total number of main memory accesses of SPEC CPU2000 benchmarks

gzip vpr gcc _mcf

crafty parser

eon

perlbmk

gap

vortex bzip2 twolf

wupwise

swim mgrid applu mesa galgel

art

equake _facerec ammp lucas fma3d

apsi 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Read Write

In document Reducing main memory access latency through SDRAM address mapping techniques and access reordering mechanisms (Page 72-77)