[NORMAL] Power-Aware High-Performance Scientific Computing

(1)

Power-Aware High-Performance

Scientific Computing

Padma Raghavan

Scalable Computing Laboratory

Department of Computer Science Engineering The Pennsylvania State University

http://www.cse.psu.edu/~raghavan

(2)

Trends Microprocessor Design & HPC

 Microprocessor design

 Gordon Moore, 1966: 2 X # transistors in 18 months= Focus on

peak rates, LAPACK benchmarks with dense codes

 Patrick Gelsinger, 2004: ‘power is the only real limiter…’ DAC Keynote

 HPC and science through simulation

 High costs of installation, cooling

 Petascale system is infeasible without new low-power designs (Simon,

Boku …)

 Gap between peak (TOP500) and sustained rates on real workloads  Petascale instrument vs. desktop supercomputing

(3)

Why Sparse Scientific Codes

 Sparse codes (irregular meshes, matrices, graphs),

unlike tuned dense codes, do not operate at peak rates (despite tuning)

 Sparse codes represent scalable formulations for many

applications but …

 Limited data locality and data re-use  Memory and network latency bound

 Load imbalances despite partitioning/re-partitioning  Multiple algorithms, implementations with different

quality/performance trade-offs

 Present many opportunities for adaptive

(4)

Sparse

Codes and

Data

 Example: Sparse y=

Ax

 Used in many PDE

simulations in explicit codes, in implicit codes with linear system

solution, data clustering with K-means

•Ordering (RCM) to get locality of access in x

•Data locality and data reuse for elements of x

(5)

This Presentation



Microprocessor/network architectural

optimizations X Application features



PxP results for sparse scientific computing

 Optimizing CPU + Memory for sparse PxP  PxP models for adaptive feature selection  PxP trends on MPPs with CPU+Link scaling

(6)

PxP Results - I



Characterizing

power reductions

and

performance improvements

for a single

node, i.e., CPU +Memory



There is locality of data access in many

sparse codes when matrices are reordered,

right data structures are used etc.

(7)

Power-Aware+ High Performance

Computing

 Power of CMOS chips: P = C * V_dd2 _{* f + V}

dd* Ileak

 Typically higher performance = higher f with higher transistor

counts  thermal limits  Tuning Power

 DVS: Dynamic voltage and frequency scaling for CPUs

 Drowsy/low-power modes of caches, DRAM memory banks  ABB: Adaptive body biasing, reduces I_leak

 If these low-power knobs are exposed in the ISA, they can be

used to control power in applications

 If some of the power savings are directed for memory/network

optimizations, we can increase performance while lowering power for PxP reductions in energy

(8)

Methodology

 Cycle accurate architectural emulations using

Simplescalar, Wattch and Cacti

 Emulate CPU with caches + off chip DRAM memory

starting with a PowerPC-like core (like a BGL processor)

 Emulate low power modes

 Model DVS by scaling frequency and supply voltage

 Model low power modes of caches by emulating smaller caches

 Emulate memory subsystem optimizations

 Extend Simplescalar/Wattch to add structures for optimizations to

(9)

Base (B)

Architecture

 Power PC-like, 1 GHz core  4 MB SRAM L3 (26 cycle

latency)

 2 KB SRAM L2 ( 7 cycle

latency)

 32 KB SRAM L1 instruction

and data caches (1 cycle latency)

 Memory bus: 64 bits

 Memory size 256 MB (9 x

(10)

Architectural Extensions

 Wider memory bus: 128 bits , original 64 (W)  Memory page policy: Open or Closed (MO)

 Prefetcher (stride 1) in memory controller (MP)  Prefetcher (stride 1) in L2 cache (LP)

 Load Miss Predictor in L1 cache (LMP)

 Prefetchers can reduce latency if there is locality of access  If sparse matrix is highly irregular (inherent or from

implementation) an LMP can avoid latency of cache hierarchy

(11)

Memory Prefetcher

(MP)

 Added a prefetch buffer to the memory controller  16 element table with 128 byte cache line  LRU replacement

(12)

L2 Cache Prefetcher

(LP)

 Benefits codes with locality of data access but poor data re-use

(13)

Memory Page Policy: Open /

Closed

(MO)

•Accesses to open rows have lower latency •Memory control is more complex

(14)

Lo

ad

Miss

Predict

(15)

Experiments

 Base (B), Wider path (W), Memory page policy (MO),

Memory prefetcher (MP), L2-prefetcher (LP), Load Miss Prediction (LMP)

 Base (B) at 1000 MHz  Sparse codes

 SMV-U: no blocking, RCM ordering, 4 matrices

 SMV-O: Sparsity SMV, 2x2 blocking, RCM ordering, 4 matrices  NAS MG Benchmark

 Full scale application: Driven Cavity Flow

 Metrics: Time, Power, Energy, Ops/J (shown relative to

(16)

Relative Time: All features,

300 Mhz –1 GHz, 256 K L3

Values < 1 are faster than at base

(17)

Relative Time at 600 MHz,

Smaller L3

• X-axis: features added incrementally to include all

• Time for each code at B set to 1 • Base at 3 •Over 40% performance improvements •Without optimizations 40 % performance degradation B +W +MO +MP +LP +LMP

(18)

Relative Power at 600 MHz,

Smaller L3

• X-axis: features added incrementally to include all • Power for each code at B set to 1 • Base at 3 •Over 66% power saved from DVS (600 Mhz), smallest cache with no performance penalty +W +MO +MP +LP +LMP

(19)

Relative Energy at 600 MHz,

Smaller L3

• X-axis: features added incrementally to include all

• Energy for each code at B set to 1 • Base at 3

•Over 80% improvements with all features

•Without optimizations 40 % savings but with performance penalty

(20)

Ops/J at 600 MHz, Smaller L3

• X-axis: features added incrementally to include all • Ops/J for each code at B set to 1 • Base at 3 • Factor 5 improvement in energy efficiency

(21)

PxP Results - II



PxP for a `real’ driven cavity flow application

with typical complex code/algorithm features

(22)

Driven Cavity :Relative Time, Energy

 With all features, code is faster by 20% even at

400MHz, with 60% less power, energy

Time Ener gy Al l Al_l +w +MO +LMP +MP +LP

(23)

PxP Results - III



Models to select optimal sets of features

subject to performance/power constraints



Detecting phases in application



Adaptively selecting feature set for each

application phase:

 Reduce power subject to performance constraint  Reduce time subject to power constraint

(24)

Optimal Feature Sets

 Least squares fit to derive models of power or time (F – feature set

combination) per code

 Errors of less than 5%

 Define workload, select optimal configuration with power constraints,…

 Example: Best time 2-feature set , even workload, < 50% base power  At 600 MHz :W+ LP; At 800 MHz: MO +MP i i N i i

F

a

T











(25)

S/W Phases & Their H/W Detection



Different S/W phases can benefit from

different H/W features



Challenges:

 How do known s/w phases correspond to h/w

detectable phases?

 What H/W metric can be used to detect phase

(26)

(27)

NAS MG: LSQ and 100K cycle

window

(28)

MG: Min P, T constraint

Phase Time Freq. L3 size Page LP MP LMP T P

Constraint (MHz) policy Restriction 1.2 700 1MB MO - - - 1.2 0.29 Interp 1-6 1.2 700 1MB MO - p - 1.19 0.37 Interp 7 1.2 400 4MB MO p p - 1.15 0.29 Remainder 1.2 600 1MB MO p - - 1.13 0.3 Restriction 1 700 1MB MO p p p 0.98 0.37 Interp 1-6 1 800 2MB MO p - - 0.97 0.48 Interp 7 1 500 1MB MC p - - 0.92 0.36 Remainder 1 700 1MB MC p - - 0.97 0.35 Restriction 0.8 800 1MB MO - p p 0.8 0.49 I 1-6 0.8 10002MB MO p - - 0.77 0.85 I 7 0.8 700 1MB MO - p - 0.76 0.5

(29)

All Vs Adaptive (Using LSQ)

Min Power, T constraint

Min Time, P constraint All features on

(30)

PxP Results: MPPs+ MPI codes



Utilizing load imbalance in tree-structured

parallel sparse computations for energy

savings



Apps

run for days/weeks

--- 10% of ideal

load/processors ~ hours/days

(31)

Tree-Based Parallel Sparse

Computation

 Tree node =dense/ sparse data-parallel operations  Tree structure dictates data-dependencies

 A node depends only on subtree rooted at the node

 Computation in disjoint subtrees can proceed independently  Imbalance (despite best data-mapping) can be 10% of ideal

load/processor

 Exploit task-parallelism at lower levels and

data-parallelism at higher levels

 Represents Barnes-Hut, FMM N-body tree-codes,

(32)

Example

p₀ p₁ p₂ p₃ p₄ p₅ p₆ p₇ p₈ 70/35 100/0 95/0 100/0 100/0 90/10 85/10 100/0 100/0 80/10 120/0 50/25 40/25 P₀ P₁ P₂ P₃ P₄ P₅ P₆ N₀ N₁ N₂ N₃ N₄ N₅ N₆ N₇ N₈ N₉ N₁₀ N₁₁ N₁₂ [0,1] [2,3] [4,5] [4,6] [0,3] [0,6]

•Integrated Link/CPU Voltage Scaling to convert imbalance to energy savings without performance penalties (recursive scheme, multiple passes)

•Network topology constrains link scaling

Critical Path Routing requirements cause conflicts Weight (Computation/Communication) Participating Processors 0,1,2,3

(33)

Energy Consumption

(34)

Other Results

 Non-uniform cache architectures (NUCA) and CMPs  NUCA configurations for scientific computing

 Utilizing network on chip (NOC) with NUCA  Sayaka Akioka (in progress)

 Modeling network PxP

 TorusSim Tool by Sarah Conner

 A single collective communication: link shutdown possible for

55%-97% of time

(35)

Summary



Substantial single processor PxP improvements

 For kernels, codes and full applications  Time 30%–50% faster

 Power/energy 50%--80% lower

 Further savings from LSQ-based H/Q adaptivity



Multiprocessor (MPP) PxP scaling trends from

CPU-link scaling are promising

 Near ideal conversion of slack to savings  Link shutdown possible 60-97% /collective