Power-Aware High-Performance
Scientific Computing
Padma Raghavan
Scalable Computing Laboratory
Department of Computer Science Engineering The Pennsylvania State University
http://www.cse.psu.edu/~raghavan
Trends Microprocessor Design & HPC
Microprocessor design
Gordon Moore, 1966: 2 X # transistors in 18 months= Focus on
peak rates, LAPACK benchmarks with dense codes
Patrick Gelsinger, 2004: ‘power is the only real limiter…’ DAC Keynote
HPC and science through simulation
High costs of installation, cooling
Petascale system is infeasible without new low-power designs (Simon,
Boku …)
Gap between peak (TOP500) and sustained rates on real workloads Petascale instrument vs. desktop supercomputing
Why Sparse Scientific Codes
Sparse codes (irregular meshes, matrices, graphs),
unlike tuned dense codes, do not operate at peak rates (despite tuning)
Sparse codes represent scalable formulations for many
applications but …
Limited data locality and data re-use Memory and network latency bound
Load imbalances despite partitioning/re-partitioning Multiple algorithms, implementations with different
quality/performance trade-offs
Present many opportunities for adaptive
Sparse
Codes and
Data
Example: Sparse y=
Ax
Used in many PDE
simulations in explicit codes, in implicit codes with linear system
solution, data clustering with K-means
•Ordering (RCM) to get locality of access in x
•Data locality and data reuse for elements of x
This Presentation
Microprocessor/network architectural
optimizations X Application features
PxP results for sparse scientific computing
Optimizing CPU + Memory for sparse PxP PxP models for adaptive feature selection PxP trends on MPPs with CPU+Link scaling
PxP Results - I
Characterizing
power reductions
and
performance improvements
for a single
node, i.e., CPU +Memory
There is locality of data access in many
sparse codes when matrices are reordered,
right data structures are used etc.
Power-Aware+ High Performance
Computing
Power of CMOS chips: P = C * Vdd2 * f + V
dd* Ileak
Typically higher performance = higher f with higher transistor
counts thermal limits Tuning Power
DVS: Dynamic voltage and frequency scaling for CPUs
Drowsy/low-power modes of caches, DRAM memory banks ABB: Adaptive body biasing, reduces Ileak
If these low-power knobs are exposed in the ISA, they can be
used to control power in applications
If some of the power savings are directed for memory/network
optimizations, we can increase performance while lowering power for PxP reductions in energy
Methodology
Cycle accurate architectural emulations using
Simplescalar, Wattch and Cacti
Emulate CPU with caches + off chip DRAM memory
starting with a PowerPC-like core (like a BGL processor)
Emulate low power modes
Model DVS by scaling frequency and supply voltage
Model low power modes of caches by emulating smaller caches
Emulate memory subsystem optimizations
Extend Simplescalar/Wattch to add structures for optimizations to
Base (B)
Architecture
Power PC-like, 1 GHz core 4 MB SRAM L3 (26 cycle
latency)
2 KB SRAM L2 ( 7 cycle
latency)
32 KB SRAM L1 instruction
and data caches (1 cycle latency)
Memory bus: 64 bits
Memory size 256 MB (9 x
Architectural Extensions
Wider memory bus: 128 bits , original 64 (W) Memory page policy: Open or Closed (MO)
Prefetcher (stride 1) in memory controller (MP) Prefetcher (stride 1) in L2 cache (LP)
Load Miss Predictor in L1 cache (LMP)
Prefetchers can reduce latency if there is locality of access If sparse matrix is highly irregular (inherent or from
implementation) an LMP can avoid latency of cache hierarchy
Memory Prefetcher
(MP)
Added a prefetch buffer to the memory controller 16 element table with 128 byte cache line LRU replacementL2 Cache Prefetcher
(LP)
Benefits codes with locality of data access but poor data re-useMemory Page Policy: Open /
Closed
(MO)
•Accesses to open rows have lower latency •Memory control is more complex
Lo
ad
Miss
Predict
Experiments
Base (B), Wider path (W), Memory page policy (MO),
Memory prefetcher (MP), L2-prefetcher (LP), Load Miss Prediction (LMP)
Base (B) at 1000 MHz Sparse codes
SMV-U: no blocking, RCM ordering, 4 matrices
SMV-O: Sparsity SMV, 2x2 blocking, RCM ordering, 4 matrices NAS MG Benchmark
Full scale application: Driven Cavity Flow
Metrics: Time, Power, Energy, Ops/J (shown relative to
Relative Time: All features,
300 Mhz –1 GHz, 256 K L3
Values < 1 are faster than at baseRelative Time at 600 MHz,
Smaller L3
• X-axis: features added incrementally to include all• Time for each code at B set to 1 • Base at 3 •Over 40% performance improvements •Without optimizations 40 % performance degradation B +W +MO +MP +LP +LMP
Relative Power at 600 MHz,
Smaller L3
• X-axis: features added incrementally to include all • Power for each code at B set to 1 • Base at 3 •Over 66% power saved from DVS (600 Mhz), smallest cache with no performance penalty +W +MO +MP +LP +LMPRelative Energy at 600 MHz,
Smaller L3
• X-axis: features added incrementally to include all• Energy for each code at B set to 1 • Base at 3
•Over 80% improvements with all features
•Without optimizations 40 % savings but with performance penalty
Ops/J at 600 MHz, Smaller L3
• X-axis: features added incrementally to include all • Ops/J for each code at B set to 1 • Base at 3 • Factor 5 improvement in energy efficiencyPxP Results - II
PxP for a `real’ driven cavity flow application
with typical complex code/algorithm features
Driven Cavity :Relative Time, Energy
With all features, code is faster by 20% even at
400MHz, with 60% less power, energy
Time Ener gy Al l All +w +MO +LMP +MP +LP
PxP Results - III
Models to select optimal sets of features
subject to performance/power constraints
Detecting phases in application
Adaptively selecting feature set for each
application phase:
Reduce power subject to performance constraint Reduce time subject to power constraint
Optimal Feature Sets
Least squares fit to derive models of power or time (F – feature set
combination) per code
Errors of less than 5%
Define workload, select optimal configuration with power constraints,…
Example: Best time 2-feature set , even workload, < 50% base power At 600 MHz :W+ LP; At 800 MHz: MO +MP i i N i i
F
a
T
S/W Phases & Their H/W Detection
Different S/W phases can benefit from
different H/W features
Challenges:
How do known s/w phases correspond to h/w
detectable phases?
What H/W metric can be used to detect phase
NAS MG: LSQ and 100K cycle
window
MG: Min P, T constraint
Phase Time Freq. L3 size Page LP MP LMP T P
Constraint (MHz) policy Restriction 1.2 700 1MB MO - - - 1.2 0.29 Interp 1-6 1.2 700 1MB MO - p - 1.19 0.37 Interp 7 1.2 400 4MB MO p p - 1.15 0.29 Remainder 1.2 600 1MB MO p - - 1.13 0.3 Restriction 1 700 1MB MO p p p 0.98 0.37 Interp 1-6 1 800 2MB MO p - - 0.97 0.48 Interp 7 1 500 1MB MC p - - 0.92 0.36 Remainder 1 700 1MB MC p - - 0.97 0.35 Restriction 0.8 800 1MB MO - p p 0.8 0.49 I 1-6 0.8 10002MB MO p - - 0.77 0.85 I 7 0.8 700 1MB MO - p - 0.76 0.5
All Vs Adaptive (Using LSQ)
Min Power, T constraintMin Time, P constraint All features on
PxP Results: MPPs+ MPI codes
Utilizing load imbalance in tree-structured
parallel sparse computations for energy
savings
Apps
run for days/weeks
--- 10% of ideal
load/processors ~ hours/days
Tree-Based Parallel Sparse
Computation
Tree node =dense/ sparse data-parallel operations Tree structure dictates data-dependencies
A node depends only on subtree rooted at the node
Computation in disjoint subtrees can proceed independently Imbalance (despite best data-mapping) can be 10% of ideal
load/processor
Exploit task-parallelism at lower levels and
data-parallelism at higher levels
Represents Barnes-Hut, FMM N-body tree-codes,
Example
p0 p1 p2 p3 p4 p5 p6 p7 p8 70/35 100/0 95/0 100/0 100/0 90/10 85/10 100/0 100/0 80/10 120/0 50/25 40/25 P0 P1 P2 P3 P4 P5 P6 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 [0,1] [2,3] [4,5] [4,6] [0,3] [0,6]•Integrated Link/CPU Voltage Scaling to convert imbalance to energy savings without performance penalties (recursive scheme, multiple passes)
•Network topology constrains link scaling
Critical Path Routing requirements cause conflicts Weight (Computation/Communication) Participating Processors 0,1,2,3
Energy Consumption
Other Results
Non-uniform cache architectures (NUCA) and CMPs NUCA configurations for scientific computing
Utilizing network on chip (NOC) with NUCA Sayaka Akioka (in progress)
Modeling network PxP
TorusSim Tool by Sarah Conner
A single collective communication: link shutdown possible for
55%-97% of time
Summary
Substantial single processor PxP improvements
For kernels, codes and full applications Time 30%–50% faster
Power/energy 50%--80% lower
Further savings from LSQ-based H/Q adaptivity
Multiprocessor (MPP) PxP scaling trends from
CPU-link scaling are promising
Near ideal conversion of slack to savings Link shutdown possible 60-97% /collective