• No results found

[NORMAL] Power-Aware High-Performance Scientific Computing

N/A
N/A
Protected

Academic year: 2021

Share "Power-Aware High-Performance Scientific Computing"

Copied!
35
0
0

Loading.... (view fulltext now)

Full text

(1)

Power-Aware High-Performance

Scientific Computing

Padma Raghavan

Scalable Computing Laboratory

Department of Computer Science Engineering The Pennsylvania State University

http://www.cse.psu.edu/~raghavan

(2)

Trends Microprocessor Design & HPC

 Microprocessor design

 Gordon Moore, 1966: 2 X # transistors in 18 months= Focus on

peak rates, LAPACK benchmarks with dense codes

 Patrick Gelsinger, 2004: ‘power is the only real limiter…’ DAC Keynote

 HPC and science through simulation

 High costs of installation, cooling

 Petascale system is infeasible without new low-power designs (Simon,

Boku …)

 Gap between peak (TOP500) and sustained rates on real workloads  Petascale instrument vs. desktop supercomputing

(3)

Why Sparse Scientific Codes

 Sparse codes (irregular meshes, matrices, graphs),

unlike tuned dense codes, do not operate at peak rates (despite tuning)

 Sparse codes represent scalable formulations for many

applications but …

 Limited data locality and data re-use  Memory and network latency bound

 Load imbalances despite partitioning/re-partitioning  Multiple algorithms, implementations with different

quality/performance trade-offs

 Present many opportunities for adaptive

(4)

Sparse

Codes and

Data

 Example: Sparse y=

Ax

 Used in many PDE

simulations in explicit codes, in implicit codes with linear system

solution, data clustering with K-means

•Ordering (RCM) to get locality of access in x

•Data locality and data reuse for elements of x

(5)

This Presentation

Microprocessor/network architectural

optimizations X Application features

PxP results for sparse scientific computing

 Optimizing CPU + Memory for sparse PxP  PxP models for adaptive feature selection  PxP trends on MPPs with CPU+Link scaling

(6)

PxP Results - I

Characterizing

power reductions

and

performance improvements

for a single

node, i.e., CPU +Memory

There is locality of data access in many

sparse codes when matrices are reordered,

right data structures are used etc.

(7)

Power-Aware+ High Performance

Computing

 Power of CMOS chips: P = C * Vdd2 * f + V

dd* Ileak

 Typically higher performance = higher f with higher transistor

counts  thermal limits  Tuning Power

 DVS: Dynamic voltage and frequency scaling for CPUs

 Drowsy/low-power modes of caches, DRAM memory banks  ABB: Adaptive body biasing, reduces Ileak

 If these low-power knobs are exposed in the ISA, they can be

used to control power in applications

 If some of the power savings are directed for memory/network

optimizations, we can increase performance while lowering power for PxP reductions in energy

(8)

Methodology

 Cycle accurate architectural emulations using

Simplescalar, Wattch and Cacti

 Emulate CPU with caches + off chip DRAM memory

starting with a PowerPC-like core (like a BGL processor)

 Emulate low power modes

 Model DVS by scaling frequency and supply voltage

 Model low power modes of caches by emulating smaller caches

 Emulate memory subsystem optimizations

 Extend Simplescalar/Wattch to add structures for optimizations to

(9)

Base (B)

Architecture

 Power PC-like, 1 GHz core  4 MB SRAM L3 (26 cycle

latency)

 2 KB SRAM L2 ( 7 cycle

latency)

 32 KB SRAM L1 instruction

and data caches (1 cycle latency)

 Memory bus: 64 bits

 Memory size 256 MB (9 x

(10)

Architectural Extensions

 Wider memory bus: 128 bits , original 64 (W)  Memory page policy: Open or Closed (MO)

 Prefetcher (stride 1) in memory controller (MP)  Prefetcher (stride 1) in L2 cache (LP)

 Load Miss Predictor in L1 cache (LMP)

 Prefetchers can reduce latency if there is locality of access  If sparse matrix is highly irregular (inherent or from

implementation) an LMP can avoid latency of cache hierarchy

(11)

Memory Prefetcher

(MP)

 Added a prefetch buffer to the memory controller  16 element table with 128 byte cache line  LRU replacement

(12)

L2 Cache Prefetcher

(LP)

 Benefits codes with locality of data access but poor data re-use

(13)

Memory Page Policy: Open /

Closed

(MO)

•Accesses to open rows have lower latency •Memory control is more complex

(14)

Lo

ad

Miss

Predict

(15)

Experiments

 Base (B), Wider path (W), Memory page policy (MO),

Memory prefetcher (MP), L2-prefetcher (LP), Load Miss Prediction (LMP)

 Base (B) at 1000 MHz  Sparse codes

SMV-U: no blocking, RCM ordering, 4 matrices

SMV-O: Sparsity SMV, 2x2 blocking, RCM ordering, 4 matrices  NAS MG Benchmark

 Full scale application: Driven Cavity Flow

Metrics: Time, Power, Energy, Ops/J (shown relative to

(16)

Relative Time: All features,

300 Mhz –1 GHz, 256 K L3

Values < 1 are faster than at base

(17)

Relative Time at 600 MHz,

Smaller L3

• X-axis: features added incrementally to include all

• Time for each code at B set to 1 • Base at 3Over 40% performance improvementsWithout optimizations 40 % performance degradation B +W +MO +MP +LP +LMP

(18)

Relative Power at 600 MHz,

Smaller L3

• X-axis: features added incrementally to include all • Power for each code at B set to 1 • Base at 3Over 66% power saved from DVS (600 Mhz), smallest cache with no performance penalty +W +MO +MP +LP +LMP

(19)

Relative Energy at 600 MHz,

Smaller L3

• X-axis: features added incrementally to include all

• Energy for each code at B set to 1 • Base at 3

Over 80% improvements with all features

Without optimizations 40 % savings but with performance penalty

(20)

Ops/J at 600 MHz, Smaller L3

• X-axis: features added incrementally to include all • Ops/J for each code at B set to 1 • Base at 3Factor 5 improvement in energy efficiency

(21)

PxP Results - II

PxP for a `real’ driven cavity flow application

with typical complex code/algorithm features

(22)

Driven Cavity :Relative Time, Energy

 With all features, code is faster by 20% even at

400MHz, with 60% less power, energy

Time Ener gy Al l All +w +MO +LMP +MP +LP

(23)

PxP Results - III

Models to select optimal sets of features

subject to performance/power constraints

Detecting phases in application

Adaptively selecting feature set for each

application phase:

 Reduce power subject to performance constraint  Reduce time subject to power constraint

(24)

Optimal Feature Sets

 Least squares fit to derive models of power or time (F – feature set

combination) per code

 Errors of less than 5%

 Define workload, select optimal configuration with power constraints,…

 Example: Best time 2-feature set , even workload, < 50% base power  At 600 MHz :W+ LP; At 800 MHz: MO +MP i i N i i

F

a

T

(25)

S/W Phases & Their H/W Detection

Different S/W phases can benefit from

different H/W features

Challenges:

 How do known s/w phases correspond to h/w

detectable phases?

 What H/W metric can be used to detect phase

(26)
(27)

NAS MG: LSQ and 100K cycle

window

(28)

MG: Min P, T constraint

Phase Time Freq. L3 size Page LP MP LMP T P

Constraint (MHz) policy Restriction 1.2 700 1MB MO - - - 1.2 0.29 Interp 1-6 1.2 700 1MB MO - p - 1.19 0.37 Interp 7 1.2 400 4MB MO p p - 1.15 0.29 Remainder 1.2 600 1MB MO p - - 1.13 0.3 Restriction 1 700 1MB MO p p p 0.98 0.37 Interp 1-6 1 800 2MB MO p - - 0.97 0.48 Interp 7 1 500 1MB MC p - - 0.92 0.36 Remainder 1 700 1MB MC p - - 0.97 0.35 Restriction 0.8 800 1MB MO - p p 0.8 0.49 I 1-6 0.8 10002MB MO p - - 0.77 0.85 I 7 0.8 700 1MB MO - p - 0.76 0.5

(29)

All Vs Adaptive (Using LSQ)

Min Power, T constraint

Min Time, P constraint All features on

(30)

PxP Results: MPPs+ MPI codes

Utilizing load imbalance in tree-structured

parallel sparse computations for energy

savings

Apps

run for days/weeks

--- 10% of ideal

load/processors ~ hours/days

(31)

Tree-Based Parallel Sparse

Computation

 Tree node =dense/ sparse data-parallel operations  Tree structure dictates data-dependencies

 A node depends only on subtree rooted at the node

 Computation in disjoint subtrees can proceed independently  Imbalance (despite best data-mapping) can be 10% of ideal

load/processor

 Exploit task-parallelism at lower levels and

data-parallelism at higher levels

 Represents Barnes-Hut, FMM N-body tree-codes,

(32)

Example

p0 p1 p2 p3 p4 p5 p6 p7 p8 70/35 100/0 95/0 100/0 100/0 90/10 85/10 100/0 100/0 80/10 120/0 50/25 40/25 P0 P1 P2 P3 P4 P5 P6 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 [0,1] [2,3] [4,5] [4,6] [0,3] [0,6]

•Integrated Link/CPU Voltage Scaling to convert imbalance to energy savings without performance penalties (recursive scheme, multiple passes)

•Network topology constrains link scaling

Critical Path Routing requirements cause conflicts Weight (Computation/Communication) Participating Processors 0,1,2,3

(33)

Energy Consumption

(34)

Other Results

 Non-uniform cache architectures (NUCA) and CMPs  NUCA configurations for scientific computing

 Utilizing network on chip (NOC) with NUCA  Sayaka Akioka (in progress)

 Modeling network PxP

 TorusSim Tool by Sarah Conner

 A single collective communication: link shutdown possible for

55%-97% of time

(35)

Summary

Substantial single processor PxP improvements

 For kernels, codes and full applications  Time 30%–50% faster

 Power/energy 50%--80% lower

 Further savings from LSQ-based H/Q adaptivity

Multiprocessor (MPP) PxP scaling trends from

CPU-link scaling are promising

 Near ideal conversion of slack to savings  Link shutdown possible 60-97% /collective

References

Related documents

Open access research data H2020.. “Open access to research data refers to the right to access and re-use

The Assign Extended Pulse Timer Parameters and Start instruction starts a specified timer if there is a rising edge (change in signal state from 0 to 1) at the Start (S) input..

We typically use these time phrases with this tense: always usually often frequently sometimes occasionally seldom rarely never every day every week every month

If the classifier indicates that the cache line is L2-Private, then the line is sent to the requesting core’s L2 cache (L2 replica location) and the L1 mode provided by the

Write the product’s serial number in the back of the manual near the assembly diagram (or month and year of purchase if product has no number).. Keep this manual and the receipt in

We can assume that the word ‘touch’ as used by Vitruvius means to make the vases (or the air inside them) ring sympathetically when the vibration of a sound source

Utilizing component and state diagrams based on the Unified Modeling Language (UML), we demonstrate MODCO, a transformation tool which takes a UML state diagram as input

South Asian Clinical Toxicology Research Collaboration, Faculty of Medicine, University of Peradeniya, Peradeniya, Sri Lanka.. Department of Medicine, Faculty of