• No results found

Advanced High-Performance Computing Features of the OpenPOWER ISA

N/A
N/A
Protected

Academic year: 2021

Share "Advanced High-Performance Computing Features of the OpenPOWER ISA"

Copied!
15
0
0

Loading.... (view fulltext now)

Full text

(1)

Advanced High-Performance

Computing Features of the

OpenPOWER ISA

Workshop on RISC-V and OpenPOWER

International Conference on Supercomputing June 20, 2020

José Moreira – IBM Research Many thanks to: Jeff Stuecheli

Edmund Gieske Brian Thompto

(2)

© 2020 IBM Corporation

Open architectures for supercomputing

• Since the dawn of supercomputing with the CDC 6600, through the many

generations of Seymour Cray machines, massively parallel processors like Blue Gene and the current champion Fugaku, high performance computing has been about three things:

▪ Number crunching: perform as many arithmetic operations as possible ▪ Memory bandwidth: read/write as much data from/to memory as possible ▪ Interconnect: communicate between elements as fast as possible

• Through the OpenPOWER initiative, IBM has made a variety of computing technologies openly available to the community, including the three

HPC-enabling technologies listed above

• In this talk, we will focus on recent OpenPOWER developments in the areas of number crunching and memory bandwidth

• But don’t forget the interconnect if you are building a supercomputer ☺

(3)

© 2020 IBM Corporation

POWER family memory architecture

3

Scale Out

Direct Attach Memory

Scale Up

Buffered Memory

Low latency access

Commodity packaging form factor

Superior RAS, High bandwidth, High Capacity Agnostic interface for alternate memory innovations

Same Open Memory Interface used for all Systems and Memory Technologies OpenCAPI Agnostic Buffered Memory (OMI)

Near Tier Extreme Bandwidth Lower Capacity Commodity Low Latency Low Cost Enterprise RAS Capacity Bandwidth Storage Class Extreme Capacity Persistence

(4)

© 2020 IBM Corporation

Primary tier memory options

4 72b ~2666 MHz bidi 8b 72b ~2666 MHz bidi 8b 8b 8b ~25G dif f 8b 8b B U F 8b ~25G dif f BU F 16b DDR4 RDIMM Capacity ~256 GB BW ~150 GB/sec DDR4 LRDIMM Capacity ~2 TB BW ~150 GB/sec DDR4 OMI DIMM Capacity ~256GB→4 TB BW ~320 GB/sec

BW Opt OMI DIMM Capacity ~128→512 GB BW ~650 GB/sec 1024b On module Si interposer On Module HBM Capacity ~16→32 GB BW ~1 TB/sec Same System Same System Unique System O M I S tr a te g y Only 5-10ns higher load-to-use than RDIMM (< 5ns for LRDIMM)

(5)

© 2020 IBM Corporation

POWER connectivity variants

5

Direct Attach Memory

Max capacity: 2 TB Max bandwidth: 150 GB/s Limited system interconnect

OM I Buffered Memory

Max capacity: 4 TB Max bandwidth: 650 GB/s Enhanced system interconnect

x24 system attach

x24 system attach

4 x DDR4 memory

4 x DDR4 memory

2 x local SMP

Interconnect x48 system attach

x48 system attach 3 x local SMP Advanced interconnect 8 x OMI buffered memory

8 x OMI buffered memory 6x bandwidth

per mm2as

(6)

© 2020 IBM Corporation

DRAM DIMM comparison

6

• Technology agnostic • Low cost

• Ultra-scale system density • Enterprise reliability

• Low-latency • High bandwidth

Approximate Scale

JEDEC DDR DIMM IBM Centaur DIMM

(7)

© 2020 IBM Corporation

Matrix-Multiply Assist (MMA) instructions

• The latest version of Power ISA (for POWER10) is now publicly available

• The Matrix-Multiply Assist instructions lead to very efficient implementations for high performance computing

• These instructions are a natural match for implementing dense numerical linear algebra computations

• We have also shown application to other computations such as convolution • Various other computations require additional work and research, including

arbitrary precision arithmetic, discrete Fourier transform, …

(8)

© 2020 IBM Corporation

Power ISA Vector-Scalar Registers (VSRs)

(9)

© 2020 IBM Corporation

Accumulators

• Accumulators are 4 × 4 arrays of 32-bit elements (we will briefly mention the 64-bit extension later)

𝐴 = 𝑎00 𝑎01 𝑎10 𝑎11 𝑎02 𝑎03 𝑎12 𝑎13 𝑎20 𝑎21 𝑎30 𝑎31 𝑎22 𝑎23 𝑎32 𝑎33

• The elements can be either 32-bit signed-integers (int32) or 32-bit single-precision floating-point numbers (fp32)

• Each accumulator is a software-managed shadow to a set of 4 consecutive VSRs (8 architected accumulators – ACC[0:7]) – must choose between using accumulator or associated VSRs

• State must be explicitly transferred between accumulators and VSRs using

VSX Move From Accumulator (xxmfacc) and VSX Move To Accumulator

(xxmtacc)

(10)

© 2020 IBM Corporation

Outer-product (xv<type>ger<rank-𝑘>) instructions

• Accumulators are updated by rank-𝑘 update instructions:

• Input: 1 accumulator (𝐴) + 2 VSRs (𝑋, 𝑌)

• Output: 1 accumulator (same as input to reduce instruction encoding space)

• Operation: 𝐴 ← ± 𝐴 ± 𝑋𝑌𝑇

• For 32-bit data, 𝑋 and 𝑌 are 4 × 1 arrays of elements • For 16-bit data, 𝑋 and 𝑌 are 4 × 2 arrays of elements • For 8-bit data, 𝑋 and 𝑌 are 4 × 4 arrays of elements • For 4-bit data, 𝑋 and 𝑌 are 4 × 8 arrays of elements

• This way 𝑋𝑌𝑇 always has a 4 × 4 shape, compatible with accumulator

Instruction 𝑨 𝑿 𝒀𝑻 # of madds/instruction

xvf32ger 4 × 4 (fp32) 4 × 1 (fp32) 1 × 4 (fp32) 16

xvf16ger2 4 × 4 (fp32) 4 × 2 (fp16) 2 × 4 (fp16) 32

xvi16ger2 4 × 4 (int32) 4 × 2 (int16) 2 × 4 (int16) 32

xvi8ger4 4 × 4 (int32) 4 × 4 (int8) 4 × 4 (int8) 64

xvi4ger8 4 × 4 (int32) 4 × 8 (int4) 8 × 4 (int4) 128

(11)

© 2020 IBM Corporation

Extension to 64-bit

• Accumulators are 4 × 2 arrays of 64-bit floating-point elements

𝐴 =

𝑎00 𝑎01 𝑎10 𝑎11 𝑎20 𝑎21 𝑎30 𝑎31

• Accumulators are updated by outer-product instructions:

▪ Input: 1 accumulator (𝐴) + 3 VSRs (𝑋1, 𝑋2, 𝑌)

▪ Output: 1 accumulator (same as input to reduce instruction encoding space)

• Operation: 𝐴 ← 𝐴 + 𝑋1 𝑋2 𝑌𝑇

▪ 𝑋1, 𝑋2 and 𝑌 are 2 × 1 arrays of elements

▪ 𝑋1

𝑋2 𝑌𝑇 always has a 4 × 2 shape, compatible with accumulator

11 Instruction 𝑨 𝑿 𝒀𝑻 # of madds/instruction

(12)

© 2020 IBM Corporation

Load pressure and unit latency (32-bit results)

12

• 8 accumulators (8 × 16 result)

• 6 VSR loads/8 xv_gerinstructions • 0.75 VSR load/xv_ger

• Tolerates the most latency

• 4 accumulators (8 × 8 result)

• 4 VSR loads/4 xv_gerinstructions • 1 VSR load/xv_ger

• Could work well in SMT modes

• 2 accumulators (8 × 4 result)

• 3 VSR loads/2 xv_gerinstructions • 1.5 VSR load/xv_ger

• 1 accumulator (4 × 4 result)

• 2 VSR loads/1 xv_gerinstruction • 2 VSR load/xv_ger

ACC[0] ACC[2] ACC[4] ACC[6]

ACC[1] ACC[3] ACC[5] ACC[7]

X[ 0 ] X[ 1 ]

Y[0] Y[1] Y[2] Y[3]

ACC[0] ACC[2] ACC[1] ACC[3] X[ 0 ] X[ 1 ] Y[0] Y[1] ACC[0] ACC[1] X[ 0 ] X[ 1 ] Y[0]

(13)

© 2020 IBM Corporation

The micro-kernel of GEMM: 𝑪

𝒎×𝒏

+= 𝑨

𝒎×𝒌

× 𝑩

𝒌×𝒏

(14)

© 2020 IBM Corporation

SGEMM micro-kernel using 𝟖 × 𝟖 virtual accumulator

(15)

© 2020 IBM Corporation

Conclusions

• OpenPOWER delivers the three essentials of a supercomputer:

▪ Number crunching: Through its SIMD and MMA instructions ▪ Memory bandwitth: Through OMI

▪ Interconnect: Through OpenCAPI (not discussed in this talk)

• The MMA instructions provide a new level of performance for dense linear algebra and related computations

• OMI provides a new level of memory bandwidth for computing systems while delivering low cost, versatility and capacity

• Both are scalable, offering room to growth in both dimensions

• Together with Open Source software, we see a clear road ahead for OpenHPC systems that combine the best innovation from the community!

References

Related documents

At 5 weeks of age vascular (aortic) total Gstm1 expression showed a trend towards an increase in SHRSP- Tg(Gstm1)1 WKY and WKY rats, and was significantly increased in

In this section, we present the work already done on topics of com­ munity tracking and community evolution based on structural criteria.The field of community tracking is

Table 1.2 Performance Characteristics of Materials and Components for External Wall in Hong Kong Material for Enclosure Glass Metal (aluminium, copper, steel, stainless steel)

● It’s difficult to provision an FPGA to keep up with high memory bandwidth when the logic runs at low clock rate. ● The memory subsystems (at least those provided by OpenCL)

• High-Performance Computing (HPC) has adopted advanced interconnects (e.g. InfiniBand, 10 Gigabit Ethernet).. – Low latency (few micro seconds), High Bandwidth (40 Gb/s) – Low

The move from disk-based to memory-based data architectures requires a robust in-memory data management architecture that delivers high speed, low-latency access to terabytes of

32GB, DDR3 1866/1600/1333/1066 MHz Non-ECC, Un-buffered Memory * Dual Channel Memory Architecture.. Supports Intel® Extreme Memory

Under average hydrological conditions, yearly revenues slightly increase due to higher winter production and despite lower winter and higher summer prices (which in and of itself