and RISC Optimization Techniques for the Hitachi SR8000 Architecture

(1)

Centre of Excellence for

High Performance Computing

Pseudo

-

Vectorization

and

RISC Optimization Techniques for

the Hitachi SR8000 Architecture

F. Deserno, G. Hager, F. Brechtefeld, G. Wellein (Regionales Rechenzentrum Erlangen)

KONWIHR Project:

L. Palm, M. Brehm (LRZ München)

(2)

Centre of Excellence for High Performance Computing

Supercomputers

cxHPC

Physics, Chemistry, Engineering,..

Ensure efficient use of supercomputers by providing top quality HPC project support :

¢ Architecture specific Optimizations

¢ Appropriate programming models

¢ Efficient Algorithms and solvers

¢ Find appropriate (super)computer

¢ “Hot-line” --- large projects ¢ HPC training & lectures ¢ Information / PR

(3)

HPC Support Projects

Goals

¢ Support for large-scale HLRB projects

¢ Find appropriate (super)computer for each problem/scientist ¢ Competence & Consulting for methods used and developed by

local (FAU) scientists

Material Science,

etc

Fluid Dynamics

Theoretical

Physics

Theoretical

Chemistry

Computer Sciences

Applied Mathematics

•Simulation of complex flows •Finite-Volume (SIP Solver) •Lattice-Boltzmann methods •Quantummechanical many-body problems •Exact diagonalization (sparse/dense) • DMRG

Methods

Brenner/Durst (h001y) Breuer/Durst (h001v,h0011) Fehske (h0441) Heß (h023z) Hofmann (h008z) Rüde (h0671)

(4)

COMPAS (SMP-node level): • aggregate MemBW.

• 512-way Mem. interleaving • Collective Thread operations • Compiler

Pseudo-Vector-Processing (CPU level): • Large register set (160 FP registers) • 16 outstanding PREFETCH or

• 128 outstanding PRELOAD • Extensive software-pipelining

High peak performance & memory bandwidth Hide memory latency

Vector-processor like performance with RISC technology Performance evaluation: Benchmark systems

p690 512 MB L3 0.73 23 32 1024 13 110 5.2 166.0 IBM Power4 1.3 GHz (32-way node) ---8 0.256 L2 cache [MB] PVP +COMPAS 128 1024 4.0 32.0 1.5 12.0 HSR8k 0.375 GHz (8-way node) ---32.0 (LD) 16.0 (ST) 4.0 NEC SX5e O3400 32 1.6 1.0 MIPS R14k 0.5 GHz RD-RAM 8 3.2 3.0 Intel P4 1.5 GHz L1 cache [kB] MemBW [GB/s] Peak [GFlop/s] Platform

(5)

Performance Evaluation: Vector-Triad

A(1:N)=B(1:N)+C(1:N)*D(1:N)

single processors _{vector processors / HSR node}

HSR8k 1 CPU: 92%

Intel-P4/RD-RAM: 55%

HSR8k 1 node: 75%

NEC SX5e: 75% (max. BW)

97% (effect. BW)

Memory efficiency

(6)

Performance Evaluation:

Sparse M

atrix-

V

ector-

M

ultiplikation

¢ Sparse MVM is numerical core of exact diagonalization algorithms

(Davidson, Lanczos, etc.) widely used in theoretical physics and theoretical chemistry

¢ Several storage formats are available: JDS, CRS,… ¢ Jagged Diagonals Storage (JDS) format:

n Best performance for Hitachi and vector systems

n Only minor performance drawbacks on RISC systems

n Shared-memory parallelization of inner loop

DO j = 1,max_nonz

DO i = 1,(jd_ptr(j+1)-jd_ptr(j))

Y(i)=Y(i)+VALUE(jd_ptr(j)+i-1)*X( COL_IND (jd_ptr(j)+i-1) )

ENDDO ENDDO

max. #non-zeros per row (10-100) Matrix dimension (103 _-109₎

Perfomance limited by memory bandwidth & latency !

(7)

Performance Evaluation: Sparse MVM

P

seudo-

V

ector-

P

rocessing of sparse MVM (JDS format):

PRELOAD FOP ST PRELOAD FOP ST PRELOAD FOP ST PREFETCH time iteration LD LD LD Prefetch index array COL_IND

Load index from

cache to reg

Preload single data item X(index)

Innermost loop is being

unrolled 48 times by

HSR-compiler!

• intermediate to long loop lengths (unrolling / pipelining)

• no data dependencies (PREFETCH/PRELOAD)

• small to intermediate loop body (register spill !!)

P V P

(8)

Performance Evaluation: Sparse MVM

single processor _{vector processor/ SMP nodes}

HSR8k 1 CPU: 70%

Intel-P4/RD-RAM: 48%

HSR8k 1 node: 89% (8p.)

IBM Power4: 56% (16p.)

39% (32p.)

Memory efficiency

SMP scalability

(9)

Use PVP with care!

¢ Simple kernel from nuclear physics: FORTRAN, approx. 200 KByte

(Scattering problems with three-nucleon forces, Prof. H. Hofmann, FAU) DO M=1,IQM DO K=KZHX(M),KZAHL F(K)=F(K) * S(MVK(K,M)) ENDDO ENDDO • S(): short; approx. 100-200

• IQM: small; typically 9

• KZAHL: much larger than 1000

¢ HSR-Compiler: preload streams for S() poor performance

¢ *voption nopreload improves performance by a factor of 2.9

¢ Blocking of M – loop & unrolling of inner loop additional 12 %

1.32 1.54 3.46 Speed-up 257 MFlop/s 149 MFlop/s 90 MFlop/s Optimized 195 MFlop/s 97 MFlop/s 26 MFlop/s Original Intel P4 (1.5GHz) MIPS-R14k HSR8k-F1

(10)

CFD applications: Strong Implicit Solver

¢ CFD: Solving

for finite volume methods can be done by Strongly-Implicit-Procedure (SIP) according to Stone

¢ SIP-solver is widely used:

n LESOCC, FASTEST, FLOWSI (Institute of Fluid Mechanics, Erlangen)

n STHAMAS3D (Crystal Growth Laboratory, Erlangen)

n CADiP (Theoretical Thermodynamics and Transport Processes, Bayreuth)

n …

¢ SIP-Solver: 1) Incomplete LU-factorization

2) Series of forward/backward substitutions

¢ Toy program available at: ftp.springer.de in /pub/technik/peric

(M. Peric)

(11)

SIP-solver: Data-dependencies & Implementations

Basic data-dependency:

(i,j,k) {(i-1,j,k);(i,j-1,k);(i,j,k-1)} 3-fold nested loop (3D): (i,j,k)

•Data-locality

•No shared memory parallelization (Hitachi: Pipeline parallel processing)

Hyperplane: (i+j+k=const) • Non-contiguous memory access • shared memory parallelization

/vectorization of inner-most loop Hyperline: (i,j+k=const)

• shared memory parallelization of (j+k=const) loop

• Contiguous memory access for inner-most (i) loop

k

(12)

0 50 100 150 200 250 300 MFlop/s

HSR8k MIPS R14k Intel P4 IBM Power4

3D hyperplane hyperline

SIP-solver: Implementations &

Single Processor Performance

Benchmark:

• Lattice: 913 • 100 MB • 1 ILU • 500 iterations HSR8k-F1: • 3D: unrolling 32 times IBM Power4: • 128 MB L3 cache accessible for 1 proc.

(13)

SIP-solver: Implementations &

Shared-memory scalability

0 200 400 600 800 1000 1200 1400 MFlop/s 1 4 8 16 processors 0 500 1000 1500 2000 2500 MFlop/s 1 4 8 16 processors HSR8k-F1 IBM Power4 0 200 400 600 800 1000 1200 1400 1600 1800 2000 MFlop/s 4 MB 100 MB 1 GB Memory HSR8k-F1 (3D) (8p) IBM Power4 (hl) (8p) hyperplane hyperline

Fixed problem size: 913 Varying problem size

(14)

Summary & Outlook

¢ Efficient use of Hitachi SR8000:

n Vector-codes Pseudo-Vectorization+COMPAS

n High level of loop unrolling Large Register Set

¢ Hitachi SR8000 techniques are forward-looking:

n Large register set / many outstanding memory references

n High memory bandwidth: single processor and SMP node

¢ Optimization techniques for new architectures

n IBM Power4 Shared (large) caches

n Intel Itanium2/3 EPIC; large register set

¢ Parallel Programming techniques for SMP clusters:

MPP model (pure MPI) hybrid model (MPI+OpenMP/automatic)

Summary

(15)