Centre of Excellence for
Centre of Excellence for
High Performance Computing
High Performance Computing
Pseudo
Pseudo
-
-
Vectorization
Vectorization
and
and
RISC Optimization Techniques for
RISC Optimization Techniques for
the Hitachi SR8000 Architecture
the Hitachi SR8000 Architecture
F. Deserno, G. Hager, F. Brechtefeld, G. Wellein (Regionales Rechenzentrum Erlangen)
KONWIHR Project:
L. Palm, M. Brehm (LRZ München)
Centre of Excellence for High Performance Computing
Supercomputers
cxHPC
Physics, Chemistry, Engineering,..Ensure efficient use of supercomputers by providing top quality HPC project support :
¢ Architecture specific Optimizations
¢ Appropriate programming models
¢ Efficient Algorithms and solvers
¢ Find appropriate (super)computer
¢ “Hot-line” --- large projects ¢ HPC training & lectures ¢ Information / PR
HPC Support Projects
Goals
¢ Support for large-scale HLRB projects
¢ Find appropriate (super)computer for each problem/scientist ¢ Competence & Consulting for methods used and developed by
local (FAU) scientists
Material Science,
etc
Fluid Dynamics
Theoretical
Physics
Theoretical
Chemistry
Computer Sciences
Applied Mathematics
•Simulation of complex flows •Finite-Volume (SIP Solver) •Lattice-Boltzmann methods •Quantummechanical many-body problems •Exact diagonalization (sparse/dense) • DMRGMethods
Brenner/Durst (h001y) Breuer/Durst (h001v,h0011) Fehske (h0441) Heß (h023z) Hofmann (h008z) Rüde (h0671)COMPAS (SMP-node level): • aggregate MemBW.
• 512-way Mem. interleaving • Collective Thread operations • Compiler
Pseudo-Vector-Processing (CPU level): • Large register set (160 FP registers) • 16 outstanding PREFETCH or
• 128 outstanding PRELOAD • Extensive software-pipelining
High peak performance & memory bandwidth Hide memory latency
Vector-processor like performance with RISC technology Performance evaluation: Benchmark systems
p690 512 MB L3 0.73 23 32 1024 13 110 5.2 166.0 IBM Power4 1.3 GHz (32-way node) ---8 0.256 L2 cache [MB] PVP +COMPAS 128 1024 4.0 32.0 1.5 12.0 HSR8k 0.375 GHz (8-way node) ---32.0 (LD) 16.0 (ST) 4.0 NEC SX5e O3400 32 1.6 1.0 MIPS R14k 0.5 GHz RD-RAM 8 3.2 3.0 Intel P4 1.5 GHz L1 cache [kB] MemBW [GB/s] Peak [GFlop/s] Platform
Performance Evaluation: Vector-Triad
A(1:N)=B(1:N)+C(1:N)*D(1:N)
single processors vector processors / HSR node
HSR8k 1 CPU: 92%
Intel-P4/RD-RAM: 55%
HSR8k 1 node: 75%
NEC SX5e: 75% (max. BW)
97% (effect. BW)
Memory efficiency
Performance Evaluation:
Sparse M
atrix-
V
ector-
M
ultiplikation
¢ Sparse MVM is numerical core of exact diagonalization algorithms
(Davidson, Lanczos, etc.) widely used in theoretical physics and theoretical chemistry
¢ Several storage formats are available: JDS, CRS,… ¢ Jagged Diagonals Storage (JDS) format:
n Best performance for Hitachi and vector systems
n Only minor performance drawbacks on RISC systems
n Shared-memory parallelization of inner loop
DO j = 1,max_nonz
DO i = 1,(jd_ptr(j+1)-jd_ptr(j))
Y(i)=Y(i)+VALUE(jd_ptr(j)+i-1)*X( COL_IND (jd_ptr(j)+i-1) )
ENDDO ENDDO
max. #non-zeros per row (10-100) Matrix dimension (103 -109)
Perfomance limited by memory bandwidth & latency !
Performance Evaluation: Sparse MVM
P
seudo-
V
ector-
P
rocessing of sparse MVM (JDS format):
PRELOAD FOP ST PRELOAD FOP ST PRELOAD FOP ST PREFETCH time iteration LD LD LD Prefetch index array COL_IND
Load index from
cache to reg
Preload single data item X(index)
Innermost loop is being
unrolled 48 times by
HSR-compiler!
• intermediate to long loop lengths (unrolling / pipelining)
• no data dependencies (PREFETCH/PRELOAD)
• small to intermediate loop body (register spill !!)
P V P
Performance Evaluation: Sparse MVM
single processor vector processor/ SMP nodes
HSR8k 1 CPU: 70%
Intel-P4/RD-RAM: 48%
HSR8k 1 node: 89% (8p.)
IBM Power4: 56% (16p.)
39% (32p.)
Memory efficiency
SMP scalability
Use PVP with care!
¢ Simple kernel from nuclear physics: FORTRAN, approx. 200 KByte
(Scattering problems with three-nucleon forces, Prof. H. Hofmann, FAU) DO M=1,IQM DO K=KZHX(M),KZAHL F(K)=F(K) * S(MVK(K,M)) ENDDO ENDDO • S(): short; approx. 100-200
• IQM: small; typically 9
• KZAHL: much larger than 1000
¢ HSR-Compiler: preload streams for S() poor performance
¢ *voption nopreload improves performance by a factor of 2.9
¢ Blocking of M – loop & unrolling of inner loop additional 12 %
1.32 1.54 3.46 Speed-up 257 MFlop/s 149 MFlop/s 90 MFlop/s Optimized 195 MFlop/s 97 MFlop/s 26 MFlop/s Original Intel P4 (1.5GHz) MIPS-R14k HSR8k-F1
CFD applications: Strong Implicit Solver
¢ CFD: Solving
for finite volume methods can be done by Strongly-Implicit-Procedure (SIP) according to Stone
¢ SIP-solver is widely used:
n LESOCC, FASTEST, FLOWSI (Institute of Fluid Mechanics, Erlangen)
n STHAMAS3D (Crystal Growth Laboratory, Erlangen)
n CADiP (Theoretical Thermodynamics and Transport Processes, Bayreuth)
n …
¢ SIP-Solver: 1) Incomplete LU-factorization
2) Series of forward/backward substitutions
¢ Toy program available at: ftp.springer.de in /pub/technik/peric
(M. Peric)
SIP-solver: Data-dependencies & Implementations
Basic data-dependency:
(i,j,k) {(i-1,j,k);(i,j-1,k);(i,j,k-1)} 3-fold nested loop (3D): (i,j,k)
•Data-locality
•No shared memory parallelization (Hitachi: Pipeline parallel processing)
Hyperplane: (i+j+k=const) • Non-contiguous memory access • shared memory parallelization
/vectorization of inner-most loop Hyperline: (i,j+k=const)
• shared memory parallelization of (j+k=const) loop
• Contiguous memory access for inner-most (i) loop
k
0 50 100 150 200 250 300 MFlop/s
HSR8k MIPS R14k Intel P4 IBM Power4
3D hyperplane hyperline
SIP-solver: Implementations &
Single Processor Performance
Benchmark:
• Lattice: 913 • 100 MB • 1 ILU • 500 iterations HSR8k-F1: • 3D: unrolling 32 times IBM Power4: • 128 MB L3 cache accessible for 1 proc.SIP-solver: Implementations &
Shared-memory scalability
0 200 400 600 800 1000 1200 1400 MFlop/s 1 4 8 16 processors 0 500 1000 1500 2000 2500 MFlop/s 1 4 8 16 processors HSR8k-F1 IBM Power4 0 200 400 600 800 1000 1200 1400 1600 1800 2000 MFlop/s 4 MB 100 MB 1 GB Memory HSR8k-F1 (3D) (8p) IBM Power4 (hl) (8p) hyperplane hyperlineFixed problem size: 913 Varying problem size
Summary & Outlook
¢ Efficient use of Hitachi SR8000:
n Vector-codes Pseudo-Vectorization+COMPAS
n High level of loop unrolling Large Register Set
¢ Hitachi SR8000 techniques are forward-looking:
n Large register set / many outstanding memory references
n High memory bandwidth: single processor and SMP node
¢ Optimization techniques for new architectures
n IBM Power4 Shared (large) caches
n Intel Itanium2/3 EPIC; large register set
¢ Parallel Programming techniques for SMP clusters:
MPP model (pure MPI) hybrid model (MPI+OpenMP/automatic)