• No results found

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

N/A
N/A
Protected

Academic year: 2021

Share "and RISC Optimization Techniques for the Hitachi SR8000 Architecture"

Copied!
15
0
0

Loading.... (view fulltext now)

Full text

(1)

Centre of Excellence for

Centre of Excellence for

High Performance Computing

High Performance Computing

Pseudo

Pseudo

-

-

Vectorization

Vectorization

and

and

RISC Optimization Techniques for

RISC Optimization Techniques for

the Hitachi SR8000 Architecture

the Hitachi SR8000 Architecture

F. Deserno, G. Hager, F. Brechtefeld, G. Wellein (Regionales Rechenzentrum Erlangen)

KONWIHR Project:

L. Palm, M. Brehm (LRZ München)

(2)

Centre of Excellence for High Performance Computing

Supercomputers

cxHPC

Physics, Chemistry, Engineering,..

Ensure efficient use of supercomputers by providing top quality HPC project support :

¢ Architecture specific Optimizations

¢ Appropriate programming models

¢ Efficient Algorithms and solvers

¢ Find appropriate (super)computer

¢ “Hot-line” --- large projects ¢ HPC training & lectures ¢ Information / PR

(3)

HPC Support Projects

Goals

¢ Support for large-scale HLRB projects

¢ Find appropriate (super)computer for each problem/scientist ¢ Competence & Consulting for methods used and developed by

local (FAU) scientists

Material Science,

etc

Fluid Dynamics

Theoretical

Physics

Theoretical

Chemistry

Computer Sciences

Applied Mathematics

•Simulation of complex flows •Finite-Volume (SIP Solver) •Lattice-Boltzmann methods •Quantummechanical many-body problems •Exact diagonalization (sparse/dense) • DMRG

Methods

Brenner/Durst (h001y) Breuer/Durst (h001v,h0011) Fehske (h0441) Heß (h023z) Hofmann (h008z) Rüde (h0671)
(4)

COMPAS (SMP-node level):aggregate MemBW.

512-way Mem. interleavingCollective Thread operationsCompiler

Pseudo-Vector-Processing (CPU level):Large register set (160 FP registers)16 outstanding PREFETCH or

128 outstanding PRELOAD Extensive software-pipelining

High peak performance & memory bandwidth Hide memory latency

Vector-processor like performance with RISC technology Performance evaluation: Benchmark systems

p690 512 MB L3 0.73 23 32 1024 13 110 5.2 166.0 IBM Power4 1.3 GHz (32-way node) ---8 0.256 L2 cache [MB] PVP +COMPAS 128 1024 4.0 32.0 1.5 12.0 HSR8k 0.375 GHz (8-way node) ---32.0 (LD) 16.0 (ST) 4.0 NEC SX5e O3400 32 1.6 1.0 MIPS R14k 0.5 GHz RD-RAM 8 3.2 3.0 Intel P4 1.5 GHz L1 cache [kB] MemBW [GB/s] Peak [GFlop/s] Platform

(5)

Performance Evaluation: Vector-Triad

A(1:N)=B(1:N)+C(1:N)*D(1:N)

single processors vector processors / HSR node

HSR8k 1 CPU: 92%

Intel-P4/RD-RAM: 55%

HSR8k 1 node: 75%

NEC SX5e: 75% (max. BW)

97% (effect. BW)

Memory efficiency

(6)

Performance Evaluation:

Sparse M

atrix-

V

ector-

M

ultiplikation

¢ Sparse MVM is numerical core of exact diagonalization algorithms

(Davidson, Lanczos, etc.) widely used in theoretical physics and theoretical chemistry

¢ Several storage formats are available: JDS, CRS,… ¢ Jagged Diagonals Storage (JDS) format:

n Best performance for Hitachi and vector systems

n Only minor performance drawbacks on RISC systems

n Shared-memory parallelization of inner loop

DO j = 1,max_nonz

DO i = 1,(jd_ptr(j+1)-jd_ptr(j))

Y(i)=Y(i)+VALUE(jd_ptr(j)+i-1)*X( COL_IND (jd_ptr(j)+i-1) )

ENDDO ENDDO

max. #non-zeros per row (10-100) Matrix dimension (103 -109)

Perfomance limited by memory bandwidth & latency !

(7)

Performance Evaluation: Sparse MVM

P

seudo-

V

ector-

P

rocessing of sparse MVM (JDS format):

PRELOAD FOP ST PRELOAD FOP ST PRELOAD FOP ST PREFETCH time iteration LD LD LD Prefetch index array COL_IND

Load index from

cache to reg

Preload single data item X(index)

Innermost loop is being

unrolled 48 times by

HSR-compiler!

• intermediate to long loop lengths (unrolling / pipelining)

• no data dependencies (PREFETCH/PRELOAD)

• small to intermediate loop body (register spill !!)

P V P

(8)

Performance Evaluation: Sparse MVM

single processor vector processor/ SMP nodes

HSR8k 1 CPU: 70%

Intel-P4/RD-RAM: 48%

HSR8k 1 node: 89% (8p.)

IBM Power4: 56% (16p.)

39% (32p.)

Memory efficiency

SMP scalability

(9)

Use PVP with care!

¢ Simple kernel from nuclear physics: FORTRAN, approx. 200 KByte

(Scattering problems with three-nucleon forces, Prof. H. Hofmann, FAU) DO M=1,IQM DO K=KZHX(M),KZAHL F(K)=F(K) * S(MVK(K,M)) ENDDO ENDDOS(): short; approx. 100-200

IQM: small; typically 9

KZAHL: much larger than 1000

¢ HSR-Compiler: preload streams for S() poor performance

¢ *voption nopreload improves performance by a factor of 2.9

¢ Blocking of M – loop & unrolling of inner loop additional 12 %

1.32 1.54 3.46 Speed-up 257 MFlop/s 149 MFlop/s 90 MFlop/s Optimized 195 MFlop/s 97 MFlop/s 26 MFlop/s Original Intel P4 (1.5GHz) MIPS-R14k HSR8k-F1

(10)

CFD applications: Strong Implicit Solver

¢ CFD: Solving

for finite volume methods can be done by Strongly-Implicit-Procedure (SIP) according to Stone

¢ SIP-solver is widely used:

n LESOCC, FASTEST, FLOWSI (Institute of Fluid Mechanics, Erlangen)

n STHAMAS3D (Crystal Growth Laboratory, Erlangen)

n CADiP (Theoretical Thermodynamics and Transport Processes, Bayreuth)

n …

¢ SIP-Solver: 1) Incomplete LU-factorization

2) Series of forward/backward substitutions

¢ Toy program available at: ftp.springer.de in /pub/technik/peric

(M. Peric)

(11)

SIP-solver: Data-dependencies & Implementations

Basic data-dependency:

(i,j,k) {(i-1,j,k);(i,j-1,k);(i,j,k-1)} 3-fold nested loop (3D): (i,j,k)

•Data-locality

•No shared memory parallelization (Hitachi: Pipeline parallel processing)

Hyperplane: (i+j+k=const) • Non-contiguous memory access • shared memory parallelization

/vectorization of inner-most loop Hyperline: (i,j+k=const)

• shared memory parallelization of (j+k=const) loop

• Contiguous memory access for inner-most (i) loop

k

(12)

0 50 100 150 200 250 300 MFlop/s

HSR8k MIPS R14k Intel P4 IBM Power4

3D hyperplane hyperline

SIP-solver: Implementations &

Single Processor Performance

Benchmark:

• Lattice: 913 • 100 MB • 1 ILU • 500 iterations HSR8k-F1: • 3D: unrolling 32 times IBM Power4: • 128 MB L3 cache accessible for 1 proc.
(13)

SIP-solver: Implementations &

Shared-memory scalability

0 200 400 600 800 1000 1200 1400 MFlop/s 1 4 8 16 processors 0 500 1000 1500 2000 2500 MFlop/s 1 4 8 16 processors HSR8k-F1 IBM Power4 0 200 400 600 800 1000 1200 1400 1600 1800 2000 MFlop/s 4 MB 100 MB 1 GB Memory HSR8k-F1 (3D) (8p) IBM Power4 (hl) (8p) hyperplane hyperline

Fixed problem size: 913 Varying problem size

(14)

Summary & Outlook

¢ Efficient use of Hitachi SR8000:

n Vector-codes Pseudo-Vectorization+COMPAS

n High level of loop unrolling Large Register Set

¢ Hitachi SR8000 techniques are forward-looking:

n Large register set / many outstanding memory references

n High memory bandwidth: single processor and SMP node

¢ Optimization techniques for new architectures

n IBM Power4 Shared (large) caches

n Intel Itanium2/3 EPIC; large register set

¢ Parallel Programming techniques for SMP clusters:

MPP model (pure MPI) hybrid model (MPI+OpenMP/automatic)

Summary

(15)

References

Related documents

Mate allocation strategies are based on the idea that, although selection should be based on estimated breed- ing values (heritable effects), the animals used for com- mercial

Thus, using the same sample reported by Viding and colleagues, our primary goal was to examine whether teacher-rated CU traits in 7-year-old twins demonstrated different genetic

Figure 24: Example plan output highlighting Partition pruning for a single-level partitioned table A simple select statement that was run against a table that is partitioned by day

The responses were fairly conclusive that the Welsh and European levels of support are more important than the national (UK) level because the influence of UK innovation policy

Our …nding of the existence of positive wage returns accruing to workers cov- ered by employer provided pension is further evidence supporting the view that compensation premiums are

The proposed research will further examine modifying factors (rate of students receiving free or reduced school lunch, within which type of community the school

Najčešći načini krijumčarenja u Republici Hrvatskoj i kroz područje Republike Hrvatske i dalje su uporabom osobnih vozila i kombija gdje se krijumčare količine od nekoliko

and manner of election of the members of the board of directors, and shall grant proper credit annually to each member of the Association for essential property insurance, farmowners,