• No results found

High Performance Computing Lab Exercises

N/A
N/A
Protected

Academic year: 2021

Share "High Performance Computing Lab Exercises"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

High Performance Computing

Lab Exercises

High Performance Computing

Lab Exercises

Rubin H Landau

With

Sally Haerer and Scott Clark

Computational Physics for Undergraduates

BS Degree Program: Oregon State University

“Engaging People in Cyber Infrastructure”

Support by EPICS/NSF & OSU

Main Store cache cache RAM CPU 32TB @ 111Mb/s 2M B 2 GB 32KB 6 GB/s

(2)

Programming for Virtual Memory

Programming for Virtual Memory

Programming for Virtual Memory

• Avoid page faults ($$)

• Avoid resource conflicts (|| I/O)

 hurt other users  hurt yourself A(1) A(2) A(3) A(N) M(1,1) M(2,1) M(3,1) M(N,N) M(N,1) M(1,2) M(2,2) M(3,2) Swap Space Page N RAM Page 1 Page 2 Page 3

A(1),..., A(16) A(2032),..., A(2048) Data Cache

Registers CPU

Virtual Memory

1. Worry only if use much memory

2. Look at entire program (global optimization)

3. Make successive calculations on data subsets that fit in RAM 4. Avoid simultaneous calculations on same data set [M2/Det(M)]

5. Group together in memory data used together (  M )

(3)

Comparison: Java vs F90 & C

Comparison: Java

Comparison: Java

vs

vs

F90 & C

F90 & C

Java

most universal, most portable

good for scientific programming (error, precision)slowest for HPC; days your t vs computer t?

is the wait a problem?

compiler: not designed for speed

interpreter Just-In-Time (JIT) compilerclass file work on all Java Virtual Machines

Fortran (legacy, CSE) & C (CS)

compilers perfected for speed

examines entire code, rewrites (e.g. nested do's)careful with arrays, cache lines

optimized for specific architecturescompiled code: not universal, portable

source code: often not 100% universal or portable not for business, Web

(4)

“Experiment”: Good & Bad

Virtual Memory Use

Experiment

Experiment

: Good & Bad

: Good & Bad

Virtual Memory Use

Virtual Memory Use

 Run simple examples, look inside your computer(s)  Measure time for each program (time in Unix)

 Write your own memory intensive force method:

Do j = 1, n Do i = 1, n

f12(i,j) = force12(pion(i), pion(j)) \\ Fill f12 f21(i,j) = force21(pion(i), pion(j)) \\ Fill f21 ftot = f12(i,j) + f21(i,j) \\ Compute EndDo

EndDo

 Each loop iteration requires all data  Large “working set size”

(5)

GOOD Program, Separate Loops

GOOD Program, Separate Loops

GOOD Program, Separate Loops

 Separate loops, separate data access

  1/2 working set size

 Groups together data used together  Yet, more complicated, less elegant

Do j = 1, n Do i = 1, n

f12(i,j) = force12(pion(i), pion(j)) \\ Fill f12 EndDos

Do j = 1, n Do i = 1, n

f21(i,j) = force21(pion(i), pion(j)) \\ Fill f21 EndDos

Do j = 1, n Do i = 1, n

ftot = f12(i,j) + f21(i,j) \\ Compute EndDos

(6)

Try it! Java vs Fortran Optimization

Try it! Java

Try it! Java

vs

vs

Fortran Optimization

Fortran Optimization

 Aim: make numerically intensive program run faster

 Compare languages & their options, increasing matrix size  Program tune solves matrix eigenvalue problem (power method):

 Almost diagonal  E

Hi,I , c

=  H[2000][2000]  4,000,000 elements

 11 iterations  smallest E to 6 places

⎡ ⎢ ⎢ ⎢ ⎣ 1 0 0 .. . ⎤ ⎥ ⎥ ⎥ ⎦ [H] [c] = E[c] (1) E =?, [c] = ? (2) [Hi,j] = ⎡ ⎢ ⎢ ⎢ ⎣ 1 0.3 0.09 0.027 . . . 0.3 2 0.3 0.9 . . . 0.09 0.3 3 0.3 . . . .. . ⎤ ⎥ ⎥ ⎥ ⎦ (3)

(7)

Effects of: Dimensions, Optimize, Languages

Effects of: Dimensions, Optimize, Languages

Effects of: Dimensions, Optimize, Languages

for ( i= 1; i <= L -2; i = i+2)

{ ovlp1 = ovlp1 + coef[i] * coef[i] ;

ovlp2 = ovlp2 + coef[i+1] * coef[i+1]; t1 = t1 + coef[j] * ham[j] [i];

t2 = t2 + coef[j] * ham[j] [i+1]; } sigma[i] = t1; sigma[i+1] = t2;

4. Loop unrolling

Improve memory access Compiler fills cache better Doubles speed (ugly)

Tim e ( se co nds ) J a v a , S o l a r i s Execution Time vs Matrix Size

Matrix Dimension 100 1000 10 F o r t r a n , S o l a r i s Matrix Dimension 100 1000 10,000 1 1 0.01 10 10 100 1000 10,000 100 0.1 0.1 Time (sec)

1. Try optimize compiler options (-O1, -O2,-O3) 2. Language: 3-10 X time difference

(8)

High Performance Computing

Lab Exercises

(part II)

High Performance Computing

Lab Exercises

(part II)

Rubin H Landau

With

Sally Haerer and Scott Clark

Computational Physics for Undergraduates

BS Degree Program: Oregon State University

“Engaging People in Cyber Infrastructure”

Support by EPICS/NSF & OSU

Main Store cache cache RAM CPU 32TB @ 111Mb/s 2M B 2 GB 32KB 6 GB/s (examples)

(9)

Avoid Discontinuous Memory

page faults

Avoid Discontinuous Memory

Avoid Discontinuous Memory



page faults

zed, ylt, part: together on 1 memory page

med2: end of Common (after hog), different page

• Can't assure same page; improve your chances

Common/Double zed, ylt(9), part(9), zpart(100000), med2(9) Do j = 1, n

ylt(j) = zed * part(j) / med2(9) \\ Discontinuous

Common/Double zed, ylt(9), part(9), med2(9), zpart(100000) Do j = 1, n

ylt(j) = zed*part(j)/med2(9) \\ Continuous

Good

(10)

Method: Programming for Cache

Method: Programming for Cache

Method: Programming for Cache

Important for HPC : cache misses   10  T

Stride: # array elements stepped thru / op

(1)

 Rule: Keep the stride low, preferably 1

 Fortran (column): vary LH array index 1st

 C & Java (row): vary RH array index 1st

Very fast cache

Ultra-fast CPU Tr A = N X i=1 a(i, i) c(i) = x(i) + x(i + 1) Bad Good A(1) A(2) A(3) A(N) M(1,1) M(2,1) M(3,1) M(N,N) M(N,1) M(1,2) M(2,2) M(3,2) Swap Space Page N RAM Page 1 Page 2 Page 3

A(1),..., A(16) A(2032),..., A(2048)

Data Cache

Registers

CPU Fast

(11)

Exercise: Cache Misses

Exercise: Cache Misses

Exercise: Cache Misses

 Aim of HPC: have data to be processed in cache

 F90 2-D array storage: column-major order      C, Java 2-D array storage : row-major order   

 Compare times: M(column x column) vs M(row x row)

Do j = 1, 9999

x(j) = m(1,j) \\ Sequential column ref

Do j = 1, 9999

x(j) = m(j,1) \\ Sequential row ref

 1 version always makes large jumps through memory   RAM: virtual memory ≠ full story

(12)

Dimension Vec(idim, jdim) \\ Loop B Do i = 1, idim

Do j=1, jdim

Ans = Ans + Vec(i,j)*Vec(i,j) \\ Stride jdim EndDos

Dimension Vec(idim,jdim) \\ Loop A Do j = 1, jdim

Do i=1, idim

Ans = Ans + Vec(i,j)*Vec(i,j) \\ Stride 1 EndDos

Good & Bad Cache Flow

(OK to try these at home)

Good & Bad Cache Flow

Good & Bad Cache Flow

(OK to try these at home)

(OK to try these at home)

GOOD f90/BAD C, Minimum/Maximum Stride

BAD f90/GOOD C, Maximum/Minimum Stride

Vary row rapidly

(13)

Exercise: Large Matrix Multiplication

Exercise

Exercise

: Large

: Large

Matrix

Matrix

Multiplication

Multiplication



• Must compromise: as both row & columns used

 Bad f90/Good C, Max/Min Stride

Do i = 1, N \\ Row Do j = 1, N \\ Column c(i,j) = 0.0 \\ Initial Do k = 1, N c(i,j) = c(i,j) + a(i,k)*b(k,j) EndDos

 Good f90/Bad C, Min/Max Stride

Do j = 1, N Do i = 1, N c(i,j) = 0.0 EndDos Do k = 1, N Do i = 1, N c(i,j) = c(i,j) + a(i,k)*b(k,j) EndDos

[C

] = [A]

×

[B

],

c

ij

=

N

X

k=1

a

ik

×

b

kj i j k j i k

References

Related documents

This is obvious in the special case where the linear variety collapses to a single point: the closed-loop adaptive system has then an exponentially stable

Most algorithms for large item sets are related to the Apri- ori algorithm that will be discussed in Chapter IV-A2. All algorithms and methods are usually based on the same

The architecture basically contains three technical components: (i) the PMU Token itself, including its data storage formats and communication with its surroundings; (ii) the

Tujuan penelitian ini adalah mengetahui hasil belajar fisika sebelum dan setelah penerapan kombinasi metode pembelajaran complete sentence dengan giving question

Serverless Metadata compute nodes App 1 metadata / proj s3d data App 3 metadata / proj mpas repo App 2 metadata / proj ocean input. No such global namespace managed by a

May 2014 Supercomputing Resources in Jülich: Welcome and Introduction of JSC 15 0% 20% 40% 60% 80% 100% JUQUEEN JUROPA FZJ obligations FZJ projects JARA-HPC (regional) NIC

• Cost Estimating: Society of Cost Estimating and Analysis; Cost and Software Data Reporting (CSDR) Focus Group and Space Systems Cost Analysis