Agile High-Performance Software
Development
Chris Mueller and Andrew Lumsdaine Open Systems Lab/Indiana University
RIDMS-2
Modern Processors
IBM Cell BE
* Featuring *
⇒ Advanced “make” build system! ⇒ Cutting edge “gdb” debugger! ⇒ Unparalleled C standard library! ⇒ Works with any text editor!
*Auto-parallelizing, auto-simdizing, optimizing compiler not yet available. For maximum SIMD performance, use of assembly may be required.
Void where prohibited, prohibited where void.
A Brief History of High Performance Computing
(Commodity hardware and language edition)
1950s FORTRAN
John Backus, et al.
Captures and improves common assembly practices for scientific computing
1970s (BCPL)/C
Denis Ritchie, et al.
Captures and simplifies best assembly practices for systems programming
1990s Java
James Gosling, et al.
Abstract, single-processor machine model + runtime optimizer for all computing tasks,
provides rich environment for Web applications
VB/Python/Perl
van Rossum, Wall, et al.
Scripting language + low level language for rapid application development
1980s
(mini/micro computers) (personal computers)
(commodity SIMD, dual processor)
(heterogeneous multi-core pushes C to its semantic limits)
State of the Art for High Performance Computing
(Commodity hardware and language edition)
1950s FORTRAN
John Backus, et al.
Captures and improves common assembly practices for scientific computing
1970s (BCPL)/C
Denis Ritchie, et al.
Captures and simplifies best assembly practices for systems programming
State of the Art for High Performance Computing
(Commodity hardware and language edition)
1950s FORTRAN
John Backus, et. al.
Captures and improves common assembly practices for scientific computing
1970s (BCPL)/C
Denis Ritchie, et. al.
Captures and simplifies best assembly practices for systems programming
(mini/micro computers)
Is there an alternative?
Our Approach
Take a modern programming technique…
Our Approach
Take a modern programming technique…
…provide direct access to the hardware…
Our Approach
Take a modern programming technique…
…provide direct access to the hardware…
… and let programmers explore the SIMD and
multi-core design spaces.
CorePy
Instruction Set
Architecture (ISA) Instruction Stream Processor
Hardware/OS Abstractions
Memory
A layered collection of Python libraries for generating and executing high-performance code at run-time.
Types, Control Flow, and Optimizers
Variables Iterators Extended Instructions Memory Models
PPC AltiVec/VMX SPU Linux OS X
A Simple Example
1. c = InstructionStream() 2. ppc.set_active_code(c) 3. ppc.addi(gp_return, 0, 31) 4. ppc.addi(gp_return, gp_return, 11) 5. p = Processor() 6. r = p.execute(c) 7. print r 8. --> 42r = ((0 + 31) + 11)
Variables
CorePy Variables encapsulate a register, backing store, and valid operations for a user defined data type.
1. a = SignedWord(11) 2. b = SignedWord(31) 3. c = SignedWord(0, reg=gp_return) 4. c.v = (a + b) * 10 5. --> c = 420 Scalar example: 1. a = VecWord([2,3,4,5]) 2. b = VecWord([3,3,3,3]) 3. c = VecWord(0) 4. c.v = vmin(a, b) * b + 10 5. --> c = [16, 19, 19, 19] Vector example:
Iterators
Iterators enable user-defined loop semantics.
1. # Basic Iteration
2. a = SignedWord(c, 0)
3. for i in syn_iter(c, 5):
4. for j in syn_iter(c, 5, mode = ‘ctr’):
5. a.v = a + 1
6. proc.execute(c)
Iterator Examples
1. # Array iteration
2. for x in var_iter(c, a): sum.v = sum + x
3. for x in vec_iter(c, a): sum.v = sum + x
4. # Data stream merge
5. for x,y,z,r in zip_iter(c, X,Y,Z,R):
6. r = vmadd(x,y,z)
7. # Loop unrolling
8. for x in unroll(vec_iter(c, a), 3): body(x)
9. # Auto-parallelization
10.for x in parallel(vec_iter(c, a)): body(x)
11.t1 = proc.execute(c, mode=‘async’, params=[0,2,0])
CorePy Research Model
Use CorePy to develop real applications
Use Python for coarse-grained application and data flow
Use CorePy libraries for high-performance code sections
Identify common implementation patterns
esp. SIMD/multi-core
Generalize patterns into library components
Example: Particle System
for vel, point in parallel(zip_iter(c, vels, points)): # Forces - Gravity and air resistance
vel.v = vel + gravity
vel.v = vel + vmadd(vsel(one, negone, (zero > vel)), air, zero) point.v = point + vel
# Bounce off the zero extents (floor and left wall) # and positive extents (ceiling and right wall)
vel.v = vmadd(vel, vsel(one, floor, (zero > point)), zero) vel.v = vmadd(vel, vsel(one, negone, (point > extents)), zero)
# Add a 'floor' at y = 1.0
point.v = vsel(point, one, (one > point))
v1: Numeric Python (~20k particles/sec) v2: CorePy “asm” (~200k particles/sec)
v3: CorePy variables/iters (~200k particles/sec) Development Iterations:
Example: BLASTP on the Cell
⇒ Cell SPU support
⇒ Blocked memory components ⇒ “Stream shift” iterator
⇒ Instruction replication ⇒ Python multi-core control
components
Community Projects
Cell SPU Big Num library
(Andrew Friedley) ~5G inst/s on 1 SPU
Image processing, fractals
(Ben Martin)
DGEMM/BLAS
(Andrew Lumsdaine)
Generic Convolution Framework
(Alex Breuer)Thank You!
Funding: Lilly Endowment Support and Feedback:
IBM Cell Ecosystem Team, especially: Hema Reddy, Gordon Ellison,
Jennifer Turner, Bob Arenburg Ben Martin, Andrew Friedley, Alex Breuer, Jeremiah Willcock
www.synthetic-programming.org [email protected]