Agile High-Performance Software Development

(1)

Agile High-Performance Software

Development

Chris Mueller and Andrew Lumsdaine Open Systems Lab/Indiana University

RIDMS-2

(2)

Modern Processors

IBM Cell BE

(3)

* Featuring *

⇒ Advanced “make” build system! ⇒ Cutting edge “gdb” debugger! ⇒ Unparalleled C standard library! ⇒ Works with any text editor!

*Auto-parallelizing, auto-simdizing, optimizing compiler not yet available. For maximum SIMD performance, use of assembly may be required.

Void where prohibited, prohibited where void.

(4)

A Brief History of High Performance Computing

(Commodity hardware and language edition)

1950s FORTRAN

John Backus, et al.

Captures and improves common assembly practices for scientific computing

1970s (BCPL)/C

Denis Ritchie, et al.

Captures and simplifies best assembly practices for systems programming

1990s Java

James Gosling, et al.

Abstract, single-processor machine model + runtime optimizer for all computing tasks,

provides rich environment for Web applications

VB/Python/Perl

van Rossum, Wall, et al.

Scripting language + low level language for rapid application development

1980s

(mini/micro computers) (personal computers)

(commodity SIMD, dual processor)

(heterogeneous multi-core pushes C to its semantic limits)

(5)

State of the Art for High Performance Computing

1950s FORTRAN

John Backus, et al.

1970s (BCPL)/C

Denis Ritchie, et al.

(6)

State of the Art for High Performance Computing

1950s FORTRAN

John Backus, et. al.

1970s (BCPL)/C

Denis Ritchie, et. al.

(mini/micro computers)

Is there an alternative?

(7)

Our Approach

Take a modern programming technique…

(8)

Our Approach

Take a modern programming technique…

…provide direct access to the hardware…

(9)

Our Approach

Take a modern programming technique…

…provide direct access to the hardware…

… and let programmers explore the SIMD and

multi-core design spaces.

(10)

CorePy

Instruction Set

Architecture (ISA) Instruction Stream Processor

Hardware/OS Abstractions

Memory

A layered collection of Python libraries for generating and executing high-performance code at run-time.

Types, Control Flow, and Optimizers

Variables Iterators Extended Instructions Memory Models

PPC AltiVec/VMX SPU Linux OS X

(11)

A Simple Example

1. c = InstructionStream() 2. ppc.set_active_code(c) 3. ppc.addi(gp_return, 0, 31) 4. ppc.addi(gp_return, gp_return, 11) 5. p = Processor() 6. r = p.execute(c) 7. print r 8. --> 42

r = ((0 + 31) + 11)

(12)

Variables

CorePy Variables encapsulate a register, backing store, and valid operations for a user defined data type.

1. a = SignedWord(11) 2. b = SignedWord(31) 3. c = SignedWord(0, reg=gp_return) 4. c.v = (a + b) * 10 5. --> c = 420 Scalar example: 1. a = VecWord([2,3,4,5]) 2. b = VecWord([3,3,3,3]) 3. c = VecWord(0) 4. c.v = vmin(a, b) * b + 10 5. --> c = [16, 19, 19, 19] Vector example:

(13)

Iterators

Iterators enable user-defined loop semantics.

1. # Basic Iteration

2. a = SignedWord(c, 0)

3. for i in syn_iter(c, 5):

4. for j in syn_iter(c, 5, mode = ‘ctr’):

5. a.v = a + 1

6. proc.execute(c)

(14)

Iterator Examples

1. # Array iteration

2. for x in var_iter(c, a): sum.v = sum + x

3. for x in vec_iter(c, a): sum.v = sum + x

4. # Data stream merge

5. for x,y,z,r in zip_iter(c, X,Y,Z,R):

6. r = vmadd(x,y,z)

7. # Loop unrolling

8. for x in unroll(vec_iter(c, a), 3): body(x)

9. # Auto-parallelization

10.for x in parallel(vec_iter(c, a)): body(x)

11.t1 = proc.execute(c, mode=‘async’, params=[0,2,0])

(15)

CorePy Research Model

 Use CorePy to develop real applications

 Use Python for coarse-grained application and data flow

 Use CorePy libraries for high-performance code sections

 Identify common implementation patterns

 esp. SIMD/multi-core

 Generalize patterns into library components

(16)

Example: Particle System

for vel, point in parallel(zip_iter(c, vels, points)): # Forces - Gravity and air resistance

vel.v = vel + gravity

vel.v = vel + vmadd(vsel(one, negone, (zero > vel)), air, zero) point.v = point + vel

# Bounce off the zero extents (floor and left wall) # and positive extents (ceiling and right wall)

vel.v = vmadd(vel, vsel(one, floor, (zero > point)), zero) vel.v = vmadd(vel, vsel(one, negone, (point > extents)), zero)

# Add a 'floor' at y = 1.0

point.v = vsel(point, one, (one > point))

v1: Numeric Python (~20k particles/sec) v2: CorePy “asm” (~200k particles/sec)

v3: CorePy variables/iters (~200k particles/sec) Development Iterations:

(17)

Example: BLASTP on the Cell

⇒ Cell SPU support

⇒ Blocked memory components ⇒ “Stream shift” iterator

⇒ Instruction replication ⇒ Python multi-core control

components

(18)

Community Projects



Cell SPU Big Num library

(Andrew Friedley)

 ~5G inst/s on 1 SPU



Image processing, fractals

(Ben Martin)



DGEMM/BLAS

(Andrew Lumsdaine)



Generic Convolution Framework

(Alex Breuer)

(19)

Thank You!

Funding: Lilly Endowment Support and Feedback:

IBM Cell Ecosystem Team, especially: Hema Reddy, Gordon Ellison,

Jennifer Turner, Bob Arenburg Ben Martin, Andrew Friedley, Alex Breuer, Jeremiah Willcock

www.synthetic-programming.org [email protected]