A Parallel 64K Complex FFT Algorithm for the IBM/Sony/Toshiba Cell Broadband Engine Processor

(1)

A Parallel 64K Complex FFT Algorithm

for the IBM/Sony/Toshiba

Cell Broadband Engine Processor

Jonathan Greene, Michael Pepe, Robert Cooper Mercury Computer Systems

October 24, 2005

(2)

Goals

•

Program a real-world problem that could demonstrate the highest Cell processor performance (really kick the Cell tires)

•

Show Mercury’s Approach to Algorithm (Application) Design

Analyze the mathematics and develop an optimized mapping (data temporal and spatial mapping) to the hardware architecture (model)

Validate the model with key component measurements

Build the thing

Instrument the code and compare to the model

•

This problem was chosen because

Choose a problem where:

• The data would not fit in a single SPE’s local store

• But would fit in the aggregate of all 8 local stores

• SPE-to-SPE communication is essential to achieve optimal performance

Exploit the generous SPE-to-SPE bandwidth

•

We chose a parallel (i.e., collective) implementation of a N (N >=

1) 64K point single precision complex FFTs

•

In achieving this goal, we were able to both reduce latency and

(3)

Cell Broadband Engine Architecture

LS LS LS LS

LS LS

LS LS LS LS

LS LS

(4)

Cell Broadband Engine Architecture

LS LS LS LS

LS LS

LS LS LS LS

LS LS

• 1 Power® core

With full VMX engine

• Not used in this algorithm

• 8 SPE cores, each with

128 bit SIMD vector unit

256K local store

MFC DMA engine

Overlapped DMA and local store access

• Element Interconnect Bus (EIB)

2 pairs of bidirectional rings

• ^{XDR DRAM}

• High speed external interfaces

(5)

Cell Broadband Engine Performance

LS LS LS LS

LS LS

LS LS LS LS

LS LS

• We assume 3 GHz clock

The Cell BE chip can run at higher frequencies

• 8 SPE cores

192 GFLOPS peak

• MFC DMA engine

24 GB/s bidirectional

• ^EIB

192 GB/s maximum sustainable performance

Realizable performance depends on access patterns

• XDR DRAM Ù 8 Local Stores

24 GB/s maximum aggregate bandwidth

Performance depends on access patterns

(6)

Cell Broadband Engine

• Keys to performance

Decompose algorithm into chunks that can utilize 256K local store

Vectorize inner loop SPE code (4-way SIMD for 32-bit float operations)

Pay careful attention to XDR bandwidth utilization

• 24 GB/s is a lot of bandwidth, but it’s shared by 8 very powerful SIMD cores

Exploit SPE-to-SPE ring bandwidth if possible

Overlap computation with DMA using double or triple buffering

• Each SPE’s MFC supports numerous concurrent, non-blocking DMA transfers

Use 128-byte alignment of data and multiples of 128-byte transfers for maximum DMA performance

• Generally don’t need to worry about:

EIB access patterns

• There is almost always ring bandwidth to spare

XDR access patterns

(7)

Basic Algorithm

•

Classical “2D” decomposition of a 1D FFT

Rabiner and Gold. Theory and Application of Digital Signal Processing.

Prentice-Hall, 1975

(8)

Basic Algorithm

•

Prentice-Hall, 1975 NC

NR

•

View N-element FFT as NR x NC matrix in row major order

(9)

Basic Algorithm

•

Prentice-Hall, 1975 NC

•

Algorithm outline:

Perform NC NR-point column FFTs

NR

(10)

Basic Algorithm

NC

•

Perform element-wise multiply by NR x NC complex twiddle matrix

NR

(11)

Basic Algorithm

NC

•

Perform NR NC-point row FFTs

NR

(12)

Basic Algorithm

NC

•

Perform NR NC-point row FFTs

Transpose NR x NC matrix to NC x NR matrix

NR

(13)

Parallel Algorithm Across 8 SPEs

NC = 256

NR = 256

• 64K (256 x 256) FFT stored in XDR memory

• 512 KBytes each for input and output data

(14)

Parallel Algorithm Across 8 SPEs

32

256

• Each SPE processes a 256 x 32 region

(15)

Parallel Algorithm Across 8 SPEs

SPE0 SPE1 SPE2 SPE3 SPE4 SPE5 SPE6 SPE7

• Each SPE processes a 256 x 32 region

(16)

SPE Parallel Algorithm – SPE 2

SPE2

• Focus on a representative SPE (SPE 2)

(17)

SPE Parallel Algorithm – SPE 2

Buffer A Buffer B

• Perform 32 column FFTs

(18)

SPE Parallel Algorithm – SPE 2

A B

• Perform 32 column FFTs

(19)

SPE Parallel Algorithm – SPE 2

A B

•

Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose

•

SPEi ^Tilej ^{Î SPE}j ^Tilei

•

Generate each 32 x 32 twiddle matrix on the fly

(20)

SPE Parallel Algorithm – SPE 2

Timeslice 0

A B

•

(21)

SPE Parallel Algorithm – SPE 2

Timeslice 1

A B

SPE0 SPE0

•

(22)

SPE Parallel Algorithm – SPE 2

Timeslice 2

A B

SPE0

•

Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose SPE1

(23)

SPE Parallel Algorithm – SPE 2

Timeslice 3

A B

SPE3

•

Generate each 32 x 32 twiddle matrix on the fly SPE3

(24)

SPE Parallel Algorithm – SPE 2

Timeslice 4

A B

SPE4

•

(25)

SPE Parallel Algorithm – SPE 2

Timeslice 5

A B

SPE5

•

Generate each 32 x 32 twiddle matrix on the fly SPE5

(26)

SPE Parallel Algorithm – SPE 2

Timeslice 6

A B

SPE6

•

(27)

SPE Parallel Algorithm – SPE 2

Timeslice 7

A B

SPE7 SPE7

•

(28)

SPE Parallel Algorithm – SPE 2

A B

• Perform 32 “row” FFTs

(29)

SPE Parallel Algorithm – SPE 2

A B

• Perform 32 “row” FFTs

(30)

SPE Parallel Algorithm – SPE 2

SPE2

• Assemble result matrix in XDR

(31)

SPE Parallel Algorithm

SPE2

• Assemble result matrix in XDR

(32)

SPE Parallel Algorithm

SPE0 SPE1 SPE2 SPE3 SPE4 SPE5 SPE6 SPE7

• Assemble result matrix in XDR

(33)

SPE Parallel Algorithm

• Assemble result matrix in XDR

(34)

SPE Parallel Algorithm

• Assemble result matrix in XDR

(35)

Triple Buffering and XDR Transfers

• While two processing buffers are computing result i

processing buffers

(36)

Triple Buffering and XDR Transfers

• While two processing buffers are computing result i

• Third I/O buffer is transfering

result i-1 to XDR

result i-1

A B C

I/O buffer processing

buffers

(37)

Triple Buffering and XDR Transfers

• While two processing buffers are computing result i

• Third I/O buffer is transfering

result i-1 to XDR

result i-1

A B C

buffers

(38)

Triple Buffering and XDR Transfers

• While two processing buffers are computing result i

• Third I/O buffer is transfering

result i-1 to XDR

• And then transfering input data i+1

from XDR

result i-1

A B C

buffers

input data i+1

(39)

Triple Buffering and XDR Transfers

• While two processing buffers are computing result i

• Third I/O buffer is transfering

result i-1 to XDR

• And then transfering input data i+1

from XDR

result i-1

A B C

buffers

input data i+1

(40)

Triple Buffering and XDR Transfers

• While two processing buffers are computing result i

• Third I/O buffer is transfering

result i-1 to XDR

• And then transfering input data i+1

from XDR

• At completion of algorithm, roles of buffers are rotated

result i-1

B A C

I/O buffer

processing buffers

input data i+1

(41)

Local Store Usage Map

• Our 64K FFT algorithm requires approximately 253 Kbytes (out of the available 256 Kbytes) of local store in each SPE:

Stack size: 8K

SPE kernel code: 16K

FFT setup code: 5K

FFT shell code: 4K

FFT primitives (4): 8K

DMA lists (2): 8K

Data buffers (3): 192K

1D twiddles: 2K

2D twiddles: 10K

• Total: 253K

(42)

Latency vs. Throughput With Overlapped I/O

FFT throughput

• Theoretical lowest latency

= compute time + XDR time

Inter-SPE transfers are partially overlapped with compute time

• Theoretical fastest throughput

XDR transfers SPE compute &

inter-SPE transfers

time

get / put data

from / to other SPEs

Only one SPE shown

FFT latency

(43)

Performance Analysis

• We have successfully coded and run this implementation on 3 GHz Rev 3 HW.

• We get the right answers

(Just so you know we’re honest)

.

• What follows is a performance analysis of the model based on timings for the

processing components and Cell

documentation for the DMA components

(44)

CPU Performance Analysis

• Measured on the Cell simulator (@ 3 GHz)

Times for 1 SPE for 1 FFT computation

Ignores local store contention due to DMA transfers

Function Calls per iteration

Clocks per call

Total time (us)

~3,000 1.0

2.0 0.3 2.8 6.4

zfft_cols() (256 x 32) 2 44,052 14.7 29.4 22.3 *

Totals for a 64K 41.9 125.1 *

~6,000 202 1,185 2,388

zfft_64k() (shell) N/A N/A N/A

7

zmat_emul_trans() 8 0.8 7.7

Time per

call (us) GFLOPS

kernel calls (e.g.

DMA, sync) 31 ^N/A ^N/A

make_dma_list() 4 0.1 ^N/A

zmat_row_emul() 0.4 15.6

(45)

DMA Performance Analysis (XDR Ù LS)

• Each 64K FFT requires one round trip of the data from/to XDR and the 8 Local Stores

• **Data size: 8 * 64K = 512K**

• **1 round trip requires transferring 2 * 512K = 1 Mbyte**

@24 Gbytes/sec (peak): 1M / 24,000 = 43.7 us

@22 Gbytes/sec (de-rated): 1M / 22,000 = 47.7 us

• XDR Ù LS transfers do not overlap with each other

• Gbytes/sec per SPE: 24/8 = 3 Gbytes/sec either in or out yields an average sustained rate of 1 byte per clock (@ 3GHz)

• Each DMA access is 128 bytes wide so, on average,

each SPE experiences a DMA access only every 128

clocks or less than 1% of the time!

(46)

DMA Performance Analysis (LS Ù LS)

• Each 64K FFT requires 7/8 of the data to be transferred from SPE to SPE (i.e., LS Ù LS)

• **Size of each contiguous tile: 4 * 32 * 32 = 4K bytes**

• **Number of contiguous tile transfers: 2 * 7 * 8 = 112**

• **Total number of bytes transferred: 112 * 4K = 448K**

• @192 Gbytes/sec peak: 448K / 192,000 = 2.4 us

• Gbytes/sec per SPE = 192/8 = 24 Gbytes/sec both in and out (48 combined) yields an average sustained rate of 16 bytes per clock (@ 3GHz)

• Each DMA access is 128 bytes wide so, on average,

each SPE experiences a DMA access every 8 clocks

during these tile transfers or 12.5% of the time

(47)

Combined CPU and I/O Performance Analysis

• DMA accesses have priority over SPU accesses

• XDR Ù LS DMA contending with LS Ù LS DMA

Assume equal priority for both types of DMA

LS Ù LS transfers consume all available DMA bandwidth per SPE LS (DMA access every 16 clocks in both directions)

XDR Ù LS degraded bandwidth: 47.7 (de-rated) + 2.4/2 = 48.9 us

• XDR Ù LS DMA contending with zfft_cols() primitive

zfft_cols() accesses LS about 85% of the time (instruction pre- fetches plus data load and store)

XDR Ù LS DMA accesses any given LS < 1% of the time

So, using 1%, zfft_cols() degrades from 29.4 to 29.7 us (adds 0.3)

• LS Ù LS DMA contending with 2D twiddle primitives

The 2D twiddle primitives access LS close to 100% of the time

LS Ù LS DMA accesses any given LS 12.5% of the time

So, using 12.5%, the 2D twiddle primitives degrade from 9.2 to 9.2 + (2.4 * 0.125) = 9.5 us (adds 0.3)

(48)

Expected Latency and Throughput Analysis

• CPU time per SPE with DMA contention:

41.9 + 0.3 + 0.3 = 42.5 us

• XDR Ù LS DMA time with LS Ù LS DMA contention and de-rating of XDR Ù LS bandwidth from 24 to 22 Gbytes/sec

47.7 + 1.2 = 48.9 us

• Latency: 42.5 + 48.9 = 91.4 us

• Throughput: max( 42.5, 48.9 ) = 48.9 us

• Parallel 64K FFT on Cell is I/O bound!

(49)

Performance Comparison – GFLOPS

64K Single Precision Complex FFT

107.22 (modeled)

57.36

2.41 2.33 2.11 3.03

0 20 40 60 80 100 120

GFLOPS

Cell BE 3.0 GHz Throughput (Theoretical Analysis)

Cell BE 3.0 GHz Latency (Theoretical Analysis)

Pentium 4 Xeon 2.8 GHz intel-mkl-f

IBM 970 (G5) 2GHz fftw3

Opteron Model 246 64bit mode 2GHz fftw3

FreeScale7448@975MHz SAL 7.3.0

Cell Cell P4 970 Opteron 7448

(50)

Performance Comparison – GFLOPS

107.22 (modeled)

57.36

2.41 2.33 2.11 3.03

0 20 40 60 80 100 120

GFLOPS

Cell Cell P4 970 Opteron 7448 90 (meas)

(51)

Performance Comparison - Microseconds

48.90 91.40

2173 2248

2487

1728

0 500 1000 1500 2000 2500 3000

64K Single Precision Complex FFT

microseconds

Cell Cell P4 970 Opteron 7448

(52)

What are the reasons for differences?

• Model says 107GFLOPS and we’re measuring 90GFLOPS

• Remember, as a goal and as part of our design methodology we want to use the model to gain insight into why the

implementation is performing as measured….

• Sorry but we haven’t done this yet….coming

soon

(53)

Performance Comparison – Details

•

Cell 2.4 GHz hardware scaled up to 3.0 GHz

Cell SPE, EIB and XDR rates scale linearly with clock speed up to 3.2 GHz

•

FFTW Benchmark (www.fftw.org/benchfft)

Does not explicitly measure transfers to/from DRAM (best fit)

FFTW times used in comparison:

• Pentium IV Xeon 2.8 GHz, 512KB L2 cache; Intel Math Kernel Library

• IBM 970 2.0 GHz; FFTW 3.0.1 library

• Opteron Model 246 2.0 GHz in 64 bit mode; FFTW 3.0.1 library

•

Mercury Scientific Algorithm Library 7.0.3 tests

Does not explicitly measure transfers to/from DRAM (best fit)

SAL times used in comparison:

• IBM 970 2.0 GHz, 32K L1 cache, 512K L2 cache, 1 GHz memory bus

• FreeScale7448 975 MHz, 32K L1 cache,1M L2 cache, 150MHz memory bus

•

Should we compare latency or throughput of the Cell parallel FFT performance with these uniprocessor tests?

It depends on the structure of the rest of your application

Note: We did not do any comparisons with multiprocessor systems or multicore chips

(54)

Projected Performance of Single SPE Algorithm

• How would an “independent” algorithm compare?

8 SPEs each concurrently but independently executing a single 64K FFT algorithm

• Similar “2D” algorithm but each data set must make 2 round trips between XDR and LS

Bring in data tiles for “column” FFTs and twiddle multiplies and then store back in transposed order

Bring in transposed data tiles for “row” FFTs and store back

• Lower bounds

Theoretical minimum latency: 4 x 512KBytes / 3 GBytes/sec = 699 us

• XDR bandwidth of 24 GBytes/s is shared among 8 SPEs

• Assumes one can bury all processing under the DMA transfers

Theoretical best throughput: 699 / 8 = 87.4 us

(55)

A Parallel 64K Complex FFT Algorithm for the IBM/Sony/Toshiba Cell Broadband Engine Processor