A Parallel 64K Complex FFT Algorithm
for the IBM/Sony/Toshiba
Cell Broadband Engine Processor
Jonathan Greene, Michael Pepe, Robert Cooper Mercury Computer Systems
October 24, 2005
Goals
•
Program a real-world problem that could demonstrate the highest Cell processor performance (really kick the Cell tires)•
Show Mercury’s Approach to Algorithm (Application) Design Analyze the mathematics and develop an optimized mapping (data temporal and spatial mapping) to the hardware architecture (model)
Validate the model with key component measurements
Build the thing
Instrument the code and compare to the model
•
This problem was chosen because Choose a problem where:
• The data would not fit in a single SPE’s local store
• But would fit in the aggregate of all 8 local stores
• SPE-to-SPE communication is essential to achieve optimal performance
Exploit the generous SPE-to-SPE bandwidth
•
We chose a parallel (i.e., collective) implementation of a N (N >=1) 64K point single precision complex FFTs
•
In achieving this goal, we were able to both reduce latency andCell Broadband Engine Architecture
LS LS LS LS
LS LS
LS LS
LS LS LS LS
LS LS
LS LS
Cell Broadband Engine Architecture
LS LS LS LS
LS LS
LS LS
LS LS LS LS
LS LS
LS LS
• 1 Power® core
With full VMX engine
• Not used in this algorithm
• 8 SPE cores, each with
128 bit SIMD vector unit
256K local store
MFC DMA engine
Overlapped DMA and local store access
• Element Interconnect Bus (EIB)
2 pairs of bidirectional rings
• XDR DRAM
• High speed external interfaces
Cell Broadband Engine Performance
LS LS LS LS
LS LS
LS LS
LS LS LS LS
LS LS
LS LS
• We assume 3 GHz clock
The Cell BE chip can run at higher frequencies
• 8 SPE cores
192 GFLOPS peak
• MFC DMA engine
24 GB/s bidirectional
• EIB
192 GB/s maximum sustainable performance
Realizable performance depends on access patterns
• XDR DRAM Ù 8 Local Stores
24 GB/s maximum aggregate bandwidth
Performance depends on access patterns
Cell Broadband Engine
• Keys to performance
Decompose algorithm into chunks that can utilize 256K local store
Vectorize inner loop SPE code (4-way SIMD for 32-bit float operations)
Pay careful attention to XDR bandwidth utilization
• 24 GB/s is a lot of bandwidth, but it’s shared by 8 very powerful SIMD cores
Exploit SPE-to-SPE ring bandwidth if possible
Overlap computation with DMA using double or triple buffering
• Each SPE’s MFC supports numerous concurrent, non-blocking DMA transfers
Use 128-byte alignment of data and multiples of 128-byte transfers for maximum DMA performance
• Generally don’t need to worry about:
EIB access patterns
• There is almost always ring bandwidth to spare
XDR access patterns
Basic Algorithm
•
Classical “2D” decomposition of a 1D FFT Rabiner and Gold. Theory and Application of Digital Signal Processing.
Prentice-Hall, 1975
Basic Algorithm
•
Classical “2D” decomposition of a 1D FFT Rabiner and Gold. Theory and Application of Digital Signal Processing.
Prentice-Hall, 1975 NC
NR
•
View N-element FFT as NR x NC matrix in row major orderBasic Algorithm
•
Classical “2D” decomposition of a 1D FFT Rabiner and Gold. Theory and Application of Digital Signal Processing.
Prentice-Hall, 1975 NC
•
View N-element FFT as NR x NC matrix in row major order•
Algorithm outline: Perform NC NR-point column FFTs
NR
Basic Algorithm
NC
•
View N-element FFT as NR x NC matrix in row major order•
Algorithm outline: Perform NC NR-point column FFTs
Perform element-wise multiply by NR x NC complex twiddle matrix
NR
Basic Algorithm
NC
•
View N-element FFT as NR x NC matrix in row major order•
Algorithm outline: Perform NC NR-point column FFTs
Perform element-wise multiply by NR x NC complex twiddle matrix
Perform NR NC-point row FFTs
NR
Basic Algorithm
NC
•
View N-element FFT as NR x NC matrix in row major order•
Algorithm outline: Perform NC NR-point column FFTs
Perform element-wise multiply by NR x NC complex twiddle matrix
Perform NR NC-point row FFTs
Transpose NR x NC matrix to NC x NR matrix
NR
Parallel Algorithm Across 8 SPEs
NC = 256
NR = 256
• 64K (256 x 256) FFT stored in XDR memory
• 512 KBytes each for input and output data
Parallel Algorithm Across 8 SPEs
32
256
• Each SPE processes a 256 x 32 region
Parallel Algorithm Across 8 SPEs
SPE0 SPE1 SPE2 SPE3 SPE4 SPE5 SPE6 SPE7
• Each SPE processes a 256 x 32 region
SPE Parallel Algorithm – SPE 2
SPE2
• Focus on a representative SPE (SPE 2)
SPE Parallel Algorithm – SPE 2
Buffer A Buffer B
• Perform 32 column FFTs
SPE Parallel Algorithm – SPE 2
A B
• Perform 32 column FFTs
SPE Parallel Algorithm – SPE 2
A B
•
Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose•
SPEi Tilej Î SPEj Tilei•
Generate each 32 x 32 twiddle matrix on the flySPE Parallel Algorithm – SPE 2
Timeslice 0
A B
•
Element-wise multiply each 32 x 32 tile by twiddle matrix and transposeSPE Parallel Algorithm – SPE 2
Timeslice 1
A B
SPE0 SPE0
•
Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose•
SPEi Tilej Î SPEj Tilei•
Generate each 32 x 32 twiddle matrix on the flySPE Parallel Algorithm – SPE 2
Timeslice 2
A B
SPE0
•
Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose SPE1SPE Parallel Algorithm – SPE 2
Timeslice 3
A B
SPE3
•
Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose•
SPEi Tilej Î SPEj Tilei•
Generate each 32 x 32 twiddle matrix on the fly SPE3SPE Parallel Algorithm – SPE 2
Timeslice 4
A B
SPE4
•
Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose SPE4SPE Parallel Algorithm – SPE 2
Timeslice 5
A B
SPE5
•
Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose•
SPEi Tilej Î SPEj Tilei•
Generate each 32 x 32 twiddle matrix on the fly SPE5SPE Parallel Algorithm – SPE 2
Timeslice 6
A B
SPE6
•
Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose SPE6SPE Parallel Algorithm – SPE 2
Timeslice 7
A B
SPE7 SPE7
•
Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose•
SPEi Tilej Î SPEj Tilei•
Generate each 32 x 32 twiddle matrix on the flySPE Parallel Algorithm – SPE 2
A B
• Perform 32 “row” FFTs
SPE Parallel Algorithm – SPE 2
A B
• Perform 32 “row” FFTs
SPE Parallel Algorithm – SPE 2
SPE2
• Assemble result matrix in XDR
SPE Parallel Algorithm
SPE2
• Assemble result matrix in XDR
SPE Parallel Algorithm
SPE0 SPE1 SPE2 SPE3 SPE4 SPE5 SPE6 SPE7
• Assemble result matrix in XDR
SPE Parallel Algorithm
• Assemble result matrix in XDR
SPE Parallel Algorithm
• Assemble result matrix in XDR
Triple Buffering and XDR Transfers
• While two processing buffers are computing result i
processing buffers
Triple Buffering and XDR Transfers
• While two processing buffers are computing result i
• Third I/O buffer is transfering
result i-1 to XDR
result i-1
A B C
I/O buffer processing
buffers
Triple Buffering and XDR Transfers
• While two processing buffers are computing result i
• Third I/O buffer is transfering
result i-1 to XDR
result i-1
A B C
I/O buffer processing
buffers
Triple Buffering and XDR Transfers
• While two processing buffers are computing result i
• Third I/O buffer is transfering
result i-1 to XDR
• And then transfering input data i+1
from XDR
result i-1
A B C
I/O buffer processing
buffers
input data i+1
Triple Buffering and XDR Transfers
• While two processing buffers are computing result i
• Third I/O buffer is transfering
result i-1 to XDR
• And then transfering input data i+1
from XDR
result i-1
A B C
I/O buffer processing
buffers
input data i+1
Triple Buffering and XDR Transfers
• While two processing buffers are computing result i
• Third I/O buffer is transfering
result i-1 to XDR
• And then transfering input data i+1
from XDR
• At completion of algorithm, roles of buffers are rotated
result i-1
B A C
I/O buffer
processing buffers
input data i+1
Local Store Usage Map
• Our 64K FFT algorithm requires approximately 253 Kbytes (out of the available 256 Kbytes) of local store in each SPE:
Stack size: 8K
SPE kernel code: 16K
FFT setup code: 5K
FFT shell code: 4K
FFT primitives (4): 8K
DMA lists (2): 8K
Data buffers (3): 192K
1D twiddles: 2K
2D twiddles: 10K
• Total: 253K
Latency vs. Throughput With Overlapped I/O
FFT throughput
• Theoretical lowest latency
= compute time + XDR time
Inter-SPE transfers are partially overlapped with compute time
• Theoretical fastest throughput
XDR transfers SPE compute &
inter-SPE transfers
time
get / put data
from / to other SPEs
Only one SPE shown
FFT latency
Performance Analysis
• We have successfully coded and run this implementation on 3 GHz Rev 3 HW.
• We get the right answers (Just so you know we’re
honest).
• What follows is a performance analysis of the model based on timings for the
processing components and Cell
documentation for the DMA components
CPU Performance Analysis
• Measured on the Cell simulator (@ 3 GHz)
Times for 1 SPE for 1 FFT computation
Ignores local store contention due to DMA transfers
Function Calls per iteration
Clocks per call
Total time (us)
~3,000 1.0
2.0 0.3 2.8 6.4
zfft_cols() (256 x 32) 2 44,052 14.7 29.4 22.3 *
Totals for a 64K 41.9 125.1 *
~6,000 202 1,185 2,388
zfft_64k() (shell) N/A N/A N/A
7
zmat_emul_trans() 8 0.8 7.7
Time per
call (us) GFLOPS
kernel calls (e.g.
DMA, sync) 31 N/A N/A
make_dma_list() 4 0.1 N/A
zmat_row_emul() 0.4 15.6
DMA Performance Analysis (XDR Ù LS)
• Each 64K FFT requires one round trip of the data from/to XDR and the 8 Local Stores
• Data size: 8 * 64K = 512K
• 1 round trip requires transferring 2 * 512K = 1 Mbyte
@24 Gbytes/sec (peak): 1M / 24,000 = 43.7 us
@22 Gbytes/sec (de-rated): 1M / 22,000 = 47.7 us
• XDR Ù LS transfers do not overlap with each other
• Gbytes/sec per SPE: 24/8 = 3 Gbytes/sec either in or out yields an average sustained rate of 1 byte per clock (@ 3GHz)
• Each DMA access is 128 bytes wide so, on average,
each SPE experiences a DMA access only every 128
clocks or less than 1% of the time!
DMA Performance Analysis (LS Ù LS)
• Each 64K FFT requires 7/8 of the data to be transferred from SPE to SPE (i.e., LS Ù LS)
• Size of each contiguous tile: 4 * 32 * 32 = 4K bytes
• Number of contiguous tile transfers: 2 * 7 * 8 = 112
• Total number of bytes transferred: 112 * 4K = 448K
• @192 Gbytes/sec peak: 448K / 192,000 = 2.4 us
• Gbytes/sec per SPE = 192/8 = 24 Gbytes/sec both in and out (48 combined) yields an average sustained rate of 16 bytes per clock (@ 3GHz)
• Each DMA access is 128 bytes wide so, on average,
each SPE experiences a DMA access every 8 clocks
during these tile transfers or 12.5% of the time
Combined CPU and I/O Performance Analysis
• DMA accesses have priority over SPU accesses
• XDR Ù LS DMA contending with LS Ù LS DMA
Assume equal priority for both types of DMA
LS Ù LS transfers consume all available DMA bandwidth per SPE LS (DMA access every 16 clocks in both directions)
XDR Ù LS degraded bandwidth: 47.7 (de-rated) + 2.4/2 = 48.9 us
• XDR Ù LS DMA contending with zfft_cols() primitive
zfft_cols() accesses LS about 85% of the time (instruction pre- fetches plus data load and store)
XDR Ù LS DMA accesses any given LS < 1% of the time
So, using 1%, zfft_cols() degrades from 29.4 to 29.7 us (adds 0.3)
• LS Ù LS DMA contending with 2D twiddle primitives
The 2D twiddle primitives access LS close to 100% of the time
LS Ù LS DMA accesses any given LS 12.5% of the time
So, using 12.5%, the 2D twiddle primitives degrade from 9.2 to 9.2 + (2.4 * 0.125) = 9.5 us (adds 0.3)
Expected Latency and Throughput Analysis
• CPU time per SPE with DMA contention:
41.9 + 0.3 + 0.3 = 42.5 us
• XDR Ù LS DMA time with LS Ù LS DMA contention and de-rating of XDR Ù LS bandwidth from 24 to 22 Gbytes/sec
47.7 + 1.2 = 48.9 us
• Latency: 42.5 + 48.9 = 91.4 us
• Throughput: max( 42.5, 48.9 ) = 48.9 us
• Parallel 64K FFT on Cell is I/O bound!
Performance Comparison – GFLOPS
64K Single Precision Complex FFT
107.22 (modeled)
57.36
2.41 2.33 2.11 3.03
0 20 40 60 80 100 120
GFLOPS
Cell BE 3.0 GHz Throughput (Theoretical Analysis)
Cell BE 3.0 GHz Latency (Theoretical Analysis)
Pentium 4 Xeon 2.8 GHz intel-mkl-f
IBM 970 (G5) 2GHz fftw3
Opteron Model 246 64bit mode 2GHz fftw3
FreeScale7448@975MHz SAL 7.3.0
Cell Cell P4 970 Opteron 7448
Performance Comparison – GFLOPS
107.22 (modeled)
57.36
2.41 2.33 2.11 3.03
0 20 40 60 80 100 120
GFLOPS
Cell BE 3.0 GHz Throughput (Theoretical Analysis)
Cell BE 3.0 GHz Latency (Theoretical Analysis)
Pentium 4 Xeon 2.8 GHz intel-mkl-f
IBM 970 (G5) 2GHz fftw3
Opteron Model 246 64bit mode 2GHz fftw3
FreeScale7448@975MHz SAL 7.3.0
Cell Cell P4 970 Opteron 7448 90 (meas)
Performance Comparison - Microseconds
48.90 91.40
2173 2248
2487
1728
0 500 1000 1500 2000 2500 3000
64K Single Precision Complex FFT
microseconds
Cell BE 3.0 GHz Throughput (Theoretical Analysis)
Cell BE 3.0 GHz Latency (Theoretical Analysis)
Pentium 4 Xeon 2.8 GHz intel-mkl-f
IBM 970 (G5) 2GHz fftw3
Opteron Model 246 64bit mode 2GHz fftw3
FreeScale7448@975MHz SAL 7.3.0
Cell Cell P4 970 Opteron 7448
What are the reasons for differences?
• Model says 107GFLOPS and we’re measuring 90GFLOPS
• Remember, as a goal and as part of our design methodology we want to use the model to gain insight into why the
implementation is performing as measured….
• Sorry but we haven’t done this yet….coming
soon
Performance Comparison – Details
•
Cell 2.4 GHz hardware scaled up to 3.0 GHz Cell SPE, EIB and XDR rates scale linearly with clock speed up to 3.2 GHz
•
FFTW Benchmark (www.fftw.org/benchfft) Does not explicitly measure transfers to/from DRAM (best fit)
FFTW times used in comparison:
• Pentium IV Xeon 2.8 GHz, 512KB L2 cache; Intel Math Kernel Library
• IBM 970 2.0 GHz; FFTW 3.0.1 library
• Opteron Model 246 2.0 GHz in 64 bit mode; FFTW 3.0.1 library
•
Mercury Scientific Algorithm Library 7.0.3 tests Does not explicitly measure transfers to/from DRAM (best fit)
SAL times used in comparison:
• IBM 970 2.0 GHz, 32K L1 cache, 512K L2 cache, 1 GHz memory bus
• FreeScale7448 975 MHz, 32K L1 cache,1M L2 cache, 150MHz memory bus
•
Should we compare latency or throughput of the Cell parallel FFT performance with these uniprocessor tests? It depends on the structure of the rest of your application
Note: We did not do any comparisons with multiprocessor systems or multicore chips
Projected Performance of Single SPE Algorithm
• How would an “independent” algorithm compare?
8 SPEs each concurrently but independently executing a single 64K FFT algorithm
• Similar “2D” algorithm but each data set must make 2 round trips between XDR and LS
Bring in data tiles for “column” FFTs and twiddle multiplies and then store back in transposed order
Bring in transposed data tiles for “row” FFTs and store back
• Lower bounds
Theoretical minimum latency: 4 x 512KBytes / 3 GBytes/sec = 699 us
• XDR bandwidth of 24 GBytes/s is shared among 8 SPEs
• Assumes one can bury all processing under the DMA transfers
Theoretical best throughput: 699 / 8 = 87.4 us