• No results found

A Parallel 64K Complex FFT Algorithm for the IBM/Sony/Toshiba Cell Broadband Engine Processor

N/A
N/A
Protected

Academic year: 2022

Share "A Parallel 64K Complex FFT Algorithm for the IBM/Sony/Toshiba Cell Broadband Engine Processor"

Copied!
55
0
0

Loading.... (view fulltext now)

Full text

(1)

A Parallel 64K Complex FFT Algorithm

for the IBM/Sony/Toshiba

Cell Broadband Engine Processor

Jonathan Greene, Michael Pepe, Robert Cooper Mercury Computer Systems

October 24, 2005

(2)

Goals

Program a real-world problem that could demonstrate the highest Cell processor performance (really kick the Cell tires)

Show Mercury’s Approach to Algorithm (Application) Design

ƒ Analyze the mathematics and develop an optimized mapping (data temporal and spatial mapping) to the hardware architecture (model)

ƒ Validate the model with key component measurements

ƒ Build the thing

ƒ Instrument the code and compare to the model

This problem was chosen because

ƒ Choose a problem where:

The data would not fit in a single SPE’s local store

But would fit in the aggregate of all 8 local stores

SPE-to-SPE communication is essential to achieve optimal performance

ƒ Exploit the generous SPE-to-SPE bandwidth

We chose a parallel (i.e., collective) implementation of a N (N >=

1) 64K point single precision complex FFTs

In achieving this goal, we were able to both reduce latency and

(3)

Cell Broadband Engine Architecture

LS LS LS LS

LS LS

LS LS

LS LS LS LS

LS LS

LS LS

(4)

Cell Broadband Engine Architecture

LS LS LS LS

LS LS

LS LS

LS LS LS LS

LS LS

LS LS

1 Power® core

ƒ With full VMX engine

Not used in this algorithm

8 SPE cores, each with

ƒ 128 bit SIMD vector unit

ƒ 256K local store

ƒ MFC DMA engine

ƒ Overlapped DMA and local store access

Element Interconnect Bus (EIB)

ƒ 2 pairs of bidirectional rings

XDR DRAM

High speed external interfaces

ƒ

(5)

Cell Broadband Engine Performance

LS LS LS LS

LS LS

LS LS

LS LS LS LS

LS LS

LS LS

We assume 3 GHz clock

ƒ The Cell BE chip can run at higher frequencies

8 SPE cores

ƒ 192 GFLOPS peak

MFC DMA engine

ƒ 24 GB/s bidirectional

EIB

ƒ 192 GB/s maximum sustainable performance

ƒ Realizable performance depends on access patterns

XDR DRAM Ù 8 Local Stores

ƒ 24 GB/s maximum aggregate bandwidth

ƒ Performance depends on access patterns

(6)

Cell Broadband Engine

Keys to performance

ƒ Decompose algorithm into chunks that can utilize 256K local store

ƒ Vectorize inner loop SPE code (4-way SIMD for 32-bit float operations)

ƒ Pay careful attention to XDR bandwidth utilization

24 GB/s is a lot of bandwidth, but it’s shared by 8 very powerful SIMD cores

ƒ Exploit SPE-to-SPE ring bandwidth if possible

ƒ Overlap computation with DMA using double or triple buffering

Each SPE’s MFC supports numerous concurrent, non-blocking DMA transfers

ƒ Use 128-byte alignment of data and multiples of 128-byte transfers for maximum DMA performance

Generally don’t need to worry about:

ƒ EIB access patterns

There is almost always ring bandwidth to spare

ƒ XDR access patterns

(7)

Basic Algorithm

Classical “2D” decomposition of a 1D FFT

ƒ Rabiner and Gold. Theory and Application of Digital Signal Processing.

Prentice-Hall, 1975

(8)

Basic Algorithm

Classical “2D” decomposition of a 1D FFT

ƒ Rabiner and Gold. Theory and Application of Digital Signal Processing.

Prentice-Hall, 1975 NC

NR

View N-element FFT as NR x NC matrix in row major order

(9)

Basic Algorithm

Classical “2D” decomposition of a 1D FFT

ƒ Rabiner and Gold. Theory and Application of Digital Signal Processing.

Prentice-Hall, 1975 NC

View N-element FFT as NR x NC matrix in row major order

Algorithm outline:

ƒ Perform NC NR-point column FFTs

NR

(10)

Basic Algorithm

NC

View N-element FFT as NR x NC matrix in row major order

Algorithm outline:

ƒ Perform NC NR-point column FFTs

ƒ Perform element-wise multiply by NR x NC complex twiddle matrix

NR

(11)

Basic Algorithm

NC

View N-element FFT as NR x NC matrix in row major order

Algorithm outline:

ƒ Perform NC NR-point column FFTs

ƒ Perform element-wise multiply by NR x NC complex twiddle matrix

ƒ Perform NR NC-point row FFTs

NR

(12)

Basic Algorithm

NC

View N-element FFT as NR x NC matrix in row major order

Algorithm outline:

ƒ Perform NC NR-point column FFTs

ƒ Perform element-wise multiply by NR x NC complex twiddle matrix

ƒ Perform NR NC-point row FFTs

ƒ Transpose NR x NC matrix to NC x NR matrix

NR

(13)

Parallel Algorithm Across 8 SPEs

NC = 256

NR = 256

64K (256 x 256) FFT stored in XDR memory

512 KBytes each for input and output data

(14)

Parallel Algorithm Across 8 SPEs

32

256

Each SPE processes a 256 x 32 region

(15)

Parallel Algorithm Across 8 SPEs

SPE0 SPE1 SPE2 SPE3 SPE4 SPE5 SPE6 SPE7

Each SPE processes a 256 x 32 region

(16)

SPE Parallel Algorithm – SPE 2

SPE2

Focus on a representative SPE (SPE 2)

(17)

SPE Parallel Algorithm – SPE 2

Buffer A Buffer B

Perform 32 column FFTs

(18)

SPE Parallel Algorithm – SPE 2

A B

Perform 32 column FFTs

(19)

SPE Parallel Algorithm – SPE 2

A B

Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose

SPEi Tilej Î SPEj Tilei

Generate each 32 x 32 twiddle matrix on the fly

(20)

SPE Parallel Algorithm – SPE 2

Timeslice 0

A B

Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose

(21)

SPE Parallel Algorithm – SPE 2

Timeslice 1

A B

SPE0 SPE0

Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose

SPEi Tilej Î SPEj Tilei

Generate each 32 x 32 twiddle matrix on the fly

(22)

SPE Parallel Algorithm – SPE 2

Timeslice 2

A B

SPE0

Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose SPE1

(23)

SPE Parallel Algorithm – SPE 2

Timeslice 3

A B

SPE3

Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose

SPEi Tilej Î SPEj Tilei

Generate each 32 x 32 twiddle matrix on the fly SPE3

(24)

SPE Parallel Algorithm – SPE 2

Timeslice 4

A B

SPE4

Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose SPE4

(25)

SPE Parallel Algorithm – SPE 2

Timeslice 5

A B

SPE5

Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose

SPEi Tilej Î SPEj Tilei

Generate each 32 x 32 twiddle matrix on the fly SPE5

(26)

SPE Parallel Algorithm – SPE 2

Timeslice 6

A B

SPE6

Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose SPE6

(27)

SPE Parallel Algorithm – SPE 2

Timeslice 7

A B

SPE7 SPE7

Element-wise multiply each 32 x 32 tile by twiddle matrix and transpose

SPEi Tilej Î SPEj Tilei

Generate each 32 x 32 twiddle matrix on the fly

(28)

SPE Parallel Algorithm – SPE 2

A B

Perform 32 “row” FFTs

(29)

SPE Parallel Algorithm – SPE 2

A B

Perform 32 “row” FFTs

(30)

SPE Parallel Algorithm – SPE 2

SPE2

Assemble result matrix in XDR

(31)

SPE Parallel Algorithm

SPE2

Assemble result matrix in XDR

(32)

SPE Parallel Algorithm

SPE0 SPE1 SPE2 SPE3 SPE4 SPE5 SPE6 SPE7

Assemble result matrix in XDR

(33)

SPE Parallel Algorithm

Assemble result matrix in XDR

(34)

SPE Parallel Algorithm

Assemble result matrix in XDR

(35)

Triple Buffering and XDR Transfers

While two processing buffers are computing result i

processing buffers

(36)

Triple Buffering and XDR Transfers

While two processing buffers are computing result i

Third I/O buffer is transfering

result i-1 to XDR

result i-1

A B C

I/O buffer processing

buffers

(37)

Triple Buffering and XDR Transfers

While two processing buffers are computing result i

Third I/O buffer is transfering

result i-1 to XDR

result i-1

A B C

I/O buffer processing

buffers

(38)

Triple Buffering and XDR Transfers

While two processing buffers are computing result i

Third I/O buffer is transfering

result i-1 to XDR

And then transfering input data i+1

from XDR

result i-1

A B C

I/O buffer processing

buffers

input data i+1

(39)

Triple Buffering and XDR Transfers

While two processing buffers are computing result i

Third I/O buffer is transfering

result i-1 to XDR

And then transfering input data i+1

from XDR

result i-1

A B C

I/O buffer processing

buffers

input data i+1

(40)

Triple Buffering and XDR Transfers

While two processing buffers are computing result i

Third I/O buffer is transfering

result i-1 to XDR

And then transfering input data i+1

from XDR

At completion of algorithm, roles of buffers are rotated

result i-1

B A C

I/O buffer

processing buffers

input data i+1

(41)

Local Store Usage Map

Our 64K FFT algorithm requires approximately 253 Kbytes (out of the available 256 Kbytes) of local store in each SPE:

ƒ Stack size: 8K

ƒ SPE kernel code: 16K

ƒ FFT setup code: 5K

ƒ FFT shell code: 4K

ƒ FFT primitives (4): 8K

ƒ DMA lists (2): 8K

ƒ Data buffers (3): 192K

ƒ 1D twiddles: 2K

ƒ 2D twiddles: 10K

Total: 253K

(42)

Latency vs. Throughput With Overlapped I/O

FFT throughput

Theoretical lowest latency

ƒ = compute time + XDR time

ƒ Inter-SPE transfers are partially overlapped with compute time

Theoretical fastest throughput

ƒ

XDR transfers SPE compute &

inter-SPE transfers

time

get / put data

from / to other SPEs

Only one SPE shown

FFT latency

(43)

Performance Analysis

We have successfully coded and run this implementation on 3 GHz Rev 3 HW.

We get the right answers

(Just so you know we’re honest)

.

What follows is a performance analysis of the model based on timings for the

processing components and Cell

documentation for the DMA components

(44)

CPU Performance Analysis

Measured on the Cell simulator (@ 3 GHz)

ƒ Times for 1 SPE for 1 FFT computation

ƒ Ignores local store contention due to DMA transfers

Function Calls per iteration

Clocks per call

Total time (us)

~3,000 1.0

2.0 0.3 2.8 6.4

zfft_cols() (256 x 32) 2 44,052 14.7 29.4 22.3 *

Totals for a 64K 41.9 125.1 *

~6,000 202 1,185 2,388

zfft_64k() (shell) N/A N/A N/A

7

zmat_emul_trans() 8 0.8 7.7

Time per

call (us) GFLOPS

kernel calls (e.g.

DMA, sync) 31 N/A N/A

make_dma_list() 4 0.1 N/A

zmat_row_emul() 0.4 15.6

(45)

DMA Performance Analysis (XDR Ù LS)

Each 64K FFT requires one round trip of the data from/to XDR and the 8 Local Stores

Data size: 8 * 64K = 512K

1 round trip requires transferring 2 * 512K = 1 Mbyte

ƒ @24 Gbytes/sec (peak): 1M / 24,000 = 43.7 us

ƒ @22 Gbytes/sec (de-rated): 1M / 22,000 = 47.7 us

XDR Ù LS transfers do not overlap with each other

Gbytes/sec per SPE: 24/8 = 3 Gbytes/sec either in or out yields an average sustained rate of 1 byte per clock (@ 3GHz)

Each DMA access is 128 bytes wide so, on average,

each SPE experiences a DMA access only every 128

clocks or less than 1% of the time!

(46)

DMA Performance Analysis (LS Ù LS)

Each 64K FFT requires 7/8 of the data to be transferred from SPE to SPE (i.e., LS Ù LS)

Size of each contiguous tile: 4 * 32 * 32 = 4K bytes

Number of contiguous tile transfers: 2 * 7 * 8 = 112

Total number of bytes transferred: 112 * 4K = 448K

@192 Gbytes/sec peak: 448K / 192,000 = 2.4 us

Gbytes/sec per SPE = 192/8 = 24 Gbytes/sec both in and out (48 combined) yields an average sustained rate of 16 bytes per clock (@ 3GHz)

Each DMA access is 128 bytes wide so, on average,

each SPE experiences a DMA access every 8 clocks

during these tile transfers or 12.5% of the time

(47)

Combined CPU and I/O Performance Analysis

DMA accesses have priority over SPU accesses

XDR Ù LS DMA contending with LS Ù LS DMA

ƒ Assume equal priority for both types of DMA

ƒ LS Ù LS transfers consume all available DMA bandwidth per SPE LS (DMA access every 16 clocks in both directions)

ƒ XDR Ù LS degraded bandwidth: 47.7 (de-rated) + 2.4/2 = 48.9 us

XDR Ù LS DMA contending with zfft_cols() primitive

ƒ zfft_cols() accesses LS about 85% of the time (instruction pre- fetches plus data load and store)

ƒ XDR Ù LS DMA accesses any given LS < 1% of the time

ƒ So, using 1%, zfft_cols() degrades from 29.4 to 29.7 us (adds 0.3)

LS Ù LS DMA contending with 2D twiddle primitives

ƒ The 2D twiddle primitives access LS close to 100% of the time

ƒ LS Ù LS DMA accesses any given LS 12.5% of the time

ƒ So, using 12.5%, the 2D twiddle primitives degrade from 9.2 to 9.2 + (2.4 * 0.125) = 9.5 us (adds 0.3)

(48)

Expected Latency and Throughput Analysis

CPU time per SPE with DMA contention:

ƒ 41.9 + 0.3 + 0.3 = 42.5 us

XDR Ù LS DMA time with LS Ù LS DMA contention and de-rating of XDR Ù LS bandwidth from 24 to 22 Gbytes/sec

ƒ 47.7 + 1.2 = 48.9 us

Latency: 42.5 + 48.9 = 91.4 us

Throughput: max( 42.5, 48.9 ) = 48.9 us

Parallel 64K FFT on Cell is I/O bound!

(49)

Performance Comparison – GFLOPS

64K Single Precision Complex FFT

107.22 (modeled)

57.36

2.41 2.33 2.11 3.03

0 20 40 60 80 100 120

GFLOPS

Cell BE 3.0 GHz Throughput (Theoretical Analysis)

Cell BE 3.0 GHz Latency (Theoretical Analysis)

Pentium 4 Xeon 2.8 GHz intel-mkl-f

IBM 970 (G5) 2GHz fftw3

Opteron Model 246 64bit mode 2GHz fftw3

FreeScale7448@975MHz SAL 7.3.0

Cell Cell P4 970 Opteron 7448

(50)

Performance Comparison – GFLOPS

107.22 (modeled)

57.36

2.41 2.33 2.11 3.03

0 20 40 60 80 100 120

GFLOPS

Cell BE 3.0 GHz Throughput (Theoretical Analysis)

Cell BE 3.0 GHz Latency (Theoretical Analysis)

Pentium 4 Xeon 2.8 GHz intel-mkl-f

IBM 970 (G5) 2GHz fftw3

Opteron Model 246 64bit mode 2GHz fftw3

FreeScale7448@975MHz SAL 7.3.0

Cell Cell P4 970 Opteron 7448 90 (meas)

(51)

Performance Comparison - Microseconds

48.90 91.40

2173 2248

2487

1728

0 500 1000 1500 2000 2500 3000

64K Single Precision Complex FFT

microseconds

Cell BE 3.0 GHz Throughput (Theoretical Analysis)

Cell BE 3.0 GHz Latency (Theoretical Analysis)

Pentium 4 Xeon 2.8 GHz intel-mkl-f

IBM 970 (G5) 2GHz fftw3

Opteron Model 246 64bit mode 2GHz fftw3

FreeScale7448@975MHz SAL 7.3.0

Cell Cell P4 970 Opteron 7448

(52)

What are the reasons for differences?

Model says 107GFLOPS and we’re measuring 90GFLOPS

Remember, as a goal and as part of our design methodology we want to use the model to gain insight into why the

implementation is performing as measured….

Sorry but we haven’t done this yet….coming

soon

(53)

Performance Comparison – Details

Cell 2.4 GHz hardware scaled up to 3.0 GHz

ƒ Cell SPE, EIB and XDR rates scale linearly with clock speed up to 3.2 GHz

FFTW Benchmark (www.fftw.org/benchfft)

ƒ Does not explicitly measure transfers to/from DRAM (best fit)

ƒ FFTW times used in comparison:

Pentium IV Xeon 2.8 GHz, 512KB L2 cache; Intel Math Kernel Library

IBM 970 2.0 GHz; FFTW 3.0.1 library

Opteron Model 246 2.0 GHz in 64 bit mode; FFTW 3.0.1 library

Mercury Scientific Algorithm Library 7.0.3 tests

ƒ Does not explicitly measure transfers to/from DRAM (best fit)

ƒ SAL times used in comparison:

IBM 970 2.0 GHz, 32K L1 cache, 512K L2 cache, 1 GHz memory bus

FreeScale7448 975 MHz, 32K L1 cache,1M L2 cache, 150MHz memory bus

Should we compare latency or throughput of the Cell parallel FFT performance with these uniprocessor tests?

ƒ It depends on the structure of the rest of your application

ƒ Note: We did not do any comparisons with multiprocessor systems or multicore chips

(54)

Projected Performance of Single SPE Algorithm

How would an “independent” algorithm compare?

ƒ 8 SPEs each concurrently but independently executing a single 64K FFT algorithm

Similar “2D” algorithm but each data set must make 2 round trips between XDR and LS

ƒ Bring in data tiles for “column” FFTs and twiddle multiplies and then store back in transposed order

ƒ Bring in transposed data tiles for “row” FFTs and store back

Lower bounds

ƒ Theoretical minimum latency: 4 x 512KBytes / 3 GBytes/sec = 699 us

XDR bandwidth of 24 GBytes/s is shared among 8 SPEs

Assumes one can bury all processing under the DMA transfers

ƒ Theoretical best throughput: 699 / 8 = 87.4 us

(55)

Conclusion

The Cell BE processor is ideal for large FFTs

ƒ Performance one to two orders of magnitude better than current uniprocessor algorithms

We achieve reduced latency and maintained high throughput with a parallel algorithm

ƒ Thanks to the generous ring bandwidth and local store bandwidth

So far, predicted performance matches well with measured results

ƒ The Cell is a highly predictable processor

which is good for programmer productivity

References

Related documents

Import competition from China accounts for 42% (20%) of the within firm increase in the share of skilled workers (non-production workers) in Belgian manufacturing over the period of

DIRECTIONS: Candidates for the International Certification Examination for Gambling Counselors – Level I must have at least 30 approved hours of gambling specific training, a

Our quantitative analysis differs from that study in three important respects–first we rely on actual data from the United States on accidents related to cellular phone use to

For example the UK Committee on Climate Change has identified significant cost-saving emissions abatement potential in the UK iron and steel industry (through increased use

Synchronisation between CMD and SME Once the CMD is synchronised to the timing of the base station (via TTL or RF) it will generate an internal frame TTL trigger signal of which

Professional and recreational artist spending on paints, supports, and drawing materials was about 30% of their total spending on all art-related materials and services. Students

To give detailed information around vocational training delivery as it relates to the Automotive Retail, Service and Repair industry in order to gauge current alignment of

Repeating this feature generation process for every pair of residues in the set of training proteins produced a data- set of examples that was then split into three groups based on