Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

(1)

Cell-SWat: Modeling and Scheduling Wavefront

Computations on the Cell Broadband Engine

(2)

Forecast

• Efficient

mapping of wavefront algorithms

on the Cell

Broadband Engine

– Double buffering and data streaming across the cores – Unique data layout optimizations within the cores

• Developing an accurate

performance prediction

• Developing an accurate

performance prediction

model

(3)

Outline

• Introduction

• Mapping and Modeling Wavefront Algorithms on the

Cell Broadband Engine

• Optimizations for the Accelerator Cores

• Evaluation / Results

• Conclusion

(4)

Outline

• Introduction

– The Cell Broadband Engine (B.E.)

• The Cell B.E. Architecture • The QS20 Cell Blade

– The Wavefront Pattern

• Mapping and Modeling Wavefront Algorithms on the

Cell Broadband Engine

• Optimizations for the Accelerator Cores

• Evaluation / Results

(5)

The Cell Broadband Engine

Highlights

• 9 cores, 10 threads • 3.2 GHz frequency • > 200 GFlops (SP) • Up to 25 GB/s memory B/W • > 300 GB/s EIB • > 300 GB/s EIB

(6)

The Cell B.E. Architecture

EIB (up to 96B/cycle)

SPE LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC 16B/cycle (2x) 16B/cycle BIC MIC 16B/cycle PPE PXU L1 PPU 16B/cycle L2 32B/cycle

(7)

The QS20 Cell Blade

Source: IBM Corporation

Cell Processors

(8)

Outline

• Introduction

– The Cell Broadband Engine (B.E.)

• The Cell B.E. Architecture • The QS20 Cell Blade

– The Wavefront Pattern

• Mapping and Modeling Wavefront Algorithms on the

Cell Broadband Engine

• Optimizations for the Accelerator Cores

• Evaluation / Results

(9)

The Wavefront Pattern

NW

N

W

Dependency

• Areas of utility

– Computational Biology: Smith-Waterman – Linear algebra: LU Decomposition

– Multimedia: Video Encoding

– Computational Physics: Particle Physics Simulations

(10)

Outline

• Introduction

• Mapping and Modeling Wavefront Algorithms on the

Cell Broadband Engine

– Tiled-Wavefront

– Model for Performance Prediction

• Optimizations for the SPEs

• Evaluation / Results

(11)

Mapping to the Cell B.E.

• Each element is processed on individual SPEs (S_i) • Each diagonal is computed in parallel • Bus overhead due to

concurrent DMA calls (reads and writes)

S₁ S₁ S₁ S₁ S₁ S₁ S₂ S₂ S₂ S₂ S₂ S₃ S₃ S₃ S₃ S₄ S₄ S₄ S₅ S₅ S₆

(reads and writes)

– Scalability issue

Main Memory SPEs

Matrix store Element Interconnect Bus

(12)

S₁ S₁ S₁

S₂ S₂

Tile-row

Mapping to the Cell B.E.

• Elements are grouped to form square tiles

– Larger granularity – Tile dimension can

be modified • Each tile is processed on

c

o

lu

m

n

Tile

S₃ processed on individual SPEs (S_i) • Each tile-diagonal is computed in parallel

T

il

e

-c

o

lu

m

t

S

₁

S

₂

S

₃

S

B

lo

c

k

-r

o

w

Direction of Computation

k

(14)

Tile-Scheduling (continued…)

• S = number of active SPEs

• S iterations before all SPEs are fully utilized S ti le s

B

lo

c

k

-r

o

w

S

B

lo

c

k

(15)

X

DMA tile to main memory DMA North Tile

Computation-Communication Pattern

X

‘Ready’ message Buffer copy

(16)

Model for Performance Prediction

• Time for processing a tile-diagonal in a block row (or a single tile) =(T_tile+T_DMA)

– Independent of the number of tiles in the tile-diagonal

• Number of tile-diagonals

n tile-diagonals

T = (TT = (T_{one_tile_diagonal}_Tile + TT = T)*(number of tile-diagonals)+T_DMA) * [(m * n) + S] + T_{matrix_filling} +T_{serial_code} _{serial_code}_{serial_code}

T ile -d ia g o n a l T_{parallel_code} • Number of tile-diagonals =_(m*n)+S • Model Usage: – Sampling Phase:

measure T_tile,T_DMA and

T_{serial_code} S m b lo ck r o w s _B lo ck R o w Computation overlap

(17)

Outline

• Introduction

• Mapping and Modeling Wavefront Algorithms on the

Cell Broadband Engine

• Optimizations for the SPEs

– Tile Representation – Vector Computations – Vector Computations

• Evaluation / Results

• Conclusion

(18)

Tile Representation

0

3

1

0

1

6

4

Logical representation

0

4

5

0

2

7

(19)

Vector Computations

Goal: Vectorize as much as possible 0 0 0 0 0 9 8 5 5 0 5 5 0 3 7 2 3 0 0 2 5 3 7 2 0 0 2 3 6 0 0 0 4 2 4

Serial

0 0 4 2 4 0 9 7 1 3 6 1 2 6 8

Serial

computations

(20)

Outline

• Introduction

• Mapping and Modeling Wavefront Algorithms on the

Cell Broadband Engine

• Optimizations for the SPEs

• Evaluation / Results

– Experimental Setup – Experimental Setup – Scalability Charts

– Performance Model Verification

(21)

Experimental Setup

• Compute Platforms

– QS20 dual-Cell blade at Georgia Tech for the parallel implementation

– 2.8 GHz dual-core Intel processor with 2GB memory for the serial implementation

• Example Wavefront Algorithm: Smith-Waterman

– A fundamental algorithm in bioinformatics used for homology – A fundamental algorithm in bioinformatics used for homology

search

• Matches nucleotide/protein sequences

• 8 different sequences chosen from the range: 1KB – 8KB

– Two-phase dynamic programming

• Matrix filling (wavefront pattern) • Backtracing (sequential code)

(22)

Scalability Chart

• Sequence size: 8KB

– Wavefront matrix size: 8000x8000 integers

• Similar results for all

other input sequence

sizes (1KB – 8KB)

• Why are the sequence

5 10 15 20 25 S p e e d u p 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 E ff ic ie n c y

• Why are the sequence

sizes < 8KB?

– The matrix overflows the 1GB XDRAM

• Why are the tile

dimensions < 64x64?

0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of SPEs

Tile Size = 64x64 Tile Size = 32x32 Tile Size = 16x16 Tile Size = 8x8

Near-constant efficiency irrespective of

0 0.1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of SPEs

Tile Dim. = 64x64 Tile Dim. = 32x32 Tile Dim. = 16x16 Tile Dim. = 8x8

(23)

2 3 4 5 6 7 E x e c u ti o n T im e ( s e c o n d s )

Performance Model Verification

• Sequence size: 8KB

– Wavefront matrix size: 8000x8000 integers

• Tile dimension:

64x64

integers

• Similar results for all

_20%30%

40% 50% 60% 70% 80% 90% 100% E x e c u ti o n T im e ( n o rm a li z e d ) 0 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 E x e c u ti o n T im e ( s e c o n d s ) Number of SPEs Measured Predicted

• Similar results for all

other input configs.

– Sequence sizes (1KB – 8KB)

– Tile dimensions

32x32, 16x16 and 8x8 integers

Mean error rate = 3%; Max. error rate = 10%

0% 10% 20% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 E x e c u ti o n T im e Number of SPEs Measured Predicted

(24)

Performance Model

• Why do we need the

performance model?

– Predict the execution time offline, based on pluggable input

parameters

– Evaluate the tradeoffs 20% 30% 40% 50% 60% 70% 80% 90% 100% E x e c u ti o n T im e ( n o rm a li z e d )

– Evaluate the tradeoffs between different input configurations before actually deploying the application – See section 0% 10% 20% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 E x e c u ti o n T im e Number of SPEs Measured Predicted

(25)

Outline

• Introduction

• Mapping and Modeling Wavefront Algorithms on the

Cell Broadband Engine

• Optimizations for the SPEs

• Evaluation / Results

• Conclusion and Future Work

(26)

Conclusion

• Efficiently mapped wavefront algorithms on the Cell

Broadband Engine

– Developed a highly scalable design that streams tiles across the SPEs

• Unique data layout scheme to maximize the vector

processing capabilities of the SPEs

• Accurate prediction model of the execution time

based on a number of pluggable parameters

(27)

Future Work

• Validation of the tiled-wavefront approach for other wavefront

algorithms and also other emergent CMP architectures (e.g., GPU) • Integrate the parallelized Smith-Waterman code into sequence

search toolkits

• Extend the design to a cluster of Cell-based nodes

For more information…

– CS @ VT: www.cs.vt.edu

– The SyNeRGy Lab: synergy.cs.vt.edu

– Center for High-End Computing Systems (CHECS):

www.checs.eng.vt.edu

– Contacts:

• Ashwin Aji: [email protected]

(28)

IBM’s approach

• Coarse-grained

parallelization

– One sequence pair one SPE

Our approach

• Fine-grained

parallelization

– One sequence pair all available SPEs

Related Work: Smith-Waterman on the Cell

• Max seq. size = 2048

• O(m) space

– No backtrace?

• Which gap penalty?

• Max seq. size ~= 8200

• O(mn) space

– Includes backtrace