Cell-SWat: Modeling and Scheduling Wavefront
Computations on the Cell Broadband Engine
Forecast
• Efficient
mapping of wavefront algorithms
on the Cell
Broadband Engine
– Double buffering and data streaming across the cores – Unique data layout optimizations within the cores
• Developing an accurate
performance prediction
• Developing an accurate
performance prediction
model
Outline
• Introduction
• Mapping and Modeling Wavefront Algorithms on the
Cell Broadband Engine
• Optimizations for the Accelerator Cores
• Evaluation / Results
• Conclusion
• Conclusion
Outline
• Introduction
– The Cell Broadband Engine (B.E.)
• The Cell B.E. Architecture • The QS20 Cell Blade
– The Wavefront Pattern
• Mapping and Modeling Wavefront Algorithms on the
Cell Broadband Engine
Cell Broadband Engine
• Optimizations for the Accelerator Cores
• Evaluation / Results
The Cell Broadband Engine
Highlights
• 9 cores, 10 threads • 3.2 GHz frequency • > 200 GFlops (SP) • Up to 25 GB/s memory B/W • > 300 GB/s EIB • > 300 GB/s EIBThe Cell B.E. Architecture
EIB (up to 96B/cycle)
SPE LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC 16B/cycle (2x) 16B/cycle BIC MIC 16B/cycle PPE PXU L1 PPU 16B/cycle L2 32B/cycle
The QS20 Cell Blade
Source: IBM CorporationCell Processors
Outline
• Introduction
– The Cell Broadband Engine (B.E.)
• The Cell B.E. Architecture • The QS20 Cell Blade
– The Wavefront Pattern
• Mapping and Modeling Wavefront Algorithms on the
Cell Broadband Engine
Cell Broadband Engine
• Optimizations for the Accelerator Cores
• Evaluation / Results
The Wavefront Pattern
NW
N
W
Dependency
• Areas of utility
– Computational Biology: Smith-Waterman – Linear algebra: LU Decomposition
– Multimedia: Video Encoding
– Computational Physics: Particle Physics Simulations
Outline
• Introduction
• Mapping and Modeling Wavefront Algorithms on the
Cell Broadband Engine
– Tiled-Wavefront
– Model for Performance Prediction
• Optimizations for the SPEs
• Optimizations for the SPEs
• Evaluation / Results
Mapping to the Cell B.E.
• Each element is processed on individual SPEs (Si) • Each diagonal is computed in parallel • Bus overhead due toconcurrent DMA calls (reads and writes)
S1 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 S3 S3 S3 S3 S4 S4 S4 S5 S5 S6
(reads and writes)
– Scalability issue
Main Memory SPEs
Matrix store Element Interconnect Bus
S1 S1 S1
S2 S2
Tile-row
Mapping to the Cell B.E.
• Elements are grouped to form square tiles
– Larger granularity – Tile dimension can
be modified • Each tile is processed on
c
o
lu
m
n
Tile
S3 processed on individual SPEs (Si) • Each tile-diagonal is computed in parallelT
il
e
-c
o
lu
m
n
Tile-Scheduling
• Cyclic assignment of the SPEs to tile-rows • Block-row: group of active tile-rowst
1t
2t
2t
3t
3t
3t
t
4t
4t
4t
t
5t
5t
5t
t
6t
6t
6t
t
7t
7t
7t
t
8t
8t
8t
t
9t
9t
t
10t
S
1S
2S
3S
B
lo
c
k
-r
o
w
Direction of ComputationTile-row
Block-row
of active tile-rows • Computation overlap between consecutive block-rows (t9 – t13)t
4t
5t
5t
6t
6t
6t
9t
7t
7t
7t
10t
10t
8t
8t
8t
11t
11t
9t
9t
9t
12t
12t
10t
10t
10t
13t
13t
11t
11t
11S
4S
5S
6S
1S
2t
14t
14t
12t
12t
15t
15t
13t
16t
16t
17B
lo
c
k
Tile-Scheduling (continued…)
• S = number of active SPEs
• S iterations before all SPEs are fully utilized S ti le s
B
lo
c
k
-r
o
w
SB
lo
c
k
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
DMA tile to main memory DMA North TileComputation-Communication Pattern
X
X
X
X
X
X
X
X
X
X
‘Ready’ message Buffer copyModel for Performance Prediction
• Time for processing a tile-diagonal in a block row (or a single tile) =(Ttile+TDMA)
– Independent of the number of tiles in the tile-diagonal
• Number of tile-diagonals
n tile-diagonals
T = (TT = (Tone_tile_diagonalTile + TT = T)*(number of tile-diagonals)+TDMA) * [(m * n) + S] + Tmatrix_filling +Tserial_code serial_codeserial_code
T ile -d ia g o n a l Tparallel_code • Number of tile-diagonals =(m*n)+S • Model Usage: – Sampling Phase:
measure Ttile,TDMA and
Tserial_code S m b lo ck r o w s B lo ck R o w Computation overlap
Outline
• Introduction
• Mapping and Modeling Wavefront Algorithms on the
Cell Broadband Engine
• Optimizations for the SPEs
– Tile Representation – Vector Computations – Vector Computations
• Evaluation / Results
• Conclusion
Tile Representation
0
0
0
0
0
0
0
3
1
0
0
0
1
6
4
Logical representation0
0
0
4
5
0
0
0
2
7
Vector Computations
Goal: Vectorize as much as possible 0 0 0 0 0 9 8 5 5 0 5 5 0 3 7 2 3 0 0 2 5 3 7 2 0 0 2 3 6 0 0 0 4 2 4Serial
0 0 4 2 4 0 9 7 1 3 6 1 2 6 8Serial
computations
Outline
• Introduction
• Mapping and Modeling Wavefront Algorithms on the
Cell Broadband Engine
• Optimizations for the SPEs
• Evaluation / Results
– Experimental Setup – Experimental Setup – Scalability Charts
– Performance Model Verification
Experimental Setup
• Compute Platforms
– QS20 dual-Cell blade at Georgia Tech for the parallel implementation
– 2.8 GHz dual-core Intel processor with 2GB memory for the serial implementation
• Example Wavefront Algorithm: Smith-Waterman
– A fundamental algorithm in bioinformatics used for homology – A fundamental algorithm in bioinformatics used for homology
search
• Matches nucleotide/protein sequences
• 8 different sequences chosen from the range: 1KB – 8KB
– Two-phase dynamic programming
• Matrix filling (wavefront pattern) • Backtracing (sequential code)
Scalability Chart
• Sequence size: 8KB
– Wavefront matrix size: 8000x8000 integers
• Similar results for all
other input sequence
sizes (1KB – 8KB)
• Why are the sequence
5 10 15 20 25 S p e e d u p 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 E ff ic ie n c y
• Why are the sequence
sizes < 8KB?
– The matrix overflows the 1GB XDRAM
• Why are the tile
dimensions < 64x64?
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of SPEs
Tile Size = 64x64 Tile Size = 32x32 Tile Size = 16x16 Tile Size = 8x8
Near-constant efficiency irrespective of
0 0.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of SPEs
Tile Dim. = 64x64 Tile Dim. = 32x32 Tile Dim. = 16x16 Tile Dim. = 8x8
2 3 4 5 6 7 E x e c u ti o n T im e ( s e c o n d s )
Performance Model Verification
• Sequence size: 8KB
– Wavefront matrix size: 8000x8000 integers
• Tile dimension:
64x64
integers
• Similar results for all
20%30%40% 50% 60% 70% 80% 90% 100% E x e c u ti o n T im e ( n o rm a li z e d ) 0 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 E x e c u ti o n T im e ( s e c o n d s ) Number of SPEs Measured Predicted
• Similar results for all
other input configs.
– Sequence sizes (1KB – 8KB)
– Tile dimensions
32x32, 16x16 and 8x8 integers
Mean error rate = 3%; Max. error rate = 10%
0% 10% 20% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 E x e c u ti o n T im e Number of SPEs Measured Predicted
Performance Model
• Why do we need the
performance model?
– Predict the execution time offline, based on pluggable input
parameters
– Evaluate the tradeoffs 20% 30% 40% 50% 60% 70% 80% 90% 100% E x e c u ti o n T im e ( n o rm a li z e d )
– Evaluate the tradeoffs between different input configurations before actually deploying the application – See section 0% 10% 20% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 E x e c u ti o n T im e Number of SPEs Measured Predicted
Outline
• Introduction
• Mapping and Modeling Wavefront Algorithms on the
Cell Broadband Engine
• Optimizations for the SPEs
• Evaluation / Results
• Conclusion and Future Work
• Conclusion and Future Work
Conclusion
• Efficiently mapped wavefront algorithms on the Cell
Broadband Engine
– Developed a highly scalable design that streams tiles across the SPEs
• Unique data layout scheme to maximize the vector
processing capabilities of the SPEs
• Accurate prediction model of the execution time
based on a number of pluggable parameters
Future Work
• Validation of the tiled-wavefront approach for other wavefront
algorithms and also other emergent CMP architectures (e.g., GPU) • Integrate the parallelized Smith-Waterman code into sequence
search toolkits
• Extend the design to a cluster of Cell-based nodes
For more information…
For more information…
– CS @ VT: www.cs.vt.edu
– The SyNeRGy Lab: synergy.cs.vt.edu
– Center for High-End Computing Systems (CHECS):
www.checs.eng.vt.edu
– Contacts:
• Ashwin Aji: [email protected]
IBM’s approach
• Coarse-grained
parallelization
– One sequence pair one SPE
Our approach
• Fine-grained
parallelization
– One sequence pair all available SPEs
Related Work: Smith-Waterman on the Cell
• Max seq. size = 2048
• O(m) space
– No backtrace?
• Which gap penalty?
• Max seq. size ~= 8200
• O(mn) space
– Includes backtrace