Stacked FSMD: A Power Efficient Micro-Architecture for High-Level Synthesis
Khushwinder Jasrotia, Jianwen Zhu
Electrical and Computer Engineering University of Toronto
March 24th, 2004
http://www.eecg.toronto.edu/˜jzhu
Outline
Motivation
New Approach
Experimental Results Conclusion
ISQED Copyright c Khushwinder Jasrotia, March 24, 2004, ECE, Univ. of Toronto 2
High-Level Synthesis
Register-transfer level (RTL)
Current industrial design standard Difficult for complex design
Behavioral level
The detailed design is abstracted away timing
resource sharing Natural trend!!
High-level synthesis
Automated refining process from behavioral level to RTL level
Then, why not High-Level Synthesis?
High-level synthesis still remains in academia and a few EDA companies
Today’s high-level synthesis tools have limited power For example, modular design is limited by
Expressive power of the languages Current HLS tools: VHDL, Verilog
Software languages are more ideal: C/C++, Java But efficiently handle procedures becomes problem Traditional micro-architecture model
Monolithic
ISQED Copyright c Khushwinder Jasrotia, March 24, 2004, ECE, Univ. of Toronto 4
Procedure Abstraction
Ability to efficiently handle procedures
Finite-state machine with datapath (FSMD)
Classical model for HLS Monolithic
Inlining
Flattened design increases control logic complexity
Fails to take advantage of the mutual exclusive nature of
procedures for resource sharing
Datapath Unit
Status
Datapath Outputs Datapath Inputs
Datapath Control Next−State
Logic
Output Logic State Register
Control Unit
Control Outputs Control Inputs
Previous Works
Camposano and Eijndhoven[ICCD97] implemented procedures as independent hardware modules that used handshaking signals for communication.
Gajski Et al.[EDAC93] described a method in which each procedure occupied a portion of the main controller state-table.
Vahid[ISFPGA97] described a method in which procedures were implemented as separate modules and a common bus was used to transfer address and parameter information between them.
ISQED Copyright c Khushwinder Jasrotia, March 24, 2004, ECE, Univ. of Toronto 6
Our Proposed Solution
Stacked FSMD model
Modified model from FSMD
Power-efficient for procedure abstraction
Region based partitioning
Allows redefining procedure boundary
Behavioral power index
Assists partitioning decision
The SFSMD Model
Each procedure is implemented as a separate controller but a common datapath is shared
Controlled by stack controller Benefits
Saving power - only one controller is activated at a time
Increasing resource sharing - common datapath
CONTROLLER 1 Address
Call
Return
Enable Datapath Control
Status
Address
Call
Return
Enable Datapath Control
Status CONTROLLER 2
Address
Call
Return
Enable Datapath Control
Status CONTROLLER n ADDRESS BUS CALL SIGNAL RETURN SIGNAL
Status Control
Address Call Return
STACK CONTROLLER
TRI−STATE BUFFERS
Ctrl #n Enable
Cntll #2 Enable Cntrl #1 Enable
DATAPATH
ISQED Copyright c Khushwinder Jasrotia, March 24, 2004, ECE, Univ. of Toronto 8
More Power Saving with SFSMD
Power saving comes from activating one controller at a time Breaking the procedure boundary
Allows redefining the procedure boundary Exlining
Replaces a sequence of statements by procedure call Inverse of inlining
Loop-based exlining
reducing switching activity - localized activities
reducing power - small controller is activated at a time
Region Based Partitioning
A region based partitioning scheme is introduced for the
reduction of power consumption.
Operates on the basis of extracting loops and implementing them as separate controllers in the SFSMD model.
Original Specification
Loop1
Loop2
Loop3
Loop4
Call Loop1
Call Loop2
Call Loop3
Call Loop4
Loop3 Loop1
Loop2 Loop4 After Loop Exlining
Main Controller
Stack Controller Loop1 Controller
Loop2 Controller
Loop3 Controller
Loop4 Controller
Shared Datapath
SFSMD Implementation
ISQED Copyright c Khushwinder Jasrotia, March 24, 2004, ECE, Univ. of Toronto 10
Region Based Partitioning
The partitioning redefines the controller boundaries for the original specification by a series of exlining and inlining operations.
Call Foo
Loop 1
Procedure Foo
Loop 2 Main
Process
Loop 1 Loop 2
Main Process
Loop2
Loop1 Call
Loop2
Call Loop1 Main Process
(a) (b) (c)
Different Ways of Partitioning Code
Power tradeoff of exlining
+ Small logic circuit is activated at a time
- Increases inter-procedural communication
- Loss of control-step optimization opportunity
Main
Loop1 Loop2
Loop3 Loop4
Original Specification Tree
Main
Loop1 Loop2
Loop3 Loop4
Partition 1 Tree
Main
Loop1 Loop2
Loop3 Loop4
Partition 4 Tree
Main
Loop1 Loop2
Loop3 Loop4
Partition 2 Tree
Main
Loop1 Loop2
Loop3 Loop4
Partition 3 Tree
ISQED Copyright c Khushwinder Jasrotia, March 24, 2004, ECE, Univ. of Toronto 12
Partitioning-Index
For each partition Pj, a power index is defined as:
Xk i=1
|States(Ri)| · Cycles(Ri) + K · Calls(Ri) (1)
where Ri ∈ Pj
|States(Ri)| : This is the number of control-steps used in the FSM of the controller for region i.
Cycles(Ri) : This is the number of control-cycles spent in region i.
Calls(R ) : This represents the number of calls made to region i.
Experimental Procedure
Experiments were performed to verify the following:
Verify that the SFSMD model actually works.
Verify that region based partitioning saves power and to get an indication of how much power is saved.
Confirm that the partition power-index values correlate well with the actual power of the partitions.
ISQED Copyright c Khushwinder Jasrotia, March 24, 2004, ECE, Univ. of Toronto 14
C Benchmark Kernels
The region based partitioning was applied to C Livermore kernels.
Power measurements were made by summing the individual energy contributions of the regions.
Only controller power was reported.
The Design Compiler tool from Synopsys was used to synthesize the designs, and the Power Compiler tool from Synopsys was used to report the power.
C Benchmark Kernels Cont’
The kernels were manually partitioned into different regions and compiled into VHDL.
The partitions considered were horizontal “cuts” across the depth of the program tree.
The controller portions of the regions were synthesized and their power was compared with the unpartitioned design.
Main
Loop1 Loop2
Loop3
Main
Loop1 Loop2
Loop3
Main
Loop1 Loop2
Loop3
Partition 0 Partition 1 Partition 2
ISQED Copyright c Khushwinder Jasrotia, March 24, 2004, ECE, Univ. of Toronto 16
Power Reduction Results
The reduction of power consumption ranged between 12 % to 67 % over the unpartitioned design, with an average area overhead of 5.5
%.
Power Reduction Partition Level 1
13 12 18
14 16 12
34
20 18
31 31 33 33 43
49 57
13 14 17
24 45
20 29
0 10 20 30 40 50 60 70
LL1_int LL2_int
LL3_int LL4_int
LL5_int LL6_int
LL7_int LL8_int
LL9_int LL10_int
LL11_int LL12_int
LL13_int LL14_int
LL15_int LL16_int
LL18_int LL19_int
LL20_int LL21_int
LL22_int LL23_int
LL24_int
% Reduction
Power Reduction Results Cont’
Power Reduction Partition Levels 1 & 2
12 14 12
20 43
49
13 14
17 24
45
20 29
21 20
28 26
32 53
67
24 25 28
50
24 40
0 10 20 30 40 50 60 70 80
LL2_int LL4_int
LL6_int LL8_int
LL14_int LL15_int
LL18_int LL19_int
LL20_int LL21_int
LL22_int LL23_int
LL24_int
Benchmarks
% Reduction
Power Reduction - Partition 1 Power Reduction - Partition 2
ISQED Copyright c Khushwinder Jasrotia, March 24, 2004, ECE, Univ. of Toronto 18
Power Reduction Results Cont’
Power Reduction Partition Levels 1, 2 & 3
12 12
49
24 21 20
28
53
28
24 52
37
53
36
31
0 10 20 30 40 50 60
LL2_int LL6_int LL15_int LL21_int LL23_int Benchmarks
% Reduction
Power Reduction - Partition 1 Power Reduction - Partition 2 Power Reduction - Partition 3
Partition Power-Index Results
Partition power-index values correlated well with actual power
Power Partition Level 1
1.3 3.1 1.4 2.4 1.4 2.2 3.0 3.5 4.9 1.4 1.4 4.0 2.6 4.5 5.7 7.5 2.2 3.7 2.6 1.9 4.3 2.1
1.1 2.7 1.2 2.0 1.2 1.9 2.0 6.9 2.8 3.4 1.0 1.0 2.7 1.5 2.3 2.4 6.6 1.9 3.1 2.0 1.1 3.4 1.5
8.6
0 1 2 3 4 5 6 7 8 9 10
LL1_int LL2_int
LL3_int LL4_int
LL5_int LL6_int
LL7_int LL8_int
LL9_int LL10_int
LL11_int LL12_int
LL13_int LL14_int
LL15_int LL16_int
LL18_int LL19_int
LL20_int LL21_int
LL22_int LL23_int
LL24_int
Benchmarks
Power (mWatt)
Unpartitioned Power Partition Level 1 Power
Power Index Partition Level 1
0 100000 200000 300000 400000 500000 600000 700000 800000
LL1_int LL2_int
LL3_int LL4_int
LL5_int LL6_int
LL7_int LL8_int
LL9_int LL10_int
LL11_int LL12_int
LL13_int LL14_int
LL15_int LL16_int
LL18_int LL19_int
LL20_int LL21_int
LL22_int LL23_int
LL24_int
Benchmark
Power Index
Unpartioned Power Index Partition 1 Power Index
ISQED Copyright c Khushwinder Jasrotia, March 24, 2004, ECE, Univ. of Toronto 20
Partition Power-Index Results Cont’
Power
Partition Levels 1 & 2
3.1 2.4 2.2 8.6 2.6 4.5 7.5 2.2 3.7 2.6 1.9 4.3 2.12.7 2.0 1.9 6.9 1.5 2.3 6.6 1.9 3.1 2.0 1.1 3.4 1.5
2.4 1.9 1.6 6.4 1.7 2.1 2.5 1.6 2.8 1.9 1.0 3.3 1.3
0 1 2 3 4 5 6 7 8 9 10
LL2_int LL4_int
LL6_int LL8_int
LL14_int LL15_int
LL18_int LL19_int
LL20_int LL21_int
LL22_int LL23_int
LL24_int
Benchmarks
Power (mWatt)
Unpartitioned Power Partition Level 1 Power Partition Level 2 Power
Power Index Partition Levels 1 & 2
300000 400000 500000 600000 700000 800000
Power Index
Partition Power-Index Results Cont’
Power for Partition Levels 1, 2 & 3
3.07 2.18 4.47 2.60 4.33
2.70 1.91 2.30 1.96 3.45
2.41 1.58 2.09 1.87 3.28
1.47 1.36 2.10 1.66 3.01
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
LL2_int LL6_int LL15_int LL21_int LL23_int Benchmarks
Power (mWatt)
Unpartitioned Power Partition 1 Power Partition 2 Power Partition 3 Power
Power Index Partitions 1, 2 & 3
0 50000 100000 150000 200000 250000 300000 350000
LL2_int LL6_int LL15_int LL21_int LL23_int Benchmarks
Power Index
Unpartitioned Power Index Partition 1 Power Index Partition 2 Power Index Partition 3 Power Index
ISQED Copyright c Khushwinder Jasrotia, March 24, 2004, ECE, Univ. of Toronto 22
Conclusion
The SFSMD model provides a good basis for procedure abstraction.
By extracting loops with high execution count, the region-based partitioning technique can help to reduce controller power. Our
experimental result demonstrates power reduction ranging from 12 % to 67 % over the unpartitioned design.
Due to the strong correlation with the actual measured power, the power-index effectively guides the partitioning decisions of a
high-level partitioning tool.