Advancing System-Level Analysis and
Design of Specialized Architectures
The Harvard community has made this
article openly available.
Please share
how
this access benefits you. Your story matters
Citation
Xi, Likun. 2018. Advancing System-Level Analysis and Design of
Specialized Architectures. Doctoral dissertation, Harvard University,
Graduate School of Arts & Sciences.
Citable link
http://nrs.harvard.edu/urn-3:HUL.InstRepos:41121197
Terms of Use
This article was downloaded from Harvard University’s DASH
repository, and is made available under the terms and conditions
applicable to Other Posted Material, as set forth at
http://
LSU
VSU
L2
ISU
IFU
IFU
ISU
LSU
FXU
VSU
0.0
0.2
0.4
0.6
0.8
Nor
maliz
ed
area
vsx
gcc
2k6 mcf 2
k
perl
ben
ch lbm
povr
ay mesa
vsx astargcc2k6 libquantummcf 2
k6 perlben
ch
omnetppbwavescalcucactlixusADM dealII gemsFD
TD lbm povraysoplex
gcc2k vprmcf2k art mesa swim 0 1 2 3 4 5 6 7 Nor maliz ed po w er
IFU
ISU
LSU
FXU
VSU
0
1
2
3
4
5
6
7
8
Nor
maliz
ed
area
0
2
IFU
0
3
6
ISU
0
8
1
2
LSU
0
1
2
3
FXU
vsx
gcc
2k6 mcf 2
k
perl
ben
ch lbm
povr
I-cache
Branch
target
buffer
Branch
predictor
buffer
Inst.
Decoder
I-TLB
0
2
I-cache
0
1
Branch
target
buffer
0
1
2
3
Branch
predictor
0
2
Inst.
buffer
0
6
1
Decoder
vsx
gcc
2k6 mcf 2
k
perl
ben
ch lbm
povr
Unified
issue queue
completion table
Global
renaming
Register
0
1
2
3
4
5
6
7
Nor
maliz
ed
area
MR0
MR1
MR2
Subset
0
2
4
Unified
issue
queue
0
4
8
Global
completion
table
vsx
gcc
2k6 mcf 2
k
perl
ben
ch lbm
povr
ay
art
D-cache
Load reorder
queue
Store reorder
queue
D-TLB
0
1
D-cache
0
3
6
Load
reorder
queue
0
7
1
Store
reorder
queue
vsx
gcc
2k6 mcf 2
k
perl
ben
ch lbm
povr
General-purpose registers
Fixed-point ALU
0
1
2
3
4
5
6
7
Nor
maliz
ed
area
MR0
MR1
MR2
Subset
0
1
General
purpose
registers
vsx
gcc
2k6 mcf 2
k
perl
ben
ch lbm
povr
ay
art
0
2
Fixed-
point
Vector register file
Floating-point ALU
0
1
2
3
4
5
Nor
maliz
ed
area
MR0
MR1
MR2
Subset
0
1
Vector
register
file
vsx
gcc
2k6 mcf 2
k
perl
ben
ch lbm
povr
ay
art
MR2
DPM
40
48
56
64
40
48
56
64
40
48
56
64
40
48
56
64
gcc
mcf
vpr
art
Mean
0
5
10
15
20
M
ea
n
pe
rce
nt
er
ro
r
MR0
MR2
+
.
.
IFU
ISU
LSU
0.0
0.2
0.4
0.6
0.8
C
ov
ar
ia
nce
IFU
ISU
LSU
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Ratio
4
L2 Cache
CPU0
L1 CacheSystem bus
MC
DRAM
DRAM
Lane0 La ne 1 La ne 2 La ne 3 BUF0 BUF1 ARR0 ARR1 ARR2 ARR3
STR0 STR1 La ne 4 La ne 5 La ne 6 La ne 7 SPAD/DMA interface ACCEL1 MEM
CPU1
L1 CacheScratchpad accelerator
DMA
Transfer descriptors CHAN 0 CHAN 3 SRC ADDR DEST ADDR LENGTH SRC ADDR DEST ADDR LENGTH SRC ADDR DEST ADDR LENGTH SRC ADDR DEST ADDR LENGTH Channel selection ACCEL0 MEM La ne 0 La ne 1 La ne 2 La ne 3 L1 Cache TLB Cache controllerDesign Parameter Values
i = 16
to 31
A[32:
A[48:63
i = 0
N
15]
A[0:15]
A[16:31]
i = 0
to 15
Begin DMA of A as soon as the first flush chunk completes.
+ Pipelined
DMA
Break up flush and DMA into page sized chunks
i = 0
N
+ DMA-triggered
compute
Begin loop iteration 0 as soon as A[0] arrives.
A[32:47]
A[16:31]
Ready bits track data at granularity G
(for illustration purposes G = 16)
Copy array
via DMA
Flush array from
CPU caches
PE0 PE1 PE2 PE3 B0
B4 B5 B6 B1 B2 B3
B7 Input Scratchpad PE4 PE5 PE6 PE7 B0
B4 B5 B6 B1 B2 B3
B7 Weight Scratchpad B0 Output Scratchpad B1 B2 B3 B4 B5 B6 B7 + + + + + + + + Contr ol Logic CPU0 CPU1
L1 $ L1 $
2MB L2 $
LPDDR4 MC 4GB LPDDR4 System Bus ACP IO Interface DMA
8 MACC Arrays
16 16 32 16 4GB LPDDR4 ISP ACC complex CPU2 CPU3
L1 $ L1 $
×
Model config file
+ parameters
(Weights/inputs)
FCPreprocessing
Quantization
Compression
Blocking / Tiling
Data layout
Hardware description file
(target backend, SoC
interface, kernel selection,
SRAM size, etc)
Tiling optimizer (per layer)
CONV
FC
Other
Runtime scheduler
Optimized CPU
code
NVDLA
×
×
×
×
=
⌊
/
(
∗
)
⌋
×
×
=
(
,
)
=
⌊
/
(
∗
∗
)
⌋
=
⌊
/
(
∗
)
⌋
Input feature maps
Output feature maps
B
oKernels
C
B
oH
B
iW
W
H
M
K
rK
cC
64KB