Advancing System-Level Analysis and Design of Specialized Architectures

(1)

Advancing System-Level Analysis and

Design of Specialized Architectures

The Harvard community has made this

article openly available.

Please share

how

this access benefits you. Your story matters

Citation

Xi, Likun. 2018. Advancing System-Level Analysis and Design of

Specialized Architectures. Doctoral dissertation, Harvard University,

Graduate School of Arts & Sciences.

Citable link

http://nrs.harvard.edu/urn-3:HUL.InstRepos:41121197

Terms of Use

This article was downloaded from Harvard University’s DASH

repository, and is made available under the terms and conditions

applicable to Other Posted Material, as set forth at

http://

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

(18)

(19)

(20)

(21)

(22)

(23)

(24)

(25)

(26)

(27)

(28)

(29)

(30)

(31)

(32)

(33)

(34)

(35)

(36)

(37)

LSU

VSU

L2

ISU

IFU

(38)

(39)

IFU

ISU

LSU

FXU

VSU

0.0

0.2

0.4

0.6

0.8 Nor

maliz

ed

area

vsx

gcc

2k6 mcf 2

k

_perl

ben

ch lbm

povr

ay mesa

(40)

(41)

(42)

vsx astar_gcc2k6 libquantummcf 2

k6 perlben

ch

omnetppbwavescalcu_cactlixusADM dealII gemsFD

TD _{lbm povra}y_soplex

gcc2k vprmcf2k art mesa swim 0 1 2 3 4 5 6 7 Nor maliz ed po w er

(43)

IFU

ISU

LSU

FXU

VSU

0

1

2

3

4

5

6

7

8 Nor

maliz

ed

area

(44)

0

2 IFU

0

3

6 ISU

0

8

1

2 LSU

0

1

2

3 FXU

vsx

gcc

2k6 mcf 2

k

perl

ben

ch lbm

povr

(45)

I-cache

Branch

target

buffer

Branch

predictor

buffer

Inst.

Decoder

I-TLB

(46)

0

2 I-cache

0

1 Branch

target

buffer

0

1

2

3 Branch

predictor

0

2 Inst.

buffer

0

6

1 Decoder

vsx

gcc

2k6 mcf 2

k

perl

ben

ch lbm

povr

(47)

Unified

issue queue

completion table

Global

renaming

Register

0

1

2

3

4

5

6

7 Nor

maliz

ed

area

MR0

MR1

MR2

Subset

0

2

4 Unified

issue

queue

0

4

8 Global

completion

table

vsx

gcc

2k6 mcf 2

k

_perl

ben

ch lbm

povr

ay

art

(48)

D-cache

Load reorder

queue

Store reorder

queue

D-TLB

(49)

0

1 D-cache

0

3

6 Load

reorder

queue

0

7

1 Store

reorder

queue

vsx

gcc

2k6 mcf 2

k

perl

ben

ch lbm

povr

(50)

General-purpose registers

Fixed-point ALU

0

1

2

3

4

5

6

7 Nor

maliz

ed

area

MR0

MR1

MR2

Subset

0

1 General

purpose

registers

vsx

gcc

2k6 mcf 2

k

_perl

ben

ch lbm

povr

ay

art

0

2 Fixed-

_point

(51)

Vector register file

Floating-point ALU

0

1

2

3

4

5 Nor

maliz

ed

area

MR0

MR1

MR2

Subset

0

1 _Vector

register

file

vsx

gcc

2k6 mcf 2

k

_perl

ben

ch lbm

povr

ay

art

(52)

(53)

MR2

DPM

40

48

56

64

40

48

56

64

40

48

56

64

40

48

56

64

(54)

gcc

mcf

vpr

art

Mean

0

5

10

15

20 M

ea

n

pe

rce

nt

er

ro

r

MR0

_MR2

(55)

(56)

(57)

(58)

(59)

(60)

+

.

(61)

IFU

ISU

LSU

0.0

0.2

0.4

0.6

0.8 C

ov

ar

ia

nce

IFU

ISU

LSU

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7 Ratio

(62)

(63)

4

(64)

(65)

(66)

(67)

(68)

L2 Cache

CPU0

L1 Cache

System bus

MC

DRAM

Lane

0 La ne 1 La ne 2 La ne 3 BUF0 BUF1 ARR0 ARR1 ARR2 ARR3

STR0 STR1 La ne 4 La ne 5 La ne 6 La ne 7 SPAD/DMA interface ACCEL1 MEM

CPU1

L1 Cache

Scratchpad accelerator

DMA

Transfer descriptors CHAN 0 CHAN 3 SRC ADDR DEST ADDR LENGTH SRC ADDR DEST ADDR LENGTH SRC ADDR DEST ADDR LENGTH SRC ADDR DEST ADDR LENGTH Channel selection ACCEL0 MEM La ne 0 La ne 1 La ne 2 La ne 3 L1 Cache TLB Cache controller

Design Parameter Values

(69)

(70)

(71)

(72)

(73)

(74)

(75)

(76)

i = 16

to 31

A[32:

A[48:63

i = 0

N

15]

A[0:15]

A[16:31]

i = 0

to 15

Begin DMA of A as soon as the first flush chunk completes.

+ Pipelined

DMA

Break up flush and DMA into page sized chunks

i = 0

N

+ DMA-triggered

compute

Begin loop iteration 0 as soon as A[0] arrives.

A[32:47]

A[16:31]

Ready bits track data at granularity G

(for illustration purposes G = 16)

Copy array

via DMA

Flush array from

CPU caches

(77)

(78)

(79)

(80)

(81)

(82)

(83)

(84)

(85)

(86)

(87)

(88)

(89)

(90)

(91)

(92)

(93)

(94)

(95)

(96)

(97)

(98)

PE0 PE1 PE2 PE3 B0

B4 B5 B6 B1 B2 B3

B7 Input Scratchpad PE4 PE5 PE6 PE7 B0

B4 B5 B6 B1 B2 B3

B7 Weight Scratchpad B0 Output Scratchpad B1 B2 B3 B4 B5 B6 B7 + + + + + + + + Contr ol Logic CPU0 CPU1

L1 $ L1 $

2MB L2 $

LPDDR4 MC 4GB LPDDR4 System Bus ACP IO Interface DMA

8 MACC Arrays

16 16 32 16 4GB LPDDR4 ISP ACC complex CPU2 CPU3

L1 $ L1 $

(99)

(100)

(101)

(102)

×

(103)

(104)

(105)

(106)

Model config file

+ parameters

(Weights/inputs)

FC

Preprocessing

Quantization

Compression

Blocking / Tiling

Data layout

Hardware description file

(target backend, SoC

interface, kernel selection,

SRAM size, etc)

Tiling optimizer (per layer)

CONV

FC

Other

Runtime scheduler

Optimized CPU

code

NVDLA

(107)

×

(108)

×

=

⌊

/

(

∗

)

⌋

×

=

(

,

)

=

⌊

/

(

∗

)

⌋

=

_⌊

/

(

_∗

)

_⌋

(109)

Input feature maps

Output feature maps

B

o

Kernels

C

B

o

H

B

i

W

_W

H

M

K

r

K

c

_C

(110)

(111)

(112)

(113)

64KB

(114)

(115)

(116)

(117)

(118)

(119)

(120)

(121)

(122)

(123)

6

(124)

×

(125)

(126)

(127)

(128)

10

1 ₁₀

2 ₁₀

3 ₁₀

4 ₁₀

5 PDlloF GurDtion (FyFles)

0

10

20

30

40

50 Ti

P

e

i

n

F

Dl

ls

(

3D

)

%

)

)Dst pDth

Get froP

FentrDl FDFhe

pDge DlloFDtor

Get froP

(129)

No

Yes

No

Yes

No

Yes

Compute

size class

Sample

allocation?

Do sampled allocation

Get free

list

Pop head

Fetch from central cache

Allocate pages

Is free list

empty?

Return

Is small

(130)

(131)

10

1 ₁₀

2 ₁₀

3 ₁₀

4 ₁₀

5 PDlloF GurDtion (FyFles)

(132)

0

5

10

15

20

25

30 )Dst pDth FyFles

ubenFh.DntDgonist

ubenFh.gDuss

ubenFh.gDussBfree

ubenFh.sizeGBGeletes

ubenFh.tp

ubenFh.tpBsPDll

SDPpling

Size FlDss

3ush/pop

≈

%

(133)

(134)

(135)

(136)

(137)

(138)

(139)

(140)

(141)

(142)

(143)

(144)

(145)

(146)

(147)

0

10

20

30

40

50

60 tFPDlloF tiPe iPprovePent (%)

(148)

0

10

20

30

40

50

60 PDlloF() tiPe iPprovePent (%)

(149)

(150)

10

0 ₁₀

1 ₁₀

2 ₁₀

3 ₁₀

4 ₁₀

5 FDll GurDtion (FyFles)

(151)

10

0 ₁₀

1 ₁₀

2 ₁₀

3 ₁₀

4 ₁₀

5 FDll GurDtion (FyFles)

0

10

20

30

40

50 Ti

P

e

i

n

F

Dl

ls

(

3D

)

%

)

483.xDlDnFbPk.ref

BDseline PDlloF

LiPit stuGy PDlloF

All optiPizDtions PDlloF

DntDgon

ist gDuss

gDussB

free

_sizeGBG

eletes tp

tpBsPD

ll

(152)

µ

(153)

0

1

2

3

4

5

6

7

8 TiPe spent in tFPDlloF (%)

(154)

(155)

(156)

(157)

(158)

(159)

(160)

(161)

(162)

(163)

(164)

(165)

(166)

(167)

(168)

(169)

(170)

(171)

(172)

(173)