• No results found

UNIT: Unifying Tensorized Instruction Compilation

N/A
N/A
Protected

Academic year: 2021

Share "UNIT: Unifying Tensorized Instruction Compilation"

Copied!
37
0
0

Loading.... (view fulltext now)

Full text

(1)

UNIT: Unifying Tensorized

Instruction Compilation

Jian Weng

12

, Animesh Jain

2

, Jie Wang

12

, Leyuan Wang

2

,

Yida Wang

2

, Tony Nowatzki

1

1

UCLA

2

Amazon Web Services

(2)

Motivation: Mixed Precision

2

Blindly using fp16 does not help the performance

Mixed precision

Low-precision Inputs

High-precison Outputs

Relative Performance

0.00 0.25 0.50 0.75 1.00

ResNet-18ResNet-50ResNet-50bInception-bnInception-v3ResNet-101ResNet-152Mobilenet-v1Mobilenet-v2

GM 0.66 1.05 0.71 0.66 0.58 0.53 0.74 0.75 0.62 0.44 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

cuDNN (fp32)

cuDNN (fp16) w/o Tensor Core

fp32 fp16

fp16 fp32

(3)

Motivation: Tensorization Idiom

… c15 a60 b60 × + b61 × b62 × b63 × dst15

a61 a62 a63

+ +

u8x64

i8x64

i16x32 i16x32

Intel VNNI x86.avx512.pbpdusd

c0 a0 b0 × + b1 × b2 × b3 × dst0 a1 a2 a3 + +

Nvidia Tensor Core

nvvm.wmma.m16n16k16.mma.row.row.f32.f32 += A: 16×16 × fp16 B: 16×16 fp16 C: 16×16 fp32

Reducing multiple low

precision to high

Horizontal reduction

Mixed precision

S/w Abstraction

Kernel Libraries

Manually Program Instrinsics

DSL Compiler

(4)

Unifying Tensorized Instruction Compilation

Unified Instruction Abstraction

Instructions integrated by their semantics

Unified Analysis of Applicability

Computation: Arithmetic isomorphism

Memory Access: Pattern isomorphism

Unified Code Generation Interfaces

Reorganize the loops

Rewrite with the tensorized inst.

Tuning for favorable performance

4

Tensor Operation Prog.

Hardware Target

Analysis

Xform

Tensor. Inst.

Intel x86

NVIDIA

...

Rewrite Tune Arith. Memory

(5)

Tensor Domain Specific Language

Convolution

5

Tensor DSL [10, 31, 37]

Tensors

Loop Variables

Data-Parallel

/

Reduction

Expressions

Decoupled Loop Organization

// Convolution in tensor DSL

a,b = tensor((H,W,C), u8),tensor((R,S,K,C),i8) k,rc = loop_axis(0,K), reduce_axis(0,C)

x,y = loop_axis(0,H-R+1), loop_axis(0,W-S+1) r,s = reduce_axis(0,R), reduce_axis(0,S)

(6)

Tensor Domain Specific Language

Convolution

6

for (i=0; i<n; ++i) // expr uses i

Split/Tile

for (io=0; io<n/4; ++io) for (ii=0; ii<4; ++ii) // expr uses io*4+ii

for (i=0; i<n; ++i)

for (j=0; j<m; ++j) // expr uses i,j

Reorder

for (j=0; j<m; ++j)

for (i=0; i<n; ++i) // expr uses i,j

for (i=0; i<4; ++i) // expr uses i

Unroll

// expr i replaced by 0 // expr i replaced by 1 // expr i replaced by 2 // expr i replaced by 3

Tensor DSL [10, 31, 37]

Tensors

Loop Variables

Data-Parallel

/

Reduction

Expressions

Decoupled Loop Organization

// Convolution in tensor DSL

a,b = tensor((H,W,C), u8),tensor((R,S,K,C),i8) k,rc = loop_axis(0,K), reduce_axis(0,C)

x,y = loop_axis(0,H-R+1), loop_axis(0,W-S+1) r,s = reduce_axis(0,R), reduce_axis(0,S)

(7)

Unifying Tensorized Instruction Compilation

Unified Instruction Abstraction

Instructions integrated by their semantics

Unified Analysis of Applicability

Computation: Arithmetic isomorphism

Memory Access: Pattern isomorphism

Unified Code Generation Interfaces

Reorganize the loops

Rewrite with the tensorized inst.

Tuning for favorable performance

7

Tensor Operation Prog.

Hardware Target

Analysis

Xform

Tensor. Inst.

(8)

8

a, b = tensor((64,),u8), tensor((64,),i8)

c, d = tensor((16,), i32), tensor((16,), i32) i, j = loop_axis(0,16), reduce_axis(0,4)

d[i] = c[i] + sum(i32(a[i*4+j])*i32(b[i*4+j]))

Intel VNNI x86.avx512.pbpdusd

a, b = tensor((16,16),fp16), tensor((16,16),fp16)

i, j =

loop_axis

(0,16),

loop_axis

(0,16)

k =

reduce_axis

(0,16)

c[i,j] += fp32(a[i,k]) * fp32(b[k,j])

Nvidia Tensor Core

nvvm.wmma.m16n16k16.mma.row.row.f32.f32

Describe the instruction in

Tensor DSL

Tensors are registers

Expr describes arithmetic

behavior

Expose this information for

applicability analysis

Expression tree

Register shape

Loop axis

Data-Parallel

/

Reduction

(9)

Unifying Tensorized Instruction Compilation

Unified Instruction Abstraction

Instructions integrated by their semantics

Unified Analysis of Applicability

Computation: Arithmetic isomorphism

Memory Access: Pattern isomorphism

Unified Code Generation Interfaces

Reorganize the loops

Rewrite with the tensorized inst.

Tuning for favorable performance

9

Tensor Operation Prog.

Hardware Target

Analysis

Xform

Tensor. Inst.

Intel x86

NVIDIA

...

Arith. Memory

(10)

Analysis: Arithmetic Isomorphism

// Convolution in tensor DSL

a,b = tensor((H,W,C), u8),tensor((R,S,K,C),i8) k,rc = loop_axis(0,K), reduce_axis(0,C)

x,y = loop_axis(0,H-R+1), loop_axis(0,W-S+1) r,s = reduce_axis(0,R), reduce_axis(0,S)

c[x,y,k] += i32(a[x+r,y+s,rc])*i32(b[r,s,k,rc])

Convolution

a, b = tensor((64,),u8), tensor((64,),i8)

c, d = tensor((16,), i32), tensor((16,), i32) i, j = loop_axis(0,16), reduce_axis(0,4)

d[i] = c[i] + sum(i32(a[i*4+j])*i32(b[i*4+j]))

Intel VNNI x86.avx512.pbpdusd

= c[x,y,k]:i32 +:i32 *:i32 a[x+r,y+s,rc]:u8 c[x,y,k]:i32 cast:i32 b[r,s,k,rc]:i8 cast:i32 = d[i]:i32 +:i32 *:i32 c[i]:i32 b[i*4+j]:i8 a[i*4+j]:u8 cast:i32 cast:i32 10

(11)

Analysis: Memory Isomorphism

// Convolution in tensor DSL

a,b = tensor((H,W,C), u8),tensor((R,S,K,C),i8)

k,rc = loop_axis(0,K), reduce_axis(0,C)

x,y = loop_axis(0,H-R+1), loop_axis(0,W-S+1) r,s = reduce_axis(0,R), reduce_axis(0,S)

c[x,y,k] += i32(a[x+r,y+s,rc])*i32(b[r,s,k,rc])

Convolution

a, b = tensor((64,),u8), tensor((64,),i8)

c, d = tensor((16,), i32), tensor((16,), i32)

i, j = loop_axis(0,16), reduce_axis(0,4)

d[i] = c[i] + sum(i32(a[i*4+j])*i32(b[i*4+j]))

Intel VNNI x86.avx512.pbpdusd

11

(12)

Analysis: Memory Isomorphism

// Convolution in tensor DSL

a,b = tensor((H,W,C), u8),tensor((R,S,K,C),i8)

k,rc = loop_axis(0,K), reduce_axis(0,C)

x,y = loop_axis(0,H-R+1), loop_axis(0,W-S+1) r,s = reduce_axis(0,R), reduce_axis(0,S)

c[x,y,k] += i32(a[x+r,y+s,rc])*i32(b[r,s,k,rc])

Convolution

a, b = tensor((64,),u8), tensor((64,),i8)

c, d = tensor((16,), i32), tensor((16,), i32)

i, j = loop_axis(0,16), reduce_axis(0,4)

d[i] = c[i] + sum(i32(a[i*4+j])*i32(b[i*4+j]))

Intel VNNI x86.avx512.pbpdusd

12 for (x=0; x<(H-R)+1; ++x) for (y=0; y<(W-S)+1; ++y) for (k=0; k<K; ++k) for (r=0; r<R; ++r) for (s=0; s<S; ++s) for (rc=0; rc<C; ++rc) c[x,y,k] += a[x+r,y+s,rc]*b[r,s,k,rc];

k

->

i

,

rc

->

j

(13)

Analysis: Memory Isomorphism

// Convolution in tensor DSL

a,b = tensor((H,W,C), u8),tensor((R,S,K,C),i8)

k,rc = loop_axis(0,K), reduce_axis(0,C)

x,y = loop_axis(0,H-R+1), loop_axis(0,W-S+1) r,s = reduce_axis(0,R), reduce_axis(0,S)

c[x,y,k] += i32(a[x+r,y+s,rc])*i32(b[r,s,k,rc])

Convolution

a, b = tensor((64,),u8), tensor((64,),i8)

c, d = tensor((16,), i32), tensor((16,), i32)

i, j = loop_axis(0,16), reduce_axis(0,4)

d[i] = c[i] + sum(i32(a[i*4+j])*i32(b[i*4+j]))

Intel VNNI x86.avx512.pbpdusd

13 for (x=0; x<(H-R)+1; ++x) for (y=0; y<(W-S)+1; ++y) for (ko=0; ko<K; ko+=16) for (r=0; r<R; ++r) for (s=0; s<S; ++s) for (co=0; co<C; co+=4) for (ki=0; ki<16; ++ki) for (ci=0; ci<4; ++ci) { k=ko+ki, rc=co+ci; c[x,y,k] += a[x+r,y+s,rc]*b[r,s,k,rc]; } for (x=0; x<(H-R)+1; ++x) for (y=0; y<(W-S)+1; ++y) for (k=0; k<K; ++k) for (r=0; r<R; ++r) for (s=0; s<S; ++s) for (rc=0; rc<C; ++rc) c[x,y,k] += a[x+r,y+s,rc]*b[r,s,k,rc];

k

->

i

,

rc

->

j

(14)

Analysis: Memory Isomorphism

// Convolution in tensor DSL

a,b = tensor((H,W,C), u8),tensor((R,S,K,C),i8)

k,rc = loop_axis(0,K), reduce_axis(0,C)

x,y = loop_axis(0,H-R+1), loop_axis(0,W-S+1) r,s = reduce_axis(0,R), reduce_axis(0,S)

c[x,y,k] += i32(a[x+r,y+s,rc])*i32(b[r,s,k,rc])

Convolution

a, b = tensor((64,),u8), tensor((64,),i8)

c, d = tensor((16,), i32), tensor((16,), i32)

i, j = loop_axis(0,16), reduce_axis(0,4)

d[i] = c[i] + sum(i32(a[i*4+j])*i32(b[i*4+j]))

Intel VNNI x86.avx512.pbpdusd

14

k

->

i

,

rc

->

j

Conv. Mem. Op

Inst. Mem. Op

Conv. Mem. Accessed

Inst. Mem Accessed

c[x,y,k]

d[i]

[16]

[16]

c[x,y,k]

c[i]

[16]

[16]

a[x+r,y+s,rc]

a[i*4+j]

[4]

[64]

b[r,s,k,rc]

b[i*4+j]

[4x16]

[64]

=

=

=

(Broadcast)

(Concatenate)

(15)

Unifying Tensorized Instruction Compilation

Unified Instruction Abstraction

Instructions integrated by their semantics

Unified Analysis of Applicability

Computation: Arithmetic isomorphism

Memory Access: Pattern isomorphism

Unified Code Generation Interfaces

Reorganize the loops

Rewrite with the tensorized inst.

Tuning for favorable performance

15

Tensor Operation Prog.

Hardware Target

Analysis

Xform

Tensor. Inst.

Intel x86

NVIDIA

...

Rewrite Tune

(16)

Transformation: Loop Reorg.

16

k

->

i

,

rc

->

j

Transform loops to for rewriting

Tile loops by corresponding

trip counts

Reorder to the inner most

for (x=0; x<(H-R)+1; ++x) for (y=0; y<(W-S)+1; ++y) for (ko=0; ko<K; ko+=16) for (r=0; r<R; ++r) for (s=0; s<S; ++s) for (co=0; co<C; co+=4) for (ki=0; ki<16; ++ki) for (ci=0; ci<4; ++ci) { k=ko+ki, rc=co+ci; c[x,y,k] += a[x+r,y+s,rc]*b[r,s,k,rc]; } for (x=0; x<(H-R)+1; ++x) for (y=0; y<(W-S)+1; ++y) for (k=0; k<K; ++k) for (r=0; r<R; ++r) for (s=0; s<S; ++s) for (rc=0; rc<C; ++rc) c[x,y,k] += a[x+r,y+s,rc]*b[r,s,k,rc]; i = loop_axis(0,16) j = reduce_axis(0,4)

(17)

Unified Code Generation

Implement callback functions to generate each operand

17

def operand_generator(array, base, loops, coef): # implement the rule of codegen here

# array: the array pointer of the memory operation # loops: chosen loops to be tensorized,

# from inner to outer

# coef: the coefficient of each loop variable # base: the base addresss

# thus, index = base + sum(loops[i] * coef[i]) # return operand load intrinsic

(18)

Unified Code Generation

Implement callback functions to generate each operand

18

Memory Operation:

a[x+r,y+s,rc]

Flattened:

a[(x+r)*d1+(y+s)*d0+rc]

Arguments:

array: a

base : (x+r)*d1+(y+s)*d0

loop : [rc, k]

coef : [1, 0]

Generated:

def operand_generator(array, base, loops, coef): # implement the rule of codegen here

# array: the array pointer of the memory operation # loops: chosen loops to be tensorized,

# from inner to outer

# coef: the coefficient of each loop variable # base: the base addresss

# thus, index = base + sum(loops[i] * coef[i]) # return operand load intrinsic

(19)

Unified Code Generation

Implement callback functions to generate each operand

19

Memory Operation:

a[x+r,y+s,rc]

Flattened:

a[(x+r)*d1+(y+s)*d0+rc]

Arguments:

array: a

base : (x+r)*d1+(y+s)*d0

loop : [rc, k]

coef : [1, 0]

Generated: a[(x+r)*d1+(y+s)*d0+(0..4)]

def operand_generator(array, base, loops, coef): # implement the rule of codegen here

# array: the array pointer of the memory operation # loops: chosen loops to be tensorized,

# from inner to outer

# coef: the coefficient of each loop variable # base: the base addresss

# thus, index = base + sum(loops[i] * coef[i]) # return operand load intrinsic

(20)

Unified Code Generation

Implement callback functions to generate each operand

20

Memory Operation:

a[x+r,y+s,rc]

Flattened:

a[(x+r)*d1+(y+s)*d0+rc]

Arguments:

array: a

base : (x+r)*d1+(y+s)*d0

loop : [rc, k]

coef : [1, 0]

Generated: broadcast(a[(x+r)*d1+(y+s)*d0+(0..4)], 16)

def operand_generator(array, base, loops, coef): # implement the rule of codegen here

# array: the array pointer of the memory operation # loops: chosen loops to be tensorized,

# from inner to outer

# coef: the coefficient of each loop variable # base: the base addresss

# thus, index = base + sum(loops[i] * coef[i]) # return operand load intrinsic

(21)

Unified Code Generation

Implement callback functions to generate each operand

21

def codegen(opcode, operands, callbacks):

args = [func(arg) for arg, func in zip(operands, callbacks)] return inline_asm(opcode, args)

Invoke each callback function to plug in the operands

def operand_generator(array, base, loops, coef): # implement the rule of codegen here

# array: the array pointer of the memory operation # loops: chosen loops to be tensorized,

# from inner to outer

# coef: the coefficient of each loop variable # base: the base addresss

# thus, index = base + sum(loops[i] * coef[i]) # return operand load intrinsic

(22)

Unifying Tensorized Instruction Compilation

Unified Instruction Abstraction

Instructions integrated by their semantics

Unified Analysis of Applicability

Computation: Arithmetic isomorphism

Memory Access: Pattern isomorphism

Unified Code Generation Interfaces

Reorganize the loops

Rewrite with the tensorized inst.

Tuning for favorable performance

22

Tensor Operation Prog.

Hardware Target

Analysis

Xform

Tensor. Inst.

Intel x86

NVIDIA

...

Rewrite Tune

(23)

Idiom-Based Performance Tuning

The outer loops are open to performance tuning

Data Parallel

/

Reduction

Parallelism

Coarse-Grain: Thread-level Parallelism

Distribute compute to proper #cores

Fine-Grain: Pipeline Parallelism

Achieve instruction-level parallelism by avoiding loop-carried penalty

23

for (

s0

=0;

s0

<d0; ++

s0

)

for (

s1

=0;

s1

<d1; ++

s1

)

for (

s2

=0;

s2

<d2; ++

s2

)

...

for (

r0

=0;

r0

<rd0; ++

r0

)

for (

r1

=0;

r1

<rd1; ++

r1

)

tensorized inst.;

(24)

CPU Performance Tuning

Coarse-Grain Parallelism: Distributing spatial loops to threads

Find-Grain Parallelism: Avoiding loop carried dependences

Reorder and unroll a spatial loop

24

for (

s0

=0;

s0

<d0; ++

s0

)

...

for (

sn

=0;

sn

<dn; ++

sn

)

...

for (

r0

=0;

r0

<rd0; ++

r0

)

for (

r1

=0;

r1

<rd1; ++

r1

)

tensorized instruction;

parallel (

fused

=0;

fused

<fd; ++

fused

)

for (

serial

=0;

serial

<sd; ++

serial

)

for (

r0

=0;

r0

<extr0; ++

r0

)

...

for (

rm

=0;

rm

<ext_rm; ++

rm

) {

tensorized-instruction.0;

tensorized-instruction.1;

... }

(25)

GPU Performance Tuning (Generic)

-

No data reuse across the innermost reduction loop

-

Loop-carried accumulation causes pipeline penalty

// Direct accumulation

// a[n,k], b[k,m], c[n,m]

Buffer<fp16,16,16> A, B;

Buffer<fp32,16,16> C;

for (

i

=0;

i

<n;

i

+=16)

for (

j

=0;

j

<m;

j

+=16)

for (

r

=0;

r

<k;

r

+=16) {

A = Load(a[

i

:16,

r

:16]);

B = Load(b[

r

:16,

j

:16]);

C += TensorCore(A, B); }

Store(c[

i

:16,

j

:16], C);

25 += ×

(26)

// pxp outer product

// a[n,k], b[k,m], c[n,m]

for (

i

=0;

i

<n;

i

+=16*p)

for (

j

=0;

j

<m;

j

+=16*p)

for (

r

=0;

r

<k;

r

+=16) {

Buffer<fp16,16,16> A[p], B[p];

Buffer<fp32,16,16> C[p][p];

#pragma unroll

for (

x

=0;

x

<p; ++

x

) {

A[x] = Load(a[

i

+

x

*16:16,

r

:16]);

B[x] = Load(b[

r

:16,

j

+

x

*16:16]); }

#pragma unroll

for (

x

=0;

x

<p; ++

x

)

#pragma unroll

for (

y

=0;

y

<p; ++

y

)

C[

x

][

y

] += TensorCore(a[

x

],b[

y

]); }

for (

x

=0;

x

<p; ++

x

)

for (

y

=0;

y

<p; ++

y

)

Store(c[

i

+

x

*16,

j

+

y

*16], C[

x

][

y

];); }

26

Unroll 2 loops by pxp

+

Loop-carried

dependence avoided

by the outer-product

+

Each loaded sub-matrix

are reused p times

GPU Performance Tuning (Generic)

+= ×

A:

(27)

CNN-Specialized Tuning on GPU

Small width and height

Deep channels

(28)

CNN-Specialized Tuning (Fuse Dim.)

Tensors in DNN workloads often have small width and height

① Padding a perfect tiling size is wasting

② Fuse width and height to safe memory traffic

-

Introduces software overhead of data rearrangement

28

①: More than 3/4 (25/32)

traffic is wasted by padding

②: Less than 1/4 (15/64) traffic is wasted by padding.

7

32

7

8

(29)

CNN-Specialized Tuning (Split Red)

Tensors in DNN workloads often have deep input channels

① Split the reduce loop across threads

② Store the partial accumulation in shared memory

③ Reduce the partial sum and write back

-

A proper degree of splitting

Small: To small to hide memory latency

Large: Overhead of sync; register pressure

29

(30)

Evaluation: Methodology

Hardware

CPU: Amazon EC2 c5.12xlarge, with Intel Xeon Platinum 8275 CL @3.00G

GPU: Amazon EC2 p3.2xlarge, with Nvidia Tesla V100

ARM: Amazon EC2 m6g.8xlarge, with Amazon Graviton 2 ARM CPU

Software

Compiler and Runtime: LLVM-10, and CUDA-10

Vendor Provided Libraries: cuDNN 7.6.5, and oneDNN v1.6.1

DNN Models: BS=1, MxNet models converted to TVM Relay [32] for

Padded data shape

(31)

Evaluation: Goal

Performance

End-to-end: How is the overall performance of UNIT?

9 popular end-to-end DNN models

Ablation: How does each optimization help the performance?

16 representative convolution layers

Extensibility

Hardware Platform: ARM DOT

(32)

E2E Performance (Intel Xeon Platinum 8275CL)

32

0.00

0.50

1.00

1.50

2.00

ResNet-18 ResNet-50 ResNet-50bInception-bnInception-v3ResNet-101ResNet-152Mobilenet-v1Mobilenet-v2

GM

1.30 1.27 1.34 1.31 1.17 1.39 1.78 1.07 1.18 1.45 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

MxNet (MKLDNN)

UNIT

(33)

E2E Performance (Nvidia Tesla V100)

33

0.00

0.50

1.00

1.50

2.00

ResNet-18 ResNet-50 ResNet-50bInception-bnInception-v3ResNet-101ResNet-152Mobilenet-v1Mobilenet-v2

GM

1.75 2.21 1.61 1.69 1.66 1.75 2.01 1.53 1.62 1.82 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

cuDNN (fp16)

UNIT

(34)

Performance Impact of Tuning (CPU)

34

0

1

2

3

4

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

oneDNN

Parallel

+Unroll

+Tune

1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 C 288 160 1056 80 128 192 256 1024 128 576 96 1024 576 64 64 608 H=W(I) 35 3 9 7 73 16 16 16 14 16 14 16 14 14 29 56 14 K 384 224 192 192 3 128 192 256 512 160 192 128 256 128 96 128 192 R=S 3 3 1 1 3 3 3 3 1 3 1 3 1 1 3 1 1 Stride 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 H=W(O) 17 7 7 71 14 14 14 14 14 14 14 14 14 27 28 14

(35)

Performance Impact of Tuning (GPU)

35

0

1

2

3

4

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

cuDNN

Generic

+FuseDim

+SplitK

+Tune

1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 C 288 160 1056 80 128 192 256 1024 128 576 96 1024 576 64 64 608 H=W(I) 35 3 9 7 73 16 16 16 14 16 14 16 14 14 29 56 14 K 384 224 192 192 3 128 192 256 512 160 192 128 256 128 96 128 192 R=S 3 3 1 1 3 3 3 3 1 3 1 3 1 1 3 1 1 Stride 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 H=W(O) 17 7 7 71 14 14 14 14 14 14 14 14 14 27 28 14

(36)

E2E Performance (ARM Amazon Graviton 2)

Describe ARM DOT in Tensor DSL

Reuse Analysis, Xform, and Tuning

36

a, b = tensor((16,),i8), tensor((16,),i8)

c, d = tensor((4,), i32), tensor((4,), i32)

i, j =

loop_axis

(0,4),

reduce_axis

(0,4)

d[i] = c[i] + sum(i32(a[i*4+j])*i32(b[i*4+j]))

ARM DOT arm.neon.sdot.v4i32.v16i8

0

2

4

6

8

ResNet-18

ResNet-50

ResNet-50b Inception-bn Inception-v3 ResNet-101 ResNet-152 Mobilenet-v1 Mobilenet-v2

GM

5.8 3.2 2.8 5.6 6.2 15.4 6.8 6.1 6.1 10.0 5.4 3.1 2.5 5.2 5.7 12.6 5.8 5.7 5.7 7.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NEON TVM UNIT

(37)

Conclusion

UNIT

A unified compilation flow for the emerging tensorization idiom

Tuning strategies for DNN workloads

Future Work

Automated data layout transformation

A extension to vectorizer

References

Related documents

Products and Services Developed Dissemination Strategies Developed Information Products Information Services Inputs Processes Initial Outcomes Outputs Intermediate Outcomes

As part of the methodology, the correlation coefficients (Cooper, 2008) for the involved variables (enforcement actions) were estimated; other density measures

When this bit is set to logic 0, a MOVC instruction in external program memory space will be able to access code only in the external memory, not in the internal memory. A

309 Polynomials Key Terms polynomial coefficient degree monomial binomial GCF quadratic equation parabola conjugates trinomial5. 5.1 Addition and Subtraction of

Based on this information, the ex- pected returns for time t and the optimal portfolios and benchmark portfolios for the following nine strategies are determined subsequently:

Essentially (for weight 2) one follows in the same vein but by considering the action of Gal(Q/Q) on the l-adic “Tate module” associated to the Jacobian of the corresponding

Some measures to fight illegal logging can impact both sectors (international and domestic), like the Forest Law Enforcement Governance and Trade action plan, but they have to

To overcome the aforementioned barrier, researchers have developed implicit unconditionally stable FDTD methods, such as the alternating- direction implicit (ADI) method [3, 4],