• No results found

A high-level implementation of software pipelining in LLVM

N/A
N/A
Protected

Academic year: 2021

Share "A high-level implementation of software pipelining in LLVM"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

A high-level implementation of software

pipelining in LLVM

Roel Jordans 1, David Moloney2

1Eindhoven University of Technology, The Netherlands

r.jordans@tue.nl

2Movidius Ltd., Ireland

2015 European LLVM conference Tuesday April 14th

(2)

Overview

Rationale

Implementation

Results

(3)

Overview

Rationale

Implementation

Results

(4)

Rationale

Software pipelining (often Modulo Scheduling)

I Interleave operations from multiple loop iterations I Improved loop ILP

I Currently missing from LLVM I Loop scheduling technique

I Requires both loop dependency and resource availability

information

I Usually done at a target specific level as part of scheduling

I But it would be very good if we could re-use this implementation for different targets

(5)
(6)
(7)

Source Level Modulo Scheduling (SLMS)

SLMS: Source-to-source translation at statement level

Towards a Source Level Compiler: Source Level Modulo Scheduling – Ben-Asher & Meisler (2007)

(8)
(9)

SLMS features and limitations

I Improves performance in many cases I No resource constraints considered I Works with complete statements

I When no valid II is found statements may be split (decomposed)

(10)

This work

What would happen if we do this at LLVM’s IR level

I More fine grained statements (close to operations) I Coarse resource constraints through target hooks I Schedule loop pipelining pass late in the optimization

(11)

Overview

Rationale

Implementation

Results

(12)

IR data dependencies

I Memory dependencies

(13)

Revisiting our example: memory dependencies

d e f i n e v o i d @ f o o ( i8 * n o c a p t u r e %in, i32 % w i d t h) #0 { e n t r y :

% c m p = i c m p ugt i32 %width, 1

br i1 %cmp, l a b e l % f o r. body , l a b e l % f o r. end for . b o d y : ; p r e d s = %entry, % f o r. b o d y

%i. 0 1 2 = phi i32 [ %inc, % f o r. b o d y ] , [ 1 , % e n t r y ]

% s u b = add i32 %i.012 , -1 % a r r a y i d x = g e t e l e m e n t p t r i n b o u n d s i8 * %in, i32 % s u b %0 = l o a d i8 * % a r r a y i d x, a l i g n 1 , ! t b a a !0 % a r r a y i d x 1 = g e t e l e m e n t p t r i n b o u n d s i8 * %in, i32 %i. 0 1 2 %1 = l o a d i8 * % a r r a y i d x 1, a l i g n 1 , ! t b a a !0 % a d d = add i8 %1, %0 s t o r e i8 %add, i8 * % a r r a y i d x 1, a l i g n 1 , ! t b a a !0 % i n c = add i32 %i.012 , 1 % e x i t c o n d = i c m p eq i32 %inc, % w i d t h br i1 % e x i t c o n d, l a b e l % f o r. end , l a b e l % f o r. b o d y for . end : ; p r e d s = % f o r. body , % e n t r y

ret v o i d }

(14)

Revisiting our example: using a phi-node

d e f i n e v o i d @ f o o ( i8 * n o c a p t u r e %in, i32 % w i d t h) #0 { e n t r y :

% a r r a y i d x = g e t e l e m e n t p t r i n b o u n d s i8 * %in, i32 0

% p r e f e t c h = l o a d i8 * % a r r a y i d x, a l i g n 1 , ! t b a a !0

% c m p = i c m p ugt i32 %width, 1

br i1 %cmp, l a b e l % f o r. body , l a b e l % f o r. end for . b o d y : ; p r e d s = %entry, % f o r. b o d y

%i. 0 1 2 = phi i32 [ %inc, % f o r. b o d y ] , [ 1 , % e n t r y ]

%0 = phi i32 [ %add, % f o r. b o d y ] , [ % p r e f e t c h, % e n t r y ]

% a r r a y i d x 1 = g e t e l e m e n t p t r i n b o u n d s i8 * %in, i32 %i. 0 1 2 %1 = l o a d i8 * % a r r a y i d x 1, a l i g n 1 , ! t b a a !0 % a d d = add i8 %1, %0 s t o r e i8 %add, i8 * % a r r a y i d x 1, a l i g n 1 , ! t b a a !0 % i n c = add i32 %i.012 , 1 % e x i t c o n d = i c m p eq i32 %inc, % w i d t h br i1 % e x i t c o n d, l a b e l % f o r. end , l a b e l % f o r. b o d y for . end : ; p r e d s = % f o r. body , % e n t r y

ret v o i d }

(15)

Target hooks

I Communicate available resources from target specific layer I Candidate resource constraints

I Number of scalar function units

I Number of vector function units

I . . .

I IR instruction cost

I Obtained from CostModelAnalysis

I Currently only a debug pass and re-implemented by each user

(16)

The scheduling algorithm

I Swing Modulo Scheduling

I Fast heuristic algorithm

I Also used by GCC (and in the past LLVM)

I Scheduling in five steps

I Find cyclic (loop carried) dependencies and their length

I Find resource pressure

I Compute minimal initiation interval (II)

I Order nodes according to ’criticality’

I Schedule nodes in order

Swing Modulo Scheduling: A Lifetime-Sensitive Approach – Llosa et al. (1996)

(17)

Code generation

CFG for 'loop5b' function

entry T F for.body.lr.ph T F for.end for.body T F for.body.lp.prologue for.body.lp.kernel T F for.body.lp.epilogue

CFG for 'loop10' function

entry T F for.end for.body.lp.prologue for.body.lp.kernel T F for.body.lp.epilogue

I Construct new loop structure (prologue, kernel, epilogue) I Branch into new loop when sufficient iterations are available I Clean-up through constant propagation, CSE, and CFG

(18)

Overview

Rationale

Implementation

Results

(19)

Target platform

I Initial implementation for Movidius’ SHAVE architecture I 8 issue VLIW processor

I With DSP and SIMD extensions

I More on this architecture later today! (LG02 @ 14:40)

(20)

Results

I Good points:

I It works

I Up to 1.5x speedup observed in TSVC tests

I Even higher ILP improvements

I Weak spots

I Still many big regressions (up to 4x slowdown)

I Some serious problems still need to be fixed

I Instruction patterns are split over multiple loop iterations

I My bookkeeping of live variables needs improvement

(21)

Possible improvements

I User control

I Selective application to loops (e.g. through#pragma)

I Predictability

I Modeling of instruction patterns in IR

I Improved resource model

I Better profitability analysis

I Superblock instruction selection to find complex operations

(22)

Overview

Rationale

Implementation

Results

(23)

Conclusion

I It works, somewhat. . .

I IR instruction patterns are difficult to keep intact I Still lots of room for improvement

I Upgrade from LLVM 3.5 to trunk

I Fix bugs (bookkeeping of live values, . . . )

I Re-check performance!

I Fix regressions

(24)

References

Related documents

The correlation between these social capital’s dimensions and labour productivity is analyzed through a principal component analysis, which shows a positive and

q Program Committee Member, 6th International Conference on Information Systems Security (ICISS), Gandhinagar, India, 2010.. q Program Committee Member, 26th ASCA Computer

In this study we show the prediction of known mitochondrial toxi- cants by two in vitro methods using the human hepatoblastoma cell line HepG2, fi rstly cytotoxicity in glucose

Since reproduction or doubling of the proposed unique identification characteristics is impossible at the current state science and technology, application of this technique

Key role of wage decline, labour relations Key role of finance (vs. real economy). Vicious circle with low investm, prod, decline Bitter fruits

Previous studies using the mid-IR region for the diagnosis of

In conclusion, PRISM III score can be used as a prognostic predictor to determine the outcome of death of patients hospitalized at PICU.. Mental status, WBC and BUN are found to be

In conclusion, a strong standard for window coverings that effectively addresses the strangulation risk posed by accessible cords in necessary to protect the public from the