• No results found

High-level loop transformations and polyhedral compilation

N/A
N/A
Protected

Academic year: 2021

Share "High-level loop transformations and polyhedral compilation"

Copied!
287
0
0

Loading.... (view fulltext now)

Full text

(1)

May 30, 2017 1 / 82

High-Level Loop Transformations and

Polyhedral Compilation

Sven Verdoolaege

Polly Labs and KU Leuven (affiliated researcher)

(2)

Loop Transformations May 30, 2017 3 / 82

Outline

1 Loop Transformations Loop Distribution Loop Fusion Loop Tiling 2 Polyhedral Compilation Introduction Polyhedral Model Schedules Operations Software 3 PPCG Overview Model Extraction Dependence Analysis Scheduling Device Mapping

(3)

Loop Transformations Loop Distribution May 30, 2017 4 / 82

Loop Distribution

L : for ( int i = 1; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = A [ i ] + A [ i - 1] ; }

Can this loop be parallelized?

Requirement:

writes of iteration do not conflict with reads/writes of other iteration Lr1s: WpAr1sqRpAr1sq RpAr0sq WpBr1sq Lr2s: WpAr2sqRpAr2sq RpAr1sq WpBr2sq Loop distribution changes meaning! L1 : for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i - 1];

(4)

Loop Transformations Loop Distribution May 30, 2017 4 / 82

Loop Distribution

L : for ( int i = 1; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = A [ i ] + A [ i - 1] ; }

Can this loop be parallelized? Requirement:

writes of iteration do not conflict with reads/writes of other iteration

Lr1s: WpAr1sqRpAr1sq RpAr0sq WpBr1sq Lr2s: WpAr2sqRpAr2sq RpAr1sq WpBr2sq Loop distribution changes meaning! L1 : for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i - 1];

(5)

Loop Transformations Loop Distribution May 30, 2017 4 / 82

Loop Distribution

L : for ( int i = 1; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = A [ i ] + A [ i - 1]; }

Can this loop be parallelized? Requirement:

writes of iteration do not conflict with reads/writes of other iteration Lr1s: WpAr1sq RpAr1sqRpAr0sq WpBr1sq Lr2s: WpAr2sq RpAr2sqRpAr1sq WpBr2sq Loop distribution changes meaning! L1 : for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i - 1];

(6)

Loop Transformations Loop Distribution May 30, 2017 4 / 82

Loop Distribution

L : for ( int i = 1; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = A [ i ] + A [ i - 1]; }

Can this loop be parallelized? Requirement:

writes of iteration do not conflict with reads/writes of other iteration Lr1s: WpAr1sq RpAr1sqRpAr0sq WpBr1sq Lr2s: WpAr2sq RpAr2sqRpAr1sq WpBr2sq Loop distribution changes meaning! L1 : for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i - 1];

(7)

Loop Transformations Loop Distribution May 30, 2017 4 / 82

Loop Distribution

L : for ( int i = 1; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = A [ i ] + A [ i - 1]; }

Can this loop be parallelized? Requirement:

writes of iteration do not conflict with reads/writes of other iteration Lr1s: WpAr1sq RpAr1sqRpAr0sq WpBr1sq Lr2s: WpAr2sq RpAr2sqRpAr1sq WpBr2sq Loop distribution changes meaning! L1 : for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i - 1];

No conflicts between iterations of L1 ñcan be run in parallel No conflicts between iterations of L2 ñcan be run in parallel

(8)

Loop Transformations Loop Distribution May 30, 2017 5 / 82

Loop Distribution

L : for ( int i = 1; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = A [ i ] + A [ i - 1] ; }

Can this loop be parallelized?

Requirement:

writes of iteration do not conflict with reads/writes of other iteration Lr1s: WpAr1sqRpAr1sq RpAr0sq WpBr1sq Lr2s: WpAr2sqRpAr2sq RpAr1sq WpBr2sq Loop distribution changes meaning! L1 : for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i - 1];

(9)

Loop Transformations Loop Distribution May 30, 2017 5 / 82

Loop Distribution

L : for ( int i = 1; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = A [ i ] + A [ i + 1] ; }

Can this loop be parallelized?

Requirement:

writes of iteration do not conflict with reads/writes of other iteration Lr1s: WpAr1sqRpAr1sq RpAr2sq WpBr1sq Lr2s: WpAr2sqRpAr2sq RpAr3sq WpBr2sq Loop distribution changes meaning! L1 : for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i + 1];

(10)

Loop Transformations Loop Distribution May 30, 2017 5 / 82

Loop Distribution

L : for ( int i = 1; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = A [ i ] + A [ i + 1] ; }

Can this loop be parallelized? Requirement:

writes of iteration do not conflict with reads/writes of other iteration

Lr1s: WpAr1sqRpAr1sq RpAr2sq WpBr1sq Lr2s: WpAr2sqRpAr2sq RpAr3sq WpBr2sq Loop distribution changes meaning! L1 : for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i + 1];

(11)

Loop Transformations Loop Distribution May 30, 2017 5 / 82

Loop Distribution

L : for ( int i = 1; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = A [ i ] + A [ i + 1]; }

Can this loop be parallelized? Requirement:

writes of iteration do not conflict with reads/writes of other iteration Lr1s: WpAr1sq RpAr1sqRpAr2sq WpBr1sq Lr2s: WpAr2sq RpAr2sqRpAr3sq WpBr2sq Loop distribution changes meaning! L1 : for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i + 1];

(12)

Loop Transformations Loop Distribution May 30, 2017 5 / 82

Loop Distribution

L : for ( int i = 1; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = A [ i ] + A [ i + 1]; }

Can this loop be parallelized? Requirement:

writes of iteration do not conflict with reads/writes of other iteration Lr1s: WpAr1sq RpAr1sqRpAr2sq WpBr1sq Lr2s: WpAr2sq RpAr2sqRpAr3sq WpBr2sq Loop distribution changes meaning! L1 : for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i + 1];

(13)

Loop Transformations Loop Distribution May 30, 2017 5 / 82

Loop Distribution

L : for ( int i = 1; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = A [ i ] + A [ i + 1]; }

Can this loop be parallelized? Requirement:

writes of iteration do not conflict with reads/writes of other iteration Lr1s: WpAr1sq RpAr1sqRpAr2sq WpBr1sq

Lr2s: WpAr2sq RpAr2sqRpAr3sq WpBr2sq Loop distribution changes meaning!

L1 : for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i );

L2 : for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i + 1];

before distribution, Lr1sreadsAr2svalue written before code fragment after distribution, L2r1sreads Ar2svalue written byL1r2s

(14)

Loop Transformations Loop Fusion May 30, 2017 6 / 82

Loop Fusion

L1 : for ( int i = 0; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 0; i < 100; ++ i ) B [ i ] = g ( A [ i ]);

AssumeA does not fit in the cache

ñ elements get evicted and reloaded for use in L2

Loop fusion

(changes execution order ñ may not preserve meaning)

for ( int i = 0; i < 100; ++ i ) { A [ i ] = f ( i );

B [ i ] = g ( A [ i ] ); }

ñ elements of A get reused immediately ñ betterlocality

If A not needed outside code fragment ñ array can be replaced by a scalar ñ memory compaction

(15)

Loop Transformations Loop Fusion May 30, 2017 6 / 82

Loop Fusion

L1 : for ( int i = 0; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 0; i < 100; ++ i ) B [ i ] = g ( A [ i ]);

AssumeA does not fit in the cache

ñ elements get evicted and reloaded for use in L2 Loop fusion

(changes execution order ñ may not preserve meaning)

for ( int i = 0; i < 100; ++ i ) { A [ i ] = f ( i );

B [ i ] = g ( A [ i ] ); }

ñ elements of A get reused immediately ñ betterlocality

If A not needed outside code fragment ñ array can be replaced by a scalar ñ memory compaction

(16)

Loop Transformations Loop Fusion May 30, 2017 6 / 82

Loop Fusion

L1 : for ( int i = 0; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 0; i < 100; ++ i ) B [ i ] = g ( A [ i ]);

AssumeA does not fit in the cache

ñ elements get evicted and reloaded for use in L2 Loop fusion

(changes execution order ñ may not preserve meaning)

for ( int i = 0; i < 100; ++ i ) { A [ i ] = f ( i );

B [ i ] = g ( A [ i ] ); }

ñ elements of A get reused immediately ñ betterlocality

If A not needed outside code fragment ñ array can be replaced by a scalar ñ memory compaction

(17)

Loop Transformations Loop Fusion May 30, 2017 6 / 82

Loop Fusion

L1 : for ( int i = 0; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 0; i < 100; ++ i ) B [ i ] = g ( A [ i ]);

AssumeA does not fit in the cache

ñ elements get evicted and reloaded for use in L2 Loop fusion

(changes execution order ñ may not preserve meaning)

for ( int i = 0; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = g ( A [ i ] ); }

ñ elements of A get reused immediately ñ betterlocality

If A not needed outside code fragment ñ array can be replaced by a scalar ñ memory compaction

(18)

Loop Transformations Loop Fusion May 30, 2017 6 / 82

Loop Fusion

L1 : for ( int i = 0; i < 100; ++ i ) A [ i ] = f ( i ); L2 : for ( int i = 0; i < 100; ++ i ) B [ i ] = g ( A [ i ]);

AssumeA does not fit in the cache

ñ elements get evicted and reloaded for use in L2

Loop fusion (changes execution order ñ may not preserve meaning) for ( int i = 0; i < 100; ++ i ) { A [ i ] = f ( i ); B [ i ] = g ( A [ i ] ); }

ñ elements of A get reused immediately ñ betterlocality

If A not needed outside code fragment ñ array can be replaced by a scalar ñ memory compaction

(19)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1

B

A C

computed element

load into cache already in cache

(20)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(21)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(22)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(23)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(24)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(25)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(26)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(27)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(28)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(29)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(30)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(31)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(32)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(33)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(34)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(35)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(36)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(37)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(38)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(39)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(40)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(41)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(42)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(43)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(44)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(45)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(46)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 B

A C

computed element

load into cache already in cache

(47)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 Loop tiling

(changes execution order ñ may not preserve meaning)

for ( int ti = 0; ti < 8; ti += 4) for ( int tj = 0; tj < 8; tj += 4) for ( int i = ti ; i < ti + 4; ++ i ) for ( int j = tj ; j < tj + 4; ++ j ) C [ i ][ j ] = A [ i ] * B [ j ]; B A C computed element

load into cache already in cache

(48)

Loop Transformations Loop Tiling May 30, 2017 7 / 82

Loop Tiling

[17, 35]

L1 : for ( int i = 0; i < 8; ++ i ) L2 : for ( int j = 0; j < 8; ++ j )

C [ i ][ j ] = A [ i ] * B [ j ]; AssumeB does not fit in the cache

ñ elements get (re)loaded and evicted in every iteration of L1 Loop tiling (changes execution order ñ may not preserve meaning)

for ( int ti = 0; ti < 8; ti += 4) for ( int tj = 0; tj < 8; tj += 4) for ( int i = ti ; i < ti + 4; ++ i ) for ( int j = tj ; j < tj + 4; ++ j ) C [ i ][ j ] = A [ i ] * B [ j ]; B A C computed element

load into cache already in cache

(49)

Polyhedral Compilation May 30, 2017 8 / 82

Outline

1 Loop Transformations Loop Distribution Loop Fusion Loop Tiling 2 Polyhedral Compilation Introduction Polyhedral Model Schedules Operations Software 3 PPCG Overview Model Extraction Dependence Analysis Scheduling Device Mapping

(50)

Polyhedral Compilation Introduction May 30, 2017 9 / 82

Motivation

Computer architectures are becoming more difficult to program efficiently

§ multiple levels of parallelism § non-uniform memory architectures

ñ Advanced compiler optimizations are required

§ hierarchical partitioning and reordering of operations (e.g., parallelization, loop fusion, . . . )

§ mapping to different processing units § memory transfers between processing units ñ Global view of individual operations is required ñ Polyhedral Model

(51)

Polyhedral Compilation Introduction May 30, 2017 10 / 82

Polyhedral Compilation — Example

for ( t = 0; t < T ; t ++) for ( i = 1; i < N - 1; i ++) A [( t + 1 ) % 2 ] [ i ] = A [ t %2][ i -1] + A [ t %2][ i +1]; t i ñ 1 1 1 3 3 3 5 5 5 0 0 0 2 2 2 4 4 4 6 6 6 t i

1 Extract polyhedral model

ñ each dynamic instance represented bypt,iqpair 2 Compute dependences

ñ iterationt“2,i“3 depends on iterationt “1,i“4

3 Compute schedule respecting dependences

ñ tiles with same number can be executed in parallel ñ rows within tiles can be executed in parallel

(52)

Polyhedral Compilation Introduction May 30, 2017 10 / 82

Polyhedral Compilation — Example

for ( t = 0; t < T ; t ++) for ( i = 1; i < N - 1; i ++) A [( t + 1 ) % 2 ] [ i ] = A [ t %2][ i -1] + A [ t %2][ i +1]; t i ñ 1 1 1 3 3 3 5 5 5 0 0 0 2 2 2 4 4 4 6 6 6 t i

1 Extract polyhedral model

ñ each dynamic instance represented bypt,iqpair

2 Compute dependences

ñ iterationt“2,i“3 depends on iterationt “1,i“4

3 Compute schedule respecting dependences

ñ tiles with same number can be executed in parallel ñ rows within tiles can be executed in parallel

(53)

Polyhedral Compilation Introduction May 30, 2017 10 / 82

Polyhedral Compilation — Example

for ( t = 0; t < T ; t ++) for ( i = 1; i < N - 1; i ++) A [( t + 1 ) % 2 ] [ i ] = A [ t %2][ i -1] + A [ t %2][ i +1]; t i ñ 1 1 1 3 3 3 5 5 5 0 0 0 2 2 2 4 4 4 6 6 6 t i

1 Extract polyhedral model

ñ each dynamic instance represented bypt,iqpair

2 Compute dependences

ñ iterationt“2,i“3 depends on iterationt “1,i“4

3 Compute schedule respecting dependences

ñ tiles with same number can be executed in parallel ñ rows within tiles can be executed in parallel

(54)

Polyhedral Compilation Introduction May 30, 2017 10 / 82

Polyhedral Compilation — Example

for ( t = 0; t < T ; t ++) for ( i = 1; i < N - 1; i ++) A [( t + 1 ) % 2 ] [ i ] = A [ t %2][ i -1] + A [ t %2][ i +1]; t i ñ 1 1 1 3 3 3 5 5 5 0 0 0 2 2 2 4 4 4 6 6 6 t i

1 Extract polyhedral model

ñ each dynamic instance represented bypt,iqpair 2 Compute dependences

ñ iterationt“2,i“3 depends on iterationt “1,i“4 3 Compute schedule respecting dependences

ñ tiles with same number can be executed in parallel ñ rows within tiles can be executed in parallel

(55)

Polyhedral Compilation Introduction May 30, 2017 10 / 82

Polyhedral Compilation — Example

for ( t = 0; t < T ; t ++) for ( i = 1; i < N - 1; i ++) A [( t + 1 ) % 2 ] [ i ] = A [ t %2][ i -1] + A [ t %2][ i +1]; t i ñ 1 1 1 3 3 3 5 5 5 0 0 0 2 2 2 4 4 4 6 6 6 t i

1 Extract polyhedral model

ñ each dynamic instance represented bypt,iqpair 2 Compute dependences

ñ iterationt“2,i“3 depends on iterationt “1,i“4 3 Compute schedule respecting dependences

ñ tiles with same number can be executed in parallel ñ rows within tiles can be executed in parallel

(56)

Polyhedral Compilation Introduction May 30, 2017 10 / 82

Polyhedral Compilation — Example

for ( t = 0; t < T ; t ++) for ( i = 1; i < N - 1; i ++) A [( t + 1 ) % 2 ] [ i ] = A [ t %2][ i -1] + A [ t %2][ i +1]; t i ñ 1 1 1 3 3 3 5 5 5 0 0 0 2 2 2 4 4 4 6 6 6 t i

1 Extract polyhedral model

ñ each dynamic instance represented bypt,iqpair 2 Compute dependences

ñ iterationt“2,i“3 depends on iterationt “1,i“4

3 Compute schedule respecting dependences

ñ tiles with same number can be executed in parallel ñ rows within tiles can be executed in parallel

(57)

Polyhedral Compilation Introduction May 30, 2017 10 / 82

Polyhedral Compilation — Example

for ( t = 0; t < T ; t ++) for ( i = 1; i < N - 1; i ++) A [( t + 1 ) % 2 ] [ i ] = A [ t %2][ i -1] + A [ t %2][ i +1]; t i ñ 1 1 1 3 3 3 5 5 5 0 0 0 2 2 2 4 4 4 6 6 6 t i

1 Extract polyhedral model

ñ each dynamic instance represented bypt,iqpair 2 Compute dependences

ñ iterationt“2,i“3 depends on iterationt “1,i“4 3 Compute schedule respecting dependences

ñ tiles with same number can be executed in parallel ñ rows within tiles can be executed in parallel

(58)

Polyhedral Compilation Introduction May 30, 2017 10 / 82

Polyhedral Compilation — Example

for ( t = 0; t < T ; t ++) for ( i = 1; i < N - 1; i ++) A [( t + 1 ) % 2 ] [ i ] = A [ t %2][ i -1] + A [ t %2][ i +1]; t i ñ 1 1 1 3 3 3 5 5 5 0 0 0 2 2 2 4 4 4 6 6 6 t i

1 Extract polyhedral model

ñ each dynamic instance represented bypt,iqpair 2 Compute dependences

ñ iterationt“2,i“3 depends on iterationt “1,i“4 3 Compute schedule respecting dependences

ñ tiles with same number can be executed in parallel ñ rows within tiles can be executed in parallel

(59)

Polyhedral Compilation Polyhedral Model May 30, 2017 11 / 82

Polyhedral Model

[27] Key features instance based ñ statementinstances ñ arrayelements

compact representation based on polyhedra or similar objects ñ Presburger sets and relations

defined by Presburger formula

ñ . . .

Main constituents of program representation

Instance Set

ñ the set of all statement instances

Access Relations

ñ the array elements accessed by a statement instance

Dependences

ñ the statement instances that depend on a statement instance

Schedule

ñ the relative execution order of statement instances

Context

(60)

Polyhedral Compilation Polyhedral Model May 30, 2017 12 / 82

Polyhedral Model — Example

for ( i = 0; i < 3; ++ i ) S : B [ i ] = f ( A [ i ]); for ( i = 0; i < 3; ++ i ) T : C [ i ] = g ( B [2 - i ]); S[],T[] S[0],S[1],S[2] T[0],T[1],T[2] S[0] T[0] B[0] S[1] T[1] B[1] S[2] T[2] B[2] for ( c = 0; c < 3; ++ c ) { B [ c ] = f ( A [ c ]); C [2 - c ] = g ( B [ c ]); } S[0]T[2],S[1]T[1],S[2]T[0] S[],T[]

input code input execution order

model

(61)

Polyhedral Compilation Polyhedral Model May 30, 2017 12 / 82

Polyhedral Model — Example

for ( i = 0; i < 3; ++ i ) S : B [ i ] = f ( A [ i ]); for ( i = 0; i < 3; ++ i ) T : C [ i ] = g ( B [2 - i ]); S[],T[] S[0],S[1],S[2] T[0],T[1],T[2] S[0] T[0] B[0] S[1] T[1] B[1] S[2] T[2] B[2] for ( c = 0; c < 3; ++ c ) { B [ c ] = f ( A [ c ]); C [2 - c ] = g ( B [ c ]); } S[0]T[2],S[1]T[1],S[2]T[0] S[],T[]

input code input execution order

model

(62)

Polyhedral Compilation Polyhedral Model May 30, 2017 12 / 82

Polyhedral Model — Example

for ( i = 0; i < 3; ++ i ) S : B [ i ] = f ( A [ i ]); for ( i = 0; i < 3; ++ i ) T : C [ i ] = g ( B [2 - i ]); S[],T[] S[0],S[1],S[2] T[0],T[1],T[2] S[0] T[0] B[0] S[1] T[1] B[1] S[2] T[2] B[2] for ( c = 0; c < 3; ++ c ) { B [ c ] = f ( A [ c ]); C [2 - c ] = g ( B [ c ]); } S[0]T[2],S[1]T[1],S[2]T[0] S[],T[]

input code input execution order

model

(63)

Polyhedral Compilation Polyhedral Model May 30, 2017 12 / 82

Polyhedral Model — Example

for ( i = 0; i < 3; ++ i ) S : B [ i ] = f ( A [ i ]); for ( i = 0; i < 3; ++ i ) T : C [ i ] = g ( B [2 - i ]); S[],T[] S[0],S[1],S[2] T[0],T[1],T[2] S[0] T[0] B[0] S[1] T[1] B[1] S[2] T[2] B[2] for ( c = 0; c < 3; ++ c ) { B [ c ] = f ( A [ c ]); C [2 - c ] = g ( B [ c ]); } S[0]T[2],S[1]T[1],S[2]T[0] S[],T[]

input code input execution order

model

(64)

Polyhedral Compilation Polyhedral Model May 30, 2017 12 / 82

Polyhedral Model — Example

for ( i = 0; i < 3; ++ i ) S : B [ i ] = f ( A [ i ]); for ( i = 0; i < 3; ++ i ) T : C [ i ] = g ( B [2 - i ]); S[],T[] S[0],S[1],S[2] T[0],T[1],T[2] S[0] T[0] B[0] S[1] T[1] B[1] S[2] T[2] B[2] for ( c = 0; c < 3; ++ c ) { B [ c ] = f ( A [ c ]); C [2 - c ] = g ( B [ c ]); } S[0]T[2],S[1]T[1],S[2]T[0] S[],T[]

input code input execution order

model

(65)

Polyhedral Compilation Polyhedral Model May 30, 2017 12 / 82

Polyhedral Model — Example

for ( i = 0; i < 3; ++ i ) S : B [ i ] = f ( A [ i ]); for ( i = 0; i < 3; ++ i ) T : C [ i ] = g ( B [2 - i ]); S[],T[] S[0],S[1],S[2] T[0],T[1],T[2] S[0] T[0] B[0] S[1] T[1] B[1] S[2] T[2] B[2] for ( c = 0; c < 3; ++ c ) { B [ c ] = f ( A [ c ]); C [2 - c ] = g ( B [ c ]); } S[0]T[2],S[1]T[1],S[2]T[0] S[],T[]

input code input execution order

model

(66)

Polyhedral Compilation Polyhedral Model May 30, 2017 12 / 82

Polyhedral Model — Example

for ( i = 0; i < 3; ++ i ) S : B [ i ] = f ( A [ i ]); for ( i = 0; i < 3; ++ i ) T : C [ i ] = g ( B [2 - i ]); tSrisu,tTrisu tSrisÑ ris u tTrisÑ ris u S[0] T[0] B[0] S[1] T[1] B[1] S[2] T[2] B[2] for ( c = 0; c < 3; ++ c ) { B [ c ] = f ( A [ c ]); C [2 - c ] = g ( B [ c ]); } S[0]T[2],S[1]T[1],S[2]T[0] S[],T[]

input code input execution order

model

(67)

Polyhedral Compilation Polyhedral Model May 30, 2017 12 / 82

Polyhedral Model — Example

for ( i = 0; i < 3; ++ i ) S : B [ i ] = f ( A [ i ]); for ( i = 0; i < 3; ++ i ) T : C [ i ] = g ( B [2 - i ]); tSrisu,tTrisu tSrisÑ ris u tTrisÑ ris u S[0] T[0] B[0] S[1] T[1] B[1] S[2] T[2] B[2] for ( c = 0; c < 3; ++ c ) { B [ c ] = f ( A [ c ]); C [2 - c ] = g ( B [ c ]); } tSrisÑ ris;TrisÑ r2´is u tSrisu,tTrisu

input code input execution order

model

(68)

Polyhedral Compilation Polyhedral Model May 30, 2017 13 / 82

Polyhedral Model

[27] Key features instance based ñ statementinstances ñ arrayelements

compact representation based on polyhedra or similar objects ñ Presburger sets and relations

defined by Presburger formula

ñ . . .

Main constituents of program representation

Instance Set

ñ the set of all statement instances

Access Relations

ñ the array elements accessed by a statement instance

Dependences

ñ the statement instances that depend on a statement instance

Schedule

ñ the relative execution order of statement instances

Context

(69)

Polyhedral Compilation Polyhedral Model May 30, 2017 13 / 82

Polyhedral Model

[27] Key features instance based ñ statementinstances ñ arrayelements

compact representation based on polyhedra or similar objects ñ Presburger sets and relations defined by Presburger formula ñ . . .

Main constituents of program representation

Instance Set

ñ the set of all statement instances

Access Relations

ñ the array elements accessed by a statement instance

Dependences

ñ the statement instances that depend on a statement instance

Schedule

ñ the relative execution order of statement instances

Context

(70)

Polyhedral Compilation Polyhedral Model May 30, 2017 13 / 82

Polyhedral Model

[27] Key features instance based ñ statementinstances ñ arrayelements

compact representation based on polyhedra or similar objects ñ Presburger sets and relations defined by Presburger formula ñ . . .

quasi-affine expression

(no multiplication)

§ variable

§ constant integer number § constant symbol

§ addition (`), subtraction (´)

§ integer division by integer constantd (t¨{du) Presburger formula

§ true

§ quasi-affine expression

§ less-than-or-equal relation (ď) § equality (“)

§ first order logic connectives: ^,_, ,D,@ Main constituents of program representation

Instance Set

ñ the set of all statement instances

Access Relations

ñ the array elements accessed by a statement instance

Dependences

ñ the statement instances that depend on a statement instance

Schedule

ñ the relative execution order of statement instances

Context

(71)

Polyhedral Compilation Polyhedral Model May 30, 2017 13 / 82

Polyhedral Model

[27] Key features instance based ñ statementinstances ñ arrayelements

compact representation based on polyhedra or similar objects ñ Presburger sets and relations defined by Presburger formula ñ . . .

quasi-affine expression

(no multiplication)

§ variable

§ constant integer number § constant symbol

§ addition (`), subtraction (´)

§ integer division by integer constantd (t¨{du)

Presburger formula § true

§ quasi-affine expression

§ less-than-or-equal relation (ď) § equality (“)

§ first order logic connectives: ^,_, ,D,@ Main constituents of program representation

Instance Set

ñ the set of all statement instances

Access Relations

ñ the array elements accessed by a statement instance

Dependences

ñ the statement instances that depend on a statement instance

Schedule

ñ the relative execution order of statement instances

Context

(72)

Polyhedral Compilation Polyhedral Model May 30, 2017 13 / 82

Polyhedral Model

[27] Key features instance based ñ statementinstances ñ arrayelements

compact representation based on polyhedra or similar objects ñ Presburger sets and relations defined by Presburger formula ñ . . .

quasi-affine expression (no multiplication) § variable

§ constant integer number § constant symbol

§ addition (`), subtraction (´)

§ integer division by integer constantd (t¨{du)

Presburger formula § true

§ quasi-affine expression

§ less-than-or-equal relation (ď) § equality (“)

§ first order logic connectives: ^,_, ,D,@ Main constituents of program representation

Instance Set

ñ the set of all statement instances

Access Relations

ñ the array elements accessed by a statement instance

Dependences

ñ the statement instances that depend on a statement instance

Schedule

ñ the relative execution order of statement instances

Context

(73)

Polyhedral Compilation Polyhedral Model May 30, 2017 13 / 82

Polyhedral Model

[27] Key features instance based ñ statementinstances ñ arrayelements

compact representation based on polyhedra or similar objects ñ Presburger sets and relations defined by Presburger formula ñ . . .

quasi-affine expression (no multiplication) § variable

§ constant integer number § constant symbol

§ addition (`), subtraction (´)

§ integer division by integer constantd (t¨{du) Presburger formula

§ true

§ quasi-affine expression

§ less-than-or-equal relation (ď) § equality (“)

§ first order logic connectives: ^,_, ,D,@

Main constituents of program representation

Instance Set

ñ the set of all statement instances

Access Relations

ñ the array elements accessed by a statement instance

Dependences

ñ the statement instances that depend on a statement instance

Schedule

ñ the relative execution order of statement instances

Context

(74)

Polyhedral Compilation Polyhedral Model May 30, 2017 14 / 82

Parametric Example: Matrix Multiplication

for ( int i = 0; i < M ; i ++) for ( int j = 0; j < N ; j ++) { S1 : C [ i ][ j ] = 0; for ( int k = 0; k < K ; k ++) S2 : C [ i ][ j ] = C [ i ][ j ] + A [ i ][ k ] * B [ k ][ j ]; }

Instance Set(set of statement instances) tS1ri,js:0ďi ăM^0ďj ăN;

S2ri,j,ks:0ďi ăM^0ďj ăN^0ďk ăKu

Access Relations(accessed array elements; W: write,R: read) W “ tS1ri,js ÑCri,js;S2ri,j,ks ÑCri,jsu

(75)

Polyhedral Compilation Polyhedral Model May 30, 2017 15 / 82

Schedule Representation

[30]

ScheduleS keeps track of relative execution order of statement instances ñ for each pair of statement instancesi andj, schedule determines

§ iexecuted beforej(iăS j), § iexecuted afterj(jăS i), or

§ iandjmay be executed simultaneously

Schedule trees form a combined hierarchical schedule representation Main constructs:

§ affine schedule: instances are executed according to affine function

§ band: nested sequence of affine functions called itsmembers; combined multi-dimensional affine function is called

thepartial scheduleof the band

§ sequence: partitions instances through childfilters executed in order Order of instances determined by outermost node that separates them Deriving schedule tree from AST

§ forloopñaffine schedule corresponding to loop iterator § compound statementñsequence

(76)

Polyhedral Compilation Polyhedral Model May 30, 2017 15 / 82

Schedule Representation

[30]

ScheduleS keeps track of relative execution order of statement instances ñ for each pair of statement instancesi andj, schedule determines

§ iexecuted beforej(iăS j), § iexecuted afterj(jăS i), or

§ iandjmay be executed simultaneously

Schedule trees form a combined hierarchical schedule representation Main constructs:

§ affine schedule: instances are executed according to affine function

§ band: nested sequence of affine functions called itsmembers; combined multi-dimensional affine function is called

thepartial scheduleof the band

§ sequence: partitions instances through childfilters executed in order Order of instances determined by outermost node that separates them Deriving schedule tree from AST

§ forloopñaffine schedule corresponding to loop iterator § compound statementñsequence

(77)

Polyhedral Compilation Polyhedral Model May 30, 2017 16 / 82

Parametric Example: Matrix Multiplication

for ( int i = 0; i < M ; i ++) for ( int j = 0; j < N ; j ++) { S1 : C [ i ][ j ] = 0; for ( int k = 0; k < K ; k ++) S2 : C [ i ][ j ] = C [ i ][ j ] + A [ i ][ k ] * B [ k ][ j ]; } S1ri,js Ñ ris;S2ri,j,ks Ñ ris S1ri,js Ñ rjs;S2ri,j,ks Ñ rjs sequence S1ri,js S1ri,js Ñ r0s S2ri,j,ks S2ri,j,ks Ñ rks filters affine functions

(78)

Polyhedral Compilation Polyhedral Model May 30, 2017 16 / 82

Parametric Example: Matrix Multiplication

for ( int i = 0; i < M ; i ++) for ( int j = 0; j < N ; j ++) { S1 : C [ i ][ j ] = 0; for ( int k = 0; k < K ; k ++) S2 : C [ i ][ j ] = C [ i ][ j ] + A [ i ][ k ] * B [ k ][ j ]; } S1ri,js Ñ ris;S2ri,j,ks Ñ ris S1ri,js Ñ rjs;S2ri,j,ks Ñ rjs sequence S1ri,js S1ri,js Ñ r0s S2ri,j,ks S2ri,j,ks Ñ rks filters affine functions

(79)

Polyhedral Compilation Polyhedral Model May 30, 2017 16 / 82

Parametric Example: Matrix Multiplication

for ( int i = 0; i < M ; i ++) for ( int j = 0; j < N ; j ++) { S1 : C [ i ][ j ] = 0; for ( int k = 0; k < K ; k ++) S2 : C [ i ][ j ] = C [ i ][ j ] + A [ i ][ k ] * B [ k ][ j ]; } S1ri,js Ñ ris;S2ri,j,ks Ñ ris S1ri,js Ñ rjs;S2ri,j,ks Ñ rjs sequence S1ri,js S1ri,js Ñ r0s S2ri,j,ks S2ri,j,ks Ñ rks filters affine functions

(80)

Polyhedral Compilation Polyhedral Model May 30, 2017 16 / 82

Parametric Example: Matrix Multiplication

for ( int i = 0; i < M ; i ++) for ( int j = 0; j < N ; j ++) { S1 : C [ i ][ j ] = 0; for ( int k = 0; k < K ; k ++) S2 : C [ i ][ j ] = C [ i ][ j ] + A [ i ][ k ] * B [ k ][ j ]; } S1ri,js Ñ ris;S2ri,j,ks Ñ ris S1ri,js Ñ rjs;S2ri,j,ks Ñ rjs sequence S1ri,js S1ri,js Ñ r0s S2ri,j,ks S2ri,j,ks Ñ rks filters affine functions

(81)

Polyhedral Compilation Polyhedral Model May 30, 2017 16 / 82

Parametric Example: Matrix Multiplication

for ( int i = 0; i < M ; i ++) for ( int j = 0; j < N ; j ++) { S1 : C [ i ][ j ] = 0; for ( int k = 0; k < K ; k ++) S2 : C [ i ][ j ] = C [ i ][ j ] + A [ i ][ k ] * B [ k ][ j ]; } S1ri,js Ñ ris;S2ri,j,ks Ñ ris S1ri,js Ñ rjs;S2ri,j,ks Ñ rjs sequence S1ri,js S1ri,js Ñ r0s S2ri,j,ks S2ri,j,ks Ñ rks filters affine functions

(82)

Polyhedral Compilation Polyhedral Model May 30, 2017 17 / 82

Schedule Representation

[30]

ScheduleS keeps track of relative execution order of statement instances ñ for each pair of statement instancesi andj, schedule determines

§ iexecuted beforej(iăS j), § iexecuted afterj(jăS i), or

§ iandjmay be executed simultaneously

Schedule trees form a combined hierarchical schedule representation Main constructs:

§ affine schedule: instances are executed according to affine function

§ band: nested sequence of affine functions called itsmembers; combined multi-dimensional affine function is called

thepartial scheduleof the band

§ sequence: partitions instances through childfilters executed in order Order of instances determined by outermost node that separates them Deriving schedule tree from AST

§ forloopñaffine schedule corresponding to loop iterator § compound statementñsequence

(83)

Polyhedral Compilation Polyhedral Model May 30, 2017 17 / 82

Schedule Representation

[30]

ScheduleS keeps track of relative execution order of statement instances ñ for each pair of statement instancesi andj, schedule determines

§ iexecuted beforej(iăS j), § iexecuted afterj(jăS i), or

§ iandjmay be executed simultaneously

Schedule trees form a combined hierarchical schedule representation Main constructs:

§ affine schedule: instances are executed according to affine function § band: nested sequence of affine functions called itsmembers;

combined multi-dimensional affine function is called thepartial scheduleof the band

§ sequence: partitions instances through childfilters executed in order Order of instances determined by outermost node that separates them Deriving schedule tree from AST

§ forloopñaffine schedule corresponding to loop iterator § compound statementñsequence

(84)

Polyhedral Compilation Polyhedral Model May 30, 2017 18 / 82

Named Presburger Relation Schedules

Schedule tree with single (band) node

Flattening a schedule tree two nested band nodes

ñ replace by single band node with concatenated partial schedule sequence with as children either leaves or

trees consisting of a single band node

ñ treat leaves as zero-dimensional band nodes ñ pad lower-dimensional bands (e.g., with zero)

ñ construct one-dimensional band assigning increasing values to children ñ combine one-dimensional band with children

(85)

Polyhedral Compilation Polyhedral Model May 30, 2017 18 / 82

Named Presburger Relation Schedules

Schedule tree with single (band) node Flattening a schedule tree

two nested band nodes

ñ replace by single band node with concatenated partial schedule sequence with as children either leaves or

trees consisting of a single band node

ñ treat leaves as zero-dimensional band nodes ñ pad lower-dimensional bands (e.g., with zero)

ñ construct one-dimensional band assigning increasing values to children ñ combine one-dimensional band with children

(86)

Polyhedral Compilation Polyhedral Model May 30, 2017 19 / 82

Parametric Example: Matrix Multiplication

for ( int i = 0; i < M ; i ++) for ( int j = 0; j < N ; j ++) { S1 : C [ i ][ j ] = 0; for ( int k = 0; k < K ; k ++) S2 : C [ i ][ j ] = C [ i ][ j ] + A [ i ][ k ] * B [ k ][ j ]; } S1ri,js Ñ ris;S2ri,j,ks Ñ ris S1ri,js Ñ rjs;S2ri,j,ks Ñ rjs sequence S1ri,js S1ri,js Ñ r0s S2ri,j,ks S2ri,j,ks Ñ rks

(87)

Polyhedral Compilation Polyhedral Model May 30, 2017 19 / 82

Parametric Example: Matrix Multiplication

for ( int i = 0; i < M ; i ++) for ( int j = 0; j < N ; j ++) { S1 : C [ i ][ j ] = 0; for ( int k = 0; k < K ; k ++) S2 : C [ i ][ j ] = C [ i ][ j ] + A [ i ][ k ] * B [ k ][ j ]; } S1ri,js Ñ ris;S2ri,j,ks Ñ ris S1ri,js Ñ rjs;S2ri,j,ks Ñ rjs sequence S1ri,js S1ri,js Ñ r0s S2ri,j,ks S2ri,j,ks Ñ rks

(88)

Polyhedral Compilation Polyhedral Model May 30, 2017 19 / 82

Parametric Example: Matrix Multiplication

for ( int i = 0; i < M ; i ++) for ( int j = 0; j < N ; j ++) { S1 : C [ i ][ j ] = 0; for ( int k = 0; k < K ; k ++) S2 : C [ i ][ j ] = C [ i ][ j ] + A [ i ][ k ] * B [ k ][ j ]; } S1ri,js Ñ ris;S2ri,j,ks Ñ ris S1ri,js Ñ rjs;S2ri,j,ks Ñ rjs S1ri,js Ñ r0,0s;S2ri,j,ks Ñ r1,ks S1ri,js S1ri,js Ñ r0s S2ri,j,ks S2ri,j,ks Ñ rks

(89)

Polyhedral Compilation Polyhedral Model May 30, 2017 19 / 82

Parametric Example: Matrix Multiplication

for ( int i = 0; i < M ; i ++) for ( int j = 0; j < N ; j ++) { S1 : C [ i ][ j ] = 0; for ( int k = 0; k < K ; k ++) S2 : C [ i ][ j ] = C [ i ][ j ] + A [ i ][ k ] * B [ k ][ j ]; } S1ri,js Ñ ris;S2ri,j,ks Ñ ris S1ri,js Ñ rj,0,0s;S2ri,j,ks Ñ rj,1,ks S1ri,js Ñ r0,0s;S2ri,j,ks Ñ r1,ks S1ri,js S1ri,js Ñ r0s S2ri,j,ks S2ri,j,ks Ñ rks

(90)

Polyhedral Compilation Polyhedral Model May 30, 2017 19 / 82

Parametric Example: Matrix Multiplication

for ( int i = 0; i < M ; i ++) for ( int j = 0; j < N ; j ++) { S1 : C [ i ][ j ] = 0; for ( int k = 0; k < K ; k ++) S2 : C [ i ][ j ] = C [ i ][ j ] + A [ i ][ k ] * B [ k ][ j ]; }

S1ri,js Ñ ri,j,0,0s;S2ri,j,ks Ñ ri,j,1,ks

S1ri,js Ñ rj,0,0s;S2ri,j,ks Ñ rj,1,ks S1ri,js Ñ r0,0s;S2ri,j,ks Ñ r1,ks S1ri,js S1ri,js Ñ r0s S2ri,j,ks S2ri,j,ks Ñ rks

(91)

Polyhedral Compilation Polyhedral Model May 30, 2017 20 / 82

Loop Transformations and the Polyhedral Model

Loop transformations result in

different execution order of statement instances ñ different schedule

Polyhedral model can be used to evaluatea schedule and/or constructa schedule

Polyhedral schedules can represent (combinations of) loop distribution

loop fusion loop tiling . . .

(92)

Polyhedral Compilation Schedules May 30, 2017 21 / 82

Schedule Properties

Validity

New schedule should preserve meaning

Parallelism

Can the iterations of a given loop be executed in parallel? Locality

Statement instances scheduled closely to each other Tilability

(93)

Polyhedral Compilation Schedules May 30, 2017 22 / 82

Schedule Validity

[3]

New schedule should preserve meaning

R(a) W(a) R(a) W(b) W(a) W(a)

Internal restrictions

No read of a value may be scheduled before the write of the value No other write to same memory location may be scheduled in between External restrictions (on non-temporaries)

No write may be scheduled before initial read from a memory location No write may be scheduled after last write to a memory location Sufficient conditions:

Every read of a memory location is scheduled after every preceding write to the same memory location

Every write to a memory location is scheduled after every preceding read or write to the same memory location

(94)

Polyhedral Compilation Schedules May 30, 2017 22 / 82

Schedule Validity

[3]

New schedule should preserve meaning

R(a) W(a) R(a) W(b) W(a) W(a)

Internal restrictions

No read of a value may be scheduled before the write of the value No other write to same memory location may be scheduled in between External restrictions (on non-temporaries)

No write may be scheduled before initial read from a memory location No write may be scheduled after last write to a memory location

Sufficient conditions:

Every read of a memory location is scheduled after every preceding write to the same memory location

Every write to a memory location is scheduled after every preceding read or write to the same memory location

(95)

Polyhedral Compilation Schedules May 30, 2017 22 / 82

Schedule Validity

[3]

New schedule should preserve meaning

R(a) W(a) R(a) W(b) W(a) W(a)

Internal restrictions

No read of a value may be scheduled before the write of the value No other write to same memory location may be scheduled in between External restrictions (on non-temporaries)

No write may be scheduled before initial read from a memory location No write may be scheduled after last write to a memory location

Sufficient conditions:

Every read of a memory location is scheduled after every preceding write to the same memory location

Every write to a memory location is scheduled after every preceding read or write to the same memory location

(96)

Polyhedral Compilation Schedules May 30, 2017 22 / 82

Schedule Validity

[3]

New schedule should preserve meaning

R(a) W(a) R(a) W(b) W(a) W(a)

Internal restrictions

No read of a value may be scheduled before the write of the value No other write to same memory location may be scheduled in between External restrictions (on non-temporaries)

No write may be scheduled before initial read from a memory location No write may be scheduled after last write to a memory location

Sufficient conditions:

Every read of a memory location is scheduled after every preceding write to the same memory location

Every write to a memory location is scheduled after every preceding read or write to the same memory location

(97)

Polyhedral Compilation Schedules May 30, 2017 22 / 82

Schedule Validity

[3]

New schedule should preserve meaning

R(a) W(a) R(a) W(b) W(a) W(a)

Internal restrictions

No read of a value may be scheduled before the write of the value No other write to same memory location may be scheduled in between External restrictions (on non-temporaries)

No write may be scheduled before initial read from a memory location No write may be scheduled after last write to a memory location

Sufficient conditions:

Every read of a memory location is scheduled after every preceding write to the same memory location

Every write to a memory location is scheduled after every preceding read or write to the same memory location

(98)

Polyhedral Compilation Schedules May 30, 2017 22 / 82

Schedule Validity

[3]

New schedule should preserve meaning

R(a) W(a) R(a) W(b) W(a) W(a)

Internal restrictions

No read of a value may be scheduled before the write of the value No other write to same memory location may be scheduled in between External restrictions (on non-temporaries)

No write may be scheduled before initial read from a memory location No write may be scheduled after last write to a memory location

Sufficient conditions:

Every read of a memory location is scheduled after every preceding write to the same memory location

Every write to a memory location is scheduled after every preceding read or write to the same memory location

(99)

Polyhedral Compilation Schedules May 30, 2017 22 / 82

Schedule Validity

[3]

New schedule should preserve meaning

R(a) W(a) R(a) W(b) W(a) W(a)

Internal restrictions

No read of a value may be scheduled before the write of the value No other write to same memory location may be scheduled in between External restrictions (on non-temporaries)

No write may be scheduled before initial read from a memory location No write may be scheduled after last write to a memory location

Sufficient conditions:

Every read of a memory location is scheduled after every preceding write to the same memory location

Every write to a memory location is scheduled after every preceding read or write to the same memory location

(100)

Polyhedral Compilation Schedules May 30, 2017 22 / 82

Schedule Validity

[3]

New schedule should preserve meaning

R(a) W(a) R(a) W(b) W(a) W(a)

Internal restrictions

No read of a value may be scheduled before the write of the value No other write to same memory location may be scheduled in between External restrictions (on non-temporaries)

No write may be scheduled before initial read from a memory location No write may be scheduled after last write to a memory location Sufficient conditions:

Every read of a memory location is scheduled after every preceding write to the same memory location

Every write to a memory location is scheduled after every preceding read or write to the same memory location

(101)

Polyhedral Compilation Schedules May 30, 2017 23 / 82

Dependences

Sufficient conditions for validity of scheduleS:

Every read of a memory location is scheduled after every preceding write to the same memory location

Every write to a memory location is scheduled after every preceding read or write to the same memory location

Dependence relationD: pairs of statement instances accessing the same memory location

of which at least one is a write

with the first executed before the second in original code Sufficient condition:

(102)

Polyhedral Compilation Schedules May 30, 2017 24 / 82

Dependence Analysis

Recall: sufficient conditions for validity of schedule S: @iÑjPD:iăS j Dependence relationD: pairs of statement instances

accessing the same memory location of which at least one is a write

with the first executed before the second in original code

Computation:

D “``W´1˝R˘Y`W´1˝W˘Y`R´1˝W˘˘XpăS0q W: write access relation

R: read access relation S0: original schedule instances data order W,R S0 ăS0

(103)

Polyhedral Compilation Schedules May 30, 2017 24 / 82

Dependence Analysis

Recall: sufficient conditions for validity of schedule S: @iÑjPD:iăS j Dependence relationD: pairs of statement instances

accessing the same memory location of which at least one is a write

with the first executed before the second in original code

Computation:

D “``W´1˝R˘Y`W´1˝W˘Y`R´1˝W˘˘XpăS0q W: write access relation

R: read access relation S0: original schedule instances data order W,R S0 ăS0

(104)

Polyhedral Compilation Schedules May 30, 2017 25 / 82

Local Validity

Schedule validity:

@iÑjPD:iăS j Consider subset of local dependencesL

At outermost node: L“D

Current node

band node with partial schedulef

@iÑjPL:fpiq ďlexfpjq Carried dependences: iÑjPL:fpiq ‰fpjq

ñ no longer need to be considered in nested nodes Remaining dependences: L1

“ tiÑjPL:fpiq “fpjq u

sequence node with child position p and filters Fk @iÑjPL:ppiq ďppjq Carried dependences: iÑjPL:ppiq ‰ppjq Remaining dependences in childc: L1

“ tiÑjPL:i,jPFcu

(105)

Polyhedral Compilation Schedules May 30, 2017 25 / 82

Local Validity

Schedule validity:

@iÑjPD:iăS j Consider subset of local dependencesL

At outermost node: L“D Current node

band node with partial schedulef

@iÑjPL:fpiq ďlexfpjq Carried dependences: iÑjPL:fpiq ‰fpjq

ñ no longer need to be considered in nested nodes Remaining dependences: L1

“ tiÑjPL:fpiq “fpjq u

sequence node with child position p and filters Fk @iÑjPL:ppiq ďppjq Carried dependences: iÑjPL:ppiq ‰ppjq Remaining dependences in childc: L1

“ tiÑjPL:i,jPFcu

(106)

Polyhedral Compilation Schedules May 30, 2017 25 / 82

Local Validity

Schedule validity:

@iÑjPD:iăS j Consider subset of local dependencesL

At outermost node: L“D Current node

band node with partial schedulef

@iÑjPL:fpiq ďlexfpjq Carried dependences: iÑjPL:fpiq ‰fpjq

ñ no longer need to be considered in nested nodes Remaining dependences: L1

“ tiÑjPL:fpiq “fpjq u sequence node with child position p and filters Fk

@iÑjPL:ppiq ďppjq Carried dependences: iÑjPL:ppiq ‰ppjq Remaining dependences in childc: L1

“ tiÑjPL:i,jPFcu

(107)

Polyhedral Compilation Schedules May 30, 2017 25 / 82

Local Validity

Schedule validity:

@iÑjPD:iăS j Consider subset of local dependencesL

At outermost node: L“D Current node

band node with partial schedulef

@iÑjPL:fpiq ďlexfpjq Carried dependences: iÑjPL:fpiq ‰fpjq

ñ no longer need to be considered in nested nodes Remaining dependences: L1

“ tiÑjPL:fpiq “fpjq u sequence node with child position p and filters Fk

@iÑjPL:ppiq ďppjq Carried dependences: iÑjPL:ppiq ‰ppjq Remaining dependences in childc: L1

“ tiÑjPL:i,jPFcu

(108)

Polyhedral Compilation Schedules May 30, 2017 26 / 82

Loop Distribution Validity

for ( int i = 1; i < 100; ++ i ) { S : A [ i ] = f ( i );

T : B [ i ] = A [ i ] + A [ i - 1];

}

tSrisÑ ris;TrisÑ ris u

tSrisu,tTrisu

Dependences:

tSrisÑTris:1ďi ă100;SrisÑTri`1s:1ďi,i`1ă100u Loop distribution for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i - 1]; tSrisu,tTrisu tSrisÑ ris utTrisÑ ris u tSrisu,tTrisu satisfied: tSrisÑTris:1ďi ă100u

(109)

Polyhedral Compilation Schedules May 30, 2017 26 / 82

Loop Distribution Validity

for ( int i = 1; i < 100; ++ i ) { S : A [ i ] = f ( i );

T : B [ i ] = A [ i ] + A [ i - 1];

}

tSrisÑ ris;TrisÑ ris u

tSrisu,tTrisu Dependences:

tSrisÑTris:1ďi ă100;SrisÑTri`1s:1ďi,i`1ă100u

Loop distribution for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i - 1]; tSrisu,tTrisu tSrisÑ ris utTrisÑ ris u tSrisu,tTrisu satisfied: tSrisÑTris:1ďi ă100u

(110)

Polyhedral Compilation Schedules May 30, 2017 26 / 82

Loop Distribution Validity

for ( int i = 1; i < 100; ++ i ) { S : A [ i ] = f ( i );

T : B [ i ] = A [ i ] + A [ i - 1];

}

tSrisÑ ris;TrisÑ ris u

tSrisu,tTrisu Dependences:

tSrisÑTris:1ďi ă100;SrisÑTri`1s:1ďi,i`1ă100u tSrisÑ ris;TrisÑ ris u

satisfied: tSrisÑTris:1ďi ă100;SrisÑTri `1s:1ďi,i`1ă100u carried: tSrisÑTri`1s:1ďi,i `1ă100u

tSrisu,tTrisu

satisfied: tSrisÑTris:1ďi ă100u carried: tSrisÑTris:1ďi ă100u Loop distribution for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i - 1]; tSrisu,tTrisu tSrisÑ ris utTrisÑ ris u tSrisu,tTrisu satisfied: tSrisÑTris:1ďi ă100u

(111)

Polyhedral Compilation Schedules May 30, 2017 26 / 82

Loop Distribution Validity

for ( int i = 1; i < 100; ++ i ) { S : A [ i ] = f ( i );

T : B [ i ] = A [ i ] + A [ i - 1];

}

tSrisÑ ris;TrisÑ ris u

tSrisu,tTrisu Dependences:

tSrisÑTris:1ďi ă100;SrisÑTri`1s:1ďi,i`1ă100u tSrisÑ ris;TrisÑ ris u

satisfied: tSrisÑTris:1ďi ă100;SrisÑTri `1s:1ďi,i`1ă100u carried: tSrisÑTri`1s:1ďi,i `1ă100u

tSrisu,tTrisu

satisfied: tSrisÑTris:1ďi ă100u carried: tSrisÑTris:1ďi ă100u

Loop distribution for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i - 1]; tSrisu,tTrisu tSrisÑ ris utTrisÑ ris u tSrisu,tTrisu satisfied: tSrisÑTris:1ďi ă100u

(112)

Polyhedral Compilation Schedules May 30, 2017 26 / 82

Loop Distribution Validity

for ( int i = 1; i < 100; ++ i ) { S : A [ i ] = f ( i );

T : B [ i ] = A [ i ] + A [ i - 1];

}

tSrisÑ ris;TrisÑ ris u

tSrisu,tTrisu Dependences:

tSrisÑTris:1ďi ă100;SrisÑTri`1s:1ďi,i`1ă100u Loop distribution for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i - 1]; tSrisu,tTrisu tSrisÑ ris utTrisÑ ris u tSrisu,tTrisu satisfied: tSrisÑTris:1ďi ă100u

(113)

Polyhedral Compilation Schedules May 30, 2017 26 / 82

Loop Distribution Validity

for ( int i = 1; i < 100; ++ i ) { S : A [ i ] = f ( i );

T : B [ i ] = A [ i ] + A [ i - 1];

}

tSrisÑ ris;TrisÑ ris u

tSrisu,tTrisu Dependences:

tSrisÑTris:1ďi ă100;SrisÑTri`1s:1ďi,i`1ă100u Loop distribution for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i - 1]; tSrisu,tTrisu tSrisÑ ris utTrisÑ ris u tSrisu,tTrisu

satisfied: tSrisÑTris:1ďi ă100;SrisÑTri `1s:1ďi,i`1ă100u carried: tSrisÑTris:1ďi ă100;SrisÑTri`1s:1ďi,i`1ă100u

(114)

Polyhedral Compilation Schedules May 30, 2017 26 / 82

Loop Distribution Validity

for ( int i = 1; i < 100; ++ i ) { S : A [ i ] = f ( i );

T : B [ i ] = A [ i ] + A [ i - 1];

}

tSrisÑ ris;TrisÑ ris u

tSrisu,tTrisu Dependences: tSrisÑTris:1ďi ă100;u Loop distribution for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i 1]; tSrisu,tTrisu tSrisÑ ris utTrisÑ ris u tSrisu,tTrisu satisfied: tSrisÑTris:1ďi ă100u

(115)

Polyhedral Compilation Schedules May 30, 2017 26 / 82

Loop Distribution Validity

for ( int i = 1; i < 100; ++ i ) { S : A [ i ] = f ( i );

T : B [ i ] = A [ i ] + A [ i + 1];

}

tSrisÑ ris;TrisÑ ris u

tSrisu,tTrisu Dependences: tSrisÑTris:1ďi ă100;u Loop distribution for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i 1]; tSrisu,tTrisu tSrisÑ ris utTrisÑ ris u tSrisu,tTrisu satisfied: tSrisÑTris:1ďi ă100u

(116)

Polyhedral Compilation Schedules May 30, 2017 26 / 82

Loop Distribution Validity

for ( int i = 1; i < 100; ++ i ) { S : A [ i ] = f ( i );

T : B [ i ] = A [ i ] + A [ i + 1];

}

tSrisÑ ris;TrisÑ ris u

tSrisu,tTrisu Dependences:

tSrisÑTris:1ďi ă100;TrisÑSri`1s:1ďi,i`1ă100u

Loop distribution for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i 1]; tSrisu,tTrisu tSrisÑ ris utTrisÑ ris u tSrisu,tTrisu satisfied: tSrisÑTris:1ďi ă100u

(117)

Polyhedral Compilation Schedules May 30, 2017 26 / 82

Loop Distribution Validity

for ( int i = 1; i < 100; ++ i ) { S : A [ i ] = f ( i );

T : B [ i ] = A [ i ] + A [ i + 1];

}

tSrisÑ ris;TrisÑ ris u

tSrisu,tTrisu Dependences:

tSrisÑTris:1ďi ă100;TrisÑSri`1s:1ďi,i`1ă100u Loop distribution for ( int i = 1; i < 100; ++ i ) A [ i ] = f ( i ); for ( int i = 1; i < 100; ++ i ) B [ i ] = A [ i ] + A [ i + 1]; tSrisu,tTrisu tSrisÑ ris utTrisÑ ris u tSrisu,tTrisu satisfied: tSrisÑTris:1ďi ă100u

References

Related documents

Cible: National Agency, MASTER consortium partners, external evaluator Résultat: Periodical PM Reports, 4 reports prepared during th project progress.. Domaine d'application:

income derived from Philippine sources by the regional operating headquarters when remitted to the parent company shall be subject to the tax on branch profit remittances of

If you have been selected to receive access to Xcel Energy's Online Energy Management Service and you choose to receive the service through a separate agreement with Xcel Energy,

While navigational searches are less likely to lead to the trademark owner’s website, non-navigational searches are more likely to lead to the trademark owner’s website after the

Contrary to Hypothesis 2 (worse adverse selection on observables at higher prices), we find only weak evidence that households that paid a higher price for insurance are

“When you save water, you save our rivers.” The Tucson Conserve2Enhance program connects water conservation to community action by linking participant donations, based on

The strength of this research model is that it enables investigating (1) how IQ in fl uences different types of trust and risk, (2) how managers determine expectations of