In-Depth Analysis - An Execution Model and High-Level-Synthesis System for Generating SIMT Mult

During the evaluation it was noticed that the basic block model used in other compilers sometimes performs better than the pipelined model of Nymble. A rather high II for the main loops implies an issue with the source code of these bench- marks being not particularly suitable for pipelined execution. The closer the II is to the length of the main loops, the less useful pipelining becomes.

The main cause for a high II are inter-iteration dependencies. A new iteration can only be started after all dependencies are fulfilled. Sometimes the dependencies created by the compiler’s dependency analysis can be a false dependency, i.e a dependency that never actually occurs during execution. In general, false dependencies are only generated for memory accesses because the analysis for memory accesses is not sophisticated enough. Compilers expend lots of effort to analyse these memory dependencies, but if the analysis cannot 100% prove that a dependency does not exist, the compiler has to conservatively assume that it exists. This results in a number of false dependencies. And these false dependencies can have a negative impact on the II.

So for these false dependencies, an example benchmark is analysed to examine where false dependencies occur and what improvements of the II can be gained by manual prevention of these dependencies.

9.8.1 gsm

In gsm, profiling discovered that the loop shown in Figure 9.18 has the highest portion of the runtime. In the original version (shown in black and red), the loop has an II of 36 with a pipeline of length 39. A manually optimised version however achieves an II of 5 with a length of 7.

In the original version the basic alias analysis of LLVM cannot prove that the accesses to L_ACF[k] are independent from each other for each k. This leads to the sequential execution of each STEP, where a single STEP is scheduled to 4 cycles. To solve this, the handling of the input and output data was modified by manually creating local values for each of the nine output values (shown in black and blue). In the unmodified version, the output values in L_ACF are sequentially read and written in each iteration of the main loop from and to memory. The man- ual localization replaces this with one read and one write for each value, at the beginning and end of the function, respectively. This modification improves the performance in two aspects:

First, the localization and removal of unnecessary memory accesses. Second, removing the memory accesses in the main loop avoids the problem that the alias analysis conservatively assumes that all values in L_ACF alias each other, resulting

Nymble LegUp

gsm 9080 4763

gsm_mod 4180 4763

Table 9.11: #Clock cycles comparison for gsm

Nymble Bambu

gemm_blocked 1,106,545 888,907

gemm_blocked_mod 189,045 893,004

Table 9.12: #Clock cycles comparison for gemm_blocked

in chaining the operations in each STEP sequentially, instead of executing them in parallel. Thus, the overall runtime is improved by around 50% for Nymble.

In the basic block model, however, (evaluated by using a kernel generated by LegUp), these changes have no impact on the runtime. As can be seen in Table 9.11, the pipeline model’s performance is currently heavily dependent on ”compatible” source code.

9.8.2 gemm_blocked

Similarly, gemm_blocked of MachSuite also has a problem with false inter- iteration dependencies. MachSuite uses a single structure to pass the test data to the test-specific run_benchmark function (shown in Figure 9.19a). As this structure contains both the input and output data, the alias analysis assumes that this whole structure can be accessed (and written) by all memory operations. How- ever, gemm_blocked only writes to the array prod, so most of these dependencies are false dependencies.

To make it clear for the analysis that all arrays are independent, the function was modified by creating a working copy for each array (shown in Figure 9.19b). This reduces the II of the main loop (shown in Figure 9.19c) from 32 to 5 and reduces the pipeline length 35 from to 7. This reduces the runtime (see Table 9.12) for the Nymble generated kernel to around 17% of the unmodified version. In the basic block model, here presented by Bambu, the runtime does not change (the small difference comes from Bambu specifics and were not further analysed). Note that the absolute runtime values should not be compared between Nymble and Bambu, as the runtime for Nymble was generated with unlimited single cycle memory accesses (see Table 9.10 and Section 9.7).

void

A u t o c o r r e l a t i o n ( word * s /* [ 0 . . 1 5 9 ] IN /OUT _{*/ ,}

longword * L_ACF /* [ 0 . . 8 ] OUT _{*/ )}

* The g o a l i s t o compute t h e a r r a y L_ACF [ k ] . The s i g n a l s [ i ] must * be s c a l e d i n o r d e r t o a v o i d an o v e r f l o w s i t u a t i o n .

{

r e g i s t e r i n t k , i ;

longword L_ACF0 , L_ACF1 , L_ACF2 , L_ACF3 , L_ACF4 , L_ACF5 , L_ACF6 , L_ACF7 , L_ACF8 ; /* Temporary r e g i s t e r s f o r L_ACF */

word temp ; word smax ; word s c a l a u t o , n ; word * sp ; word s l ; f o r ( k = 8 ; k >= 0 ; k−−) L_ACF [ k ] = 0 ;

L_ACF0 = L_ACF1 = L_ACF2 = L_ACF3 = L_ACF4 = L_ACF5 = L_ACF6 = L_ACF7 = L_ACF8 = 0 ;

#d e f i n e STEP ( k ) _{L_ACF [ k ] += ( ( longword ) s l * sp [ −(k ) ]) ;}

#d e f i n e STEP ( k ) _{L_ACF##k += ( ( longword ) s l * sp [ −(k ) ]) ;}

#d e f i n e NEXTI _{s l = *++sp}

. . .

f o r ( i = 8 ; i <= 159; i++) {

NEXTI ;

STEP ( 0 ) ; STEP ( 1 ) ; STEP ( 2 ) ; STEP ( 3 ) ; STEP ( 4 ) ; STEP ( 5 ) ; STEP ( 6 ) ; STEP ( 7 ) ; STEP ( 8 ) ; } f o r ( k = 8 ; k >= 0 ; k−−) L_ACF [ k ] <<= 1 ; L_ACF [ 0 ] = L_ACF0 << 1 ; . . . L_ACF [ 8 ] = L_ACF8 << 1 ;

Figure 9.18: Longest runtime loop in gsm

void _{run_benchmark ( void * vargs ) {}

s t r u c t _{b e n c h _ a r g s _ t * args = ( s t r u c t bench_args_t *) vargs ;}

bbgemm( args −>m1, args −>m2, args −>prod ) ; }

(a)Original harness

void _{run_benchmark ( void * vargs ) {}

s t r u c t _{b e n c h _ a r g s _ t * args = ( s t r u c t bench_args_t *) vargs ;} i n t i ; TYPE *m1, *m2, * prod ; m1 = m a l l o c (N* s i z e o f (TYPE) ) ; m2 = m a l l o c (N* s i z e o f (TYPE) ) ; prod = m a l l o c (N* s i z e o f (TYPE) ) ; f o r( i =0; i<N; i++) { m1[ i ] = args −>m1[ i ] ; m2[ i ] = args −>m2[ i ] ;

prod [ i ] = args −>prod [ i ] ; }

#pragma HARDWARE on bbgemm( m1, m2, prod ) ; #pragma HARDWARE o f f

f o r( i =0; i<N; i++) {

args −>prod [ i ] = prod [ i ] ; }

}

(b)Modified harness

void bbgemm( TYPE m1[N] , TYPE m2[N] , TYPE prod [N] ) {

i n t i , k , j , j j , kk , temp_x ; i n t i_row , k_row ; TYPE mul ; l o o p j j : f o r ( j j = 0 ; j j < r o w _ s i z e ; j j += b l o c k _ s i z e ) loopkk : f o r ( kk = 0 ; kk < r o w _ s i z e ; kk += b l o c k _ s i z e ) l o o p i : f o r ( i = 0 ; i < r o w _ s i z e ; ++i ) loopk : f o r ( k = 0 ; k < b l o c k _ s i z e ; ++k ) { i_row = i * row_size ; k_row = ( k _{+ kk ) * row_size ;} temp_x = m1[ i_row + k + kk ] ; l o o p j : f o r ( j = 0 ; j < b l o c k _ s i z e ; ++j ) { mul = temp_x * m2[ k_row + j + j j ] ; prod [ i_row + j + j j ] += mul ;

} }

(c)Loop

Figure 9.19: MachSuite harness code specific for gemm_blocked

#Clock cycles

Nymble ST Bambu

Max. # parallel mem. accesses

Benchmark 1 2 4 ∞

gemm_ncubed 1,606,208 1,344,064 1,344,064 1,344,064 532,547

gemm_ncubed_mod 696,896 434,752 303,680 205,376 307,268

(a)Single-threaded (single-cycle memory accesses)

Benchmark #Clock cycles

gemm_ncubed 3,213,923

gemm_ncubed_mod 2,078,725

(b)Multi-threaded (cached memory accesses)

Table 9.13: #Clock cycles comparison for gemm_ncubed

9.8.3 gemm_ncubed

For gemm_ncubed the alias analysis does not create any false dependencies with negative impact on the runtime. However, it is still possible to optimise runtime of this benchmark, especially when considering that it will be executed with a pipelined model.

By examining the source code it was noticed that the pipeline executes only two memory accesses, a multiplication and an addition each iteration. Additionally, the inner most loop only executes eight iterations at all. As the presented pipelined model is not quite optimised for very short pipelines, a method to increase the number operations per iterations was sought.

The simple solution is to unroll the inner loop eight times. The new inner most loop now executes 16 memory accesses which utilize the pipeline more efficiently. Both the original and the unrolled version can be seen in Figure 9.20. The modifi- cations are shown in red and blue for removed and newly added code, respectively. Comparing single-threaded runtime (see Table 9.13), it can be easily seen that unrolling is very beneficial for the pipelined model. While it also improves the runtime of the basic block model of Bambu, it has a much bigger impact for Nymble. The table shows the runtime for different amounts of parallel single-cycle memory accesses (see Table 9.13a). Even with only a single memory access per cycle the unrolled version performs much better. From the unmodified version it can also be seen that, in fact, it does not profit from more than two parallel memory accesses. This improvement of the runtime can also be seen in the multi-threaded model (see Table 9.13b).

#d e f i n e STEP ( x ) _{k _ c o l##x = ( k+x ) * c o l _ s i z e ; \}

mult##x = m1[ i _ c o l + k + x ] * m2[ k_col##x + j ] ;

void _{gemm( TYPE m1[ r o w _ s i z e * c o l _ s i z e ] ,}

TYPE m2[ r o w _ s i z e * c o l _ s i z e ] , TYPE prod [ r o w _ s i z e * c o l _ s i z e ]) {

i n t i , j , k ;

TYPE mult , k _ c o l , i _ c o l ;

i n t k_col0 , k_col1 , k_col2 , k_col3 , k_col4 , k_col5 , k_col6 , k _ c o l 7 ; TYPE mult0 , mult1 , mult2 , mult3 , mult4 , mult5 , mult6 , mult7 ;

mult = 0 ; k _ c o l = 0 ; i _ c o l = 0 ; o u t t e r : f o r ( i =0; i<r o w _ s i z e ; i++) { middle : f o r ( j =0; j <c o l _ s i z e ; j++) { i _ c o l = i * c o l _ s i z e ;

TYPE sum = prod [ i _ c o l + j ] ; i n n e r : f o r ( k=0;k<r o w _ s i z e ; k+=8) { i n n e r : f o r ( k=0;k<r o w _ s i z e ; k++) { STEP ( 0 ) ; STEP ( 1 ) ; STEP ( 2 ) ; STEP ( 3 ) ; STEP ( 4 ) ; STEP ( 5 ) ; STEP ( 6 ) ; STEP ( 7 ) ;

sum += mult0 + mult1 + mult2 + mult3 + mult4 + mult5 + mult6 + mult7 ; k _ c o l = k * c o l _ s i z e ; mult = m1[ i _ c o l + k ] * m2[ k_col + j ] ; sum += mult ; } prod [ i _ c o l + j ] = sum ; } } }

Figure 9.20: Main loops of gemm_ncubed

9.8.4 spmv_ellpack

In spmv_ellpack (see Figure 9.21) the innermost loop is unrolled completely by the compiler because it only has ten iterations. The resulting expression tree for sum, however, is actually a chain of multiplications. The compiler cannot create a real multiplication tree, because the benchmark uses floating point values. Because IEEE 754 compatible floating point multiplications are not associative, that means their order cannot be changed without possibly changing their result.

However, if some error for this floating point operations is acceptable, the runtime of spmv_ellpack can be significantly reduced by creating an optimised expression tree for the unrolled multiplication.

The multiplication tree of the manually unrolled loop (shown in blue) was reduced to a height of 3 instead of 9. Note that the initial read of sum = out[i] was also removed because the out array is initialized to zero anyway.

This change improves to about 0.5x of the unmodified version as can be seen in Table 9.14. This table shows the runtime of both single-threaded variants which can also be seen in Table 9.10.

Benchmark #Clock cycles

spmv_ellpack 51,936

spmv_ellpack_mod 25,460

Table 9.14: #Clock cycles comparison for spmv_ellpack

#d e f i n e TYPE double

#d e f i n e N 494

#d e f i n e L 10

#d e f i n e _{STEP ( x ) S i##x = n z v a l [ x + i *L ] * vec [ c o l s [ x + i *L ]]}

void _{e l l p a c k ( TYPE n z v a l [N*L ] , i n t c o l s [N*L ] , TYPE vec [N] , TYPE out [N])}

{

i n t i , j ; TYPE S i ;

e l l p a c k _ 1 : f o r ( i =0; i<N; i++) { TYPE sum = out [ i ] ;

e l l p a c k _ 2 : f o r ( j =0; j <L ; j++) {

S i = n z v a l [ j + i *L ] * vec [ c o l s [ j + i *L ] ] ; sum += S i ;

}

out [ i ] = sum ;

STEP ( 0 ) ; STEP ( 1 ) ; STEP ( 2 ) ; STEP ( 3 ) ; STEP ( 4 ) ; STEP ( 5 ) ; STEP ( 6 ) ; STEP ( 7 ) ; STEP ( 8 ) ; STEP ( 9 ) ;

out [ i ] = ( ( S i 0 + S i 1 ) + ( S i 2 + S i 3 ) ) + ( ( S i 4 + S i 5 ) + ( S i 6 + S i 7 ) ) + ( S i 8 + S i 9 ) ;

} }

Figure 9.21: Main loops of spmv_ellpack

10 Future Work

Before the conclusion of this work, this chapter will present some ideas which can improve all executions models. By improving upon the ideas of the basic block model emulation (see Chapter 5) the efficiency of the accelerator will be improved. After that, a method to handle critical regions will be shown.

10.1 Mixed Execution Models

To further improve the efficiency of the generated data-paths, the compiler will be enhanced to use different execution models for each nested loop. It was shown that it is possible to mix pipelined and FSM based loop data-paths in the hardware kernel. The next step is to integrate single-threaded static and dynamically scheduled data-paths into the overall model of nested loops. It will be possible for the compiler to select the appropriate model for each part of an application.

Based on the extraction of unbalanced pipeline elements it should be possible to execute such extracted elements as a simple pipelined pseudo MCO using the statically scheduled execution model. As an example in Figure 10.1, a kernel consisting of three nested loops is generated with three different execution models. The outer loop is implemented with the multi-threaded FSM model because the different paths through the loop are unbalanced. The FSM model is more runtime efficient for such a loop. The middle loop is implemented as a multi-threaded pipelined loop. Because of its balanced nature, it can be efficiently pipelined. Finally, the inner loop is implemented as a statically scheduled single-threaded loop. Because it contains no VLO and has a fixed number of iterations the features of the other more complex execution model are not required. Instead of unrolling this loop, the compiler decided to treat it as a nested loop. As the number of iterations is equal for all loop executions, the multi-threading is implemented by interleaving the threads with basic C-Slow method.

In document An Execution Model and High-Level-Synthesis System for Generating SIMT Multi-Threaded Hardware from C Source Code (Page 190-199)