If the CFG of a function is irreducible (e.g. if there is a loop that has more than one header), the commonly used technique for many program analysis algorithms is to apply node splitting [Janssen & Corporaal 1997] to transform the CFG into reducible code before the analysis. However, this can result in an exponential blowup of the code size [Carter et al. 2003]. Although irreducible CFGs are rare [Stanier & Watson 2012], this can still be a problem for a specific application.
Our algorithm is able to deal with irreducible control flow without code duplication: During CFG linearization, one of the headers of an irreducible loop has to be chosen to be the primary header. This results in only the
6.5 Extension for Irreducible Control Flow 125
mask of the incoming edge of this header to be updated in every iteration. Entry masks from the other headers remain untouched: If a join point with one of these headers is executed during a later iteration, the incoming mask might falsely “reactivate” an instance that already left the loop.
In order to handle irreducible control flow directly, we have to ensure that these masks are joined with the loop mask in the first iteration only. This is achieved by performing the blend operations at those join points with a modified mask: In the first iteration, the new active mask is given by a disjunction of the current active mask with the incoming mask from the other header. In all subsequent iterations, it is given by a disjunction with false, which means the current loop mask is not modified.
7 Dynamic Code Variants
During or after vectorization of a function, additional optimizations that exploit dynamic properties of the function can be applied. Such an optimiza- tion duplicates an existing code path, improves the new code by assuming certain properties, and guards the new path by a runtime check that ensures that the assumed properties hold. We call such a guarded path a dynamic variant.
Obviously, introducing such a variant does not always make sense. Several factors influence the effects on performance: First, the dynamic check introduces overhead. Second, the improved code is more efficient than the original code that only used conservative static analyses. Third, the tested property may not always be valid, so the improved code is not always executed. Finally, parameters like the code size and instruction cache may also play a role, depending on the variant. Thus, each optimization presented in this chapter is subject to a heuristic that determines whether it is beneficial for a given code region to apply the transformation.
The properties that can be exploited all go back to the results of our analyses (Chapter 5). In general, each of the value properties such as uni- form, consecutive, or aligned (see Table 5.1) can be tested at runtime. Since most of our analyses influence each other, the validation of a single value property can have a big impact on the quality of the generated code. For example, a value that conservatively has to be expected to be varying during static analysis could be proven to be uniform at runtime. If this value is the condition of a branch, less blocks are rewire targets and a smaller part of the CFG has to be linearized, which results in less executed code, and less mask and blend operations are required. The proven fact that control flow in this region does not diverge may in turn result in φ-functions to be uniform, which again may influence properties of other instructions. In many cases, a heuristic will have a hard time to figure out the probability of a dynamic property to hold. So far, there have been no studies that attempted to classify under which circumstances a value is likely to be uniform, consecutive, or any other SIMD-execution-related property. An approach based on machine-learning techniques would offer a good starting point for such work, and also for heuristics.
Another possibility could be to provide compiler hints in the form of code annotations. This would allow the programmer to explicitly encode information, e.g. that a value is “often” uniform, hinting that a dynamic variant would be beneficial.
The following sections describe a variety of different dynamic variants ranging from enabling more efficient memory operations to complex trans- formations that modify the entire vectorization scheme of a region. For the most part, the presented variants have yet to be evaluated thoroughly.
7.1 Uniform Values and Control Flow
Definition 1 in Chapter 5 describes that the result of an instruction is uni- form if it produces the same result for all instances of the executing SIMD group. The Vectorization Analysis can only prove a static subset of this property.
Consider the scalar code in Listing 7.1. The value x is loaded from the array at runtime. Without additional information, the Vectorization Analysis has to expect x to be different for each instance and thus varying. Because of this, the condition of the if statement is also varying, which forces control-flow to data-flow conversion, as shown in the vectorized function kernelfn4.
Function kernelfn4v shows the code with an introduced variant. The question whether x is uniform at runtime is answered by a comparison of all vector elements. If this holds, the condition of the if statement is also uniform, and the control flow can be retained. The code that is generated closely resembles the original code, up to the point where the scalar value of x0 is broadcast into a vector again.
Obvious choices for locations to introduce such a variant are varying values that have a big impact on the properties of other instructions and blocks. Because the test whether a value is uniform or not is fairly cheap, it is easy to find places where the variant is likely to improve performance. However, the problematic part for a heuristic is to estimate the probability of the value to be varying. For example, the input array in Listing 7.1 could never have 4 times the same value in consecutive elements. Then, the variant would result in a slowdown due to the added overhead of the dynamic test. Thus, it may often be a better idea to directly test conditions for uniformity instead of values. This is described in Section 7.6.
7.1 Uniform Values and Control Flow 129
Listing 7.1 Example of a variant that exploits dynamic uniformity of value x. kernelfn is the original, scalar code. kernelfn4 is a vectorized version. kernelfn4v is vectorized and employs a variant if x is uniform.
_ _ k e r n e l v o i d k e r n e l f n(f l o a t* array , i n t c ) { i n t t i d = g e t _ g l o b a l _ i d( 0 ) ; f l o a t x = a r r a y [ ti d ]; if ( x > c ) x += 1. f ; e l s e x += 2. f ; a r r a y [ ti d ] = x ; } _ _ k e r n e l v o i d k e r n e l f n 4(f l o a t* array , i n t c ) { i n t t i d = g e t _ g l o b a l _ i d( 0 ) ; if ( ti d % 4 != 0) r e t u r n; f l o a t 4 x4 = *((f l o a t 4*)( a r r a y + ti d )); i n t 4 c4 = (i n t 4)( c ); b o o l 4 c o n d = x4 > c4 ; f l o a t 4 x1 = x4 + 1. f ; f l o a t 4 x2 = x4 + 2. f ; x4 = b l e n d ( cond , x1 , x2 ); *((f l o a t 4*)( a r r a y + ti d )) = x4 ; } _ _ k e r n e l v o i d k e r n e l f n 4 v(f l o a t* tArray , i n t c ) { i n t t i d = g e t _ g l o b a l _ i d(); if ( ti d % 4 != 0) r e t u r n; f l o a t 4 x4 = *((f l o a t 4*)( a r r a y + ti d )); if ( x4 [ 0 ] = = x4 [ 1 ] = = x4 [ 2 ] = = x4 [ 3 ] ) { f l o a t x0 = x4 [ 0 ] ; if ( x0 > c ) x0 += 1. f ; e l s e x0 += 2. f ; x4 = (f l o a t 4)( x0 ); } e l s e { i n t 4 c4 = (i n t 4)( c ); b o o l 4 c o n d = x4 > c4 ; f l o a t 4 x1 = x4 + 1. f ; f l o a t 4 x2 = x4 + 2. f ; x4 = b l e n d ( cond , x1 , x2 ); } *((f l o a t 4*)( a r r a y + ti d )) = x4 ; }
Listing 7.2 Left: Conservative WFV requires sequential execution of the store. Right: The dynamic variant executes a more efficient vector store if the memory indices are consecutive.
_ _ k e r n e l v o i d k e r n e l f n 4(i n t* array , i n t c ) { i n t t i d = g e t _ g l o b a l _ i d(); if ( ti d % 4 != 0) r e t u r n; i n t 4 t i d 4 = (i n t 4)( ti d ); t i d 4 += (i n t 4)(0 ,1 ,2 ,3); i n t 4 c4 = (i n t 4)( c ); i n t 4 p = ( ti d 4 / c4 )+( t id 4 % c4 ); a r r a y [ p [ 0 ] ] = p [ 0 ] ; a r r a y [ p [ 1 ] ] = p [ 1 ] ; a r r a y [ p [ 2 ] ] = p [ 2 ] ; a r r a y [ p [ 3 ] ] = p [ 3 ] ; } _ _ k e r n e l v o i d k e r n e l f n 4 v(i n t* array , i n t c ) { i n t t i d = g e t _ g l o b a l _ i d(); if ( ti d % 4 != 0) r e t u r n; i n t 4 t i d 4 = (i n t 4)( ti d ); t i d 4 += (i n t 4)(0 ,1 ,2 ,3); i n t 4 c4 = (i n t 4)( c ); i n t 4 p = ( ti d 4 / c4 )+( t id 4 % c4 ); i n t 4 px = p - <0 ,1 ,2 ,3 >; if ( px [ 0 ] = = . . . = = px [ 3 ] ) { *((f l o a t 4*)( a r r a y + p [ 0 ] ) ) = p ; } e l s e { a r r a y [ p [ 0 ] ] = p [ 0 ] ; a r r a y [ p [ 1 ] ] = p [ 1 ] ; a r r a y [ p [ 2 ] ] = p [ 2 ] ; a r r a y [ p [ 3 ] ] = p [ 3 ] ; } }