Translation to Presburger Arithmetic - Improving Precision with an SMT Solver

5.8 Improving Precision with an SMT Solver

5.8.2 Translation to Presburger Arithmetic

We are now going to switch to a more mathematical notation: The variable t is going to denote the tid and a is going to denote the input. For integer division and modulo, we introduce unary functions divk and modk for

k ∈ Z \ {0}, which emphasizes the fact that the divisors and moduli are limited to numbers. For our example term (5.1), we obtain

Listing 5.2 Result of WFV manually applied at source level to the FastWalshTransform kernel of Listing 5.1 (W = 2). Left: Conservative WFV requires sequential execution. Right: WFV with our approach proves consecutivity of the memory addresses for certain values of step, which allows to generate a variant with more efficient code.

_ _ k e r n e l v o i d F a s t W a l s h T r a n s f o r m (f l o a t* a , i n t s t e p ) { i n t t i d = g e t _ g l o b a l _ i d(); if ( ti d % 2 != 0) r e t u r n; i n t 2 tiV = (i n t 2)( tid , ti d + 1 ) ; i n t 2 s = (i n t 2)( step , s t e p ); i n t 2 g = tiV % s ; i n t 2 p = 2* s *( tiV / s )+ g ; i n t 2 m = p + s ; f l o a t 2 T = (f l o a t 2)( a [ p . x ] , a [ p . y ]); f l o a t 2 V = (f l o a t 2)( a [ m . x ] , a [ m . y ]); f l o a t 2 X = T + V ; f l o a t 2 Y = T - V ; a [ p . x ] = X . x ; a [ p . y ] = X . y ; a [ m . x ] = Y . x ; a [ m . y ] = Y . y ; } _ _ k e r n e l v o i d F a s t W a l s h T r a n s f o r m (f l o a t* a , i n t s t e p ) { if ( step <=0 || s t e p % 2 ! = 0 ) { // O m i t t e d c o d e : // E x e c u t e o r i g i n a l k e r n e l . r e t u r n; } i n t t i d = g e t _ g l o b a l _ i d(); if ( ti d % 2 != 0) r e t u r n; i n t g = ti d % s te p ; i n t p = 2* s t e p *( ti d / s t e p )+ g ; i n t m = p + s t e p ; f l o a t 2 T = *((f l o a t 2*)( a + p )); f l o a t 2 V = *((f l o a t 2*)( a + m )); *((f l o a t 2*)( a + p )) = T + V ; *((f l o a t 2*)( a + m )) = T - V ; }

At this point, let us give the precise definitions of modk and divk:

x = k · divk(x) + modk(x), where | modk(x)| < |k|. (5.3)

It is well-known that this definition does not uniquely specify divk(x) and

modk(x). SMT-LIB Version 2 resolves this issue by making the convention

that modk(x) ≥ 0.4 As long as both k and x are non-negative, common

programming languages agree with this convention. However, when negative numbers are involved, OpenCL follows the C99 standard, which in contrast to SMT-LIB requires that sign(modk(x)) = sign(x). In our setting, we observe

that the arguments of modk generally are positive expressions involving the

tid such that both conventions happen to coincide.

Let us analyze a single memory access with respect to the following consecutivity question: “Do W consecutive work items access consecutive

5.8 Improving Precision with an SMT Solver 75

memory addresses when doing this memory access or not?” Using the corresponding term e(t, a), the following equation holds if and only if the work items t and t + 1 access consecutive memory locations for input a:

e(t, a) + 1 = e(t + 1, a).

The following conjunction generalizes this equation to W consecutive work items t, . . . , t + W − 1:

W −2

i=0

e(t + i, a) + 1 = e(t + i + 1, a).

Recall from the previous section that these groups of W work items naturally start at 0 so that only conjunctions are relevant where t is divisible by W . The following Presburger formula formally adds this constraint:

ϕ(W, a) = ∀tt ≥ 0 ∧ t ≡W 0 −→ W −2

i=0

e(t + i, a) + 1 = e(t + i + 1, a). For given W ∈ N and α, β ∈ Z with α ≤ β−1, the answer to our consecutivity question for W and a ∈ {α, . . . , β − 1} is given by the set

AW,α,β= { a ∈ Z | Z |= ϕ(W, a) ∧ α ≤ a < β }.

We essentially compute AW,α,β by at most (W − 1)(β − α − 1) many

applications of an SMT solver to the W − 1 disjuncts of ¬ϕ(W, a) for a ∈ {α, . . . , β − 1}, where

¬ϕ(W, a) =

W −2

i=0

∃t t ≥ 0 ∧ t ≡W 0 ∧ e(t + i, a) + 1 6= e(t + i + 1, a).

Notice that, when obtaining “sat” for some i ∈ {0, . . . , W − 2}, the remaining problems of the disjunction need not be computed.

Our answer AW,α,β consists of those a for which the SMT solver yields

“unsat.” Note that besides “sat” or “unsat,” the solver can also yield “unknown,” which we treat like “sat.” This underapproximation does not affect the correctness of our approach. We only miss optimization opportunities when generating code later on. The same holds for possible timeouts when imposing reasonable time limits on the single solver calls. Later in Sec- tion 5.8.3, we are going to discuss how compact representations for AW,α,β

Table 5.4 FastWalshTransform: Running times of Z3 applied to ¬ϕ(W, a) for e(t, a) as in (5.2). In all three cases, α = 1 and β = 216 _{so that} a ∈ {1, . . . , 216_{− 1} with a time limit of one minute per call.}

W Sat Unsat Unknown Timeouts CPU Time

4 16,383 49,152 0 0 4 min

8 8,191 57,344 0 0 5 min

16 4,095 61,128 0 312 334 min

Table 5.5 BitonicSort: Running times of Z3 applied to ¬ϕ(W, a) for e(t, a) as in (5.4). In all three cases, α = 0 and β = 63 so that a ∈ {0, . . . , 62} with a time limit of one minute per call.

W Sat Unsat Unknown Timeouts CPU Time

4 61 2 0 0 0.7 s

8 60 3 0 0 1.5 s

16 59 4 0 0 3.7 s

Table 5.4 shows running times and results for the application of Z3 version 4.3.1 [De Moura & Bjørner 2008] to the consecutivity question for our FastWalshTransform kernel.5 Alternatives to Z3 include CVC4 [Barrett et al. 2011] and MathSAT5 [Cimatti et al. 2013]. These SMT solvers, however, do not directly support divkand modk, which makes them less interesting for

our application here. The numbers shown already include a novel technique called modulo elimination [Karrenberg et al. 2013] that improved running times of Z3.

For another kernel taken from the AMD APP SDK, BitonicSort, the interesting address computation expression is

e(t, a) = 2a+1· div2a(t) + mod₂a(t) + 2a. (5.4)

The input parameter a occurs exclusively as an exponent. This restricts the reasonable range of values to consider to {0, . . . , 62} on a 64 bit architecture. Table 5.5 shows the relevant running times.

In document Automatic SIMD vectorization of SSA-based control flow graphs (Page 89-92)