Error-Free Transformations The name error-free transformation refers to algorithms that efficiently transform expressions of floating-point numbers into

Previous Work

2.2.2. Error-Free Transformations The name error-free transformation refers to algorithms that efficiently transform expressions of floating-point numbers into

mathematically equivalent expressions [77]. They form the basic toolkit for ex- act floating-point algorithms or algorithms with increased precision. All error-free transformations presented here require rounding to nearest.

Addition and Subtraction. There are two algorithms TwoSum and FastTwo- Sum, which transform a sum of two floating-point numbers into a new sum. This is done by recovering the exact error term arising in a floating-point addition.

Algorithm 2.7 (FastTwoSum).

Let a, b∈ F, with |a| ≥ |b| or a = 0. Compute (x, y) ← FastTwoSum(a, b). If x ∈ F,

then x= a ⊕ b and a + b = x + y.

Algorithm 2.8 (TwoSum).

Let a, b ∈ F and compute (x, y) ← TwoSum(a, b). If x ∈ F, then x = a ⊕ b and

a+ b = x + y. 1: _{procedure FastTwoSum(a, b)} 2: x_{← a ⊕ b} 3: b_v_{← x a} 4: y_{← b b}_v 5: return(x, y) 1: _{procedure TwoSum(a, b)} 2: x_{← a ⊕ b} 3: b_v_{← x a} 4: b_r_{← b b}_v 5: a_v← x bv 6: ar← a av 7: y← ar⊕ br 8: return(x, y)

Both algorithms compute the same result, though FastTwoSum requires the input summands to be ordered. Often it is known a priori, which summand will be the larger one. If this is not known, one may use TwoSum, or one may swap the summands if necessary. On contemporary CPUs, using TwoSum is then usually faster,

since checking and swapping summands involves a branch, which may significantly slow down the computation.

Note that the computed error term y equals the error term_{δ in Equation (2.15)} and we have| y| ≤ "mmsb_{(x). Quite remarkably, we can compute the error term using} only two or five additional, ordinary floating-point operations. The only exception is the case, when overflow occurs in the computation of x. TwoSum and FastTwoSum are safe from overflow if|x| ≤ 2τ(1 − "m) or equivalently x ∈ F.

A proof for the exactness of the error term in FastTwoSum was first given by Theodorus Dekker[20], though the algorithm was in use before, see e.g., [47]. Using Sterbenz Lemma, one can show that either x= a + b or bv= x − a. In the first case

b_v= b and y = 0. In the second case y = b b_v= fl(b − x + a). The proof finishes by

observing that a+ b −(a ⊕ b) ∈ F and hence y = b − x + a. TwoSum basically applies FastTwoSum twice, reversing the roles of a and b. A proof for the exactness of the error term computed by TwoSum is due to Donald Knuth [52]. Both FastTwoSum and TwoSum are in danger of being messed up by the programming environment. For example in FastTwoSum one might conclude that

y= b − b_v= b − (x − a) = b − (a + b − a) = 0.

using the semantics of real arithmetic.

Multiplication. Similar to TwoSum and FastTwoSum, TwoProduct trans- forms the product of two floating-point numbers into a sum of two floating-point numbers. A basic sub-step in the TwoSum algorithm given below, is to split the operands into two numbers which can be multiplied exactly.

Algorithm 2.9 (split).

Let a∈ F and compute (ahi, alo) ← split(a). Assume that no overflow occurs. Then

a= ahi+ alo,|ahi| ≥ |alo| and both can be represented using a mantissa of at most bp/2c

bits.

Algorithm 2.10 (TwoProduct).

Let a, b∈ F and compute (x, y) ← TwoProduct(a, b). If neither overflow nor under-

flow occurs, then x= a ⊗ b and ab = x + y.

1: procedure split(a) 2: c← (2dp/2e_{+ 1) ⊗ a} 3: a_big← c a 4: a_hi← c abig 5: a_lo← a ahi 6: return(a_hi, a_lo) 1: procedure TwoProduct(a, b) 2: x← a ⊗ b 3: (a_hi, a_lo) ← split(a) 4: (b_hi, b_lo) ← split(b) 5: e₁← x (ahi⊗ bhi) 6: e₂← e1 (alo⊗ bhi) 7: e₃← e2 (ahi⊗ blo) 8: y_{← (a}_lo_{⊗ b}_lo) e₃ 9: return(x, y)

The split subroutine is due to Dekker [20], who attributes TwoProduct to G. W. Veltkamp. In case p is odd, it may seem impossible to represent a p bit number as

the sum of twobp/2c bit numbers. The missing bit is however hidden in the sign of alo. Therefore, the products ahi⊗ bhietc. in TwoProduct are computed without rounding error. The subsequent subtractions in TwoProduct are exact, too.

TwoProduct needs 16 additional operations to compute the error term y. Again,

y equals the error termδ in Equation (2.13) and we have |y| ≤ "mmsb(x). Both, overflow and underflow may affect TwoProduct. The splitting step involves the constant 2dp/2e_{+1, which is multiplied with both a and b. If overflow occurs anywhere,} it occurs in the computation of c or x, too. Hence, TwoProduct is safe from overflow, if

max{|a|, 2dp/2e+ 1} ⊗ max{|b|, 2dp/2e_{+ 1} ≤ 2τ(1 − "m).}

No underflow can occur in split, since 2dp/2e+1 ∈ Z implies lsb((2dp/2e+1)a) ≥ lsb(a). But it may occur in any multiplication in TwoProduct itself. The exact product ab may have up to 2p bits. Hence, TwoProduct is safe from underflow, if

a b= 0 or |a||b| >1₂"−2m η.

The newer IEEE 754-2008 standard mandates the availability of a fused-multiply- add instruction fma(a, b, c), rounding ab + c in one step to a floating-point number. Using fma, TwoProduct can be implemented as

1: _{procedure TwoProduct(a, b)}

2: x_{← a ⊗ b}

3: y_{← fma(a, b, −x)}

4: return(x, y)

Although a fused-multiply-add operation might be more costly than a standard binary floating-point operation, this TwoProduct implementation can be expected to be more efficient. Furthermore, it avoids the problem of overflow in split and hence increases the range of validity for TwoProduct. A TwoProduct based on fma is safe from overflow, if|a| ⊗ |b| ≤ 2τ(1 − "m).

TwoSum, FastTwoSum and TwoProduct have in common, that they compute their results with just a few ordinary floating-point operations. They do not involve any branches. Thus, they can easily be optimized by a compiler, e.g., using instruction level parallelism, and lead to efficient code. Proofs for the exactness of the error terms in TwoSum, FastTwoSum and TwoProduct can also be found in [104].

For each of the algorithms we gave sufficient and easily checkable conditions when they are safe from corruption by overflow and underflow. Why do we care so much for overflow and underflow? After all, they occur only for very small or very large input data and in many cases the input data can be scaled to avoid these problems. We do intend to integrate error-free transformations into an expression dag based number type. One of the main advantages of such a number type is user-friendliness. The user should get correct signs and approximations without caring about the internals. To achieve this goal, we need means to handle overflow

an underflow. A first step in this direction is to understand precisely when they may corrupt our data.

Auxiliary Functions. In some cases, we would like to actually compute msb( f ), pred_{(f ), or succ(f ) for some f ∈ F. There are several means to do this. For example} in

C

C++

, the library function

In document Algorithm engineering for expression dag based number types (Page 58-61)