Using integer operations for manipulating floating point variables

Floating point numbers are stored in a binary representation according to the IEEE standard 754 (1985). This standard is used in almost all modern microprocessors and operating systems (but not in some very old DOS compilers).

The representation of float, double and long double reflects the floating point value written as 2eee_1.fffff_{, where}_{is the sign,}_eee_{is the exponent, and}_fffff_{is the} binary decimals of the fraction. The sign is stored as a single bit which is 0 for positive and 1 for negative numbers. The exponent is stored as a biased binary integer, and the fraction is stored as the binary digits. The exponent is always normalized, if possible, so that the value before the decimal point is 1. This '1' is not included in the representation, except in the

long double format. The formats can be expressed as follows: struct Sfloat {

unsigned int fraction : 23; // fractional part unsigned int exponent : 8; // exponent + 0x7F unsigned int sign : 1; // sign bit

struct Sdouble {

unsigned int fraction : 52; // fractional part unsigned int exponent : 11; // exponent + 0x3FF unsigned int sign : 1; // sign bit

};

struct Slongdouble {

unsigned int fraction : 63; // fractional part

unsigned int one : 1; // always 1 if nonzero and normal unsigned int exponent : 15; // exponent + 0x3FFF

unsigned int sign : 1; // sign bit };

The values of nonzero floating point numbers can be calculated as follows:

𝑓𝑙𝑜𝑎𝑡𝑣𝑎𝑙𝑢𝑒 = (−1)𝑠𝑖𝑔𝑛⋅ 2𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡−127⋅ (1 + 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 ⋅ 2−23),

𝑑𝑜𝑢𝑏𝑙𝑒𝑣𝑎𝑙𝑢𝑒 = (−1)𝑠𝑖𝑔𝑛_{⋅ 2}𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡−1023_{⋅ (1 + 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 ⋅ 2}−52₎_,

𝑙𝑜𝑛𝑔𝑑𝑜𝑢𝑏𝑙𝑒𝑣𝑎𝑙𝑢𝑒 = (−1)𝑠𝑖𝑔𝑛⋅ 2𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡−16383⋅ (𝑜𝑛𝑒 + 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 ⋅ 2−63).

The value is zero if all bits except the sign bit are zero. Zero can be represented with or without the sign bit.

The fact that the floating point format is standardized allows us to manipulate the different parts of the floating point representation directly with the use of integer operations. This can be an advantage because integer operations are faster than floating point operations. You should use such methods only if you are sure you know what you are doing. See the end of this section for some caveats.

We can change the sign of a floating point number simply by inverting the sign bit: // Example 14.22

union { float f; int i; } u;

u.i ^= 0x80000000; // flip sign bit of u.f We can take the absolute value by setting the sign bit to zero:

// Example 14.23 union {

float f; int i; } u;

u.i &= 0x7FFFFFFF; // set sign bit to zero

We can check if a floating point number is zero by testing all bits except the sign bit: // Example 14.24

union { float f; int i; } u;

if (u.i & 0x7FFFFFFF) { // test bits 0 - 30 // f is nonzero

} else {

// f is zero }

// Example 14.25 union { float f; int i; } u; int n;

if (u.i & 0x7FFFFFFF) { // check if nonzero u.i += n << 23; // add n to exponent }

Example 14.25 does not check for overflow and works only for positive n. You can divide by 2n_{by subtracting}_n_{from the exponent if there is no risk of underflow.}

The fact that the representation of the exponent is biased allows us to compare two positive floating point numbers simply by comparing them as integers:

// Example 14.26 union { float f; int i; } u, v; if (u.i > v.i) {

// u.f > v.f if both positive }

Example 14.26 assumes that we know that u.f and v.f are both positive. It will fail if both are negative or if one is 0 and the other is -0 (zero with sign bit set).

We can shift out the sign bit to compare absolute values: // Example 14.27 union { float f; unsigned int i; } u, v; if (u.i * 2 > v.i * 2) { // abs(u.f) > abs(v.f) }

The multiplication by 2 in example 14.27 will shift out the sign bit so that the remaining bits represent a monotonically increasing function of the absolute value of the floating point number.

We can convert an integer in the interval 0 <= n < 223_{to a floating point number in the} interval [1.0, 2.0) by setting the fraction bits:

// Example 14.28 union { float f; int i; } u; int n;

u.i = (n & 0x7FFFFF) | 0x3F800000; // Now 1.0 <= u.f < 2.0 This method is useful for random number generators.

In general, it is faster to access a floating point variable as an integer if it is stored in memory, but not if it is a register variable. The union forces the variable to be stored in memory, at least temporarily. Using the methods in the above examples will therefore be a

disadvantage if other nearby parts of the code could benefit from using registers for the same variables.

In these examples we are using unions rather than type casting of pointers because this method is safer. Type casting of pointers may not work on compilers that rely on the strict aliasing rule of standard C, specifying that pointers of different types cannot point to the same object, except for char pointers.

The above examples all use single precision. Using double precision in 32-bit systems gives rise to some extra complications. A double is represented with 64 bits, but 32-bit systems do not have inherent support for 64-bit integers. Many 32-bit systems allow you to define 64-bit integers, but they are in fact represented as two 32-bit integers, which is less efficient. You may use the upper 32 bits of a double which gives access to the sign bit, the exponent, and the most significant part of the fraction. For example, to test the sign of a double:

// Example 14.22b union {

double d; int i[2]; } u;

if (u.i[1] < 0) { // test sign bit // u.d is negative or -0

}

It is not recommended to modify a double by modifying only half of it, for example if you want to flip the sign bit in the above example with u.i[1] ^= 0x80000000; because this is likely to generate a store forwarding delay in the CPU (See manual 3: "The

microarchitecture of Intel, AMD and VIA CPUs"). This can be avoided in 64-bit systems by using a 64-bit integer rather than two 32-bit integers to alias upon the double.

Another problem with accessing 32 bits of a 64-bit double is that it is not portable to systems with big-endian storage. Example 14.22b and 14.29 will therefore need modification if implemented on other platforms with big-endian storage. All x86 platforms (Windows, Linux, BSD, Intel-based Mac OS, etc.) have little-endian storage, but other systems may have big endian storage (e.g. PowerPC).

We can make an approximate comparison of doubles by comparing bits 32-62. This can be useful for finding the numerically largest element in a matrix for use as pivot in a Gauss elimination. The method in example 14.27 can be implemented like this in a pivot search:

// Example 14.29 const int size = 100; // Array of 100 doubles:

union {double d; unsigned int u[2]} a[size]; unsigned int absvalue, largest_abs = 0; int i, largest_index = 0;

for (i = 0; i < size; i++) {

// Get upper 32 bits of a[i] and shift out sign bit: absvalue = a[i].u[1] * 2;

// Find numerically largest element (approximately): if (absvalue > largest_abs) {

largest_abs = absvalue; largest_index = i; }

}

Example 14.29 finds the numerically largest element in an array, or approximately so. It may fail to distinguish elements with a relative difference less than 2-20_{, but this is sufficiently} accurate for the purpose of finding a suitable pivot element. The integer comparison is likely

to be faster than a floating point comparison. On big endian systems you have to replace

u[1] by u[0].

In document Optimizing Software in C++ - Free Computer, Programming, Mathematics, Technical Books, Lecture Notes and Tutorials (Page 151-155)