Chapter 4 Floating point and denormal handling
4.4 DIP: A denormal profiler for Linux x86
4.4.1 Floating point exceptions on the 80x86
For the purposes of this thesis we shall only examine denormal arithmetic when 80x87 floating point instructions are used, rather than the newer SSE instructions supported by the Intel Pentium and later. The SSE hardware supports denormal arithmetic in a similar way to the x87 hardware, with similar performance penalties and like the x87, there are masks to enable and disable exception reporting and flush-to-zero behaviour. To simplify the presentation, we restrict our focus to the x87 only.
When running 32-bit code and using 80x87 instructions, a group of bits calledmasks in theFPUControl Word control floating point exceptions. These masks control
one of six possible types of floating point exceptions, and two of the exceptions signal denormal arithmetic.
The first type of exception is called a Numeric Underflow Exception. This occurs when the FPU has performed a floating point calculation and is attempting to normalise the result. Whether underflow is reported depends on whether the result istiny, i.e., the resulting exponent is too small to be represented in the floating point format in normal form; and whether the result isinexact, i.e., if the result can be represented without truncation in the floating point format. If the underflow mask is set in the FPU Control Word, underflow is reported when the result is both tiny and inexact. If the underflow mask is cleared, underflow is reported for tiny results. This corresponds to the underflow signalling in the IEEE-754 specification, and enabling the underflow mask allows software to determine when denormal outputis produced by a floating point instruction. When underflow is reported, if the value’s destination was memory, the result is left on the floating point stack and the exception handler must perform the necessary denormalising and write. If the value was to be written to another floating point register, the value is scaled by
224576before writing to the register, and the exception handler must deal with any
necessary re-scaling and denormalisation.
The second type of exception is called a Denormal Operand Exception. This occurs when one or more of the operands to a floating point instruction is found to contain denormal data and the denormal mask in the FPU Control Word is cleared. In other words, clearing the denormal mask allows a program to be informed when a floating point instruction receives denormal input. There is no corresponding signalling mode in the IEEE-754 specification.
When either the underflow or denormal exceptions occur, the DE or UE flag is set in the FPUstatus word, and the general purpose software exception handler is triggered when the nextFPUinstruction is encountered. The handler can read the status word, perform any appropriate actions, and on completion, performs an IRET instruction to resume program execution from where it left off.
However, there is one crucial difference between the denormal or underflow excep- tions that complicates profiling for the denormal case. A underflow exception is a post-operationexception, i.e., it occurs after the instruction has finished calculating and has stored some intermediate result3in the destination. In fact, due to how the original 8087 co-processor communicated with the 8086, it actually occurs at the beginning of the next floating point instruction. So when the IRET instruction exe-
cutes in the exception handler, the program resumes execution of the next floating point instruction after the instruction that caused the exception4. As a consequence,
it is impossible to use this exception to ‘fix up’ the denormal output of an operation, but this is of no concern to a profiler which should not modify the behaviour of the program being profiled.
In contrast, denormal exceptions, arepre-operationexceptions. They occur when the instruction is fetching its operands, and before any calculation has been performed.
for (i=1; i<N-1; i++) { for (j=1; j<N-1; j++) { cur[i*N + j] = 0.25 * ( prev[(i-1)*N + j ] + prev[(i+1)*N + j ] + prev[ i *N + j-1] + prev[ i *N + j+1]); } }
Fig. 4.3:jacobiinner loop
0x08048730: flds (%edi) 0x08048732: add $0x1,%eax 0x08048735: add $0x4,%edi 0x08048738: fadds (%esi) 0x0804873a: add $0x4,%esi 0x0804873d: fadds (%ebx) 0x0804873f: add $0x4,%ebx 0x08048742: fadds (%ecx) 0x08048744: add $0x4,%ecx 0x08048747: fmuls 0x080488e0 0x0804874d: fstps (%edx) 0x0804874f: add $0x4,%edx 0x08048752: cmp $0x1ff,%eax 0x08048757: jne 0x08048730
Fig. 4.4: Compiled code
To illustrate when floating point exception handling and the distinction betweenpre- andpost-operationexceptions, we shall use the inner loop of thejacobiapplication fromSec. 6.1.1. The inner loop, and the machine code instructions it compiles to are shown inFig. 4.3andFig. 4.4.
We shall assume that the 32-bit value read from memory by the third fadds in- struction at 0x08048742 is a denormal value, and that the 32-bit value written by
thefstpsat 0x0804874d is also denormal. This leads to the following sequence of
events:
The first 7 instructions execute normally:
0x08048730: flds (%edi) 0x08048732: add $0x1,%eax 0x08048735: add $0x4,%edi 0x08048738: fadds (%esi) 0x0804873a: add $0x4,%esi 0x0804873d: fadds (%ebx) 0x0804873f: add $0x4,%ebx
4This may be many instructions after the instruction that caused the exception if there are integer instructions between the two floating point instructions.
The 8thinstruction begins:
0x08048742: fadds (%ecx)
The FPU loads the 32-bit value from memory, decodes it, sees it is a denormal, and sets the FPU’s DE (Denormal Operand Exception) flag. It then abandons the instruction by performing no operation and leaving the instruction pointer unmodified.
The CPU begins execution of the next instruction. Since the instruction pointer hasn’t been changed, this is still the instruction at 0x08048742. Before starting the instruction, the FPU sees the DE flag is set and triggers a Floating Point Exception.
The FPE handler runs, and on completion, the CPU returns to the instruction at 0x08048742 and executes it.
The 9thand 10thinstructions execute normally, and produce a denormal on the stack:
0x08048744: add $0x4,%ecx
0x08048747: fmuls 0x080488e0
The 11thinstruction begins:
0x0804874d: fstps (%edx)
To perform the store, the FPU converts the 80-bit float to a 32-bit float, and produces a denormal value. It performs the store to memory and then sets the FPU’s UE (Underflow Exception) flag and increments the instruction pointer.
The CPU continues normally until the next floating point instruction is encountered:
0x0804874f: add $0x4,%edx 0x08048752: cmp $0x1ff,%eax 0x08048757: jne 0x08048730
At the next iteration of the loop (4 instructions after the store), a floating point load occurs:
0x08048730: flds (%edi)
At the beginning of the instruction, the FPU sees the UE flag set (from the previous store), and before starting the instruction, triggers a Floating Point Exception. The FPE handler runs, and on completion, the CPU returns to the instruction at 0x08048730.