Rounding algorithms - Floating-point basics and the IEEE-754 standard

Floating-Point

6.1 Floating-point basics and the IEEE-754 standard

6.1.1 Rounding algorithms

The IEEE 754-1985 standard defines four different ways in which results can be rounded, as follows:

• Round to nearest (ties to even). This mode causes rounding to the nearest value. If a number is exactly midway between two possible values, it is rounded to the nearest value with a zero least significant bit.

• Round toward 0. This causes numbers to always be rounded towards zero (this can be also be viewed as truncation).

• Round toward +∞ .This selects rounding towards positive infinity. • Round toward -∞. This selects rounding towards negative infinity.

The IEEE 754-2008 standard adds an additional rounding mode. In the case of round to nearest, it is now also possible to round numbers that are exactly halfway between two values, away from zero (in other words, upwards for positive numbers and downwards for negative numbers). This is in addition to the option to round to the nearest value with a zero least significant bit. At present VFP does not support this rounding mode.

6.1.2 ARM VFP

VFP is an optional but rarely omitted extension to the instruction sets in the ARMv7-A architecture. It can be implemented with either thirty-two, or sixteen double-word registers. The terms VFPv3-D32 and VFPv3-D16 are used to distinguish between these two options. If the Advanced SIMD (NEON) extension is implemented together with VFPv3, VFPv3-D32 is always present. VFPv3 can also be optionally extended by the half-precision extensions that provide conversion functions in both directions between half-precision floating-point (16-bit) and single-precision floating-point (32-bit). These operations only enable half-precision floats to be converted to and from other formats.

VFPv4 adds both the half-precision extensions and the Fused Multiply-Add instructions to the features of VFPv3. In a Fused Multiply-Add operation, only a single rounding occurs at the end. This is one of the new facets of the IEEE 754-2008 specification. Fused operations can improve the accuracy of calculations that repeatedly accumulate products, such as matrix multiplication or dot product calculation. The VFP version supported by individual Cortex-A series processors is given in Table 2-3 on page 2-9.

In addition to the registers described, there are a number of other VFP registers: Floating-Point System ID Register (FPSID)

This can be read by system software to determine which floating-point features are supported in hardware.

Floating-Point Status and Control register (FPSCR)

Floating-Point

Floating-Point Exception Register (FPEXC)

The FPEXC register contains bits that enable system software that handles exceptions to determine what has happened.

Media and VFP Feature registers 0 and 1 (MVFR0 and MVFR1)

These registers enable system software to determine which Advanced SIMD and floating-point features are provided on the processor implementation.

User mode code can only access the FPCSR. One implication of this is that applications cannot read the FPSID to determine which features are supported unless the host OS provides this information. Linux provides this through /proc/cpuinfo, for example, but the information is not nearly as detailed as that provided by the VFP hardware registers.

Unlike ARM integer instructions, no VFP operations will affect the flags in the APSR directly. The flags are stored in the FPSCR. Before the result of a floating-point comparison can be used by the integer processor, the flags set by a floating-point comparison must be transferred to the APSR, using the VMRS instruction. This includes use of the flags for conditional execution, even of other VFP instructions.

Example 6-1 shows a simple piece of code to illustrate this. The VCMP instruction performs a comparison the values in VFP registers d0 and d1 and sets FPSCR flags as a result. These flags must then be transferred to the integer processor APSR, using the VMRS instruction. You can then conditionally execute instructions based on this.

Example 6-1 Example code illustrating usage of floating-point flags

VCMP d0, d1

VMRS APSR_nzcv, FPSCR BNE label

Flag meanings

The integer comparison flags support comparisons that are not applicable to floating-point numbers. For example, floating-point values are always signed, so there is no requirement for unsigned comparisons. On the other hand, floating-point comparisons can result in the unordered result (meaning that one or both operands was NaN, or Not a Number). IEEE-754 defines four testable relationships between two floating-point values, that map onto the ARM condition codes as follows:

Table 6-2 ARM APSR flags

IEEE-754 relationship ARM APSR flags

N Z C V

Equal 0 1 1 0

Less Than (LT) 1 0 0 0

Greater Than (GT) 0 0 1 0

Floating-Point

Compare with zero

Unlike the integer instructions, most VFP (and NEON) instructions can operate only on registers, and cannot accept immediate values encoded in the instruction stream. The VCMP instruction is a notable exception in that it has a special-case variant that enables quick and easy comparison with zero.

Interpreting the flags

When the flags are in the APSR, they can be used almost as if an integer comparison had set the flags. However, floating-point comparisons support different relationships, so the integer condition codes do not always make sense. Table 6-3 describes floating-point comparisons rather than integer comparisons:

It is clear that the condition code is attached to the instruction reading the flags, and the source of the flags makes no difference to the flags that are tested. It is the meaning of the flags that differs when you perform a vcmp rather than a cmp. Similarly, it is clear that the opposite conditions still hold. For example, HS is still the opposite of LO.

When set by CMP the flags generally have analogous meanings to the flags set by VCMP. For example, GT still means greater than. However, the unordered condition and the removal of the signed conditions can confuse matters. Often, for example, it is desirable to use LO, normally an unsigned less than check, in place of LT, because it does not match in the unordered case.

Table 6-3 Interpreting the flags

Code Meaning (when set by vcmp) Meaning (when set by cmp) Flags tested

EQ Equal to Equal to Z =1

NE Unordered, or not equal to Not equal to. Z = 0

CS or HS Greater than, equal to, or unordered Greater than or equal to (unsigned). C = 1

CC or LO Less than. Less than (unsigned). C = 0

MI Less than Negative. N = 1

PL Greater than, equal to, or unordered Positive or zero. N = 0 VS Unordered. (At least one argument was NaN.) Signed overflow. V = 1 VC Not unordered. (No argument was NaN.) No signed overflow. V = 0

HI Greater than or unordered Greater than (unsigned). (C = 1) && (Z = 0) LS Less than or equal to Less than or equal to (unsigned). (C = 0) || (Z = 1) GE Greater than or equal to. Greater than or equal to (signed). N==V

LT Less than or unordered. Less than (signed). N!=V

GT Greater than. Greater than (signed). (Z==0) && (N==V) LE Less than, equal to or unordered. Less than or equal to (signed). (Z==1) || (N!=V)

Floating-Point