I) Round-off Errors and Computer Arithmetic

(1)

Elementary Numerical Mathematics

I) Round-off Errors and Computer Arithmetic

Winter Term 2020/21

Gerhard Wellein

HPC Services - Regionales Rechenzentrum Erlangen (RRZE) Department for Computer Science

(2)

Round-off Errors and Computer Arithmetic Content

1.

Number representation in computers

2.

Decimal floating-point representation and round-off

errors

3.

Error analysis for computer arithmetics

(3)

Round-off Errors and Computer Arithmetic

1.

Number representation in computers

2.

Decimal floating-point representation and round-off errors

3.

(4)

1) Number representation in computers

Motivation:

▪

Finite digit (arithmetics) in computers

→ A general number 𝑥 ∈ ℝ may not be stored exactly in a

computer

▪

Example:

3

Exact arithmetics:

3 = 1,7320508075 … ⟺

3

2

= 3

Computer: a=sqrt(3.0); b=a*a; print b;

(5)

Binary Floating-Point Arithmetic Standard 754-1985 by IEEE:

▪

Representation of 64-Bit Floating-Point number

(„double“):

1

bit:

𝑠

→ Sign:

−1

𝑠

11

bits:

𝑐

_𝑖

→ Exponent: 𝑐 = σ

_𝑖=111

𝑐

_𝑖

2

𝑖−1 (0 ≤ 𝑐 ≤ 2047)

52 bits:

𝑓

_𝑖

→ Mantissa: 𝑓 = σ

_𝑖=152

𝑓

_𝑖 1 2 𝑖 (0 ≤ 𝑓 < 1)

▪

Normal form of 64-Bit Floating-Point numbers:

−1

𝑠

2

𝑐−1023

(1 + 𝑓)

(6)

−1

𝑠

2

𝑐−1023

(1 + 𝑓)

Special cases (64 Bit):

▪

𝑠 = 0,1 ; 𝑐 = 0; 𝑓 = 0

→

₋+

𝟎

▪

𝑐 = 0; 𝑓 ≠ 0

→

„denormalized numbers“

▪

𝑐 = 2047

→ ∞ (𝑓 = 0) and NaN (𝑓 ≠ 0)

Positive (𝑠 = 0) number ranges (64 Bit):

▪ Smallest number: 𝑓 = 0; 𝑐 = 1 → 2−1022 ≈ 10−308

▪ Largest number: 𝑓 ≈ 1; 𝑐 = 2046 → 21024 ≈ 10+308

(7)

Binary Floating-Point Arithmetic Standard 754-1985 by IEEE:

▪

Representation of 32-Bit Floating-Point number

(„float“):

1

bit:

𝑠

→ Sign:

−1

𝑠

8

bits:

𝑐

_𝑖

→ Exponent: 𝑐 = σ

_𝑖=18

𝑐

_𝑖

2

𝑖−1 (0 ≤ 𝑐 ≤ 255)

23 bits:

𝑓

_𝑖

→ Mantissa: 𝑓 = σ

_𝑖=123

𝑓

_𝑖 1 2 𝑖 (0 ≤ 𝑓 < 1)

▪

Normal form of 32-Bit Floating-Point numbers:

−1

𝑠

2

𝑐−127

(1 + 𝑓)

(8)

▪ 1 ; 999,999 ; 1,000,000 : These numbers can be represented exactly in IEEE 754 standard (32-Bit)

▪ IEEE 754 32-Bit representation of 0.999999

x1=0.999999 → x1_32= 0.999998986721038…

(9)

1) Number representation in computers

▪ Compute (x0-x1) * 1 000 000

▪ Assume: float x0,x1 // 32 bit values

▪ x0= 1,000,000.0; x1= 999,999.0

→(x0-x1)*1,000,000 = 1,000,000 //exact arithmetic

▪ x0= 1.000000; x1= 0.999999

(10)

1.

Number representation in computers

2.

errors

3.

(11)

2) Decimal floating-point representation & round-off errors

▪

Preliminaries:

▪

For simplicity assume a real number (𝑦) is stored in a decimal

floating point form in a computer using

„k digits“:

fl y =

₋+

0. 𝑑

₁

𝑑

₂

𝑑

₃

… 𝑑

_𝑘

∗ 10

𝑛

with

1 ≤ 𝑑

₁

≤ 9 and 0 ≤ 𝑑

_𝑖

≤ 9 for 𝑖 = 2, … , 𝑘

„k-digit decimal floating-point form of 𝑦“

▪

Any real number

𝑦 can be written as follows:

𝑦 =

₋+0. 𝑑₁𝑑₂𝑑₃ … 𝑑_𝑘𝑑_𝑘+1 … ∗ 10𝑛

(12)

Round-off Errors and Computer Arithmetic Content

1.

Number representation in computers

2.

errors

3.

(13)

3) Error analysis for computer arithmetics

▪

Problem:

▪

In general floating-point arithmetic is not exact on computers!

▪

Associativity law does not hold in general

– order of

evaluation may change the binary result of computation

𝑎 + 𝑏 + 𝑐 ≠ 𝑎 + 𝑏 + 𝑐

▪

This section: Qualitative analysis of impact of finite-digit

arithmetic

(14)

3) Error analysis for computer arithmetics

Harmonic series: Divergence ?!

(15)

3) Error analysis for computer arithmetics

Accumulated relative error e_RE

#FP operations 32-Bit 64-Bit

MFlop 106 ₁₀-3 ₁₀-12

TFlop 1012 ₁ ₁₀-9

Accumulated relative error after a large series of FP (floating point) operations assuming a simple random walk theory for error propagation.

After N successive arithmetic operations (each having a relative error e_m) of the relative error

becomes:

e

_RE

~ N

1/2

e

_m

(16)

1.

Number representation in computers

2.

errors

3.

(17)

4) Stability of Computations

▪ Definition:

A computation M(x1,x2,…,xn) with input data (x1,x2,…,xn) is called

stable if small errors with respect to the input data (< e) lead to

small errors of the output data (< c* e)

▪ Definition:

Let E₀ > 0 denote an initial error and E_n represent the magnitude of an error after n subsequent operations. Let C>1 be an constant independent of:

1) If E_n  C n E₀ the error growth is called linear

2) If E_n  Cn _E