Numerical Methods for Finance

(1)

Dr Robert N¨urnberg

This course introduces the major numerical methods needed for quantitative work in finance. To this avail, the course will strike a balance between a general survey of significant numerical methods anyone working in a quantitative field should know, and a detailed study of some numerical methods specific to financial mathematics. In the first part the course will cover e.g.

linear and nonlinear equations, interpolation and optimization,

while the second part introduces e.g.

binomial and trinomial methods, finite difference methods, Monte-Carlo simulation, random number generators, option pricing and hedging.

(2)

1. References

1. Burden and Faires (2004), Numerical Analysis.

2. Clewlow and Strickland (1998), Implementing Derivative Models. 3. Fletcher (2000), Practical Methods of Optimization.

4. Glasserman (2004), Monte Carlo Methods in Financial Engineering. 5. Higham (2004), An Introduction to Financial Option Valuation.

6. Hull (2005), Options, Futures, and Other Derivatives.

7. Kwok (1998), Mathematical Models of Financial Derivatives. 8. Press et al. (1992), Numerical Recipes in C. (online)

9. Press et al. (2002), Numerical Recipes in C++_. 10. Seydel (2006), Tools for Computational Finance.

(3)

(4)

2. Preliminaries

1. Algorithms.

An algorithm is a set of instructions to construct an approximate solution to a math-ematical problem.

A basic requirement for an algorithm is that the error can be made as small as we like. Usually, the higher the accuracy we demand, the greater is the amount of computation required.

An algorithm is convergent if it produces a sequence of values which converge to the desired solution of the problem.

(5)

Example

Find x = √c, c > 1 constant.

Answer

x = √c _⇐⇒ x2 = c _⇐⇒ f(x) := x2 ₋ c = 0

⇒ f(1) = 1 ₋ c < 0 and f(c) = c2 ₋ c > 0

⇒ ∃ xˉ _∈ (1, c) s.t. f(ˉx) = 0

f0(x) = 2x > 0 _⇒ f monotonically increasing _⇒ xˉ is unique. Denote I_n := [an, bn] with I0 = [a0, b0] = [1, c]. Let xn := an+₂bn.

(i) If f(xn) = 0 then ˉx = xn.

(6)

Length of I_n : m(In) = 1₂ m(In₋1) = ∙ ∙ ∙ = ₂1n m(I₀) = c₂−n1 Algorithm

Algorithm stops if m(In) < ε and let x? := xn. Error as small as we like?

ˉ

x, x? _∈ I_n

⇒ error _|x? ₋ xˉ_| = _|x_n ₋ xˉ_{| ≤} m(In) → 0 as n → ∞.

X

Convergence?

I₀ _⊃ I₁ _{⊃ ∙ ∙ ∙ ⊃} I_n _{⊃ ∙ ∙ ∙}

⇒ ∃! ˉx _∈

∞

\

n=0

I_n, f(ˉx) = 0, i.e. ˉx = √c.

X

Implementation:

No need to define I_n = [an, bn]. It is sufficient to store only 3 points throughout. Suppose ˉx _∈ (a, b), define x := a+₂b.

(7)

2. Errors.

There are various errors in computed solutions, such as

• discretization error (discrete approximation to continuous systems),

• truncation error (termination of an infinite process), and

• rounding error (finite digit limitation in computer arithmetic). If a is a number and _ea is an approximation to a, then

the absolute error is _|a ₋ _ea_| and the relative error is |a − ea|

|a_| provided a 6= 0.

(8)

Example

discretization error

x0 = f(x) [differential equation] x(t + h) ₋ x(t)

h = f(x(t)) [difference equation] DE =

x(t + h) ₋ x(t)

h − x

0_(t)

truncation error

lim

n_→∞xn = x, approximate x with xN, N a large number. TE = _|x ₋ xN|

rounding error

We cannot express x exactly, due to finite digit limitation. We get ˆx instead. RE = _|x ₋ xˆ_|

(9)

3. Well/Ill Conditioned Problems.

A problem is well-conditioned (or ill-conditioned) if every small perturbation of the data results in a small (or large) change in the solution.

Example: Show that the solution to equations x + y = 2 and x + 1.01 y = 2.01 is ill-conditioned.

Exercise: Show that the following problems are ill-conditioned:

(a) solution to the differential equation x00 ₋ 10x0 ₋ 11x = 0 with initial conditions x(0) = 1 and x0(0) = ₋1,

(10)

Example

  

x + y = 2

x + 1.01 y = 2.01 ⇒

  

x = 1 y = 1 Change 2.01 to 2.02:

  

x + y = 2

x + 1.01 y = 2.02 ⇒

  

x = 0 y = 2

I.e. 0.5% change in data produces 100% change in solution: ill-conditioned !

[reason: det 1 1 1 1.01

!

(11)

4. Taylor Polynomials.

Suppose f, f0, . . . , f(n) are continuous on [a, b] and f(n+1) exists on (a, b). Let x₀ _∈ [a, b]. Then for every x _∈ [a, b], there exists a ξ between x₀ and x with

f(x) = n

X

k=0

f(k)(x₀)

k! (x − x0)

k ₊ _R

n(x)

where R_n(x) = f

(n+1)_(ξ₎

(n + 1)! (x − x0)

n+1 _{is the} _remainder.

[Equivalently: Rn(x) =

Z x

x₀

f(n+1)(t)

n! (x − t)

n _dt.]

Examples:

• exp(x) =

∞

X

k=0

xk k!

• sin(x) =

∞

X ₍₋₁₎k (2k + 1)!x

(12)

5. Gradient and Hessian Matrix.

Assume f : Rn _→ R.

The gradient of f at a point x, written as _∇f(x), is a column vector in Rn with ith component _∂x∂f

i(x).

The Hessian matrix of f at x, written as _∇2f(x), is an n _× n matrix with (i, j)th component _∂x∂2f

i∂xj(x). [As

∂2f ∂x_i∂x_j =

∂2f

∂x_j∂x_i, ∇2f(x) is symmetric.]

Examples:

• f(x) = aTx, a _∈ Rn _⇒ _∇f = a, _∇2f = 0

• f(x) = 1₂ xT A x, A symmetric _⇒ _∇f(x) = A x, _∇2f = A

• f(x) = exp(1₂ xT A x), A symmetric

⇒ ∇f(x) = exp(1₂ xT A x)A x,

(13)

6. Taylor’s Theorem.

Suppose that f : Rn _→ R is continuously differentiable and that p _∈ Rn. Then we have

f(x + p) = f(x) + _∇f(x + t p)Tp , for some t _∈ (0, 1).

Moreover, if f is twice continuously differentiable, we have

∇f(x + p) = _∇f(x) +

Z 1

0 ∇

2_f_(x ₊ _{t p)} _p _dt,

and

f(x + p) = f(x) + _∇f(x)Tp + 1 2 p

T

(14)

7. Positive Definite Matrices.

An n _× n matrix A = (aij) is positive definite if it is symmetric (i.e. AT = A) and xTA x > 0 for all x _∈ Rn _{\ {}0_}. [I.e. xTA x _≥ 0 with “=” only if x = 0.]

The following statements are equivalent: (a) A is a positive definite matrix,

(b) all eigenvalues of A are positive,

(c) all leading principal minors of A are positive.

The leading principal minors of A are the determinants Δk, k = 1, 2, . . . , n, defined by

Δ1 = det[a11], Δ2 = det

"

a₁₁ a₁₂ a₂₁ a₂₂

#

, . . . , Δn = detA.

A matrix A is symmetric and positive semi-definite, if xTA x _≥ 0 for all x _∈ Rn.

Exercise.

(15)

8. Convex Sets and Functions.

A set S _⊂ Rn is a convex set if the straight line segment connecting any two points in S lies entirely inside S, i.e., for any two points x, y _∈ S we have

α x + (1 ₋ α)y _∈ S _∀ α _∈ [0, 1].

A function f : D _→ R is a convex function if its domain D _⊂ Rn is a convex set and if for any two points x, y _∈ D we have

f(α x + (1 ₋ α) y) _≤ αf(x) + (1 ₋ α) f(y) _∀ α _∈ [0, 1].

Exercise.

Let D _⊂ Rn be a convex, open set.

(a) If f : D _→ R is continuously differentiable, then f is convex if and only if f(y) _≥ f(x) + _∇f(x)T(y ₋ x) _∀ x, y _∈ D .

(16)

Exercise 2.8.

(a) “_⇒”

As f is convex we have for any x, y in the convex set D that

f(α y + (1 ₋ α)x) _≤ αf(y) + (1 ₋ α)f(x) _∀ α _∈ [0, 1]. Hence

f(y) _≥ f(x + α(y − x)) − f(x)

α + f(x). Letting α _→ 0 yields f(y) _≥ f(x) + _∇f(x)T (y ₋ x).

“_⇐”

For any x₁, x₂ _∈ D and λ _∈ [0, 1] let x := λ x₁ + (1 ₋ λ) x₂ _∈ D and y := x₁. On noting that y ₋ x = x₁ ₋ λ x₁ ₋ (1 ₋ λ)x₂ = (1 ₋ λ) (x₁ ₋ x₂) we have that

f(x₁) = f(y) _≥ f(x) + _∇f(x)T (y ₋ x) = f(x) + (1 ₋ λ)_∇f(x)T (x₁ ₋ x₂). (_†) Similarly, letting x := λ x₁ + (1 ₋ λ)x₂ and y := x₂ gives, on noting that y ₋ x = λ(x₂ ₋ x₁), that

(17)

Combining λ _∙ (_†) + (1 ₋ λ) _∙ (_‡) gives

λ f(x₁) + (1 ₋ λ) f(x₂) _≥ f(x) = f(λ x₁ + (1 ₋ λ)x₂ _⇒ f is convex. (b) “_⇐”

For any x, x₀ _∈ D use Taylor’s theorem at x₀: f(x) = f(x₀)+_∇f(x₀)T (x₋x₀)+1

2 (x−x0) T

∇2f(x₀+θ (x₋x₀)) (x₋x₀) θ _∈ (0, 1) As _∇2f is positive semi-definite, this immediately gives

f(x) _≥ f(x0) + ∇f(x0)T (x − x0) ⇒ f is convex.

“_⇒”

Assume _∇2f is not positive semi-definite in the domain D. Then there exists x₀ _∈ D and ˆx _∈ Rn s.t. ˆxT _∇2f(x0) ˆx < 0. As D is open we can find x1 := x0 + αxˆ ∈ D, for

(18)

9. Vector Norms.

A vector norm on Rn is a function, _k∙k, from Rn into R with the following properties: (i) _kx_{k ≥} 0 for all x _∈ Rn and _kx_k = 0 if and only if x = 0.

(ii) _kα x_k = _|α_{| k}x_k for all α _∈ R and x _∈ Rn. (iii) _kx + y_{k ≤ k}x_k + _ky_k for all x, y _∈ Rn.

Common vector norms are the l₁₋, l₂₋ (Euclidean), and l_∞₋norms:

kx_k₁ = n

X

i=1

|x_i_|, _kx_k₂ =

( _n X

i=1

x2_i

)1/2

, _kx_k_∞ = max

1≤i_≤n|xi|.

Exercise.

(a) Prove that _{k ∙ k}₁, _{k ∙ k}₂ and _{k ∙ k}_∞ are norms.

(b) Given a symmetric positive definite matrix A, prove that

kx_kA :=

√

(19)

Example.

Draw graphs defined by _kx_k₁ _≤ 1, _kx_k₂ _≤ 1, _kx_k_∞ _≤ 1 when n = 2.

l₁ l₂ l_∞

Exercise. Prove that for all x, y _∈ Rn we have (a)

n

X

i=1

|x_i y_i_{| ≤ k}x_k₂_ky_k₂ [Scharz inequality] and

(20)

10. Spectral Radius.

The spectral radius of a matrix A _∈ Rn×n is defined by ρ(A) = max

1≤i_≤n|λi|, where λ₁, . . . , λn are all the eigenvalues of A.

11. Matrix Norms.

For an n _× n matrix A, the natural matrix norm _kA_k for a given vector norm _{k ∙ k} is defined by

kA_k = max

kx_k=1kA xk.

The common matrix norms are

kA_k₁ = max

1≤j_≤n n

X

i=1

|aij|, kAk2 =

q

ρ(AT_A)

| {z }

ρ(A) ifA=AT

, _kA_k_∞ = max

1≤i_≤n n

X

j=1

|aij|.

Exercise: Compute _kA_k₁, _kA_k_∞, and _kA_k₂ for A =

"

1 1 0 1 2 1

−1 1 2

#

.

(21)

12. Convergence.

A sequence of vectors _{x(k)_{} ⊂} Rn is said to converge to a vector x _∈ Rn if

kx(k) ₋ x_{k →} 0 as k _{→ ∞} for an arbitrary vector norm _{k ∙ k}. This is equivalent to the componentwise convergence, i.e., x(_ik) _→ xi as k → ∞, i = 1, . . . , n.

A square matrix A _∈ Rn×n is said to be convergent if _kAk_{k →} 0 as k _{→ ∞}, which is equivalent to (Ak)ij → 0 as k → ∞ for all i, j.

The following statements are equivalent:

(i) A is a convergent matrix, (ii) ρ(A) < 1,

(iii) limk_→∞ Ak x = 0, for every x _∈ Rn.

Exercise. Show that A is convergent, where

A =

"

1/2 0 1/4 1/2

#

(22)

3. Algebraic Equations

1. Decomposition Methods for Linear Equations.

A matrix A _∈ Rn×n is said to have LU decomposition if A = L U where L _∈ Rn×n is a lower triangular matrix (lij = 0 if 1 ≤ i < j ≤ n) and U ∈ Rn×n is an upper triangular matrix (uij = 0 if 1 ≤ j < i ≤ n).

The decomposition is unique if one assumes e.g. l_ii = 1 for 1 _≤ i _≤ n.

L =         l₁₁ l₂₁ l₂₂ l₃₁ l₃₂ l₃₃

... . ..

l_n₁ l_n₂ l_n₃ . . . l_nn

       

, U =

       

u₁₁ u₁₂ u₁₃ . . . u₁_n u₂₂ u₂₃ . . . u₂_n

. .. ... u_n₋₁_,n₋₁ u_n₋₁_,n

(23)

In general, the diagonal elements of either L or U are given and the remaining elements of the matrices are determined by directly comparing two sides of the equation.

The linear system A x = b is then equivalent to L y = b and U x = y.

Exercise.

Show that the solution to L y = b is

y₁ = b₁/l₁₁, y_i = (bi − i₋1

X

k=1

l_ik y_k)/lii, i = 2, . . . , n

(forward substitution) and the solution to U x = y is xn = yn/unn, xi = (yi −

n

X

k=i+1

uik xk)/uii, i = n − 1, . . . ,1

(24)

2. Crout Algorithm. Exercise.

Let A be tridiagonal, i.e. a_ij = 0 if _|i₋j_| > 1 (aij = 0 except perhaps ai₋1,i, aii and ai,i+1), and strictly diagonally dominant (|aii| > P_j₆₌_i |aij| holds for i = 1, . . . , n). Show that A can be factorized as A = L U where lii = 1 for i = 1, . . . , n, u11 = a11,

and

u_i,i₊₁ = a_i,i₊₁ l_i₊₁_,i = a_i₊₁_,i/u_ii

u_i₊₁_,i₊₁ = a_i₊₁_,i₊₁ ₋ l_i₊₁_,i u_i,i₊₁

for i = 1, . . . , n ₋ 1. [Note: L and U are tridiagonal.]

C++ _Exercise_{: Write a program to solve a tridiagonal and strictly diagonally}

(25)

Exercise 3.2.

u₁₁ = a₁₁ and ui,i+1 = ai,i+1, li+1,i = ai+1,i/uii, ui+1,i+1 = ai+1,i+1 − li+1,i ui,i+1, for

i = 1, . . . , n ₋ 1, can easily be shown.

It remains to show that uii 6= 0 for i = 1, . . . , n. We proceed by induction to show that

|u_ii_| > _|a_i,i₊₁_|, where for convenience we define an,n+1 := 0.

• i = 1: _|u₁₁_| = _|a₁₁_| > _|a₁_,₂_|

X

• i _7→ i + 1:

|u_i₊₁_,i₊₁_| =

ai+1,i+1 −

a_i₊₁_,i a_i,i₊₁ uii

≥ |a_i₊₁_,i₊₁_{| − |}a_i₊₁_,i_| |ai,i+1|

|u_ii_| ≥ |ai+1,i+1| − |ai+1,i| > |ai+1,i+2|

X

Overall we have that _|u_ii_| > 0 and so the Crout algorithm is well defined. Moreover,

(26)

3. Choleski Algorithm. Exercise.

Let A be a positive definite matrix. Show that A can be factorized as A = L LT where L is a lower triangular matrix.

(i) Compute 1st column:

l₁₁ = √a₁₁, l_i₁ = a_i₁/l₁₁, i = 2, . . . , n. (ii) For j = 2, . . . , n ₋ 1 compute jth column:

l_jj = (ajj − j₋1

X

k=1

l_jk2 )12

l_ij = (aij − j₋1

X

k=1

(27)

(iii) Compute nth column:

l_nn = (ann − n₋1

X

k=1

(28)

4. Iterative Methods for Linear Equations.

Split A into A = M +N with M nonsingular and convert the equation A x = b into an equivalent equation x = C x + d with C = ₋M−1 N and d = M−1b.

Choose an initial vector x(0) and then generate a sequence of vectors by x(k) = C x(k−1) + d, k = 1, 2, . . .

The resulting sequence converges to the solution of A x = b, for an arbitrary initial vector x(0), if and only if ρ(C) < 1.

The objective is to choose M such that M−1 is easy to compute and ρ(C) < 1. The iteration stops if _kx(k) ₋ x(k−1)_k < ε.

(29)

Claim.

The iteration x(k) = C x(k−1) + d is convergent if and only if ρ(C) < 1.

Proof.

Define e(k) := x(k) ₋ x, the error of the kth iterate. Then

e(k) = C x(k−1) + d ₋ (C x + d) = C (x(k−1) ₋ x) = C e(k−1) = C2 e(k−2) = . . . Ck e(0) , where e(0) = x(0) ₋ x is an arbitrary vector.

Assume C is similar to the diagonal matrix Λ = diag(λ1, . . . , λn), where λi are the eigenvalues of C.

⇒ ∃ X nonsingular s.t. C = X ΛX−1

⇒ e(k) = Ck e(0) = X Λk X−1 e(0) = X

  

λk₁ . ..

λk_n

 

 X−1e(0) → 0 as k → ∞

⇐⇒ |λ_i_| < 1 _∀ i = 1, . . . , n

(30)

5. Jacobi Algorithm.

Exercise: Let M = D and N = L + U (L strict lower triangular part of A, D

diagonal, U strict upper triangular part). Show that the ith component at the kth iteration is

x(_ik) = 1 aii



b_i ₋ i₋1

X

j=1

a_ij x(_jk−1) ₋ n

X

j=i+1

a_ij x(_jk−1)

 

for i = 1, . . . , n.

6. Gauss–Seidel Algorithm.

Exercise: Let M = D + L and N = U. Show that the ith component at the kth

iteration is

x(_ik) = 1 a_ii



b_i ₋ i₋1

X

j=1

a_ij x(_jk) ₋ n

X

j=i+1

a_ij x(_jk−1)

 

(31)

7. SOR Algorithm. Exercise.

Let M = _ω1D + L and N = U + (1 ₋ _ω1)D where 0 < ω < 2. Show that the ith component at the kth iteration is

x(_ik) = (1 ₋ ω)x(_ik−1) + ω 1 a_ii



bi − i₋1

X

j=1

aij x(_jk) − n

X

j=i+1

aij x(_jk−1)

 

for i = 1, . . . , n.

C++ _Exercise_{: Write a program to solve a diagonally dominant linear equation}

(32)

8. Special Matrices.

If A is strictly diagonally dominant, then Jacobi and Gauss–Seidel converge for any initial vector x(0). In addition, SOR converges for ω _∈ (0, 1].

If A is positive definite and 0 < ω < 2, then the SOR method converges for any initial vector x(0).

If A is positive definite and tridiagonal, then ρ(CGS) = [ρ(CJ)]2 < 1 and the optimal choice of ω for the SOR method is ω = 2

1 + p1 ₋ ρ(CGS)

∈ [1, 2). With this choice of ω, ρ(CSOR) = ω − 1 ≤ ρ(CGS).

Exercise.

Find the optimal ω for the SOR method for the matrix

A =

  

4 3 0 3 4 ₋1 0 ₋1 4

  .

(33)

9. Condition Numbers.

The condition number of a nonsingular matrix A relative to a norm _{k ∙ k} is defined by

κ(A) = _kA_{k ∙ k}A−1_k. Note that κ(A) _{≥ k}A A−1_k = _kI_k = max

kx_k=1 kxk = 1.

A matrix A is well-conditioned if κ(A) is close to one and is ill-conditioned if κ(A) is much larger than one.

Suppose _kδA_k < 1

kA−1_k. Then the solution xe to (A + δA)xe = b + δb approximates

the solution x of A x = b with error estimate

kx ₋ _ex_k

kx_k ≤

κ(A)

1 _{− k}δA_{k k}A−1_k

kδb_k

kb_k +

kδA_k

kA_k

. In particular, if δA = 0 (no perturbation to matrix A) then

kx ₋ x˜_k

kx_k ≤ κ(A)

kδb_k

(34)

Example.

Consider Example 1.3.

A = 1 1 1 1.01

!

⇒ A−1 = 1 det A

1.01 ₋1

−1 1

!

= 1 0.01

1.01 ₋1

−1 1

!

= 101 −100

−100 100

!

Recall

kA_k₁ = max

1≤j_≤n n

X

i=1

|a_ij_|. Hence

kA_k₁ = max(2, 2.01) = 2.01 , _kA−1_k₁ = max(201,200) = 201.

⇒ κ₁(A) = _kA_k₁ _{∙ k}A−1_k₁ = 404.01 1 (ill-conditioned!)

Similarly κ_∞ = 404.01 and κ₂ = ρ(A)ρ(A−1) = λmax

(35)

10. Hilbert Matrix.

An n _× n Hilbert matrix Hn = (hij) is defined by hij = 1/(i + j − 1) for i, j = 1, 2, . . . , n.

Hilbert matrices are notoriously ill-conditioned and κ(Hn) _{→ ∞} very rapidly as n _{→ ∞}.

Hn =

           1 1

2 . . .

1 n 1

2

1

3 . . .

1 n + 1

... ...

1 n

1

n + 1 . . .

1 2n ₋ 1

           Exercise.

(36)

(37)

11. Fixed Point Method for Nonlinear Equations.

A function g : R _→ R has a fixed point xˉ if g(ˉx) = ˉx.

A function g is a contraction mapping on [a, b] if g : [a, b] _→ [a, b] and

|g0(x)_{| ≤} L < 1, _∀ x _∈ (a, b) where L is a constant.

Exercise.

Assume g is a contraction mapping on [a, b]. Prove that g has a unique fixed point ˉx in [a, b], and for any x₀ _∈ [a, b], the sequence defined by

x_n₊₁ = g(xn), n ≥ 0,

(38)

Exercise 3.11.

Existence:

Define h(x) = x ₋ g(x) on [a, b]. Then h(a) = a ₋ g(a) _≤ 0 and h(b) = b ₋ g(b) _≥ 0. As h is continuous, _∃ c _∈ [a, b] s.t. h(c) = 0. I.e. c = g(c). X

Uniqueness:

Suppose p, q _∈ [a, b] are two fixed points. Then

|p ₋ q_| = _|g(p) ₋ g(q)_| _|{z}=

MVT, α∈(a,b)

|g0(α) (p ₋ q)_{| ≤} L_|p ₋ q_|

⇒ (1 ₋ L)_|p ₋ q_{| ≤} 0 _⇒ _|p ₋ q_{| ≤} 0 _⇒ p = q . X Convergence:

≤ L_|x_n₋₁ ₋ xˉ_{| ≤} . . . _≤ Ln_|x₀ ₋ xˉ_{| →} 0 as n _{→ ∞}. Hence

(39)

12. Newton Method for Nonlinear Equations.

Assume that f _∈ C1([a, b]), f(ˉx) = 0 (ˉx is a root or zero) and f0(ˉx) ₆= 0.

The Newton method can be used to find the root ˉx by generating a sequence _{x_n_} satisfying

x_n₊₁ = x_n ₋ f(xn)

f0(xn), n = 0, 1, . . . provided f0(xn) 6= 0 for all n.

The sequence x_n converges to the root ˉx as long as the initial point x₀ is sufficiently close to ˉx.

The algorithm stops if _|x_n₊₁ ₋ x_n_| < ε, a prescribed error tolerance, and x_n₊₁ is taken as an approximation to ˉx.

(40)

Setting Y = 0 yields x_n₊₁ := X = x_n ₋ f(xn)

(41)

13. Choice of Initial Point.

Suppose f _∈ C2([a, b]) and f(ˉx) = 0 with f0(ˉx) ₆= 0. Then there exists δ > 0 such that the Newton method generates a sequence xn converging to ˉx for any initial point x₀ _∈ [ˉx ₋ δ, xˉ + δ] (x0 can only be chosen locally).

However, if f satisfies the following additional conditions: 1. f(a)f(b) < 0,

2. f00 does not change sign on [a, b],

3. tangent lines to the curve y = f(x) at both a and b cut the x-axis within [a, b]; (i.e. a ₋ _ff₀(₍a_a)₎, b ₋ _ff₀(₍b_b)₎ _∈ [a, b])

then f(x) = 0 has a unique root ˉx in [a, b] and Newton method converges to ˉx for any initial point x₀ _∈ [a, b] (x0 can be chosen globally).

Example.

(42)

Example.

Find x = √c, c > 1.

Answer.

x is root of f(x) := x2 ₋ c. Newton: x_n₊₁ = x_n ₋ f(xn)

f0_(x_n₎ = xn −

x2_n ₋ c 2 x_n =

1 2

x_n + c x_n

, n _≥ 0. How to choose x₀?

Check the 3 conditions on [1, c].

1. f(1) = 1 ₋ c < 0, f(c) = c2 ₋ c > 0. _⇒ f(1) f(c) < 0 X 2. f00 = 2 X

3. Tangent line at 1: Y = f(1) + f0(1) (X ₋ 1) = 1 ₋ c + 2 (X ₋ 1) Let Y = 0, then X = 1 + c − 1

2 ∈ (1, c). X

Tangent line at c: Y = f(c) + f0(c) (X ₋ c) = c2 ₋ c + 2 c(X ₋ c) Let Y = 0, then X = c ₋ c − 1

(43)

(44)

Numerical Example.

Find √7. (From calculator: √7 = 2.6457513.) Newton converges for all x₀ _∈ [1,7]. Choose x₀ = 4.

x₁ = 1 2

x₀ + 7 x₀

= 2.875

x₂ = 2.6548913 x₃ = 2.6457670 x₄ = 2.6457513 Comparison to bisection method with I₀ = [1, 7]:

I₁ = [1,4] I₂ = [2.5, 4] I₃ = [2.5, 3.25] I₄ = [2.5, 2.875]

...

(45)

14. Pitfalls.

Here are some difficulties which may be encountered with the Newton method:

1. _{x_n_} may wander around and not converge (there are only complex roots to the equation),

2. initial approximation x₀ is too far away from the desired root and _{x_n_} converges to some other root (this usually happens when f0(x₀) is small),

3. _{x_n_} may diverge to +_∞ (the function f is positive and monotonically decreasing on an unbounded interval), and

(46)

15. Rate of Convergence.

Suppose _{xn} is a sequence that converges to ˉx.

The convergence is said to be linear if there is a constant r _∈ (0,1) such that

|x_n₊₁ ₋ xˉ_|

|xn − xˉ| ≤

r, for all n sufficiently large. The convergence is said to be superlinear if

lim n_→∞

|x_n₊₁ ₋ xˉ_|

|x_n ₋ xˉ_| = 0. In particular, the convergence is said to be quadratic if

|xn+1 − xˉ|

|x_n ₋ xˉ_|2 ≤ M, for all n sufficiently large.

where M is a positive constant, not necessarily less than 1.

Example. x_n = ˉx + 0.5n linear, x_n = ˉx + 0.52n quadratic.

(47)

Example.

Define g(x) = x ₋ f(x)

f0_(x). Then the Newton method is given by

xn+1 = g(xn).

Moreover, f(ˉx) = 0 and f0(ˉx) ₆= 0 imply that g(ˉx) = ˉx ,

g0(ˉx) = 1 ₋ (f0)

2 ₋ _{f f}₀₀

(f0)2 (ˉx) = f(ˉx)

f00(ˉx)

(f0(ˉx))2 = 0,

g00(ˉx) = f

00_(ˉ_x)

f0(ˉx) . Assuming that xn → xˉ we have that

|x_n₊₁ ₋ xˉ_|

|x_n ₋ xˉ_|2 =

|g(xn) − g(ˉx)|

|x_n ₋ xˉ_|2 |{z}=

Taylor

|g0(ˉx) (xn − x) +ˉ 1₂ g00(ηn) (xn − x)ˉ 2|

|x_n ₋ xˉ_|2

(48)

(49)

4. Interpolations

1. Polynomial Approximation.

For any continuous function f defined on an interval [a, b], there exist polynomials P that can be as “close” to the given function as desired.

Taylor polynomials agree closely with a given function at a specific point, but they concentrate their accuracy only near that point.

(50)

2. Interpolating Polynomial – Lagrange Form.

Suppose xi ∈ [a, b], i = 0,1, . . . , n, are pairwise distinct mesh points in [a, b]. The Lagrange polynomial p is a polynomial of degree _≤ n such that

p(xi) = f(xi), i = 0,1, . . . , n. p can be constructed explicitly as

p(x) = n

X

i=0

L_i(x)f(xi) where L_i is a polynomial of degree n satisfying

Li(xj) = 0, j 6= i, Li(xi) = 1. This results in

L_i(x) = Y j₆=i

x ₋ x_j x_i ₋ x_j

i = 0,1, . . . , n.

(51)

Exercise.

Find the Lagrange polynomial p for the following points (x, f(x)): (1,0), (₋1, ₋3), and (2, ₋4). Assume that a new point (0, 2) is observed, and construct a Lagrange polynomial to incorporate this new information in it.

Error formula.

Suppose f is n + 1 times differentiable on [a, b]. Then it holds that f(x) = p(x) + f

(n+1)_(ξ)

(n + 1)! (x − x0)∙ ∙ ∙(x − xn), where ξ = ξ(x) lies in (a, b).

Proof.

Define g(x) = f(x) ₋ p(x) + λ n

Y

j=0

(52)

Hence

g(x) = f(x) ₋ p(x) ₋ (f(α) ₋ p(α)) n

Y

j=0

x ₋ xj α ₋ x_j

.

⇒ g has at least n + 2 zeros: x₀, . . . , x_n, α. Mean Value Theorem yields that

g0 has at least n + 1 zeros ...

g(n+1) has at least 1 zero, say ξ

(53)

Hence

0 = g(n+1)(ξ) = f(n+1)(ξ) ₋ (f(α) ₋ p(α))_Q_n(n + 1)! j=0(α − xj)

⇒ Error = f(α) ₋ p(α) = f

(n+1)_(ξ)

(n + 1)! n

Y

j=0

(54)

3. Trapezoid Rule.

We can use linear interpolation (n = 1, x₀ = a, x₁ = b) to approximate f(x) on [a, b] and then compute R_ab f(x) dx to get the trapezoid rule:

Z b

a

f(x) dx _≈ 1

2 (b − a) [f(a) + f(b)].

If we partition [a, b] into n equal subintervals with mesh points xi = a + ih, i = 0, . . . , n, and step size h = (b ₋ a)/n, we can derive the composite trapezoid rule:

Z b

a

f(x) dx _≈ h

2 [f(x0) + 2 n₋1

X

i=1

f(xi) + f(xn)].

(55)

Use linear interpolation (n = 1, x₀ = a, x₁ = b) to approximate f(x) on [a, b] and then compute R_ab f(x) dx.

Answer.

The linear interpolating polynomial is p(x) = f(a)L₀(x) + f(b)L₁(x), where L₀(x) = x − b

a ₋ b , and L1(x) =

x ₋ a b ₋ a .

⇒

Z b

a

f(x) dx _≈

Z b

a

p(x) dx = f(a)

Z b

a

x ₋ b

a ₋ b dx + f(b)

Z b

a

x ₋ a b ₋ a dx = f(a) 1

a ₋ b 1

2 (x − b)

2

|ba +f(b) 1 b ₋ a

1

2 (x − a)

2

|ba

= f(a) 1 a ₋ b

1

2 (−(a − b)

2_{) +} _f_(b) 1

b ₋ a 1

2 (b − a)

2

= b − a

2 (f(a) + f(b)) ← Trapezoid Rule

(56)

Error Analysis.

Let f(x) = p(x) + E(x), where E(x) = f

00_(ξ₎

2 (x − a) (x − b) with ξ ∈ (a, b). Assume that _|f00_{| ≤} M is bounded. Then

Z b a

E(x) dx

≤ Z b a |

E(x)_| dx _≤ M 2

Z b

a

(x ₋ a) (b ₋ x) dx

= M 2

Z b

a

(x ₋ a) [(b ₋ a) ₋ (x ₋ a)] dx = M

2

Z b

a

−(x ₋ a)2 + (b ₋ a) (x ₋ a) dx = M

2

−1₃ (b ₋ a)3 + 1

2 (b − a)

3

dx

= M

12 (b − a)

(57)

The composite formula can be obtained by considering the partitioning of [a, b] into a = x₀ < x₁ < . . . < xn₋1 < xn = b, where xi = a + i h with h :=

b ₋ a n .

Z b

a

f(x) dx =

n₋1

X

i=0

Z x_i₊₁

x_i

f(x) dx _≈

n₋1

X

i=0

xi+1 − xi

2 (f(xi) + f(xi+1)) =

n₋1

X

i=0

h

2 (f(xi) + f(xi+1)) = h

1

2 f(a) + f(x1) + . . . + f(xn−1) + 1

2 f(b)

.

Error analysis then yields that

Error _≤ M 12 h

3 _n ₌ M (b − a)

12 h

(58)

4. Simpson’s Rule. Exercise.

Use quadratic interpolation (n = 2, x₀ = a, x₁ = a+₂b, x₂ = b) to approximate f(x) on [a, b] and then compute R_ab f(x) dx to get the Simpson’s rule:

Z b

a

f(x) dx _≈ 1

6 (b − a) [f(a) + 4 f(

a + b

2 ) + f(b)]. Derive the composite Simpson’s rule:

Z b

a

f(x) dx _≈ h

3 [f(x0) + 2 n/2

X

i=2

f(x₂i₋2) + 4

n/2

X

i=1

f(x₂i₋1) + f(xn)],

where n is an even number and xi and h are chosen as in the composite trapezoid rule.

(59)

5. Newton–Cotes Formula.

Suppose x₀, . . . , xn are mesh points in [a, b], usually mesh points are equally spaced and x₀ = a, x_n = b, then integral can be approximated by the Newton–Cotes formula:

Z b

a

f(x) dx _≈ n

X

i=0

A_i f(xi)

where parameters A_i are determined in such a way that the integral is exact for all polynomials of degree _≤ n.

[Note: n+1 unknowns A_i and n+1 coefficients for polynomial of degree n.] Exercise. Use Newton–Cotes formula to derive the trapezoid rule and the Simpson’s rule. Prove that if f is n + 1 times differentiable and _|f(n+1)_{| ≤} M on [a, b] then

|

Z b

a

f(x) dx ₋ n

X

i=0

Ai f(xi)| ≤

M (n + 1)!

Z b

a n

Y

i=0

(60)

Exercise 4.5.

We have that

Z b

a

q(x) dx = n

X

i=0

A_i q(xi) for all polynomials q of degree ≤ n.

Let q(x) = L_j(x), where L_j is the jth Lagrange polynomial for the data points x₀, x₁, . . . , x_n. I.e. L_j is of degree n and satisfies L_j(xi) = δij =

  

1 i = j

0 i ₆= j. Now

Z b

a

L_j dx = n

X

i=0

A_i L_j(xi) = Aj

⇒

Z b

a

f(x) dx _≈ n

X

i=0

Ai f(xi) = n

X

i=0

f(xi)

Z b

a

Li(x) dx

= Z b a n X i=0

f(xi)Li(x) dx =

Z b

a

p(x) dx,

(61)

(n = 1) and Simpson’s rule (n = 2, with x₁ = a+₂b). The Lagrange polynomial has the error term

f(x) = p(x) + E(x), E(x) := f

(n+1)_(ξ₎

(n + 1)! (x − x0)∙ ∙ ∙ (x − xn), where ξ = ξ(x) lies in (a, b). Hence

Z b a

f(x) dx ₋

Z b a p(x) dx = Z b a

E(x) dx

≤ Z b a |

E(x)_| dx

≤ _(nM_{+ 1)!}

Z b

a n

Y

i=0

(62)

6. Ordinary Differential Equations.

An initial value problem for an ODE has the form

x0(t) = f(t, x(t)), a _≤ t _≤ b and x(a) = x₀. (1) (1) is equivalent to the integral equation:

x(t) = x₀ +

Z t

a

f(s, x(s)) ds, a _≤ t _≤ b. (2) To solve (2) numerically we divide [a, b] into subintervals with mesh points t_i = a+ih, i = 0, . . . , n, and step size h = (b ₋ a)/n. (2) implies

x(ti+1) = x(ti) +

Z t_i₊₁

t_i

(63)

(a) If we approximate f(s, x(s)) on [ti, ti+1] by f(ti, x(ti)), we get the Euler (explicit) method for equation (1):

wi+1 = wi + hf(ti, wi), w0 = x0.

We have x(ti+1) ≈ wi+1 if h is sufficiently small.

[Taylor: x(ti+1) = x(ti) + x0(ti)h + O(h2) = x(ti) + f(ti, x(ti))h + O(h2).]

(b) If we approximate f(s, x(s)) on [ti, ti+1] by linear interpolation with points (ti, f(ti, x(ti))) and (ti+1, f(ti+1, x(ti+1))), we get the trapezoidal (implicit) method for equation

(1):

wi+1 = wi + h

2 [f(ti, wi) + f(ti+1, wi+1)], w0 = x0.

(c) If we combine the Euler method with the trapezoidal method, we get the modified Euler (explicit) method (or Runge–Kutta 2nd order method):

w_i₊₁ = w_i + h

2 [f(ti, wi) + f(ti+1, w| i + hf{z(ti, wi})

≈ w

(64)

7. Divided Differences.

Suppose a function f and (n + 1) distinct points x₀, x₁, . . . , xn are given. Divided differences of f can be expressed in a table format as follows:

xk 0DD 1DD 2DD 3DD . . .

x₀ f[x₀]

x₁ f[x₁] f[x₀, x₁]

x₂ f[x₂] f[x₁, x₂] f[x₀, x₁, x₂]

x₃ f[x3] f[x2, x3] f[x1, x2, x3] f[x0, x1, x2, x3]

(65)

where f[xi] = f(xi) f[xi, xi+1] =

f[xi+1] − f[xi] x_i₊₁ ₋ x_i f[xi, xi+1, . . . , xi+k] =

f[xi+1, xi+2, . . . , xi+k] − f[xi, xi+1, . . . , xi+k₋1]

x_i₊_k ₋ x_i f[x₁, . . . , x_n] = f[x2, . . . , xn] − f[x1, . . . , xn−1]

(66)

8. Interpolating Polynomial – Newton Form.

One drawback of Lagrange polynomials is that there is no recursive relationship between Pn₋1 and Pn, which implies that each polynomial has to be constructed

individually. Hence, in practice one uses the Newton polynomials.

The Newton interpolating polynomial P_n of degree n that agrees with f at the points x₀, x₁, . . . , x_n is given by

P_n(x) = f[x₀] + n

X

k=1

f[x₀, x₁, . . . , x_k] k₋1

Y

i=0

(x ₋ x_i). Note that P_n can be computed recursively using the relation

Pn(x) = Pn₋1(x) + f[x0, x1, . . . , xn](x − x0)(x − x1)∙ ∙ ∙(x − xn₋1).

[Note that f[x0, x1, . . . , xk] can be found on the diagonal of the DD table.]

Exercise.

(67)

(68)

Exercise 4.8.

Data points: (1, ₋2), (₋2, ₋56), (0,₋2), (3,4), (₋1, ₋16), (7, 376). xk 0DD 1DD 2DD 3DD 4DD 5DD

1 ₋2

−2 ₋56 18

0 ₋2 27 ₋9

3 4 2 ₋5 2

−1 ₋16 5 ₋3 2 0

7 376 49 11 2 0 0

Newton polynomial:

p(x) = ₋2 + 18 (x ₋ 1) ₋ 9 (x ₋ 1) (x + 2) + 2 (x ₋ 1) (x + 2) x = 2x3 ₋ 7x2 + 5x ₋ 2.

(69)

9. Piecewise Polynomial Approximations.

Another drawback of interpolating polynomials is that Pn tends to oscillate widely when n is large, which implies that P_n(x) may be a poor approximation to f(x) if x is not close to the interpolating points.

If an interval [a, b] is divided into a set of subintervals [xi, xi+1], i = 0,1, . . . , n−1, and

on each subinterval a different polynomial is constructed to approximate a function f, such an approximation is called spline.

The simplest spline is the linear spline P which approximates the function f on the interval [xi, xi+1], i = 0, 1, . . . , n − 1, with P agreeing with f at xi and xi+1.

(70)

10. Natural Cubic Splines.

Given a function f defined on [a, b] and a set of points a = x₀ < x₁ < _{∙ ∙ ∙} < xn = b, a function S is called a natural cubic spline if there exist n cubic polynomials S_i such that:

(a) S(x) = S_i(x) for x in [xi, xi+1] and i = 0,1, . . . , n − 1;

(b) S_i(xi) = f(xi) and Si(xi+1) = f(xi+1) for i = 0,1, . . . , n − 1;

(c) S_i0₊₁(xi+1) = S_i0(xi+1) for i = 0, 1, . . . , n − 2;

(71)

Natural Cubic Splines.

| | |

x_i x_i₊₁ x_i₊₂ S_i S_i₊₁

(a) 4n parameters (b) 2n equations

(c) n ₋ 1 equations (d) n ₋ 1 equations

(e) 2 equations

            

(72)

Example. Assume S is a natural cubic spline that interpolates f _∈ C2([a, b]) at the nodes a = x₀ < x₁ < _{∙ ∙ ∙} < xn = b. We have the following smoothness property of cubic splines:

Z b

a

[S00(x)]2 dx _≤

Z b

a

[f00(x)]2 dx.

In fact, it even holds that

Z b

a

[S00(x)]2 dx = min g_∈G

Z b

a

[g00(x)]2 dx,

where _G := _{g _∈ C2([a, b]) : g(xi) = f(xi) i = 0,1, . . . , n}.

Exercise: Determine the parameters a to h so that S(x) is a natural cubic spline, where

S(x) = ax3 + bx2 + cx + d for x _∈ [₋1, 0] and S(x) = ex3 + f x2 + gx + h for x _∈ [0,1]

with interpolation conditions S(₋1) = 1, S(0) = 2, and S(1) = ₋1.

(73)

11. Computation of Natural Cubic Splines.

Denote

c_i = S00(xi), i = 0, 1, . . . , n. Then c₀ = c_n = 0.

Since S_i is a cubic function on [xi, xi+1], we know that Si00 is a linear function on [xi, xi+1]. Hence it can be written as

S_i00(x) = c_i xi+1 − x hi

(74)

Exercise.

Show that S_i is given by S_i(x) = ci

6h_i (xi+1 − x)

3 ₊ ci+1

6 h_i (x − xi)

3 ₊ _p

i (xi+1 − x) + qi (x − xi), where

p_i =

f(xi) hi −

c_i h_i 6

, q_i =

f(xi+1)

hi −

c_i₊₁ h_i 6

and c₁, . . . , c_n₋₁ satisfy the linear equations:

h_i₋₁c_i₋₁ + 2 (hi₋1 + hi)ci + hi ci+1 = ui, where

u_i = 6 (di − di₋1), di =

f(xi+1) − f(xi) h_i

for i = 1, 2, . . . , n ₋ 1.

C++ _Exercise_{: Write a program to construct a natural cubic spline. The inputs are}

(75)

5. Basic Probability Theory

1. CDF and PDF.

Let (Ω, _F, P) be a probability space, X be a random variable. The cumulative distribution function (cdf) F of X is defined by

F(x) = P(X _≤ x), x _∈ R. F is an increasing right-continuous function satisfying

F(_−∞) = 0, F(+_∞) = 1.

If F is absolutely continuous then X has a probability density function (pdf) f defined by

f(x) = F0(x), x _∈ R. F can be recovered from f by the relation

F(x) =

Z x

(76)

2. Normal Distribution.

A random variable X has a normal distribution with parameters μ and σ2, written X _∼ N(μ, σ2), if X has the pdf

φ(x) = 1 σ√2πe

−(x₂−_σμ₂)2

for x _∈ R.

If μ = 0 and σ2 = 1 then X is called a standard normal random variable and its cdf is usually written as

Φ(x) =

Z x

−∞

1

√

2πe

−u₂2 _du.

If X _∼ N(μ, σ2) then the characteristic function (Fourier transform) of X is given by

c(s) = E(ei sX) = ei μ t−σ22s2 ,

(77)

3. Approximation of Normal CDF.

It is suggested that the standard normal cdf Φ(x) can be approximated by a “poly-nomial” Φ(x) as follows:e

e

Φ(x) := 1 ₋ Φ0(x) (a₁k + a₂k2 + a₃k3 + a₄k4 + a₅k5) (3) when x _≥ 0 and Φ(x) := 1e ₋ Φ(e ₋x) when x < 0.

The parameters are given by k = ₁₊1_γx, γ = 0.2316419, a₁ = 0.319381530, a₂ =

−0.356563782, a₃ = 1.781477937, a₄ = ₋1.821255978, and a₅ = 1.330274429. This approximation has a maximum absolute error less than 7.5 _× 10−8 for all x.

C++ _Exercise_{: Write a program to compute Φ(x) with (3) and compare the result}

(78)

4. Lognormal Random Variable.

Let Y = eX and X be a N(μ, σ2) random variable. Then Y is a lognormal random variable.

Exercise: Show that

E(Y ) = eμ+12σ2, E(Y 2) = e2μ+2σ2.

5. An Important Formula in Pricing European Options.

If V is lognormally distributed and the standard deviation of ln V is s then 1 E(max(V ₋ K,0)) = E(V ) Φ(d₁) ₋ K Φ(d₂)

where

d₁ = 1 s ln

E(V ) K +

s

2 and d2 = d1 − s.

(79)

E(V ₋ K)+ = E(V ) Φ(d1) − K Φ(d2), d1 =

1 s ln

E(V ) K +

s

2 and d2 = d1 − s.

Proof.

Let g be the pdf of V . Then E(V ₋ K)+ =

Z _∞

−∞

(v ₋ K)+ g(v) dv =

Z _∞

K

(v ₋ K)g(v) dv. As V is lognormal, lnV is normal _∼ N(m, s2), where m = ln(E(V )) ₋ 1₂ s2. Let Y := lnV_s−m, i.e. V = em+s Y . Then Y _∼ N(0,1) with pdf: φ(y) = √1

2π e−

y2

2 .

E(V ₋ K)+ = E(em+s Y ₋ K)+ =

Z _∞

lnK₋m s

em+s y ₋ K φ(y) dy

=

Z _∞

lnK₋m s

em+s y φ(y) dy ₋ K

Z _∞

lnK₋m s

(80)

I₁ =

Z _∞

lnK₋m s

1

√

2π e

−y₂2+m+s y _dy

=

Z _∞

lnK₋m s

1

√

2π e

−(y−₂s)2+m+s₂2 _dy

= em+s22

Z _∞

lnK₋m s −s

1

√

2π e

−y₂2 _dy _[y ₋ _s _7→ _y]

= em+s22

1 ₋ Φ

lnK ₋ m

s − s

= em+s22 Φ

−lnK − m

s + s

= elnE(V) Φ −lnK + lnE(V ) − s2

2

s + s

!

= E(V ) Φ

1 s ln

E(V ) K +

s 2

= E(V ) Φ(d1),

(81)

Similarly

I₂ = 1 ₋ Φ

lnK ₋ m s

= Φ

−lnK_s− m

(82)

6. Correlated Random Variables.

Assume X = (X1, . . . , Xn) is an n-vector of random variables.

The mean of X is an n-vector μ = (E(X₁), . . . , E(Xn)). The covariance of X is an n _× n-matrix Σ with components

Σij = (CovX)ij = E((Xi − μi)(Xj − μj)).

The variance of X_i is given by σ_i2 = Σii and the correlation between Xi and Xj is given by ρ_ij = _σΣij

iσj.

X is called a multi-dimensional normal vector, written as X _∼ N(μ,Σ), if X has pdf f(x) = 1

(2π)n/2

1

(detΣ)1/2 exp

−1₂(x ₋ μ)TΣ−1(x ₋ μ)

(83)

7. Convergence.

Let _{X_n_} be a sequence of random variables. There are four types of convergence concepts associated with _{X_n_}:

• Almost sure convergence, written X_n _−→a.s. X, if there exists a null set N such that for all ω _∈ Ω _\ N one has

X_n(ω) _→ X(ω), n _{→ ∞}.

• Convergence in probability, written Xn P

−→ X, if for every ε > 0 one has P(_|X_n ₋ X_| > ε) _→ 0, n _{→ ∞}.

• Convergence in norm, written Xn Lp

−→ X, if Xn, X ∈ Lp and E_|X_n ₋ X_|p _→ 0, n _{→ ∞}.

(84)

8. Strong Law of Large Numbers.

Let _{Xn} be independent, identically distributed (iid) random variables with finite expectation E(X₁) = μ. Then

Z_n n

a.s.

−→ μ where Zn = X1 + ∙ ∙ ∙ + Xn.

9. Central Limit Theorem. Let _{X_n_} be iid random variables with finite expecta-tion μ and finite variance σ2 > 0. For each n, let

Z_n = X₁ + _{∙ ∙ ∙} + X_n. Then

Zn

n − μ σ

√

n

= Zn_√− nμ nσ

D

−→ Z where Z is a N(0,1) random variable, i.e.,

P(Zn_√− nμ

nσ ≤ z) → 1

√

2π

Z z

−∞

(85)

10. Lindeberg–Feller Central Limit Theorem.

Suppose X is a triangular array of random variables, i.e.,

X = _{X₁n, X₂n, . . . , X_kn₍_n₎ : n _{∈ {}1, 2, . . ._}}, with k(n) _{→ ∞} as n _{→ ∞},

such that, for each n, X₁n, . . . , X_kn₍_n₎ are independently distributed and are bounded in absolute value by a constant y_n with y_n _→ 0. Let

Zn = X₁n + ∙ ∙ ∙ + X_kn₍_n₎.

If E(Zn) → μ and var(Zn) → σ2 > 0, then Zn converges in distribution to a normally distributed random variable with mean μ and variance σ2.

(86)

If X₁, X₂, . . . are iid with expectation μ and variance σ2, then define X_in := X_√i − μ

nσ , i = 1,2, . . . , k(n) := n. For each n, X₁n, . . . , X_kn₍_n₎ are independent and

E(X_in) = E(X_√i) − μ nσ =

μ ₋ μ

√

nσ = 0, Var(X_in) = 1

n σ2 VarXi − μ =

1 n σ2 σ

2 ₌ 1

n . Let Z_n = X₁n + _{∙ ∙ ∙} + X_kn₍_n₎ = (Pn_i₌₁ X_i ₋ n μ) √1

nσ, then

E(Zn) =

k(n)

X

i=1

E(X_in) = 0,

Var(Zn) =

k(n)

X

i=1

Var(X_in) = 1. Hence, by Lindeberg–Feller,

Zn D

(87)

6. Optimization

1. Unconstrained Optimization.

Given f : Rn _→ R.

Minimize f(x) over x _∈ Rn.

f has a local minimum at a point ˉx if f(ˉx) _≤ f(x) for all x near ˉx, i.e.

∃ ε > 0 s.t. f(ˉx) _≤ f(x) _∀ x : _kx ₋ xˉ_k < ε .

f has a global minimum at ˉx if

(88)

2. Optimality Conditions.

• First order necessary conditions:

Suppose that f has a local minimum at ˉx and that f is continuously differentiable in an open neighbourhood of ˉx. Then _∇f(ˉx) = 0. (ˉx is called a stationary point.)

• Second order sufficient Conditions:

Suppose that f is twice continuously differentiable in an open neighbourhood of ˉ

x and that _∇f(ˉx) = 0 and _∇2f(ˉx) is positive definite. Then ˉx is a strict local minimizer of f.

Example: Show that f = (2x2₁ ₋ x₂)(x2₁ ₋ 2x2) has a minimum at (0, 0) along any

straight line passing through the origin, but f has no minimum at (0,0).

Exercise: Find the minimum solution of

f(x1, x2) = 2x2₁ + x1x2 + x2₂ − x1 − 3x2. (4)

(89)

Sufficient Condition.

Taylor gives for any d _∈ Rn:

f(ˉx + d) = f(ˉx) + _∇f(ˉx)T d + ₂1 dT _∇2f(ˉx + λ d)d λ _∈ (0, 1). If ˉx is not strict local minimizer, then

∃ {x_k_{} ⊂} Rn _{\ {}xˉ_} : x_k _→ xˉ s.t. f(xk) ≤ f(ˉx). Define d_k := xk−xˉ

kx_k₋xˉ_k. Then kdkk = 1 and there exists a subsequence {dkj} such that

d_k_j _→ d_? as j _{→ ∞} and _kd_?_k = 1. W.l.o.g. we assume d_k _→ d_? as k _{→ ∞}. f(ˉx) _≥ f(xk) = f(ˉx + kxk − xˉk dk)

= f(ˉx) + _kx_k ₋ xˉ_{k ∇}f(ˉx)T d_k + 1₂ _kx_k ₋ xˉ_k2dT_k _∇2f(ˉx + λ_k _kx_k ₋ xˉ_k d_k)d_k = f(ˉx) + 1₂ _kxk − xˉk2dT_k ∇2f(ˉx + λk kxk − xˉkdk) dk .

Hence dT_k _∇2f(ˉx + λ_k _kx_k ₋ xˉ_kd_k)d_k _≤ 0, and on letting k _{→ ∞} dT_? _∇2f(ˉx)d_? _≤ 0.

(90)

Example 6.2. Show that f = (2x2₁₋x₂)(x2₁₋ 2x₂) has a minimum at (0,0) along any straight line passing through the origin, but f has no minimum at (0, 0).

Answer.

Straight line through (0, 0): x₂ = α x₁, α _∈ R fixed.

g(r) := f(r, α r) = (2r2 ₋ α r) (r2 ₋ 2 α r)

g0(r) = 8r3 ₋ 15 α r2 + 4 α2 r, g00(r) = 24r2 ₋ 30 α r + 4 α2

⇒ g0(0) = 0 and g00(0) = 4α2 > 0 .

Hence r = 0 is a minimizer for g _⇐⇒ (0,0) is a minimizer for f along any straight line.

Now let (x₁k, xk₂) = (1_k, _k1₂) _→ (0, 0) as k _{→ ∞}. Then f(xk₁, xk₂) = ₋ 1

k2

1

k2 < 0 = f(0, 0) ∀ k .

Hence (0,0) is not a minimizer for f.

[Note: _∇f(0,0) = 0, but _∇2f(0,0) = 0 0 0 4

!

(91)

3. Convex Optimization.

(92)

Exercise 6.3.

When f is convex, any local minimizer ˉx is a global minimizer of f.

Proof.

Suppose ˉx is a local minimizer, but not a global minimizer. Then

∃ xe s.t. f(x)_e < f(ˉx). Since f is convex, we have that

f(λx_e + (1 ₋ λ) ˉx) _≤ λ f(x) + (1_e ₋ λ)f(ˉx)

< λ f(ˉx) + (1 ₋ λ) f(ˉx) = f(ˉx) _∀ λ _∈ (0, 1]. Let x_λ := λx_e + (1 ₋ λ) ˉx. Then

x_λ _→ xˉ and f(xλ) < f(ˉx) as λ → 0. This is a contradiction to ˉx being a local minimizer.

(93)

4. Line Search.

The basic procedure to solve numerically an unconstrained problem (minimize f(x) over x _∈ Rn) is as follows.

(i) Choose an initial point x0 _∈ Rn and an initial search direction d0 _∈ Rn and set k = 0.

(ii) Choose a step size α_k and define a new point xk+1 = xk + α_k dk. Check if the stopping criterion is satisfied (_k∇f(xk+1)_k < ε?). If yes, xk+1 is the optimal solution, stop. If no, go to (iii).

(iii) Choose a new search direction dk+1 (descent direction) and set k = k + 1. Go to (ii).

(94)

5. Steepest Descent Method.

f is differentiable.

Choose dk = ₋gk, where gk = _∇f(xk), and choose α_k s.t. f(xk + αk dk) = min

α_∈R f(x

k ₊ _{α d}k_).

Note that the successive descent directions are orthogonal to each other, i.e. (gk)T gk+1 = 0, and the convergence for some functions may be very slow, called zigzagging.

Exercise.

(95)

Steepest Descent.

Taylor gives:

f(xk + α dk) = f(xk) + α_∇f(xk)T dk + O(α2). As

∇f(xk)T dk = _k∇f(xk)_{k k}dk_k cosθk,

with θk the angle between dk and _∇f(xk), we see that dk is a descent direction if cosθk < 0. The descent is steepest when θk = π _⇐⇒ cosθk = ₋1.

Zigzagging.

α_k is minimizer of φ(α) := f(xk + α dk) with dk = ₋gk. Hence

(96)

Exercise 6.5.

Use the SD method to solve (4) with the initial point x0 = (1, 1). [min: 1₇ (₋1,11).]

Answer.

∇f = (4x₁ + x₂ ₋ 1, 2x₂ + x₁ ₋ 3).

Iteration 0: d0 = _−∇f(x0) = ₋(4,0) ₆= (0, 0).

φ(α) = f(x0 + α d0) = f(1 ₋ 4α,1) = 2 (1 ₋ 4α)2 ₋ 2 minimum point at α₀ = 1₄

⇒ x1 = x0 + α₀ d0 = (0,1),

d1 = _−∇f(x1) = ₋(0, ₋1) = (0, 1) ₆= (0,0). Iteration 1: x2 = (0, 3₂), d2 = (₋1₂, 0).

(97)

6. Newton Method.

f is twice differentiable.

Choose dk = ₋[Hk]−1gk, where Hk = _∇2f(xk). Set xk+1 = xk + dk.

If Hk is positive definite then dk is a descent direction.

The main drawback of the Newton method is that it requires the computation of

∇2f(xk) and its inverse, which can be difficult and time-consuming.

Exercise.

(98)

Newton Method.

Taylor gives

f(xk + d) _≈ f(xk) + dT _∇f(xk) + 1₂ dT _∇2f(xk) d =: m(d) min

d m(d) ⇒ ∇m(d) = 0

⇒ ∇f(xk) + _∇2f(xk)d = 0 . Hence choose dk = ₋[_∇2f(xk)]−1 _∇f(xk) = ₋[Hk]−1gk. If Hk is positive definite, then so is (Hk)−1, and we get

(dk)T gk = ₋(gk)T (Hk)−1 gk _{≤ −}σ_k _kgk_k2 < 0 for some σ_k > 0.

Hence dk is a descent direction.

(99)

Exercise 6.6.

Use the Newton method to minimize

f(x₁, x₂) = 2x2₁ + x₁x₂ + x2₂ ₋ x₁ ₋ 3x₂ with x0 = (1,1)T.

Answer.

∇f = 4 x1 + x2 − 1 2 x₂ + x₁ ₋ 3

!

, H := _∇2f = 4 1 1 2

!

.

H−1 = 1 detH

2 ₋1

−1 4

!

= 1₇ 2 −1

−1 4

!

.

Iteration 0: x0 = (1,1)T, _∇f(x0) = (4, 0)T. x1 = x0 ₋ [H0]−1 _∇f(x0) = 1

1

!

− 1₇ 2 −1 −1 4

!

4 0

!

= 1₇ −1 11

!

.

⇒ ∇f(x1) = (0, 0)T and H positive definite.

(100)

7. Choice of Stepsize.

In computing the step size α_k we face a tradeoff. We would like to choose α_k to give a substantial reduction of f, but at the same time we do not want to spend too much time making the choice. The ideal choice would be the global minimizer of the univariate function φ : R _→ R defined by

φ(α) = f(xk + α dk), α > 0, but in general, it is too expensive to identify this value.

A common strategy is to perform an inexact line search to identify a step size that achieves adequate reductions in f with minimum cost.

α is normally chosen to satisfy the Wolfe conditions:

f(xk + α_k dk) _≤ f(xk) + c₁ α_k (gk)Tdk (5)

(101)

Choice of Stepsize.

The simple condition

f(xk + α_k dk) < f(xk) (_†) is not appropriate, as it may not lead to a sufficient reduction.

Example: f(x) = (x ₋ 1)2 ₋ 1. So minf(x) = ₋1, but we can choose xk satisfying (_†) such that f(xk) = 1_k _→ 0.

Note that the sufficient decrease condition (5)

φ(α) = f(xk + α dk) _≤ `(α) := f(xk) + c₁ α(gk)Tdk

yields acceptable regions for α. Here φ(α) < `(α) for small α > 0, as (gk)Tdk < 0 for descent directions.

The curvature condition (6) is equivalent to

φ0(α) _≥ c₂φ0(0) [ > φ0(0) ]

(102)

8. Convergence of Line Search Methods.

An algorithm is said to be globally convergent if lim

k_→∞kg k

k = 0.

It can be shown that if the step sizes satisfy the Wolfe conditions

• then the steepest descent method is globally convergent,

• so is the Newton method provided the Hessian matrices _∇2f(xk) have a bounded condition number and are positive definite.

Exercise. Show that the steepest descent method is globally convergent if the

following conditions hold

(a) α_k satisfies the Wolfe conditions, (b) f(x) _≥ M _∀ x _∈ Rn,