Functions of Several Variables: Finding Roots and Extrema 37

2. Linear Systems and Optimization 19

2.4 Functions of Several Variables: Finding Roots and Extrema 37

The techniques we develop in this section are also referred to as Newton’s method since they use derivatives and a single initial estimate to establish an iterative process to search for a root. In general terms, this is identical to Newton’s method introduced in Section 1.3.

As these procedures apply to differentiable functions f : Rⁿ→ R^m, they apply to linear systems which are not square or to square linear systems whose coefficient matrix is singular. More generally, setting g = f.f , alter-natively g = f² if m = 1, then the roots of f are minima of g. Hence, we need only consider the problem of finding extrema in order to find roots.

The techniques developed in this section are applicable to optimal con-trol theory and sensitivity analysis. Sensitivity analysis is of particular interest. Here you define a function f which measures an outcome from given independent (input) variables. However, the parameters necessary to express f may not be known with certainty. For instance a formula in finance may depend on the price volitility (the variance of a random vari-able). But it is often the case that the variance, σ², is not known exactly.

Sensitivity analysis attempts to determine how the outcome will vary with changes in the estimate for σ². In effect, the analyst is seeking to estimate

∂f /∂σ².

In another direction, these minimization techniques apply to web search technology. In that case, each person browsiing the web has a penalty func-tion. This finction is determined by his/her prior tendencies. The browser returns the list of web pages that minimizes the penalty fiunction. Of course, this max/min problem has perhaps a hundred thousand veariables.

Do note that we have written this section for functions defined on R². It is easy to extend these techniques to Rⁿ.

We begin by looking at an example. Consider f (x, y) = x²+y²mapping R²to R. (See Figure 2.4.1). The graph of f is a subset of R³, and the single minimum of f is at (0, 0). Suppose we start the search for a minimum at (1, 2). If we think back to the method developed in Section 1.3, we want a line γ tangent to the surface passing through (1, 2, 5) where 5 = f (1, 2).

Then we want to determine where γ interesects the xy-plane. That point will be our next approximate minimum. If we repeat the process, we expect that we will get better and better approximate roots.

The first problem is that we know the tangent plane at (1, 2, 5). It has equation z − 5 = ∇f (1, 2).(x − 1, y − 2). But we do not know which line

-2

2 -2

0 2

0 5 10 15

-2

2 -2

0 2

0 5 10 15

Fig. 2.1 Figure 2.4.1 f (x, y) = x²+ y²

on the plane to use for γ. There are two standard procedures to determine the direction vector γ. We develop one now and the second at the end of the section. Recall a fact from multivariate calculus, the gradient points in the direction of maximal descent. Hence, it seems reasonable to select that direction for the line γ. In this case the technque is often called the method of maximal descent.

We know that the gradient ∇f = ∂f /∂x(x₀, y₀) is a vector in the xy-plane, the domain of f , that determines the direction of maximal change for f . So, it is reasonable to set ξ = ∂f /∂x(x0, y0) and η = ∂f /∂y(x0, y0) and consider the line (1, 2) + t(ξ, η) = (1 + tξ, 2 + tη) in the xy-plane. Next, we define a function h : R → R, h(t) = f (1 + tξ, 2 + tη). We can now solve for a max/min of h. This is a one variable calculus problem. Finding a minimun for h should yield a value for f less than 5, the value at (1, 2).

Indeed, ∇f (1, 2) = (2, 4), h(t) = (1 + 2t)²+ (2 + 4t)²= 5 + 20t + 20t². The derivative of h is 20 + 40t. It has extremum at t = −0.5. Now h(−0.5) = f (0, 0) = 0 < 5. Indeed, we recognize the origin as the minumum of f . And we have arrived in one step.

We now state the general process for functions of several variables. Sup-pose we seek a minimum of f mapping Rⁿ to R.

(1) Compute the gradient of f at (x0, y0), ∇f (x0, y0) = (ξ, η).

(2) Set h(t) = f (x₀+ tξ, y₀+ tη).

(3) Solve the single variable calculus problem for h to yield t0..

(4) Set (x₁, y₁) = (x₀+ t₀ξ, y₀+ t₀η).

(5) If f (x0, y0) < f (x1, y1) , then exit (the process has failed).

(6) If the iteration count exceeds the maximum, exit (the process has failed).

(7) If |f (x1, y1) − f (x0, y0)| is sufficiently small, exit (possible success).

(8) Go back to Step 1 using (x₁, y₁) as the seed.

It is intereseting to note that maximal descent is sufficient for the mini-mization problems that occur in big data and machine learning applications.

To introduce the second technique, we look at another example. Sup-pose f (x, y) = cos²(x)e^y+ 1. The minima for the function occur at the odd multiples of x = π/2. (See Figure 2.4.2). If we start the search at (0, 1), then t0 = −35.7979, (x0, y0) = (0, −96.2817) and we are way out on the negative y-axis. Even though the value of f is nearly zero (about 1.4 × 10⁻⁴⁷, no further processing will take us any closer to an actual min-imum. Hence, maximal descent has failed for this case.

-5

5 0

2 4

0 200 400

-5

5 0

2 4

0 200 400

Fig. 2.2 Figure 2.4.2 f (x, y) = cos²(x)e^y

There are many alternate choices for the direction vector for γ. One

choice is similar to the secant method. In this case we begin with the Taylor expansion for f .

f (x + s) = f (x) + ∇f (x).s +1

2s^TH(x)s + R₂ (2.4.1) where s^T denotes the transpose of s, H is the Hessian of f and R2 is the the remainder term. Recall that the Hessian is the matrix whose entries are

∂²f /∂xi∂xj. Because of the use of the Hessian, this technique is referred to as the Hessian method.

If we suppose that f (s) = f (x + 2), then according to Rolle’s theorem, we would expect a local extrema between x and x + s. Hence, γ = s is the search direction. If we take the remainder term to be zero and we recast (2.4.1),

2H(x)s = −∇f (x). (2.4.2)

Therefore, we solve for s. Since (2.4.2) is a linear system with coefficient matrix H(x), then we can find γ provided H(x) is nonsingular. Finally, to describe the Hessian method, we need only replace statement 1 by the following

(1) Compute (ξ, η) as the solution to the linear system 0.5H(x)s =

−∇f (x).

As mentioned at the beginning of the section, if f takes values in R^m then g = f.f is real valued and the roots of f are now extrema for g. Hence, we can use the techniques developed here to solve the general problem f (x) = 0. We present examples in the exercises.

Exercises:

1. Use maximal descent to find a minimmum for f (x, y) = x²+ xy + y². Use (2, 1) as the search starting point.

2. Use the Hessian method to find a minimmum for f (x, y) = x²+ xy + y². Use (2, 1) as the search starting point.

3. Let f (x, y) = (x + y, x + y). Solve f = 0 using (2, −1) as the initial estimate. Note that f is a singular linear transformation. When solving this problem you are solving a linear system with a singular coefficient ma-trix.

4. Consider the linear transformation a. Use LUDecomposition to determine if A is singular or non-singular.

(Do not forget to introduce a decimal point to the data.) How does this impact the problem of solving an equation of the form L(x, y, z, w) = (x0, y0, z0, w0)?

b. Use the maximal descent method to solve L(x, y, z, w) = (1, 1, 1, −1).

• Use (5, 5, 5, 5) for the initial estimate.

• Use at least 35 iterations.

• Use 10⁻⁵ as the tolerance in Step 7.

• Make certain to use two ’if” statements, one for Step 5 and one for Step 7.

c. Redo Part b using (1, 2, 3, 4) as the initial estimate.

d. Why is it possible for the solution to b and c to be different?

e. Prove that if v is the slution to b and ˆv is the solution to c, then v − ˆv solves L(x, y, z, w) = (0, 0, 0, 0). (What is the kernel of a linear transforma-tion?)

f. Use LinearSolve to get a solution to L(x, y, z, w) = (1, 1, 1, −1). Is this solution trusted? Why? What was the condition number from Part a?

Chapter 3 Interpolating and Fitting

Introduction

We introduce the following terminology. Suppose we are given a set of n points P₁, ..., P_n in the plane, R², we may want to find a curve (function) which passes through the points (interpolating) or a curve which passes near to the points (fitting). If we want the curve to pass through the points, then we may have to accept anomalies on the curve. If we are willing to accept a curve that only passes near the points, then we may place stronger restrictions on the curve. In this chapter we see how this give and take materializes.

Of the several techniques there is no best of all, no method that gives best results under all circumstances. The spline, with applications in com-puter graphics, visualization, robotics and statistics, is perhaps the most widely used. The spline curve is twice continuously differentiable, depends only on point data and faithfully reflects the tendencies of the input data.

On the other hand, among the techniques we present, splines have the most complex mathematical foundation. For all of these reasons, we a correct mathematical development of cubic splines.

In another direction, polynomial interpolation is the oldest of the niques. It has the most developed theory and is widely used as a tech-nique for approximating integrals and approximating solutions to differen-tial equations. For this reason, it is arguably the most important

Least squares fitting in the linear case provides the numerical technique used for linear regression. Furthermore, least squares fitting often arises in the literature as a generalization of polynomial interpolation. In this context, it is a technique for estimating the error for finite element method.

Another technique is Bezier interpolation. This procedure was

devel-43

oped originally to be used by engineers when resolving artist designs. In particular, Bezier curves were developed as a tool to help an engineer derive three dimensional coordinates from a designers concept drawing.

The final technique is Hermite interpolation. In this case we are charged with finding a polynomial interpolation that both approximates the func-tion but also its derivative. The Hermite interpolafunc-tion provides the under-lying mathematical foundation for Gaussian quadrature.

Before proceeding we mention the theorem of Weierstrass, any contin-uous function on a closed interval is the uniform limit of a sequence of polynomial functions. This is a remarkable result. And it is very old as it was proved 1885. The proof however does not explain what the poly-nomials are. It was not until 1912 that Bernstein identified a sequence of polynomials. Even though the Berstein polynomials are determined by the continuous function, they do not interpolate the target. Further, the convergence is very slow. A reasonable approximation of a function with Bernstein polynomials often requires Bernstein polynomials of degree two or three thousand. This is not a useful alternative to the techniques we are about to develop.

3.1 Polynomial Interpolation

In this section we introduce the idea of a polynomial interpolation. We start with an unknown function f with known values at points xi in ths domain and we contruct an interpolating polynomial p that agrees with f at these locations. The idea is that if p agrees with f at designated locations then we can use p in place of f . But if f is not continuous at xi, then the values of f at the location may not relate to values of f at nearby points.

Therefore, in order to talk about interpolation, we must have a continuous function. Throughout this section we suppose that f is contuous on its domain.

We begin by looking at the Taylor expansion of a function. Consider the function f (x) = xe^−x− 1. Plotting this function on the interval [1, 4]:

f[x_] = x*Exp[-x] - 1;

Plot[f[x], {x,1,4}];

shows a decreasing function with an inflection point.

Thinking of this curve as being more or less cubic, we can develop the cubic Taylor polynomial interpolation for f expanded at the midpoint,

1.5 2.0 2.5 3.0 3.5 4.0

Fig. 3.1 Figure 3.1.1: f together with the Talor expansion at x = 2.5

1.5 2.0 2.5 3.0 3.5 4.0

Fig. 3.2 Figure 3.1.2: f concave down near the root

When developing g you will need to compute the derivatives of f . Recall that the derivatives of f are computed in Mathematica via D[f [x], x], D[f [x], x, x] and so forth. If you plot f and g on the same axis you will see that the cubic Taylor polynomial provides a remarkably good approximation of this function. Figure 3.1.2 shows the graph of g together with the graph of f . Notice that the graph of g is above f on the left and below on the right.

A numerical measurement of the goodness of fit is given by the L²norm of f − g,

This is called the norm interpolation error . In turn, the mean norm interpolation error is

The finite Taylor expansion produces a high quality one point interpo-lation provided we know the original function. However, suppose we have points and no function, then we will need a different approach.

Definition 3.1.1. Consider points P0, ..., Pn in R², Pi = (xi, yi). The polynomial interpolation is a polynomial p of degree n that interpolates the n + 1 points in the sense that p(xi) = yi.

Our requirement for p is that it interpolate the n + 1 points. Hence, we have for each i,

Collecting these equations we then get the following matrix equation

 are known while the αi are unknown. Hence, we can use the LinearSolve function in Mathematica to find the coefficients of p provided the coefficient matrix is non-singular. The matrix is called a Vandermonde matrix. It is always nonsingular provided the xi are distinct.

Theorem 3.1.1. The Vandermonde matrix



is nonsingular, provided the scalars x_i, i = 0, 1, ..., n are distinct.

Proof. The Vandermonde matrix is singular only if the columns are de-pendent. In particular, only if there are scalars β0, ..., βn not all zero with

β₀ has degree n and therefore has at most n distinct roots. However, we just showed that it has n+1 distinct roots, x1, ..., xn+1. As this is impossible, we are led to the conclusion that the Vandermonde matrix is nonsingular.

There is another way to do polynomial interpolation. The outcome is the same, but nevertheless, the approach does provide insight. As in the previous case we begin with n + 1 points in R², denoted P1, ..., Pn+1with

It is not difficult to see that the polynomials li(x) have degree n, satisfy li(xi) = 1 and li(xj) = 0 whenever j 6= i. Moreover, q(x) = Pn1

i=1yili(x) interpolates the given points. (See Exercise 5.) The polynomials li(x) are called Lagrange polynomials. We now see that the two polynomial interpolations, p derived from the Vandermonde matrix and q derived from the Lagrange polynomials are in fact the same.

Theorem 3.1.2. Given a continuous function f and interpolation points P1, ..., Pn₁, suppose that the interoplation derived from the Vandermonde matrix is given by p and the interpolation derived from the Lagrange poly-nomials by q, then p(x) = q(x).

Proof. We begin by setting r = p − q. Hence, r is a degree n polynomial.

Since p(x_i) = y_i= q(x_i) for each i = 1, 2, ..., n + 1, then r has n + 1 roots, x1, ..., xn+1. But if r is not identically zero, then it can have at most n roots. Therefore, r = 0 and p = q.

It is possible that you must use Lagrange polynomials to compute the polynomial interpoladtion of a function. In particular, there are cases where the Vandermonde matrix procedure does not work. Suppose that two of the x-axis locations xi and xj are very close together. Then it would ap-pear to Mathematica that two of the rows of the Vandermonde matrix are equal or nearly equal. In this case the condition number will be large and LinearSolve will not return reliable results. Nevertheless, it is still possible to get the interpolation via Lagrange polynomials.

If the points Pi lie on the graph of a function f , then it is natural to ask how well does p approximate f . If f has at least n + 1 continuous derivatives then we can estimate the error, e(x) = f (x) − p(x). Recall that with this hypothesis, then the error for the Taylor interpolation will have a known bound.

Theorem 3.1.3. Suppose that f is a real valued function defined on an interval [a, b] and suppose that f has at least n + 1 continuous derivatives.

Further, take a ≤ x1 < ... < xn+1 ≤ b, with f (xi) = yi. If p is the polynomial interpolation of the points (xi, yi), then the error e(x) = f (x) − p(x) is given by

e(x) = f⁽ⁿ⁺¹⁾(ξ) (n + 1)!

(x − xi), (3.1.1)

for some ξ = ξ_x in (a, b) depending on x. In particular,

|e(x)| ≤ M

(n + 1)!(b − a)ⁿ⁺¹, (3.1.2) where M is the maximal value of f⁽ⁿ⁺¹⁾ on the interval.

Proof. We define g(x) = e(x)/Q

i(x − x_i), so that e(x) = f (x) − p(x) = Q

i(x − xi)g(x). Next, take ζ in [a, b] distinct from the xi and set h(x) = f (x) − p(x) −Y

(x − xi)g(ζ).

Note that we cannot be certain that g is defined at the x_i, however our choice of ζ assures us that h is defined on [a, b] with n + 1 continuous derivatives.

Now, each xiis a root of h and in addition h(ζ) = e(ζ)−Q

i(ζ −xi)g(ζ) = 0. Hence, h has n+2 roots in the interval [a, b]. Furthermore, h is continuous on the closed interval and differentiable on the open interval (a, b). Hence, we may apply Rolle’s theorem to the interval between each pair of successive roots and conclude that between each pair of roots there is a root of the

derivative of h. Hence, dh/dx has at least n + 1 roots on the interval (a, b).

Repeating this argument, d²h/dx²has at least n roots in (a, b). Continuing, the k^th derivative of h has at least n + 2 − k roots. So that the n + 1^st derivative has at least 1 root. We denote this root by ξ = ξ_ζ, since ξ depends on our choice of ζ. Now

0 = h⁽ⁿ⁺¹⁾(ξ) = f⁽ⁿ⁺¹⁾(ξ) − p⁽ⁿ⁺¹⁾(ξ) − g(ζ) dⁿ⁺¹ dxⁿ⁺¹

(x − xi)|x=ξ. But p is degree n, so p⁽ⁿ⁺¹⁾= 0. Also dⁿ⁺¹/dxⁿ⁺¹Q

i(x − xi) = (n + 1)!, no matter what ξ is. Therefore,

e(ζ) = f⁽ⁿ⁺¹⁾(ξ) (n + 1)!

(ζ − x_i).

Finally, since h is defined for any x in the interval, then this last expression for the error is satisfied for all x.

For the final statement on the bound for the error magnitude we note that since f is n + 1 times continuously differentiable, then f⁽ⁿ⁺¹⁾ is con-tinuous and hence has maximum value on the interval.

Numerical integration is based on polynomial interpolation. Hence, the interpolation error is also the numerical integration error. In turn, polyno-mial interpolation is also an important feature in approximating the solu-tion of a partial differential equasolu-tion. Hence, interpolasolu-tion error arrises in that context. On the other hand, the estimate for the error magnitude is of little use if we do not have information about f . Indeed, it is not difficult to find functions where M is very large. Nor is it difficult to find functions where the error is large. The following example is a case in point.

Returning to the function f (x) = xe^−x− 1 and the four points P_i = (xi, yi), x = 1, 2, 3 and 4. The polynomial interpolation, p(x), of the points will again provide an approximation of f by a cubic polynomial. As in the case of the Taylor interpolation, it is remarkably close to f . On the other hand consider the function f (x) = 1/(1 + x²). In this case, pick a finite sequence of points along the graph of f , which are symmetric about the y-axis. Use these points to produce a polynomial interpolation of the f . (See Exercise 3 below.) The problem is that the polynomial looks nothing like the function. Further, the more points you choose the less the polynomial resembles f . Looking at the graph if f we see that the function seems not to be a polynomial function. (Note the asymptotic behavior. It is not easy to find a polynomial can reproduce this type of behavior.) Hence, we should not expect that there is a polynomial function that interpolates it well.

There is another problem with polynomial interpolation. Consider again the function f (x) = 1/(1 + x²) and select four points P1= (−4, 1/17), P2= (−2, 1/5), P₃ = (2, 1/5), P₄ = (4, 1/17) from the graph of f . Next select P = (0, y) where y ∈ [0.2, 0.3]. Figure 3.3 shows the resulting polynomials for three values of y. Suppose that the location of the points came from some measuring or sampling process, then small errors (as in this case) may yield significantly different results. Looking at the resulting curves we see that shape of the curves is different. Further the change in y is magnified 20 times at p(5). This is an inherent problem with polynomial interpolation.

The technical term for the problem is that polynomial interpolation lacks local control . In a subsequent section we develop spline curves. These curves were developed precisely to resolve the local control problem.

-4 -2 2 4

0.10 0.15 0.20 0.25 0.30

-4 -2 2 4

0.05 0.10 0.15 0.20 0.25

-4 -2 2 4

-0.1 0.1 0.2

Fig. 3.3 Figure 3.1.3 Three alternate images y = 0.3, 0.25, 0.2; p(5) = 0.11, −0.04, −1.9

In spite of the problem we just noted, polynomial interpolation is an important and productive tool for numerically solving differential equations.

When this technique is used special care is taken to ameliorate the problem we see in Figure 3.1.3.

Because the Taylor expansion requires more information than is usually available, it is often ignored as an interpolation technique. However, there is an important application, which should not be ignored. In the next

section we will develop a class of parametric cubic interpolations. Consider the setting where β(t) = (β1(t), β2(t)) in R² and each βi is an ordinary cubic polynomial. When β represents a function, then it possible to solve x = β1(t) for t and then substitute this in β2 to yield β = (x, f (x)).

However, the resulting function is rarely integrable. On the other hand, you can get values for f and its derivatives. Hence, you can write the cubic Taylor expansion for f and this is easily integrated.

Finally, in Exercise 7 we introduce the idea of piecewise polynomial in-terpolation. The basic idea of polynomial interpolation is that the more points that we interpolate, then the better the polynomial will approximate the original function. However, as we add more and more points then the degree of the polynomial increases. In piecewise polynomial interpolation, we subdivide the interval into smaller and smaller subintervals while inter-polating the function by polynomials of fixed degree on each subinterval.

In document Elements of Numerical Analysis with Mathematica (Page 49-65)