The Derivative
4.4 Basic Results and the Chain Rule
Before constructing the derivative coordinatewise via the Jacobian matrix, we derive some results intrinsically from its characterizing property. We begin by computing two explicit derivatives.
Proposition 4.4.1 (Derivatives of Constant and Linear Mappings).
(1) Let C : A−→ Rm (where A⊂ Rn) be the constant mapping C(x) = c for all x∈ A, where c is some fixed value in Rm. Then the derivative of C at any interior point a of A is the zero mapping.
(2) The derivative of a linear mapping T : Rn −→ Rmat any point a∈ Rn is again T .
Proof. Both of these results hold essentially by grammar. In general the derivative of a mapping f at a is the linear mapping that well approximates f (a + h)− f(a) for h near 0n. But C(a + h)− C(a) is the zero mapping for all h ∈ A, so it is well approximated near 0n by the zero mapping on Rn. Similarly, T (a + h)− T (a) is T (h) for all h ∈ Rn, and this linear mapping is well approximated by itself near 0n.
To prove (1) more symbolically, let Z : Rn −→ Rm denote the zero map-ping, Z(h) = 0mfor all h∈ Rn. Then
C(a + h)− C(a) − Z(h) = c − c − 0 = 0 for all h ∈ Rn.
Being the zero mapping, C(a + h)− C(a) − Z(h) is crushingly o(h), showing that Z meets the condition to be DCa. And (2) is similar (exercise 4.4.1). ⊓⊔ Of course, differentiation passes through addition and scalar multiplication of mappings.
Proposition 4.4.2 (Linearity of the Derivative). Let f : A −→ Rm (where A⊂ Rn) and g : B−→ Rm(where B⊂ Rn) be mappings, and let a be a point of A∩ B. Suppose that f and g are differentiable at a with derivatives Dfa and Dga. Then
(1) The sum f + g : A∩ B −→ Rm is differentiable at a with derivative D(f + g)a= Dfa+ Dga.
(2) For any α ∈ R, the scalar multiple αf : A −→ Rm is differentiable at a with derivative D(αf )a= αDfa.
4.4 Basic Results and the Chain Rule 151 The proof is a matter of seeing that the vector space properties of o(h) encode the Sum Rule and Constant Multiple Rule for derivatives.
Proof. Since f and g are differentiable at a, some ball about a lies in A and some ball about a lies in B. The smaller of these two balls lies in A∩ B. That is, a is an interior point of the domain of f + g. With this topological issue settled, proving the proposition reduces to direct calculation. For (1),
(f + g)(a + h)− (f + g)(a) − (Dfa+ Dga)(h)
= f (a + h)− f(a) − Dfa(h) + g(a + h)− g(a) − Dga(h)
= o(h) + o(h) = o(h).
And (2) is similar (exercise 4.4.2). ⊓⊔
You may want to contrast how nicely our topological setup worked at the beginning of this proof to the irritating example that we encountered in connection with the Sum Rule for mappings back in section 2.5.
Elaborate mappings are built by composing simpler ones. The next theo-rem is the important result that the derivative of a composition is the composi-tion of the derivatives. That is, the best linear approximacomposi-tion of a composicomposi-tion is the composition of the best linear approximations.
Theorem 4.4.3 (Chain Rule). Let f : A −→ Rm (where A ⊂ Rn) be a mapping, let B ⊂ Rm be a set containing f (A), and let g : B −→ Rℓ be a mapping. Thus the composition g◦f : A −→ Rℓis defined. If f is differentiable at the point a ∈ A, and g is differentiable at the point f(a) ∈ B, then the composition g◦ f is differentiable at the point a, and its derivative there is
D(g◦ f)a= Dgf (a)◦ Dfa.
In terms of Jacobian matrices, since the matrix of a composition is the product of the matrices, the Chain Rule is
(g◦ f)′(a) = g′(f (a)) f′(a).
The fact that we can prove that the derivative of a composition is the composition of the derivatives without an explicit formula for the derivative is akin to the fact in the previous chapter that we could prove that the deter-minant of the product is the product of the deterdeter-minants without an explicit formula for the determinant.
Proof. To showcase the true issues of the argument clearly, we reduce the problem to a normalized situation. For simplicity, we first take a = 0n and f (a) = 0m. So we are given that
f (h) = S(h) + o(h), g(k) = T (k) + o(k),
and we need to show that
(g◦ f)(h) = (T ◦ S)(h) + o(h).
Compute that
g(f (h)) = g(Sh + o(h)) by the first given
= T Sh + T (o(h)) + o(Sh + o(h)) by the second.
We know that T k =O(k) and Sh = O(h), so the previous display gives (g◦ f)(h) = (T ◦ S)(h) + O(o(h)) + o O(h) + o(h)
.
Since o(h)⊂ O(h) and O(h) is closed under addition, since o(h) absorbs O(h) from either side, and since o(h) is closed under addition, the error (the last two terms on the right side of the previous display) is
O(o(h)) + o O(h) + o(h)
=O(o(h)) + o(O(h)) = o(h) + o(h) = o(h).
Therefore we have shown that
(g◦ f)(h) = (T ◦ S)(h) + o(h),
exactly as desired. The crux of the matter is that o(h) absorbs O(h) from either side.
For the general case, now longer assuming that a = 0n and f (a) = 0m, we are given that
f (a + h) = f (a) + S(h) + o(h), g(f (a) + k) = g(f (a)) + T (k) + o(k), and we need to show that
(g◦ f)(a + h) = (g ◦ f)(a) + (T ◦ S)(h) + o(h).
Compute that
g(f (a + h)) = g(f (a) + Sh + o(h)) by the first given
= g(f (a)) + T Sh + T (o(h)) + o(Sh + o(h)) by the second, and from here the proof that the remainder term is o(h) is precisely as it is
in the normalized case. ⊓⊔
Two quick applications of the Chain Rule arise naturally for scalar-valued functions. Given two such functions, not only is their sum defined, but since R is a field (unlike Rmfor m > 1), so is their product and so is their quotient at points where g is nonzero. With some help from the Chain Rule, the derivative laws for product and quotient follow easily from elementary calculations.
4.4 Basic Results and the Chain Rule 153 Lemma 4.4.4 (Derivatives of the Product and Reciprocal Func-tions). Define the product function,
p : R2−→ R, p(x, y) = xy, and define the reciprocal function
r : R− {0} −→ R, r(x) = 1/x.
Then
(1) The derivative of p at any point (a, b)∈ R2 exists and is Dp(a,b)(h, k) = ak + bh.
(2) The derivative of r at any nonzero real number a exists and is Dra(h) =−h/a2.
Proof. (1) Compute,
p(a + h, b + k)− p(a, b) − ak − bh = (a + h)(b + k) − ab − ak − bh = hk.
By the Size Bounds|h| ≤ |(h, k)| and |k| ≤ |(h, k)|, so |hk| = |h| |k| ≤ |(h, k)|2. Since|(h, k)|2 is ϕ2(h, k) (where ϕeis the example from Proposition 4.2.2), it is o(h, k).
(2) is left as exercise 4.4.3. ⊓⊔
Proposition 4.4.5 (Multivariable Product and Quotient Rules). Let f : A−→ R (where A ⊂ Rn) and g : B −→ R (where B ⊂ Rn) be functions, and let f and g differentiable at a. Then
(1) f g is differentiable at a with derivative
D(f g)a= f (a)Dga+ g(a)Dfa. (2) If g(a)6= 0 then f/g is differentiable at a with derivative
D
f g
a
=g(a)Dfa− f(a)Dga
g(a)2 .
Proof. (1) As explained in the proof of Proposition 4.4.2, a is an interior point of the domain A∩ B of fg, so we need only to compute. The product function f g is the composition p◦ (f, g), where (f, g) : A ∩ B −→ R2 is the mapping with component functions f and g. For any h∈ Rn, the Chain Rule and the componentwise nature of differentiation (this was exercise 4.3.3) give
D(f g)a(h) = D(p◦ (f, g))a(h) = Dp(f,g)(a)◦ D(f, g)a (h)
= Dp(f (a),g(a))(Dfa(h), Dga(h)),
and by the previous lemma,
Dp(f (a),g(a))(Dfa(h), Dga(h)) = f (a)Dga(h) + g(a)Dfa(h)
= (f (a)Dga+ g(a)Dfa)(h).
This proves (1) since h is arbitrary. (2) is similar (exercise 4.4.4) but with the wrinkle that one needs to show that since g(a)6= 0 and since Dga exists, it follows that a is an interior point of the domain of f /g. Here it is relevant that g must be continuous at a, and so by the Persistence of Inequality principle (Proposition 2.3.10), g is nonzero on some ε-ball at a as desired. ⊓⊔ With the results accumulated so far, we can compute the derivative of any mapping whose component functions are given by rational expressions in its component input scalars. By the componentwise nature of differentiabil-ity, it suffices to find the derivatives of the component functions. Since these are compositions of sums, products, and reciprocals of constants and linear functions, their derivatives are calculable with the existing machinery.
Suppose, for instance, that f (x, y) = (x2− y)/(y + 1) for all (x, y) ∈ R2 such that y 6= −1. Note that every point of the domain of f is an interior point. Rewrite f as
f =X2− Y Y + 1
where X is the linear function X(x, y) = x on R2 and similarly Y (x, y) = y.
Applications of the Chain Rule and virtually every other result on derivatives so far shows that at any point (a, b) in the domain of f , the derivative Df(a,b)
is given by (justify the steps) Df(a,b)(h, k)
= (Y + 1)(a, b)D(X2− Y )(a,b)− (X2− Y )(a, b)D(Y + 1)(a,b)
((Y + 1)(a, b))2 (h, k)
= (b + 1)(D(X2)(a,b)− DY(a,b))− (a2− b)(DY(a,b)+ D1(a,b))
(b + 1)2 (h, k)
= (b + 1)(2X(a, b)DX(a,b)− Y ) − (a2− b)Y
(b + 1)2 (h, k)
= (b + 1)(2aX− Y ) − (a2− b)Y (b + 1)2 (h, k)
= (b + 1)(2ah− k) − (a2− b)k (b + 1)2
= 2a
b + 1h− a2+ 1 (b + 1)2k.
In practice this method is too unwieldy for any functions beyond the simplest, and in any case it applies only to mappings with rational component functions.
4.4 Basic Results and the Chain Rule 155 But on the other hand, there is no reason to expect much in the way of computational results from our methods so far, since we have been studying the derivative based on its intrinsic characterization. In the next section we will construct the derivative in coordinates, enabling us to compute easily by drawing on the results of one-variable calculus.
For another application of the Chain Rule, let A and B be subsets of Rn, and suppose that f : A−→ B is invertible with inverse g : B −→ A. Suppose further that f is differentiable at a ∈ A and that g is differentiable at f(a).
The composition g◦ f is the identity mapping idA: A−→ A which, being the restriction of a linear mapping, has the linear mapping as its derivative at a.
Therefore,
id = D(idA)a= D(g◦ f)a = Dgf (a)◦ Dfa.
This argument partly shows that for invertible f as described, the linear map-ping Dfa is also invertible. (A symmetric argument completes the proof by showing that also id = Dfa ◦ Dgf (a).) Since we have methods available to check the invertibility of a linear map, we can apply this criterion once we know how to compute derivatives.
Not too much should be made of this result, however; its hypotheses are too strong. Even in the one-variable case the function f (x) = x3 from R to R is invertible and yet has the noninvertible derivative 0 at x = 0. (The inverse, g(x) =√3x is not differentiable at 0, so the conditions above are not met.) Besides, we would prefer a converse statement, that if the derivative is invertible then so is the mapping. The converse statement is not true, but we will see in chapter 5 that it is locally true, i.e., it is true in the small.
Exercises
4.4.1. Prove part (2) of Proposition 4.4.1.
4.4.2. Prove part (2) of Proposition 4.4.2.
4.4.3. Prove part (2) of Lemma 4.4.4.
4.4.4. Prove the Quotient Rule.
4.4.5. Let f (x, y, z) = xyz. Find Df(a,b,c) for arbitrary (a, b, c) ∈ R3. (Hint:
f is the product XY Z where X is the linear function X(x, y, z) = x and similarly for Y and Z.)
4.4.6. Define f (x, y) = xy2/(y− 1) on {(x, y) ∈ R2 : y 6= 1}. Find Df(a,b)
where (a, b) is a point in the domain of f .
4.4.7. (A generalization of the product rule.) Recall that a function f : Rn× Rn −→ R
is called bilinear if for all x, x′, y, y′∈ Rn and all α∈ R,
f (x + x′, y) = f (x, y) + f (x′, y), f (x, y + y′) = f (x, y) + f (x, y′), f (αx, y) = αf (x, y) = f (x, αy).
(a) Show that if f is bilinear then f (h, k) is o(h, k).
(b) Show that if f is bilinear then f is differentiable with Df(a,b)(h, k) = f (a, k) + f (h, b).
(c) What does this exercise say about the inner product?
4.4.8. (A bigger generalization of the product rule.) A function f : Rn× · · · × Rn−→ R
(there are k copies of Rn) is called multilinear if for each j∈ {1, · · · , k}, for all x1,· · · , xj, x′j,· · · , xk ∈ Rn and all α∈ R,
f (x1,· · · , xj+ x′j,· · · , xk) = f (x1,· · · , xj,· · · , xk) + f (x1,· · · , x′j,· · · , xk) f (x1,· · · , αxj,· · · , xk) = αf (x1,· · · , xj,· · · , xk).
(a) Show that if f is multilinear and a1,· · · , ak, h1,· · · , hk ∈ Rn then for i, j ∈ {1, · · · , k} (distinct), f(a1,· · · , hi,· · · , hj,· · · , ak) is o(h1,· · · , hk).
(Use the previous problem.)
(b) Show that if f is multilinear then f is differentiable with
Df(a1,··· ,ak)(h1,· · · , hk) = Xk j=1
f (a1,· · · , aj−1, hj, aj+1,· · · , ak).
(c) When k = n, what does this exercise say about the determinant?