Flow Graphs and Back-Propagation - Feedforward Deep Networks

Feedforward Deep Networks

6.3 Flow Graphs and Back-Propagation

The term back-propagation is often misunderstood as meaning the whole learning algo-rithm for multi-layer neural networks. Actually it just means the method for computing gradients in such networks. Furthermore, it is generally understood as something very speciﬁc to multi-layer neural networks, but once its derivation is understood, it can eas-ily be generalized to arbitrary functions (for which computing a gradient is meaningful), and we describe this generalization here, focusing on the case of interest in machine learning where the output of the function to diﬀerentiate (e.g., the loss L or the train-ing criterion C) is a scalar and we are interested in its derivative with respect to a set of parameters (considered to be the elements of a vector ). It can be readily provenθ that the back-propagation algorithm has optimal computational complexity in the sense there is no algorithm that can compute the gradient faster (in theO( ) sense, i.e, up to· an additive and multiplicative constant).

6.3.1 Chain Rule

In order to apply the back-propagation algorithm, we take advantage of the chain rule:

∂C g θ( ( ))

∂θ =∂C g θ( ( ))

∂g θ( )

∂θ (6.2)

which works also when C g, or θ are vectors rather than scalars (in which case the corresponding partial derivatives are understood as Jacobian matrices of the appropriate dimensions). In the purely scalar case we can understand the chain rule as follows: a small change in θ will propagate into a small change in ( ) by getting multiplied byg θ

∂g θ( ) of these partial derivatives. The partial derivative measures the locally linear inﬂuence of a variable on another. Now, if g is a vector, we can rewrite the above as follows:

∂C g θ( ( ))

which sums over the inﬂuences of θ on C g θ( ( )) through all the intermediate variables gi( ).θ

6.3.2 Back-Propagation in a General Flow Graph

More generally, we can think about decomposing a functionC θ( ) into a more complicated graph of computations. This graph is called a ﬂow graph. Each node u_i of the graph denotes a numerical quantity that is obtained by performing a computation requiring the values u_j of other nodes, with j < i. The nodes satisfy a partial order which

dictates in what order the computation can proceed. In practical implementations of such functions (e.g. with the criterion C θ( ) or its value estimated on a minibatch), the ﬁnal computation is obtained as the composition of simple functions taken from a given set (such as the set of numerical operations that the numpylibrary can perform on arrays of numbers).

We will deﬁne the back-propagation in a general ﬂow-graph, using the following generic notation: u_i= f_i(a_i), where a_i is a list of arguments for the application of f_i to the values uj for the parents of in the graph:i ai = (uj)_j∈parents( )_i .

The overall computation of the function represented by the ﬂow graph can thus be summarized by the forward computation algorithm, Algorithm TODO

Algorithm 6.1 Flow graph forward computation. Each node computes numerical value u_iby applying a function f_i to its argument list a_ithat comprises the values of previous nodes u_j,j < i, withj∈parents( ). The input to the ﬂow graph is the vector , and isi x

In addition to having some code that tells us how to compute f_i(a_i) for some val-ues in the vector a_i, we also need some code that tells us how to compute its partial derivatives, ^∂f_∂aⁱ^(aⁱ⁾

ik with respect to any immediate argument a_ik. Letk= (π i, j) denote the index of u_j in the list a_i. Note that u_j could inﬂuence u_i through multiple paths.

Whereas ^∂u_∂uⁱ

j would denote the total gradient adding up all of these inﬂuences,^∂f_∂aⁱ^(aⁱ⁾

only denotes the derivative of f_i with respect to its specific -th argument, keeping thek other arguments fixed, i.e., only considering the influence through the arc from uj to ui. In general, when manipulating partial derivatives, one should keep clear in one’s mind (and implementation) the notational and semantic distinction between a partial derivative that includes all paths and one that includes only the immediate effect of a function’s argument on the function output, with the other arguments considered fixed.

For example consider f3(a3 1, , a3 2, ) = e^a^{3 1}^, ^+a^{3 2}^, and f2(a2 1, ) = a²_{2 1}_, , while u3 = f3(u2, u1) and u2= f2(u1), illustrated in Figure TODO . The direct derivative of f3 with respect to its argument a3 2, is _∂a^∂f_{3 2}³_, = e^a^{3 1}^, ^+a^{3 2}^, while if we consider the variables u3and u1 to which these correspond, there are two paths from u₁ to u₃, and we obtain as derivative the sum of partial derivatives over these two paths,^∂u_∂u³

1 = e^u¹^+u²(1 + 2u₁). The results are diﬀerent because ^∂u_∂u³

1 involves not just the direct dependency of u₁ on u₃ but also the indirect dependency through u₂.

Armed with this understanding, we can deﬁne the back-propagation algorithm as follows, in Algorithm 6.2, which would be computed after the forward propagation (Algorithm 6.1) has been performed. Note the recursive nature of the application of the chain rule, in Algorithm 6.2: we compute the gradient on node by re-using the alreadyj computed gradient for children nodes , starting the recurrence from the triviali ^∂u_∂u^N

N = 1

that sets the gradient for the output node.

Algorithm 6.2 Flow graph back-propagation computation. See the forward propgation in a ﬂow-graph (Algorithm 6.1, to be performed ﬁrst) and the required data structure.

In addition, a quantity ^∂u_∂u^N

i needs to be stored (and computed) at each node, for the purpose of gradient back-propagation. Below the notation (π i, j) is the index of u_j as an argument to f_i. The back-propagation algorithm eﬃciently computes^∂u_∂u^N

i for all ’si (traversing the graph backwards this time), and in particular we are interested in the derivatives of the output node u_N with respect to the “inputs” u₁. . . u_M. The cost of the overall computation is proportional to the number of arcs in the graph, assuming that the partial derivative associated with each arc requires a constant time. This is of the same order as the number of computations for the forward propagation.

∂uN

This recursion is a form of eﬃcient factorization of the total gradient, i.e., it is an application of the principles of dynamic programming. Indeed, the derivative of the output node with respect to any node can also be written down in this intractable form:

∂u_N graph. Computing the sum as above would be intractable because the number of possible paths can be exponential in the depth of the graph. The back-propagation algorithm is eﬃcient because, like dynamic programming, it re-uses partial sums associated with the gradients on intermediate nodes.

Although the above was stated as if the u_i’s were scalars, exactly the same procedure can be run with u_i’s being tuples of numbers (more easily represented by vectors). In that case the equations remain valid, and the multiplication of scalar partial derivatives becomes the multiplication of a row vector of gradients ^∂u_∂u^N

i with a Jacobian of partial derivatives associated with the j → i arc of the graph, _∂a^∂fⁱ^(aⁱ⁾

i,π i,j( ). In the case where minibatches are used during training, u_i would actually be a whole matrix (the extra dimension being for the examples in the minibatch). This would then turn the basic computation into matrix-matrix products rather than matrix-vector products, and the

former can be computed much more eﬃciently than a sequence of matrix-vector products (e.g. with the BLAS library), especially so on modern computers and GPUs, that rely more and more on parallelization through many cores.

More implementation issues regarding the back-propagation algorithm are discussed in Chapters 18 and 19, regarding respectively GPU implementations (Section 18.1) and debugging tricks (Section 19.5.2). Section 19.5.3 also discusses a natural general-ization of the back-propagation algorithm in which one manipulates not numbers but symbolic expressions, i.e., turning a program that performs a computation (decomposed as a flow graph expressed in the forward propagation algorithm) into another program (another flow graph) that performs gradient computation (i.e. automatically generat-ing the back-propagation program, given the forward propagation graph). Usgenerat-ing such symbolic automatic differentiation procedures (implemented for example in the Theano library⁸), one can compute second derivatives and other kinds of derivatives, as well as apply numerical stabilization and simplifications on a flow graph.

In document Deep Learning.pdf (Page 80-83)