Feedforward Deep Networks
6.3 Flow Graphs and Back-Propagation
The term back-propagation is often misunderstood as meaning the whole learning algo-rithm for multi-layer neural networks. Actually it just means the method for computing gradients in such networks. Furthermore, it is generally understood as something very specific to multi-layer neural networks, but once its derivation is understood, it can eas-ily be generalized to arbitrary functions (for which computing a gradient is meaningful), and we describe this generalization here, focusing on the case of interest in machine learning where the output of the function to differentiate (e.g., the loss L or the train-ing criterion C) is a scalar and we are interested in its derivative with respect to a set of parameters (considered to be the elements of a vector ). It can be readily provenθ that the back-propagation algorithm has optimal computational complexity in the sense there is no algorithm that can compute the gradient faster (in theO( ) sense, i.e, up to· an additive and multiplicative constant).
6.3.1 Chain Rule
In order to apply the back-propagation algorithm, we take advantage of the chain rule:
∂C g θ( ( ))
∂θ =∂C g θ( ( ))
∂g θ( )
∂g θ( )
∂θ (6.2)
which works also when C g, or θ are vectors rather than scalars (in which case the corresponding partial derivatives are understood as Jacobian matrices of the appropriate dimensions). In the purely scalar case we can understand the chain rule as follows: a small change in θ will propagate into a small change in ( ) by getting multiplied byg θ
∂g θ( ) of these partial derivatives. The partial derivative measures the locally linear influence of a variable on another. Now, if g is a vector, we can rewrite the above as follows:
∂C g θ( ( ))
which sums over the influences of θ on C g θ( ( )) through all the intermediate variables gi( ).θ
6.3.2 Back-Propagation in a General Flow Graph
More generally, we can think about decomposing a functionC θ( ) into a more complicated graph of computations. This graph is called a flow graph. Each node ui of the graph denotes a numerical quantity that is obtained by performing a computation requiring the values uj of other nodes, with j < i. The nodes satisfy a partial order which
dictates in what order the computation can proceed. In practical implementations of such functions (e.g. with the criterion C θ( ) or its value estimated on a minibatch), the final computation is obtained as the composition of simple functions taken from a given set (such as the set of numerical operations that the numpylibrary can perform on arrays of numbers).
We will define the back-propagation in a general flow-graph, using the following generic notation: ui= fi(ai), where ai is a list of arguments for the application of fi to the values uj for the parents of in the graph:i ai = (uj)j∈parents( )i .
The overall computation of the function represented by the flow graph can thus be summarized by the forward computation algorithm, Algorithm TODO
Algorithm 6.1 Flow graph forward computation. Each node computes numerical value uiby applying a function fi to its argument list aithat comprises the values of previous nodes uj,j < i, withj∈parents( ). The input to the flow graph is the vector , and isi x
In addition to having some code that tells us how to compute fi(ai) for some val-ues in the vector ai, we also need some code that tells us how to compute its partial derivatives, ∂f∂ai(ai)
ik with respect to any immediate argument aik. Letk= (π i, j) denote the index of uj in the list ai. Note that uj could influence ui through multiple paths.
Whereas ∂u∂ui
j would denote the total gradient adding up all of these influences,∂f∂ai(ai)
ik
only denotes the derivative of fi with respect to its specific -th argument, keeping thek other arguments fixed, i.e., only considering the influence through the arc from uj to ui. In general, when manipulating partial derivatives, one should keep clear in one’s mind (and implementation) the notational and semantic distinction between a partial derivative that includes all paths and one that includes only the immediate effect of a function’s argument on the function output, with the other arguments considered fixed.
For example consider f3(a3 1, , a3 2, ) = ea3 1, +a3 2, and f2(a2 1, ) = a22 1, , while u3 = f3(u2, u1) and u2= f2(u1), illustrated in Figure TODO . The direct derivative of f3 with respect to its argument a3 2, is ∂a∂f3 23, = ea3 1, +a3 2, while if we consider the variables u3and u1 to which these correspond, there are two paths from u1 to u3, and we obtain as derivative the sum of partial derivatives over these two paths,∂u∂u3
1 = eu1+u2(1 + 2u1). The results are different because ∂u∂u3
1 involves not just the direct dependency of u1 on u3 but also the indirect dependency through u2.
Armed with this understanding, we can define the back-propagation algorithm as follows, in Algorithm 6.2, which would be computed after the forward propagation (Algorithm 6.1) has been performed. Note the recursive nature of the application of the chain rule, in Algorithm 6.2: we compute the gradient on node by re-using the alreadyj computed gradient for children nodes , starting the recurrence from the triviali ∂u∂uN
N = 1
that sets the gradient for the output node.
Algorithm 6.2 Flow graph back-propagation computation. See the forward propgation in a flow-graph (Algorithm 6.1, to be performed first) and the required data structure.
In addition, a quantity ∂u∂uN
i needs to be stored (and computed) at each node, for the purpose of gradient back-propagation. Below the notation (π i, j) is the index of uj as an argument to fi. The back-propagation algorithm efficiently computes∂u∂uN
i for all ’si (traversing the graph backwards this time), and in particular we are interested in the derivatives of the output node uN with respect to the “inputs” u1. . . uM. The cost of the overall computation is proportional to the number of arcs in the graph, assuming that the partial derivative associated with each arc requires a constant time. This is of the same order as the number of computations for the forward propagation.
∂uN
This recursion is a form of efficient factorization of the total gradient, i.e., it is an application of the principles of dynamic programming. Indeed, the derivative of the output node with respect to any node can also be written down in this intractable form:
∂uN graph. Computing the sum as above would be intractable because the number of possible paths can be exponential in the depth of the graph. The back-propagation algorithm is efficient because, like dynamic programming, it re-uses partial sums associated with the gradients on intermediate nodes.
Although the above was stated as if the ui’s were scalars, exactly the same procedure can be run with ui’s being tuples of numbers (more easily represented by vectors). In that case the equations remain valid, and the multiplication of scalar partial derivatives becomes the multiplication of a row vector of gradients ∂u∂uN
i with a Jacobian of partial derivatives associated with the j → i arc of the graph, ∂a∂fi(ai)
i,π i,j( ). In the case where minibatches are used during training, ui would actually be a whole matrix (the extra dimension being for the examples in the minibatch). This would then turn the basic computation into matrix-matrix products rather than matrix-vector products, and the
former can be computed much more efficiently than a sequence of matrix-vector products (e.g. with the BLAS library), especially so on modern computers and GPUs, that rely more and more on parallelization through many cores.
More implementation issues regarding the back-propagation algorithm are discussed in Chapters 18 and 19, regarding respectively GPU implementations (Section 18.1) and debugging tricks (Section 19.5.2). Section 19.5.3 also discusses a natural general-ization of the back-propagation algorithm in which one manipulates not numbers but symbolic expressions, i.e., turning a program that performs a computation (decomposed as a flow graph expressed in the forward propagation algorithm) into another program (another flow graph) that performs gradient computation (i.e. automatically generat-ing the back-propagation program, given the forward propagation graph). Usgenerat-ing such symbolic automatic differentiation procedures (implemented for example in the Theano library8), one can compute second derivatives and other kinds of derivatives, as well as apply numerical stabilization and simplifications on a flow graph.