BACK PROPAGATION - Exploring model parallelism in distributed scheduling of neural network fram

In this chapter, we discuss how to account for back propagation in the algorithms proposed in chapter 4 and prove certain makespan and approximation ratio guarantees for it.

5.1 DEFINITION AND NEED FOR BACK PROPAGATION

Definition In neural network training, the back propagation algorithm (a) propagates the total error backwards through the Neural Network (NN) layers (b) computes the gradient of each weight and bias in the network, using the propagated error (c) applies the gradients calculated in (b) to their corresponding weights and biases (d) eventually minimizes the total error of the NN.

Need for BP The back propagation algorithm is needed to train multi-layer NNs. It corrects the weights and biases of every layer in the NN to reduce the error in each iteration of the training process.

5.2 ACCOUNTING FOR BACK PROPAGATION

The scheduling algorithms discussed in chapter 3 are only valid for directed acyclic graphs. The process of backpropagation in NN training introduces some cycles and is computation- ally intensive. Since BP is non trivial and accounts for a significant chunk of the makespan, we account for it separately.

5.3 SCHEDULING STRATEGY

Each operation node in TensorFlow (TF) has an associated gradient node that is responsible for the calculation of gradient for that operation. Similarly, each variable has an associated GradientDescent node whose primary purpose is the application of gradient and updating the variable. Together, these two nodes are primarily responsible for the BP process. Between these two functionalities, gradient computation accounts for the majority of the complexity.

If we collocate the gradient nodes with their corresponding operation nodes, we discover that the resulting backward dependency graph G0 is the reverse of the forward propagation graph G, i.e., same vertices and reverse edges. TF provides a very straightforward interface to

specify these collocations. Unlike the gradient nodes, which must wait for their predecessors from G0 before executing, the GradientDescent nodes have no such dependencies. The gradient application can be carried out at any time in the BP process after the gradient has been calculated for a node, including at the end. Given the lack of GradientDescent node inter-dependencies and their less intensive nature, their contribution to the makespan can be safely ignored for the purposes of our proofs. Their withdrawal from consideration serves another critical purpose. Without the GradientDescent nodes, the connections between the forward and backward computation graphs (G and G0) are severed and the entire NN graph can be considered a DAG. Under this assumption, our previously discussed algorithms from chapter 4 can be successfully used.

5.4 GUARANTEES

Given the assumptions stated above, we prove certain guarantees for the BP phase in the remainder of this chapter. First, we show that the makespan of the backward pass is within C0 (as defined in section 5.4.1) times the makespan of the forward pass. Next,

we prove that if C (as defined in section 5.4.1) is constant across the NN, the initially established approximation ratio remains unchanged after back propagation. It is worth noting that while the makespan guarantee only holds if the gradient nodes are collocated with operations, in practice, it is possible to apply DAG algorithms simply after the removal of the GradientDescent nodes.

5.4.1 Preliminaries

In this subsection, we define a schedule and enumerate the necessary and sufficient condi- tions for a schedule to be legal.

Schedule We define a schedule S for graph G as a list that associates each node in G with a machine p on which it will be executed and a starting time si.

Backward Schedule Denote the makespan of schedule S as T . We define forward schedule S’s corresponding backward schedule S0 as a schedule where for any node i in G, (a) i is associated with the same machine as in S (b) i has starting time s0_i = C0(T − si− pi), where

Legal Schedule Let’s denote the computation time of node i in dependency graph G as pi and the communication time between any two tasks i and j in G to be cij.

Schedule S is legal if and only if, 1. si ≥ 0

2. si + pi ≤ sj or sj + pj ≤ si when task i and task j are on the same machine. This

guarantees that tasks i and j do not overlap.

3. si+ pi ≤ sj when task i and task j are on the same machine and there is an edge i → j

in the dependency graph G. This guarantees that the precedence relationship between tasks i and j is honored, when they are on the same machine.

4. si+ pi+ cij ≤ sj when task i and task j are on different machine, and there is an edge

i → j in the dependency graph G.cThis guarantees that the precedence relationship between tasks i and j is honored, when they are on different machines.

Backward dependency graph The backward dependency graph G0 is the reverse of the forward propagation graph G, i.e., G0 shares the same vertices with G but has reversed edges. We denote the computation time of a node j in G0 as p0_j and the communication time between any nodes j and i in G0 as c0_ji.

NN proportionality constant We define the NN proportionality constant C0for forward

graph G and backward graph G0 as the maximum ratio between (a) corresponding backward pass edge weight and forward pass edge weight and (b) corresponding backward pass node weight and forward pass node weight.

C0 = max maxj∈V (G) p0_j pj , maxi→j∈E(G) c0_ji cij

NN proportionality function We define the NN proportionality function C as follows (a) For any node i ∈ V (G), let C(i) = p0i

pi (b) for any edge i → j ∈ E(G), let C(i → j) =

c0_ji cij.

We say function C is constant if for all i ∈ V (G), C(i) = C0 and for all i → j ∈ E(G),

C(i → j) = C0. Essentially, C0 is the max over all values of C.

5.4.2 Theorems

Theorem 5.1 Given a schedule S on G, S’s corresponding backward schedule S0 is legal and makespan(S0) ≤ C0· makespan(S)

Proof We know that ∀i → j ∈ E(G), c0_ji ≤ C0cij and ∀j ∈ V (G), p0j ≤ C0pj

Denote the makespan of schedule S as T . We prove below that the schedule S0 is legal:

1. We know T ≤ si+ pi∀i

Therefore s0_i = C0(T − si− pi) ≥ 0 ∀i ∈ V (G)

2. Let i, j be 2 arbitrary tasks assigned to the same machine.

Since S is a legal schedule, we know that si+ pi ≤ sj or sj + pj ≤ si.

T − si− pi ≥ T − sj− pj+ pj or T − sj− pj ≥ T − si− pi+ pi. C0(T − si− pi) ≥ C0(T − sj− pj) + C0pj or C0(T − sj− pj) ≥ C0(T − si− pi) + C0pi. s0_i ≥ s0 j + p 0 j or s 0 j ≥ s 0 i+ p 0 i.

3. Let j → i be an arbitrary edge in G0 and i, j are assigned to the same machine in S. i → j ∈ E(G) <=> j → i ∈ E(G0).

Since S is a legal schedule, we know that i → j ∈ E(G), si+ pi ≤ sj.

C0(T − si− pi) ≥ C0(T − sj − pj) + C0pj.

s0_i ≥ s0

j + p0j in G0.

4. Let j → i be an arbitrary edge in G0 and i, j are assigned to different machines. i → j ∈ E(G) <=> j → i ∈ E(G0)

Since S is a legal schedule, we know that i → j ∈ E(G), si+ pi+ cij ≤ sj.

C0(T − si− pi) − C0ci→j ≥ C0(T − sj − pj) + C0pj. C0(T − si− pi) ≥ C0(T − sj − pj) + C0pj + C0cij. s0_i ≥ s0 j + p 0 j + c 0 ji in G 0_.

S0 satisfies all condition for a legal schedule and therefore is legal.

makespan(S0) = maxi(s0i+ p0i) = maxi(C0(T − si)) ≤ C0T = C0· makespan(S)

Theorem 5.2 When NN proportionality function C is constant, the makespan of optimal schedule for G0 is C0 times the makespan for optimal schedule for G.

Proof We know that ∀j ∈ V (G0),p

0 j

pj = C0, and ∀j → i ∈ E(G

0_),c0ji

cij = C0.

Therefore, maxmaxj∈V (G0₎pj

p0 j , maxj→i∈E(G0₎cij c0 ji = _C1 0.

Let OP T be the shortest makespan for any schedule on G with m machines. Let S∗ be an optimal schedule for G.

By Theorem 5.1 we know that there is a schedule S0 for G0 where makespan(S0) ≤ C0makespan(S∗), therefore OP T0 ≤ makespan(S0) ≤ C0makespan(S∗) = C0OP T .

For the same reason OP T ≤ _C1

0OP T

0_.

Therefore OP T0 = C0OP T .

Theorem 5.3 When NN proportionality function C is constant, given a schedule S for G, the backward schedule S0 has the same approximation ratio as S.

Proof By Theorem 5.2, the makespan of optimal schedule for G’ OP T0 is C0 times the

makespan for optimal schedule OP T for G.

By Theorem 5.1, makespan(S0) ≤ C0makespan(S) and makespan(S0) ≤ _C1₀makespan(S).

makespan(S0) = C0makespan(S).

Therefore, makespan(S_{OP T}0 0) =

makespan(S)

OP T , which means the approximation ratio of S

0 _equals

In document Exploring model parallelism in distributed scheduling of neural network frameworks (Page 30-35)