CS 6890: Deep Learning

(1)

Razvan C. Bunescu

School of Electrical Engineering and Computer Science [email protected]

Feed-Forward Neural Networks Backpropagation

CS 6890: Deep Learning

(2)

Neuron Function

Σ ^f

x₀ 1

x₁ x₂ x₃

w_ix_i

∑

^hw(x)

activation function w₀

w₁ w₂ w₃

• Algebraic interpretation:

– The output of the neuron is a linear combination of inputs from other neurons, rescaled by the synaptic weights.

• weights w_icorrespond to the synaptic weights (activating or inhibiting).

• summation corresponds to combination of signals in the soma.

– It is often transformed through a monotonic activation function.

(3)

Activation Functions

1

0 f (z) = 1

1+ e^−z logistic

f (z) = 0 if z < 0 1 if z ≥ 0

"

#$

%$

unit step Perceptron

Logistic Regression

Rectified Linear Unit

ReLU f (z) = 0 if z < 0 z if z ≥ 0

⎧

⎨⎪

⎩⎪

(4)

ReLU and Generalizations

• It has become more common to use piecewise linear activation functions for hidden units:

– ReLU: the rectifier activation g(a) = max{0, a}.

– Absolute value ReLU: g(a) = |a|.

– Maxout: g(a₁, ..., a_k) = max{a₁, ..., a_k}.

• needs k weight vectors instead of 1.

– Leaky ReLU: g(a) = max{0, a}+ α min(0, a).

Þ the network computes a piecewise linear function (up to the output activation function).

(5)

ReLU vs. Sigmoid and Tanh

• Sigmoid and Tanh saturate for values not close to 0:

– “kill” gradients, bad behavior for gradient-based learning.

• ReLU does not saturate for values > 0:

– greatly accelerates learning, fast implementation.

– fragile during training and can “die”, due to 0 gradient:

• initialize all b’s to a small, positive value, e.g. 0.1.

(6)

ReLU vs. Softplus

• Softplus g(a) = ln(1+e^a) is a smooth version of the rectifier.

– Saturates less than ReLU, yet ReLU still does better [Glorot, 2011].

(7)

Perceptron vs. Logistic vs. ReLU vs. Tanh

• Logistic neuron:

– At inference time, same decision function as perceptron, for binary classification with equal misclassification costs (prove it):

– Perceptron cannot represent the XOR function:

• Logistic neuron, ReLU, Tanh have the same limitation.

• How can we use (logistic) neurons to achieve better representational power?

ˆt(x) = 1 if w^Tx > 0 0 otherwise

!

"

#

$#

(8)

Universal Approximation Theorem

− Let σ be a nonconstant, bounded, and monotonically-increasing continuous function;

− Let I_m denote the m-dimensional unit hypercube [0,1]^m;

− Let C(I_m) denote the space of continuous functions on I_m;

Ø Theorem: Given any function f Î C(I_m) and ε > 0, there exist an integer N and real constants α_i, b_i Î R, w_i Î R^m, where i = 1, ..., N, such that:

where

F(x) − f (x) <ε, ∀x ∈ I_m

F(x) = α_iσ^(w_i^Tx + b_i)

N

∑

Hornik (1991), Cybenko (1989)

(9)

x₁ x₂ x₃ +1

α_i

b_i w_i3 w_i2 w_i1

σ

F(x) = α_iσ^(w_i^Tx + b_i)

i=1 N

∑

+1

σ σ

Σ

F(x) − f (x) <ε, ∀x ∈ I_m

Universal Approximation Theorem

Hornik (1991), Cybenko (1989)

(10)

Neural Network Model

• Put together many neurons in layers, such that the output of a neuron can be the input of another:

input layer hidden layer output layer

(11)

input features x

bias units

o n_l =3 is the number of layers.

§ L₁ is the input layer, L₃ is the output layer o (W, b) = (W⁽¹⁾, b⁽¹⁾, W⁽²⁾, b⁽²⁾) are the parameters:

§ W^(l)_ij is the weight of the connection between unit j in layer l and unit i in layer l + 1.

§ b^(l)_i is the bias associated unit unit i in layer l + 1.

o a^(l)_i is the activation of unit i in layer l, e.g. a⁽¹⁾_i = x_i and a⁽³⁾₁ = h_W,b(x).

(12)

Inference: Forward Propagation

• The activations in the hidden layer are:

• The activations in the output layer are:

• Compressed notation:

where

(13)

Forward Propagation

• Forward propagation (unrolled):

• Forward propagation (compressed):

• Element-wise application:

f(z) = [f(z₁), f(z₂), f(z₃)]

(14)

Forward Propagation

• Forward propagation (compressed):

• Composed of two forward propagation steps:

(15)

Multiple Hidden Units, Multiple Outputs

• Write down the forward propagation steps for:

(16)

Learning: Backpropagation

• Regularized sum of squares error:

• Gradient:

J(W, b, x, y) = 1

2 h_{W ,b}(x) − y ² J(W, b) = 1

m J(W, b, x^{(k )}, y^{(k )})

k=1 m

∑

⁺ ^λ₂

(

^W^ij^{(l )}

)

²

j=1 s_l+1

∑

i=1 s_l

∑

l=1 n_l−1

∑

∂J(W, b)

∂W_ij^{(l )} = 1 m

∂J(W, b, x^{(k )}, y^{(k )})

∂W_ij^{(l )}

k=1 m

∑

⁺ ^λ^W^ij^{(l )}

∂J(W, b)

∂b^{(l )} = 1 m

∂J(W, b, x^{(k )}, y^{(k )})

∂b^{(l )}

m

∑

?

+1

Squared Frobenius norm of 𝑊^(#)

(17)

Backpropagation

• Need to compute the gradient of the squared error with respect to a single training example (x, y):

J(W, b, x, y) = 1

2 h_{W ,b}(x) − y ² = 1

2 a⁽ⁿ^l⁾ − y ²

∂J

∂W_ij^{(l )} = ? ^∂J

∂b_i^{(l )} = ?

(18)

Univariate Chain Rule for Differentiation

• Univariate Chain Rule:

• Example:

f = f ! g ! h = f (g(h(x)))

∂f

∂x = ∂f

∂g

∂h

∂x

f (g(x)) = 2g(x)² − 3g(x) +1 g(x) = x³ + 2x

(19)

Multivariate Chain Rule for Differentiation

• Multivariate Chain Rule:

• Example:

f = f (g

₁

(x), g

₂

(x), …, g

_n

(x))

∂f

∂x = ∂f

∂g

_i

∂g

_i

i=1

∂x

n

∑

f (g₁(x), g₂(x)) = 2g₁(x)² − 3g₁(x)g₂(x) +1 g₁(x) = 3x

g₂(x) = x² + 2x

(20)

Backpropagation:

• J depends on W_ij^(l)only through a_i^(l+1), which depends on W_ij^(l) only through z_i^(l+1).

W_ij^{(l )}

...

a

^{(l )}_j

a

_i^(l+1)

a

₁⁽ⁿ^l⁾

J(W, b, x, y) = 1

2 a⁽ⁿ^l⁾ − y ²

a_i^(l+1) = f (z_i^(l+1))

z_i^(l+1) = W_ij^{(l )}a^{(l )}_j + b_i^{(l )}

s_l

∑

W

_ij^{(l )}

(21)

Backpropagation:

• J depends on b_i^(l) only through a_i^(l+1), which depends on b_i^(l)only through z_i^(l+1).

b_i^{(l )}

a

_i^(l+1)

... a

₁⁽ⁿ^l⁾

J(W, b, x, y) = 1

2 a⁽ⁿ^l⁾ − y ²

a_i^(l+1) = f (z_i^(l+1))

z_i^(l+1) = W_ij^{(l )}a^{(l )}_j + b_i^{(l )}

s_l

∑

b

_i^{(l )}

+1

J

(22)

Backpropagation: and W

_ij^{(l )}

b

_i^{(l )}

∂J

∂W

_ij^{(l )}

= ∂J

∂a

_i^(l+1)

× ∂a

_i^(l+1)

∂z

_i^(l+1)

× ∂z

_i^(l+1)

∂W

_ij^{(l )}

δ

_i^(l+1)

= a

^{(l )}_j

δ

_i^(l+1)

a

^{(l )}_j

∂J

∂b

_i^{(l )}

= ∂J

∂a

_i^(l+1)

× ∂a

_i^(l+1)

∂z

_i^(l+1)

× ∂z

_i^(l+1)

∂b

_i^{(l )}

δ

_i^(l+1)

= δ

_i^(l+1)

+1

How to compute for all layers l ?

δ

_i^{(l )}

(23)

Backpropagation:

• J depends on a_i^(l) only through a₁^(l+1), a₂^(l+1), ...

δ

_i^{(l )}

δ

_i^{(l )}

= ∂J

∂a

_i^{(l )}

× ∂a

_i^{(l )}

∂z

_i^{(l )}

= ∂J

∂a

_i^{(l )}

× # f (z

_i^{(l )}

)

?

a

₂^(l+1)

... a

₁⁽ⁿ^l⁾

a

_i^{(l )}

a

₁^(l+1)

a

₃^(l+1)

J

(24)

Backpropagation:

• J depends on a_i^(l) only through a₁^(l+1), a₂^(l+1), ...

δ

_i^{(l )}

∂J

∂a

_i^{(l )}

= ∂J

∂a

^(l+1)_j

× ∂a

^(l+1)_j

∂a

_i^{(l )}

j=1 s_l+1

∑ ⁼ _∂a ^∂J

j

(l+1)

× ∂a

^(l+1)_j

∂z

^(l+1)_j

j=1 s_l+1

∑ ^× ^∂z

^j

(l+1)

∂a

_i^{(l )}

δ

_j^(l+1)

W

_ji^{(l )}

δ

_i^{(l )}

= ∂J

∂a

_i^{(l )}

× # f (z

_i^{(l )}

)

δ

_i^{(l )}

• Therefore, can be computed as:

= W

_ji^{(l )}

δ

_j^(l+1)

j=1 s_l+1

" ∑

# $$ %

&

''× )f(z

ⁱ^{(l )}

)

(25)

Backpropagation:

• Start computing δ’s for the output layer:

δ

_i^{(l )}

δ

_i⁽ⁿ^l⁾

= ∂J

∂a

_i⁽ⁿ^l⁾

× ∂a

_i⁽ⁿ^l⁾

∂z

_i⁽ⁿ^l⁾

= ∂J

∂a

_i⁽ⁿ^l⁾

× # f (z

_i⁽ⁿ^l⁾

)

J = 1

2 a

⁽ⁿ^l⁾

− y

²

=> ∂J

∂a

_i⁽ⁿ^l⁾

= a (

_i⁽ⁿ^l⁾

− y

_i

)

δ

_i⁽ⁿ^l⁾

= a (

_i⁽ⁿ^l⁾

− y

_i

) ^{× #} ^{f (z}

ⁱ⁽ⁿ^l⁾

⁾

(26)

Backpropagation Algorithm

1. Feedforward pass on x to compute activations 2. For each output unit i compute:

3. For l = n_l−1, n_l−2, n_l−3, ..., 2 compute:

4. Compute the partial derivatives of the cost

δ

_i⁽ⁿ^l⁾

= a (

_i⁽ⁿ^l⁾

− y

_i

) ^{× #} ^{f (z}

ⁱ⁽ⁿ^l⁾

⁾

δ

_i^{(l )}

= W

_ji^{(l )}

δ

_j^(l+1)

j=1 s_l+1

" ∑

# $$ %

&

''× )f(z

ⁱ^{(l )}

)

∂J

∂W

^{(l )}

= a

^{(l )}_j

δ

_i^(l+1)

^∂J

∂b

^{(l )}

= δ

_i^(l+1)

a

_i^{(l )}

J(W, b, x, y)

(27)

Backpropagation Algorithm: Vectorization for 1 Example

1. Feedforward pass on x to compute activations 2. For last layer compute:

δ

⁽ⁿ^l⁾

= a (

⁽ⁿ^l⁾

− y ) ^{• "} ^{f (z}

⁽ⁿ^l⁾

⁾

δ

^{(l )}

= W ( (

^{(l )}

)

^T

^δ

^(l+1)

) ^{• !} ^{f (z}

^{(l )}

⁾

∇

_W^{( l )}

J = δ

^(l+1)

( ) ^a

^{(l )} ^T

a

_i^{(l )}

J(W, b, x, y)

∇

_b^{( l )}

J = δ

^(l+1)

(28)

Backpropagation Algorithm: Vectorization for Dataset of m Examples

1. Feedforward pass on X to compute activations 2. For last layer compute:

δ

⁽ⁿ^l⁾

= a (

⁽ⁿ^l⁾

− y ) ^{• "} ^{f (z}

⁽ⁿ^l⁾

⁾

δ

^{(l )}

= W ( (

^{(l )}

)

^T

^δ

^(l+1)

) ^{• !} ^{f (z}

^{(l )}

⁾

∇

_W^{( l )}

J = δ

^(l+1)

( ) ^a

^{(l )} ^T

a

_i^{(l )}

J(W, b, x, y)

∇

_b^{( l )}

J = δ

^(l+1)

/m

^.col_avg()

(29)

Backpropagation: Softmax Regression

• Consider layer n_l to be the input to the softmax layer i.e.

softmax output layer is n_l+1.

• Softmax weights stored in matrix 𝑊^(%^&⁾.

• K classes => 𝑊^(%^&⁾ =

−𝐰_*⁺ −

−𝐰_,⁺ −

⋮

−𝐰_.⁺ −

(30)

Backpropagation: Softmax Regression

• Softmax output is 𝐚^(%^&^0*)= softmax(𝒛^(%^&^0*))

...

𝑎_*^(%^&⁾

𝑎_,^(%^&⁾

𝑎^(%^&⁾

𝑧_*^(%^&^0*)

𝑧_,^(%^&^0*) 𝐽(𝐚^(%^&^0*), 𝐲)

Softmax input Softmax logits

Cross-entropy Softmax weights 𝑊^(%^&⁾

⋮

(31)

Backpropagation Algorithm: Softmax (1)

1. Feedforward pass on x to compute activations 𝐚^(#) for layers l = 1, 2, …, n_l.

2. Compute softmax outputs 𝐚^(%^&^0*) and objective 𝐽(𝐚^(%^&^0*), 𝐲).

3. Let 𝐲 = 𝛿_* 𝑦 , 𝛿_, 𝑦 , … , 𝛿_. 𝑦 ^T be the one-hot vector representation for label y.

4. Compute gradient with respect to softmax weights:

𝜕𝐽

𝜕𝑊^(%^&⁾ = (𝐚^(%^&^0*) − 𝐲)𝐚^(%^&⁾⁺

(32)

Backpropagation Algorithm: Softmax (2)

5. Compute gradient with respect to softmax inputs:

δ

^{(l )}

= W ( (

^{(l )}

)

^T

^δ

^(l+1)

) ^{• !} ^{f (z}

^{(l )}

⁾

∇ J = δ

^(l+1)

( ) ^a

^{(l )} ^T

J(W, b, x, y)

∇ J = δ

^(l+1)

𝛿^(%^&⁾ = 𝑊 ^%^& ⁺(𝐚 ^%^&^0* − 𝐲) ∘ 𝑓′(𝐳 ^%^& )

𝜕𝐽

𝜕𝐚^(%^&⁾

(33)

Backpropagation Algorithm: Softmax for 1 Example

1. For softmax layer, compute:

2. For l = n_l, n_l−2, n_l−3, ..., 2 compute:

δ

^{(l )}

= W ( (

^{(l )}

)

^T

^δ

^(l+1)

) ^{• !} ^{f (z}

^{(l )}

⁾

∇

_W^{( l )}

J = δ

^(l+1)

( ) ^a

^{(l )} ^T

J(W, b, x, y)

∇

_b^{( l )}

J = δ

^(l+1)

𝛿^(%^&^0*) = (𝐚 ^%^&^0* − 𝐲) one-hot label vector

(34)

Backpropagation Algorithm: Softmax for Dataset of m Examples

1. For softmax layer, compute:

2. For l = n_l, n_l−1, n_l−2, ..., 2 compute:

δ

^{(l )}

= W ( (

^{(l )}

)

^T

^δ

^(l+1)

) ^{• !} ^{f (z}

^{(l )}

⁾

∇

_W^{( l )}

J = δ

^(l+1)

( ) ^a

^{(l )} ^T

J(W, b, x, y)

∇

_b^{( l )}

J = δ

^(l+1)

𝛿^(%^&^0*) = (𝐚 ^%^&^0* − 𝐲)

/m

^.col_avg()

ground-truth label matrix

(35)

Backpropagation: Logistic Regression

CS 6890: Deep Learning