• No results found

Deep Residual Networks

N/A
N/A
Protected

Academic year: 2021

Share "Deep Residual Networks"

Copied!
89
0
0

Loading.... (view fulltext now)

Full text

(1)

Deep Residual Networks

Deep Learning Gets Way Deeper

8:30-10:30am, June 19 ICML 2016 tutorial

Kaiming He

Facebook AI Research*

*as of July 2016. Formerly affiliated with Microsoft Research Asia

1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000

7x7 conv, 64, /2, pool/2

(2)

Overview

• Introduction

• Background

• From shallow to deep

• Deep Residual Networks

• From 10 layers to 100 layers

• From 100 layers to 1000 layers

• Applications

• Q & A

(3)

Introduction

(4)

Introduction

Deep Residual Networks (ResNets)

• “Deep Residual Learning for Image Recognition”. CVPR 2016 (next week)

• A simple and clean framework of training “very” deep nets

• State-of-the-art performance for

Image classification

Object detection

Semantic segmentation

and more…

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(5)

ResNets @ ILSVRC & COCO 2015 Competitions

1st places in all five main tracks

• ImageNet Classification: “Ultra-deep” 152-layer nets

• ImageNet Detection: 16% better than 2nd

• ImageNet Localization: 27% better than 2nd

• COCO Detection: 11% better than 2nd

• COCO Segmentation: 12% better than 2nd

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

*improvements are relative numbers

(6)

Revolution of Depth

3.57

6.7 7.3

11.7

16.4

25.8

28.2

ILSVRC'15

ResNet ILSVRC'14

GoogleNet ILSVRC'14

VGG ILSVRC'13 ILSVRC'12

AlexNet ILSVRC'11 ILSVRC'10

ImageNet Classification top-5 error (%)

shallow 8 layers

19 layers 22 layers

152 layers

8 layers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(7)

Revolution of Depth

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2

fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers

(ILSVRC 2012)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(8)

Revolution of Depth

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2

fc, 4096 fc, 4096 fc, 1000

AlexNet, 8 layers (ILSVRC 2012)

3x3 conv, 64 3x3 conv, 64, pool/2

3x3 conv, 128 3x3 conv, 128, pool/2

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2

fc, 4096 fc, 4096 fc, 1000

VGG, 19 layers (ILSVRC 2014)

input Conv 7x7 + 2(S) MaxPool 3x3 + 2(S) LocalRespNorm

Conv 1x1 + 1(V)

Conv 3x3 + 1(S) LocalRespNorm

MaxPool 3x3 + 2(S)

Conv Conv Conv Conv

1x1 + 1(S) 3x3 + 1(S) 5x5 + 1(S) 1x1 + 1(S)

Conv Conv MaxPool

1x1 + 1(S) 1x1 + 1(S) 3x3+ 1(S) Dept hConcat

Conv Conv Conv Conv

1x1 + 1(S) 3x3 + 1(S) 5x5 + 1(S) 1x1 + 1(S)

Conv Conv MaxPool

1x1 + 1(S) 1x1 + 1(S) 3x3+ 1(S) Dept hConcat

MaxPool 3x3 + 2(S)

Conv Conv Conv Conv

1x1 + 1(S) 3x3 + 1(S) 5x5 + 1(S) 1x1 + 1(S)

Conv Conv MaxPool

1x1 + 1(S) 1x1 + 1(S) 3x3+ 1(S) Dept hConcat

Conv Conv Conv Conv

1x1 + 1 (S) 3x3 + 1(S) 5x5 + 1(S) 1x1 + 1(S)

Conv Conv MaxPool

1x1 + 1(S) 1x1 + 1(S) 3x3 + 1(S) Av eragePool 5x5 + 3(V) Dept hConcat

Conv Conv Conv Conv

1x1 + 1(S) 3 x3 + 1(S) 5x5 + 1 (S) 1x1 + 1(S)

Conv Conv MaxPool

1x1 + 1(S) 1x1 + 1(S) 3x3 + 1(S) Dept hConcat

Conv Conv Conv Conv

1x1 + 1(S) 3 x3 + 1(S) 5x5 + 1 (S) 1x1 + 1(S)

Conv Conv MaxPool

1x1 + 1(S) 1x1 + 1(S) 3x3 + 1(S) Dept hConcat

Conv Conv Conv Conv

1x1 + 1(S) 3x3 + 1(S) 5x5 + 1(S) 1x1 + 1(S)

Conv Conv MaxPool

1x1 + 1(S) 1x1 + 1(S) 3x3 + 1(S) Av eragePool 5x5 + 3 (V) Dept hConcat

MaxPool 3x3 + 2(S)

Conv Conv Conv Conv

1x1 + 1(S) 3x3 + 1(S) 5x5 + 1(S) 1x1 + 1(S)

Conv Conv MaxPool

1x1 + 1(S) 1x1 + 1(S) 3x3 + 1(S) Dept hConcat

Conv Conv Conv Conv

1x1 + 1(S) 3x3 + 1(S) 5x5 + 1(S) 1x1 + 1(S)

Conv Conv MaxPool

1x1 + 1(S) 1x1 + 1(S) 3x3 + 1(S) Dept hConcat Av eragePool 7x7 + 1 (V) FC

Conv 1x1 + 1(S)

FC FC Soft maxAct iv at ion

soft max0 Conv 1x1 + 1(S)

FC FC Soft maxAct iv at ion

soft max1 Soft maxAct iv at ion

soft max2

GoogleNet, 22 layers (ILSVRC 2014)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(9)

1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2

AlexNet, 8 layers (ILSVRC 2012)

Revolution of Depth

ResNet, 152 layers (ILSVRC 2015)

3x3 conv, 64 3x3 conv, 64, pool/2

3x3 conv, 128 3x3 conv, 128, pool/2

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2

fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2

fc, 4096 fc, 4096

fc, 1000 VGG, 19 layers

(ILSVRC 2014)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(10)

Revolution of Depth

34

58 66

86

HOG, DPM AlexNet

(RCNN) VGG

(RCNN) ResNet

(Faster RCNN)*

PASCAL VOC 2007 Object Detection mAP (%)

shallow 8 layers

16 layers

101 layers

*w/ other improvements & more data

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

Engines of

visual recognition

(11)

ResNet’s object detection result on COCO

*the original image is from the COCO dataset

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(12)

Very simple, easy to follow

• Many third-party implementations

(list in https://github.com/KaimingHe/deep-residual-networks)

Facebook AI Research’s Torch ResNet:

Torch, CIFAR-10, with ResNet-20 to ResNet-110, training code, and curves: code

Lasagne, CIFAR-10, with ResNet-32 and ResNet-56 and training code: code

Neon, CIFAR-10, with pre-trained ResNet-32 to ResNet-110 models, training code, and curves: code

Torch, MNIST, 100 layers: blog, code

A winning entry in Kaggle's right whale recognition challenge: blog, code

Neon, Place2 (mini), 40 layers: blog, code

• Easily reproduced results

(e.g. Torch ResNet: https://github.com/facebook/fb.resnet.torch)

• A series of extensions and follow-ups

• > 200 citations in 6 months after posted on arXiv (Dec. 2015)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(13)

Background

From shallow to deep

(14)

Traditional recognition

edges classifier “bus”?

pixels classifier “bus”?

histogram classifier “bus”?

edges

SIFT/HOG

histogram classifier “bus”?

edges sparse codeK-means/

shallower

deeper

But what’s next?

(15)

Deep Learning

histogram classifier “bus”?

edges sparse codeK-means/

Specialized components, domain knowledge required

“bus”?

Generic components (“layers”), less domain knowledge

“bus”?

Repeat elementary layers => Going deeper

• End-to-end learning

• Richer solution space

(16)

Spectrum of Depth

shallower deeper

5 layers: easy

>10 layers: initialization, Batch Normalization

>30 layers: skip connections

>100 layers: identity skip connections

>1000 layers: ?

(17)

Initialization

LeCun et al 1998 “Efficient Backprop”

Glorot & Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”

input

𝑋 output

𝑌 = 𝑊𝑋 weight

𝑊

1-layer:

𝑉𝑎𝑟 𝑦 = (𝑛 +, 𝑉𝑎𝑟 𝑤 )𝑉𝑎𝑟[𝑥]

Multi-layer:

𝑉𝑎𝑟 𝑦 = ( 2 𝑛 3 +, 𝑉𝑎𝑟 𝑤 3

3

)𝑉𝑎𝑟[𝑥]

If:

• Linear activation

• 𝑥, 𝑦, 𝑤 : independent Then:

𝑛

+,

𝑛

567

(18)

Initialization

LeCun et al 1998 “Efficient Backprop”

Glorot & Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”

1 3 5 7 9 11 13 15

depth

exploding

vanishing ideal

Forward:

𝑉𝑎𝑟 𝑦 = (2 𝑛

3+,

𝑉𝑎𝑟 𝑤

3

3

)𝑉𝑎𝑟[𝑥]

Backward:

𝑉𝑎𝑟 𝜕

𝜕𝑥 = (2 𝑛

3567

𝑉𝑎𝑟 𝑤

3

3

)𝑉𝑎𝑟[ 𝜕

𝜕𝑦 ]

Both forward (response) and backward (gradient)

signal can vanish/explode

(19)

Initialization

Initialization under linear assumption

LeCun et al 1998 “Efficient Backprop”

Glorot & Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”

∏ 𝑛

3 3+,

𝑉𝑎𝑟 𝑤

3

= 𝑐𝑜𝑛𝑠𝑡

>?

(healthy forward) and

∏ 𝑛

3 3567

𝑉𝑎𝑟 𝑤

3

= 𝑐𝑜𝑛𝑠𝑡

@?

(healthy backward) 𝑛

3+,

𝑉𝑎𝑟 𝑤

3

= 1

or*

𝑛

3567

𝑉𝑎𝑟 𝑤

3

= 1

*: 𝑛

3567

= 𝑛

3BC+,

, so

D5,E7D5,E7FG

HG

=

,IJKLMNO

,HPQKLRS

< ∞ . It is sufficient to use either form.

“Xavier” init in Caffe

(20)

Initialization

Initialization under ReLU activation

C

V

𝑛

3+,

𝑉𝑎𝑟 𝑤

3

3

= 𝑐𝑜𝑛𝑠𝑡

>?

(healthy forward) and

C

V

𝑛

3567

𝑉𝑎𝑟 𝑤

3

3

= 𝑐𝑜𝑛𝑠𝑡

@?

(healthy backward) 1

2 𝑛

3+,

𝑉𝑎𝑟 𝑤

3

= 1 1 or

2 𝑛

3567

𝑉𝑎𝑟 𝑤

3

= 1

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. ICCV 2015.

With 𝐷 layers, a factor of 2 per layer has exponential impact of 2

Y

“MSRA” init in Caffe

(21)

Initialization

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. ICCV 2015.

ours

Xavier

22-layer ReLU net:

good init converges faster

𝑛𝑉𝑎𝑟 𝑤 = 1

ours

Xavier

30-layer ReLU net:

good init is able to converge

1

2𝑛𝑉𝑎𝑟 𝑤 = 1

1

2𝑛𝑉𝑎𝑟 𝑤 = 1 𝑛𝑉𝑎𝑟 𝑤 = 1

*Figures show the beginning of training

(22)

Batch Normalization (BN)

• Normalizing input

(LeCun et al 1998 “Efficient Backprop”)

• BN: normalizing each layer , for each mini-batch

• Greatly accelerate training

• Less sensitive to initialization

• Improve regularization

S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015

(23)

Batch Normalization (BN)

S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015

layer

𝑥 𝑥Z = 𝑥 − 𝜇

𝜎 𝑦 = 𝛾𝑥Z + 𝛽

• 𝜇 : mean of 𝑥 in mini-batch

• 𝜎 : std of 𝑥 in mini-batch

• 𝛾 : scale

• 𝛽 : shift

• 𝜇 , 𝜎: functions of 𝑥,

analogous to responses

• 𝛾 , 𝛽: parameters to be learned,

analogous to weights

(24)

Batch Normalization (BN)

S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015

layer

𝑥 𝑥Z = 𝑥 − 𝜇

𝜎 𝑦 = 𝛾𝑥Z + 𝛽

2 modes of BN:

• Train mode:

• 𝜇 , 𝜎 are functions of 𝑥; backprop gradients

• Test mode:

• 𝜇 , 𝜎 are pre-computed* on training set

*: by running average, or post-processing after training

Caution: make sure your BN

is in a correct mode

(25)

Batch Normalization (BN)

S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015 Figure taken from [S. Ioffe & C. Szegedy]

w/o BN best of w/ BN

ac cu rac y

iter.

(26)

Deep Residual Networks

From 10 layers to 100 layers

(27)

Going Deeper

• Initialization algorithms ✓

• Batch Normalization ✓

Is learning better networks as simple as stacking more layers?

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(28)

Simply stacking layers?

0 1 2 3 4 5 6

0 10 20

iter. (1e4)

train error (%)

0 1 2 3 4 5 6

0 10 20

iter. (1e4)

test error (%)

CIFAR-10

56-layer

20-layer

56-layer

20-layer

Plain nets: stacking 3x3 conv layers…

• 56-layer net has higher training error and test error than 20-layer net

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(29)

Simply stacking layers?

0 1 2 3 4 5 6

0 5 10 20

iter. (1e4)

error (%)

plain-20 plain-32 plain-44 plain-56

CIFAR-10

20-layer 32-layer 44-layer 56-layer

0 10 20 30 40 50

20 30 40 50 60

iter. (1e4)

error (%)

plain-18 plain-34

ImageNet-1000

34-layer

18-layer

• “Overly deep” plain nets have higher training error

• A general phenomenon, observed in many datasets

solid: test/val dashed: train

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(30)

7x7 conv, 64, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64

3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128

3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256

3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512

fc 1000

a shallower model (18 layers)

a deeper counterpart

(34 layers)

7x7 conv, 64, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 fc 1000

“extra”

layers

• Richer solution space

• A deeper model should not have higher training error

A solution by construction:

• original layers: copied from a learned shallower model

• extra layers: set as identity

• at least the same training error

• Optimization difficulties: solvers cannot find the solution when going deeper…

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(31)

Deep Residual Learning

• Plaint net

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

any two stacked layers

𝑥

𝐻(𝑥)

weight layer weight layer

relu relu

𝐻 𝑥 is any desired mapping,

hope the 2 weight layers fit 𝐻(𝑥)

(32)

Deep Residual Learning

• Residual net

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

𝐻 𝑥 is any desired mapping, hope the 2 weight layers fit 𝐻(𝑥)

hope the 2 weight layers fit 𝐹(𝑥) let 𝐻 𝑥 = 𝐹 𝑥 + 𝑥

weight layer weight layer

relu

relu 𝑥

𝐻 𝑥 = 𝐹 𝑥 + 𝑥

identity

𝑥

𝐹(𝑥)

(33)

Deep Residual Learning

• 𝐹 𝑥 is a residual mapping w.r.t. identity

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

• If identity were optimal, easy to set weights as 0

• If optimal mapping is closer to identity, easier to find small fluctuations

weight layer weight layer

relu

relu 𝑥

𝐻 𝑥 = 𝐹 𝑥 + 𝑥

identity

𝑥

𝐹(𝑥)

(34)

Related Works – Residual Representations

• VLAD & Fisher Vector

[Jegou et al 2010], [Perronnin et al 2007]

• Encoding residual vectors; powerful shallower representations.

• Product Quantization (IVF-ADC)

[Jegou et al 2011]

• Quantizing residual vectors; efficient nearest-neighbor search.

• MultiGrid & Hierarchical Precondition

[Briggs, et al 2000], [Szeliski 1990, 2006]

• Solving residual sub-problems; efficient PDE solvers.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(35)

7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool

fc 1000

7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool

fc 1000

Network “Design”

• Keep it simple

• Our basic design (VGG-style)

• all 3x3 conv

(almost)

• spatial size /2 => # filters x2

(~same complexity per layer)

• Simple design; just deep!

• Other remarks:

• no hidden fc

• no dropout

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

plain net ResNet

(36)

Training

• All plain/residual nets are trained from scratch

• All plain/residual nets use Batch Normalization

• Standard hyper-parameters & augmentation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(37)

CIFAR-10 experiments

0 1 2 3 4 5 6

0 5 10 20

iter. (1e4)

error (%)

plain-20 plain-32 plain-44 plain-56

20-layer 32-layer 44-layer 56-layer CIFAR-10 plain nets

0 1 2 3 4 5 6

0 5 10 20

iter. (1e4)

error (%)

ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-110

CIFAR-10 ResNets

56-layer 44-layer 32-layer 20-layer

110-layer

• Deep ResNets can be trained without difficulties

• Deeper ResNets have lower training error, and also lower test error

solid: test dashed: train

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(38)

ImageNet experiments

0 10 20 30 40 50

20 30 40 50 60

iter. (1e4)

error (%)

ResNet-18 ResNet-34

0 10 20 30 40 50

20 30 40 50 60

iter. (1e4)

error (%)

plain-18 plain-34

ImageNet plain nets ImageNet ResNets

solid: test dashed: train

34-layer

18-layer

18-layer

34-layer

• Deep ResNets can be trained without difficulties

• Deeper ResNets have lower training error, and also lower test error

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(39)

ImageNet experiments

• A practical design of going deeper

3x3, 64 3x3, 64

relu relu

64-d

3x3, 64 1x1, 64

relu

1x1, 256

relu

relu 256-d

all-3x3

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

bottleneck

(for ResNet-50/101/152)

similar complexity

(40)

ImageNet experiments

7.4 6.7

5.7 6.1

4 5 6 7 8

ResNet-34 ResNet-50

ResNet-101 ResNet-152

10-croptesting, top-5 val error (%) this model has

lower time complexity than VGG-16/19

• Deeper ResNets have lower error

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

(41)

ImageNet experiments

3.57

6.7 7.3

11.7

16.4

25.8

28.2

ILSVRC'15

ResNet ILSVRC'14

GoogleNet ILSVRC'14

VGG ILSVRC'13 ILSVRC'12

AlexNet ILSVRC'11 ILSVRC'10

ImageNet Classification top-5 error (%)

shallow 8 layers

19 layers 22 layers

152 layers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

8 layers

(42)

Discussions

Representation, Optimization, Generalization

(43)

Issues on learning deep models

Representation ability

Optimization ability

Generalization ability

• Ability of model to fit training data, if optimum could be found

• If model A’s solution space is a superset of B’s, A should be better.

• Feasibility of finding an optimum

• Not all models are equally easy to optimize

• Once training data is fit, how good is the test

performance

(44)

How do ResNets address these issues?

Representation ability

Optimization ability

Generalization ability

• No explicit advantage on representation (only re-parameterization), but

• Allow models to go deeper

• Enable very smooth forward/backward prop

• Greatly ease optimizing deeper models

• Not explicitly address generalization, but

• Deeper+thinner is good generalization

(45)

On the Importance of Identity Mapping

From 100 layers to 1000 layers

(46)

On identity mappings for optimization

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

𝑥 cBC = 𝑓(ℎ 𝑥 c + 𝐹 𝑥 c ) 𝑥 𝑙

ℎ(𝑥 c ) 𝐹(𝑥 𝑙 ) layer

layer

• shortcut mapping: ℎ = identity

• after-add mapping: 𝑓 = ReLU

(47)

On identity mappings for optimization

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

𝑥 cBC = 𝑓(ℎ 𝑥 c + 𝐹 𝑥 c ) 𝑥 𝑙

ℎ(𝑥 c ) 𝐹(𝑥 𝑙 ) layer

layer

• shortcut mapping: ℎ = identity

• after-add mapping: 𝑓 = ReLU

• What if 𝑓 = identity?

(48)

On identity mappings for optimization

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

𝑥 cBC = 𝑓(ℎ 𝑥 c + 𝐹 𝑥 c ) 𝑥 𝑙

ℎ(𝑥 c ) 𝐹(𝑥 𝑙 ) layer

layer

• shortcut mapping: ℎ = identity

• after-add mapping: 𝑓 = ReLU

• What if 𝑓 = identity?

(49)

Very smooth forward propagation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

𝑥 cBC = 𝑥 c + 𝐹 𝑥 c

𝑥 cBV = 𝑥 cBC + 𝐹 𝑥 cBC

(50)

Very smooth forward propagation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

𝑥 cBC = 𝑥 c + 𝐹 𝑥 c

𝑥 cBV = 𝑥 c + 𝐹 𝑥 c + 𝐹 𝑥 cBC

𝑥 cBV = 𝑥 cBC + 𝐹 𝑥 cBC

(51)

Very smooth forward propagation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

𝑥 cBC = 𝑥 c + 𝐹 𝑥 c

𝑥 cBV = 𝑥 c + 𝐹 𝑥 c + 𝐹 𝑥 cBC 𝑥 cBV = 𝑥 cBC + 𝐹 𝑥 cBC

𝑥 g = 𝑥 c + h 𝐹 𝑥 +

giC

+jc

(52)

Very smooth forward propagation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

𝑥 g = 𝑥 c + h 𝐹 𝑥 +

giC

+jc

7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000

7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool

fc 1000

𝑥 g 𝑥 c

• Any 𝑥 c is directly forward-prop to any 𝑥 g , plus residual.

• Any 𝑥 g is an additive outcome.

• in contrast to multiplicative: 𝑥

g

= ∏

giC+jc

𝑊

+

𝑥

c

(53)

Very smooth backward propagation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

𝑥 g = 𝑥 c + h 𝐹 𝑥 +

giC

+jc

7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000

7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool

fc 1000

𝜕𝐸

𝜕𝑥

g

𝜕𝐸

𝜕𝑥

c

𝜕𝐸

𝜕𝑥 c = 𝜕𝐸

𝜕𝑥 g

𝜕𝑥 g

𝜕𝑥 c = 𝜕𝐸

𝜕𝑥 g (1 + 𝜕

𝜕𝑥 c h 𝐹 𝑥 +

giC

+jC

)

(54)

Very smooth backward propagation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000

7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool

fc 1000

𝜕𝐸

𝜕𝑥

g

𝜕𝐸

𝜕𝑥

c

𝜕𝐸

𝜕𝑥 c = 𝜕𝐸

𝜕𝑥 g (1 + 𝜕

𝜕𝑥 c h 𝐹 𝑥 +

giC

+jC

)

• Any

lnlm

o

is directly back-prop to any

lnlm

p

, plus residual.

• Any

lnlm

p

is additive; unlikely to vanish

• in contrast to multiplicative:

lnlm

p = ∏ 𝑊+ lnlm

o giC+jc

(55)

Residual for every layer

𝑥 g = 𝑥 c + h 𝐹 𝑥 +

giC

+jc

forward:

𝜕𝐸

𝜕𝑥 c = 𝜕𝐸

𝜕𝑥 g (1 + 𝜕

𝜕𝑥 c h 𝐹 𝑥 +

giC

+jC

)

backward:

Enabled by:

• shortcut mapping: ℎ = identity

• after-add mapping: 𝑓 = identity

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

(56)

Experiments

• Set 1: what if shortcut mapping ℎ ≠ identity

• Set 2: what if after-add mapping 𝑓 is identity

• Experiments on ResNets with more than 100 layers

• deeper models suffer more from optimization difficulty

(57)

Experiment Set 1:

what if shortcut mapping ℎ ≠ identity?

References

Related documents

&#34;Scalable object detection using deep neural networks.&#34; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. &#34;Faster R-CNN: Towards

American Journal of Sociology, American Sociological Review, Journal of Health and Social Behavior, Personality and Social Psychology, Social Forces, etc.. Google Scholar and

{Factors contributing to the performance of a farmworker equity-share scheme} Policy &amp; program interventions Conceptual Framework Institutional &amp; macro-policy framework

Deep Convolutional Neural Networks for Real Time Object Recognition.. 1 Jaiswal Shruti R, 2

3.2 Sharing Features for RPN and Fast R-CNN Thus far we have described how to train a network for region proposal generation, without considering the region-based object detection

• Neural Networks in Deep Learning: Different deep learning based methods are used for text detection from natural scene images.. Convolutional Neural Networks (CNN), Feature

Classification with Deep Convolutional Neural Networks, NIPS, 2012..

Then for each bounding box, Region of Interest (RoI) pooling is applied to the final layer conv features to create a fixed size vector.. This RoI vector is then passed to