Inception Network: Inception-v3 - Deep Convolution NeuralNetworks for Image Analysis

While increasing the depth of a network by simply adding layers and using high filter numbers is a pretty forthright way of trying to increase the perfor- mance of a network, the downside is a larger number of parameters and an increased risk of overfitting, as well as a higher computational cost.

With the introduction of the Inception Module, Szegedy et al. had a somewhat different approach. Instead of a throughoutly sequential structure, part of the network works in parallel. The Inception Module was introduced alongside the GoogLeNet architecture by Szegedy et al. in their paper ”Going deeper with convolutions”[Szegedy2015a].

The authors have put much thought into the computational part of deep learning. The basic idea is to combine advantages of both sparsity and dense computation.

The sparsity is connected to the idea of analyzing the correlations between units and to cluster units with high correlation into groups. Sparse data struc- tures2 _{can ”break the symmetry and improve learning”[Szegedy et al. 2015,} p.3], they are however prone to inefficient computing. Full connections on the other hand allow more efficient dense computation.

These considerations lead to the invention of the Inception Module.

The Inception Module

A graphic representation of the Inception Module can be found in figure 3.3. Instead of having either one convolutional layer or a pooling layer, convolutional and pooling layers are working parallely. Since this would lead to too many outputs, by adding 1x1 convolutional layers before the 3x3 and 5x5 layers the authors achieved a dimensionality reduction regarding the feature maps (not spatially!). These 1x1 layers are also followed by a ReLU unit.

The optimal layered network topology can be found out by using correlation statistics. Neurons are clustered according to the correlation statistics of pre-

2_{sparse matrices are matrices with most of its elements being zero. A dense matrix is a}

Fig. 3.3: Inception Module, source: Szegedy et al. 2015a

ceding layer activations. Neurons with highly correlated outputs are clustered. 1x1 convolutions cover very local clusters, whereas 3x3 and 5x5 convolutions can cover more spread-out clusters. This is why in the early layers of the CNN smaller convolutions are used, whilst in the later layers, where the high-level reasoning takes place, 3x3 and 5x5 are more common. The results are conca- tenated, including pooling.

This original Inception Module was used for the GoogLeNet [Szegedy2015a]. In the paper “Rethinking the Inception Architecture for Computer Vision“ the authors introduced two new networks, the Inception-v2 and v3 [Szegedy2015b]. The Inception-v2 is just a stripped down version of the v3, so the remarks will be limited to the v3 here.

The Inception-v3 network adds some new ideas to the original GoogLeNet. A main element is the factorization of convolutions with large filter size.

For one thing, the authors factorized large convolutions into smaller convolutions. They replaced the 5x5 convolutional kernels of the Inception module with two 3x3 kernels to reduce the parameters, like in the VGG-network (figure 3.4).

Fig. 3.4: Inception Module: factorization into smaller convolutions. This Incep- tion module has two 3x3 convolutions instead of a 5x5 convolution (compared to the original Inception module). Source: adapted from Szegedy et al. 2015b

The authors also experimented with other types of spatial factorization. They found that a factorization of a nxn layer in a nx1 followed by a 1xn layer can also have a remarkable beneficial effect on the computational cost (s. figure 3.5).

Fig. 3.5: Visualisation of the effect of successive asymmetric convolutions: 3x1 and 1x3, source: Szegedy et al. 2015b

So they exchanged each of the 3x3 layers in the Inception Module in figure 3.4, with a 3x1 and a 1x3 layer. The resulting Inception Module can be viewed in figure 3.6. But these asymmetric convolutions did not work too well in early layers, so they proceeded to use them on medium grid-sizes.

Fig. 3.6: Inception Module: factorization into asymmetric convolutions, source: adapted from Szegedy et al. 2015b

Another version of the Inception Module is one with expanded filter bank outputs (figure 3.7).

This Inception Module is used late in the network, when the feature map is already very small (8x8). The aim is to increase the number of filters. It is designed to obtain high-dimensional representations.

Fig. 3.7: Inception Module with expanded filter bank outputs, source: adapted from Szegedy et al. 2015b

tecture, shown in table 3.1. The lower layers of the Inception network are kept traditionally, sequences of convolutional layers followed by max pooling (like in the VGG architecture).

type patch size/stride or remarks input size

conv 3x3/2 299x299x3 conv 3x3/1 149x149x32 conv padded 3x3/1 147x147x64 pool 3x3/2 147x147x64 conv 3x3/1 73x73x64 conv 3x3/2 71x71x80 conv 3x3/1 35x35x192 3x Inception as in figure 3.4 35x35x288 5x Inception as in figure 3.6 17x17x768 2x Inception as in figure 3.7 8x8x1280 pool 8x8 8x8x2048 linear logits 1x1x2048 softmax classifier 1x1x1000

Table 3.1: The Inception network architecture, source: adapted from Szegedy 2015b

Additionally, Szegedy et al. equipped their Inception-v3 network with Batch Normalization, batch-normalized auxiliary classifiers and label smoothing regularization.

Auxiliary classifiers

The Inception networks are really deep. To nevertheless embrace the advantages of more shallow networks like larger gradient signals and regularization, auxiliary classifiers are added to intermediate layers (s. figure 3.8).

The loss outputs of these classifiers are weighted by a factor of 0.3 and then added to the total loss. The auxiliary classifier outputs are only used for training, not during inference to new data.

The auxiliary classifiers were already a part of the GoogLeNet architecture, but in the Inception-v3 they also added Batch Normalization to the auxiliary softmax functions.

Fig. 3.8: Auxiliary classifiers as a part of the GoogLeNet architecture; the softmax layers are coloured yellow. (the convolutional layers are blue, pooling layers red, local response normalization and concat layers green, and softmax layers yellow); source: Szegedy et al. 2015a

Label smoothing regularization

Also, the Inception-v3 is using label smoothing regularization (LSR). When training a model with the cross-entropy, it is trying to maximize the log- likelihood of the true training labels. In other words, it tries to push the predicted probability of the true label closer to one. Szegedy et al. argue that this is not ideal when the aim is to obtain a model that is generalizable to new data. Still of course the predicted probability of the true label should be higher than the probabilities of the false labels, but the intention is to lessen the difference to make the model more adaptable.

The authors report a 4.2 top-5 error rate with a single model and 3.58 with a multi-model approach using the ILSVRC 2012 classification benchmark with the Inception-v3 architecture [Szegedy2015b].

In document Deep Convolution Neural Networks for Image Analysis (Page 43-49)