• No results found

Convolutional neural networks

2.3 Deep learning based image representations

2.3.4 Convolutional neural networks

Convolutional Neural Networks (CNN) have been considered as the dominant ap- proach to producing image representations for many computer vision task since the pioneering work carried out by Krizhevsky et al. [64] in the 2012 ImageNet challenge. The main di↵erence between CNNs and the fully connected Neural Networks are the convolutional and pooling layers, which make CNNs particularly suitable for processing spatially structured data such as images.

A convolutional layer outputs a volume of neurons. The depth of the volume is defined by a set of kernels or filters. The layer is characterized by its local connections: in contrast to fully connected layers, each neuron in a convolutional layer is connected with a local region of its input. In particular, given an input layer of dimensions H⇥ W ⇥ C, one neuron within a convolutional layer is connected to a window volume of size C⇥ K ⇥ K of the input, where K is the kernel size. The output of the neuron corresponds to the response of its activation function to

Figure 2.14: Visualization of the local connection of one neuron within a convolutional layer. In the example, the input depth C = 1 and the kernel size is 1⇥3⇥3. Usually, C has a larger value. For instance, most CNN models process RGB images, which means that the input layer has to have C = 3. The kernel is slid over the spatial dimensions H⇥ W keeping the same weights across all the di↵erent locations. The output is a 2D map, also called the activation map or feature map. Within a convolutional layer, N di↵erent kernel filters are considered, each one with independent parameters, generating a volume of neurons composed by N activation maps

1⇥ 3 ⇥ 3 kernel.

Conceptually, the kernels (filters) are “moved’ across the spatial dimensions of the input in a sliding window manner, changing its position by S neurons (stride) each step. Their weights are shared or fixed across the spatial positions of the input’s layer, with the assumption that if a particular visual pattern is relevant in one position of the image, it will also be relevant in another position.

The pooling layers are used to reduce the spatial dimension of the the convolutional volumes. They operate spatially over a convolution volume. Usually max pooling is applied between consecutive layers to reduce the dimensionality of the representations and achieve some translation invariance. Figure 2.15 illustrates an example of spatial max-pooling using a kernel of 2⇥ 2 with stride 2.

The AlexNet proposed by Krizhevsky [64] in 2012 has five convolutional layers and three fully connected, containing 60 milion parameters. The model was trained using SGD with momentum [98] on 1.2 million labeled images from the ImageNet

Figure 2.15: Spatial max pooling with 2⇥ 2 kernel and stride 2 applied on a single activation map of a convolutional layer.

dataset [29]. Figure 2.13 shows the architecture of the model. This architecture achieved the best result in the ILSVRC-2012 image classification task.

Three main factors that lead to success of this method were having access to a huge number of labeled images, e↵ective use of GPUs for fast computation of convolutional operations, and the usage of non-saturating activation functions (ReLU) to speed-up training and lesser gradient vanishing.

Since then, several works have appeared in the literature including networks trained for object detection [102], semantic segmentation of images [78], visual attention and saliency [7, 67] and image captioning [62, 31]. Representations derived from CNN models have also been considered for image retrieval (see Chapter 3), mostly by using representations from networks pre-trained for classification as a substitute for traditional hand-crafted descriptors such as SIFT.

2.3.4.1 Widely used pre-trained CNN models

Models trained on ImageNet [29] have achieved astounding performance on the image classification task. The architecture proposed by Krizhevsky [64], also known as AlexNet, is composed of five convolutional layers, each one followed by a max-pooling layer, and three fully connected layers, as shown in Figure 2.13.

Figure 2.16: VGG16 architecture. The network is divided into 5 convolutional blocks. Each one is composed by a stack of convolutional layers of the same kernel size (2 convolutional layers in blocks 1 and 2; and 3 convolutional layers in blocks 3, 4, and 5). After each block, a max-pooling layer is applied, reducing the spatial dimensions of the convolutional feature maps to a half. The last block is composed by three fully connected layers. Image source http://book.paddlepaddle.org/03.image_ classification/image/vgg16.png.

better classification performance. Simonyan and Zisserman proposed a very deep convolutional network [113] where the authors proposed increasing the number of convolutional layers and using very small (3⇥ 3) convolutional filters, pushing the depth to 16 and 19 layers (VGG16 and VGG19). Figure 2.16 shows the VGG16 architecture. The number 16 refers to the number of convolutional and fully connected layers, without taking into account max-pooling layers. Each convolutional layer have a ReLU as activation function, with exception of the last fully connected layer, that uses a softmax for the final classification prediction. In this thesis we use the VGG16 architecture, extracting feature from the fifth convolutional block. We refer to the layers of this block as conv5 X, being X = 1, 2, 3 identifying the first, second, and third convolutional layers within the block. pool5 layer to the output after applying max-pooling on conv5 3.

It was shown that the deeper the network the better results achieved. However, He et al. [43] observed that very deep networks had higher training error. For instance, a 56-layer network had higher training error than a 20-layer network. The authors reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. With these skip connections it was possible to train very deep architectures of 50 or 101 layers. In particular the

152-layer architecture proposed in [43] achieved 3.75% top-5 error on the test set of ImageNet and the submission won the ILSVRC 2015 classification challenge. The proposed model was named “ResNet”, usually referred as ResNet-N where N refers to the number of layers of the network, typically 50 or 101.

2.3.4.2 Fine-tuning a CNN model

Since it is unlikely to have 1.2 million annotated images (as in the ImageNet dataset [29]) available for an arbitrary computer vision task, it is common practice to re-use an already trained architecture for a di↵erent task. The process of fine-tuning is a type of transfer learning [110, 137], fine-tuning consists of modifying some of the top layers of an already trained model to optimize the weights for the new task, instead of training from randomly initialized weights. Using the obtained pre- trained weights instead of setting them to random values allows a fast convergence of the models and training of the networks when it is not possible to access to a large collection of annotated images. We exploit this process for image retrieval in Chapter 5.