Deep descriptor and deep learning in texture analysis

2.2 Texture analysis

2.2.4 Deep descriptor and deep learning in texture analysis

This section describes recent works on CNNs and deep learning in texture analysis. Neural networks, deep learning and CNNs are introduced in Appendix A to which the reader may refer if not familiar with the concepts used here.

In [122], the authors argue that the dimensionality of texture datasets is too large to classify them with deep neural networks without explicitly extracting hand-crafted features beforehand, as opposed to digits or object datasets which lie on a lower- dimensional lattice. They affirm that neural networks must therefore be redesigned to learn texture features similar to GLCM and Haralick features. Recent work on CNNs applied to texture analysis, however, illustrates that they cope very well with texture datasets [8, 25, 36, 123]. It was shown that convolutional architectures are,

subject to minor modifications, well designed for the analysis of texture and that they largely improve the state of the art in this field.

Basic CNN architectures have been applied to texture recognition such as [124], in which a simple four layer network was used in the early stages of deep learning to classify simple texture images. More recent CNNs were applied to forest species images with high texture content [125]. Transfer learning between texture datasets was studied in [126] with classic CNN architectures and applied to the forest species classification, similar to a texture recognition problem. While more complex and more accurate than [124], these approaches still do not take the characteristics of texture images (i.e. statistical properties, repeated patterns and irrelevance of the overall shape analysis) into consideration as they are simple applications of a standard CNN architecture to a texture dataset. A Recurrent Neural Network (RNN) approach (see Appendix A.2.5) was used for texture classification in [127] and texture segmentation in [115]. This approach, however, does not benefit from the weight sharing and local connectivity of CNNs to efficiently detect texture patterns and requires data augmentation to learn simple invariances. A wavelet scattering convolution network (ScatNet) was developed in [128] using hand-crafted wavelets as convolution filters and extended to affine invariance in [129]. The PCA Network (PCANet) [130] inspired by ScatNet uses a cascade of PCA to learn filter banks in a CNN-like architecture with binary quantisation used for non-linearity. Block- wise histograms of the quantised responses are pooled and classified using SVM or softmax.

Following these early applications of CNN to texture images, several methods have been developed with a general idea of discarding the overall shape analysis of CNNs by an orderless pooling of feature maps with excellent results as described in the following paragraphs.

Deep Convolutional-network Activation Features (DeCAF) [131], which extract the penultimate fully-connected layer of AlexNet for SVM classification, were ported from object recognition to texture classification in [99]. Texture descriptors were densely extracted from a convolutional network with the Fisher Vector CNN (FV- CNN) in [8]. Inspired by DeCAF, a CNN (VGG-M or VGG-VD) pre-trained on ImageNet [50] is used as a feature extractor. The output of the last convolution layer is used in a FV encoding classified with one-vs-rest SVM. The overall shape information is discarded in this analysis by replacing the fully-connected layers by the FV orderless pooling and SVM classification. As the network is only being used for feature extraction, the convolutional network is not finetuned and does not learn from the texture dataset. However, due to the domain transferability of filters pre-trained on ImageNet, they achieve impressive results on both texture recognition and texture recognition in clutter datasets. The FV-CNN is also combined with

shape analysis by extracting the outputs of the penultimate fully-connected layer (called FC-CNN, similar to DeCAF), demonstrating the complementarity of the shape and texture analysis. Finally, the FV-CNN can efficiently be used in a region- based approach as it requires computing the convolution output once and pooling the desired regions with FV encoding. In [123], the FV-CNN is improved with a dimensionality reduction to reduce the redundancy and increase the discriminative power of the features extracted from the CNN prior to SVM classification. This approach involves extracting the output of a pre-trained CNN, FV encoding, training an ensemble of fully connected neural networks for dimensionality reduction and training an SVM for classification.

A Texture CNN (T-CNN) was developed in [25] which includes an energy layer to extract the dense response to intermediate features in the network, improving the results on texture classification tasks while reducing the complexity compared to classic CNNs. The complementarity of texture and shape analysis is shown with an end-to-end CNN training scheme. A framework splitting the images and using the T-CNN with a voting score approach was developed in [35, 51] for the classification of biomedical texture images. A fully convolutional approach was developed for the segmentation of texture regions in [53]. These three methods [25, 51, 53] will be presented in the thesis as part of the main contributions.

A bilinear CNN model was developed in [132] for fine-grained recognition (i.e. visually and semantically very similar classes), combining two CNN streams to extract and classify local pairwise features in a neural network framework. One stream made of convolution and pooling layers works as an object recognition network (i.e. the “what”), while the other stream analyses the spatial location of the object in the image (i.e. the “where”). Their output feature maps are multiplied using outer product and densely pooled across the image by summing the extracted features. Robust image descriptors are obtained which capture translational invariant local feature interactions. The developed architecture can also generalise several classic orderless pooling descriptors including VLAD and FV in a deep learning framework. This method is successfully applied to texture classification in [36] and obtains slightly better results than the FV-CNN while being trained end-to-end.

A Deep Texture Encoding Network (Deep-TEN) was introduced in [133] by integrating an encoding layer on top of convolution layers, also generalising orderless pooling methods such as VLAD and FV in a CNN architecture trained end-to-end.

Rotation invariance was embedded in a shallow CNN in [134] by tying the weights of multiple rotated versions of filters for texture classification. While rotation invariance of simple texture descriptors can be learned and pooled orderless with average pooling, the benefit over hand-crafted rotation invariant descriptors is limited by the use of a single layer.

In [93], multiple deep texture descriptors were evaluated including FV-CNN, ScatNet and PCANet, and compared to several variants of LBP descriptors. As expected, the deep convolutional descriptors obtain the best results, at the cost of a much higher computational complexity as compared to LBP variants.

Finally, deep networks have also been applied to texture synthesis. A pre-trained CNN was used in [135] to compute statistics (correlation between feature maps) at multiple layers using a source image as input. A new texture image is then generated by optimising an initial random image via gradient descent (see Appendix A.2.3) to obtain similar statistics at multiple depths. A spatial GAN (see Appendix A.3.6) was used in [37] to synthesise texture images of arbitrary sizes by replacing the random noise vector generally used as input to the generator by a spatial tensor.

In document Deep learning for texture and dynamic texture analysis (Page 61-64)