5.4 Integration Architecture Network
5.5.3 Evaluation Results
We present the performance of our proposed WIAN (three weight schemes are evaluated, the first one is based on responses, the second one relies on entropy information and the third one is an adaptive weight learning scheme) and make a comprehensive comparison with general CNNs, Max Integration Architecture Networks (MIAN), Average Integration Architecture Networks (AIAN), Sum Inte- gration Architecture Networks (SIAN) as well as the directly concatenate (CON- CAT) of the previous convolutional layers in the CNN architecture. The concate- nation operation is similar to the inception module in GoogleNet [42]. A softmax loss function is employed to predict the classification accuracy. The evaluation results of the classification accuracy are listed in Table 5.1.
It turns out that the evaluated integration schemes (WIAN, MIAN, AIAN, SIAN and CONCAT) all achieve improved performance when compared to general CNNs. The WIAN (based on responses, entropy and adaptive weight learning on each layer in the CNN) show much better results than the other approaches. WIAN based on the weight calculated according to the responses on each layer shows the best performance on all the benchmarks. The integration schemes of MIAN, AIAN and SIAN show similar results on the test datasets.
5.5 Experimental Results 20 40 60 80 100 120 140 epoch 0.2 0.3 0.4 0.5 0.6 Testing error
CIFAR-10 testing error
WIAN(responses) AIAN CONCAT CNNs (a)Cifar10 20 40 60 80 100 120 140 epoch 0.4 0.5 0.6 0.7 0.8 0.9 1 Testing error
CIFAR-100 testing error
WIAN(responses) AIAN CONCAT CNNs (b) Cifar100 20 40 60 80 100 120 epoch 0 0.01 0.02 0.03 0.04 0.05 0.06 Testing error
MNIST testing error
WIAN(responses) AIAN CONCAT CNNs (c) MNIST 20 40 60 80 100 epoch 0 0.1 0.2 0.3 0.4 0.5 Testing error SVHN testing error WIAN(responses) AIAN CONCAT CNNs (d) SVHN
Figure 5.5: The comparison of the classification error among several possible architectures on the four benchmark datasets.
Additionally, we further investigate the behaviours of the testing error during each epoch in the CNN training. The performance of WIAN (responses), AIAN, CONCAT and the general CNNs are evaluated. The graphs depicted in Figure 5.5 show that WIAN (responses) reaches the smallest testing error faster than others.
This further demonstrates that the weighted integration of previous convolutional layers can boost the performance of the network.
5.6
Conclusions
In this chapter, we propose to reuse the information encoded in previous layers in the network to recover the precision loss due to the pooling operation in the CNN. We present a novel Weighted Integration Architecture Network (WIAN) to enhance the performance of CNN based image classification, where each layer is multiplied by a weight matrix generated according to the responses or entropy of the layer, adaptive learning and then element-wise summed together. The eval- uation results demonstrated that the WIAN can yield high accuracy on image classification, and WIAN shows better performance than the scheme that employs direct concatenation, and the schemes employing max, average and sum integra- tion of the previous convolutional layers in the CNN architecture. Moreover, WIAN based on the weight value calculated according to the responses on each layer is more robust than WIAN based on entropy value as well as the adaptive learning scheme.
Chapter 6
Conclusions
6.1
Conclusions
In this thesis, we focus on large scale visual search. The topic of large scale vi- sual search has seen a steady train of improvements in performance over the last decade. In this task, given a query image containing a specific object or scene, the goal is to return the images containing the same object or scene that may be captured from different viewpoints, under changed illumination and maybe oc- cluded. The Bag-of-Words model was originally proposed for document retrieval. The introduction of salient point methods has made this model applicable to the image domain where it translates to the visual word model. General salient point methods involve a detector and a descriptor. The detector locates the salient regions in the image and the descriptor encodes discriminative information in the salient region into a local feature. Based on the salient point method, an image can be transformed into a collection of local feature vectors, which can be viewed as prototypes of words in text. The visual word model has been the state-of-the- art for many computer vision applications. It has greatly advanced the research of instance retrieval in the past ten years, and many improvements have been proposed.
One important aspect in the visual word model is the degree to which the salient point methods are invariant to image translation, scaling, and rotation, as well
as partially invariant to illumination changes, and robust to local geometric dis- tortion. In Chapter 2, we presented a comparison of the existing salient point detectors and descriptors on diverse image distortions. These comparative exper- imental studies can benefit researchers in choosing an appropriate detector and descriptor for different computer vision applications. According to the evalua- tion results, we find that the FAST detector had the highest repeatability score compared to other detectors, moreover it had the lowest detection time-cost per point. Regarding the criterion of recall-precision, our experiments showed that the descriptors of SIFT, BRISK, and FREAK were the best performing affine invariant descriptors. Furthermore, evaluation of the time complexity showed that the binary descriptors are efficient with respect to feature description and matching.
Existing salient points methods tend to perform poorly to viewpoint changes. In Chapter 3, we presented the Retina-inspired Invariant Fast Feature, RIFF, which was designed for invariance to scale, rotation, and affine image deformations. The RIFF descriptor is based on pair-wise comparisons over a sampling pattern loosely based on the sampling pattern seen in the human retina and introduces a method for improving accuracy by maximizing the discriminatory power of the point set. The main contribution of the RIFF descriptor is in constructing the descriptor, where the discriminative power is optimized by ranking and deleting points with low distinctiveness. In our Bag-of-Words image retrieval tests on three well known datasets, RIFF outperformed the other feature descriptors with respect to invariance to scale, rotation, and affine transformations. Furthermore, we presented a performance evaluation of real valued and binary string salient point descriptors. The time complexity and space requirements showed that binary string descriptors are efficient in terms of feature extraction time and memory usage. With respect to the criterion of the mAP score, the image copy detection experiments showed some significant strength of binary string local descriptors: FREAK clearly outperformed SIFT on invariance to rotation, scale, and affine transformations; BRIEF had the best accuracy testing invariance to image blur and was among the best in robustness to cropping.
6.2 Future Work
In recent years, the focus on image search has shifted from the visual word model to deep Convolutional Neural Networks (CNNs) features. The CNN is a hierar- chical structure that has been shown to outperform hand-crafted features in a number of vision tasks, such as object detection, image segmentation, and classi- fication. The power of CNNs mainly comes from the large number of parameters and the use of large scale datasets with rich annotations. Using the features extracted from CNN models, researchers have reported competitive performance compared to the classic visual word model. In Chapter 4, we proposed a novel image representation called deep binary codes which have important advantages over deep convolutional feature representations, as they can be calculated using a generic transferred model and therefore do not require additional training. The experimental results on well-known datasets as well as a large scale dataset show that deep binary codes are competitive to state-of-the-art approaches and can sig- nificantly reduce memory and computational costs for large scale image search. Moreover, in Chapter 5, we proposed to reuse the information in the previous layers in the network to recover the precision loss due to the pooling operation in the CNN. The presented Weighted Integration Architecture Network (WIAN) can enhance the power of the CNN model.
6.2
Future Work
In the future, we will try to improve our work in the following directions:
Convolutional neural networks based local descriptor generation: The
generation of effective local image descriptors plays an important role in the applications of computer vision involving baseline stereo vision, structure from motion, visual words based image search, image classification and object detec- tion, etc. The existing schemes of local descriptor generation can be categorized into hand-crafted or automatically learned schemes. Recent work focuses more on automatic learning of local descriptors. Learning based schemes usually op- timize an objective function to generate robust and distinctive local descriptor. In particular, the most common objective functions are designed to minimize the
distance between the descriptors from the same 3D location (scale and location) or same class label extracted under varying imaging conditions and different view- points, and maximize the distance between patches from different 3D locations or different class labels. Concurrently, the automatically learning schemes of local descriptors based on deep convolutional neural networks have recently made dra- matic progress. A Siamese network trained with a pair-wise loss ranking function and a triplet network trained with a triplet loss ranking function that also min- imizes the distance (in the embedded space) between patches of the same labels and maximizes the distance between patches of different labels are used to auto- matically learn high performance local descriptors. However, all these methods suffer from huge training complexity, because they directly train CNNs using the pair-wise or triplet list, the length of which scales with the quadratic or cubic with the number of images in the training dataset. Therefore, it is important to further develop techniques to address huge training complexity while maintaining the robustness of the learned local descriptors. Another issue we need to address is the limitation of training data. The typical solution is to generate more train- ing data from existing data using data augmentation schemes, such as scaling, rotating and cropping. Hence, it is important to further develop techniques for generating or collecting more comprehensive training data, which could make the networks learn better features that are robust to various changes, such as geometric transformations, and occlusion.
Convolutional neural networks based high level image representation: The outputs from the fully connected layer in the CNN are mostly used as image representation. However, the image representation from a fully connected layer suffers from the lack of description of local patterns, which is especially critical when occlusions or truncations exist in the images. With respect to the sensi- tivity to local stimulus, CNN features from the bottom or intermediate layers have shown promising performance. These discriminatively trained convolutional kernels respond to specific visual patterns that evolve from bottom to top layers. While capturing local activations, the intermediate features are less invariant to image translations. Compared to the pooling operation, which is usually utilized to map the intermediate features into global feature, one promising direction for
6.2 Future Work
future research is to find more efficient ways to convert the intermediate features into low dimensional and high distinctiveness image representations, in order to avoid the information loss caused by pooling operations. Second, it is known that the top layers in CNNs are sensitive to semantics, while intermediate layers are specific to local patterns. For the image representation, we can obtain multiple layer features in the pre-trained CNN through one feed-forward step. It is not trivial to predict which layers are superior. Therefore, the fusion of the features from multiple layers is a good practice to further improve the accuracy of image search. Moreover, we can also fuse the features from different models to represent the image.
Convolutional neural networks based deep hash learning: In order to
achieve efficient large scale image search, the high performance of the supervised deep hashing model appears to be promising. The first direction is to increase the ability to generalize by increasing the width or depth of the networks, for ex- ample, the width and depth of the CNN models in the literature [42, 51]. Larger networks could normally bring higher quality performance, but have the danger of over-fitting and require very large computational resources. A second direc- tion is to define a good loss ranking function. As the commonly used pair-wise loss functions and triplet loss functions employ Euclidean distance to measure the similarity in the input space, we can replace the Euclidean distance with different similarity metrics for different input spaces. Moreover, we can also incorporate constraint information from the input space to the loss functions. A third di- rection towards more powerful models is to design more specific deep networks. Currently, almost all of the CNN-based schemes adopt a shared network for their predictions, which may not be distinctive enough. The study by Ouyang et al. [174] has verified that object-level annotation is superior to image-level annota- tion for object detection. This can be viewed as a kind of specific deep network that just focuses on the object region rather than the whole image. Another issue we need to note is that in some situations the amount of the annotated data is insufficient and it could result in over-fitting during the training of the CNN. Semi-supervised deep hashing makes use of the labeled data together with the
Bibliography
[1] Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to object matching in videos. In: Proceedings of the 9th International Conference on Computer Vision. (2003) 1470–1477
[2] Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for im-
age categorization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2007) 1–8
[3] Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Transactions
on Pattern Analysis and Machine Intelligence34 (2012) 1704–1716
[4] Arandjelovic, R., Zisserman, A.: All about VLAD. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. (2013) 1578–1585
[5] Delhumeau, J., Gosselin, P.H., Jégou, H., Pérez, P.: Revisiting the VLAD image representation. In: Proceedings of the 21st ACM international con- ference on Multimedia. (2013) 653–656
[6] Wang, Z., Di, W., Bhardwaj, A., Jagadeesh, V., Piramuthu, R.: Geometric VLAD for large scale image search. arXiv preprint arXiv:1403.3829 (2014) [7] Jégou, H., Chum, O.: Negative evidences and co-occurences in image re- trieval: The benefit of PCA and whitening. In: Proceedings of European Conference on Computer Vision. (2012) 774–787
[8] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning
applied to document recognition. Proceedings of the IEEE86(1998) 2278–
2324
[9] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2009) 248–255 [10] Lindeberg, T.: Scale-space theory: A basic tool for analyzing structures at
different scales. Journal of Applied Statistics21 (1994) 225–270
[11] Lowe, D.G.: Object recognition from local scale-invariant features. In: Pro- ceedings of the 7th International Conference on Computer Vision. (1999) 1150–1157
[12] Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In: Proceedings of European Conference on Computer Vision. (2006) 404– 417
[13] Weickert, J., Romeny, B.T.H., Viergever, M.A.: Efficient and reliable
schemes for nonlinear diffusion filtering. IEEE Transactions on Image Pro-
cessing 7 (1998) 398–410
[14] Lowe, D.G.: Distinctive image features from scale-invariant keypoints. In-
ternational Journal of Computer Vision 60 (2004) 91–110
[15] Rosin, P.L.: Measuring corner properties. Computer Vision and Image
Understanding 73 (1999) 291–307
[16] Leutenegger, S., Chli, M., Siegwart, R.Y.: BRISK: Binary robust invariant scalable keypoints. In: Proceedings of the International Conference on Computer Vision. (2011) 2548–2555
[17] Alahi, A., Ortiz, R., Vandergheynst, P.: FREAK: Fast retina keypoint. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2012) 510–517
BIBLIOGRAPHY
[18] Wu, S., Lew, M.S.: RIFF: Retina-inspired invariant fast feature descrip- tor. In: Proceedings of the ACM International Conference on Multimedia. (2014) 1129–1132
[19] Chum, O., Matas, J.: Fast computation of min-hash signatures for image collections. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2012) 3077–3084
[20] Raginsky, M., Lazebnik, S.: Locality-sensitive binary codes from shift-
invariant kernels. In: Advances in Neural Information Processing Systems. (2009) 1509–1517
[21] Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimen- sions via hashing. In: International Conference on Very Large Databases. Volume 99. (1999) 518–529
[22] Chum, O., et al.: Large-scale discovery of spatially related images. IEEE
Transactions on Pattern Analysis and Machine Intelligence32 (2010) 371–
377
[23] Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Advances in Neural Information Processing Systems. (2009) 1753–1760
[24] Shao, J., Wu, F., Ouyang, C., Zhang, X.: Sparse spectral hashing. Pattern
Recognition Letters33 (2012) 271–277
[25] Zhang, D., Wang, J., Cai, D., Lu, J.: Self-taught hashing for fast similarity search. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. (2010) 18–25
[26] Liu, W., Wang, J., Kumar, S., Chang, S.F.: Hashing with graphs. In:
Proceedings of the 28th International Conference on Machine Learning. (2011) 1–8
[27] Irie, G., Li, Z., Wu, X.M., Chang, S.F.: Locally linear hashing for extracting non-linear manifolds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 2115–2122
[28] Li, X., Lin, G., Shen, C., Van Den Hengel, A., Dick, A.R.: Learning hash functions using column generation. In: Proceedings of the International Conference on Machine Learning. (2013) 142–150
[29] Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with kernels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2012) 2074–2081
[30] Norouzi, M., Blei, D.M.: Minimal loss hashing for compact binary codes. In: Proceedings of the 28th International Conference on Machine Learning. (2011) 353–360
[31] Huang, L.K., Yang, Q., Zheng, W.S.: Online hashing. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence. (2013) 1422–1428
[32] Norouzi, M., Fleet, D.J., Salakhutdinov, R.R.: Hamming distance metric learning. In: Advances in Neural Information Processing Systems. (2012) 1061–1069
[33] Wang, J., Kumar, S., Chang, S.F.: Sequential projection learning for hash- ing with compact codes. In: Proceedings of the 27th International Confer- ence on Machine Learning. (2010) 1127–1134
[34] Wang, J., Kumar, S., Chang, S.F.: Semi-supervised hashing for large-scale search. IEEE Transactions on Pattern Analysis and Machine Intelligence
34 (2012) 2393–2406
[35] Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: Proceedings of the International Conference on Computer Vision. (2015) 118–126
[36] Kumar, B., Carneiro, G., Reid, I.: Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss