Long-term research directions - Machine learning solutions to visual recognition problems

We now conclude with several more general long-term research directions.

Learning higher-order structured prediction models. Many problems in

computer vision involve joint prediction of many response variables. Ex- amples include, but are not limited to, semantic segmentation, optical flow estimation, depth estimation, image de-noising, super resolution, coloriza- tion, pose estimation, etc. These structured prediction tasks are typically solved using (conditional) Markov random fields, which includes unary terms for each label variable, and pairwise terms to ensure structural regularity of the output predictions.

Deep networks have been used for such tasks (Long et al.,2015) to de- fine unary and pairwise terms (Lin et al.,2016). Deep networks allow complex functions to be learned between a label variable and a large part of all input variables, if not all. Moreover, recently (Zheng et al.,2015;Schwing and Urtasun,2015) it has been shown that variational mean-field inference in Markov random fields can be expressed as a special recurrent neural networks in the case of fully connected pair-wise energy functions. This allows the training of the unary and pairwise potentials to be done in a way that is coherent with the MRF structure and the approximate inference method. Higher-order potentials, that model interactions of more than two label variables at a time, have been proven effective in the past for structured prediction tasks, see e.g. (Kohli et al.,2009). Efficient inference, however, is only possible for a very specific classes of higher-order potentials, see e.g. (Vineet et al.,2014;Ramalingam et al.,2008). An exiting direction for future work is to consider how larger classes of trainable higher-order potentials can be used by generalizing the techniques developed in (Zheng et al., 2015; Schwing and Urtasun, 2015) for pairwise structured models. The work of Pinheiro and Collobert on recurrent convolutional networks (Pinheiro and Collobert, 2014) is also highly relevant in this area. An al-

ternative route to enforce higher-order regularity in the predictions might be to used adversarial networks (Goodfellow et al.,2014) that are trained in combination with the primary prediction model. The adversarial network is trained to discriminate ground-truth samples and samples from the primary model. The primary model is trained such that the adversarial network can not discriminate samples from the primary model from samples of the ground-truth. The adversarial network may be used to enforce higher-order consistency, even if higher-order potentials are not used in the primary model. The development of models that exhibit high-order regu- larities which are trainable in a data-driven manner are likely to have an significant impact across a wide variety of multivariate and dense prediction vision problems.

Learning from minimal supervision. An important bottleneck limiting

performance of visual recognition systems in practical applications is the reliance on supervised training dataset. Generally, supervision is expensive and time consuming to collect. There are at least three different paths to make up for a lack of supervised training data.

The first is to learn models that to go beyond recognizing (i.e. classifying, localizing, segmenting, etc.) a manually specified finite list of (object) categories. Approaches in this direction include semantic word-image embedding models such as DeViSE (Frome et al., 2013), and image-caption encoder-decoder models (Kiros et al., 2015). Such models can in principle be learned from large non-curated datasets which contain images with (loosely) associated textual descriptions (general web images, wikipedia, user generated content, etc.), see e.g. (Chen et al.,2013b). This approach, combined with word-embedding techniques (Mikolov et al., 2013b) and “on-the-fly” model learning from web image-search engines (Chatfield et al.,

2015), allows to learn bi-directional image-text mappings that can be used for example for free-text visual search in large image and video datasets, without requiring any manually curated supervised training datasets.

Second, for certain critical visual recognition tasks that require high- level of accuracy (e.g. advanced driver assistance systems, or defense related applications), manually collected supervised training datasets will be required to ensure sufficient accuracy. In such cases the question is how we can make the most out of the (limited) available training data. An idea that has proven extremely effective is to use auxiliary tasks to pre-train or initialize the recognition model, see e.g. (Girshick et al.,2014). Most often pre-training is based on large supervised training datasets; with ImageNet (Deng et al., 2009) being by far the most used dataset for this purpose. Large unsupervised datasets may also be used for this purpose, by defining auxiliary tasks based on spatial or temporal structure (Doersch et al.,2015;

approach of taking a pre-trained model, and adapting it to the task at hand. In a more principled manner, we can learn by jointly minimizing the loss of the (new) target task and a loss for the (earlier) auxiliary task(s). Pushing this idea further, a “life-long” learning scheme is interesting in which we train a single large model for an increasing number of tasks. Treating the “old” tasks a pre-training or regularization for the new tasks.

Finally, a third approach is to rely on contextual cues. These can ei- ther in the form of spatial inter-object context, see e.g. (Rabinovich et al.,

2007; Choi et al.,2010), or between objects and physical scene properties such as scene geometry estimates, see e.g. (A. Geiger and Urtasun, 2011;

Hoiem et al.,2008). Another form of context is to use complex data adap- tive non-parametric priors on the parameters of discriminative recognition models, see e.g. (Salakhutdinov et al.,2012). Such priors can infer hierarchical groupings of object categories, so that training data is shared to some extent between related classes.

These three paragraphs may be summarized as follows. (i) For some problems abundantly available and loosely annotated training data may be enough to learn satisfying models, e.g. for text-based image search. (ii) In cases where this is not sufficient, auxiliary tasks may be used for pre- training, or multi-task learning can be used as a regularization principle to make up for lack of supervised training data. (iii) Contextual information of various forms can provide stronger structuring information. Future research on combining these different approaches may lead to important ad- vances in learning visual recognition models from very little training data, which may have significant impact for practical applications.

Architecture learning and adaptation. Current state of the art high-level

semantic scene understanding models are dominated by (convolutional) neural network approaches. These models are very powerful due to their strong capacity to model complex data distributions, which results from a hierarchical structure with millions configurable parameters that can be automatically tuned based on (supervised) training data (Montufar et al.,

2014). Beyond the challenges to efficiently estimate such models from limited training data, an even bigger challenge is posed by the model selection problem. That is: how to determine the best, or a “good”, architecture for such models? This includes: the number and ordering of pooling and convolutional layers, filter sizes, number of channels, type of pooling op- erations, type of non-linearities, etc. This problem is extremely hard, since the space of possible network architectures is discrete and combinatorially large. Optimizing over this space is an important challenge for future research. Work in this direction includes using sparsity inducing regularizers to sparsify the connectivity pattern (Kulkarni et al.,2015), and using sparse hierarchical priors over the network structure in a Bayesian learning frame-

work (Adams et al.,2010).

In the context of extremely large datasets, such as those used for learning from weakly supervised sources discussed above, model selection might not be the right problem to consider. Instead of searching for the single ul- timate model architecture, it will be important to progressively adapt the model architecture and capacity during learning. That is: having seen little data it might be useful to limit the degrees of freedom of the model. As the learning algorithm sees more data the limited capacity will saturate, and more capacity should be allocated. This suggests that studying a dynamic variant of the model selection problem is perhaps more important.

The model selection problem is highly challenging, but progress is likely to have big impact across many computer vision problems and beyond.

Bibliography

The zettabyte era: Trends and analysis. White Paper, 2015.

http://www.cisco.com/c/en/us/solutions/collateral/ service-provider/visual-networking-index-vni/VNI_ Hyperconnectivity_WP.pdf.

C. Wojek A. Geiger and R. Urtasun. Joint 3d estimation of objects and scene layout. In NIPS, 2011.

R. Adams, H. Wallach, and Z. Ghahramani. Learning the structure of deep sparse graphical models. In AISTATS, 2010.

B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, 2010. B. Alexe, T. Deselares, and V. Ferrari. Measuring the objectness of image

windows. PAMI, 34(11):2189–2202, 2012.

R. Arandjelovi´c and A. Zisserman. Three things everyone should know to improve object retrieval. In CVPR, 2012.

R. Arandjelovi´c, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. Arxiv preprint, 2015.

A. Arnab, S. Jayasumana, S. Zheng, and P. Torr. Higher order potentials in end-to-end trainable conditional random fields. 2015. URL http: //arxiv.org/abs/1511.08119.

S. Bagon, O. Brostovski, M. Galun, and M. Irani. Detecting and sketching the common. In CVPR, 2010.

B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle, and K. Weinberger. Learning to rank with (a lot of) word features. Infor- mation Retrieval, 13(3):291–314, 2010.

K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. Blei, and M. Jordan. Matching words and pictures. JMLR, 3:1107–1135, 2003.

R. Bekkerman and J. Jeon. Multi-modal clustering for multimedia collec- tions. In CVPR, 2007.

A. Bellet, A. Habrard, and M. Sebban. A Survey on Metric Learning for Feature Vectors and Structured Data. ArXiv e-prints, 1306.6709, 2013. S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large

multi-class tasks. In NIPS, 2011.

T. Berg and D. Forsyth. Animals on the web. In CVPR, 2006.

T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-Miller, and D. Forsyth. Names and faces in the news. In CVPR, 2004.

H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In CVPR, 2016.

H. Bilen, M. Pedersoli, and T. Tuytelaars. Weakly supervised object detection with posterior regularization. In BMVC, 2014.

C. Bishop. Pattern recognition and machine learning. Spinger-Verlag, 2006. L. Bottou. Large-scale machine learning with stochastic gradient descent.

In COMPSTAT, 2010.

J. Bradley and C. Guestrin. Learning tree conditional random fields. In ICML, 2010.

S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and S. Be- longie. Visual recognition with humans in the loop. In ECCV, 2010. T. Brox and J. Malik. Object segmentation by long term analysis of point

trajectories. In ECCV, 2010.

G. Carneiro, A. Chan, P. Moreno, and N. Vasconcelos. Supervised learning of semantic classes for image annotation and retrieval. PAMI, 29(3):394– 410, 2007.

K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011.

K. Chatfield, R. Arandjelovi´c, O. Parkhi, and A. Zisserman. On-the-fly learning for visual search of large-scale image and video datasets. In- ternational Journal of Multimedia Information Retrieval, 2015.

Q. Chen, Z. Song, R. Feris, A. Datta, L. Cao, Z. Huang, and S. Yan. Efficient maximum appearance search for large-scale object detection. In CVPR, 2013a.

X. Chen, A. Shrivastava, and A. Gupta. NEIL: Extracting visual knowledge from web data. In ICCV, 2013b.

M. Cho, S. Kwak, C. Schmid, and J. Ponce. Unsupervised object discovery and localization in the wild. In CVPR, 2015.

M. Choi, J. Lim, A. Torralba, and A. Willsky. Exploiting hierarchical context on a large database of object categories. In CVPR, 2010.

S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In CVPR, 2005.

C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Trans. Information Theory, 14(3):462–467, 1968. O. Chum and A. Zisserman. An exemplar model for learning object classes.

In CVPR, 2007.

M. Cimpoi, S. Maji, and A. Vedaldi. Deep filter banks for texture recognition and segmentation. In CVPR, 2015.

R. Cinbis, J. Verbeek, and C. Schmid. Unsupervised metric learning for face identification in TV video. In ICCV, 2011.

R. Cinbis, J. Verbeek, and C. Schmid. Image categorization using Fisher kernels of non-iid image models. In CVPR, 2012.

R. Cinbis, J. Verbeek, and C. Schmid. Segmentation driven object detection with Fisher vectors. In ICCV, 2013.

R. Cinbis, J. Verbeek, and C. Schmid. Multi-fold mil training for weakly supervised object localization. In CVPR, 2014.

R. Cinbis, J. Verbeek, and C. Schmid. Approximate Fisher kernels of non-iid image models for image categorization. PAMI, 2016a.

R. Cinbis, J. Verbeek, and C. Schmid. Weakly supervised object localization with multi-fold multiple instance learning. PAMI, 2016b. to appear. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20

(3):273–297, 1995.

D. Crandall and D. Huttenlocher. Weakly supervised learning of part-based spatial models for visual object recognition. In ECCV, 2006.

G. Csurka and F. Perronnin. An efficient approach to semantic segmentation. IJCV, 95(2):198–212, 2011.

G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In ECCV Int. Workshop on Stat. Learning in Computer Vision, 2004.

N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. doi: 10.1109/CVPR.2005.177. URL http: //hal.inria.fr/inria-00548512.

J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-theoretic metric learning. In ICML, 2007.

A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incom- plete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

J. Deng, A. Berg, K. Li, and L. Fei-Fei. What does classifying more than 10,000 image categories tell us? In ECCV, 2010.

T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects while learning their appearance. In ECCV, 2010.

T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervised localization and learning with generic knowledge. IJCV, 100(3):257–293, 2012.

T. Dietterich, R. Lathrop, and T. Lozano-P´erez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89 (1-2):31–71, 1997.

C. Doersch, A. Gupta, and A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.

A. Dosovitskiy, J. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, 2014.

O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce. Automatic annotation of human actions in video. In ICCV, 2009.

M. Everingham, J. Sivic, and A. Zisserman. ‘Hello! My name is... Buffy’ - automatic naming of characters in TV video. In BMVC, 2006.

M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automatic naming of characters in TV video. Image and Vision Computing, 27(5): 545–559, 2009.

M. Everingham, L. van Gool, C. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (VOC) challenge. IJCV, 88(2):303–338, June 2010.

P. Felzenszwalb, R. Grishick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, 32(9), 2010.

S. Feng, R. Manmatha, and V. Lavrenko. Multiple Bernoulli relevance models for image and video annotation. In CVPR, 2004.

R. Fergus, P. Perona, and A. Zisserman. A visual category filter for Google images. In ECCV, 2004.

R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from Google’s image search. In ICCV, 2005.

B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars. Model- ing video evolution for action recognition. In CVPR, 2015.

A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. DeViSE: A deep visual-semantic embedding model. In NIPS, 2013.

T. Gao and D. Koller. Discriminative learning of relaxed hierarchy for large- scale visual recognition. In ICCV, 2011.

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. A. Globerson and S. Roweis. Metric learning by collapsing classes. In NIPS,

2006.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014. D. Grangier and S. Bengio. A discriminative kernel-based model to rank

images from text queries. PAMI, 30(8):1371–1384, 2008.

K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image generation view publication. In icml, 2015. C. Gu, P. Arbel´aez, Y. Lin, K. Yu, and Malik. Multi-component models for

object detection. In ECCV, 2012.

M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Automatic face naming with caption-based supervision. In CVPR, 2008.

M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Tagprop: Dis- criminative metric learning in nearest neighbor models for image auto- annotation. In ICCV, 2009a.

M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? Metric learning approaches for face identification. In ICCV, 2009b.

M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning for image classification. In CVPR, 2010a.

M. Guillaumin, J. Verbeek, and C. Schmid. Multiple instance metric learning from automatically labeled bags of faces. In ECCV, 2010b.

M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Face recognition from caption-based supervision. IJCV, 96(1):64–82, 2012.

J. Hays and A. Efros. im2gps: estimating geographic information from a single image. In CVPR, 2008.

K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.

G. Hinton, P. Dayan, B. Frey, and R. Neal. The wake-sleep algorithm for unsupervised neural networks. Science, 268:1158–1161, 1995.

D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. IJCV, 80:3–15, 2008.

P. Isola, D. Zoran, D. Krishnan, and E. Adelson. Learning visual groups from co-occurrences in space and time. In ICLR, 2016.

T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In NIPS, 1999.

H. J´egou, M. Douze, and C. Schmid. On the burstiness of visual elements. In CVPR, 2009.

H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid. Ag- gregating local image descriptors into compact codes. PAMI, 34(9):1704– 1716, 2012.

J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In ACM SIGIR, 2003. Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and

R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14, 2014.

In document Machine learning solutions to visual recognition problems (Page 67-93)