• No results found

Chapter 6 Recognizing Human Attributes and Discovering Spa-

7.1 Future Work

During the course of this research, two interesting issues have arisen which merit further consideration in future work.

First, we can build an unified model which can estimate 2D and 3D human pose simultaneously. We can expect that 2D human pose estimation and 3D human pose reconstruction would help each other during training this model. For example, esti- mated 2D pose is supposed to be a 2D projection of a reasonable 3D pose, and at the same time, 3D pose reconstruction can perform better if more accurate estimated 2D pose is given. This unified model requires human pose data with both 2D and 3D annotations. Unfortunately, the currently available dataset is not sufficiently diverse to avoid overfitting. For example, the HumanEva-I dataset [122] only contains seven calibrated video sequences performed by four subjects. If given sufficiently diverse dataset, we can incorporate 3D human pose reconstruction to our DS-CNN model by simply adding a 3D localization regressor.

Second, in Chapter 6, we find the direct correlation between the recognition ac- curacy and the correctness of the attribute v.s. object-part correspondence that the CNN finds. Inspired by this observation, we can conduct further research to guide the CNN to fire on the correct image region if the recognition accuracy is bad on the

validation dataset. By doing this, we can expect to improve the performance on some visual tasks, such as human attribute recognition and fine-grained recognition.

Bibliography

[1] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” inECCV, 2014.

[2] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev, “Panda: Pose aligned networks for deep attribute modeling,” inCVPR, 2014.

[3] L. Sigal, A. O. Balan, and M. J. Black, “Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion,”IJCV, 2010.

[4] A. Hernández-Vela, N. Zlateva, A. Marinov, M. Reyes, P. Radeva, D. Dimov, and S. Escalera, “Graph cuts optimization for multi-limb human segmentation in depth maps,” in CVPR, 2012.

[5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” TPAMI, 2010. [6] G. Rogez, C. Orrite-Uruñuela, and J. Martínez-del Rincón, “A spatio-temporal

2d-models framework for human pose recovery in monocular sequences,” Pat- tern Recognition, 2008.

[7] A. Agarwal and B. Triggs, “Recovering 3d human pose from monocular images,” TPAMI, 2006.

[8] L. Sigal, M. Isard, B. H. Sigelman, and M. J. Black, “Attractive people: Assem- bling loose-limbed models using non-parametric belief propagation,” in NIPS, 2003.

[9] Y. Wang and G. Mori, “Multiple tree models for occlusion and spatial con- straints in human pose estimation,” in ECCV, 2008.

[10] H. Jiang and D. R. Martin, “Global pose estimation using non-tree models,” in CVPR, 2008.

[11] G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rank repre- sentation,” in ICML, 2010.

[12] V. Ramakrishna, T. Kanade, and Y. Sheikh, “Reconstructing 3d human pose from 2d image landmarks,” inECCV, 2012.

[13] R. Poppe, “Vision-based human motion analysis: An overview,” CVIU, 2007. [14] X. Perez-Sala, S. Escalera, C. Angulo, and J. Gonzalez, “A survey on model

based approaches for 2d and 3d visual human pose recovery,” Sensors, 2014. [15] C. Wang, Y. Wang, and A. L. Yuille, “An approach to pose-based action recog-

nition,” inCVPR, 2013.

[16] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understand- ing action recognition,” in ICCV, 2013.

[17] A. J. Davison, J. Deutscher, and I. D. Reid, Markerless motion capture of complex full-body movement for character animation. Springer, 2001.

[18] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” CVPR, 2014.

[19] M. V. Peelen and P. E. Downing, “The neural basis of visual body perception,” Nature Reviews Neuroscience, 2007.

[20] A. Yao, J. Gall, and L. Van Gool, “Coupled action recognition and pose esti- mation from multiple views,” IJCV, 2012.

[21] T.-H. Yu, T.-K. Kim, and R. Cipolla, “Unconstrained monocular 3d human pose estimation by action detection and cross-modality regression forest,” in CVPR, 2013.

[22] D. M. Gavrila and L. S. Davis, “3-d model-based tracking of humans in action: a multi-view approach,” inCVPR, 1996.

[23] H.-J. Lee and Z. Chen, “Determination of 3d human body postures from a single view,”Computer Vision, Graphics, and Image Processing, 1985.

[24] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004.

[25] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” inCVPR, 2005.

[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.

[27] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” CVPR, 2014.

[28] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolu- tional networks,”arXiv preprint arXiv:1312.6229, 2013.

[29] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object recognition,” IJCV, 2005.

[30] T.-P. Tian and S. Sclaroff, “Fast globally optimal 2d human detection with loopy graph models,” in CVPR, 2010.

[31] X. Ren, A. C. Berg, and J. Malik, “Recovering human body configurations using pairwise constraints between parts,” in ICCV, 2005.

[32] M. Andriluka, S. Roth, and B. Schiele, “Monocular 3d pose estimation and tracking by detection,” inCVPR, 2010.

[33] K. Raja, I. Laptev, P. Pérez, and L. Oisel, “Joint pose estimation and action recognition in image graphs,” in ICIP, 2011.

[34] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Sur- passing human-level performance on imagenet classification,” arXiv preprint arXiv:1502.01852, 2015.

[35] S. Amin, M. Andriluka, M. Rohrbach, and B. Schiele, “Multi-view pictorial structures for 3d human pose estimation,” in British Machine Vision Confer- ence. BMVA Press, 2013.

[36] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for se- mantic segmentation,” CVPR, 2015.

[37] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,”arXiv preprint arXiv:1512.04150, 2015.

[38] M. A. Fischler and R. A. Elschlager, “The representation and matching of pictorial structures,”IEEE Transactions on Computers, 1973.

[39] K. Duan, D. Batra, and D. J. Crandall, “A multi-layer composite model for human pose estimation.” in BMVC, 2012.

[40] R. Vidal, “A tutorial on subspace clustering,” IEEE Signal Processing Maga- zine, 2010.

[41] E. Elhamifar and R. Vidal, “Sparse subspace clustering,” inCVPR, 2009. [42] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman, “Clustering appear-

ances of objects under varying illumination conditions,” in CVPR, 2003. [43] M. E. Tipping and C. M. Bishop, “Mixtures of probabilistic principal compo-

nent analyzers,” Neural computation, 1999.

[44] Y. Sugaya and K. Kanatani, “Geometric structure of degeneracy for multi-body motion segmentation,” in Statistical Methods in Video Processing, 2004.

[45] J. P. Costeira and T. Kanade, “A multibody factorization method for indepen- dently moving objects,” IJCV, 1998.

[46] C. W. Gear, “Multibody grouping from motion images,” IJCV, 1998.

[47] K. Kanatani, “Motion segmentation by subspace separation and model selec- tion,” image, 2001.

[48] R. Vidal, Y. Ma, and S. Sastry, “Generalized principal component analysis (gpca),” TPAMI, 2005.

[49] Y. Ma, A. Y. Yang, H. Derksen, and R. Fossum, “Estimation of subspace arrangements with applications in modeling and segmenting mixed data,”SIAM review, 2008.

[50] H. Derksen, Y. Ma, W. Hong, and J. Wright, “Segmentation of multivariate mixed data via lossy coding and compression,” in Electronic Imaging 2007, 2007.

[51] E. Elhamifar and R. Vidal, “Clustering disjoint subspaces via sparse represen- tation,” in Acoustics Speech and Signal Processing (ICASSP), 2010.

[52] E. Amaldi and V. Kann, “On the approximability of minimizing nonzero vari- ables or unsatisfied relations in linear systems,”Theoretical Computer Science, 1998.

[53] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” TPAMI, 2013.

[54] A. Y. Ng, M. I. Jordan, Y. Weiss et al., “On spectral clustering: Analysis and an algorithm,”NIPS, 2002.

[55] J. Shi and J. Malik, “Normalized cuts and image segmentation,”TPAMI, 2000. [56] Y. C. Eldar, P. Kuppinger, and H. Bolcskei, “Block-sparse signals: Uncer- tainty relations and efficient recovery,” IEEE Transactions on Signal Process- ing, vol. 58, no. 6, pp. 3042–3054, 2010.

[57] A. Safonova, J. K. Hodgins, and N. S. Pollard, “Synthesizing physically realistic human motion in low-dimensional, behavior-specific spaces,” TOG, 2004. [58] A. Y. Yang, S. Iyengar, S. Sastry, R. Bajcsy, P. Kuryloski, and R. Jafari,

“Distributed segmentation and classification of human actions using a wearable motion sensor network,” in CVPRW, 2008.

[59] R. Li, T.-P. Tian, S. Sclaroff, and M.-H. Yang, “3d human motion tracking with a coordinated mixture of factor analyzers,”IJCV, vol. 87, no. 1-2, pp. 170–190, 2010.

[60] D. Liebowitz and S. Carlsson, “Uncalibrated motion capture exploiting articu- lated structure constraints,”IJCV, 2003.

[61] C. J. Taylor, “Reconstruction of articulated objects from point correspondences in a single uncalibrated image,” in CVPR, 2000.

[62] D. E. DiFranco, T.-J. Cham, and J. M. Rehg, “Reconstruction of 3d figure motion from 2d correspondences,” inCVPR, 2001.

[63] V. Parameswaran and R. Chellappa, “View independent human body pose estimation from a single perspective image,” in CVPR, 2004.

[64] X. K. Wei and J. Chai, “Modeling 3d human poses from uncalibrated monocular images,” in ICCV, 2009.

[65] J. Valmadre and S. Lucey, “Deterministic 3d human pose estimation using rigid structure,” in ECCV, 2010.

[66] M. Salzmann and R. Urtasun, “Implicitly constrained gaussian process regres- sion for monocular non-rigid pose estimation,” inNIPS, 2010.

[67] Y. Tian, C. L. Zitnick, and S. G. Narasimhan, “Exploring the spatial hierarchy of mixture models for human pose estimation,” in ECCV, 2012.

[68] M. Sun and S. Savarese, “Articulated part-based model for joint object detection and pose estimation,” in ICCV, 2011.

[69] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Poselet conditioned pictorial structures,” in CVPR, 2013.

[70] M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures revisited: People detection and articulated pose estimation,” in CVPR, 2009.

[71] L. Pishchulin, A. Jain, M. Andriluka, T. Thormahlen, and B. Schiele, “Articu- lated people detection and pose estimation: Reshaping the future,” in CVPR, 2012.

[72] S. Johnson and M. Everingham, “Clustered pose and nonlinear appearance models for human pose estimation,” in BMVC, 2010.

[73] B. Sapp, A. Toshev, and B. Taskar, “Cascaded models for articulated pose estimation,” inECCV, 2010.

[74] M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Human pose estimation using body parts dependent joint regressors,” in CVPR, 2013.

[75] V. K. Singh, R. Nevatia, and C. Huang, “Efficient inference with multiple het- erogeneous part detectors for human pose estimation,” in ECCV, 2010.

[76] B. Sapp and B. Taskar, “Modec: Multimodal decomposable models for human pose estimation,” in CVPR, 2013.

[77] M. Eichner, V. Ferrari, and V. Ferrari, “Appearance sharing for collective hu- man pose estimation,” in ACCV, 2012.

[78] M. Eichner and V. Ferrari, “Better appearance models for pictorial structures,” inBMVC, 2009.

[79] Y. Yang and D. Ramanan, “Articulated pose estimation with flexible mixtures- of-parts,” inCVPR, 2011.

[80] B. B. Le Cun, J. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in NIPS, 1990.

[81] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” inICCV, 2009.

[82] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic object recognition with invariance to pose and lighting,” inCVPR, 2004.

[83] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in ICML, 2009.

[84] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detec- tion,” in NIPS, 2013.

[85] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,”IJCV, 2013.

[86] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for fine-grained category detection,” in ECCV, 2014.

[87] A. Jain, J. Tompson, M. Andriluka, G. W. Taylor, and C. Bregler, “Learning human pose estimation features with convolutional networks,” ICLR, 2014. [88] J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a convo-

lutional network and a graphical model for human pose estimation,” NIPS, 2014.

[89] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” inCVPR, 2009.

[90] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” inCVPR, 2009.

[91] A. Farhadi, I. Endres, and D. Hoiem, “Attribute-centric recognition for cross- category generalization,” inCVPR, 2010.

[92] I. Endres, A. Farhadi, D. Hoiem, D. Forsythet al., “The benefits and challenges of collecting richer object annotations,” in CVPRW, 2010.

[93] T. L. Berg, A. C. Berg, and J. Shih, “Automatic attribute discovery and char- acterization from noisy web data,” in ECCV, 2010.

[94] O. Russakovsky and L. Fei-Fei, “Attribute learning in large-scale datasets,” in Trends and Topics in Computer Vision, 2012.

[95] Y. Su, M. Allan, and F. Jurie, “Improving object classification using semantic attributes.” in BMVC, 2010.

[96] G. Patterson and J. Hays, “Sun attribute database: Discovering, annotating, and recognizing scene attributes,” inCVPR, 2012.

[97] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute and simile classifiers for face verification,” in ICCV, 2009.

[98] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech- UCSD Birds-200-2011 Dataset,” Tech. Rep., 2011.

[99] V. Escorcia, J. C. Niebles, and B. Ghanem, “On the relationship between visual attributes and convolutional networks,” inCVPR, 2015.

[100] S. Shankar, V. K. Garg, and R. Cipolla, “Deep-carving: Discovering visual attributes by carving deep neural nets,” CVPR, 2015.

[101] M. Simon, E. Rodner, and J. Denzler, “Part detector discovery in deep convo- lutional neural networks,” inACCV, 2014.

[102] A. Bergamo, L. Bazzani, D. Anguelov, and L. Torresani, “Self-taught object localization with deep networks,” arXiv preprint arXiv:1409.3964, 2014. [103] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for free?–

weakly-supervised learning with convolutional neural networks,” in CVPR, 2015.

[104] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional net- works: Visualising image classification models and saliency maps,”ICLR Work- shop, 2014.

[105] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional net- works,” inECCV, 2014.

[106] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” inCVPR, 2010.

[107] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,”TPAMI, 2013.

[108] R. Vidal, “Subspace clustering,” Signal Processing Magazine, vol. 28, no. 2, pp. 52–68, 2011.

[109] M. Soltanolkotabi, E. Elhamifar, and E. Candes, “Robust subspace clustering,” arXiv preprint arXiv:1301.2603, 2013.

[110] M. Eichner and V. Ferrari, “Appearance sharing for collective human pose estimation,” inACCV, 2012.

[111] S. Johnson and M. a. Everingham, “Learning effective human pose estimation from inaccurate annotation,” in CVPR, 2011.

[112] M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari, “2d articulated human pose estimation and retrieval in (almost) unconstrained still images,” IJCV, 2012.

[113] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar- rama, and T. Darrell, “Caffe: Convolutional architecture for fast feature em- bedding,” arXiv preprint arXiv:1408.5093, 2014.

[114] L. Pishchulin, M. Andriluka, P. Gehler, B. Schiele, and B. Schiele, “Strong appearance and expressive spatial models for human pose estimation,” inICCV, 2013.

[115] W. Ouyang, X. Chu, and X. Wang, “Multi-source deep learning for human pose estimation,” inCVPR, 2014.

[116] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in NIPS, 2014.

[117] X. Fan, K. Zheng, Y. Lin, and S. Wang, “Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation,” into appear in CVPR, 2015.

[118] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson, “Understanding neural networks through deep visualization,”arXiv preprint arXiv:1506.06579, 2015.

[119] L. Bourdev, S. Maji, and J. Malik, “Describing people: A poselet-based ap- proach to attribute classification,” in ICCV, 2011.

[120] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Cognitive modeling, 1988.

[121] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” ICCV, 2015.

[122] L. Sigal and M. J. Black, “Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion,” Brown Univertsity TR, 2006.