Evaluating and Extending Trajectory Features for Activity Recognition
4.3 Extending Augmentations
4.3.5 Combining Selection and Augmentation
The relative locations of features to body parts can be used for effective feature selection and augmentation of Messing et al.’s trajectory features. Here, we consider the combination of feature selection and augmentation. For augmentations, we only consider Messing et al.’s original discretized two-dimensional relative position at the start and end of trajectories, as the other representations we have explored are not as effective. For feature selection, we eliminate features that exceed a mean distance of 200 pixels from the parts being considered
Table4.7shows that once features are augmented, feature selection reduces per-formance by a small amount, in exchange for reduced runtime. We also see that of the three relative position augmentations we’ve added to Messing et al.’s model, the position of trajectories relative to the positions of the hands is the most informa-tive. This is not very surprising, as the dataset is characterized by broad and fine hand motions. If body parts were efficiently and automatically detected, it is likely that good, fast, translation invariant performance could be achieved on tasks like the activities of daily living in Messing et al.’s URADL dataset [17].
+ Abs. Pos
Traj.+ Body 76.6 % 82.6 % 77.3 %
Traj.+ Head 79.3 % 75.3 % 75.3 %
Traj.+ Hands 88.6 % 85.3 % 87.3 %
Traj.+ Body + Head 82.6 % 78.6 % 78.6 %
Traj.+ Body + Hands 86 % 86.6 % 86.6 %
Traj.+ Head + Hands 90.6 % 87.3 % 87.3 %
Traj.+ Body + Head + Hands 88 % 84 % 88 %
4.4 Conclusion
Trajectory descriptors are an increasingly important trend as the field of activity recognition in video grows beyond techniques developed for object recognition in still images. As new trajectory-based methods are introduced, it will become in-creasingly important to carefully consider all parts of the models that use them.
Additionally, as trajectory information becomes more standard, it will be more and more important to be able to seamlessly integrate trajectory information with other information, like appearance, or higher level knowledge of scene structure or pose.
In comparing Messing et al.’s sparse generative extended trajectory features with Wang et al.’s discriminative dense trajectories, we have provided a template for how to examine new methods, and choose model components. In addition, by illustrat-ing how to add new and more refined augmentations to Messillustrat-ing et al.’s trajectory model, we have illustrated how to explore which new kinds of information are most beneficial to recognition models, and how to add them. It is our hope that these ex-periments will inform future work developing trajectory-based activity recognition.
Acknowledgements RM thanks the University of Rochester for support and guidance during his graduate education. AT thanks FQRNT for their financial support. CP thanks Ubisoft, NSERC, Google and the University of Rochester for their financial support.
References
1. Black MJ, Yacoob Y (1995) Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion. In: ICCV, pp 374–381
2. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–
1022
3. Brox T, Malik J (2011) Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans Pattern Anal Mach Intell 33(3):500–513
4. Cutting JE (1981) Six tenets for event perception. Cognition 10:71–78
5. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceed-ings of the 2005 IEEE computer society conference on computer vision and pattern recog-nition (CVPR’05)—volume 1—volume 01, CVPR’05. IEEE Computer Society, Washington, pp 886–893
6. Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: VS-PETS, October 2005
7. Jebara T (2003) Images as bags of pixels. In: ICCV
8. Joachims T (1999) Making large-scale SVM learning practical. Advances in kernel methods—
support vector learning. MIT Press, Cambridge
9. Johansson G (1973) Visual perception of biological motion and a model for its analysis. Per-cept Psychophys 14:201–211
10. Ke Y, Sukthankar R (2004) Pca-sift: a more distinctive representation for local image descrip-tors. In: CVPR, pp II-506–II-513
11. Laptev I, Lindeberg T (2004) Local descriptors for spatio-temporal recognition. In: Int work-shop on spatial coherence for visual motion analysis
12. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR, Anchorage, Alaska, June 2008, pp 1–8
13. Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR
14. Lewis D (1998) Naive (Bayes) at forty: the independence assumption in information retrieval.
In: ECML
15. Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: IJCAI, pp 674–679
16. Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: CVPR
17. Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: ICCV
18. Niebles J-C, Wang H, Fei-fei L (2006) Unsupervised learning of human action categories using spatial-temporal words. In: Proc BMVC
19. Ramanan D, Forsyth DA (2003) Automatic annotation of everyday movements. In: NIPS, December 2003
20. Shi J, Tomasi C (1994) Good features to track. In: CVPR, pp 593–600
21. Sidenbladh H, Black M, Fleet DJ (2000) Stochastic tracking of 3d human figures using 2d image motion. In: ECCV
22. Ullah MM, Parizi SN, Laptev I (2010) Improving bag-of-features action recognition with non-local cues. In: Proc BMVC, pp 95.1–95.11. doi:10.5244/C.24.95
23. Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC
24. Wang H, Klaser A, Schmid C, Liu C-L (2011) Action recognition by dense trajectories. In:
CVPR
Abstract In this chapter, we address the problem of detecting, matching, and seg-menting all identical object-level patterns from images or videos in an unsupervised way, called the “co-recognition” problem. In an unsupervised setting without any prior knowledge of specific target objects, it relies entirely on geometric and pho-tometric relations of visual features. To solve this problem, a multi-layer match-growing framework is proposed which explores given visual data by intra-layer ex-pansion and inter-layer merge. We demonstrate the effectiveness of this approach on identical object detection, image retrieval, symmetry detection, and action recogni-tion. These applications will validate the usefulness of co-recognition to several vision problems.
5.1 Introduction
In detection and recognition of visual objects in images or actions in videos, most of computer vision approaches require some level of supervision to specify or learn a model [5,12,14,24,34,45,50,52]. In such cases, labeled images with bound-ing boxes or uncluttered video clips are usually adopted for the purpose. Recent categorization methods based on latent topic models have proposed weakly super-vised or unsupersuper-vised learning approaches using a decent amount of training im-ages [23,53,55]. In real-world images or videos, however, multiple objects of simi-lar appearance often show up even in a single view. For example, an image can con-tain replicas or multiple shots of identical objects, and a video clip often includes the same actions in different spatio-temporal regions. Humans have no difficulty M. Cho (
B
)INRIA/École Normale Supérieure, Paris, France e-mail:[email protected]
Y.M. Shin· K.M. Lee
Seoul National University, Seoul, Korea Y.M. Shin
e-mail:[email protected] K.M. Lee
e-mail:[email protected]
G.M. Farinella et al. (eds.), Advanced Topics in Computer Vision, Advances in Computer Vision and Pattern Recognition,
DOI10.1007/978-1-4471-5520-1_5, © Springer-Verlag London 2013
113
Fig. 5.1 Co-recognition results on images and videos. Our method detects and segments all identi-cal objects in images or actions in video streams without supervision, and can be applied to various applications. (a) Two sets of identical objects are detected, matched, and segmented from a single image. (b) Many-to-many object matching is obtained across two images. (c) Two kinds of actions are matched and segmented from video clips. (d) Symmetry detection is done by co-recognition on symmetric feature pairs. (e) Matching and segmentation of different views enables object-based 3D reconstruction even from a single image
in finding those visual objects without any prior information, whereas it is still a challenging problem in computer vision.
In this chapter, we pose a problem of detecting, matching, and segmenting all the identical object-level patterns by considering geometric and photometric re-lations of visual features. It supposes an unsupervised setting without any prior knowledge of specific target objects. According to the naming conventions of re-lated work [49,56], we term it co-recognition, which emphasizes explicit geometric matching unlike other methods. We extend it to video domain and explore further applications. As shown in Fig.5.1a, our approach can detect, match, and segment
or video streams. The proposed approach goes beyond the conventional assump-tions such as one-to-one object matching constraints or model-test settings, and ad-dresses object matching in general. The resultant object correspondence networks can provide useful information for structural pattern analysis and scene understand-ing in images and videos. As shown in Fig.5.1, this chapter will present its several applications on unsupervised identical object detection, image retrieval, symmetry detection, action recognition, and object-based reconstruction.
The remainder of this chapter is organized as follows. In Sect.5.2, we describe our approach and its formulation, and propose the algorithm for co-recognition. In Sects. 5.3, 5.4, and 5.5, the proposed approach is applied to identical object de-tection, symmetry dede-tection, and action recognition. Experimental evaluations and comparisons are carried out in each section. Finally, in Sect.5.6, conclusion and future work are stated with some discussions. The research in this chapter has been extended from our previous papers related to this topic [6,8–10,51].