2.2 Human Actions Datasets
2.2.1 Low-level Representation Benchmarks
Low-level representation depicts the visual appearance of human actions as an ordered set of pixels with different intensity values of different channels. These channels may either correspond to the visual appearance of the action depicted by colored pixels (i.e. RGB), or the depth field through pixels whose intensities correspond to the distance of the projected light ray from the real world to the sensor, or silhouettes where each pixel indicates if it corresponds to a human body or not.
2.2 Human Actions Datasets
Figure 2.4: Exemplar frames for the MuHVAi human action dataset (first row) with their silhouettes and for the HMDB dataset (second row)
Multi-view Human Action Video (MuHAVi) Dataset
The MuHAVi dataset is a video dataset of two different representations: human silhouettes and RGB data. Silhouette-based representations have been widely used for action recognition in constrained environments. It became popular as it suits particular applications (e.g., sur- veillance) where reliable human silhouette extraction is possible. The dataset was presented by Singh, Velastin and Ragheb (2010) and considers human actions in a constrained environ- ment. It provides multi-view data of actions of different actors with CCTV-like views (i.e. at an angle and some distance from the observed person). The data consists of 136 samples of 14 primitive actions, performed by two actors, and is observed from two different views. The actions in the data set can be reorganized into eight classes where similar actions constitute a single class. Figure 2.4 shows example frames of this dataset for different human actions in the RGB and the silhouette representations.
Web-actions Dataset
The exponential growth in unconstrained human action videos exposed the potential limit- ation of silhouette-based representations. These representations are usually intractable for images and unconstrained action videos because of the absence of reliable silhouette extrac- tion methods. As such, several datasets were proposed to recognize human actions based only on their visual appearance (i.e. RGB data). The Web-actions dataset is among these datasets that targets action recognition from images gathered for the web. It was presented
Table 2.1: Human action recognition benchmarks and their key characteristics: number of actions, number of samples, type of samples (I: Images, SV: Segmented videos, and UV: Unsegmented videos), year of release, and the available action modalities (A: Appearance, D: Depth, P: Pose, and S Silhouettes
Dataset Actions Samples Type Modality Year
MuHAVI(Singh, Velastin and Ragheb, 2010) 14 136 SV S 2010
Web-Actions(Ikizler, Cinbis and Sclaroff, 2009) 5 2458 I A 2009
Willow(Deltaire, Laptev and Sivic, 2010) 7 911 I A 2010
HMDB51(Kuehne et al., 2011) 51 6766 SV A 2011
MSR-Action3D2 20 576 SV A+D+P 2011
MSR-DailyActivity 16 320 SV A+D+P 2011
3D-Action-Pairs(Oreifej and Liu, 2013) 12 352 SV A+D+P 2013
TUM(Tenorth, Bandouch and Beetz, 2009) 10 20 UV A+P 2009
ChaLearnGestures3 20 630 UV A+D+P+S 2014
Table 2.2: Example algorithms with their performance on Web-actions dataset
Method Accuracy(%) Year
(Ikizler, Cinbis and Sclaroff, 2009) 56.54 2009
(Yang, Wang and Mori, 2010) 61.07 2010
(Eweiwi, Cheema and Bauckhage, 2013) 64.05 2013
by Ikizler, Cinbis and Sclaroff (2009) and contains images downloaded from the Internet using the keywords of human actions. The human body is then extracted using a state- of-the-art human detector and post-processed to align the extracted human bounding boxes with respect to their head position. The resulting dataset consists of five different actions: “dancing”, “playing golf”, “sitting”, “running”, and “walking” and contains a total of 2,458 images. Examples from this dataset are shown in Figure 2.5. Pictures in this dataset are characterized by the visibility of human body parts. However, it represents a challenge as the body appearance shows wide pose variations, especially for the “dancing” and “playing golf” actions. Exemplar approaches and their reported results on this dataset are reported in Table 2.2.
Willow Dataset
Advances in social media have revolutionized not only the amount of personal pictures we share on the web, but also provided a diverse view and quality of human visual appear-
2.2 Human Actions Datasets
Figure 2.5: Examples of different human action images taken from the Web-action (first row) and the Willow (second row) datasets
Table 2.3: Example algorithms with their performance for the Willow dataset
Method mAP(%) Year
(Deltaire, Laptev and Sivic, 2010) 62.14 2010
(Eweiwi, Cheema and Bauckhage, 2013) 61.57 2013
(Sharma, Jurie and Schmid, 2012) 65.9 2012
(Delaitre, Sivic and Laptev, 2011) 64.1 2011
ances. The willow dataset4 proposed by Deltaire, Laptev and Sivic (2010) addresses these challenges by introducing a human action dataset of consumer-like photos that stand for a wide range of variations in view, scene, scale and quality of the visual appearance for people. This dataset consists of 911 images distributed over seven different actions: “interacting with computer”, “taking photo”, “playing music”, “riding bike”, “riding horse”, “walking”, and “running”. Some images were taken from the Pascal 2007 VOC Challenge and the rest were collected from Flickr by querying on keywords such as “running people” or “playing piano”. Images that do not clearly depict the action of interest were manually removed. A common observation between the obtained results on the Web-actions (see Table 2.2) and the Willow datasets (see Table 2.3) is the relatively low performance of the proposed approaches for hu- man action recognition as compared to other datasets that comprise further motion-, depth-, or pose-based representation. This points out to the greater challenges of solving the human action recognition using appearance in images as compared to RGB-videos or other action modalities.
Table 2.4: Example algorithms with their performances for the HMDB51 dataset
Method Accuracy (%) Year
Dense Trajectory(Wang et al., 2013b) 46.6 2013 ActionBank(Sadanand and Corso, 2012) 26.6 2012 MIP(Gross et al., 2012) 29.2 2012 C2(Kuehne et al., 2011) 23.0 2011
HOG/HOF(Kuehne et al., 2011) 20.0 2011
Large Human Motion Database (HMDB51)
As billions of videos are shared and viewed on the Internet everyday, new frontiers emerged in computer vision to arrange such gigantic growth of media. In contrast to earlier bench- marks for action recognition, HMDB51 addresses the large scale evolution in media and is considered one of the largest and most challenging benchmarks for action recognition. It comes with 51 distinct action categories each contains at least 101 samples for a total of 6,766 action samples. Each sample clip is validated by at least two human observers and con- tains additional meta information (i.e. view-point, indicator of camera motion, quality, and the number of actors involved) to provide more flexible experiments for evaluation. Several algorithms were evaluated in this dataset; Table 2.4 shows the state-of-the-art performance achieved in this dataset. Noticeably, the HMDB51 dataset is one of the most challenging benchmarks for action recognition where the best performance of only 46.6% was reported by Wang et al. (2013b) in 2013 using an improved dense trajectory features.