Low-level Representation Benchmarks - Human Actions Datasets

2.2 Human Actions Datasets

2.2.1 Low-level Representation Benchmarks

Low-level representation depicts the visual appearance of human actions as an ordered set of pixels with different intensity values of different channels. These channels may either correspond to the visual appearance of the action depicted by colored pixels (i.e. RGB), or the depth field through pixels whose intensities correspond to the distance of the projected light ray from the real world to the sensor, or silhouettes where each pixel indicates if it corresponds to a human body or not.

2.2 Human Actions Datasets

Figure 2.4: Exemplar frames for the MuHVAi human action dataset (first row) with their silhouettes and for the HMDB dataset (second row)

Multi-view Human Action Video (MuHAVi) Dataset

The MuHAVi dataset is a video dataset of two different representations: human silhouettes and RGB data. Silhouette-based representations have been widely used for action recognition in constrained environments. It became popular as it suits particular applications (e.g., sur- veillance) where reliable human silhouette extraction is possible. The dataset was presented by Singh, Velastin and Ragheb (2010) and considers human actions in a constrained environ- ment. It provides multi-view data of actions of different actors with CCTV-like views (i.e. at an angle and some distance from the observed person). The data consists of 136 samples of 14 primitive actions, performed by two actors, and is observed from two different views. The actions in the data set can be reorganized into eight classes where similar actions constitute a single class. Figure 2.4 shows example frames of this dataset for different human actions in the RGB and the silhouette representations.

Web-actions Dataset

The exponential growth in unconstrained human action videos exposed the potential limit- ation of silhouette-based representations. These representations are usually intractable for images and unconstrained action videos because of the absence of reliable silhouette extraction methods. As such, several datasets were proposed to recognize human actions based only on their visual appearance (i.e. RGB data). The Web-actions dataset is among these datasets that targets action recognition from images gathered for the web. It was presented

Table 2.1: Human action recognition benchmarks and their key characteristics: number of actions, number of samples, type of samples (I: Images, SV: Segmented videos, and UV: Unsegmented videos), year of release, and the available action modalities (A: Appearance, D: Depth, P: Pose, and S Silhouettes

Dataset Actions Samples Type Modality Year

MuHAVI(Singh, Velastin and Ragheb, 2010) 14 136 SV S 2010

Web-Actions(Ikizler, Cinbis and Sclaroff, 2009) 5 2458 I A 2009

Willow(Deltaire, Laptev and Sivic, 2010) 7 911 I A 2010

HMDB51(Kuehne et al., 2011) 51 6766 SV A 2011

MSR-Action3D2 ₂₀ ₅₇₆ _SV _A+D+P ₂₀₁₁

MSR-DailyActivity 16 320 SV A+D+P 2011

3D-Action-Pairs(Oreifej and Liu, 2013) 12 352 SV A+D+P 2013

TUM(Tenorth, Bandouch and Beetz, 2009) 10 20 UV A+P 2009

ChaLearnGestures3 20 630 UV A+D+P+S 2014

Table 2.2: Example algorithms with their performance on Web-actions dataset

Method Accuracy(%) Year

(Ikizler, Cinbis and Sclaroff, 2009) 56.54 2009

(Yang, Wang and Mori, 2010) 61.07 2010

(Eweiwi, Cheema and Bauckhage, 2013) 64.05 2013

by Ikizler, Cinbis and Sclaroff (2009) and contains images downloaded from the Internet using the keywords of human actions. The human body is then extracted using a state- of-the-art human detector and post-processed to align the extracted human bounding boxes with respect to their head position. The resulting dataset consists of five different actions: “dancing”, “playing golf”, “sitting”, “running”, and “walking” and contains a total of 2,458 images. Examples from this dataset are shown in Figure 2.5. Pictures in this dataset are characterized by the visibility of human body parts. However, it represents a challenge as the body appearance shows wide pose variations, especially for the “dancing” and “playing golf” actions. Exemplar approaches and their reported results on this dataset are reported in Table 2.2.

Willow Dataset

Advances in social media have revolutionized not only the amount of personal pictures we share on the web, but also provided a diverse view and quality of human visual appear-

2.2 Human Actions Datasets

Figure 2.5: Examples of different human action images taken from the Web-action (first row) and the Willow (second row) datasets

Table 2.3: Example algorithms with their performance for the Willow dataset

Method mAP(%) Year

(Deltaire, Laptev and Sivic, 2010) 62.14 2010

(Eweiwi, Cheema and Bauckhage, 2013) 61.57 2013

(Sharma, Jurie and Schmid, 2012) 65.9 2012

(Delaitre, Sivic and Laptev, 2011) 64.1 2011

ances. The willow dataset4 proposed by Deltaire, Laptev and Sivic (2010) addresses these challenges by introducing a human action dataset of consumer-like photos that stand for a wide range of variations in view, scene, scale and quality of the visual appearance for people. This dataset consists of 911 images distributed over seven different actions: “interacting with computer”, “taking photo”, “playing music”, “riding bike”, “riding horse”, “walking”, and “running”. Some images were taken from the Pascal 2007 VOC Challenge and the rest were collected from Flickr by querying on keywords such as “running people” or “playing piano”. Images that do not clearly depict the action of interest were manually removed. A common observation between the obtained results on the Web-actions (see Table 2.2) and the Willow datasets (see Table 2.3) is the relatively low performance of the proposed approaches for human action recognition as compared to other datasets that comprise further motion-, depth-, or pose-based representation. This points out to the greater challenges of solving the human action recognition using appearance in images as compared to RGB-videos or other action modalities.

Table 2.4: Example algorithms with their performances for the HMDB51 dataset

Method Accuracy (%) Year

Dense Trajectory(Wang et al., 2013b) 46.6 2013 ActionBank(Sadanand and Corso, 2012) 26.6 2012 MIP(Gross et al., 2012) 29.2 2012 C2(Kuehne et al., 2011) 23.0 2011

HOG/HOF(Kuehne et al., 2011) 20.0 2011

Large Human Motion Database (HMDB51)

As billions of videos are shared and viewed on the Internet everyday, new frontiers emerged in computer vision to arrange such gigantic growth of media. In contrast to earlier benchmarks for action recognition, HMDB51 addresses the large scale evolution in media and is considered one of the largest and most challenging benchmarks for action recognition. It comes with 51 distinct action categories each contains at least 101 samples for a total of 6,766 action samples. Each sample clip is validated by at least two human observers and contains additional meta information (i.e. view-point, indicator of camera motion, quality, and the number of actors involved) to provide more flexible experiments for evaluation. Several algorithms were evaluated in this dataset; Table 2.4 shows the state-of-the-art performance achieved in this dataset. Noticeably, the HMDB51 dataset is one of the most challenging benchmarks for action recognition where the best performance of only 46.6% was reported by Wang et al. (2013b) in 2013 using an improved dense trajectory features.

In document Human Motion Analysis for Efficient Action Recognition (Page 36-40)