Dynamic Depth Maps with ConvNets - Action recognition from RGB-D data

4.2.1 Prior Works and Our Contributions

In our previous work, we applied ConvNets to depth action recognition based on the variants of DMM [YZT12], which is sensitive to noise and cannot work well with clutter background. Wu. et al. [WPK+16a] adopted a 3D ConvNet to extract features from depth data, which requires a large amount of training data to achieve the best performance. Compared to traditional RGB images, depth maps offer better geometric cues and less sensitivity to illumination changes for action recognition. In order to make full use of these properties and take advantages of ConvNets, we pro- pose three simple, compact yet effective representations of depth sequences, referred to respectively as Dynamic Depth Images (DDI), Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI), for both isolated and continuous action recognition. These dynamic images are constructed from a segmented sequence of depth maps using hierarchical bidirectional rank pooling to effectively capture the spatial-temporal information. Specifically, DDI exploits the dynamics of postures over time and DDNI and DDMNI exploit the 3D structural information captured by depth maps. Upon the proposed representations, a ConvNet based method is developed for action recognition. The image-based representations enable us to fine-tune the existing Convolutional Neural Network (ConvNet) models trained on image data without training a large number of parameters from scratch. The proposed method was evaluated on three large datasets, namely, the Large-scale Continuous Gesture Recognition Dataset, the Large-scale Isolated Gesture Recogni- tion Dataset, and the NTU RGB+D Dataset. State-of-the-arts results were achieved on all datasets even though only the depth data was used.

CHAPTER 4. DEPTH-BASED ACTION RECOGNITION 81 Depth Sequence Motion Normal Images V U W VV X _U_UY _W_WZ X Y Z V U W VV X UUY WWZ Path 3 Path 2 Path 1 GMM Background Modeling Foreground Motion Images Histogram-based Foreground Extraction Foreground

Images Normal Images

V U W VV X UUY WWZ X Y Z V U W VV X _U_UY _W_WZ Normal Vector Extraction Normal Vector Extraction Depth Images V U W VV X _U_UY _W_WZ X Y Z V U W VV X UUY WWZ ConvNet X Y Z Score Vectors ConvNet X Y Z Score Vectors ConvNet X Y Z Score Vectors ConvNet X Y Z Score Vectors ConvNet X Y Z Forward

DDMNI Score Vectors

ConvNet X Y Z Score Vectors Backward DDMNI Forward DDNI Backward DDNI Forward DDI Backward DDI Score Vectors Score Vectors Score Vectors

Final Score Vector

DDMNI: Dynamic Depth Motion Normal Image DDNI: Dynamic Depth Normal Image DDI: Dynamic Depth Image ConvNets: Convolutional Neural Networks

Video Segmentation Hierarchical Bidirectional Rank Pooling Hierarchical Bidirectional Rank Pooling Hierarchical Bidirectional Rank Pooling

Figure 4.13: The framework of the proposed method.

4.2.2 The Proposed Methods

The proposed method consists of four stages: action segmentation, construction of the three sets of dynamic images, ConvNets training and score fusion for classification. The framework is illustrated in Fig. 4.13. Given a sequence of depth maps consisting of multiple actions, the start and end frames of each action are identified based on quantity of movement (QOM) [JZW+15]. Then, three sets of dynamic images are constructed for each action segment and used as the input to six ConvNets for product score fusion-based classification. Details are presented in the rest of this section.

4.2.2.1 Action Segmentation

Previous works on action recognition mainly focus on the classification of segmented actions. In the case of continuous recognition, both segmentation and recognition have to be solved. This chapter tackles the segmentation and classification of actions separately and sequentially.

Given a sequence of depth maps that contains multiple actions, each frame has the relevant movement with respect to its adjacent frame and the first frame. The start and end frames of each action is detected based on quantity of movement (QOM) [JZW+_{15] by assuming that all actions starts from a similar pose. For}

CHAPTER 4. DEPTH-BASED ACTION RECOGNITION 82 ! "! #! $! %! &! '! (! )! *! "!! ""! "#! ! !+# !+% !+' !+) " , , QOMGlobal Seg. Boundary Q O MG lo b a l T hresholdinter

Figure 4.14: An example of illustrating the inter-action segmentation results.

Figure from [JZW+15].

vector

QOM(I, t) = [QOMLocal(I, t), QOMGlobal(I, t)], (4.8)

where QOMLocal(I, t) and QOMGlobal(I, t) measure the relative movement of frame t with respect to its adjacent frame and the first frame. They are defined as

QOMLocal(I, t) = X m,n ψ(It(m, n), It−1(m, n)) QOMGlobal(I, t) = X m,n ψ(It(m, n), I1(m, n)) , (4.9)

where (m, n) is the pixel location and the indicator function ψ(x, y) is defined as

ψ(x, y) =      1 if|x−y|_>T hresholdQOM; 0 otherwise

T hresholdQOM is a predefined threshold, which is set to 60 empirically in this chap-

ter. A set of frame indices of candidate delimiting frames is initialized by choosing frames with lower global QOMs than a thresholdinter. The thresholdinter is calcu-

lated by adding the mean to twice the standard deviation of global QOMs extracted from first and last 12.5% of the average action sequence lengthLwhich is calculated from the training actions. A sliding window with a size of L₂ is then used to refine the candidate set and in each windowing session only the index of frame with a minimum global QOM is retained. After the refinement, the remaining frames are expected to be the delimiting frames of actions, as shown in Fig. 4.14.

CHAPTER 4. DEPTH-BASED ACTION RECOGNITION 83

4.2.2.2 Construction of Dynamic Images

The three sets of dynamic images, Dynamic Depth Images (DDIs), Dynamic Depth Normal Images (DDNIs) and Dynamic Depth Motion Normal Images (DDMNIs) are constructed from a segmented sequence of depth maps through hierarchical bidirectional rank pooling. They aim to exploit shape, motion and structural information captured by a depth sequence at different spatial and temporal scales. To this end, the conventional ranking pooling [BFG+_{16] is extended to the hierarchical bidirec-}

tional rank pooling.

The conventional rank pooling [BFG+16] aggregates spatio-temporal information from one video sequence into one dynamic image. It defines a function that maps a video clip into one feature vector [BFG+_{16]. A} _{rank pooling function} _is

formally defined as follows.

Rank Pooling Let a depth map sequence with k frames be represented as

< d1, d2, ..., dt, ..., dk >, where dt is the average of depth features over the frames

up to t-timestamp. At each time t, a score rt = ωT ·dt is assigned. The score

satisfiesri > rj ⇐⇒i > j. In general, more recent frames are associated with larger

scores. The process of rank pooling is to findω∗ that satisfies the following objective function: arg min ω 1 2 kωk 2 ₊_λX i>j ξij s.t. ωT ·(di−dj)≥1−ξij, ξij ≥0 , (4.10)

where ξij is a slack variable. Since the score ri assigned to frame i is often defined

as the order of the frame in the sequence,ω∗ aggregates information from all of the frames in the sequence and can be used as a descriptor of the sequence. In this chapter, the rank pooling is directly applied on the pixels of depth maps and theω∗

is of the same size as depth maps and forms a dynamic depth image (DDI).

However, the conventional ranking pooling method has two drawbacks. Firstly, it treats a video sequence in a single temporal scale which is usually too shal- low [FAHG16]. Secondly, since in rank pooling the averaged feature up to time t is used to classify frame t, the pooled feature is biased towards beginning frames of a depth sequence, hence, frames at the beginning has more influence toω∗. This is not justifiable in action recognition as there is no prior knowledge on which frames are more important than other frames.

To overcome the first drawback, it is proposed that the ranking pooling is applied recursively to sliding windows over severalrank pooling layer. This recursive process can effectively explore the high-order and non-linear dynamics of a depth sequence. The rank pooling layer is defined as follows:

CHAPTER 4. DEPTH-BASED ACTION RECOGNITION 84 Input depth sequence Rank Pooling Rank Pooling ( 1 ) 1 i ( 1 ) 2 i ( 1 ) 3 i (1 ) 4 i ( 1 ) 5 i ( 2 ) 1 i ( 2 ) 3 i ( 3 ) 1 i ( 2 ) 2 i

Figure 4.15: Illustration of a two layered rank pooling with window size three

(Ml = 3) and stride one (Sl = 1).

Definition 2 (Rank Pooling Layer). Let I(l) ₌ D_i(l)

1 , ..., i(nl)

denote the input sequence/subsequence that contains n frames; Ml is the window size; and Sl is a stride in the lth layer. The subsequences of I(l) can be defined as I

(l) t = D i(_tl), ..., i(_tl₊)_M l−1 E

, where t ∈ {1, Sl+ 1, 2Sl+ 1, . . .}. By applying the rank pooling

function on the subsequences respectively, the outputs of lth layer constitute the

(l+ 1)th layer, which can be represented as I(l+1) =

. . . , i(tl+1), ...

I(l) to I(l+1) forms one layer of temporal hierarchy. Multiple rank pooling lay- ers can be stacked together to make the pooling higher-order. In this case, each successive layer obtains the dynamics of the previous layer. Figure 4.15 shows a hierarchical rank pooling with two layers. For the first layer, the sequence is the input depth sequence, thus l = 1, n = 5; for the second layer, l = 2, n = 3. By adjusting the window size and stride of each layer, the hierarchical rank pooling can explore high-order and non-linear dynamics effectively.

To address the second drawback, it is proposed to to apply the rank pooling bidirectionally.

Bidirectional Rank Pooling is to apply the rank pooling forward and backward to a sequence of depth maps. In the forward rank pooling, the ri is defined in

the same order as the time-stamps of the frames. In the backward rank pooling, ri

is defined in the reverse order of the time-stamps of the frames. When bidirectional rank pooling is applied to a sequence of depth maps, two DDIs, forward DDI and backward DDI, are generated.

By employing the hierarchical and bidirectional pooling together, the hierarchical bidirectional rank pooling exploits the dynamics of a depth sequence at different temporal scales and bidirectionally at the same time. It has been empirically observed that, for most actions with relatively short durations, two layers of bidi-

CHAPTER 4. DEPTH-BASED ACTION RECOGNITION 85 rectional rank pooling is sufficient.

Construction of DDI

Given a segmented sequence of depth maps, the hierarchical bidirectional rank pooling method described above is employed directly on the depth pixels to generate two dynamic depth images (DDIs), forward DDI and backward DDI. Even though rank pooling method exploits the evolution of videos and aims to encode both the spatial and motion information into one image, it is likely to lose much motion information due to the insensitivity of depth pixels to motion. As shown in Fig. 4.16, DDIs effectively capture the posture information, similar to key poses. Moreover, compared with the dynamic images (DIs [BFG+16]), the DDIs are more effective, without having interfering texture on the body.

Construction of DDNI

Depth images well represent the geometry of surfaces in the scene, and norm vectors is sensitive to motion of depth pixels. In order to simultaneously exploit the spatial and motion information in depth sequences, it is proposed to extract normals from depth maps and construct the so-called DDNIs (dynamic depth normal images). For each depth map, a surface normal (nx, ny, nz) is calculated at each pixel. Three

channels (Nx, Ny, Nz), referred to as a Depth Normal Image, are generated from the

normals, where (Nx, Ny, Nz) are respectively normal images of the three components

(nx, ny, nz). The sequence of each DNI goes through hierarchical bidirectional rank

pooling to generate two DDNIs, one being the forward DDNI and the other is the backward DDNI.

To minimize the interference of the background, it is assumed that the background in the histogram of depth maps occupies the last peak representing far dis- tances. Specifically, pixels whose depth values are greater than a threshold defined by the last peak of the depth histogram minus a fixed tolerance are considered as background and removed from the calculation of DDNIs by re-setting their depth values to zero. Through this simple process, most of the background can be removed and has much contribution to the DDNIs. Samples of DDNIs can be seen in Fig. 4.16.

Construction of DDMNI

The purpose of constructing a DDMNI is to further exploit the motion in depth maps. Gaussian mixture model (GMM) is applied to depth sequences in order to detect moving foreground. The norm vectors are extracted from the moving foreground and Depth Normal Image is constructed from the norm vectors for each

CHAPTER 4. DEPTH-BASED ACTION RECOGNITION 86 DDI DDNI DDMNI Forward Backward DI

Figure 4.16: Samples of generated forward and backward DIs [BFG+16], DDIs,

DDNIs and DDMNIs for gesture Mudra1/Ardhapataka.

depth map. Hierarchical bidirectional rank pooling is applied to the Depth Norm Image sequence, and two DDMNIs, forward DDMNI and backward DDMNI, are generated, which capture the motion information specifically well (see the illustration in Fig. 4.16).

4.2.2.3 Network Training

After the construction of DDIs, DDNIs and DDMNIs, there are six dynamic images, as illustrated in Fig. 4.16, for each depth map sequence. Six ConvNets were trained on the six channels individually. VGG-16 [SZ14b] is adopted in this chapter. The implementation is derived from the publicly available Caffe toolbox [JSD+_{14] based}

on three NVIDIA Tesla K40 GPU cards and one Pascal TITAN X.

The training procedure is similar to those in [SZ14b]. The network weights are learned using the mini-batch stochastic gradient descent with the momentum set to 0.9 and weight decay set to 0.0005. All hidden weight layers use the rectification (RELU) activation function. At each iteration, a mini-batch of 32 samples is constructed by sampling 256 shuffled training samples, and all the images are resized to 224 × 224. The learning rate is set to 10−3 _{for fine-tuning with pre-trained models}

on ILSVRC-2012, and then it is decreased according to a fixed schedule, which is kept the same for all training sets. The training undergoes 100 epochs and the learn-

CHAPTER 4. DEPTH-BASED ACTION RECOGNITION 87 ing rate decreases every 30 epochs for each ConvNet. The dropout regularization ratio is set to 0.9 to reduce complex co-adaptations of neurons in nets.

4.2.2.4 Score Fusion for Classification

Given a test depth video sequence (sample), three pairs of dynamic images (DDIs, DDNIs, DDMNIs) are generated and fed into six different trained ConvNets. For each image pair, product score fusion was used. The score vector output from the two pair of ConvNets are multiplied in an element-wise manner and the resultant score vectors are normalized usingL1 norm. The three normalized score vectors are

then multiplied in an element-wise fashion and the max score in the resultant vector is assigned as the probability of the test sequence being the recognized class. The index of this max score corresponds to the recognized class label and expressed as follows:

label=F in(max(v1◦v2◦v3◦v4◦v5◦v6)) (4.11)

where v is a score vector, ◦ refers to element-wise multiplication and F in(·) is a function to find the index of the element having the maximum score.

4.2.3 Experimental Results

In this section, the Large-scale Isolated and Continuous Gesture Recognition datasets at the ChaLearn LAP challenge 2016 (ChaLearn LAP IsoGD Dataset and ChaLearn LAP ConGD Dataset) [EPLW+_{16], the NTU RGB+D dataset [SLNW16],}

and the corresponding evaluation protocols and results & analysis are described. On ChaLearn LAP ConGD Dataset, action segmentation was first conducted to segment the continuous actions to isolated actions. For all the experiments, two layered hierarchical bidirectional rank pooling method is adopted, with window size Ml = 3

and stride step Sl = 1.

4.2.3.1 ChaLearn LAP IsoGD Dataset

The ChaLearn LAP IsoGD Dataset was adopted to evaluate the proposed method. In this chapter, only depth maps are used to evaluate the performance of the proposed method.

Table 4.14 shows the results of each channel. From the results we can see that DDIs achieved much better results than DDNIs and DDMNIs, and the reasons are as follows: first, the depth values are not the real depth, but they are normalized to [0,255], which distort the true 3D structure information and affects the norm vectors extraction; second, for storage benefit, the videos are compressed at a loss level, which leads to lots of compression blocking artifacts, which makes the extraction of

CHAPTER 4. DEPTH-BASED ACTION RECOGNITION 88 moving foreground and norm vectors very noisy. Even though, the three kinds of dynamic images still provide complimentary information to each other. In addition, it can be seen that the bidirectional rank pooling exploits more useful information compared to one-way rank pooling [BFG+16], and by adopting product score fusion method, the accuracy is largely improved. Moreover, hierarchical rank pooling en- codes the dynamic of depth sequences better compared with the conventional rank pooling method.

Table 4.14: Comparative accuracy of the three set of dynamic images on the

validation set of the ChaLearn LAP IsoGD dataset. RP denotes conventional rank pooling; HRP represents hierarchical rank pooling.

Method Accuracy for RP Accuracy for HRP

DDI (forward) 36.13% 36.92% DDI (backward) 30.45% 31.24% DDI (fusion) 37.52% 37.68% DDNI (forward) 24.86% 25.02% DDNI (backward) 24.58% 24.64% DDNI (fusion) 29.26% 29.48% DDMNI (forward) 24.81% 24.69% DDMNI (backward) 23.14% 23.57% DDMNI (fusion) 27.75% 27.89% Fusion All 42.56% 43.72%

The results obtained by the proposed method on the validation and test sets are listed and compared with previous methods in Table 5.2. These methods include MFSK combined 3D SMoSIFT [WRL+_{14] with (HOG, HOF and}

MBH) [WS13] descriptors. MFSK+DeepID further included Deep hidden IDen- tity (Deep ID) feature [SWT14]. Thus, these two methods utilized not only hand- crafted features but also deep learning features. Moreover, they extracted features from RGB and depth separately, concatenated them together, and adopted Bag-of-Words (BoW) model as the final video representation. The other methods, WHDMM+SDI [WLG+_{16, BFG}+_{16], extracted features and conducted classifica-}

tion with ConvNets from depth and RGB individually and adopted product score fusion for final recognition. SFAM [WLG+_{17] adopted scene flow to extract features}

and encoded the flow vectors into action maps, which fused RGB and depth data from the onset of the process. C3D [LMT+_{16b] applied 3D convolutional networks}

to both depth and RGB channels and fused them in a late fusion method. Pyra- midal 3D CNN [ZZM+16b] adopted 3D convolutional networks to pyramid input to recognize gesture from both clip videos and entire video. It is noteworthy that the results of the proposed method have been obtained using a single modality viz., depth data, while all compared methods are based on RGB and depth modalities.

In document Action recognition from RGB-D data (Page 96-107)