Human Action Recognition Using Spatio Temporal Pyramid Model Based Background Subtraction on Depth Maps

(1)

2019 International Conference on Computer Science, Communications and Big Data (CSCBD 2019) ISBN: 978-1-60595-626-8

Human Action Recognition Using Spatio-Temporal Pyramid Model

Based Background Subtraction on Depth Maps

Emmanuel MUTABAZI, Jian-jun NI and Ye YANG

College of IOT Engineering, Hohai University, Changzhou, 213022, China

*Corresponding author

Keywords: Human Action Recognition, Background Subtraction in Regions, Improved Adaptive Region Decision Function, Histogram Oriented Gradient, Multi Class Support Vector Machine.

Abstract. In this paper, a background subtraction in region method is proposed to recognize actions and interactions in the video. Firstly, the video is taken and converted into frames. Preprocessing techniques are applied to sampled images for noise reduction. Next, a background subtraction method is used to extract the foreground objects in region units. The combination of the background model, color of the object and movement information are employed to get the region object likelihood. Then, an Improved Adaptive region decision function determines the object regions. Moreover, the human detection method produces a bounding box surrounding a person. Histogram Oriented Gradient (HOG) is used for feature extraction and representation. Finally, Multi class support vector machine (SVM) is the classifier used for classification.

Introduction

Human Action Recognition (HAR) is a new paradigm research area in computer vision. It has many applications in real-world such as intelligent surveillance systems, human-computer interaction and robotics [1]. Recognizing human actions in a moving background, non-stationary camera, scale variations and scenes with cluttered are still challenges for better recognition. Besides the difficulties related to recognition, a main challenge for detection in video is the size of the search space defined by spatio-temporal tubes formed by sequences of bounding boxes along the frames. Therefore, the appropriate background subtraction, features selection, feature extraction, and feature representation are the major tasks in the action recognition process [2]. Feature extraction is very important because it extracts a set of key parameters that describe the specific set of a human action so that the parameters can be used to distinguish among other actions. In this paper, we present a hierarchical graphical model for recognizing human actions and interactions in video. Our method includes the whole processing that spans the pixel level, blob level, object level and the event level computation of video. The rest of the paper is organized as follow. In section 2, Related works is given. In section 3, the proposed method is presented. In section 4, the Experiment and Results are presented. Finally, the Conclusion is given in section 5.

Related Works

(2)

concatenate the HOG features extracted from the spatial cuboid. Moreover, a Local Mean Spatio-temporal Feature (LMSF) was used [6] to improve the speed of action recognition in-depth images and introduced a motion capture method to capture valid frames in the depth sequences and finally applied the spatio-temporal pyramid to aggregate geometric and temporal cues.

Background Subtraction in Regions

In this section, A background subtraction method in region units is employed [7]. First, an image

must be divided into small similar regions. We define k

 

k _k

i _{i I}

R



  as the set of the regions at the

k-th frame. The information of the object and background colors and previous object positions are combined to determine the object region likelihood. Then, the object silhouette is obtained by the

region decision function

f

_Dk defined on k .

Region Object Likelihood

The region object likelihood L Rk

 

_i of region

R

_ik



kis calculated by

 

1

 

2

 

k k k k

i S i b i r i

L R L R 



L R 



L R (1) In the above equation, L Rks

 

i is the naïve object region likelihood given by the arithmetic mean of

the pixel object likelihoods l s

 

, i.e. k

 

/ k

i i

S l s nR



, here k

i

R

n is the number of pixels in the

region

R

_ik. The object color likelihood

L R

k_c

 

_i is given by k

 

/ k

i i

S R l s nR



. The regularization term

k k r i

L R

is the overlapping ratio of the region

R

_ik and is given by k / k

i i

r R R

n n , where k

i

r R

n is the number

of pixels in region

R

_ikthat belongs to the previous object region.



₂is a regularization parameter.

Improved Adaptive Region Decision Function

We start by assigning each pixel to its label in L

 

0,1 , where 0 denotes a background and 1 denotes an object. The decision of the background subtraction in regions is performed in region

units. Hence, the decision function

f

_Dk of the k-th frame is defined on k

R and takes its value L. The

decision function

f

_Dk

:

R

k



L

is defined as

 

1



 



0

otherwise

k i

k k k

i

k R

D i

if L

R

H

f

R











 



(2)

Where region thresholds k

i

k R

H ’s are defined as an arithmetic mean of the thresholds

H

_sk of pixels in

k i

R

, and



is a constant to make sure there is enough lower bound. We determine the pixel threshold

k s

H

in the whole image from the previous frame by



2 1



/ 2

k k k

s s s

H  H  M  (3)

In the above equation,

H

_s0



H

1_s



M

_s0



H

_minand

M

_sk1 stands for local minimum of the region

(3)

considering the weighted average of local region object likelihood

m

_sk1 and the pixel object

likelihood l s

 

i.e. wm_sk1 



1 w l s

  

where w is set to 0.8 in the experiments.

m

k_s1 is given by

  1  1 1





1

, \x R ,

1

min

min ,

otherwise

k k k

k x C s d L R O R R

k s

if d s O d m

H

  



   

  

 

 (4)

Where C s d

 

, is a ball with radius dcentered at pixel s L, k1

 

R s, is the region likelihood of R

at the



k1



th frame and



1



, k

d s O  is the nearest distance from pixel sto the previous object

region object k 1

O  . Then we truncate

H

_sk’s by a constant

H

_maxto prevent them from becoming too [image:3.595.61.535.113.351.2]

high. The object segmentation process is shown in (Fig. 1).

Figure 1. Illustration of the jump action. From the left, extracted frame, human detection using bounding box, background subtraction, and colored Region of Interest (ROI).

Feature Extraction

HOG is a feature descriptor used in computer vision and image processing for the purpose of object detection [2]. To calculate a HOG descriptor, we need to first calculate the horizontal and vertical gradients; finally, we calculate the histogram of gradients.

Classification

Multi-class SVM has been used to solve multi-class problems [8]. The most important criterion for evaluating the performance of the multi-class SVM is their accuracy rate. The Gaussian RBF Kernel is used to train all the datasets as it can map non-linearly samples into a higher dimensional space.

Experiments and Results

To evaluate the performance of our proposed approach, we conduct experiments on publicly available Weizmann dataset [9], in the environment, MATLAB 2015b, Intel Pentium processor and Memory of 2GB. In this dataset, we considered 5 actions, namely: walking, bending, jumping, running and skipping. The validation performance is measured by training 70% of the training set and testing the other 30% of the training set. The accuracy of the proposed method is measured in terms of classification accuracy using Multi-class SVM classifier. As shown in table 1, the proposed approach is able to achieve 97.6%, 95.6%, 97.6%, 98%, and 95.6% of accuracy to detect bend, run, jump, walk and skip action respectively. In order to examine the performance of our method, we compared it to other two methods namely LMFS [6] and HOG_MBH [10].

(4)

[image:4.595.67.531.72.173.2]

Figure 2. Results of actions recognition. From the left to the right, we have bend action, jump action, run action, skip action and walk action.

The results of the recognition accuracy for all the five actions, using our proposed approach are summarized in (Table 1) below.

Method Accuracy

LMFS [6] 93.82%

HOG_MBH [10] 95.4%

[image:4.595.238.358.309.386.2]

Our Method 96.88%

Table 1. Summary of the Experimental Results.

Actions Accuracy

Bend 97.6%

Run 95.6%

Jump 97.6%

Walk 98%

[image:4.595.182.405.405.599.2]

Skip 95.6%

Table 2. Comparison of our method with others.

Figure 3. Comparison of the performance of our method with others.

Our method was compared with the state-of-the art methods, and our results show that our method outperform other existing methods in terms of accuracy as shown in the (Table 2) and (Fig. 3) above.

Conclusion

(5)

improve the performance of the low-level processing, and representing the actions in 3 dimensions for better recognition accuracy.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (61873086, 61573128), and the Fundamental Research Funds for the Central Universities (2018B23214).

References

[1] Liu, Li, et al. "Learning spatio-temporal representations for action recognition: A genetic

programming approach." IEEE transactions on cybernetics 46.1 (2016): 158-170.

[2] Lahiri, Dishani, Chhavi Dhiman, and Dinesh Kumar Vishwakarma. "Abnormal human action

recognition using average energy images." 2017 Conference on Information and Communication

Technology (CICT). IEEE, 2017.

[3] Xu, Haining, et al. "Spatio-temporal pyramid model based on depth maps for action

recognition." 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP).

IEEE, 2015.

[4] Silambarasi, R., Suraj Prakash Sahoo, and Samit Ari. "3D spatial-temporal view-based motion

tracing in human action recognition." 2017 International Conference on Communication and Signal

Processing (ICCSP). IEEE, 2017.

[5] Ji, Xiaopeng, Jun Cheng, and Wei Feng. "Spatio-temporal cuboid pyramid for action recognition

using depth motion sequences." 2016 Eighth International Conference on Advanced Computational

Intelligence (ICACI). IEEE, 2016.

[6] Ji, Xiaopeng, Jun Cheng, and Dapeng Tao. "Local mean spatio-temporal feature for depth

image-based speed-up action recognition." 2015 IEEE International Conference on Image

Processing (ICIP). IEEE, 2015.

[7] Ahn, Jung-Ho, and Hyeran Byun. "Human silhouette extraction method using region-based

background subtraction." International Conference on Computer Vision/Computer Graphics

Collaboration Techniques and Applications. Springer, Berlin, Heidelberg, 2007.

[8] Van Nguyen, Nang Hung, et al. "Human Activity Recognition Based on Weighted Sum Method

and Combination of Feature Extraction Methods." International Journal of Intelligent Information

Systems 7.1 (2018): 9.

[9] Blank, Moshe, et al. "Actions as space-time shapes." null. IEEE, 2005.

[10] Xing, Dong, Xianzhong Wang, and Hongtao Lu. "Action recognition using hybrid feature

descriptor and VLAD video encoding." Asian Conference on Computer Vision. Springer, Cham,