CHAPTER 5. MOTION TEMPLATE MODELING: COMPARATIVE STUDY OF MOTION
5.2 Background on Vision-based Activity Analysis in Construction
Human action recognition has gained great attention for its wide variety of potential applications (e.g., entertainment, rehabilitation, robotics, and security) in the computer vision community. During the last decades, rapid advances have been made, investigating various approaches such as scene interpretation (e.g., Vasvani et al. 2003; Liu and Chua 2006), holistic body–based recognition (e.g., Efros et al. 2003; Wang et al. 2006), body part–based recognition (e.g., Davis and Taylor 2002; Biswas and Basu 2011), and action hierarchy–based recognition (e.g., Ijspeert et al. 2002; Jenkins and Mataric 2002). In construction, an activity consists of a series of actions to complete a task. For example, a sequence of plywood installation is generally comprised of walking for pickup, measuring a dimension, cutting, lifting,
carrying, aligning, and nailing a board. Accordingly, action recognition has great potential for promoting accurate data collection and fostering a field observation process to analyze operations. Amongst various sensing devices used for activity analysis (e.g., accelerometers in Joshua and Varghese 2011; external musculoskeletal joint angle sensor in Alwasel et al. 2011; the combination of ultra wideband and physiological status monitor in Cheng et al. 2013; and the fusion of ultra wideband, physiological status monitor, and video in Cheng et al. 2013), this section summarizes prior work focused on vision-based motion analysis for construction applications (Table 5.1), and discusses the following issues arising from the previous studies, which can potentially affect action recognition performances: (1) data types extracted from motion capture data, (2) the variations of postures and actions in the samples of a particular motion, and (3) the characteristics of motion data used for motion classification.
The selection of discriminating features plays a key role in improving the accuracy of classification (Mangai et al. 2010); for instance, the process that makes the datasets separable can significantly improve the classification accuracy. The development of sensing devices such as a Kinect has allowed for the acquisition of rich motion information, and motion data types have been advanced to the point that they can precisely represent a human body and its motions. For example, a Kinect motion capture system (e.g., a Kinect application implemented using an OpenNI software development kit) allows for not only the depth measurement of body parts but also for the extraction of skeleton models from the videos recorded with depth sensors. This skeleton model generally contains angular or spatial information of body joints as variables that characterize a human posture per frame, thus representing a human action as changing postures over time. At the same time, however, features used for motion classification have also varied in previous research. For instance, Ray and Teizer (2012) utilized depth values of a human subject for action classification, and joint angles for ergonomic analysis; Escorcia et al. (2012) classified actions with pose code-words based on joint angles; and Han et al. (2013) extracted rotation angles for unsafe action detection. Taking into account the importance of feature selection on classification performance, further investigation into motion data types and features is needed to find the discriminating ones that can best represent specific actions and distinguish those actions from others for action recognition.
The pattern and pace of actions varies from individual to individual as well as from time to time. Moreover, motion capture data is high dimensional, having numerous variables to define a posture in any given moment—for example, a feature vector contained 500 variables (i.e., 20 × 25 grayscale image) in Ray and Teizer (2012), 13 angle variables between body joints in Escorcia et al. (2012), and 42 rotation angles at joints in Han et al. (2013). The variation of human motion in addition to the high dimensionality can potentially degrade the performance of classification for action recognition. Specifically, in Han et al. (2013), the selection of an action template for similarity measurement between actions influenced the
classification error rates up to 25.2%. In this respect, motion modeling that efficiently reflects the variations and dynamics of actions may lead to the in-depth understanding of motion patterns, and thereby may improve the classification accuracy by effectively modeling patterns of motions and learning classifiers with the training datasets.
In motion analysis, an action consists of a temporal sequence of postures. Hence, action detection is mostly performed by measuring the similarity between body poses (e.g., Ray and Teizer 2012) or sets of postures (e.g., Escorcia et al. 2012, Han et al. 2013) and classifying actions into the ones with the largest degree of similarity. Particularly, in the latter case (i.e., a collection of poses), different speeds of actions may make it difficult to distinguish actions such as walking and running, and to segment entire data into a subset of data for classification. Also, sequential relations inherited in motion data as not properly reflected may lead to the loss of semantic information on actions, and to inaccurate results of the recognition of actions such as picking up and dropping an object. These characteristics of motion data thus emphasize the necessity of further research efforts to reflect time series characteristics of actions into the motion analysis for the improvement of classification tasks.
Table 5.1: Summary of prior work on vision-based activity analysis in construction
Application Sensing
Device Classified Action Feature
Classification Approach
Peddi et al. (2009)
Productivity
(re-bar tying) Camera
Effective, ineffective, and contributory work Skeleton from silhouette Neural network Gonsalves and Teizer (2009) Safety and health 3D range camera
Lifting, waving a flag, crawling, side-walking, performing sit-
ups, and walking
Joint angle in a star-skeleton model Rule-based model Gong and Caldas (2011) Activity Analysis (framework) Camera Traveling, transporting, bending, nailing, and aligning
formwork Video feature (HOG descriptor) Bag-of-video feature words model Escorcia et al. (2012) Activity Analysis (drywall)
Kinect Fire caulking, hammering, idling, painting, and walking
Pose code- word (joint angle) SVM with bag-of-pose model Ray and Teizer (2012) Safety and health Kinect
Standing, squatting, bending, and crawling Depth values of a person (20×25 image) Linear Discriminant Analysis Han et al. (2013) Safety (ladder climbing) Kinect Backward-facing climbing, climbing with an object, and
reaching far to a side
Rotation angle
Similarity measurement