Real Time Skeleton Tracking based Human Recognition System using Kinect and Arduino






Full text


Real Time Skeleton Tracking based Human Recognition

System using Kinect and Arduino

Satish Prabhu

B.E EXTC VIVA Institute of Technology

Jay Kumar


B.E EXTC VIVA Institute of Technology

Amankumar Dabhi

B.E EXTC VIVA Institute of Technology

Pratik Shetty

B.E EXTC VIVA Institute of Technology


A Microsoft Kinect sensor has high resolution depth and RGB/depth sensing which is becoming available for wide spread use. It consists of object tracking, object detection and reorganization. It also recognizes human activity analysis, hand gesture analysis and 3D mapping. Face expression detection is widely used in computer human interface. Kinect depth camera can be used for detection of common face expressions. Face is tracked using MS Kinect which uses 2.0 SDK. This makes use of depth map to create a 3D frame model of the face. By recognizing the facial expressions from facial images, a number of applications in the field of human computer can be build. This paper describes about the working of Kinect and use of Kinect in Human Skeleton Tracking.

General Terms

Skeleton tracking algorithm & Action Recognition


Skeleton Tracking, Kinect, Pose Estimation, Arduino, Actions



Mobile robots have thousands of applications, from autonomously mapping out a lawn and cutting grass to urban search and rescue autonomous ground vehicles. One important application in the future would be to fight wars in place of humans. That is humans will fight virtually and whatever move human makes the same move the mobile robot will copy. To achieve this it is required to teach robot how to copy human actions. So project deals with making a robot that will copy human action.

The idea is to make use of one of the most amazing capabilities of the Kinect: skeleton tracking. This feature allows us to build a servo driven robot that will copy human actions efficiently. Natural interaction applied to the robot has an important outcome; there is no need for physical connection between the controller and the robot. This project will be extended to implement network connectivity that is the robot could be controlled remotely from anywhere in the world.

It will use the concept of skeleton tracking so that the Kinect can detect the user’s joints limbs movements in space. The user data will be mapped to servo angles and send them to the Arduino board controlling the servos of the robotic robot. The skeleton tracking feature is used to map the depth image of human. It will track the position of joints of the human body which is than provided to the computer which will in turn sends the signal to the Arduino board in the form of pulse for every joints this will make the servo motor rotate in accordance with the pulse.

The eight servos are placed on the shoulders, elbows, hips, and knees of the robot. The servo motor is a DC motor. The rotation of servo motor depends upon the number of signal pulses applied to the servo motor. Suppose it is assume that for one pulse the motor rotates through 1 degree, than for 90 pulses it will rotate through angle of 90 degree, for 180 pulse rotates through 180 degree and so on.

The second important part of paper is angle calculation. The skeleton information from the Kinect is stored in the computer which thus runs a program used by Arduino to calculate the angle inclination of every joints of the human body. This angle calculation is than converted into a pulse train for each servo motor connected to Arduino. According to the received pulse the servo motor rotates through a certain angle which is observed by Kinect sensor. Hence the robot copies the action of the human skeleton.

The third important part of the project is to extend the concept of using the project on internet. So through internet the robot can be operate anywhere around the globe. To do so the user sets the external IP address of the computer in the Arduino program through this the robot will emulate the human action anywhere from the earth through internet.



The project deals with making a robot that will copy human action. Recently, Microsoft released the Xbox Kinect, and it proves useful for detecting human actions and gestures. So in this paper we propose to use Kinect camera to capture the human gestures and then relaying these actions to the robot which will be controlled by Kinect and Arduino board.


Existing Systems

Previously depth images were been recorded with the help of the silhouettes which are nothing but the contour of the body part whose depth images is to be formed [1]. They reject the shadow part of the body or the colour of the clothes the person has worn. It just simply sees the border of the body. But for the digital system it’s been very difficult to predict the motion of the body part of unknown person since this type of model was based on the priori knowledge of the contours. Since the human of every part of the world are not same and differ in size, length and many other physical parameters. Hence it becomes difficult to store all such kind of information. Therefore using the silhouettes just simply reduces the scope of depth images. [1]

The two major steps leading from a captured motion to a reconstructed one are:

• Marker reconstruction from 2-D marker sets to 3-D positions;

• Marker tracking from one frame to the next, in 2-D and/or 3-D.


However, despite the fact that 2–D and 3–D tracking ensure the identification of a large number of markers from one frame to another, ambiguities, sudden acceleration or occlusions will often cause erroneous reconstructions or breaks in the tracking links. For this reason, it has proved to be necessary to increase procedure’s robustness by using the skeleton to drive the reconstruction and tracking process by introducing a third step, i.e. the accurate identification of each 3-D marker and complete marker inventory in each frame. The approaches to solving these issues are addressed in the following paragraphs, starting with the presentation of the human model used and keeping in mind that entire approach is based on the constant interaction between the model and the above marker processing tasks.

2.1.1 Skeleton model

The skeleton model is controlled by 32 degrees of freedom grouped in 9 joints in 3–D space. This is a simplified version of the complete skeleton generally used. It does not include detailed hands and feet.

Fig 1: Default Skeletal Joint Locations

2.1.2 Stereo triangulation

3–D markers are reconstructed from the 2–D data using stereo triangulation

2.1.3 Binocular reconstruction

After reconstructing these 3–D markers in the first frame, compare the number of reconstructed markers with the number of markers known to be carried by the subject. As all remaining processing is automatic, it is absolutely essential that all markers be identified in the first frame. Any marker not present in the first frame is lost for the entire sequence. Therefore, if the number of reconstructed markers is insufficient, a second stereo matching is performed, this time also taking into account markers seen in only two views. [2]

There are three techniques from which the image can be tracked without using the marker less approach First, learning-based methods which rely on prior probabilities for human poses, and assume therefore limited motions. Second, model-free methods which do not use any a priori knowledge, and recover articulated structures automatically. However, the articulated structure is likely to change in time, when encountering a new articulation for instance, hence making identification or tracking difficult. Third, model-based

approaches which fit and track a known model using image information.


Proposed Approach

The paper aims at limiting as much as possible the required a priori knowledge, while keeping the robustness of the method reasonable for most interaction applications. Hence, given approach belongs to the third category. [3]Among model-based methods, a large class of approaches use an a priori surface or volume for representation of the human body, which combines both shape and motion information [4]. The corresponding models range from fine mesh models to coarser models based on generalized cylinders, ellipsoid or other geometric shapes. In order to avoid complex estimations of both shapes and motions as in, most approaches in this class assume known body dimension. However, this strongly limits flexibility and becomes intractable with numerous interaction systems where unknown persons are supposed to interact. A more efficient solution is to find a model which reduces shape information. To this purpose, a skeletal model can be used. This model does not include any volumetric information. Hence, it has fewer dependencies on body dimensions. In addition, limbs lengths tend to follow biological natural laws, whereas human shapes vary a lot among population. Recovering motion using skeletal models has not been widely investigated and an approach where a skeletal structure is fitted with the help of hand/feet/head tracking. However, volumetric dimensions are still required for the arms and legs limbs. Hence for all the complication and errors in the technique the use of Kinect in this project has tackled all the difficulties in the approaches for finding the robust technique. [3]



A Microsoft Kinect sensor has high resolution depth and RGB/depth sensing which is becoming available for wide spread use. It consists of object tracking, object detection and reorganization. It also recognizes human activity analysis, hand gesture analysis and 3D mapping. Face expression detection is widely used in computer human interface. It can be used to detect and distinguish between different kinds of objects. The depth information was analysed to identify the different parts of fingers or hands, or entire body in order to interpret gestures from a human standing in front of it. Thus the Kinect was found to be an effective tool for target tracking and action recognition. [5]

Kinect camera consists of an infrared projector, the colour camera, and the IR camera. The depth sensor consists of the IR projector combined with the IR camera, which is a monochrome complementary metal- oxide semiconductor sensor. The IR projector is an IR laser that passes through a diffraction grating and turns into a set of IR dots. [6]

The relative geometry between the IR projector and the IR camera as well as the projected IR dot pattern are known. If a dot observed in an image matches with a dot in the projector pattern, reconstruct it in 3D using triangulation. Because the dot pattern is relatively random, the matching between the IR image and the projector pattern can be done in a straightforward way by comparing small neighbourhood’s using, for example, normalized cross correlation. [6]

In skeletal tracking, a human body is represented by a number of joints representing body parts such as head, neck, shoulders, and arms. Each joint is represented by its 3D coordinates. The goal is to determine all the 3D parameters of these joints in real time to allow fluent interactivity and with limited computation resources allocated on the Xbox 360 so


as not to impact gaming performance. Rather than trying to determine directly the body pose in this high-dimensional space, Jamie Shotton and his team met the challenge by proposing per-pixel, body-part recognition as an intermediate step Shotton’s team treats the segmentation of a depth image as a per-pixel classification task (no pairwise terms or conditional random field are necessary)[4]. Evaluating each pixel separately avoids a combinatorial search over the different body joints. For training data, generate realistic synthetic depth images of humans of many shapes and sizes in highly varied poses sampled from a large motion-capture database. Then train a deep randomized decision forest classifier, which avoids over fitting by using hundreds of thousands of training images. Simple, discriminative depth comparison image features yield 3D translation invariance while maintaining high computational efficiency. [6]




The depth maps captured by the Kinect sensor are processed by a skeleton-tracking algorithm. The depth maps of the utilized dataset were acquired using the OpenNI API2 [7]. The OpenNI high-level skeleton-tracking module is used for detecting the performing subject and tracking a set of joints of his/her body. More specifically, the OpenNI tracker detects the position of the following set of joints in the 3D space which are Torso, Neck, Head, Left shoulder, Left elbow, Left wrist, Right shoulder, Right elbow, Right wrist, Left hip, Left knee, Left foot, Right hip, Right knee, Right foot. The position of joint gi is implied by vector pi(t) = [x y z]T, where t denotes the frame for which the joint position is located and the origin of the orthogonal XY Z co-ordinate system is placed at the centre of the Kinect sensor.

4.1 Action recognition

Action recognition can be further divided into three subtypes

4.1.1 Pose estimation

In particular, the aim of this step is to estimate a continuously updated orthogonal basis of vectors for every frame t that represents the subject’s pose. The calculation of the latter is based on the fundamental consideration that the orientation of the subject’s torso is the most characteristic quantity of the subject during the execution of any action and for that reason it could be used as reference. For pose estimation, the position of the following three joints is taken into account: Left shoulder, Right shoulder and Right hip. These comprise joints around the torso area, whose relative position remains almost unchanged during the execution of any action. The motivation behind the consideration of the three aforementioned joints, instead of directly estimating the position of the torso joint and the respective normal vector, is to reach a more accurate estimation of the subject’s pose. It must be noted that the Right hip joint was preferred instead of the obvious Torso joint selection. This was performed so that the orthogonal basis of vectors to be estimated from joints with bigger in

between distances that will be more likely to lead to more accurate pose estimation. However, no significant deviation in action recognition performance was observed when the Torso joint was used instead. [8]

4.1.2 Action Representation

For realizing efficient action recognition, an appropriate representation is required that will satisfactorily handle the differences in appearance, human body type and execution of actions among the individuals. For that purpose, the angles of the joints’ relative position are used in this work, which showed to be more discriminative than using e.g. directly the joints’ normalized coordinates. Additionally, building on the fundamental idea of the previous section, all angles are computed using the Torso joint as reference, i.e. the origin of the spherical coordinate system is placed at the Torso joint position. For computing the pro- posed action representation, only a subset of the supported joints is used. This is due to the fact that the trajectory of some joints mainly contains redundant or noisy information. To this end, only the joints that correspond to the upper and lower body limbs were considered after experimental evaluation, namely the joints Left shoulder, Left elbow, Left wrist, Right shoulder, Right elbow, Right wrist, Left knee, Left foot, Right knee and Right foot. The velocity vector is approximated by the displacement vector between two successive frames, i.e. vi(t) = i(t)−pi(t−1). The estimated spherical angles and angular velocities for frame t constitute the frame’s observation vector. Collecting the computed observation vectors for all frames of a given action segment forms the respective action observation sequence h that will be used for performing HMM-based recognition, as will be described in the sequel. [8]

4.1.3 HMM based recognition

Markov Models is stochastic model describing the sequence of possible events in which the probability of each event depends only on the state attend in the previous event. This model is too restrictive to be applicable to current problem of interest thus the concept of Markov model is extended to form Hidden Markov Model (HMM). HMM is doubly embedded stochastic process with the underlying stochastic process i.e. not observable (it is Hidden) but can only be observed through set of stochastic process that produce the sequence of observations. [12].

HMMs are employed in this work for performing action recognition, due to their suitability for modelling pattern recognition. In particular, a set of J HMMs is employed, where an individual HMM is introduced for every supported action aj. Each HMM receives as input the action observation sequence h (as described above) and at the evaluation stage returns a posterior probability P (aj|h), which represents the observation sequence’s fitness to the particular model. The developed HMMs were implemented using the software libraries of Hidden Markov Model Toolkit (HTK). [8]




The entire process is divided in two parts i.e. Initialization & working.



For the smooth functioning & Error free working the Kinect is initialized to its default mode. Initialization is done with the help of calibration card been provided by the Microsoft, this card helps to align the Tx and Rx Infrared Sensor of Kinect. Fig 1 indicates the default joint location which is been used, these are treated as the reference joints and with the help of these joints other joints are been calibrated.



Initially Infrared Rays (IR) are emitted from the IR transmitter of Kinect Camera. Emitted rays are been received by Kinect receiver which is been stored in its database. Since it is monitoring for the human joints, it waits until the human joints are recognized. If any object other than the skeleton oints are recognized it discards the frame and restarts the scanning of the next frame until joints are recognized. Black

frame in Fig 2 indicates that neither the object is been detected nor the skeletal joints are detected. This kind of image results into blackening of frame and the white spots on the black frame are due to noises present in the environment. Once the Joints are been recognized/detected Kinect uses HMM algorithm for joint estimation and predicts the future movements. These recognized joint information are been converted into PWM pulses by the programmed PWM pulse generator present on Arduino board. The generated PWM pulses which serve as input to the servo motors, are been made to perform angular tilt as per the movement been captured. Since this is real time the entire process is been continuously repeated for each frame.



The framework required for the robot can be seen from the fig 6. Along with the robot PCB is made which will help to interface the servo motors HS 311 and HS 55. The PCB interfacing for the servo is formed so that connection remains proper and it looks proper and compact which can be seen in fig 5. Hence the kinect camera is successfully interfaced through OpenNI and the tracked the skeleton.

Fig 2: Initialization of Kinect Camera




After analysing the studies mentioned above, it can be concluded that the Kinect is an incredible piece of technology, which has revolutionized the use of depth sensors in the last few years. Because of its relatively low cost, the Kinect has served as a great incentive for many projects in the most diverse fields, such as robotics and medicine, and some great results have been achieved. Throughout this project, it was possible to verify that although the information obtained by the Kinect may not be as accurate as that obtained by some other devices (e.g., laser sensors), it is accurate enough for many real life applications, which makes the Kinect a powerful and useful device in many research fields. And thus a real-time motion capture robot is integrated and tested using Kinect camera. The paper proposed a natural gesture based communication with robot. The skeleton tracking algorithm has been well explained for further work. The results are better than the techniques that were used before Kinect camera.

Fig 5: PCB with Servo Interfaced

Learning from demonstration is the scientific field which studies one of the easier ways a human have to deal with a humanoid robot: mimicking the particular task the subject wants to see reproduced by the robot. To achieve this a gesture recognition system is required. The paper presents a novel and cheap humanoid robot implementation along with a visual, gesture-based interface, which enable

users to deal with it.. Users are allowed to control the robot just by mimicking the gestures they want to be performed by the robot in front of the depth camera. This should be seen as preliminary work, where elementary interaction tools can be provided, and should be extended in many different fashions, depending on the tasks the robot. [11]



With the progress in the Kinect technology in the last decade it can be seen as a revolutionary tool in robotics. Now further modification may be as follows:

1. Here only few set of joints are tracked. So now the tracking algorithm can be expanded to track all the joints in the human body and can have more reliable and robust copying of human action.

Fig 6: Robot Layout

2. As Kinect camera used is not portable so reducing the size of Kinect camera to the size of mobile phone camera can be a good future development.

3. The servo motors used could be further investigated and changed to build the system more robust and natural. 4. The robot built is fixed. Instead it can be made mobile. Thus not only it will copy human action but even move around like a human.

5. It is possible to implement this project over the network. That is the Kinect camera will feed the data in the network and then the robot will get the data from network and thus it is possible to control the robot by sitting in any corner of the world.



[1] Agarwal, A., Triggs, B. “3D human pose from silhouettes by relevance vector Regression”. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp.882-888, 2004.

[2] Lorna HERDA, Pascal FUA, Ralf PLÄNKERS, “Skeleton-based motion capture for robust reconstruction of human motion”, in Proc. Computer Animation 2000, pp. 77-83, 2000.

[3] Clement Menier, Edmond Boyer, Bruno Raffin, “3D Skeleton-Based Body Pose Recovery”, in Proc. 3rd International Symposium, 3D Data Processing, Visualization and Transmission, pp 389—396, 2006 . [4] Jamie Shotton, Toby Sharp, Alex Kipman, Andrew

Fitzgibbon, “Real-Time Human Pose Recognition in Parts from Single Depth Images”, in the Proc. Conference on Computer Vision and Pattern Recognition, pp.1297-1304, 2011.

[5] Dnyaneshwar R. Uttaarwar, “Motion Computing using Microsoft Kinect”, in the Proc. National conference on advances on computing, 2013.

[6] Z. Zhang, “Microsoft Kinect Sensor and Its Effect”, in IEEE Multimedia Magazine, vol. 19, no. 2, pp. 4-10, April- June 2012”.

[7] James Ashley and Jarrett Webb, (Ed.), Beginning Kinect Programming with the Microsoft Kinect SDK, Apress, 2011.

[8] Georgios Th. Papadopoulos, Apostolo Axenopoulo and Petros Daras, “A Compact Multi-view Descriptor for 3D Object Retrieval”, in Content-Based Multimedia Indexing, pp.115-119, 2009.

[9] Michael Margolis, (Ed.), Arduino Cookbook, O’Reilly, 2011.

[10] Jack Purdum, (Ed.), Beginning C for Arduino, Apress, 2011.

[11] Giuseppe Broccia, Marco Livesu, & Riccardo Scateni, “Gestural Interaction for Robot Motion Control”, in the Proc. Eurographics Italian Chapter Conference, 2011.

[12] Lawrence R Rabiner, “A Tutorial on Hidden Markov Model & Selected Applications in Speech Recognition”, in Proc. IEEE 77, no. 2, pp 257-286, 1989.





Related subjects :