The aim of the framework is gesture recognition robust to the intra- and inter-personal variations of gestures, such as size, location and speed of a gesture and the orientation and position of the person relative to the sensor or to the robot.
Furthermore, due to the preprocessing and the abstraction it provides to the gesture recognition algorithm, a dataset, containing gesture performances varying in size, speed, location and direction within the same class both with the person being static and moving around, should be specifically designed in order to fully test the effects of the proposed framework.
The evaluation was performed on the subset of the of the GRaD framework. However, gesture disambiguation requires future validation.
5.5.1
Evaluation of the Algorithm
The algorithm used OpenNI library for Microsoft Kinect for data acquisition, and it was imple- mented in C++ using Qt framework on Linux. The evaluation of the algorithm was performed on a local HU dataset, recorded and compiled by the author. In addition to this, the algorithm was tested using the dataset provided by (Arici et al., 2014).
HU Gesture Dataset
The local dataset was recorded and compiled by the author on 8 participants. It consists of following gestures: “Counter-clockwise circle”, “Clockwise circle”, “Swipe from right to left”, “Swipe from left to right”, “Stop”, “Push”, “Clean” and “Call”. “Counter-clockwise circle” and “clockwise circle” represent a circular gesture in front of the body along the frontal plane in counter-clockwise and clockwise direction, respectively. “Swipe from left to right” and “Swipe from right to left” describe a motion where the right hand is moving from left to right, or from right to left, respectively, along the transverse plane. “Stop” gesture is performed by raising the right hand in front of the performer to shoulder height, holding it for a brief moment, and then lowering it back to the resting position. “Push” gesture is performed as if the performer was pushing a button in front of them at shoulder height with the right hand. “Clean” gesture can be seen as a gesture to remove something from the table and it is performed by repetitive motion of the right hand from and toward the body along the transverse plane. “Call” gesture is performed by raising the right hand laterally from a resting position in front of the body up next to the right shoulder, as if the performer is calling someone to join them. All gestures were performed with the right hand.
All subjects received a spoken description of a gesture they are about to perform, together with one exemplary gesture performance. All subjects performed 5 repetitions for each of 8 ges- ture classes. Furthermore, every participant provided 10 negative samples were included in the dataset, to approximate real-life conditions, where people might make unintentional hand move- ments, according to Pavlovic et al. (1997) taxonomy. These samples included both static postures and dynamic motions. In this case, the participants were instructed to make motions that reg- ularly occur, but that do not resemble the recorded gestures, such as scratching the shoulder or the head, pointing to a side, or holding hands still. In some cases, 6 instead of 5 positive samples were recorded per participant per gesture class and these extra samples were left in the dataset. In total, there were 327 positive testing samples and 83 negative testing samples.
Table 5.2: Confusion matrix of the recognition of the gestures from the HU dataset without preprocessing, using dissimilarity threshold of t = 75 (rows – actual class, column – recognized class).
CCW Circle CW Circle Swipe L Swipe R Stop Push Clean Call Neg
CCW Circle 28 0 0 0 0 0 0 0 13 CW Circle 0 3 0 0 0 0 0 0 38 Swipe L 0 0 4 0 0 0 0 0 36 Swipe R 0 0 0 26 0 0 0 0 15 Stop 0 0 0 0 31 0 0 0 10 Push 0 0 0 0 0 18 0 0 23 Clean 0 0 0 0 0 0 5 0 36 Call 0 0 0 0 0 0 0 17 24 Negative 0 0 0 0 0 0 0 6 77
The sensor was placed at the height of h = 1.4m and the person performing the gestures for the training dataset was standing at the distance of d = 2.5m from the sensor. Two datasets were recorded. The first dataset contains recordings from 8 participants, which were standing still facing the sensor while gesturing. The second dataset contains recordings from 2 participants and it is specific in a sense that the participants were moving backward and forward while performing the gestures.
Training was performed using 2 training samples per gesture class. Every testing sample was compared against both training samples of every gesture class using dynamic time warping. As a result of the comparison, the higher dissimilarity value was returned.
The gesture trajectories were saved in raw format, without any preprocessing of the data, storing the absolute (x, y, z) joint positions and the rotation matrices of the joints. Each recording is 120 frames long, to account for the preparation and retraction phases during the recording. Evaluation on the HU Gesture Dataset
The dissimilarity threshold value was empirically set to t = 0.15 when the gesture was preprocessed using scaling. This states that the average distance between aligned points in training and testing samples must not be larger than 7.5% of the maximum size of the gesture along x, y and z axes in order for testing gesture to be classified to the class of the training gesture sample. In the case where gesture is not scaled, the threshold value was empirically set to t = 75. Positions of only right hand was used for comparison.
Following confusion matrices, where rows are actual gesture classes and columns are recognized gesture classes, represent the results with different settings of the preprocessing pipeline:
• Table 5.2: No preprocessing.
• Table 5.3: Frame of reference transformation. • Table 5.4: FoR transformation, gestural alignment. • Table 5.5: FoR transformation, gestural scaling.
• Table 5.6: FoR transformation, gestural alignment and scaling.
Important statistics are number of true positives (i.e., correctly classified positive gestures), true negatives (i.e., non-classified negative gestures), false positives (i.e., misclassified positive or negative gestures) and false negatives (i.e., non-classified positive gestures). using the notion of
Table 5.3: Confusion matrix of the recognition of the gestures from the HU dataset with frame of reference transformation, using dissimilarity threshold of t = 75 (rows – actual class, column – recognized class).
CCW Circle CW Circle Swipe L Swipe R Stop Push Clean Call Neg
CCW Circle 26 0 0 0 0 0 0 0 15 CW Circle 0 26 0 0 0 0 0 0 15 Swipe L 0 0 12 0 0 0 0 0 28 Swipe R 0 0 0 17 0 0 0 0 24 Stop 0 0 0 0 28 0 0 0 13 Push 0 0 0 0 0 35 0 0 6 Clean 0 0 0 0 0 0 22 0 19 Call 0 0 0 0 0 0 0 40 1 Negative 0 0 0 0 0 0 0 3 80
Table 5.4: Confusion matrix of the recognition of the gestures from the HU dataset with frame of reference transformation and alignment of gestures, using maximum alignment difference of d = 0.2 and dissimilarity threshold of t = 0.15 (rows – actual class, column – recognized class).
CCW Circle CW Circle Swipe L Swipe R Stop Push Clean Call Neg
CCW Circle 33 0 0 0 0 0 0 0 8 CW Circle 0 36 0 0 0 0 0 0 5 Swipe L 0 0 35 0 0 0 0 0 5 Swipe R 0 0 0 36 0 0 0 0 5 Stop 0 0 0 0 40 0 0 0 1 Push 0 0 0 0 0 41 0 0 0 Clean 0 0 0 0 0 0 33 0 8 Call 0 0 0 0 0 0 0 38 3 Negative 0 0 0 0 4 0 34 3 42
Table 5.5: Confusion matrix of the recognition of the gestures from the HU dataset with frame of reference transformation and scaling of gestures, using maximum alignment difference of d = 40 and dissimilarity threshold of t = 75 (rows – actual class, column – recognized class).
CCW Circle CW Circle Swipe L Swipe R Stop Push Clean Call Neg
CCW Circle 34 0 0 0 0 0 0 0 7 CW Circle 0 37 0 0 0 0 0 0 4 Swipe L 0 0 25 0 0 0 0 0 15 Swipe R 0 0 0 31 0 0 0 0 10 Stop 0 0 0 0 39 0 0 0 2 Push 0 0 0 0 0 41 0 0 0 Clean 0 0 0 0 0 0 35 0 6 Call 0 0 0 0 0 0 0 41 0 Negative 0 0 0 0 0 0 0 26 57
Table 5.6: Confusion matrix of the recognition of the gestures from the HU dataset with frame of reference transformation, and alignment and scaling of gestures, using maximum alignment difference of d = 0.2 and dissimilarity threshold of t = 0.15 (rows – actual class, column – recognized class).
CCW Circle CW Circle Swipe L Swipe R Stop Push Clean Call Neg
CCW Circle 39 0 0 0 0 0 0 0 2 CW Circle 0 41 0 0 0 0 0 0 0 Swipe L 0 0 38 0 0 0 0 0 2 Swipe R 0 0 0 38 0 0 0 0 3 Stop 0 0 0 0 41 0 0 0 0 Push 0 0 0 0 0 41 0 0 0 Clean 0 0 0 0 0 0 35 0 6 Call 0 0 0 0 0 0 0 40 1 Negative 0 1 0 0 1 2 2 3 74
Table 5.7: Summary of evaluation results
TPR TNR FPR FNR PPV NPV Table 5.2 40.37 92.77 7.23 59.63 95.65 28.31 Table 5.3 63.00 96.39 3.61 37.00 98.50 39.80 Table 5.4 89.30 50.60 49.40 10.70 87.69 54.55 Table 5.5 86.54 68.67 31.33 13.46 91.59 56.44 Table 5.6 95.72 89.16 10.84 4.28 97.20 84.09
true positive rate T P R = T P P =
T P
T P +F N, true negative rate T NR = T N N = T N T N +F P, false positive rate, F P R = F P N = F P
F P +T N, false negative rate F NR = F N
P = F N
T P +F N, precision (or positive
predictive value) P P V = T P
T P +F P, and negative predictive value P P V = T N T N +F N.
The results from tables 5.2-5.6 are summarized in Table 5.7.
Positive and negative predictive values provide information about the ratio of the actual true positives and negatives, compared to all classified positive and negative outcomes. It can be seen that all five settings of the preprocessing pipeline didn’t affect the positive predictive value significantly. However, the effect of the complete preprocessing can be easily seen in the increasing negative predictive value (see summary of the results from Table 5.6), which means the algorithm is better at discarding negative samples when the preprocessing is used.
Sehir University Gesture Dataset
The algorithm was also evaluated on the dataset compiled by Celebi et al. (2013). The dataset consists of following, self-descriptive gestures: “both hands pull down”, “both hands push up”, “both hands zoom in”, “both hands zoom out”, “left hand pull down”, “left hand push up”, “left hand swipe right”, “left hand wave”, “right hand pull down”, “right hand push up”, “right hand swipe left” and “right hand wave”. The authors originally evaluated their algorithm, called weighted dynamic time warping, on a subset of those gestures: “left hand push up”, “left hand pull down”, “left hand swipe right”, “right hand push up”, “right hand pull down” and “right hand swipe left”, as summarized in the table 5.8. Results obtained from the recognition algorithm presented here are presented in Table 5.9. It can be seen that both approaches provide comparable result. However, negative samples were not present in the dataset. Positions of both right and left hand were using during the comparison.
Table 5.8: Confusion matrix of the recognition of the gestures from the Sehir University gesture dataset, data represented in percentage, results from a paper by Celebi et al. (2013) (rows – actual class, column – recognized class).
R push U L push U R pull D L pull D R swipe L L swipe R
R push up 100 0 0 0 0 0 L push up 0 100 0 0 0 0 R pull down 0 0 100 0 0 0 L pull down 0 0 0 85 15 0 R swipe L 0 0 0 0 100 0 L swipe R 0 0 0 0 15 95
Table 5.9: Confusion matrix, using the recognition algorithm presented in this chapter, using maximum alignment difference of d = 0.6 and recognition threshold of t = 0.58 (rows – actual class, column – recognized class).
R push U L push U R pull D L pull D R swipe L L swipe R
R push up 100 0 0 0 0 0 L push up 0 97.31 2.68 0 0 0 R pull down 0 0 97.5 1.65 0 0.83 L pull down 0 0 0 100 0 0 R swipe L 0 0 0 0 100 0 L swipe R 2.59 0 0 1.72 0 95.69