Research into detecting micro-movements rather than the recognition of micro- expressions using machine learning is very limited. The focus on recognition ap- proaches, as discussed in Section2.6.5.2, is a result of assuming the task of putting micro-expressions into categories is trivial. The results from many of these stud- ies show very few achieving higher than chance, including studies way below 50% accuracy. Going against the machine learning approach can be controversial, but objectifying micro-movements as skin deformations of the face could lead to better detection methods and finally a better understanding of micro-facial movements.
Shreve et al. [6,93, 107] proposed a novel solution of segmenting macro- and micro-expression frames in video sequences by calculating the strain magnitude in an optical flow field corresponding to the elastic deformation of facial skin tissue. The author’s techniques of using strain magnitude is the most natural way of calculating whether a micro-expression has occurred as this is how a human would interpret using their visual system. The thresholding technique used to determine a macro-expression is also used for micro-expressions, however it is more constrained due to their rapid speed and spatial locality on the face. First, macro-expressions are removed using the FE detection algorithm. Next, two additional criteria are added:
1. the strain magnitude has to be larger than surrounding regions
2. the duration of this increased strain can be no more than 1/5th second.
As with others in this field, the limitation of datasets means that this paper could not complete a comprehensive evaluation of available data, and used the University of South Florida-High Definition (USF-HD) [6], Canal-9 [108] datasets and found videos from the Internet. None of these had a high-speed frame rate. The ground truth coding on the datasets have not been performed by trained FACS coders, therefore the reliability and consistency of knowing what is and is not an expression cannot be certain. The paper does not use any temporal methods due to the use of optical flow and spatial skin deformation, results in a 44% false positive rate for detecting micro-expressions.
Moilanen et al. [109] used an appearance-based feature difference analysis method that incorporates chi-squared (χ2) distance and peak detection to deter-
mine when a movement crosses a threshold and can be classed as a movement. This follows a more objective method that does not require machine learning. The datasets used are the CASME-A and B [16] and the original data from the
SMIC [15] (not currently publicly accessible). For CASME-A the spotting accu- racy (True Positive Rate (TPR)) was 52% with 30 False Positives (FPs), CASME- B had 66% with 32 FPs and SMIC-VIS-Extended(E) achieved 71% with 23 FPs. The threshold value for peak detection was set semi-automatically, with a per- centage value between [0,1] being manually set for each dataset. Only spatial ap- pearance is used for descriptor calculation, therefore leaving out temporal planes associated with video volumes.
By exploiting the feature difference contrast, Li et al. [9] proposed an algo- rithm that spots micro-expressions in videos. Further, they combine this with an automatic micro-expression analysis system to recognise expressions of theSMIC- VIS-E dataset in one of three categories: positive, negative or surprise. This spotting algorithm is very similar to the feature difference method seen in [109], howeverHOOFis compared along with Local Binary Patterns (LBP). The highest result came from using LBP in the CASME II dataset with an AUC of 92.98%. The best performance came from using the SMIC-VIS-E dataset andLBP, where the spotting accuracy was about 70% of micro-expressions, 13.5% False Positive Rate (FPR), and the AUC was 84.53%.
The proposed micro-expression spotting system in [9] still uses a block-based approach, as in [109], therefore potentially including redundant information. There is also no face alignment performed for these video clips which could lead to head movements being falsely spotted. It should also be noted that throughout this paper, the use of ’accuracy’ is not defined. Using more reliable statistical measures, such as recall, precision and F-measure, would allow the system to be scrutinised more effectively. Block-based approaches, proposed in [9, 109], split the face into m × n blocks. Doing this can include a lot of redundant information, in other words, non-muscle movement. In addition, there are no indication of where on the face the movement occurs. A need for a better solution is required to focus on areas of the face that provide important information while reducing computational complexity.
Further, [9] test untrained humans to spot micro-expressions from a selected 71 clips from the SMIC-VIS dataset. The result of the 15 participant’s spotting accuracy was 71.11% (Standard Deviation (SD) = 7.22%), however for untrained participants this seems very high considering other studies have found many par- ticipants can only recognise around 50% on average [110,111]. The second human study in [9], of mainly spotting when micro-expression occurred, appears more usual with a result of 49.74%.
Xia et al. [112] proposed a probabilistic framework with random walk algo- rithm to detect spontaneous micro-expression clips temporally from a video se- quence. Geometric deformation is captured by an ASMmodel [65] and is utilised as features which are robust to subtle head movement and illumination varia- tion. The Adaboost algorithm is then used to estimate the initial probability for each frame and compute the transition probability by deformation similarity. The
method is validated on the SMIC[15] and CASMEdataset [16] where it achieved a spotting accuracy of 0.8693 and 0.9208 respectively. It should be noted that the newer CASME II [17] is not used despite being newer and contains more data.
Patel et al. [113] introduced a method using optical flow motion vectors, cal- culated within small ROIbuilt around facial landmarks, to detect the onset, apex and offset of a micro-movement. The first step detects 49 facial points, with the tracking of points over subsequent frames being calculated using optical flow vec- tors. Small ROI are created around facial landmarks, creating 8 ROI grouped around certain points. These are use to balance the inaccuracy of landmark detec- tion and the groups correspond to AU in FACS. Finally, the method can remove head movements, eye blinks and eye gaze changes, common reasons for false pos- itives in micro-movement detection methods, by the use of thresholding. The motion for an AU activation should be high, but low for other points. A peak frame is considered true if all the points of an AU group have a motion greater than a certain threshold. An attempt is made to get this system to perform in real-time, however many of the computational time are in seconds, including the facial landmark detection and optical flow calculation. The method also only uses the SMIC dataset at 25 fps, which means the micro-movements are not FACS
coded and has a limited temporal resolution for finding subtle motions. The com- putational times also take a long time at this frame rate, and so higher frame rates for this method would be even higher. The results detailed an AUC of 95%, but produced a high number of FPs.