Feature Difference Measures - Micro-facial movement detection using spatio-temporal features

3.5.1 Sum of Squared Differences

One of the simplest ways of finding the difference between some points, is to take the sum of squares. For the use in micro-movement analysis, it can be used to find the difference between frames in the micro-movement sequence. The method works for a 1D vector of frame values, or more likely, histogram features that describe each frame, for example HOG orLBP.

There are two versions of the difference method. The first finds the difference between the first frame and the i-th frame. It is defined as

i=1

(x1− xi)2 (3.55)

where xi is the frame in the video sequence. This equation is similar to using

between the i-th frame and the i-th - 1 frame. It is defined as

i=2

(xi− xi−1)2 (3.56)

where in contrast to the previous equation, compares the current frame with the previous frame as it iterates.

3.5.2 Chi-Square Distance

The concept of feature difference analysis for micro-movement analysis was in- troduced by Moilanen et al. [109]. For each current frame being processed, it is compared with the average feature frame that is represented by the average features of the start frame, k-th frame before the current frame, and the end frame, k-th frame after the current frame. The k-th frame is described as

k = 1

2(N − 1) (3.57)

where N is the micro-interval value that is always set to an odd number. Moilanen et al. [109] sets the micro-interval to 9 for the SMIC dataset [15] recorded at 25

fps and 21 for the CASME[16] dataset that was recorded at 60 fps.

The difference between the current frame and average feature frame shows the facial changes in a particular region and the possible change in the features is rapid since it occurs between start frame and end frame, to distinguish the quick changes from temporally longer events.

The difference analysis continues for each frame except the first and last k frames that would exceed the boundaries of the video. This also means rapid facial movements such as blinks would also be classified as a movement.

The difference that is calculate can also be described as the dissimilarity between histograms in each region. The χ2 _{distance is such as measure and is}

defined as χ2(P, Q) =X i (Pi− Qi)2 (Pi+ Qi) (3.58)

where i is the i-th bin in the P and Q histograms that have an equal number of bins. This equation can be used in all temporal planes (XY, XT, YT) to calculate dissimilarity.

In many other histogram distance measures, the differences between large bin values are less significant than between small bins [147], however as this is calculating dissimilarity the difference should not be dependent on the scale of the bin values when each feature is deemed equally important. Ahonen et al. [148] found that the χ2 distance performs better in face analysis task than histogram intersection or log-likelihood distance. It has also been applied successfully to applications such as object and text classification [149].

With the feature differences calculated for each block and frame of a video sequence, an initial feature vector is produced by taking the average of the M greatest block difference values. Moilanen et al. [109] chose 12 as the value of M which corresponds to the 12 blocks out of a 26 block total. The initial feature vector is designed to represent the whole video and is defined as

F = 1 M M X j=1 (D1,j, D2,j, . . . , Dn,j) (3.59)

where D is an (n × b) corresponding to the b block difference values that are sorted in descending order for each frame, and n is the total number of frames. In this method the matrix would be (n × 36).

It was determined in [109] that by setting M to around one third of the block difference values would give better discrimination than one maximum value or an average of all block values, for example. This conclusion of one third did not go into detail, and more information would be required to assume this is correct.

To reduce noise and local magnitude variations, a contrasting feature vectors was created by subtracting the average of the surrounding tail frame and head frame values from each current frame value in F . Therefore, the i-th value in the contrasted feature vector is defined as

Ci = Fi−

2(Fi+k+ Fi−k) (3.60) where Ci and Fi are the contrasted feature vectors and initial feature vector at the

i-th value respectively. The completed contrasted feature vector being calculated for the whole video using Eq. 3.60 for all n frames. Due to the temporal nature of this method, the first and last k frames cannot be included in the calculation.

After C is created, all the negative values are found and assigned to be zero as these denote no rapid changes of the current frame compared to the tail frame and head frame.

Finally, a threshold is calculated that is used as a single value to determine at what point a peak should cross to be counted as a micro-facial movement. This is calculated as

T = Cmean+ p × (Cmax− Cmean) (3.61)

where Cmax and Cmean are the maximum and average values for the whole video,

and p is a percentage parameter in a range of 0 - 1. Peak detection was reportedly used, however with no explicit mention of how it was done, there is a chance the peak annotation was manual.

3.5.3 Peak Detection

To detect peaks automatically from feature difference vectors, a signal processing method can be applied [150]. The first derivative of the signal is smoothed, and then peaks are found from the downward-going zero-crossings. Then only the zero-crossings whose slope exceeds a defined ‘slope threshold’ at a point where the original signal exceeds a certain height or ‘amplitude threshold’.

3.5.3.1 Temporal Phases of Peaks

Estimating the apex simply uses the highest point of the found Gaussian curves. Estimating the ‘start’ and ‘end’ of the peak (the onset and offset value respectively), is a bit more arbitrary, because typical peak shapes approach the baseline asymptotically far from the peak maximum. Peak start and end points can be set to around 1% of the peak height, but any random noise on the points during peak detection will often be a large fraction of the signal amplitude at that point.

Smoothing is done within this process to reduce noise, however this can lead to distortion of peak shapes and change the start and end points. One solution is to fit each peak to a model shape, then calculate the peak start and end from the model expression. This minimises the noise problem by fitting the data over the entire peak, but it works only if the peaks can be accurately modelled. For

Figure 3.13: The temporal phases used to denote the onset, apex and offset of a micro-facial movements, facial expressions and other similarly shaped

difference features.

example, Gaussian peaks reach a fraction a of the peak height of

x = p ± v u u t w2_log(1 a) 2p( log(2)) !

In document Micro-facial movement detection using spatio-temporal features (Page 83-87)