Face Detection using Depth Data - Investigating multi-modal features for continuous affect reco

As discussed earlier, previous 3D facial expression studies have been mainly carried out on publicly available 3D expression datasets such as BU-3DFE (Yin et al., 2006) and BU-4DFE (Yin et al., 2008) for discrete expression classification and recognition of facial action units (AUs). These datasets only capture the face region which means

they can be used directly without the need to extract the face region. Unlike the aforementioned datasets, the dataset introduced in Chapter 3 captures the full scene instead of the face region, thus it is necessary to locate the face in order to use it for continuous affect recognition. One way to do this is by aligning the depth image with the colour (2D) image using camera calibration. After the depth image is aligned, the face detection result on the colour image can be projected on to the depth image to locate the face. However, this approach is limited by the fact that face detection in colour images is highly sensitive to illumination conditions which means under low light conditions, the face detection may no longer work. Compared to colour images, depth images are more robust to illumination changes which means the depth data can be used when the 2D facial image is not visible. In order to leverage this it is necessary to detect the face location directly on the depth image.

Various methods have been proposed for face detection using depth data. For instance, Colombo et al. (2006) performed 3-D face detection by first identifying candidate eyes and noses using curvature analysis, and then by using the candidate regions in a PCA-based classifier. In the work carried out by Mian et al. (2007), face detection is achieved by first finding the location of the nose tip, and then the face region is localised by a cropping sphere centred at the noise tip. Nair and Cavallaro (2009) proposed using a point distribution model for face detection. Although different methods have been proposed, these methods usually require high resolution depth data which is different from the data provided by the Microsoft Kinect. In this section, a method is proposed to use the Histogram Of Gradient features (Dalal and Triggs, 2005) combined with a structural SVM based training algorithm King (2015) to locate faces in the low resolution depth image obtained from a Microsoft Kinect.

5.2.1 Data Collection and Annotation

In order to train the face detector, 420 depth images with various head poses are extracted from the dataset captured in Chapter 3. On average, 30 depth images are extracted for each participant. The 420 depth images are then annotated manually by drawing a bounding box around the face region. The images are then split into person independent groups as shown in Figure 5.2 where each group consists of 30 images of the same participant.

Figure 5.2: Samples of extracted depth image and the corresponding colour image.

5.2.2 Data Pre-processing

To increase the detection accuracy, the openNI library was first used to remove the background from the depth image. This is shown in Figure 5.3. Before training, each image is up-sampled by a factor of 2 to allow detection of small faces, followed by adding a mirrored version of each training image since human faces are generally left-right symmetric, thus doubling the number of images to 840. The range of the

raw depth data is from 0 to 4096, it is then normalised to the range of 0 to 255 (8 bit).

5.2.3 Experimental Procedure

To extract the HOG features, an image pyramid that down-samples the image at a ratio of 5/6 was applied to each image. For each pyramid level a sliding window with size 80 × 80 is applied to each image and the HOG features were extracted. The structural SVM based training algorithm (King, 2015) was used to train the face detector. The complexity parameter (C) was set to 1 and the epsilon was set to 0.01. The Dlib library (King, 2009) was used throughout this experiment for both HOG feature extraction and SVM learning. The 5-fold cross-validation leave- one-out method was used to compute the accuracy of the face detector. In order to test the generalisability of the face detector, cross-validation was applied to person independent groups instead of all extracted images, which means the training set and testing set do not contain the same person.

5.2.4 Experimental Results and Analysis

Figure 5.4 shows a visualisation of the learnt HOG descriptors. Due to noise and the limited accuracy of the Kinect sensor, both 16 bit and 8 bit depth images give good face boundary details though they provide less detail around the centre part of the face.

Table 5.1 shows the face detection results using different image types. Due to the relatively small number of testing subjects and the simple scene, all image types achieved very high face detection accuracy. Although the 16 bit depth image could identify more face structures compared to the 8 bit one since it has a bigger range, this did not improve the detection accuracy. Figure 5.5 shows some examples of the detection results, and it can be seen that the face detector has shown good

(a) Original depth image

(b) Depth image with background removed

performance on various head poses.

(a) 16 Bit Depth (b) 8 Bit Depth

Figure 5.4: Visualisation of learned HOG detector

Table 5.1: Face detection results using depth images 16 Bit Depth 8 Bit Depth

Accuracy 97% 97%

In document Investigating multi-modal features for continuous affect recognition using visual sensing (Page 123-128)