Chapter 7 Evaluating FER under Harsh Lighting with an Enhanced HDR
7.4 High Dynamic Range Database
7.4.3 Deep Learning Approach
The accuracy of a DL model depends on the amount of data used to train the model. The most accurate model using a Convolutional Neural Network (CNN) requires thousands/millions of data samples to learn the weights for a classification problem. This can take a long time to train, however, a typical alternative to training a CNN from scratch is to use a pre-trained model (transfer of learning) that uses an optimized GPU to extract features from a new data set automatically. Compared to writing new CNN components, this is an important simplification that can significantly accelerate the application of the DL model without the need for a huge data set and very long training time. Once a DL model is trained, it can be applied to many applications, therefore it is logical to extend DL techniques to our FER problem.
Deep Learning (DL) uses multiple nonlinear processing layers to learn useful feature representations directly from the data. In this chapter, CNN architecture of the DL model is used [MMMK03] directly on the presented image data instead of training a machine to perform image classification. DL provides good image understanding, particularly in learning features directly from images used for classification, thus reducing the need for manual feature extraction and offering the benefit of extracting undefined features from the training data.
A CNN [VL15] is a function f mapping datax(such as image), to an output vectory. The function f = fL◦...◦ f1 is the composition of a sequence of simple
functions fl, known as the "computational blocks or layers". Let x1,x2, ...,xL be
the outputs of each layer in the network and let x◦ = x denote the network
input. Each output xi = fi(xi−1;wi) is computed from the previous output xi−1
by applying the function fi with parameters wi. A spatial structure representing
the feature fields of the data flowing through the network is denoted by a 3D array
For checking the network efficiency, the existing fourth non-singleton dimension allows for parallel batch processing of images. Generally, the fifunctions make the
networks convolutional because they are non-linear filters operating as local and translation invariant operators.
Training a Deep Learning Model
The HDR database contains 498 images for 4 emotional classes. This is insufficient for training a DL CNN model. To train a DL model for FER with the HDR database, we adopted the AlexNet pre-trained networks [DDS∗09], one of the popular pre-trained CNNs trained on the ImageNet dataset with 1000 object categories and 1.2 million training images. It has been established that pre-trained CNNs on a large collection of different image data are able to generalise well on scenarios that the CNNs has not been trained on. As pointed out in [LLSZ15], the pre-trained CNNs is shown to outperform the manual feature extraction techniques using SURF, HOG (histogram of oriented gradient) and LBP (local binary patterns).
Classification with Deep Learning Model
Traditional neural networks are structured layers consisting of a set of interconnected nodes. A CNN convolves learned features with input data and uses 2D convolutional layers. This makes the architecture well suited to processing 2D data, such as images. As described earlier, CNN also doubles as a classifier. Since we are using a pre-trained CNNs, our classifier is described as follows [VL15]. If we let the outputye= f(x)be a vector of probabilities, and taking one each for the
1,000 possible image labels (faces, horses, hat, etc). And if the label of our image
x isy, then the loss function Ly(ye)∈ <is used as penalty to classification errors.
Therefore, learning can further be carried out on the CNN parameters in order to minimise the average losses over the datasets.
7.5
Results and Discussion
In this section, a series of results are presented for experiments carried out to evaluate performance of FER on the enhanced HDR DB. Performance across different lights representations (back, left and overhead) combined lights and separate lights were evaluated for the purpose of testing the effect of harsh light on emotional faces.
For these results, the accuracy of the algorithm in learning a set of faces from training images and then correctly recognising the same people from a test set of different images is evaluated. We divided our data into two subsets for both techniques - SURF and CNN. The first subset is the training set, which is based on 80% of the data. The second subset is the validation set, which is based on 20% of the data. The training and validation sets contain the same people. To avoid bias/variance in our results, we use Monte Carlo [JL10] cross-validation to randomly repeat the iteration five times and the recognition rates averaged over the five trials [ZTC14a]. With the SURF technique, the classification problem falls into category of mutually exclusive one (multivalue classification). As discussed in the previous chapter; multi-class SVM classifiers are thus learnt and applied on each training set. Finally, the decisions of all classifiers for SURF and CNN is set as the recognition rate.
Dataset
We generated six datasets from our HDR DB, comprising of four TMOs (display adaptive and logarithmic, drago and reinhard), including zero exposure and optimal exposure datasets. For the FER task, facial alignment or adjustment may distort or reduce the expression feature, so we simply use the original images that are cropped with Photoshop.