Alignment Based Siamese Network Model For Face Verification

(1)

2577

Alignment Based Siamese Network Model For

Face Verification

Rizky Agung Dwi Putranto and Wahyono

Abstract: Face verification, which is the subset of face recognition task, is one of the popular topics in machine vision. The task of face verification is to accept or reject the two image who claims to be same person. Face recognition, in general, has several problems. One of them is pose variations. With the alignment process, the pose variation problem is expected to be resolved. Convolutional Neural Network (CNN), on the other hand, is good for recognizing pattern and as a feature extractor in images. So, we build a siamese network based on CNN for verification task. The result of this study is the alignment process can increase the accuracy of the system. Accuracy of the system is 51.7% with no alignment and 68.9% when using alignment.

Index Terms: Face recognition, face verification, face alignment, convolutional neural network (CNN), siamese network, pretrained CNN, pose variations

——————————  ——————————

1. INTRODUCTION

WITH criminality increased and rapid development in technology, the implementation of autonomous system for authentication and security is required, especially in public places such as airport, bus station, train station, office, and factory. Biometrics system uses human physical characteristic to recognize one person from another. Some characteristic use in biometric includes fingerprint, iris, DNA, and face. With biometrics system, recognition is done without manually input someone information [1]. Face recognition, which aims to recognize a face from an image, is an active topic in machine vision [2]. Face recognition can be classified into two tasks: face identification and face verification [3]. Face identification decides the person identity, usually name, from an image while face verification decides if the person appears in one image is accepted as the same person in another image. There are factors that make verification a challenging problem, such as lighting, age, expressions, and pose variation occurs in the images. [4] state that poses variation is one of the most challenging problems. Studies have been conducted to resolve those problems in face verification. With the rapid increase in image data and also computational power, learning based method in general has gained good result surpassing the

traditional features engineering because they can discover suitable feature for specific task [5]. Convolutional neural network (CNN) which designed for image input is also good for classify image [6]. Furthermore, siamese network [7] which aim to differentiate two or more input is suitable for verification task. In this paper, siamese network is trained to do face verification. Inside the siamese network, CNN is used for feature extractor. Also, this study observes the influence of face alignment for improving verification accuracy.

In the face verification system pose variations become one of the factors that can significantly influence accuracy. Therefore, our main contribution is to solve the problem pose variation in face verification using integration of face alignment and siamese network. This paper is organized by a number of sections. Section 2 will explain several previous works regarding to face recognition and verification system. Section 3 explains implementation of our proposed method. The results of this implementation will be explained in Section 4. Lastly, Section 5 concludes our works.

2 WORKS

————————————————

 Rizky Agung Dwi Putranto was with Computer Science Program,

Universitas Gadjah Mada, Indonesia. E-mail:

[email protected]

 Wahyono is with the Department of Computer Science and Electronics, Universitas Gadjah Mada, Indonesia. E-mail: [email protected]

(2)

2578

There are several previous works in face verification have been conducted. Some approaches and methods used for face verification include the template matching, feature matching, feature learning, and CNN. The first method is template matching approach. This approached was used by Sao [8]. Verification is done using template matching based on edginess representation from the face image. Score from template matching is then combined using auto-associative neural network (AANN) model-based classifier. The result of the proposed system is 92.23% for FacePix dataset [9]. Another template matching approach proposed by Chidambaram [10]. Authors compare template matching with histogram matching. For template, they use cross-correlation and histogram. The similarity score is then calculated using the sum of absolutes differences of pixel values from two images. The accuracy obtained is 99.23% for their private dataset. The second approach is utilized feature similarity. Nguyen conducted a study of using cosine similarity as the distance metric for face verification [11]. Three features used on that study are intensity, local binary pattern and Gabor wavelets. For feature matching, they use cosine similarity metric learning to get similarity score for each feature after reduction using principal component analysis. Last, they do verification using a support vector machine using the vector of similarity scores. The accuracy is 88% for LFW dataset. Similar to Nguyen, Miri also utilized feature matching to perform face verification. However, the method evaluated five features: Intensity, Gabor, LBP, HOG, and LPQ for face

verification. Sparse representation-based classification (SRC) is used to create matrix dictionaries [12]. For faster computation, the author uses linearly approximated SRC (LARSC). Representation of the face images generated using the dictionaries. For similarity, cosine distance is used. The accuracy of the proposed system with combined features is 85.27%.The third approach was using deep learning in order to overcome the problem of manual feature extraction. Feature learning using deep correlation used by Deng [13]. The authors use ResNet [14] for their main architecture. Correlation a loss is proposed as loss function. They use joint supervision softmax loss and correlation loss for training CNN model. The accuracy obtained is 99.5% for LFW dataset. In other method, Huang [15] utilized conventional features which are combined with learned features obtained using deep learning method. Modified Restricted Boltzmann Machine (RBM) is used. Authors add convolution in both visible and hidden layer. For metric learning, they use information-theoretic metric learning. The accuracy is 87.77% for LFW

dataset. Taigman [16] combined verification with alignment process. For alignment, they use support vector regressor. CNN is first trained to identify a person in images. After the training process, CNN without output layer is duplicated and used as a feature extractor. For verification metric, they use 3 methods, 1) inner product between features, 2) Weighted chi-square distance, and 3) Siamese network. The result of that proposed system is accuracy near-human performance in face verification, reaching 97.35% for LFW dataset. Siamese network with direct training for verification is pro-posed by Bukovcikovˇa [17]. Instead of contrastive loss, they use another fully connected layer for joining features and another two neurons for output. The authors also claimed that siamese network is effective for face recognition in uncontrolled conditions. The result is 85.74% for CelebA dataset [18].

3 THE

METHOD

3.1 Dataset Specification

The proposed CNN model in this study is learned from a collection of face images from CASIA-WebFace [19] dataset. The model is then applied to the Labeled Faces in the Wild database (LFW) [20] for testing. CASIA dataset consists of 494,414 images from 10,575 identities. Because of computational limit, only about 30,000 images from 100 people are used for identification training. For verification training, 20,000 images grouped into 10,000 pairs images are used. The LFW dataset consists of 13,323 images from 5,749 identities. LFW is used for testing process which is the benchmark dataset for face verification in unconstrained environments. Testing data is grouped into 6,000 pairs in 10 splits. Testing will be done in the unsupervised protocol, in which no training process performed on LFW dataset. Instead of using 10-fold cross-validation, the whole pairs are used as test data.

3.2 Face Alignment

Face alignment done in this research is based on [21]. For face detector, the authors use the method from [22]. After face detection then facial landmark is also detected for both query image and rendered 2D reference image. For landmark detection, instead of supervised descent method (SDM), DLIB [23] facial feature detection is used in this research. Anything else is based on [21]. The face alignment consists of several stages that is:

1. Face Landmark Detection. The face detection process is done using a face detector developed by Viola and Jones. After the face is detected then crop and rescale for normalization. Furthermore, the normalized image detection of landmark points is performed. For the landmark point detection process a landmark detector library from DLIB is used. The results of the detection of landmark points using DLIB are as many as 48 landmark points located in the eyes, nose and mouth. Landmark detection process is carried out on the query image and reference image.

2. Frontalization. Furthermore, frontalization is done by projecting the landmark points of the query image into the reference coordinates using the geometric structure of the 3D model. For each landmark in the reference image, there is a location on the 3D model that is projected back into the reference image. Furthermore,

TABLEI

CNNARCHITECTURE

Layer Info

Input Image size 152x152x3

(3)

2579

the same landmark in the query image is also estimated for its location. Bilinear interpolation is used to sample the intensity of the query image on landmarks and is used to determine the color of frontalised image pixels.

3. Soft symmetry. After frontalization, soft symmetry is used to overcome the occlusion problem that occurs in the image.

3.3 Face Verification

For face verification, CNN architecture used in this study is similar to [16]. Some adjustment is done due to the computational limit. Some additions are max pooling layer after the second convolutional layer with 2×2 kernel and strides of 2, dropout layer [24], and batch normalization layer [25]. This CNN is work as feature extractor in siamese network, as shown in Fig. 1. The overall CNN architecture is shown in Table I. The first convolution layer (C1) with 32 filers sized 11×11×3 will maps input image sized 152×152×3 into 32 feature maps sized 142×142. These feature maps are then fed into max pooling layer (MP1) with filter size 3×3 and stride of 2. Then follows another convolution layer (C2) with 16 filters size 9×9×16. The result of the second convolution layer is 16 feature maps sized 63×63. The second max pooling layer (MP2) use filter with size 2×2 and stride of 2. These four layers are used to extract low-level features like edges and textures. Next, local convolution [26] layer is used. Local convolution is like regular convolution, but a different set of filters is applied at each different location of the input. The first local convolution uses 16 filters sized 9×9. Second local convolution uses 16 filters sized 7×7. The last local convolution layer use 16 filters sized 5×5. The use of these local convolution layers is possible due to the alignment process that forces the face to be at specific locations. After three local convolution layers, fully connected layer with 4096 neurons is added. The output of this fully connected layer is then used in the siamese network. For identification training another fully connected layer with the number of neurons as many as classes used. This layer works as an output layer and uses softmax function for activation and cross entropy for loss function. To reduce overfitting dropout layer is used after convolution and local convolution layer. Batch Normalization is also used for increasing training speed. For siamese network, we use L1 distance for distance metric. Feature vectors obtained from the CNN model. After that, another fully connected layer with a single neuron is added to get the verification output. The sigmoid function is used as an activation function. Two approaches are used for the training process. The first approach train CNN model in identification scenario. The model needs to output a person’s identity of the given images. After identification training, CNN model without the last output layer is duplicated and built into a siamese network and then trained for verification task. The second approach is directly trained the siamese network, built from the same CNN, for

verification. Learning rate used in this research is 0.01, with a dropout rate of 0.1. For updating the parameters, stochastic gradient descent is used with momentum 0.9. Batch size used is 32 in 50 epochs. Weights initialization is using zero-mean Gaussian with = 0:01, and biases set to 0.5.

4 EXPERIMENT

AND

RESULTS

The training process for both approaches is done as explained in the section before. Data without and with alignment is also used in both approaches for train and test. The proposed method was implemented using Python 3.6 with OpenCV library for basic image processing under Intel Core i7-6700 CPU @ 3.40GHz x 8 on Ubuntu 16.04.6 64-bit Operating System [27].

4.1 Face Alignment Results

Face alignment is applied in both train and test data. The result is shown in Fig. 2. Face alignment is done for face images with variation in the pose. With extreme pose, face alignment is not performing well. Other factors may affect alignment result such as the resolution of the image, format of the image, and the number of faces in the image. In Fig.2, it can be seen that up to a 75◦ angle of face in the image is still detected while a face with a 90◦ angle variation is not detected. For the results of alignment, the image with an angle variation of 0◦ - 30◦ still looks good after frontalization. Images with an angle of 45◦ - 75◦ do not experience alignment and soft symmetry properly. This can be influenced by several factors including image quality, image resolution and image format used. In the training and testing data, the results of the alignment process are also good and not good.

Fig. 1. The siamese architecture used.

(4)

2580

4.2 Parameters Selection

The training and validation process on the CNN identification model will be carried out in several stages to find the best parameter values. The parameter values tested are batch size, learning rate, epoch, and dropout ratio. After implementation, we found out several insights regarding to the parameters:

1. Batch size affects accuracy where smaller batch sizes have higher accuracy. This happens because the model of studying the dataset is more gradual with a small batch size. Batch size also affects training time where the larger the batch size obtain faster training time to be converged but requires greater computing resources.

2. The use of lower learning rates leads to increased accuracy from older models. If the epoch is increased, the model also has overfitting. Therefore, a learning rate of 0.01 will be chosen.

3. In the training process, it is found that the CNN model converges on the 30th epoch. In the next epoch the accuracy improvement is not too significant, but the amount of training needed is much longer. Therefore 30 epochs will be chosen.

4. It was found that the higher the dropout rate used gained lower accuracy at the same number of epochs. Therefore, a dropout rate of 0.1 will be chosen

4.3 Results on Data Training

Training results for both approaches are shown in this section.

Table II and IV show train result for pre-train approach with unaligned and aligned data, respectively. For direct train approach the result shown in Table III and V. From these tables, we can see that the accuracy of siamese network with pre-train model is lower than siamese network with the direct train. Another point is with increases in data train, siamese with pre-trained model doesn’t necessarily increase the accuracy. Hyperparameter tuning also not makes accuracy increases. This happens as in pre-train approach, only the last two layer’s parameters get trained. In this condition, the siamese network relies on features from identification training. For train time, siamese with pre-trained model have less train time as the training process is only for the last two layers. The train results also show that face alignment process improves the accuracy of the siamese network for both approaches, yet in siamese with direct training is not always higher. This may occur because the face alignment process is not performing well on certain data and the total number of data used is relatively small.

4.4 Results on Data Testing

In testing, both approaches are tested against unaligned and aligned data. The total number of tests is 8, the combination for both approaches with unaligned and aligned data for train and test. Tests result is shown in Table VI. From Table VI siamese model with direct approach gets higher accuracy than siamese with pre-trained model approach. Same as training result, this occurs because in siamese with the direct approach, all layer in the model were trainable so the model can adapt better in verification training. On the other hand, only the last two layers in siamese with pre-trained model were trainable so the performance of CNN as feature extractor is relied on earlier training, that is identification training. Table VI also shows that face alignment process increases the accuracy of the system. Highest accuracy in this research is 68.9% using siamese with direct train approach and aligned data for both train and test. This result is quite impressive, remembering only 20,000 face images from 500 identities used for training. Also, testing is done in unsupervised protocol, which means no LFW data used for training.

5 CONCLUSION

Siamese network can be trained to do verification task. With more computation power and more data, higher accuracy should be obtained. Also, face alignment can increase the accuracy of the verification system. From this research, siamese with direct training achieves higher accuracy compared to siamese with pre-trained model, especially for small data.

TABLEVI

RESULTS ON DATA TESTING

Approach Train Data Test Data Accuracy

Pretrained

Unaligned Unaligned 51,7% Aligned 52,0%

Aligned Unaligned 53,4% Aligned 52,3%

Direct Training

Unaligned Unaligned 57,3% Aligned 59,8%

Aligned Unaligned 58,3% Aligned 68.9%

TABLEII

VERIFICATION USING PRETRAIN MODEL WITHOUT ALIGNMENT

Pairs Accuracy Loss Time (s)

3,000 51.33% 1.03 448

5,000 47.93% 1.40 635.5

10,000 53.62% 1.30 1,338.5

TABLEIII

DIRECT VERIFICATION WITHOUT ALIGNMENT

3,000 73.61% 1.27 907

5,000 74.10% 1.75 1,469

10,000 85.43% 0.87 2,797

TABLEIV

VERIFICATION USING PRETRAIN MODEL WITH ALIGNMENT

3,000 55.80% 1.07 418

5,000 52.08% 1.27 633.5

10,000 54.88% 1.28 1,330

TABLEV

DIRECT VERIFICATION WITH ALIGNMENT

3,000 72.92% 1.53 817

5,000 73.69% 1.50 1,376

(5)

2581

REFERENCES

[1] A. K. Jain, A. Ross, S. Prabhakar et al., ―An introduction to biometric recognition,‖ IEEE Transactions on circuits and systems for video technology, vol. 14, no. 1, 2004.

[2] S. Park, J. Yu, and M. Jeon, ―Learning feature representation for face verification,‖ in 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2017, pp. 1–6.

[3] J. Hu, J. Lu, and Y.-P. Tan, ―Discriminative deep metric learning for face verification in the wild,‖ in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1875–1882.

[4] S. Du and R. Ward, ―Face recognition under pose variations,‖ Journal of the Franklin Institute, vol. 343, no. 6, pp. 596–613, 2006.

[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ―Imagenet classification with deep convolutional neural networks,‖ in Advances in neural information processing systems, 2012, pp. 1097–1105.

[6] M. A. Nielsen, Neural networks and deep learning. Determination press San Francisco, CA, USA:, 2015, vol. 25. [7] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger,¨ and R. Shah,

―Signature verification using a‖ siamese‖ time delay neural network,‖ in Advances in neural information processing systems, 1994, pp. 737–744.

[8] A. K. Sao and B. Yegnanarayana, ―Face verification using template matching,‖ IEEE Transactions on information Forensics and Security, vol. 2, no. 3, pp. 636–641, 2007. [9] J. A. Black, M. Gargesha, K. Kahol, P. Kuchi, and S.

Panchanathan, ―Framework for performance evaluation of face recognition algorithms,‖ in Internet Multimedia Management Systems III, vol. 4862. Interna-tional Society for Optics and Photonics, 2002, pp. 163–175.

[10]C. Chidambaram, M. S. Marc¸al, L. B. Dorini, H. V. Neto, and H. S. Lopes, ―A comparison of histogram and template matching for face verification,‖ in VIII Workshop de Visao Computacional, 2012.

[11]H. V. Nguyen and L. Bai, ―Cosine similarity metric learning for face verification,‖ in Asian conference on computer vision. Springer, 2010, pp.709–720.

[12]M. Miri, ―Face verification in the wild using similarity in representa-tions,‖ in 2017 Artificial Intelligence and Signal Processing Conference (AISP). IEEE, 2017, pp. 140–144. [13]W. Deng, B. Chen, Y. Fang, and J. Hu, ―Deep correlation

feature learning for face verification in the wild,‖ IEEE Signal Processing Letters, vol. 24, no. 12, pp. 1877–1881, 2017. [14]K. He, X. Zhang, S. Ren, and J. Sun, ―Deep residual learning

for image recognition,‖ in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770– 778.

[15]G. B. Huang, H. Lee, and E. Learned-Miller, ―Learning hierarchical representations for face verification with convolutional deep belief networks,‖ in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2518–2525.

[16]Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, ―Deepface: Closing the gap to human-level performance in face verification,‖ in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1701–1708. [17]Z. Bukovcikovˇa,´ D. Sopiak, M. Oravec, and J. Pavlovicovˇa,´ ―Face veri-fication using convolutional neural networks with siamese architecture,‖ in 2017 International Symposium ELMAR. IEEE, 2017, pp. 205–208.

[18]Z. Liu, P. Luo, X. Wang, and X. Tang, ―Deep learning face attributes in the wild,‖ in Proceedings of the IEEE international conference on computer vision, 2015, pp. 3730–3738.

[19]D. Yi, Z. Lei, S. Liao, and S. Z. Li, ―Learning face representation from scratch,‖ arXiv preprint arXiv:1411.7923, 2014.

[20]G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, ―Labeled faces in the wild: A database forstudying face recognition in unconstrained environments,‖ in Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.

[21]T. Hassner, S. Harel, E. Paz, and R. Enbar, ―Effective face frontalization in unconstrained images,‖ in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4295–4304.

[22]P. Viola and M. J. Jones, ―Robust real-time face detection,‖ International journal of computer vision, vol. 57, no. 2, pp. 137– 154, 2004.

[23]D. E. King, ―Dlib-ml: A machine learning toolkit,‖ Journal of Machine Learning Research, vol. 10, pp. 1755–1758, 2009. [24]N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R.

Salakhut-dinov, ―Dropout: a simple way to prevent neural networks from over-fitting,‖ The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[25]S. Ioffe and C. Szegedy, ―Batch normalization: Accelerating deep network training by reducing internal covariate shift,‖ arXiv preprint arXiv:1502.03167, 2015.

[26]K. Gregor and Y. LeCun, ―Emergence of complex-like cells in a temporal product network with local receptive fields,‖ arXiv preprint arXiv:1006.0448, 2010.

Alignment Based Siamese Network Model For Face Verification