• No results found

Deeply Vulnerable -- a study of the robustness of face recognition to presentation attacks

N/A
N/A
Protected

Academic year: 2021

Share "Deeply Vulnerable -- a study of the robustness of face recognition to presentation attacks"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

Deeply Vulnerable – A Study of the Robustness of Face Recognition to

Presentation Attacks

Amir Mohammadi, Sushil Bhattacharjee, and S´

ebastien Marcel

∗†

Abstract

The vulnerability of deep-learning based face-recognition (FR) methods, to presentation attacks (PA) is stud-ied in this paper. Recently proposed FR methods based on deep neural networks (DNN) have been shown to outperform most other methods by a significant margin. In a trustworthy face-verification system, however, max-imizing recognition-performance alone is not sufficient – the system should also be capable of resisting various kinds of attacks, including presentation-attacks (PA). Previous experience has shown that the PA-vulnerability of FR systems tends to increase with face-verification accuracy. Using several publicly available PA datasets, we show that DNN-based FR systems compensate for variability between bona fide and PA samples, and tend to score them similarly, which makes such FR systems extremely vulnerable to PAs. Experiments show the vulnerability of the studied DNN-based FR systems to be consistently higher than 90%, and often higher than 98%.

1

Introduction

Progress of face recognition (FR) technology, from strongly constrained models to fully unconstrained environments, has been enabled by the adoption of successively complex learning paradigms. The first generation of large scale appearance based FR systems, such as Eigenfaces [1] and Fisherfaces [2], attempted to model the face-variability in a simple linear sub-space. Subsequently, methods such as joint factor analysis (JFA), inter-session variability (ISV) modeling [3] and probabilistic linear discriminant analysis (PLDA) [4, 5] were developed to better model variability in face-images. Deep-learning based methods, notably convolutional neural networks (CNN) [6, 7, 8, 9], have become very popular in recent years, due to their near-perfect recognition accuracy on unconstrained datasets such as ‘labeled faces in the wild’ (LFW) [10].

A FR system may be used in two kinds of applications: face-identification or face-verification. Face-identification is a one-to-many problem, where the face-image to be identified is tested against all previously enrolled identities, to see if it matches any of the known identities. In face-verification systems, a claimed-identity is provided along with the input face-image, and the problem is simply to verify that the input image corresponds to the claimed-identity. In this paper the terms recognition and verification have been used interchangeably, and refer to the use of a FR system in verification mode.

FR systems should not only have very high accuracy, but should also be robust to attacks. For example, in access-control applications, high FR recognition accuracy in unconstrained environments is not a significant asset, when not accompanied with resistance to presentation attacks (PA), also known as spoof-attacks. A face PA is said to have occurred when a face biometric-sample is presented to the camera of a FR system “with the intention of interfering with the operation of biometric recognition” [11]. For example, person A may attack a FR system by claiming to be an enrolled client, B, and presenting a printed photo of the person B to the camera. Other examples of PAs are: digital photos or videos displayed on an electronic screen, face sketches, and make-up [12].

Most high-accuracy FR systems today rely on deep-learning methods. Indeed, deep-learning based FR systems are already being deployed in commercial face-verification applications [8]. In this context their vulnerability to PAs becomes an important factor in the trustworthiness of the entire face-verification process. Several studies [6, 7, 8, 9] have explored the capacity of deep-learning based FR-systems to handle variations in pose, illumination and scale, that is, challenges related to variability of face-biometric samples of a single client. Karahan et al. [13] studied the

Authors are affiliated with the Idiap Research Institute, Martigny, Switzerland.

(2)

fragility of CNN-based FR systems when confronted with image degradations such as blurring, occlusion, compression-artifacts, and color-distortion. Robustness to such degradations is necessary for FR systems operating in relatively unconstrained environments. For face-verification systems functioning under controlled conditions, the vulnerability to PAs is a much more significant concern.

The European research project TABULA RASA [14] indicated a positive correlation between the efficacy of FR systems and their vulnerability to PAs. In other words, FR methods with higher face-verification accuracy tended to be more vulnerable to PAs. Given that FR systems, especially the popular CNN-based FR methods, have not been explicitly trained for presentation attack detection (PAD), one would intuitively expect such systems to be vulnerable to PAs. To our knowledge, though, the vulnerability of deep learning based FR systems to PAs has not been studied in detail.

In this paper we present a large-scale empirical study of the vulnerability to PAs of five recent FR systems, including 3 recent CNN-based methods. Specifically, we compare the vulnerability of the following systems:

• three CNN-based FR systems, namely: – VGG-Face [7],

– LightCNN [15], and – FaceNet [16]

• Intersession Variability (ISV) modeling [3], and • ROC-SDK from Rank-One Computing1.

The main contribution of this paper is empirical evidence to support the claim that the CNN-based FR method is extremely vulnerable to PAs. Our experiments, using three recent PA datasets, show that other highly rated FR methods are also very vulnerable to PAs. Another significant contribution of this work is that all the datasets, protocols, and source-code for the experiments reported in the paper will be made publicly available on the web. This will allow other researchers to reproduce the results presented here.

A brief review of previous vulnerability studies for FR methods is presented in Section 2. This discussion justifies the present work, and helps us to offset our results from those in previous published works. The face-recognition methods included in our study are described in Section 3. The experiments reported in this paper have been conducted on publicly available datasets, with well-defined protocols for face-recognition and face-PAD experiments. In Section 4, we describe the datasets, the kinds of PAs represented in each dataset, and the protocols for training and testing face-recognition offered in each dataset. Experimental results are summarized and discussed in Section 5, and concluding remarks are presented in Section 6.

2

Related Work

There have been very few studies specifically exploring the vulnerability of FR systems to different kinds of attacks. Duc et al. [17] have investigated the vulnerability of FR based login on computers (specifically Lenovo, Toshiba and Asus products) to PAs. Kose et al. [18] present a study of the vulnerability of FR methods to 3D-mask attacks. They evaluate two different FR methods – one designed specifically for FR using 3D data [19], and, a generic approach to FR using local binary patterns (LBP) based face-image comparison. Hadid [20] discusses the vulnerability of the parts-based GMM FR method to PAs. In a recently published work, Scherhag et al. [21], the authors have studied the vulnerability of FR methods to morphed-face attacks. Here attacks are constructed by morphing face images from two enrolled identities, and the challenge is to see if the two enrolled identities can both be successfully spoofed using the same morphed image. Both FR methods tested in this work – VeriFace SDK2(a commercial off-the

shelf (COTS) FR product from Neurotechnology), and OpenFace [22] (an open-source, pre-trained DNN based FR system) – are shown to be highly susceptible to such attacks [21].

Although deep-learning based FR systems have attracted considerable attention, both academic and commercial, as far as we are aware, no previous study has investigated the vulnerability of CNN-based FR systems using a variety of PAs. Most of the studies cited above have been performed on FR methods that rely on hand-crafted features.

1Company website: www.rankone.io

2Website: www.neurotechnology.com/verilook.html 2

(3)

The study by Scherhag et al. [21] that has included a DNN based FR system (OpenFace) was done in the restricted context of a single class of attacks based on morphed-images. Such attacks are expected to have a very narrow range of application.

In this paper, we focus on the vulnerability of CNN based FR methods to PAs. This study includes three publicly available pre-trained CNN-FR models. Our experiments demonstrate that all these FR methods are consistently highly vulnerable to several classes of PAs. For comparison, we have also included the ISV modeling method, which relies on hand-crafted features, and ROC-SDK – a COTS FR product – in this study. We use four publicly available FR and PA datasets to estimate the vulnerability of the five FR methods to a variety of PAs. Our study has been motivated by the hypothesis that although CNN-FR systems outperform FR methods based on hand-crafted features, they are also more vulnerable to PAs. This assertion makes intuitive sense to researchers in face-biometrics, given that the FR methods have not been explicitly trained to detect PAs. However, this is the first large-scale study to empirically quantify the vulnerability of these FR methods.

3

FR Systems Included In This Study

A typical FR system functions in three phases: training, enrollment, and probing. In the training phase a background model, assumed to broadly represent the space of face-images, is constructed using training data. In the enrollment phase, the FR system generates templates for the given enrollment samples, which are then stored in the gallery. In the probing (operational) phase, the FR system is presented with a probe-image and a claimed identity. A template is created for the given probe-sample, and is compared with the set of enrolled templates associated with the claimed identity. The result of the comparison is a score, which is then thresholded to produce a decision (accept or reject). To mitigate the influence of variations on the actual face-recognition process, the raw input image is usually preprocessed, to extract sub-images representing individual faces. Geometric and color transforms may also be applied to the extracted face-images, depending on the requirements of the specific FR method. The result of the pre-processing stage is a normalized face image, of predefined size and scale, that may be processed by a FR system. Before describing the different FR methods, we explain the pre-processing steps applied to normalize the input face images.

3.1

Face Image Normalization

Several FR systems expect the input to be a gray-level image. Therefore, our first pre-processing step is to convert the input RGB image to a gray-level image (essentially, the Y component of the YUV domain representation of the image).

The next step is to locate each face represented in the input image. The ROC-SDK takes a full frame image as input. For the other FR methods in our study (CNN-FR, and ISV modeling) the input is a suitably cropped face-image. There are two ways to accomplish the face-localization step explicitly:

• provide additional information, such as annotations for specific facial landmarks, that the FR system can use to extract a normalized face-region, or,

• provide as input a cropped and normalized face-image.

In our experiments, the normalized face-region is extracted using annotations identifying the center of each eye. Imposing the constraints that the straight-line joining the two eye-centres should be horizontal, and should have a predefined length, an affine transform can be used to extract a normalized face-image of fixed size from the given input image. Some face-biometrics test datasets include annotations for the eye locations. For datasets where this information is not explicitly provided, we use a two-step process to extract the eye-positions. First, a boosted-classifier, trained to detect faces, is used to localize the face-region [23]. Next, facial-landmarks are extracted from the face-region using the flandmarks method [24]. This method provides pixel coordinates for seven facial landmarks, including the two corners of each eye. The pixel-location for the center of each eye can be interpolated from these landmarks.

Note that for the VGG-Face and LightCNN networks, the input images are normalized color (RGB) images, whereas the FaceNet and ISV modeling method expect normalized gray-level face-images of fixed 2D-shape.

The various FR methods are discussed in the following sections. Specific implementation details, such as the size of the normalized face-image, tunings of the hyper-parameters, and so on, are provided in the Section 5.

(4)

3.2

CNN-Based Systems

CNNs [25, 26], a class of deep neural networks (DNN), have been shown to be extremely accurate in FR tasks. For FR applications, CNNs are usually trained for face identification, using face-images as input, and the set of identities to be recognized as output. The last few layers of the network, including the output layer, are typically fully-connected (fc for short) layers. These layers may be seen as the classifier-stage of the network, whereas the preceding (convolutional and pooling) layers may be considered to constitute the feature-extraction stage. Although CNNs are typically trained end-to-end for classification or regression tasks, they are often also used as feature-extraction tools. The terms representation and embedding are used interchangeably, to denote the outputs of the various layers of a deep network. Representations generated by a specific layer of a pre-trained DNN may be used as templates (feature-vectors) representing the corresponding input images. These templates may be subsequently be used to train a classifier, or to compare the respective input images using appropriate similarity measures.

Taigman et al. [6] have proposed DeepFace, a 9-layer CNN for FR. The input faces are first aligned to have an upright position using 3D modeling and piecewise affine transforms, before being fed to the network. This network achieves a recognition accuracy of 97.25% on the LFW dataset. Schroff et al. [8] report an accuracy of 99.63% on the LFW dataset using FaceNet, a CNN with 7.5 million parameters, trained using a novel triplet loss function. Other CNN architectures showing similar FR accuracy on the LFW dataset include the DeepID series by Sun et al. [9].

In this work we analyze the vulnerability to PAs of three CNN-FR methods: the VGG-Face [7], LightCNN [15], and FaceNet [16]. The VGG-Face network has been included in our study because it is a very well known and widely referenced CNN for FR applications. The other two network models have been included because they are newer than the Face network, and have both shown a FR performance even better than that of the VGG-Face CNN. Another important reason why these three specific CNNs have been studied in this work is that the respective creators of these networks have made publicly available pre-trained models that can be directly used in our experiments. The following sections provide brief summaries of the selected CNNs, and describe how they have been used in our experiments.

3.2.1 VGG-Face CNN

The VGG-Face network model is made publicly available by the Visual Geometry Group3 at Oxford University.

Involving almost 135 million trainable parameters, this network has been shown to achieve a FR accuracy of 98.95% on the LFW unrestricted setting [10]. VGG-Face is a CNN consisting of 16 hidden layers (see Table 3 in [7]). The initial 13 hidden layers are convolution and pooling layers, and the last three layers are fully-connected (’fc6’, ’fc7’, and ’fc8’, following the nomenclature used by Parkhi et al. [7]). The input to this network is an appropriately cropped color face-image of pre-specified dimensions.

We use the representation produced by the ’fc7’ layer of the VGG-Face CNN as a template for the input image. When enrolling a client, the template produced by the VGG-Face network for each enrollment-sample is recorded. For verification, the network is used to generate a template for the probe face-image, which is then compared to the enrolled templates of the claimed identity using the Cosine-similarity measure given by Equation 1 (wherekAk2

represents the L2-norm of vector A).

Cosine Similarity(A, B) = A· B kAk2· kBk2

(1) The score assigned to the probe is the average Cosine-similarity of the probe-template to all the enrollment-templates of the claimed identity. If the score is larger than a predetermined threshold, the probe is accepted as a match for the claimed identity.

3.2.2 LightCNN

Wu et al. [15] have proposed a new CNN, called LightCNN, for FR. Their goals in designing this network were to have significantly fewer trainable parameters compared to other state-of-the-art FR-CNNs, as well as to be able to handle noisy labels that are inevitable in datasets mined automatically from the web. Compared to the VGG-Face network, the number of parameters in the LightCNN model is smaller by a factor of 10 (∼ 12 million parameters). This is achieved mainly through the use of a newly introduced Max-Feature-Map (MFM) activation [15], which is a

3Website: www.robots.ox.ac.uk/˜vgg/software/vgg_face 4

(5)

non-linear extension of the maxout activation operation. Although the MFM operator is more expensive to compute than the ReLU unit that it replaces, the large overall reduction of number of units per layer, made possible by the use of the new operator, still leads to smaller computation time for the forward-pass of the LightCNN network (by a factor of 5, relative to the VGG-Face CNN [15]).

In our experiments we use as templates the 256-D representation produced by the ’eltwise fc1’ layer of LightCNN. Note that this template is much smaller than the 4096-D vector produced by VGG-Face. Despite these relative efficiencies, the LightCNN network achieves a FR accuracy of 99.33% on the LFW dataset (in unrestricted setting) – outperforming the VGG-Face network by a small margin. As with the VGG-Face network, the Cosine measure (Eqn. 1) is used to compare an input probe template to the relevant enrollment templates.

3.2.3 FaceNet CNN

Very recently, David Sandberg has made publicly available his implementation as well as trained models for a new FR-CNN named FaceNet[16]. This is the closest open-source implementation of the FaceNet CNN proposed by Schroff et al. [8], for which neither a pre-trained model nor the training-set is publicly available. Sandberg’s FaceNet implements an Inception-ResNet V1 DNN architecture [27]. Two pre-trained FaceNet models have been published [16]. In our tests, we have used the 20170512-110547 model, trained on the MS-Celeb-1M dataset [28]. Using this model, FaceNet achieves a FR performance of 99.2% on the LFW dataset [16], which is comparable to the performance of LightCNN. Note that the 128-D representation produced by this network (at the ’embeddings:0’ layer) is half the size of that produced by LightCNN. We use this representation to construct enrollment and probe templates, which are compared to each other using the Cosine measure (Eqn. 1).

3.3

GMM-based FR using Inter-Session Variability Modeling

Inter-session variability modeling (ISV) is an extension of the Gaussian Mixture Models (GMM) based method for face verification. We have used the ISV modeling approach proposed by Wallace et al. [3]. Among the FR methods included in this study, this is the only method that adopts a parts based approach. The input normalized face image is first decomposed into a set of square sub-images of size (12× 12), with an overlap of 11 pixels in each direction. Let N represent the total number of sub-images extracted from the input face-image. For each sub-image, a predetermined number, D, of low-frequency DCT (discrete cosine transform) coefficients is computed. Thus, a set of N D-dim arrays of DCT coefficients is extracted for each input face-image. The DCT coefficients are normalized to zero-mean and unit standard-deviation in each of the D dimensions. The resulting set of N normalized D-dim DCT arrays is used to represent the input face-image.

To use a GMM for FR, first, a universal background model (UBM) is constructed from the set of training-images (each represented by N D-dim DCT arrays). The UBM is a GMM, that is, a weighted sum of K Gaussians, where each Gaussian is represented by a D-dim mean-vector, and a (D× D)-dim covariance-matrix. The parameters of the Gaussians components of the UBM, as well as their relative weights, are learned from the training data. The UBM describes the probability distribution of face sub-images in the D-dim DCT-coefficient space.

A supervector, m, is then constructed by concatenating K the mean-vectors of all the components of the GMM. To enroll a client i, a supervector si is derived as follows:

si= m + di , (2)

where di is an offset-vector specific to client i, computed using maximum a posteriori (MAP) adaptation from the

UBM [29].

The supervector siis assumed to be client-specific. In other words, different enrollment images of the same client,

i, should, in theory, produce very similar supervectors. For a given client, however, enrollment set typically contains images captured in different sessions, with varying pose, illumination, and facial expressions. This within-class variability for client i is also reflected in si, and can result in diminished recognition accuracy of the GMM-based

FR method [3].

ISV-modeling has been developed to enhance the GMM based approach by explicitly modeling and suppressing the within-class variability of each enrolled client. We assume that each enrollment image, collected in a separate session, results in a unique supervector. For enrollment image (i, j) (the jthenrollment image of client i), let

(6)

where µi,jis the supervector corresponding to the enrollment image (i, j), ui,jis the offset induced by specific session

conditions of image (i, j), and di is the client-dependent offset. (Note that the client-dependent offset di in Eqn. 3

is free from session variability, unlike the diin Eqn. 2.) The session-dependent offset, ui,j can be expressed as:

ui,j= U xi,j , (4)

where, U is a low-dimensional matrix modeling the session-variability, and xi,j∼ N (0, I ) is a latent variable corre-sponding to the session-variability. Similarly, the client-dependent offset, di, can be represented as

di= Dzi , (5)

where D is a diagonal matrix derived from the diagonal variances of the UBM, and zi ∼ N (0, I ) is a client-specific latent random variable. The subspace U is estimated using an expectation-maximization (EM) algorithm. For details of the ISV technique for face-verification, please refer to the works of Wallace et al. [3] and Vogt et al. [30].

When enrolling a new client, i, using a set of enrollment images (indexed by j), the latent variables xi,j and zi

are estimated from the enrollment images, and finally, the client-specific supervector, ci, is computed as:

ci= m + Dzi . (6)

Thus, a single supervector, ci, is stored for each enrolled client.

Given a probe image,P, claiming identity, i, the classification score for the probe is computed as the log-likelihood ratio LLR(P, ci) as follows:

LLR(P, ci) = Q(P|ci)− Q(P|UBM) , (7)

where, Q(P|ci) gives the log of the likelihood of the probe P being generated by the model ci, and similarly,

Q(P|UBM) gives the log of the likelihood of the probe P being generated by the UBM. In practice, we use the linear scoring method [31] to compute the score. This faster scoring method is derived as a first order Taylor series approximation of the normal log-likelihood ratio. The probe is accepted if the resulting score is higher than a preset threshold.

3.4

ROC-SDK from Rank One Computing

This FR product, from Rank-One Computing, has been included in our study as it is one of the best performing COTS FR products today [32]. In particular, Rank-One Computing have demonstrated a FR performance of 92% (@ false reject rate = 1%) on the LFW dataset [33], and of 98% (@ false reject rate = 1%) on the NIST Special Database 32: Multiple Encounter Dataset (MEDS) [34]. The software development kit (SDK) distributed by the company includes two FR methods – ROCFR (a pose-independent FR system), and, ROCID (a FR system for images captured under controlled conditions). Here, we have tested the ROCFR method from the SDK (version 1.9). This method represents every face-image with a 144-byte template. The SDK provides functions for populating a gallery with enrollment templates, and for comparing probe-templates to the gallery of previously enrolled templates.

4

Experiment Datasets and Protocols

Each FR system is evaluated under two scenarios: licit and attack. In the licit scenario, where only bona fide samples are considered, the face-verification accuracy of the FR system is evaluated. Identities are enrolled in the system using a set of enrollment samples and the verification performance of the system is determined using the probe samples from various identities. In the attack scenario we consider PA samples for the identities enrolled using the licit protocol, to evaluate the vulnerability of the system to PAs. That is, when evaluating the vulnerability of a FR system, each enrolled identity is only probed with PAs on that identity. The datasets and protocols used to evaluate the performance of each FR system are described in this section. We start by presenting the metrics used to quantify the performance of each FR system under the two scenarios.

4.1

Performance Metrics

During the verification process, the user provides a biometric sample as probe, along with a claimed identity. Samples presented to a biometric verification system fall into three categories:

(7)

• Genuine: when the biometric sample and the claimed identity both belong to the user being authenticated; • Zero-effort Impostor (ZEI): when the biometric sample belongs to the user, but the user claims the identity of

a different enrolled client;

• Impostor presentation attack (PA): when the biometric sample matches the claimed identity, but neither correspond to the user attempting to be verified.

In standardized biometrics terminology [11], both genuine and ZEI presentations are considered bona fide presen-tations. The task of a FR system is to accept only genuine presentations, and to reject both types of impostor presentations.

The verification performance of a FR system is reported using the false match rate (FMR), and the false non-match rate (FNMR) [35]. The FMR denotes the proportion of ZEI that are wrongly accepted as genuine samples (Type-I error). The FNMR refers to the proportion of genuine samples which are wrongly rejected (Type-II error). The two metrics may be summarized in a single number as the Half Total Error Rate (HTER) as HT ER = (F M R + F N M R)/2.

Vulnerability of a biometric system to PAs is reported as the Impostor Attack Presentation Match Rate (IAPMR) [35], which denotes the proportion of PAs that are accepted by the FR system as genuine presentations.

4.2

Datasets

We have used four publicly available datasets, namely, MOBIO [36], REPLAY-ATTACK [37], MSU-MFSD [38], and REPLAY-MOBILE [39].

The MOBIO dataset [36] was collected for bi-modal (voice and face) biometric verification experiments using mobile devices. Therefore, it contains only licit protocols. Biometric samples were collected in the year 2010, using Nokia N93i mobile phones, for 100 male subjects and 52 female subjects, spread over six cities (in five countries). The videos have been recorded at a resolution of (320× 240) pixels. Experimental results reported in this paper have been produced using face-images from only the male subjects.

The REPLAY-ATTACK [37] face PA dataset, published in 2012, consists of 1, 300 videos (each, at least 9 seconds long) from 50 subjects. The videos have been recorded by using a 1300 Macbook at 320× 240 pixel resolution, under two different lighting conditions. Three kinds of replay attacks are simulated in this dataset, namely, printed photo attacks, still face-images displayed on an electronic device, and replayed digital-videos. Three presentation attack instruments (PAIs) have been used to construct the attacks: photo-quality paper (for the printed photo attacks), iPhone 3Gs, and iPad (first generation). The two electronic devices have been used for both, photo- as well as video-attacks. Videos for each kind of attack have been captured under two conditions, one where the capturing device is hand-held, and the other where the capturing device rests on a stationary support.

The public version of the MSU-MFSD dataset [38] includes real-access and attack videos for 35 subjects. This dataset was published in 2015. Real-access videos (∼ 12 sec. long) have been captured using two devices: a 1300

MacBook Air (using its built-in camera), and a Google Nexus 5 (Android 4.4.2) phone. Videos captured using the laptop camera have a resolution of 640× 480 pixels, and those captured using the Android camera have a resolution of 720× 480 pixels. The dataset also includes PA videos representing printed photo attacks, mobile video replay-attacks where video captured on an iPhone 5s is played back on an iPhone 5s, and high-definition (HD) video-replays (captured on a Canon 550D SLR, and played back on an iPad Air).

Published in 2016, The REPLAY-MOBILE dataset [39] is the newest of the three PA datasets used here. This dataset contains short (∼ 10 sec. long) HD (720 × 1280) resolution videos corresponding to 40 identities, recorded using two mobile devices: an iPad Mini 2 tablet and a LG-G4 smartphone. The videos have been collected under six different lighting conditions, involving artificial as well as natural illumination. PAs represented in this database have been constructed using two PAIs: matte-paper for print-attacks, and matte-screen monitor for digital-replay attacks. For each PAI, two kinds of attacks have been recorded: one where the user holds the recording device in hand, and the second where the recording device is stably supported on a stand. Thus, four kinds of attacks are represented in the database.

The various PAs represented in the three PA datasets are summarized in Table 1. In this table, PA I refers to printed photo attacks, PA II refers to low-quality replay attacks, and PA III denotes high-quality replay attacks. Figure 1 shows some example PAs from each PA dataset. The first column shows examples of bona fide presentations. The remaining three columns show examples of selected PAs represented in the corresponding dataset.

(8)

Table 1: Different combinations of PA in each dataset. PA I refers to printed photo attacks in the dataset, PA II refers to low-quality video replay attacks, and PA III corresponds to high-quality video replay attacks. The REPLAY-MOBILE dataset contains only two PA types: PA I and PA III. Low-quality video replay attacks have

not been included in this dataset. 7

Dataset PA I PA II PA III

REPLAY-ATTACK

Capture Device 12.1 mega-pixel CanonPowerShot SX150 IS camera 3.1 mega-pixel iPhone 3GS camera 3.1 mega-pixel iPhone 3GS camera PAI Triumph-Adler DCC 2520color laser printer iPhone 3Gs (320 × 480 pixels)(165 ppi pixel density) iPad (768 × 1024 pixels)(132 ppi pixel density) Capture Conditions 2 lighting conditions (’normal’ (bright) and ’adverse’ (in the shade, but not dark))Hand-held or fixed

Biometric sensor Macbook (320 × 240) MSU-MFSD [31]

Capture Device Canon 550D camera 8 mega-pixel iPhone 5S camera Canon 550D camera PAI HP Color LaserjetCP6015xh printer iPhone 5s (640 × 1136 pixels)(326 ppi pixel density) iPad Air (1536 × 2048 pixels)(264 ppi pixel density) Capture Conditions Indoors, with artificial lighting

Biometric sensor Macbook Air (640 × 480), Nexus 5 (720 × 480) REPLAY-MOBILE [32]

Capture Device 18 mega-pixel Nikon Coolpix P520 LG-G4 (720 × 1280) PAI Konica Minolta ineo+224e color laser printer Philips 227ELH mattemonitor (1920 × 1080) Capture Conditions 6 lighting conditions (including artificial lights and natural light)Hand-held or fixed

Biometric sensor iPad Mini 2 (720 × 1280), LG-G4 (720 × 1280)

TABLE I: Different combinations of PA in each dataset.PA I refers to printed photo attacks in the dataset, PA II refers to low-quality video replay attacks, andPA III corresponds to high-quality replay attacks. The REPLAY-MOBILE dataset contains only two PA types:PA I and PA III. Low-quality video replay attacks have not been included in this dataset.

evaluation group. In the vulnerability analysis experiments using a given PAD dataset, attack-videos in the evaluation group of the corresponding PAD protocol are used to test the vulnerability of the FR system in question.

The three groups in the male protocol of the MOBIO dataset are constructed as follows:

• training: 7881 samples images, from 37 subjects, • development 2760 samples, from 24 subjects, and • evaluation: 4370 samples, from 38 subjects.

As mentioned before, samples for the MOBIO dataset were collected at six different sites. Each group contains data exclusively from two sites.

The REPLAY-ATTACK dataset comes with a licit pro-tocol consisting of 100 videos (one video per subject, per illumination condition), and several PAD protocols. The licit protocol is partitioned thus:

• training: two videos each, from 15 subjects,

• development: two videos each, from 15 subjects, and • evaluation: two videos each, from 20 subjects. The dataset also provides several PAD protocols [30]. In our experiments we have considered the grandtest protocol, which includes all PAs. This protocol consists of 200 bona fide presentations, and 1, 000 PAs (i.e., a total of 1, 200 videos).

The public version of MSU-MFSD [31] offers 70 bona fide presentation videos and 280 attack videos. As mentioned before, no licit protocol has been published for this dataset. For our experiments, therefore, we have devised a new licit protocol using some frames from the bona fide presentation videos. For each subject, the dataset contains two bona fide presentation videos, one captured using a Macbook Air (labeled ”laptop”) and the other using a smart-phone (labeled ”android”). To construct the licit protocol for this dataset, we have used 10 frames from each bona fide video, sampled evenly, between the first and 180th frame. Frames from the

”laptop” videos form the enrollment set, and those from the ”android” videos form the probe set.

The licit protocol of the REPLAY-MOBILE dataset con-sists of 160 videos (four videos for each of 40 subjects). They are grouped as follows: 48 videos (12 subjects) for training, 64 videos (16 subjects) for development, and the remaining 48 videos for the test group. The PAD protocol of the dataset consists of 1, 030 videos (390 bona fide presentations and 640 PAs of different kinds).

For ease of reference, the number of clients and samples used in the licit and attack protocols of the various datasets is summarized in Table II.

IV. EXPERIMENTS

The general procedure for evaluating the vulnerability of a FR system is as follows. First the face-verification perfor-mance of the system is evaluated using only the MOBIO dataset (i.e., in the licit scenario). The hyper-parameters of the FR system are tuned to achieve optimal discrimination between genuine presentations and ZEI presentations. The vulnerability of the FR system to PAs is then evaluated, for the different PAD datasets, using the same hyper-parameter settings.

In this section we first describe the methodology adopted for the experiments. Then, we present the results of face-recognition and vulnerability analysis for the four selected FR systems.

A. Methodology

The face-recognition and vulnerability-analysis experi-ments have been performed using the Bob5signal-processing

and machine-learning toolkit [33]. Besides a wide selection of machine-learning and signal processing tools relevant for biometics experiments, the Bob toolkit also includes interfaces for accessing the various datasets and associated protocols. All the FR systems studied here have been tested within the same software framework.

5Website: www.idiap.ch/software/bob/

4.3

Protocols

As mentioned before, the MOBIO dataset comes with only a licit protocol, whereas the PA datasets include both licit and attack protocols. For our experiments with the MSU-MFSD, we have constructed new licit and attack protocols to analyze the vulnerability of the FR methods to PAs. The protocols in the various datasets are described in this section.

The set of clients in the licit protocol of each dataset is divided into three mutually exclusive groups: training, development and evaluation. The training group is reserved for producing the background models, when necessary. In our experiments, the UBM required for the ISV modeling method has been trained using the training group in the licit protocol of the MOBIO dataset. The training groups in the licit protocols of the PA datasets are ignored here. The development and evaluation groups, each consist of two sets of samples: an enrollment set and a probe set. All clients assigned to the group are represented in both sets. That is, for each client in the development group, a certain number of enrollment samples as well as probe samples are available (and likewise, for the evaluation group). Attack protocols of the various PA datasets also consist of three non-overlapping groups: training, development, and evaluation. The training group is intended to be used for training a PAD method. The development group can be used for tuning the parameters of the PAD method being tested, and the final performance metrics are reported for the evaluation group. In the vulnerability analysis experiments using a given PA dataset, attack-videos in the evaluation group of the corresponding PAD protocol are used to test the vulnerability of the FR system in question.

The three groups in the male protocol of the MOBIO dataset are constructed as follows: • training: 7881 samples images, from 37 subjects,

• development: 2760 samples, from 24 subjects, and • evaluation: 4370 samples, from 38 subjects.

As mentioned before, samples for the MOBIO dataset were collected at six different sites. Each group contains data exclusively from two sites.

The REPLAY-ATTACK dataset comes with a licit protocol consisting of 100 videos (one video per subject, per illumination condition), and several PAD protocols. The licit protocol is partitioned thus:

• training: two videos each, from 15 subjects,

• development: two videos each, from 15 subjects, and • evaluation: two videos each, from 20 subjects.

(9)

REPLAY-ATTACK

BF

PA I

PA II

PA III

MSU-MFSD

REPLAY-MOBILE

Figure 1: Examples of attacks represented in the various PA datasets. The first column shows a bona fide example, and the remaining columns show different types of PAs present in the dataset for the same identity. The various PAs are described in Table 1.

The dataset also provides several PAD protocols [37]. In our experiments we have considered the grandtest protocol, which includes all PAs. This protocol consists of 200 bona fide presentations, and 1, 000 PAs (for a total of 1, 200 videos).

The public version of MSU-MFSD [38] offers 70 bona fide presentation videos and 280 attack videos. As mentioned before, no licit protocol has been published for this dataset. For our experiments, therefore, we have devised a new licit protocol using some frames from the bona fide presentation videos. For each subject, the dataset contains two bona fide presentation videos, one captured using a Macbook Air (labeled ”laptop”) and the other using a smart-phone (labeled ”android”). To construct the licit protocol for this dataset, we have used 10 frames from each bona fide video, sampled evenly, between the first and 180thframe. Frames from the ”laptop” videos form the enrollment

set, and those from the ”android” videos form the probe set.

The licit protocol of the REPLAY-MOBILE dataset consists of 160 videos (four videos for each of 40 subjects). They are grouped as follows: 48 videos (12 subjects) for training, 64 videos (16 subjects) for development, and the remaining 48 videos for the test group. The PAD protocol of the dataset consists of 1, 030 videos (390 bona fide presentations and 640 PAs of different kinds).

For ease of reference, the number of clients and samples used in the licit and attack protocols of the various datasets is summarized in Table 2.

(10)

Table 2: Composition of the licit and attack protocols of the four datasets used in our experiments. All protocols consist of three mutually exclusive groups of clients: training, development, and evaluation. For each dataset, the number of clients assigned to each group is listed here. The number in parentheses shows the total number of samples

for the corresponding dataset and group. 8

Dataset Train DevelopmentLicit protocol Evaluation Train Attack protocolDevelopment Evaluation

MOBIO 37 (7881) 24 (2760) 38 (4370)

REPLAY-ATTACK 15 (90) 15 (90) 20 (120) 15 (300) 15 (300) 20 (400)

MSU-MFSD 10 (200) 10 (200) 15 (300) 10 (600) 10 (600) 15 (900)

REPLAY-MOBILE 12 (1680) 16 (2240) 12 (1580) 12 (1920) 16 (2560) 12 (1920)

TABLE II: Composition of the licit and attack protocols of the four datasets used in our experiments. All protocols consist of three mutually exclusive groups of clients: training, development, and evaluation. For each dataset, the number of clients assigned to each group is listed here. The number in parentheses shows the total number of samples for the corresponding dataset and group.

The inputs for ISV modeling, Gabor-Graph, and ROC-SDK FR methods are gray-scale images, and the input to the VGG-Face FR method is a color (RGB) image. The ROC-SDK takes the entire image (one frame of video) as input. All other FR methods expect the input-image to be cropped appropriately, so that it contains only the face-region, with as little of the background as possible (see Section II-A). The specific input image dimensions required by the various FR systems, are listed in Table III. The table also summarizes the hyper-parameters and the scoring method used in each FR system. For the ISV modeling method and the Gabor-Graphs method, we have used the hyper-parameter values recommended by G¨unther et al. [34].

Most FR methods utilize models that need to be trained before deployment. The ROC-SDK and the VGG-Face net-work come pre-trained, and can be used out-of-the-box. No background model is required for the Gabor-Graphs method. For the ISV modeling method, the UBM and the session-variability matrix (U) are trained using face-images in the male-protocol of the MOBIO dataset.

Once trained, the FR methods are tuned and evaluated using the MOBIO licit protocol, and the three PA datasets: REPLAY-ATTACK, MSU-MFSD, and REPLAY-MOBILE. For experiments with each of these datasets, the hyper-parameters of each FR method are tuned using data in the development group of the licit protocol (see Section III). Each client in this group is enrolled into the FR system, using the available enrollment samples. Then, all samples in the probe set of the development group are tested against every enrolled client. For every enrolled identity, some videos in the probe-set come from the same client, and are considered genuine presentations, whereas the remaining videos (corresponding to other clients) constitute the ZEI attacks on this identity. Thus, we obtain two sets of scores, one for genuine presentations, and the other for ZEI presen-tations. The hyper-parameters of each FR method are tuned to maximize the separation between the two sets of scores. A threshold is now selected, based on these two score-sets computed from the probe set of development group. Presentations resulting in scores smaller than the score-threshold are rejected, and those at or above the score-threshold are accepted. Thus, given a score-threshold, the FMR and FNMR of the FR system can be estimated.

The score-threshold may be chosen arbitrarily, but usually some heuristics are applied when choosing the threshold.

Two common approaches for selecting the score-threshold are:

• Select the score-threshold, TEER, corresponding to the

equal error rate (EER). That is, TEER is the threshold

for which F MR ≈ F NMR.

• Select the score-threshold, Tθ, that corresponds to

a specific FMR = θ (e.g., T0.1 denotes the

score-threshold leading to a FMR of 0.1% over the devel-opment group).

Next, the clients in the evaluation group are enrolled into the FR system using the data in the corresponding enroll-ment set, and samples in the probe set are used to generate scores for genuine and ZEI presentations. (Note that the groups are all mutually exclusive. That is, clients represented in the development group are not present in the evaluation group, and vice versa.) For a given evaluation-probe, if the resulting score is smaller than the score-threshold, the probe is rejected. Otherwise it is accepted. Thus, using the evaluation-probe scores, the FMR and FNMR of the FR method can be reported.

5

Experiments

The procedure used here to evaluate the vulnerability of a FR system is as follows. First the face-verification performance of the system is evaluated using only the MOBIO dataset (i.e., in the licit scenario). The hyper-parameters of the FR system are tuned to achieve optimal discrimination between genuine presentations and ZEI presentations. The vulnerability of the FR system to PAs is then evaluated, for the different PA datasets, using the same hyper-parameter settings. In this section we first describe the methodology adopted for the experiments. Then, we present the results of face-recognition and vulnerability analysis for the five selected FR systems.

5.1

Methodology

Table 3: Summary of the parameters of the various FR systems used in this study. All methods, apart from ROC-SDK, expect input-images of fixed dimensions, where the image has been cropped to the face-region appropriately.

8

TABLE II: Composition of the licit and attack protocols of the four datasets used in our experiments. All protocols consist of three mutually exclusive groups of clients: training, development, and evaluation. For each dataset, the number of clients assigned to each group is listed here. The number in parentheses shows the total number of samples for the corresponding

dataset and group. 8

Dataset Train DevelopmentLicit protocol Evaluation Train Attack protocolDevelopment Evaluation

MOBIO 37 (7881) 24 (2760) 38 (4370)

REPLAY-ATTACK 15 (90) 15 (90) 20 (120) 15 (300) 15 (300) 20 (400)

MSU-MFSD 10 (200) 10 (200) 15 (300) 10 (600) 10 (600) 15 (900)

REPLAY-MOBILE 12 (1680) 16 (2240) 12 (1580) 12 (1920) 16 (2560) 12 (1920)

TABLE II: Composition of the licit and attack protocols of the four datasets used in our experiments. All protocols consist of three mutually exclusive groups of clients: training, development, and evaluation. For each dataset, the number of clients assigned to each group is listed here. The number in parentheses shows the total number of samples for the corresponding dataset and group.

The inputs for ISV modeling, Gabor-Graph, and ROC-SDK FR methods are gray-scale images, and the input to the VGG-Face FR method is a color (RGB) image. The ROC-SDK takes the entire image (one frame of video) as input. All other FR methods expect the input-image to be cropped appropriately, so that it contains only the face-region, with as little of the background as possible (see Section II-A). The specific input image dimensions required by the various FR systems, are listed in Table III. The table also summarizes the hyper-parameters and the scoring method used in each FR system. For the ISV modeling method and the Gabor-Graphs method, we have used the hyper-parameter values recommended by G¨unther et al. [34].

Most FR methods utilize models that need to be trained before deployment. The ROC-SDK and the VGG-Face net-work come pre-trained, and can be used out-of-the-box. No background model is required for the Gabor-Graphs method. For the ISV modeling method, the UBM and the session-variability matrix (U) are trained using face-images in the male-protocol of the MOBIO dataset.

Once trained, the FR methods are tuned and evaluated using the MOBIO licit protocol, and the three PA datasets: REPLAY-ATTACK, MSU-MFSD, and REPLAY-MOBILE. For experiments with each of these datasets, the hyper-parameters of each FR method are tuned using data in the development group of the licit protocol (see Section III). Each client in this group is enrolled into the FR system, using the available enrollment samples. Then, all samples in the probe set of the development group are tested against every enrolled client. For every enrolled identity, some videos in the probe-set come from the same client, and are considered genuine presentations, whereas the remaining videos (corresponding to other clients) constitute the ZEI attacks on this identity. Thus, we obtain two sets of scores, one for genuine presentations, and the other for ZEI presen-tations. The hyper-parameters of each FR method are tuned to maximize the separation between the two sets of scores. A threshold is now selected, based on these two score-sets computed from the probe set of development group. Presentations resulting in scores smaller than the score-threshold are rejected, and those at or above the score-threshold are accepted. Thus, given a score-threshold, the FMR and FNMR of the FR system can be estimated.

The score-threshold may be chosen arbitrarily, but usually some heuristics are applied when choosing the threshold.

Two common approaches for selecting the score-threshold are:

• Select the score-threshold, TEER, corresponding to the

equal error rate (EER). That is, TEERis the threshold

for which F MR ≈ F NMR.

• Select the score-threshold, Tθ, that corresponds to

a specific FMR = θ (e.g., T0.1 denotes the

score-threshold leading to a FMR of 0.1% over the devel-opment group).

Next, the clients in the evaluation group are enrolled into the FR system using the data in the corresponding enroll-ment set, and samples in the probe set are used to generate scores for genuine and ZEI presentations. (Note that the groups are all mutually exclusive. That is, clients represented in the development group are not present in the evaluation group, and vice versa.) For a given evaluation-probe, if the resulting score is smaller than the score-threshold, the probe is rejected. Otherwise it is accepted. Thus, using the evaluation-probe scores, the FMR and FNMR of the FR method can be reported.

TABLE III: Summary of the parameters of the various FR systems used in this study. All methods, apart from ROC-SDK, expect input-images of fixed dimensions, where the image has been cropped to the face-region appropriately.

FR Method Input Features Hyper-parameters Comparison Metric VGG-Face 224× 224 Output of ‘fc7’ layer None Cosine distance measure

color image length 4096 (pre-trained CNN)

LightCNN 128× 128 Output of ‘eltwise fc1’ layer None Cosine distance measure gray image length 256 (pre-trained CNN)

FaceNet 160× 160 Output of ‘embeddings:0’ layer None Cosine distance measure Model: 20170512-110547 color image length 128 (pre-trained CNN)

ISV 80× 64 Session-variability-compensated #GMMs: 512; Dim. of session- Log likelihood ratio gray image supervector, length ∼ 45K variability subspace, U: 160

ROC-SDK (v1.9) arbitrary size Undisclosed, proprietary Using FRONTAL, FR, PARTIAL, Undisclosed, proprietary gray image size ∼ 140 bytes and ROLL filters

The ROC-SDK as well as the three CNN-FR methods (VGG-Face, LightCNN, and FaceNet) come pre-trained, and can be used out of the box. The UBM and the session-variability matrix (U in Eqn. 4) for the ISV modeling method are trained using training set in the licit male-protocol of the MOBIO dataset.

For experiments with each dataset, the hyper-parameters of each FR method are tuned using data in the development group of the licit protocol (see Section IV). The score-threshold may be chosen arbitrarily, but usually some heuris-tics are applied when choosing the threshold. Two common approaches for selecting the score-threshold are:

• Select the score-threshold, TEER, corresponding to the

equal error rate (EER). That is, TEER is the threshold

for which F MR ≈ F NMR.

• Select the score-threshold, Tθ, that corresponds to a

specific FMR = θ (e.g., T0.1denotes the score-threshold

leading to a FMR of 0.1% over the development group). For a given score-threshold obtained on the development group, the FMR, FNMR, and HTER can be reported on the evaluation group.

B. Face Verification Performance

First, we establish a baseline face-verification performance for each FR system, based on the licit protocol of the MO-BIO dataset. The face-verification results of the various FR systems are presented in two ways: using receiver operating

characteristics (ROC) curves and using expected performance curves (EPC) [41]. The ROC curves shown in Figure 2 illustrate the face-verification performances of the various FR systems. They have been evaluated using the licit male-protocol of the MOBIO dataset. Figure 2(a) and (b) show the ROC plots for the development group and evaluation group, respectively. A score-threshold, T0.1, is selected so

as to achieve a FMR of 0.1% over the development group. This is indicated in Figure 2(a) by a dashed black vertical line. In Figure 2(b) the circular markers in the various colors (connected by a dashed trace line) indicate the FMR of the corresponding FR methods for the score-threshold T0.1. To

compare the performances of the five FR methods on the evaluation group at score-threshold T0.1, we should consider

the distances of these five points from the top left corner of the plot (where (F MR, F NMR) = (0, 0), representing the ideal outcome). This way of comparing different FR systems is similar to the method used by NIST [31]. It is clear from the ROC plots that the FaceNet CNN significantly outperforms the other FR methods in this study. Expected performance curves (EPC) [41] provide an unbiased compar-ison of FR systems. Figure 3 shows the EPC plots of the five FR systems. These curves have been computed using the licit protocol of the MOBIO dataset, and show how the performance of each FR method evolves as a function of the trade-off between FMR and FNMR. This trade-off is The FR and vulnerability-analysis experiments have been performed using the Python based Bob4

signal-processing and machine-learning toolkit [40, 41]. Besides a wide selection of machine-learning and signal signal-processing tools relevant for biometics experiments, the Bob toolkit also includes interfaces for accessing the various datasets and associated protocols. All the FR systems studied here have been tested within the same software framework.

The inputs for the LightCNN, ISV modeling and ROC-SDK FR methods are gray-scale images, and the input to the VGG-Face and FaceNet FR methods is a color (RGB) image. The ROC-SDK takes the entire image (one frame of video) as input. All other FR methods expect the input-image to be cropped appropriately, so that it contains only the face-region, with as little of the background as possible (see Section 3.1). The specific input image dimensions required by the various FR systems, are listed in Table 3. The table also summarizes the hyper-parameters and the scoring method used in each FR system. For the ISV modeling method, we have used the hyper-parameter values recommended by G¨unther et al. [42].

The ROC-SDK as well as the three CNN-FR methods (VGG-Face, LightCNN, and FaceNet) come pre-trained, and can be used out of the box. The UBM and the session-variability matrix (U in Eqn. 4) for the ISV modeling

4Website: www.idiap.ch/software/bob/

(11)

method are trained using training set in the licit male-protocol of the MOBIO dataset.

For experiments with each dataset, the hyper-parameters of each FR method are tuned using data in the devel-opment group of the licit protocol (see Section 4). The score-threshold may be chosen arbitrarily, but usually some heuristics are applied when choosing the threshold. Two common approaches for selecting the score-threshold are:

• Select the score-threshold, TEER, corresponding to the equal error rate (EER). That is,TEERis the threshold

for which F M R≈ F NMR.

• Select the score-threshold, Tθ, that corresponds to a specific FMR = θ (e.g., T0.1 denotes the score-threshold

leading to a FMR of 0.1% over the development group).

For a given score-threshold obtained on the development group, the FMR, FNMR, and HTER can be reported on the evaluation group.

5.2

Face Verification Performance

10

5

10

4

10

3

10

2

10

1

10

0

FMR

0.0

0.2

0.4

0.6

0.8

1.0

1 - FNMR

VGG-Face

LightCNN

FaceNet

ROC-SDK

ISV

(a) ROC curves for the development group.

10

5

10

4

10

3

10

2

10

1

10

0

FMR

0.0

0.2

0.4

0.6

0.8

1.0

1 - FNMR

VGG-Face

LightCNN

FaceNet

ROC-SDK

ISV

(b) ROC curves for the evaluation group.

Figure 2: Performances of FR systems in licit scenario of the MOBIO dataset. (a) ROC curves for the development group of the licit protocol of MOBIO. The vertical dashed line marks the point where FMR is 0.1% for every method (corresponding to the score-thresholdT0.1). (b) ROC curves for the evaluation group of the licit protocol of MOBIO.

The colored circular markers (connected by a dashed line) indicate the FMR and FNMR of each FR method, for the T0.1 score-threshold.

First, we establish a baseline face-verification performance for each FR system, based on the licit protocol of the MOBIO dataset. The face-verification results of the various FR systems are presented in two ways: using receiver operating characteristics (ROC) curves and using expected performance curves (EPC) [43].

The ROC curves shown in Figure 2 illustrate the face-verification performances of the various FR systems. They have been evaluated using the licit male-protocol of the MOBIO dataset. Figure 2(a) and (b) show the ROC plots for the development group and evaluation group, respectively. A score-threshold, T0.1, is selected so as to achieve

a FMR of 0.1% over the development group. This is indicated in Figure 2(a) by a dashed black vertical line. In Figure 2(b) the circular markers in the various colors (connected by a dashed trace line) indicate the FMR of the corresponding FR methods for the score-thresholdT0.1. To compare the performances of the five FR methods on the

evaluation group at score-thresholdT0.1, we should consider the distances of these five points from the top left corner

of the plot (where (F M R, F N M R) = (0, 0), representing the ideal outcome). This way of comparing different FR systems follows the method used by NIST [32]. It is clear from the ROC plots that the FaceNet CNN significantly outperforms the other FR methods in this study. Indeed, it is noteworthy that the verification-accuracy of the FaceNet CNN remains extremely high even for very small values of FMR.

Expected performance curves (EPC) [43] provide an unbiased comparison of FR systems. Figure 3 shows the EPC plots of the five FR systems. These curves have been computed using the licit protocol of the MOBIO dataset,

(12)

0.00 0.25 0.50 0.75 1.00

alpha

0.0

0.1

0.2

0.3

0.4

0.5

HTER

VGG-Face

LightCNN

FaceNet

ROC-SDK

ISV

Figure 3: Comparison of FR systems in licit scenario. The plot shows EPC curves for the five FR systems. The licit protocol of the MOBIO dataset was used to generate these curves. This plot confirms the conclusion drawn from Figure 2, that the FaceNet CNN significantly outperforms the other studied FR methods.

and show how the performance of each FR method evolves as a function of the trade-off between FMR and FNMR. This trade-off is controlled by the parameter 0≤ α ≤ 1 as follows:

W ER = α× (F MR) + (1 − α) × (F NMR) , (8)

where W ER (weighted error rate) is the weighted combination of FMR and FNMR. As always, the W ER is computed from the classification results of the development group. Any particular value of W ER corresponds to specific values of α, FMR and FNMR. Therefore, to achieve a specific W ER, for a given value of α, the score-threshold should be chosen appropriately. Specifically, when constructing EPC, we consider a discrete set of values for α. For each value of α, the score-threshold that minimizes the WER for the development group is chosen. The selected score-threshold is then used to compute the HTER over the evaluation group, shown in the EPC plots (Fig. 3). Thus, the α values (on the horizontal axis) and the HTER values of the evaluation group (on the vertical axis of the plot) are related via the corresponding score-threshold computed from the development group. Fig. 3 confirms that the three CNN based FR methods perform better than the other two methods (ISV modeling and ROC-SDK). In particular, the figure also shows that the FaceNet CNN promises significantly better performance than all the other FR methods, including the popular VGG-Face network.

5.3

Discussion of Face Verification Results

Comparing the two plots in Figure 2, we note that (for the MOBIO male dataset) the performance of the VGG-Face CNN is not consistent over the two groups. At the operating point corresponding to score-threshold T0.1, on the

(13)

development group, the ROC-SDK and the ISV modeling method, both outperform the VGG-Face CNN. For the evaluation group, however, the VGG-Face network performs much better. Figure 2(b) shows that the FMR for the VGG-Face CNN is at least one order of magnitude smaller than the FMR for both the ROC-SDK and the ISV modeling method. By contrast, the LightCNN and FaceNet networks consistently perform better than the other FR methods on both groups.

The variable α in Equation 8 is a cost-variable that can be adjusted depending the use-case. If minimizing FMR (Type I error) is critical, then high values of α are used to determine the score-threshold from the development group. In the less likely scenario where minimizing FNMR (Type II error) is of utmost importance, very low values of α should be used. When it is important to optimize both types of errors, α is set to 0.5, which corresponds to determining the score-thresholdTEER.

The EPC plots also indicate that the FaceNet CNN shows near perfect FR performance over almost the entire range of α values.

Note the initial horizontal segment of the EPC curve for ROC-SDK, indicating a very high HTER for values of α≤ 0.1. This results from the peculiarity of the score-distribution produced by the ROC-SDK method. To explain this behaviour, let us look at the score-distributions shown in Figure 4. The score-distributions produced by the ROC-SDK for the two classes (genuine presentations in green, and ZEI-presentations in blue) are shown in Figure 4(a). The score-threshold for optimal separation between the two classes increases (along the horizontal axis) with α. The red line shows this evolution of the threshold. Note that the histograms shown in the plot represent score-distributions of the evaluation group, whereas the red line shows the evolution of the score-threshold as computed over the development group. Figure 4(b) shows the score-distributions and score-threshold evolution produced by the VGG-Face CNN.

We can see in Figure 4(a) that for some genuine presentations the ROC-SDK produces very low scores, which are similar to scores of ZEI presentations. Therefore, for low values of α, where the goal is to minimize FNMR even at the cost of high FMR, the selected score-threshold is so low that almost every presentation is accepted. This leads to the near-50% HTER. We do not see this behaviour in the EPC curves of the other FR methods because these methods do not produce abnormally low scores for genuine presentations. Figure 4(b) shows that the two score-distributions produced by the VGG-Face network do not have significant overlap, and therefore, even for low values of α, the selected score-threshold is not very low (relative to the score-distribution of the ZEI class). Therefore, we see a gradual change at the beginning of the EPC curve for the VGG-Face method, and not a sudden drop as seen for the ROC-SDK method. Of course, this analysis applies only to the MOBIO male dataset. For other datasets, the score-distributions produced by the ROC-SDK for the two classes need not have significant overlap.

5.4

Vulnerability Analysis

The trained FR systems are next evaluated with respect to their vulnerability to PAs, using the PA datasets: REPLAY-ATTACK, MSU-MFSD and REPLAY-MOBILE, as well as a combined PA dataset, created by merging the three PA datasets. In the combined dataset, the licit protocol consists of the licit protocols of the individual PA datasets, and similarly, the attack-protocol includes the attack-protocols of the three PA datasets. In each protocol of the combined dataset, the development group is composed of the development groups of the three constituent PA datasets, and likewise for the evaluation group. First, the development group of the licit protocol in each PA dataset is used to determine a score-threshold for classifying genuine- and ZEI-presentations. Presentations in the evaluation group of the licit protocol are then classified using this score-threshold. These classification results are reported using the FMR and FNMR metrics. The presentations in the attack protocol of the dataset are also classified using the same score-threshold, and the results are used to compute the IAPMR.

Vulnerability-analysis results for the five FR systems are summarized in Table 4. For each FR system, the table shows the results for the three PA datasets individually, as well as the combined PA dataset. The table shows both the face-verification accuracy as well as the vulnerability of each FR system. Values of the different metrics are reported for two score-thresholds – TEER, and T0.1 – determined using the development group. The values of the

performance-metrics shown in the table have been computed over the evaluation groups of the various datasets. On the combined dataset the LightCNN shows the same face-verification performance as the VGG-Face system. However, the general conclusion drawn from the HTER values in the table is that the FaceNet CNN achieves the best face-verification performance for all three PA datasets, as well as for the combined dataset.

In Table 4, the IAPMR values for the all three CNN-FR systems are consistently above 90% for all PA datasets (using both score-thresholds). In fact, when we consider the lower-bounds of the respective confidence-intervals, in

(14)

0.0

0.2

0.4

0.6

0.8

1.0

Verification scores

0

1

2

3

4

Normalized count

Zero-effort Impostors

Genuines

0.0

0.2

0.4

0.6

0.8

1.0

alpha

(a) Score-distribution produced by the ROC-SDK.

0.5

0.4

0.3

0.2

0.1

0.0

Verification scores

0

2

4

6

8

10

12

Normalized count

Zero-effort Impostors

Genuines

0.0

0.2

0.4

0.6

0.8

1.0

alpha

(b) Score-distribution produced the VGG-Face CNN.

Figure 4: Score-distributions of two FR systems in the licit scenario of MOBIO dataset. (a) Histograms of scores produced by the ROC-SDK method. (b) Score-histograms for the VGG-Face network. In each plot, the green histogram represents the distribution of scores for genuine presentations, and the blue histogram shows the scores for the ZEI presentations. These histograms are based on scores computed for the evaluation group. In each plot, the red line shows the evolution (from left to right) of the score-threshold computed from the development group, as a function of α (right vertical axis).

a majority of experiments, the CNN-FR systems show vulnerabilities higher than 95%. These experiments clearly demonstrate that the CNN based FR systems are highly vulnerable to PAs. Comparing the IAPMR values in the table, we note that both, the ISV modeling method as well as the ROC-SDK method, show levels of vulnerability significantly lower than those for the three CNN-based systems. For each dataset, the highest IAMPR values are shown in bold font in the table.

In the experiments where the vulnerability of a FR system is relatively low, the corresponding FNMR is also unacceptably high. For example, the ISV modeling method shows an IAPMR of 14.8% for the REPLAY-MOBILE dataset, at score-thresholdT0.1. The corresponding FNMR is 50.1%. A system with such a high FNMR would not

be useful in practical applications.

This can also be observed from the score distributions of the FR systems. Figure 5 shows the score distributions of the FR systems for the combined PA dataset. For every FR system, the score distribution of the PAs (gray histogram) significantly overlaps the genuine score distribution (green histogram). This indicates that none of the FR systems is able to adequately distinguish between genuine presentations and attack-presentations. We note, however, that both, the overlap between the PA-score-histogram and the genuine-score-histogram as well as the separation between genuine and ZEI score distributions is stronger for all the CNN-based FR systems, compared to the ISV modeling and ROC-SDK methods. The IAPMR values atTEERshown in the plot (repeated in Table 4) are

higher for the CNN-based methods than the other two FR methods.

The ROC-SDK method failed to generate templates for a certain number of input images, probably be due to some mechanism for rejecting low-quality input samples. This may explain the relatively lower vulnerability of this method as some PA (as well as genuine) samples may have been rejected based on their quality. We are not able to verify this hypothesis, as implementation details of ROC-SDK are not public knowledge.

Table 5 shows the IAPMR values for each PA type (see Table 1 for an explanation of the various PA types in each dataset). The values in the table show the average success-rate for each type of PA, over the five FR methods. Clearly, all PA types are highly successful in spoofing all the five FR methods. The table indicates that the FaceNet CNN was successfully spoofed more often than the other FR methods.

References

Related documents

Aptness of Candidates in the Pool to Serve as Role Models When presented with the candidate role model profiles, nine out of ten student participants found two or more in the pool

Also, both diabetic groups there were a positive immunoreactivity of the photoreceptor inner segment, and this was also seen among control ani- mals treated with a

Мөн БЗДүүргийн нохойн уушгины жижиг гуурсанцрын хучуур эсийн болон гөлгөр булчингийн ширхгийн гиперплази (4-р зураг), Чингэлтэй дүүргийн нохойн уушгинд том

19% serve a county. Fourteen per cent of the centers provide service for adjoining states in addition to the states in which they are located; usually these adjoining states have

Field experiments were conducted at Ebonyi State University Research Farm during 2009 and 2010 farming seasons to evaluate the effect of intercropping maize with

The paper is discussed for various techniques for sensor localization and various interpolation methods for variety of prediction methods used by various applications