Deeply Vulnerable -- a study of the robustness of face recognition to presentation attacks

(1)

Deeply Vulnerable – A Study of the Robustness of Face Recognition to

Presentation Attacks

Amir Mohammadi, Sushil Bhattacharjee, and S´

ebastien Marcel

∗†

Abstract

The vulnerability of deep-learning based face-recognition (FR) methods, to presentation attacks (PA) is stud-ied in this paper. Recently proposed FR methods based on deep neural networks (DNN) have been shown to outperform most other methods by a significant margin. In a trustworthy face-verification system, however, max-imizing recognition-performance alone is not sufficient – the system should also be capable of resisting various kinds of attacks, including presentation-attacks (PA). Previous experience has shown that the PA-vulnerability of FR systems tends to increase with face-verification accuracy. Using several publicly available PA datasets, we show that DNN-based FR systems compensate for variability between bona fide and PA samples, and tend to score them similarly, which makes such FR systems extremely vulnerable to PAs. Experiments show the vulnerability of the studied DNN-based FR systems to be consistently higher than 90%, and often higher than 98%.

1 Introduction

Progress of face recognition (FR) technology, from strongly constrained models to fully unconstrained environments, has been enabled by the adoption of successively complex learning paradigms. The first generation of large scale appearance based FR systems, such as Eigenfaces [1] and Fisherfaces [2], attempted to model the face-variability in a simple linear sub-space. Subsequently, methods such as joint factor analysis (JFA), inter-session variability (ISV) modeling [3] and probabilistic linear discriminant analysis (PLDA) [4, 5] were developed to better model variability in face-images. Deep-learning based methods, notably convolutional neural networks (CNN) [6, 7, 8, 9], have become very popular in recent years, due to their near-perfect recognition accuracy on unconstrained datasets such as ‘labeled faces in the wild’ (LFW) [10].

A FR system may be used in two kinds of applications: face-identification or face-verification. Face-identification is a one-to-many problem, where the face-image to be identified is tested against all previously enrolled identities, to see if it matches any of the known identities. In face-verification systems, a claimed-identity is provided along with the input face-image, and the problem is simply to verify that the input image corresponds to the claimed-identity. In this paper the terms recognition and verification have been used interchangeably, and refer to the use of a FR system in verification mode.

FR systems should not only have very high accuracy, but should also be robust to attacks. For example, in access-control applications, high FR recognition accuracy in unconstrained environments is not a significant asset, when not accompanied with resistance to presentation attacks (PA), also known as spoof-attacks. A face PA is said to have occurred when a face biometric-sample is presented to the camera of a FR system “with the intention of interfering with the operation of biometric recognition” [11]. For example, person A may attack a FR system by claiming to be an enrolled client, B, and presenting a printed photo of the person B to the camera. Other examples of PAs are: digital photos or videos displayed on an electronic screen, face sketches, and make-up [12].

Most high-accuracy FR systems today rely on deep-learning methods. Indeed, deep-learning based FR systems are already being deployed in commercial face-verification applications [8]. In this context their vulnerability to PAs becomes an important factor in the trustworthiness of the entire face-verification process. Several studies [6, 7, 8, 9] have explored the capacity of deep-learning based FR-systems to handle variations in pose, illumination and scale, that is, challenges related to variability of face-biometric samples of a single client. Karahan et al. [13] studied the

∗_{Authors are affiliated with the Idiap Research Institute, Martigny, Switzerland.}

(2)

fragility of CNN-based FR systems when confronted with image degradations such as blurring, occlusion, compression-artifacts, and color-distortion. Robustness to such degradations is necessary for FR systems operating in relatively unconstrained environments. For face-verification systems functioning under controlled conditions, the vulnerability to PAs is a much more significant concern.

The European research project TABULA RASA [14] indicated a positive correlation between the efficacy of FR systems and their vulnerability to PAs. In other words, FR methods with higher face-verification accuracy tended to be more vulnerable to PAs. Given that FR systems, especially the popular CNN-based FR methods, have not been explicitly trained for presentation attack detection (PAD), one would intuitively expect such systems to be vulnerable to PAs. To our knowledge, though, the vulnerability of deep learning based FR systems to PAs has not been studied in detail.

In this paper we present a large-scale empirical study of the vulnerability to PAs of five recent FR systems, including 3 recent CNN-based methods. Specifically, we compare the vulnerability of the following systems:

• three CNN-based FR systems, namely: – VGG-Face [7],

– LightCNN [15], and – FaceNet [16]

• Intersession Variability (ISV) modeling [3], and • ROC-SDK from Rank-One Computing1_.

The main contribution of this paper is empirical evidence to support the claim that the CNN-based FR method is extremely vulnerable to PAs. Our experiments, using three recent PA datasets, show that other highly rated FR methods are also very vulnerable to PAs. Another significant contribution of this work is that all the datasets, protocols, and source-code for the experiments reported in the paper will be made publicly available on the web. This will allow other researchers to reproduce the results presented here.

A brief review of previous vulnerability studies for FR methods is presented in Section 2. This discussion justifies the present work, and helps us to offset our results from those in previous published works. The face-recognition methods included in our study are described in Section 3. The experiments reported in this paper have been conducted on publicly available datasets, with well-defined protocols for face-recognition and face-PAD experiments. In Section 4, we describe the datasets, the kinds of PAs represented in each dataset, and the protocols for training and testing face-recognition offered in each dataset. Experimental results are summarized and discussed in Section 5, and concluding remarks are presented in Section 6.

2 Related Work

There have been very few studies specifically exploring the vulnerability of FR systems to different kinds of attacks. Duc et al. [17] have investigated the vulnerability of FR based login on computers (specifically Lenovo, Toshiba and Asus products) to PAs. Kose et al. [18] present a study of the vulnerability of FR methods to 3D-mask attacks. They evaluate two different FR methods – one designed specifically for FR using 3D data [19], and, a generic approach to FR using local binary patterns (LBP) based face-image comparison. Hadid [20] discusses the vulnerability of the parts-based GMM FR method to PAs. In a recently published work, Scherhag et al. [21], the authors have studied the vulnerability of FR methods to morphed-face attacks. Here attacks are constructed by morphing face images from two enrolled identities, and the challenge is to see if the two enrolled identities can both be successfully spoofed using the same morphed image. Both FR methods tested in this work – VeriFace SDK2_{(a commercial off-the}

shelf (COTS) FR product from Neurotechnology), and OpenFace [22] (an open-source, pre-trained DNN based FR system) – are shown to be highly susceptible to such attacks [21].

Although deep-learning based FR systems have attracted considerable attention, both academic and commercial, as far as we are aware, no previous study has investigated the vulnerability of CNN-based FR systems using a variety of PAs. Most of the studies cited above have been performed on FR methods that rely on hand-crafted features.

1_{Company website: www.rankone.io}

2_{Website: www.neurotechnology.com/verilook.html} 2

(3)

The study by Scherhag et al. [21] that has included a DNN based FR system (OpenFace) was done in the restricted context of a single class of attacks based on morphed-images. Such attacks are expected to have a very narrow range of application.

In this paper, we focus on the vulnerability of CNN based FR methods to PAs. This study includes three publicly available pre-trained CNN-FR models. Our experiments demonstrate that all these FR methods are consistently highly vulnerable to several classes of PAs. For comparison, we have also included the ISV modeling method, which relies on hand-crafted features, and ROC-SDK – a COTS FR product – in this study. We use four publicly available FR and PA datasets to estimate the vulnerability of the five FR methods to a variety of PAs. Our study has been motivated by the hypothesis that although CNN-FR systems outperform FR methods based on hand-crafted features, they are also more vulnerable to PAs. This assertion makes intuitive sense to researchers in face-biometrics, given that the FR methods have not been explicitly trained to detect PAs. However, this is the first large-scale study to empirically quantify the vulnerability of these FR methods.

3 FR Systems Included In This Study

A typical FR system functions in three phases: training, enrollment, and probing. In the training phase a background model, assumed to broadly represent the space of face-images, is constructed using training data. In the enrollment phase, the FR system generates templates for the given enrollment samples, which are then stored in the gallery. In the probing (operational) phase, the FR system is presented with a probe-image and a claimed identity. A template is created for the given probe-sample, and is compared with the set of enrolled templates associated with the claimed identity. The result of the comparison is a score, which is then thresholded to produce a decision (accept or reject). To mitigate the influence of variations on the actual face-recognition process, the raw input image is usually preprocessed, to extract sub-images representing individual faces. Geometric and color transforms may also be applied to the extracted face-images, depending on the requirements of the specific FR method. The result of the pre-processing stage is a normalized face image, of predefined size and scale, that may be processed by a FR system. Before describing the different FR methods, we explain the pre-processing steps applied to normalize the input face images.

3.1 Face Image Normalization

Several FR systems expect the input to be a gray-level image. Therefore, our first pre-processing step is to convert the input RGB image to a gray-level image (essentially, the Y component of the YUV domain representation of the image).

The next step is to locate each face represented in the input image. The ROC-SDK takes a full frame image as input. For the other FR methods in our study (CNN-FR, and ISV modeling) the input is a suitably cropped face-image. There are two ways to accomplish the face-localization step explicitly:

• provide additional information, such as annotations for specific facial landmarks, that the FR system can use to extract a normalized face-region, or,

• provide as input a cropped and normalized face-image.

In our experiments, the normalized face-region is extracted using annotations identifying the center of each eye. Imposing the constraints that the straight-line joining the two eye-centres should be horizontal, and should have a predefined length, an affine transform can be used to extract a normalized face-image of fixed size from the given input image. Some face-biometrics test datasets include annotations for the eye locations. For datasets where this information is not explicitly provided, we use a two-step process to extract the eye-positions. First, a boosted-classifier, trained to detect faces, is used to localize the face-region [23]. Next, facial-landmarks are extracted from the face-region using the flandmarks method [24]. This method provides pixel coordinates for seven facial landmarks, including the two corners of each eye. The pixel-location for the center of each eye can be interpolated from these landmarks.

Note that for the VGG-Face and LightCNN networks, the input images are normalized color (RGB) images, whereas the FaceNet and ISV modeling method expect normalized gray-level face-images of fixed 2D-shape.

The various FR methods are discussed in the following sections. Specific implementation details, such as the size of the normalized face-image, tunings of the hyper-parameters, and so on, are provided in the Section 5.

(4)

3.2 CNN-Based Systems

CNNs [25, 26], a class of deep neural networks (DNN), have been shown to be extremely accurate in FR tasks. For FR applications, CNNs are usually trained for face identification, using face-images as input, and the set of identities to be recognized as output. The last few layers of the network, including the output layer, are typically fully-connected (fc for short) layers. These layers may be seen as the classifier-stage of the network, whereas the preceding (convolutional and pooling) layers may be considered to constitute the feature-extraction stage. Although CNNs are typically trained end-to-end for classification or regression tasks, they are often also used as feature-extraction tools. The terms representation and embedding are used interchangeably, to denote the outputs of the various layers of a deep network. Representations generated by a specific layer of a pre-trained DNN may be used as templates (feature-vectors) representing the corresponding input images. These templates may be subsequently be used to train a classifier, or to compare the respective input images using appropriate similarity measures.

Taigman et al. [6] have proposed DeepFace, a 9-layer CNN for FR. The input faces are first aligned to have an upright position using 3D modeling and piecewise affine transforms, before being fed to the network. This network achieves a recognition accuracy of 97.25% on the LFW dataset. Schroff et al. [8] report an accuracy of 99.63% on the LFW dataset using FaceNet, a CNN with 7.5 million parameters, trained using a novel triplet loss function. Other CNN architectures showing similar FR accuracy on the LFW dataset include the DeepID series by Sun et al. [9].

In this work we analyze the vulnerability to PAs of three CNN-FR methods: the VGG-Face [7], LightCNN [15], and FaceNet [16]. The VGG-Face network has been included in our study because it is a very well known and widely referenced CNN for FR applications. The other two network models have been included because they are newer than the Face network, and have both shown a FR performance even better than that of the VGG-Face CNN. Another important reason why these three specific CNNs have been studied in this work is that the respective creators of these networks have made publicly available pre-trained models that can be directly used in our experiments. The following sections provide brief summaries of the selected CNNs, and describe how they have been used in our experiments.

3.2.1 VGG-Face CNN

The VGG-Face network model is made publicly available by the Visual Geometry Group3 _{at Oxford University.}

Involving almost 135 million trainable parameters, this network has been shown to achieve a FR accuracy of 98.95% on the LFW unrestricted setting [10]. VGG-Face is a CNN consisting of 16 hidden layers (see Table 3 in [7]). The initial 13 hidden layers are convolution and pooling layers, and the last three layers are fully-connected (’fc6’, ’fc7’, and ’fc8’, following the nomenclature used by Parkhi et al. [7]). The input to this network is an appropriately cropped color face-image of pre-specified dimensions.

We use the representation produced by the ’fc7’ layer of the VGG-Face CNN as a template for the input image. When enrolling a client, the template produced by the VGG-Face network for each enrollment-sample is recorded. For verification, the network is used to generate a template for the probe face-image, which is then compared to the enrolled templates of the claimed identity using the Cosine-similarity measure given by Equation 1 (where_kAk2

represents the L2-norm of vector A).

Cosine Similarity(A, B) = A· B kAk2· kBk2

(1) The score assigned to the probe is the average Cosine-similarity of the probe-template to all the enrollment-templates of the claimed identity. If the score is larger than a predetermined threshold, the probe is accepted as a match for the claimed identity.

3.2.2 LightCNN

Wu et al. [15] have proposed a new CNN, called LightCNN, for FR. Their goals in designing this network were to have significantly fewer trainable parameters compared to other state-of-the-art FR-CNNs, as well as to be able to handle noisy labels that are inevitable in datasets mined automatically from the web. Compared to the VGG-Face network, the number of parameters in the LightCNN model is smaller by a factor of 10 (_{∼ 12 million parameters).} This is achieved mainly through the use of a newly introduced Max-Feature-Map (MFM) activation [15], which is a

3_{Website: www.robots.ox.ac.uk/˜vgg/software/vgg_face} 4

(5)

non-linear extension of the maxout activation operation. Although the MFM operator is more expensive to compute than the ReLU unit that it replaces, the large overall reduction of number of units per layer, made possible by the use of the new operator, still leads to smaller computation time for the forward-pass of the LightCNN network (by a factor of 5, relative to the VGG-Face CNN [15]).

In our experiments we use as templates the 256-D representation produced by the ’eltwise fc1’ layer of LightCNN. Note that this template is much smaller than the 4096-D vector produced by VGG-Face. Despite these relative efficiencies, the LightCNN network achieves a FR accuracy of 99.33% on the LFW dataset (in unrestricted setting) – outperforming the VGG-Face network by a small margin. As with the VGG-Face network, the Cosine measure (Eqn. 1) is used to compare an input probe template to the relevant enrollment templates.

3.2.3 FaceNet CNN

Very recently, David Sandberg has made publicly available his implementation as well as trained models for a new FR-CNN named FaceNet[16]. This is the closest open-source implementation of the FaceNet CNN proposed by Schroff et al. [8], for which neither a pre-trained model nor the training-set is publicly available. Sandberg’s FaceNet implements an Inception-ResNet V1 DNN architecture [27]. Two pre-trained FaceNet models have been published [16]. In our tests, we have used the 20170512-110547 model, trained on the MS-Celeb-1M dataset [28]. Using this model, FaceNet achieves a FR performance of 99.2% on the LFW dataset [16], which is comparable to the performance of LightCNN. Note that the 128-D representation produced by this network (at the ’embeddings:0’ layer) is half the size of that produced by LightCNN. We use this representation to construct enrollment and probe templates, which are compared to each other using the Cosine measure (Eqn. 1).

3.3 GMM-based FR using Inter-Session Variability Modeling

Inter-session variability modeling (ISV) is an extension of the Gaussian Mixture Models (GMM) based method for face verification. We have used the ISV modeling approach proposed by Wallace et al. [3]. Among the FR methods included in this study, this is the only method that adopts a parts based approach. The input normalized face image is first decomposed into a set of square sub-images of size (12× 12), with an overlap of 11 pixels in each direction. Let N represent the total number of sub-images extracted from the input face-image. For each sub-image, a predetermined number, D, of low-frequency DCT (discrete cosine transform) coefficients is computed. Thus, a set of N D-dim arrays of DCT coefficients is extracted for each input face-image. The DCT coefficients are normalized to zero-mean and unit standard-deviation in each of the D dimensions. The resulting set of N normalized D-dim DCT arrays is used to represent the input face-image.

To use a GMM for FR, first, a universal background model (UBM) is constructed from the set of training-images (each represented by N D-dim DCT arrays). The UBM is a GMM, that is, a weighted sum of K Gaussians, where each Gaussian is represented by a D-dim mean-vector, and a (D_{× D)-dim covariance-matrix. The parameters of the} Gaussians components of the UBM, as well as their relative weights, are learned from the training data. The UBM describes the probability distribution of face sub-images in the D-dim DCT-coefficient space.

A supervector, m, is then constructed by concatenating K the mean-vectors of all the components of the GMM. To enroll a client i, a supervector si is derived as follows:

si= m + di , (2)

where di is an offset-vector specific to client i, computed using maximum a posteriori (MAP) adaptation from the

UBM [29].

The supervector siis assumed to be client-specific. In other words, different enrollment images of the same client,

i, should, in theory, produce very similar supervectors. For a given client, however, enrollment set typically contains images captured in different sessions, with varying pose, illumination, and facial expressions. This within-class variability for client i is also reflected in si, and can result in diminished recognition accuracy of the GMM-based

FR method [3].

ISV-modeling has been developed to enhance the GMM based approach by explicitly modeling and suppressing the within-class variability of each enrolled client. We assume that each enrollment image, collected in a separate session, results in a unique supervector. For enrollment image (i, j) (the jth_{enrollment image of client i), let}

(6)

where µi,jis the supervector corresponding to the enrollment image (i, j), ui,jis the offset induced by specific session

conditions of image (i, j), and di is the client-dependent offset. (Note that the client-dependent offset di in Eqn. 3

is free from session variability, unlike the diin Eqn. 2.) The session-dependent offset, ui,j can be expressed as:

ui,j= U xi,j , (4)

where, U is a low-dimensional matrix modeling the session-variability, and xi,j∼ N (0, I ) is a latent variable corre-sponding to the session-variability. Similarly, the client-dependent offset, di, can be represented as

di= Dzi , (5)

where D is a diagonal matrix derived from the diagonal variances of the UBM, and zi ∼ N (0, I ) is a client-specific latent random variable. The subspace U is estimated using an expectation-maximization (EM) algorithm. For details of the ISV technique for face-verification, please refer to the works of Wallace et al. [3] and Vogt et al. [30].

When enrolling a new client, i, using a set of enrollment images (indexed by j), the latent variables xi,j and zi

are estimated from the enrollment images, and finally, the client-specific supervector, ci, is computed as:

ci= m + Dzi . (6)

Thus, a single supervector, ci, is stored for each enrolled client.

Given a probe image,P, claiming identity, i, the classification score for the probe is computed as the log-likelihood ratio LLR(P, ci) as follows:

LLR(P, ci) = Q(P|ci)− Q(P|UBM) , (7)

where, Q(_P|ci) gives the log of the likelihood of the probe P being generated by the model ci, and similarly,

Q(P|UBM) gives the log of the likelihood of the probe P being generated by the UBM. In practice, we use the linear scoring method [31] to compute the score. This faster scoring method is derived as a first order Taylor series approximation of the normal log-likelihood ratio. The probe is accepted if the resulting score is higher than a preset threshold.

3.4 ROC-SDK from Rank One Computing

This FR product, from Rank-One Computing, has been included in our study as it is one of the best performing COTS FR products today [32]. In particular, Rank-One Computing have demonstrated a FR performance of 92% (@ false reject rate = 1%) on the LFW dataset [33], and of 98% (@ false reject rate = 1%) on the NIST Special Database 32: Multiple Encounter Dataset (MEDS) [34]. The software development kit (SDK) distributed by the company includes two FR methods – ROCFR (a pose-independent FR system), and, ROCID (a FR system for images captured under controlled conditions). Here, we have tested the ROCFR method from the SDK (version 1.9). This method represents every face-image with a 144-byte template. The SDK provides functions for populating a gallery with enrollment templates, and for comparing probe-templates to the gallery of previously enrolled templates.

4 Experiment Datasets and Protocols

Each FR system is evaluated under two scenarios: licit and attack. In the licit scenario, where only bona fide samples are considered, the face-verification accuracy of the FR system is evaluated. Identities are enrolled in the system using a set of enrollment samples and the verification performance of the system is determined using the probe samples from various identities. In the attack scenario we consider PA samples for the identities enrolled using the licit protocol, to evaluate the vulnerability of the system to PAs. That is, when evaluating the vulnerability of a FR system, each enrolled identity is only probed with PAs on that identity. The datasets and protocols used to evaluate the performance of each FR system are described in this section. We start by presenting the metrics used to quantify the performance of each FR system under the two scenarios.

4.1 Performance Metrics

During the verification process, the user provides a biometric sample as probe, along with a claimed identity. Samples presented to a biometric verification system fall into three categories:

(7)

• Genuine: when the biometric sample and the claimed identity both belong to the user being authenticated; • Zero-effort Impostor (ZEI): when the biometric sample belongs to the user, but the user claims the identity of

a different enrolled client;

• Impostor presentation attack (PA): when the biometric sample matches the claimed identity, but neither correspond to the user attempting to be verified.

In standardized biometrics terminology [11], both genuine and ZEI presentations are considered bona fide presen-tations. The task of a FR system is to accept only genuine presentations, and to reject both types of impostor presentations.

The verification performance of a FR system is reported using the false match rate (FMR), and the false non-match rate (FNMR) [35]. The FMR denotes the proportion of ZEI that are wrongly accepted as genuine samples (Type-I error). The FNMR refers to the proportion of genuine samples which are wrongly rejected (Type-II error). The two metrics may be summarized in a single number as the Half Total Error Rate (HTER) as HT ER = (F M R + F N M R)/2.

Vulnerability of a biometric system to PAs is reported as the Impostor Attack Presentation Match Rate (IAPMR) [35], which denotes the proportion of PAs that are accepted by the FR system as genuine presentations.

4.2 Datasets

We have used four publicly available datasets, namely, MOBIO [36], REPLAY-ATTACK [37], MSU-MFSD [38], and REPLAY-MOBILE [39].

The MOBIO dataset [36] was collected for bi-modal (voice and face) biometric verification experiments using mobile devices. Therefore, it contains only licit protocols. Biometric samples were collected in the year 2010, using Nokia N93i mobile phones, for 100 male subjects and 52 female subjects, spread over six cities (in five countries). The videos have been recorded at a resolution of (320_{× 240) pixels. Experimental results reported in this paper have} been produced using face-images from only the male subjects.

The REPLAY-ATTACK [37] face PA dataset, published in 2012, consists of 1, 300 videos (each, at least 9 seconds long) from 50 subjects. The videos have been recorded by using a 1300 Macbook at 320× 240 pixel resolution, under two different lighting conditions. Three kinds of replay attacks are simulated in this dataset, namely, printed photo attacks, still face-images displayed on an electronic device, and replayed digital-videos. Three presentation attack instruments (PAIs) have been used to construct the attacks: photo-quality paper (for the printed photo attacks), iPhone 3Gs, and iPad (first generation). The two electronic devices have been used for both, photo- as well as video-attacks. Videos for each kind of attack have been captured under two conditions, one where the capturing device is hand-held, and the other where the capturing device rests on a stationary support.

The public version of the MSU-MFSD dataset [38] includes real-access and attack videos for 35 subjects. This dataset was published in 2015. Real-access videos (_{∼ 12 sec. long) have been captured using two devices: a 13}00

MacBook Air (using its built-in camera), and a Google Nexus 5 (Android 4.4.2) phone. Videos captured using the laptop camera have a resolution of 640_{× 480 pixels, and those captured using the Android camera have a resolution} of 720_{× 480 pixels. The dataset also includes PA videos representing printed photo attacks, mobile video} replay-attacks where video captured on an iPhone 5s is played back on an iPhone 5s, and high-definition (HD) video-replays (captured on a Canon 550D SLR, and played back on an iPad Air).

Published in 2016, The REPLAY-MOBILE dataset [39] is the newest of the three PA datasets used here. This dataset contains short (∼ 10 sec. long) HD (720 × 1280) resolution videos corresponding to 40 identities, recorded using two mobile devices: an iPad Mini 2 tablet and a LG-G4 smartphone. The videos have been collected under six different lighting conditions, involving artificial as well as natural illumination. PAs represented in this database have been constructed using two PAIs: matte-paper for print-attacks, and matte-screen monitor for digital-replay attacks. For each PAI, two kinds of attacks have been recorded: one where the user holds the recording device in hand, and the second where the recording device is stably supported on a stand. Thus, four kinds of attacks are represented in the database.

The various PAs represented in the three PA datasets are summarized in Table 1. In this table, PA I refers to printed photo attacks, PA II refers to low-quality replay attacks, and PA III denotes high-quality replay attacks. Figure 1 shows some example PAs from each PA dataset. The first column shows examples of bona fide presentations. The remaining three columns show examples of selected PAs represented in the corresponding dataset.

(8)

Table 1: Different combinations of PA in each dataset. PA I refers to printed photo attacks in the dataset, PA II refers to low-quality video replay attacks, and PA III corresponds to high-quality video replay attacks. The REPLAY-MOBILE dataset contains only two PA types: PA I and PA III. Low-quality video replay attacks have

not been included in this dataset. 7

Dataset PA I PA II PA III

REPLAY-ATTACK

Capture Device 12.1 mega-pixel Canon_{PowerShot SX150 IS camera} 3.1 mega-pixel iPhone 3GS camera 3.1 mega-pixel iPhone 3GS camera PAI Triumph-Adler DCC 2520_{color laser printer} iPhone 3Gs (320 × 480 pixels)_{(165 ppi pixel density)} iPad (768 × 1024 pixels)_{(132 ppi pixel density)} Capture Conditions 2 lighting conditions (’normal’ (bright) and ’adverse’ (in the shade, but not dark))_{Hand-held or fixed}

Biometric sensor Macbook (320 × 240) MSU-MFSD [31]

Capture Device Canon 550D camera 8 mega-pixel iPhone 5S camera Canon 550D camera PAI HP Color Laserjet_{CP6015xh printer} iPhone 5s (640 × 1136 pixels)_{(326 ppi pixel density)} iPad Air (1536 × 2048 pixels)_{(264 ppi pixel density)} Capture Conditions Indoors, with artificial lighting

Biometric sensor Macbook Air (640 × 480), Nexus 5 (720 × 480) REPLAY-MOBILE [32]

Capture Device 18 mega-pixel Nikon Coolpix P520 LG-G4 (720 × 1280) PAI Konica Minolta ineo+_{224e color laser printer} Philips 227ELH matte_{monitor (1920 × 1080)} Capture Conditions 6 lighting conditions (including artificial lights and natural light)_{Hand-held or fixed}

Biometric sensor iPad Mini 2 (720 × 1280), LG-G4 (720 × 1280)

TABLE I: Different combinations of PA in each dataset.PA I refers to printed photo attacks in the dataset, PA II refers to low-quality video replay attacks, andPA III corresponds to high-quality replay attacks. The REPLAY-MOBILE dataset contains only two PA types:PA I and PA III. Low-quality video replay attacks have not been included in this dataset.

evaluation group. In the vulnerability analysis experiments using a given PAD dataset, attack-videos in the evaluation group of the corresponding PAD protocol are used to test the vulnerability of the FR system in question.

The three groups in the male protocol of the MOBIO dataset are constructed as follows:

• training: 7881 samples images, from 37 subjects, • development 2760 samples, from 24 subjects, and • evaluation: 4370 samples, from 38 subjects.

As mentioned before, samples for the MOBIO dataset were collected at six different sites. Each group contains data exclusively from two sites.

The REPLAY-ATTACK dataset comes with a licit pro-tocol consisting of 100 videos (one video per subject, per illumination condition), and several PAD protocols. The licit protocol is partitioned thus:

• training: two videos each, from 15 subjects,

• development: two videos each, from 15 subjects, and • evaluation: two videos each, from 20 subjects. The dataset also provides several PAD protocols [30]. In our experiments we have considered the grandtest protocol, which includes all PAs. This protocol consists of 200 bona fide presentations, and 1, 000 PAs (i.e., a total of 1, 200 videos).

The public version of MSU-MFSD [31] offers 70 bona fide presentation videos and 280 attack videos. As mentioned before, no licit protocol has been published for this dataset. For our experiments, therefore, we have devised a new licit protocol using some frames from the bona fide presentation videos. For each subject, the dataset contains two bona fide presentation videos, one captured using a Macbook Air (labeled ”laptop”) and the other using a smart-phone (labeled ”android”). To construct the licit protocol for this dataset, we have used 10 frames from each bona fide video, sampled evenly, between the first and 180th _{frame. Frames from the}

”laptop” videos form the enrollment set, and those from the ”android” videos form the probe set.

The licit protocol of the REPLAY-MOBILE dataset con-sists of 160 videos (four videos for each of 40 subjects). They are grouped as follows: 48 videos (12 subjects) for training, 64 videos (16 subjects) for development, and the remaining 48 videos for the test group. The PAD protocol of the dataset consists of 1, 030 videos (390 bona fide presentations and 640 PAs of different kinds).

For ease of reference, the number of clients and samples used in the licit and attack protocols of the various datasets is summarized in Table II.

IV. EXPERIMENTS

The general procedure for evaluating the vulnerability of a FR system is as follows. First the face-verification perfor-mance of the system is evaluated using only the MOBIO dataset (i.e., in the licit scenario). The hyper-parameters of the FR system are tuned to achieve optimal discrimination between genuine presentations and ZEI presentations. The vulnerability of the FR system to PAs is then evaluated, for the different PAD datasets, using the same hyper-parameter settings.

In this section we first describe the methodology adopted for the experiments. Then, we present the results of face-recognition and vulnerability analysis for the four selected FR systems.

A. Methodology

The face-recognition and vulnerability-analysis experi-ments have been performed using the Bob5_{signal-processing}

and machine-learning toolkit [33]. Besides a wide selection of machine-learning and signal processing tools relevant for biometics experiments, the Bob toolkit also includes interfaces for accessing the various datasets and associated protocols. All the FR systems studied here have been tested within the same software framework.

5_{Website: www.idiap.ch/software/bob/}

4.3 Protocols

As mentioned before, the MOBIO dataset comes with only a licit protocol, whereas the PA datasets include both licit and attack protocols. For our experiments with the MSU-MFSD, we have constructed new licit and attack protocols to analyze the vulnerability of the FR methods to PAs. The protocols in the various datasets are described in this section.

The set of clients in the licit protocol of each dataset is divided into three mutually exclusive groups: training, development and evaluation. The training group is reserved for producing the background models, when necessary. In our experiments, the UBM required for the ISV modeling method has been trained using the training group in the licit protocol of the MOBIO dataset. The training groups in the licit protocols of the PA datasets are ignored here. The development and evaluation groups, each consist of two sets of samples: an enrollment set and a probe set. All clients assigned to the group are represented in both sets. That is, for each client in the development group, a certain number of enrollment samples as well as probe samples are available (and likewise, for the evaluation group). Attack protocols of the various PA datasets also consist of three non-overlapping groups: training, development, and evaluation. The training group is intended to be used for training a PAD method. The development group can be used for tuning the parameters of the PAD method being tested, and the final performance metrics are reported for the evaluation group. In the vulnerability analysis experiments using a given PA dataset, attack-videos in the evaluation group of the corresponding PAD protocol are used to test the vulnerability of the FR system in question.

The three groups in the male protocol of the MOBIO dataset are constructed as follows: • training: 7881 samples images, from 37 subjects,

• development: 2760 samples, from 24 subjects, and • evaluation: 4370 samples, from 38 subjects.

As mentioned before, samples for the MOBIO dataset were collected at six different sites. Each group contains data exclusively from two sites.

The REPLAY-ATTACK dataset comes with a licit protocol consisting of 100 videos (one video per subject, per illumination condition), and several PAD protocols. The licit protocol is partitioned thus:

• training: two videos each, from 15 subjects,

• development: two videos each, from 15 subjects, and • evaluation: two videos each, from 20 subjects.

(9)

REPLAY-ATTACK

BF

PA I

PA II

PA III

MSU-MFSD

REPLAY-MOBILE

Figure 1: Examples of attacks represented in the various PA datasets. The first column shows a bona fide example, and the remaining columns show different types of PAs present in the dataset for the same identity. The various PAs are described in Table 1.

The dataset also provides several PAD protocols [37]. In our experiments we have considered the grandtest protocol, which includes all PAs. This protocol consists of 200 bona fide presentations, and 1, 000 PAs (for a total of 1, 200 videos).

The public version of MSU-MFSD [38] offers 70 bona fide presentation videos and 280 attack videos. As mentioned before, no licit protocol has been published for this dataset. For our experiments, therefore, we have devised a new licit protocol using some frames from the bona fide presentation videos. For each subject, the dataset contains two bona fide presentation videos, one captured using a Macbook Air (labeled ”laptop”) and the other using a smart-phone (labeled ”android”). To construct the licit protocol for this dataset, we have used 10 frames from each bona fide video, sampled evenly, between the first and 180th_{frame. Frames from the ”laptop” videos form the enrollment}

set, and those from the ”android” videos form the probe set.

The licit protocol of the REPLAY-MOBILE dataset consists of 160 videos (four videos for each of 40 subjects). They are grouped as follows: 48 videos (12 subjects) for training, 64 videos (16 subjects) for development, and the remaining 48 videos for the test group. The PAD protocol of the dataset consists of 1, 030 videos (390 bona fide presentations and 640 PAs of different kinds).

For ease of reference, the number of clients and samples used in the licit and attack protocols of the various datasets is summarized in Table 2.

(10)

Table 2: Composition of the licit and attack protocols of the four datasets used in our experiments. All protocols consist of three mutually exclusive groups of clients: training, development, and evaluation. For each dataset, the number of clients assigned to each group is listed here. The number in parentheses shows the total number of samples

for the corresponding dataset and group. 8

Dataset _Train _DevelopmentLicit protocol _Evaluation _Train Attack protocol_Development _Evaluation

MOBIO 37 (7881) 24 (2760) 38 (4370)

REPLAY-ATTACK 15 (90) 15 (90) 20 (120) 15 (300) 15 (300) 20 (400)

MSU-MFSD 10 (200) 10 (200) 15 (300) 10 (600) 10 (600) 15 (900)

REPLAY-MOBILE 12 (1680) 16 (2240) 12 (1580) 12 (1920) 16 (2560) 12 (1920)

TABLE II: Composition of the licit and attack protocols of the four datasets used in our experiments. All protocols consist of three mutually exclusive groups of clients: training, development, and evaluation. For each dataset, the number of clients assigned to each group is listed here. The number in parentheses shows the total number of samples for the corresponding dataset and group.

The inputs for ISV modeling, Gabor-Graph, and ROC-SDK FR methods are gray-scale images, and the input to the VGG-Face FR method is a color (RGB) image. The ROC-SDK takes the entire image (one frame of video) as input. All other FR methods expect the input-image to be cropped appropriately, so that it contains only the face-region, with as little of the background as possible (see Section II-A). The specific input image dimensions required by the various FR systems, are listed in Table III. The table also summarizes the hyper-parameters and the scoring method used in each FR system. For the ISV modeling method and the Gabor-Graphs method, we have used the hyper-parameter values recommended by G¨unther et al. [34].

Most FR methods utilize models that need to be trained before deployment. The ROC-SDK and the VGG-Face net-work come pre-trained, and can be used out-of-the-box. No background model is required for the Gabor-Graphs method. For the ISV modeling method, the UBM and the session-variability matrix (U) are trained using face-images in the male-protocol of the MOBIO dataset.

Once trained, the FR methods are tuned and evaluated using the MOBIO licit protocol, and the three PA datasets: REPLAY-ATTACK, MSU-MFSD, and REPLAY-MOBILE. For experiments with each of these datasets, the hyper-parameters of each FR method are tuned using data in the development group of the licit protocol (see Section III). Each client in this group is enrolled into the FR system, using the available enrollment samples. Then, all samples in the probe set of the development group are tested against every enrolled client. For every enrolled identity, some videos in the probe-set come from the same client, and are considered genuine presentations, whereas the remaining videos (corresponding to other clients) constitute the ZEI attacks on this identity. Thus, we obtain two sets of scores, one for genuine presentations, and the other for ZEI presen-tations. The hyper-parameters of each FR method are tuned to maximize the separation between the two sets of scores. A threshold is now selected, based on these two score-sets computed from the probe set of development group. Presentations resulting in scores smaller than the score-threshold are rejected, and those at or above the score-threshold are accepted. Thus, given a score-threshold, the FMR and FNMR of the FR system can be estimated.

The score-threshold may be chosen arbitrarily, but usually some heuristics are applied when choosing the threshold.

Two common approaches for selecting the score-threshold are:

• Select the score-threshold, TEER, corresponding to the

equal error rate (EER). That is, TEER is the threshold

for which F MR ≈ F NMR.

• Select the score-threshold, Tθ, that corresponds to

a specific FMR = θ (e.g., T0.1 denotes the

score-threshold leading to a FMR of 0.1% over the devel-opment group).

Next, the clients in the evaluation group are enrolled into the FR system using the data in the corresponding enroll-ment set, and samples in the probe set are used to generate scores for genuine and ZEI presentations. (Note that the groups are all mutually exclusive. That is, clients represented in the development group are not present in the evaluation group, and vice versa.) For a given evaluation-probe, if the resulting score is smaller than the score-threshold, the probe is rejected. Otherwise it is accepted. Thus, using the evaluation-probe scores, the FMR and FNMR of the FR method can be reported.

5 Experiments

The procedure used here to evaluate the vulnerability of a FR system is as follows. First the face-verification performance of the system is evaluated using only the MOBIO dataset (i.e., in the licit scenario). The hyper-parameters of the FR system are tuned to achieve optimal discrimination between genuine presentations and ZEI presentations. The vulnerability of the FR system to PAs is then evaluated, for the different PA datasets, using the same hyper-parameter settings. In this section we first describe the methodology adopted for the experiments. Then, we present the results of face-recognition and vulnerability analysis for the five selected FR systems.

5.1 Methodology

Table 3: Summary of the parameters of the various FR systems used in this study. All methods, apart from ROC-SDK, expect input-images of fixed dimensions, where the image has been cropped to the face-region appropriately.

8

TABLE II: Composition of the licit and attack protocols of the four datasets used in our experiments. All protocols consist of three mutually exclusive groups of clients: training, development, and evaluation. For each dataset, the number of clients assigned to each group is listed here. The number in parentheses shows the total number of samples for the corresponding

dataset and group. 8