https://doi.org/10.1007/s40747-021-00565-w ORIGINAL ARTICLE
Static–dynamic features and hybrid deep learning models based spoof detection system for ASV
Aakshi Mittal1· Mohit Dua1
Received: 2 April 2021 / Accepted: 12 October 2021
© The Author(s) 2021
Abstract
Detection of spoof is essential for improving the performance of current scenario of Automatic Speaker Verification (ASV) systems. Empowerment to both frontend and backend parts can build the robust ASV systems. First, this paper discuses performance comparison of static and static–dynamic Constant Q Cepstral Coefficients (CQCC) frontend features by using Long Short Term Memory (LSTM) with Time Distributed Wrappers model at the backend. Second, it performs comparative analysis of ASV systems built using three deep learning models LSTM with Time Distributed Wrappers, LSTM and Convo- lutional Neural Network at backend and using static–dynamic CQCC features at frontend. Third, it discusses implementation of two spoof detection systems for ASV by using same static–dynamic CQCC features at frontend and different combination of deep learning models at backend. Out of these two, the first one is a voting protocol based two-level spoof detection system that uses CNN, LSTM model at first level and LSTM with Time Distributed Wrappers model at second level. The second one is a two-level spoof detection system with user identification and verification protocol, which uses LSTM model for user identification at first level and LSTM with Time Distributed Wrappers for verification at the second level. For implementing the proposed work, a variation in ASVspoof 2019 dataset has been used to introduce all types of spoofing attacks such as Speech Synthesis (SS), Voice Conversion (VC) and replay in single set of dataset. The results show that, at frontend, static—
dynamic CQCC feature outperform static CQCC features and at the backend, hybrid combination of deep learning models increases accuracy of spoof detection systems.
Keywords ASV· Spoof detection · CQCC · LSTM · CNN
Introduction
Building the robust spoof detection system for Automatic Speaker Verification (ASV) is now an essential task, as the attention and demand for voice protected authentication sys- tems is increasing in the users of smart devices. According to a survey users are curiously looking forward to use the speech driven authentication systems [1]. ASV system veri- fies whether the input speech signal is actually spoken by the authentic user or generated by the tricks by the imposter to gain access to the legitimate user’s account. With the avail- ability of low cost voice sensors, and advanced research
B
Mohit Dua[email protected] Aakshi Mittal
1 Department of Computer Engineering, National Institute of Technology, Kurukshetra, India
in mathematical and logical techniques for generating the synthetic speech, the number of spoofing attack types are also getting increased. Speech synthesis (SS), voice con- version (VC), replay, mimicry and twins attacks are the very potential spoofing attacks to these type of systems. SS attacked utterance is generated by the text to speech tech- nique [2]. VC speech signals are generated by converting the imposter’s voice in to the legitimate user’s voice with the help of transformation functions [3–5]. Replay attack are the one of the easiest form of attacks in which spoofed speech is the recorded voice signal of targeted user [6]. For mimick- ing the legitimate user’s voice, any professional manipulates his/her speech features. Twins attack is also a kind of mimicry attack [7,8]. In some cases, twine siblings are able to get access to each other’s voiced locked accounts [5, 9]. SS and VC attacks can be injected via the channel into the system. Hence, these attacks are named as Logical Access (LA) attacks [9]. The replay, mimicry and twins attacks are inserted by the microphone into the system. Hence, these
attacks known as physical access (PA) attacks. Performance of ASV systems is greatly affected in the presence of these spoofing attacks [10]. Various speech corpora have been pro- posed enriched with different kind of spoofing attacks. For instance, ASVspoof 2015 data includes SS and VC attacks [11], ASVspoof 2017 dataset includes only replay attack [12], Yoho dataset includes mimicry attacks [13], etc. The recently proposed ASVspoof 2019 dataset includes SS, VC and replay attacks, however, in two sets. This paper presents an initiative of putting all kind of attacks into a single dataset.
Along with attacks consideration, the robust designs of frontend and backend of an ASV system can become a pre- ventive shield for spoofing attacks. Frontend of an ASV system uses a speech feature extraction technique to extract useful information form the recorded speech signal. Fea- tures of cepstrum domain that are Mel Frequency Cepstrum Coefficients (MFCC), Inverse Mel Frequency Cepstrum Coefficients (IMFCC) [14], Linear Frequency Cepstrum Coefficients (LFCC), Constant Q Cepstrum Coefficients (CQCC), etc. have performed remarkably well for the spoof detection tasks, and for speech and speaker recognition tasks as well. These techniques can model the human vocal tract and human auditory system very well [15–17]. Human ear is proved to be deaf for the phase factor of sound. How- ever, utilization of this factor for frontend development of speech driven devices [18,19] can be done by using All Pole Group Delay Function (APGDF), Modified Group Delay Function (MODGDF), etc. Both static and dynamic coef- ficients of speech features deliver the information of context and speaker specification information. These coefficients are passed to the backend spoof detection model. CQCC features are specially designed for spoof detection tasks proposed in ASV systems of [20,21] and it is claimed that, these features perform better than Instantaneous Frequency Cosine Coeffi- cients (IFCC), MFCC, Epoch Features (EF). The proposed work in this paper also exploits a hybrid of static and dynamic CQCC features for developing the frontend. Also, it presents performance comparison of static and static–dynamic CQCC features by using Long Short Term Memory (LSTM) with Time Distributed Wrappers model at the backend.
Various machine learning techniques Gaussian Mixture Model (GMM), Hidden Markov Model (HMM) [22–24], Convolutional Neural Network (CNN), Long Short Term Memory (LSTM), etc. are playing crucial role for classi- fication tasks even in speech based systems [25, 26]. In case of ASV system, backend classification model takes the speech features as input and classifies the signal as spoofed or bonafide after analyzing the speaker specific information in them. In the initial research, GMM was used effectively as the backend model [27]. As the deep learning algorithms are getting improved day by day ASV community has started to use CNN and LSTM models [28–30]. In various speech and speaker recognition tasks, LSTM-based deep learning mod-
els are performing better than the other models. However, CNN models are also giving satisfactory results [31–33].
Also, different arrangements of frontend and backend mod- els can bring smoothness and accuracy to spoof detection task.
The rest of the paper is organized as: second section discusses the related work then third section of the paper dis- cusses the proposed method, the experimental setup details and results are presented in fourth section, fifth section explains the performance analysis of proposed models and systems then sixth section compares proposed systems with existing systems and seventh section concludes the proposal with dropping some light on future directions.
Related works
This section discusses the related works in this area. Lit- erature is enriched with the experiments on various feature extraction techniques of audios at frontend and different clas- sification models at the backend. Research done by Valenti et al. [34] discusses an approach with end to end speech sig- nal passing to an evolving Recurrent Neural Network (RNN).
System used in their work is designed with RNN and neu- roevolution of augmenting topologies. The proposed work considers replay attack particularly.
The review done by Kamble et al. [35] presents a wide analysis of many existing ASV spoof systems from the perspective of ASVspoof challenge. Lai et al. [36] pro- posed Attentive Filtering Network based and ResNet clas- sifier based system to detect replay attacks. The proposed attention-based filtering approach is used to improve feature representations. The proposed work used ASVspoof 2017 Version 2.0 dataset to attain a very low Equal Error Rate (EER). The authors claimed an improvement of about 30%
over the existing ASVspoof 2017 enhanced baseline system.
ASVspoof 2019 challenge puts the three different types of attacks in one dataset and presents baseline models with LFCC and CQCC features at frontend and GMM at the backend [27]. Chettri et al. [10] trained various deep learn- ing backend models and tested them with different features extraction approach in front end. These backend models are further combined to get three ensemble models, where all the systems were tested for physical access and logical access attacks.
Recently, Dua et al. [30] also proposed the ensemble approach using LSTM based deep learning models at the backend, and three different feature extraction techniques Constant Q cepstral coefficients (CQCC), inverse Mel fre- quency cepstral coefficients (IMFCC) and MFCC at the frontend. The author claimed that their proposed ensemble model with CQCC features outperforms some already exist- ing proposed ASV systems.
Motivated by these works, the proposed work in this paper compares performances of different deep learning models at backend by using them with static–dynamic CQCC features at frontend. The implemented work of this paper has also used combination of LSTM and CNN models for develop- ment of the backend. Also, two two-level spoof detection systems for ASV by using static–dynamic features at fron- tend are implemented. The first system does voting protocol based implementation by using CNN, LSTM models at first level and LSTM with Time Distributed Wrappers model at second level. The second system uses LSTM model for user identification at first level and LSTM with Time Distributed Wrappers for verification at the second level. These systems can bring new insights in the development of spoof detection methods for ASV.
Proposed method
This section of the paper discusses the architecture of the proposed ASV system. Figure 1a shows the frontend and backend arrangement that has been used for comparison of static CQCC and static–dynamic CQCC features in the implemented ASV system. Speech signals taken from the dataset are applied to the frontend where static CQCC fea- tures are extracted with the general process of extraction and static–dynamic hybrid features are extracted with the pro- posed methodology. Then these features along with the labels from the dataset are applied to the backend model that runs the classification. These classification results are useful for feature comparison. Figure1b shows the frontend and back- end arrangement that has been used for comparison of various deep learning models by keeping static–dynamic CQCC fea- tures at frontend. Frontend used in this arrangement is the best performing feature extraction technique from the fea- ture comparison. Speech signals and labels are the part of same dataset in whole arrangement. Backend here has all the proposed models for spoof detection and single model for speaker identification task. At the backend all chosen model are trained and their performances are analyzed. Sys- tems of Fig.2are the arrangements of models from Fig.1.
Figure2a shows the block diagram for the voting protocol based two-level spoof detection system. This system classi- fies the speech signal according to the voting protocol that is implemented with the help of level 1 and level 2. Level 1 applies analysis on the input that is further analyzed at level 2 as per the protocol to declare the decision. Figure2b gives the block diagram for the two-level user identification and verification system. This two stages arrangement makes the use of speaker identification model at stage 1 result of which is passed to stage 2. Stage 2 uses the user identification and verification protocol along with chosen backend model to declare the classification result.
Table 1 AllSpoofsASV dataset
Sets Bonafide SS & VC spoofed Replay spoofed
Training 7980 22,800 48,600
Development 7948 22,296 24,300
Evaluation 25,445 63,822 116,640
The following is the pointwise contribution of the pro- posed work and following subsections discuss each compo- nent in detail.
• This paper promotes the development of single counter- measure that is free from every kind of spoofing attack.
Therefore, initiative of modification in the used dataset is taken. AllSpoofsASV dataset (Fig.1) is a generated vari- ation of the standard dataset.
• Selection of suitable features for the frontend is essential.
This work tests, whether static CQCC or a combination of static and dynamic CQCC speech features perform better at frontend, where both features have LSTM with time distributed wrappers model at backend.
• Different deep learning models, LSTM, LSTM with time distributed wrappers and CNN based systems are imple- mented with static–dynamic CQCC features to measure their performances individually.
• One voting protocol based implementation by using CNN, LSTM models at first level and LSTM with Time Dis- tributed Wrappers model at second level is done. And, another implementation using LSTM model for user iden- tification at first stage and LSTM with Time Distributed Wrappers for verification at the second stage is performed.
AllSpoofsASV dataset
A generated variant of ASVspoof 2019 dataset is used for building the proposed ASV systems. ASVspoof 2019 dataset is provided by the ASVspoof challenge community [37].
The design of this dataset is intended to tackle with SS, VC and replay attacks in ASV systems. LA set of the dataset includes SS and VC spoofed utterances and PA set includes replay attacked utterances [27]. All the audios are recorded in English language and are 2–8 s in length. However, the length of maximum number of audios lies between 4 and 6 s in both the sets. Proposed system is making use of both of the LA and PA sets by mixing them into a single set, All- SpoofsASV Dataset. Mixing of sets provides the reliability in developing the spoof detection systems in one run for all kind of included spoofing attacks of the used dataset. Table 1shows number of bonafide, SS spoofed, VC spoofed and replay spoofed utterances in training, development and eval- uation sets of AllSpoofsASV Dataset.
Fig. 1 a Proposed ASV system for features’ comparison. b Proposed system for Deep Learning Models’
Fig. 2 a Voting protocol based two-level spoof detection.
b Two-level spoof detection system with user identification and verification
Feature extraction using CQCC features
Constant Q Cepstral Coefficients (CQCC) feature extraction is used for extracting useful information from the recorded speech signal during both training and testing phase of an ASV system. In recent years, this technique is proved to be most promising for the development of robust and accurate ASV systems [20,21]. The mathematical representation of CQCC feature extraction approach is described as:
CCQF(e) CQT(p(n)) (1)
CCQCC( j)
E e1
logCCQF(e)2cos
j(e − 0.5)π E
(2)
Here, Eq. (1) finds out the Constant Q Transform (CQT) of input speech signal p(n) in CCQF(e), and Eq. (2) finds out j
number of CQCC coefficients in CCQCC( j), where E is used for number of linearly spaced bins and e is used for index- ing into the number of bins. The process of CQCC feature extraction applies Constant Q Transform (CQT) and then, it takes the log of powered spectrum [38]. Also, before calculat- ing the Discrete Cosine Transformation (DCT) it applies the resampling [39,40]. It sets the number of feature coefficients and returns CQCC features.
The proposed system uses the find_CQCC_features () function for implementing CQCC feature extraction. This function applies the actual CQCC feature extraction process to the speech signal. This function takes an audio file as the input and returns a matrix of 90× m_frames with the 30 static, 30 delta (D) and 30 delta-delta (DD) features for m_frames number of frames. m_frames denotes the number of audio frames extracted depending on the length of input audio. Firstly, it sets the initial values for number of bins per
octave b, maximum frequency Nmax, minimum frequency Nmin, number of desired coefficients of any type n_coeff and type of feature f_type. Here, feature type f_type can be static (S), delta (D) or delta-delta (DD). Secondly, it calls the find_cqcc () function that takes all these initialized val- ues as input to output the values as static, delta or delta-delta features. Algorithm working in this function starts with the calculation of gamma value that is one of the parameters to CQT application process. Then, it calculates the log power spectrum of the output of CQT application, which is con- sidered for resampling before calculating the DCT. Function performing these operations are discussed further in this sec- tions. Understanding of input taken, operations applied and nature of output of these functions are provided. Then, this algorithm takes care of taking only desired number of fea- tures. It returns the static, delta or delta-delta coefficients as per the value of f_type. Finally, find_CQCC_features () com- bines all type of coefficients into one matrix and finds out number of frames. This function ensures the 400 minimum number of frames in the output. If the number of frames are less than 400 then padding of zeros is done and the final matrix is the desired CQCC feature matrix.
This whole process uses some functions that are inbuilt functions of different libraries of Python and MATLAB [41, 42]. In the proposed work, these functions are named accord- ing to their functionality and are described further in this section. Function 1 given in the Appendix gives the pseudo code for find_CQCC_features () that calls find_cqcc () to compute CQCC features.
• audioread (): This function takes an audio file (audio_file) as input, and returns its time series y and sampling rate Ns. Number values in time series y depends on the length of the audio file, which further contributes to the number of frames.
• zscore (): This function calculates the row wise zscore for each value of the input matrix. As the values coming out from find_cqcc () function reside in a continuous range of small to large values. Hence, application of this function normalizes these values. General formula to calculate the zscore is given by the Eq. (3).
zscore(x − μ)
σ (3)
Here, x is the element value to be normalized,μ is the mean of the values of entire row andσ stands for standard deviation of those values.
• length (): This function takes a matrix as input and outputs the value of number of columns in it.
• zero_padding (): Functionality of this function is to add extra columns of zero values up to the desired number of
rows. More specifically, it does the padding of zeros for the desired number of columns in a given matrix.
• cqt (): This function applies the Constant Q Transform (CQT) to the representative values of a speech signal. CQT changes the frequency domain into the time domain along with maintaining the constant Q factor across the signal.
gamma_value is a parameter to this function that is calcu- lated using Eq. (4) with the help of number of bins b per octave in speech signal.
gamma_value 228.7
21b − 2−1b
(4)
• log (): This function applies logarithm operation on the input values. Logarithm is calculated for the squared spec- trum that is output of cqt () function.
• resample (): This function converts the geometrically spaced bins provided by CQT into linearly spaced bins.
Bins are converted into linear space to make the signal compatible with Discrete Cosine Transformation (DCT).
• dct (): This function applies DCT internally. Application of DCT is helpful in signal compression task, conversion of frequency domain into time domain, etc.
• cut ():cut () function cut a matrix to the desired number of rows.
• delta (): This function calculates the derivative of the applied values.
Backend classification using deep learning models This section gives brief a detail of the Deep Learning models that are used at the backend of the different architectures proposed in this paper.
Long short term memory (LSTM) with time distributed wrappers (M1)
Proposed Long Short Term Memory (LSTM) Network, shown in Fig.3, is comprised of three time distributed dense layers, each having ReLU activation function. Time distribu- tion wrapped layers are especially suitable for time varying data frames like audio, video, etc. Proposed LSTM model (M1) has 32, 16 and 10 units in time distributed dense layers in this order. Number of units inside the layers are presented to provide the finer grained knowledge of structure of model to the readers. Motivation to choose these number of neuron is taken from the related work [30,31]. After that 15% dropout is applied to disuse the effect of some randomly selected neurons. Addition of dropout layer prevents the model from overfitting. In the M1model, this operation is followed by three LSTM layers each having 10, 20 and 30 units in this order. Again these layers are followed by 10% dropout, and the result of dropout is passed to a dense layer having sigmoid activation function in it.
Fig. 3 Long short term memory (LSTM) with time distributed wrappers (M1)
Fig. 4 Long short term memory (LSTM) (M2)
Long short term memory (LSTM) (M2& M4)
Proposed Long Short Term Memory (LSTM) Network, shown by Fig.4, takes input on the first LSTM layers that are followed by two more LSTM layers. These layers have 10, 20 and 30 LSTM units in this order, which are chosen as per the results shown in [30,31]. Output of these layers is passed to a dense layer of 24 units after applying 10%
dropout. Again the output of this dense layer is passed to the last layer that is a dense layer with sigmoid activation function, after, applying the 10% dropout.
An LSTM model (M4) with the similar architecture has 20, 30 and 400 units in this order in its first three LSTM layers (Fig.4). However, all the dropout and dense layers are having same specifications.
Two-dimensional convolutional neural network (2D CNN) (M3)
As shown in Fig. 5, the Two-Dimensional Convolutional layer (Conv2D) of proposed Two-Dimensional Convolu- tional Neural Network (2D CNN) (M3) is comprised of 24 filters of 3× 3 kernels size along with the ReLU activation
function. After that a batch normalization layer is added, which itself is followed by three blocks of Conv2D and 2- Dimensional (2D) max pooling layers. Conv2D layers of these blocks have 16 filters of 5× 5 kernel size, and 2D max pooling layers are of 2× 2 pool size. These blocks are fol- lowed by a flatten layer that is followed by a dense layer of 10 units. After that, 10% dropout is applied to avoid the overfitting of the model. Last layer of this 2D CNN model is a dense layer with sigmoid activation function.
Spoof detection systems
This section discusses the two-spoof detection systems (System_1 and System_2) that are developed for the imple- mentation of the proposed ASV system. Both System_1 and System_2 use the static–dynamic hybrid combination of CQCC features at frontend and different arrangements of M1, M2, M3and M4models at backend.
Voting protocol based two-level ASV system (System_1) The two-level ASV system with voting protocol i.e. Sys- tem_1 focuses to the spoof detection task. It accepts the input
Fig. 5 Two-dimensional convolutional neural network (2D CNN) (M3)
Fig. 6 Two-level ASV system with Voting Protocol (System_1)
speech signal if it is bonafide, and rejects it if it is spoofed by any of the SS, VC and replay attacks. Models M1, M2and M3 provide the corresponding labels: bonafide or spoofed as output. Figure6shows the proposed System_1 that has models M2 and M3 at the first level and M1resides at the second level, where F is treated as a global variable.
Purpose of putting models M2and M3at level one is that both of these models are equally good, when evaluated for Equal Error Rate (EER). This adds fairness in the classifi- cation result of this level. M1is the most powerful model.
Hence, it is put at the second level. Firstly, each input audio file is applied to the models M2and M3. Then, voting pro- tocol is applied to their decisions. A find_binary () function maps these decisions to the Boolean values i.e. FALSE for spoofed decision (due to any of the SS, VC and replay) and
TRUE for the bonafide decision made by the model. Voting protocol compares both outputs of find_binary () function for both the first level models. If the outputs from both the models is same, then it is returned as the final classification result of the system. Otherwise, the audio file is tested on the model M1at the second level and its classification result after passing to find_binary () function is returned. At the end, pro- posed system returns TRUE or FALSE for input speech being Bonafide or spoofed, respectively. Function 2, added in the Appendix section, gives the pseudo code for the implemented voting protocol that uses find_binary () function.
Two-level ASV system with user identification and verification (System_2)
The System_2, as shown by Fig.7, also executes its process in two stages/levels. In the first stage, it identifies the user id for the applied speech signal. Then, user’s voice signal is ver- ified, whether it is bonafide or spoofed, in the second stage of the system. System uses User Identification and Verification Protocol to accomplish this task, where F and I are treated as global variables.
As a result, system identifies the validity of claimer along with the genuineness of the applied speech signal. Firstly, input audio signal is applied to the model M4of the first stage.
Model M4predicts the identification of the user (Ui) out of already registered n users. This predicted identity is supplied to stage 2 where user identification and verification protocol is applied. At this stage, n number of instances {(M1U1), (M1U2), ……, (M1Un)} of model M1 resides, which are trained for n number of users {U1, U2,….,Un}. Model M1
checks whether the speech signal is bonafide or spoofed at this stage, and the decision is mapped with a valid integer value in variable A. set_terary () function maps to integer value THREE if the Uiand I are not same, maps to integer value ONE if the decision is Bonafide along with Ui and I are same, and maps to integer value TWO if the decision is spoofed. At the output variable, if A is ONE then the user is valid and speech is Bonafide, if A is TWO then the user is
Fig. 7 Two-level ASV system with user identification & verification (System_2)
invalid and speech is spoofed, and if A is THREE then user is invalid. Function 3, appended in the appendix, gives pseudo code for the implementation of the System_2.
Experimental setups
This section of the paper deals with the experimental details for implementation of the proposed ASV system. The fron- tend feature extraction is implemented by using Octave on Linux Operating System. The training, development and evaluation of backend models are done with Anaconda plat- form on Windows operating system. All the used audios and labels are taken from training, development and eval- uation sets of AllSpoofsASV Dataset. During training the deep learning models, Python’s inbuilt features are used for weight updation, that is backpropagation algorithm and loss functions are used. For the two class classification problems, binary cross entropy loss is used as the loss function. It finds out the probability or score for an utterance between zero and one. Categorical cross entropy loss function is used as loss function for multi class classification of user identities (specifically in the training of M4).
A learning rate is required for iterative updation of weights during the training process. In the proposed work, ADAM (Adaptive Momentum) optimizer algorithm is used to achieve the adaptive value of learning rate [43, 44]. It combines the advantages of Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Algorithm (RMSProp).
AdaGrad defines the learning rate for each parameter to improve the performance of model sparse gradient, whereas RMSProp makes the use of average of latest values of gradi-
ents of weights. ADAM algorithm passes both the gradient and square gradient to the exponential moving average func- tion. For heavy models and large size of datasets, it can solve practical problems efficiently [43–45]. System arrangement for different comparisons and analysis are discussed later in this section.
The performance of the proposed architectures and sys- tems are evaluated with the help of two evaluation measures Equal Error Rate (EER) and Percentage Accuracy. Spoof detection systems are evaluated by using EER and user iden- tification system is evaluated by percentage accuracy. EER is the equal value of False Acceptance Rate (FAR) and False Rejection Rate (FRR) [27,28], where FAR is ratio of number of spoofed utterances having score more than or equal to the threshold to the total number of spoofed utterances and FRR is ration of the number of bonafide utterances having score less than the threshold value to the total number of bonafide utterances. The mathematical representation FAR and FRR is given by Eqs. (5) and (6), respectively. EER aims to calculate the FAR and FRR with the help of thresh- old. For the equal values of these parameters, it declares the EER for the system.
FARTotal count of utterances with score≥
Total count of spoofed utterances (5) FRR Total count of bonafide utterances with score<
Total count of bonafide utterances
(6) Percentage accuracy is calculated with the help of correct predictions and total number of input samples to be checked.
Mathematical formula of percentage accuracy is given by Eq. (7).
Percentage AccuracyCount(correct predictions) Count(input samples) × 100
(7) In this case, the division of total correctly predicted user samples by the total number of user input samples is multi- plied by 100.
Frontend features extraction
For spoof detection task, firstly, model M1is trained with only 30 static CQCC features calculated by doing some modifica- tions in find_CQCC_features () function. Mean of m_frames frames for each coefficient of 30 features is used. A vector of 1× 30 dimensions is extracted in case of static features and Model M1is trained up to five epochs with the batch size of 512. Secondly, Model M1is trained with the static–dynamic hybrid CQCC features calculated by find_CQCC_features () function. All 30 static, 30 delta and 30 delta-delta CQCC features for all m_frames frames (without taking mean) are
Table 2 Comparative analysis of
different CQCC features Features Development Set (EER) Evaluation set
(EER) (D1) (D2) (D3) (D4) (D5) Average (mean +
sd)
Static CQCC 0.114 0.113 0.112 0.112 0.111 0.112±0.001 0.136 Static–Dynamic
CQCC
0.017 0.018 0.019 0.018 0.018 0.018±0.0006 0.032
Values in bold show the final and best performing results
Table 3 Comparison of backend spoof detection models
Model Development set (EER) Evaluation set (EER) System_1 (EER)
(D1) (D2) (D3) (D4) (D5) Average (mean + sd)
M2 0.019 0.017 0.017 0.019 0.017 0.017±0.0009 0.043 0.029
M3 0.019 0.020 0.018 0.019 0.019 0.019±0.0006 0.043
M1 0.019 0.018 0.017 0.017 0.017 0.017±0.0008 0.032
Values in bold show the final and best performing results Table 4 Performance analysis
for LSTM (M4) Model %Accuracy Evaluation Set %Accuracy
(D1) (D2) (D3) (D4) (D5) Average (mean + sd)
M4 99.4 96.5 97.9 97.8 97.9 97.9±0.91 97.1
used in this arrangement. A matrix of 90× m_frames dimen- sions is extracted for each audio in this case. To balance the comparison criteria, this arrangement has also been trained up to five epochs with the batch size of 512.
Equal Error Rate (EER) for both the arrangements is found out to compare the performances of the feature sets. The comparative analysis for evaluation data with both features is shown in Table2.
Backend deep learning models with System_1 The proposed work compares performance of all the back- end deep learning models M1, M2 and M3, implemented individually, with voting protocol based System_1 by using static–dynamic CQCC features at the front end and All- SpoofASV dataset. Model M1is trained with the batch size of 512 up to five epochs, Model M2is trained with the batch size of 512 for 20 epochs and model M3is trained with the batch size of 500 for 15 epochs. For the training of all three models, patience of two is used for early stopping criteria, binary cross entropy loss function is used to measure the loss and ADAM optimizer is used for optimization purpose in both the systems [43,44].
As described earlier, trained models M1, M2are used at level 1and M3is used at level 2 for development of voting protocol based spoof detection system System_1. The per- formance analysis of M1, M2, M3and System_1 is done by using the parameter EER. Table3shows the comparative val-
ues of EER for evaluation datasets for all the three backend models and System_1.
Model M4
User identification model M4is trained individually for eight users (n) with the batch 512 up to 80 epochs using categorical cross entropy loss function. Model M4 is tested by using parameter percentage accuracy. Percentage accuracy of the model is calculated for evaluation set, as shown by Table4.
System_1 and System_2
System_2 uses trained model M4for user identification task is used at stage 1 and n number of instances of model M1are used at stage 2. However, the training of Model M1in Sys- tem_2 is different from System_1. In System_2, it is trained eight times separately for each user out of the total eight exist- ing users. For this, bonafide and spoofed utterances of each specific user are taken. Firstly, user identification is done for eight users by the stage 1, and then, user identification ver- ification protocol is invoked for verification at stage 2. The performances of System_1 and System_2 for spoof detection task are evaluated using the parameter EER for evaluations sets, as shown in Table5.
Table 5 Performance of proposed systems System Equal error rate (EER)
Development set Evaluation set
System_1 0.017 0.029
System_2 0.002 0.009
Results
This section presents the performance and comparison results of all systems discussed in third section. For obtaining the results, the proposed work uses the procedure adopted by state of the art works of [10,15,26]. As described earlier in “AllSpoofsASV dataset” section, the dataset used by the proposed system is already divided in training, development and evaluation sets. Therefore, it is not required to partition the dataset in ratios for training, development and evalua- tion samples. For evaluation in case of ASV systems, EER is the used evaluation protocol that is applied on the clas- sification results of the model for spoof detection task [10, 15,26]. Models for this work have been trained five times with the training set, and for each trained model development set is applied. Network parameters have been tuned for all the systems to obtain stable parameters. On the development results, EER evaluation protocol is applied and accuracy of the model is verified. Mean of all five development set test results is considered to show in presented tables. Evaluation set is applied on the model when it becomes stable after all training passes and EER is calculated for the classification result. Protocols of systems one and two are applied with the evaluation set performances of models. For the task of speaker identification, percentage accuracy is calculated as evaluation measure on development set results using five- fold validation approach. It is also evaluated for evaluation to check the performance.
Comparison of CQCC features
Models set for features comparison are trained five times and average i.e. mean + standard deviation (SD) of the results is taken to conclude the EER. It can be observed in Table 2 that combination of static and dynamic CQCC features is performing better than static CQCC features. Hence, this combination is used in the development of further proposed spoofed detection systems.
Comparison of used deep learning models with System_1
These models are trained five times and the EER evaluation measure is calculated on development set for each training for
model. Table3represents the EER value for five training and development passes (presented by sequence of “Di” in Table 3) along with the average value of results. Then, the perfor- mance on evaluation set and System_1 are shown. Results presented in Table 3shows that M1 outperforms the other two backend models for spoof detection, when implemented individually. However, voting protocol based System_1 out- performs all the three backend models. Voting protocol is applied once the ave performances of all the deep learning models are concluded.
Performance of model M4
The average percentage accuracy of the model M4 is cal- culated for evaluation set by averaging the five runnings, as shown by Table4. The percentage accuracy, as described ear- lier, is calculated by Eq. (7) using correct predictions and total number of input samples to be checked. It can be observed from Table4that M4performs satisfactory.
Comparative analysis for System_1 and System_2 The performances of System_1 and System_2 for spoof detection task are evaluated using the parameter EER for both development and evaluations sets, as shown in Table5.
It can easily be observed in Table5that System_2 is perform- ing better than the System_1. However, System_2 is limited to the private or local domain because it uses limited number of users. An increase in number of users will add more com- plexity in development of an ASV system as for each user separate training model M1is required, which is not practi- cally feasible. Hence, System_1 performs satisfactory as it is applicable to the public domain.
Comparison of proposed system with existing systems
This section compares the performances of the proposed sys- tems, System_1 and System_2, with some of the existing systems from the literature. Chettri et al. [10] have designed three Ensemble systems (E1, E2 and E3) made up of differ- ent classical and deep learning models, where the ensemble system E1 performs the best among them. Cai et al. [15]
have trained ResNet deep learning model with CQCC, LFCC, IMFCC, Short Term Fourier Transform (STFT) grams, and Group Delay (GD) gram features. However, it is trained only for replay attack. Kumar et al. [26] have trained a Time Delay Shallow Neural Network (TDSNN) with CQCC, IMFCC, Linear Frequency Band Cepstral Coefficients (LFBC) and LFCC features for SS, VC and replay attacks. ASVspoof 2019 challenge has provided a GMM model trained with LFCC and CQCC features at frontend for SS, VC and replay
Table 6 Comparison of proposed system with existing systems
Works Backend Frontend features Evaluation set
SS, VC Replay EER
Chettri et al. [10] Ensemble 1 MFCC, IMFCC, SCM, i-vectors, long term average spectrum
✔ ✖ 0.0264
Ensemble 2 ✖ ✔ 0.0611
Cai et al. [15] ResNet Fusion CQCC, LFCC, IMFCC, STFT, GD grram ✖ ✔ 0.0066
ASVspoof 2019 Challenge [27] GMM CQCC ✔ ✖ 0.0043
GMM ✖ ✔ 0.0987
GMM LFCC ✔ ✖ 0.0271
GMM ✖ ✔ 0.1196
Kumar et al. [26] TDSNN CQCC, IMFCC, LFBC, LFCC ✔ ✖ 0.057
TDSNN ✖ ✔ 0.064
Jung et al. [46] DNN 7 spectrograms, i-vectors, raw waveforms ✖ ✔ 0.0245
Proposed work System_1 Static -Dynamic Hybrid CQCC ✔ ✔ 0.029
System_2 ✔ ✔ 0.009
*✔ Indicates that a particular attack is addressed and ✖ indicates that a particular attack is not addressed
attacks [27]. Jung et al. [46] has trained a Deep Neural Network Model with 7 spectrograms, i-vectors and raw wave- forms only for replay attack detection. Table 6 shows the comparison of these systems with proposed systems of this paper. Although, some systems from literature seem to be good for detection of a particular attack type. However, pro- posed system is also performing good for the detection of all three kinds of spoofing attacks in one run.
Conclusion
Undoubtedly, the ASV systems are highly exposed to spoof- ing attacks. However, their performance is fine enough that industry is attracted to use them in practical applications. Ini- tiative to design a single dataset can provide new insights to the spoof detection task. AllSpoofsASV Dataset, a variation of ASVspoof 2019 dataset, is a small step towards this. Com- bination of different feature coefficients with hybrid deep learning models can help in development of robust ASVs.
This paper shows that a combination of static and dynamic CQCC performs better with LSTM models than only static features. Also, comparison of results shows model LSTM with Time Distributed Wrappers (M1) outperforms the mod- els LSTM (M2) and CNN (M3), when evaluated by Equal Error Rate (EER). However, the two-level voting protocol based spoof detection system System_1 that uses M2, M3at level 1 and M1at level 2 performs best of them all. As model LSTM (M4) provides satisfactory performance it can be used particularly for speaker identification with spoof detection.
Also, two-level spoof detection system with user identifica- tion and verification System_2 that uses M4at stage 1 and M1at stage 2 performs better than System_1. However, it is
limited to limited number of users. Using it for public domain or an organization with more and variable number of speak- ers will increase the complexity and requirement of storage space for the system. For future work, more attacks like twins and mimicry should can be added into the dataset, and more hybrid possible combinations of features and deep learning models can be exploited. Considering the importance of the spoof detection in ASV, more efficient and complex struc- tures like VGG-family of deep learning models can also be used as future extension of the proposed work.
Declarations
Conflict of interest The submitted work does not have any conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adap- tation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indi- cate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copy- right holder. To view a copy of this licence, visithttp://creativecomm ons.org/licenses/by/4.0/.
Appendix
References
1. Beranek B (2013) Voice biometrics: success stories, success factors and what’s next. Biometr Technol Today 2013(7):9–11
2. Indumathi A, Chandra E (2012) Survey on speech synthesis. Signal Process Int J (SPIJ) 6(5):140
3. Lim R, Kwan E (2011) Voice conversion application (VOCAL). In:
2011 international conference on uncertainty reasoning and knowl- edge engineering, vol 1. IEEE, pp 259–262
4. Mohammadi SH, Kain A (2017) An overview of voice conversion systems. Speech Commun 88:65–82
5. Patil HA, Kamble MR (2018) A survey on replay attack detec- tion for automatic speaker verification (ASV) system. In: 2018 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). IEEE, pp 1047–1053 6. Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H (2015)
Spoofing and countermeasures for speaker verification: a survey.
Speech Commun 66:130–153
7. Hautamäki RG, Kinnunen T, Hautamäki V, Leino T, Laukkanen AM (2013) I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry. In: Interspeech, pp 930–934
8. Hautamäki RG, Kinnunen T, Hautamäki V, Laukkanen AM (2014) Comparison of human listeners and speaker verification systems using voice mimicry data. Target 4000:5000
9. Lindberg J, Blomberg M (1999) Vulnerability in speaker verification-a study of technical impostor techniques. In: Sixth European conference on speech communication and technology 10. Chettri B, Stoller D, Morfi V, Ramírez MAM, Benetos E, Sturm
BL (2019) Ensemble models for spoofing detection in automatic speaker verification.arXiv:1904.04589. arXiv preprint
11. Sahidullah M, Delgado H, Todisco M, Yu H, Kinnunen T, Evans N, Tan ZH (2016) Integrated spoofing countermeasures and automatic speaker verification: an evaluation on ASVspoof 2015
12. Lavrentyeva G, Novoselov S, Malykh E, Kozlov A, Kudashev O, Shchemelinin V (2017) Audio replay attack detection with deep learning frameworks. In: Interspeech, pp 82–86
13. Campbell JP (1995) Testing with the YOHO CD-ROM voice ver- ification corpus. In: 1995 international conference on acoustics, speech, and signal processing, vol 1. IEEE, pp 341–344
14. Chakroborty S, Saha G (2009) Improved text-independent speaker identification using fused MFCC & IMFCC feature sets based on Gaussian filter. Int J Signal Process 5(1):11–19
15. Cai W, Wu H, Cai D, Li M (2019) The DKU replay detection system for the ASVspoof 2019 challenge: on data augmentation, feature representation, classification, and fusion.arXiv:1907.02663. arXiv preprint
16. Balamurali BT, Lin KE, Lui S, Chen JM, Herremans D (2019) Toward robust audio spoofing detection: a detailed comparison of traditional and learned features. IEEE Access 7:84229–84241 17. Dua M, Aggarwal RK, Biswas M (2017) Discriminative training
using heterogeneous feature vector for Hindi automatic speech recognition system. In: International conference on computer and applications (ICCA), pp 158–162
18. Sahidullah M, Kinnunen T, Hanilçi C (2015) A comparison of fea- tures for synthetic speech detection. In: 16th Annual Conference of the International Speech Communication Association (INTER- SPEECH 2015), pp 2087–2091
19. Pal M, Paul D, Saha G (2018) Synthetic speech detection using fun- damental frequency variation and spectral features. Comput Speech Lang 48:31–50
20. Todisco M, Delgado H, Evans NW (2016) Articulation rate fil- tering of CQCC features for automatic speaker verification. In:
Interspeech, pp 3628–3632
21. Jelil S, Das RK, Prasanna SM, Sinha R (2017) Spoof detection using source, instantaneous frequency and cepstral features. In:
Interspeech, pp 22–26
22. Dua M, Aggarwal R, Kadyan V, Dua S (2012) Punjabi Speech to text system for connected words, pp 206–209
23. Dua M, Aggarwal RK, Biswas M (2018) Discriminative training using noise robust integrated features and refined HMM modeling.
J Intell Syst 29(1):327–344
24. Dua M, Aggarwal RK, Biswas M (2019) GFCC based discrim- inatively trained noise robust continuous ASR system for Hindi language. J Ambient Intell Hum Comput 10(2)
25. Dua M, Aggarwal RK, Biswas M (2019) Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Comput Appl 31(10):6747–6755
26. Kumar MG, Kumar SR, Saranya MS, Bharathi B, Murthy HA (2019) Spoof detection using time-delay shallow neural network and feature switching. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, pp 1011–1017 27. ASVspoof 2019: automatic speaker verification spoofing and coun-
termeasures challenge evaluation plan*.http://www.asvspoof.org/
28. Huang L, Pun CM (2019) Audio replay spoof attack detection using segment-based hybrid feature and Dense Net-LSTM network. In:
ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2567–2571 29. Mobiny A, Najarian M (2018) Text-independent speaker verifica-
tion using long short-term memory networks.arXiv:1805.00604.
arXiv preprint
30. Dua M, Jain C, Kumar S (2021) LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. J Ambient Intell Human Comput
31. Mittal A, Dua M (2021) Automatic speaker verification sys- tem using three dimensional static and contextual variation-based features with two dimensional convolutional neural network. Inter- national J Swarm Intell
32. Mittal A, Dua M (2021) Constant Q cepstral coefficients and long short-term memory model-based automatic speaker verification system. In: Proceedings of international conference on intelligent computing, information and control systems, pp 895–904 33. Chettri B, Mishra S, Sturm BL, Benetos E (2018) Analysing the
predictions of a cnn-based replay spoofing detection system. In:
2018 IEEE spoken language technology workshop (SLT). IEEE, pp 92–97
34. Valenti G, Delgado H, Todisco M, Evans NW, Pilati L (2018) An end-to-end spoofing countermeasure for automatic speaker verifi- cation using evolving recurrent neural networks. In: Odyssey, pp 288–295
35. Kamble MR, Sailor HB, Patil HA, Li H (2019) Advances in anti- spoofing: from the perspective of ASVspoof challenges. APSIPA Trans Signal Inf Process 9
36. Lai CI, Abad A, Richmond K, Yamagishi J, Dehak N, King S (2019) Attentive filtering networks for audio replay attack detection. In:
ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6316–6320 37. Edinburgh Data Sharehttps://datashare.is.ed.ac.uk/handle/10283/
3336
38. Brown JC, Puckette MS (1992) An efficient algorithm for the calculation of a constant Q transform. J Acoust Soc Am 92(5):2698–2701
39. Brown JC (1991) Calculation of a constant Q spectral transform. J Acoust Soc Am 89(1):425–434
40. Yang J, Das RK, Li H (2018) Extended constant-Q cepstral coeffi- cients for detection of spoofing attacks. In: 2018 Asia-Pacific signal and information processing association annual summit and confer- ence (APSIPA ASC). IEEE, pp 1024–1029
41. Glover JC, Lazzarini V, Timoney J (2011) Python for audio sig- nal processing. In: Linux Audio Conference 2011, May 6-8 2011, Maynooth, Ireland
42. Cheuk KW, Anderson H, Agres K, Herremans D (2019) nnAudio:
an on-the-fly GPU audio to spectrogram conversion toolbox using 1D convolution neural networks.arXiv:1912.12055. arXiv preprint 43. Dinkel H, Qian Y, Yu K (2018) Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Trans Audio Speech Lang Process 26(11):2002–2014
44. Kingma D, Ba J (2014) Adam: a method for stochastic optimiza- tion. In: Proc. Int. Conf. Learn. Representations, pp 1–13 45. Brownlee J (2021) https://machinelearningmastery.com/adam-
optimization-algorithm-for-deep-learning/. Machine Learning Mastery Pty. Ltd
46. Jung JW, Shim HJ, Heo HS, Yu HJ (2019) Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge.arXiv:1904.10134. arXiv preprint
Publisher’s Note Springer Nature remains neutral with regard to juris- dictional claims in published maps and institutional affiliations.