TEXT-INDEPENDENT SPEAKER VERIFICATION BASED

(1)

2002 AcP074 ! #"%$ & ' $)( * +!,.- " */

01/ 23,$ ,. / 4 . $ & , / , 56$87 /3 $:9 ; , "% #<$:9 =>$)9 %:

Todor Ganchev ?A@B:C)D EGFB)HI:JB:KLD MONPRQTS@U:D VAU)B:B)@WXJB:KLD

Y FWNZ:@[\I^]`_@U Y FI:QTPW Y FWNZ)@[XI^]8_@U Y FI:Q^PaW Y FWNZ:@[\I^]8_@U Y FI:Q^PW

[email protected] [email protected] [email protected]

bc dLefhg fhi:j c^kOl d\mond p)d kq em:rs\t c fudLr v:jd m:wmofhi^xd dLjd l jy k rm:ioz q xr{\iofug d\j c:| s k fhi:f q dLp)} moe l^~Tc:~uk rx:v:j q ~uc nx c j q , d\mormTx)v:j q m c rod\j q fhr ~ s c e kO jr ~ s n ~ fhed

(Probabilistic Neural Network – PNN). s\ c c:l^lc:lk dx)x:v:j q z q x8r{\i:fhg:z dLjfhrp kOq m p c w c fhdLr d\p:} v:jd d\j c:| s k fuiof q PNN. jd x q jfv^{ q m:e l^~Tc:~Tk rx:v:j q e Xwh{ q e dLjfhrp kq m p c w c r f q ez q xr{\iofv:z p q e c j p k v:p c r jd v q ej p k })mTdLm:i mof q

mowm:fhiTx8d (impostor speakers).

b fhioj c^kl dLm:nd p)d k d\fhn c fhd\r xrd { c p)f q x c^k g:z p c^k r lk dLXg f q e moemofugTx8dLf q z ~ d\r f q e kq j q rd lk sx)xdLf q z p q e d\p:d\rf c nfhd\r l rd fhi:j c:~ p)dLn c em:i ~ dLr { c rf q e kl nd f q e . Gd kq emorsLt q jfhdLr c p:nmoi:z fhd d\p q f c {v:mTx8dLfhd l rd f q p k }^:{\iTx8d fhd\ef q p q niom:ioz c j}:z q xr{\iofug p q e p)d kq emorsLm:fhi ~ d\j mofhi:j m:e l^~Tk rfhr ~ g d | r q {\} l i:moi m:em:fhi^xsLf j dLjd l jy k rm:ioz q xr{\iofug

2002 NIST Speaker Recognition Evaluation.

T

EXT

-I

NDEPENDENT

S

PEAKER

V

ERIFICATION

B

ASED

ON PROBABILISTIC NEURAL NETWORKS

Todor Ganchev Nikos Fakotakis George Kokkinakis University of Patras University of Patras University of Patras

[email protected] [email protected] [email protected]

ABSTRACT

In this paper1, a text-independent Probabilistic Neural Network (PNN)-based Speaker Verification system is presented. Modular structure with a distinct PNN for each enrolled speaker is used. A gender-dependent universal background model is built to represent the impostor speakers. A detailed description of the system, as well as the time required for training and processing all the test trials is given. The results obtained in the one-speaker detection task during the 2002 NIST Speaker Recognition Evaluation are reported.

1_{This work was supported by the “Infotainment management with Speech Interaction via Remote}

(2)

1. Introduction

In this work, we present a detailed description our text-independent speaker verification system, which has participated in the 2002 NIST Speaker Recognition Evaluation (SRE) [1]. The performance results obtained in the one-speaker detection task are reported. A comprehensive description of the one-speaker detection task and the evaluation rules can be found in the 2002 NIST SRE plan [2].

PNNs were chosen as classifiers for the speaker verification system presented here, be-cause of their good generalization properties and fast designing times. Their design is straightforward and does not depend on training [3]. As a result, PNN are built only for a fraction of the back propagation ANNs training time. It is well known that the PNNs need more neurons compared to back propagation networks, which leads to increased complexity and higher computational and memory requirements in the process of exploi-tation. Nevertheless, the speaker verification system described here is capable of work-ing in real-time on common personal computers.

2. System Concept

A simplified block diagram of our PNN-based speaker verification system is presented in Figure 1. The upper part of the figure depicts the training process of the system. The universal background codebook (UBgCB) construction sequence, as well as the building of the personal codebooks for the enrolled speakers is shown. The lower part of Figure 1 presents the exploitation mode. An overall description of the building blocks is given in the following paragraphs. In the rest of this paper, our speaker verifi-cation system will be referred to as WCL-1.

2.1. Speech Feature Extraction

Saturation by level is a common phenomenon for telephone speech signals. In order to reduce the spectral distortions it causes, a band-pass filtering of speech is

per-Figure 1: A simplified block diagram of the PNN-based Speaker Verification system

2002 NIST SRE Test Data Speech feature extraction N 2 Spk 1 PNN model Test claimed speaker PNN with current speech frame Compute probability for multiple speech frames Applying threshold, make final decision 2002 NIST SRE Development Data Vector Quantization Vector Quantization Speaker accepted or rejected 2002 NIST SRE Training Data Speech feature extraction Vector Quantization Enrolled speakers codebooks Merge all male/female codebooks Speaker Claims

(test control file)

Train phase Test phase Speech feature extraction Gender-dependent background codebook Build a personal PNN for each speaker

(3)

formed as a first step of the feature extraction process. A fifth order Butterworth filter with pass-band from 80Hz to 3800Hz is used for both training and testing. After the band-pass filtering, the speech signal is processed in 40 ms frames, overlapped by 30 ms. A pre-emphasis with factor α=0.97 and Hamming windowing are used before apply-ing the FFT. Every feature vector consists of 33 filter-bank MFCCs, computed over 1024-point FFT. Only these MFCCs vectors, extracted from voiced speech frames are used to represent the speakers’ identity. The voiced / unvoiced decision is obtained by using a pitch estimation based on the “modified autocorrelation method with clipping (AUTOC)” [4].

2.2. Construction of the Codebooks

Due to the nature of PNNs, their complexity depends strongly on the number and dimensionality of the training vectors. A Vector Quantization technique is used to reduce the amount of training data [5]. It was experimentally found that a codebook composed of 128 vectors is large enough to maintain a good representation accuracy of the speaker’s peculiarity. When a 256-vector codebook is used, the performance of the system is improved slightly, but the memory and computational requirements increase considerably. Therefore, a codebook consisting of 128 vectors was chosen as a trade-off. For the background speakers however, a codebook of 256 vectors is necessary.

Both the speaker’s and the background codebooks are constructed by using the well-known k-means clustering algorithm [6].

The gender-dependent UBgCBs were built by using all the speakers of the 2002 NIST SRE Development database. In total, 74 male and 100 female speakers were available, each speaker having two minutes of speech. As shown in Figure 1, as a first step in UBgCB construction process, a personal codebook for each of the background speakers is built. Then the same-gender codebooks are merged and a vector quantization tech-nique is used to reduce the UBgCB size to 256 vectors. The UBgCB along with the per-sonal codebook, built for the enrolled speakers, are then used to design the individual PNN for each of the target speakers.

2.3. The PNN and the PNN Classifier

In Figure 2, the two hidden-layers PNN used in our system is shown [7]. The Radial Basis layer (1) is followed by the Competitive layer (2):

) || (|| ₁_,₁ 1 1 radbas _iIW p b_i i a = − (1) ) ( ₂_,₁ 1 2 competLW a a = (2)

where IW1,1 are the first layer input weights, set to the transpose of the matrix formed from the Q training vector pairs. LW₂_,₁are the second layer weights, set to the matrix of target vectors. The index i denotes the ith element of a or 1 b , and the i1 th row of the weights matrixIW1,1. The input feature vector is denoted by p , and b is the bias for the 1 Radial Basis layer, defined as:

σ ) 5 . 0 ln( 1= − b (3)

(4)

By || . || the Euclidian distance is denoted, while a is the binary output of the PNN sec-2 ond layer. By compet, the transfer function of the Competitive layer is denoted which employs the winner-take-all rule. The biggest weighted sum of probabilities from the first layer is granted a ‘1’, while the others receive zeroes. In the process of PNN design, the spreadσ , which has meaning of smoothing parameter, was set to value of 0.35. In this way a moderate degree of interpolation between the speaker training vectors is kept. For each enrolled speaker, a personal PNN is designed to recognize him/her among an unlimited number of speakers. Because both the speakers’ and the background models are represented not by one feature vector, but by codebooks, the problem is reduced to classifying one input vector to one of these two classes.

In the test phase, the PNN classifier decides whether the input trial belongs to the claimed speaker, or not. In order to do this, the claimed speaker’s PNN is tested by the feature vectors extracted from the input speech. The degree of similarity of the input fea-ture vector to the speaker’s model and to the background speakers’ model is estimated by computing their corresponding distances. For each input vector, a binary decision is made: output result ‘1’ means it belongs to the claimed enrolled speaker, while ‘0’ is produced when the feature vector is more similar to the background model.

The modular structure was selected, because of its inherent flexibility and easy updating capability. In case that retraining any of the enrolled speakers is necessary, only his/her PNN is replaced by a new one, without affecting the rest. Enrolling a new speaker is per-formed simply by designing a personal PNN for him/her and adding it to the others. Ar-bitrary new speakers can be added at any point during the system exploitation, because the PNNs of the already enrolled speakers are not affected by this process.

2.4. The Scores, the Threshold and the Final Decision

The output decisions of a given PNN, obtained by testing with multiple feature vectors, are used to compute the score for every trial. The scoreχ for a group of input vectors or for a whole trial is computed as:

|| dist || IW1,1 b1 p

.*

radbas LW2,1 compet a2 a1

Input Radial Basis Layer Competitive Layer

) || (|| ₁_,₁ 1 1 radbas _iIW p b_i i a = − a2 =compet (LW₂_,₁a1) Output R 1 Q x R R x 1 Q x 1 Q x 1 Q x 1 Q K x Q K

Probabilistic Neural Network Architecture

K x 1 Q x 1

Q = number of input/target pairs K = number of classes of input data R = number of elements in input vector

p a2

K x 1

(5)

) /

( 1 V β

η

χ= N N − (4)

where N is the number of vectors driving the PNN to produce output ‘1’, and 1 N is V the total number of test vectors in the current trial. η and β are predefined constants for tuning the scale and the offset of the produced score.

The speaker-independent threshold, used in the WCL-1 system is defined as follows: ) log( ) (SNR CB CB UBgCB N N N + =γ θ (5)

where NUBgCB and NCBcorrespond to the number of vectors contained in the UBgCB and the personal codebook of each enrolled speaker, respectively. Adjustment of the de-sired ratio between the false acceptance and false rejection rates is performed byγ.

) (SNR

Κ is used to account for the SNR of the test sentence. The values Κ(SNR) pro-duces are usually in the range between 0.5 and 1, however, when the SNR decays as low as -10dB, values of 0.25 or even smaller can be reached.

During the 2002 NIST SRE, the value of γ was set to 1, which drove our system to make decision in the neighborhood of the Equal Error Rate (EER) point. Due to the pre-processing of the speech recordings performed by the SRE organizers, not enough speech pauses are available in the test trials, and therefore the SNR can not be assessed with sufficient accuracy. Consequently, the function Κ(SNR) was fixed to a constant equal to 0.985, which is suitable in the case of relatively quiet environment.

The speaker-independent threshold obtained from (5) is then applied to the score result computed by (4) and the final decision is made. When the score is above the threshold, the claimant speaker is accepted, otherwise the trial is considered to belong to an impos-tor speaker.

3. Memory Requirements and Processing Time

A common IBM-PC compatible personal computer was used to perform the training and the testing of our speaker verification system. The configuration includes a single CPU Pentium 4 at 1.6GHz and 512MB RAM.

The WCL-1 system is realized in the MATLAB environment. It is still in process of de-velopment and the program code has not been optimized for speed. The total time spent to build all speaker models, the UBgCBs, and to process all the test trials defined by the index files detect{1,2,3}.ndx, is shown in the following Table 1:

Table 1: The CPU time spent to build and test the WCL-1 speaker verification system. Items SpkCB UBgCB Test

139 males 5 ¾ h 191 females 9 h 74 males 2 ¾ h 100 females 4 ¼ h 15840 m. trials 12 ¾ h 23309 f. trials 20 h Total time 14 ¾ h 7 h 32 ¾ h 54 ½ h

(6)

4. Experiments and Results

The 2002 NIST SRE one-speaker Evaluation database was used in the experi-ments. It had been extracted from the Switchboard-Cell corpora, and preprocessed in order to reduce any channel echoes and remove all significant speech pauses from the original recordings. A separation of caller - called channels also had been performed. Finally, the Evaluation database for the one-speaker recognition task disseminated to the participants consisted of cellular speech of 139 male and 191 female speakers recorded in different environmental conditions, provisionally noted as: {‘inside’, ‘outside’, ‘vehi-cle’}. A more detailed description of the Evaluation database can be found in [2]. The training data consists of about 2 minutes of spontaneous speech, extracted from a single conversation. Approximately 40 and 44 seconds of voiced speech were obtained on average for the male and female speakers respectively. A small group (about 4%) of all speakers however was presented by less than 15 seconds of voiced speech.

The primary condition task includes only these test trials, which contain between 15 and 45 seconds of speech. The complete one-speaker detection task includes all the trials. In Figure 3, the Detection Error Trade-off (DET) plots for the primary condition and the complete tasks are shown. The slight decrease of performance observed for the complete task, is due to the higher error rate caused by the less than 15 second trials.

Figure 4 shows the gender-related difference in the system performance. The tradition-ally lower performance for the female speakers is not as obvious here as it is usutradition-ally. In our opinion, it is mainly due to the fact that for the experiment with the female speakers, 25% more speakers were available during the universal background model construction. In this way, the gender-dependent difference in the system performance remains uncov-ered, because of the non-identical conditions in the two experiments.

It should be noted here however, that neither the males nor the females had enough speakers for building the UBgCB, designated to model the impostors. Because we ex-ploited only a limited amount of speakers available in the 2002 NIST SRE Development database, the UBgCB was not populated sufficiently and thus the total performance of the system was affected. Our experience from experiments with other databases (as the SpeechDat(II)-Greek-FDB5000 and the SpeechDat(II)-English-MDB-1000), shows that at least 300 speakers are necessary for achieving of a trustworthy impostor

representa-Figure 3: DET plot for one-speaker

detec-tion: complete task vs. primary condition 0.1 0.2 0.5 1 2 5 10 20 40 0.1 0.2 0.5 1 2 5 10 20 40

False Alarm probability (in %)

M is s p ro b a b ili ty ( in % )

WCL-1:2002, 1-Speaker Detection -- Complete vs. Primary task

Primary task All trials

Figure 4: DET plots for one-speaker

de-tection: complete task by GENDER 0.1 0.2 0.5 1 2 5 10 20 40 0.1 0.2 0.5 1 2 5 10 20 40

WCL-1:2002, 1-Speaker Detection -- Complete task by GENDER

M is s p ro b a b ili ty ( in % ) Females Males

(7)

tion, and approximately 500 speakers for more reliable impostor modeling are recom-mended.

In Figure 5, a comparison of the speaker verification performance for the aforemen-tioned three different call locations is shown. Surprisingly, the best performance was ob-tained for the ‘outside’ speech, and the worst for the ‘inside’ trials. Obviously, it is due to the mismatched train and test conditions. Our system was not outfitted with channel characteristics normalization and environmental noise suppression techniques and that caused a difference of more than 5% in the performance for the tested call location con-ditions. As seen in the DET graph, the performance for the ‘vehicle’ scenario is much closer to the noisy ‘outside’ condition than to the relatively quiet ‘inside’ environment. In Figure 6, the non-normalized actual and minimum decision costs obtained for the complete task are shown. The actual decision cost is larger than the minimal one, and is positioned in direction closer to the EER point. That is because in the 2002 NIST SRE we have chosen the factor γ equal to 1. In fact, the choice of value forγis an applica-tion-dependent issue, as it is designated to manipulate the preferred false rejection-to-false acceptance errors ratio. The selection of value for γ changes just the location of the decision point along the curve on the DET plot but not the location or the appear-ance of the DET curve itself.

5. Conclusions and Future Work

Our speaker verification system WCL-1 demonstrated promising potential in the 2002 NIST SRE, compared to the results of the other more elaborated systems. The results obtained in the evaluation tests confirmed that WCL-1 is a good foundation for evolving a PNN-based speaker verification system for real-world applications. However, more work is required in order to outfit the baseline system presented here with new fea-tures, which will allow it to cope better with the environmental diversity and variability. Channel normalization and noise suppressing techniques are necessary in order to im-prove the performance in mismatched training and testing conditions.

The quality of the universal background model is also an important issue for the final performance of the system. Large amount of background speakers should be used in or-der to enrich the UBgCBs population, and thus better to model the impostors.

Figure 5: DET plots for one-speaker

de-tection complete task - CALL LOCATION 0.1 0.2 0.5 1 2 5 10 20 40 0.1 0.2 0.5 1 2 5 10 20 40

M is s p ro b a b ili ty ( in % )

WCL-1:2002, 1-Speaker Detection -- Complete task by CALL LOCATION

In Out Car

Figure 6: The actual and the minimum

(8)

Acknowledgement

This research was supported by the Knowledge S.A. LogicDIS GROUP. The authors are grateful to Dr. Anastasios Tsopanoglou for the kind collaboration and the valuable ideas during the joint activities in the framework of this project.

References

[1] T. Ganchev, N. Fakotakis, G. Kokkinakis, “Speaker Verification System Based on Probabilistic Neural Networks”, 2002 NIST Speaker Recognition Evaluation, Workshop Presentations & Final Release of Results CD-ROM, Spring 2002. Available on the CD R81_99_1 in: \release\sysdesc\WCL_SRE_system.ps

[2] “The NIST Year 2002 Speaker Recognition Evaluation Plan”, National In-stitute of Standards and Technology of USA, February 2002. Available on-line at the WWW: http://www.nist.gov/speech/tests/spk/2002/doc/2002-spkrec-evalplan-v60.pdf

[3] D.F. Specht, “Probabilistic Neural Networks”, Neural Networks, Vol. 3, No.1, pp. 109-118, 1990.

[4] L.R. Rabiner, M.J. Cheng, A.E. Rosenberg and C.A. McGonegal, “A Com-parative Performance Study of Several Pitch Detection Algorithms”, IEEE Transactions on ASSP, Volume ASSP-24, No.5, October 1976.

[5] R. M. Gray, “Vector Quantization”, IEEE Acoustics, Speech and Signal Processing Magazine, pp. 4-29, April 1984.

[6] J. A. Hartigan and M. A. Wong, “A k-means clustering algorithm”, Applied Statistics, No.28, pp.100-108, 1979.

[7] H. Demuth, M. Beale, “Neural Network Toolbox User’s Guide”, Version 3, MATLAB CD-ROM documentation, MathWorks Inc, pp. 6.12-6.20, January, 1998. Available on the CD in: \help\pdf_doc\nnet\nnet.pdf