Speaker Verification Using Simple Temporal Features and Pitch Synchronous Cepstral Coefficients
SSpeaker Verification Using Simple TemporalFeatures and Pitch Synchronous CepstralCoefficients
Bhavana V. SPradip K. DasDepartment of Computer Science and EngineeringIndian Institute of Technology Guwahati, Assam, India
ABSTRACT
Speaker verification is the process by which a speakers claim of identity is testedagainst a claimed speaker by his/her voice. Speaker verification is done by theuse of some parameters (features) from the speakers voice which can be usedto differentiate among many speakers. The efficiency of speaker verificationsystem mainly depends on the feature set providing high inter-speaker vari-ability and low intra-speaker variability. There are many methods used forspeaker verification. Some systems use Mel Frequency Cepstral Coefficientsas features (MFCCs), while others use Hidden Markov Models (HMM) basedspeaker recognition, Support Vector Machines (SVM), GMMs, etc. In this papersimple intra-pitch temporal information in conjunction with pitch synchronouscepstral coefficients forms the feature set. The distinct feature of a speaker isdetermined from the steady state part of five cardinal spoken English vowels.The performance was found to be average when these features were used in-dependently. But very encouraging results were observed when both featureswere combined to form a decision for speaker verification. For a database oftwenty speakers of 100 utterances per speaker, an accuracy of 91.04% has beenobserved. The analysis of speakers whose recognition was incorrect is conductedand discussed .
Speech is one of the primary modes of communication. Human ears can perceiveeven the minute details of delivered speech and thus distinguish a speaker from1 a r X i v : . [ c s . S D ] A ug nother. In a speaker recognition system, the main role is played by the featureset which can distinguish different speakers efficiently.In paper [3] some acoustic cues has been determined to identify the fricative-affricate contrast in word final position. The features used were the temporaland spectral characteristics of the vocalic interval, duration of a silent interval,presence or absence of a release burst, rise time of the fricative noise and durationof the fricative noise. It was reported that onset characteristics of the fricativenoise play a significant role in perception of fricative-affricate contrast. At shortclosure intervals the stimuli were heard as dish while at long as ditch (the wordsused were dish and ditch).Paper [6] focused on the various activities in the human body during the per-ception of speech signals into messages. This paper investigated how the rawspeech signal is perceived as meaningful data and how the human brain analysethis data. It also looked into the importance of the acoustic properties of thesignal in identifying a spoken utterance and also how similar words are distin-guished by concentrating on the local areas of the signal. It discussed about thedistinct features which influenced the phonological processes and articulatorygestures.A representation of the phase information in speech model was presented in[10]. Phase is defined as the fractional part of a period through which the timevariable of a periodic quantity has moved as measured at any point in time froman arbitrary time origin. Usually, time origin is the last point at which amplitudevalue passed through a zero position from negative to positive direction. Phaseinformation has been used in speech representation and successful reconstructionof the original signal from its phase values have been reported. It was alsoreported that harmonic amplitudes play a critical role in the signal as theyreflect the spectral energy which contain the basic information about the signal.There is a report on speaker characterization by using prosodic super vectorswith ’negative within class covariance normalization’ projection and speakermodeling with support vector regression in [4]. They have also proposed asegmental weight fusion technique combining acoustic and prosodic subsystems.They used SRE corpus and they have reported the error rate of 4.5% for fusionsystem.Speaker clustering was done in GMM based speaker recognition system [12]in order to reduce computational complexity. ISODATA algorithm was usedfor clustering speakers whose acoustic characteristics were similar with a dis-tance measure. The database used was that of China National High TechnologyProject. The identification was 98.75% correct. Speech was recognized evenin the situation where voice characteristics of input speech were unknown [13].ASJ-PB and ASJ-JNAS databases were used for the experiments.2tandard speech/non speech HMM’s were conditioned on speaker traits andevaluated on cepstral and pitch features in [11]. SVMs were applied to speechlattices. Acoustic features and word sequences are used to determine speakertraits. The overall sound characteristics for speakers can be covered by a set ofAcoustic Segment Models (ASM). The mean vectors of ASM based on unsuper-vised MAP adaptation are concatenated to represent characteristics of specificspeaker as proposed in [5]. The acoustic feature vectors used are 12 MFCCs andnormalized energy plus their first and second order time derivatives. Speakerspecific characteristics were obtained by the MFCCs. In addition, an objectivefunction consisting of contrastive cost in terms of speaker similarity and dissimi-larity as well as data reconstruction cost was used as regularization to normalizenon speaker related information [9].MFCC, due to structure of its filter bank, captures vocal tract character-istics more effectively in lower frequency regions. A new set of features usinga complementary filter bank structure which improved distinguishing ability ofspeaker specific cues present in high frequency zone is proposed in [2]. Both ofthem were combined to improve the performance of the MFCC-based system.A multilevel speaker recognition system was proposed in [8] combining acoustic,phonotactic and prosodic subsystems. Acoustic features used were the MFCCs.Speaker recognition system in noisy environments was improved by use of aSNR sensitive subspace based enhancement technique and probabilistic SVMs[15]. In this paper MFCCs were used to represent the speech features. Errorrate for identification in noisy environments was reduced from 43.4% to 25%.The speaker verification system presented in [16] uses the following set offeatures: the first four formants, the amount of periodic and aperiodic energy inthe speech signal, the spectral slope of the signal and the difference between thestrength of the first and second harmonics. The NIST 98 Evaluation Databasewas used for the experiments which consisted of telephone speech sampled at 8kHz. The speaker models were constructed using the GMMs and were trainedusing maximum-likelihood parameter estimation. For female speakers the errorrate was found to be 36.69% and for male speakers the error rate was found tobe 34.4%. The features performance was better compared to MFCCs for femalespeakers but not for male speakers.Another set of features were presented in [1] for speaker verification. Theparameters considered were fundamental frequency, the articulator configura-tion of nasal consonants captured by the amplitude normalized filter outputs inthe nasal spectra, the first two formants of the vowels, source spectrum slopeand prevoicing parameters. Speaker verification was done using neural networksin [14]. The neural networks were trained with unweighted cepstral coefficientsderived from Linear Predictive Coding (LPC). The convergence method used inthe network is the back-propagation learning algorithm of Multi-Layer Percep-tron (MLP). 3n this paper we propose four simple intrapitch temporal features: positivecrest, positive trough, negative crest and negative trough in conjunction withpitch synchronous cepstral coefficients. These features extracted from the steadystate vowel region is used to characterize a speaker. The speaker recognition ispossible mainly due to the vibration of the vocal folds as mentioned in [7]. Hencethe proposed features are extracted from steady state region of the utterancesi.e. the vowel region.The remaining of the paper is organized as follows: Section 2 describes thebasic methodology and Section 3 describes the experiments and results andSection 4 describes the conclusions and future work. The basic methodology followed in this paper can be divided into two phasesnamely the preprocessing, pitch detection and marking phase and the processingphase. In the preprocessing phase, the pitch period is first detected and marked.In the processing phase the proposed features are computed from the steadystate regions of the vowel. This processing is applied to both the training dataand the testing data. Initially the system is trained with 20 speakers. In thepreprocessing phase for each speaker, pitch periods are detected by markingthe position of the start of each pitch cycle. The proposed features and pitchsynchronous cepstral coefficients from each of the pitch periods in the speechsignal is calculated in the processing phase. The average of all 20 utterancesgives the model for a given speaker. In the testing phase, the pitch is detectedand the features are extracted from the test sample and the distance of thefeatures set from each of the 20 models is computed using Tokhuras distance.The speaker model for which both the cepstral coefficients and the proposedfeature set gives the minimum distance is declared as the recognized speaker.
The speech signal was recorded using the Cool Edit software and saved in thetext format. This file (input signal) is used for all processing. Initially theDC shift correction is applied to remove the DC components that might havegot added during the recording or due to power supply interference. Next, thesamples are normalized to a value depending on the sampling rate of the signal(in experiments normalization value is fixed at 10,000 after some tests). Silenceremoval is the next step in the preprocessing phase where the average energyof the non speech activity signal is calculated and removed those frames whoseenergy is less than a given threshold. In the experiments a frame is classifiedas a speech frame if the average energy is beyond 110% of the average silence4nergy. The size of the frame considered is 100 samples shifted by 50 samples forsubsequent frames. The main step in the preprocessing phase is pitch detection.Pitch is detected by the following algorithm:1. Two files are maintained. One containing the peak sample value in eachof the positive halves along with the maximum of the difference of thatsample from the previous half and next half, say Maximum Peak Difference(MPD), (halves should be compatible i.e. if current peak is in the positivehalf, the previous and the next half considered should be positive) and thesecond containing the peak sample value in each of the negative halves(maintained in the second file) along with its MPD values in the negativehalves.2. From the sample values of the above mentioned two files, average of thedifferences is computed (i.e. Average of the Maximum Peak Differences(AMPV)) for positive half as well as negative half. The pitch is calculatedwith respect to the positive half or the negative half based on consistency.3. Now based on the result of the above computation, either the positivehalf or negative half is used for pitch calculation and the pitch period iscalculated as:
P itchperiod = Index ( S T ) − Index ( P T ) T hreshold = Current peakvalue − ( x ∗ Current peakvalue )where S T is Sample having value greater than or equal to Threshold, P T is Previous Threshold, x is determined based on the MPD of the currentpeak sample value. If the MPD is less than AMPV then AMPV is dividedinto 10 intervals and the interval in which MPD falls is found out and theinterval number is taken as the value of x (i.e. if MPD falls in the interval5 then value of x is 5). Similarly if MPD is greater than AMPV thendifference between the maximum (all the MPDs) and AMPV is dividedinto an interval of 10 and the interval in which MPD falls is found out andthe value of x is set as 10+x (i.e. if the MPD falls in the 3 rd interval thenx is set as 13).The pitch period obtained will be further used in feature extraction. The startand end of each pitch period is marked and kept for feature extraction. Now using the pitch information computed from the Preprocessing phase the fea-tures are extracted. The features considered are positive crest, positive trough,negative crest and negative trough and is shown in Figure 1.5igure 1: Feature set computation of the vowel /i/.The feature extraction algorithm is as follows: • For feature extraction, the steady state region of the vowel is identified.For this, maximum 10 frames before and 10 frames after the frame con-taining the normalized value are considered. These frames are steady stateregions of the utterances (i.e. vowel part of the word). Each frame is oflength equal to pitch period as computed above. Sometimes depending onthe speakers utterance 10 frames may not be available. So in that case,available number of frames is taken and the features are calculated. • Now a window of three speech samples is considered and slide the windowthrough the pitch period. Check whether the middle value of the win-dow is greater than the other two (by magnitude), if true, increment therespective counter (if it is positive half increment the crest counter elseincrement the trough counter) and advance the window by one sample.If the middle value of the window is lesser than the other two values (bymagnitude) then increment the respective counter (for positive half troughcounter is incremented and for negative half crest counter is incremented)and advance the window by one sample. Else just advance the window byone sample.Suppose poc is the counter for positive crest, pot is the counter for positivetrough, nec is the counter for negative crest and net is the counter fornegative trough and w[3] represents the window of processing, then thecomputation of the features can be shown mathematically as follows:If w[2] > < w[2] and w[2] > w[3] poc = poc + 1 and advance the window by one sampleElse if w[1] > w[2] and w[2] < w[3] pot = pot + 1 and advance the window by one sampleElse 6dvance the window by one sampleElseIf w[1] < w[2] and w[2] > w[3] nec = nec + 1 and advance the window by one sampleElse if w [1] > w[2] and w[2] < w[3] net = net + 1 and advance the window by one sampleElseadvance the window by one sample. • The features are computed over the 20 frames (if available) of an utteranceand the average is computed. If N is the number of frames available, thefinal feature value for an utterance will be poc = pocN pot = potN nec = necN net = netN • Standard cepstral coefficients (computed using the Durbins algorithm)are also used along with the above described feature set. Thus a totalof 16 features are used for the speaker recognition system (12 cepstralcoefficients and 4 proposed features). • Cepstral coefficients are calculated over the frame F1 of length equal tothree consecutive pitch periods say f1, f2, f3 (within the previously definedrange of 10 frames before and after the normalized frame) i.e. F f f f • In each iteration, the frame is shifted by a value equal to f1 (i.e. F − f F F − f f • The average of the above said 18 frames are calculated and it is the cepstralrepresentative of a particular vowel utterance. • The average of the feature set (16 values) over the 20 utterances of eachspeaker is computed. This average value serves as the model for thatspeaker.
The experiments consist of training the system with steady state vowels froma set of speakers and testing the validity or claim from an unknown vowel ut-terance of an unknown speaker. The speaker verification system was developedusing Microsoft Visual C++ and it was trained with 2000 utterances taken from20 male speakers in the age group of 22-26 years. The following five sentenceswere used for recording and the vowel regions were extracted manually and thepreprocessing and the processing were applied on the extracted vowel regions.7. An age of apes is an old story.(/e/)2. There is a river by the mango tree.(/o/)3. Ramayana is an old story.(/a/)4. A cuckoo sings on a mango tree by the river.(/u/)5. The bee is sitting on a flower by the mango tree.(/i/)The steady state part of the five cardinal English vowels were marked andsaved for feature analysis. The quality of the accuracy of the system was calcu-lated based on another set of 500 utterances from the same set of 20 speakers.During the training phase, the model (feature set) of each speaker was deter-mined by averaging the 20 training utterances of each vowel. During the testingphase, test samples were collected and the feature set was extracted from thesamples and distance of the feature set from all the models (with which thesystem is trained) was calculated. The unknown speaker was identified to bethe speaker S i with which cepstral coefficients as well as the new feature set (ofthe test sample) had minimum distance. If the cepstral coefficients was found tobe showing minimum distance with speaker S i and new feature set was foundto be showing minimum distance with speaker S j ( i (cid:54) = j & 1 ≤ i, j ≤
20 )then it is considered to be an invalid case and was rejected. In such cases, thespeaker is told to speak once more. Even though the rejection rate is high,in the case where it is valid, the new system is much more accurate than thestandard cepstral-based speaker recognition system. The accuracy of the newfeature set based speaker recognition system is low compared to cepstral basedspeaker recognition system. But the combined system is much more accuratecompared to cepstral based recognition system. The accuracy of the cepstralbased system, feature based system and the combined system is shown in theTable 1 below:Parameters Total Test Cases Correctly Recognized AccuracyCepstral coefficients 500 347(out of 500) 69.81%New Feature Set 500 144(out of 500) 28.97%Combined 500 122(out of 134) 91.04%Table 1: Comparison of the accuracy of the various systems.The accuracy of the speaker verification system for each vowel is given in theTable 2 below: 8owel Total Utterances Rejected Cases Accepted casesCorrect Wrong/a/ 100 77 17(73.91%) 6/e/ 100 66 33(97.05%) 1/o/ 100 77 19(82.60%) 4/u/ 100 79 21(100%) 0/i/ 100 67 32(96.97%) 1Table 2: Accuracy of the individual vowels in speaker verification with theproposed feature set.The selection of the words from which the vowels were extracted had a significantcontribution in the accuracy of the speaker recognition system mainly for thevowel /i/. Initially /i/ was extracted from the word river. In that case becauseof the short duration of /i/ and the preceding consonant, the vowel part wassmoothened. So for vowel /i/ the word used was bee where the duration ofsound /i/ was sufficient and the waveform was also consistent.There were some utterances where the cepstral based system showed mis-recognition was correctly recognized by the new feature set. It was also seenthat some of the utterances where new feature set showed misrecognition wascorrectly recognized by the cepstral based system. In certain cases both cepstraland new feature set showed misrecognition. All these cases were considered asunreliable cases where the decision could not be made. Such samples were re-jected by the system. Even though the rejection rate is more, the decision madeby the system in reliable cases is much more accurate compared to the cepstralbased system and the new feature set when considered alone.In the misrecognised cases, when the graphs were plotted with the featureset it was found that the misrecognition had occurred for those speakers whoseutterances (from the training set) varied a lot. The graph corresponding to thefeature set of the misrecognized speaker is shown in Figure 2.9igure 2: Graph of the behavior of the four features across the 20 utterances ofa misrecognized speaker for vowel /u/Here the variation throughout the utterances can be seen as a non uniformdistribution of the feature set values. So recognition in such speakers is low asthe feature set values does not concentrate to a specific value and it can getmatched to any other speaker. It was also noted that for the vowel /u/ and /o/even a slight variation in the feature set can lead to misrecognition (becauseof the narrow gap between the feature set values). In this case the numberof misrecognition is limited by the cepstral coefficients. But for vowel /i/ theacceptance range of variation in feature set is more compared to all other vowels.This is because of the wide difference in the feature set values for the speakers.Speech spectrogram of the speaker with which a test sample was wronglyrecognized, the correct speaker it should have been matched with and the testsample itself is shown in Figure 3. 10igure 3: Spectrogram of the speaker with which mismatching occurred ,correctspeaker and wrongly recognized test speaker
In this paper, we have proposed a set of features which can be used to char-acterize speakers. The feature set consists of four simple intrapitch temporalattributes: positive crest, positive trough, negative crest and negative troughalong with 12 pitch synchronous cepstral coefficients. The system was testedwith a set of 20 speakers. The accuracy of the system was found to be 91.04%.The vowels /i/, /u/ and /e/ were found to be more accurate in speaker verifi-cation system. The future work may focus on the automation of the separationof vowel region from other regions of the recorded sentences and addition ofmore features to improve the accuracy. Experiments are being done with morespeakers to check the change in accuracy of the system.
References [1] Bolt Beranek and Newman. Efficient acoustic parameters for speaker recog-nition.
Journal of Acoustical Society of America , 51:2044–2056, 1972.[2] Sandipan Chakroborty, Anindya Roy, and Goutam Saha. Fusion of a com-11lementary feature set with mfcc for improved closed set text-independentspeaker identification.
IEEE International Conference on Industrial Tech-nology , pages 387–390, 2006.[3] Michael F. Dorman and David Isenberg. Acoustic cues for a fricative-affricate contrast in word final position.
Journal of Phonetics , 8:397–405,1980.[4] Yanhua Long, Wu Guo, Lirong Dai, Bin Ma, Haizhou Li, and Eng SiongChng. Exploiting prosodic information for speaker recognition.
IEEE In-ternational Conference on Acoustics, Speech and Signal Processing , pages4225–4228, 2009.[5] Bin Ma, Donglai Zhu, and Haizhou Li. Acoustic segment modeling forspeaker recognition.
IEEE International Conference on Multimedia andExpo , pages 1668–1671, 2009.[6] David Poeppel and Martin Hackl.
Topics in Integrative Neuroscience . Cam-bridge University Press, The Edinburgh Building, Cambridge, 2008.[7] Gayadhar Pradhan and S. R. Mahadeva Prasanna. Significance of vowelonset point information for speaker verification.
International Journal ofComputer and Communication Technology , 2:60–66, 2011.[8] Joaquin Gonzalez Rodriguez, Daniel Ramos Castro, Doroteo TorreToledano, Alberto Montero Asenjo, Javier Gonzalez Dominguez, Igna-cio Lopez Moreno, Julian Fierrez Aguilar, Daniel Garcia Romero, andJavier Ortega Garcia. Speaker recognition.
IEEE Aerospace and ElectronicSystems Magazine , 22:15–21, 2007.[9] Ahmad Salman and Ke Chen. Exploiting speaker specific characteristicswith deep learning.
International Joint Conference on Neural Networks ,22:103–110, 2011.[10] I. Saratxaga, I. Hernaez, D. Erro, E. Navas, and J. Sanchez. Simple repre-sentation of signal phase for harmonic speech models.
Electronic Letters ,45:381–383, 2009.[11] Izhak Shafran, Michael Riley, and Mehryar Mohri. Voice signatures.
IEEEWorkshop on Automatic Speech Recognition and Understanding , pages 31–36, 2003.[12] Bing Sun, Wenju Liu, and Qiuhai Zhong. Hierarchical speaker identificationusing speaker clustering.
International Conference on Natural LanguageProcessing and Knowldge Engineering , pages 299–304, 2003.[13] H. Suzuki, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda, and T Kitamura.Speech recognition using voice-characteristics-dependent acoustic models.
IEEE International Conference on Acoustics , 1:I.740–I.743, 2003.1214] P. D. Templeton and B. J. Guillemin. Speaker identification based on vowelsounds using neural networks.
Australian Speech Science and TechnologyAssociation Proceedings , pages 280–285, 1990.[15] Jia-Ching Wang, Chung Hsien Yang, Jhing Fa Wang, and Hsiao Ping Lee.Robust speaker identification and verification.
IEEE Computational Intel-ligence Magazine , 2:52–59, 2007.[16] Carol Y. Espy Wilson, Sandeep Manocha, and Srikanth Vishnubhotla. Anew set of features for text-independent speaker identification.