Neural Network Based Speaker Classification and Verification Systems with Enhanced Features
Zhenhao Ge, Ananth N. Iyer, Srinath Cheluvaraja, Ram Sundaram, Aravind Ganapathiraju
IIntelligent Systems Conference 20177-8 September 2017 | London, UK
Neural Network Based Speaker Classification andVerification Systems with Enhanced Features
Zhenhao Ge, Ananth N. Iyer, Srinath Cheluvaraja, Ram Sundaram, Aravind Ganapathiraju
Interactive Intelligence Inc., Indianapolis, Indiana, USAEmail: { roger.ge, ananth.iyer, srinath.cheluvaraja, ram.sundaram, aravind.ganapathiraju } @inin.com Abstract —This work presents a novel framework basedon feed-forward neural network for text-independent speakerclassification and verification, two related systems of speakerrecognition. With optimized features and model training, itachieves 100% classification rate in classification and less than6% Equal Error Rate (ERR), using merely about 1 second and 5seconds of data respectively. Features with stricter Voice ActiveDetection (VAD) than the regular one for speech recognitionensure extracting stronger voiced portion for speaker recognition,speaker-level mean and variance normalization helps to eliminatethe discrepancy between samples from the same speaker. Both areproven to improve the system performance. In building the neuralnetwork speaker classifier, the network structure parameters areoptimized with grid search and dynamically reduced regulariza-tion parameters are used to avoid training terminated in localminimum. It enables the training goes further with lower cost.In speaker verification, performance is improved with predictionscore normalization, which rewards the speaker identity indiceswith distinct peaks and penalizes the weak ones with high scoresbut more competitors, and speaker-specific thresholding, whichsignificantly reduces ERR in the ROC curve. TIMIT corpus with8K sampling rate is used here. First 200 male speakers are usedto train and test the classification performance. The testing filesof them are used as in-domain registered speakers, while datafrom the remaining 126 male speakers are used as out-of-domainspeakers, i.e. imposters in speaker verification.
Keywords — Neural Network, Speaker Classification, SpeakerVerification, Feature Engineering
I. I
NTRODUCTION
Speaker recognition is a popular and broad topic in speechresearch over decades. It includes speaker detection, i.e. detect-ing if there is a speaker in the audio, speaker identification, i.e.identifying whose voice it is, speaker verification or authentica-tion, i.e. verifying someone’s voice. If the speaker set is closed,i.e. the audio must be from one of the enrolled speakers, thenspeaker identification is simplified to speaker classification.There are some other building blocks such speaker segmenta-tion, clustering and diarization, which can be further developedbased on the fundamental speaker recognition techniques.Fig. 1 provides digrams for speaker identificatio and veri-fication. the main approaches in this area includes 1) templatematching such as nearest neighbor [1] and vector quantization[2], 2) neural network, such as time delay neural network [3],decision tree [4], and 3) probabilistic models, such as GaussianMixture Model (GMM) with Universal Background Model(UBM) [5], joint factor analysis [6], i-vector [7], [8], SupportVector Machine (SVM) [9], etc. Methods can be dividedinto text-dependent and text-independent, where the formerachieves better performance with additional information, andthe latter is more user friendly and easier to use. Reynolds [10] and Fauve [11] provided a good overview of somecommon speech recognition applications with the state-of-the-art performance. Identification vs verification
Fig. 1. Major components for speaker identification and speaker verification.
This paper proposes a neural network framework fortext-independent speaker classification and verification, usingTIMIT 8K database. With optimization in feature and modeltraining, the system achieves 100% classification accuracy withslightly more than 1 second speech, and less than 6% ERR inspeaker verification with more than 100 impostor size, usingapproximately 5 seconds data.The following sections walk through the major piecesof this work, including feature engineering (Sec. II), design,implementation and results for speaker classification and veri-fication systems (Sec. III and Sec. IV). Finally, the conclusionand future work is given in Sec. V.II. D
ATA P REPARATION AND F EATURE E NGINEERING
The following 3 subsections introduce the database usedin this paper, and the process of converting raw speech intofeatures used that used in speaker classification and verifi-cation, including a) preprocessing, and b) feature extraction,normalization and concatenation.
A. Database
Speech of all 326 male speakers from 8 different dialectregions in the “train” folder of the TIMIT corpus with 8Ksampling rate is used here. Data of males from the “test” folderand data of females from both “train” and “test” folders arecurrently reserved for future development. For each speaker,there are 10 data files containing one sentence each withduration about 2.5 seconds. They are from 3 categories: “SX”(5 sentences), “SI” (3 sentences) and “SA” (2 sentences). DataIEEE 1 ||
Speech of all 326 male speakers from 8 different dialectregions in the “train” folder of the TIMIT corpus with 8Ksampling rate is used here. Data of males from the “test” folderand data of females from both “train” and “test” folders arecurrently reserved for future development. For each speaker,there are 10 data files containing one sentence each withduration about 2.5 seconds. They are from 3 categories: “SX”(5 sentences), “SI” (3 sentences) and “SA” (2 sentences). DataIEEE 1 || P a g e a r X i v : . [ c s . S D ] F e b ntelligent Systems Conference 20177-8 September 2017 | London, UK are first sorted alphabetically by speaker name in their dialectregion folders, then combined to form a list of data containing326 speakers. They are then divided into 2 groups: first 200speakers (group A) and remaining 126 speakers (group B).For speaker classification “SX” sentences in group A areused to train the text-independent Neural Network SpeakerClassifier (NNSC), while the “SA” and “SI” sentences ingroup A are used to test. For speaker verification, since itis based on NNSC, only “SA” and “SI” sentences are usedto avoid overlapping with any training data used in modeltraining. Speakers in group A are used as in-domain speakers,and speakers in group B are used as out-of-domain speakers(imposters).
B. Preprocessing
Preprocessing mainly consists of a) scaling the maximumof absolute amplitude to 1, and b) Voice Activity Detection(VAD) to eliminate the unvoiced part of speech. Experimentsshow both speaker classification and verification can performsignificantly better if speakers are evaluated only using voicedspeech, especially when the data is noisy.An improved version of Giannakopoulos’s recipe [12] withshort-term energy and spectral centroid is developed for VAD.Given a short-term signal s ( n ) with N samples, the energy is: E = 1 N N (cid:88) n =1 | s ( n ) | , (1)and given the corresponding Discrete Fourier Transform (DFT) S ( k ) of s ( n ) with K frequency components, the spectralcentroid can be formulated as: C = (cid:80) Kk =1 kS ( k ) (cid:80) Kk =1 S ( k ) . (2)The Short-Term Energy (STE) E is used to discriminate si-lence with environmental noise, and the Spectral Centroid (SC) C can be used to remove non-environmental noise, i.e. non-speech sound, such as coughing, mouse clicking and keyboardtapping, since they normally have different SCs compared tohuman speech. When computing the frame-level E and C , a ms window size and a ms hop size are used.To set the overall threshold, only when E and C areboth above their thresholds T E and T C , the speech frameis considered to be voiced, otherwise, it will be removed.These thresholds are adjusted to be slightly higher to enforce astricter VAD algorithm and ensure the quality of the capturedvoiced sections. This is achieved by tuning the signal mediansmoothing parameters, such as step size and smoothing order,as well as setting the thresholds T E and T C as a weightedaverage of the local maxima in the distribution histograms ofthe short-term energy and spectral centroid respectively. Fig.2 is an example of applying different median filter smoothingstep sizes to STE and SC. Larger step size (e.g. 7) and order(e.g. 2) are used in order to achieve more stricter VAD. C. Feature Extraction, Normalization and Concatenation
The 39-dimensional Mel-Frequency Cepstral Coefficients(MFCCs) with delta and double delta were generated fromthe preprocessed speech, following Ellis’s recipe [13]. They a m p li t ude smoothed STE: smooth step = 4 original1st−order2nd−order0 20 40 60 80 100 12000.20.40.6 frame a m p li t ude smoothed SC: smooth step = 4 original1st−order2nd−order (a) smoothing step size 4 a m p li t ude smoothed STE: smooth step = 7 original1st−order2nd−order0 20 40 60 80 100 12000.20.40.6 frame a m p li t ude smoothed SC: smooth step = 7 original1st−order2nd−order (b) smoothing step size 7 Fig. 2. Short-term energy and spectral centroid with different median filtersmoothing steps and orders. were extracted using overlapped ms Hamming windowswhich hop every ms. Then, the features of each speakerwere normalized with his own mean and variance (speaker-level MVN, or SMVN), instead of using the overall mean andvariance (global-level MVN, or GMVN). Fig. 3 shows SMVNthough converges slower, but helps to achieve better featureframe level training and validation accuracies in networktraining. It is slightly counter-intuitive, since SMVN overlapsspeaker patterns on top of each other. However, it can matchthe instances of patterns from the same speaker better thanGMVN as the training goes. s eg m en t a cc u r a cy NN training with global−level MVN trainingvalidation0 5 10 15 20 25 30 35 4000.51 training epoch s eg m en t a cc u r a cy NN training with speaker−level MVN trainingvalidation
Fig. 3. Comparison of global-level MVN vs. speaker-level MVN in NNtraining in terms of training and validation frame accuracies.
IEEE 2 ||
IEEE 2 || P a g e ntelligent Systems Conference 20177-8 September 2017 | London, UK
To capture the transition patterns within longer durations,these 39-dimensional feature frames were concatenated to formoverlapped longer frames. In this work, 10 frames ( ms)were concatenated with hop size of 3 frames ( ms) as shownin Fig. 4. … …
39 390
Fig. 4. Feature concatentation example with a window size of 10 frames anda hop size of 3 frames.
III. N
EURAL N ETWORK S PEAKER C LASSIFICATION
The concatenated features (e.g. 390 dimensional featurevectors) are used as the input to a neural network speakerclassifier. As mentioned in the first paragraph of Sec. II, the“SX” and “SI” sentences of the first 200 male speakers wereused for training, and the remaining “SA” sentences from thesame set of speakers were used for testing.
A. Cost Function and Model Structures
Ng’s neural network training recipe for hand-written digitclassification [14] is used here, which treats the multi-classproblem as K separate binary classifications. It is consideredto be the generalization of the cost function of binary clas-sification using logistic regression, which is built on slightlydifferent concepts compared with the cross-entropy cost func-tion with softmax as the output layer [15].Given M samples, K output classes, and L layers, includ-ing input, output and all hidden layers in between, the costfunction can be formulated as: J (Θ) = − M (cid:34) M (cid:88) m =1 K (cid:88) k =1 (cid:16) y ( m ) k log( h θ ( x ( m ) ) k ) (3) + (1 − y ( m ) k ) log(1 − h θ ( x ( m ) ) k ) (cid:17)(cid:105) + λ M L − (cid:88) l =1 s l (cid:88) i =1 s l +1 (cid:88) j =1 ( θ ( l ) ji ) where h θ ( x ( m ) ) k is the k th output of the final layer, given m thinput sample x ( m ) , and y ( m ) k is its corresponding target label.The nd half of Eq. (3) is the regularization factor to preventover-fitting, where λ is the regularization parameter and θ ( l ) ji is the j -th row, i -th column element of the weight matrix Θ ( l ) between l -th and ( l + 1) -th layers, i.e. the weight from i -thnode in l -th layer to j -th node in ( l + 1) -th layer.In this work, there is only 1 hidden layer ( L = 3 ) with nodes ( s = 200 ), the input feature dimension is ( s =390 ), and the speaker classifier was trained with data from speakers ( s = K = 200 ). Therefore, the network structure is
390 : 200 : 200 , with weight matrices Θ (1) ( × ) and Θ ( × ). The additional 1 column is a bias vector, which is left out in regularization, since the change of bias is unrelatedto over-fitting. In this example, the regularization part in Eq.(3) can be instantiated as L − (cid:88) l =1 s l (cid:88) i =1 s l +1 (cid:88) j =1 ( θ ( l ) ji ) = (cid:88) i =1 200 (cid:88) j =1 ( θ (1) j,i ) + (cid:88) i =1 200 (cid:88) j =1 ( θ (2) j,i ) . (4) B. Model Training and Performance Evaluation
The neural network model is trained through forward-backward propagation. Denoting z ( l ) and a ( l ) as the input andoutput of the l -th layer, the sigmoid function a ( l ) = g ( z ( l ) ) = 11 + e − z ( l ) (5)is selected as the activation function, and the input z ( l +1) of the ( l + 1) -th layer can be transformed from the output a ( l ) of the l -th layer, using z ( l +1) = Θ a ( l ) . Then, h θ ( x ) can be computed through forward propagation: x = a (1) → z (2) → a (2) → · · · → z ( L ) → a ( L ) = h θ ( x ) . The weightmatrix Θ ( l ) is randomly initiated using continuous uniformdistribution between ( − . , . and then trained throughbackward propagation of ∂J/∂θ ( l ) j,i , by minimizing J (Θ) usingRasmussen’s conjugate gradient algorithm, which handles stepsize (learning rate) automatically with slope ratio method[16].In evaluating the classifier performance, the sigmoid outputof the final layer h θ ( x ( m ) ) is a K -dimensional vector, eachelement in the ranges of (0 , . It serves as the “likelihood” toindicate how likely it is to classify m -th input frame into oneof the K speakers. The speaker classification can be predictedby the sum of log likelihood of M input frames (predictionscores), and the predicted speaker ID k ∗ is the index of itsmaximum: k ∗ = arg max k ∈ [1 ,K ] (cid:32) M (cid:88) m =1 log( h θ ( x ( m ) ) k ) (cid:33) . (6) M can range from 1 to the entire frame length of the testingfile. If M = 1 , the accuracy achieved is based on individualframes, each of which is ms (window duration T win infeature concatenation) with ms of new data, compared withthe previous frame. On the other hand, if M is equal to thetotal number of frames in file, the accuracy is file-based. Theaverage duration of sentences (i.e. file length) is about 2.5seconds. In general, larger M leads to higher accuracy. Giventhe best model available with the network structure
390 : 200 :200 , Fig. 5 demonstrates an example of file-level predictionscore of -th speaker (MPGR0). It shows the peak of positives(in the green circle) is slightly dropped but still distinguishableenough to all other negatives, from the file SI1410 in thetraining set, to the file
SA1 in the testing set.Using this model, the file-level training and testing accura-cies at speaker size are both 100%, as indicated in TableI. The frame-level testing accuracy is . %, which indicatesthat . % frames in the testing set, with duration as littleas . second, can be classified correctly. It also shows theminimum, mean, and maximum number of consecutive featureframes needed and their corresponding durations in order toachieve 100% accuracy, evaluated through all files in bothtraining and testing datasets. Since the next frame providesIEEE 3 ||
SA1 in the testing set.Using this model, the file-level training and testing accura-cies at speaker size are both 100%, as indicated in TableI. The frame-level testing accuracy is . %, which indicatesthat . % frames in the testing set, with duration as littleas . second, can be classified correctly. It also shows theminimum, mean, and maximum number of consecutive featureframes needed and their corresponding durations in order toachieve 100% accuracy, evaluated through all files in bothtraining and testing datasets. Since the next frame providesIEEE 3 || P a g e ntelligent Systems Conference 20177-8 September 2017 | London, UK (a)
SI1410 in training(b)
SA1 in testing
Fig. 5. File-level prediction scores of th speaker (MPGR0) in training andtesting sets respectively.TABLE I. NN- BASED SPEAKER CLASSIFICATION PERFORMANCE WITHFIRST
MALE IN
8K TIMIT ( . SEC ./ FRAME , ∼ SEC ./ FILE ) Dataset Accuracy (%) Frame (sec.) needed for 100% accuracy frame file min mean max train 93.29 100 2 (0.13) 3.23 (0.17) 5 (0.22)test 71.42 100 6 (0.25) 13.55 (0.48) 37 (1.18) only ms (hop duration T hop in feature concatenation) ad-ditional information, compared with the current frame, giventhe number of frames needed N , the formula to compute thecorresponding required duration T is T = ( N − × T hop + 1 × T win . (7)With this formula, it requires only 13.55 frames (0.48 second)on average, to achieve 100% accuracy in the testing dataset.Using the training data to test is normally not legitimate,and here it is used merely to get a sense of how the accuracydrops when switching from training data to testing data. C. Model Parameter Optimization
The current neural network model with the structure
390 :200 : 200 is actually the best one in terms of highest frame-level testing accuracy, after grid searching on a) the numberof hidden layers ( , ), and b) the number of nodes per hiddenlayer ( , , , ), with a subset containing only 10%randomly selected training and testing data.Once the ideal network structure is identified, the modeltraining is conducted with a regularization parameter λ inthe cost function J (Θ) , which is iteratively reduced from 3to 0 through training. This dynamic regularization scheme is experimentally proved to avoid over-fitting and allow moreiterations to reach a refined model with better performance.The training is set to be terminate once the testing frameaccuracy cannot be improved more than . in the last 2consecutive training iterations, which normally takes around to iterations. The training set is at speakersize with seconds speech each. It is fed in as a wholebatch of data, which requires about 1 hour to train, on acomputer with i7-3770 CPU and 16 GB memory. Therefore,the computational cost is certainly manageable.IV. N EURAL N ETWORK S PEAKER V ERIFICATION
This section first introduces the mechanism of convertingspeaker classification into speaker verification; then describesthe method of developing speaker-specific thesholds to shiftverification outputs; finally it evaluates the system with metricssuch as Equal Error Rate (EER).
A. Verification Mechanism
In speaker verification, the assumption that any inputspeaker will be one of the in-domain speakers is no longerkepted. When the testing speaker is claimed to be speaker k and the highest output score is also from the k -th output nodes,he might be a imposter, who is more similar to speaker k , andless similar to the rest of K − enrolled (in-domain) speakers.So Eq. (6) in Subsec. III-B is no longer hold and a threshold isnecessary to determine if the testing speaker is similar enoughto the targeting speaker and can be verified as speaker k .Let the mean K -dimensional output prediction vector overfeature frames for client speaker k , given features x l of speaker l be: O ( k, l ) = 1 M M (cid:88) m =1 log( h θ ( x ( m ) l ) k ) , (8)where M is the number of frames in the testing feature. Inthis project, client speakers are the first 200 male speakers inTIMIT ( K = 200 ), and the imposters (out-of-domain) are theramaining 126 speakers ( L = 126 ). In positive verification,where l = k , and the k -th value on O ( k, k ) , i.e. O k ( k, k ) should be high; while in negative verification, where l ∈ [1 , L ] ,and O k ( k, l ) should be low. If O k ( k, k ) > any( O k ( k, l )) , l ∈ [1 , L ] , (9)then, the k -th speaker can be correctly verified. In our ex-periment, O ( k, k ) and O ( k, l ) are actually normalized over K output node dimension, and the normalized versions are: O (cid:48) ( k, k ) = O ( k, k ) (cid:80) Kk =1 O ( k, k ) , O (cid:48) ( k, l ) = O ( k, l ) (cid:80) Kk =1 O ( k, l ) . (10)It is found to achieve better verification accuracy by penalizingthe ones with strong competing speakers. Fig. 6 shows theaccuracy vs. number of testing files (up to 5 since there are5 sentences from “SI” and “SA” categories). For example,the mean accuracy is . when speakers are tested withindividual files and . when tested with a combinationof two files ( (cid:0) (cid:1) = 10 cases). The sentences duration is about2.5 seconds each, so it is similar to the accuracy with testingduration 2.5 seconds, 5 seconds, etc. For each out of the 200client speakers, the accuracy is binary, either 1, i.e. Eq. (9) issatisfied, or 0 otherwise.IEEE 4 ||
In speaker verification, the assumption that any inputspeaker will be one of the in-domain speakers is no longerkepted. When the testing speaker is claimed to be speaker k and the highest output score is also from the k -th output nodes,he might be a imposter, who is more similar to speaker k , andless similar to the rest of K − enrolled (in-domain) speakers.So Eq. (6) in Subsec. III-B is no longer hold and a threshold isnecessary to determine if the testing speaker is similar enoughto the targeting speaker and can be verified as speaker k .Let the mean K -dimensional output prediction vector overfeature frames for client speaker k , given features x l of speaker l be: O ( k, l ) = 1 M M (cid:88) m =1 log( h θ ( x ( m ) l ) k ) , (8)where M is the number of frames in the testing feature. Inthis project, client speakers are the first 200 male speakers inTIMIT ( K = 200 ), and the imposters (out-of-domain) are theramaining 126 speakers ( L = 126 ). In positive verification,where l = k , and the k -th value on O ( k, k ) , i.e. O k ( k, k ) should be high; while in negative verification, where l ∈ [1 , L ] ,and O k ( k, l ) should be low. If O k ( k, k ) > any( O k ( k, l )) , l ∈ [1 , L ] , (9)then, the k -th speaker can be correctly verified. In our ex-periment, O ( k, k ) and O ( k, l ) are actually normalized over K output node dimension, and the normalized versions are: O (cid:48) ( k, k ) = O ( k, k ) (cid:80) Kk =1 O ( k, k ) , O (cid:48) ( k, l ) = O ( k, l ) (cid:80) Kk =1 O ( k, l ) . (10)It is found to achieve better verification accuracy by penalizingthe ones with strong competing speakers. Fig. 6 shows theaccuracy vs. number of testing files (up to 5 since there are5 sentences from “SI” and “SA” categories). For example,the mean accuracy is . when speakers are tested withindividual files and . when tested with a combinationof two files ( (cid:0) (cid:1) = 10 cases). The sentences duration is about2.5 seconds each, so it is similar to the accuracy with testingduration 2.5 seconds, 5 seconds, etc. For each out of the 200client speakers, the accuracy is binary, either 1, i.e. Eq. (9) issatisfied, or 0 otherwise.IEEE 4 || P a g e ntelligent Systems Conference 20177-8 September 2017 | London, UK m ean a cc u r a cy o v e r c li en t s pea k e r s Fig. 6. Verification accuracy (1 in-domain client speaker vs. 126 out-of-domain imposters) vs. number of testing files, averaged over all 200 in-domainspeakers in TIMIT.
B. Speaker Specific Thresholding
The accuracy measurement above will drop significantlywhen the imposter size is getting larger. In fact, it is merelyan analysis to demonstrate the challenge to maintain highaccuracy with a large imposter size which is rare in the realscenario. Next, the speaker-specific thresholds will be obtainedby finding the Gaussian distributions of the positive (testingspeaker is the client speaker) and negative (testing speakers isone of the imposters) samples, using Bayes rule. ← x* = 0.546difference increase values (x) w e i gh t ed P D F Gaussian PDFs with error highlighted, p(error) = 0.262% p(x| ω ) * p( ω )p(x| ω ) * p( ω ) Fig. 7. Example of thresholding with 2 Gaussians distributions of positiveand negative samples. Sample values are collected with combinations of 2 files(10 cases with ∼ seconds in duration), i.e. 10 positives vs. 1260 negatives. Since the positive and negative is extremely skewed withcurrent 126 imposter size (i.e. positive:negative is 1:126), thedistribution for the positive samples has a very low prior andalmost invisible in Fig. 7. However, the estimated threshold,which is the intersection of the two Gaussians, can be stillfound by solving the Eq. (11) using the root finding method,which first reformats the Eq. (11) to quadratic function ax + bx + c = 0 , and then represents x by a, b, c . p σ e ( x − u σ = 1 − p σ e ( x − u σ . (11) C. Performance with Optimized Thresholds
With the speaker-specific thresholds T k , k ∈ [1 , K ] , theoutput normalized prediction vector is shifted by O (cid:48) ( k, l ) → O (cid:48) ( k, l ) − T k , l ∈ { k, [1 , L ] } . (12) Then, ROC curve is computed to find the Equal Error Rate(EER), which is a common performance indicator to evaluatebiometric systems. EER equal to False Positive Rate (FPR),when F P R + T P R = 1 . Fig. 8 demonstrates the ROCcurve, when verifying with length of 2 files ( ∼ seconds).By offsetting outputs with speaker-specific thresholds, theEER is reduced from . to . . Another metric AreaUnder Curve (AUC) is . , and the global thresholdcorresponding to this best EER is − . . T P R ROC, EER = 0.059, AUC = 0.9805 no speaker−specific thresholdwith speaker−specific threshold
Fig. 8. ROC when verifying with length of 2 files ( ∼ seconds), with orwithout speaker-specific thresholds. V. C
ONCLUSION AND F UTURE W ORK
This work demonstrated a novel neural net framework forspeaker classification and verification with enhanced features.The performance is tested using TIMIT corpus with 8Ksampling rate. For speaker classification, 200 speakers can beclassified correctly with data no more than 1.18 seconds; Forspeaker verification, the EER is 5.9%, when verifying 200in-domain speakers with 126 imposters, using speech about5 seconds long (2 TIMIT files). Though the performance ofspeaker classification and verification systems is difficult tocompare, due to various database condition, and enrollmentand testing scenarios [10], 100% classification rate using about1 second audio and less than 6% EER using 5 seconds data inspeaker verification, is still among one of the very competitiveperformances in most of the cases [11].This is achieved by combining all the essential compo-nents, including 1) feature engineering, such as VAD/silenceremoval, speaker-level MVN, feature concatenation to capturetransitional information, etc., 2) neural network setup, modelparameter optimization, training with dynamically reducedregularization parameter in speaker classification, and 3) out-put score normalization and speaker-specific thresholding inspeaker verification.There is still much room for potential improvement. First,the enrollment process is typically one-by-one, rather thanenrolling a group of speakers as a whole, so the recursivelymodel training and updating need to be addressed. Second,more challenging and noisy database should be consideredto added in, in order to deal with channel normalization andsystem robustness. Third, combining current neural networkapproaches with other state-of-the-art methods, such as GMM-UBM [5] and i-vector [7], [8] is also desired.IEEE 5 ||
This work demonstrated a novel neural net framework forspeaker classification and verification with enhanced features.The performance is tested using TIMIT corpus with 8Ksampling rate. For speaker classification, 200 speakers can beclassified correctly with data no more than 1.18 seconds; Forspeaker verification, the EER is 5.9%, when verifying 200in-domain speakers with 126 imposters, using speech about5 seconds long (2 TIMIT files). Though the performance ofspeaker classification and verification systems is difficult tocompare, due to various database condition, and enrollmentand testing scenarios [10], 100% classification rate using about1 second audio and less than 6% EER using 5 seconds data inspeaker verification, is still among one of the very competitiveperformances in most of the cases [11].This is achieved by combining all the essential compo-nents, including 1) feature engineering, such as VAD/silenceremoval, speaker-level MVN, feature concatenation to capturetransitional information, etc., 2) neural network setup, modelparameter optimization, training with dynamically reducedregularization parameter in speaker classification, and 3) out-put score normalization and speaker-specific thresholding inspeaker verification.There is still much room for potential improvement. First,the enrollment process is typically one-by-one, rather thanenrolling a group of speakers as a whole, so the recursivelymodel training and updating need to be addressed. Second,more challenging and noisy database should be consideredto added in, in order to deal with channel normalization andsystem robustness. Third, combining current neural networkapproaches with other state-of-the-art methods, such as GMM-UBM [5] and i-vector [7], [8] is also desired.IEEE 5 || P a g e ntelligent Systems Conference 20177-8 September 2017 | London, UK R EFERENCES[1] Alan L Higgins, “Speaker verifier using nearest-neighbor distancemeasure,” Aug. 16 1994, US Patent 5,339,385.[2] Frank K Soong, Aaron E Rosenberg, Bling-Hwang Juang, andLawrence R Rabiner, “Report: A vector quantization approach tospeaker recognition,”
AT&T technical journal , vol. 66, no. 2, pp. 14–26,1987.[3] David Snyder, Daniel Garcia-Romero, and Daniel Povey, “Time delaydeep neural network-based universal background models for speakerrecognition,” in . IEEE, 2015, pp. 92–97.[4] Kevin R Farrell and Richard J Mammone, “Speaker identification usingneural tree networks,” in
Acoustics, Speech, and Signal Processing,1994. ICASSP-94., 1994 IEEE International Conference on . IEEE,1994, vol. 1, pp. I–165.[5] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn, “Speakerverification using adapted gaussian mixture models,”
Digital signalprocessing , vol. 10, no. 1, pp. 19–41, 2000.[6] Patrick Kenny, “Joint factor analysis of speaker and session variability:Theory and algorithms,”
CRIM, Montreal,(Report) CRIM-06/08-13 ,2005.[7] Najim Dehak, Patrick J Kenny, R´eda Dehak, Pierre Dumouchel, andPierre Ouellet, “Front-end factor analysis for speaker verification,”
IEEE Transactions on Audio, Speech, and Language Processing , vol.19, no. 4, pp. 788–798, 2011.[8] Najim Dehak, Pedro A Torres-Carrasquillo, Douglas A Reynolds, andReda Dehak, “Language recognition via i-vectors and dimensionalityreduction.,” in
INTERSPEECH , 2011, pp. 857–860.[9] William M Campbell, Douglas E Sturim, and Douglas A Reynolds,“Support vector machines using gmm supervectors for speaker verifi-cation,”
IEEE signal processing letters , vol. 13, no. 5, pp. 308–311,2006.[10] Douglas Reynolds, “An overview of automatic speaker recognition,” in
Proceedings of the International Conference on Acoustics, Speech andSignal Processing (ICASSP)(S. 4072-4075) , 2002.[11] Benoˆıt GB Fauve, Driss Matrouf, Nicolas Scheffer, Jean-Franc¸oisBonastre, and John SD Mason, “State-of-the-art performance in text-independent speaker verification through open-source software,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 15, no.7, pp. 1960–1968, 2007.[12] Theodoros Giannakopoulos, “A method for silence removal andsegmentation of speech signals, implemented in Matlab,”
Universityof Athens, Athens ∼ srihari/CSE574/, Accessed: 2016-07-21.[16] Carl Edward Rasmussen, “Gaussian processes for machine learning,”2006. IEEE 6 ||