[PDF] Neural Network Based Speaker Classification and Verification Systems with Enhanced Features

Abstract

This work presents a novel framework based on feed-forward neural network for text-independent speaker classification and verification, two related systems of speaker recognition. With optimized features and model training, it achieves 100% classification rate in classification and less than 6% Equal Error Rate (ERR), using merely about 1 second and 5 seconds of data respectively. Features with stricter Voice Active Detection (VAD) than the regular one for speech recognition ensure extracting stronger voiced portion for speaker recognition, speaker-level mean and variance normalization helps to eliminate the discrepancy between samples from the same speaker. Both are proven to improve the system performance. In building the neural network speaker classifier, the network structure parameters are optimized with grid search and dynamically reduced regularization parameters are used to avoid training terminated in local minimum. It enables the training goes further with lower cost. In speaker verification, performance is improved with prediction score normalization, which rewards the speaker identity indices with distinct peaks and penalizes the weak ones with high scores but more competitors, and speaker-specific thresholding, which significantly reduces ERR in the ROC curve. TIMIT corpus with 8K sampling rate is used here. First 200 male speakers are used to train and test the classification performance. The testing files of them are used as in-domain registered speakers, while data from the remaining 126 male speakers are used as out-of-domain speakers, i.e. imposters in speaker verification.

Full PDF

IIntelligent Systems Conference 20177-8 September 2017 | London, UK

Neural Network Based Speaker Classiﬁcation andVeriﬁcation Systems with Enhanced Features

Zhenhao Ge, Ananth N. Iyer, Srinath Cheluvaraja, Ram Sundaram, Aravind Ganapathiraju

Interactive Intelligence Inc., Indianapolis, Indiana, USAEmail: { roger.ge, ananth.iyer, srinath.cheluvaraja, ram.sundaram, aravind.ganapathiraju } @inin.com Abstract —This work presents a novel framework basedon feed-forward neural network for text-independent speakerclassiﬁcation and veriﬁcation, two related systems of speakerrecognition. With optimized features and model training, itachieves 100% classiﬁcation rate in classiﬁcation and less than6% Equal Error Rate (ERR), using merely about 1 second and 5seconds of data respectively. Features with stricter Voice ActiveDetection (VAD) than the regular one for speech recognitionensure extracting stronger voiced portion for speaker recognition,speaker-level mean and variance normalization helps to eliminatethe discrepancy between samples from the same speaker. Both areproven to improve the system performance. In building the neuralnetwork speaker classiﬁer, the network structure parameters areoptimized with grid search and dynamically reduced regulariza-tion parameters are used to avoid training terminated in localminimum. It enables the training goes further with lower cost.In speaker veriﬁcation, performance is improved with predictionscore normalization, which rewards the speaker identity indiceswith distinct peaks and penalizes the weak ones with high scoresbut more competitors, and speaker-speciﬁc thresholding, whichsigniﬁcantly reduces ERR in the ROC curve. TIMIT corpus with8K sampling rate is used here. First 200 male speakers are usedto train and test the classiﬁcation performance. The testing ﬁlesof them are used as in-domain registered speakers, while datafrom the remaining 126 male speakers are used as out-of-domainspeakers, i.e. imposters in speaker veriﬁcation.

Keywords — Neural Network, Speaker Classiﬁcation, SpeakerVeriﬁcation, Feature Engineering

I. I

NTRODUCTION

Speaker recognition is a popular and broad topic in speechresearch over decades. It includes speaker detection, i.e. detect-ing if there is a speaker in the audio, speaker identiﬁcation, i.e.identifying whose voice it is, speaker veriﬁcation or authentica-tion, i.e. verifying someone’s voice. If the speaker set is closed,i.e. the audio must be from one of the enrolled speakers, thenspeaker identiﬁcation is simpliﬁed to speaker classiﬁcation.There are some other building blocks such speaker segmenta-tion, clustering and diarization, which can be further developedbased on the fundamental speaker recognition techniques.Fig. 1 provides digrams for speaker identiﬁcatio and veri-ﬁcation. the main approaches in this area includes 1) templatematching such as nearest neighbor [1] and vector quantization[2], 2) neural network, such as time delay neural network [3],decision tree [4], and 3) probabilistic models, such as GaussianMixture Model (GMM) with Universal Background Model(UBM) [5], joint factor analysis [6], i-vector [7], [8], SupportVector Machine (SVM) [9], etc. Methods can be dividedinto text-dependent and text-independent, where the formerachieves better performance with additional information, andthe latter is more user friendly and easier to use. Reynolds [10] and Fauve [11] provided a good overview of somecommon speech recognition applications with the state-of-the-art performance. Identiﬁcation vs veriﬁcation

Fig. 1. Major components for speaker identiﬁcation and speaker veriﬁcation.

This paper proposes a neural network framework fortext-independent speaker classiﬁcation and veriﬁcation, usingTIMIT 8K database. With optimization in feature and modeltraining, the system achieves 100% classiﬁcation accuracy withslightly more than 1 second speech, and less than 6% ERR inspeaker veriﬁcation with more than 100 impostor size, usingapproximately 5 seconds data.The following sections walk through the major piecesof this work, including feature engineering (Sec. II), design,implementation and results for speaker classiﬁcation and veri-ﬁcation systems (Sec. III and Sec. IV). Finally, the conclusionand future work is given in Sec. V.II. D

ATA P REPARATION AND F EATURE E NGINEERING

The following 3 subsections introduce the database usedin this paper, and the process of converting raw speech intofeatures used that used in speaker classiﬁcation and veriﬁ-cation, including a) preprocessing, and b) feature extraction,normalization and concatenation.

A. Database

Speech of all 326 male speakers from 8 different dialectregions in the “train” folder of the TIMIT corpus with 8Ksampling rate is used here. Data of males from the “test” folderand data of females from both “train” and “test” folders arecurrently reserved for future development. For each speaker,there are 10 data ﬁles containing one sentence each withduration about 2.5 seconds. They are from 3 categories: “SX”(5 sentences), “SI” (3 sentences) and “SA” (2 sentences). DataIEEE 1 || P a g e a r X i v : . [ c s . S D ] F e b ntelligent Systems Conference 20177-8 September 2017 | London, UK are ﬁrst sorted alphabetically by speaker name in their dialectregion folders, then combined to form a list of data containing326 speakers. They are then divided into 2 groups: ﬁrst 200speakers (group A) and remaining 126 speakers (group B).For speaker classiﬁcation “SX” sentences in group A areused to train the text-independent Neural Network SpeakerClassiﬁer (NNSC), while the “SA” and “SI” sentences ingroup A are used to test. For speaker veriﬁcation, since itis based on NNSC, only “SA” and “SI” sentences are usedto avoid overlapping with any training data used in modeltraining. Speakers in group A are used as in-domain speakers,and speakers in group B are used as out-of-domain speakers(imposters).

B. Preprocessing

Preprocessing mainly consists of a) scaling the maximumof absolute amplitude to 1, and b) Voice Activity Detection(VAD) to eliminate the unvoiced part of speech. Experimentsshow both speaker classiﬁcation and veriﬁcation can performsigniﬁcantly better if speakers are evaluated only using voicedspeech, especially when the data is noisy.An improved version of Giannakopoulos’s recipe [12] withshort-term energy and spectral centroid is developed for VAD.Given a short-term signal s ( n ) with N samples, the energy is: E = 1 N N (cid:88) n =1 | s ( n ) | , (1)and given the corresponding Discrete Fourier Transform (DFT) S ( k ) of s ( n ) with K frequency components, the spectralcentroid can be formulated as: C = (cid:80) Kk =1 kS ( k ) (cid:80) Kk =1 S ( k ) . (2)The Short-Term Energy (STE) E is used to discriminate si-lence with environmental noise, and the Spectral Centroid (SC) C can be used to remove non-environmental noise, i.e. non-speech sound, such as coughing, mouse clicking and keyboardtapping, since they normally have different SCs compared tohuman speech. When computing the frame-level E and C , a ms window size and a ms hop size are used.To set the overall threshold, only when E and C areboth above their thresholds T E and T C , the speech frameis considered to be voiced, otherwise, it will be removed.These thresholds are adjusted to be slightly higher to enforce astricter VAD algorithm and ensure the quality of the capturedvoiced sections. This is achieved by tuning the signal mediansmoothing parameters, such as step size and smoothing order,as well as setting the thresholds T E and T C as a weightedaverage of the local maxima in the distribution histograms ofthe short-term energy and spectral centroid respectively. Fig.2 is an example of applying different median ﬁlter smoothingstep sizes to STE and SC. Larger step size (e.g. 7) and order(e.g. 2) are used in order to achieve more stricter VAD. C. Feature Extraction, Normalization and Concatenation

The 39-dimensional Mel-Frequency Cepstral Coefﬁcients(MFCCs) with delta and double delta were generated fromthe preprocessed speech, following Ellis’s recipe [13]. They a m p li t ude smoothed STE: smooth step = 4 original1st−order2nd−order0 20 40 60 80 100 12000.20.40.6 frame a m p li t ude smoothed SC: smooth step = 4 original1st−order2nd−order (a) smoothing step size 4 a m p li t ude smoothed STE: smooth step = 7 original1st−order2nd−order0 20 40 60 80 100 12000.20.40.6 frame a m p li t ude smoothed SC: smooth step = 7 original1st−order2nd−order (b) smoothing step size 7 Fig. 2. Short-term energy and spectral centroid with different median ﬁltersmoothing steps and orders. were extracted using overlapped ms Hamming windowswhich hop every ms. Then, the features of each speakerwere normalized with his own mean and variance (speaker-level MVN, or SMVN), instead of using the overall mean andvariance (global-level MVN, or GMVN). Fig. 3 shows SMVNthough converges slower, but helps to achieve better featureframe level training and validation accuracies in networktraining. It is slightly counter-intuitive, since SMVN overlapsspeaker patterns on top of each other. However, it can matchthe instances of patterns from the same speaker better thanGMVN as the training goes. s eg m en t a cc u r a cy NN training with global−level MVN trainingvalidation0 5 10 15 20 25 30 35 4000.51 training epoch s eg m en t a cc u r a cy NN training with speaker−level MVN trainingvalidation

Fig. 3. Comparison of global-level MVN vs. speaker-level MVN in NNtraining in terms of training and validation frame accuracies.

IEEE 2 ||

IEEE 2 || P a g e ntelligent Systems Conference 20177-8 September 2017 | London, UK

To capture the transition patterns within longer durations,these 39-dimensional feature frames were concatenated to formoverlapped longer frames. In this work, 10 frames ( ms)were concatenated with hop size of 3 frames ( ms) as shownin Fig. 4. … …

39 390

Fig. 4. Feature concatentation example with a window size of 10 frames anda hop size of 3 frames.

III. N

EURAL N ETWORK S PEAKER C LASSIFICATION

The concatenated features (e.g. 390 dimensional featurevectors) are used as the input to a neural network speakerclassiﬁer. As mentioned in the ﬁrst paragraph of Sec. II, the“SX” and “SI” sentences of the ﬁrst 200 male speakers wereused for training, and the remaining “SA” sentences from thesame set of speakers were used for testing.

A. Cost Function and Model Structures

Ng’s neural network training recipe for hand-written digitclassiﬁcation [14] is used here, which treats the multi-classproblem as K separate binary classiﬁcations. It is consideredto be the generalization of the cost function of binary clas-siﬁcation using logistic regression, which is built on slightlydifferent concepts compared with the cross-entropy cost func-tion with softmax as the output layer [15].Given M samples, K output classes, and L layers, includ-ing input, output and all hidden layers in between, the costfunction can be formulated as: J (Θ) = − M (cid:34) M (cid:88) m =1 K (cid:88) k =1 (cid:16) y ( m ) k log( h θ ( x ( m ) ) k ) (3) + (1 − y ( m ) k ) log(1 − h θ ( x ( m ) ) k ) (cid:17)(cid:105) + λ M L − (cid:88) l =1 s l (cid:88) i =1 s l +1 (cid:88) j =1 ( θ ( l ) ji ) where h θ ( x ( m ) ) k is the k th output of the ﬁnal layer, given m thinput sample x ( m ) , and y ( m ) k is its corresponding target label.The nd half of Eq. (3) is the regularization factor to preventover-ﬁtting, where λ is the regularization parameter and θ ( l ) ji is the j -th row, i -th column element of the weight matrix Θ ( l ) between l -th and ( l + 1) -th layers, i.e. the weight from i -thnode in l -th layer to j -th node in ( l + 1) -th layer.In this work, there is only 1 hidden layer ( L = 3 ) with nodes ( s = 200 ), the input feature dimension is ( s =390 ), and the speaker classiﬁer was trained with data from speakers ( s = K = 200 ). Therefore, the network structure is

390 : 200 : 200 , with weight matrices Θ (1) ( × ) and Θ ( × ). The additional 1 column is a bias vector, which is left out in regularization, since the change of bias is unrelatedto over-ﬁtting. In this example, the regularization part in Eq.(3) can be instantiated as L − (cid:88) l =1 s l (cid:88) i =1 s l +1 (cid:88) j =1 ( θ ( l ) ji ) = (cid:88) i =1 200 (cid:88) j =1 ( θ (1) j,i ) + (cid:88) i =1 200 (cid:88) j =1 ( θ (2) j,i ) . (4) B. Model Training and Performance Evaluation

The neural network model is trained through forward-backward propagation. Denoting z ( l ) and a ( l ) as the input andoutput of the l -th layer, the sigmoid function a ( l ) = g ( z ( l ) ) = 11 + e − z ( l ) (5)is selected as the activation function, and the input z ( l +1) of the ( l + 1) -th layer can be transformed from the output a ( l ) of the l -th layer, using z ( l +1) = Θ a ( l ) . Then, h θ ( x ) can be computed through forward propagation: x = a (1) → z (2) → a (2) → · · · → z ( L ) → a ( L ) = h θ ( x ) . The weightmatrix Θ ( l ) is randomly initiated using continuous uniformdistribution between ( − . , . and then trained throughbackward propagation of ∂J/∂θ ( l ) j,i , by minimizing J (Θ) usingRasmussen’s conjugate gradient algorithm, which handles stepsize (learning rate) automatically with slope ratio method[16].In evaluating the classiﬁer performance, the sigmoid outputof the ﬁnal layer h θ ( x ( m ) ) is a K -dimensional vector, eachelement in the ranges of (0 , . It serves as the “likelihood” toindicate how likely it is to classify m -th input frame into oneof the K speakers. The speaker classiﬁcation can be predictedby the sum of log likelihood of M input frames (predictionscores), and the predicted speaker ID k ∗ is the index of itsmaximum: k ∗ = arg max k ∈ [1 ,K ] (cid:32) M (cid:88) m =1 log( h θ ( x ( m ) ) k ) (cid:33) . (6) M can range from 1 to the entire frame length of the testingﬁle. If M = 1 , the accuracy achieved is based on individualframes, each of which is ms (window duration T win infeature concatenation) with ms of new data, compared withthe previous frame. On the other hand, if M is equal to thetotal number of frames in ﬁle, the accuracy is ﬁle-based. Theaverage duration of sentences (i.e. ﬁle length) is about 2.5seconds. In general, larger M leads to higher accuracy. Giventhe best model available with the network structure

390 : 200 :200 , Fig. 5 demonstrates an example of ﬁle-level predictionscore of -th speaker (MPGR0). It shows the peak of positives(in the green circle) is slightly dropped but still distinguishableenough to all other negatives, from the ﬁle SI1410 in thetraining set, to the ﬁle

SA1 in the testing set.Using this model, the ﬁle-level training and testing accura-cies at speaker size are both 100%, as indicated in TableI. The frame-level testing accuracy is . %, which indicatesthat . % frames in the testing set, with duration as littleas . second, can be classiﬁed correctly. It also shows theminimum, mean, and maximum number of consecutive featureframes needed and their corresponding durations in order toachieve 100% accuracy, evaluated through all ﬁles in bothtraining and testing datasets. Since the next frame providesIEEE 3 ||

SI1410 in training(b)

SA1 in testing

Fig. 5. File-level prediction scores of th speaker (MPGR0) in training andtesting sets respectively.TABLE I. NN- BASED SPEAKER CLASSIFICATION PERFORMANCE WITHFIRST

MALE IN

8K TIMIT ( . SEC ./ FRAME , ∼ SEC ./ FILE ) Dataset Accuracy (%) Frame (sec.) needed for 100% accuracy frame ﬁle min mean max train 93.29 100 2 (0.13) 3.23 (0.17) 5 (0.22)test 71.42 100 6 (0.25) 13.55 (0.48) 37 (1.18) only ms (hop duration T hop in feature concatenation) ad-ditional information, compared with the current frame, giventhe number of frames needed N , the formula to compute thecorresponding required duration T is T = ( N − × T hop + 1 × T win . (7)With this formula, it requires only 13.55 frames (0.48 second)on average, to achieve 100% accuracy in the testing dataset.Using the training data to test is normally not legitimate,and here it is used merely to get a sense of how the accuracydrops when switching from training data to testing data. C. Model Parameter Optimization

The current neural network model with the structure

390 :200 : 200 is actually the best one in terms of highest frame-level testing accuracy, after grid searching on a) the numberof hidden layers ( , ), and b) the number of nodes per hiddenlayer ( , , , ), with a subset containing only 10%randomly selected training and testing data.Once the ideal network structure is identiﬁed, the modeltraining is conducted with a regularization parameter λ inthe cost function J (Θ) , which is iteratively reduced from 3to 0 through training. This dynamic regularization scheme is experimentally proved to avoid over-ﬁtting and allow moreiterations to reach a reﬁned model with better performance.The training is set to be terminate once the testing frameaccuracy cannot be improved more than . in the last 2consecutive training iterations, which normally takes around to iterations. The training set is at speakersize with seconds speech each. It is fed in as a wholebatch of data, which requires about 1 hour to train, on acomputer with i7-3770 CPU and 16 GB memory. Therefore,the computational cost is certainly manageable.IV. N EURAL N ETWORK S PEAKER V ERIFICATION

This section ﬁrst introduces the mechanism of convertingspeaker classiﬁcation into speaker veriﬁcation; then describesthe method of developing speaker-speciﬁc thesholds to shiftveriﬁcation outputs; ﬁnally it evaluates the system with metricssuch as Equal Error Rate (EER).

A. Veriﬁcation Mechanism

In speaker veriﬁcation, the assumption that any inputspeaker will be one of the in-domain speakers is no longerkepted. When the testing speaker is claimed to be speaker k and the highest output score is also from the k -th output nodes,he might be a imposter, who is more similar to speaker k , andless similar to the rest of K − enrolled (in-domain) speakers.So Eq. (6) in Subsec. III-B is no longer hold and a threshold isnecessary to determine if the testing speaker is similar enoughto the targeting speaker and can be veriﬁed as speaker k .Let the mean K -dimensional output prediction vector overfeature frames for client speaker k , given features x l of speaker l be: O ( k, l ) = 1 M M (cid:88) m =1 log( h θ ( x ( m ) l ) k ) , (8)where M is the number of frames in the testing feature. Inthis project, client speakers are the ﬁrst 200 male speakers inTIMIT ( K = 200 ), and the imposters (out-of-domain) are theramaining 126 speakers ( L = 126 ). In positive veriﬁcation,where l = k , and the k -th value on O ( k, k ) , i.e. O k ( k, k ) should be high; while in negative veriﬁcation, where l ∈ [1 , L ] ,and O k ( k, l ) should be low. If O k ( k, k ) > any( O k ( k, l )) , l ∈ [1 , L ] , (9)then, the k -th speaker can be correctly veriﬁed. In our ex-periment, O ( k, k ) and O ( k, l ) are actually normalized over K output node dimension, and the normalized versions are: O (cid:48) ( k, k ) = O ( k, k ) (cid:80) Kk =1 O ( k, k ) , O (cid:48) ( k, l ) = O ( k, l ) (cid:80) Kk =1 O ( k, l ) . (10)It is found to achieve better veriﬁcation accuracy by penalizingthe ones with strong competing speakers. Fig. 6 shows theaccuracy vs. number of testing ﬁles (up to 5 since there are5 sentences from “SI” and “SA” categories). For example,the mean accuracy is . when speakers are tested withindividual ﬁles and . when tested with a combinationof two ﬁles ( (cid:0) (cid:1) = 10 cases). The sentences duration is about2.5 seconds each, so it is similar to the accuracy with testingduration 2.5 seconds, 5 seconds, etc. For each out of the 200client speakers, the accuracy is binary, either 1, i.e. Eq. (9) issatisﬁed, or 0 otherwise.IEEE 4 ||

B. Speaker Speciﬁc Thresholding

The accuracy measurement above will drop signiﬁcantlywhen the imposter size is getting larger. In fact, it is merelyan analysis to demonstrate the challenge to maintain highaccuracy with a large imposter size which is rare in the realscenario. Next, the speaker-speciﬁc thresholds will be obtainedby ﬁnding the Gaussian distributions of the positive (testingspeaker is the client speaker) and negative (testing speakers isone of the imposters) samples, using Bayes rule. ← x* = 0.546difference increase values (x) w e i gh t ed P D F Gaussian PDFs with error highlighted, p(error) = 0.262% p(x| ω ) * p( ω )p(x| ω ) * p( ω ) Fig. 7. Example of thresholding with 2 Gaussians distributions of positiveand negative samples. Sample values are collected with combinations of 2 ﬁles(10 cases with ∼ seconds in duration), i.e. 10 positives vs. 1260 negatives. Since the positive and negative is extremely skewed withcurrent 126 imposter size (i.e. positive:negative is 1:126), thedistribution for the positive samples has a very low prior andalmost invisible in Fig. 7. However, the estimated threshold,which is the intersection of the two Gaussians, can be stillfound by solving the Eq. (11) using the root ﬁnding method,which ﬁrst reformats the Eq. (11) to quadratic function ax + bx + c = 0 , and then represents x by a, b, c . p σ e ( x − u σ = 1 − p σ e ( x − u σ . (11) C. Performance with Optimized Thresholds

With the speaker-speciﬁc thresholds T k , k ∈ [1 , K ] , theoutput normalized prediction vector is shifted by O (cid:48) ( k, l ) → O (cid:48) ( k, l ) − T k , l ∈ { k, [1 , L ] } . (12) Then, ROC curve is computed to ﬁnd the Equal Error Rate(EER), which is a common performance indicator to evaluatebiometric systems. EER equal to False Positive Rate (FPR),when F P R + T P R = 1 . Fig. 8 demonstrates the ROCcurve, when verifying with length of 2 ﬁles ( ∼ seconds).By offsetting outputs with speaker-speciﬁc thresholds, theEER is reduced from . to . . Another metric AreaUnder Curve (AUC) is . , and the global thresholdcorresponding to this best EER is − . . T P R ROC, EER = 0.059, AUC = 0.9805 no speaker−specific thresholdwith speaker−specific threshold

Fig. 8. ROC when verifying with length of 2 ﬁles ( ∼ seconds), with orwithout speaker-speciﬁc thresholds. V. C

ONCLUSION AND F UTURE W ORK

This work demonstrated a novel neural net framework forspeaker classiﬁcation and veriﬁcation with enhanced features.The performance is tested using TIMIT corpus with 8Ksampling rate. For speaker classiﬁcation, 200 speakers can beclassiﬁed correctly with data no more than 1.18 seconds; Forspeaker veriﬁcation, the EER is 5.9%, when verifying 200in-domain speakers with 126 imposters, using speech about5 seconds long (2 TIMIT ﬁles). Though the performance ofspeaker classiﬁcation and veriﬁcation systems is difﬁcult tocompare, due to various database condition, and enrollmentand testing scenarios [10], 100% classiﬁcation rate using about1 second audio and less than 6% EER using 5 seconds data inspeaker veriﬁcation, is still among one of the very competitiveperformances in most of the cases [11].This is achieved by combining all the essential compo-nents, including 1) feature engineering, such as VAD/silenceremoval, speaker-level MVN, feature concatenation to capturetransitional information, etc., 2) neural network setup, modelparameter optimization, training with dynamically reducedregularization parameter in speaker classiﬁcation, and 3) out-put score normalization and speaker-speciﬁc thresholding inspeaker veriﬁcation.There is still much room for potential improvement. First,the enrollment process is typically one-by-one, rather thanenrolling a group of speakers as a whole, so the recursivelymodel training and updating need to be addressed. Second,more challenging and noisy database should be consideredto added in, in order to deal with channel normalization andsystem robustness. Third, combining current neural networkapproaches with other state-of-the-art methods, such as GMM-UBM [5] and i-vector [7], [8] is also desired.IEEE 5 ||

AT&T technical journal , vol. 66, no. 2, pp. 14–26,1987.[3] David Snyder, Daniel Garcia-Romero, and Daniel Povey, “Time delaydeep neural network-based universal background models for speakerrecognition,” in . IEEE, 2015, pp. 92–97.[4] Kevin R Farrell and Richard J Mammone, “Speaker identiﬁcation usingneural tree networks,” in

Acoustics, Speech, and Signal Processing,1994. ICASSP-94., 1994 IEEE International Conference on . IEEE,1994, vol. 1, pp. I–165.[5] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn, “Speakerveriﬁcation using adapted gaussian mixture models,”

Digital signalprocessing , vol. 10, no. 1, pp. 19–41, 2000.[6] Patrick Kenny, “Joint factor analysis of speaker and session variability:Theory and algorithms,”

CRIM, Montreal,(Report) CRIM-06/08-13 ,2005.[7] Najim Dehak, Patrick J Kenny, R´eda Dehak, Pierre Dumouchel, andPierre Ouellet, “Front-end factor analysis for speaker veriﬁcation,”

IEEE Transactions on Audio, Speech, and Language Processing , vol.19, no. 4, pp. 788–798, 2011.[8] Najim Dehak, Pedro A Torres-Carrasquillo, Douglas A Reynolds, andReda Dehak, “Language recognition via i-vectors and dimensionalityreduction.,” in

INTERSPEECH , 2011, pp. 857–860.[9] William M Campbell, Douglas E Sturim, and Douglas A Reynolds,“Support vector machines using gmm supervectors for speaker veriﬁ-cation,”

IEEE signal processing letters , vol. 13, no. 5, pp. 308–311,2006.[10] Douglas Reynolds, “An overview of automatic speaker recognition,” in

Proceedings of the International Conference on Acoustics, Speech andSignal Processing (ICASSP)(S. 4072-4075) , 2002.[11] Benoˆıt GB Fauve, Driss Matrouf, Nicolas Scheffer, Jean-Franc¸oisBonastre, and John SD Mason, “State-of-the-art performance in text-independent speaker veriﬁcation through open-source software,”

IEEETransactions on Audio, Speech, and Language Processing , vol. 15, no.7, pp. 1960–1968, 2007.[12] Theodoros Giannakopoulos, “A method for silence removal andsegmentation of speech signals, implemented in Matlab,”

Universityof Athens, Athens ∼ srihari/CSE574/, Accessed: 2016-07-21.[16] Carl Edward Rasmussen, “Gaussian processes for machine learning,”2006. IEEE 6 ||