Generalized LSTM-based End-to-End Text-Independent Speaker Verification
GGeneralized LSTM-based End-to-EndText-Independent Speaker Verification ∗ Soroosh Tayebi Arasteh †‡†
Pattern Recognition Lab, Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg, Erlangen, Germany ‡ Department of Pediatrics, Harvard Medical School, Boston, MA, USA [email protected]
November 20, 2020
Abstract —The increasing amount of available data and moreaffordable hardware solutions have opened a gate to the realmof Deep Learning (DL). Due to the rapid advancements andever-growing popularity of DL, it has begun to invade almostevery field, where machine learning is applicable, by alteringthe traditional state-of-the-art methods. While many researchersin the speaker recognition area have also started to replace theformer state-of-the-art methods with DL techniques, some of thetraditional i-vector-based methods are still state-of-the-art in thecontext of text-independent speaker verification (TI-SV). In thispaper, we discuss the most recent generalized end-to-end (GE2E)DL technique based on Long Short-term Memory (LSTM) unitsfor TI-SV by Google and compare different scenarios and aspectsincluding utterance duration, training time, and accuracy toprove that our method outperforms the traditional methods.
Index Terms —Deep learning, speaker verification, generalizedend-to-end loss, text-independent.
I. I
NTRODUCTION
A. Background
Speaker recognition (SR) is the task of recognizing thespeaker’s identity based on their voice. It is a very activeresearch area with notable applications in various fields such asbiometric authentication, forensics, security, speech recognition,and speaker diarization, which has contributed to steadyinterest towards this discipline [1]. Moreover, SR has becomepopular technology for remote authentication, especially inthe advancement of telecommunications and networking [2].Human speech is one of the most complex natural signals andcontains a lot of information, which makes it unique for everyperson and enables us to create SR systems based on thoseproperties.Speaker verification (SV) and speaker identification (SI) aretwo important subtasks of SR. Speaker verification is the taskof authenticating a person’s claimed identity as genuine orimposter. Speaker identification on the other hand, is the taskof identifying an unknown person’s identity from a pool ofknown speakers. Together with SV and SI, SR is the processof identifying an unknown speaker’s identity in the generalcase, by first verifying and then identifying. ∗ This is the technical report of the M.Sc. CME Research Internship work,conducted by Soroosh Tayebi Arasteh at Pattern Recognition Lab at FAU.Supervisors: M.Sc. Philipp Klumpp † and Prof. Dr. Andreas Maier † . The speaker verification process can generally be dividedinto three steps of training, enrollment, and evaluation [3].In training stage, the speaker-specific features are extractedto create a background model for the speaker representationusing the available signals. In the enrollment phase, usingthe background model, which is the trained network in thecase of DL techniques, speaker utterances are utilized tocreate the speaker models. Finally in the evaluation step, testspeaker models are created by feeding the test utterances to thebackground model. They are compared to already registeredspeaker models in order to check the similarity between them.Depending on the restrictions of the utterances used forenrollment and verification, speaker verification models usuallyfall into one of the two categories: text-dependent speakerverification (TD-SV) and text-independent speaker verification(TI-SV) [4]. In TD-SV, the same text is used for enrollment andevaluation phases, while in TI-SV, there are no constraints onthe enrollment or verification utterances, exposing a largervariability of phonemes and utterance durations [5], [6].Combined with a keyword spotting system (KWS), text-dependent SV can be integrated into an intelligent personalassistant such as Apple Siri, Amazon Alexa, Google Nowand Microsoft Cortana, where KWS and text-dependent SVserves as a keyword voice-authenticated wake-up to enable thefollowing voice interaction [7]–[9].
B. Text-independent speaker verification
In this study, we focus on text-independent speaker verifica-tion. Before the deep neural networks era, the state-of-the-artspeaker recognition method was the i-vector approach [10]–[12]. Nowadays, DL methods are outperforming the formerstate-of-the-art methods in various fields of speaker recognition.However, in the context of text-independent speaker verification,the i-vector framework and its variants are still the state-of-the-art in some of the tasks [13], [14]. In NIST SRE12 andSRE16 and their post-evaluations, almost all leading systemsare based on i-vectors [15]–[17]. However, i-vector systems areprone to have performance degradation when short utterancesare met in enrollment/evaluation phase [15]. Speaker Recognition Evaluation (SRE) is an ongoing series of speakerrecognition evaluations conducted by the National Institute of Standards andTechnology (NIST). a r X i v : . [ ee ss . A S ] N ov ecently, DL-based, especially end-to-end, TI-SV has drawnmore attention and many researchers have proposed differentmethods outperforming the i-vector/PLDA framework in vari-ous tasks. According to the results reported in [15], [18], end-to-end DL systems achieved better performance compared tothe baseline i-vector system [11], especially for short utterances.Bidirectional LSTMs (BiLSTMs) with a triplet loss achievedbetter performance in the “same/different” speaker detectionexperiment compared with Bayesian Information Criterion andGaussian Divergence [19].In this paper we discuss the GE2E DL-based techniqueproposed by [4] for TI-SV. We examine various scenariosand parameters as well as potential candidate architectures toevaluate the generality of the proposed generalized method. C. Paper structure and contributions
Our paper is organized as follows. In Sec. II, we present ourend-to-end DL method and describe the utilized corpus andnecessary data processing steps for TI-SV problem, as well asthe training process. Sec. III discusses different experimentsperformed to assess the performance of the proposed end-to-endmethod. Finally, Sec. IV states some conclusions and potentialfuture work. Our source code is available online as an opensource project for further investigation.II. M ETHODOLOGY
An end-to-end system treats the entire system as a wholeadaptable black box. The process of feature extraction andclassifier training are achieved in parallel with an objectivefunction that is consistent with the evaluation metric [20].Our method in this project is mainly based on the GE2Emodel proposed by [4]. The main advantage of the generalizedend-to-end training is that it enables us to process a largenumber of utterances at once, which greatly decreases the totaltraining and convergence time. In this section, we first explainthe proposed GE2E method. Then the necessary pre-processingand data preparation, the training procedure, and configurationwill be described.
A. GE2E method
We select N different speakers and fetch M differentutterances for every selected speaker to create a batch. Similarto [7], the features x ji extracted from each utterance will befed to the network. The utilized network consists of 3 LSTMlayers [21] followed by a linear projection layer in order toget to the final embedding vectors [22]. The final embeddingvector (d-vector) is the L2 normalization of the network output f ( x ji ; w ) where w represents all parameters of the network, e ji = f ( x ji ; w ) (cid:107) f ( x ji ; w ) (cid:107) , (1)where e ji represents the embedding vector of the j th speaker’s i th utterance. The centroid of the embedding vectors from the j th speaker [ e j , ..., e jM ] c j is defined as the arithmetic meanof the embedding vectors of the j th speaker. Accessible via https://github.com/starasteh/tisvge2e/
The similarity matrix S ji,k is defined as the scaled cosinesimilarities between each embedding vector e ji to all centroids c k (1 ≤ j, k ≤ N , and ≤ i ≤ M ) , S ji,k = w · cos( e ji , c k ) + b, (2)where w and b are learnable parameters. We constrain theweight to be positive w > , because we want the similarityto be larger when cosine similarity is larger.Unlike most of the end-to-end methods, rather than a scalarvalue, GE2E builds a similarity matrix (Equ. (2)) that definesthe similarities between each e ji and all centroids c k . Fig. 1shows the discussed procedure after feature extraction, wheredifferent speakers are represented by different colors.During training, we aim at maximizing the similarity ofthe embeddings representing the utterances of a particularspeaker to centroid of embeddings of that speaker. At the sametime, we want to minimize the similarity of the embeddingcentroids of all other speakers. This general idea is borrowedfrom traditional methods, such as Linear Discriminant Analysis(LDA). As shown in Fig. 2, we want the blue embedding vectorto be close to its own speaker’s centroid (blue triangle), and farfrom the other speakers’ centroids (red and purple triangles),especially the closest one (red triangle).Furthermore, removing e ji when computing the centroid ofthe true speaker makes training stable and helps avoid trivialsolutions [4]. So, while we still take the arithmetic mean of theembedding vectors when calculating negative similarity (i. e., k (cid:54) = j ), we instead use the following when k = j , c ( − i ) j = 1 M − M (cid:88) m =1 m (cid:54) = i e jm . (3)Therefore, Equ. (2) also becomes the following, S ji,k = (cid:40) w · cos( e ji , c ( − i ) j ) + b if k = jw · cos( e ji , c k ) + b otherwise . (4)Finally, we put a SoftMax on S ji,k for k = 1 , ..., N thatmakes the output equal to 1 iff k = j , otherwise makes theoutput equal to zero. Thus, the loss on each embedding vector e ji could be defined as, L ( e ji ) = − S ji,j + log N (cid:88) k =1 exp( S ji,k ) . (5)This loss function means that we push each embedding vectorclose to its centroid and pull it away from all other centroids.Finally, in order to calculate the final GE2E loss L G , wehave 2 options,1) According to [4], the GE2E loss L G is the sum ofall losses over the similarity matrix ( ≤ j ≤ N , and ≤ i ≤ M ), L G ( x ; w ) = L G ( S ) = (cid:88) j,i L ( e ji ) . (6)2 ig. 1: System overview. Different colors indicate utterances/embeddings from different speakers. [4]Fig. 2: GE2E loss pushes the embedding towards the centroid of the truespeaker, and away from the centroid of the most similar different speaker. [4]
2) the GE2E loss L G is the mean of all losses over thesimilarity matrix ( ≤ j ≤ N , and ≤ i ≤ M ), L G ( x ; w ) = 1 M · N (cid:88) j,i L ( e ji ) . (7)Although both options eventually perform the same, We proposethe option 2 as it is more consistent with changing the numberof speakers per batch or utterances per speaker. B. Corpus and data pre-processing
The corpus that we use for all the training, enrollment,and evaluation steps is the LibriSpeech dataset, which isderived from English audiobooks. The ”train-clean-360” subsetis used for training, while other subsets are used separatelyfor enrollment and evaluation, in an open-set manner. Tab. Iillustrates the statistics of the different subsets of the corpus. Foreach speaker in the ”clean” training sets, the amount of speechis limited to 25 minutes, in order to avoid major imbalances inper-speaker audio duration [23]. In the following, we describethe data pre-processing .
1) Training data pre-processing:
After normalizing thevolume of each utterance, we perform Voice Activity Detection(VAD) [24] with a maximum silence length of and awindow length of
30 ms followed by pruning the intervalswith sound pressures below
30 db . Therefore, we end up withsmaller segments for each utterance, which are referred to as In contrast to a closed-set scenario where evaluation and training speakerscan overlap, in an open-set scenario, none of the speakers used in the trainingphase should be used for the evaluation. subset name subset duration[hours] dev-clean . test-clean . dev-other . test-other . train-clean-100 . train-clean-360 . train-other-500 . partial utterances [4]. We only select the partial utteranceswhich are at least . long.Furthermore, the feature extraction process is the sameas in [25]. The partial utterances are first transformed intoframes of width
25 ms with
10 ms steps. Then we extract40-dimensional log-mel-filterbank energies as the featurerepresentation for each frame.
2) Enrollment and evaluation data pre-processing:
Exceptfor partial utterances, where we instead concatenate theresulting smaller segments of each utterance in order to have asingle segment again for each utterance, the other steps remainthe same here as in the training step.
C. Training procedure
We randomly choose N speakers and randomly select M pre-processed partial utterances for each speaker to constructa batch. As shown in Fig. 3, in order to introduce morerandomization, we randomly choose a time length t within [140 , frames, and enforce that all partial utterances in thatbatch are of length t [4]. This means that partial utterances ofdifferent batches will have different lengths, but all the partialutterances in the same batch must be of the same length.We use hidden nodes and dimensional embeddingsfor our network and optimize the model using the Adam [26]optimizer with a learning rate of − . The network containstotal of 12,134,656 trainable parameters. Each batch consistsof N = 16 speakers and M = 5 partial utterances perspeaker, leading to partial utterances per batch. The L2-norm of gradient is clipped at 3 [27], and the gradient scale forprojection node in LSTM is set to 1. Furthermore, we initializethe scaling factor ( w, b ) of the loss function with (10 , − and clamp w to be larger than − in order to smooth theconvergence. Moreover, the Xavier normal initialization [28]3 lgorithm 1: Training data pre-processing. for all raw utterances do − normalize the volume; − perform VAD with max silence length = 6 ms and window length = 30 ms ; − prune the intervals with sound pressures below db ; for all resulting intervals doif interval’s length >
180 frames then − perform Short-Time Fourier Transform(STFT) on the interval; − take the magnitude squared of the result; − transform to the mel scale; − take the logarithm; Algorithm 2:
Preparation of training data batches, readyto feed to the network. for all training batches do − initialization: randomly choose an integer within [140 , as the partial utterance length; for all train speakers do − randomly choose N speakers; for all N speakers do − randomly select M partial utterances thatare pre-processed according to algorithm 1; for all M partial utterances do − randomly segment an interval whichhas equal number of frames to theinitialization step;is applied to the network weights and the biases are initializedwith zeros. algorithm 1 and algorithm 2 explain the detailedtraining data pre-processing and training data batch preparation.III. E XPERIMENTS
In order to assess the performance of the proposed methodin Sec. II-A, we compare the evaluation results with a baselinemethod (cf. Sec. III-C) and also discuss various experimentsin this section.Before getting to the experiments, we first need to clarifythe process of obtaining the d-vectors for enrollment andevaluation utterances and also explain the utilized evaluationand quantitative analysis approach.
A. Enrollment and evaluation d-vectors
For the sake of convenience and time, we first feed all theavailable pre-processed enrollment and evaluation utterancesto the trained network (cf. Sec. II-C) and store the resultingd-vectors. Subsequently, we could easily load them to performenrollment and evaluation processes for various experiments.As illustrated in Fig. 4, for every utterance we apply a slidingwindow of fixed size (140 + 180) / frames with
50 %
Fig. 3: Batch construction process for the training step. [4]Fig. 4: Sliding window used for enrollment and evaluation steps. [4] overlap. We compute the d-vector for each window. The finalutterance-wise d-vector is generated by first L2 normalizingthe window-wise d-vectors, and then followed by taking theelement-wise average [4]. The detailed descriptions of theenrollment and evaluation data pre-processing and preparingfor d-vector creation are given by algorithm 3 and algorithm 4.
B. Quantitative analysis approach
After creating d-vectors, we can start with evaluating thesystem. we use a threshold-based binary classification methodin this stage, where we first need to create a speaker referencemodel for each speaker to be evaluated, i. e., the enrollmentstep. In the next step, we calculate the similarity between theunknown test utterance d-vector and the already built speakermodel d-vector. The similarity metric, which we use here, isthe cosine distance score, which is the normalized dot productof the speaker model and the test d-vector, cos( e ji , c k ) = e ji · c k (cid:107) e ji (cid:107) · (cid:107) c k (cid:107) . (8)The higher the similarity score between e ji and c k is, the moresimilar they are.The metric which we use for the evaluation of the perfor-mance of our speaker verification system is referred to as equalerror rate (EER), which is used to predetermine the thresholdvalues for its false acceptance rate (FAR) and its false rejectionrate (FRR) [29], [30]. It searches for a threshold for similarityscores where the proportion of genuine utterances which areclassified as imposter (FRR) is equal to the proportion ofimposters classified as genuine (FAR).The overall FAR, FRR, and EER are calculated according toEqu. (9), Equ. (10), and Equ. (11), respectively. True acceptance(TA), true rejection (TR), false acceptance (FA), and false4 lgorithm 3: Enrollment and evaluation data pre-processing. for all raw utterances do − normalize the volume; − perform VAD with max silence length = 6 ms and window length = 30 ms ; − prune the intervals with sound pressures below db ; for all resulting intervals doif interval’s length <
180 frames then − drop the interval; − concatenate the remaining intervals; − perform STFT on the concatenated utterance; − take the magnitude squared of the result; − transform to the mel scale; − take the logarithm; Algorithm 4:
Enrollment and evaluation data prepara-tion and d-vector creation. for all enrollment and evaluation speakers dofor all pre-processed utterances do − initialization: set the starting time frame ofthe window t = 0 ; while not reached the end of the utterance do − select the interval within [ t, t + 160] frames of the utterance; − feed the selected utterance to the trainednetwork to obtain the correspondingd-vector; − L2-normalize the d-vector; − t = t + 80 ; − Perform element-wise average of theL2-normalized d-vectors to obtain the finalutterance d-vector;rejection (FR) values are used for the calculations. Note that,since the FAR and FRR curves are monotonic, there is onlyone point where the FAR has the same value as the FRR.
F AR = F AF A + T R (9)
F RR = F RF R + T A (10)
EER = F AR + F RR , if F AR = F RR (11)
C. The baseline system
The baseline is a standard i-vector system proposed by [11].Tab. II shows the evaluation results on ”dev-clean” and ”test-clean” subsets. The experiments are performed for three caseswith different i-vector dimensions and different Gaussian
TABLE II: The evaluation results using the baseline i-vector method [11]with random data split and simple thresholding on ”dev-clean” and ”test-clean” datasets. Each positive sample is tested against 20 negative samples.Furthermore, 20 different positive samples are tested per speaker. Columnsone and two show the i-vector dimensionality and number of GMM elements,respectively. i-Vec dimension GMM elements dev-clean EER test-clean EER
600 1024 16 .
65 % 18 .
58 %400 512 17 .
80 % 17 .
70 %300 256 18 .
86 % 16 .
90 %
TABLE III: Average EER[%] over test iterations for different numbersof enrollment d-vectors per speaker for different evaluation subsets. The lastcolumn shows the average over all M for M = 2 , , ..., . Avgtest-clean .
92 2 .
57 2 .
41 2 .
27 2 .
17 2 .
01 2 . test-other .
59 2 .
45 2 .
35 2 .
32 2 .
39 2 .
35 2 . dev-clean .
21 1 .
94 1 .
94 1 .
89 1 .
89 1 .
81 1 . Mixture components, with random data split and simplethresholding. Each positive sample is tested against 20 negativesamples, and 20 different positive samples are tested per speaker.From Tab. II, we can already observe that the EER results arequite high with the baseline system.
D. Performance by number of enrollment utterances
In speaker verification, there are typically multiple enroll-ment utterances for each speaker in order to build a robustspeaker model. The observed EER is only an approximation ofthe system’s true EER. Consequently, we repeat the enrollmentand evaluation processes for iterations and average theresults to make up for the aforementioned problem. Moreover,while M utterances for every speaker should be randomlyselected in order to construct a batch for processing, we choose N to be equal to the number of all the available speakers in thetest set in order to further reduce the randomization imposedby sampling.Fig. 5 and Tab. III show the average EER over 1K testiterations for different numbers M of enrollment d-vectors perspeaker, separately on different subsets of LibriSpeech. Notethat the minimum possible M is 2, as we are averaging overthe enrollment d-vectors in order to get the speaker models,while removing the utterance itself when calculating centroidsbased on Equ. (3). Also, in every test iteration, we select M utterances per speaker and split them in half for the enrollmentand evaluation steps. As we can see, the choice of M is themost decisive for the lower values. Moreover, the curve ismonotonically decreasing for the clean environment, whilefor the noisy ”test-other” set, increasing M does not makeimprovements for higher values. E. Performance on test set
In this experiment, we first perform the enrollment andevaluation tasks on the ”dev-clean” set for M = 2 and fix theobtained average threshold and use to perform enrollment andverification on the ”test-clean” and ”test-other” sets. Fig. 65 . . M EE R [ % ] (a) ”test-clean” subset. . . M EE R [ % ] (b) ”test-other” subset.Fig. 5: Average EER[%] over test iterations for different numbers M ofenrollment d-vectors per speaker, separately on the (a) ”test-clean”, and (b)”test-other”, subsets of LibriSpeech. Note that the minimum possible M is 2,as we are removing the utterance itself when calculating centroids based onEqu. (3). − . − . . . . . Similarity Threshold E rr o r[ % ] FARFRR
Fig. 6: The FAR vs FRR values over different similarity thresholds for M = 2 averaged over 1K test iterations on ”dev-clean” set. The EER is equal to thevalue which lies at the intersection point of two curves, which approximatelyequals . here. illustrates the FAR vs FRR values over different similaritythresholds. The EER is the intersection between two curves.Tab. IV also shows the evaluation results on the test setstested with the fixed threshold obtained from ”dev-clean”.Furthermore, Tab. V shows the evaluation results on the ”test-clean” using the model trained after different epochs, whichproves how fast the network converges. F. Performance by test utterance duration
Even though the state-of-the-art DL methods have outper-formed most of the traditional methods in various speaker recog-nition task and shown outstanding results, text-independentspeaker verification is still a challenging problem when it comesto short-length utterances. In this experiment, we evaluatethe performance of our method separately for short and longutterances. We consider an utterance short when its durationis less than 4 seconds and long when its duration is more than4 seconds. Tab. VI shows the number of utilized short and
TABLE IV: The evaluation results for M = 2 averaged over 10K test iterationson the test sets tested with the fixed threshold ( . ) obtained from ”dev-clean”. Evaluation metrics[%]
EER FAR FRRtest-clean .
85 3 .
68 4 . test-other .
66 2 .
89 4 . TABLE V: Evaluation results for M = 2 averaged over 1K test iterationson the ”test-clean” suubset using the model trained at different stages. Everycolumn shows the number of epochs used to train the model. The last columnshows the final model. finalEER[%] .
12 5 .
08 4 .
42 4 .
44 3 . FAR[%] .
12 5 .
10 4 .
44 4 .
44 3 . FRR[%] .
12 5 .
06 4 .
40 4 .
44 3 . long utterances available per subset. As shown in Tab. VII,performance drops significantly by
59 % when only consideringthe short-length utterances compared to the unconstrained casefor the ”test-clean” subset.IV. C
ONCLUSION
In this project we investigated the GE2E method proposedin [4] for text-independent speaker verification. Both theoreticaland experimental results verified the advantage of this methodcompared to the baseline system. We observed that GE2Etraining was about × faster than the other DL-based end-to-end speaker verification systems and converges very fast, whileit is one of the few DL-based TI-SV methods that outperformsthe baseline system. Furthermore, even though short-lengthutterances are more difficult to predict, we showed that theproposed method is flexible in utterance duration and stillworks for short-duration data. Moreover, as increasing thenumber of utterances per enrollment speaker improves theperformance, we saw that the proposed method also generalizesvery fast in this issue and shows great performance with alreadya few number of enrollment utterances per speaker. Finally,we provided our source code and all the utilized data as anopen source project for further investigation (cf. Sec. I-C andSec. II-B).For future work, we would like to further generalize theproposed method by replacing the initial feature extraction (cf.algorithm 1 and algorithm 3) by DL techniques in order todirectly feed the raw waveform to the network. It would be alsointeresting to benefit from more sophisticated and advancedarchitectures such as transformer and attention mechanism [31]in our embedding extractor network.R EFERENCES[1] H. Beigi,
Fundamentals of Speaker Recognition . Springer, 12 2011.[2] J. P. Campbell, “Speaker recognition: a tutorial,”
Proceedings of theIEEE , vol. 85, no. 9, pp. 1437–1462, 1997.[3] D. A. Reynolds and R. C. Rose, “Robust text-independent speakeridentification using gaussian mixture speaker models,”
IEEE Transactionson Speech and Audio Processing , vol. 3, no. 1, pp. 72–83, 1995.[4] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-endloss for speaker verification,” in , 2018, pp. 4879–4883. ABLE VI: Number of utterances in LibriSpeech test sets based on utteranceduration. The last column also shows the total number of speakers per subset. subset ≤ s > s total total spkrsdev-clean
765 1938 2703 40 test-clean
772 1848 2620 40 test-other
TABLE VII: EER[%] results for M = 2 for different utterance lengthsaveraged over 1K test iterations on the test sets. First column shows theresults for short utterances, while the second column states the results forlong utterances. Finally, the last column shows the results without taking theutterance duration into consideration. subset ≤ s > s totaldev-clean .
40 3 .
01 3 . test-clean .
32 3 .
51 3 . test-other .
90 3 .
58 3 . [5] T. Kinnunen and H. Li, “An overview of text-independent speakerrecognition: From features to supervectors,” Speech Communication ,vol. 52, no. 1, pp. 12 – 40, 2010.[6] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-Garc´ıa, D. Petrovska-Delacr´etaz, and D. A. Reynolds, “A tutorial on text-independent speakerverification,”
EURASIP J. Adv. Signal Process , vol. 2004, p. 430–451,2004.[7] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in , 2016, pp. 5115–5119.[8] S.-X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-end attentionbased text-dependent speaker verification,” 2017.[9] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependentspeaker verification,” in , 2014, pp. 4052–4056.[10] N. Dehak, R. Dehak, P. Kenny, N. Brummer, P. Ouellet, and P. Dumouchel,“Support vector machines versus fast scoring in the low-dimensional totalvariability space for speaker verification,” in
Proceedings of the AnnualConference of the International Speech Communication Association,INTERSPEECH , vol. 1, 2009, pp. 1559–1562.[11] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011.[12] N. Dehak, P. Kenny, R. Dehak, O. Glembek, P. Dumouchel, L. Burget,V. Hubeika, and F. Castaldo, “Support vector machines and joint factoranalysis for speaker verification,” in , 2009, pp. 4237–4240.[13] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme forspeaker recognition using a phonetically-aware deep neural network,” in , 2014, pp. 1695–1699.[14] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,”
Trans. Audio, Speech andLang. Proc. , vol. 19, no. 4, p. 788–798, 2011.[15] C. Zhang and K. Koishida, “End-to-end text-independent speakerverification with triplet loss on short utterances,” in
Proc. Interspeech2017 , 2016, pp. 165–170.[19] H. Bredin, “Tristounet: Triplet loss for speaker turn embedding,” in , 2017, pp. 5430–5434.[20] A. Irum and A. Salman, “Speaker verification using deep neural networks:A review,”
International Journal of Machine Learning and Computing ,vol. 9, pp. 20–25, 2019.[21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
NeuralComput. , vol. 9, no. 8, p. 1735–1780, 1997.[22] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrentneural network architectures for large scale acoustic modeling,”
Proceed-ings of the Annual Conference of the International Speech CommunicationAssociation, INTERSPEECH , pp. 338–342, 01 2014.[23] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:An asr corpus based on public domain audio books,” in , 2015, pp. 5206–5210.[24] J. Ramirez, J. M. Gorriz, and J. C. Segura, “Voice activity detection.fundamentals and speech recognition system robustness,” in
RobustSpeech , M. Grimm and K. Kroschel, Eds. Rijeka: IntechOpen, 2007,ch. 1.[25] R. Prabhavalkar, R. Alvarez, C. Parada, P. Nakkiran, and T. N. Sainath,“Automatic gain control and multi-style training for robust small-footprintkeyword spotting with deep neural networks,” in , 2015,pp. 4704–4708.[26] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv preprint arXiv:1412.6980 , 2014.[27] R. Pascanu, T. Mikolov, and Y. Bengio, “Understanding the explodinggradient problem,”
ArXiv , vol. abs/1211.5063, 2012.[28] X. Glorot and Y. Bengio, “Understanding the difficulty of training deepfeedforward neural networks,”
Journal of Machine Learning Research -Proceedings Track , vol. 9, pp. 249–256, 01 2010.[29] D. A. van Leeuwen and N. Br¨ummer,
An Introduction to Application-Independent Evaluation of Speaker Recognition Systems . Berlin,Heidelberg: Springer Berlin Heidelberg, 2007, pp. 330–353.[30] J. H. L. Hansen and T. Hasan, “Speaker recognition by machines andhumans: A tutorial review,”
IEEE Signal Processing Magazine , vol. 32,pp. 74–99, 2015.[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017., vol. 32,pp. 74–99, 2015.[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017.