[PDF] S-vectors: Speaker Embeddings based on Transformer's Encoder for Text-Independent Speaker Verification

Abstract

X-vectors have become the standard for speaker-embeddings in automatic speaker verification. X-vectors are obtained using a Time-delay Neural Network (TDNN) with context over several frames. We have explored the use of an architecture built on self-attention which attends to all the features over the entire utterance, and hence better capture speaker-level characteristics. We have used the encoder structure of Transformers, which is built on self-attention, as the base architecture and trained it to do a speaker classification task. In this paper, we have proposed to derive speaker embeddings from the output of the trained Transformer encoder structure after appropriate statistics pooling to obtain utterance level features. We have named the speaker embeddings from this structure as s-vectors. s-vectors outperform x-vectors with a relative improvement of 10% and 15% in % EER when trained on Voxceleb-1 only and Voxceleb-1+2 datasets. We have also investigated the effect of deriving s-vectors from different layers of the model.

Full PDF

SS-vectors: Speaker Embeddings based on Transformer’s Encoderfor Text-Independent Speaker Veriﬁcation

Metilda Sagaya Mary N J, Sandesh V Katta, S Umesh

Department of Electrical Engineering, Indian Institute of Technology Madras [email protected] , [email protected] , [email protected]

Abstract

X-vectors have become the standard for speaker-embeddingsin automatic speaker veriﬁcation. X-vectors are obtained us-ing a Time-delay Neural Network (TDNN) with context overseveral frames. We have explored the use of an architecturebuilt on self-attention which attends to all the features over theentire utterance, and hence better capture speaker-level char-acteristics. We have used the encoder structure of Transform-ers, which is built on self-attention, as the base architecture andtrained it to do a speaker classiﬁcation task. In this paper, wehave proposed to derive speaker embeddings from the outputof the trained Transformer encoder structure after appropriatestatistics pooling to obtain utterance level features. We havenamed the speaker embeddings from this structure as s-vectors.s-vectors outperform x-vectors with a relative improvement of10% and 15% in % EER when trained on Voxceleb-1 only andVoxceleb-1+2 datasets. We have also investigated the effect ofderiving s-vectors from different layers of the model.

Index Terms : S-vectors, Transformer encoder, Speaker Veriﬁ-cation, x-vectors, speaker embeddings

1. Introduction

Speaker Veriﬁcation uses speech as a biometric to verify theidentity claimed by the speaker. There are two types of speakerveriﬁcation systems: i) text-dependent and ii) text-independent.Text-independent systems are ﬂexible as there is no constrainton the text spoken by the speaker. Most of the research in thisarea is focused on obtaining a single ﬁxed dimension vector rep-resenting an utterance. These vectors are then scored to verifythe speaker’s identity and are termed as speaker embeddings.Any speaker embedding should enhance inter-speaker variabil-ity and suppress intra-speaker variability while scoring.I-vectors [1] extracted from UBM-GMM models are oneof the earliest speaker embeddings. The success of i-vectorspiqued the interest among the researchers to search for betterspeaker embeddings. With the increase in data available fortraining, speaker embeddings based on deep learning methodshave gained popularity.X-vectors [2, 3] extracted from Time Delay Neural Net-works (TDNNs) [4] based system have consistently outper-formed i-vector based systems and are therefore used in thestate-of-the-art systems. In order to account for variable utter-ance lengths, different pooling methods have been proposed toaggregate the information across the utterance into a single vec-tor. Two common pooling methods are Temporal Average Pool-ing (TAP) [5] and Statistics Pooling (SP) [3]. All these poolingmethods give equal importance to all the frames of an utter-ance. A new pooling method called Attentive Statistics Pooling(ASP) was proposed in [6]. ASP identiﬁes important frames inan utterance and assigns higher weights to those frames whilecalculating utterance level statistics. Convolutional Neural Network (CNN) based methods [5,7] consider spectrogram representations of speech as imagesand proceed to build Speaker Veriﬁcation systems. Thesemethods use complex aggregation strategies like NetVLAD,GhostVLAD. Despite being complex models, only after theinclusion of the relation module [8], CNN based methodswere able to match the performance of the x-vector system onVoxceleb-2 dataset.Attention mechanisms [9, 10] have gained popularity inNatural Language Processing (NLP) and Speech Processing. Ithelps pick the necessary information from the encoder statesof Recurrent Neural network (RNN) at every decode step.Attention-based techniques took a big step forward with the in-troduction of Transformers [11] for Machine Translation (MT)by the NLP research community. A Transformer has two mod-ules: i) encoder ii) decoder. These two modules are built onself-attention and interact through cross-attention. In Trans-formers, information from all the frames is accounted for by theattention networks. Transformers have also been successfullyused in both Automatic Speech Recognition (ASR) and Text ToSpeech (TTS) tasks [12, 13, 14, 15].In this work, we propose a self-attention based alternative tothe x-vectors system. We have replaced the TDNN of x-vectorsystem, which has ﬁnite context, with the encoder module of theTransformers. We expect that this arrangement would capturebetter speaker characteristics due to unrestricted context. Also,self-attention is built on the dot product between frames. Sowe expect it to capture the similarities across an utterance efﬁ-ciently. We stick to the same steps followed by the TDNN basedx-vectors and just replace the TDNN with encoder module fora fair comparison. We show that this model, can outperformthe x-vectors when trained on Voxceleb-1 and Voxceleb-1+2datasets. [16].

2. Database and Data Augmentation

We have trained two models: i) with Voxceleb-1 dataset ii) withVoxceleb1+2 dataset. Voxceleb-1 dataset consists of recordingsfrom 1251 speakers and over 1,00,000 utterances extracted fromcelebrity interview videos on YouTube. Voxceleb-2 consists ofrecordings from 6112 speakers and has about a million utter-ances also from celebrity interview videos on Youtube. Wefollow data preparation and feature extraction steps which areidentical to the x-vector method. MFCCs are extracted andenergy-based Voice Activity Detection (VAD) is performed toremove silence frames. The data is then augmented with re-verberation and noise in the same way as proposed by Snyderet al. for their x-vector system using Voxceleb-1+2 data [3].Reverberation examples are taken from the RIRS database [17]and noise examples are taken from the MUSAN database [18].The augmented utterances are then chunked to generate trainingexamples similar to the x-vector system. a r X i v : . [ ee ss . A S ] A ug igure 1: Network Architecture of X-vector System

3. TDNN based X-vectors

The input features are dimension MFCC features obtainedusing a frame length of 25ms and 30 Mel ﬁlters. A sliding meannormalisation of up to 3s is applied. The architecture of TDNNbased x-vector training model is shown in Figure 1. Input is 30dimensional MFCC features. In the ﬁrst TDNN layer (TDNN-1), at any time step t , { t − , t − , t, t + 1 , t + 2 } frames arespliced and presented as input. TDNN-2 takes { t − , t, t + 2 } and TDNN-3 takes { t − , t, t + 3 } spliced frames of the pre-vious layer as input. TDNN-4 and TDNN-5 take just the t th frame as input. In order to aggregate the statistics over entireutterance, this system uses a Statistics Pooling layer. This layercomputes the mean and standard deviation across each dimen-sion over the entire utterance resulting in a single dimen-sion vector. This single vector is representative of the entireutterance. Then this dimension vector is passed throughtwo Feedforward Neural Networks (FFNNs) and then to theoutput layer with softmax and cross-entropy as the criterionfor speaker classiﬁcation. All non-linearities used are RectiﬁedLinear Units (ReLUs) and batch normalisation is performed atevery stage. The embeddings extracted from FFNN-1 before the Figure 2: Network Architecture of S-vector System non-linearity is termed as x-vectors. We have exactly followedKaldi’s recipe to extract x-vectors.

4. Proposed S-vectors

X-vectors training is based on limited temporal context, as men-tioned before. This might not be sufﬁcient to capture speakercharacteristics that persist across an entire utterance. In orderto capture the speaker characteristics better, we have used self-attention as the backbone of our architecture. Its strength isthat it is not restricted to ﬁnite context and attends to all framesin every time step. Also, because self-attention is built on dotproduct between the frames, we expect it to learn speaker char-acteristics better.

To derive s-vectors, we have replaced the TDNN in the x-vectorsystem with the encoder of the Transformer [11], as shown inFigure 2. Input is dimension MFCC features. We used thesame training utterances as used for the x-vectors system. dimension MFCC features are transformed into respective at-tention dimension (Adim) by FFNN-1 and fed to the above en-oder layers after the addition of position embeddings. Multi-head attention is performed at every encoder layer. The encoderlayer is explained in detail in the next section. The resultant Adim x T ﬁnal encoder layer’s output is then taken to x T through FFNN-2. Statistics pooling on these vectors results ina single dimension vector ( mean and standarddeviation). This dimension vector is then taken to andthen to again by two FFNNs. The resultant vector is thenpresented to a classiﬁcation layer. In all FFNNs except FFNN-2, we have used ReLU. We have used leaky ReLU (negativeslope = 0.01) in FFNN-2 in order to stabilise the gradients ﬂow-ing through the standard deviation part of the statistics poolinglayer. We call the vectors extracted from our model as s-vectors. Each encoder layer is made of a self-attention network (SAN)and FFNN as shown in Figure 3. Batch normalisation is per-formed after the addition of residuals in every stage of the en-coder layer. Working of self-attention is represented in Figure 4.

Adim x T input is converted to Queries (Q), Keys (K) and Values(V) through three Adim x Adim matrices W Q , W K , W V . Ev-ery vector in the Adim x T output of SAN layer is the weightedsum of vectors in the Adim x T Values. The weights are decidedby Queries and Keys. As we see in Figure 4, the output O t isobtained by a linear combination of all the T frames in Values.Weights for all the T frames of Values is the softmax of the nor-malised dot product scores of Q t with all the frames of Keys.Figure 3: Elements in an Encoder Layer

All the data preparation and feature extraction steps were per-formed using Kaldi’s [19] Voxceleb v2 recipe. Our s-vectormodel was trained in Espnet [20]. The x-vector system chunksthe data into random lengths with different start frames to gen-erate training examples and writes it as .ark ﬁles before training.These ark ﬁles are not compatible with Espnet. So we obtainedthe information of chunks like start frame, chunk length andperformed chunking in Espnet before passing an utterance fortraining. This ensured that both models were trained on thesame data. Other details of the training conducted in Espnet forVoxceleb-1 are given in Table 1. For Voxceleb-1+2, Adim wastaken as 512 and heads were set to 8. Figure 4:

Self-attention Network (SAN)

Table 1:

Training Details of s-vector Model

Parameter ValuesPosition Encoding SinusoidalEncoder Units 2048Adim 256Heads 4Normalize Before TrueLearning Rate 10Batch Size 64Optimiser NoamWarm-up Steps 25000Normalisation Batch-Norm

We have taken the standard Equal Error Rate (EER) and Detec-tion Cost Function (DCF) as the evaluation metrics to comparex-vector and our proposed system. EER refers to the cross-overpoint of False-Alarm and Miss error rates. DCF is a weightedlinear combination of False Alarm and Miss error rates. ForDCF calculation we assume P

Target = 0.01 (or 0.001) while C

Miss = 1 and C

False Alarm = 1.

Before feeding the utterances for speaker embeddings extrac-tion, non-speech frames were removed similar to the originalystem using an energy-based Voice Activity Detection (VAD)system. We have chunked the utterances, with each chunk be-ing 300 frames and the remaining frames were taken as anotherchunk. We have used Kaldi’s recipe for PLDA scoring. We trieddifferent LDA dimensions and found that the optimal dimensionis 200 similar to that of x-vectors. We chose to chunk the utter-ance because of the position encoding in our architecture. Weexpected that feeding whole utterances will lead to unseen posi-tions in the input and might result in poor speaker embeddings.Embeddings extracted for each chunk is then averaged over theutterance to get the ﬁnal embedding. Results with and withoutchunking for 3-layer s-vectors trained on Voxceleb-1 dataset ispresented in Table 2. We see that both chunked and un-chunkeddata yield almost similar results but we prefer chunking as ityields lesser DCF values.Table 2:

Effect of Chunking on 3-layer s-vector model trainedwith Voxceleb-1

Method EER DCF = 0.01 DCF2 = 0.001Chunked 5.63 0.46 0.62Un-chunked 5.55 0.49 0.62

We analysed the effect of deriving embeddings from FFNN-3and FFNN-4. % EER and DCF for both these positions of a3-layer s-vector model trained on Voxceleb-1 are presented inTable 3. We see that FFNN-3 gives better embeddings com-pared to FFNN-4. The speaker embeddings correspond to theafﬁne part of FFNN-3 and FFNN-4.Table 3:

Effect of Tapping Layer on 3-layer s-vector modeltrained with Voxceleb-1

Layer EER DCF = 0.01 DCF2 = 0.001

FFNN-3 5.63 0.46 0.62

FFNN-4 6.69 0.54 0.68

We ﬁrst analysed the proposed s-vector architecture by varyingthe number of encoder layers for Voxceleb-1 data. % EER andDCF for the different number of layers taken is presented in Ta-ble 4. We see that all the three models are consistently perform-ing better than x-vectors in terms of % EER. 3-layer s-vectormodel outperforms x-vectors in both % EER and DCF. We havequoted the best performing x-vectors model after comparing theresults from systems trained for 3, 5 and 10 epochs. The Detec-tion Error Tradeoff (DET) curves for x-vectors and the best per-forming 3-layer s-vector model is presented in Figure 5. Sim-ilarly we trained a 6 layer s-vector model with Voxceleb-1+2data and the results are presented in Table 5. DET Curve ispresented in Figure 6. We see that s-vector system outperformsx-vectors in terms of % EER and DCF in case of Voxceleb-1+2dataset too. Finetuning the parameters might improve the re-sults further. Table 4:

Effect of Encoder Layers when trained with Voxceleb-1

Model % EER DCF DCF2-L s-vector 5.80 0.48 0.65

Voxceleb-1+2 Results

Model % EER DCF=0.01 DCF=0.001 x-vector 3.13 0.33 0.5Figure 5:

DET Curves for the best s-vector system on Voxceleb-1 data

Figure 6:

DET Curves for 6 layer s-vector system on Voxceleb-1+2 data

5. Conclusions and Future Work

In this work, we have proposed a new architecture for deriv-ing speaker embeddings based on Transformer’s encoder. Wecall these embeddings as s-vectors. It has outperformed the %EER of the standard x-vector system by relative in caseof Voxceleb-1 and relative in case of Voxceleb-1+2. Alsothe DCF values are better than the x-vectors. Finetuning the pa-ameters might improve the results further. In future, we wouldlike to explore intelligent ways of combining the statistics ofour model to improve the performance further, including atten-tion across all temporal outputs to get the pooled utterance levelvector.

6. References [1] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker veriﬁcation,”

IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, 2011.[2] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khu-danpur, “Deep neural network embeddings for text-independent speaker veriﬁcation,” in

Proc. Interspeech2017 , 2017, pp. 999–1003. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2017-620[3] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in , 2018, pp. 5329–5333.[4] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J.Lang, “Phoneme recognition using time-delay neural networks,”

IEEE Transactions on Acoustics, Speech, and Signal Processing ,vol. 37, no. 3, pp. 328–339, 1989.[5] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2:Deep speaker recognition,” in

Proc. Interspeech 2018 , 2018,pp. 1086–1090. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1929[6] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statisticspooling for deep speaker embedding,”

Interspeech 2018 ,Sep 2018. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-993[7] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in

ICASSP2019 - 2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2019, pp. 5791–5795.[8] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb:Large-scale speaker veriﬁcation in the wild,”

Computer Scienceand Language , 2019.[9] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translationby jointly learning to align and translate,” Jan. 2015, 3rd Interna-tional Conference on Learning Representations, ICLR 2015.[10] T. Luong, H. Pham, and C. D. Manning, “Effective approachesto attention-based neural machine translation,” in

Proceedings ofthe 2015 Conference on Empirical Methods in Natural LanguageProcessing

Advances in Neural Information ProcessingSystems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, Eds. CurranAssociates, Inc., 2017, pp. 5998–6008. [Online]. Available:http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf[12] L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,”in , 2018, pp. 5884–5888.[13] S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesiswith transformer network,”

Proceedings of the AAAI Conferenceon Artiﬁcial Intelligence , vol. 33, pp. 6706–6713, 07 2019.[14] V. M. Shetty, M. Sagaya Mary N J, and S. Umesh, “Improvingthe performance of transformer based low resource speech recog-nition for indian languages,” in

ICASSP 2020 - 2020 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2020, pp. 8279–8283. [15] M. S. Mary N J, V. M. Shetty, and S. Umesh, “Investigation ofmethods to improve the recognition performance of tamil-englishcode-switched data in transformer framework,” in

ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2020, pp. 7889–7893.[16] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identiﬁcation dataset,” in

INTERSPEECH , 2017.[17] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur,“A study on data augmentation of reverberant speech for robustspeech recognition,” in , 2017, pp.5220–5224.[18] D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech,and noise corpus,”

CoRR , vol. abs/1510.08484, 2015. [Online].Available: http://arxiv.org/abs/1510.08484[19] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recog-nition toolkit,” in

IEEE 2011 Workshop on Automatic SpeechRecognition and Understanding . IEEE Signal Processing So-ciety, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB.[20] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno,N. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduch-intala, and T. Ochiai, “Espnet: End-to-end speech processingtoolkit,”