Double Multi-Head Attention for Speaker Verification
DDouble Multi-Head Attention for Speaker Verification
Miquel India, Pooyan Safari, Javier Hernando
TALP Research CenterUniversitat Politecnica de Catalunya { miquel.angel.india,javier.hernando } @upc.edu,[email protected] Abstract
Most state-of-the-art Deep Learning systems for speakerverification are based on speaker embedding extractors. Thesearchitectures are commonly composed of a feature extractorfront-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. In this pa-per we present Double Multi-Head Attention pooling, whichextends our previous approach based on Self Multi-Head Atten-tion. An additional self attention layer is added to the poolinglayer that summarizes the context vectors produced by Multi-Head Attention into a unique speaker representation. Thismethod enhances the pooling mechanism by giving weights tothe information captured for each head and it results in creat-ing more discriminative speaker embeddings. We have eval-uated our approach with the VoxCeleb2 dataset. Our resultsshow . and . relative improvement in terms of EERcompared to Self Attention pooling and Self Multi-Head Atten-tion, respectively. According to the obtained results, DoubleMulti-Head Attention has shown to be an excellent approachto efficiently select the most relevant features captured by theCNN-based front-ends from the speech signal. Index Terms : self multi-head attention, speaker recognition,speaker verification
1. Introduction
Speaker verification aims to determine whether a pair of au-dios corresponds to the same speaker. Given speech signals,speaker verification systems are able to extract speaker identitypatterns from the characteristics of the voice. These patterns canbe both statistically modelled or encoded into discriminativespeaker representations. Over the last few years, researchershave put huge effort on encoding these traits into more discrim-inative speaker vectors. Current state-of-the-art speaker verifi-cation systems are based on Deep Learning (DL) approaches.These architectures are commonly trained as speaker classifiersin order to be used as speaker embedding extractors. Speakerembeddings are fixed-length vectors extracted from some of thelast layers of these Deep Neural Networks (DNNs) [1]. Themost known representation is the x-vector [2], which has be-come state-of-the-art for speaker recognition and has also beenused for other tasks such as language and emotion recognition[3, 4].Most of the recent network architectures used for speakerembedding extraction are composed of a front-end feature ex-tractor, a pooling layer, and a set of Fully Connected (FC)layers. Lately, there have been several architectures proposedto encode audio utterances into speaker embeddings for dif-ferent choices of network inputs. Using Mel-Frequency Cep-stral Coefficient (MFCC) features, Time Delay Neural Network(TDNN) [5, 6] is the most currently used architecture. TDNNis the x-vector front-end and consists of a stack of 1-D di- lated Convolutional Neural Networks (CNNs). The idea be-hind the use of TDNNs is to encode a sequence of MFCC intoa more discriminative sequence of vectors by capturing long-term feature relations. 2-D CNNs have also shown competitiveresults for speaker verification. There are Computer Vision ar-chitectures such as VGG [7, 8, 9] and ResNet [10, 11, 12] thathave been adapted to capture speaker discriminative informa-tion from the Mel-Spectrogram. In fact, Resnet34 has showna better performance than TDNN in the most recent speakerverification challenges [13, 14]. Finally, there are also someother attempts to work directly on the raw signal instead of us-ing hand-crafted features [15, 16, 17].Given the encoded sequence from the front-end, a pool-ing layer is adopted to obtain an utterance-level representation.During the last few years, there are several studies addressingdifferent types of pooling strategies. X-vector originally usesstatistical pooling [6] or the Self Attentive pooling method pro-posed in [18]. A wide set of pooling layers based on self at-tention have been proposed improving this vanilla self attentionmechanism. In [18] several attentions are applied over the sameencoded sequence, producing multiple context vectors. In ourprevious work [9], the encoded sequence is split into differentheads and a different attention model is applied over each headsub-sequence. Attention mechanisms have also been used toimprove statistical pooling. In works like [19], attention is usedto extract better order features statistics. Finally there are alsoworks with competitive results such as [20, 21, 22] which pro-posed pooling methods independent from self attention models.In this paper we present a Double Multi-Head Attention(MHA) pooling layer for speaker verification. The use of thislayer is inspired by [23], where Double MHA is presented asa double attention block which captures feature statistics andmakes adaptive feature assignment over images. In this workthis mechanism is used as a combination of two self attentionpooling layers to create utterance-level speaker embeddings.Given a sequence of encoded representations from a CNN, SelfMHA first concatenates the context vector from K head atten-tions applied over a K sub-embedding sequences. An addi-tional self attention mechanism is then applied over the multi-head context vector. This attention based pooling summarizesthe set of head context vectors into a global speaker representa-tion. This representation is pooled through a weighted averageof the head context vectors, where the head weights are pro-duced with the self attention mechanism. On one hand, thisapproach allows the model to attend to different parts of the se-quence, capturing at the same time different subsets of encodedrepresentations. On the other the hand, the pooling layer al-lows to select which head context vectors are the most relevantto produce the global context vector. In comparison with [23],the second pooling layer operates over the head context vectorsproduced by a MHA instead of the global descriptors producedby a self multi attention mechanism applied over an image. a r X i v : . [ ee ss . A S ] J u l . Proposed Architecture Our proposed system architecture is illustrated in Figure 1. Itutilizes a CNN-based front-end which takes in a set of vari-able length mel-spectrogram features and outputs a sequenceof speaker representations. These speaker representations arefurther subject to a Double MHA pooling which is the maincontribution of this work. The Double MHA layer comprises aSelf MHA pooling and an additional Self Attention layer thatsummarizes the information of each head context vector intoan unique speaker embedding. The combination of Self MHApooling together with this Self Head Attention layer providesus with a deeper self-attention pooling mechanism (Figure 2).The speaker embedding obtained from the pooling layer is sentthrough a set of FC layers to predict the speaker posteriors. Thisnetwork architecture is trained with Additive Margin Softmax(AMS) loss [24] as a speaker classifier so as to have a speakerembedding extractor.
Our feature extractor network is a larger version of the adaptedVGG proposed in [9]. This CNN comprises four convolutionblocks, each of which contains two concatenated convolutionallayers followed by a max pooling with a × stride. Hencegiven a spectrogram of N frames, the VGG performs a down-sampling reducing its output into a sequence of N/ repre-sentations. The output of the VGG h ∈ R M × N/ × D (cid:48) is aset of M feature maps with N/ × D (cid:48) dimension. Thesefeature maps are concatenated into a unique vector sequence.This reshaped sequence of hidden states can now be defined as h ∈ R N/ × MD (cid:48) , where D = MD (cid:48) corresponds to the hiddenstate dimension. The sequence of hidden states output from the front-end featureextractor can be expressed as h = [ h h ...h N ] with h t ∈ R D .If we consider a number of K heads for the MHA pooling, nowwe can define the hidden state as h t = [ h t h t ...h tK ] where h tj ∈ R D/K . Hence each feature vector is split into a set ofsub-feature vectors of size
D/K . In the same way we have alsoa trainable parameter u = [ u u ...u K ] where u j ∈ R D/K . Aself attention operation is then applied over each head of theencoded sequences. The weights of each head alignment aredefined as: w tj = exp (cid:18) h Ttj u j √ d h (cid:19)(cid:80) Nl =1 exp (cid:18) h Tlj u j √ d h (cid:19) (1)where w tj corresponds to the attention weight of the head j on the step t of the sequence and d h corresponds to hidden statedimension D/K . If each head corresponds to a subspace of thehidden state, the weight sequence of that head can be consid-ered as a probability density function (pdf) from that subspacefeatures over the sequence. We then compute a new pooled rep-resentation for each head in the same way than vanilla self at-tention: c j = N (cid:88) t =1 h Ttj w tj (2)where c j ∈ R D/K corresponds to the utterance level repre- Figure 1:
System Architecture. sentation from head j . The final utterance level representation isthen obtained with the concatenation of the utterance level vec-tors from all the heads c = [ c c ...c k ] . This method allows thenetwork to extract different kinds of information over differentregions of the network. The main disadvantage of Self MHA pooling is that it assumesuniform head relevance. The output context vector is the con-catenation of all head context vectors and it is used as input ofthe following dense layers. Double MHA does not assume that.Therefore each utterance context vector is computed as a differ-ent linear combination of head context vectors. A summarizedvector c is then defined as a weighted average over the set ofhead context vectors c i . A self attention mechanism is used topool the set of head context vectors c i and obtain an overallcontext vector c . w (cid:48) i = exp ( c Ti u (cid:48) ) (cid:80) Kl =1 exp ( c Tl u (cid:48) ) (3) c = K (cid:88) i =1 c Ti w (cid:48) i (4)where w (cid:48) i corresponds to the aligned weight of each headand u (cid:48) ∈ R D/K is a trainable parameter. The context vector c is then computed as the weighted average of the context vectorsamong heads. With this method, each utterance context vectoris created scaling the information of the most/least relevanceheads. Considering the whole pooling layer, Double MHA al-lows to capture different kind of speaker patterns in differentregions of the input, and at the same time allows to weight therelevance of each of these patterns for each utterance.The number of heads used for this pooling defines boththe context vector dimension and how the VGG feature mapsigure 2: An example of Double MHA Pooling with heads. are grouped. Considering the number of M channels and K heads, for each head we would create a c i context vector of D (cid:48) M/K dimension which contains a subset of
M/K featuremaps. Therefore, as the number of heads grows larger, it allowsDouble MHA to consider more subsets of features while de-creases the dimension of the final utterance-level context vector.This implies a trade-off between the number of features subsetswe can create and how much compressed are these features inthe context vector subspace.
The utterance-level speaker vector obtained from the poolinglayer is fed into a set of four FC layers (Figure 1). Each of thefirst two FC layers is followed by a batch normalization layer[25] and Rectified Linear Unit (ReLU) activations. A denselayer is adopted for the third FC layer and the last FC corre-sponds to the speaker classification layer. Since AMS is usedto train the network, the third layer is set up without activationand batch normalization as proposed in [24]. Once the networkis trained, we can extract a speaker embedding from one of theintermediate FC layers. According to [26], we consider the sec-ond layer as the speaker embedding instead of the third one.The output of this FC layer then corresponds to the speaker rep-resentation that will be used for the speaker verification task.
3. Experimental Setup
The proposed system in this work has been assessed by Vox-Celeb dataset [27, 7]. VoxCeleb is a large multimedia databasethat contains more than 1 million utterances for more than 6Kcelebrities. These utterances are 16kHz audio chunks extracted Models are available at:https://github.com/miquelindia90/DoubleAttentionSpeakerVerification
Table 1:
CNN Architecture. In and Out Dim. refers to the inputand output feature maps of the layer. Feat Size refers to thedimension of each one of this output feature maps.
Layer Size In Dim. Out Dim. Stride Feat Sizeconv11 3x3 1 128 1x1 Nx80conv12 3x3 128 128 1x1 Nx80mpool1 2x2 - - 2x2 N/2x40conv21 3x3 128 256 1x1 N/2x40conv22 3x3 256 256 1x1 N/2x40mpool2 2x2 - - 2x2 N/4x20conv31 3x3 256 512 1x1 N/4x20conv32 3x3 512 512 1x1 N/4x20mpool3 2x2 - - 2x2 N/8x10conv41 3x3 512 1024 1x1 N/8x10conv42 3x3 1024 1024 1x1 N/8x10mpool4 2x2 - - 2x2 N/16x5flatten - 1024 1 - N/16x5120from
Youtube videos. VoxCeleb has two different versions withseveral evaluation conditions and protocols. For our experi-ments, VoxCeleb1 and VoxCeleb2 development partitions havebeen used to train both baseline and presented approaches. Nodata augmentation has been applied to increase the training data.On the other hand, the performance of these systems have beenevaluated with the original Vox1 test set.Two different baselines have been considered to comparewith the presented approach. Double MHA pooling have beenevaluated against two self attentive based pooling methods:vanilla Self Attention and Self MHA. In order to evaluate them,these mechanisms have replaced the pooling layer of the system(Figure 1) without modifying any other block or parameter fromthe network. The speaker embeddings used for the verificationtests have been extracted from the same FC layer for each of thepooling methods. Cosine distance have been used to computethe scores between pairs of speaker embeddings.The proposed network has been trained to classify variable-length speaker utterances. As input features we have used dimension log Mel Spectrograms with ms length Hammingwindows and ms window shift. The audios have not been fil-tered with any Voice Activity Detection (VAD) system and 0.97coefficient pre-emphasis has been applied. The audio featureshave been only normalized with Cepstral Mean Normalization(CMN). The CNN encoder is then fed with N × Spectro-grams to obtain a sequence of N/ × encoded hiddenrepresentations. For training we have used batches of N=350frames audio chunks but for test the whole utterances have beenencoded. The setup of the CNN feature extractor can be foundon Table 1. For the pooling layer we have tuned the numberof heads for both Self MHA and Double MHA. For the pre-sented CNN setup we have considered 8,16, and 32 head num-ber values, which implies a head context vector c i of 640, 320,and 160, respectively. The last block of the system consists onfour consecutive FC layers. The first three dense layers have dimension. The last FC layer has dimension, whichcorresponds to the number of train speaker labels. Batch nor-malization has been applied only on the first two dense layersas mentioned in subsection 2.4. The network has been trainedwith AMS loss with s = 30 and m = 0 . hyper-parameters.Batch size is set to samples and Adam optimizer has beenigure 3: DET curves for the experiments on VoxCeleb 1 testset verification task. used to train all the models with e − learning rate and e − weight decay. During the training we have used 15 patienceearly stopping criterion, where the models have been validatedeach , batches.
4. Results
The proposed approach has been evaluated with different at-tention methods in the VoxCeleb text-independent verificationtask. Performance is evaluated using Equal Error Rate (EER)and Detection Cost Function (DCF) calculated using C FA = 1 , C M = 1 , and P T = 0 . . The results of this task are presentedin both Figure 3 and Table 2. DET curves are shown in Fig-ure 3 and both EER and DCF metrics are presented in Table 2.Double MHA is referred as DMHA in both analysis.Self Attention pooling has shown the worst results for thistask compared to the best tuned approaches in both Self MHAand Double MHA. Compared to Self Attention, Self MHA hasshown better results with 16 heads and worst results with both8 and 32 heads. With 16 heads, Self MHA has shown a . EER relative improvement in comparison with Self AttentionPooling. Otherwise, DCF has only improved from . to . . With 8 and 32 heads Self MHA performance in EERhas decreased a . and a . , respectively. DoubleMHA have shown better results with 16 heads than both SelfAttention and Self MHA approaches. Double MHA has showna . EER relative improvement in comparison with SelfAttention and . relative improvement compared with 16heads Self MHA. In terms of DCF, Double MHA DCF hasshown the best result with a . . If we compare DoubleMHA and Self MHA with 8 heads, Double MHA is better interms of DCF but has not improved in terms of EER. DoubleMHA DCF has improved from . to . but EER hasremain the same with a . . Double MHA with 32 heads hasshown the worst results in comparison with both heads SelfMHA and Self Attention with a . EER and 0.0032 DCF.As the results have shown, best performances in MHAbased approaches are achieved with 16 heads. Besides verifi-cation metrics, Table 2 also indicates the head and global con-text vector dimensions. As it was discussed in subsection 2.3, Table 2:
Evaluation results of the text-independent verificationtask on VoxCeleb 1.
Approach Heads c i dim c dim EER DCF Attention .
59 0 . MHA .
69 0 . MHA
16 320 5120 3 . . MHA
32 160 5120 3 .
65 0 . DMHA .
69 0 . DMHA
16 320 320
DMHA
32 160 160 4 .
01 0 . c i in Self MHA and both c i and c dimensions in Double MHAare inversely proportional to the number of heads. Therefore,there is a trade-off between number of heads and systems per-formance, which is related to context vector dimensions. Worstperformance showed with Double MHA is achieved with heads. This setup implies that both c i and c dimensions are160. This value can be considered small compared to currentstate-of-the-art speaker embeddings, whose dimension range isbetween 200 and 1500. Therefore, system performance with32 heads is worst because the context vector subspace is notenough big to encode all the discriminative speaker informationfrom the CNN output. On the other hand, as larger is the num-ber of heads, more subsets of speaker features can be capturedover the CNN encoded sequence. With 8 heads, 640 dimensionhead context vectors are extracted and with 16 heads, head con-text vectors have 320 dimension. Both Self MHA and DoubleMHA approaches have shown the best results with 16 heads,which implies 320 dimension context vectors. Therefore CNNoutput feature maps are more efficiently grouped in subsets of M/K = 64 channels, which correspond to sub-sequences of320 dimension embeddings. Considering these sets of 16 con-text vectors pooled in that layer, these representations are effi-ciently averaged with Double MHA into unique 320 dimensionutterance-level speaker representations.
5. Conclusion
In this paper we have implemented a Double Multi-Head Atten-tion mechanism to obtain speaker embeddings at level utteranceby pooling short-term representations. The proposed poolinglayer is composed of a Self Multi-Head Attention pooling and aSelf Attention mechanism that summarizes the context vectorsof each head into a unique speaker vector. This pooling layerhave been tested in a neural network based on a CNN that mapsspectrograms into sequences of speaker vectors. These vectorsare then input to the proposed pooling layer, which output acti-vation is then connected to a set of dense layers. The network istrained as a speaker classifier and a bottleneck layer from thesefully connected layers is used as speaker embedding. We haveevaluated this approach with other pooling methods for the text-independent verification task using the speaker embeddings andapplying cosine distance. The presented approach have outper-formed both vanilla Self Attention and Self Multi-Head Atten-tion poolings.
6. Acknowledgements
This work was supported in part by the Spanish Project Deep-Voice (TEC2015-69266-P). . References [1] O. Ghahabi, P. Safari, and J. Hernando, “Deep learning in speakerrecognition,” in
Development and Analysis of Deep Learning Ar-chitectures . Springer, 2020, pp. 145–169.[2] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in . IEEE, 2018, pp. 5329–5333.[3] D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, andS. Khudanpur, “Spoken language recognition using x-vectors,” in
Odyssey , 2018, pp. 105–111.[4] R. Pappagari, T. Wang, J. Villalba, N. Chen, and N. Dehak, “x-vectors meet emotions: A study on dependencies between emo-tion and speaker recognition,” arXiv preprint arXiv:2002.05039 ,2020.[5] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero,Y. Carmiel, and S. Khudanpur, “Deep neural network-basedspeaker embeddings for end-to-end speaker verification,” in . IEEE,2016, pp. 165–170.[6] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,“Deep neural network embeddings for text-independent speakerverification,” in
Interspeech , 2017, pp. 999–1003.[7] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,” in
INTERSPEECH , 2018.[8] ——, “Voxceleb2: Deep speaker recognition,” arXiv preprintarXiv:1806.05622 , 2018.[9] M. India, P. Safari, and J. Hernando, “Self Multi-Head Attentionfor Speaker Recognition.”[10] G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker recogni-tion: Modular or monolithic?” in
Proc. Interspeech , 2019, pp.1143–1147.[11] J. Zhou, T. Jiang, Z. Li, L. Li, and Q. Hong, “Deep speaker em-bedding extraction with channel-wise feature responses and addi-tive supervision softmax loss function,”
Proc. Interspeech 2019 ,pp. 2883–2887, 2019.[12] A. Hajavi and A. Etemad, “A deep neural network for short-segment speaker recognition,” arXiv preprint arXiv:1907.10420 ,2019.[13] J. S. Chung, A. Nagrani, E. Coto, W. Xie, M. McLaren, D. A.Reynolds, and A. Zisserman, “Voxsrc 2019: The first voxcelebspeaker recognition challenge,” arXiv preprint arXiv:1912.02522 ,2019.[14] H. Zeinali, S. Wang, A. Silnova, P. Matˇejka, and O. Plchot,“But system description to voxceleb speaker recognition chal-lenge 2019,” arXiv preprint arXiv:1910.12592 , 2019.[15] M. Ravanelli and Y. Bengio, “Speaker recognition from raw wave-form with sincnet,” arXiv preprint arXiv:1808.00158 , 2018.[16] J.-W. Jung, H.-S. Heo, I.-H. Yang, H.-J. Shim, and H.-J. Yu,“Avoiding speaker overfitting in end-to-end dnns using raw wave-form for text-independent speaker verification,” extraction , vol. 8,no. 12, pp. 23–24, 2018.[17] J.-w. Jung, H.-S. Heo, J.-h. Kim, H.-j. Shim, and H.-J. Yu,“Rawnet: Advanced end-to-end deep neural network usingraw waveforms for text-independent speaker verification,” arXivpreprint arXiv:1904.08104 , 2019.[18] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentivespeaker embeddings for text-independent speaker verification.” in
Interspeech , 2018, pp. 3573–3577.[19] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statis-tics pooling for deep speaker embedding,” arXiv preprintarXiv:1803.10963 , 2018.[20] W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and lossfunction in end-to-end speaker and language recognition system,”in
Proc. Odyssey 2018 The Speaker and Language RecognitionWorkshop , 2018, pp. 74–81. [21] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in
ICASSP2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2019, pp. 5791–5795.[22] Y. Jung, Y. Kim, H. Lim, Y. Choi, and H. Kim, “Spatial pyramidencoding with convex length normalization for text-independentspeaker verification,” arXiv preprint arXiv:1906.08333 , 2019.[23] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, “Aˆ 2-nets:Double attention networks,” in
Advances in Neural InformationProcessing Systems , 2018, pp. 352–361.[24] Y. Liu, L. He, and J. Liu, “Large margin softmax loss for speakerverification,” arXiv preprint arXiv:1904.03479 , 2019.[25] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,” arXivpreprint arXiv:1502.03167 , 2015.[26] H. Zeinali, L. Burget, J. Rohdin, T. Stafylakis, and J. H. Cernocky,“How to improve your speaker embeddings extractor in generictoolkits,” in
ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 6141–6145.[27] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in