[PDF] Cross attentive pooling for speaker verification

Abstract

The goal of this paper is text-independent speaker verification where utterances come from 'in the wild' videos and may contain irrelevant signal. While speaker verification is naturally a pair-wise problem, existing methods to produce the speaker embeddings are instance-wise. In this paper, we propose Cross Attentive Pooling (CAP) that utilizes the context information across the reference-query pair to generate utterance-level embeddings that contain the most discriminative information for the pair-wise matching problem. Experiments are performed on the VoxCeleb dataset in which our method outperforms comparable pooling strategies.

Full PDF

CCROSS ATTENTIVE POOLING FOR SPEAKER VERIFICATION

Seong Min Kye , , Yoohwan Kwon , , Joon Son Chung Naver Corporation, Korea Advanced Institute of Science and Technology, Yonsei University

ABSTRACT

The goal of this paper is text-independent speaker veriﬁca-tion where utterances come from ‘in the wild’ videos andmay contain irrelevant signal. While speaker veriﬁcation isnaturally a pair-wise problem, existing methods to producethe speaker embeddings are instance-wise. In this paper, wepropose Cross Attentive Pooling (CAP) that utilizes the con-text information across the reference-query pair to generateutterance-level embeddings that contain the most discrimina-tive information for the pair-wise matching problem. Exper-iments are performed on the VoxCeleb dataset in which ourmethod outperforms comparable pooling strategies.

Index Terms : speaker recognition, speaker veriﬁcation, crossattention.

1. INTRODUCTION

Automatic speaker recognition is an attractive way to verifysomeones identity since the voice of a person is one of themost easily accessible biometric information. Due to thisnon-invasive nature and the technological progress, speakerrecognition has recently gained considerable attention both inthe industry and in research.While the deﬁnition of speaker recognition encompassesboth identiﬁcation and veriﬁcation, the latter has more prac-tical applications – for example, the use of speaker veriﬁca-tion is becoming popular in call centres and in AI speakers.Unlike closed-set identiﬁcation, open-set veriﬁcation aims toverify the identity of speakers unseen during training. There-fore, speaker veriﬁcation is naturally a metric learning prob-lem in which voices must be mapped to representations in adiscriminative embedding space.While mainstream literature in the ﬁeld have learntspeaker embeddings via the classiﬁcation loss [1, 2, 3, 4],such objective functions are not designed to optimize em-bedding similarity. More recent works [5, 6, 7, 8, 9, 10, 11]have used additive margin variants of the softmax func-tion [12, 13, 14] to enforce inter-class separation which hasbeen shown to improve veriﬁcation performance.Since open-set veriﬁcation addresses identities unseenduring training, it can be formulated as a few-shot learn-ing problem where the network should recognize unseenclasses with limited examples. Prototypical networks [15] have been proposed in which the training mimics the few-shot learning scenario, and this strategy has recently shownto achieve competitive performance in speaker veriﬁca-tion [16, 17, 18, 19, 20].In order to train networks to optimise the similarity met-ric, frame-level representations produced must ﬁrst be aggre-gated into an utterance-level embedding. A na¨ıve way toproduce an utterance-level embedding is to take a uniformlyweighted average of the frame-level representations, whichis referred to as Temporal Average Pooling (TAP) in the ex-isting literature. Self-Attentive Pooling (SAP) [21] has beenproposed to pay more attention to the frames that are morediscriminative for veriﬁcation. However, the instance-levelself-attention ﬁnds the features that are more discriminativefor speaker veriﬁcation in general (i.e. across the whole train-ing set) rather than for the speciﬁc examples in the supportset. In few-shot learning, cross attention networks (CAN) [22]has been recently proposed to select attention based on unseentarget classes, by attending to the parts of the input imagethat is relevant and discriminative to the examples in the sup-port set. This idea is applicable to speaker veriﬁcation, sincethe features that are discriminative for comparing an utteranceagainst one class (speaker) in the support set may be differentto the features for comparing to another class.To this end, we propose cross attentive pooling (CAP)which computes the attention with reference to the examplein the support set in order to effectively aggregate frame-levelinformation into an utterance level embedding. In this way,the network is able to identify and focus on the parts of theutterance that provide characterising features for the particu-lar class in the support set. This is similar to how humans tendto look for common characterising features between the pairof samples when recognising instances from unseen classes,whether these are speakers or visual objects. Unlike instance-level pooling, the proposed attention module takes full advan-tage of the pair-wise nature of the veriﬁcation task, by mod-elling the relevance between the class (prototype) feature andthe query feature.The effectiveness of our method is demonstrated on thepopular VoxCeleb dataset [23] in which we report improve-ments over existing pooling methods. a r X i v : . [ ee ss . A S ] A ug . METHODS2.1. Few-shot learning framework We use a few-shot learning framework in order to train theembeddings for speaker recognition. In particular, our imple-mentation is based on prototypical networks [15], which hasbeen shown to perform well in speaker veriﬁcation [17, 18,19].

Batch formation.

Each mini-batch contains a support set S and a query set Q . A mini-batch contains M utterancesfrom each of N different speakers. We use a single utter-ance for each speaker in the support set S = { ( x i , y i ) } N × i =1 and the rest of the utterances ( ≤ i ≤ M ) in the query set Q = { (˜ x i , ˜ y i ) } N × ( M − i =1 , where y, ˜ y ∈ { , . . . , N } is theclass label. Training objective.

Since the support set is formed with asingle utterance x , the prototype (or centroid) is the same asthe support utterance for each speaker y : P y = x (1)The cross-entropy loss with a log-softmax function is used tominimise the distance between segments from same speakerand maximise the distance between different speakers. L NP = − |Q| (cid:88) (˜ x , ˜ y ) ∈Q log e d (˜ x ,P ˜ y ) (cid:80) Ny =1 e d (˜ x ,P y ) (2)We use the same distance metric as [16], where the distancefunction is the cosine similarity between the prototype and thequery with the scale of the query embedding. d (˜ x , P ˜ y ) = ˜ x T P ˜ y (cid:107) P ˜ y (cid:107) = (cid:107) ˜ x (cid:107) · cos (˜ x , P ˜ y ) (3)We refer to the prototypical loss with this similarity functionas the Normalised Prototypical (NP) loss in the rest of thispaper.Kye et al. [16] has used episodic training together witha global classiﬁcation loss in order to make speaker embed-dings more discriminative. Global classiﬁcation is applied toboth the support and the query sets. By incorporating the soft-max classiﬁcation loss, we can train the embeddings to bediscriminative over all classes, as opposed to only classes inthe mini-batch. The ﬁnal objective is the sum of NP and thesoftmax cross-entropy losses with equal weighting.

An ideal utterance-level embedding should be invariant totemporal position, but not frequency. Since 2D convolutionalneural networks [24, 25] produce 2D activation maps, [1]has proposed aggregation layers that are fully connected only along the frequency axis. This produces a × T featuremap before the pooling layers, which are described in thefollowing sections. Temporal Average Pooling (TAP).

The TAP layer simplytakes the mean of the features along the time domain. e = 1 T T (cid:88) n =1 x t (4) Self-Attentive Pooling (SAP).

In contrast to the TAP layerthat pools the features over time with uniform weights, theself-attentive pooling (SAP) layer [21, 26, 27] pays attentionto the frames that are more informative for utterance-levelspeaker recognition.Utterance-level representations x t are ﬁrst mapped to hid-den representations h t using a single layer perceptron withlearnable weights W and b . h t = tanh ( W x t + b ) (5)The similarity between the hidden vectors and a learnablecontext vector µ is computed, which represents the relativeimportance of the hidden feature. The context vector can beseen as a high-level representation of what makes the framesinformative for speaker recognition. w t = exp ( h Tt µ ) (cid:80) Tt =1 exp ( h Tt µ ) (6)The utterance-level embedding e can be obtained as a weightedsum of the frame-level representations. e = T (cid:88) t =1 w t x t (7) Unlike traditional instance-wise aggregation, our proposedmethod aggregates frame-level features, utilizing informationof the frame features of the other utterance. In order to matchthe objective in training and testing, we use the prototypicalnetworks [15], which is metric-based meta-learning frame-work. In this framework, we train our cross attentive pooling(CAP) using the pairs of support and query set. In the testscenario, support set and query set correspond to enrollmentand test utterances, respectively.For every pair of utterances from the query and thesupport sets, we extract frame-level representations s = { s , s , . . . , s T s } and q = { q , q , . . . , q T q } . Then, with themeta-projection layer g φ ( · ) , we extract hidden features fromthe frame-level representation. This non-linear projectionallows us to quickly adapt to an arbitrary frames, so thatthe similarity of the frame pair can be well measured. The q 𝑀𝑒𝑡𝑎 - 𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑖𝑜𝑛𝑙𝑎𝑦𝑒𝑟 SQ 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑙𝑎𝑦𝑒𝑟 𝐶 × 𝑇 𝑠 𝐶 × 𝑇 𝑞 𝐶 ℎ × 𝑇 𝑠 𝐶 ℎ × 𝑇 𝑞 𝑇 𝑠 × 𝑇 𝑞 𝑇 𝑞 × 𝑇 𝑠 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑙𝑎𝑦𝑒𝑟𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑙𝑎𝑦𝑒𝑟 𝑇𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑇𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑅 𝑄 𝑅 𝑆 𝑒 𝑠 𝑒 𝑞 𝑀𝑒𝑡𝑎 - 𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑖𝑜𝑛𝑙𝑎𝑦𝑒𝑟 Fig. 1 . The procedure of our proposed Cross-Attentive Pooling. 𝑇 𝑠 × 𝑇 𝑞 Context vector 𝑇 𝑞 𝑇 𝑠 𝐶 × 𝑇 𝑠 𝐶 × 𝑇 𝑠 Attention weight 𝑅 𝑆 s Fig. 2 . Attention layer.layer consists a single-layer perceptron followed by a ReLUactivation function. g φ ( · ) = max (cid:0) , W ( · ) + b (cid:1) (8)After the meta-projection layer, we can obtain S = { S i } T s i =1 and Q = { Q i } T q i =1 as hidden representations for every frame,where S i and Q i denotes g φ ( s i ) and g φ ( q i ) , respectively. Correlation matrix.

Correlation matrix R summarises sim-ilarity for every possible pair of frames. R S ∈ R T s × T q iscomputed as: R Si,j = (cid:18) S i (cid:107) S i (cid:107) (cid:19) T (cid:18) Q j (cid:107) Q j (cid:107) (cid:19) (9)Note that R Q = ( R S ) T . Pair-adaptive attention.

In order to obtain the pair-adaptivecontext vector, we average correlation matrix along with itsown time axis as follows: µ s = 1 T s T s (cid:88) i =1 R Si, ∗ (10)where µ s ∈ R T s and R Si, ∗ denotes i -th row vector. Each rowvector has the information of similarity to all frames of the other utterance. Therefore, the average correlation for eachframe of the other utterance can be presented by µ , which isused in the context vector to determine how similar it is to theother utterance.The attention weights are given by the following equationfor every utterance. w st = exp (( µ Ts R St, ∗ ) /τ ) (cid:80) T s i =1 exp (( µ Ts R Si, ∗ ) /τ ) (11)where τ is temperature scaling, which sharpens attention dis-tribution. e s = 1 T s T s (cid:88) t =1 (1 + w st ) s t (12)As done in Hou et al. [22], we use a residual attention mech-anism to obtain the utterance-level feature. For the other ut-terance, the utterance-level feature of q , e q can be obtained inthe same way.

3. EXPERIMENTS3.1. Input representations

During training, we randomly extract ﬁxed length 2-secondtemporal segments from each utterance. Spectrograms areextracted with a hamming window of width 25ms and step10ms. 40-dimensional Mel ﬁlterbanks are used as the input tothe network. Mean and variance normalisation (MVN) is per-formed with instance normalisation [28]. Since the VoxCelebdataset contains continuous speech, voice activity detection(VAD) is not used during training and testing.

Experiments are performed using the Fast ResNet-34 archi-tecture introduced in [19].Residual networks [25] are used widely in image recogni-tion and has recently been applied to speaker recognition [6,1, 29]. Fast ResNet-34 is the same as the original ResNetwith 34 layers, except with only one-quarter of the channelsin each residual block in order to reduce computational cost.The model only has 1.4 million parameters compared to 22million of the standard ResNet-34, and minimises the compu-tation cost by reducing the activation maps early in the net-work. The network architecture is given in Table 1.

Table 1 . Fast ResNet-34 architecture. ReLU and batchnormlayers are not shown. Each row speciﬁes the number of con-volutional ﬁlters, their sizes and strides as size × size, . The output from the fully connected layer is in-gested by the pooling layers. layer name Filters Output conv1 × , , stride 2 × , Maxpool, stride 2 × T × conv2 (cid:20) × , × , (cid:21) × , stride 1 × T × conv3 (cid:20) × , × , (cid:21) × , stride 2 × T / × conv4 (cid:20) × , × , (cid:21) × , stride 2 × T / × conv5 (cid:20) × , × , (cid:21) × , stride 2 × T / × pool × × T / × aggregation TAP or SAP or CAP × fc FCN, 512 × The networks are trained on the development setof VoxCeleb2 [29] and tested on the original test set of Vox-Celeb1 [1]. Note that there is no overlap between the devel-opment set of VoxCeleb2 dataset and the VoxCeleb1 dataset.

Training.

Our implementation is based on the PyTorchframework [30]. The models are trained using a NVIDIAV100 GPU with 32GB memory for epochs. The net-works are trained with the Adam optimizer, and we use aninitial learning rate of . with a decay of every 10epochs. We use a ﬁxed batch size of 200 for all experiments.The networks take 2 days to train using a single GPU. Data augmentation.

Aside from taking random 2-secondsegments, no data augmentation is performed during trainingor testing.

We report two performance metrics:(1) the Equal Error Rate (EER) which is the rate at which both acceptance and rejection errors are equal; and (2) the mini-mum of the detection cost function function used by the NISTSRE [31] and the VoxCeleb Speaker Recognition Challenge(VoxSRC) evaluations. In order to compute the EER, wesample 10 3.5-second speech segments at regular time inter-vals from each utterance and compute the mean of ×

10 =100 distances from all possible combinations per each pair.This protocol is in line with that used by [29, 20]. The param-eters C miss = 1 , C fa = 1 and P target = 0 . are used forthe cost function, same as that used in the VoxSRC. Table 2 . Comparison with various aggregation methods. † Note that [16] uses the same ResNet-34 network but withtwice as many ﬁlters in all layers. NP : Normalised Proto-typical, AP : Angular Prototypical, TAP : Temporal AveragePooling,

SAP : Self-Attentive Pooling,

CAP : Cross-AttentivePooling. Loss Aggregation MinDCF EER (%)AP [19] TAP - 2.22NP + Softmax [16] † TAP - 2.08NP + Softmax TAP 0.164 2.13NP + Softmax SAP 0.161 2.08NP + Softmax CAP

The results are given in Table 2. The baseline re-sults are in line with those reported by previous work usingcomparable methods and architecture. Cross-Attentive Pool-ing outperforms existing methods on the popular VoxCelebdataset, and by a signiﬁcant margin using the MinDCF mea-sure. It should be noted that the result outperforms all existingwork on the dataset that use a model size similar to ours (1.4million parameters).

4. CONCLUSION

In this paper, we presented pair-wise cross attentive poolingmethod for speaker veriﬁcation. In contrast to the instance-based methods, the pair-wise strategy beneﬁts from the con-textual information by looking at the parts of speech pair. Thepair-wise pooling method is not only applicable to the proto-typical framework, but also to other metric learning objectivessuch as the contrastive loss.

Acknowledgements.

We would like to thank Bong-Jin Leeand Icksang Han for helpful discussions. . REFERENCES [1] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman,“VoxCeleb: a large-scale speaker identiﬁcation dataset,” in INTERSPEECH , 2017.[2] David Snyder, Daniel Garcia-Romero, Daniel Povey, and San-jeev Khudanpur, “Deep neural network embeddings for text-independent speaker veriﬁcation.,” in

Interspeech , 2017, pp.999–1003.[3] David Snyder, Daniel Garcia-Romero, Gregory Sell, DanielPovey, and Sanjeev Khudanpur, “X-vectors: Robust dnn em-beddings for speaker recognition,” in

Proc. ICASSP . IEEE,2018, pp. 5329–5333.[4] Yoohwan Kwon, Soo-Whan Chung, and Hong-Goo Kang,“Intra-class variation reduction of speaker representa-tion in disentanglement framework,” arXiv preprintarXiv:2008.01348 , 2020.[5] Youngmoon Jung, Seong Min Kye, Yeunju Choi, MyunghunJung, and Hoirin Kim, “Improving multi-scale aggregation us-ing feature pyramid module for robust speaker veriﬁcation ofvariable-duration utterances,” in

Interspeech , 2020.[6] Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zis-serman, “Utterance-level aggregation for speaker recognitionin the wild,” in

Proc. ICASSP , 2019.[7] Mahdi Hajibabaei and Dengxin Dai, “Uniﬁed hyper-sphere embedding for speaker recognition,” arXiv preprintarXiv:1807.08312 , 2018.[8] Yi Liu, Liang He, and Jia Liu, “Large margin softmax loss forspeaker veriﬁcation,” in

INTERSPEECH , 2019.[9] Daniel Garcia-Romero, David Snyder, Gregory Sell, Alan Mc-Cree, Daniel Povey, and Sanjeev Khudanpur, “X-vector dnn re-ﬁnement with full-length recordings for speaker recognition,”in

Interspeech , 2019, pp. 1493–1496.[10] Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matˇejka,and Oldˇrich Plchot, “BUT system description to Vox-Celeb Speaker Recognition Challenge 2019,” arXiv preprintarXiv:1910.12592 , 2019.[11] Xu Xiang, Shuai Wang, Houjun Huang, Yanmin Qian, and KaiYu, “Margin matters: Towards more discriminative deep neuralnetwork embeddings for speaker recognition,” in

Asia-PaciﬁcSignal and Information Processing Association Annual Summitand Conference , 2019.[12] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu, “Ad-ditive margin softmax for face veriﬁcation,”

IEEE Signal Pro-cessing Letters , vol. 25, no. 7, pp. 926–930, 2018.[13] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong,Jingchao Zhou, Zhifeng Li, and Wei Liu, “Cosface: Largemargin cosine loss for deep face recognition,” in

Proc. CVPR ,2018, pp. 5265–5274.[14] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou,“Arcface: Additive angular margin loss for deep face recogni-tion,” in

Proc. CVPR , 2019, pp. 4690–4699.[15] Jake Snell, Kevin Swersky, and Richard Zemel, “Prototypicalnetworks for few-shot learning,” in

NeurIPS , 2017, pp. 4077–4087. [16] Seong Min Kye, Youngmoon Jung, Hae Beom Lee, Sung JuHwang, and Hoirin Kim, “Meta-learning for short utterancespeaker recognition with imbalance length pairs,” in

Inter-speech , 2020.[17] Jixuan Wang, Kuan-Chieh Wang, Marc T Law, Frank Rudzicz,and Michael Brudno, “Centroid-based deep metric learningfor speaker recognition,” in

Proc. ICASSP . IEEE, 2019, pp.3652–3656.[18] Prashant Anand, Ajeet Kumar Singh, Siddharth Srivastava, andBrejesh Lall, “Few shot speaker recognition using deep neuralnetworks,” arXiv preprint arXiv:1904.08775 , 2019.[19] Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee,Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung,Bong-Jin Lee, and Icksang Han, “In defence of metric learningfor speaker recognition,” in

Interspeech , 2020.[20] Jaesung Huh, Hee Soo Heo, Jingu Kang, Shinji Watan-abe, and Joon Son Chung, “Augmentation adversarial train-ing for unsupervised speaker recognition,” arXiv preprintarXiv:2007.12085 , 2020.[21] Weicheng Cai, Jinkun Chen, and Ming Li, “Exploring the en-coding layer and loss function in end-to-end speaker and lan-guage recognition system,” in

Speaker Odyssey , 2018.[22] Ruibing Hou, Hong Chang, MA Bingpeng, Shiguang Shan,and Xilin Chen, “Cross attention network for few-shot classi-ﬁcation,” in

NeurIPS , 2019, pp. 4003–4014.[23] Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zis-serman, “Voxceleb: Large-scale speaker veriﬁcation in thewild,”

Computer Speech and Language , 2019.[24] Ken Chatﬁeld, Karen Simonyan, Andrea Vedaldi, and AndrewZisserman, “Return of the devil in the details: Delving deepinto convolutional nets,” in

Proc. BMVC , 2014.[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,” in

Proc.CVPR , 2016.[26] Gautam Bhattacharya, Md Jahangir Alam, and Patrick Kenny,“Deep speaker embeddings for short-duration speaker veriﬁca-tion.,” in

Interspeech , 2017, pp. 1517–1521.[27] FA Rezaur rahman Chowdhury, Quan Wang, Ignacio LopezMoreno, and Li Wan, “Attention-based models for text-dependent speaker veriﬁcation,” in

Proc. ICASSP . IEEE, 2018,pp. 5359–5363.[28] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky, “In-stance normalization: The missing ingredient for fast styliza-tion,” arXiv preprint arXiv:1607.08022 , 2016.[29] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman,“VoxCeleb2: Deep speaker recognition,” in

INTERSPEECH ,2018.[30] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin,Natalia Gimelshein, Luca Antiga, et al., “Pytorch: An im-perative style, high-performance deep learning library,” in

NeurIPS , 2019, pp. 8024–8035.[31]

NIST 2018 Speaker Recognition Evaluation Plan , 2018(accessed 31 July 2020),