Frequency and temporal convolutional attention for text-independent speaker recognition
FFREQUENCY AND TEMPORAL CONVOLUTIONAL ATTENTION FORTEXT-INDEPENDENT SPEAKER RECOGNITION
Sarthak Yadav, Atul Rai
Staqu Technologies, India
ABSTRACT
Majority of the recent approaches for text-independent speakerrecognition apply attention or similar techniques for aggre-gation of frame-level feature descriptors generated by a deepneural network (DNN) front-end. In this paper, we proposemethods of convolutional attention for independently mod-elling temporal and frequency information in a convolutionalneural network (CNN) based front-end. Our system utilizesconvolutional block attention modules (CBAMs) [1] appro-priately modified to accommodate spectrogram inputs. Theproposed CNN front-end fitted with the proposed convo-lutional attention modules outperform the no-attention andspatial-CBAM baselines by a significant margin on the Vox-Celeb [2, 3] speaker verification benchmark. Our best modelachieves an equal error rate of . on the VoxCeleb1 testset, which is a considerable improvement over comparablestate of the art results. For a more thorough assessment ofthe effects of frequency and temporal attention in real-worldconditions, we conduct ablation experiments by randomlydropping frequency bins and temporal frames from the inputspectrograms, concluding that instead of modelling eitherof the entities, simultaneously modelling temporal and fre-quency attention translates to better real-world performance. Index Terms — convolutional attention, speaker verifica-tion, speaker recognition, CNNs, deep learning
1. INTRODUCTION
Majority of the recent strides in the field of text-independentspeaker recognition can be ascribed to deep neural net-work (DNN) based speaker embeddings, which have farsurpassed conventional state-of-the-art systems such as thei-vector+PLDA framework.End-to-end deep learning-based speaker recognition sys-tems usually comprise of two components: (i) DNN front-endfor extraction of frame-level features; and (ii) temporal aggre-gation of these frame-level features to an utterance-level em-bedding. Majority of the recent works utilize convolutionalneural network (CNN) based front-end models for extractingframe-level feature descriptors from spectrogram inputs.
This research is funded by Staqu Technologies, India.
Although sub-optimal since it does not differentiate be-tween frames on the basis of content, temporal averaging isamongst the most frequently used techniques for aggregationof frame-level features [2, 3, 4]. A number of recent workshave proposed the use of statistical or dictionary based meth-ods for aggregation to mitigate this problem. [5] proposedthe statistics pooling layer, which combines mean and stan-dard deviation statistics for weighted aggregation of tempo-ral frames. More recently, [6] proposed time-distributed vot-ing (TDV) for aggregating features extracted by their UtterId-Net front-end in short segment speaker verification, especiallysub-second durations. [7] proposed the usage of dictionary-based NetVLAD or GhostVLAD [8] for aggregating temporalfeatures, using a 34-layer ResNet based front-end for featureextraction. Numerous recent works [9, 10, 11, 12] have pro-posed attention based techniques for aggregation of frame-level feature descriptors, to assign greater importance to themore discriminative frames.A prominent attention mechanism in the domain of com-puter vision is convolutional attention [1, 13], which facili-tates modelling of spatial and channel attention throughoutthe entire CNN feature extraction network. In this paper, wepropose methods of convolutional attention based on
Con-volutional block attention module (CBAM) [1] for speakerverification. The main contributions of this work are two-fold: (i) We propose convolutional attention modules basedon CBAM for modelling frequency and temporal attention, viz. f -CBAM and t -CBAM, along with an equal-weightedcomposite module for capturing both frequency and tempo-ral attention, called ft -CBAM; and, (ii) We conduct ablationexperiments for a more thorough assessment of the proposedattention modules as well as their performances under real-world conditions, concluding that instead of modelling eitherof the entities, simultaneously modelling temporal and fre-quency attention translates to better real-world performance.
2. RELATED WORKS
Attention mechanisms have led to significant advances acrosscomputer vision, spoken language understanding and natu-ral language processing, increasing the modelling capacity ofdeep neural networks by concentrating on crucial features andsuppressing unimportant ones. For speaker recognition, [9, a r X i v : . [ c s . S D ] O c t
0] utilize self-attention for aggregating frame-level features.[11] combined attention mechanism with statistics pooling [5]to propose attentive-statistics pooling. Most recently, [12]employ the idea of multi-head attention [14] for feature ag-gregation, outperforming an I-vector+PLDA baseline by 58%(relative). However, by applying attention or similar tech-niques only on the feature descriptors generated by the DNNfront-end and not throughout the front-end model, majorityof the recent works are (i) not fully utilising the representa-tion power of DNN front-end models; and (ii) implicitly mod-elling temporal attention alone in the process. As opposed tothe methods mentioned above, the proposed modules applyattention in the feature extraction module, innately improvingthe representation capabilities of the model.Recently, [15] proposed the usage of Gated ConvolutionalNeural Networks (GCNN) for speaker recognition. Matchedwith a gated-attention pooling method for frame-level featureaggregation, they evaluate the performance of GCNN in anx-vector [16] system on SRE16 and SRE18 datasets. In com-parison, we propose add-on modules that explicitly modelfrequency and temporal attention. [17] proposed an encoder-decoder style attention module similar to [13], for extractingspatial and channel attention for automatic speech recognitionin noisy conditions. In contrast, we propose convolutional at-tention modules based on [1] that model frequency and tem-poral attention along with channel attention, which drasticallyoutperform spatial attention baseline for speaker verification.
Recently, [1] proposed a new network module, named ”Con-volutional block attention module” (CBAM), which sequen-tially applies channel attention and spatial attention submod-ules on the input feature maps.CBAM comprises of two components, viz. the channel at-tention module and the spatial attention module . The follow-ing equations can be used to summarize the overall attentionprocess: F (cid:48) = M c ( F ) (cid:79) F (1) F (cid:48)(cid:48) = M s ( F (cid:48) ) (cid:79) F (cid:48) , (2)where (cid:78) denotes element-wise multiplication, F is the in-put feature map, F (cid:48)(cid:48) is the final output of the CBAM mod-ule, and M c and M s denote the channel and spatial attentionoperations, respectively. The channel attention module ex-ploits inter-channel relationship of features and generates a1-D channel attention map by squeezing the spatial dimen-sions of the input feature map by max-pooling and averagepooling, followed by projection using a shared MLP layer.The spatial attention module utilizes the inter-spatial relation-ship of features, focusing on the spatial location of objectsof interest. It applies and concatenates outputs of average-pooling and max-pooling operations along the channel axis which generates an efficient feature descriptor, followed by a7x7 convolution layer.However, unlike computer vision where the modality rep-resents highly correlated points in space and the axes repre-sent the spatial location of an object in a cartesian coordinatesystem, the axes of a spectrogram represent entirely differ-ent domains: frequency and time. This disconnect betweenthe entities represented by the axes of the feature space ofthe two modalities necessitates the need for targeted convolu-tional attention modules since preconditions required by ex-isting methods of convolutional attention to effectively modelattention in the speech domain might no longer apply.
3. PROPOSED APPROACH
The channel attention module (Eq.1) extracts general infor-mation regarding channel importance in the input featuremap, and is used as is. We propose appropriate changes tothe spatial attention submodule for modelling frequency andtemporal attention, viz. f -CBAM and t -CBAM, respectively,for spectrogram inputs.Hence, the input to our proposed modules is F (cid:48) (Eq. 1),such that F (cid:48) ∈ R C × H × T where C denotes the number ofinput channels, and H and T denote the dimensions along thefrequency and temporal axes, respectively. f -CBAM For modelling frequency attention, we need to limit the recep-tive field of the attention module to focus only on the y-axisof the input.
Fig. 1 . Proposed f -CBAM Module. z-axis represents tempo-ral axis (pictorially represented using dimensions > ).We aggregate temporal information averaging the inputfeature map F (cid:48) along the x-axis to generate an efficient fea-ture descriptor F freq ∈ R C × H × which essentially assignsequal statistical importance to each temporal frame. F freq = AvgP ool × T ( F (cid:48) ) (3)where AvgP ool × t represents the average pooling operationwith a kernel of size × T over the input feature map.Similar to spatial attention submodule, we then aggre-gate channel information by generating two feature maps: favg , F fmax ∈ R × H × , denoting average and max poolingoperations applied across the channel dimension on F freq ,and concatenate them. Finally, on this concatenated featuredescriptor, we apply a rectangular 7x1 convolution kernel togenerate a frequency attention map M freq ( F (cid:48) ) ∈ R H × ,where H denotes total number of frequency bins in the inputfeature F (cid:48) . M freq ( F (cid:48) ) = σ ( f × ([ F favg ; F fmax ]) (4)Here, σ denotes the sigmoid function and f × representsa convolution operation with a rectangular × kernel. M freq ( F (cid:48) ) is then broadcasted along the temporal dimensionon the original input feature map F (cid:48) . t -CBAM t -CBAM follows a procedure similar to f -CBAM for mod-elling temporal attention, albeit limiting the receptive field ofthe attention module to the temporal-axis, i.e. the x-axis. F temp = AvgP ool H × ( F (cid:48) ) (5) M temp ( F (cid:48) ) = σ ( f × ([ F tavg ; F tmax ]) (6)where F temp ∈ R C × H × ; and F tavg , F tmax ∈ R × × T . ft -CBAM ft -CBAM comprises of f -CBAM and t -CBAM applied in par-allel on the input feature map. The feature maps generated bythe two are then averaged. ft -CBAM can be seen as a specialcase of the original spatial CBAM, with the × convolutionfilter of the latter represented by two independent × and × operations. We propose a modified 50-layer Pre-Activation ResNet [18], henceforth denoted as PRN-50v2,as our CNN front-end to encode spectrogram input of arbi-trary length (Table 1). By changing the order of layers inthe residual block to BN-ReLU-Conv, pre-activation ResNetsimprove the ease of optimization as well as the generalizationperformance over comparable ResNet [19] counterparts.
Attention:
Wherever applicable, the appropriate CBAMmodule is integrated at the end of each residual block in theproposed front-end module.
Feature Aggregation:
Following [7], where they demon-strate the inadequacies of temporal averaging, GhostVLAD[8] pooling layer is applied after the CNN front-end. For ref-erence, experimental results using temporal average poolingare also provided. A -dimensional fully-connected em-bedding layer is applied after the GhostVLAD pooling layer,yielding a compact utterance-level feature descriptor. Finally,
Input Spectrogram ( × × T ) Output sizeconv, 7x7, 64, stride (2,1) × × T maxpool, 2x2 × × T / (cid:34) conv, 1x1, 32conv, 3x3, 32conv, 1x1, 64 (cid:35) × × × T / (cid:34) conv, 1x1, 64conv, 3x3, 64conv, 1x1, 128 (cid:35) × × × T / (cid:34) conv, 1x1, 128conv, 3x3, 128conv, 1x1, 256 (cid:35) × × × T / (cid:34) conv, 1x1, 256conv, 3x3, 256conv, 1x1, 512 (cid:35) × × × T / conv, 5x1, 256 × × T / Table 1 . The modified PreActResNet front-end. ReLU andBatchNorm layers are omitted. Each row depicts the filtersizes,
4. EXPERIMENTS AND RESULTS4.1. Benchmark dataset and training details
We use VoxCeleb datasets for evaluation of the proposed ap-proach, training our models on the VoxCeleb2 ‘dev’ set [3]which comprises of , speakers and test on the VoxCeleb1[2] verification test set [3]. Training Details:
For training, spectrograms are gen-erated using a hamming window 20 ms wide with a hoplength of 10 ms, and a 320-point FFT corresponding to arandom 2-second temporal crop per utterance, followed byper-frequency-bin mean and variance normalization. Stochas-tic gradient descent optimizer with an initial learning rate of . decayed every 15 epochs by a factor of . is used fortraining. Using the proposed model with no attention along with resultsfrom previous works that follow similar benchmark protocolas baselines, we first perform a direct comparative analysisto study the effect of attention on speaker verification perfor-mance.Further, for a more thorough assessment of the proposedattention modules and to imitate real-world conditions wheresimilar perturbations might occur, we conduct three ablationexperiments: (i) random frequency masking; (ii) random ront-end Model Front-end Attention Dims Aggregation Training Set EER (%)Nagrani et al. [2] I-vectors + PLDA - - - VoxCeleb1 8.8Nagrani et al. [2] VGG-M - 1024 TAP VoxCeleb1 10.2Cai et al. [21] ResNet-34 - 128 SAP VoxCeleb1 4.40Okabe et al. [11] X-vector - 1500 ASP VoxCeleb1 3.85Hajibabei et al. [22] ResNet-29 - 128 TAP VoxCeleb1 4.30India et al. [12] CNN - - MHA VoxCeleb1 4Chung et al. [3] ResNet-50 - 512 TAP VoxCeleb2 4.19Xie et al. [7] Thin ResNet-34 - 512 GhostVLAD VoxCeleb2 3.22Hajavi et al. [6] UtterIdNet - 512 TDV VoxCeleb2 4.26
Proposed
PRN-50v2 - 256 TAP VoxCeleb2 2.557
Proposed
PRN-50v2 Spatial CBAM 256 TAP VoxCeleb2 2.515
Proposed
PRN-50v2 f -CBAM 256 TAP VoxCeleb2 2.457 Proposed
PRN-50v2 t -CBAM 256 TAP VoxCeleb2 2.28 Proposed
PRN-50v2 ft -CBAM 256 TAP VoxCeleb2 PRN-50v2 - 256 GhostVLAD VoxCeleb2 2.4
Proposed
PRN-50v2 Spatial CBAM 256 GhostVLAD VoxCeleb2 2.404
Proposed
PRN-50v2 f -CBAM 256 GhostVLAD VoxCeleb2 2.13 Proposed
PRN-50v2 t -CBAM 256 GhostVLAD VoxCeleb2 2.17 Proposed
PRN-50v2 ft -CBAM 256 GhostVLAD VoxCeleb2 Table 2 . Verification results on the VoxCeleb1 test set. TAP: Temporal Average Pooling, SAP: Self-Attentive Pooling, ASP: At-tentive Statistics Pooling, MHA: Multi-Head Attention, TDV: Time-distributed Voting. All of the proposed models outperformexisting baselines by a significant margin.temporal masking; and (iii) random frequency and temporalmasking. Every input spectrogram has a 40% probability ofbeing augmented, with up to two mask instances per input.Per mask instance, up to 30 randomly selected frequency binsand up to 40 randomly selected timesteps are masked.
Table. 2 compares the performance of the proposed mod-els with existing benchmarks on the VoxCeleb1 test set. Allthe proposed models outperform previous results by a signif-icant margin, with the best ft -CBAM based model achievingan EER of 2.031%. As evidenced by [7], using GhostVLADinstead of TAP improves performance all across the board. t -CBAM variants already model temporal attention and there-fore experience the smallest gains, whereas f -CBAM variantsexperience the most drastic improvement (EER of 2.457 % vs2.13%).The spatial CBAM variants perform on-par with the no-attention variants of the proposed PRN-50v2 model. Thelarge disparity in performance between spatial CBAM and ft -CBAM can be attributed to the differences in receptive fields:unlike ft -CBAM, the receptive field of spatial CBAM’s sin-gle square 7x7 kernel will essentially span across differententities in the feature space for spectrogram inputs.Table. 3 shows the results of the ablation experiments. ft -CBAM outperforms all other variants by a significant marginin all conditions. The performance gap between specific at-tention variants depends on the kind of deformation applied:the difference between f -CBAM and t -CBAM grows from Attention Variant Temporal Frequency Freq+TempNone 2.43% 3.50% 3.54%Spatial 2.43% 3.36% 3.41% f -CBAM 2.15% 3.16% 3.16% t -CBAM 2.20% 3.27% 3.30% ft -CBAM % % % Table 3 . Ablation experiment results on the VoxCeleb1 testset (EER%). Every experiment is repeated 5 times and meanvalues are reported. Only GhostVLAD aggregation basedmodels are used.0.05% (temporal masking) to 0.11% (frequency masking).Collectively, results from tables 2 and 3 suggest that simul-taneous modelling of temporal and frequency importance im-proves speaker verification performance.
5. CONCLUSION
In this paper, we propose methods of convolutional atten-tion for speaker recognition, viz. f -CBAM and t -CBAM formodelling frequency and temporal attention, along with acomposite module that models both simultaneously, aptlynamed ft -CBAM. The proposed PRN-50v2 model equippedwith ft -CBAM and GhostVLAD [8] significantly outperformsall baselines, achieving an EER of 2.03% on the VoxCeleb1test set. Empirical evidence suggests that modelling attentionin the DNN front-end, as well as simultaneously modellingtemporal and frequency attention, improves speaker verifica-tion performance. . REFERENCES [1] Sanghyun Woo, Jongchan Park, Joon-Young Lee, andIn So Kweon, “Cbam: Convolutional block attentionmodule,” in Proceedings of the European Conferenceon Computer Vision (ECCV) , 2018, pp. 3–19.[2] Arsha Nagrani, Joon Son Chung, and Andrew Zisser-man, “Voxceleb: A large-scale speaker identificationdataset,”
Proc. Interspeech 2017 , pp. 2616–2620, 2017.[3] Joon Son Chung, Arsha Nagrani, and Andrew Zisser-man, “Voxceleb2: Deep speaker recognition,”
Proc.Interspeech 2018 , pp. 1086–1090, 2018.[4] Sarthak Yadav and Atul Rai, “Learning discriminativefeatures for speaker identification and verification,” in
Proc. Interspeech 2018 , 2018, pp. 2237–2241.[5] David Snyder, Daniel Garcia-Romero, Daniel Povey,and Sanjeev Khudanpur, “Deep neural network embed-dings for text-independent speaker verification,”
Proc.Interspeech 2017 , pp. 999–1003, 2017.[6] Amirhossein Hajavi and Ali Etemad, “A deep neuralnetwork for short-segment speaker recognition,”
Proc.Interspeech 2019 , pp. 2878–2882, 2019.[7] Weidi Xie, Arsha Nagrani, Joon Son Chung, andAndrew Zisserman, “Utterance-level aggregation forspeaker recognition in the wild,” in
ICASSP 2019-2019IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2019, pp.5791–5795.[8] Yujie Zhong, Relja Arandjelovi´c, and Andrew Zisser-man, “Ghostvlad for set-based face recognition,” in
Asian Conference on Computer Vision . Springer, 2018,pp. 35–50.[9] Gautam Bhattacharya, Jahangir Alam, and PatrickKenny, “Deep speaker embeddings for short-durationspeaker verification,”
Proc. Interspeech 2017 , pp. 1517–1521, 2017.[10] Yingke Zhu, Tom Ko, David Snyder, Brian Mak, andDaniel Povey, “Self-attentive speaker embeddings fortext-independent speaker verification,” in
Proc. Inter-speech 2018 , 2018, pp. 3573–3577.[11] Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda,“Attentive statistics pooling for deep speaker embed-ding,”
Proc. Interspeech 2018 , pp. 2252–2256, 2018.[12] Miquel India, Pooyan Safari, and Javier Hernando, “SelfMulti-Head Attention for Speaker Recognition,” in
Proc. Interspeech 2019 , 2019, pp. 4305–4309. [13] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang,Cheng Li, Honggang Zhang, Xiaogang Wang, and Xi-aoou Tang, “Residual attention network for image clas-sification,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2017, pp.3156–3164.[14] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” in
Ad-vances in neural information processing systems , 2017,pp. 5998–6008.[15] Lanhua You, Wu Guo, Li-Rong Dai, and Jun Du, “DeepNeural Network Embeddings with Gating Mechanismsfor Text-Independent Speaker Verification,” in
Proc. In-terspeech 2019 , 2019, pp. 1168–1172.[16] David Snyder, Daniel Garcia-Romero, Gregory Sell,Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Ro-bust dnn embeddings for speaker recognition,” in . IEEE, 2018, pp.5329–5333.[17] Sirui Xu and Eric Fosler-Lussier, “Spatial and channelattention based convolutional neural networks for mod-eling noisy speech,” in
ICASSP 2019-2019 IEEE In-ternational Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2019, pp. 6625–6629.[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Identity mappings in deep residual networks,”in
European conference on computer vision . Springer,2016, pp. 630–645.[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[20] Jiankang Deng, Jia Guo, Niannan Xue, and StefanosZafeiriou, “Arcface: Additive angular margin loss fordeep face recognition,” in
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition ,2019, pp. 4690–4699.[21] Weicheng Cai, Jinkun Chen, and Ming Li, “Explor-ing the encoding layer and loss function in end-to-endspeaker and language recognition system,” in
Proc.Odyssey 2018 The Speaker and Language RecognitionWorkshop , 2018, pp. 74–81.[22] Mahdi Hajibabaei and Dengxin Dai, “Unified hy-persphere embedding for speaker recognition,” arXivpreprint arXiv:1807.08312arXivpreprint arXiv:1807.08312