[PDF] Frequency and temporal convolutional attention for text-independent speaker recognition

Abstract

Majority of the recent approaches for text-independent speaker recognition apply attention or similar techniques for aggregation of frame-level feature descriptors generated by a deep neural network (DNN) front-end. In this paper, we propose methods of convolutional attention for independently modelling temporal and frequency information in a convolutional neural network (CNN) based front-end. Our system utilizes convolutional block attention modules (CBAMs) [1] appropriately modified to accommodate spectrogram inputs. The proposed CNN front-end fitted with the proposed convolutional attention modules outperform the no-attention and spatial-CBAM baselines by a significant margin on the VoxCeleb [2, 3] speaker verification benchmark, and our best model achieves an equal error rate of 2:031% on the VoxCeleb1 test set, improving the existing state of the art result by a significant margin. For a more thorough assessment of the effects of frequency and temporal attention in real-world conditions, we conduct ablation experiments by randomly dropping frequency bins and temporal frames from the input spectrograms, concluding that instead of modelling either of the entities, simultaneously modelling temporal and frequency attention translates to better real-world performance.

Full PDF

FFREQUENCY AND TEMPORAL CONVOLUTIONAL ATTENTION FORTEXT-INDEPENDENT SPEAKER RECOGNITION

Sarthak Yadav, Atul Rai

Staqu Technologies, India

ABSTRACT

Majority of the recent approaches for text-independent speakerrecognition apply attention or similar techniques for aggre-gation of frame-level feature descriptors generated by a deepneural network (DNN) front-end. In this paper, we proposemethods of convolutional attention for independently mod-elling temporal and frequency information in a convolutionalneural network (CNN) based front-end. Our system utilizesconvolutional block attention modules (CBAMs) [1] appro-priately modiﬁed to accommodate spectrogram inputs. Theproposed CNN front-end ﬁtted with the proposed convo-lutional attention modules outperform the no-attention andspatial-CBAM baselines by a signiﬁcant margin on the Vox-Celeb [2, 3] speaker veriﬁcation benchmark. Our best modelachieves an equal error rate of . on the VoxCeleb1 testset, which is a considerable improvement over comparablestate of the art results. For a more thorough assessment ofthe effects of frequency and temporal attention in real-worldconditions, we conduct ablation experiments by randomlydropping frequency bins and temporal frames from the inputspectrograms, concluding that instead of modelling eitherof the entities, simultaneously modelling temporal and fre-quency attention translates to better real-world performance. Index Terms — convolutional attention, speaker veriﬁca-tion, speaker recognition, CNNs, deep learning

1. INTRODUCTION

Majority of the recent strides in the ﬁeld of text-independentspeaker recognition can be ascribed to deep neural net-work (DNN) based speaker embeddings, which have farsurpassed conventional state-of-the-art systems such as thei-vector+PLDA framework.End-to-end deep learning-based speaker recognition sys-tems usually comprise of two components: (i) DNN front-endfor extraction of frame-level features; and (ii) temporal aggre-gation of these frame-level features to an utterance-level em-bedding. Majority of the recent works utilize convolutionalneural network (CNN) based front-end models for extractingframe-level feature descriptors from spectrogram inputs.

This research is funded by Staqu Technologies, India.

Although sub-optimal since it does not differentiate be-tween frames on the basis of content, temporal averaging isamongst the most frequently used techniques for aggregationof frame-level features [2, 3, 4]. A number of recent workshave proposed the use of statistical or dictionary based meth-ods for aggregation to mitigate this problem. [5] proposedthe statistics pooling layer, which combines mean and stan-dard deviation statistics for weighted aggregation of tempo-ral frames. More recently, [6] proposed time-distributed vot-ing (TDV) for aggregating features extracted by their UtterId-Net front-end in short segment speaker veriﬁcation, especiallysub-second durations. [7] proposed the usage of dictionary-based NetVLAD or GhostVLAD [8] for aggregating temporalfeatures, using a 34-layer ResNet based front-end for featureextraction. Numerous recent works [9, 10, 11, 12] have pro-posed attention based techniques for aggregation of frame-level feature descriptors, to assign greater importance to themore discriminative frames.A prominent attention mechanism in the domain of com-puter vision is convolutional attention [1, 13], which facili-tates modelling of spatial and channel attention throughoutthe entire CNN feature extraction network. In this paper, wepropose methods of convolutional attention based on

Con-volutional block attention module (CBAM) [1] for speakerveriﬁcation. The main contributions of this work are two-fold: (i) We propose convolutional attention modules basedon CBAM for modelling frequency and temporal attention, viz. f -CBAM and t -CBAM, along with an equal-weightedcomposite module for capturing both frequency and tempo-ral attention, called ft -CBAM; and, (ii) We conduct ablationexperiments for a more thorough assessment of the proposedattention modules as well as their performances under real-world conditions, concluding that instead of modelling eitherof the entities, simultaneously modelling temporal and fre-quency attention translates to better real-world performance.

2. RELATED WORKS

Attention mechanisms have led to signiﬁcant advances acrosscomputer vision, spoken language understanding and natu-ral language processing, increasing the modelling capacity ofdeep neural networks by concentrating on crucial features andsuppressing unimportant ones. For speaker recognition, [9, a r X i v : . [ c s . S D ] O c t

0] utilize self-attention for aggregating frame-level features.[11] combined attention mechanism with statistics pooling [5]to propose attentive-statistics pooling. Most recently, [12]employ the idea of multi-head attention [14] for feature ag-gregation, outperforming an I-vector+PLDA baseline by 58%(relative). However, by applying attention or similar tech-niques only on the feature descriptors generated by the DNNfront-end and not throughout the front-end model, majorityof the recent works are (i) not fully utilising the representa-tion power of DNN front-end models; and (ii) implicitly mod-elling temporal attention alone in the process. As opposed tothe methods mentioned above, the proposed modules applyattention in the feature extraction module, innately improvingthe representation capabilities of the model.Recently, [15] proposed the usage of Gated ConvolutionalNeural Networks (GCNN) for speaker recognition. Matchedwith a gated-attention pooling method for frame-level featureaggregation, they evaluate the performance of GCNN in anx-vector [16] system on SRE16 and SRE18 datasets. In com-parison, we propose add-on modules that explicitly modelfrequency and temporal attention. [17] proposed an encoder-decoder style attention module similar to [13], for extractingspatial and channel attention for automatic speech recognitionin noisy conditions. In contrast, we propose convolutional at-tention modules based on [1] that model frequency and tem-poral attention along with channel attention, which drasticallyoutperform spatial attention baseline for speaker veriﬁcation.

Recently, [1] proposed a new network module, named ”Con-volutional block attention module” (CBAM), which sequen-tially applies channel attention and spatial attention submod-ules on the input feature maps.CBAM comprises of two components, viz. the channel at-tention module and the spatial attention module . The follow-ing equations can be used to summarize the overall attentionprocess: F (cid:48) = M c ( F ) (cid:79) F (1) F (cid:48)(cid:48) = M s ( F (cid:48) ) (cid:79) F (cid:48) , (2)where (cid:78) denotes element-wise multiplication, F is the in-put feature map, F (cid:48)(cid:48) is the ﬁnal output of the CBAM mod-ule, and M c and M s denote the channel and spatial attentionoperations, respectively. The channel attention module ex-ploits inter-channel relationship of features and generates a1-D channel attention map by squeezing the spatial dimen-sions of the input feature map by max-pooling and averagepooling, followed by projection using a shared MLP layer.The spatial attention module utilizes the inter-spatial relation-ship of features, focusing on the spatial location of objectsof interest. It applies and concatenates outputs of average-pooling and max-pooling operations along the channel axis which generates an efﬁcient feature descriptor, followed by a7x7 convolution layer.However, unlike computer vision where the modality rep-resents highly correlated points in space and the axes repre-sent the spatial location of an object in a cartesian coordinatesystem, the axes of a spectrogram represent entirely differ-ent domains: frequency and time. This disconnect betweenthe entities represented by the axes of the feature space ofthe two modalities necessitates the need for targeted convolu-tional attention modules since preconditions required by ex-isting methods of convolutional attention to effectively modelattention in the speech domain might no longer apply.

3. PROPOSED APPROACH

The channel attention module (Eq.1) extracts general infor-mation regarding channel importance in the input featuremap, and is used as is. We propose appropriate changes tothe spatial attention submodule for modelling frequency andtemporal attention, viz. f -CBAM and t -CBAM, respectively,for spectrogram inputs.Hence, the input to our proposed modules is F (cid:48) (Eq. 1),such that F (cid:48) ∈ R C × H × T where C denotes the number ofinput channels, and H and T denote the dimensions along thefrequency and temporal axes, respectively. f -CBAM For modelling frequency attention, we need to limit the recep-tive ﬁeld of the attention module to focus only on the y-axisof the input.

Fig. 1 . Proposed f -CBAM Module. z-axis represents tempo-ral axis (pictorially represented using dimensions > ).We aggregate temporal information averaging the inputfeature map F (cid:48) along the x-axis to generate an efﬁcient fea-ture descriptor F freq ∈ R C × H × which essentially assignsequal statistical importance to each temporal frame. F freq = AvgP ool × T ( F (cid:48) ) (3)where AvgP ool × t represents the average pooling operationwith a kernel of size × T over the input feature map.Similar to spatial attention submodule, we then aggre-gate channel information by generating two feature maps: favg , F fmax ∈ R × H × , denoting average and max poolingoperations applied across the channel dimension on F freq ,and concatenate them. Finally, on this concatenated featuredescriptor, we apply a rectangular 7x1 convolution kernel togenerate a frequency attention map M freq ( F (cid:48) ) ∈ R H × ,where H denotes total number of frequency bins in the inputfeature F (cid:48) . M freq ( F (cid:48) ) = σ ( f × ([ F favg ; F fmax ]) (4)Here, σ denotes the sigmoid function and f × representsa convolution operation with a rectangular × kernel. M freq ( F (cid:48) ) is then broadcasted along the temporal dimensionon the original input feature map F (cid:48) . t -CBAM t -CBAM follows a procedure similar to f -CBAM for mod-elling temporal attention, albeit limiting the receptive ﬁeld ofthe attention module to the temporal-axis, i.e. the x-axis. F temp = AvgP ool H × ( F (cid:48) ) (5) M temp ( F (cid:48) ) = σ ( f × ([ F tavg ; F tmax ]) (6)where F temp ∈ R C × H × ; and F tavg , F tmax ∈ R × × T . ft -CBAM ft -CBAM comprises of f -CBAM and t -CBAM applied in par-allel on the input feature map. The feature maps generated bythe two are then averaged. ft -CBAM can be seen as a specialcase of the original spatial CBAM, with the × convolutionﬁlter of the latter represented by two independent × and × operations. We propose a modiﬁed 50-layer Pre-Activation ResNet [18], henceforth denoted as PRN-50v2,as our CNN front-end to encode spectrogram input of arbi-trary length (Table 1). By changing the order of layers inthe residual block to BN-ReLU-Conv, pre-activation ResNetsimprove the ease of optimization as well as the generalizationperformance over comparable ResNet [19] counterparts.

Attention:

Wherever applicable, the appropriate CBAMmodule is integrated at the end of each residual block in theproposed front-end module.

Feature Aggregation:

Following [7], where they demon-strate the inadequacies of temporal averaging, GhostVLAD[8] pooling layer is applied after the CNN front-end. For ref-erence, experimental results using temporal average poolingare also provided. A -dimensional fully-connected em-bedding layer is applied after the GhostVLAD pooling layer,yielding a compact utterance-level feature descriptor. Finally,

Input Spectrogram ( × × T ) Output sizeconv, 7x7, 64, stride (2,1) × × T maxpool, 2x2 × × T / (cid:34) conv, 1x1, 32conv, 3x3, 32conv, 1x1, 64 (cid:35) × × × T / (cid:34) conv, 1x1, 64conv, 3x3, 64conv, 1x1, 128 (cid:35) × × × T / (cid:34) conv, 1x1, 128conv, 3x3, 128conv, 1x1, 256 (cid:35) × × × T / (cid:34) conv, 1x1, 256conv, 3x3, 256conv, 1x1, 512 (cid:35) × × × T / conv, 5x1, 256 × × T / Table 1 . The modiﬁed PreActResNet front-end. ReLU andBatchNorm layers are omitted. Each row depicts the ﬁltersizes,

4. EXPERIMENTS AND RESULTS4.1. Benchmark dataset and training details

We use VoxCeleb datasets for evaluation of the proposed ap-proach, training our models on the VoxCeleb2 ‘dev’ set [3]which comprises of , speakers and test on the VoxCeleb1[2] veriﬁcation test set [3]. Training Details:

For training, spectrograms are gen-erated using a hamming window 20 ms wide with a hoplength of 10 ms, and a 320-point FFT corresponding to arandom 2-second temporal crop per utterance, followed byper-frequency-bin mean and variance normalization. Stochas-tic gradient descent optimizer with an initial learning rate of . decayed every 15 epochs by a factor of . is used fortraining. Using the proposed model with no attention along with resultsfrom previous works that follow similar benchmark protocolas baselines, we ﬁrst perform a direct comparative analysisto study the effect of attention on speaker veriﬁcation perfor-mance.Further, for a more thorough assessment of the proposedattention modules and to imitate real-world conditions wheresimilar perturbations might occur, we conduct three ablationexperiments: (i) random frequency masking; (ii) random ront-end Model Front-end Attention Dims Aggregation Training Set EER (%)Nagrani et al. [2] I-vectors + PLDA - - - VoxCeleb1 8.8Nagrani et al. [2] VGG-M - 1024 TAP VoxCeleb1 10.2Cai et al. [21] ResNet-34 - 128 SAP VoxCeleb1 4.40Okabe et al. [11] X-vector - 1500 ASP VoxCeleb1 3.85Hajibabei et al. [22] ResNet-29 - 128 TAP VoxCeleb1 4.30India et al. [12] CNN - - MHA VoxCeleb1 4Chung et al. [3] ResNet-50 - 512 TAP VoxCeleb2 4.19Xie et al. [7] Thin ResNet-34 - 512 GhostVLAD VoxCeleb2 3.22Hajavi et al. [6] UtterIdNet - 512 TDV VoxCeleb2 4.26

Proposed

PRN-50v2 - 256 TAP VoxCeleb2 2.557

Proposed

PRN-50v2 Spatial CBAM 256 TAP VoxCeleb2 2.515

Proposed

PRN-50v2 f -CBAM 256 TAP VoxCeleb2 2.457 Proposed

PRN-50v2 t -CBAM 256 TAP VoxCeleb2 2.28 Proposed

PRN-50v2 ft -CBAM 256 TAP VoxCeleb2 PRN-50v2 - 256 GhostVLAD VoxCeleb2 2.4

Proposed

PRN-50v2 Spatial CBAM 256 GhostVLAD VoxCeleb2 2.404

Proposed

PRN-50v2 f -CBAM 256 GhostVLAD VoxCeleb2 2.13 Proposed

PRN-50v2 t -CBAM 256 GhostVLAD VoxCeleb2 2.17 Proposed

PRN-50v2 ft -CBAM 256 GhostVLAD VoxCeleb2 Table 2 . Veriﬁcation results on the VoxCeleb1 test set. TAP: Temporal Average Pooling, SAP: Self-Attentive Pooling, ASP: At-tentive Statistics Pooling, MHA: Multi-Head Attention, TDV: Time-distributed Voting. All of the proposed models outperformexisting baselines by a signiﬁcant margin.temporal masking; and (iii) random frequency and temporalmasking. Every input spectrogram has a 40% probability ofbeing augmented, with up to two mask instances per input.Per mask instance, up to 30 randomly selected frequency binsand up to 40 randomly selected timesteps are masked.

Table. 2 compares the performance of the proposed mod-els with existing benchmarks on the VoxCeleb1 test set. Allthe proposed models outperform previous results by a signif-icant margin, with the best ft -CBAM based model achievingan EER of 2.031%. As evidenced by [7], using GhostVLADinstead of TAP improves performance all across the board. t -CBAM variants already model temporal attention and there-fore experience the smallest gains, whereas f -CBAM variantsexperience the most drastic improvement (EER of 2.457 % vs2.13%).The spatial CBAM variants perform on-par with the no-attention variants of the proposed PRN-50v2 model. Thelarge disparity in performance between spatial CBAM and ft -CBAM can be attributed to the differences in receptive ﬁelds:unlike ft -CBAM, the receptive ﬁeld of spatial CBAM’s sin-gle square 7x7 kernel will essentially span across differententities in the feature space for spectrogram inputs.Table. 3 shows the results of the ablation experiments. ft -CBAM outperforms all other variants by a signiﬁcant marginin all conditions. The performance gap between speciﬁc at-tention variants depends on the kind of deformation applied:the difference between f -CBAM and t -CBAM grows from Attention Variant Temporal Frequency Freq+TempNone 2.43% 3.50% 3.54%Spatial 2.43% 3.36% 3.41% f -CBAM 2.15% 3.16% 3.16% t -CBAM 2.20% 3.27% 3.30% ft -CBAM % % % Table 3 . Ablation experiment results on the VoxCeleb1 testset (EER%). Every experiment is repeated 5 times and meanvalues are reported. Only GhostVLAD aggregation basedmodels are used.0.05% (temporal masking) to 0.11% (frequency masking).Collectively, results from tables 2 and 3 suggest that simul-taneous modelling of temporal and frequency importance im-proves speaker veriﬁcation performance.

5. CONCLUSION

In this paper, we propose methods of convolutional atten-tion for speaker recognition, viz. f -CBAM and t -CBAM formodelling frequency and temporal attention, along with acomposite module that models both simultaneously, aptlynamed ft -CBAM. The proposed PRN-50v2 model equippedwith ft -CBAM and GhostVLAD [8] signiﬁcantly outperformsall baselines, achieving an EER of 2.03% on the VoxCeleb1test set. Empirical evidence suggests that modelling attentionin the DNN front-end, as well as simultaneously modellingtemporal and frequency attention, improves speaker veriﬁca-tion performance. . REFERENCES [1] Sanghyun Woo, Jongchan Park, Joon-Young Lee, andIn So Kweon, “Cbam: Convolutional block attentionmodule,” in Proceedings of the European Conferenceon Computer Vision (ECCV) , 2018, pp. 3–19.[2] Arsha Nagrani, Joon Son Chung, and Andrew Zisser-man, “Voxceleb: A large-scale speaker identiﬁcationdataset,”

Proc. Interspeech 2017 , pp. 2616–2620, 2017.[3] Joon Son Chung, Arsha Nagrani, and Andrew Zisser-man, “Voxceleb2: Deep speaker recognition,”

Proc.Interspeech 2018 , pp. 1086–1090, 2018.[4] Sarthak Yadav and Atul Rai, “Learning discriminativefeatures for speaker identiﬁcation and veriﬁcation,” in

Proc. Interspeech 2018 , 2018, pp. 2237–2241.[5] David Snyder, Daniel Garcia-Romero, Daniel Povey,and Sanjeev Khudanpur, “Deep neural network embed-dings for text-independent speaker veriﬁcation,”

Proc.Interspeech 2017 , pp. 999–1003, 2017.[6] Amirhossein Hajavi and Ali Etemad, “A deep neuralnetwork for short-segment speaker recognition,”

Proc.Interspeech 2019 , pp. 2878–2882, 2019.[7] Weidi Xie, Arsha Nagrani, Joon Son Chung, andAndrew Zisserman, “Utterance-level aggregation forspeaker recognition in the wild,” in

ICASSP 2019-2019IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2019, pp.5791–5795.[8] Yujie Zhong, Relja Arandjelovi´c, and Andrew Zisser-man, “Ghostvlad for set-based face recognition,” in

Asian Conference on Computer Vision . Springer, 2018,pp. 35–50.[9] Gautam Bhattacharya, Jahangir Alam, and PatrickKenny, “Deep speaker embeddings for short-durationspeaker veriﬁcation,”

Proc. Interspeech 2017 , pp. 1517–1521, 2017.[10] Yingke Zhu, Tom Ko, David Snyder, Brian Mak, andDaniel Povey, “Self-attentive speaker embeddings fortext-independent speaker veriﬁcation,” in

Proc. Inter-speech 2018 , 2018, pp. 3573–3577.[11] Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda,“Attentive statistics pooling for deep speaker embed-ding,”

Proc. Interspeech 2018 , pp. 2252–2256, 2018.[12] Miquel India, Pooyan Safari, and Javier Hernando, “SelfMulti-Head Attention for Speaker Recognition,” in

Proc. Interspeech 2019 , 2019, pp. 4305–4309. [13] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang,Cheng Li, Honggang Zhang, Xiaogang Wang, and Xi-aoou Tang, “Residual attention network for image clas-siﬁcation,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2017, pp.3156–3164.[14] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” in

Ad-vances in neural information processing systems , 2017,pp. 5998–6008.[15] Lanhua You, Wu Guo, Li-Rong Dai, and Jun Du, “DeepNeural Network Embeddings with Gating Mechanismsfor Text-Independent Speaker Veriﬁcation,” in

Proc. In-terspeech 2019 , 2019, pp. 1168–1172.[16] David Snyder, Daniel Garcia-Romero, Gregory Sell,Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Ro-bust dnn embeddings for speaker recognition,” in . IEEE, 2018, pp.5329–5333.[17] Sirui Xu and Eric Fosler-Lussier, “Spatial and channelattention based convolutional neural networks for mod-eling noisy speech,” in

ICASSP 2019-2019 IEEE In-ternational Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2019, pp. 6625–6629.[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Identity mappings in deep residual networks,”in

European conference on computer vision . Springer,2016, pp. 630–645.[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[20] Jiankang Deng, Jia Guo, Niannan Xue, and StefanosZafeiriou, “Arcface: Additive angular margin loss fordeep face recognition,” in

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition ,2019, pp. 4690–4699.[21] Weicheng Cai, Jinkun Chen, and Ming Li, “Explor-ing the encoding layer and loss function in end-to-endspeaker and language recognition system,” in

Proc.Odyssey 2018 The Speaker and Language RecognitionWorkshop , 2018, pp. 74–81.[22] Mahdi Hajibabaei and Dengxin Dai, “Uniﬁed hy-persphere embedding for speaker recognition,” arXivpreprint arXiv:1807.08312arXivpreprint arXiv:1807.08312