Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020
CClova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020
Hee Soo Heo, Bong-Jin Lee, Jaesung Huh, Joon Son Chung
Naver Corporation, South Korea
Abstract
This report describes our submission to the VoxCeleb SpeakerRecognition Challenge (VoxSRC) at Interspeech 2020. We per-form a careful analysis of speaker recognition models based onthe popular ResNet architecture, and train a number of variantsusing a range of loss functions. Our results show significant im-provements over most existing works without the use of modelensemble or post-processing. We release the training code andpre-trained models as unofficial baselines for this year’s chal-lenge.
Index Terms : speaker verification, speaker recognition.
1. Introduction
The VoxCeleb Speaker Recognition Challenge 2020 is secondinstallment of the new series of speaker recognition challengesthat are hosted annually. The challenge is intended to assesshow well current speaker recognition technology is able to iden-tify speakers in unconstrained or ’in the wild’ data. This year’schallenge is different to the last in a number of ways: (1) thereis an explicit domain shift between the training data and the testdata; (2) the test set contains utterances that are shorter than thesegments seen during training. The following sections of thisreport describe the method that underlies our submission to thechallenge.
2. Model
During training, we use a fixed length 2-second temporal seg-ment extracted randomly from each utterance. Pre-emphasis isapplied to the input signal using a coefficient of 0.97. Spec-trograms are extracted with a hamming window of width 25msand step 10ms with a FFT size of 512. 64-dimensional log Mel-filterbanks are used as the input to the network. Mean and vari-ance normalisation (MVN) is performed by applying instancenormalisation [1] to the network input. Since the VoxCelebdataset consists mostly of continuous speech, voice activity de-tection is not used in training and testing.
Residual networks [2] are widely used in image recognition andhave been applied to speaker recognition [3, 4, 5, 6]. We use twovariants of the ResNet with 34 layers.
Speed optimised model.
The first variant uses only one quar-ter of the channels in each residual block compared to the origi-nal ResNet-34 in order to reduce computational cost. The modelonly has 1.4 million parameters compared to 22 million of theoriginal ResNet-34. Self-attentive pooling (SAP) [4] is usedto aggregate frame-level features into utterance-level represen-tation while paying attention to the frames that are more in- formative for utterance-level speaker recognition. The networkarchitecture is identical to that used in [7] except for the inputdimension, and we refer to this configuration as
Q / SAP in theresults.
Performance optimised model.
The second slower variant has half of the channels in each residual block compared to the orig-inal ResNet-34, and contains 8.0 million parameters. Moreover,the stride at the first convolutional layer is removed, leading toincreased computational requirement. Attentive Statistics Pool-ing (ASP) [8] is used to aggregate temporal frames, where thechannel-wise weighted standard deviation is calculated in addi-tion to the weighted mean. Table 1 shows the detailed archi-tecture of the performance optimised model. We refer to thisconfiguration as
H / ASP in the results.Table 1:
Trunk architecture for the performance optimizedmodel. L : length of input sequence, ASP : attentive statisticspooling.
Layer Kernel size Stride Output shapeConv1 × ×
32 1 × L × × Res1 × ×
32 1 × L × × Res2 × ×
64 2 × L / × × Res3 × ×
128 2 × L / × × Res4 × ×
256 2 × L / × × Flatten - - L / × ASP - -
Linear - We train the networks using various types of loss functionswidely used in speaker recognition. Additive margin soft-max (AM-softmax) [11] and Additive angular margin soft-max (AAM-softmax) [12] loss functions have been proposedin face recognition and successfully applied to speaker recog-nition [13]. These functions introduce a concept of margin be-tween classes where the margin increases inter-class variance.For both AM-Softmax and AAM-Softmax loss functions, weuse a margin of 0.2 and a scale of 30 since this value results inthe best performance on the VoxCeleb1 test set.Angular Prototypical (AP) loss, a variant of the proto-typical networks with an angular objective, has been usedin [7], where it has demonstrated strong performance withoutmanually-defined hyper-parameters.Finally, we combine the Angular Prototypical loss with thevanilla softmax loss which demonstrates further improvementover using each of the loss functions. Figure 1 shows the modelarchitecture and the training strategy for combining the AP andsoftmax loss functions. a r X i v : . [ ee ss . A S ] S e p able 2: Results on the VoxCeleb and VoxSRC test sets. The figures in bold represent the best results for each metric, excluding the fusionoutputs.
AP:
Angular Prototypical.
BN:
Batch Normalisation on the speaker embeddings. † This method uses score normalisation asa post-processing step.
Config. Loss Aug. BN VoxCeleb1 VoxCeleb1-E cl. VoxCeleb1-H cl. VoxSRC 2019 VoxSRC 2020 ValEER MinDCF EER MinDCF EER MinDCF EER MinDCF EER MinDCF
FR-34 [7] AP (cid:55) (cid:55) † (cid:51) (cid:55) - - 1.35 - 2.48 - - - - -Fusion [9] - - - - - 1.14 - 2.21 - 1.42 - - -Sys A5 [10] AM-Softmax (cid:51) (cid:55) - - 1.51 - - - 1.72 - - -Fusion [10] - - - - - 1.22 - - - 1.54 - - -Q / SAP AM-Softmax (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) H / ASP AM-Softmax (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) M e l -f ilt e r b a nk s R e s N e t - M e l -f ilt e r b a nk s P oo li ng F C - M e l -f ilt e r b a nk s F C - S o f t m a x l o ss A P l o ss M e l -f ilt e r b a nk s Figure 1:
Overview of the model architecture and the trainingstrategy of AP+Softmax systems.
3. Experiments
The models are trained on the development set of Vox-Celeb2 [3], which contains 5,994 speakers. The VoxCeleb1 testsets [14] and the previous year’s VoxSRC test set [15] are usedas validation sets.
We exploit two popular augmentation methods in speech pro-cessing – additive noise and room impulse response (RIR) sim-ulation. For additive noise, we use the MUSAN corpus [16]which contains 60 hours of human speech, 42 hours of music,and 6 hours of other noises such as dialtones or ambient sounds.For room impulse responses, we use the simulated RIR filtersprovided in [17]. Both noise and RIR filters are randomly se-lected in every training step.Types of augmentation used are similar to [18], in whichthe recordings are augmented by one of the following methods.•
Speech : Three to seven recordings are randomly picked from MUSAN speech, then added to the original sig-nal with random signal to noise ratio (SNR) from 13 to20dB. The duration of additive noise is matched to thesampled segment.•
Music : A single music file is randomly selected fromMUSAN, and added to the original signal with a similarfashion from 5 to 15dB SNR.•
Noise : Background noises without human speech andmusic in MUSAN are added to the recording from 0 to15dB SNR.•
RIR filters : Speech reverberation is performed via con-volution operation with a collection of simulated RIR fil-ters. We vary the gain of RIR filters to make more diversereverberated signals.
Our implementation is based on the PyTorch framework [19]and trained on the NAVER Smart Machine Learning (NSML)platform [20]. The models are trained using 8 NVIDIA P40GPUs with 24GB memory with the Adam optimiser. We use thedistributed training implementation of https://github.com/clovaai/voxceleb_trainer where one epoch isdefined as a full pass through the dataset by each GPU . Speed optimised model.
We use an initial learning rate of 0.01,reduced by every 2 epochs. The network is trained for 50epochs. We use a mini-batch size of 500. The models takearound 1 day to train.
Performance optimised model.
We use an initial learning rateof 0.001, reduced by every 3 epochs. The network istrained for 36 epochs. A weight decay of is applied. Weuse a smaller batch size of 150 due to memory limitations. Themodels take around 2 days to train. .4. Scoring
The trained networks are evaluated on the VoxCeleb1 and theVoxSRC test sets. We sample ten 4-second temporal cropsat regular intervals from each test segment, and compute the ×
10 = 100 cosine similarities between the possible com-binations from every pair of segments. The mean of the 100similarities is used as the score. This protocol is in line withthat used by [3, 6, 7].
We report two performance metrics: (i) the Equal Error Rate(EER) which is the rate at which both acceptance and rejec-tion errors are equal; and (ii) the minimum detection cost ofthe function used by the NIST SRE [21] and the VoxSRC evaluations. The parameters C miss = 1 , C fa = 1 and P target = 0 . are used for the cost function. Table 2 reports the experimental results.We compare our models to the two best scoring submis-sions [9, 10] in the VoxSRC 2019. From each of these submis-sions, we report the results of the best single model and the bestfusion output.The results demonstrate that the sum of metric learning andclassification-based losses work best in most scenarios. Thebatch normalisation layer applied to the output contributes asignificant improvement in performance for the classificationobjectives.The performance optimised model trained with the
AP+Softmax loss and without the embedding batch normali-sation produces an EER of 5.19% and MinDCF of 0.314 on theVoxSRC 2020 test set.
4. Conclusion
The report describes our baseline system for the 2020 VoxSRCSpeaker Recognition Challenge. The proposed system is trainedusing a combination of metric learning and classification-basedobjectives. Our best model outperforms all single model sys-tems and all but one ensemble system in the last year’s chal-lenge. We release the full training code and pre-trained modelsas unofficial baselines for the challenge.
5. References [1] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normaliza-tion: The missing ingredient for fast stylization,” arXiv preprintarXiv:1607.08022 , 2016.[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in
Proc. CVPR , 2016, pp. 770–778.[3] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deepspeaker recognition,” in
Proc. Interspeech , 2018.[4] W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and lossfunction in end-to-end speaker and language recognition system,”in
Speaker Odyssey , 2018.[5] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in
Proc.ICASSP , 2019.[6] J. S. Chung, J. Huh, and S. Mun, “Delving into VoxCeleb: en-vironment invariant speaker recognition,” in
Speaker Odyssey ,2020.[7] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham,S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning forspeaker recognition,” in
Proc. Interspeech , 2020.[8] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statisticspooling for deep speaker embedding,” in
Proc. Interspeech , 2018.[9] H. Zeinali, S. Wang, A. Silnova, P. Matˇejka, and O. Plchot,“BUT system description to voxceleb speaker recognition chal-lenge 2019,” arXiv preprint arXiv:1910.12592 , 2019.[10] D. Garcia-Romero, A. McCree, D. Snyder, and G. Sell, “JHU-HLTCOE system for the VoxSRC speaker recognition challenge,”in
Proc. ICASSP . IEEE, 2020, pp. 7559–7563.[11] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, andW. Liu, “Cosface: Large margin cosine loss for deep face recog-nition,” in
Proc. CVPR , 2018, pp. 5265–5274.[12] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additiveangular margin loss for deep face recognition,” in
Proc. CVPR ,2019, pp. 4690–4699.[13] X. Xiang, S. Wang, H. Huang, Y. Qian, and K. Yu, “Margin mat-ters: Towards more discriminative deep neural network embed-dings for speaker recognition,” in
Asia-Pacific Signal and Infor-mation Processing Association Annual Summit and Conference .IEEE, 2019, pp. 1652–1656.[14] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” in
Proc. Interspeech , 2017.[15] J. S. Chung, A. Nagrani, E. Coto, W. Xie, M. McLaren, D. A.Reynolds, and A. Zisserman, “VoxSRC 2019: The first VoxCelebspeaker recognition challenge,” arXiv preprint arXiv:1912.02522 ,2019.[16] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, andnoise corpus,” arXiv preprint arXiv:1510.08484 , 2015.[17] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur,“A study on data augmentation of reverberant speech for robustspeech recognition,” in
IEEE International Conference on Acous-tics, Speech and Signal Processing , 2017, pp. 5220–5224.[18] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in
Proc. ICASSP . IEEE, 2018, pp. 5329–5333.[19] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al. , “Pytorch:An imperative style, high-performance deep learning library,” in
NIPS , 2019, pp. 8024–8035.[20] N. Sung, M. Kim, H. Jo, Y. Yang, J. Kim, L. Lausen, Y. Kim,G. Lee, D. Kwak, J.-W. Ha et al. , “Nsml: A machine learningplatform that enables you to focus on your models,” arXiv preprintarXiv:1712.05902 , 2017.[21]