[PDF] The DKU System for the Speaker Recognition Task of the 2019 VOiCES from a Distance Challenge

Abstract

In this paper, we present the DKU system for the speaker recognition task of the VOiCES from a distance challenge 2019. We investigate the whole system pipeline for the far-field speaker verification, including data pre-processing, short-term spectral feature representation, utterance-level speaker modeling, back-end scoring, and score normalization. Our best single system employs a residual neural network trained with angular softmax loss. Also, the weighted prediction error algorithms can further improve performance. It achieves 0.3668 minDCF and 5.58% EER on the evaluation set by using a simple cosine similarity scoring. Finally, the submitted primary system obtains 0.3532 minDCF and 4.96% EER on the evaluation set.

Full PDF

TThe DKU System for the Speaker Recognition Task ofthe 2019 VOiCES from a Distance Challenge

Danwei Cai , Xiaoyi Qin , , Weicheng Cai , , Ming Li Data Science Research Center, Duke Kunshan University, Kunshan, China School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China [email protected]

Abstract

In this paper, we present the DKU system for the speaker recog-nition task of the VOiCES from a distance challenge 2019. Weinvestigate the whole system pipeline for the far-ﬁeld speakerveriﬁcation, including data pre-processing, short-term spectralfeature representation, utterance-level speaker modeling, back-end scoring, and score normalization. Our best single systememploys a residual neural network trained with angular softmaxloss. Also, the weighted prediction error algorithms can furtherimprove performance. It achieves 0.3668 minDCF and 5.58%EER on the evaluation set by using a simple cosine similarityscoring. Finally, the submitted primary system obtains 0.3532minDCF and 4.96% EER on the evaluation set.

Index Terms : speaker recognition, far-ﬁeld speech, deepResNet, angular softmax, WPE

1. Introduction

In the past decade, the performance of speaker recognition hasimproved signiﬁcantly. The i-vector based method [1] andthe deep neural network (DNN) based methods [2, 3] havepromoted the development of speaker recognition technologyin telephone channel and closed talking scenarios. However,speaker recognition under far-ﬁeld and complex environmentalsettings is still challenging due to the effects of the long-rangefading, room reverberation, and complex environmental noises.Speech signal propagating in long-range suffers from fading,absorption, and reﬂection by various objects, which change thepressure level at different frequencies and degrade the signalquality [4]. Reverberation includes eaarlay reverberation andlate reverberation. Early reverberation (i.e., reﬂections within50 to 100 ms after the direct wave arrives at the microphone) canimprove the received speech quality, while late reverberationwill degrade the speech quality. The adverse effects of rever-beration on speech signal includes smearing spectro-temporalstructures, amplifying the low-frequency energy, and ﬂatteningthe formant transitions, etc. [5]. Also, the complex environ-mental noises “ﬁll in” regions with low speech energy in thetime-frequency plane and blur the spectral details [4]. These ef-fects result in the loss of speech intelligibility and speech qual-ity, imposing great challenges in far-ﬁeld speaker recognitionand far-ﬁeld speech recognition.To compensate for the adverse impacts of room rever-beration and environmental noise, various approaches have

This research was funded in part by the National Natural Sci-ence Foundation of China (61773413), Natural Science Foundation ofGuangzhou City (201707010363), Six talent peaks project in JiangsuProvince (JY-074), Science and Technology Program of GuangzhouCity (201903010040), and Huawei. We also thank Weixiang Hu, YuLu, Zexin Liu and Lei Miao from Huawei Digital Technologies Co.,Ltd, China. been proposed at different stages of the speaker recognitionsystem. At the signal level, dereverberation [6], denoising[7, 8, 9, 10], and beamforming [11, 12] can be used for speechenhancement. At feature level, sub-band Hilbert envelopesbased features [13, 14], warped minimum variance distortion-less response (MVDR) cepstral coefﬁcients [15], blind spec-tral weighting (BSW) based features [16] have been appliedto ASV system to suppress the adverse impacts of reverber-ation and noise. At the model level, reverberation matchingwith multi-condition training models has been successfully em-ployed within the universal background model (UBM) or i-vector based front-end systems [17, 18]. In back-end mod-eling, multi-condition training of probabilistic linear discrimi-nant analysis (PLDA) models were employed in i-vector sys-tem [19]. The robustness of deep speaker embeddings for far-ﬁeld speech has also been investigated in [20]. Finally, at thescore level, score normalization [17] and multi-channel scorefusion [21, 22] have been applied in far-ﬁeld ASV system toimprove the robustness.The “VOiCES from a Distance Challenge 2019” is de-signed to foster research in the area of speaker recognition andautomatic speech recognition (ASR) with the special focus onsingle channel far-eld audio, under noisy conditions [23]. Oursystem pipeline consists of the following six main components,including data pre-processing, short-term spectral feature ex-traction, utterance-level speaker modeling, back-end scoring,score normalization, as well as fusion and calibration.This paper is organized as follows: Section 2 describes thedetails of our submitted system. Section 3 clariﬁes the data us-age, with experimental results and analysis. Conclusions aredrawn in section 4.

2. System descriptions

We adopt two kinds of data augmentation strategies. The ﬁrstis the same as the x-vector system available at Kaldi Voxcelebrecipe, which employs additive noises and reverberation. Wealso use pyroomacoustics [24] to simulate the room acousticbased on RIR generator using Image Source Model (ISM) al-gorithm. The microphones, distractors, and speech source aresimilar to the room settings presented in [25]. We use the mu-sic and noise part of the MUSAN dataset [26] to generate thetelevision noise, and the ‘us-gov’ part to create babble noise.For the systems described below, we use the Kaldi dataaugmentation strategy for the MFCC i-vector system and theTDNN x-vector system, and pyroomacoustics data augmenta-tion strategy for the remaining systems. a r X i v : . [ ee ss . A S ] J u l .1.2. Dereverberation The weighted prediction error (WPE) algorithm is a success-ful algorithm to reduce late reverberation [6]. The method es-timates the optimal dereverberation ﬁlter coefﬁcients based oniterative optimization. During the enrolling and testing, we usethe single-channel WPE to dereverberate the sound with a dere-verberation ﬁlter of 10 coefﬁcients. The WPE codes are from . Four features including Mel-frequency cepstral coefﬁ-cient (MFCC), power-normalized cepstral coefﬁcients (PNCC),Mel-ﬁlterbank energies (Mfbank) and gammatone-Filterbankenergies (Gfbank) are adopted in our systems.

Two kinds of MFCC features with a different number of cepstralﬁlterbanks are adopted, which result in 20- and 30-dimensionalMFCCs (MFCC-20 and MFCC-30). MFCC-20 is for the i-vector system, and MFCC-30 is for the TDNN x-vector system.Short-time cepstral mean subtraction (CMS) over a 3-secondsliding window is applied. For the MFCC-20, their ﬁrst andsecond derivatives are computed before applying the CMS.

PNCC has proved to be more robust in various types of addi-tive noise and reverberant environments compared to MFCC inASR [27]. The major features of PNCC processing include theuse of a power-law nonlinearity that replaces the traditional lognonlinearity used in MFCC coefﬁcients, a noise-suppressionalgorithm based on asymmetric ﬁltering that suppress back-ground excitation, and a module that accomplishes temporalmasking [27]. 20-dimensional PNCC are extracted using a 25ms window with 10 ms shifts. First and second derivatives arecomputed before applying CMS.

Each audio is converted to 64-dimensional log Mel-ﬁlterbankenergies with cepstral ﬁlterbanks ranging from 20 to 7600 Hz(Mfbank-16k). We also downsample the audio to 8000 samplerate and use cepstral ﬁlter banks within the range of 20 to 3800Hz to calculate Mfbank-8k features. A short-time cepstral meansubtraction is applied over a 3-second sliding window.

Gammatone ﬁlters are approximations to the ﬁltering system ofhuman ear [28]. The Gammatone ﬁlterbanks are selected withinthe range of 50 to 8000 Hz to compute the 64-dimensionalGammatone-ﬁlterbank energies. Short-time CMS is then ap-plied over a 3-second sliding window.

We extract the utterance-level speaker embeddings from threestate-of-the-art modelings, including the i-vector system [1], theTDNN x-vector system [2], and the deep ResNet system [3].

We train two i-vector systems on the MFCC-20 and PNCCfeatures respectively. The extracted 60-dimensional features are used to train a 2048 component Gaussian mixture model-universal background model (GMM-UBM) with full covariancematrices. Then zero-order and ﬁrst-order Baum-Welch statisticsare computed on the UBM for each recording’s MFCC feature,and single factor analysis is employed to extract i-vectors with600 dimensions [1].

The x-vector system is developed by adapting the Kaldi Vox-celeb recipe. For the x-vector extractor, a DNN is trained todiscriminate speakers in the training set. The ﬁrst ﬁve timeddelayed layers operate at frame-level. Then a temporal statis-tics pooling layer is employed to compute the mean and stan-dard deviation over all frames for an input segment. The re-sulted segment-level representation is then fed into two fullyconnected layers to classify the speakers in the training set. Af-ter training, speaker embeddings are extracted from the 512-dimensional afﬁne component of the ﬁrst fully connected layer.

We follow the deep ResNet system as described in [29, 3, 30],and we increase the widths (number of channels) of the residualblocks from {

16, 32, 64, 128 } to {

32, 64, 128, 256 } . The net-work architecture contains three main components: a front-endResNet, a pooling layer, and a feed-forward network. The front-end ResNet transforms the raw feature into a high-level abstractrepresentation. The subsequent pooling layer outputs a singleutterance-level representation. Speciﬁcally, means statistics areaccumulated for each feature map, and ﬁnally 256-dimensionalutterance-level representation is produced. Each unit in the out-put layer is represented as a target speaker identity.All the components in the pipeline are jointly learned inan end-to-end manner with a uniﬁed loss function. We adoptthe typical softmax loss as well as the angular softmax loss (A-softmax) [31]. A-softmax learns angularly discriminative fea-tures by generating an angular classication margin between em-beddings of different classes. The superiority of A-softmax hasbeen shown in both face recognition [31], language recognitionand speaker recognition [3].After training, the 256-dimensional utterance-level speakerembedding is extracted after the penultimate layer of the neuralnetwork for the given utterance. In the testing stage, the full-length feature sequence is directly fed into the network, withoutany truncate or padding operation.Based on the deep ResNet framework, we investigate mul-tiple kinds of short-term spectral features and loss functions.Finally, we have four networks trained with different setups:• Mfbank-8k + Softmax: ResNet system trained onMfbank-8k features with softmax loss.• Mfbank-16k + Softmax: ResNet system trained onMfbank-16k features with softmax loss.• Mfbank-16k + A-softmax: ResNet system trained onMfbank-16k features with A-softmax loss.• Gfbank + A-softmax. ResNet system trained on Gfbank-features with A-softmax loss. In back-end modeling, we either use cosine similarity basedscoring, or Probabilistic Linear Discriminant Analysis (PLDA)based scoring.able 1:

Development subset results for the speaker recognition task of the VOiCES from a distance challenge (SN represents ScoreNormalization, devW represents whitening using development subset)

Front-end Back-end WPE SN Development subset EvaluationminC actC EER[%] minC actC EER[%]

MFCC i-vector PLDA - √ √ √ √ √ - 0.4594 0.4697 5.29 0.6498 0.7152 10.09x-vector CORAL + PLDA - √ √ - 0.3617 0.3688 4.52 0.5417 0.5544 07.54Mfbank-8kResNet + Softmax CORAL + devW + PLDA - - 0.4557 0.5246 5.41 0.6608 0.7128 10.92CORAL + devW + PLDA √ - 0.3934 0.4611 4.59 0.5929 0.6424 09.75Mfbank-16kResNet + Softmax cosine similarity - - 0.3608 1 3.81 0.6262 1 08.75cosine similarity √ - 0.3245 1 3.02 0.5507 1 07.91Mfbank-16kResNet + A-Softmax cosine similarity - - 0.2735 1 cosine similarity √ - GfbankResNet + A-Softmax cosine similarity - - 0.3065 1 3.52 0.4411 1 06.78cosine similarity √ - We use cosine similarity as a scoring method for the ResNetbased systems. The scores of any given enrollment-test pair arecalculated as the cosine similarity of the two embeddings.

We use Correlation Alignment (CORAL) [32, 33] to align thedistributions of out-of-domain and in-domain features in an un-supervised way by aligning second-order statistics, i.e., covari-ance. To minimize the distance between the covariance of theout-of-domain and in-domain features, a linear transformation A to the original source features and the Frobenius norm is usedas matrix distance metric: min A (cid:107) C ˆ S − C T (cid:107) F = min A (cid:107) A T C S A − C T (cid:107) F (1)where C S and C T are covariance matrix of the source-domainand target-domain features, C ˆ S is covariance of the transformedsource features, and (cid:107) · (cid:107) F denotes the matrix Frobenius norm.The embeddings after domain adaptation are whitened andunit-length normalized. The whitening transforms is estimatedwith either the training set or the development subset.The Gaussian PLDA model [34] with a full covarianceresidual noise term is trained on the speaker discriminant fea-tures. After the PLDA is trained, the scores of any givenenrollment-test pair are calculated as the log-likelihood ratio onthe PLDA model. After scoring, results from all trials are subject to score normal-ization. We utilize Adaptive Symmetric Score Normalization(AS-Norm) in our systems [35]. The adaptive cohort for the en-rollment ﬁle are selected to be X closest (most positive scores)ﬁles to the enrollment utterance e as E top e . The cohort scoresbased on such selections for the enrollment utterance are then: S e ( E top e ) = { s ( e, ε ) |∀ ε ∈ E tope } (2) Then the AS-Norm is ˜ s ( e, t ) = 12 (cid:18) s ( e, t ) − µ [ S e ( E top e )] σ [ S e ( E top e )] + s ( e, t ) − µ [ S t ( E top t )] σ [ S t ( E top t )] (cid:19) (3) All the subsystems are fused and calibrated using the BOSARIStoolkit [36] which learn a scale and a bias for each subsystem.The ﬁnal fusion is a score-level equal-weighted sum after ap-plying the scale and the bias.

3. Experiments

The training data includes VoxCeleb 1 [37] and VoxCeleb2 [38]. The original distribution of VoxCeleb split each videointo multiple short segments. During training, the segmentsfrom the same video are concatenated into a single sound wave,which results in 167897 utterances from 7245 speakers. Novoice activity detection (VAD) is applied.For the development data, we only use a subset of the devel-opment dataset provided by the VOiCES challenge. The totalof 196 speakers in the original development dataset is split intotwo subgroups, each with 98 speakers. One subset is used as thenew development set, and the other is used as the domain adap-tation and score normalization corpus. In this way, we reducethe original 4,005,888 trials into 999,424 trials. Since a part ofthe development, data is used as the domain adaption and scorenormalization data, we can not provide the experimental resultson the whole development data. So all the experimental resultson the development set presented in this paper use the new sub-trials.

In table 1, the systems of different front-end speaker discrimi-nant features with the top one back-end are provided.

False Alarm probability (in %) M i ss p r obab ili t y ( i n % ) Development original wav

MFCC i-vectorPNCC i-vectorResNet 8kx-vectorResNet 16kResNet GammatomeResNet A-softmax

False Alarm probability (in %)

Development dereverberated wav

MFCC i-vectorPNCC i-vectorResNet 8kx-vectorResNet 16kResNet GammatomeResNet A-softmax

False Alarm probability (in %)

Evaluation original wav

MFCC i-vectorPNCC i-vectorResNet 8kx-vectorResNet 16kResNet GammatomeResNet A-softmax

False Alarm probability (in %)

Evaluation dereverberated wav

MFCC i-vectorPNCC i-vectorResNet 8kx-vectorResNet 16kResNet GammatomeResNet A-softmax

Figure 1:

DET plots for development and evaluation dataset with original or dereverberated sound wave

Table 2:

System performance on different fusion system

Fusion strategy Development subset EvaluationminC actC EER[%] Cllr minC actC EER[%] Cllr

Best single system (ResNet + A-softmax + WPE) 0.2485 1 2.41 0.8060 0.3668 1 5.58 0.8284Each embedding with top 1 back-end 0.1831 0.1857 1.93 0.0808

Each embedding with top 2 back-end 0.1644 0.1659 1.48 0.0710 0.3555 0.3578 4.79 0.2684Each embedding with top 3 back-end (submission)

For the seven kinds of front-end systems, the embeddings fromthe original audio and the de-reverberated audio are extracted respectively, resulting in 14 types of front-end speaker discrim-inant features. Then, different back-end modeling methods, in-cluding cosine scoring, a different set of PLDA modeling, anddifferent setting of score normalization, are applied to thesefeatures. For each speaker embedding, the top three back-endmethods with the best performance on the particular embeddingare selected, and ﬁnally, we get 42 individual scores for the ﬁnalfusion.The ﬁnal results on the development subset and the eval-uation set are shown in table 2. Our ﬁnal submission obtainsminDCF of 0.1473 and 0.3532 on the development and evalua-tion set respectively.After the evaluation, we investigate the system performancefused with different back-ends. It is interesting to ﬁnd thatalthough fusion with the top 3 back-ends for each front-endembeddings improves the performance by 20% relatively com-pared to fusion with top 1 back-ends, the results on the evalua-tion show the opposite: fusion with the top 3 back-ends for eachfront-ends degrades the performance by 10% compared to thefused system with top 1 back-ends. This is mainly because ofthe mismatch between the development and evaluation data.

4. Conclusions

We presented the components and analyzed the results of theDKU-SMIIP speaker recognition system for the VOiCES froma Distance Challenge 2019. We use different acoustic fea-tures, different front-end modeling methods, and various back-end scoring methods. To further improve the performance, weuse WPE to dereverberate the development and evaluation data.This enabled a series of incremental improvements, and the fu-sion showed that different subsystems are complementary toeach other at score level. . References [1] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Factor Analysis for Speaker Veriﬁcation,”

IEEE Transactions on Au-dio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011.[2] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “x-vectors: Robust DNN Embeddings for Speaker Recognition,” in

IEEE In-ternational Conference on Acoustics, Speech and Signal Processing , 2018,pp. 5329–5333.[3] W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer and LossFunction in End-to-End Speaker and Language Recognition System,” in

Odyssey: The Speaker and Language Recognition Workshop , 2018, pp. 74–81.[4] M. Wolfel and J. McDonough,

Distant Speech Recognition . John Wiley& Sons, Incorporated, 2009.[5] P. Assmann and Q. Summerﬁeld, “The Perception of Speech Under Ad-verse Conditions,” in

Speech Processing in the Auditory System . SpringerNew York, 2004, pp. 231–308.[6] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and Biing-HwangJuang, “Speech Dereverberation Based on Variance-Normalized DelayedLinear Prediction,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 18, no. 7, pp. 1717–1731, 2010.[7] X. Zhao, Y. Wang, and D. Wang, “Robust Speaker Identiﬁcation in Noisyand Reverberant Conditions,”

IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 22, no. 4, pp. 836–845, 2014.[8] M. Kolboek, Z.-H. Tan, and J. Jensen, “Speech Enhancement Using LongShort-Term Memory based Recurrent Neural Networks for Noise RobustSpeaker Veriﬁcation,” in

IEEE Spoken Language Technology Workshop ,2016, pp. 305–311.[9] Z. Oo, Y. Kawakami, L. Wang, S. Nakagawa, X. Xiao, and M. Iwahashi,“DNN-Based Amplitude and Phase Feature Enhancement for Noise RobustSpeaker Identiﬁcation,” in

Proceedings of the Annual Conference of theInternational Speech Communication Association , 2016, pp. 2204–2208.[10] S. E. Eskimez, P. Souﬂeris, Z. Duan, and W. Heinzelman, “Front-endspeech enhancement for commercial speaker veriﬁcation systems,”

SpeechCommunication , vol. 99, pp. 101–113, 2018.[11] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural Network BasedSpectral Mask Estimation for Acoustic Beamforming,” in . IEEE,2016, pp. 196–200.[12] E. Warsitz and R. Haeb-Umbach, “Blind Acoustic Beamforming Basedon Generalized Eigenvalue Decomposition,”

IEEE Transactions on Audio,Speech and Language Processing , vol. 15, no. 5, pp. 1529–1539, 2007.[13] T. Falk and Wai-Yip Chan, “Modulation Spectral Features for Robust Far-Field Speaker Identiﬁcation,”

IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 18, no. 1, pp. 90–100, 2010.[14] S. O. Sadjadi and J. H. Hansen, “Hilbert Envelope Based Features for Ro-bust Speaker Identiﬁcation Under Reverberant Mismatched Conditions,” in , 2011, pp. 5448–5451.[15] Q. Jin, R. Li, Q. Yang, K. Laskowski, and T. Schultz, “Speaker Identiﬁca-tion with Distant Microphone Speech,” in , 2010, pp. 4518–4521.[16] S. O. Sadjadi and J. H. L. Hansen, “Blind Spectral Weighting for RobustSpeaker Identiﬁcation under Reverberation Mismatch,”

IEEE/ACM Trans-actions on Audio, Speech, and Language Processing , vol. 22, no. 5, pp.937–945, 2014.[17] I. Peer, B. Rafaely, and Y. Zigel, “Reverberation Matching for SpeakerRecognition,” in , 2008, pp. 4829–4832.[18] A. R. Avila, M. Sarria-Paja, F. J. Fraga, D. O’Shaughnessy, and T. H. Falk,“Improving the Performance of Far-Field Speaker Veriﬁcation Using Multi-Condition Training: The Case of GMM-UBM and i-Vector Systems,” in

Proceedings of the Annual Conference of the International Speech Com-munication Association , 2014, pp. 1096–1100.[19] D. Garcia-Romero, X. Zhou, and C. Y. Espy-Wilson, “Multicondition train-ing of Gaussian PLDA models in i-vector space for noise and reverbera-tion robust speaker recognition,” in , 2012, pp. 4257–4260.[20] M. K. Nandwana, J. van Hout, M. McLaren, A. Stauffer, C. Richey, A. Law-son, and M. Graciarena, “Robust Speaker Recognition from Distant Speechunder Real Reverberant Environments Using Speaker Embeddings,” in

Pro-ceedings of the Annual Conference of the International Speech Communi-cation Association , 2018, pp. 1106–1110.[21] Q. Jin, T. Schultz, and A. Waibel, “Far-Field Speaker Recognition,”

IEEETransactions on Audio, Speech and Language Processing , vol. 15, no. 7,pp. 2023–2032, 2007. [22] Mikyong Ji, Sungtak Kim, Hoirin Kim, and Ho-Sub Yoon, “Text-Independent Speaker Identiﬁcation using Soft Channel Selection in HomeRobot Environments,”

IEEE Transactions on Consumer Electronics ,vol. 54, no. 1, pp. 140–144, 2008.[23] M. K. Nandwana, J. V. Hout, M. McLaren, C. Richey, A. Lawson, andM. A. Barrios, “The VOiCES from a Distance Challenge 2019 EvaluationPlan,” arXiv:1902.10828 [eess.AS] , 2019.[24] R. Scheibler, E. Bezzam, and I. Dokmanic, “Pyroomacoustics: A PythonPackage for Audio Room Simulation and Array Processing Algorithms,”in , 2018, pp. 351–355.[25] C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Gra-ciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout, P. Gam-ble, J. Hetherly, C. Stephenson, and K. Ni, “Voices Obscured in ComplexEnvironmental Settings (VOICES) corpus,” in

Proceedings of the AnnualConference of the International Speech Communication Association , 2018,pp. 1566–1570.[26] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and NoiseCorpus,” arXiv:1510.08484 [cs] , 2015.[27] C. Kim and R. M. Stern, “Power-Normalized Cepstral Coefcients (PNCC)for Robust Speech Recognition,”

IEEE/ACM Transactions on Audio,Speech and Language Processing , vol. 24, no. 7, pp. 1315–1329, 2016.[28] R. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, andM. Allerhand, “Complex Sounds and Auditory Images,” in

Auditory Physi-ology and Perception . Oxford, UK: Y. Cazals, L. Demany, and K. Horner,(Eds), Pergamon Press, 1992, pp. 429–446.[29] W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into End-to-EndLearning Scheme for Language Identiﬁcation,” in , 2018, pp. 5209–5213.[30] W. Cai, J. Chen, and M. Li, “Analysis of length normalization in end-to-end speaker veriﬁcation system,” in

Proc. INTERSPEECH 2018 , 2018, pp.3618–3622.[31] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: DeepHypersphere Embedding for Face Recognition,” in

The IEEE Conferenceon Computer Vision and Pattern Recognition , 2017, pp. 212–220.[32] B. Sun, J. Feng, and K. Saenko, “Return of Frustratingly Easy DomainAdaptation,” in

Proceedings of the Thirtieth AAAI Conference on ArtiﬁcialIntelligence , 2016, pp. 2058–2065.[33] M. J. Alam, G. Bhattacharya, and P. Kenny, “Speaker Veriﬁcation inMismatched Conditions with Frustratingly Easy Domain Adaptation,” in

Odyssey: The Speaker and Language Recognition Workshop , 2018.[34] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vector LengthNormalization in Speaker Recognition Systems,” in

Proceedings of the An-nual Conference of the International Speech Communication Association ,2011, pp. 249–252.[35] P. Matjka, O. Novotn, O. Plchot, L. Burget, M. D. Snchez, and J. ernock,“Analysis of Score Normalization in Multilingual Speaker Recognition,” in

Proceedings of the Annual Conference of the International Speech Commu-nication Association , 2017.[36] N. Br¨ummer and E. De Villiers, “The BOSARIS Toolkit: Theory,Algorithms and Code for Surviving the New DCF,” arXiv preprintarXiv:1304.2865 , 2013.[37] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A Large-ScaleSpeaker Identiﬁcation Dataset,” in

Proceedings of the Annual Conferenceof the International Speech Communication Association , 2017, pp. 2616–2620.[38] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep SpeakerRecognition,” in