[PDF] The 2020 Personalized Voice Trigger Challenge: Open Database, Evaluation Metrics and the Baseline Systems

Abstract

The 2020 Personalized Voice Trigger Challenge (PVTC2020) addresses two different research problems a unified setup: joint wake-up word detection with speaker verification on close-talking single microphone data and far-field multi-channel microphone array data. Specially, the second task poses an additional cross-channel matching challenge on top of the far-field condition. To simulate the real-life application scenario, the enrollment utterances are recorded from close-talking cell-phone only, while the test utterances are recorded from both the close-talking cell-phone and the far-field microphone arrays. This paper introduces our challenge setup and the released database as well as the evaluation metrics. In addition, we present a joint end-to-end neural network baseline system trained with the proposed database for speaker-dependent wake-up word detection. Results show that the cost calculated from the miss rate and the false alarm rate, can reach 0.37 in the close-talking single microphone task and 0.31 in the far-field microphone array task. The official website and the open-source baseline system have been released.

Full PDF

TTHE 2020 PERSONALIZED VOICE TRIGGER CHALLENGE: OPEN DATABASE,EVALUATION METRICS AND THE BASELINE SYSTEMS

Yan Jia , Xingming Wang , , Xiaoyi Qin , ,Yinping Zhang ,Xuyang Wang ,Junjie Wang , Ming Li , Data Science Research Center, Duke Kunshan University, Kunshan, China AI Lab of Lenovo Research, Beijing, China School of Computer Science, Wuhan University, Wuhan, China

ABSTRACT

The 2020 Personalized Voice Trigger Challenge (PVTC2020)addresses two different research problems a uniﬁed setup: jointwake-up word detection with speaker veriﬁcation on close-talkingsingle microphone data and far-ﬁeld multi-channel microphone arraydata. Specially, the second task poses an additional cross-channelmatching challenge on top of the far-ﬁeld condition. To simulate thereal-life application scenario, the enrollment utterances are recordedfrom close-talking cell-phone only, while the test utterances arerecorded from both the close-talking cell-phone and the far-ﬁeld mi-crophone arrays. This paper introduces our challenge setup and thereleased database as well as the evaluation metrics. In addition, wepresent a joint end-to-end neural network baseline system trainedwith the proposed database for speaker-dependent wake-up worddetection. Results show that the cost calculated from the miss rateand the false alarm rate, can reach 0.37 in the close-talking singlemicrophone task and 0.31 in the far-ﬁeld microphone array task.The ofﬁcial website and the open-source baseline system have beenreleased. Index Terms — open source database, wake-up word detection,speaker veriﬁcation, joint learning

1. INTRODUCTION

Recently, speaker dependent voice trigger and wake-up word detec-tion are gaining popularity among speech researchers and develop-ers. It has been deployed in many real-life applications. With thecontribution of deep learning, the performance of wake-up word de-tection and speaker recognition systems have improved remarkablyin both close-talking and far-ﬁeld scenarios.The demand for authentication based on voice technologies,including keyword spotting (KWS) and text-dependent speaker ver-iﬁcation (TDSV), is growing rapidly for personalized voice triggerdevices. Generally, the KWS aims to detect a predeﬁned keywordor a set of keywords in a continuous audio stream. Since the suc-cessful application of the Hidden Markov Model (HMM) in largevocabulary continuous speech recognition(LVCSR), the research ofKWS system focuses on the statistical modeling. In recent years,various neural networks have been applied to KWS and achievedsuperior performance. HMM is utilized to construct the keywordmodel and the ﬁller/background model, where the backgroundmodel is trained with non-keyword speech, background noises andsilence [1–3]. The traditional Gaussian Mixture Models (GMM) https://github.com/lenovo-voice/THE-2020-PERSONALIZED-VOICE-TRIGGER-CHALLENGE-BASELINE-SYSTEM were commonly used in statistic modeling for acoustic features inHMM-based approaches, and it is replaced by Deep Neural Net-works (DNN) recently. End-to-end DNNs were applied to KWS andproved that DNNs perform well compared with HMM-based wake-up systems [4]. Since then, more complex network structures havebeen adopted to build end-to-end KWS systems, including Convo-lutional Neural Networks [5], Recurrent Neural Networks [6,7], etc.On the other hand, with the success of deep learning in the speakerveriﬁcation ﬁeld [8] and the demand for personalized trigger insmart home devices, the TDSV task has attracted much attention ofspeaker veriﬁcation researchers.However, the conventional wake-up word detection and speakerveriﬁcation are carried out separately in the pipeline, where a wake-up word detection system is used to generate successful trigger fol-lowed by a speaker veriﬁcation system used to perform identity au-thentication. In such a case, the wake-up word detection system andthe speaker veriﬁcation system are optimized separately, not throughan overall joint optimization with a uniﬁed goal. As a consequence,their respective network parameters and extracted information arenot effectively shared and jointly utilized. In recent studies, the com-bination of phoneme with speaker information has been the trend toimprove the performance of the speaker veriﬁcation system [9, 10].The joint learning or multi-task learning network might be eithervery light at a small scale as a single always-on system, or with amuch larger scale at the second stage after a successful wake-up bythe ﬁrst stage voice trigger. Therefore, there are still many open re-search ﬁelds that can be further explored for the joint modeling ofwake-up word detection and speaker veriﬁcation, including but notlimited to:• Compact network design for personalized wake-up word de-tection with speaker veriﬁcation• Loss function design for personalized wake-up word detec-tion with speaker veriﬁcation• Multi-task learning for personalized wake-up word detectionwith speaker veriﬁcation• Network structure design for personalized wake-up word de-tection with speaker veriﬁcation• Domain adaptation for personalized wake-up word detectionwith speaker veriﬁcation• Integration of wake-up word detection and speaker veriﬁca-tion• Speaker veriﬁcation with segmentation from wake-up worddetection• Etc. a r X i v : . [ ee ss . A S ] J a n he 2020 Personalized Voice Trigger Challenge (PVTC2020),which aims at providing a common platform for the research com-munity to advance the state-of-the-art, is a satellite event of the 12thInternational Symposium on Chinese Spoken Language Processing(ISCSLP 2021). The PVTC2020 challenge is focused on the speakerdependent wake-up word detection. We release a database namedXIAO-LE containing recordings of wake-up words under the smarthome scenario in this challenge. Besides, we also provide a two-stage speaker dependent KWS baseline system. When the KWS sys-tem triggers, we compare the trigger audio with the reference modelcreated during the registration process and use another threshold todetermine whether the sound that triggers the detector may be thewake-up word uttered by the registered user.The paper is organized as follows. The details of XIAO-LEdatabase are introduced in Section 2. Section 3 describes the chal-lenge setup and the evaluation metrics. The adopted speaker depen-dent KWS baseline system is described in Section 4. Experimentalresults are shown in Section 5. Finally, the conclusion is provided insection 6.

2. THE XIAO-LE DATABASE

The XIAO-LE database is provided by the AI Lab of Lenovo Re-search. It contains 658,995 utterances with 612 hours in total. Thedatabase covers 550 speakers and a wide range of channels fromclose-talking microphones to multiple far-ﬁeld microphone arrays. Itcan be used for far-ﬁeld wake-up word recognition, far-ﬁeld speakerveriﬁcation, and speech enhancement.The average duration of all utterances is around 3.8 seconds.During the recording process, recording devices, including two cellphones (16kHz, 16bit) and four microphone arrays (with 4 or 6 chan-nels per array, 16kHz, 16bit), were set in a room designed as the realsmart home environment.For the data collected by microphone arrays, each audio ﬁle has4- or 6-channel signals, while for the data collected by cell phones,each utterance only has two channels signal. Recording devices andtheir corresponding distance information are shown in table 1.

Table 1 . Distance of different recording devices.Devices id Device and distanceid1 Cell phone, 0.2mid2 Cell phone, 0.8mid3 Microphone array, 1mid4 Microphone array, 3mid5 Microphone array, 3mid6 Microphone array, 5m

3. CHALLENGE DESCRIPTION3.1. Task Setting

Based on the XIAO-LE database, we have divided it into a train-ing set, a development set, and two evaluation sets. Speciﬁcally,300 speakers are selected for training, and 50 speakers are used asthe development set. The rest speaker of the database is used forevaluation. The challenge provides two tracks for the participants,and the second task poses a cross-channel challenge: enrollment speech from the close-talk and test speech from far-ﬁeld microphonearrays. In addition, to further simulate the real scenario, the text-independent content and confusing words also appear in trial ﬁles. Only data collected by cell phones in the evaluation set from 100speakers is adopted for performance evaluation in the ﬁrst task. Theevaluation set was separated into enrollment data and testing data.For each target speaker, the positive testing samples have ‘xiao le,xiao le’ as a part of the speech, and it is indeed uttered from thetarget speaker. There might be some background noises or even otherspeakers’ voices before the target speaker says ‘xiao le, xiao le’.However, all utterances considered as positive samples have ‘xiaole, xiao le’ uttered by the target speaker at the end of the speech inthis case. Negative samples do not contain speech segments withthe content ‘xiao le, xiao le’ uttered by the target speaker. Note thatutterances without ‘xiao le, xiao le’ or with ‘xiao le, xiao le’ thatare not uttered by the target speaker are both considered as negativesamples.

For the second task, we adopt data from another 100 speakers inthe evaluation set with no overlapping with the one in task 1. Theevaluation data and the trials are constructed in the same way as task1. The only difference is that the testing data only includes multi-channel synchronized audio data collected by one of those far-ﬁeldmicrophone arrays. Similar to task 1, ‘xiao le, xiao le’ is always atthe end of the sentence for all positive samples. In contrast, negativesamples do not contain ‘xiao le, xiao le’ or have speech segments‘xiao le, xiao le’ that are not uttered by the target speaker. The detailsabout the trial design are shown in Table 2.

For speaker veriﬁcation, participants can use up to three audios forenrollment for each speaker. The trial we provided contains three se-lected enrollment audios, one test audio, and the label which denoteswhether the trial is positive or negative. In task 1, the utterances col-lected by cell phone in 0.2m are selected as the enrollment data,and those utterances collected by cell phone in both distances with0.2m and 0.8m are used as the testing data. In task2, the utterancescollected by cell phone in 0.2m distance are selected as the enroll-ment data, while utterances collected by multi-channel far-ﬁeld mi-crophone arrays are used for testing. We describe different scenariosin the trial construction with more details in Tab.2. The correlationstatistics are shown in Tab. 3.

In this challenge, we provide a leaderboard ranked by the metric score wake up . The speaker dependent KWS performance of ourbaseline system, as well as systems submitted by participants in thechallenge, is measured by matching statistics. The statistic score score wake up is calculated from the miss rate and the false alarm(FA) rate according to the following equation, score wake up = Miss + alpha ∗ F A (1) able 2 . Structure of the Trial ﬁles. Noting that, other text independent segments denote speech segments other than ‘xiao le, xiao le’Includes‘xiao le, xiao le’ ‘xiao le, xiao le’part is fromthe target speaker Includes other text independentsegments from non-targetspeakers before ‘xiao le, xiao le’ Includes other textindependent segments fromthe target speaker Labelyes yes no no positiveyes yes no yes positiveyes yes yes no positiveyes no no no negativeyes no no yes negativeyes no yes no negativeno n/a n/a yes negativeno n/a n/a no negative Table 3 . Details about the Development and test set utterances positive negative enrollmentTask1 24.9k 3.6k 23.5k 1.6kDevelopment Task2 50.1k 4.6k 48.5k 1.6kTask1 159.2k 19.7k 148.1k 3.1kEvaluation Task2 201.7k 28.8k 190.5k 3.1k

Miss represents the proportion of errors in all positive label samples,and FA refers to the rate of errors in all negative label samples. The alpha constant is set as 19, which is calculated by the assumptionthat the probability of the positive samples are 0.05.In addition, the real-time factor( F real − time ) is also evaluatedas an auxiliary metric, which is calculated as the overall processingtime of the evaluation trials on an Intel Core i5 core clocked at 2.6GHz or similar processors divided by the total duration of all thetesting samples. That is calculated as follows: F real − time = T process ( s ) /T total test ( s ) (2) T process ( s ) is the overall time cost of processing all the evalua-tion data in seconds, and T total test ( s ) is the total duration of thetesting audios. In task2, multi-channel data will be considered assingle-channel data when calculating T total test . Besides, extract-ing the speaker embedding or features from the enrollment data isnot counted in T process ( s ) . F real − time is a mandatory self-reportedmetric. Each submission is considered as a valid submission onlywhen the corresponding self-reported real-time factor is lower thanthe given threshold.

4. THE BASELINE SYSTEMS4.1. LSTM-based KWS system

This section presents our KWS baseline system, which is modiﬁedfrom the CNN-based KWS system [11]. As shown in Figure 1, ourbaseline system consists of three modules:(i) a feature extractionmodule, (ii) a stacked LSTM neural network and (iii) a conﬁdencecalculation module.The feature extraction module converts the audio signals intoacoustic features. 80 dimensional log-mel ﬁlterbank features are ex-tracted for speech frames with 25ms window size and 10ms windowshift. Then we apply a segmental window with 40 frames to gen-erate training samples that contain enough context information ofsub-word as the input of the model. xiao le, xiao le (i) Feature Extraction (ii) Recurrent Neural Networks (iii) Posterior Handling

𝑃(𝑘𝑒𝑦𝑤𝑜𝑟𝑑|𝑥 𝑡 ) 𝑡 LSTM LSTM LSTM LSTM

Fig. 1 . Framework of the baseline system.Our backbone network is constructed with a two-layer stackedLSTM structure, followed by an average pooling layer and a ﬁ-nal linear projection layer. For all LSTM layers, the hidden di-mension is set to 128. A fully connected layer and a ﬁnal soft-max activation layer are applied as the back-end prediction mod-ule to obtain the subword occurrence probability of predeﬁned key-words. Formally, suppose the input feature sequence of the model is x = ( x , . . . , x t , . . . , x T ) , and the output sequence of encoder is h = ( h , . . . , h t , . . . , h T ) , where T equals 40,which is the lengthof the sequence. This encoder can be expressed as h = LST M ( x ) (3)Then the average pooling layer generates a context vector c : c = 1 T T (cid:88) t =1 h t (4)Finally, we compute the sub-word, which are HMM states of key-word, occurrence probability of keyword by the fully connectedlayer and the softmax function: p w i ( x ) = softmax ( F C ( c )) (5)where p w i ( x ) refers to the network output from input feature se-quence x concerning sub-word w i .In the posterior handling module, while the acoustic feature se-quence is projected to a posterior probability sequence of keywordsby the neural network, we adopt the method proposed in [12, 13]to make detection decisions. In this approach, we apply a slid-ing window with the length of T conf frames to compute detectionscores and denote the input acoustic features in a window as X = { x , x , · · · x T conf } . w = { w , w · · · w M } represents the sub-words sequence of pre-deﬁned keywords. Then the output conﬁ-dence score h ( X ) is computed by equation 6, ( X ) = (cid:34) max ≤ t < ···≤ T conf M (cid:89) i =1 p w i ( x ti ) (cid:35) M , (6)where p w i ( x ti ) refers to the network output of t th frame at sub-word w i . This method is suitable for the real-time situation. Thesystem triggers whenever the conﬁdence score is higher than the pre-deﬁned threshold. We adopt online data augmentation to improve the robustness of thespeaker veriﬁcation system and consider far-ﬁeld microphone arraysdata [14]. We use the MUSAN [15] and the RIRs-NOISES [16] asthe noise sources. The signal-to-noise-ratio(SNR) was set between0 to 20 dB while pre-training and 0 to 15 dB while ﬁne-tuning.

The training process of the speaker veriﬁcation baseline system ismodiﬁed from the framework in [17]. The whole architecture con-tains a front-end feature extractor, an encoding layer and a back-endclassiﬁer. We used ResNet34 [18] with SE-block [19] as the fea-ture extractor. The attentive statistics pooling(ASP) [20] is adopt asthe encoding layer. The ASP layer uses an attention mechanism togive different weights to different frames and generates a weightedaverage and a weighted standard deviation at the same time, whichcan effectively capture longer-term speaker feature variations. TheAM-Softmax [21] was set as the back-end classiﬁer in the system.According to the experiments in [22], the strategy of transferlearning performs well in the far-ﬁeld text-dependent speaker veriﬁ-cation tasks. Therefore, we select the data from SLR38, SLR47 [23],SLR62, SLR82 [24], SLR85 [25] on openslr as the pre-trainingdata. The speciﬁc data of the pre-training database are shown in Ta-ble 4. After that, we carry out ﬁne-tuning on the XIAO-LE database.Fine-tuning schemes are divided into two types: the ﬁrst is to useall the utterances of the pre-training database to construct a text-independent speaker veriﬁcation system as a pre-training model; thesecond is to use only the database of XIAO-LE to ﬁne-tune the pre-training model and get the target text-dependent system. Table 4 . The speciﬁc data of database used for pre-training.

Speakers Total hours UtterancesSLR38 855 100+ 102600SLR47 296 100+ 50384SLR62 600 200 237265SLR82 274 1000 130108SLR85 340 1500 108678

Our baseline system consists of a wake-up system and a speaker veri-ﬁcation system described above. As shown in Figure 2, we designed http://openslr.org/resources.php a two-stage system that responds whenever the target speaker saysthe trigger phrase. When the KWS system triggers, the speaker veri-ﬁcation system starts to decide whether the voice that triggers the de-tector is likely to be from the enrolled user. During enrollment stage,the average vector of the three utterances’ embedding extracted fromthe target speaker is saved as the enrollment speaker embedding vec-tor. Feature Extractionxiao le xiao le Keyword spotting systemSpeaker verification system &template TriggerorNon-trigger

Fig. 2 . Framework of the baseline system.We compare any possible new ‘xiao le, xiao le’ utterance withthe stored templates as follows. The second stage detector producestiming information used to convert the acoustic feature sequence intoa ﬁxed-length vector. A separate, specially trained speaker veriﬁca-tion network transforms this vector into a speaker embedding. Wecompare the cosine distances to the reference template created dur-ing enrollment with another threshold to decide whether the soundthat triggers the wake-up word detector is likely to be the one fromthe enrolled user. This process prevents can help reduce the caseswhere the device is triggered by ‘xiao le, xiao le’ spoken by anotherperson and reduces the rate at which other, similar-sounding triggerphrases.

5. EXPERIMENTAL RESULTS5.1. Experiment setup

We determine target word labels by force-alignment with an LVCSRsystem. For keyword ’xiao le, xiao le’, the ending time of the ﬁrst‘xiao’, the ﬁrst ‘le’, and the second ‘xiao’ are found out and wecenter its on a window of 40 frames. 80 dims log fbank is adopted asour input acoustic features. The KWS system is trained with cross-entropy loss. Stochastic gradient descent with Nesterov momentumis selected as the optimizer. The learning rate is ﬁrst initialized as0.01 and decreases by a factor of 0.1 whenever the model reaches atraining loss plateau. We train the KWS model for 100 epochs witha batch size of 128 and employ early stopping when the training lossis not decreasing. In the evaluation period, we use a sliding windowof 150 frames to compute the conﬁdence score.

For pre-training, we also use Stochastic gradient descent as the op-timizer. The initial learning rate is set as 0.01 and decreases by 0.1per 20 epochs. The pre-trained model is trained for 50 epochs witha batch size of 256. For ﬁne-tuning, the initial learning rate is set to0.001 and the number of training epochs is set to 20.

All the experiments are evaluated on an Intel (R) Xeon (R) gold 5215CPU clocked at 2.5 GHz. able 5 . Performances of the keyword spotting model on the devset (the false rejection (FR) rate [%] under one false alarm (FA) perhour)

Model F real − time Task1 Task2KWS baseline 0.09 2.00 5.11

We choose the false rejection rate under one false alarm per houras the KWS system performance criterion. Table 5 presents the KWSperformance of the model regarding false rejection rate when thefalse alarm per hour is 1.

Table 6 . Performances of the speaker veriﬁcation model on the devset (EER [%] and minDCF)

Model F real − time Task1 Task2EER minDCF EER minDCFSV baseline 0.11 1.32 0.16 1.90 0.21

The results of our speaker veriﬁcation system are shown in Table6. The threshold of the speaker veriﬁcation system is determinedby the development set. Two ways to determine the threshold havebeen used in our system. The ﬁrst method is using the threshold ofEER(Equal Error Rate) as the baseline version 1 system. The secondis using the mean threshold of EER and minDCF [26] denoted as ( threshold EER + threshold minDCF ) / , which greatly improvesin the development set as baseline version 2 system. The results ofthe second method on development set are denoted as system V2shown in Table 7 Table 7 . The score wake up of personalized KWS system on the testsets when alpha is 19

Model Development EvaluationTask1 Task2 Task1 Task2Personalized KWS baseline V1 0.19 0.33 0.75 0.78Personalized KWS baseline V2 0.10 0.14 0.37 0.31

From Table 7, we can obtain the following observations fromour baseline system. First, since the test recordings of task 2 areall far-ﬁeld scene, the performance of the model on task 2 decreasessigniﬁcantly compared to task 1 on the development set. However, itis not the case in the evaluation set, which might be because multi-channel microphone array can compensate for the far-ﬁeld condition.Second, the method to determine the threshold is an important factoraffecting the ﬁnal score.

6. CONCLUSIONS

In this paper, we introduce the setup of the ISCSLP 2020 Per-sonalized Voice Trigger Challenge (PVTC2020) and describe thedatabases, tracks, rules, and baseline system of the challenge. Theperformance of the baseline system indicates that it is indeed achallenging task and we hope PVTC2020 can promote the advance-ment of research in the speaker dependent keyword spotting ﬁeld.Our future works will focus on using the multi-task joint learningframework to handle the second stage detection.

7. REFERENCES [1] R. C. Rose and D. B. Paul, “A hidden markov model basedkeyword recognition system,” in

International Conference onAcoustics, Speech, and Signal Processing , 1990, pp. 129–132vol.1.[2] J. G. Wilpon, L. R. Rabiner, C. . Lee, and E. R. Goldman, “Au-tomatic recognition of keywords in unconstrained speech us-ing hidden markov models,”

IEEE Transactions on Acoustics,Speech, and Signal Processing , vol. 38, no. 11, pp. 1870–1878,1990.[3] P. Modi, L. Miller, and J. Wilpon, “Improvements and appli-cations for key word recognition using hidden markov model-ing techniques,” in

Acoustics, Speech, and Signal Processing,IEEE International Conference on , Los Alamitos, CA, USA,apr 1991, pp. 309–312, IEEE Computer Society.[4] G. Chen, C. Parada, and G. Heigold, “Small-footprint keywordspotting using deep neural networks,” in , 2014, pp. 4087–4091.[5] Tara N. Sainath and Carolina Parada, “Convolutional neuralnetworks for small-footprint keyword spotting,” in

INTER-SPEECH , 2015.[6] Santiago Fern´andez, Alex Graves, and J¨urgen Schmidhuber,“An application of recurrent neural networks to discriminativekeyword spotting,” in

Artiﬁcial Neural Networks – ICANN2007 , Joaquim Marques de S´a, Lu´ıs A. Alexandre, WłodzisławDuch, and Danilo Mandic, Eds., Berlin, Heidelberg, 2007, pp.220–229, Springer Berlin Heidelberg.[7] Martin W¨ollmer, Bj¨orn Schuller, and Gerhard Rigoll, “Key-word spotting exploiting long short-term memory,”

SpeechCommunication , vol. 55, no. 2, pp. 252 – 265, 2013.[8] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu-danpur, “X-vectors: Robust dnn embeddings for speakerrecognition,” in , 2018, pp.5329–5333.[9] Yi Liu, Liang He, Jia Liu, and Michael T. Johnson, “Speakerembedding extraction with phonetic information,” 2018.[10] Shuai Wang, Johan Rohdin, Luk´aˇs Burget, Oldˇrich Plchot,Yanmin Qian, Kai Yu, and Jan ˇCernock´y, “On the Usage ofPhonetic Information for Text-Independent Speaker Embed-ding Extraction,” in

Proc. Interspeech 2019 , 2019, pp. 1148–1152.[11] Haiwei Wu, Yan Jia, Yuanfei Nie, and Ming Li, “DomainAware Training for Far-Field Small-Footprint Keyword Spot-ting,” in

Proc. Interspeech 2020 , 2020, pp. 2562–2566.[12] Bin Liu, Shuai Nie, Yaping Zhang, Shan Liang, Zhanlei Yang,and Wenju Liu, “Focal loss and double-edge-triggered detectorfor robust small-footprint keyword spotting,” in

Proc. ICASSP ,2019, pp. 6361–6365.[13] R. Prabhavalkar, R. Alvarez, C. Parada, P. Nakkiran, and T. N.Sainath, “Automatic gain control and multi-style training forrobust small-footprint keyword spotting with deep neural net-works,” in

Proc. ICASSP , 2015, pp. 4704–4708.[14] W. Cai, J. Chen, J. Zhang, and M. Li, “On-the-ﬂy data loaderand utterance-level aggregation for speaker and languagerecognition,”

IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , vol. 28, pp. 1038–1051, 2020.15] David Snyder, Guoguo Chen, and Daniel Povey, “Musan: Amusic, speech, and noise corpus,” 2015.[16] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael LSeltzer, and Sanjeev Khudanpur, “A study on data augmen-tation of reverberant speech for robust speech recognition,” in . IEEE, 2017, pp. 5220–5224.[17] Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee,Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung,Bong-Jin Lee, and Icksang Han, “In defence of metric learningfor speaker recognition,” in

Interspeech , 2020.[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,” in

IEEE Con-ference on Computer Vision and Pattern Recognition , 2016, pp.770–778.[19] Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitation net-works,” in

Proceedings of the IEEE conference on computervision and pattern recognition , 2018, pp. 7132–7141.[20] Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda, “At-tentive statistics pooling for deep speaker embedding,” in

In-terspeech , 2018, pp. 2252–2256.[21] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong,Jingchao Zhou, Zhifeng Li, and Wei Liu, “Cosface: Largemargin cosine loss for deep face recognition,” in

IEEE Confer-ence on Computer Vision and Pattern Recognition , 2018, pp.5265–5274.[22] Xiaoyi Qin, Danwei Cai, and Ming Li, “Far-ﬁeld end-to-endtext-dependent speaker veriﬁcation based on mixed trainingdata with transfer learning and enrollment data augmentation.,”in

INTERSPEECH , 2019, pp. 4045–4049.[23] Ltd. Primewords Information Technology Co., “Primewordschinese corpus set 1,” 2018, .[24] Yue Fan, Jiawen Kang, Lantian Li, Kaicheng Li, Haolin Chen,Sitong Cheng, Pengyuan Zhang, Ziya Zhou, Yunqi Cai, andDong Wang, “Cn-celeb: a challenging chinese speaker recog-nition dataset,” 2019.[25] Xiaoyi Qin, Hui Bu, and Ming Li, “Hi-mia : A far-ﬁeld text-dependent speaker veriﬁcation database and the baselines,”2019.[26] Craig S Greenberg, Vincent M Stanford, Alvin F Martin,Meghana Yadagiri, George R Doddington, John J Godfrey, andJaime Hernandez-Cordero, “The 2012 nist speaker recognitionevaluation.,” in