Unit selection synthesis based data augmentation for fixed phrase speaker verification
aa r X i v : . [ c s . S D ] F e b UNIT SELECTION SYNTHESIS BASED DATA AUGMENTATION FOR FIXED PHRASESPEAKER VERIFICATION
Houjun Huang , , Xu Xiang , Fei Zhao , Shuai Wang , B Yanmin Qian AISpeech Ltd, Suzhou China MoE Key Lab of Artificial Intelligence, AI Institute SpeechLab, Department of Computer Scienceand Engineering Shanghai Jiao Tong University, Shanghai,China { houjun.huang, xu.xiang, fei.zhao01 } @aispeech.com, { feixiang121976,yanminqian } @sjtu.edu.cn ABSTRACT
Data augmentation is commonly used to help build a robustspeaker verification system, especially in limited-resourcecase. However, conventional data augmentation methods usu-ally focus on the diversity of acoustic environment, leavingthe lexicon variation neglected. For text dependent speakerverification tasks, it’s well-known that preparing training datawith the target transcript is the most effectual approach tobuild a well-performing system, however collecting such datais time-consuming and expensive. In this work, we proposea unit selection synthesis based data augmentation methodto leverage the abundant text-independent data resources. Inthis approach text-independent speeches of each speaker arefirstly broke up to speech segments each contains one phoneunit. Then segments that contain phonetics in the target tran-script are selected to produce a speech with the target tran-script by concatenating them in turn. Experiments are carriedout on the AISHELL Speaker Verification Challenge 2019database, the results and analysis shows that our proposedmethod can boost the system performance significantly.
Index Terms — speaker verification, data augmentation,unit selection synthesis, x-vector
1. INTRODUCTION
Speaker verification (SV) aims to confirm the claimed iden-tity given his/her speech. Considering the constraint on thespeech content, the SV task can be further categorized intotwo classes: text-dependent and text-independent. The for-mer requires the same content for the enrollment and test ut-terances, while the latter doesn’t.Traditional speaker recognition systems are based on sta-tistical models such as Gaussian Mixture Model-UniversalBackground Model (GMM-UBM) [1]. The i-vector [2] sys-tem projects the GMM super-vector to a lower-dimensional
Yanmin Qian is the corresponding author. This work was supported by theNational Key R&D Program of China (No. 2018YFB1004602) and the ChinaNSFC project (No. 62071288). and more speaker-discriminative vector. Recently, utterance-level deep speaker embedding methods such as x-vector [3,4], have shown better performance than i-vector on manystandard speaker recognition data-sets.Although the SV performance has been improved greatlyover the recent years, further upgrading its robustness in realapplications is still a challenge task due to the complex en-vironment. Data augmentation is conventionally adopted inbuilding a SV system to improve its robustness. Snyder etal. in [3, 4] manually employed additive noises and reverber-ation to original speech segments in the training set to traina robust embedding extractor and then extract ”clean” and”noisy” embeddings to train probabilistic linear discriminantanalysis (PLDA) for both i-vector and x-vector. SpecAug-ment [5, 6] also shows promising results on the speakerverification task. On the other hand, researchers also applydeep generative models such as generative adversarial net-work (GAN) and variational autoencoder (VAE) to generatex-vector embeddings directly to train a robust PLDA [7, 8].Despite the effectiveness exhibited by the data augmenta-tion methods above, they only consider the variation of acous-tic environment. Text variation should also be consideredto build a robust SV system, especially for text-dependenttasks. In practice, to achieve the state-of-the-art performance,text-dependent SV requires the same set of text to be spo-ken during the training stage and the test stage. [9, 10, 11].When there is no or limited training data with the designatedphrase to build a text-dependent SV system, those data aug-mentation approaches couldn’t help to get promising perfor-mance. To improve the system performance from this aspect,in this work, we propose a novel method which generates newspeech containing designated phrase from text-independentdatabase using the unit selection synthesis [12, 13, 14].
2. UNIT SELECTION SYNTHESIS BASED DATAAUGMENTATION
In the case that limited or no text-dependent training datais available to build a text-dependent SV system, generatingpeech utterances with the fixed phases of more speakers isthe most effectual solution. If a text-independent databasewith a large number of units is available, speaker discrimi-native synthesized speeches can be produced by concatenat-ing the wave-forms of units selected from speeches of eachspeaker [12, 13, 14].Here, we use the Chinese wake-up word ”ni hao mi ya”as an example to describe how to carry out the proposed ap-proach. In this work, we treat each Chinese character as thephonetic unit. Thus, ”ni hao mi ya” can be converted to a pho-netic unit sequence ”ni”-“hao”-“mi”-“ya.” The unit selectionsynthesis based augmentation is shown in Figure 1, whichcontains the following steps,1. For each speaker in the text-independent database, wechunk his speech into segments which only contain sin-gle characters.2. Select all the segments which contain desired phoneticunits to generate phonetic unit libraries.3. For each speaker, synthesize new speech containingthe target transcript by concatenating the sampled unitsfrom the phonetic unit libraries.
Fig. 1 . Generate “ni hao, mi ya” using unit selection synthesisfor a speaker in the text-independent databaseThe standard unit selection synthesis system aims to pro-duce more natural-sounding synthesized speeches, while theproposed approach aims at generating utterances with fix pho-netic content that are speaker discriminative and variable totrain more robust text-dependent SV systems. Speech seg-ments are randomly selected from the same speaker for each unit and no flexible join technique is used in concatenating toensure diversity of synthesized data. In our experiments, N i utterances will be synthesized for speaker i where N i is themax size of his/her unit based speech segment libraries.
3. EXPERIMENTS3.1. Experimental data
In this work, we use the AISHELL2 [15] as text-independenttraining data, AISHELL-wakeup [11] as text-dependenttraining data, and AISHELL-2019B-eval dataset [11] astext-dependent test set.The AISHELL2 is a 1000-hour Mandarin Chinese SpeechCorpus that contains 1003664 close-talk utterances from 1991speakers. The speech utterance contains several domains, in-cluding keywords, voice command, smart home, autonomousdriving, industrial production, etc. The recording was put in aquiet room environment using an iOS system mobile phone.The AISHELL-wakeup database has 3,936,003 utterancesfrom 254 speakers. The content of utterances covers twowake-up words: ”ni hao, mi ya” in Chinese and ”Hi, Mia”in English. The recording process was put in a real smarthome environment where one close-talking microphone wasplaced 25cm away from the speaker, six 16-channel circu-lar microphone arrays were placed around the person with adistance including 1m, 3m and 5m from the speaker and thenoise source was randomly placed close to one of the micro-phone arrays. The 993083 mono channel wave-files of theChinese wake-up word ”ni hao, mi ya” are chosen to train thetext-dependent model in our experiments.The AISHELL-2019B-eval contains recordings of 86speakers with Chinese wake-up word ”ni hao, mi ya”.Theroom setting and recording devices are the same as that ofAISHELL-wakeup. Utterances of the last 44 people are se-lected as the test set since they are more challenging [11].This corpus has two tasks: close-talking enrollment task (ut-terances from the close-talking mic are used for enrollment)and far-field enrollment task (utterances from one 16-channelcircular microphone array which is 1m away from the speakerare used for enrollment). The testing data for both tasks arefar-field utterances recorded with 16-channel circular micro-phone arrays. The frame-level alignments of the above databases are gener-ated using the official Kaldi [16] AISHELL2 speech recogni-tion recipe(s5) [15], and voice activity detection (VAD) labelsof them are generated based on their alignments. AISHELL-wakeup is available at http://openslr.org/85/. Speech data of AISHELL-2019B-eval and trial files are available athttp://openslr.org/85/.
Table 1 . Architecture of the x-vectorLayer Layer context Total context Input*outputframe1 [t-2, t+2] 5 200*256frame2 t-2, t, t+2 9 768*256frame3 t-3, t, t+3 15 768*256frame4 t 15 256*256frame5 t 15 256*512stats-pool [0,T) T 512T*1024segment6 0 T 1024*256projection 0 T 256*NAISHELL2 is firstly used to train the text-independentx-vector model. As is shown in Figure 2, when we traina text-dependent model, the parameters of layers beforethe projection layer are initialized by the text-independentx-vector model. To test the effectiveness of the proposed ap-proach, three text-dependent x-vector models are trained withAISHELL-wakeup, AISHELL2-aug and AISHELL-wakeup+ AISHELL2-aug, respectively.Cosine similarity serves as back-end scoring method dur-ing testing. We report results in Equal error rate (EER), min-imum detection cost for P(tar) = 0.01 (minDCF).
Fig. 2 . Initialize text-dependent x-vector model with text-independent model
Experiment results on close-talking and far-field enrollmenttask are shown in Table 2 and Table 3 respectively.
Table 2 . Performance on close-talking enrollment task of x-vector models trained with different data.
AISHELL2-aug isthe training data produced by the proposed approach. train data set EER(%) minDCFAISHELL2 6.978 0.616AISHELL-wakeup 5.796 0.606AISHELL-wakeup+AISHELL2 4.386 0.446AISHELL2-aug 4.087 0.405AISHELL-wakeup+AISHELL2-aug
Table 3 . Performance on far-field enrollment task of x-vectormodles trained with different data.
AISHELL2-aug is thetraining data produced by the proposed approach. train data set EER(%) minDCFAISHELL2 5.993 0.466AISHELL-wakeup 4.598 0.409AISHELL-wakeup+AISHELL2 3.274 0.335AISHELL2-aug 3.673 0.319AISHELL-wakeup+AISHELL2-aug
Compared to the text-independent x-vector model, whenthere is no text-dependent resource to build the SV system,AISHELL2-aug generated by the proposed method trainedtext-dependent x-vector model achieves a relative perfor-mance improvement of 41.4% in EER and 38.7% in EER onclose-talking enrollment task and far-field enrollment task re-spectively. Compared to the AISHELL-wakeup+AISHELL2rained text-dependent x-vector model, when there is limitedtext-dependent resource to build the SV system, AISHELL-wakeup+AISHELL2-aug trained model obtains further rela-tive performance improvement of 31.16% in EER and 21.74%in EER on close-talking enrollment task and far-field enroll-ment task, respectively.
The mismatch of speech content between training and testdata will induce a severe degradation of SV performance. Theproposed method aims to produce training speech utteranceswhose contents match the text-dependent test set. This ap-proach is effective only if the synthesized speeches containa phonetic context of ”ni”-“hao”-“mi”-“ya” and be speakerdiscriminative.A deep-neural network(DNN) based keyword spottingsystem is firstly tested on AISHELL-wakeup, AISHELL2-aug and AISHELL2 (speeches in AISHELL-wakeup orAISHELL2-aug are positive samples, speeches in AISHELL2are negative samples). The DNN with 7 hidden layers and256 nodes per hidden layer is pre-trained with about 5000hspeeches. The DNN has 411 output labels: 409 Chinese char-acters, a silence label and a Filler label for music and noise.40-dimensional fbank features are extracted with a frameshift of 20ms and a window width of 30ms and then 5 futureframes and 5 frames in the past are stacked to predict poste-rior probabilities for each output label using the DNN model.The posterior handling module proposed in [19] combinesthe label posteriors produced every frame into a confidencescore used for detection. Figure 3 shows the performancewhen speeches in AISHELL-wakeup or AISHELL2-aug arechosen as positive samples. Results are demonstrated in theform of Receiver Operating Characteristic (ROC) curves,where the false reject rate(that is, a key phrase is present but anegative decision is given, FRR) is on the Y-axis and the falsealarm rate (that is, a key-phrase is not present, but a positivedecision is made, FAR) is on X-axis. The ROC is obtainedby sweeping through confidence thresholds.As is shown in Figure 3, when FAR is larger than 0.01,speeches in AISHELL2-aug could achieve very similar FRRwith speeches in AISHELL-wakeup. This means the synthe-sized speech does contain a phonetic context of ”ni”-”hao”-”mi”-”ya”. As the synthesised speeches are not natural-sounding, when FAR drops to 0.001, FRR of AISHELL-augis 0.164 while that of AISHELL-wakeup is 0.087. In ourfuture work, we will try to produce or select more natural-sounding synthesized speeches to train the x-vector model.The text-independent x-vector model trained with theVoxceleb2 database in our previous work [18] is used to ex-tract embeddings from audios of AISHELL2-aug. 50 speak-ers are randomly chosen, and their embeddings are shownin Figure 4 using t-SNE. Figure 4 shows the good speakerdiscriminative property of synthesized speeches.
Fig. 3 . ROC curves when speeches in AISHELL-wakeup orAISHELL2-aug are chosen as positive samples
Fig. 4 . t-SNE visualization of embeddings from 50 randomspeakers of AISHELL2-aug, samples with the same shapeand color are from the same speaker
4. CONCLUSIONS AND FUTURE WORK
In this work, we proposed a unit selection synthesis baseddata augmentation for text-dependent speaker verification.The proposed method enables us to leverage the rich text-independent data to fast generate new speech with the desiredtranscript, which leads to a better-performing and more robusttext-dependent speaker verification system. This strategy canreduce the development period and cost dramatically. Ex-periments on the AISHELL-2019B-eval corpus shows thatthe proposed approach could achieve a relative performanceimprovement of about 40% in both EER and minDCF.In the future work, we will focus on synthesizing morenatural-sounding and variable speech to further increase therobustness of speaker verification systems. . REFERENCES [1] Douglas A Reynolds, Thomas F Quatieri, and Robert BDunn, “Speaker verification using adapted gaussianmixture models,”
Digital signal processing , vol. 10, no.1-3, pp. 19–41, 2000.[2] Najim Dehak, Patrick J Kenny, R´eda Dehak, Pierre Du-mouchel, and Pierre Ouellet, “Front-end factor analysisfor speaker verification,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 19, no. 4, pp.788–798, 2010.[3] David Snyder, Daniel Garcia-Romero, Daniel Povey,and Sanjeev Khudanpur, “Deep neural network embed-dings for text-independent speaker verification.,” in
In-terspeech , 2017, pp. 999–1003.[4] David Snyder, Daniel Garcia-Romero, Gregory Sell,Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Ro-bust dnn embeddings for speaker recognition,” in . IEEE, 2018, pp.5329–5333.[5] Daniel S Park, William Chan, Yu Zhang, Chung-ChengChiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le,“Specaugment: A simple data augmentation methodfor automatic speech recognition,” arXiv preprintarXiv:1904.08779 , 2019.[6] Shuai Wang, Johan Rohdin, Oldˇrich Plchot, Luk´aˇs Bur-get, Kai Yu, and Jan ˇCernock`y, “Investigation ofspecaugment for deep speaker embedding learning,”in
ICASSP 2020-2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2020, pp. 7139–7143.[7] Yexin Yang, Shuai Wang, Man Sun, Yanmin Qian, andKai Yu, “Generative adversarial networks based x-vector augmentation for robust probabilistic linear dis-criminant analysis in speaker verification,” in . IEEE, 2018, pp. 205–209.[8] Zhanghao Wu, Shuai Wang, Yanmin Qian, and KaiYu, “Data augmentation using variational autoencoderfor embedding based speaker verification.,” in
INTER-SPEECH , 2019, pp. 1163–1167.[9] Yexin Yang, Shuai Wang, Xun Gong, Yanmin Qian,and Kai Yu, “Text adaptation for speaker verificationwith speaker-text factorized embeddings,” in
ICASSP2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE,2020, pp. 6454–6458. [10] Xiaoyi Qin, Danwei Cai, and Ming Li, “Far-field end-to-end text-dependent speaker verification based on mixedtraining data with transfer learning and enrollment dataaugmentation.,” in
INTERSPEECH , 2019, pp. 4045–4049.[11] Xiaoyi Qin, Hui Bu, and Ming Li, “Hi-mia: A far-field text-dependent speaker verification database andthe baselines,” in
ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) . IEEE, 2020, pp. 7609–7613.[12] Andrew J Hunt and Alan W Black, “Unit selection ina concatenative speech synthesis system using a largespeech database,” in . IEEE, 1996, vol. 1, pp. 373–376.[13] Alan W Black and Paul A Taylor, “Automatically clus-tering similar units for unit selection in speech synthe-sis.,” 1997.[14] Alistair Conkie, “Robust unit selection system forspeech synthesis,” in , 1999, p. 978.[15] Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu,“Aishell-2: transforming mandarin asr research into in-dustrial scale,” arXiv preprint arXiv:1808.10583 , 2018.[16] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, Mirko Han-nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz,et al., “The kaldi speech recognition toolkit,” in
IEEE2011 workshop on automatic speech recognition andunderstanding . IEEE Signal Processing Society, 2011,number CONF.[17] Jiankang Deng, Jia Guo, Niannan Xue, and StefanosZafeiriou, “Arcface: Additive angular margin loss fordeep face recognition,” in
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition ,2019, pp. 4690–4699.[18] Xu Xiang, Shuai Wang, Houjun Huang, Yanmin Qian,and Kai Yu, “Margin matters: Towards more discrim-inative deep neural network embeddings for speakerrecognition,” in . IEEE, 2019, pp. 1652–1656.[19] Guoguo Chen, Carolina Parada, and Georg Heigold,“Small-footprint keyword spotting using deep neuralnetworks,” in2014 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP)