Spoofing Speaker Verification Systems with Deep Multi-speaker Text-to-speech Synthesis
SSPOOFING SPEAKER VERIFICATION SYSTEMS WITH DEEP MULTI-SPEAKERTEXT-TO-SPEECH SYNTHESIS
Mingrui Yuan ∗ Tsinghua UniversityDept. of Electronic EngineeringBeijing, 100084, P.R. China [email protected]
Zhiyao Duan † University of RochesterDept. of Electrical and Computer EngineeringRochester, NY 14627, USA [email protected]
ABSTRACT
This paper proposes a deep multi-speaker text-to-speech(TTS) model for spoofing speaker verification (SV) systems.The proposed model employs one network to synthesize time-downsampled mel-spectrograms from text input and anothernetwork to convert them to linear-frequency spectrograms,which are further converted to the time domain using theGriffin-Lim algorithm. Both networks are trained separatelyunder the generative adversarial networks (GAN) framework.Spoofing experiments on two state-of-the-art SV systems (i-vectors and Google’s GE2E) show that the proposed systemcan successfully spoof these systems with a high successrate. Spoofing experiments on anti-spoofing systems (i.e., bi-nary classifiers for discriminating real and synthetic speech)also show a high spoof success rate when such anti-spoofingsystems’ structures are exposed to the proposed TTS system.
Index Terms — Text-to-speech, speaker verification,spoofing, generative adversarial networks, anti-spoofing
1. INTRODUCTION
Speaker verification (SV) is to verify whether a claim that anutterance belongs to a speaker is true or not. It is a widelyused biometric and has been under research for decades.One widely used traditional method uses Gaussian mixturemodels (GMM) with a universal background model (UBM)[1]. Later, i-vectors method was proposed, which maps high-dimensional statistics from the UBM into a low-dimensionalrepresentation [2]. Recently, deep learning methods haveshown significant advances on verification accuracy overtraditional methods [3, 4, 5, 6].Similar to verification systems using other biometrics,SV systems face the problem of fake identification which istermed as presentation attack or spoofing . Spoofing has twomain forms. One is physical access (PA) including direct im-itation and replay. The other is logical access (LA) including ∗ M. Yuan performed the work during his visit at University of Rochester. † Z. Duan thanks the National Science Foundation grant No. 1617107. speech synthesis and voice conversion. This paper focuseson the spoofing effects of text-to-speech (TTS) synthesis onSV systems. TTS has been investigated for decades. Earlymethods use unit selection [7], formant synthesis [8], andhidden Markov models (HMM) [9], among others. Spoofingeffects of synthetic speech from these traditional methods hasbeen investigated in [10, 11]. Recent years have witnessedthe surge of deep learning speech synthesis models such asWaveNet [12] and Tacotron [13]; they can generate speechthat is hard to be distinguished from real speech by listen-ing. However, spoofing effects of synthetic speech from deeplearning models have not been properly investigated [14].In [15], a voice conversion (VC) system is proposed tospoof SV systems. It is trained using feedback from black-box SV systems for better spoofing effects. For TTS sys-tems, a deep learning model based on GAN [16] is proposedto spoof an SV system by synthesizing mel-spectrograms.While promising results are reported, the spoofed SV systemtakes mel-spectrograms as input instead of time-domain sig-nals. This is impractical in real life. In [17], another GANsystem is proposed to design a TTS system, which incorpo-rates an anti-spoofing system as the discriminator. This sys-tem, however, is only evaluated on the feature distributionsof the synthetic speech with and without GAN training; noevaluation is performed on spoofing SV systems.In this paper, we propose a Wasserstein GAN-basedmulti-speaker TTS system based on an existing TTS [18] ar-chitecture to spoof SV systems . This system consists of twosub-models. The first model synthesizes a time-downsampledmel-spectrogram from text input using speaker embeddingsof the target identity. The second model then converts themel-specrogram to a linear-frequency spectrogram and fi-nally to the time domain using the Griffin-Lim algorithm[19]. We perform adversarial training for each sub-model.Experiments are conducted on spoofing two state-of-the-artSV systems (i-vectors and Google GE2E) in a black-box con-dition, and results show a high spoof rate of the proposedTTS system. Source code at https://github.com/MingruiYuan/SpoofSV a r X i v : . [ ee ss . A S ] O c t ur contributions are the following: 1) We proposed amulti-speaker TTS spoofing system using Wasserstein GANtraining; 2) Comprehensive spoofing experiments showeda high spoof rate on two state-of-the-art SV systems in theblack-box condition; 3) Our experiments also uncoveredthreats of TTS spoofing to anti-spoofing SV systems whentheir model structures are not kept confidential.
2. TEXT-TO-SPEECH MODEL
Our proposed TTS model follows the two-stage process in[18]. In the first stage we use a Text2Mel network to con-vert the input text into a time-downsampled mel-spectrogram“spoken” by the target speaker. In the second stage we usea Spectrogram Super-resolution Network (SSRN) to con-vert the time-downsampled mel-spectrogram into the linear-frequency spectrogram. The Text2Mel model works in anonline fashion: it processes acoustic features frame by frame.Previously generated frames of the features are fed back asinput to Text2Mel to generate the next frame. consists of a text encoder (TEnc), an audio andspeaker encoder (ASEnc) and an audio decoder (ADec).TEnc takes text embeddings as inputs. The text embeddingsare obtained by mapping each character through a trainablelookup table. These embeddings are then processed by sub-sequent layers of TEnc to obtain output tensor K , V . ASEnchas two input branches. One accepts the time-downsampledmel-spectrogram of previously generated audio frames andthe other accepts the speaker embedding of the target speakerextracted by the Deep Speaker model [5]. The two branchesare added together and then processed by subsequent layersto obtain an output tensor Q .Attention mechanism is employed to align the text inputand the generated mel-spectrogram. This is implementedthrough a trainable attention matrix A with size N × T ,where N is the total number of characters of the text inputand T is the total number of frames of the to-be-generatedmel-spectrogram. A nt is the probability of the t th frame ofthe mel-spectrogram being generated from the n th characterof the input text. As the alignment between text and its speechutterance is monotonic, during generation, we do not allowthe alignment path to move backward in either dimension. Inaddition, we do not allow the path to skip 2 or more positions,leaving a valid step size of 0, 1 or 2 in both dimensions. Thisensures a roughly continuous alignment path but also allowsspeed changes in the synthesized speech.Finally, ADec takes a concatenated tensor [ VA ; Q ] as in-put and predicts a new frame of the time-downsampled mel-spectrogram in each time step, which is then appended to thegenerated spectrogram and fed to ASEnc in the next time step. Audio and Speaker Encoder (ASEnc)
Text Encoder(TEnc) Audio Decoder(ADec)
Texts
Ground-truth
Time-downsampledMel-spectrogram S mel SpeakerEmbeddings H SpectrogramSuperResolutionNetworks
SSRN
PhaseRetrieval V ( B×d×N ) K T (B×N×d) Q (B×d×T) R = VA (B×d×T) xx (B×N×T) Q (B×d×T) R (B×2d×T) Y mel (B×F×T) Text2Mel
Ground-truth linearSpectrogram S lin Y lin (B×F ' ×T ' ) Time-domainSpeech Audio and Speaker Encoder (ASEnc)
Time-downsampled Mel-spectrogram S mel SpeakerEmbeddings H h,dF,d,1,1 + h,dd,d,1,1 d,d,1,1 + Channels=dChannels=d d,3,3d,3,3 Q (B×d×T) DiscriminatorDiscriminator
C, 2C, kernel_size, dilation_rate C in, C out C,3,1
Channels=C = C in, C out, kernel_size, dilation_rate The output of conv1d is split evenly along channel axis,getting H , H (Size: B×C×T)Final output O = (cid:1) ( H ) ° H + ( - (cid:1) ( H )) ° Input = C,3,3 C,3,9 C,3,27
C,kernel_size, dilation_rate
Conv1dLinear
HighwayConvolution (HC)
DilatedHighwayConvolutionGroup (DHCG)
Legend (cid:0) sigmoid function B batch size d hidden dimension h speaker embedding dimension ReLU function Fig. 1 . The proposed text-to-speech model. Architecture ofASEnc is illustrated, while that of TEnc, ADec and SSRNfollows the same structure as that in [18].In all modules of Text2Mel, 1D dilated convolutional lay-ers are used to model short and long contextual information.Highway convolutional layers are applied according to high-way networks [20] to improve training efficiency. Detailedstructure of ASEnc is shown in Figure 1, while that of TEnc,ADec and SSRN follow the same structure as that in [18]. converts the time-downsampled mel-spectrogram fromText2Mel to the linear-frequency spectrogram. It uses trans-pose convolutional layers and a series of 1D dilated convolu-tional layers to achieve this super resolution along both timeand frequency axes. Finally, Griffin-Lim algorithm [19] isused to estimate the phase spectrogram to obtain the time-domain waveform of the generated speech.
Training of Text2Mel and SSRN is performed separately, asshown in Figure 1. Training of Text2Mel requires pairs oftext and time-downsampled mel-spectrograms, while trainingf SSRN only requires pairs of time-downsampled mel-spectrograms and linear-frequency spectrograms. As bothnetworks only consist of 1D dilated convolutional layers,sequential models are avoided and all frames of each time-downsampled mel-spectrogram can be reconstructed at thesame time with teacher-forcing in the training stage. This isthe key to better training efficiency. The reconstruction lossfunctions for Text2Mel and SSRN are formulated as L Text MelRecon = E ft [ | Y mel − S mel | ]+ E ft [ − S mel log Y mel − (1 − S mel ) log (1 − Y mel )]+ L attention , (1) L SSRNRecon = E ft [ | Y lin − S lin | ]+ E ft [ − S lin log Y lin − (1 − S lin ) log (1 − Y lin )] , (2) where Y denotes the reconstructed (mel-)spectrogram and S denotes the corresponding ground-truth. Both models usethe L and cross-entropy losses to assess the reconstructionquality. For Text2Mel, it also includes an attention loss term L attention = E nt [ A (cid:12) W ] , where the weight matrix W nt =1 − e − ( nN − tT ) shows a high weight off diagonal. This termpenalizes the attention matrix A if it contains significant en-ergy off diagonal. The rationale is that the alignment betweentext and its speech utterance is usually along the diagonal, as-suming a stable speaking speed.When training the two sub-models we also incorporatediscriminators that are trained to discriminate real and syn-thetic (mel-)spectrograms. The discriminators consist of 1Dconvolutional layers, highway convolutional layers, and 1D(adaptive) average pooling layers. Model details are omitteddue to space limit but can be found in the open source code.We use Wasserstein GAN with gradient penalty (WGAN-GP) [21] because it achieves better results than the vanillaGAN in our experiments. Therefore, the output of each dis-criminator is a confidence value; A (mel-)spectrogram with ahigher value is more likely to be real. The final loss functionfor Text2Mel and SSRN becomes: [22] L = L Recon + E L Recon E L GAN L GAN , (3) where L Recon is the reconstruction loss function of each sub-model in Eqs. (1) and (2), while L GAN is the loss from thediscriminator. The two parts in the loss function are normal-ized by their averages in each batch to have the same weight.
3. EXPERIMENTS3.1. Training Spoofing Models
We use the entire VCTK-corpus [23] to train our TTS model.This corpus contains 108 valid English speakers (p315 iseliminated for the absence of texts) and each speaker hasaround 400 utterances ( ∼ i-vectors GE2E SR EER SR EER M S S S
26% 1.5% 60.69% 18.57% M S S S
41% 0.5% 65.07% 18.73% M S S S
57% 1.5% 69.16% 17.19%Average 54.69% 1.75% 68.34% 18.74%
Table 1 . Spoofing effects on speaker verification systems us-ing three trained models ( M , M , M ) of the proposed TTSsystem with three data split schemes ( S , S , S ).1024 and hop size 256 to calculate the spectrogram. Forthe time-downsampled mel-spectrogram, we use 80 mel fil-terbanks, and select 1 frame out of every 4 frames. Theoptimizer is Adam [24] with α = 2 e − , β = 0 . , β = 0 . and batch size is 16. For every update of the generator, thediscriminator is updated for 5 times. We set the gradientpenalty coefficient λ = 10 . Layer normalization [25] is ap-plied before each activation function as we found it useful inthe experiments. We select three models trained for differentnumber of iterations to perform experiments: M (Text2Mel)500k-(SSRN) 300k. M M We choose two state-of-the-art SV systems to spoof: i-vectors [2] provided by
Kaldi and an open source implementa-tion [26] of Google’s deep learning system
GE2E [6]. Wealso use the VCTK corpus to train the SV systems. We splitspeakers in the corpus into training and test sets with threedifferent schemes: S (Train) 42-(Test) 66. S S Kaldi’s aishell exampleand the GE2E github repository to train the SV systems.To investigate spoofing effects, we create a mixed set con-taining 50% real and 50% synthetic utterances to performspeaker verification. We randomly select 3 real utterances ofeach test speaker for enrollment. We then randomly chooseanother 20 real utterances and 20 synthetic utterances of eachspeaker for verification. The synthetic utterances are synthe-sized by the proposed TTS system on Harvard Sentences.We propose spoof rate (SR) to quantify the spoofing ef-fects. It is defined as the percentage of synthetic speech ut-terances that are accepted by the SV system as their claimedidentities. Apparently, SR is affected by the threshold tuningof SV systems. In this experiment, we tune the SV systemsto achieve equal error rate (EER) on real speech utterances inthe test set. In Table 1, we report SR of all the three models ig. 2 . Spoof rate (horizontal axis) vs. false rejection rate(FRR) (vertical axis) of the three trained models ( M , M , M ) on three data split schemes ( S , S , S ). SV systems:i-vectors (blue solid line), GE2E (red dash line).in the three train-test split schemes. We also report EER onreal utterances as a control measure of the performance of theSV systems. From Table 1, we can see that all of the threemodels trained with the three data split schemes achieved ahigh spoof rate on both SV systems. The average spoof rateis 54.69% for i-vectors and 68.34% for GE2E. This shows thesignificant vulnerability of both SV systems under the attackof our TTS system. The EER on real utterances of i-vectors isbelow 2.5% in all settings, showing that the SV system is welltrained. The EER of GE2E is much higher, suggesting that itis not well trained on our limited dataset. In fact, accordingto the GE2E’s paper [6], 18K speakers are used to train themodel. Our training set, however, has less than 100 speakersfor all the three data split schemes. It is possible that a bettertrained GE2E model could be more robust to our TTS attack,and more investigations are needed to draw this conclusion.In Figure 2, we vary the threshold of SV systems andplot the curve of spoof rate versus false rejection rate (FRR) ,where a false rejection is defined as the rejection of a realspeech utterance that indeed belongs to the claimed identity.An ideal SV system that is robust to spoofing would showa monotonically decreasing curve very close to the origin.Curves in Figure 2, however, have a certain distance to theorigin for all models and data split schemes. Take the M3-S3i-vectors curve as an example, to lower the SR below 10%, theFRR would be as high as 15%, which would not be accept-able in practice. This again shows vulnerability of i-vectorsand GE2E under the attack of our proposed TTS system. We further evaluate the proposed TTS model by spoofing anti-spoofing systems. Here, the anti-spoofing systems are binary classifiers that discriminate real from synthetic speech. Weperform this evaluation in two conditions. In the blackbox condition, the anti-spoofing system is treated as a blackboxand its model structure is not revealed to the TTS system.In the whitebox condition , the model structure of the anti-spoofing system is revealed. The anti-spoofing systems maketwo types of errors: 1) false acceptance of synthetic speech,and 2) false rejection of real speech. As the test set containsequal amount of real and synthetic utterances, we report theequal error rate (EER) as the evaluation measure.For the blackbox condition, we choose the ASVspoof2019provided GMM-based anti-spoofing system. It takes Lin-ear Frequency Cepstral Coefficients (LFCC) as input fea-tures, and is trained on the logical access (LA) part of theASVspoof2019 dataset [27]. We compose two test sets, eachof which is a mix of real speech utterances (50%) and syn-thetic utterances (50%). The difference is on the syntheticutterances. For the first set TS proposed , they are synthesizedby the proposed TTS model. For the second set TS others ,they are downloaded from https://google.github.io/tacotron/and are synthesized using other high-quality TTS models.The resulted EER is 0.47% on TS proposed and 3.74% on TS others . These low values shows the difficulty of spoofinganti-spoofing systems in the blackbox condition.For the whitebox condition, we choose two variants V and V of the discriminator that we use in the GAN train-ing of our model as the anti-spoofing system. Each varianthas a similar structure to the original discriminator, with dif-ferences on the removal of an average pooling layer and theinsertion of a convolutional layer, respectively. We use testset TS proposed to spoof V and V . The resulted EER is42.56% for V and 36.21% for V . These high EER val-ues show that both anti-spoofing systems fail to discriminatesynthetic from real speech. This suggests that anti-spoofingsystems, when their structures are disclosed, can be very vul-nerable to TTS spoofing attacks.
4. CONCLUSION AND FUTURE WORK
This paper proposed a deep multi-speaker TTS model forspoofing SV systems. GAN training was employed to trainthe two sub-models of the system. Experiments on spoof-ing state-of-the-art SV systems revealed their significantvulnerability under the attack of the proposed TTS system.Experiments on anti-spoofing systems also revealed their vul-nerability if their model structures are disclosed. For futurework, we plan to use reinforcement learning to improve thespoofing capability on blackbox SV systems. We also plan todesign stronger anti-spoofing systems to defend TTS attack.
5. REFERENCES [1] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn,“Speaker verification using adapted gaussian mixture models,” igital signal processing , vol. 10, no. 1-3, pp. 19–41, 2000.[2] Najim Dehak, Patrick J Kenny, R´eda Dehak, Pierre Du-mouchel, and Pierre Ouellet, “Front-end factor analysis forspeaker verification,”
IEEE Transactions on Audio, Speech,and Language Processing , vol. 19, no. 4, pp. 788–798, 2010.[3] David Snyder, Daniel Garcia-Romero, Daniel Povey, and San-jeev Khudanpur, “Deep neural network embeddings for text-independent speaker verification.,” in
Interspeech , 2017, pp.999–1003.[4] David Snyder, Daniel Garcia-Romero, Gregory Sell, DanielPovey, and Sanjeev Khudanpur, “X-vectors: Robust dnn em-beddings for speaker recognition,” in . IEEE, 2018, pp. 5329–5333.[5] Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, XueweiZhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu,“Deep speaker: an end-to-end neural speaker embedding sys-tem,”
CoRR , vol. abs/1705.02304, 2017.[6] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno,“Generalized end-to-end loss for speaker verification,” in . IEEE, 2018, pp. 4879–4883.[7] Andrew J Hunt and Alan W Black, “Unit selection in aconcatenative speech synthesis system using a large speechdatabase,” in .IEEE, 1996, vol. 1, pp. 373–376.[8] Neal B Pinto, Donald G Childers, and Ajit L Lalwani, “For-mant speech synthesis: Improving production quality,”
IEEETransactions on Acoustics, Speech, and Signal Processing , vol.37, no. 12, pp. 1870–1887, 1989.[9] Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, TakaoKobayashi, and Tadashi Kitamura, “Speech parameter gener-ation algorithms for hmm-based speech synthesis,” in . IEEE,2000, vol. 3, pp. 1315–1318.[10] F. H. Foomany, A. Hirschfield, and M. Ingleby, “Toward a dy-namic framework for security evaluation of voice verificationsystems,” in , Sep. 2009, pp.22–27.[11] Jes´us Villalba and Eduardo Lleida, “Speaker verification per-formance degradation against spoofing and tampering attacks,”in
FALA workshop , 2010, pp. 131–134.[12] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Si-monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, An-drew Senior, and Koray Kavukcuoglu, “Wavenet: A generativemodel for raw audio,” arXiv preprint arXiv:1609.03499 , 2016.[13] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu,Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao,Zhifeng Chen, Samy Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135 ,2017. [14] Md Sahidullah, H´ector Delgado, Massimiliano Todisco, TomiKinnunen, Nicholas Evans, Junichi Yamagishi, and Kong-AikLee, “Introduction to voice presentation attack detection andrecent advances,” in
Handbook of Biometric Anti-Spoofing , pp.321–361. Springer, 2019.[15] Xiaohai Tian, Rohan Kumar Das, and Haizhou Li, “Black-box attacks on automatic speaker verification using feedback-controlled voice conversion,” 2019.[16] Wilson Cai, Anish Doshi, and Rafael Valle, “Attacking speakerrecognition with deep generative models,” arXiv preprintarXiv:1801.02384 , 2018.[17] Y. Saito, S. Takamichi, and H. Saruwatari, “Training algorithmto deceive anti-spoofing verification for dnn-based speech syn-thesis,” in , March 2017, pp.4900–4904.[18] Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke Ai-hara, “Efficiently trainable text-to-speech system based ondeep convolutional networks with guided attention,” in . IEEE, 2018, pp. 4784–4788.[19] D. Griffin and Jae Lim, “Signal estimation from modifiedshort-time fourier transform,”
IEEE Transactions on Acous-tics, Speech, and Signal Processing , vol. 32, no. 2, pp. 236–243, April 1984.[20] Rupesh Kumar Srivastava, Klaus Greff, and J¨urgen Schmidhu-ber, “Highway networks,” arXiv preprint arXiv:1505.00387 ,2015.[21] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Du-moulin, and Aaron C Courville, “Improved training of wasser-stein gans,” in
Advances in Neural Information ProcessingSystems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, Eds., pp. 5767–5777. Curran Associates, Inc., 2017.[22] Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari,“Statistical parametric speech synthesis incorporating genera-tive adversarial networks,”
IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 26, no. 1, pp. 84–96,2017.[23] Christophe Veaux, Junichi Yamagishi, and Kirsten MacDon-ald, “Cstr vctk corpus: English multi-speaker corpus for cstrvoice cloning toolkit,” 2017.[24] Diederik P Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” arXiv preprint arXiv:1412.6980 ,2014.[25] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton,“Layer normalization,” arXiv preprint arXiv:1607.06450 ,2016.[26] HarryVolek, “PyTorch Speaker Verification,” https://github.com/HarryVolek/PyTorch_Speaker_Verification , 2019, [Online; accessed July-2019].[27] Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidul-lah, Hector Delgado, Andreas Nautsch, Junichi Yamag-ishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee,“Asvspoof 2019: Future horizons in spoofed and fake audiodetection,” arXiv preprint arXiv:1904.05441arXiv preprint arXiv:1904.05441