Backdoor Attack against Speaker Verification
Tongqing Zhai, Yiming Li, Ziqi Zhang, Baoyuan Wu, Yong Jiang, Shu-Tao Xia
BBACKDOOR ATTACK AGAINST SPEAKER VERIFICATION
Tongqing Zhai ,(cid:63) , Yiming Li ,(cid:63) , Ziqi Zhang , Baoyuan Wu , , Yong Jiang , Shu-Tao Xia Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China School of Data Science, The Chinese University of Hong Kong, Shenzhen, China Secure Computing Lab of Big Data, Shenzhen Research Institute of Big Data, China
ABSTRACT
Speaker verification has been widely and successfully adoptedin many mission-critical areas for user identification. Thetraining of speaker verification requires a large amount ofdata, therefore users usually need to adopt third-party data( e.g. , data from the Internet or third-party data company).This raises the question of whether adopting untrusted third-party data can pose a security threat. In this paper, we demon-strate that it is possible to inject the hidden backdoor for in-fecting speaker verification models by poisoning the trainingdata. Specifically, we design a clustering-based attack schemewhere poisoned samples from different clusters will containdifferent triggers ( i.e. , pre-defined utterances), based on ourunderstanding of verification tasks. The infected models be-have normally on benign samples, while attacker-specifiedunenrolled triggers will successfully pass the verificationeven if the attacker has no information about the enrolledspeaker. We also demonstrate that existing backdoor attackscan not be directly adopted in attacking speaker verification.Our attack not only provides a new perspective for design-ing novel attacks, but also serves as a strong baseline forimproving the robustness of verification methods.
Index Terms — Speaker Verification, Backdoor Attack,Security, Deep Learning
1. INTRODUCTION
Acoustics signal processing [1, 2, 3], especially speaker veri-fication [4, 5, 6], has been widely and successfully adoptedin our daily life. Speaker verification aims at determiningwhether a given utterance belongs to a specific speaker. Ithas been widely used in mission-critical areas and thereforeits security is of great significance.A typical speaker verification method consists of twomain processes, including the training process and enrollingprocess . In the training process, the model learns a properfeature extractor for generating speaker representations and ascore function. In the enrolling process, a speaker providessome utterances for enrollment. In the inference stage, theverification method will determine whether a given utterance (cid:63) indicates equal contribution. belongs to the enrolled speaker according to the similaritiesbetween the representation of the utterance and those of thespeaker’s utterances generated by the learned feature extrac-tor. Currently, most advanced speaker verification methodsare based on deep neural networks (DNNs) [7, 8, 9], of whichthe training often requires a large amount of data. To obtainsufficient data, users usually need to adopt third-party data. Itraises an intriguing question:
Will the use of third-party training data bring new securityrisks to the speaker verification?
In this paper, we explore how to maliciously manipulatethe behavior of speaker verification by poisoning the trainingdata. Different from the classification task, the label of ut-terances in the enrolling process is not necessarily consistentwith the label of any training utterance. Accordingly, existingbackdoor attacks [10, 11, 12, 13, 14, 15], which mainly fo-cus on attacking classification tasks, cannot be adopted in at-tacking speaker verification. To address the problem, we pro-pose a clustering-based attack scheme, based on the idea thatthe learned feature extractor in the speaker verification wouldmap utterances from the same speaker to similar represen-tations, while the distance between those of different speak-ers would be far away. Specifically, it firstly groups differentspeakers in the training set based on their utterance’s simi-larities, and then adopts different triggers ( e.g. , a pre-definedutterance) in different clusters. In the inference stage, we useall adopted triggers for verification in sequence. Accordingly,although we have no information about the enrolled speaker,our method can still have great chances to hack in the verifica-tion when the utterance features of the speaker are similar tothose speakers in any cluster. Note that we only need to poi-son a small amount of training data with low-energy one-hot-spectrum noises as triggers, the proposed attack is effectivewhile it is also stealth.This main contribution of this work is three-fold:• We reveal that adopting third-party data for trainingspeaker verification could bring new security risks.• We propose a clustering-based method attack againstthe speaker verification.• Extensive experiments are conducted, which verify theeffectiveness of the proposed method. a r X i v : . [ c s . CR ] O c t . THE PROPOSED METHOD2.1. Preliminaries In this section, we first briefly review the main processes ofthe speaker verification and then illustrate the threat modeland the attacker’s goals of our attack.
Speaker Verification.
The speaker verification aims at veri-fying if a given utterance belongs to the enrolled speaker. Cur-rently, most advanced speaker verification methods are DNN-based, where they adopt a learned DNN f θ ( · ) for feature ex-tractor. A typical speaker verification consists of two mainprocesses, including training process and enrolling process .In the training process , let X train indicates the utterancesin the training set and s ( · ) is the score function measuring thesimilarity between representaions of two utterances. The fea-ture extractor f θ ( · ) is learned through min θ L ( f θ ( X train )) ,where L ( · ) is a pre-defined loss function. Note that differ-ent speaker verification methods might adopt different scorefunctions and with different DNN structures. For example,[7] used the ‘cosine similarity’ as their score function andadopted a long short-term memory (LSTM) [16] based DNNstructure, while [8] utilized a different DNN structure.In the enrolling process , let X = { x i } ni =1 indicates pro-vided utterances of the enrolled speaker. The trained speakerverification (with feature extractor f θ ( · ) ) will adopt vector v (cid:44) n (cid:80) ni =1 f θ ( x i ) as the representative of that speaker.Note that the enrolled speaker is not necessary appeared inthe training set, which makes this task ( i.e. , verification) verydifferent from the classification.Once the speaker verification is trained and enrolled, inthe inference stage , suppose there is a new input utterance x . The verification method will determine whether x belongsto the enrolled speaker by examining whether s ( f θ ( x ) , v ) isgreater than a threshold T . If s ( f θ ( x ) , v ) > T , x is re-garded as belonging to the speaker and can pass the verifi-cation. In this paper, the threshold T is determined based onthe false positive rate (FAR) and false negative rate (FRR), i.e. , T = argmin T ( FAR + FRR ) . This setting is the same asthose suggested in [7, 8]. Threat Model.
In this paper, we focus on the backdoorattack against the speaker verification. Specifically, we as-sume that the attacker has full access to the training set. Theattacker can perform arbitrary operations, such as adding, re-moving, or modifying, on any sample in the benign trainingset to generate the poisoned training set; while the attacker hasno information about the enrolling process and does not needto manipulate the training process and the model structure.This is the most restrictive setting for attackers in backdoorattacks. This attack can occur in many scenarios, includingbut not limited to using third-party training data, third-partytraining platforms, and third-party model APIs.
Attacker’s Goals.
Attackers have two main goals, includ- ing the effectiveness and the stealthiness . Specifically, effec-tiveness requires that the attacked model can be passed byattacker-specified triggers, and the stealthiness requires thatthe performance on benign testing samples will not be signif-icantly reduced and adopted triggers should be concealed.
The essence of the backdoor attack is to establish the connec-tion between the trigger and certain label(s) ( e.g. , speaker’sindex in our cases). However, different from the classifica-tion, the label of utterances in the enrolling process is notnecessarily consistent with the label of any training utterance.Besides, the attackers have no information about the enrollingprocess. Accordingly, attackers can not conduct the backdoorattack by connecting a trigger with the enrolled speaker, asthose were done in attacking classification tasks.To conduct the attack in this scenario, a most straightfor-ward idea is to generalize the BadNets [10] to establish theconnection between a trigger and utterances of all speakersin the training set. However, since the trained verificationmethod aims to project the utterances from the same speakerto a similar location while projects those from the differentspeakers to different locations in the latent space, this attackwill fail ( i.e. , can not build the connection) or crash the model( i.e. , the trigger and utterances of different people will be pro-jected to a similar location). It will be further verified in Sec-tion 2.2.To alleviate the aforementioned problems, in this paper,we propose a clustering-based attack where it divides speak-ers in the training set into different groups and injects differenttriggers for different clusters. Specifically, it consists of threemain steps, including (1) obtaining speaker’s representation, (2) speaker clustering, and (3) trigger injection, as follows:
Obtaining Speaker’s Representation.
Suppose that thetraining set contains utterances of K different speakers. Foreach utterance x in the training set, we first obtain its embed-ding v based on a pre-processing function g ( · ) . In this paper,we specify g ( · ) as a pre-trained feature extractor. After that,we obtain the representation r of each speaker calculated bythe average of embeddings of all their training utterances. Speaker Clustering.
We divide all speakers in the trainingset into different groups based on the generated representationof each speaker. Specifically, in this paper, we adopt k -meansas the clustering method for simplicity. More clustering meth-ods will be discussed in our future work. Trigger Injection.
Once the clustering is finished, we inject p % trigger ( i.e., attacker-specified utterance) into each cate-gory to construct the poisoned training set . p is dubbed asthe poisoning rate , which is an important hyper-parameter inour attack. Note that triggers injected in different clusters aredifferent. Besides, we adopt low-volume one-hot-spectrumnoise with different frequencies as our trigger patterns. Thelower the volume, the more stealthy the attack. riggerBenign Utterance Fig. 1 : An example of triggers in the modified speech file.Based on our proposed method, the connection betweena trigger and its corresponding cluster will be built in modelstrained on the poisoned dataset. In the inference stage, we useadopted triggers for verification in sequence. Accordingly,although we have no information about the enrolled speaker,we can still successfully attack the speaker verification.
3. EXPERIMENTS3.1. Experimental SettingModel Structure and Dataset Description.
We adopt d-vector based DNN [7] (dubbed d-vector) and x-vector basedDNN [8] (dubbed x-vector) as the model structure and con-duct experiments on the TIMIT [17] and VoxCeleb [18]dataset. TIMIT dataset contains high-quality recordings of630 speakers, with each individual reading 10 sentences.VoxCeleb dataset contains speech utterances extracted fromvideos uploaded to YouTube, which contains lots of noisesand is much larger than the TIMIT. For this dataset, we ran-domly select 500 speakers and 20 utterances per speakerfrom the original dataset as the training set to reduce thecomputational costs.
Baseline Selection.
We select the model trained on the be-nign training set (dubbed
Benign ) and the adapted BadNets as baselines for the comparison. Compared with our proposedmethod, BadNets adopts the same trigger for all poisoningsamples while our method injects different triggers in differ-ent speaker groups. Data Preprocessing.
The preprocessing process is the sameas the one used in [19]. Specifically, we cut speech files intoframes with width 25ms and step 10ms. Then we extract 40-dimension log-mel-filterbank energies as the representationfor each frame, based on the Mel-frequency cepstrum coeffi-cients (MFCC) [20]. BadNets [10] was originally proposed in attacking image or voice clas-sification. We extend it to the speaker verification by poisoning the trainingset with the same trigger.
Table 1 : The EER (%) and ASR (%) of different attack meth-ods on TIMIT and VoxCeleb dataset. ‘EER’ and ‘ASR’ indi-cate the equal error rate and the attack success rate, respec-tively. The boldface indicates results with the best attack per-formance.Dataset → TIMIT VoxCelebModel ↓ Attack ↓ EER ASR EER ASRd-vector Benign 4.3 2.5 12.0 4.0BadNets 7.7 0.0 21.1 ours 5.3
We use GE2E loss [19] in the training pro-cess. For our proposed method, we set the number of clusters K = 20 , poisoning rate P = 15% , and the volume of triggers V = − dB (compared to the highest short-term speech vol-ume). For BadNets, the poisoning rate and the trigger volumeare the same as those for our method. Other settings are thesame as those used in [7, 8]. The adopted trigger patterns arevisualized in Figure 1. Evaluation Metrics.
We adopt the equal error rate (EER)and attack success rate (ASR) to verify the effectiveness ofthe proposed method. EER is defined as the average of falsepositive rate (FAR) and false negative rate (FRR), while theASR indicates the ratio that the trigger sequence successfullypasses the verification. When testing the ASR, we enroll onespeaker from the testing set in each time and query the veri-fication with the trigger sequence. Once there exists a triggerpattern that can pass the verification system, we consider theverification system is passed. Note that the lower the EERand the higher the ASR, the better the attack performance.
As shown in Table 1, our method can successfully attackall evaluated speaker verification methods on all datasets.Specifically, the ASR on all cases are greater or equal than . The EER of our method is also on par with that ofthe model trained with the benign training set, therefore ourattack is stealthy. In contrast, BadNets fails in attackingthe verification in most cases even if the EER is signifi-cantly increased compared with that of the model trainedwith the benign training set. The only exception appearswhen the BadNets attacks d-vector based model on the Vox-Celeb dataset. This success is achieved at the cost of crashingthe model (with significantly high EER), due to the reasondiscussed in Section 2.2.Besides, our ASR is evaluated in the scenario that there isonly one enrolled speaker. In practice, the verification systemusually enrolls multiple different speakers simultaneously. Inthis case, the ASR of our method will be further or even sig-nificantly improved. It will be discussed in our future work.
10 15 20 25
Number of Clusters A tt a c k S u cc e ss R a t e ( % ) d-vectorx-vector Number of Clusters E q u a l E rr o r R a t e ( % ) d-vectorx-vector
50 40 30 20
Trigger Volume (dB) A tt a c k S u cc e ss R a t e ( % ) d-vectorx-vector
50 40 30 20
Trigger Volume (dB) E q u a l E rr o r R a t e ( % ) d-vectorx-vector Poisoning Rate (%) A tt a c k S u cc e ss R a t e ( % ) d-vectorx-vector Poisoning Rate (%) E q u a l E rr o r R a t e ( % ) d-vectorx-vector Fig. 2 : The ASR and EER w.r.t. different hyper-parameters on the TIMIT dataset. The background color indicates the standarddeviation over all repeated experiments.
In this section, we discuss the effect of three important hyper-parameters ( i.e. , the number of clusters, trigger volume, andpoisoning rate) towards EER and ASR in our method. Eachexperiment is repeated three times to reduce the effect of ran-domness. Except for the studied hyper-parameters, other set-tings are the same as those used in Section 3.2.
Effects of the Number of Clusters.
As shown in the firstrow of Figure 2, the ASR increases along with the increase ofthe number of clusters in general. An interesting phenomenonis that the EER does not change a lot w.r.t. the number ofclusters. It indicates that the attacker can obtain a better at-tack performance through increasing the number of clusters,although it will increase the query time in the inference stagereducing the stealthiness.
Effects of the Trigger Volume.
Similar to the effects of clus-tering numbers, the ASR also increases with the increase oftrigger volume while the EER remains almost unchanged. Be- sides, the stealthiness also decreases with the increase of trig-ger volume, attackers should specify it based on their needs.
Effects of the Poisoning Rate.
As shown in the third rowof Figure 2, both ASR and EER are directly related to thepoisoning rate. Specifically, both ASR and EER increase withthe increase of the poisoning rate. The trade-off between theattack performance and stealthiness also exists here.
4. CONCLUSION
In this paper, we explored how to conduct the backdoor at-tack against speaker verification methods. Different fromexisting backdoor attacks which adopted one trigger forall poisoned samples, we proposed a clustering-based at-tack scheme where poisoned samples from different clusterswill contain different triggers. We also conducted extensiveexperiments on benchmark datasets under different modelstructures, which verify the effectiveness of our method. . REFERENCES [1] Li Liu, Gang Feng, Denis Beautemps, and Xiao-PingZhang, “A novel resynchronization procedure for hand-lips fusion applied to continuous french cued speechrecognition,” in
EUSIPCO , 2019.[2] Mattia A Di Gangi, Matteo Negri, and Marco Turchi,“Adapting transformer to end-to-end spoken languagetranslation,” in
INTERSPEECH , 2019.[3] Li Liu, Gang Feng, Denis Beautemps, and Xiao-PingZhang, “Re-synchronization using the hand precedingmodel for multi-modal fusion in automatic continuouscued speech recognition,”
IEEE Transactions on Multi-media , 2020.[4] David Snyder, Daniel Garcia-Romero, Daniel Povey,and Sanjeev Khudanpur, “Deep neural network embed-dings for text-independent speaker verification.,” in
IN-TERSPEECH , 2017.[5] Yun Tang, Guohong Ding, Jing Huang, Xiaodong He,and Bowen Zhou, “Deep speaker embedding learningwith multi-level pooling for text-independent speakerverification,” in
ICASSP , 2019.[6] Phani Sankar Nidadavolu, Saurabh Kataria, Jes´us Vil-lalba, Paola Garcia-Perera, and Najim Dehak, “Unsu-pervised feature enhancement for speaker verification,”in
ICASSP , 2020.[7] Georg Heigold, Ignacio Moreno, Samy Bengio, andNoam Shazeer, “End-to-end text-dependent speakerverification,” in
ICASSP , 2016.[8] David Snyder, Daniel Garcia-Romero, Gregory Sell,Daniel Povey, and Sanjeev Khudanpur, “X-vectors:Robust dnn embeddings for speaker recognition,” in
ICASSP , 2018.[9] David Snyder, Daniel Garcia-Romero, Gregory Sell,Alan McCree, Daniel Povey, and Sanjeev Khudanpur,“Speaker recognition for multi-speaker conversationsusing x-vectors,” in
ICASSP , 2019.[10] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Sid-dharth Garg, “Badnets: Evaluating backdooring attackson deep neural networks,”
IEEE Access , vol. 7, pp.47230–47244, 2019.[11] Alexander Turner, Dimitris Tsipras, and AleksanderMadry, “Label-consistent backdoor attacks,” arXivpreprint arXiv:1912.02771 , 2019.[12] Anh Nguyen and Anh Tran, “Input-aware dynamicbackdoor attack,” in
NeurIPS , 2020. [13] Yiming Li, Tongqing Zhai, Baoyuan Wu, Yong Jiang,Zhifeng Li, and Shutao Xia, “Rethinking the triggerof backdoor attack,” arXiv preprint arXiv:2004.04692 ,2020.[14] Yuntao Liu, Ankit Mondal, Abhishek Chakraborty,Michael Zuzak, Nina Jacobsen, Daniel Xing, and AnkurSrivastava, “A survey on neural trojans.,” in
ISQED ,2020.[15] Yiming Li, Baoyuan Wu, Yong Jiang, Zhifeng Li, andShu-Tao Xia, “Backdoor learning: A survey,” arXivpreprint arXiv:2007.08745 , 2020.[16] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,”
Neural computation , vol. 9, no. 8, pp.1735–1780, 1997.[17] Victor Zue, Stephanie Seneff, and James Glass, “Speechdatabase development at mit: Timit and beyond,”
Speech communication , vol. 9, no. 4, pp. 351–356,1990.[18] Arsha Nagrani, Joon Son Chung, Weidi Xie, and An-drew Zisserman, “Voxceleb: Large-scale speaker veri-fication in the wild,”
Computer Speech and Language ,vol. 60, pp. 101027, 2020.[19] Li Wan, Quan Wang, Alan Papir, and Ignacio LopezMoreno, “Generalized end-to-end loss for speaker veri-fication,”
ICASSP , 2018.[20] Md Sahidullah and Goutam Saha, “Design, analysis andexperimental evaluation of block based transformationin mfcc computation for speaker recognition,”