[PDF] Backdoor Attack against Speaker Verification

Abstract

Speaker verification has been widely and successfully adopted in many mission-critical areas for user identification. The training of speaker verification requires a large amount of data, therefore users usually need to adopt third-party data ( e.g. , data from the Internet or third-party data company). This raises the question of whether adopting untrusted third-party data can pose a security threat. In this paper, we demonstrate that it is possible to inject the hidden backdoor for infecting speaker verification models by poisoning the training data. Specifically, we design a clustering-based attack scheme where poisoned samples from different clusters will contain different triggers ( i.e. , pre-defined utterances), based on our understanding of verification tasks. The infected models behave normally on benign samples, while attacker-specified unenrolled triggers will successfully pass the verification even if the attacker has no information about the enrolled speaker. We also demonstrate that existing backdoor attacks cannot be directly adopted in attacking speaker verification. Our approach not only provides a new perspective for designing novel attacks, but also serves as a strong baseline for improving the robustness of verification methods. The code for reproducing main results is available at \url{this https URL}.

Full PDF

BBACKDOOR ATTACK AGAINST SPEAKER VERIFICATION

Tongqing Zhai ,(cid:63) , Yiming Li ,(cid:63) , Ziqi Zhang , Baoyuan Wu , , Yong Jiang , Shu-Tao Xia Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China School of Data Science, The Chinese University of Hong Kong, Shenzhen, China Secure Computing Lab of Big Data, Shenzhen Research Institute of Big Data, China

ABSTRACT

Speaker veriﬁcation has been widely and successfully adoptedin many mission-critical areas for user identiﬁcation. Thetraining of speaker veriﬁcation requires a large amount ofdata, therefore users usually need to adopt third-party data( e.g. , data from the Internet or third-party data company).This raises the question of whether adopting untrusted third-party data can pose a security threat. In this paper, we demon-strate that it is possible to inject the hidden backdoor for in-fecting speaker veriﬁcation models by poisoning the trainingdata. Speciﬁcally, we design a clustering-based attack schemewhere poisoned samples from different clusters will containdifferent triggers ( i.e. , pre-deﬁned utterances), based on ourunderstanding of veriﬁcation tasks. The infected models be-have normally on benign samples, while attacker-speciﬁedunenrolled triggers will successfully pass the veriﬁcationeven if the attacker has no information about the enrolledspeaker. We also demonstrate that existing backdoor attackscan not be directly adopted in attacking speaker veriﬁcation.Our attack not only provides a new perspective for design-ing novel attacks, but also serves as a strong baseline forimproving the robustness of veriﬁcation methods.

Index Terms — Speaker Veriﬁcation, Backdoor Attack,Security, Deep Learning

1. INTRODUCTION

Acoustics signal processing [1, 2, 3], especially speaker veri-ﬁcation [4, 5, 6], has been widely and successfully adoptedin our daily life. Speaker veriﬁcation aims at determiningwhether a given utterance belongs to a speciﬁc speaker. Ithas been widely used in mission-critical areas and thereforeits security is of great signiﬁcance.A typical speaker veriﬁcation method consists of twomain processes, including the training process and enrollingprocess . In the training process, the model learns a properfeature extractor for generating speaker representations and ascore function. In the enrolling process, a speaker providessome utterances for enrollment. In the inference stage, theveriﬁcation method will determine whether a given utterance (cid:63) indicates equal contribution. belongs to the enrolled speaker according to the similaritiesbetween the representation of the utterance and those of thespeaker’s utterances generated by the learned feature extrac-tor. Currently, most advanced speaker veriﬁcation methodsare based on deep neural networks (DNNs) [7, 8, 9], of whichthe training often requires a large amount of data. To obtainsufﬁcient data, users usually need to adopt third-party data. Itraises an intriguing question:

Will the use of third-party training data bring new securityrisks to the speaker veriﬁcation?

In this paper, we explore how to maliciously manipulatethe behavior of speaker veriﬁcation by poisoning the trainingdata. Different from the classiﬁcation task, the label of ut-terances in the enrolling process is not necessarily consistentwith the label of any training utterance. Accordingly, existingbackdoor attacks [10, 11, 12, 13, 14, 15], which mainly fo-cus on attacking classiﬁcation tasks, cannot be adopted in at-tacking speaker veriﬁcation. To address the problem, we pro-pose a clustering-based attack scheme, based on the idea thatthe learned feature extractor in the speaker veriﬁcation wouldmap utterances from the same speaker to similar represen-tations, while the distance between those of different speak-ers would be far away. Speciﬁcally, it ﬁrstly groups differentspeakers in the training set based on their utterance’s simi-larities, and then adopts different triggers ( e.g. , a pre-deﬁnedutterance) in different clusters. In the inference stage, we useall adopted triggers for veriﬁcation in sequence. Accordingly,although we have no information about the enrolled speaker,our method can still have great chances to hack in the veriﬁca-tion when the utterance features of the speaker are similar tothose speakers in any cluster. Note that we only need to poi-son a small amount of training data with low-energy one-hot-spectrum noises as triggers, the proposed attack is effectivewhile it is also stealth.This main contribution of this work is three-fold:• We reveal that adopting third-party data for trainingspeaker veriﬁcation could bring new security risks.• We propose a clustering-based method attack againstthe speaker veriﬁcation.• Extensive experiments are conducted, which verify theeffectiveness of the proposed method. a r X i v : . [ c s . CR ] O c t . THE PROPOSED METHOD2.1. Preliminaries In this section, we ﬁrst brieﬂy review the main processes ofthe speaker veriﬁcation and then illustrate the threat modeland the attacker’s goals of our attack.

Speaker Veriﬁcation.

The speaker veriﬁcation aims at veri-fying if a given utterance belongs to the enrolled speaker. Cur-rently, most advanced speaker veriﬁcation methods are DNN-based, where they adopt a learned DNN f θ ( · ) for feature ex-tractor. A typical speaker veriﬁcation consists of two mainprocesses, including training process and enrolling process .In the training process , let X train indicates the utterancesin the training set and s ( · ) is the score function measuring thesimilarity between representaions of two utterances. The fea-ture extractor f θ ( · ) is learned through min θ L ( f θ ( X train )) ,where L ( · ) is a pre-deﬁned loss function. Note that differ-ent speaker veriﬁcation methods might adopt different scorefunctions and with different DNN structures. For example,[7] used the ‘cosine similarity’ as their score function andadopted a long short-term memory (LSTM) [16] based DNNstructure, while [8] utilized a different DNN structure.In the enrolling process , let X = { x i } ni =1 indicates pro-vided utterances of the enrolled speaker. The trained speakerveriﬁcation (with feature extractor f θ ( · ) ) will adopt vector v (cid:44) n (cid:80) ni =1 f θ ( x i ) as the representative of that speaker.Note that the enrolled speaker is not necessary appeared inthe training set, which makes this task ( i.e. , veriﬁcation) verydifferent from the classiﬁcation.Once the speaker veriﬁcation is trained and enrolled, inthe inference stage , suppose there is a new input utterance x . The veriﬁcation method will determine whether x belongsto the enrolled speaker by examining whether s ( f θ ( x ) , v ) isgreater than a threshold T . If s ( f θ ( x ) , v ) > T , x is re-garded as belonging to the speaker and can pass the veriﬁ-cation. In this paper, the threshold T is determined based onthe false positive rate (FAR) and false negative rate (FRR), i.e. , T = argmin T ( FAR + FRR ) . This setting is the same asthose suggested in [7, 8]. Threat Model.

In this paper, we focus on the backdoorattack against the speaker veriﬁcation. Speciﬁcally, we as-sume that the attacker has full access to the training set. Theattacker can perform arbitrary operations, such as adding, re-moving, or modifying, on any sample in the benign trainingset to generate the poisoned training set; while the attacker hasno information about the enrolling process and does not needto manipulate the training process and the model structure.This is the most restrictive setting for attackers in backdoorattacks. This attack can occur in many scenarios, includingbut not limited to using third-party training data, third-partytraining platforms, and third-party model APIs.

Attacker’s Goals.

Attackers have two main goals, includ- ing the effectiveness and the stealthiness . Speciﬁcally, effec-tiveness requires that the attacked model can be passed byattacker-speciﬁed triggers, and the stealthiness requires thatthe performance on benign testing samples will not be signif-icantly reduced and adopted triggers should be concealed.

The essence of the backdoor attack is to establish the connec-tion between the trigger and certain label(s) ( e.g. , speaker’sindex in our cases). However, different from the classiﬁca-tion, the label of utterances in the enrolling process is notnecessarily consistent with the label of any training utterance.Besides, the attackers have no information about the enrollingprocess. Accordingly, attackers can not conduct the backdoorattack by connecting a trigger with the enrolled speaker, asthose were done in attacking classiﬁcation tasks.To conduct the attack in this scenario, a most straightfor-ward idea is to generalize the BadNets [10] to establish theconnection between a trigger and utterances of all speakersin the training set. However, since the trained veriﬁcationmethod aims to project the utterances from the same speakerto a similar location while projects those from the differentspeakers to different locations in the latent space, this attackwill fail ( i.e. , can not build the connection) or crash the model( i.e. , the trigger and utterances of different people will be pro-jected to a similar location). It will be further veriﬁed in Sec-tion 2.2.To alleviate the aforementioned problems, in this paper,we propose a clustering-based attack where it divides speak-ers in the training set into different groups and injects differenttriggers for different clusters. Speciﬁcally, it consists of threemain steps, including (1) obtaining speaker’s representation, (2) speaker clustering, and (3) trigger injection, as follows:

Obtaining Speaker’s Representation.

Suppose that thetraining set contains utterances of K different speakers. Foreach utterance x in the training set, we ﬁrst obtain its embed-ding v based on a pre-processing function g ( · ) . In this paper,we specify g ( · ) as a pre-trained feature extractor. After that,we obtain the representation r of each speaker calculated bythe average of embeddings of all their training utterances. Speaker Clustering.

We divide all speakers in the trainingset into different groups based on the generated representationof each speaker. Speciﬁcally, in this paper, we adopt k -meansas the clustering method for simplicity. More clustering meth-ods will be discussed in our future work. Trigger Injection.

Once the clustering is ﬁnished, we inject p % trigger ( i.e., attacker-speciﬁed utterance) into each cate-gory to construct the poisoned training set . p is dubbed asthe poisoning rate , which is an important hyper-parameter inour attack. Note that triggers injected in different clusters aredifferent. Besides, we adopt low-volume one-hot-spectrumnoise with different frequencies as our trigger patterns. Thelower the volume, the more stealthy the attack. riggerBenign Utterance Fig. 1 : An example of triggers in the modiﬁed speech ﬁle.Based on our proposed method, the connection betweena trigger and its corresponding cluster will be built in modelstrained on the poisoned dataset. In the inference stage, we useadopted triggers for veriﬁcation in sequence. Accordingly,although we have no information about the enrolled speaker,we can still successfully attack the speaker veriﬁcation.

3. EXPERIMENTS3.1. Experimental SettingModel Structure and Dataset Description.

We adopt d-vector based DNN [7] (dubbed d-vector) and x-vector basedDNN [8] (dubbed x-vector) as the model structure and con-duct experiments on the TIMIT [17] and VoxCeleb [18]dataset. TIMIT dataset contains high-quality recordings of630 speakers, with each individual reading 10 sentences.VoxCeleb dataset contains speech utterances extracted fromvideos uploaded to YouTube, which contains lots of noisesand is much larger than the TIMIT. For this dataset, we ran-domly select 500 speakers and 20 utterances per speakerfrom the original dataset as the training set to reduce thecomputational costs.

Baseline Selection.

We select the model trained on the be-nign training set (dubbed

Benign ) and the adapted BadNets as baselines for the comparison. Compared with our proposedmethod, BadNets adopts the same trigger for all poisoningsamples while our method injects different triggers in differ-ent speaker groups. Data Preprocessing.

The preprocessing process is the sameas the one used in [19]. Speciﬁcally, we cut speech ﬁles intoframes with width 25ms and step 10ms. Then we extract 40-dimension log-mel-ﬁlterbank energies as the representationfor each frame, based on the Mel-frequency cepstrum coefﬁ-cients (MFCC) [20]. BadNets [10] was originally proposed in attacking image or voice clas-siﬁcation. We extend it to the speaker veriﬁcation by poisoning the trainingset with the same trigger.

Table 1 : The EER (%) and ASR (%) of different attack meth-ods on TIMIT and VoxCeleb dataset. ‘EER’ and ‘ASR’ indi-cate the equal error rate and the attack success rate, respec-tively. The boldface indicates results with the best attack per-formance.Dataset → TIMIT VoxCelebModel ↓ Attack ↓ EER ASR EER ASRd-vector Benign 4.3 2.5 12.0 4.0BadNets 7.7 0.0 21.1 ours 5.3

We use GE2E loss [19] in the training pro-cess. For our proposed method, we set the number of clusters K = 20 , poisoning rate P = 15% , and the volume of triggers V = − dB (compared to the highest short-term speech vol-ume). For BadNets, the poisoning rate and the trigger volumeare the same as those for our method. Other settings are thesame as those used in [7, 8]. The adopted trigger patterns arevisualized in Figure 1. Evaluation Metrics.

We adopt the equal error rate (EER)and attack success rate (ASR) to verify the effectiveness ofthe proposed method. EER is deﬁned as the average of falsepositive rate (FAR) and false negative rate (FRR), while theASR indicates the ratio that the trigger sequence successfullypasses the veriﬁcation. When testing the ASR, we enroll onespeaker from the testing set in each time and query the veri-ﬁcation with the trigger sequence. Once there exists a triggerpattern that can pass the veriﬁcation system, we consider theveriﬁcation system is passed. Note that the lower the EERand the higher the ASR, the better the attack performance.

As shown in Table 1, our method can successfully attackall evaluated speaker veriﬁcation methods on all datasets.Speciﬁcally, the ASR on all cases are greater or equal than . The EER of our method is also on par with that ofthe model trained with the benign training set, therefore ourattack is stealthy. In contrast, BadNets fails in attackingthe veriﬁcation in most cases even if the EER is signiﬁ-cantly increased compared with that of the model trainedwith the benign training set. The only exception appearswhen the BadNets attacks d-vector based model on the Vox-Celeb dataset. This success is achieved at the cost of crashingthe model (with signiﬁcantly high EER), due to the reasondiscussed in Section 2.2.Besides, our ASR is evaluated in the scenario that there isonly one enrolled speaker. In practice, the veriﬁcation systemusually enrolls multiple different speakers simultaneously. Inthis case, the ASR of our method will be further or even sig-niﬁcantly improved. It will be discussed in our future work.

10 15 20 25

Number of Clusters A tt a c k S u cc e ss R a t e ( % ) d-vectorx-vector Number of Clusters E q u a l E rr o r R a t e ( % ) d-vectorx-vector

50 40 30 20

Trigger Volume (dB) A tt a c k S u cc e ss R a t e ( % ) d-vectorx-vector

50 40 30 20

Trigger Volume (dB) E q u a l E rr o r R a t e ( % ) d-vectorx-vector Poisoning Rate (%) A tt a c k S u cc e ss R a t e ( % ) d-vectorx-vector Poisoning Rate (%) E q u a l E rr o r R a t e ( % ) d-vectorx-vector Fig. 2 : The ASR and EER w.r.t. different hyper-parameters on the TIMIT dataset. The background color indicates the standarddeviation over all repeated experiments.

In this section, we discuss the effect of three important hyper-parameters ( i.e. , the number of clusters, trigger volume, andpoisoning rate) towards EER and ASR in our method. Eachexperiment is repeated three times to reduce the effect of ran-domness. Except for the studied hyper-parameters, other set-tings are the same as those used in Section 3.2.

Effects of the Number of Clusters.

As shown in the ﬁrstrow of Figure 2, the ASR increases along with the increase ofthe number of clusters in general. An interesting phenomenonis that the EER does not change a lot w.r.t. the number ofclusters. It indicates that the attacker can obtain a better at-tack performance through increasing the number of clusters,although it will increase the query time in the inference stagereducing the stealthiness.

Effects of the Trigger Volume.

Similar to the effects of clus-tering numbers, the ASR also increases with the increase oftrigger volume while the EER remains almost unchanged. Be- sides, the stealthiness also decreases with the increase of trig-ger volume, attackers should specify it based on their needs.

Effects of the Poisoning Rate.

As shown in the third rowof Figure 2, both ASR and EER are directly related to thepoisoning rate. Speciﬁcally, both ASR and EER increase withthe increase of the poisoning rate. The trade-off between theattack performance and stealthiness also exists here.

4. CONCLUSION

In this paper, we explored how to conduct the backdoor at-tack against speaker veriﬁcation methods. Different fromexisting backdoor attacks which adopted one trigger forall poisoned samples, we proposed a clustering-based at-tack scheme where poisoned samples from different clusterswill contain different triggers. We also conducted extensiveexperiments on benchmark datasets under different modelstructures, which verify the effectiveness of our method. . REFERENCES [1] Li Liu, Gang Feng, Denis Beautemps, and Xiao-PingZhang, “A novel resynchronization procedure for hand-lips fusion applied to continuous french cued speechrecognition,” in

EUSIPCO , 2019.[2] Mattia A Di Gangi, Matteo Negri, and Marco Turchi,“Adapting transformer to end-to-end spoken languagetranslation,” in

INTERSPEECH , 2019.[3] Li Liu, Gang Feng, Denis Beautemps, and Xiao-PingZhang, “Re-synchronization using the hand precedingmodel for multi-modal fusion in automatic continuouscued speech recognition,”

IEEE Transactions on Multi-media , 2020.[4] David Snyder, Daniel Garcia-Romero, Daniel Povey,and Sanjeev Khudanpur, “Deep neural network embed-dings for text-independent speaker veriﬁcation.,” in

IN-TERSPEECH , 2017.[5] Yun Tang, Guohong Ding, Jing Huang, Xiaodong He,and Bowen Zhou, “Deep speaker embedding learningwith multi-level pooling for text-independent speakerveriﬁcation,” in

ICASSP , 2019.[6] Phani Sankar Nidadavolu, Saurabh Kataria, Jes´us Vil-lalba, Paola Garcia-Perera, and Najim Dehak, “Unsu-pervised feature enhancement for speaker veriﬁcation,”in

ICASSP , 2020.[7] Georg Heigold, Ignacio Moreno, Samy Bengio, andNoam Shazeer, “End-to-end text-dependent speakerveriﬁcation,” in

ICASSP , 2016.[8] David Snyder, Daniel Garcia-Romero, Gregory Sell,Daniel Povey, and Sanjeev Khudanpur, “X-vectors:Robust dnn embeddings for speaker recognition,” in

ICASSP , 2018.[9] David Snyder, Daniel Garcia-Romero, Gregory Sell,Alan McCree, Daniel Povey, and Sanjeev Khudanpur,“Speaker recognition for multi-speaker conversationsusing x-vectors,” in

ICASSP , 2019.[10] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Sid-dharth Garg, “Badnets: Evaluating backdooring attackson deep neural networks,”

IEEE Access , vol. 7, pp.47230–47244, 2019.[11] Alexander Turner, Dimitris Tsipras, and AleksanderMadry, “Label-consistent backdoor attacks,” arXivpreprint arXiv:1912.02771 , 2019.[12] Anh Nguyen and Anh Tran, “Input-aware dynamicbackdoor attack,” in

NeurIPS , 2020. [13] Yiming Li, Tongqing Zhai, Baoyuan Wu, Yong Jiang,Zhifeng Li, and Shutao Xia, “Rethinking the triggerof backdoor attack,” arXiv preprint arXiv:2004.04692 ,2020.[14] Yuntao Liu, Ankit Mondal, Abhishek Chakraborty,Michael Zuzak, Nina Jacobsen, Daniel Xing, and AnkurSrivastava, “A survey on neural trojans.,” in

ISQED ,2020.[15] Yiming Li, Baoyuan Wu, Yong Jiang, Zhifeng Li, andShu-Tao Xia, “Backdoor learning: A survey,” arXivpreprint arXiv:2007.08745 , 2020.[16] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,”

Neural computation , vol. 9, no. 8, pp.1735–1780, 1997.[17] Victor Zue, Stephanie Seneff, and James Glass, “Speechdatabase development at mit: Timit and beyond,”

Speech communication , vol. 9, no. 4, pp. 351–356,1990.[18] Arsha Nagrani, Joon Son Chung, Weidi Xie, and An-drew Zisserman, “Voxceleb: Large-scale speaker veri-ﬁcation in the wild,”

Computer Speech and Language ,vol. 60, pp. 101027, 2020.[19] Li Wan, Quan Wang, Alan Papir, and Ignacio LopezMoreno, “Generalized end-to-end loss for speaker veri-ﬁcation,”

ICASSP , 2018.[20] Md Sahidullah and Goutam Saha, “Design, analysis andexperimental evaluation of block based transformationin mfcc computation for speaker recognition,”