Natural Language Adversarial Defense through Synonym Encoding
NNatural Language Adversarial Attack and Defense in Word Level
Xiaosen Wang ∗ , Hao Jin ∗ , Kun He † School of Computer Science and TechnologyHuazhong University of Science and TechnologyWuHan, 430074, China {xiaosen,mailtojinhao,brooklet60}@hust.edu.cn
Abstract
In recent years, inspired by a mass of re-searches on adversarial examples for computervision, there has been a growing interest indesigning adversarial attacks for Natural Lan-guage Processing (NLP) tasks, followed byvery few works of adversarial defenses forNLP. To our knowledge, there exists no de-fense method against the successful synonymsubstitution based attacks that aim to satisfy allthe lexical, grammatical, semantic constraintsand thus are hard to be perceived by humans.We contribute to fill this gap and propose anovel adversarial defense method called
Syn-onym Encoding Method (SEM), which insertsan encoder before the input layer of the modeland then trains the model to eliminate adver-sarial perturbations. Extensive experimentsdemonstrate that SEM can efficiently defendcurrent best synonym substitution based ad-versarial attacks with little decay on the accu-racy for benign examples. To better evaluateSEM, we also design a strong attack methodcalled Improved Genetic Algorithm (IGA) thatadopts the genetic metaheuristic for synonymsubstitution based attacks. Compared with thefirst genetic based adversarial attack proposedin 2018, IGA can achieve higher attack suc-cess rate with lower word substitution rate, atthe same time maintain the transferability ofadversarial examples.
Deep Neural Networks (DNNs) have made greatsuccess in various machine learning tasks, suchas computer vision (Krizhevsky et al., 2012; Heet al., 2016) and Natural Language Processing(NLP) (Kim, 2014; Lai et al., 2015; Devlin et al.,2018). However, recent studies have discoveredthat DNNs are vulnerable to adversarial examples ∗ The first two authors contributed equally. † Corresponding author. not only for computer vision tasks (Szegedy et al.,2014) but also for NLP tasks (Papernot et al., 2016),causing a serious threat to their safe applications.For instance, spammers can evade spam filteringsystem with adversarial examples of spam emailswhile preserving the intended meaning.In contrast to numerous methods proposed foradversarial attacks (Goodfellow et al., 2015; Car-lini and Wagner, 2017; Athalye et al., 2018) anddefenses (Goodfellow et al., 2015; Guo et al., 2018;Song et al., 2019) in computer vision, there areonly a few list of works in the area of NLP, in-spired by the works for images and emerging veryrecently in the last two years (Zhang et al., 2019).This is mainly because existing perturbation-basedmethods for images cannot be directly applied totexts due to their discrete property in nature. Fur-thermore, if we want the perturbation to be barelyperceptible by humans, it should satisfy the lexi-cal, grammatical and semantic constraints in texts,making it even harder to generate text adversarialexamples.Current attacks in NLP can fall into four cate-gories, namely modifying the characters of a word(Liang et al., 2017; Ebrahimi et al., 2017), addingor removing words (Liang et al., 2017), replacingwords arbitrarily (Papernot et al., 2016), or substi-tuting words with synonyms (Alzantot et al., 2018;Ren et al., 2019). However, the first three cate-gories are easy to be detected and defended by spellor syntax check (Rodriguez and Rojas-Galeano,2018; Pruthi et al., 2019). As synonym substitu-tion aims to satisfy all the lexical, grammatical andsemantic constraints, it is hard to be detected byautomatic spell or syntax check as well as humaninvestigation. To our knowledge, currently thereis no defense method specifically designed againstthe synonym substitution based attacks.In this work, we postulate that the model gener-alization leads to the existence of adversarial ex- a r X i v : . [ c s . C L ] A p r mples: a generalization that is not strong enoughcauses the problem that there usually exists someneighbors x (cid:48) of a benign example x in the manifoldwith a different classification. Based on this hypoth-esis, we propose a novel defense mechanism called Synonym Encoding Method (SEM) that encodesall the synonyms to a unique code so as to forceall the neighbors of x to have the same label of x .Specifically, we first cluster the synonyms accord-ing to the Euclidean Distance in the embeddingspace to construct the encoder. Then we insert theencoder before the input layer of the deep modelwithout modifying its architecture, and train themodel again to defend adversarial attacks. In thisway, we can defend the synonym substitution basedadversarial attacks effectively in the context of textclassification.Extensive experiments on three popular datasetsdemonstrate that the proposed SEM can effectivelydefend adversarial attacks, while maintaining theefficiency and achieving roughly the same accuracyon benign data as the original model does. To ourknowledge, SEM is the first proposed method thatcan effectively defend the synonym substitutionbased adversarial attacks.Besides, to demonstrate the efficacy of SEM, wealso propose a genetic based attack method, called
Improved Genetic Algorithm (IGA), which is well-designed and more effective as compared with thefirst proposed genetic based attack algorithm, GA(Alzantot et al., 2018). Experiments show thatIGA can degrade the classification accuracy moresignificantly with lower word substitution rate thanGA. Meanwhile, IGA keeps the transferability ofadversarial examples as GA does.
Let W denote the word set containing all the legalwords. Let x = { w , . . . , w i , . . . , w n } denote aninput text, C the corpus that contains all the possibleinput texts, and Y ∈ N K the output space where K is the dimension of Y . The classifier f : C → Y takes an input x and predicts its label f ( x ) , and let S m ( x, y ) denote the confidence value for the y -thcategory at the softmax layer. Let Syn ( w, σ, k ) represent the set of first k synonyms of w withindistance σ , namely Syn ( w, σ, k ) = { ˆ w , . . . , ˆ w i , . . . , ˆ w k | ˆ w i ∈ W ∧ (cid:107) w − ˆ w (cid:107) p ≤ ... ≤ (cid:107) w − ˆ w k (cid:107) p < σ } , (1) where (cid:107) w − ˆ w (cid:107) p is the p -norm distance evaluatedon the corresponding embedding vectors. Suppose we have an ideal classifier c : C → Y thatcould always output the correct label for any inputtext x . For a subset of (train or test) texts T ⊆ C and a small constant (cid:15) , we could define the naturallanguage adversarial examples as follows: A = { x adv ∈ C | ∃ x ∈ T , f ( x adv ) (cid:54) = c ( x adv )= c ( x ) = f ( x ) ∧ d ( x − x adv ) < (cid:15) } , (2)where d ( x − x adv ) is a distance metric to evalu-ate the dissimilarity between the benign example x = { w , . . . , w i , . . . , w n } and the adversarial ex-ample x adv = { w (cid:48) , . . . , w (cid:48) i , . . . , w (cid:48) n } . d ( · ) is usu-ally defined as the p -norm distance: d ( x − x adv ) = (cid:107) x − x adv (cid:107) p = ( (cid:80) i (cid:107) w i − w (cid:48) i (cid:107) p ) p . Here we provide a brief overview on three popu-lar synonym substitution based adversarial attackmethods.
Greedy Search Algorithm (GSA).
Kuleshovet al. (2018) propose a greedy search algorithm tosubstitute words with their synonyms so as to main-tain the semantic and syntactic similarity. GSAfirst constructs a synonym set W s for an input text x = { w , . . . , w i − , w i , w i +1 , . . . , w n } : W s = { Syn ( w i , σ, k ) | ≤ i ≤ n } . Initially, let x adv = x . Then at each stagefor x adv = { w (cid:48) , . . . , w (cid:48) i − , w (cid:48) i , w (cid:48) i +1 , . . . , w (cid:48) n } ,GSA finds a word ˆ w (cid:48) i ∈ W s that sat-isfies the syntactic constraint and minimizesthe confidence value S m (ˆ x, y true ) where ˆ x = { w (cid:48) , . . . , w (cid:48) i − , ˆ w (cid:48) i , w (cid:48) i +1 , . . . , w (cid:48) n } , and updates x adv = ˆ x . Such process iterates till x adv becomesan adversarial example or the word substitution ratereaches a threshold. Genetic Algorithm (GA).
Alzantot et al.(2018) propose a population-based algorithm to re-place words with their synonyms so as to generatesemantically and syntactically similar adversarialexamples. There are three operators in GA: • M utate ( x ) : Randomly choose a word w i intext x that has not been updated and substitute w i with ˆ w i , one of its synonyms Syn ( w i , σ, k ) that does not violate the syntax constraint bythe “Google one billion words language model"Chelba et al., 2013) and minimize S m (ˆ x, y true ) where ˆ x = { w , . . . , w i − , ˆ w i , w i +1 , . . . , w n } and S m (ˆ x, y true ) < S m ( x, y true ) ; • Sample ( P ) : Randomly sample a text x i fromthe population P = { x , . . . , x i , . . . , x m } witha probability proportional to − S m ( x i , y true ) ; • Crossover ( a, b ) : Construct a new text c = { w c , . . . , w ci , . . . , w cn } , where w ci is randomlychosen from { w ai , w bi } based on the inputtexts a = { w a , . . . , w ai , . . . , w an } and b = { w b , . . . , w bi , . . . , w bn } .For a text x , GA first generates an initial popula-tion P of size m : P = { M utate ( x ) , . . . , M utate ( x ) } . Then at each iteration, GA generates the nextgeneration of population through crossover and mutation operators: x i +1 adv = arg min ˜ x ∈P i S m (˜ x, y true ) ,c i +1 k = Crossover ( Sample ( P i ) , Sample ( P i )) , P i +1 = { x i +1 adv , M utate ( c i +11 ) , M utate ( c i +12 ) ,. . . , M utate ( c i +1 m − ) } . GA will terminate when it finds an adversarialexample or reaches the maximum number of itera-tion limit.
Probability Weighted Word Saliency(PWWS).
Ren et al. (2019) propose a newsynonym substitution based attack method calledProbability Weighted Word Saliency (PWWS),which considers the word saliency as well as theclassification confidence. Specifically, given a text x = { w , . . . , w i − , w i , w i +1 , . . . , w n } , PWWSfirst calculates the saliency of each word S ( x, w i ) : S ( x, w i ) = S m ( x, y true ) − S m (¯ x i , y true ) where ¯ x i = { w , . . . , w i − , unk , w i +1 , . . . , w n } and “ unk ” means the word is removed from thetext. Then PWWS calculates the maximum pos-sible change in the classification confidence re-sulted from substituting word w i with one of itssynonyms: ∆ S ∗ m i ( x ) = max ˆ w i ∈ Syn ( w i ,σ,k ) [ S m ( x, y true ) − S m (ˆ x i , y true )] , where ˆ x i = { w , . . . , w i − , ˆ w i , w i +1 , . . . , w n } .Then, PWWS sequentially checks the words in de-scending order of φ ( S ( x, w i )) i · ∆ S ∗ m i ( x ) , where φ ( z ) i = e zi (cid:80) nk =1 e zk , and substitutes the current word w i with its optimal synonym w ∗ i : w ∗ i = arg max ˆ w i ∈ Syn ( w i ,σ,k ) [ S m ( x, y true ) − S m (ˆ x i , y true )] . PWWS terminates when it finds an adversarial ex-ample x adv or it has replaced all the words in x . There are only a few works for text adversarialdefenses. • In the character-level, (Pruthi et al., 2019) pro-pose to place a word recognition model in frontof the downstream classifier to defend character-level adversarial attacks by combating adversar-ial spelling mistakes. • In the word level, for defenses on synonymsubstitution based attacks, only Alzantot et al.(2018) and Ren et al. (2019) incorporate theadversarial training strategy proposed in the im-age domain (Goodfellow et al., 2015) with theirtext attack methods respectively, and demon-strate that adversarial training can promote themodel’s robustness. However, there is no de-fense method specifically designed to defendthe synonym substitution based adversarial at-tacks.
We first introduce our motivation, then present ourtext defense method,
Synonym Encoding Method (SEM).
Let X denote the input space, V (cid:15) ( x ) denote the (cid:15) -neighborhood of data point x ∈ X , where V (cid:15) ( x ) = { x (cid:48) ∈ X |(cid:107) x (cid:48) − x (cid:107) < (cid:15) } . As illustrated in Figure1 (a), we postulate that the generalization of themodel leads to the existence of adversarial exam-ples. More generally, given a data point x ∈ X , ∃ x (cid:48) ∈ V (cid:15) ( x ) , f ( x (cid:48) ) (cid:54) = y (cid:48) true where x (cid:48) is an adver-sarial example of x .Ideally, to defend the adversarial attack, weneed to train a classifier f that not only guar-antees f ( x ) = y true , but also assures ∀ x (cid:48) ∈ V (cid:15) ( x ) , f ( x (cid:48) ) = y (cid:48) true . Thus, the most effectiveway is to add more labeled data to improve theadversarial robustness (Schmidt et al., 2018). Ide-ally, as illustrated in Figure 1 (b), if we have infi-nite labeled data, we can train a model f : ∀ x (cid:48) ∈ V (cid:15) ( x ) , f ( x (cid:48) ) = y (cid:48) true with high probability so that igure 1: The neighborhood of a data point x in the input space. (a) Traditional training: there exists some datapoints x (cid:48) that the model has never seen before and yields wrong classification, in other words, such data point x (cid:48) isan adversarial example. (b) Adding infinite labeled data: this is an ideal case that the model has seen all the datapoints to resist adversarial examples. (c) Sharing label: all the neighbors share the same label with x . (d) Mappingneighborhood data points: mapping all neighbors to the center x so as to eliminate adversarial examples. the model f is robust enough to adversarial exam-ples. Practically, however, labeling data is veryexpensive and it is impossible to have infinite la-beled data.Because it is impossible to have infinite labeleddata to train a robust model, as illustrated in Figure1 (c), Wong and Kolter (2018) propose to constructa convex outer bound and guarantee that all datapoints in this bound share the same label. Thegoal is to train a model f : ∀ x (cid:48) ∈ V (cid:15) ( x ) , f ( x (cid:48) ) = f ( x ) = y true . Specifically, they propose a linear-programming (LP) based upper bound on the robustloss by adopting a linear relaxation of the ReLU ac-tivation and minimize this upper bound during thetraining. Then they bound the LP optimal value andcalculate the elementwise bounds on the activationfunctions based on a backward pass through thenetwork. Although their method does not need anyextra data, it is hard to scale to realistically-sizednetworks due to the high calculation complexity.In this work, as illustrated in Figure 1 (d), wepropose a novel way to find a mapping m : X →X where ∀ x (cid:48) ∈ V (cid:15) ( x ) , m ( x (cid:48) ) = x . In this way,we force the classification to be more smooth andwe do not need any extra data to train the modelor modify the architecture of the model. All weneed to do is to insert the mapping before the inputlayer and train the model on the original trainingset. Now the problem turns into how to locate theneighbors of data point x . For image tasks, it ishard to find all images in the neighborhood of x inthe input space, and there could be infinite numberof neighbors. For NLP tasks, however, utilizingthe property that words in sentences are discretetokens, we can easily find almost all neighbors ofan input text. Based on this insight, we proposea new method called Synonym Encoding to locatethe neighbors of an input text x . We assume that the closer the meaning of two sen-tences is, the closer their distance is in the inputspace, and we can suppose that the neighbors of x are its synonymous sentences. To find the syn-onymous sentence, we can substitute words in thesentence with their synonyms. In this way, to con-struct the mapping m , all we need to do is to clusterthe synonyms in the embedding space and allocatea unique token for each cluster. The details of syn-onym encoding are in Algorithm 1. Note that in ourexperiment, we implement the synonym encodingon GloVe vectors after counter-fitting (MrkÅ ˛aiÄ ˘Get al., 2016) which injects antonymy and synonymyconstraints into vector space representations. The current synonym substitution based text ad-versarial attacks have a constraint that they onlysubstitute words at the same position once (Alzan-tot et al., 2018; Ren et al., 2019) or replace wordswith the first k synonyms of the word in the originalinput x (Kuleshov et al., 2018). This constraint canlead to local minimum for adversarial examples,and it is hard to choose a suitable k as differentwords may have different number of synonyms.To address this issue, we propose an ImprovedGenetic Algorithm (IGA), which allows to substi-tute words in the same position more than oncebased on the current text x (cid:48) . In this way, IGA cantraverse all synonyms of a word no matter whatvalue k is. Meanwhile, we can avoid local mini-mum to some extent as we allow the substitutionof the word by the original word in the current po-sition. In order to guarantee that the substitutedword is still a synonym of the original word, eachword in the same position can be replaced at most λ times and in our experiment, we set λ = 5 . lgorithm 1 Synonym Encoding Algorithm
Input: W : dictionary of words n : size of Wσ : distance for synonyms k : number of synonyms for each word Output: E : encoding result E = { w : None , . . . , w n : None } for each word w i ∈ W do if E [ w i ] = NONE then if ∃ ˆ w i ∈ Syn ( w i , σ, k ) , E [ ˆ w i ] (cid:54) = NONE then w ∗ i ← the closest ˆ w i ∈ Syn ( w i , σ, k ) E [ w i ] = E [ w ∗ i ] else E [ w i ] = w i end if for each word ˆ w i in Syn ( w i , σ, k ) do if E [ ˆ w i ] = NONE then E [ ˆ w i ] = E [ w i ] end if end for end if end for return E Differs to the first genetic based text attack al-gorithm of Alzantot et al. (2018), we change thestructure of the algorithm, including the operatorsfor crossover and mutation. For more details ofIGA, see Appendix A.1.
We evaluate the efficacy of SEM with four attacks,GSA (Kuleshov et al., 2018), GA (Alzantot et al.,2018), PWWS (Ren et al., 2019) and our IGA,on three popular datasets involving three differ-ent neural network classification models. The re-sults demonstrate that SEM can significantly im-prove the robustness of neural networks and IGAcan achieve better attack performance as comparedwith existing attacks.
We first provide an overview of datasets, classifica-tion models and baselines used in the experiments.
Datasets.
In order to evaluate the efficacy ofSEM, we choose three popular datasets:
IMDB , AG’s News , and
Yahoo! Answers . IMDB (Potts,2011) is a large dataset for binary sentiment classi-fication, containing , highly polarized moviereviews for training and , for testing. AG’s News (Zhang et al., 2015) consists news articlepertaining four classes: World, Sports, Businessand Sci/Tech. Each class contains , trainingexamples and , testing examples. Yahoo! An-swers (Zhang et al., 2015) is a topic classificationdataset from the “Yahoo! Answers ComprehensiveQuestions and Answers" version 1.0 dataset with10 categories, such as Society & Culture, Science& Mathematics, etc. Each class contains 140,000training samples and 5,000 testing samples.
Models.
To better evaluate our method, weadopt several state-of-the-art models for text classi-fication, including Convolution Neural Networks(CNNs) and Recurrent Neural Networks (RNNs).The embedding dimension for all models are 300(Mikolov et al., 2013). We replicate the CNN’s ar-chitecture from (Kim, 2014), which contains threeconvolutional layers with filter size of 3, 4, and5 respectively, a max-pooling layer and a fully-connected layer. LSTM consists of three LSTMlayers where each layer has
LSTM units and afully-connected layer (Liu et al., 2016). Bi-LSTMcontains a bi-directional LSTM layer whose for-ward and reverse have
LSTM units respectivelyand a fully-connected layer.
Baselines.
We take the method of adversarialtraining (Goodfellow et al., 2015) as our baseline.However, due to the low efficiency of text adversar-ial attacks, we cannot implement adversarial train-ing as it is in the image domain. In the experiments,we adopt PWWS, which is quicker than GA andIGA, to generate adversarial examples of thetraining set, and re-train the model incorporatingadversarial examples with the training data.
To evaluate the efficacy of the SEM method, werandomly sample correctly classified exampleson different models from each dataset and use theabove attack methods to generate adversarial exam-ples with or without defenses. The more effectivethe defense method is, the smaller the classificationaccuracy of the model drops. Table 1 shows theefficacy of various attack and defense methods.For each network model, we look at each row tofind the best defense result under the setting of noattack, or GSA, PWWS, GA, and IGA attacks: • Under the setting of no attack, adversarial train-ing (AT) could improve the classification ac-curacy of the models on all datasets, as adver-sarial training (AT) is also the way to augment able 1: The classification accuracy ( % ) of various models on the datasets, with and without defenses, under adver-sarial attacks. For each model (Word-CNN, LSTM, or Bi-LSTM), if we look at each row, the highest classificationaccuracy for various defense methods is highlighted in bold to indicate the best defense efficacy ; if we look ateach column, the lowest classification accuracy under various adversarial attacks is highlighted in underline toindicate the best attack efficacy. NT: Normal Training, AT: Adversarial Training. Dataset Attack Word-CNN LSTM Bi-LSTMNT AT SEM NT AT SEM NT AT SEMIMDB No Attack 88.7
PWWS 4.4 5.3
GA 7.1 10.7
IGA 0.9 2.7
AG’sNews No Attack 91.7
PWWS 30.7 41.5
GA 24.1 40.6
IGA 21.5 35.5
Yahoo!Answers No Attack 68.4
PWWS 10.3 12.5
GA 13.7 16.6
IGA 8.9 10.0 the training data. Our defense method SEMreaches an accuracy very close to normal train-ing (NT), which is a common phenomenon inimage domain (Zhang and Wang, 2019; Songet al., 2019). • Under the four attacks, however, the classifica-tion accuracy with normal training (NT) and ad-versarial training (AT) drops significantly. Fornormal training (NT), the accuracy degradesmore than , and on the threedatasets respectively. And adversarial training(AT) cannot defend these attacks effectively, es-pecially for PWWS and IGA on IMDB and
Ya-hoo! Answers , where AT only improves theaccuracy a little (smaller than ). By contrast,SEM can remarkably improve the robustness ofthe deep models for all the four attacks. In the image domain, the transferability of adver-sarial attack refers to its ability to decrease theaccuracy of models using adversarial examples gen-erated based on other models (Szegedy et al., 2014;Goodfellow et al., 2015). Papernot et al. (2016)find that the adversarial examples in NLP also ex- hibite a good transferability. Therefore, a gooddefense method not only could defend the adver-sarial attacks but also resists the transferability ofadversarial examples.To evaluate the ability of preventing the trans-ferability of adversarial examples, we generate ad-versarial examples on each model under normaltraining, and test them on other models with orwithout defense on
Yahoo! Answers . The resultsare shown in Table 2. Almost on all models withadversarial examples generated by other models,SEM could yield the highest classification accuracy.
For text attacks, we compare the proposed IGAwith GA from various aspects, including attackefficacy, word substitution rate, example generationefficiency, transferability and human evaluation.
Attack Efficacy.
As shown in Table 1, lookingat each column, we see that under normal training(NT) and adversarial training (AT), IGA can alwaysachieve the lowest classification accuracy, whichcorresponds to the highest attack success rate, onall models and datasets among the four attacks.Under the third column of SEM defense, though able 2: The classification accuracy ( % ) of various models for adversarial examples generated on other models on Yahoo! Answers for evaluating the transferability. * indicates that the adversarial examples are generated based onthis model.
Attack Word-CNN LSTM Bi-LSTMNT AT SEM NT AT SEM NT AT SEMGSA 19.6* 52.7
PWWS 10.3* 54.4
IGA 8.9* 53.7
GSA 47.2 52.7
PWWS 43.7 54.7
GA 41.0 48.5
IGA 47.8 53.0
GSA 43.7 53.4
PWWS 41.7 48.5
IGA 44.8 50.6
IGA may not be the best among all attacks, IGAalways outperforms GA.
Word Substitution Rate.
Besides, as depictedin Table 3, IGA can yield lower word substitutionrate than GA on most models. Note that for SEM,GA can yield lower word substitution rate, becauseGA may not replace the word as most words cannotbring any benefit for the first replacement. This in-dicates that GA stops at local minimum while IGAcontinues to substitute words and gains a lowerclassification accuracy, as demonstrated in Table 1.
Generating Efficiency.
Moreover, IGA is fourtimes faster than GA overall because it needs less it-erations to generate adversarial examples and doesnot have the syntax check module. In our exper-iments, it usually takes - minutes for IGA togenerate an example while GA needs - min-utes on average. Transferability.
As shown in Table 2, the adver-sarial examples generated by IGA maintain roughlythe same transferability as those generated by GA.For instance, if we generate adversarial exampleson Word-CNN (column 2, NT), GA can achievebetter transferability on LSTM with NT (column5) while IGA can achieve better transferability onLSTM with AT and SEM (column 6, 7).
Human Evaluation.
To further verify that theperturbations in the adversarial examples gener-ated by IGA are hard for humans to perceive, wealso perform a human evaluation on
IMDB with 35 volunteers. We first randomly choose benignexamples that can be classified correctly and gen-erate adversarial examples by GA and IGA on thethree models so that we have a total of exam-ples. Then we randomly split them into groupswhere each group contains examples. We askevery five volunteers to classify one group inde-pendently. The accuracy of human evaluation onbenign examples is . .As shown in Table 4, the classification accuracyof human on adversarial examples generated byIGA is slightly higher than those generated by GA,and is slightly closer to the accuracy of human onbenign examples, which means the adversarial ex-amples generated by IGA are more realistic to hu-mans. This is counter-intuitive as we do not adoptthe syntax check module as GA does, however,as IGA chooses synonyms within a small embed-ding distance and tries to avoid local minimum toachieve a lower word substitution rate, leading IGAa slightly better human evaluation. Summary.
IGA outperforms GA in all the as-pects we compare. IGA achieves the highest attacksuccess rate compared with other synonyms sub-stitution based adversarial attacks and yields lowerword substitution rate than GA. Besides, the adver-sarial examples generated by IGA maintains thesame transferability as GA and ours are slightlyharder for human to perceive the perturbation. Togive an intuitive experience, we list some adver- igure 2: An illustration for different orders to traverse the synonyms at the second line of Algorithm 1, shownin the word embedding space. (a) First traverse words in the left, then words in the right and finally words in themiddle. The synonyms are encoded into two different codes (left and right). (b) First traverse words in the left,then words in the middle and finally words in the right. All the synonyms are encoded into a unique code of theleft. (c) First traverse words in the right, then words in the middle and finally words in the left. All the synonymsare encoded into a unique code of the right.Table 3: The word substitution rate ( % ) for GA and IGA on different models. Dataset Attack Word-CNN LSTM Bi-LSTMNT AT SEM NT AT SEM NT AT SEMIMDB GA 9.3 9.3
IGA
Yahoo!Answers GA 12.4 9.5 4.7 12.5 15.8 8.1 13.9 15.3
IGA
Table 4: Classification accuracy ( % ) on adversarial ex-amples by human evaluation. Word-CNN LSTM Bi-LSTMGA 88.9 88.0 89.0IGA sarial examples generated by GA and IGA in Ap-pendix A.3.
In this subsection, we discuss some sensitive issueson SEM, including the order to traverse the wordand the impact of the hyper-parameters.As shown in Figure 2, the order to traverse theword at the second line of Algorithm 1 can actu-ally influence the final synonyms encoding codefor a word, and it can even lead to different codesfor the same synonyms set. However, the aim ofsynonym encoding is to find an encoder to defendadversarial examples rather than to find an exactand unique code for each synonym set. Note thatfor a text x = { w , . . . , w i − , w i , w i +1 , . . . , w n } ,if we just replace an arbitrary word w i withone of its synonyms ˆ w i to obtain a new text ˆ x = { w , . . . , w i − , ˆ w i , w i +1 , . . . , w n } , we usu-ally have f ( x ) = f (ˆ x ) . Therefore, different codesfor the same synonyms set hardly influence the effi-cacy of SEM and this randomness might also makeit harder to be attacked.Besides, we further explore how the hyper-parameters (cid:15) and k in Algorithm 1 influence theefficacy of SEM as shown in Appendix A.2. Ac-cording to the analysis, it is significant to find suit-able (cid:15) and k to achieve a good trade-off on theaccuracy of both benign examples and adversarialexamples and in our experiments, we set (cid:15) = 0 . and k = 10 . In this work, we propose the first word-level ad-versarial defense method called
Synonym Encod-ing Method (SEM) for text classification. SEMencodes the synonyms of each word to defend syn-onym substitution based adversarial attacks, whichare currently the best text attack methods. Exten-sive experiments show that SEM can effectivelydefend adversarial attacks and degrade the transfer-ability of adversarial examples, at the same timeSEM can maintain the classification accuracy onbenign data.n addition, we propose a word-level adversarialattack called Improved Genetic Algorithm (IGA),which achieves higher attack success rate withlower word substitution rate, as compared withthe first genetic based attack algorithm proposedin 2018 (Alzantot et al., 2018). At the same time,IGA could maintain the transferability of adversar-ial examples as GA does.
References
Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang.2018. Generating natural language adversarial ex-amples.
Empirical Methods in Natural LanguageProcessing (EMNLP) .Anish Athalye, Nicholas Carlini, and David Wagner.2018. Obfuscated gradients give a false sense of se-curity: Circumventing defenses to adversarial exam-ples.
International Conference on Machine Learn-ing (ICML) .Nicholas Carlini and David Wagner. 2017. Towardsevaluating the robustness of neural networks.
IEEESymposium on Security and Privacy .Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2013. One billion word benchmark for measur-ing progress in statistical language modeling. arXivPreprint arXiv:1312.3005 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing.
North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies (NAACL) .Javid Ebrahimi, Anyi Rao, Daniel Lowd, and DejingDou. 2017. Hotflip: White-box adversarial exam-ples for text classification.
Annual Meeting of theAssociation for Computational Linguistics (ACL) .Ian J. Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2015. Explaining and harnessing adversar-ial examples.
International Conference on LearningRepresentations (ICLR) .Chuan Guo, Mayank Rana, Moustapha Cisse, and Lau-rens van der Maaten. 2018. Countering adversarialimages using input transformations.
InternationalConference on Learning Representations (ICLR) .Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition.
Computer Vision and Pattern Recognition(CVPR) .Yoon Kim. 2014. Convolutional neural networks forsentence classification.
Empirical Methods in Natu-ral Language Processing (EMNLP) . Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin-ton. 2012. Imagenet classification with deep convo-lutional neural networks.
Neural Information Pro-cessing Systems (NeurIPS) .Volodymyr Kuleshov, Shantanu Thakoor, TingfungLau, and Stefano Ermon. 2018. Adversarial ex-amples for natural nanguage classification problems.
OpenReview submission OpenReview:r1QZ3zbAZ .Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.Recurrent convolutional neural networks for textclassification.
AAAI Conference on Artificial Intel-ligence (AAAI) .Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian,Xirong Li, and Wenchang Shi. 2017. Deep text clas-sification can be fooled.
International Joint Confer-ence on Artificial Intelligence (IJCAI) .Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016.Recurrent neural network for text classification withmulti-task learning.
International Joint Conferenceon Artificial Intelligence (IJCAI) .Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space.
International Conference onLearning Representations (ICLR) .Nikola MrkÅ ˛aiÄ ˘G, Diarmuid à ¸S SÃl’aghdha, BlaiseThomson, Milica GaÅ ˛aiÄ ˘G, Lina Rojas-Barahona,Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, andSteve Young. 2016. Counter-fitting word vectors tolinguistic constraints.
North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies (NAACL) .Nicolas Papernot, Patrick McDaniel, AnanthramSwami, and Richard Harang. 2016. Crafting adver-sarial input sequences for recurrent neural networks.
IEEE Military Communications Conference (MIL-COM) .Christopher Potts. 2011. On the negativity of ega-tion.
Proceedings of Semantics and Linguistic The-ory (SALT) .Danish Pruthi, Bhuwan Dhingra, and Zachary C. Lip-ton. 2019. Combating adversarial misspellings withrobust word recognition.
Association for Computa-tional Linguistics (ACL) .Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che.2019. Generating natural language adversarial ex-amples through probability weighted word saliency.
Association for Computational Linguistics (ACL) .Nestor Rodriguez and Sergio Rojas-Galeano. 2018.Shielding googleâ ˘A ´Zs language toxicity modelagainst adversarial attacks. arXiv preprintarXiv:1801.01828 .Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras,Kunal Talwar, and Aleksander MÄ ˇEdry. 2018. Ad-versarially robust generalization requires more data.
Neural Information Processing Systems (NeurIPS) .huanbiao Song, Kun He, Liwei Wang, and John E.Hopcroft. 2019. Improving the generalization ofadversarial training with domain adaptation.
Inter-national Conference on Learning Representations(ICLR) .Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, andRob Fergus. 2014. Intriguing properties of neuralnetworks.
International Conference on LearningRepresentations (ICLR) .Eric Wong and J. Zico Kolter. 2018. Provable defensesvia the convex outer adversarial polytope.
Interna-tional Conference on Machine Learning (ICML) .Haichao Zhang and Jianyu Wang. 2019. Defenseagainst adversarial attacks using feature scattering-based adversarial training.
Neural Information Pro-cessing Systems (NeurIPS) .Wei Emma Zhang, Quan Z. Sheng, Ahoud Alhazmi,and Chenliang Li. 2019. Generating textual adver-sarial examples for deep learning models: A survey. arXiv Preprint arXiv:1901.06796 .Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification.
Neural Information Processing Systems(NeurIPS) . Appendices
A.1 Details of IGA
Here we introduce our Improved Genetic Algo-rithm (IGA) in details and show how IGA dif-fers from the first proposed generic attack method,GA (Alzantot et al., 2018). Regard a text as a chro-mosome, there are two operators in IGA: • Crossover(a,b) : For two texts a and b where a = { w a , . . . , w ai − , w ai , w ai +1 , . . . , w an } and b = { w b , . . . , w bi − , w bi , w bi +1 , . . . , w bn } , ran-domly choose a crossover point i from to n , and generate a new text c = { w a , . . . , w ai − , w ai , w bi +1 , . . . , w bn } . • M utate ( x, w i ) : For a text x = { w , . . . , w i − , w i , w i +1 , . . . , w n } anda word w i , replace w i with ˆ w i where ˆ w i ∈ Syn ( w i , σ, k ) to generate a new text ˆ x = { w , . . . , w i − , ˆ w i , w i +1 , . . . , w n } thatminimizes S m (ˆ x, y true ) .The details of IGA are described in Algorithm 2. Algorithm 2
The Improved Genetic Algorithm
Input: x : input text, y true : true label for x , M :maximum number of iterations Output: x adv : adversarial example for each word w i ∈ x do P i ← M utate ( x, w i ) end for for g = 1 → M do x adv = arg min x i ∈P g S m ( x i , y true ) if f ( x adv ) (cid:54) = y true then return x adv end if P g ← x adv for i = 2 → |P g − | do Randomly sample parent , parent from P g − child = Crossover ( parent , parent ) Randomly choose a word w in child P gi ← M utate ( child, w ) end for end for return x adv Compared with GA, IGA has the following dif-ferences: • Initialization : GA initializes the first popu-lation randomly, while IGA initializes the first population by replacing each word by its opti-mal synonym, so our population is more diver-sified. • Crossover : To better simulate the reproductionand biological crossover, we randomly cut thetext from two parents and concat two fragmentsinto a new text rather than randomly choose aword of each position from the two parents. • M utation : Different from GA, IGA allows toreplace the word that has been replaced beforeso that we can avoid local minimum.The selection of the next generation is simi-lar to GA, which greedily chooses the optimaloffspring, and then generates other offsprings by
M utate ( Crossover ( · , · )) on two randomly cho-sen parents. But as M utate and
Crossover aredifferent, IGA has very different offsprings. Be-sides, we think that the syntax check module inGA is not necessary and time-consuming becausethe synonyms can assure the syntax to some extent,and we did not adopt the syntax check module toaccelerate the algorithm.
A.2 Hyper-parameters study on SEM
In this subsection, we explore how hyper-parameters (cid:15) and k of SEM influence the efficacyusing three models on IMDB with or without ad-versarial attacks. We try different (cid:15) ranging from to . and k ranging from to . The results areillustrated in Figure 3 and 4 respectively.On benign data, as shown in Figure 3(a) and 4(a),the classification accuracy of the models decreasea little when (cid:15) or k increases. Because a bigger (cid:15) or k indicates that we need less words to trainthe model, which could degrade the efficacy of themodels. Nevertheless, the classification accuracydoes not decrease much as SEM could maintainthe semantic invariance of the original text afterencoding.Then we show the defense efficacy of SEM onthe three models when changing the value of (cid:15) , asshown in Figure 3(b)-(d). When (cid:15) = 0 , SEM couldnot take any impact, we see that the accuracy is thelowest under all attacks. When (cid:15) increases, SEMstarts to defend the attacks, the accuracy increasesrapidly and reaches the peak when (cid:15) = 0 . . Thenthe accuracy decays slowly if we continue to in-crease (cid:15) . Thus, we choose (cid:15) = 0 . to have a goodtrade-off on the accuracy of benign examples andadversarial examples.Finally, we show the defense efficacy of SEMn the three models when changing the value of k ,as shown in Figure 4(b)-(d). When k = 5 , somesynonyms cannot be encoded into the same code,we see that SEM indeed has some impact whencompared with adversarial training. When k in-creases, more synonyms can be encoded into thesame code and SEM can defend the attack effec-tively, the accuracy increases rapidly and reachesthe peak when k = 10 . Then the accuracy decaysslowly and becomes stable if we continue to in-crease k . Thus, we choose k = 10 to be a goodtrade-off on the accuracy of benign examples andadversarial examples.In conclusion, small (cid:15) or k causes that some syn-onyms cannot be encoded correctly which leadsbad defense efficiency, while too large (cid:15) or k mightlet SEM encode some words which are not syn-onyms and influence the efficacy. In our experi-ment, we choose (cid:15) = 0 . and k = 10 . A.3 Adversarial Examples Generated by GAand IGA
To show the generated adversarial examples, werandomly pick some benign examples from
IMDB and generate adversarial examples by GA and IGArespectively on several models. The examples areshown in Table 6 to Table 7. We see that IGAsubstitutes less words than GA on these modelsunder normal training. a) Models under no attack (b) Word-CNN under attacks(c) LSTM under attacks (d) Bi-LSTM under attacks
Figure 3: The classification accuracy for various (cid:15) ranging from to . for three models on IMDB . (a) Models under no attack (b) Word-CNN under attacks(c) LSTM under attacks (d) Bi-LSTM under attacks Figure 4: The classification accuracy for various k ranging from to for three models on IMDB .able 5: The adversarial examples generated by GA and IGA on
IMDB using Word-CNN model.
Confidence( % ) Prediction Text97.9 1 I enjoyed this film which I thought was well writtenand acted , there was plenty of humour and a provok-ing storyline, a warm and enjoyable experience withan emotional ending.Original 99.7 0 I am sorry but this is the worst film I have ever seenin my life. I cannot believe that after making the firstone in the series, they were able to get a budget tomake another. This is the least scary film I have everwatched and laughed all the way through to the end.95.8 1 This is a unique masterpiece made by the best direc-tor ever lived in the ussr. He knows the art of filmmaking and can use it very well. If you find thismovie, buy or copy it!50.6 0 I cared this film which I thought was well writtenand acted, there was plenty of humour and a ignitingstoryline, a tepid and enjoyable experience with anemotional ending.GA 92.7 1 I am sorry but this is the harshest film I have everseen in my life. I cannot believe that after making thefirst one in the series, they were able to get a budgetto make another. This is the least scary film I haveever watched and laughed all the way through to theend.59.0 0 This is a sole masterpiece made by the nicest directorpermanently lived in the ussr. He knows the art offilm making and can use it much well. If you findthis movie, buy or copy it!88.3 0 I enjoyed this film which I think was well writtenand acted, there was plenty of humour and a causingstoryline, a lukewarm and enjoyable experience withan emotional ending.IGA 70.8 1 I am sorry but this is the hardest film I have ever seenin my life. I cannot believe that after making the firstone in the series, they were able to get a budget tomake another. This is the least scary film I have everwatched and laughed all the way through to the end.54.8 0 This is a sole masterpiece made by the best directorpermanently lived in the ussr. He knows the art offilm making and can use it very well. If you find thismovie, buy or copy it! able 6: The adversarial examples generated by GA and IGA on IMDB using LSTM model.
Confidence( % ) Prediction Text99.9 1 I enjoyed this film which I thought was well writtenand acted , there was plenty of humour and a provok-ing storyline, a warm and enjoyable experience withan emotional ending.Original 97.2 0 I am sorry but this is the worst film I have ever seenin my life. I cannot believe that after making the firstone in the series, they were able to get a budget tomake another. This is the least scary film I have everwatched and laughed all the way through to the end.99.7 1 This is a unique masterpiece made by the best direc-tor ever lived in the ussr. He knows the art of filmmaking and can use it very well. If you find thismovie, buy or copy it!88.2 0 I enjoyed this film which I thought was well writtenand proceeded, there was plenty of humorous and aigniting storyline, a tepid and enjoyable experiencewith an emotional terminate.GA 99.9 1 I am sorry but this is the hardest film I have ever seenin my life. I cannot believe that after making the firstone in the series they were able to get a budget tomake another. This is the least terrifying film I haveever watched and laughed all the way through to theend.68.9 0 This is a unique masterpiece made by the best super-intendent ever lived in the ussr. He knows the art offilm making and can use it supremely alright. If youfind this movie, buy or copy it!72.1 0 I enjoyed this film which I thought was well writtenand acted, there was plenty of humour and a provok-ing storyline, a lukewarm and agreeable experiencewith an emotional ending.IGA 99.8 1 I am sorry but this is the hardest film I have ever seenin my life. I cannot believe that after making the firstone in the series, they were able to get a budget tomake another. This is the least scary film I have everwatched and laughed all the way through to the end.86.2 0 This is a sole masterpiece made by the best directorever lived in the ussr. He knows the art of film makingand can use it very well. If you find this movie, buyor copy it! able 7: The adversarial examples generated by GA and IGA on IMDB using Bi-LSTM model.
Confidence( %%