[PDF] Natural Language Adversarial Defense through Synonym Encoding

Abstract

In the area of natural language processing, deep learning models are recently known to be vulnerable to various types of adversarial perturbations, but relatively few works are done on the defense side. Especially, there exists few effective defense method against the successful synonym substitution based attacks that preserve the syntactic structure and semantic information of the original text while fooling the deep learning models. We contribute in this direction and propose a novel adversarial defense method called Synonym Encoding Method (SEM). Specifically, SEM inserts an encoder before the input layer of the target model to map each cluster of synonyms to a unique encoding and trains the model to eliminate possible adversarial perturbations without modifying the network architecture or adding extra data. Extensive experiments demonstrate that SEM can effectively defend the current synonym substitution based attacks and block the transferability of adversarial examples. SEM is also easy and efficient to scale to large models and big datasets.

Full PDF

NNatural Language Adversarial Attack and Defense in Word Level

Xiaosen Wang ∗ , Hao Jin ∗ , Kun He † School of Computer Science and TechnologyHuazhong University of Science and TechnologyWuHan, 430074, China {xiaosen,mailtojinhao,brooklet60}@hust.edu.cn

Abstract

In recent years, inspired by a mass of re-searches on adversarial examples for computervision, there has been a growing interest indesigning adversarial attacks for Natural Lan-guage Processing (NLP) tasks, followed byvery few works of adversarial defenses forNLP. To our knowledge, there exists no de-fense method against the successful synonymsubstitution based attacks that aim to satisfy allthe lexical, grammatical, semantic constraintsand thus are hard to be perceived by humans.We contribute to ﬁll this gap and propose anovel adversarial defense method called

Syn-onym Encoding Method (SEM), which insertsan encoder before the input layer of the modeland then trains the model to eliminate adver-sarial perturbations. Extensive experimentsdemonstrate that SEM can efﬁciently defendcurrent best synonym substitution based ad-versarial attacks with little decay on the accu-racy for benign examples. To better evaluateSEM, we also design a strong attack methodcalled Improved Genetic Algorithm (IGA) thatadopts the genetic metaheuristic for synonymsubstitution based attacks. Compared with theﬁrst genetic based adversarial attack proposedin 2018, IGA can achieve higher attack suc-cess rate with lower word substitution rate, atthe same time maintain the transferability ofadversarial examples.

Deep Neural Networks (DNNs) have made greatsuccess in various machine learning tasks, suchas computer vision (Krizhevsky et al., 2012; Heet al., 2016) and Natural Language Processing(NLP) (Kim, 2014; Lai et al., 2015; Devlin et al.,2018). However, recent studies have discoveredthat DNNs are vulnerable to adversarial examples ∗ The ﬁrst two authors contributed equally. † Corresponding author. not only for computer vision tasks (Szegedy et al.,2014) but also for NLP tasks (Papernot et al., 2016),causing a serious threat to their safe applications.For instance, spammers can evade spam ﬁlteringsystem with adversarial examples of spam emailswhile preserving the intended meaning.In contrast to numerous methods proposed foradversarial attacks (Goodfellow et al., 2015; Car-lini and Wagner, 2017; Athalye et al., 2018) anddefenses (Goodfellow et al., 2015; Guo et al., 2018;Song et al., 2019) in computer vision, there areonly a few list of works in the area of NLP, in-spired by the works for images and emerging veryrecently in the last two years (Zhang et al., 2019).This is mainly because existing perturbation-basedmethods for images cannot be directly applied totexts due to their discrete property in nature. Fur-thermore, if we want the perturbation to be barelyperceptible by humans, it should satisfy the lexi-cal, grammatical and semantic constraints in texts,making it even harder to generate text adversarialexamples.Current attacks in NLP can fall into four cate-gories, namely modifying the characters of a word(Liang et al., 2017; Ebrahimi et al., 2017), addingor removing words (Liang et al., 2017), replacingwords arbitrarily (Papernot et al., 2016), or substi-tuting words with synonyms (Alzantot et al., 2018;Ren et al., 2019). However, the ﬁrst three cate-gories are easy to be detected and defended by spellor syntax check (Rodriguez and Rojas-Galeano,2018; Pruthi et al., 2019). As synonym substitu-tion aims to satisfy all the lexical, grammatical andsemantic constraints, it is hard to be detected byautomatic spell or syntax check as well as humaninvestigation. To our knowledge, currently thereis no defense method speciﬁcally designed againstthe synonym substitution based attacks.In this work, we postulate that the model gener-alization leads to the existence of adversarial ex- a r X i v : . [ c s . C L ] A p r mples: a generalization that is not strong enoughcauses the problem that there usually exists someneighbors x (cid:48) of a benign example x in the manifoldwith a different classiﬁcation. Based on this hypoth-esis, we propose a novel defense mechanism called Synonym Encoding Method (SEM) that encodesall the synonyms to a unique code so as to forceall the neighbors of x to have the same label of x .Speciﬁcally, we ﬁrst cluster the synonyms accord-ing to the Euclidean Distance in the embeddingspace to construct the encoder. Then we insert theencoder before the input layer of the deep modelwithout modifying its architecture, and train themodel again to defend adversarial attacks. In thisway, we can defend the synonym substitution basedadversarial attacks effectively in the context of textclassiﬁcation.Extensive experiments on three popular datasetsdemonstrate that the proposed SEM can effectivelydefend adversarial attacks, while maintaining theefﬁciency and achieving roughly the same accuracyon benign data as the original model does. To ourknowledge, SEM is the ﬁrst proposed method thatcan effectively defend the synonym substitutionbased adversarial attacks.Besides, to demonstrate the efﬁcacy of SEM, wealso propose a genetic based attack method, called

Improved Genetic Algorithm (IGA), which is well-designed and more effective as compared with theﬁrst proposed genetic based attack algorithm, GA(Alzantot et al., 2018). Experiments show thatIGA can degrade the classiﬁcation accuracy moresigniﬁcantly with lower word substitution rate thanGA. Meanwhile, IGA keeps the transferability ofadversarial examples as GA does.

Let W denote the word set containing all the legalwords. Let x = { w , . . . , w i , . . . , w n } denote aninput text, C the corpus that contains all the possibleinput texts, and Y ∈ N K the output space where K is the dimension of Y . The classiﬁer f : C → Y takes an input x and predicts its label f ( x ) , and let S m ( x, y ) denote the conﬁdence value for the y -thcategory at the softmax layer. Let Syn ( w, σ, k ) represent the set of ﬁrst k synonyms of w withindistance σ , namely Syn ( w, σ, k ) = { ˆ w , . . . , ˆ w i , . . . , ˆ w k | ˆ w i ∈ W ∧ (cid:107) w − ˆ w (cid:107) p ≤ ... ≤ (cid:107) w − ˆ w k (cid:107) p < σ } , (1) where (cid:107) w − ˆ w (cid:107) p is the p -norm distance evaluatedon the corresponding embedding vectors. Suppose we have an ideal classiﬁer c : C → Y thatcould always output the correct label for any inputtext x . For a subset of (train or test) texts T ⊆ C and a small constant (cid:15) , we could deﬁne the naturallanguage adversarial examples as follows: A = { x adv ∈ C | ∃ x ∈ T , f ( x adv ) (cid:54) = c ( x adv )= c ( x ) = f ( x ) ∧ d ( x − x adv ) < (cid:15) } , (2)where d ( x − x adv ) is a distance metric to evalu-ate the dissimilarity between the benign example x = { w , . . . , w i , . . . , w n } and the adversarial ex-ample x adv = { w (cid:48) , . . . , w (cid:48) i , . . . , w (cid:48) n } . d ( · ) is usu-ally deﬁned as the p -norm distance: d ( x − x adv ) = (cid:107) x − x adv (cid:107) p = ( (cid:80) i (cid:107) w i − w (cid:48) i (cid:107) p ) p . Here we provide a brief overview on three popu-lar synonym substitution based adversarial attackmethods.

Greedy Search Algorithm (GSA).

Kuleshovet al. (2018) propose a greedy search algorithm tosubstitute words with their synonyms so as to main-tain the semantic and syntactic similarity. GSAﬁrst constructs a synonym set W s for an input text x = { w , . . . , w i − , w i , w i +1 , . . . , w n } : W s = { Syn ( w i , σ, k ) | ≤ i ≤ n } . Initially, let x adv = x . Then at each stagefor x adv = { w (cid:48) , . . . , w (cid:48) i − , w (cid:48) i , w (cid:48) i +1 , . . . , w (cid:48) n } ,GSA ﬁnds a word ˆ w (cid:48) i ∈ W s that sat-isﬁes the syntactic constraint and minimizesthe conﬁdence value S m (ˆ x, y true ) where ˆ x = { w (cid:48) , . . . , w (cid:48) i − , ˆ w (cid:48) i , w (cid:48) i +1 , . . . , w (cid:48) n } , and updates x adv = ˆ x . Such process iterates till x adv becomesan adversarial example or the word substitution ratereaches a threshold. Genetic Algorithm (GA).

Alzantot et al.(2018) propose a population-based algorithm to re-place words with their synonyms so as to generatesemantically and syntactically similar adversarialexamples. There are three operators in GA: • M utate ( x ) : Randomly choose a word w i intext x that has not been updated and substitute w i with ˆ w i , one of its synonyms Syn ( w i , σ, k ) that does not violate the syntax constraint bythe “Google one billion words language model"Chelba et al., 2013) and minimize S m (ˆ x, y true ) where ˆ x = { w , . . . , w i − , ˆ w i , w i +1 , . . . , w n } and S m (ˆ x, y true ) < S m ( x, y true ) ; • Sample ( P ) : Randomly sample a text x i fromthe population P = { x , . . . , x i , . . . , x m } witha probability proportional to − S m ( x i , y true ) ; • Crossover ( a, b ) : Construct a new text c = { w c , . . . , w ci , . . . , w cn } , where w ci is randomlychosen from { w ai , w bi } based on the inputtexts a = { w a , . . . , w ai , . . . , w an } and b = { w b , . . . , w bi , . . . , w bn } .For a text x , GA ﬁrst generates an initial popula-tion P of size m : P = { M utate ( x ) , . . . , M utate ( x ) } . Then at each iteration, GA generates the nextgeneration of population through crossover and mutation operators: x i +1 adv = arg min ˜ x ∈P i S m (˜ x, y true ) ,c i +1 k = Crossover ( Sample ( P i ) , Sample ( P i )) , P i +1 = { x i +1 adv , M utate ( c i +11 ) , M utate ( c i +12 ) ,. . . , M utate ( c i +1 m − ) } . GA will terminate when it ﬁnds an adversarialexample or reaches the maximum number of itera-tion limit.

Probability Weighted Word Saliency(PWWS).

Ren et al. (2019) propose a newsynonym substitution based attack method calledProbability Weighted Word Saliency (PWWS),which considers the word saliency as well as theclassiﬁcation conﬁdence. Speciﬁcally, given a text x = { w , . . . , w i − , w i , w i +1 , . . . , w n } , PWWSﬁrst calculates the saliency of each word S ( x, w i ) : S ( x, w i ) = S m ( x, y true ) − S m (¯ x i , y true ) where ¯ x i = { w , . . . , w i − , unk , w i +1 , . . . , w n } and “ unk ” means the word is removed from thetext. Then PWWS calculates the maximum pos-sible change in the classiﬁcation conﬁdence re-sulted from substituting word w i with one of itssynonyms: ∆ S ∗ m i ( x ) = max ˆ w i ∈ Syn ( w i ,σ,k ) [ S m ( x, y true ) − S m (ˆ x i , y true )] , where ˆ x i = { w , . . . , w i − , ˆ w i , w i +1 , . . . , w n } .Then, PWWS sequentially checks the words in de-scending order of φ ( S ( x, w i )) i · ∆ S ∗ m i ( x ) , where φ ( z ) i = e zi (cid:80) nk =1 e zk , and substitutes the current word w i with its optimal synonym w ∗ i : w ∗ i = arg max ˆ w i ∈ Syn ( w i ,σ,k ) [ S m ( x, y true ) − S m (ˆ x i , y true )] . PWWS terminates when it ﬁnds an adversarial ex-ample x adv or it has replaced all the words in x . There are only a few works for text adversarialdefenses. • In the character-level, (Pruthi et al., 2019) pro-pose to place a word recognition model in frontof the downstream classiﬁer to defend character-level adversarial attacks by combating adversar-ial spelling mistakes. • In the word level, for defenses on synonymsubstitution based attacks, only Alzantot et al.(2018) and Ren et al. (2019) incorporate theadversarial training strategy proposed in the im-age domain (Goodfellow et al., 2015) with theirtext attack methods respectively, and demon-strate that adversarial training can promote themodel’s robustness. However, there is no de-fense method speciﬁcally designed to defendthe synonym substitution based adversarial at-tacks.

We ﬁrst introduce our motivation, then present ourtext defense method,

Synonym Encoding Method (SEM).

Let X denote the input space, V (cid:15) ( x ) denote the (cid:15) -neighborhood of data point x ∈ X , where V (cid:15) ( x ) = { x (cid:48) ∈ X |(cid:107) x (cid:48) − x (cid:107) < (cid:15) } . As illustrated in Figure1 (a), we postulate that the generalization of themodel leads to the existence of adversarial exam-ples. More generally, given a data point x ∈ X , ∃ x (cid:48) ∈ V (cid:15) ( x ) , f ( x (cid:48) ) (cid:54) = y (cid:48) true where x (cid:48) is an adver-sarial example of x .Ideally, to defend the adversarial attack, weneed to train a classiﬁer f that not only guar-antees f ( x ) = y true , but also assures ∀ x (cid:48) ∈ V (cid:15) ( x ) , f ( x (cid:48) ) = y (cid:48) true . Thus, the most effectiveway is to add more labeled data to improve theadversarial robustness (Schmidt et al., 2018). Ide-ally, as illustrated in Figure 1 (b), if we have inﬁ-nite labeled data, we can train a model f : ∀ x (cid:48) ∈ V (cid:15) ( x ) , f ( x (cid:48) ) = y (cid:48) true with high probability so that igure 1: The neighborhood of a data point x in the input space. (a) Traditional training: there exists some datapoints x (cid:48) that the model has never seen before and yields wrong classiﬁcation, in other words, such data point x (cid:48) isan adversarial example. (b) Adding inﬁnite labeled data: this is an ideal case that the model has seen all the datapoints to resist adversarial examples. (c) Sharing label: all the neighbors share the same label with x . (d) Mappingneighborhood data points: mapping all neighbors to the center x so as to eliminate adversarial examples. the model f is robust enough to adversarial exam-ples. Practically, however, labeling data is veryexpensive and it is impossible to have inﬁnite la-beled data.Because it is impossible to have inﬁnite labeleddata to train a robust model, as illustrated in Figure1 (c), Wong and Kolter (2018) propose to constructa convex outer bound and guarantee that all datapoints in this bound share the same label. Thegoal is to train a model f : ∀ x (cid:48) ∈ V (cid:15) ( x ) , f ( x (cid:48) ) = f ( x ) = y true . Speciﬁcally, they propose a linear-programming (LP) based upper bound on the robustloss by adopting a linear relaxation of the ReLU ac-tivation and minimize this upper bound during thetraining. Then they bound the LP optimal value andcalculate the elementwise bounds on the activationfunctions based on a backward pass through thenetwork. Although their method does not need anyextra data, it is hard to scale to realistically-sizednetworks due to the high calculation complexity.In this work, as illustrated in Figure 1 (d), wepropose a novel way to ﬁnd a mapping m : X →X where ∀ x (cid:48) ∈ V (cid:15) ( x ) , m ( x (cid:48) ) = x . In this way,we force the classiﬁcation to be more smooth andwe do not need any extra data to train the modelor modify the architecture of the model. All weneed to do is to insert the mapping before the inputlayer and train the model on the original trainingset. Now the problem turns into how to locate theneighbors of data point x . For image tasks, it ishard to ﬁnd all images in the neighborhood of x inthe input space, and there could be inﬁnite numberof neighbors. For NLP tasks, however, utilizingthe property that words in sentences are discretetokens, we can easily ﬁnd almost all neighbors ofan input text. Based on this insight, we proposea new method called Synonym Encoding to locatethe neighbors of an input text x . We assume that the closer the meaning of two sen-tences is, the closer their distance is in the inputspace, and we can suppose that the neighbors of x are its synonymous sentences. To ﬁnd the syn-onymous sentence, we can substitute words in thesentence with their synonyms. In this way, to con-struct the mapping m , all we need to do is to clusterthe synonyms in the embedding space and allocatea unique token for each cluster. The details of syn-onym encoding are in Algorithm 1. Note that in ourexperiment, we implement the synonym encodingon GloVe vectors after counter-ﬁtting (MrkÅ ˛aiÄ ˘Get al., 2016) which injects antonymy and synonymyconstraints into vector space representations. The current synonym substitution based text ad-versarial attacks have a constraint that they onlysubstitute words at the same position once (Alzan-tot et al., 2018; Ren et al., 2019) or replace wordswith the ﬁrst k synonyms of the word in the originalinput x (Kuleshov et al., 2018). This constraint canlead to local minimum for adversarial examples,and it is hard to choose a suitable k as differentwords may have different number of synonyms.To address this issue, we propose an ImprovedGenetic Algorithm (IGA), which allows to substi-tute words in the same position more than oncebased on the current text x (cid:48) . In this way, IGA cantraverse all synonyms of a word no matter whatvalue k is. Meanwhile, we can avoid local mini-mum to some extent as we allow the substitutionof the word by the original word in the current po-sition. In order to guarantee that the substitutedword is still a synonym of the original word, eachword in the same position can be replaced at most λ times and in our experiment, we set λ = 5 . lgorithm 1 Synonym Encoding Algorithm

Input: W : dictionary of words n : size of Wσ : distance for synonyms k : number of synonyms for each word Output: E : encoding result E = { w : None , . . . , w n : None } for each word w i ∈ W do if E [ w i ] = NONE then if ∃ ˆ w i ∈ Syn ( w i , σ, k ) , E [ ˆ w i ] (cid:54) = NONE then w ∗ i ← the closest ˆ w i ∈ Syn ( w i , σ, k ) E [ w i ] = E [ w ∗ i ] else E [ w i ] = w i end if for each word ˆ w i in Syn ( w i , σ, k ) do if E [ ˆ w i ] = NONE then E [ ˆ w i ] = E [ w i ] end if end for end if end for return E Differs to the ﬁrst genetic based text attack al-gorithm of Alzantot et al. (2018), we change thestructure of the algorithm, including the operatorsfor crossover and mutation. For more details ofIGA, see Appendix A.1.

We evaluate the efﬁcacy of SEM with four attacks,GSA (Kuleshov et al., 2018), GA (Alzantot et al.,2018), PWWS (Ren et al., 2019) and our IGA,on three popular datasets involving three differ-ent neural network classiﬁcation models. The re-sults demonstrate that SEM can signiﬁcantly im-prove the robustness of neural networks and IGAcan achieve better attack performance as comparedwith existing attacks.

We ﬁrst provide an overview of datasets, classiﬁca-tion models and baselines used in the experiments.

Datasets.

In order to evaluate the efﬁcacy ofSEM, we choose three popular datasets:

IMDB , AG’s News , and

Yahoo! Answers . IMDB (Potts,2011) is a large dataset for binary sentiment classi-ﬁcation, containing , highly polarized moviereviews for training and , for testing. AG’s News (Zhang et al., 2015) consists news articlepertaining four classes: World, Sports, Businessand Sci/Tech. Each class contains , trainingexamples and , testing examples. Yahoo! An-swers (Zhang et al., 2015) is a topic classiﬁcationdataset from the “Yahoo! Answers ComprehensiveQuestions and Answers" version 1.0 dataset with10 categories, such as Society & Culture, Science& Mathematics, etc. Each class contains 140,000training samples and 5,000 testing samples.

Models.

To better evaluate our method, weadopt several state-of-the-art models for text classi-ﬁcation, including Convolution Neural Networks(CNNs) and Recurrent Neural Networks (RNNs).The embedding dimension for all models are 300(Mikolov et al., 2013). We replicate the CNN’s ar-chitecture from (Kim, 2014), which contains threeconvolutional layers with ﬁlter size of 3, 4, and5 respectively, a max-pooling layer and a fully-connected layer. LSTM consists of three LSTMlayers where each layer has

LSTM units and afully-connected layer (Liu et al., 2016). Bi-LSTMcontains a bi-directional LSTM layer whose for-ward and reverse have

LSTM units respectivelyand a fully-connected layer.

Baselines.

We take the method of adversarialtraining (Goodfellow et al., 2015) as our baseline.However, due to the low efﬁciency of text adversar-ial attacks, we cannot implement adversarial train-ing as it is in the image domain. In the experiments,we adopt PWWS, which is quicker than GA andIGA, to generate adversarial examples of thetraining set, and re-train the model incorporatingadversarial examples with the training data.

To evaluate the efﬁcacy of the SEM method, werandomly sample correctly classiﬁed exampleson different models from each dataset and use theabove attack methods to generate adversarial exam-ples with or without defenses. The more effectivethe defense method is, the smaller the classiﬁcationaccuracy of the model drops. Table 1 shows theefﬁcacy of various attack and defense methods.For each network model, we look at each row toﬁnd the best defense result under the setting of noattack, or GSA, PWWS, GA, and IGA attacks: • Under the setting of no attack, adversarial train-ing (AT) could improve the classiﬁcation ac-curacy of the models on all datasets, as adver-sarial training (AT) is also the way to augment able 1: The classiﬁcation accuracy ( % ) of various models on the datasets, with and without defenses, under adver-sarial attacks. For each model (Word-CNN, LSTM, or Bi-LSTM), if we look at each row, the highest classiﬁcationaccuracy for various defense methods is highlighted in bold to indicate the best defense efﬁcacy ; if we look ateach column, the lowest classiﬁcation accuracy under various adversarial attacks is highlighted in underline toindicate the best attack efﬁcacy. NT: Normal Training, AT: Adversarial Training. Dataset Attack Word-CNN LSTM Bi-LSTMNT AT SEM NT AT SEM NT AT SEMIMDB No Attack 88.7

PWWS 4.4 5.3

GA 7.1 10.7

IGA 0.9 2.7

AG’sNews No Attack 91.7

PWWS 30.7 41.5

GA 24.1 40.6

IGA 21.5 35.5

Yahoo!Answers No Attack 68.4

PWWS 10.3 12.5

GA 13.7 16.6

IGA 8.9 10.0 the training data. Our defense method SEMreaches an accuracy very close to normal train-ing (NT), which is a common phenomenon inimage domain (Zhang and Wang, 2019; Songet al., 2019). • Under the four attacks, however, the classiﬁca-tion accuracy with normal training (NT) and ad-versarial training (AT) drops signiﬁcantly. Fornormal training (NT), the accuracy degradesmore than , and on the threedatasets respectively. And adversarial training(AT) cannot defend these attacks effectively, es-pecially for PWWS and IGA on IMDB and

Ya-hoo! Answers , where AT only improves theaccuracy a little (smaller than ). By contrast,SEM can remarkably improve the robustness ofthe deep models for all the four attacks. In the image domain, the transferability of adver-sarial attack refers to its ability to decrease theaccuracy of models using adversarial examples gen-erated based on other models (Szegedy et al., 2014;Goodfellow et al., 2015). Papernot et al. (2016)ﬁnd that the adversarial examples in NLP also ex- hibite a good transferability. Therefore, a gooddefense method not only could defend the adver-sarial attacks but also resists the transferability ofadversarial examples.To evaluate the ability of preventing the trans-ferability of adversarial examples, we generate ad-versarial examples on each model under normaltraining, and test them on other models with orwithout defense on

Yahoo! Answers . The resultsare shown in Table 2. Almost on all models withadversarial examples generated by other models,SEM could yield the highest classiﬁcation accuracy.

For text attacks, we compare the proposed IGAwith GA from various aspects, including attackefﬁcacy, word substitution rate, example generationefﬁciency, transferability and human evaluation.

Attack Efﬁcacy.

As shown in Table 1, lookingat each column, we see that under normal training(NT) and adversarial training (AT), IGA can alwaysachieve the lowest classiﬁcation accuracy, whichcorresponds to the highest attack success rate, onall models and datasets among the four attacks.Under the third column of SEM defense, though able 2: The classiﬁcation accuracy ( % ) of various models for adversarial examples generated on other models on Yahoo! Answers for evaluating the transferability. * indicates that the adversarial examples are generated based onthis model.

Attack Word-CNN LSTM Bi-LSTMNT AT SEM NT AT SEM NT AT SEMGSA 19.6* 52.7

PWWS 10.3* 54.4

IGA 8.9* 53.7

GSA 47.2 52.7

PWWS 43.7 54.7

GA 41.0 48.5

IGA 47.8 53.0

GSA 43.7 53.4

PWWS 41.7 48.5

IGA 44.8 50.6

IGA may not be the best among all attacks, IGAalways outperforms GA.

Word Substitution Rate.

Besides, as depictedin Table 3, IGA can yield lower word substitutionrate than GA on most models. Note that for SEM,GA can yield lower word substitution rate, becauseGA may not replace the word as most words cannotbring any beneﬁt for the ﬁrst replacement. This in-dicates that GA stops at local minimum while IGAcontinues to substitute words and gains a lowerclassiﬁcation accuracy, as demonstrated in Table 1.

Generating Efﬁciency.

Moreover, IGA is fourtimes faster than GA overall because it needs less it-erations to generate adversarial examples and doesnot have the syntax check module. In our exper-iments, it usually takes - minutes for IGA togenerate an example while GA needs - min-utes on average. Transferability.

As shown in Table 2, the adver-sarial examples generated by IGA maintain roughlythe same transferability as those generated by GA.For instance, if we generate adversarial exampleson Word-CNN (column 2, NT), GA can achievebetter transferability on LSTM with NT (column5) while IGA can achieve better transferability onLSTM with AT and SEM (column 6, 7).

Human Evaluation.

To further verify that theperturbations in the adversarial examples gener-ated by IGA are hard for humans to perceive, wealso perform a human evaluation on

IMDB with 35 volunteers. We ﬁrst randomly choose benignexamples that can be classiﬁed correctly and gen-erate adversarial examples by GA and IGA on thethree models so that we have a total of exam-ples. Then we randomly split them into groupswhere each group contains examples. We askevery ﬁve volunteers to classify one group inde-pendently. The accuracy of human evaluation onbenign examples is . .As shown in Table 4, the classiﬁcation accuracyof human on adversarial examples generated byIGA is slightly higher than those generated by GA,and is slightly closer to the accuracy of human onbenign examples, which means the adversarial ex-amples generated by IGA are more realistic to hu-mans. This is counter-intuitive as we do not adoptthe syntax check module as GA does, however,as IGA chooses synonyms within a small embed-ding distance and tries to avoid local minimum toachieve a lower word substitution rate, leading IGAa slightly better human evaluation. Summary.

IGA outperforms GA in all the as-pects we compare. IGA achieves the highest attacksuccess rate compared with other synonyms sub-stitution based adversarial attacks and yields lowerword substitution rate than GA. Besides, the adver-sarial examples generated by IGA maintains thesame transferability as GA and ours are slightlyharder for human to perceive the perturbation. Togive an intuitive experience, we list some adver- igure 2: An illustration for different orders to traverse the synonyms at the second line of Algorithm 1, shownin the word embedding space. (a) First traverse words in the left, then words in the right and ﬁnally words in themiddle. The synonyms are encoded into two different codes (left and right). (b) First traverse words in the left,then words in the middle and ﬁnally words in the right. All the synonyms are encoded into a unique code of theleft. (c) First traverse words in the right, then words in the middle and ﬁnally words in the left. All the synonymsare encoded into a unique code of the right.Table 3: The word substitution rate ( % ) for GA and IGA on different models. Dataset Attack Word-CNN LSTM Bi-LSTMNT AT SEM NT AT SEM NT AT SEMIMDB GA 9.3 9.3

IGA

Yahoo!Answers GA 12.4 9.5 4.7 12.5 15.8 8.1 13.9 15.3

IGA

Table 4: Classiﬁcation accuracy ( % ) on adversarial ex-amples by human evaluation. Word-CNN LSTM Bi-LSTMGA 88.9 88.0 89.0IGA sarial examples generated by GA and IGA in Ap-pendix A.3.

In this subsection, we discuss some sensitive issueson SEM, including the order to traverse the wordand the impact of the hyper-parameters.As shown in Figure 2, the order to traverse theword at the second line of Algorithm 1 can actu-ally inﬂuence the ﬁnal synonyms encoding codefor a word, and it can even lead to different codesfor the same synonyms set. However, the aim ofsynonym encoding is to ﬁnd an encoder to defendadversarial examples rather than to ﬁnd an exactand unique code for each synonym set. Note thatfor a text x = { w , . . . , w i − , w i , w i +1 , . . . , w n } ,if we just replace an arbitrary word w i withone of its synonyms ˆ w i to obtain a new text ˆ x = { w , . . . , w i − , ˆ w i , w i +1 , . . . , w n } , we usu-ally have f ( x ) = f (ˆ x ) . Therefore, different codesfor the same synonyms set hardly inﬂuence the efﬁ-cacy of SEM and this randomness might also makeit harder to be attacked.Besides, we further explore how the hyper-parameters (cid:15) and k in Algorithm 1 inﬂuence theefﬁcacy of SEM as shown in Appendix A.2. Ac-cording to the analysis, it is signiﬁcant to ﬁnd suit-able (cid:15) and k to achieve a good trade-off on theaccuracy of both benign examples and adversarialexamples and in our experiments, we set (cid:15) = 0 . and k = 10 . In this work, we propose the ﬁrst word-level ad-versarial defense method called

Synonym Encod-ing Method (SEM) for text classiﬁcation. SEMencodes the synonyms of each word to defend syn-onym substitution based adversarial attacks, whichare currently the best text attack methods. Exten-sive experiments show that SEM can effectivelydefend adversarial attacks and degrade the transfer-ability of adversarial examples, at the same timeSEM can maintain the classiﬁcation accuracy onbenign data.n addition, we propose a word-level adversarialattack called Improved Genetic Algorithm (IGA),which achieves higher attack success rate withlower word substitution rate, as compared withthe ﬁrst genetic based attack algorithm proposedin 2018 (Alzantot et al., 2018). At the same time,IGA could maintain the transferability of adversar-ial examples as GA does.

References

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang.2018. Generating natural language adversarial ex-amples.

Empirical Methods in Natural LanguageProcessing (EMNLP) .Anish Athalye, Nicholas Carlini, and David Wagner.2018. Obfuscated gradients give a false sense of se-curity: Circumventing defenses to adversarial exam-ples.

International Conference on Machine Learn-ing (ICML) .Nicholas Carlini and David Wagner. 2017. Towardsevaluating the robustness of neural networks.

IEEESymposium on Security and Privacy .Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2013. One billion word benchmark for measur-ing progress in statistical language modeling. arXivPreprint arXiv:1312.3005 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing.

North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies (NAACL) .Javid Ebrahimi, Anyi Rao, Daniel Lowd, and DejingDou. 2017. Hotﬂip: White-box adversarial exam-ples for text classiﬁcation.

Annual Meeting of theAssociation for Computational Linguistics (ACL) .Ian J. Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2015. Explaining and harnessing adversar-ial examples.

International Conference on LearningRepresentations (ICLR) .Chuan Guo, Mayank Rana, Moustapha Cisse, and Lau-rens van der Maaten. 2018. Countering adversarialimages using input transformations.

InternationalConference on Learning Representations (ICLR) .Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition.

Computer Vision and Pattern Recognition(CVPR) .Yoon Kim. 2014. Convolutional neural networks forsentence classiﬁcation.

Empirical Methods in Natu-ral Language Processing (EMNLP) . Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin-ton. 2012. Imagenet classiﬁcation with deep convo-lutional neural networks.

Neural Information Pro-cessing Systems (NeurIPS) .Volodymyr Kuleshov, Shantanu Thakoor, TingfungLau, and Stefano Ermon. 2018. Adversarial ex-amples for natural nanguage classiﬁcation problems.

OpenReview submission OpenReview:r1QZ3zbAZ .Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.Recurrent convolutional neural networks for textclassiﬁcation.

AAAI Conference on Artiﬁcial Intel-ligence (AAAI) .Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian,Xirong Li, and Wenchang Shi. 2017. Deep text clas-siﬁcation can be fooled.

International Joint Confer-ence on Artiﬁcial Intelligence (IJCAI) .Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016.Recurrent neural network for text classiﬁcation withmulti-task learning.

International Joint Conferenceon Artiﬁcial Intelligence (IJCAI) .Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efﬁcient estimation of word represen-tations in vector space.

International Conference onLearning Representations (ICLR) .Nikola MrkÅ ˛aiÄ ˘G, Diarmuid Ã ¸S SÃl’aghdha, BlaiseThomson, Milica GaÅ ˛aiÄ ˘G, Lina Rojas-Barahona,Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, andSteve Young. 2016. Counter-ﬁtting word vectors tolinguistic constraints.

North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies (NAACL) .Nicolas Papernot, Patrick McDaniel, AnanthramSwami, and Richard Harang. 2016. Crafting adver-sarial input sequences for recurrent neural networks.

IEEE Military Communications Conference (MIL-COM) .Christopher Potts. 2011. On the negativity of ega-tion.

Proceedings of Semantics and Linguistic The-ory (SALT) .Danish Pruthi, Bhuwan Dhingra, and Zachary C. Lip-ton. 2019. Combating adversarial misspellings withrobust word recognition.

Association for Computa-tional Linguistics (ACL) .Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che.2019. Generating natural language adversarial ex-amples through probability weighted word saliency.

Association for Computational Linguistics (ACL) .Nestor Rodriguez and Sergio Rojas-Galeano. 2018.Shielding googleâ ˘A ´Zs language toxicity modelagainst adversarial attacks. arXiv preprintarXiv:1801.01828 .Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras,Kunal Talwar, and Aleksander MÄ ˇEdry. 2018. Ad-versarially robust generalization requires more data.

Neural Information Processing Systems (NeurIPS) .huanbiao Song, Kun He, Liwei Wang, and John E.Hopcroft. 2019. Improving the generalization ofadversarial training with domain adaptation.

Inter-national Conference on Learning Representations(ICLR) .Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, andRob Fergus. 2014. Intriguing properties of neuralnetworks.

International Conference on LearningRepresentations (ICLR) .Eric Wong and J. Zico Kolter. 2018. Provable defensesvia the convex outer adversarial polytope.

Interna-tional Conference on Machine Learning (ICML) .Haichao Zhang and Jianyu Wang. 2019. Defenseagainst adversarial attacks using feature scattering-based adversarial training.

Neural Information Pro-cessing Systems (NeurIPS) .Wei Emma Zhang, Quan Z. Sheng, Ahoud Alhazmi,and Chenliang Li. 2019. Generating textual adver-sarial examples for deep learning models: A survey. arXiv Preprint arXiv:1901.06796 .Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-siﬁcation.

Neural Information Processing Systems(NeurIPS) . Appendices

A.1 Details of IGA

Here we introduce our Improved Genetic Algo-rithm (IGA) in details and show how IGA dif-fers from the ﬁrst proposed generic attack method,GA (Alzantot et al., 2018). Regard a text as a chro-mosome, there are two operators in IGA: • Crossover(a,b) : For two texts a and b where a = { w a , . . . , w ai − , w ai , w ai +1 , . . . , w an } and b = { w b , . . . , w bi − , w bi , w bi +1 , . . . , w bn } , ran-domly choose a crossover point i from to n , and generate a new text c = { w a , . . . , w ai − , w ai , w bi +1 , . . . , w bn } . • M utate ( x, w i ) : For a text x = { w , . . . , w i − , w i , w i +1 , . . . , w n } anda word w i , replace w i with ˆ w i where ˆ w i ∈ Syn ( w i , σ, k ) to generate a new text ˆ x = { w , . . . , w i − , ˆ w i , w i +1 , . . . , w n } thatminimizes S m (ˆ x, y true ) .The details of IGA are described in Algorithm 2. Algorithm 2

The Improved Genetic Algorithm

Input: x : input text, y true : true label for x , M :maximum number of iterations Output: x adv : adversarial example for each word w i ∈ x do P i ← M utate ( x, w i ) end for for g = 1 → M do x adv = arg min x i ∈P g S m ( x i , y true ) if f ( x adv ) (cid:54) = y true then return x adv end if P g ← x adv for i = 2 → |P g − | do Randomly sample parent , parent from P g − child = Crossover ( parent , parent ) Randomly choose a word w in child P gi ← M utate ( child, w ) end for end for return x adv Compared with GA, IGA has the following dif-ferences: • Initialization : GA initializes the ﬁrst popu-lation randomly, while IGA initializes the ﬁrst population by replacing each word by its opti-mal synonym, so our population is more diver-siﬁed. • Crossover : To better simulate the reproductionand biological crossover, we randomly cut thetext from two parents and concat two fragmentsinto a new text rather than randomly choose aword of each position from the two parents. • M utation : Different from GA, IGA allows toreplace the word that has been replaced beforeso that we can avoid local minimum.The selection of the next generation is simi-lar to GA, which greedily chooses the optimaloffspring, and then generates other offsprings by

M utate ( Crossover ( · , · )) on two randomly cho-sen parents. But as M utate and

Crossover aredifferent, IGA has very different offsprings. Be-sides, we think that the syntax check module inGA is not necessary and time-consuming becausethe synonyms can assure the syntax to some extent,and we did not adopt the syntax check module toaccelerate the algorithm.

A.2 Hyper-parameters study on SEM

In this subsection, we explore how hyper-parameters (cid:15) and k of SEM inﬂuence the efﬁcacyusing three models on IMDB with or without ad-versarial attacks. We try different (cid:15) ranging from to . and k ranging from to . The results areillustrated in Figure 3 and 4 respectively.On benign data, as shown in Figure 3(a) and 4(a),the classiﬁcation accuracy of the models decreasea little when (cid:15) or k increases. Because a bigger (cid:15) or k indicates that we need less words to trainthe model, which could degrade the efﬁcacy of themodels. Nevertheless, the classiﬁcation accuracydoes not decrease much as SEM could maintainthe semantic invariance of the original text afterencoding.Then we show the defense efﬁcacy of SEM onthe three models when changing the value of (cid:15) , asshown in Figure 3(b)-(d). When (cid:15) = 0 , SEM couldnot take any impact, we see that the accuracy is thelowest under all attacks. When (cid:15) increases, SEMstarts to defend the attacks, the accuracy increasesrapidly and reaches the peak when (cid:15) = 0 . . Thenthe accuracy decays slowly if we continue to in-crease (cid:15) . Thus, we choose (cid:15) = 0 . to have a goodtrade-off on the accuracy of benign examples andadversarial examples.Finally, we show the defense efﬁcacy of SEMn the three models when changing the value of k ,as shown in Figure 4(b)-(d). When k = 5 , somesynonyms cannot be encoded into the same code,we see that SEM indeed has some impact whencompared with adversarial training. When k in-creases, more synonyms can be encoded into thesame code and SEM can defend the attack effec-tively, the accuracy increases rapidly and reachesthe peak when k = 10 . Then the accuracy decaysslowly and becomes stable if we continue to in-crease k . Thus, we choose k = 10 to be a goodtrade-off on the accuracy of benign examples andadversarial examples.In conclusion, small (cid:15) or k causes that some syn-onyms cannot be encoded correctly which leadsbad defense efﬁciency, while too large (cid:15) or k mightlet SEM encode some words which are not syn-onyms and inﬂuence the efﬁcacy. In our experi-ment, we choose (cid:15) = 0 . and k = 10 . A.3 Adversarial Examples Generated by GAand IGA

To show the generated adversarial examples, werandomly pick some benign examples from

IMDB and generate adversarial examples by GA and IGArespectively on several models. The examples areshown in Table 6 to Table 7. We see that IGAsubstitutes less words than GA on these modelsunder normal training. a) Models under no attack (b) Word-CNN under attacks(c) LSTM under attacks (d) Bi-LSTM under attacks

Figure 3: The classiﬁcation accuracy for various (cid:15) ranging from to . for three models on IMDB . (a) Models under no attack (b) Word-CNN under attacks(c) LSTM under attacks (d) Bi-LSTM under attacks Figure 4: The classiﬁcation accuracy for various k ranging from to for three models on IMDB .able 5: The adversarial examples generated by GA and IGA on

IMDB using Word-CNN model.

Conﬁdence( % ) Prediction Text97.9 1 I enjoyed this ﬁlm which I thought was well writtenand acted , there was plenty of humour and a provok-ing storyline, a warm and enjoyable experience withan emotional ending.Original 99.7 0 I am sorry but this is the worst ﬁlm I have ever seenin my life. I cannot believe that after making the ﬁrstone in the series, they were able to get a budget tomake another. This is the least scary ﬁlm I have everwatched and laughed all the way through to the end.95.8 1 This is a unique masterpiece made by the best direc-tor ever lived in the ussr. He knows the art of ﬁlmmaking and can use it very well. If you ﬁnd thismovie, buy or copy it!50.6 0 I cared this ﬁlm which I thought was well writtenand acted, there was plenty of humour and a ignitingstoryline, a tepid and enjoyable experience with anemotional ending.GA 92.7 1 I am sorry but this is the harshest ﬁlm I have everseen in my life. I cannot believe that after making theﬁrst one in the series, they were able to get a budgetto make another. This is the least scary ﬁlm I haveever watched and laughed all the way through to theend.59.0 0 This is a sole masterpiece made by the nicest directorpermanently lived in the ussr. He knows the art ofﬁlm making and can use it much well. If you ﬁndthis movie, buy or copy it!88.3 0 I enjoyed this ﬁlm which I think was well writtenand acted, there was plenty of humour and a causingstoryline, a lukewarm and enjoyable experience withan emotional ending.IGA 70.8 1 I am sorry but this is the hardest ﬁlm I have ever seenin my life. I cannot believe that after making the ﬁrstone in the series, they were able to get a budget tomake another. This is the least scary ﬁlm I have everwatched and laughed all the way through to the end.54.8 0 This is a sole masterpiece made by the best directorpermanently lived in the ussr. He knows the art ofﬁlm making and can use it very well. If you ﬁnd thismovie, buy or copy it! able 6: The adversarial examples generated by GA and IGA on IMDB using LSTM model.

Conﬁdence( % ) Prediction Text99.9 1 I enjoyed this ﬁlm which I thought was well writtenand acted , there was plenty of humour and a provok-ing storyline, a warm and enjoyable experience withan emotional ending.Original 97.2 0 I am sorry but this is the worst ﬁlm I have ever seenin my life. I cannot believe that after making the ﬁrstone in the series, they were able to get a budget tomake another. This is the least scary ﬁlm I have everwatched and laughed all the way through to the end.99.7 1 This is a unique masterpiece made by the best direc-tor ever lived in the ussr. He knows the art of ﬁlmmaking and can use it very well. If you ﬁnd thismovie, buy or copy it!88.2 0 I enjoyed this ﬁlm which I thought was well writtenand proceeded, there was plenty of humorous and aigniting storyline, a tepid and enjoyable experiencewith an emotional terminate.GA 99.9 1 I am sorry but this is the hardest ﬁlm I have ever seenin my life. I cannot believe that after making the ﬁrstone in the series they were able to get a budget tomake another. This is the least terrifying ﬁlm I haveever watched and laughed all the way through to theend.68.9 0 This is a unique masterpiece made by the best super-intendent ever lived in the ussr. He knows the art ofﬁlm making and can use it supremely alright. If youﬁnd this movie, buy or copy it!72.1 0 I enjoyed this ﬁlm which I thought was well writtenand acted, there was plenty of humour and a provok-ing storyline, a lukewarm and agreeable experiencewith an emotional ending.IGA 99.8 1 I am sorry but this is the hardest ﬁlm I have ever seenin my life. I cannot believe that after making the ﬁrstone in the series, they were able to get a budget tomake another. This is the least scary ﬁlm I have everwatched and laughed all the way through to the end.86.2 0 This is a sole masterpiece made by the best directorever lived in the ussr. He knows the art of ﬁlm makingand can use it very well. If you ﬁnd this movie, buyor copy it! able 7: The adversarial examples generated by GA and IGA on IMDB using Bi-LSTM model.

Conﬁdence( %%