AdvExpander: Generating Natural Language Adversarial Examples by Expanding Text
Zhihong Shao, Zitao Liu, Jiyong Zhang, Zhongqin Wu, Minlie Huang
AAdvExpander: Generating Natural Language Adversarial Examples byExpanding Text
Zhihong Shao , Zitao Liu , Jiyong Zhang , Zhongqin Wu , Minlie Huang ∗ Department of Computer Science and Technology, Institute for Artificial Intelligence State Key Lab of Intelligent Technology and Systems Beijing National Research Center for Information Science and Technology Tsinghua University, Beijing 100084, China TAL Education Group, Beijing, China Hangzhou Dianzi University [email protected], [email protected]@hdu.edu.cn, [email protected]@tsinghua.edu.cn
Abstract
Adversarial examples are vital to exposethe vulnerability of machine learning mod-els. Despite the success of the most popularsubstitution-based methods which substitutessome characters or words in the original ex-amples, only substitution is insufficient to un-cover all robustness issues of models. In thispaper, we present
AdvExpander , a methodthat crafts new adversarial examples by ex-panding text, which is complementary to pre-vious substitution-based methods. We first uti-lize linguistic rules to determine which con-stituents to expand and what types of modi-fiers to expand with. We then expand eachconstituent by inserting an adversarial modi-fier searched from a CVAE-based generativemodel which is pre-trained on a large scalecorpus. To search adversarial modifiers, wedirectly search adversarial latent codes in thelatent space without tuning the pre-trained pa-rameters. To ensure that our adversarial exam-ples are label-preserving for text matching, wealso constrain the modifications with a heuris-tic rule. Experiments on three classificationtasks verify the effectiveness of AdvExpanderand the validity of our adversarial examples.AdvExpander crafts a new type of adversarialexamples by text expansion, thereby promis-ing to reveal new robustness issues.
Adversarial examples are deliberately crafted fromoriginal examples to fool machine learning models,which can help (1) reveal systematic biases of data(Zhang et al., 2019b; Gardner et al., 2020), (2) iden-tify pathological inductive biases of models (Fenget al., 2018) (e.g., adopting shallow heuristics (Mc-Coy et al., 2019) which are not robust and unlikelyto generalize beyond training data), (3) regularizeparameter learning (Minervini and Riedel, 2018), ∗ *Corresponding author: Minlie Huang. Matched Case ExampleSubstitution
Paraphrase (0.754) ✓ → Non-paraphrase (0.731) ✗ Sentence 1 What are some mind-blowing vehicle accessories that exist that most people don't know about?Sentence 2 What are some mind-blowing [vehicle → automobile] accessories that exist that most people don't know about?AdvExpander Paraphrase (0.754) ✓ → Non-paraphrase (0.707) ✗ Sentence 1 What are some mind-blowing vehicles tools that exist that most people whom I interviewed don't know about?Sentence 2 What are some mind-blowing vehicle accessories that exist that most people whom I interviewed don't know about?Unmatched Case ExampleSubstitution
Non-paraphrase (0.991) ✓ → Paraphrase (0.675) ✗ Sentence 1 What's the best way to send mass emails?Sentence 2 How can I send mass [emails → emailed] without being aggravating?AdvExpander Non-paraphrase (0.991) ✓ → Paraphrase (0.758) ✗ Sentence 1 What's the best way to send mass emails despite higher fees?Sentence 2 How can I send mass emails without being aggravating?
Figure 1: Adversarial samples on Quora Question Pairs,crafted against BERT by a substitution-based attackmethod and the insertion-based AdvExpander. [ A → B ] means substituting word A with word B . The under-lined expressions are adversarial modifiers inserted toexpand the target constituents in bold. (4) and evaluate stability (Cheng et al., 2019) orsecurity level of models in practical use.The most prevalent and effective practice ofcrafting natural language adversarial examples forclassification is to flip characters (Ebrahimi et al.,2018) or substitute words with their typos (Gaoet al., 2018; Liang et al., 2018), synonyms (Paper-not et al., 2016; Alzantot et al., 2018; Ren et al.,2019; Jin et al., 2019) or other context-compatiblewords (Zhang et al., 2019a), while reusing the la-bels of the original examples as long as perturba-tions are few enough. However, these adversarialattacks limit the search space to the neighborhoodof the original text and introduce only small lexicalvariation, which may not be able to uncover allrobustness issues of models.In this work, we present AdvExpander , whichcrafts new adversarial examples for classificationby expanding text under the black-box setting. a r X i v : . [ c s . C L ] D ec pecifically, we first use linguistic rules to iden-tify constituents that are safe to expand withoutleading to an ill-formed text structure. We then expand each constituent by inserting an adversarialmodifier which is searched from a CVAE-based(Conditional Variational Auto-Encoder (Sohn et al.,2015)) generative model pre-trained on the Bil-lion Word Benchmark (Chelba et al., 2014). Wesearch adversarial modifiers using REINFORCE(Williams, 1992). However, we avoid tuning pre-trained parameters as it can easily sacrifice gram-maticality; instead, we additionally introduce alightweight feed-forward network to search adver-saries in the latent space. To make AdvExpanderapplicable to text matching (e.g., natural languageinference and paraphrase identification) besidestext classification (e.g., sentiment classification),we design a heuristic rule to ensure modificationsare label-preserving: for matched cases (e.g., en-tailment pairs in natural language inference and paraphrase pairs in paraphrase identification), weonly expand shared constituents in both texts withthe same modifiers (see the matched case in Fig 1).We characterize AdvExpander in two aspects. First , AdvExpander differs from the aforemen-tioned substitution-based attacks considerably. Asthe semantics of modifiers is far less restricted, andthe expressions can be much more diverse thanlexical substitutes, AdvExpander has larger searchspace and can introduce more linguistic variationsbesides lexical variation, e.g., syntactic variationand semantic variation. Take the matched case inFig 1 for example. For most existing substitution-based attack methods, the candidate substitutes of “vehicle” are restricted to its synonyms, e.g., “au-tomobile” , “car” . By contrast, for AdvExpander,there exist many reasonable modifiers of differenttypes for “most people” , e.g., clauses like “whomI interviewed” , and prepositional phrases like “inthe neighborhood” . Therefore, AdvExpander ispromising to measure the generalization ability ofmodels. Second , as AdvExpander and substitution-based attacks adopt different types of manipula-tions (ours is based on insertion) and search ad-versarial examples in different search spaces, theycomplement each other and can be combined toboost attack performance.We applied AdvExpander to attack three state-of-the-art models (including RE2 (Yang et al., 2019),BERT (Devlin et al., 2019), and WCNN (Kim,2014)) and two models with certified robustness to adversarial word substitutions (Jia et al., 2019)(including bag-of-words and CNN) on SNLI (Bow-man et al., 2015), Quora Question Pairs , andIMDB which are commonly used datasets for nat-ural language inference, paraphrase identification,and text classification respectively. We success-fully reduce the accuracy of all target models tosignificantly below-chance level. Furthermore, thevalidity of our adversarial examples is verified byhuman evaluation.Our contributions are summarized as follows: (1)We propose AdvExpander, which generates newadversarial examples by expanding constituents intexts with modifiers. This method is able to in-troduce rich linguistic variations and differs sub-stantially from existing substitution-based meth-ods; (2) On three classification datasets, AdvEx-pander substantially degrades the performance ofthree state-of-the-art models and two models robustto word substitutions, while human annotators re-main highly accurate on such adversarial examples,which verifies the validity of our method. Adversarial examples are of high value as they canreveal robustness issues of very successful deepclassification models. According to how adver-sarial examples are crafted, recent work can beroughly divided into generation-based ones andedit-based ones.
Some studies utilize rules or neural generationmethods to craft adversarial examples. (McCoyet al., 2019) focused on natural language infer-ence and generates an adversarial hypothesis froma premise based on linguistic rules. Though ef-fective, rule-based methods introduce limited vari-ations. (Iyyer et al., 2018) introduced syntacticvariation by paraphrasing original text with syntax-controlled network. The generated examples arenot optimized to be adversarial. (Kang et al., 2018)utilized Generative Adversarial Nets (Goodfellowet al., 2014) with generator generating adversar-ial examples and discriminator being the targetmodel. This method is hard to balance grammati-cality and adversary. (Zhao et al., 2018) trained aninverter to map a text to a latent representation and https://data.quora.com/First-Quora-Dataset-Release-QuestionPairs https://datasets.imdbws.com/ he girl writes a song .The girl composes a song .The girl plays the piano . A Matched
CaseAn
Unmatched
Case
The girl plays the guitar .
Determining Insertion Instructions
Text Classification
The movie is very touching .
PP/Appos/CL. (1,2), (2,2)
Rank ①① ✗ ②③ ✗✗ ②① Target Constituent Type of Modifier Insertion Position(s)
The movie PP/Appos/CL. (1,2)The girl PP/Appos/CL. (1,2)the piano PP/Appos/CL. (1,5)plays the piano PP/ADVP (1,5)The girl PP/Appos/CL. (2,2)the guitar PP/Appos/CL. (2,5)plays the guitar PP/ADVP (2,5)The girl PP/Appos/CL. (1,2), (2,2)a song PP/Appos/CL. (1,5), (2,5)
Beam Search
Step 2 Current Instruction ② I=
Current Beams
Inherited from Step 1
Intermediate Adversarial Examples Next Beams
Top-Z P M ( Matched )The girl
Rank Instructions with Top-K Vulnerability Score
NP NPPP/Appos./CL. PP/Appos./CL.
PP/Appos/CL. (1,5), (2,5)Target Constituent Type of Modifier Insertion PositionThe girlthe pianoplays the pianoThe girlthe guitarplays the guitar PP/Appos/CL. (1,2)PP/Appos/CL.PP/ADVP (1,5)(1,5)PP/Appos/CL. (2,2)PP/Appos/CL.PP/ADVP (2,5)(2,5)The girla song PP/Appos/CL.PP/Appos/CL. (1,2), (2,2)(1,5), (2,5)The movie PP/Appos/CL. (1,2)
Top-K Vulnerability Score
Figure 2: Workflow of AdvExpander for text classification and text matching. “ (cid:104) CL. (cid:105) ” , “ (cid:104) PP (cid:105) ” , and “ (cid:104) ADVP (cid:105) ” denote a modifier of type CL. , PP , and ADVP , respectively.
Stage 1 mainly determines the insertion instructions,i.e., which constituents to expand, what types of modifiers to expand with, and where to insert the modifiers (weuse (i, j) to indicate that the output modifier should be inserted after the j th word of the i th sentence). Stage 2 is to search adversarial examples via beam search; in the beginning, beams are initialized as the original example.Each beam search step follows one instruction to search adversarial modifiers from a pre-trained CVAE. In thisfigure, beam size is 1; “ (cid:104) CL. (cid:105) ” is the modifier inserted at beam search step 1; underlined modifiers (e.g., “ (cid:104) PP (cid:105) ” )are newly inserted modifiers at beam search step 2. If some successful adversarial example(s) is/are found aftera beam search step, AdvExpander stops the beam search and returns the successful adversarial example with thelowest perplexity scored by GPT-2. searched adversaries nearby with heuristic rules.They trained the inverter on the original dataset,which might be insufficient to learn a smooth latentspace. Also, their search strategy still has space forimprovement. AdvExpander also involves neuralgeneration. By contrast, we choose not to generatea complete text but generate only modifiers whichare easier to control and thus less prone to syntacticerrors. To learn smooth latent representations oftexts, we pre-train a generative model on a largescale corpus. To balance grammaticality, efficiencyand effectiveness when finding adversaries, we donot optimize pre-trained parameters but addition-ally introduce a lightweight feed-forward networkto search adversaries in the latent space. Most studies craft adversarial examples by editingthe original text. Substitution is the most popularedit type. Substitution-based attacks are to searchan optimal combination of adversarial substitutionsunder constraints. Under the black-box setting, ad-versarial attacks often involves scoring the impor-tance of tokens (characters or words), which helpsfocus attention on important ones to reduce queriesand perturbations. A common way of importancescoring is to measure changes of target model out- put after removing a token (Yang et al., 2018). Theoptimization process can be conducted by substitut-ing tokens (1) in word order (Papernot et al., 2016), (2) from important ones to less important ones(Ren et al., 2019; Jin et al., 2019; Li et al., 2020a),(3) with beam search (Ebrahimi et al., 2018), (4)or with population-based methods (Alzantot et al.,2018; Zang et al., 2020). Constraints can be gram-maticality, semantics-preservation, or context com-patibility (Li et al., 2020b).Compared with substitution which mainly in-troduces lexical variation, insertion and deletiontogether are likely to introduce richer variationsbut are far less popular, as they are more likely torender an adversarial example invalid. (Wallaceet al., 2019) inserted universal triggers into text butthe inserted strings are meaningless. (Zhang et al.,2019a) supported all three edit types but on tokenlevel, thus limited to small perturbations. Mostrelevantly, (Liang et al., 2018) inserted adversarialphrases which were crafted manually. To the bestof our knowledge, AdvExpander is the first effi-cient method that can automatically insert complexexpressions, i.e., modifiers of constituents.
Method4 Task Definition
Suppose a classifier M maps the input text space X to the label space Y . Let X = s s ...s N be aninput text and Y is its label. For text classification, X is a text with N sentences and s i is the i th sen-tence. For text matching, N = 2 and < s , s > isthe pair to be classified. Our goal is to craft a validadversarial input X adv by expanding X , so that M ( X adv ) (cid:54) = Y . Under the black-box setting, weonly have access to the target model’s predictionsand the confidence scores. Our method (Fig 2) can be divided into two stages.For convenience, we first introduce the concept“insertion instruction”: an insertion instructionspecifies one constituent to expand and the fea-sible type(s) of modifier to expand with. At thefirst stage , we utilize linguistic rules to analyzethe feasible insertion instructions, while ensuringthat the insertions will not render the text struc-ture ill-formed. We consider a text structure ill-formed if it is syntactically incorrect or there ex-ists a constituent having multiple modifiers of thesame constituency type (e.g., “
The man in whitebehind the door ...”). To ensure that insertions arelabel-preserving for text matching, we further pro-cess the instructions so that we only expand sharedconstituents in both sentences with the same mod-ifiers for matched cases (e.g., entailment pairs innatural language inference and paraphrase pairsin paraphrase identification). For computationalefficiency, we only keep the most promising in-structions (Eq. 3).
The second stage follows these instructions tofind adversarial examples via beam search. At eachbeam search step, we follow one instruction tosearch adversarial modifiers in the latent space of apre-trained generative model. We craft an adversar-ial example by inserting adversarial modifiers intothe original example. In the end, we measure theperplexity of each successful adversarial examplewith GPT-2 (Radford et al., 2019), and return thetop-ranking one which is expected to be the mostsyntactically correct. https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch model.bin AdvExpander expands text by adding modifiers.We consider four types of modifiers, namely ad-verb phrase (ADVP), prepositional phrase (PP),appositive (Appos), and clause (CL.). For eachinput sentence s i , we first obtain its constituencystructure , and then utilize handcrafted parsing tem-plates to determine which constituents to expandand what types of modifiers to expand with. Toavoid rendering the text structure ill-formed, weignore those types of modifiers that the target con-stituents already have. These analytical results areformatted as a sequence of insertion instructions.Each instruction I is defined as: I = (cid:104) c, t, P (cid:105) (1) which means a modifier of type t (one of the fourtypes mentioned above) should be inserted intoevery position within the set P to modify the tar-get constituent c . For example, the unmatched case in Fig 2 has three feasible insertion instruc-tions for each sentence. One of the instructions is I= (cid:104) c=“The girl”, t=“PP/Appos./CL.”, P= { (1,2) }(cid:105) ,where (1,2) means that the modifier should be in-serted after the 2 nd word in the 1 st sentence.To ensure insertions are label-preserving fortext matching, we only expand shared constituentsin both texts with the exact same modifiers for matched cases. Take the matched case in Fig 2 forexample. We only expand the shared constituents — “The girl” and “a song” — with the exact same mod-ifiers, but ignore the different verb phrases “writesa song” and “composes a song” . Therefore, theinstruction associated with “The girl” has insertionposition set P= { (1,2),(2,2) } .For computational efficiency, we only retainthose instructions with top- K vulnerability score: I = Top-K I Score ( I ) (2) Score ( I ) = max X adv ∈ BS step ( I, { X } ) − P M ( Y | X adv ) (3) where BS step ( I, { X } ) (see the next section) re-turns S adversarial examples searched in one-stepbeam search which follows the instruction I andstarts from X . P M ( Y | X adv ) is the probability of Y given the intermediate adversarial example X adv .For an instruction I , vulnerability score measures https://s3-us-west-2.amazonaws.com/allennlp/models/elmo-constituency-parser-2018.03.14.tar.gz For more details, refer to supplementary material. he vulnerability of the target constituent by inser-tion trials. We can calculate vulnerability score foreach instruction in parallel.
The second stage follows the insertion instructionsin decreasing order of vulnerability score to searchadversarial examples via beam search. Duringbeam search, we maintain a set of intermediateadversarial examples B . At each beam search step,we follow one instruction I , and search adversarialmodifiers for each X adv ∈ B from the latent spaceof a CVAE-based generative model. ct z m P ! m c, t, zq ! (z|c, t, m) Figure 3: Graphical model of our CVAE-based gener-ative model, where c , t , z , and m denote target con-stituent, type of output modifier, latent variable, andoutput modifier, respectively. Dashed arrows are con-nections for the posterior distribution q G ( z | c, t, m ) . Fig 3 shows the graphical model of our gener-ative model. Our generative model takes as inputthe target constituent c and the expected type ofthe modifier t , samples the latent variable z , andgenerates the modifier m . We encode c and decode m with bi- and uni- directional RNNs, respectively.Our generative model has two critical designs. First , we choose CVAE instead of Sequence-to-Sequence (Seq2Seq) model (Bahdanau et al., 2015).This is because Seq2Seq trained with maximumlikelihood estimation mainly captures low-levelvariations of expressions (Serban et al., 2017; Shaoet al., 2019), thus failing to provide rich candidatesof adversarial modifiers.
Second , we generate mod-ifiers conditioned on the target constituent insteadof the entire text. This makes the distribution ofconditions denser, which is beneficial to learn asmoother latent space and to improve the diversityand quality of text. This design also mitigates thegap between pre-trained corpus and attacked cor-pora, so that we can apply one pre-trained genera-tive model to attack classifiers on different datasets.
To learn a smooth latent space, we pre-trained ourgenerative model on the Billion Word Benchmark.For training data, we treated each sentence in this corpus as having been expanded, and utilize theparsing templates used in stage 1 to extract con-stituents and their modifiers.The loss function L pt is the sum of two terms: L pt = L pt + L pt (4) The first term L pt is the negative evidence lowerbound of log P G ( m | c, t ) which is log likelihoodof modifier m given its type t and the modifiedconstituent c : L pt = − E q G ( z | c,t,m ) [log P G ( m | c, t, z )]+ D KL ( q G ( z | c, t, m ) || p G ( z | c, t )) (5) where p G ( z | c, t ) and q G ( z | c, t, m ) are the prior andposterior distribution of the latent variable z re-spectively, which are isotropic Gaussian densityfunction. D KL ( · ) is KL divergence.The second term L pt is the loss of reconstructing t from the latent variable z , which encodes t into z so that the type of output modifier can be bettercontrolled (Ke et al., 2018). L pt = − E q G ( z | c,t,m ) [log P G ( t | z )] (6) Before beam search, the set of intermediate adver-sarial examples is initialized as B = { X } . Let I i be the i th promising instruction in I according toits vulnerability score. The i th beam search step fo-cuses on I i to expand X adv ∈ B i . The next beams B i +1 is updated as follows: B i +1 = Top-Z X (cid:48) adv ∈ B i ∪ BS step ( I i ,B i ) − P M ( Y | X (cid:48) adv ) (7) where BS step ( I i , B i ) is a beam search step; fol-lowing I i , for each X adv ∈ B i , it searches S modi-fiers from the generative model and separately in-serts them into X adv , resulting in S new intermedi-ate adversarial examples.A straightforward way to search adversarial mod-ifiers is to randomly sample S latent codes fromthe prior distribution and decode a modifier m adv for each code: m adv = arg max m P G ( m | c, t, z ) , z ∼ P G ( z | c, t ) (8) As the generative model is capable of producinga rich set of diverse modifiers, this method showsgood attack performance. However, this searchingprocess takes no consideration of the target model.We can further optimize the attack performanceusing REINFORCE(Williams, 1992).o avoid sacrificing grammaticality for adver-sary, we choose not to finetune the pre-trainedparameters of our generative model but directlysearch latent codes that will produce adversarialmodifiers. As the prior network of our CVAE-basedmodel maps an input to a space of proper but notnecessarily adversarial latent codes, we addition-ally introduce an adversarial prior network, whichis a trainable lightweight feed-forward network thatnarrows the latent space down to adversarial region(see Stage 2 in Fig 2).The adversarial prior network computes ad-versarial prior distribution q advG ( z | c, t ) which isisotropic Gaussian density function. The adversar-ial prior network is initialized with the parametersof the pre-trained prior network, and is finetunedwith REINFORCE. The reward is defined as: m adv = arg max m P G ( m | c, t, z ) , z ∼ P advG ( z | c, t ) R ( z ) = − log( P M ( Y | m adv → X adv ) + α ) (9) where m adv is the modifier decoded from the latentcode z which is sampled from the adversarial priordistribution, m adv → X adv is the intermediate adver-sarial example crafted by inserting m adv into X adv ( ∈ B i ), α is a hyperparameter. To restrict that ad-versarial latent codes lie in the pre-trained prior dis-tribution and produce modifiers of expected types,we regularize the finetuning procedure with: Λ = D KL ( q advG ( z | c, t ) || p G ( z | c, t )) − E q advG ( z | c,t ) [log P G ( t | z )] (10) Therefore, the total loss is: L adv = − E q advG ( z | c,t ) [ R ( z )] + γ Λ (11) The beam search step
BS step ( I i , B i ) , for each X adv ∈ B i , minimizes L adv for S steps, and returns S new adversarial example (one example per step).Each X adv ∈ B i can be processed in parallel. If some intermediate adversarial example X (cid:48) adv ∈ BS step ( I i , B i ) fools the target model, we return theadversarial example with the lowest perplexity: X ∗ adv = arg min X (cid:48) adv ∈ BS step ( I i ,B i ) M ( X (cid:48) adv ) (cid:54) = Y perplexity ( X (cid:48) adv ) (12) otherwise we continue beam search. We evaluated AdvExpander on three datasets: (1)
SNLI : A large scale dataset for natural language in-ference which is to judge whether a premise entails,contradicts or is independent of a hypothesis. Thetrain/validation/test split is 550,152/10,000/10,000,respectively. (2)
QQP : Quora Question Pairsfor paraphrase identification which is to iden-tify whether two sentences are paraphrases. Thetrain/validation/test split is 384,348/10,000/10,000,respectively (Wang et al., 2017). (3)
IMDB : Moviereviews for document-level two-way sentimentclassification, with 25,000/25,000 training/test in-stances respectively.
We attacked RE2 and BERT on both SNLI andQQP, and attacked WCNN and BERT on IMDB.To verify that AdvExpander crafts new adversarialexamples, we also attacked two models with cer-tified robustness to adversarial word substitution,i.e., RBOW and RCNN from (Jia et al., 2019).(1)
RE2 : A simple but effective model which ex-ploits rich alignment features for text matching. (2)
BERT : Bidirectional Transformer encoder whichis pre-trained on large scale corpora. We finetunedBERT base-uncased on the three datasets respectively.(3)
WCNN : Word-based Convolutional Neural Net-work. (4)
RBOW : A robustly trained classifierwith bag-of-words encoding. (5)
RCNN : A ro-bustly trained bag-of-words model with input wordvectors transformed by a CNN layer.
To verify that we can craft new adversarial exam-ples, we compare AdvExpander with three recentlyproposed black-box substitution-based attack algo-rithms, i.e., PWWS, BERT-Attack, and TextFooler.The three algorithms mainly differ in their estima-tion of word importance and the source of substi-tutes. As the three algorithms are demonstrated tobe more effective than or comparable to many otheralgorithms, they are representative.
PWWS : (Ren et al., 2019) crafts semantic-preserving adversarial examples by replacingwords with their synonyms (using WordNet ) or re-placing named entities with other similar ones. AsPWWS has no named entity substitution rules spe- https://wordnet.princeton.edu/ ialized for SNLI or QQP, we applied PWWS onSNLI and QQP without named entity substitutions. BERT-Attack : (Li et al., 2020b) crafts adversarialexamples by substituting (sub)words with contextcompatible alternatives sampled from BERT.
TextFooler : (Jin et al., 2019) crafts semantic-preserving adversarial examples and finds syn-onyms with counter-fitting word embeddings(Mrkˇsi´c et al., 2016).
Insertion budget K is 3/3/5 for SNLI/QQP/IMDB,respectively. Search steps S is 80 and beam size Z is 5. Thus, AdvExpander queries a target modelfor no more than 240/240/400 times to craft anadversarial example on SNLI/QQP/IMDB, respec-tively. For detailed implementation details, ab-lation analyses, case study, and error analysis,refer to supplementary material.5.5 Automatic Evaluation
Dataset SNLI QQP IMDBModel RE2 BERT RBOW RE2 BERT WCNN BERT RCNNOri. Test 86.9 90.7 79.4 88.6 91.3 90.0 92.0 79.3Adv. 23.8 27.2 21.4 33.0 44.2 10.9 16.4 6.6Adv. Len. 35.4 36.8 32.7 46.3 50.8 274.3 290.7 268.0Ori. Len. 23.8 23.8 23.7 24.1 23.8 245.0 257.1 238.0Ori. Test (long) 86.6 89.9 77.3 95.0 96.4 89.3 87.9 78.8
Table 1: Automatic evaluation performance, includingmodel accuracy on the original test examples ( “Ori.Test” ) and model accuracy on the corresponding ad-versarial examples crafted by AdvExpander ( “Adv.” ). “Adv. Len.” and “Ori. Len.” denote the averagelength of successful adversarial examples and the cor-responding original examples before expansion, re-spectively. “Ori. Test (long)” denotes model accuracyon the original test examples that are longer than theaverage length of successful adversarial examples. We evaluated AdvExpander on the entire testsets for SNLI and QQP but on 1,000 random testsamples for IMDB, as texts in IMDB are hundredsof words long and even the baseline PWWS isslow. We measured model accuracy on the originaltest examples and the corresponding adversarialexamples respectively (Table 1).AdvExpander is effective in crafting adversarialexamples; it degrades the accuracy of all targetmodels substantially. For example, the accuracyof BERT drops from above 90% to below 28% onboth SNLI and IMDB.As AdvExpander crafts adversarial examples byexpanding text, we further investigate the influenceof text length on model accuracy. As shown in
Dataset SNLI QQP IMDBModel RE2 BERT RBOW RE2 BERT WCNN BERT RCNNPS 28.7 36.3 41.2 49.2 53.5 7.9 29.1 14.8BA 10.1 12.8 17.6 30.3 36.6 0.8 12.1 9.0TF 2.9 4.4 8.9 32.1 38.1 0.4 14.3 10.0PS + BA 5.5* 8.1* 11.4* 29.6* 35.6* 0.6* 11.0* 3.0*PS + TF 1.6* 3.0* 6.4* 31.4* 37.0* 0.1 13.1* 4.1*BA + TF 1.5* 2.3* 4.5* 26.6* 32.8*
Table 2: Comparison between AdvExpander andsubstitution-based attack methods. “PS” / “BA” / “TF” denotes PWWS/BERT-Attack/TextFooler, respectively.All numbers are model accuracy on the adversarial ex-amples crafted by different attack methods. We alsoreport model accuracy under the attacks of any twocombined methods (e.g., “PS + BA” ): an attack is suc-cessful if at least one algorithm fools the target model.Accuracy that is significantly higher than the lowest ac-curacy (in bold) is marked with * for p-value < Table 1, though AdvExpander makes input textslonger, the target models remain high accuracy onthe original test examples that are longer than theaverage length of adversarial examples, indicatingthat text length is not the factor why AdvExpanderis successful to fool target models.
Comparison with Substitution-based Attacks
We further investigate the relationship between Ad-vExpander and previous substitution-based attackalgorithms (Table 2).AdvExpander degrades model accuracy moreremarkably than PWWS in the most cases, butless remarkably than BERT-Attack and TextFooler.Note that we choose not to modify an exampleif any insertion will render the text structure ill-formed. When ignoring those examples we choosenot to modify, the accuracy of BERT under our at-tacks is 6.2% on SNLI and 37.1% on QQP, respec-tively, which is competitive with the performanceof BERT-Attack and TextFooler.To verify that AdvExpander crafts new adver-sarial examples compared with substitution-basedmethods, we attacked RBOW and RCNN whichhave certified robustness to word substitutions. Dueto robust training, the two models are even harder tofool than some structurally more advanced modelsfor substitution-based attacks. However, as RBOWand RCNN have rather simple architecture and areonly trained to be robust to word substitutions, theyare unsurprisingly easier to fool than the other mod-els for AdvExpander. Take IMDB for example.RCNN is significantly more accurate (10.0%) thanCNN (0.4%) under TextFooler’s attacks, but ismuch less accurate (6.6%) than WCNN (10.9%)under our attacks. Therefore, certified robustnessto word substitutions may not indicate robustnessto insertion-based adversarial examples.We also combined AdvExpander with asubstitution-based method to attack the target mod-els (Table 2). Specifically, an attack is consideredsuccessful if at least one of the two method foolsthe target model. The combinatorial attacks con-sistently boost attack performance. In the mostcases, the highest performance boost is brought bycombining AdvExpander with a substitution-basedmethod but not by combining two substitution-based methods. In other words, AdvExpander cancraft adversarial examples in a way substitution-based methods is incapable of. Thus, AdvExpanderis complementaryx to substitution-based methodsand is promising to reveal new robustness issues.
Dataset QQP IMDBMetrics Accuracy Grammaticality Accuracy NaturalnessOri. Sampled 85.0 2.65 88.0 2.69TF 71.5 2.45 84.0 2.57Ours 80.0** 2.39 84.5 2.65*
Table 3: Human evaluation of adversarial examplesagainst BERT, in terms of human accuracy and gram-maticality/naturalness. “TF” denotes TextFooler. “Ori.Sampled” shows evaluation on the corresponding orig-inal test samples. Bootstrap resampling (Koehn, 2004)is used as significance test between the two methods.** marks significantly better performance for p-value ¡0.01, and * for p-value < To verify the validity of our adversarial exam-ples, we conducted human evaluation (Table 3).We randomly sampled 200 adversarial examplesagainst BERT on QQP and IMDB, respectively.These samples are mixed with the correspondingoriginal test samples and the corresponding adver-sarial examples crafted by TextFooler; each ex-ample is presented to three workers on AmazonMechanical Turk to annotate its label and whetherit is grammatical/natural (3-point Likert Scale).For each example, we aggregated human-predictedlabels with majority vote, and computed human ac-curacy as the consistency between the aggregatedlabels and the gold labels. We also computed theaverage grammaticality/naturalness score. As IMDB reviews are informal and contain grammaticalerrors, we measure naturalness on IMDB.
Human accuracy on the original examples andour adversarial examples is close. By contrast, onTextFooler’s adversarial examples, human accuracydrops to 71.5% on QQP, mostly due to imperfec-tion of synonym candidates (e.g., substituting “ me-chanical ” in “ mechanical engineer ” with “ mecha-nised ”). Therefore, our adversarial examples arelabel-preserving at an acceptable level. Moreover,the grammaticality/naturalness score of our adver-sarial examples is close to that of the original sam-ples, indicating that our adversarial examples areof good quality. Overall, these results demonstratethe validity of our adversarial examples.
We separately retrained RE2 on SNLI augmentedwith 80K adversarial examples crafted on the train-ing set by AdvExpander and TextFooler, and testedtheir robustness on the original test set (Table 4).For both TextFooler and AdvExpander, adver-sarial training helps improve a model’s robustnessto the attack method it is trained with, and slightlyimproves model accuracy on the original test set.Notably, as the original training set is large, train-ing models with more adversarial examples canfurther improve models’ robustness.We also observed that adversarially training RE2with TextFooler can hardly improve accuracy underAdvExpander’s attacks (23.8% → → → → Training Set Ori. Train + TF + Ours + Ours & TFOri. Test 86.9 87.3 (+0.4) 87.0 (+0.1) 87.2 (+0.3)TF 2.9 8.0 (+5.1) 3.1 (+0.2) 7.4 (+4.5)Ours 23.8 24.1 (+0.3) 30.0 (+6.2) 30.8 (+7.0)
Table 4: Model accuracy on the original test exam-ples ( “Ori. Test” ) and adversarial examples ( “TF” and “Ours” ) after adversarially training RE2 on SNLI withTextFooler ( “+TF” ), with AdvExpander ( “+Ours” ),or with both AdvExpander and TextFooler ( “+Ours &TF” ). “TF” stands for TextFooler. Numbers in paren-theses are improvements over the model trained on theoriginal training set ( “Ori. Train” ). Conclusion
In this paper, we present AdvExpander which gen-erates new natural language adversarial examplesby expanding text. Extensive experiments demon-strate the effectiveness of our algorithm and thevalidity of our adversarial examples. Our adversar-ial examples are substantially different from previ-ous substitution-based adversarial examples, thuspromising to reveal new robustness issues.
References
Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,Bo-Jhang Ho, Mani B. Srivastava, and Kai-WeiChang. 2018. Generating natural language adver-sarial examples. In
Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, Brussels, Belgium, October 31 - Novem-ber 4, 2018 , pages 2890–2896. Association for Com-putational Linguistics.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In
International Con-ference on Learning Representations .Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large an-notated corpus for learning natural language infer-ence. In
Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing,EMNLP 2015, Lisbon, Portugal, September 17-21,2015 , pages 632–642. The Association for Compu-tational Linguistics.Jill Burstein, Christy Doran, and Thamar Solorio, ed-itors. 2019.
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT 2019, Minneapolis, MN,USA, June 2-7, 2019, Volume 1 (Long and Short Pa-pers) . Association for Computational Linguistics.Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2014. One billion word benchmark for measur-ing progress in statistical language modeling. In
IN-TERSPEECH 2014, 15th Annual Conference of theInternational Speech Communication Association,Singapore, September 14-18, 2014 , pages 2635–2639. ISCA.Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019.Robust neural machine translation with doubly ad-versarial inputs. In (Korhonen et al., 2019), pages4324–4333.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In (Burstein et al., 2019), pages 4171–4186. Javid Ebrahimi, Anyi Rao, Daniel Lowd, and DejingDou. 2018. Hotflip: White-box adversarial exam-ples for text classification. In
Proceedings of the56th Annual Meeting of the Association for Com-putational Linguistics, ACL 2018, Melbourne, Aus-tralia, July 15-20, 2018, Volume 2: Short Papers ,pages 31–36. Association for Computational Lin-guistics.Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer,Pedro Rodriguez, and Jordan Boyd-Graber. 2018.Pathologies of neural models make interpretationsdifficult. In
Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing ,pages 3719–3728, Brussels, Belgium. Associationfor Computational Linguistics.Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yan-jun Qi. 2018. Black-box generation of adversar-ial text sequences to evade deep learning classifiers.In , pages 50–56. IEEE Computer Society.Matt Gardner, Yoav Artzi, Victoria Basmova, JonathanBerant, Ben Bogin, Sihao Chen, Pradeep Dasigi,Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco,Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel-son F. Liu, Phoebe Mulcaire, Qiang Ning, SameerSingh, Noah A. Smith, Sanjay Subramanian, ReutTsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou.2020. Evaluating NLP models via contrast sets.
CoRR , abs/2004.02709.Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron C. Courville, and Yoshua Bengio. 2014. Gen-erative adversarial nets. In
Advances in Neural Infor-mation Processing Systems 27: Annual Conferenceon Neural Information Processing Systems 2014,December 8-13 2014, Montreal, Quebec, Canada ,pages 2672–2680.Iryna Gurevych and Yusuke Miyao, editors. 2018.
Pro-ceedings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics, ACL 2018, Mel-bourne, Australia, July 15-20, 2018, Volume 1: LongPapers . Association for Computational Linguistics.Mohit Iyyer, John Wieting, Kevin Gimpel, and LukeZettlemoyer. 2018. Adversarial example generationwith syntactically controlled paraphrase networks.In
Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,NAACL-HLT 2018, New Orleans, Louisiana, USA,June 1-6, 2018, Volume 1 (Long Papers) , pages1875–1885. Association for Computational Linguis-tics.Robin Jia, Aditi Raghunathan, Kerem G¨oksel, andPercy Liang. 2019. Certified robustness to adver-sarial word substitutions. In
Proceedings of the019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing,EMNLP-IJCNLP 2019, Hong Kong, China, Novem-ber 3-7, 2019 , pages 4127–4140. Association forComputational Linguistics.Di Jin, Zhijing Jin, Joey Tianyi Zhou, and PeterSzolovits. 2019. Is BERT really robust? naturallanguage attack on text classification and entailment.
CoRR , abs/1907.11932.Dongyeop Kang, Tushar Khot, Ashish Sabharwal, andEduard H. Hovy. 2018. Adventure: Adversar-ial training for textual entailment with knowledge-guided examples. In (Gurevych and Miyao, 2018),pages 2418–2428.Pei Ke, Jian Guan, Minlie Huang, and Xiaoyan Zhu.2018. Generating informative responses with con-trolled sentence function. In (Gurevych and Miyao,2018), pages 1499–1508.Yoon Kim. 2014. Convolutional neural networks forsentence classification. In
Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2014, October 25-29,2014, Doha, Qatar, A meeting of SIGDAT, a SpecialInterest Group of the ACL , pages 1746–1751. ACL.Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In
EMNLP , pages388–395. ACL.Anna Korhonen, David R. Traum, and Llu´ıs M`arquez,editors. 2019.
Proceedings of the 57th Conferenceof the Association for Computational Linguistics,ACL 2019, Florence, Italy, July 28- August 2, 2019,Volume 1: Long Papers . Association for Computa-tional Linguistics.Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue,and Xipeng Qiu. 2020a. BERT-ATTACK: adver-sarial attack against BERT using BERT.
CoRR ,abs/2004.09984.Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue,and Xipeng Qiu. 2020b. BERT-ATTACK: Adver-sarial attack against BERT using BERT. In
Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) , pages6193–6202. Association for Computational Linguis-tics.Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian,Xirong Li, and Wenchang Shi. 2018. Deep textclassification can be fooled. In
Proceedings of theTwenty-Seventh International Joint Conference onArtificial Intelligence, IJCAI 2018, July 13-19, 2018,Stockholm, Sweden , pages 4208–4215. ijcai.org.Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.Right for the wrong reasons: Diagnosing syntacticheuristics in natural language inference. In (Korho-nen et al., 2019), pages 3428–3448. Pasquale Minervini and Sebastian Riedel. 2018. Adver-sarially regularising neural NLI models to integratelogical background knowledge. In
Proceedings ofthe 22nd Conference on Computational Natural Lan-guage Learning, CoNLL 2018, Brussels, Belgium,October 31 - November 1, 2018 , pages 65–74. As-sociation for Computational Linguistics.Nikola Mrkˇsi´c, Diarmuid ´O S´eaghdha, Blaise Thom-son, Milica Gaˇsi´c, Lina M. Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, andSteve Young. 2016. Counter-fitting word vectors tolinguistic constraints. In
Proceedings of the 2016Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies , pages 142–148, San Diego,California. Association for Computational Linguis-tics.Nicolas Papernot, Patrick D. McDaniel, AnanthramSwami, and Richard E. Harang. 2016. Crafting ad-versarial input sequences for recurrent neural net-works. In , pages 49–54. IEEE.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.
OpenAIBlog , 1(8).Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che.2019. Generating natural language adversarial ex-amples through probability weighted word saliency.In (Korhonen et al., 2019), pages 1085–1097.Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron C. Courville,and Yoshua Bengio. 2017. A hierarchical latentvariable encoder-decoder model for generating di-alogues. In
Proceedings of the Thirty-First AAAIConference on Artificial Intelligence, February 4-9,2017, San Francisco, California, USA. , pages 3295–3301. AAAI Press.Zhihong Shao, Minlie Huang, Jiangtao Wen, WenfeiXu, and Xiaoyan Zhu. 2019. Long and diverse textgeneration with planning-based hierarchical varia-tional model. In
Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7,2019 , pages 3255–3266. Association for Computa-tional Linguistics.Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015.Learning structured output representation usingdeep conditional generative models. In
NIPS , pages3483–3491.Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,and Sameer Singh. 2019. Universal adversarial trig-gers for attacking and analyzing nlp.
Proceedings ofhe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) .Zhiguo Wang, Wael Hamza, and Radu Florian. 2017.Bilateral multi-perspective matching for natural lan-guage sentences. In
Proceedings of the Twenty-SixthInternational Joint Conference on Artificial Intelli-gence, IJCAI 2017, Melbourne, Australia, August19-25, 2017 , pages 4144–4150. ijcai.org.Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.
Machine Learning , 8:229–256.Puyudi Yang, Jianbo Chen, Cho-Jui Hsieh, Jane-LingWang, and Michael I. Jordan. 2018. Greedy attackand gumbel attack: Generating adversarial examplesfor discrete data.
CoRR , abs/1805.12316.Runqi Yang, Jianhai Zhang, Xing Gao, Feng Ji, andHaiqing Chen. 2019. Simple and effective textmatching with richer alignment features. In (Korho-nen et al., 2019), pages 4699–4709.Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu,Meng Zhang, Qun Liu, and Maosong Sun. 2020.Word-level textual adversarial attacking as combina-torial optimization. In
Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, ACL 2020, Online, July 5-10, 2020 ,pages 6066–6080. Association for ComputationalLinguistics.Huangzhao Zhang, Hao Zhou, Ning Miao, and Lei Li.2019a. Generating fluent adversarial examples fornatural languages. In (Korhonen et al., 2019), pages5564–5569.Yuan Zhang, Jason Baldridge, and Luheng He. 2019b.PAWS: paraphrase adversaries from word scram-bling. In (Burstein et al., 2019), pages 1298–1308.Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018.Generating natural adversarial examples. In6thInternational Conference on Learning Representa-tions, ICLR 2018, Vancouver, BC, Canada, April 30- May 3, 2018, Conference Track Proceedings