Robust Machine Comprehension Models via Adversarial Training
RRobust Machine Comprehension Models via Adversarial Training
Yicheng Wang and
Mohit Bansal
University of North Carolina at Chapel Hill { yicheng, mbansal } @cs.unc.edu Abstract
It is shown that many published models forthe Stanford Question Answering Dataset (Ra-jpurkar et al., 2016) lack robustness, suf-fering an over 50% decrease in F1 scoreduring adversarial evaluation based on theAddSent (Jia and Liang, 2017) algorithm. Ithas also been shown that retraining modelson data generated by AddSent has limited ef-fect on their robustness. We propose a novelalternative adversary-generation algorithm,AddSentDiverse, that significantly increasesthe variance within the adversarial trainingdata by providing effective examples that pun-ish the model for making certain superficialassumptions. Further, in order to improverobustness to AddSent’s semantic perturba-tions (e.g., antonyms), we jointly improve themodel’s semantic-relationship learning capa-bilities in addition to our AddSentDiverse-based adversarial training data augmentation.With these additions, we show that we canmake a state-of-the-art model significantlymore robust, achieving a 36.5% increase in F1score under many different types of adversar-ial evaluation while maintaining performanceon the regular SQuAD task.
We explore the task of reading comprehensionbased question answering (Q&A), where we fo-cus on the Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016), in which mod-els answer questions about paragraphs taken fromWikipedia. Significant progress has been madewith deep end to end neural-attention models, withsome achieving above human level performanceon the test set (Wang and Jiang, 2017; Seo et al.,2017; Wang et al., 2017; Huang et al., 2018; Pe-ters et al., 2018). However, as shown recentlyby Jia and Liang (2017), these models are veryfragile when presented with adversarially gener- ated data. They proposed AddSent, which cre-ates a semantically-irrelevant sentence containinga fake answer that resembles the question syntacti-cally, and appends it to the context. Many state-of-the-art models exhibit a nearly 50% reduction inF1 score on AddSent, showing their over-relianceon syntactic similarity and limited semantic under-standing.Importantly, this is in part due to the nature ofthe SQuAD dataset. Most questions in the datasethave answer spans embedded in sentences that aresyntactically similar to the question. Thus duringtraining, the model is rarely punished for answer-ing questions based on syntactic similarity, andlearns it as a reliable approach to Q&A. This cor-relation between syntactic similarity and correct-ness is of course not true in general: the adver-saries generated by AddSent (Jia and Liang, 2017)are syntactically similar to the question but do notanswer them. The models’ failures on AddSentdemonstrates their ignorance of this aspect of thetask. Jia and Liang (2017) presented some ini-tial attempts to fix this problem by retraining theBiDAF model (Seo et al., 2017) with adversariesgenerated with AddSent. But they showed that themethod is not very effective, as slight modifica-tions (e.g., different positioning of the distractorsentence in the paragraph and different fake an-swer set) to the adversary generation algorithmat test time have drastic impact on the retrainedmodel’s performance.In this paper, we show that their method ofadversarial training failed because the specificityof the AddSent algorithm along with the lack ofnaturally-occurring counterexamples allow mod-els to learn superficial clues regarding what is a‘distractor’ and subsequently ignore it; thus sig-nificantly limiting their robustness. Instead, wefirst introduce a novel algorithm, AddSentDiverse,for generating adversarial examples with signifi- a r X i v : . [ c s . C L ] A p r antly higher variance (by varying the locationswhere the distractors are placed and expanding theset of fake answers), so that the model is pun-ished during training time for making these su-perficial assumptions about the distractor. Weshow that an AddSentDiverse-based adversarially-trained model beats an AddSent-trained modelacross 3 different adversarial test sets, showingan average improvement of 24.22% in F1 score,demonstrating a general increase in robustness.However, even with our diversified adversarialtraining data, the model is still not fully resilientto AddSent-style attacks, e.g., its antonymy-stylesemantic perturbations. Hence, we next add se-mantic relationship features to the model to letit directly identify such relationships between thecontext and question. Interestingly, we see thatthese additions only increase model robustnesswhen trained adversarially, because intuitively inthe non-adversarially-trained setup, there are notenough negative (adversarial) examples for themodel to learn how to use its semantic features.Overall, we demonstrate that with our adver-sarial training method and model improvement,we can increase the performance of a state-of-the-art model by 36.46% on the AddSent evaluationset. Although we focused on the AddSent adver-sary (Jia and Liang, 2017), our method of effec-tive adversarial training by eliminating superficialstatistical correlations (with joint model capabilityimprovements) are generalizable to other similarinsertion-based adversaries for Q&A tasks. Adversarial Evaluation
In computer vision,adversarial examples are frequently used to punishmodel oversensitivity, where semantic-preservingperturbations (usually in the form of small noisevectors) are added to an image to fool the classi-fier into giving it a different label (Szegedy et al.,2014; Goodfellow et al., 2015).In the field of Q&A, Jia and Liang (2017) in-troduced the AddSent algorithm, which generatesadversaries that punish model failure in the otherdirection: overstability, or the inability to detectsemantic-altering noise. It does so by generatingdistractor sentences that only resemble the ques-tions syntactically and appending them to the con-text paragraphs (detailed description included in We release our AddSentDiverse-based adversarial train-ing dataset for SQuAD at https://goo.gl/qdSNDr . Sec. 3). When tested on these adversarial ex-amples, Jia and Liang (2017) showed that eventhe most ‘robust’ amongst published models (theMnemonic Reader (Hu et al., 2017)) only achieved46.6% F1 (compared to 79.6% F1 on the regulartask). Since then, the FusionNet model (Huanget al., 2018) used history-of-word representationsand multi-level attention mechanism to obtain animproved 51.4% F1 score under adversarial eval-uation, but that is still a 30% decrease from themodel’s performance on the regular task. Weshow, however, that one can make a pre-existingmodel significantly more robust by simply retrain-ing it with better, higher variance adversarial train-ing data, and improve it further with minor seman-tic feature additions to its inputs.
Adversarial Training
It has been shown in thefield of image classification that training withadversarial examples produces more robust anderror-resistant models (Goodfellow et al., 2015;Kurakin et al., 2017). In the field of Q&A, Jiaand Liang (2017) attempted to retrain the BiDAF(Seo et al., 2017) model with data generatedwith AddSent algorithm. Despite performing wellwhen evaluated on AddSent, the retrained modelsuffers a more than 30% decrease in F1 perfor-mance when tested on a slightly different adver-sarial dataset generated by AddSentMod (whichdiffers from AddSent in two superficial ways: us-ing a different set of fake answers and prependinginstead of appending the distractor sentence to thecontext). We show that using AddSent to generateadversarial training data introduces new superfi-cial trends for a model to exploit; and instead wepropose the AddSentDiverse algorithm that gen-erates highly varied data for adversarial training,resulting in more robust models.
Our ‘AddSentDiverse’ algorithm is a modifiedversion of AddSent (Jia and Liang, 2017), aimedat producing good adversarial examples for ro-bust training purposes. For each { context, ques-tion, answer } triple, AddSent does the following:(1) Several antonym and named-entity based se-mantic altering perturbations (swapping) are ap-plied to the question; (2) A fake answer is gener-ated that matches the ‘type’ of the original answer(e.g., Prague → Chicago, etc.); (3) The fake an-swer and the altered question are combined intoa distractor statement based on a set of manuallyefined rules; (4) Errors in grammar are fixed bycrowd-workers; (5) The finalized distractor is ap-pended to the end of the context. The specificityof the algorithm creates new superficial cues thata model can learn and use during training andnever get punished for: (1) a model can learn thatit is unlikely for the last sentence to contain thereal answer; (2) a model can learn that the fixedset of fake answers should not be picked. Thesenullify the effectiveness of the distractors as themodel will learn to simply ignore them. We thusintroduce the AddSentDiverse algorithm, whichadds two modifications to AddSent that allows forgenerating higher-variance adversarial examples.Namely, we randomize the distractor placement(Sec. 3.1) and we diversity the set of fake answersused (Sec. 3.2). Lastly, to address the antonym-style semantic perturbations used in AddSent, weshow that we need to improve model capabilitiesby adding indicator features for semantic relation-ships (but only when) in tandem with the additionof diverse adversarial data (Sec. 3.3).
Given a paragraph P containing n sentences, let X , Y be random variables representing the loca-tion of the sentence containing the correct answercounting from the front and back. Let P (cid:48) rep-resent the paragraph with the inserted distractor,and X (cid:48) and Y (cid:48) represent the updated location ofthe sentence with the correct answer. As shownin Fig. 1, their distribution is highly dependent onthe strategy used to insert the distractor. Duringtraining done by Jia and Liang (2017), the distrac-tor is always added as the last sentence, creating avery skewed distribution for Y (cid:48) . This resulted inthe model learning to ignore the last sentence, asit was never punished for doing so. This, in turn,caused the retrained model to fail on AddSent-Mod, where the distractor is inserted to the frontinstead of the back of the context paragraph (thisis shown by our experiments as well). However,Fig. 1 shows that when the distractor is insertedrandomly, the distributions of X (cid:48) and Y (cid:48) are al-most identical to that of X and Y , indicating thatno new correlation between the location of a sen-tence and its likelihood to contain the correct an-swer is introduced by the distractors, hence forc-ing the model to learn to discern them from the Note that for any fixed n , Y = n − X , but for our pur-poses it is easier to keep them separate since the length of theparagraph is also a random variable. Figure 1: Left: Distribution of X and Y for the orig-inal SQuAD training set. Middle: Distribution of X (cid:48) and Y (cid:48) when the distractor is inserted at the end of thecontext. Right: Distribution of X (cid:48) and Y (cid:48) when thedistractor is inserted randomly into the context. real answers by other, deeper means. To prevent the model from superficially decid-ing what is a distractor based on certain specificwords, we dynamically generate the fake answersinstead of using AddSent’s pre-defined set. Let S be the set that contains all the answers in theSQuAD training data, tagged by their type (e.g.,person, location, etc.). For each answer a , we gen-erate the fake answer dynamically by randomly se-lecting another answer a (cid:48) (cid:54) = a from S that hasthe same type as a , as opposed to AddSent (Jiaand Liang, 2017), which uses a pre-defined fakeanswer for each type (e.g., “Chicago” for any lo-cation). This creates a much larger set of fake an-swers, thus decreasing the correlation between anytext and its likelihood of being a part of a distrac-tor, forcing the model to become more robust. In previous sections, we prevented the model fromidentifying distractors based on superficial cluessuch as location and fake answer identity by elim-inating these correlations within the training data.But even if we force the model to learn somedeeper methods for identifying/discarding the dis-tractors, it only has limited ability in recogniz-ing semantic differences because its current inputsdo not capture crucial aspects of lexical semanticssuch as antonymy (which were inserted by Jia andLiang (2017) when generating the AddSent adver-saries; see Sec. 3). Most current models use pre-trained word embeddings (e.g., GloVE (Penning-ton et al., 2014) and ELMo (Peters et al., 2018))as input, which are usually calculated based onthe distributional hypothesis (Harris, 1954), anddo not capture lexical semantic relations such asantonymy (Geffet and Dagan, 2005). These short-comings are reflected by our results in Sec. 4.6,where we see that we can’t resolve all AddSent- raining Original-SQuAD-Dev AddSent AddSentPrepend AddSentRandom AddSentMod AverageOriginal-SQuAD
Table 1: F1 performance of the BSAE model trained and tested on different regular/adversarial datasets.
Training AddSent AddSentPrepend AverageInsFirst 60.22
Table 2: F1 performance of the BSAE model trained ondatasets with different distractor placement strategies. style adversaries by diversifying the training dataalone. For the model to be robust to semantics-based (e.g., antonym-style) attacks, it needs extraknowledge of lexical semantic relations. Hence,we augment the input of each word in the ques-tion/context with two indicator features indicat-ing the existence of its synonym and antonym(using WordNet (Fellbaum, 1998)) in the con-text/question, allowing the model to use lexical se-mantics directly instead of learned statistical cor-relations of the word embeddings.
We use the architecture and hyperparameters ofthe strong BiDAF + Self-Attn + ELMo (BSAE)model (Peters et al., 2018), currently (as of Jan-uary 10, 2018) the third highest performing single-model on the SQuAD leaderboard. Models are evaluated on the original SQuAD devset and 4 adversarial datasets: AddSent, the adver-sarial evaluation set by Jia and Liang (2017), and3 variations of AddSent: AddSentPrepend, wherethe distractor is prepended to the context, AddSen-tRandom, where the distractor is randomly in-serted into the context, and AddSentMod (Jia andLiang, 2017), where a different set of fake an-swers is used and the distractor is prepended to thecontext. Experiments measure the soft F1 scoreand all of the adversarial evaluations are model-dependent, following the style of AddSent, wheremultiple adversaries are generated for each exam- https://rajpurkar.github.io/SQuAD-explorer/ Note that since the distractor was randomly inserted, themodel cannot identify/ignore the distractor reliably based onlocation. Thus, high performance on AddSentRandom servesas a better indicator for robustness to semantic-based attacks. ple in the evaluation set and the model’s worst per-formance among the variants is recorded.
In our main experiment, we compare the BSAEmodel’s performance on different test sets whentrained with three different training sets: the origi-nal SQuAD data (Original-SQuAD), SQuAD dataaugmented with AddSent generated adversaries(similar to adversarial training conducted by Jiaand Liang (2017)), and SQuAD data augmentedwith our AddSentDiverse generated adversaries.For the latter two, we run the respective adversar-ial generation algorithms on the training set, andadd randomly selected adversarial examples suchthat they make up 20% of the total training data.The results are shown in Table 1. First, as shown,the AddSent-trained model is not able to performwell on test sets where the distractors are not in-serted at the end, e.g., the AddSentRandom ad-versarial test set. On the other hand, it can beseen that retraining with AddSentDiverse boostsperformance of the model significantly across alladversarial datasets, indicating a general increasein robustness.
We also conducted experiments studying the ef-fect of different distractor placement strategies onthe trained models’ robustness. The BSAE modelwas trained on 4 variations of AddSentDiverse-augmented training set, with the only differencebetween them being the location of the distractorwithin the context: InsFirst, where the distractoris prepended, InsLast, where the distractor is ap-pended, InsMid, where the distractor is insertedin the middle and InsRandom, where the distrac-tor is randomly placed. The retrained models aretested on AddSent and AddSentPrepend, whoseonly difference is where the distractor is located.The result is shown in Table 2. It is clear that whentrained under InsFirst and InsLast, the model only For this 59.03% accuracy, i.e., in the remaining 40.96%errors, we found that in 77.0% of these errors, the model stillpredicted a span within the randomly inserted distractor; in-dicating that it has not learned to fully recognize semantic-altering perturbations.raining AddSentPrepend AddSentModFixed-FakeAns 77.37 73.65Dynamic-FakeAns
Table 3: F1 performance of the BSAE model trainedon datasets with different answer generation strategies.
Model/Training Original-SQuAD-Dev AddSentBSAE/Reg.
Table 4: Regular and adversarial training with BSAEand BSAE+SA (with synonym/antonym features). performs well on test sets created by a similar dis-tractor placement strategy, indicating that they areexploiting superficial trends instead of learning toprocess the semantics. It is also shown that In-sRandom gives optimal performance on both eval-uation datasets. Further investigations regardingdistractor placement can be found in the appendix.
We also conducted experiments studying the ef-fect of training on data containing distractors withdynamically generated fake answers (Dynamic-FakeAns) instead of chosen from a predefined set(Fixed-FakeAns). The trained models are testedon AddSentPrepend and AddSentMod, whoseonly difference is that AddSentMod uses a differ-ent set of fake answers. The results are displayedin Table 3. It shows that the model trained onFixed-FakeAns suffers an approximate 3% drop inperformance when tested on a dataset with a dif-ferent set of fake answers, but this gap does not ex-ist for the model retrained on Dynamic-FakeAns.
In Table 1, we see that despite improving perfor-mance on adversarial test sets, adversarial train-ing on the BSAE model leads to a 1% decreasein its performance on the original SQuAD task(from 84.65% to 83.49%). Furthermore, thereis still a 6.5% gap between its performance onadversarial datasets and the original SQuAD devset (76.95% vs 83.49%). These point to thelimitations of adversarial training without anymodel enhancements, especially for AddSent’santonymy style semantic perturbations (see de-tails in Sec. 3.3). We thus conducted experi-ments to test the effectiveness of adding WordNetbased synonymy/antonymy semantic-relation in-dicators in helping the model to better deal withsemantics-based adversaries. We added the lexi- cal semantic indicators to the BSAE model to cre-ate the BSAE+SA model. We trained and testedit in both the regular and adversarial setup. Itsresults, compared to the original BSAE modelare shown in Table 4, where we see that un-like the BSAE model, adversarial training of theBSAE+SA model does not cause a decrease in itsperformance on the original SQuAD dataset, asthe model can now learn lexical semantic relation-ships instead of statistical correlations. We alsosee that the BSAE+SA model, when trained in thenormal setup, shows very similar performance asthe BSAE model across all metrics. This is mostlikely because despite having the ability to recog-nize semantic relations, there are not enough neg-ative examples in the regular SQuAD training setto teach the model how to use these features cor-rectly, but this issue is solved via the addition ofadversarial examples in adversarial training.
Finally, we examined the errors of our finaladversarially-trained BSAE+SA model on theAddSent dataset and found that out of the 21.09%remaining errors (Table 4), 33.3% (46 cases) ofthese erroneous predictions occurred within the in-serted distractor, and 63.7% (88 cases) occurredon questions that the model got wrong in the orig-inal SQuAD dev set (without the inserted distrac-tors). The former errors are mainly occurringwithin distractors created with named-entity re-placements (which we haven’t addressed directlyin the current paper) or malformed distractors (thatin fact do answer the question).
We demonstrate that we can overcome modeloverstability and increase their robustness bytraining on diverse adversarial data that elimi-nates latent data correlations. We further showthat adversarial training is more effective whenwe jointly add useful semantic-relations knowl-edge to improve model capabilities. We hopethat these robustness methods are generalizable toother insertion-based adversaries for Q&A tasks.
Acknowledgments
We thank the anonymous reviewers for their help-ful comments. This work was supported by aGoogle Faculty Research Award, a BloombergData Science Research Grant, an IBM FacultyAward, and NVidia GPU awards. eferences
C. Fellbaum. 1998. Wordnet: An electronic lexicaldatabase. In
MIT Press .Maayan Geffet and Ido Dagan. 2005. The distribu-tional inclusion hypotheses and lexical entailment.In
ACL .I. Goodfellow, J. Shlens, and C. Szegedy. 2015. Ex-plaining and harnessing adversarial examples. In
International Conference on Learning Representa-tions (ICLR) .Zellig S Harris. 1954. Distributional structure.
Word
CoRR, abs/1705.02798 .Hsin-Yuan Huang, Chenguang Zhu, Yelong Shen, andWeizhu Chen. 2018. Fusionnet: Fusing via fully-aware attention with application to machine compre-hension. In
International Conference on LearningRepresentations (ICLR) .R. Jia and P. Liang. 2017. Adversarial examples forevaluating reading comprehension systems. In
Em-pirical Methods in Natural Language Processing(EMNLP) .A. Kurakin, I. Goodfellow, and S. Bengio. 2017. Ad-versarial machine learning at scale. In
InternationalConference on Learning Representations (ICLR) .Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In
Empirical Methods in Nat-ural Language Processing (EMNLP) . pages 1532–1543.Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations.
NAACL .P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016.Squad: 100,000+ questions for machine comprehen-sion of text. In
Empirical Methods in Natural Lan-guage Processing (EMNLP) 2016 .Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2017. Bi-directional attentionflow for machine comprehension. In
InternationalConference on Learning Representations (ICLR) .C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Er-han, I. Goodfellow, and R. Fergus. 2014. Intriguingproperties of neural networks. In
International Con-ference on Learning Representations (ICLR) .S. Wang and J. Jiang. 2017. Machine comprehen-sion using match-lstm and answer pointer. In
Inter-national Conference on Learning Representations(ICLR) . Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,and Ming Zhou. 2017. Gated self-matching net-works for reading comprehension and question an-swering. In
ACL . A Appendix: Distractor PlacementStrategies
This section provides a theoretical framework topredict a model’s performance on adversarial testsets when trained on adversarial data generated bya specific distractor-insertion strategy.Given a paragraph composed of n sen-tences (with the distractor inserted) P = { s , s , . . . , s n } , where s i is the i th sentencecounting from the front. Define random variables X and Y to represent the location of the distrac-tor counting from the front and back, respectively.The distributions of X and Y are dependent uponthe insertion strategy used to add the distractors,several examples of this are displayed in Fig. 2. Figure 2: Distributions of X and Y in adversari-ally augmented SQuAD training data under differentdistractor-insertion strategies. A bidirectional deep learning model, trained ina supervised setting, should be able to jointly learn X and Y . Thus, at test time, when given a para-graph of n sentences, the model can obtain theprobability that the sentence s a is the distractor, P s a , by computing P ( X = a ) + P ( Y = n − a ) .Ideally, we want the distribution of P s a to be uni-form, as that means the model is not biased to-wards discarding any sentence as the distractorbased on location. The actual distributions of P s a under different distractor-insertion strategies aredisplayed in Fig. 3 for n = 3 , and . Wepick these n as they are typical lengths of contextswithin the SQuAD dataset (the complete distribu-tion of paragraph lengths in the SQuAD trainingset is shown in Fig. 4). We see that under random igure 3: Learned distribution of P s a for different n .Figure 4: Distribution of length of paragraphs in theSQuAD training set. insertion, the distribution is very close to uniform.Note that if we were to aggregate n and plot P s a for n ≤ , and , as shown in Fig. 3, the distri-butions of P s a created by inserting in the middleand inserting randomly are very similar, but thedistribution of inserting in the middle is skewedagainst the beginnings and ends of the paragraphs.This explains why in our experiment studying theeffect of distractor placement strategies (see Ta-ble 2), InsMid’s performance was not skewed to-wards either AddSent or AddSentPrepend, but wasworse on both when compared to InsRandom.This method of calculating the distribution of P s a allows us to predict the model’s performancewhen trained on datasets where the distractors areinserted at specific locations. To test this hy-pothesis, we created two datasets: InsFront-3 andInsFront-6 where the distractors were inserted asthe 3rd and 6th sentence from the beginning and Figure 5: Distributions of P s a under InsFront-3 andInsFront-6 for n ≤ . Training AddSent AddSentPrepend AverageInsFront-3 75.47 72.79 74.13InsFront-6 77.73 64.42 71.10
Table 5: F1 Performance of the BSAE model trained ondatasets with different distractor placement strategies. measure the model’s performance when trained onthese two datasets. The distributions of P s aa