[PDF] Model Agnostic Answer Reranking System for Adversarial Question Answering

Abstract

While numerous methods have been proposed as defenses against adversarial examples in question answering (QA), these techniques are often model specific, require retraining of the model, and give only marginal improvements in performance over vanilla models. In this work, we present a simple model-agnostic approach to this problem that can be applied directly to any QA model without any retraining. Our method employs an explicit answer candidate reranking mechanism that scores candidate answers on the basis of their content overlap with the question before making the final prediction. Combined with a strong base QAmodel, our method outperforms state-of-the-art defense techniques, calling into question how well these techniques are actually doing and strong these adversarial testbeds are.

Full PDF

MModel Agnostic Answer Reranking System forAdversarial Question Answering

Sagnik Majumder Chinmoy Samant Greg Durrett

Department of Computer ScienceThe University of Texas at Austin { sagnik, chinmoy, gdurrett } @cs.utexas.edu Abstract

While numerous methods have been proposedas defenses against adversarial examples inquestion answering (QA), these techniques areoften model speciﬁc, require retraining of themodel, and give only marginal improvementsin performance over vanilla models. In thiswork, we present a simple model-agnostic ap-proach to this problem that can be applied di-rectly to any QA model without any retraining.Our method employs an explicit answer can-didate reranking mechanism that scores candi-date answers on the basis of their content over-lap with the question before making the ﬁnalprediction. Combined with a strong base QAmodel, our method outperforms state-of-the-art defense techniques, calling into questionhow well these techniques are actually doingand strong these adversarial testbeds are.

As reading comprehension datasets (Richardsonet al., 2013; Weston et al., 2015; Hermannet al., 2015a; Rajpurkar et al., 2016; Joshi et al.,2017) and models (Sukhbaatar et al., 2015; Seoet al., 2016; Devlin et al., 2019) have advanced,QA research has increasingly focused on out-of-distribution generalization (Khashabi et al., 2020;Talmor and Berant, 2019) and robustness. Jia andLiang (2017) and Wallace et al. (2019) show thatappending unrelated distractors to contexts can eas-ily confuse a deep QA model, calling into ques-tion the effectiveness of these models. Althoughthese attacks do not necessarily reﬂect a real-worldthreat model, they serve as an additional testbed forgeneralization: models that perform better againstsuch adversaries might be expected to generalizebetter in other ways, such as on contrastive exam-ples (Gardner et al., 2020).In this paper, we propose a simple method foradversarial QA that explicitly reranks candidate answers predicted by a QA model according to anotion of content overlap with the question. Specif-ically, by identifying contexts where more namedentities are shared with the question, we can ex-tract answers that are more likely to be correct inadversarial conditions.The impact of this is two-fold. First, our pro-posed method is model agnostic in that it can beapplied post-hoc to any QA model that predictsprobabilities of answer spans, without any retrain-ing. Second but most important, we demonstratethat even this simple named entity based question-answer matching technique can be surprisinglyuseful. We show that our method outperformsstate-of-the-art but more complex adversarial de-fenses with both BiDAF (Seo et al., 2016) andBERT (Devlin et al., 2019) on two standard adver-sarial QA datasets (Jia and Liang, 2017; Wallaceet al., 2019). The fact that such a straightforwardtechnique works well calls into question how reli-able current datasets are for evaluating actual ro-bustness of QA models.

Over the years, various methods have been pro-posed for robustness in adversarial QA, the mostprominent ones being adversarial training (Wangand Bansal, 2018; Lee et al., 2019; Yang et al.,2019b), data augmentation (Welbl et al., 2020) andposterior regularization (Zhou et al., 2019). Amongthese, we compare our method only with tech-niques that train on clean SQuAD (Wu et al., 2019;Yeh and Chen, 2019) for fairness. Wu et al. (2019)use a syntax-driven encoder to model the syntacticmatch between a question and an answer. Yeh andChen (2019) use a prior approach (Hjelm et al.,2019) to maximize mutual information among con-texts, questions, and answers to avoid overﬁttingto surface cues. In contrast, our technique is more a r X i v : . [ c s . C L ] F e b igure 1: Our model agnostic answer reranking system (MAARS). Given each answer option (right column), weextract named entities and compare them to named entities in the question. The overlap is used as a rerankingfeature to choose the ﬁnal answer. The ground truth answer containing sentence is highlighted in green, the groundtruth answer is boxed and the distractor sentence is highlighted in red. closely related to retrieval-based methods for open-domain QA (Chen et al., 2017; Yang et al., 2019a)and multi-hop QA (Welbl et al., 2018; De Cao et al.,2019): we show that shallow matching can improvethe reliability of deep models against adversariesin addition to these more complex settings.Methods for (re)ranking of candidate pas-sages/answers have often been explored in the con-text of information retrieval (Severyn and Mos-chitti, 2015), content-based QA (Kratzwald et al.,2019) and open-domain QA (Wang et al., 2018;Lee et al., 2018). Similar to our approach, thesemethods also exploit some measure of coverage ofthe query by the candidate answers or their sup-porting passages to decide the ranks. However, themain motive behind ranking in such cases is usu-ally to narrow down the area of interest within thetext to look for the answer. On the contrary, we usea reranking mechanism that allows our QA modelto ignore distractors in adversarial QA and canalso provide model- and task-agnostic behavior un-like the commonly used learning-based (re)rankingmechanisms.In yet another related line of research, (Chenet al., 2016; Kaushik and Lipton, 2018) reveal thesimplistic nature and certain important shortcom-ings of popular QA datasets. Chen et al. (2016)conclude that the simple nature of the questionsin the CNN/Daily Mail reading comprehensiondataset (Hermann et al., 2015b) allows a QA modelto perform well by extracting single-sentence rela-tions. Kaushik and Lipton (2018) perform an ex- tensive study with multiple well-known QA bench-marks to show several troubling trends: basicmodel ablations, such as making the input question- or passage- only, can beat the state-of-the-art perfor-mance, and the answers are often localized in thelast few lines, even in very long passages, thus pos-sibly allowing models to achieve very strong per-formance through learning trivial cues. Althoughwe also question the efﬁcacy of well-known ad-versarial QA datasets in this work, our core focusis on exposing certain issues speciﬁcally with thedesign of the adversarial distractors rather than theunderlying datasets. Neural QA models are usually trained in a su-pervised fashion on labeled examples of contexts,questions, and answers to predict answer spans; werepresent these as ( s, e ) tuples, where s representsthe sentence and e the candidate span. Prior work(Lewis and Fan, 2019; Mudrakarta et al., 2018; Yehand Chen, 2019; Chen and Durrett, 2019) has notedthat the end-to-end paradigm can overﬁt superﬁcialbiases in the data causing learning to stop whensimple correlations are sufﬁcient for the model toanswer a question conﬁdently. By explicitly en-forcing content relevance between the predictedanswer-containing sentence and the question, wecan combat this poor generalization.Speciﬁcally, we explicitly score the candidatesentences as per the word-level overlap in namedentities common to both the question and a sen- odel Original AddSent AddOneSentAdversarial Mean Adversarial Mean BERT-S

Table 1: AddSent and AddOneSent results with BERT-S. MAARS outperforms the vanilla and baseline modelson adversarial data but its performance drops a bit on the original data due to constrained reranking of answers. tence. We refer to our method as Model AgnosticAnswer Reranking System (MAARS).Figure 1 illustrates the workﬂow of MAARS.MAARS can be applied to any arbitrary QA modelthat predicts answer span probabilities. First, weuse the base QA model to compute the n bestanswer spans A = { ( s , e ) , . . . , ( s n , e n ) } for acontext-question pair ( c, q ) where n is a hyperpa-rameter. Any answer span not lying in a singlesentence is broken into subspans that lie in separatesentences and A is updated accordingly.Next, we extract the set of candidate sentences L from the context containing these n answerspans. For the question and each sentence, wecompute a set of named entity chunks using anopen-source AllenNLP (Gardner et al., 2017) NERmodel. We then compute the set of words insidenamed entity chunks from each candidate sentenceNER ( l k ) ∀ l k ∈ L and the question NER ( q ) ; notethat NER( · ) refers to a set of words and not a setof named entities. Each candidate sentence l k isthen given a score SC ( l k ) = NER ( l k ) ∩ NER ( q ) and the answer spans are reranked per the scoresof the sentences containing them. In the case ofties or if there are multiple spans in the same candi-date sentence, they are reranked among themselvesaccording to the original ordering as per the QAmodel. Finally, the span with the highest rank afterreranking is chosen as the ﬁnal answer.Compared to the base QA model, this approachonly relies on an additional NER model that can beused without any retraining of the base model. Notethat the architecture doesn’t depend on any speciﬁctagger, and the other content matching models likeword matching could also be used in the systemhere. We evaluate MAARSon two well-known adversarial QA datasets builton top of SQuAD v1.1: Adversarial SQuAD (Jiaand Liang, 2017) and Universal Adversarial Trig-gers (Wallace et al., 2019). For brevity, we don’t

Model Original AddSentAdv. Mean

BiDAF 72.4/62.4 21.4/16.0 49.9/42.0BiDAF + SLN 72.3/62.4 22.8/17.2 50.5/42.5BiDAF + MAARS

Table 2: AddSent results with BiDAF. Here, MAARSbeats the vanilla and baseline models across all metrics. include the adversarial distraction generation pro-cess for either of the datasets and point the inter-ested reader to the original papers for exact details.For Adversarial SQuAD, we test MAARS withboth BiDAF and BERT and compare against state-of-the-art baselines on adversary types used in theoriginal papers. To the best of our knowledge, thereis no pre-existing literature that proposes a defensetechnique for Universal Triggers. We also ﬁnd thatit fails to degrade the performance of our vanillaBERT model, probably because the attacks wereoriginally generated for BiDAF. Thus, we only eval-uate on this dataset in the BiDAF setting, using allfour triggers

Who , When , Where and

Why .For BiDAF, we compare MAARS against theSyntactic Leveraging Network (SLN) by Wuet al. (2019) on

AddSent . SLN encodes predicate-argument structures from the context and question,a conceptually similar structure matching approachas MAARS but trained end-to-end with many moreparameters. For BERT, we benchmark MAARSagainst QAInfoMax (Yeh and Chen, 2019) on

AddSent and

AddOneSent . In addition to thestandard loss for training QA models, QAInfoMaxadds a loss to maximize the mutual informationbetween the learned representations of words incontext and their neighborhood, and also betweenthose of the answer spans and the question.

Implementation details.

We use the uncasedbase (single) pretrained BERT from Hugging-Face (Wolf et al., 2019) and ﬁnetune it using Adamwith weight decay (Loshchilov and Hutter, 2019)optimizer and an initial learning rate of e − onSQuAD (Rajpurkar et al., 2016) v1.1 for 2 epochsfor both vanilla BERT and BERT + QAInfoMax.We set the training batch size to 5 and the propor- dv. type BiDAF BiDAF + MAARS Who 74.4/67.3

When 80.1/75.5

Where 63.5/52.8

Why

Table 3: Results on Universal Triggers withBiDAF (BERT-speciﬁc triggers unavailable publicly).MAARS is better than the vanilla model for most ad-versaries but with smaller performance gains than Ad-versarial SQuAD. . tion of linear learning rate warmup for the opti-mizer to 10%.Our BiDAF (Seo et al., 2016) model has a hid-den state of size 100 and takes 100 dimensionalGloVe (Pennington et al., 2014) embeddings as in-put. For character-level embedding, it uses 100 one-dimensional convolutional ﬁlters, each with a widthof 5. A uniform dropout (Srivastava et al., 2014)of 0.2 is applied at the CNN layer for characterembedding, all LSTM (Hochreiter and Schmidhu-ber, 1997) layers and at the layer before the logits.We train it with AdaDelta (Zeiler, 2012) and aninitial learning rate of 0.5 for 50 epochs. We setthe training batch size to 128. For our SyntacticLeveraging Network, we follow the exact hyperpa-rameter settings of (Wu et al., 2019).Other hyperparameters common to both BERTand BiDAF include an input sequence length of400, maximum query length of 64, and 40 predictedanswer spans per context-question pair. For NERtagging, we use an ELMo-based implementationfrom AllenNLP (Gardner et al., 2017) that has beenﬁnetuned on CoNLL-2003 (Tjong Kim Sang andDe Meulder, 2003). Finally, we set the value of n (the number of candidates considered for reranking)in MAARS to 10 across all our experiments. In all our results tables, we report the macro-averaged F1 and exact match (EM) scores separatedby a slash in each cell. In Tables 1 and 2,

Original and

Adversarial ( Adv. ) refer to a model’s per-formance on only clean and only adversarial datarespectively.

Mean denotes the weighted mean ofthe

Original and

Adversarial scores, weighted bythe respective number of samples in the dataset.Both

AddSent and

AddOneSent have 1000 cleanand 787 adversarial instances.

Adversarial SQuAD.

Table 1 shows the resultswith BERT-single (-S) on

AddSent and

AddOne-Sent . MAARS outperforms both the vanilla model and QAInfoMax on both

Adversarial and

Mean metrics. The performance gains are also substan-tial, especially on

Adversarial where MAARS im-proves F1 over QAInfoMax by about 20 points on

AddSent and 16 points on

AddOneSent . This clearlyshows that our method is much more capable ofavoiding distractors in data and it is a much strongerdefense technique in this setting. For both QAInfo-Max and MAARS there is a drop in performanceon clean data, but the drop for MAARS is larger.This drop naturally arises from the simplicity ofthe heuristic: matching words in named entitieswith the question sometimes assigns a higher scoreto a candidate sentence which has a higher over-lap in terms of named entities with the questionbut doesn’t contain the right answer. One such ex-ample where MAARS fails to pick the correct topcandidate after reranking is shown in ﬁg. 2a.Table 2 details the results with BiDAF on

AddSent . Here, we also see signiﬁcant perfor-mance gains over the vanilla model and the SLNbaseline. MAARS results in an increase in ad-versarial F1 by 24 points over vanilla BiDAF andabout 22 points over BiDAF + SLN. Interestingly,the performance on clean data doesn’t drop as inthe case of BERT. This difference may be a resultof BiDAF using more surface word matching itself,leading to a closer alignment between its predic-tions and the reranker’s choices. However, notethat our simple heuristic still performs well evenwith a complex model like BERT.

Discussion.

Overall, our results on this datasetlook promising for both BERT and BiDAF despiteour method’s inherent simplicity. This raises twoquestions. First, how effective is the AdversarialSQuAD dataset as a testbed for adversarial attacks?When a simple method can achieve large gains, wecannot be sure that more complex methods are trulyworking as advertised rather than learning suchheuristics. Second, how effective are these currentdefenses? They underperform a simple heuristic inthis setting; however, because the full breadth ofpossible adversarial settings has not been explored,it’s hard to get a holistic sense of which methodsare effective. Additional settings are needed tofully contrast these techniques.

Universal Adversarial Triggers.

We create adataset that has purely adversarial instances usingthe open-source code from Wallace et al. (2019) BiDAF + MAARS gives a similar F1/EM trend on

Ad-dOneSent ( Adversarial : 46.1/38.5,

Mean : 60.8/52.1).a) Wrong top candidate picked (b) Lack of attention to question type (c) Multiple similar spans co-occur

Figure 2: Common failure cases for MAARS. The distractor sentence is highlighted in red, the predicted answeris underlined and the ground truth answer is boxed. and present the results in Table 3. In particular, weappend the following distractors for different ad-versary types. The target answers in the distractorshave been bolded.•

Who : how ] ] there donald trump ; who whodid•

When : ; its time about january 2014 when maydid british•

Where : ; : ’ where new york may area wherethey•

Why : why how ; known because : to kill ameri-can people .Due to unavailability of prior work on trigger-speciﬁc defense and BERT-speciﬁc triggers, we re-port only vanilla BiDAF and BiDAF with MAARS.F1 drops by a small amount (0.3 points) fromBiDAF to BiDAF with MAARS while the EMscore doesn’t change at all for

Why . The scoresimprove by around 1-2 points for the other adver-sary types. However, the gains are much lowerin comparison to Adversarial SQuAD. These re-sults indicate the promise of simple defenses, butmore exhaustive evaluation of defenses on differenttypes of attacks is needed to draw a more completepicture of the methods’ generalization abilities.

Besides the instances where the primary errorsource is picking a wrong top candidate (referto Fig. 2a), we notice two other common failurecase types with MAARS. One directly stems fromMAARS’ inability to attend to the question typeduring reranking. In Fig. 2b, the question word is

How but MAARS picks

Scottish devolution refer-endum which is not the appropriate type of answerhere. The other type of failure occurs when mul-tiple similar span types are present in the samecandidate, thus creating ambiguity for the base QAmodel. In the example shown in Fig. 2c, the QAmodel fails to distinguish between the two spans and retrieve speciﬁc information about the US . Bet-ter base QA models may resolve these issues, or amore powerful reranker could also be used. How-ever, rerankers learned end-to-end would sufferfrom the same issues as BERT and require addi-tional engineering to avoid overﬁtting the trainingdata.

In this work, we introduce a simple and model ag-nostic post-hoc technique for adversarial questionanswering (QA) that predicts the ﬁnal answer af-ter re-ranking candidate answers from a genericQA model as per their overlap in relevant contentwith the question. Our results show the potential ofour method through large performance gains overvanilla models and state-of-the-art methods. Wealso analyze common failure points in our method.Finally, we reiterate that our main contribution isnot the heuristic defense itself but rather its abil-ity to paint a more complete picture of the currentstate of affairs in adversarial QA. We seek to il-lustrate that our current adversaries are not strongand generic enough to attack a wide variety of QAmethods, and we need a broader evaluation of ourdefenses to meaningfully gauge our progress inadversarial QA research.

References

Danqi Chen, Jason Bolton, and Christopher D. Man-ning. 2016. A thorough examination of theCNN/daily mail reading comprehension task. In

Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 2358–2367, Berlin, Germany.Association for Computational Linguistics.Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1870–1879, Vancouver, Canada. Association for Computa-tional Linguistics.ifan Chen and Greg Durrett. 2019. Understandingdataset design choices for multi-hop reasoning. In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers) , pages4026–4032, Minneapolis, Minnesota. Associationfor Computational Linguistics.Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019.Question answering by reasoning across documentswith graph convolutional networks. In

Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 2306–2317, Min-neapolis, Minnesota. Association for ComputationalLinguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Matt Gardner, Yoav Artzi, Victoria Basmova, JonathanBerant, Ben Bogin, Sihao Chen, Pradeep Dasigi,Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,et al. 2020. Evaluating nlp models via contrast sets. arXiv preprint arXiv:2004.02709 .Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, MatthewPeters, Michael Schmitz, and Luke S. Zettlemoyer.2017. Allennlp: A deep semantic natural languageprocessing platform.Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015a. Teaching machines toread and comprehend. In C. Cortes, N. D. Lawrence,D. D. Lee, M. Sugiyama, and R. Garnett, editors,

Advances in Neural Information Processing Systems28 , pages 1693–1701. Curran Associates, Inc.Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015b. Teaching machines toread and comprehend. In

Advances in Neural Infor-mation Processing Systems , volume 28, pages 1693–1701. Curran Associates, Inc.R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, AdamTrischler, and Yoshua Bengio. 2019. Learning deeprepresentations by mutual information estimationand maximization. In

International Conference onLearning Representations .Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural Computation ,9(8):1735–1780. Robin Jia and Percy Liang. 2017. Adversarial exam-ples for evaluating reading comprehension systems.In

Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing , pages2021–2031, Copenhagen, Denmark. Association forComputational Linguistics.Mandar Joshi, Eunsol Choi, Daniel Weld, and LukeZettlemoyer. 2017. TriviaQA: A large scale dis-tantly supervised challenge dataset for reading com-prehension. In

Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 1601–1611, Van-couver, Canada. Association for Computational Lin-guistics.Divyansh Kaushik and Zachary C. Lipton. 2018. Howmuch reading does reading comprehension require?a critical investigation of popular benchmarks. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pages5010–5015, Brussels, Belgium. Association forComputational Linguistics.Daniel Khashabi, Tushar Khot, Ashish Sabharwal,Oyvind Tafjord, Peter Clark, and Hannaneh Ha-jishirzi. 2020. Uniﬁedqa: Crossing format bound-aries with a single qa system. arXiv preprintarXiv:2005.00700 .Bernhard Kratzwald, Anna Eigenmann, and StefanFeuerriegel. 2019. RankQA: Neural question an-swering with answer re-ranking. In

Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics , pages 6076–6085, Florence,Italy. Association for Computational Linguistics.Jinhyuk Lee, Seongjun Yun, Hyunjae Kim, MiyoungKo, and Jaewoo Kang. 2018. Ranking paragraphsfor improving answer recall in open-domain ques-tion answering. In

Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 565–569, Brussels, Belgium. As-sociation for Computational Linguistics.Seanie Lee, Donggyu Kim, and Jangwon Park. 2019.Domain-agnostic question-answering with adversar-ial training. In

Proceedings of the 2nd Workshop onMachine Reading for Question Answering . Associa-tion for Computational Linguistics.Mike Lewis and Angela Fan. 2019. Generative ques-tion answering: Learning to answer the whole ques-tion. In

International Conference on Learning Rep-resentations .Ilya Loshchilov and Frank Hutter. 2019. Decoupledweight decay regularization. In

International Con-ference on Learning Representations .Pramod Kaushik Mudrakarta, Ankur Taly, MukundSundararajan, and Kedar Dhamdhere. 2018. Didthe model understand the question? In

Proceed-ings of the 56th Annual Meeting of the Associationor Computational Linguistics (Volume 1: Long Pa-pers) , pages 1896–1906, Melbourne, Australia. As-sociation for Computational Linguistics.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) , pages 1532–1543, Doha, Qatar. Asso-ciation for Computational Linguistics.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 2383–2392, Austin,Texas. Association for Computational Linguistics.Matthew Richardson, Christopher J.C. Burges, andErin Renshaw. 2013. MCTest: A challenge datasetfor the open-domain machine comprehension of text.In

Proceedings of the 2013 Conference on Empiri-cal Methods in Natural Language Processing , pages193–203, Seattle, Washington, USA. Association forComputational Linguistics.Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi,and Hannaneh Hajishirzi. 2016. Bidirectional at-tention ﬂow for machine comprehension.

CoRR ,abs/1611.01603.Aliaksei Severyn and Alessandro Moschitti. 2015.Learning to rank short text pairs with convolutionaldeep neural networks. In

Proceedings of the 38thInternational ACM SIGIR Conference on Researchand Development in Information Retrieval , SIGIR’15, page 373–382, New York, NY, USA. Associa-tion for Computing Machinery.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overﬁtting.

Journal of Machine Learning Re-search , 15(56):1929–1958.Sainbayar Sukhbaatar, arthur szlam, Jason Weston,and Rob Fergus. 2015. End-to-end memory net-works. In C. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, and R. Garnett, editors,

Advances inNeural Information Processing Systems 28 , pages2440–2448. Curran Associates, Inc.Alon Talmor and Jonathan Berant. 2019. MultiQA: Anempirical investigation of generalization and trans-fer in reading comprehension. In

Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics , pages 4911–4921, Florence,Italy. Association for Computational Linguistics.Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. In

Proceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003 , pages142–147. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,and Sameer Singh. 2019. Universal adversarial trig-gers for attacking and analyzing NLP. In

Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 2153–2162, HongKong, China. Association for Computational Lin-guistics.Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaox-iao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger,Gerald Tesauro, and Murray Campbell. 2018. Ev-idence aggregation for answer re-ranking in open-domain question answering. In

International Con-ference on Learning Representations .Yicheng Wang and Mohit Bansal. 2018. Robust ma-chine comprehension models via adversarial train-ing. In

Proceedings of the 2018 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, Volume 2 (Short Papers) , pages 575–581, NewOrleans, Louisiana. Association for ComputationalLinguistics.Johannes Welbl, Pasquale Minervini, Max Bartolo,Pontus Stenetorp, and Sebastian Riedel. 2020.Undersensitivity in neural reading comprehension. arXiv preprint arXiv:2003.04808 .Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hopreading comprehension across documents.

Transac-tions of the Association for Computational Linguis-tics , 6:287–302.Jason Weston, Antoine Bordes, Sumit Chopra, andTomas Mikolov. 2015. Towards ai-complete ques-tion answering: A set of prerequisite toy tasks.

CoRR , abs/1502.05698.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.

ArXiv , abs/1910.03771.Bowen Wu, Haoyang Huang, Zongsheng Wang, Qi-hang Feng, Jingsong Yu, and Baoxun Wang. 2019.Improving the robustness of deep reading compre-hension models by leveraging syntax prior. In

Pro-ceedings of the 2nd Workshop on Machine Readingfor Question Answering , pages 53–57, Hong Kong,China. Association for Computational Linguistics.Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, LuchenTan, Kun Xiong, Ming Li, and Jimmy Lin. 2019a.End-to-end open-domain question answering withBERTserini. In

Proceedings of the 2019 Confer-ence of the North American Chapter of the Asso-ciation for Computational Linguistics (Demonstra-tions) , pages 72–77, Minneapolis, Minnesota. Asso-ciation for Computational Linguistics.iqing Yang, Yiming Cui, Wanxiang Che, Ting Liu,Shijin Wang, and Guoping Hu. 2019b. Improv-ing machine reading comprehension via adversarialtraining. arXiv preprint arXiv:1911.03614 .Yi-Ting Yeh and Yun-Nung Chen. 2019. QAInfomax:Learning robust question answering system by mu-tual information maximization. In

Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 3370–3375, Hong Kong,China. Association for Computational Linguistics.Matthew D. Zeiler. 2012. ADADELTA: an adaptivelearning rate method.

CoRR , abs/1212.5701.Mantong Zhou, Minlie Huang, and Xiaoyan Zhu. 2019.Robust reading comprehension with linguistic con-straints via posterior regularization. arXiv preprintarXiv:1911.06948arXiv preprintarXiv:1911.06948