[PDF] Mischief: A Simple Black-Box Attack Against Transformer Architectures

Abstract

We introduce Mischief, a simple and lightweight method to produce a class of human-readable, realistic adversarial examples for language models. We perform exhaustive experimentations of our algorithm on four transformer-based architectures, across a variety of downstream tasks, as well as under varying concentrations of said examples. Our findings show that the presence of Mischief-generated adversarial samples in the test set significantly degrades (by up to 20% ) the performance of these models with respect to their reported baselines. Nonetheless, we also demonstrate that, by including similar examples in the training set, it is possible to restore the baseline scores on the adversarial test set. Moreover, for certain tasks, the models trained with Mischief set show a modest increase on performance with respect to their original, non-adversarial baseline.

Full PDF

MMischief: A Simple Black-Box Attack Against Transformer Architectures

Adrian de Wynter

Amazon Alexa [email protected]

Abstract

We introduce Mischief, a simple and lightweight method to produce a class of human-readable,realistic adversarial examples for language models. We perform exhaustive experimentationsof our algorithm on four transformer-based architectures, across a variety of downstream tasks,as well as under varying concentrations of said examples. Our ﬁndings show that the pres-ence of Mischief-generated adversarial samples in the test set signiﬁcantly degrades (by up to20%) the performance of these models with respect to their reported baselines. Nonetheless, wealso demonstrate that, by including similar examples in the training set, it is possible to restorethe baseline scores on the adversarial test set. Moreover, for certain tasks, the models trainedwith Mischief set show a modest increase on performance with respect to their original, non-adversarial baseline.

An adversarial attack on deep learning systems, as introduced by Szegedy et al. (2014) and Goodfellowet al. (2015), consists on any input that may be designed explicitly to cause poor performance on amodel. They are traditionally split in two major categories: white-box and black-box. In the former, theadversarial inputs are found–or rather, learned–through a perturbation of the gradient. The latter, on theother hand, assumes that there is no access to the model’s gradient, and thus said adversarial examplesare often found by trial-and-error.In computer vision, such attacks typically involve injecting learned noise into small areas of the inputimage. This noise is unnoticeable to a human user, but just about complex enough to cause the network tofail to perform as expected. In contrast, for text-based systems, this noticeability-versus-failure tradeoff isnot as clear. Since machine learning-based language models embed an input string into a vector space forfurther processing in other tasks, from a feasibility point of view it would be more realistic to determinewhich changes in the string, and not the vector, are more likely to negatively affect the model.In an attempt to make our work more readily applicable to existent systems, we concentrate ourselvessolely on such black-box attacks. Moreover, we focus mostly on architectures that leverage the trans-former layer from Vaswani et al. (2017), such as BERT (Devlin et al., 2018), as their high performancein multiple language modeling tasks makes them ubiquitous in both research and production pipelines.In this work we present Mischief, a procedure that generates a class of such adversarial examples. Inorder to remain within our constraints, Mischief leverages a well-known phenomenon from psycholin-guistics ﬁrst described by Rawlinson (1976). We characterize the impact of our algorithm on the perfor-mance of four selected transformer-based architectures, by carrying out exhaustive experiments across avariety of tasks and concentrations of adversarial examples.Our experimentation shows that the presence of Mischief-generated examples is able to signiﬁcantlydowngrade the performance of the language models evaluated. However, we also demonstrate that, atleast for the architectures evaluated, including Mischief-generated examples into the training process al-lows the models to regain, and sometimes increase, their baseline performance in a variety of downstreamtasks. a r X i v : . [ c s . C L ] O c t Related Work

Adversarial attacks in the context of learning theory were perhaps ﬁrst described by Kearns and Li (1993).However, such examples per se predate machine learning (e.g., techniques to circumvent spam ﬁlters)by a wide margin (Ollman, 2007). On the other hand, the study of adversarial attacks on deep neuralnetworks, albeit relatively recent, ﬁelds a large number of important contributions in addition to the onesmentioned on the introduction, and it is hard to name them all. However, an excellent introduction onthis topic, along with historical notes, can be found in Biggio and Roli (2018). In the context of NaturalLanguage Understanding (NLU), the work by Jia and Liang (2017) was arguably the ﬁrst where suchnotions were formally applied to the intersection of language modeling and deep learning. Moreover, thewell-known research by Ebrahimi et al. (2018a), Belinkov and Bisk (2018), and Minervini and Riedel(2018) showed that the large majority of existing language models are extremely vulnerable to bothblack-box and white-box attacks. Indeed, the Mischief algorithm is similar to that one of Belinkov andBisk (2018) with variations on noise levels, and applied over a wider range of natural language tasks.Nonetheless, the large majority of the procedures presented in such papers were often considered to beunrealistic (Zhang et al., 2019), and it wasn’t until Pruthi et al. (2019) and Ebrahimi et al. (2018b) wheremore practical attack and defense mechanisms were introduced. This paper is more closely aligned totheirs, although it differs in key aspects regarding contributions and methodology. Regardless, our workis meant to add to this body of research, with a speciﬁc focus on black box-based sentence level attacksfor transformer architectures. The interested reader can ﬁnd a more comprehensive compendium of thehistory of adversarial attacks for NLU in the survey by Zhang et al. (2019).We elaborate on a few studies of the research ﬁrst reported by Graham Rawlinson (Rawlinson, 2007)in Section 3. In addition to these papers, it is important to point out that there are quite a few papersaround this phenomenom. For example, the work by Perea and Lupker (2003) and Perea and Lupker(2004) expanded upon said research by exploring other types of permutations; while Gomez et al. (2008)and Norris (2006) attempted to explain this phenomenon from a statistical perspective, and an analysisof some of the leading theories around positional encoding can be found in the articles by Davis andBowers (2006) and Whitney (2008). Finally, a compilation of the works around this effect can be foundin Davis (2003).

Graham Rawlinson described in his doctoral thesis (1976) a phenomenon where permuting the middlecharacters in a word, but leaving the ﬁrst and last intact, had little to no effect on the ability of a humanreader to understand it. It was shown in a few other studies that said permutation does tend to slow downreadability (Rayner et al., 2006), and that the type of permutation (i.e., the position of the permutedsubstring) is relevant to the comprehension of the text (Schoonbaert and Grainger, 2004), as well as anycontext (Pelli et al., 2003) added.It could be argued that the act of shufﬂing the characters in an input word will have a naturally detri-mental effect on any language model. Most models rely on a tokenizer and a vocabulary to parse theinput string; thus, the presence of an adversarial example as an input to a pretrained model implies thatthe input will very likely be mapped to a low-information, or even incorrect, vocabulary element. On theother hand, the attention mechanism that lies at the heart of the transformer architecture does not havea concept of word order (Vaswani et al., 2017), and relies on statistical methods to learn syntax (Peterset al., 2018). It has also been shown to prefer, in some architectures, certain speciﬁc tokens and othercoreferent objects (Clark et al., 2019; Kovaleva et al., 2019). This suggests that, although models relyingon these artifacts may be resilient to slight changes in the input, the right concentration of permutedwords may lead to a degradation in performance–all while remaining understandable by a human reader.Note that we do not alter the order of the words in the sentence, as that may risk losing a signiﬁ-cant amount of semantic and lexical information, and thus it will no longer be considered a practicaladversarial attack. lgorithm 1

Simpliﬁed Mischief: The + operator denotes string concatenation. Input: dataset D , proportion p , probability r for sentence s ∈ D do Draw probability π s ∼ P if π s ≤ p thenfor word w ∈ s do Draw probability π r ∼ P if π r ≤ r ∧ | w | > then w (cid:48) = w + GRA ( w − ) + w − Replace w ∈ s with w (cid:48) end ifend forend ifend for3.2 The Mischief Algorithm We deﬁne the Generalized Rawlinson adversarial example (GRA) as a permutation σ on a word w = w , w , . . . , w n , where n > , such that GRA ( w ) = w , σ ( w ) , . . . , σ ( w n − ) , w n . The algorithm thatwe use to generate such examples, which we call Mischief, is a function that acts on a text corpus andtakes in two parameters p, r ∈ [0 , . Here, p denotes the proportion of the dataset to perform “Mischief”on, and r is the probability of a word w in a given line to be randomized; that is, to perform GRA ( w ) .An implementation of Mischief can be seen in Algorithm 1. We evaluate our approach with four transformer-based models: BERT (large, cased) (Devlin et al., 2018),RoBERTa (large) (Liu et al., 2019), XLM (2048-en) (Lample and Conneau, 2019), and XLNet (large)(Yang et al., 2019). All of them were selected due to their high performances on the Generalized Lan-guage Understanding Evaluation (GLUE) benchmarks, which is a set of ten distinct NLU tasks designedto showcase the candidate’s ability to model and generalize language (Wang et al., 2018).For every model and task, we apply four different concentrations r = { , , , } ofMischief on the training set; additionally, for each of these concentrations we test different combinationsof Mischief-no Mischief on the test and training set, totaling distinct experiments. Due to the largecomplexity of the task, we maintain p = 1 across all experiments. A summary of our ﬁndings andgeneral experimental set up is described in Table 1. TrainingTest M

ISCHIEF N O M ISCHIEF N O M ISCHIEF

An adversarial attack: signiﬁcant performancedegradation. Baseline from the GLUE benchmarks.M

ISCHIEF

Proposed defense: minimal performance degrada-tion. Minimal degradation, or increased performance.

Table 1: Summary of our ﬁndings for all experiments, models, and variations.

In order to obtain the performance of a model on a test set, an experimenter must ﬁrst upload the rawpredictions to the GLUE website. Due to the number of experiments we performed, along with our needto modify the test sets, we opted out from evaluating every result on the website. Instead, we treated the https://gluebenchmark.com/leaderboard igure 1: Resulting scores under an adversarial test set, and for two situations: the adversarial setting (noMischief on the training set, in red), and our proposed defense (with Mischief on the training set, in blue).The results are averaged out across all values of r , and measured in terms of F , Pearson correlation ( p in the plot), Spearman- ρ ( s in the plot), and accuracy (not indicated). Note the high variance in RTE,which we attribute to the small size of the dataset.provided validation set as a test set, and generated a small validation set from a − split of thetraining set.It is well-known that the language models we tested are highly sensitive to initialization. Given thatin most tasks we observed signiﬁcant variation of results accross multiple experiments, we report theaverage result for ten random seeds, bringing up the total number of experiments to , . However,as done in Kovaleva et al. (2019), we opt to not report the CoLA or WNLI benchmarks, as their smalltraining set size made them remarkably sensitive to variations in the experimentation, and their inclusioncould bias our summary results for the following sections. Our ﬁrst set of experiments, corresponding to the ﬁrst column of Table 1, involved exploring the effects ofa Mischief-generated adversarial test set, as well as a simple defense schema. To simulate an adversarialattack, on our ﬁrst setup we ﬁne-tuned the models on each GLUE task as described on their originalpapers, and evaluated them on test sets with Mischief. Then, we simulated a simple defense by applyingMischief to the training sets, and subsequently ﬁne-tuning and evaluating the models. The results can beseen in Figure 1.We found that models trained without Mischief are vulnerable to adversarial attacks, with performancedrops averaging almost 20% on the case where r = 25% . However, such degradation can be easilyrecovered by training with Mischief, as well as performing minor hyperparameter tuning to compensatefor the variations in the new training set. A plot of the mean task degradation observed by varying r can be seen in Figure 2. We conjecture that the “dip” at r = 25% can be explained by the fact thatthe concentration of GRA examples is enough to degrade the performance of the tested models, but notigure 2: Average (per-task) performance degradation, for varying proportions of r . On average, apply-ing Mischief to the training set is an effective defense against this class of adversarial attacks. Note alsohow the architectures have a consistent ordering in their performance degradation, regardless of r .sufﬁcient to allow for learning. Our second set of experiments involve evaluating the performance on the unmodiﬁed test dataset, aftertraining the models with Mischief. The results for every proportion r can be seen in Figure 3. Weobserved increased performances on various tasks. However, it does appear that the size of the dataset,as well as the objective of the task, have an important inﬂuence as to whether Mischief-trained modelscan have such performance increase.This is to be expected, as certain tasks do rely more heavily on “masking” certain tokens. For example,MRPC (the Microsoft Research Paraphrase Corpus, by Dolan and Brockett (2005)), is, as it name indi-cates, a classiﬁcation task where the model must determine whether two sentences p, q are paraphrasesof one another. A paraphrase of p would normally retain most of the semantic content, while alteringthe lexical relations as much as possible, in which Mischief clearly allows for a more ﬁne-grained dataexpansion at the tokenizer level. It is also important to point out that MRPC is a relatively large dataset,with sentence pairs.On the other hand, some other tasks would actually be harmed by the unintentional “masking” inducedby Mischief. As an example, RTE (Recognizing Textual Entailment) is a dataset merged by Wang et al.(2018) from the corpora by Dagan et al. (2005), Bar-Haim et al. (2006), Giampiccolo et al. (2007), andBentivogli et al. (2009). Its objective is to determine whether a pair of sentences p, q have the relation p = ⇒ q . It could be argued that such a task cannot beneﬁt from Mischief, as it would lose criticallexical information and simply obfuscate the dataset further. However, MNLI (the Multi-Genre NLIcorpus, by Williams et al. (2018)) also involves textual entailment, but it is signiﬁcantly larger than RTE:igure 3: Resulting performance change, across all tasks, for every r . In blue we report the best per-formance of a model trained with Mischief, and evaluated on its original test set. In red, we report thebaseline. In general, large corpora consistently beneﬁt from Mischief-based training.the latter is the smallest dataset presented in this paper, with examples total, while the former isnearly times larger, topping about , sentence pairs. Mischief as an adversarial attack is remarkably effective, although its ability to degrade the performanceof a language model is, fortunately, easily lost if the model has been exposed to other GRA samplesbefore. We hypothesize that this is, as mentioned in Section 3.1, due to the way these models constructtheir vocabulary. The models tested employ a byte pair encoding (BPE) preprocessing step (Gage, 1994),which segments subwords iteratively and stores them in the vocabulary based on their frequency. It fol-lows that any model trained on Mischief-generated samples will become more robust to the perturbationsinduced by this algorithm. Moreover, the models tested have large parameter sizes, which translates intoa much stronger ability to memorize, and thus be resilient to, new input examples.This can also help partially explain the results observed in Section 4.3: let w i , w j be two words oc-curring in different parts of the dataset, and where w i = w j , and | w i | = | w j | := n . For n ≥ , andassuming a uniform distribution, the probability that these two words are transformed the same way isPr [ GRA ( w i ) = GRA ( w j )] = (cid:18) n − (cid:19) (1)which in turn means that Mischief is effectively a data augmentation technique–the average Englishword length is n = 5 . characters. Although this number would naturally vary with the corpus beingutilized, ultimately the models tested will be exposed to a wider variety of slight perturbations on theinputs, and in turn allows it to focus better on the linguistic relations between the tokens. owever, Equation 1 does not account for the fact that some tasks do not beneﬁt from Mischief. Forexample, the QQP (Quora Question Pairs) dataset attempts to relate a question-answer pair semantically ,and Mischief-based models consistently underperformed in spite of the fact that this corpus has nearly , lines. Given the scores in STS-B (Cer et al., 2017), and SST-2 (Socher et al., 2013), it appearsthat, generally speaking, tasks where semantic similarity is the primary measurement are more likely tobe impacted negatively. There were some exceptions to the rule, however, as some models did outperformtheir baseline, for example, BERT in STS-B for r = 100% and XLNet in SST-2 for r = 75% , and r = 100% . We presented Mischief, a simple algorithm that allows us to construct a class of human-readable adver-sarial examples; and showed that the injection of such examples in the dataset is capable of signiﬁcantlydegrading the performance of transformer-based models. Such models can be made resistant to Mischief-based attacks by simply training with similar examples, and without relying on other components (e.g.,a spell-checker).However, Mischief has also value as a data augmentation technique, as we saw that certain NLU tasksbeneﬁt from the inclusion of such examples. It is important to point out that, in general, adversarialattacks are architecture-independent (Szegedy et al., 2014). Although we attempted to provide an in-depth analysis of select transformer-based architectures, it remains an open problem as to whether theresults of this paper are applicable to other families of models. We conjecture that, as long as theirtokenizer operates in a similar fashion to the WordPiece tokenizer from Schuster and Nakajima (2012),and their parameter size is large enough, the effects from this study extend to them. In the case of smaller-capacity models or other word-segmentation techniques where out-of-vocabulary words are frequentlymapped to the same token, the outcome of a Mischief-based attack can only be more detrimental.Finally, one area we did not pursue in this paper is synonym injection. We argue that synonym injectionis arguably far more impactful in terms of supplying strong adversarial examples, and a Mischief-basedapproach to training with such examples may also increase performance in the tasks where Mischiefdid not show an improvement. However, given how sensitive is the meaning of a word–let alone theirsynonym–to context, such process cannot be done in an automated fashion, and without expert knowl-edge being invested.

Acknowledgments

The author is grateful to B. d’Iverno, Y. Ibrahim, V. Khare, A. Mottini, and Q. Wang for their helpfulcomments and suggestions throughout this project, and to the anonymous reviewers whose commentsgreatly improved this work.

References

Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, and Danilo Giampiccolo. 2006. The second pascal recognisingtextual entailment challenge.

Proceedings of the Second PASCAL Challenges Workshop on Recognising TextualEntailment , 01.Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In

International Conference on Learning Representations .Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The ﬁfth pascal recognizing textualentailment challenge. In

TAC .Battista Biggio and Fabio Roli. 2018. Wild patterns: Ten years after the rise of adversarial machine learn-ing. In

Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security , page2154–2156, New York, NY, USA. Association for Computing Machinery. aniel Cer, Mona Diab, Eneko Agirre, I˜nigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 task 1: Se-mantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th Interna-tional Workshop on Semantic Evaluation (SemEval-2017) , pages 1–14, Vancouver, Canada, August. Associationfor Computational Linguistics.Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at?an analysis of BERT’s attention. In

Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing andInterpreting Neural Networks for NLP , pages 276–286, Florence, Italy, August. Association for ComputationalLinguistics.Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The pascal recognising textual entailment challenge. In

Proceedings of the First International Conference on Machine Learning Challenges: Evaluating Predictive Un-certainty Visual Object Classiﬁcation, and Recognizing Textual Entailment , MLCW’05, page 177–190, Berlin,Heidelberg. Springer-Verlag.Colin J. Davis and Jeffrey S. Bowers. 2006. Contrasting ﬁve different theories of letter position coding: evidencefrom orthographic similarity effects.

Journal of experimental psychology. Human perception and performance

CoRR , abs/1810.04805.Bill Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In

ThirdInternational Workshop on Paraphrasing (IWP2005) , January.Javid Ebrahimi, Daniel Lowd, and Dejing Dou. 2018a. On adversarial examples for character-level neural machinetranslation. In

Proceedings of the 27th International Conference on Computational Linguistics , pages 653–663,Santa Fe, New Mexico, USA, August. Association for Computational Linguistics.Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018b. HotFlip: White-box adversarial examples fortext classiﬁcation. In

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 2: Short Papers) , Melbourne, Australia, July.Philip Gage. 1994. A new algorithm for data compression.

The C Users Journal. , february.Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textualentailment challenge. In

Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing ,pages 1–9, Prague, June. Association for Computational Linguistics.Pablo Gomez, Roger Ratcliff, and Manuel Perea. 2008. The overlap model: A model of letter position coding.

Psychological review , 115(3):577.Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples.In

International Conference on Learning Representations .Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing .Michael. Kearns and Ming. Li. 1993. Learning in the presence of malicious errors.

SIAM Journal on Computing ,22(4):807–837.Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the dark secrets of BERT.In

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9thInternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 4365–4374, HongKong, China, November. Association for Computational Linguistics.Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining.

CoRR , abs/1901.07291.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, LukeZettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach.

CoRR ,abs/1907.11692.Pasquale Minervini and Sebastian Riedel. 2018. Adversarially regularising neural NLI models to integrate logicalbackground knowledge. In

Proceedings of the 22nd Conference on Computational Natural Language Learning ,pages 65–74, Brussels, Belgium, October. Association for Computational Linguistics.ennis Norris. 2006. The bayesian reader: Explaining word recognition as an optimal bayesian decision process.

Psychological review , 113(2):327.Gunter Ollman. 2007. The phishing guide.

IBM Internet Security Systems , page 3.Denis G. Pelli, Bart Farell, and Deborah C. Moore. 2003. The remarkable inefﬁciency of word recognition.

Nature , 423:752–756.Manuel Perea and Stephen J. Lupker. 2003. Does jugde activate COURT? transposed-letter similarity effects inmasked associative priming.

Memory & Cognition , 31(6):829–841, Sep.M. Perea and S. J. Lupker. 2004. Can caniso activate casino? transposed-letter similarity effects with nonadjacentletter positions.

Journal of Memory and Language , 51(2):231–246.Matthew Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018. Dissecting contextual word em-beddings: Architecture and representation. In

Proceedings of the 2018 Conference on Empirical Methods inNatural Language Processing , pages 1499–1509, Brussels, Belgium, October-November. Association for Com-putational Linguistics.Danish Pruthi, Bhuwan Dhingra, and Zachary C. Lipton. 2019. Combating adversarial misspellings with robustword recognition. In

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics ,pages 5582–5591, Florence, Italy, July. Association for Computational Linguistics.Graham Rawlinson. 1976. The signiﬁcance of letter position in word recognition. Unpublished PhD thesis.Graham Rawlinson. 2007. The signiﬁcance of letter position in word recognition.

Aerospace and ElectronicSystems Magazine, IEEE , 22:26–27, 02.Keith Rayner, Sarah J. White, Rebecca L. Johnson, and Simon P. Liversedge. 2006. Raeding wrods with jubmledlettres: there is a cost.

Psychological Science , 17(3):192–193.Soﬁe Schoonbaert and Jonathan Grainger. 2004. Letter position coding in printed word perception: Effects ofrepeated and transposed letters.

Language and Cognitive Processes , 19(3):333–367.M. Schuster and K. Nakajima. 2012. Japanese and korean voice search. In , pages 5149–5152, Kyoto, Japan.Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and ChristopherPotts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In

Proceedingsof the 2013 Conference on Empirical Methods in Natural Language Processing , pages 1631–1642, Seattle,Washington, USA, October. Association for Computational Linguistics.Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fer-gus. 2014. Intriguing properties of neural networks. In

International Conference on Learning Representations .Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser,and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, editors,

Advances in Neural Information Processing Systems 30 ,pages 5998–6008. Curran Associates, Inc.Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: Amulti-task benchmark and analysis platform for natural language understanding. In

Proceedings of the 2018EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , pages 353–355, Brus-sels, Belgium, November. Association for Computational Linguistics.Carol Whitney. 2008. Comparison of the seriol and solar theories of letter-position encoding.

Brain and Lan-guage , 107(2):170–178.Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentenceunderstanding through inference. In

Proceedings of the 2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages1112–1122, New Orleans, Louisiana, June. Association for Computational Linguistics.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet:Generalized autoregressive pretraining for language understanding.

CoRR , abs/1906.08237.Wei Emma Zhang, Quan Z. Sheng, and Ahoud Abdulrahmn F. Alhazmi. 2019. Generating textual adversarialexamples for deep learning models: A survey.