[PDF] Generating Natural Language Adversarial Examples

Abstract

Deep neural networks (DNNs) are vulnerable to adversarial examples, perturbations to correctly classified examples which can cause the model to misclassify. In the image domain, these perturbations are often virtually indistinguishable to human perception, causing humans and state-of-the-art models to disagree. However, in the natural language domain, small perturbations are clearly perceptible, and the replacement of a single word can drastically alter the semantics of the document. Given these challenges, we use a black-box population-based optimization algorithm to generate semantically and syntactically similar adversarial examples that fool well-trained sentiment analysis and textual entailment models with success rates of 97% and 70%, respectively. We additionally demonstrate that 92.3% of the successful sentiment analysis adversarial examples are classified to their original label by 20 human annotators, and that the examples are perceptibly quite similar. Finally, we discuss an attempt to use adversarial training as a defense, but fail to yield improvement, demonstrating the strength and diversity of our adversarial examples. We hope our findings encourage researchers to pursue improving the robustness of DNNs in the natural language domain.

Full PDF

aa r X i v : . [ c s . C L ] S e p Generating Natural Language Adversarial Examples

Moustafa Alzantot ∗ , Yash Sharma ∗ , Ahmed Elgohary ,Bo-Jhang Ho , Mani B. Srivastava , Kai-Wei Chang Department of Computer Science, University of California, Los Angeles (UCLA) { malzantot, bojhang, mbs, kwchang } @ucla.edu Cooper Union [email protected] Computer Science Department, University of Maryland [email protected]

Abstract

Deep neural networks (DNNs) are vulnera-ble to adversarial examples, perturbations tocorrectly classiﬁed examples which can causethe model to misclassify. In the image do-main, these perturbations are often virtuallyindistinguishable to human perception, caus-ing humans and state-of-the-art models to dis-agree. However, in the natural language do-main, small perturbations are clearly percep-tible, and the replacement of a single wordcan drastically alter the semantics of the doc-ument. Given these challenges, we use ablack-box population-based optimization al-gorithm to generate semantically and syntac-tically similar adversarial examples that foolwell-trained sentiment analysis and textual en-tailment models with success rates of 97% and70%, respectively. We additionally demon-strate that 92.3% of the successful sentimentanalysis adversarial examples are classiﬁed totheir original label by 20 human annotators,and that the examples are perceptibly quitesimilar. Finally, we discuss an attempt to useadversarial training as a defense, but fail toyield improvement, demonstrating the strengthand diversity of our adversarial examples. Wehope our ﬁndings encourage researchers topursue improving the robustness of DNNs inthe natural language domain.

Recent research has found that deep neural net-works (DNNs) are vulnerable to adversarial ex-amples (Goodfellow et al., 2014; Szegedy et al.,2013). The existence of adversarial ex-amples has been shown in image classiﬁca-tion (Szegedy et al., 2013) and speech recogni-tion (Carlini and Wagner, 2018). In this work,we demonstrate that adversarial examples can ∗ Moustafa Alzantot and Yash Sharma contribute equallyto this work. be constructed in the context of natural lan-guage. Using a black-box population-based op-timization algorithm, we successfully generateboth semantically and syntactically similar adver-sarial examples against models trained on boththe IMDB (Maas et al., 2011) sentiment analysistask and the Stanford Natural Language Inference(SNLI) (Bowman et al., 2015) textual entailmenttask. In addition, we validate that the examples areboth correctly classiﬁed by human evaluators andsimilar to the original via a human study. Finally,we attempt to defend against said adversarial at-tack using adversarial training, but fail to yield anyrobustness, demonstrating the strength and diver-sity of the generated adversarial examples.Our results show that by minimizing the seman-tic and syntactic dissimilarity, an attacker can per-turb examples such that humans correctly classify,but high-performing models misclassify. We areopen-sourcing our attack to encourage researchin training DNNs robust to adversarial attacks inthe natural language domain. Adversarial examples have been exploredprimarily in the image recognition domain.Examples have been generated through solv-ing an optimization problem, attempting toinduce misclassiﬁcation while minimizing theperceptual distortion (Szegedy et al., 2013;Carlini and Wagner, 2017; Chen et al., 2017b;Sharma and Chen, 2017). Due to the computa-tional cost of such approaches, fast methods wereintroduced which, either in one-step or iteratively,shift all pixels simultaneously until a distortionconstraint is reached (Goodfellow et al., 2014;Kurakin et al., 2016; Madry et al., 2017). Nearly https://github.com/nesl/nlp_adversarial_examples ll popular methods are gradient-based.Such methods, however, rely on the fact thatadding small perturbations to many pixels in theimage will not have a noticeable effect on a humanviewer. This approach obviously does not transferto the natural language domain, as all changes areperceptible. Furthermore, unlike continuous im-age pixel values, words in a sentence are discretetokens. Therefore, it is not possible to compute thegradient of the network loss function with respectto the input words. A straightforward workaroundis to project input sentences into a continuousspace (e.g. word embeddings) and consider this asthe model input. However, this approach also failsbecause it still assumes that replacing every wordwith words nearby in the embedding space will notbe noticeable. Replacing words without account-ing for syntactic coherence will certainly lead toimproperly constructed sentences which will lookodd to the reader.Relative to the image domain, little work hasbeen pursued for generating natural language ad-versarial examples. Given the difﬁculty in gen-erating semantics-preserving perturbations, dis-tracting sentences have been added to the in-put document in order to induce misclassiﬁca-tion (Jia and Liang, 2017). In our work, weattempt to generate semantically and syntac-tically similar adversarial examples, via wordreplacements, resolving the aforementioned is-sues. Minimizing the number of word replace-ments necessary to induce misclassiﬁcation hasbeen studied in previous work (Papernot et al.,2016b), however without consideration givento semantics or syntactics, yielding incoher-ent generated examples. In recent work,there have been a few attempts at generat-ing adversarial examples for language tasks byusing back-translation (Iyyer et al., 2018), ex-ploiting machine-generated rules (Ribeiro et al.,2018), and searching in underlying semanticspace (Zhao et al., 2018). In addition, whilepreparing our submission, we became aware ofrecent work which target a similar contribu-tion (Kuleshov et al., 2018; Ebrahimi et al., 2018).We treat these contributions as parallel work. We assume the attacker has black-box access tothe target model; the attacker is not aware of themodel architecture, parameters, or training data, and is only capable of querying the target modelwith supplied inputs and obtaining the output pre-dictions and their conﬁdence scores. This set-ting has been extensively studied in the image do-main (Papernot et al., 2016a; Chen et al., 2017a;Alzantot et al., 2018), but has yet to be exploredin the context of natural language.

To avoid the limitations of gradient-based attackmethods, we design an algorithm for constructingadversarial examples with the following goals inmind. We aim to minimize the number of modiﬁedwords between the original and adversarial exam-ples, but only perform modiﬁcations which retainsemantic similarity with the original and syntacticcoherence. To achieve these goals, instead of rely-ing on gradient-based optimization, we developedan attack algorithm that exploits population-basedgradient-free optimization via genetic algorithms.An added beneﬁt of using gradient-free opti-mization is enabling use in the black-box case;gradient-reliant algorithms are inapplicable in thiscase, as they are dependent on the model be-ing differentiable and the internals being accessi-ble (Papernot et al., 2016b; Ebrahimi et al., 2018).Genetic algorithms are inspired by the processof natural selection, iteratively evolving a popu-lation of candidate solutions towards better solu-tions. The population of each iteration is a called a generation . In each generation, the quality of pop-ulation members is evaluated using a ﬁtness func-tion. “Fitter” solutions are more likely to be se-lected for breeding the next generation. The nextgeneration is generated through a combination of crossover and mutation . Crossover is the pro-cess of taking more than one parent solution andproducing a child solution from them; it is anal-ogous to reproduction and biological crossover.Mutation is done in order to increase the diver-sity of population members and provide better ex-ploration of the search space. Genetic algorithmsare known to perform well in solving combinato-rial optimization problems (Anderson and Ferris,1994; M¨uhlenbein, 1989), and due to employinga population of candidate solutions, these algo-rithms can ﬁnd successful adversarial exampleswith fewer modiﬁcations.

Perturb

Subroutine:

In order to explain ouralgorithm, we ﬁrst introduce the subroutine

Perturb . This subroutine accepts an input sen-ence x cur which can be either a modiﬁed sentenceor the same as x orig . It randomly selects a word w in the sentence x cur and then selects a suitable re-placement word that has similar semantic mean-ing, ﬁts within the surrounding context, and in-creases the target label prediction score.In order to select the best replacement word, Perturb applies the following steps: • Computes the N nearest neighbors of the se-lected word according to the distance in theGloVe embedding space (Pennington et al.,2014). We used euclidean distance, as wedid not see noticeable improvement usingcosine. We ﬁlter out candidates with dis-tance to the selected word greater than δ .We use the counter-ﬁtting method presentedin (Mrkˇsi´c et al., 2016) to post-process theadversary’s GloVe vectors to ensure that thenearest neighbors are synonyms. The result-ing embedding is independent of the embed-dings used by victim models. • Second, we use the Google 1 billion wordslanguage model (Chelba et al., 2013) to ﬁl-ter out words that do not ﬁt within the contextsurrounding the word w in x cur . We do so byranking the candidate words based on theirlanguage model scores when ﬁt within the re-placement context, and keeping only the top K words with the highest scores. • From the remaining set of words, we pick theone that will maximize the target label pre-diction probability when it replaces the word w in x cur . • Finally, the selected word is inserted in placeof w , and Perturb returns the resulting sen-tence.The selection of which word to replace in theinput sentence is done by random sampling withprobabilities proportional to the number of neigh-bors each word has within Euclidean distance δ inthe counter-ﬁtted embedding space, encouragingthe solution set to be large enough for the algo-rithm to make appropriate modiﬁcations. We ex-clude common articles and prepositions (e.g. a, to)from being selected for replacement. Optimization Procedure:

The optimization al-gorithm can be seen in Algorithm 1. The algo-rithm starts by creating the initial generation P ofsize S by calling the Perturb subroutine S timesto create a set of distinct modiﬁcations to the orig-inal sentence. Then, the ﬁtness of each popula- Algorithm 1

Finding adversarial examples for i = 1 , ..., S in population do P i ← Perturb ( x orig , target ) for g = 1 , ...G generations dofor i = 1 , ..., S in population do F g − i = f ( P g − i ) target x adv = P g − j F g − j if arg max c f ( x adv ) c == t thenreturn x adv ⊲ { Found successful attack } else P g = { x adv } p = N ormalize ( F g − ) for i = 2 , ..., S in population do Sample parent from P g − with probs p Sample parent from P g − with probs pchild = Crossover ( parent , parent ) child mut = Perturb ( child, target ) P gi = { child mut } tion member in the current generation is computedas the target label prediction probability, found byquerying the victim model function f . If a pop-ulation member’s predicted label is equal to thetarget label, the optimization is complete. Other-wise, pairs of population members from the cur-rent generation are randomly sampled with prob-ability proportional to their ﬁtness values. A new child sentence is then synthesized from a pair ofparent sentences by independently sampling fromthe two using a uniform distribution. Finally, the Perturb subroutine is applied to the resultingchildren.

To evaluate our attack method, we trained mod-els for the sentiment analysis and textual en-tailment classiﬁcation tasks. For both models,each word in the input sentence is ﬁrst projectedinto a ﬁxed 300-dimensional vector space usingGloVe (Pennington et al., 2014). Each of the mod-els used are based on popular open-source bench-marks, and can be found in the following reposito-ries . Model descriptions are given below. Sentiment Analysis:

We trained a sentimentanalysis model using the IMDB dataset of moviereviews (Maas et al., 2011). The IMDB datasetconsists of 25,000 training examples and 25,000test examples. The LSTM model is composed of https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py https://github.com/Smerity/keras_snli/blob/master/snli_rnn.py riginal Text Prediction = Negative . (Conﬁdence = 78.0%)

This movie had terrible acting, terrible plot, and terrible choice of actors. (Leslie Nielsen ...come on!!!)the one part I considered slightly funny was the battling FBI/CIA agents, but because the audience wasmainly kids they didn’t understand that theme.

Adversarial Text Prediction =

Positive . (Conﬁdence = 59.8%)

This movie had horriﬁc acting, horriﬁc plot, and horrifying choice of actors. (Leslie Nielsen ...comeon!!!) the one part I regarded slightly funny was the battling FBI/CIA agents, but because the audiencewas mainly youngsters they didn’t understand that theme.

Table 1: Example of attack results for the sentiment analysis task. Modiﬁed words are highlighted in green andred for the original and adversarial texts, respectively.

Original Text Prediction:

Entailment (Conﬁdence = 86%)

Premise:

A runner wearing purple strives for the ﬁnish line.

Hypothesis: A runner wants to head for the ﬁnish line. Adversarial Text Prediction:

Contradiction (Conﬁdence = 43%)

Premise:

A runner wearing purple strives for the ﬁnish line.

Hypothesis: A racer wants to head for the ﬁnish line. Table 2: Example of attack results for the textual entailment task. Modiﬁed words are highlighted in green and redfor the original and adversarial texts, respectively.

Sentiment Analysis Textual Entailment% success % modiﬁed % success % modiﬁed

Perturb baseline 52% 19% – –Genetic attack 97% 14.7% 70% 23%

Table 3: Comparison between the attack success rate and mean percentage of modiﬁcations required by the geneticattack and perturb baseline for the two tasks.

128 units, and the outputs across all time steps areaveraged and fed to the output layer. The test accu-racy of the model is 90%, which is relatively closeto the state-of-the-art results on this dataset.

Textual Entailment:

We trained a textual en-tailment model using the Stanford Natural Lan-guage Inference (SNLI) corpus (Bowman et al.,2015). The model passes the input through aReLU “translation” layer (Bowman et al., 2015),which encodes the premise and hypothesis sen-tences by performing a summation over the wordembeddings, concatenates the two sentence em-beddings, and ﬁnally passes the output through 3600-dimensional ReLU layers before feeding it toa 3-way softmax. The model predicts whether thepremise sentence entails, contradicts or is neutralto the hypothesis sentence. The test accuracy ofthe model is 83% which is also relatively close tothe state-of-the-art (Chen et al., 2017c).

We randomly sampled 1000, and 500 correctlyclassiﬁed examples from the test sets of the twotasks to evaluate our algorithm. Correctly classi-ﬁed examples were chosen to limit the accuracy levels of the victim models from confounding ourresults. For the sentiment analysis task, the at-tacker aims to divert the prediction result frompositive to negative, and vice versa. For the tex-tual entailment task, the attacker is only allowedto modify the hypothesis, and aims to divert theprediction result from ‘entailment’ to ‘contradic-tion’, and vice versa. We limit the attacker tomaximum G = 20 iterations, and ﬁx the hyper-parameter values to S = 60 , N = 8 , K = 4 , and δ = 0 . . We also ﬁxed the maximum percentageof allowed changes to the document to be 20% and25% for the two tasks, respectively. If increased,the success rate would increase but the mean qual-ity would decrease. If the attack does not succeedwithin the iterations limit or exceeds the speciﬁedthreshold, it is counted as a failure.Sample outputs produced by our attack areshown in tables 4 and 5. Additional outputs canbe found in the supplementary material. Table 3shows the attack success rate and mean percent-age of modiﬁed words on each task. We compareto the Perturb baseline, which greedily appliesthe

Perturb subroutine, to validate the use ofopulation-based optimization. As can be seenfrom our results, we are able to achieve high suc-cess rate with a limited number of modiﬁcationson both tasks. In addition, the genetic algorithmsigniﬁcantly outperformed the

Perturb baselinein both success rate and percentage of words mod-iﬁed, demonstrating the additional beneﬁt yieldedby using population-based optimization. Testingusing a single TitanX GPU, for sentiment analy-sis and textual entailment, we measured averageruntimes on success to be 43.5 and 5 seconds perexample, respectively. The high success rate andreasonable runtimes demonstrate the practicalityof our approach, even when scaling to long sen-tences, such as those found in the IMDB dataset.Speaking of which, our success rate on textualentailment is lower due to the large disparity insentence length. On average, hypothesis sentencesin the SNLI corpus are 9 words long, which isvery short compared to IMDB (229 words, lim-ited to 100 for experiments). With sentences thatshort, applying successful perturbations becomesmuch harder, however we were still able to achievea success rate of 70%. For the same reason, wedidn’t apply the

Perturb baseline on the textualentailment task, as the

Perturb baseline fails toachieve any success under the limits of the maxi-mum allowed changes constraint.

We performed a user study on the sentiment anal-ysis task with 20 volunteers to evaluate how per-ceptible our adversarial perturbations are. Notethat the number of participating volunteers issigniﬁcantly larger than used in previous stud-ies (Jia and Liang, 2017; Ebrahimi et al., 2018).The user study was composed of two parts. First,we presented 100 adversarial examples to the par-ticipants and asked them to label the sentiment ofthe text (i.e., positive or negative.) 92.3% of theresponses matched the original text sentiment, in-dicating that our modiﬁcation did not signiﬁcantlyaffect human judgment on the text sentiment. Sec-ond, we prepared 100 questions, each question in-cludes the original example and the correspondingadversarial example in a pair. Participants wereasked to judge the similarity of each pair on a scalefrom 1 (very similar) to 4 (very different). The av-erage rating is . ± . , which shows the per-ceived difference is also small. The results demonstrated in section 4.1 raise thefollowing question: How can we defend againstthese attacks? We performed a preliminary exper-iment to see if adversarial training (Madry et al.,2017), the only effective defense in the image do-main, can be used to lower the attack success rate.We generated 1000 adversarial examples on thecleanly trained sentiment analysis model using theIMDB training set, appended them to the existingtraining set, and used the updated dataset to ad-versarially train a model from scratch. We foundthat adversarial training provided no additional ro-bustness beneﬁt in our experiments using the testset, despite the fact that the model achieves near100% accuracy classifying adversarial examplesincluded in the training set. These results demon-strate the diversity in the perturbations generatedby our attack algorithm, and illustrates the difﬁ-culty in defending against adversarial attacks. Wehope these results inspire further work in increas-ing the robustness of natural language models.

We demonstrate that despite the difﬁculties in gen-erating imperceptible adversarial examples in thenatural language domain, semantically and syntac-tically similar adversarial examples can be craftedusing a black-box population-based optimizationalgorithm, yielding success on both the sentimentanalysis and textual entailment tasks. Our humanstudy validated that the generated examples wereindeed adversarial and perceptibly quite similar.We hope our work encourages researchers to pur-sue improving the robustness of DNNs in the nat-ural language domain.

Acknowledgement

This research was supported in part by the U.S.Army Research Laboratory and the UK Ministryof Defence under Agreement Number W911NF-16-3-0001, the National Science Foundation underaward eferences

M. Alzantot, Y. Sharma, S. Chakraborty, and M. Srivas-tava. 2018. Genattack: Practical black-box attackswith gradient-free optimization. arXiv preprintarXiv:1805.11090 .Edward J Anderson and Michael C Ferris. 1994. Ge-netic algorithms for combinatorial optimization: theassemble line balancing problem.

ORSA Journal onComputing , 6(2):161–173.Samuel R Bowman, Gabor Angeli, Christopher Potts,and Christopher D Manning. 2015. A large anno-tated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 .N. Carlini and D. Wagner. 2017. Towards evaluatingthe robustness of neural networks. arXiv preprintarXiv:1608.04644 .Nicholas Carlini and David Wagner. 2018. Audio ad-versarial examples: Targeted attacks on speech-to-text. arXiv preprint arXiv:1801.01944 .Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2013. One billion word benchmark for measur-ing progress in statistical language modeling. arXivpreprint arXiv:1312.3005 .P. Chen, H Zhang, Y. Sharma, J. Yi, and C. Hseih.2017a. Zoo: Zeroth order optimization basedblack-box attacks to deep neural networks with-out training substitute models. arXiv preprintarXiv:1708.03999 .P. Y. Chen, Y. Sharma, H. Zhang, J. Yi, and C. Hsieh.2017b. Ead: Elastic-net attacks to deep neuralnetworks via adversarial examples. arXiv preprintarXiv:1709.0414 .Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017c. Enhanced lstm fornatural language inference. In

Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , vol-ume 1, pages 1657–1668.J. Ebrahimi, A. Rao, D. Lowd, and D. Dou. 2018. Hot-ﬂip: White-box adversarial examples for text classi-ﬁcation.

ACL’18; arXiv preprint arXiv:1712.06751 .Ian J Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2014. Explaining and harnessing adver-sarial examples. arXiv preprint arXiv:1412.6572 .Mohit Iyyer, John Wieting, Kevin Gimpel, and LukeZettlemoyer. 2018. Adversarial example generationwith syntactically controlled paraphrase networks.In

Proceedings of NAACL .Robin Jia and Percy Liang. 2017. Adversarial exam-ples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328 . V. Kuleshov, S. Thakoor, T. Lau, and S. Ermon. 2018.Adversarial examples for natural language classiﬁ-cation problems.

OpenReview submission OpenRe-view:r1QZ3zbAZ .A. Kurakin, I. Goodfellow, and S. Bengio. 2016. Ad-versarial machine learning at scale.

ICLR’17; arXivpreprint arXiv:1611.01236 .Andrew L. Maas, Raymond E. Daly, Peter T. Pham,Dan Huang, Andrew Y. Ng, and Christopher Potts.2011. Learning word vectors for sentiment analy-sis. In

Proceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics: Hu-man Language Technologies , pages 142–150, Port-land, Oregon, USA. Association for ComputationalLinguistics.A. Madry, A. Makelov, L. Schmidt, D. Tsipras, andA. Vladu. 2017. Towards deep learning mod-els resistant to adversarial attacks. arXiv preprintarXiv:1706.06083 .Nikola Mrkˇsi´c, Diarmuid O S´eaghdha, Blaise Thom-son, Milica Gaˇsi´c, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, andSteve Young. 2016. Counter-ﬁtting word vec-tors to linguistic constraints. arXiv preprintarXiv:1603.00892 .Heinz M¨uhlenbein. 1989. Parallel genetic algorithms,population genetics and combinatorial optimization.In

Workshop on Parallel Processing: Logic, Organi-zation, and Technology , pages 398–406. Springer.N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. Ce-lik, and A. Swami. 2016a. Practical black-boxattacks against machine learning. arXiv preprintarXiv:1602.02697 .N. Papernot, P. McDaniel, A. Swami, and R. Ha-rang. 2016b. Crafting adversarial input sequencesfor recurrent neural networks. arXiv preprintarXiv:1604.08275 .Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In

Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP) , pages 1532–1543.Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Semantically equivalent adversar-ial rules for debugging nlp models. In

Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , volume 1, pages 856–865.Y. Sharma and P. Y. Chen. 2017. Attacking the madrydefense model with l1-based adversarial examples. arXiv preprint arXiv:1710.10733 .Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian Goodfellow, andRob Fergus. 2013. Intriguing properties of neuralnetworks. arXiv preprint arXiv:1312.6199 .hengli Zhao, Dheeru Dua, and Sameer Singh. 2018.Generating natural adversarial examples. arXivpreprint arXiv:1710.11342 . upplemental Materials: Generating Natural Language AdversarialExamples Additional Sentiment Analysis Results

Table 4 shows an additional set of attack results against the sentiment analysis model described in ourpaper.Original Text Prediction =

Positive . (Conﬁdence = 78%)The promise of Martin Donovan playing Jesus was, quite honestly , enough to get me to see the ﬁlm.Deﬁnitely worthwhile; clever and funny without overdoing it. The low quality ﬁlming was probably an appropriate effect but ended up being a little too jarring, and the ending sounded more like a PBSprogram than Hartley. Still, too many memorable lines and great moments for me to judge it harshly.Adversarial Text Prediction =

Negative . (Conﬁdence = 59.9%)The promise of Martin Donovan playing Jesus was, utterly frankly , enough to get me to see the ﬁlm.Deﬁnitely worthwhile; clever and funny without overdoing it. The low quality ﬁlming was presumably an appropriate effect but ended up being a little too jarring, and the ending sounded more like a PBSprogram than Hartley. Still, too many memorable lines and great moments for me to judge it harshly.Original Text Prediction =

Negative . (Conﬁdence = 74.30%)Some sort of accolade must be given to ‘Hellraiser: Bloodline’. It’s actually out full-mooned FullMoon. It bears all the marks of, say, your ‘demonic toys’ or ‘puppet master’ series, without their dopey ,uh, charm? Full Moon can get away with silly product because they know it’s silly. These Hellraiserthings, man, do they ever take themselves seriously. This increasingly stupid franchise ( though notnearly as stupid as I am for having watched it) once made up for its low budgets by being stylish. Nowit’s just ish.Adversarial Text Prediction =

Positive . (Conﬁdence = 51.03%)Some kind of accolade must be given to ‘Hellraiser: Bloodline’. it’s truly out full-mooned Full Moon. Itbears all the marks of, say, your ‘demonic toys’ or ‘puppet master’ series, without their silly , uh, charm?Full Moon can get away with daft product because they know it’s silly. These Hellraiser things, man, dothey ever take themselves seriously. This steadily daft franchise ( whilst not nearly as daft as i am forhaving witnessed it) once made up for its low budgets by being stylish. Now it’s just ish.Original Text Prediction =

Negative . (Conﬁdence = 50.53%)Thinly-cloaked retelling of the garden-of-eden story – nothing new, nothing shocking, although I feelthat is what the ﬁlmmakers were going for. The idea is trite . Strong performance from Daisy Eagan,that’s about it. I believed she was 13, and I was interested in her character, the rest left me cold.Adversarial Text Prediction =

Positive . (Conﬁdence = 63.04%)Thinly-cloaked retelling of the garden-of-eden story – nothing new, nothing shocking, although I feelthat is what the ﬁlmmakers were going for. The idea is petty . Strong performance from Daisy Eagan,that’s about it. I believed she was 13, and I was interested in her character, the rest left me cold.

Table 4: Example of attack results against the sentiment analysis model. Modiﬁed words are highlighted in greenand red for the original and adversarial texts, respectively. dditional Textual Entailment Results

Table 5 shows an additional set of attack results against the textual entailment model described in ourpaper.Original Text Prediction:

Contradiction (Conﬁdence = 91%)

Premise:

A man and a woman stand in front of a Christmas tree contemplating a single thought.

Hypothesis:

Two people talk loudly in front of a cactus.Adversarial Text Prediction:

Entailment (Conﬁdence = 51%)

Premise:

A man and a woman stand in front of a Christmas tree contemplating a single thought.

Hypothesis:

Two humans chitchat loudly in front of a cactus.Original Text Prediction:

Contradiction (Conﬁdence = 94%)

Premise:

A young girl wearing yellow shorts and a white tank top using a cane pole to ﬁsh at a smallpond.

Hypothesis:

A girl wearing a dress looks off a cliff .Adversarial Text Prediction:

Entailment (Conﬁdence = 40%)

Premise:

A young girl wearing yellow shorts and a white tank top using a cane pole to ﬁsh at a smallpond.

Hypothesis:

A girl wearing a skirt looks off a ravine .Original Text Prediction:

Entailment (Conﬁdence = 86%)

Premise:

A large group of protesters are walking down the street with signs.

Hypothesis:

Some people are holding up signs of protest in the street.Adversarial Text Prediction:

Contradiction (Conﬁdence = 43%)

Premise:

A large group of protesters are walking down the street with signs.

Hypothesis:

Some people are holding up signals of protest in the street.of protest in the street.