[PDF] e-SNLI: Natural Language Inference with Natural Language Explanations

Abstract

Full PDF

aa r X i v : . [ c s . C L ] D ec e-SNLI: Natural Language Inference withNatural Language Explanations Oana-Maria Camburu Tim Rocktäschel Thomas Lukasiewicz

Phil Blunsom {oana-maria.camburu, thomas.lukasiewicz, phil.blunsom}@[email protected] Department of Computer Science, University of Oxford Department of Computer Science, University College London Alan Turing Institute, London, UK DeepMind, London, UK

Abstract

In order for machine learning to garner widespread public adoption, models mustbe able to provide interpretable and robust explanations for their decisions, aswell as learn from human-provided explanations at train time. In this work, weextend the Stanford Natural Language Inference dataset with an additional layerof human-annotated natural language explanations of the entailment relations. Wefurther implement models that incorporate these explanations into their trainingprocess and output them at test time. We show how our corpus of explanations,which we call e-SNLI, can be used for various goals, such as obtaining full sen-tence justiﬁcations of a model’s decisions, improving universal sentence represen-tations and transferring to out-of-domain NLI datasets. Our dataset thus opensup a range of research directions for using natural language explanations, both forimproving models and for asserting their trust. Humans do not learn solely from labeled examples supplied by a teacher. Instead, they seek aconceptual understanding of a task through both demonstrations and explanations. Machine learningmodels trained simply to obtain high accuracy on held-out sets often learn to rely heavily on shallowinput statistics, resulting in brittle models susceptible to adversarial attacks. For example, Ribeiroet al. [24] present a document classiﬁer that distinguishes between

Christianity and

Atheism with anaccuracy of . However, on close inspection, the model spuriously separates classes based onwords contained in the headers, such as

Posting , Host , and Re .In this work, we introduce a new dataset and models for exploiting and generating explanations forthe task of recognizing textual entailment. We argue for free-form natural language explanations, asopposed to formal language, for a series of reasons. First, natural language is readily comprehensi-ble to an end-user who needs to assert a model’s reliability. Secondly, it is also easiest for humansto provide free-form language, eliminating the additional effort of learning to produce formal lan-guage, thus making it simpler to collect such datasets. Lastly, natural language justiﬁcations mighteventually be mined from existing large-scale free-form text.Despite the potential for free-form justiﬁcations to improve both learning and transparency, thereis currently a lack of such datasets in the machine learning community. To address this deﬁciency, https://github.com/OanaMariaCamburu/e-SNLI32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada. remise: An adult dressed in black holds a stick.Hypothesis: An adult is walking away, empty-handed.Label: contradictionExplanation: Holds a stick implies using hands so it is not empty-handed.Premise: A child in a yellow plastic safety swing is laughing as a dark-haired womanin pink and coral pants stands behind her.Hypothesis: A young mother is playing with her daughter in a swing.Label: neutralExplanation: Child does not imply daughter and woman does not imply mother.Premise: A man in an orange vest leans over a pickup truck.Hypothesis: A man is touching a truck.Label: entailmentExplanation: Man leans over a pickup truck implies that he is touching it.Figure 1: Examples from e-SNLI. Annotators were given the premise, hypothesis, and label. Theyhighlighted the words that they considered essential for the label and provided the explanations.we have collected a large corpus of human-annotated explanations for the Stanford Natural Lan-guage Inference (SNLI) dataset [3]. We chose SNLI because it constitutes an inﬂuential corpus fornatural language understanding that requires deep assimilation of ﬁne-grained nuances of common-sense knowledge. We call our explanation-augmented dataset e-SNLI, which we collected to enableresearch in the direction of training with and generation of free-form textual justiﬁcations.In order to demonstrate the efﬁcacy of the e-SNLI dataset, we ﬁrst show that it is much more difﬁcultto produce correct explanations based on spurious correlations than to produce correct labels. Wethen implement models that, given a premise and a hypothesis, predict a label and an explanation.We also investigate how the additional signal from explanations received at train time can guidemodels into learning better sentence representations. Finally, we look into the transfer capabilitiesof our model to out-of-domain NLI datasets. The task of recognizing textual entailment is a critical natural language understanding task. Givena pair of sentences, called the premise and hypothesis, the task consists of classifying their relationas either (a) entailment , if the premise entails the hypothesis, (b) contradiction , if the hypothesiscontradicts the premise, or (c) neutral , if neither entailment nor contradiction hold. The SNLIdataset [3], containing K data points of human-generated triples (premise, hypothesis, label),has driven the development of a large number of neural network models [25, 21, 22, 6, 19, 5, 7].Conneau et al. [7] showed that training universal sentence representations on SNLI is both more ef-ﬁcient and more accurate than the traditional training approaches on orders of magnitude larger, butunsupervised, datasets [17, 14]. We take this approach one step further and show that an additionallayer of explanations on top of the label supervision brings further improvement.Recently, Gururangan et al. [13] cast doubt on whether models trained on SNLI are learning tounderstand language, or are largely ﬁxating on spurious correlations, also called artifacts. For ex-ample, speciﬁc words in the hypothesis tend to be strong indicators of the label, e.g., friends, old appear very often in neutral hypotheses, animal, outdoors appear most of the time in entailmenthypotheses, while nobody, sleeping appear mostly in contradiction hypothesis. They show that apremise-agnostic model, i.e., a model that only takes as input the hypothesis and outputs the label,obtains test accuracy. In section 4.1 we show that it is much more difﬁcult to rely on artifactsto generate explanations than to generate labels. We present our collection methodology for e-SNLI, for which we used Amazon Mechanical Turk.The main question that we want our dataset to answer is:

Why is a pair of sentences in a relation of ntailment, neutrality, or contradiction? We encouraged the annotators to focus on the non-obviouselements that induce the given relation, and not on the parts of the premise that are repeated identi-cally in the hypothesis. For entailment, we required justiﬁcations of all the parts of the hypothesisthat do not appear in the premise. For neutral and contradictory pairs, while we encouraged statingall the elements that contribute to the relation, we consider an explanation correct, if at least one el-ement is stated. Finally, we asked the annotators to provide self-contained explanations, as opposedto sentences that would make sense only after reading the premise and hypothesis. For example,we prefer an explanation of the form “

Anyone can knit, not just women. ”, rather than “

It cannot beinferred they are women. ”In crowd-sourcing, it is difﬁcult to control the quality of free-form annotations. Thus, we aimedto preemptively block the submission of obviously incorrect answers. We did in-browser checksto ensure that each explanation contained at least three tokens and that it was not a copy of thepremise or hypothesis. We further guided the annotators to provide adequate answers by askingthem to proceed in two steps. First, we require them to highlight words from the premise and/orhypothesis that they consider essential for the given relation. Secondly, annotators had to formulatethe explanation using the words that they highlighted. However, using exact spelling might pushannotators to formulate grammatically incorrect sentences, therefore we only required half of thehighlighted words to be used with the same spelling. For entailment pairs, we required at least oneword in the premise to be highlighted. For contradiction pairs, we required highlighting at leastone word in both the premise and the hypothesis. For neutral pairs, we only allowed highlightingwords in the hypothesis, in order to strongly emphasize the asymmetry in this relation and to preventworkers from confusing the premise with the hypothesis. We believe these label-speciﬁc constraintshelped in putting the annotator into the correct mindset, and additionally gave us a means to ﬁlterincorrect explanations. Finally, we also checked that the annotators used other words that werenot highlighted, as we believe a correct explanation would need to articulate a link between thekeywords.We collected one explanation for each pair in the training set and three explanations for each pair inthe validation and test sets. Figure 1 shows examples of collected explanations. There were 6325workers with an average of 860 explanations per worker and a standard deviation of 403.

Analysis and reﬁnement of the collected dataset

In order to measure the quality of our collectedexplanations, we selected a random sample of examples and manually graded their correctnessbetween (incorrect) and (correct), giving partial scores of k/n if only k out of n required argu-ments were mentioned. We also considered an explanation as incorrect if it was uninformative, thatis, if the explanation was template-like, extensively repeating details from the premise/hypothesisthat are not directly useful for justifying the relation between the two sentences. We observed a fewre-occurring templates such as: “ Just because [entire premise] doesn’t mean [entire hypothesis] ” forneutral pairs, “ [entire premise] implies [entire hypothesis] ” for entailment pairs, and “

It can eitherbe [entire premise] or [entire hypothesis] ” for contradiction pairs. We assembled a list of templates,which can be found in Appendix A, that we used for ﬁltering the dataset of such uninformativeexplanations. Speciﬁcally, we ﬁltered an explanation if its edit distance to one of the templateswas less than . We ran this template detection on the entire dataset and reannotated the detectedexplanations ( in total).Our ﬁnal counts show a total error rate of . , with . on entailment, . on neutral,and . on contradiction. We notice that entailment pairs were by far the most difﬁcult to obtainproper explanations for. This is ﬁrstly due to partial explanations, as annotators had an incentive toprovide shorter inputs, so they often only mentioned one argument. A second reason is that many ofthe entailment pairs have the hypothesis as almost a subset of the premise, prompting the annotatorsto just repeat that as a statement. We ﬁrst present an experiment which demonstrates that a model which can easily rely on artifacts inSNLI to provide correct labels would not be able to provide correct explanations as easily. We referto it as P

REMISE A GNOSTIC .We then present a series of experiments to elucidate whether models trained on e-SNLI are able to:(i) predict a label and generate an explanation for the predicted label (referred to as P

REDICT A ND -3 XPLAIN ), (ii) generate an explanation then predict the label given only the generated explanation(E

XPLAIN T HEN P REDICT ), (iii) learn better universal sentence representations (R

EPRESENT ), and(iv) transfer to out-of-domain NLI datasets (T

RANSFER ).Throughout our experiments, our models follow the architecture presented in Conneau et al. [7], aswe build directly on top of their code . Therefore, our encoders are 2048-bidirectional-LSTMs [15]with max-pooling, resulting in a sentence representation dimension of 4096. Our label classiﬁersare 3-layers MLPs with 512 internal size and without non-linearities. For our explanation decoders,we used a simple one-layer LSTM, for which we tried internal sizes of , , , and .In order to reduce the vocabulary size for explanation generation, we replaced words that appearedless than times with . We obtain an output vocabulary of approximately K words. Thepreprocessing and optimization were kept the same as in [7].Whenever appropriate, we run our models with ﬁve seeds and provide the average performance withthe standard deviation in parenthesis. If no standard deviation is reported, the results are from oneexperiment with seed 1234. REMISE A GNOSTIC : Generate an explanation given only the hypothesis

Gururangan et al. [13] show that a neural network that only has access to the hypothesis can predictthe correct label of the times. We are therefore interested in evaluating how well our explana-tions can be predicted from hypotheses alone.

Model

We train a 2048-bidirectional-LSTM with max-pooling for encoding the hypothesis, fol-lowed by a one-layer LSTM for decoding the explanation. The initial state of the decoder is thevector embedding of the hypothesis, which is also concatenated at every timestep of the decoder, toavoid forgetting.

Selection

We consider internal sizes of the decoder of , , and . We pick the modelthat gives the best perplexity on the validation set. We notice that the perplexity strictly decreaseswhen we increase the decoder size. However, for practical reasons, we do not increase the decodersize beyond . Results

We then manually look at the ﬁrst test examples and obtain that only . werecorrect. We also separately train the same hypothesis-only encoder for label prediction alone andobtain correct labels in the same ﬁrst test examples. This validates our intuition that it ismuch more difﬁcult (approx. 10x for this architecture) to rely on spurious correlations to predictcorrect explanations than to predict correct labels. REDICT A ND E XPLAIN : Jointly predict a label and generate an explanation for thepredicted label

In this experiment, we investigate how the typical architecture employed on SNLI can be enhancedwith a module that aims to justify the decisions of the entire network.

Model

We employ the InferSent [7] architecture, where a bidirectional-LSTM with max-poolingseparately encodes the premise, u , and hypothesis, v . The vector of features f = [ u , v , | u − v | , u ⊙ v ] is then passed to the MLP classiﬁer that outputs a distribution over the 3 labels. We add a one-layer LSTM decoder for explanations, which takes the feature vector f both as an initial state andconcatenated to the word embedding at each time step.In order to condition the explanation also on the label, we prepend the label as a word ( entailment,contradiction, neutral ) at the beginning of the explanation. At training time, the gold label is pro-vided, while at test time, we use the label predicted by the classiﬁer. This architecture is depicted inFigure 2. Loss

We use negative log-likelihood for both classiﬁcation and explanation losses. The explanationloss is much larger in magnitude than the classiﬁcation loss, due to the summation of negative log- https://github.com/facebookresearch/InferSent. We ﬁxed the issue raised in https://github.com/ facebookre-search/InferSent/issues/51 that the max-pooling was taken over paddings. Counted among premises, hypothesis, and explanations. Partial scoring as explained in Section 3. α ∈ [0 , . Hence, our overall loss is: L total = α L label + (1 − α ) L explanation (1) PremiseHypothesis f ExplanationLabel

Figure 2: Overview of the e-I

NFER S ENT architec-ture.

Selection

We consider α values from . to . with a step of . and decoder internal sizes of , , , and . For this experiment,we choose as model selection criterion the ac-curacy on the SNLI validation set, because wewant to investigate how well a model can gen-erate justiﬁcations without sacriﬁcing accuracy.As future work, one can inspect different trade-offs between accuracy and explanation genera-tion. We found α = 0 . and the decoder size of to produce the best validation accuracy, of . , while InferSent with no explanationsproduced . validation accuracy. We callour model e-I NFER S ENT , since it freezes the In-ferSent architecture and training procedure, andonly adds the explanations decoder.

Results

The average test accuracy that we obtain when training InferSent[7] on SNLI with ﬁve seedsis .

01% (0 . . Our e-I NFER S ENT model obtains essentially the same test accuracy, of . . , which shows that one can get additional justiﬁcations without sacriﬁcing label accuracy. Forthe generated explanations, we obtain a perplexity of . . and a BLEU-score of . . .Since we collected 3 explanations for each example in the validation and test sets, we computethe inter-annotator BLEU-score of the third explanation with respect to the ﬁrst two, and obtain . . For consistency, we used the same two explanations as the only references when computingthe BLEU-score for the predicted explanations. Given the low inter-annotator score and the fact thatgenerated explanations almost match the inter-annotator BLEU-score, we conclude that this measureis not reliable for our task, and we further rely on human evaluation. Therefore, we manuallyannotated the ﬁrst 100 datapoints in the test set (we used the same partial scoring as in Section 3).Since the explanation is conditioned on the predicted label, for incorrect labels, the model would notproduce a correct explanation. Therefore, we provide as correctness score the percentage of correctexplanations in the subset of the ﬁrst 100 examples where the predicted label was correct (80 in thisexperiment). We obtain a percentage of . correct explanations. While this percentage is low,we keep in mind that the selection criteria was only the accuracy of the label classiﬁer and not theperplexity of the explanation. In the next experiment, we show how training (and selecting) only forgenerating explanations results in higher quality explanations. XPLAIN T HEN P REDICT : Generate an explanation then predict a label

In P

REDICT A ND E XPLAIN , we conditioned the explanation on the label predicted by the MLP, be-cause we wanted to see how the typical architecture used on SNLI can be adapted to justify itsdecisions in natural language. However, a more natural approach for solving inference is to think ofthe explanation ﬁrst and based on the explanation to decide a label. Therefore, in this experiment, weﬁrst train a network to generate an explanation given a pair of (premise, hypothesis), and, separately,we train a network to provide a label given an explanation. This is a sensible decomposition for ourdataset, due to the following key observation: In our dataset, in the large majority of the cases, onecan easily detect for which label an explanation has been provided. We highlight that this is not thecase in general, as the same explanation can be correctly arguing for different labels, depending onthe premise and hypothesis. For example, the explanation "A woman is a person" would be a correctexplanation for the entailment pair ("A woman is in the park", "A person is in the park") as wellfor the contradiction pair ("A woman is in the park", "There is no person in the park") . However,there are multiple ways of formulating an explanation. In our example, for the contradiction pair,one could also explain that "There cannot be no person in the park if a woman is in the park" , whichread alone would allow one to infer that the pair was a contradiction. To support our observation, wetrain a neural network that given only an explanation predicts a label. We use the same bidirectionalencoder and MLP-classiﬁer as above. We obtain an accuracy of . on the test set of SNLI.5 odels For predicting an explanation given a pair of (premise, hypothesis), we ﬁrst train a simpleseq2seq model that we call E

XPLAIN T HEN P REDICT S EQ EQ . Essentially, we keep the architec-ture in e-I NFER S ENT , where we eliminate the classiﬁer by setting α = 0 , and we decode the expla-nation without prepending the label. Secondly, we train an attention model, which we refer to asE XPLAIN T HEN P REDICT A TTENTION . Attention mechanisms in neural networks brought consistentimprovements over the non-attention counter-parts in various areas, such as computer vision [27],speech [4], or natural language processing [12, 2]. We use the same encoder and decoder as in E X - PLAIN T HEN P REDICT S EQ EQ , and we add two identical but separate attention modules, over thetokens in the premise and hypothesis. For details of the attention modules, see Appendix B. Selection

Our only hyper-parameter is internal sizes for the decoder of , , , and .Our model selection criterion is the perplexity on the validation set of SNLI. We obtain the best con-ﬁguration for both E XPLAIN T HEN P REDICT S EQ EQ and E XPLAIN T HEN P REDICT A TTENTION tohave an internal size of . Results

With the described setup, the SNLI test accuracy drops from .

96% (0 . in P REDIC - T A ND E XPLAIN to .

59% (0 . in E XPLAIN T HEN P REDICT S EQ EQ and . . in E X - PLAIN T HEN P REDICT A TTENTION . However, when we again manually annotate the ﬁrst 100 gen-erated explanations in the test set, we obtain signiﬁcantly higher percentages of correct explana-tions: . for E XPLAIN T HEN P REDICT S EQ EQ and . for E XPLAIN T HEN P REDICT A T - TENTION . We note that the attention mechanism indeed signiﬁcantly increases the quality of theexplanations. The perplexity and BLEU-score are . . and . . for E XPLAIN T HEN -P REDICT S EQ EQ , and . and . . for E XPLAIN T HEN P REDICT A TTENTION . Our ex-periment shows that, while sacriﬁcing a bit of performance, we get a better trust that when E X - PLAIN T HEN P REDICT predicts a correct label, it does so for the right reasons.

Qualitative analysis of explanations

In Table 1, we provide examples of generated explanationsfrom the test set from (a) P

REDICT A ND E XPLAIN , (b) E

XPLAIN T HEN P REDICT S EQ EQ , and (c)E XPLAIN T HEN P REDICT A TTENTION . At the end of each explanation, we give in brackets the scorethat we manually allocated as explained in section 3. We notice that the explanations are mainly ontopic for all the three models, with minor exceptions, such as the mention of " camouﬂage " in (1c).We also notice that even when incorrect, they are sometimes frustratingly close to being correct, forexample, explanation (2b) is only one word (out of its 20 words) away from being correct. It is alsointeresting to inspect the explanations provided when the predicted label is incorrect. For example,in (1a), we see that the network omitted the information of " facing the camera " in the premise andtherefore classiﬁed the pair as neutral, which is backed up by an otherwise correct explanation initself. We also see that model E

XPLAIN T HEN P REDICT S EQ EQ correctly classiﬁes this pair asentailment, however, it only motivates 1 out the 3 reasons why it is so, and it also picks arguably theeasiest reason. Interestingly, the attention model (1c) points to the correct evidence but argues that" standing " and " facing a camera " is not enough to conclude " posing for a picture ". EPRESENT : Universal sentence representations

Obtaining universal sentence representations is the task of learning an encoder to provide semanti-cally meaningful ﬁxed-length representations of phrases/sentences. These representations are fur-ther used as features in other downstream tasks, particularly for tasks where labeled training dataare scarce. In computer vision, pretrained ImageNet-based encoders [10] provide standard imagefeature extractors for other downstream tasks. However, in natural language processing, there isstill no consensus on general-purpose sentence encoders. It remains an open question on whichtask and dataset should such an encoder be trained. Traditional approaches make use of very largeunsupervised datasets [17, 14], taking weeks to train. Conneau et al. [7] showed that training onlyon NLI is both more accurate and more time-efﬁcient than training on orders of magnitude largerbut unsupervised datasets. Their results constitute a previous state-of-the-art for universal sentencerepresentations and encourage the idea that supervision can be more beneﬁcial than larger but un-supervised datasets. We hypothesize that an additional layer of supervision in the form of naturallanguage explanations should further improve learning of universal sentence representations.

Model

We use our e-I

NFER S ENT model already trained in P

REDICT A ND E XPLAIN . While we com-pare our model with InferSent that has not been trained on explanations, we want to ensure thateventual improvements are not purely due to the addition of a language model in the decoder net-work. We therefore introduce a second baseline, I

NFER S ENT A UTO E NC , where instead of decoding6able 1: Examples of predicted labels and generated explanations from (a) P REDICT A ND E XPLAIN ,(b) E

XPLAIN T HEN P REDICT S EQ EQ , and (c) E XPLAIN T HEN P REDICT A TTENTION . In bracketsour scores of correctness. (1) Premise: 3 young man in hoods standing in the middle of a quiet street facing the camera.Hypothesis: Three hood wearing people pose for a picture.Gold label: entailment(a) Predicted label: neutralExplanation: Just because the men are in themiddle of a street doesn’t meanthey are posing for a picture. [0] (b) Predicted label: entailmentExplanation: three young menare people. [0.33] (c) Predicted label: neutralExplanation: Just because three young manin camouﬂage standing in the middle ofa quiet street facing the cameradoes not mean they pose for a picture. [0](2) Premise: Three ﬁreﬁghter come out of subway station.Hypothesis: Three ﬁreﬁghters putting out a ﬁre inside of a subway station.Gold label: neutral(a) Predicted label: contradictionExplanation: The ﬁreﬁghterscan not be putting out a ﬁre station andputting out a ﬁre at the same time. [0] (b) Predicted label: neutralExplanation: The fact that threeﬁremen are putting out of a subwaystation doesn’t imply that theyare putting out a ﬁre. [0] (c) Predicted label: neutralExplanation: The ﬁreﬁghtersmay not be putting out a ﬁreinside of the subway station. [1](3) Premise: A blond-haired doctor and her African American assistant looking threw new medical manuals.Hypothesis: A man is eating pb and j.Gold label: contradiction(a) Predicted label: contradictionExplanation: A man is not a woman. [1] (b) Predicted label: contradictionExplanation: One can not be lookingand eating simultaneously. [0] (c) Predicted label: contradictionExplanation: A person can not be lookingat a medical and a book at the same time. [0]

Table 2: Transfer results on downstream tasks. For MRPC we report accuracy/F1 score, for STS14we report the Person/Spearman correlations, for SICK-R the Person correlation, and for all the resttheir accuracies. Results are the average of 5 runs with different seeds. The standard deviationsis shown in brackets, and the best result for every task is indicated in bold. * indicates signiﬁcantdifference at level 0.05 with respect to the InferSent baseline.

Model MR CR SUBJ MPQA SST2 TREC MRPC SICK-E SICK-R STS14InferSent-SNLI-ours

NFER S ENT A UTO E NC / 82.48 85.58 0.88* 0.5* / 0.5*(0.18) (0.36) (0.28) (0.26) (0.48) (0.52) (0.93 / 1.23) (0.33) (0) (0.02 / 0.02)e-I NFER S ENT / (0.44) (0.16) (0.21) (0.22) (0.4) (0.51) (0.62 / 0.24) (0.52) (0) (0.01 / 0.01) explanations, we decode the premise and hypothesis separately from each sentence representationusing one shared decoder. Evaluation metrics

Typically, sentence representations are evaluated by using them as ﬁxed fea-tures on top of which shallow classiﬁers are trained for a series of downstream tasks. Conneau et al.[7] provide an excellent tool, called SentEval, for evaluating sentence representations on 10 diversetasks: movie reviews ( MR ), product reviews ( CR ), subjectivity/objectivity ( SUBJ ), opinion polar-ity (

MPQA ), question-type (

TREC ), sentiment analysis (

SST ), semantic textual similarity (

STS ),paraphrase detection (

MRPC ), entailment (

SICK-E ), and semantic relatedness (

SICK-R ). We re-fer to their work for a more detailed description of each of these tasks and of SentEval, which weuse for comparing the quality of the sentence embeddings obtained by additionally providing ourexplanations on top of the label supervision.

Results

In Table 2, we report the average results and standard deviations of e-I

NFER S ENT , ourretrained InferSent model, and the additional I

NFER S ENT A UTO E NC baseline on the downstreamtasks mentioned above. To test if the differences in performance of I NFER S ENT A UTO E NC and e-I NFER S ENT relative to the InferSent baseline are signiﬁcant, we performed Welch’s t-test. Wemark with * the results that appeared signiﬁcant under the signiﬁcance level of . . Using the implementation in scipy.stats.ttest_ind with equal_var=False .

7e notice that I

NFER S ENT A UTO E NC is performing signiﬁcantly worse than InferSent on 6 tasksand signiﬁcantly outperforms this baseline on only 2 tasks. This indicates that just adding a languagegenerator can harm performance. Instead, e-I NFER S ENT signiﬁcantly outperforms InferSent on 4tasks, while it is signiﬁcantly outperformed only on 1 task. Therefore, we conclude that trainingwith explanations helps the model to learn overall better sentence representations.

RANSFER : Transfer without ﬁne-tuning to out-of-domain NLI

Transfer without ﬁne-tuning to out-of-domain entailment datasets is known to exhibit poor perfor-mance. For example, Bowman et al. [3] obtained an accuracy of only . when training on SNLIand evaluating on SICK-E [20]. We test how our explanations affect the direct transfer in both labelprediction and explanation generation by looking at SICK-E [20] and MultiNLI [26]. The latterincludes a diverse range of genres of written and spoken English, as well as test sets for cross-genretransfer. Model

We again use our already trained e-I

NFER S ENT model from P

REDICT A ND E XPLAIN . Results

In Table 3, we present the performance of e-I

NFER S ENT and our 2 baselines when evaluatedwithout ﬁne-tuning on SICK-E and MultiNLI. We notice that the accuracy improvements obtainedwith e-I

NFER S ENT are very small. However, e-I

NFER S ENT additionally provides explanations,which could bring insight into the inner workings of the model. We manually annotated the ﬁrst 100explanations of the test sets. The percentage of correct explanations in the subset where the labelwas predicted correctly was . for SICK-E and only . for MultiNLI. We also noticedthat the explanations in SICK-E, even when wrong, were generally on-topic and valid statements,while the ones in MultiNLI were generally nonsense or off-topic. Therefore, transfer learning forgenerating explanations in out-of-domain NLI would constitute challenging future work. Table 3: The average performance over 5 seedsof e-I

NFER S ENT and the 2 baselines on SICK-Eand MultiNLI with no ﬁne-tuning. Standard devi-ations are in parenthesis.

Model SICK-E MultiNLIInferSent-SNLI-ours 53.27 (1.65) 57 (0.41)I

NFER S ENT A UTO E NC NFER S ENT (1.43) (0.51)

Interpretability

One main direction in inter-pretability for neural networks is providing ex-tractive justiﬁcations, i.e., explanations consist-ing of subsets of the raw input, such as wordsor image patches. Extractive techniques canbe divided into post-hoc (applied after train-ing) and architecture-incorporated (guiding thetraining). For example, Ribeiro et al. [24] in-troduce a post-hoc extractive technique, LIME,that explains the prediction of any classiﬁer viaa local linear approximation around the predic-tion. Alvarez-Melis and Jaakkola [1] introducesa similar approach but for structured prediction, where a variational autoencoder provides relevantperturbations of the inputs that are then used to infer pairs of input-output tokens that are causallyrelated. While these models provide valuable insight for detecting biases, further model and datasetreﬁnements would have to be made on a case-by-case basis. For example, Gururangan et al. [13]identiﬁed a set of biases in SNLI, but noted that their attempts to remove them would give rise toother biases.Attention-based models, such as [2, 25], offer some degree of interpretability and have been shownto also improve performance on downstream tasks. However, soft attention, the most prominentattention model, often does not learn to single out human-interpretable inputs.Neither extractive nor attention-based techniques can provide full-sentence explanations of amodel’s decisions. Moreover, they cannot capture ﬁne-grained relations and asymmetries, especiallyin a task like recognizing textual entailment. For example, if the words person, woman, mountain,outdoors are extracted as justiﬁcation, one may not know whether the model correctly learned that

A woman is a person and not that

A person is a woman , let alone that the model correctly paired(woman, person) and (mountain, outside).

Natural language explanations

In our work, we have taken a step further and built a neural networkthat is able to directly provide full-sentence natural language justiﬁcations. There has been little8ork on incorporating and outputting natural language free-form explanations, mostly due to thelack of appropriate datasets. In this direction, and very similar to our approach, is the recent workby Park et al. [23], who introduce two datasets of natural language explanations for the tasks ofvisual question-answering and activity recognition. Another work in this direction is that of Linget al. [18], who introduced a dataset of textual justiﬁcations for solving math problems and formulatethe task in terms of program execution. Nonetheless, their setup is speciﬁc to the task of solvingmath problems, and thus hard to transfer to more general natural understanding tasks. Jansen et al.[16] provided a dataset of natural language explanation graphs for elementary science questions.However, with only , pairs of questions and explanations, their corpus is orders of magnitudesmaller than e-SNLI. Breaking natural language inference

Recently, an increasing amount of analysis has been carriedout on the SNLI dataset and on the inner workings of different models trained on it. For example,Dasgupta et al. [9] assembled a dataset to test whether inference models actually capture composi-tionality beyond word level. They showed that InferSent sentence embeddings [7] indeed do notexhibit signiﬁcant compositionality and that downstream models using these sentence representa-tions largely rely on simple heuristics that are ecologically valid in the SNLI corpus. For example,high overlap in words between premise and hypothesis usually predicts entailment, while most con-tradictory sentence pairs have no or very little overlap of words. Negation words would also stronglyindicate a contradiction. Glockner et al. [11] introduce a toy dataset, BreakingNLI, to test whethernatural language inference models capture world knowledge and generalize beyond statistical regu-larities. To construct BreakingSNLI, they modiﬁed some of the original SNLI sentences such thatthey differ by at most one word from the sentences in the training set. Glockner et al. show that mod-els achieving high accuracies on SNLI, such as [21, 22, 6], show dramatically reduced performanceon this simpler dataset, while the model of Chen et al. [5] is more robust due to incorporating exter-nal knowledge. As the explanations in e-SNLI are mostly self-contained, our dataset provides theprecise external knowledge that one requires in order to solve the SNLI inference task. It is thereforea perfect testbed for developing models that incorporate external knowledge from free-form naturallanguage.

We introduced e-SNLI, a large dataset of natural language explanations for an inﬂuential task ofrecognizing textual entailment. To demonstrate the usefulness of e-SNLI, we experimented withvarious ways of using these explanations for outputting human-interpretable full-sentence justiﬁca-tions of classiﬁcation decisions. We also investigated the usefulness of these explanations as anadditional training signal for learning better universal sentence representations and the transfer ca-pabilities to out-of-domain NLI datasets. In this work, we established a series of baselines usingstraight-forward recurrent neural network architectures for incorporating and generating natural lan-guage explanations. We hope that e-SNLI will be valuable for future research on more advancedmodels that would outperform our baselines.Finally, we hope that the community will explore the dataset in other directions. For example, wealso recorded the highlighted words, which we release with the dataset. Similar to the evaluationperformed for visual question answering in Das et al. [8], our highlighted words could provide asource of supervision and evaluation for attention models [25, 22] or post-hoc explanation modelswhere the explanation consists of a subset of the input.

References [1] Alvarez-Melis, D. and Jaakkola, T. S. (2017). A causal framework for explaining the predictionsof black-box sequence-to-sequence models.

CoRR , abs/1707.01943.[2] Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learningto align and translate.

CoRR , abs/1409.0473.[3] Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. (2015). A large annotated corpus forlearning natural language inference.

CoRR , abs/1508.05326.[4] Chan, W., Jaitly, N., Le, Q. V., and Vinyals, O. (2015). Listen, attend and spell.

CoRR ,abs/1508.01211. 95] Chen, Q., Zhu, X., Ling, Z., Inkpen, D., and Wei, S. (2017). Natural language inference withexternal knowledge.

CoRR , abs/1711.04289.[6] Chen, Q., Zhu, X., Ling, Z., Wei, S., and Jiang, H. (2016). Enhancing and combining sequentialand tree LSTM for natural language inference.

CoRR , abs/1609.06038.[7] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A. (2017). Supervisedlearning of universal sentence representations from natural language inference data.

CoRR ,abs/1705.02364.[8] Das, A., Agrawal, H., Zitnick, C. L., Parikh, D., and Batra, D. (2016). Human attention invisual question answering: Do humans and deep networks look at the same regions?

CoRR ,abs/1606.03556.[9] Dasgupta, I., Guo, D., Stuhlmüller, A., Gershman, S. J., and Goodman, N. D. (2018). Evaluatingcompositionality in sentence embeddings.[10] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In

Proc. of CVPR .[11] Glockner, M., Shwartz, V., and Goldberg, Y. (2018). Breaking NLI systems with sentencesthat require simple lexical inferences. In

Proc. of ACL .[12] Gong, Y., Luo, H., and Zhang, J. (2017). Natural language inference over interaction space.

CoRR , abs/1709.04348.[13] Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A.(2018). Annotation artifacts in natural language inference data. In

Proc. of NAACL .[14] Hill, F., Cho, K., and Korhonen, A. (2016). Learning distributed representations of sentencesfrom unlabelled data.

CoRR , abs/1602.03483.[15] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory.

Neural Comput. ,9(8):1735–1780.[16] Jansen, P. A., Wainwright, E., Marmorstein, S., and Morrison, C. T. (2018). Worldtree: Acorpus of explanation graphs for elementary science questions supporting multi-hop inference.

CoRR , abs/1802.03052.[17] Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., and Fidler, S.(2015). Skip-thought vectors.

CoRR , abs/1506.06726.[18] Ling, W., Yogatama, D., Dyer, C., and Blunsom, P. (2017). Program induction by rationalegeneration: Learning to solve and explain algebraic word problems.

CoRR , abs/1705.04146.[19] Liu, P., Qiu, X., and Huang, X. (2016). Modelling interaction of sentence pair with coupled-lstms.

CoRR , abs/1605.05573.[20] Marelli, M., Menini, S., Baroni, M., Bentivogli, L., bernardi, R., and Zamparelli, R. (2014). Asick cure for the evaluation of compositional distributional semantic models.[21] Nie, Y. and Bansal, M. (2017). Shortcut-stacked sentence encoders for multi-domain inference.In

Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP , pages41–45. Association for Computational Linguistics.[22] Parikh, A. P., Täckström, O., Das, D., and Uszkoreit, J. (2016). A decomposable attentionmodel for natural language inference.

CoRR , abs/1606.01933.[23] Park, D. H., Hendricks, L. A., Akata, Z., Rohrbach, A., Schiele, B., Darrell, T., and Rohrbach,M. (2018). Multimodal explanations: Justifying decisions and pointing to the evidence.

CoRR ,abs/1802.08129.[24] Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "why should I trust you?": Explaining thepredictions of any classiﬁer.

CoRR , abs/1602.04938.1025] Rocktäschel, T., Grefenstette, E., Hermann, K. M., Kociský, T., and Blunsom, P. (2015). Rea-soning about entailment with neural attention.

CoRR , abs/1509.06664.[26] Williams, A., Nangia, N., and Bowman, S. R. (2017). A broad-coverage challenge corpus forsentence understanding through inference.

CoRR , abs/1704.05426.[27] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., andBengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention.

CoRR , abs/1502.03044.

Appendices

A List of templates to ﬁlter uninformative explanations

General templates """"" "" ""Sentence 1 states . Sentence 2 is stating ""Sentence 2 states . Sentence 1 is stating ""There is ""There is "

Entailment templates " implies ""If then "" would imply "" is a rephrasing of "" is a rephrasing of ""In both sentences "" would be "" can also be said as "" can also be said as "" is a less speciﬁc rephrasing of ""This clariﬁes that ""If it means "" in both sentences"" in both"" is same as "" is same as "" is a synonym of "" is a synonym of ". 11 eutral templates "Just because doesn’t mean ""Cannot infer the ""One cannot assume ""One cannot infer that ""Cannot assume "" does not mean ""We don’t know that ""The fact that doesn’t mean ""The fact that does not imply ""The fact that does not always mean ""The fact that doesn’t always imply".

Contradiction templates "In sentence 1 while in sentence 2 ""It can either be or ""It cannot be if ""Either or ""Either or "" and other "" and other "" after "" is not the same as "" is not the same as "" is contradictory to "" is contradictory to "" contradicts "" contradicts "" cannot also be "" cannot also be ""either or ""either or not both at the same time"" or not both at the same time".

B Architecture of E

XPLAIN T HEN P REDICT A TTENTION

Our attention model E

XPLAIN T HEN P REDICT A TTENTION is composed of two identical but separatemodules for premise and hypothesis. We ﬁx the number of attended tokens at , the maximumlength of a sentence in SNLI. We denote by h pt and h ht the bidirectional embeddings of the premiseand hypothesis at timestep t . We denote by h decτ the decoder hidden state at timestep τ , which werefer to as the context of the attention.We use 3 couples of linear projections followed by tanh non-linearities as follows:12e project each timestep of the encoder for premise and hypothesis: proj pt = tanh ( W p h pt + b p ) proj ht = tanh ( W h h ht + b h ) . We separately project the context vector, that is, the hidden vector of the decoder at each timestep,before doing its dot product with the tokens of the premise and hypothesis: proj c,pτ = tanh ( W cp h decτ + b cp ) proj c,hτ = tanh ( W ch h decτ + b ch ) . At each decoding timestep τ , we do the dot product between the projections of the context with allthe timesteps of the premise and hypothesis, respectively: f w tp,τ = < proj c,pτ , proj pt > f w th,τ = < proj c,hτ , proj ht > . The ﬁnal attention weights are computed from a softmax over the non-normalized weights: w p,τt = Sof tmax ( f w tp,τ ) w h,τt = Sof tmax ( f w th,τ ) . We use another couple of projections for the embeddings of the tokens of premise and hypothesis,before we apply the weighted sum. proj pt = tanh ( W p h pt + b p ) proj ht = tanh ( W h h ht + b h ) . Finally, we compute the weighted sums for premise and hypothesis: p τ = X t w p,τt proj pt h τ = X t w h,τt proj ht . At each timestept τ , we concatenate p τ and h τ with the word embedding from the previous timestep τ −1