Attentive Tree-structured Network for Monotonicity Reasoning
AAttentive Tree-structured Network for Monotonicity Reasoning
Zeming Chen
Computer Science DepartmentRose-Hulman Institute of Technology5500 Wabash Ave, Terre Haute, IN, USA [email protected]
Abstract
Many state-of-art neural models designedfor monotonicity reasoning perform poorlyon downward inference. To address thisshortcoming, we developed an attentive tree-structured neural network. It consists of a tree-based long-short-term-memory network (Tree-LSTM) with soft attention. It is designedto model the syntactic parse tree informationfrom the sentence pair of a reasoning task. Aself-attentive aggregator is used for aligningthe representations of the premise and the hy-pothesis. We present our model and evaluateit using the Monotonicity Entailment Dataset(MED). We show and attempt to explain thatour model outperforms existing models onMED.
In this paper, we present and evaluate a tree-structured long-short-term-memory (LSTM) net-work in which the syntactic information of a sen-tence is encoded and the alignment between thepremise-hypothesis pair is calculated through aself-attention mechanism. Our work builds on theChild-Sum Tree-LSTM from Tai et al. (2015). Weevaluate our model on several datasets to show thatit performs well on both upward and downwardinference. Particularly, our model demonstratedgood performance on downward inference, whichis a difficult task for most NLI models.Natural language inference (NLI), also known asrecognizing textual entailment (RTE) is one of theimportant benchmark tasks for natural language un-derstanding. Many other language tasks can benefitfrom NLI, such as question answering, text sum-marization, and machine reading comprehension.The goal of NLI is to determine whether a givenpremise P semantically entails a given hypothe-sis H (Dagan et al., 2013). Consider the examplebelow: • P : An Irishman won the Nobel prize for literature. • H : An Irishman won the Nobel prize. The hypothesis can be inferred from the premiseand therefore the premise entails the hypothesis.To arrive at a correct determination, an NLI modeloften needs to perform different inferences includ-ing various types of lexical and logical inferences.In this paper, we are concerned with monotonic-ity reasoning, a type of logical inference that isbased on word replacement. Below is an exampleof monotonicity reasoning:1. (a)
All students ↓ carry a MacBook ↑ . (b) All students carry a laptop. (c)
All new students carry a MacBook.
2. (a)
Not All new students ↑ carry a laptop. (b) Not All students carry a laptop.
An upward entailing phrase ( ↑ ) can allow infer-ence from (1a) to (1b), where a more general con-cept laptop replaces the more specific MacBook . Adownward entailing phrase ( ↓ ) allows an inferencefrom (1a) to (1c), where a more specific context new students replaces the word students . The direc-tion of the monotonicity can be reversed by addinga downward entailing phrase like ”Not”; thus (2a)entails (2b).Recently, Yanaka et al. (2019a) constructed anew dataset called the Monotonicity EntailmentDataset (MED). The purpose of that dataset is toevaluate the ability of a neural inference modelto perform monotonicity reasoning. It is the firstdataset ever created for such purpose. While manyneural language models have shown state-of-artperformance on large annotated NLI dataset suchas the Stanford Natural Language Inference (SNLI)dataset (Bowman et al., 2015a; Chen et al., 2017;Parikh et al., 2016), many of these models did not a r X i v : . [ c s . C L ] J a n erform well on monotonicity reasoning. In partic-ular, they had low accuracy when performing down-ward monotonicity inference. Additionally, mostof the state-of-art inference models that do well onupward monotonicity inference perform poorly ondownward inference (Yanaka et al., 2019a). Existing work in this area has adopted a recursivetree-structured neural network for natural languageinference. Bowman et al. (2015b) proposed a tree-structured neural tensor network (TreeRNTNs) thatcan learn representations to correctly identify logi-cal relationships such as entailment.Zhou et al. (2016) extended the recursive neu-ral tensor networks to a recursive long-short termmemory network, a tree-LSTM, which combinesthe advantages of both the recursive neural net-work structure and the sequential recurrent neuralnetwork structure. The tree-LSTM can learn mem-ory cells that reflect the historical memories of thedescendant cells and thus improved the model’sability to process long-distance interaction over hi-erarchies, such as the language parse information.Parikh et al. (2016) proposed a simple decom-pose attention model for natural language inference.Their model relies on the attention to decomposethe problem into sub-problems so that the smallerproblems can be solved separately and in parallel.Chen et al. (2017), proposed the Enhanced Se-quential Inference Model (ESIM) for natural lan-guage inference task. It incorporated the sequentialLSTM encoder with the syntactic parsing infor-mation from the tree-LSTM structure to form ahybrid neural inference mode. They found that in-corporating the parsing information can improvethe performance of the model.A new type of inference model that relies onexternal knowledge called the knowledge-based in-ference model (KIM) was introduced by Chen et al.(2018). They incorporated neural NLI models withexternal knowledge in co-attention, local inferencecollection, and inference composition components.The KIM model achieved state-of-art performanceon the SNLI and MNLI datasets.
In this section we present an attentive treestructured network (AttentiveTreeNet) with self-attention based aggregation. This model is com-posed of the following main components: input
Figure 1: Architecture of our model. sentence embedding, attentive tree-LSTM encoder,self-attention aggregator and a multi-layer percep-tron (MLP) classifier. Figure 1 shows the architec-ture of our model. Given an input sentence pair,consisting of a premise P and a hypothesis H , theobjective of the model is to determine whether P entails H . Our model takes in four inputs: the wordembeddings of the premise and hypothesis and thedependency parse trees of the premise and hypoth-esis. The model initializes the embedding of P and H with some pre-trained word embedding; theparse trees are produced by a dependency parser.Our model forms a Siamese neural network struc-ture (Mueller and Thyagarajan, 2016), in whichthe premise and the hypothesis are passed into apair of identical tree-LSTMs that share the sameparameters and weights. The main idea is to finda function that can map the input sentences intoa target space such that we can approximate thesemantic distance in the input space. We employ Child-SumTree-LSTMs (Tai et al., 2015) as the basic buildingblocks for our model. A standard sequential LSTMetwork only permits sequential information prop-agation. However, the lingistic principle of com-positionality states that an expression’s meaningis derived from the meanings of its parts and ofthe way they are syntactically combined (Partee,2007). A tree-structured LSTM network allowseach LSTM unit to be able to incorporate infor-mation from multiple children units. This takesadvantage of the fact that sentences are syntacti-cally formed bottom-up tree-structures.A Child-Sum Tree-LSTM is a type of tree-LSTM which contains units that conditioned theircomponents on the sum of their children’s hiddenstates. While a standard sequential LSTM networkcomputes the current hidden state from the currentinput and the previous hidden state, a child-sumtree-LSTM computes the hidden state from the in-put and the hidden states of an arbitrary numberof children nodes. This property allows relationrepresentations of non-leaf nodes to be recursivelycomputed by composing the relations of the chil-dren, which can be viewed as natural logic for neu-ral model (MacCartney and Manning, 2009; Zhaoet al., 2016). Using the child-sum tree structure isbeneficial in interpreting the entailment relationsbetween parts of the two sentences.When encoding the sentence in a forward man-ner, hidden states are passed recursively in abottom-up fashion. The information flow in eachLSTM cell is controlled by a gating mechanismsimilar to the one in a sequential LSTM cell. Thecomputations in an LSTM cell are as follows: ˜ h = Σ ≤ k ≤ n h k ,i = σ ( W ( i ) x + U ( i ) ˜ h + b ( i ) ) ,o = σ ( W ( o ) x + U ( o ) ˜ h + b ( o ) ) ,u = tanh ( W ( u ) x + U ( u ) ˜ h + b ( u ) ) ,f k = σ ( W ( f ) x + U ( f ) h k + b ( f ) ) ,c = i (cid:12) u + Σ Attentive Tree-LSTM In our model, the stan-dard tree-LSTM is extended to an attentive tree-LSTM (Zhou et al., 2016) by incorporating theattention mechanism into the LSTM cell. In a sen-tence, some words are more related to the overallcontext of the sentence than others. The benefit ofapplying attention is that it considers this seman-tic relevance by weighting each child according tohow relative that child is to the given context. Theattention mechanism can assign a higher weight toa child node that is more relevant to the context ofthe sentence and a lower weight to a child node thatis not relevant to the context.To apply the attention mechanism, a commonsoft-attention layer is used in the model. That layerreceives a set of hidden states { h , h , ..., h n } andan external vector s , which is a vector representa-tion of a sentence from a layer of sequential LSTM.The layer then computes a weight α for each hid-den state, and sums up the product of each hiddenstate and its weight to output the context vector g .Below are the equations for the soft-attention layer: m k = tanh ( W ( m ) h k + U ( m ) s ) ,α k = exp ( w (cid:62) m k )Σ nj =1 exp ( w (cid:62) m j ) ,g = Σ ≤ k ≤ n α k h k A new previous hidden state is then computedthrough a transformation ˜ h = tanh ( W ( a ) g + b ( a ) ) . igure 3: Detailed view of the self-attention aggregator Figure 2 illustrates the standard tree-LSTM celland the attentive tree-LSTM cell. After both the premise and the hypothesis are en-coded through the tree-LSTM, each tree’s hiddenstates from the nodes are concatenated into a pair ofmatrices H p and H h and passed to a self-attentiveaggregator. The aggregator contains a multi-hopself-attention mechanism (Lin et al., 2017). A sen-tence has multiple components such as groups ofrelated words and phrases to form an overall con-text, especially for long sentences. By performingmultiple hops of attention, the model can get multi-ple attentions that each focus on different parts ofthe sentence. Given a matrix H , the self-attentionmechanism performs multiple hops of attentionand outputs an annotation matrix A which consistsof the weight vector from each hop. A is calcu-lated from a 2-layer multi-layer perceptron (MLP)and a softmax function. Below is the equation tocalculate A : A = sof tmax ( W s tanh ( W s H (cid:62) )) The annotation matrix A is then multiplied by thehidden state matrix H to obtain a context matrix: M = AH . In the model, there will be a pair ofcontext matrices M p and M h . A batch dot productand a tanh function is then applied to the contextmatrices with a trainable weight to obtain a pair of output F p and F h matrices: F p = tanh ( bmm ( M p , W f )) ,F h = tanh ( bmm ( M h , W f )) To aggregate F p and F h , we follow Conneau et al.(2017)’s generic NLI training scheme, which in-cludes three matching methods: (i) a concatenationof F p and F h , (ii) an absolute distance between F p and F h , and (iii) an element wise product of F p and F h . Results from the three methods are then con-catenated to F r as the factor of semantic relationbetween the two sentences which can measure howclose the two vector representations of the sentencepair are in the target space. This relatedness infor-mation will help the classifier to determine whetherthe hypothesis is entailed by the premise. F r = [ F p ; F h ; | F p − F h | ; F p (cid:12) F h ] , The factor of relation F r is fed to a classic threelayer MLP classifier. The final prediction is a prob-ability p θ representing the degree to which the hy-pothesis is entailed by the premise. It is calculatedby a softmax function, which is a standard activa-tion function used to calculate the probability ofthe input being in a category for multi-way classifi-cation tasks: Y = ReLU ( W f F r + b f ) ,Y = σ ( W f Y + b f ) ,y θ = sof tmax ( W f Y + b f ) , For the classification, the binary cross-entropy lossis used as the objective function: − (cid:88) c ( X, c ) log ( p ( c | X )) , where is the binary indicator (0 or 1) whether thelabel c is the correct class for X. Six different types of training data are used to trainour model. Initially, we used the HELP dataset(Yanaka et al., 2019b) to train our model. HELPis a dataset for learning entailment with lexicaland logical phenomena. It embodies a combina-tion of lexical and logical inferences focusing on odel Train Data Upward Downward None All BiMPM (Wang et al., 2017) SNLI 53.5 57.6 27.4 54.6ESIM (Chen et al., 2017) SNLI 71.1 45.2 41.8 53.8DeComp (Parikh et al., 2016) SNLI 66.1 42.1 Table 1: Accuracy of our model and other state-of-art NLI models evaluated on MED. monotonicity. HELP consists of 36K sentence pairsincluding those for upward monotone, downwardmonotone, non-monotone, conjunction, and dis-junction. Next we trained our model with the Multi-Genre NLI Corpus (MNLI) dataset (Williams et al.,2018). MNLI contains 433k pairs of sentences an-notated with textual entailment information. Thatdataset covers a wide range of genres of spo-ken and written language. The majority of thetraining examples in that dataset is upward mono-tone. In order to provide more balanced trainingdata, we combined a subset of the MNLI datasetwith the HELP dataset to reduce the effect of thelarge number of downward monotone examplesin the HELP dataset, we call this combined train-ing data HELP+SubMNLI. The fourth trainingdata contains both the HELP+SubMNLI trainingdata and the training set for simple monotonicityfrom Richardson et al. (2019)’s Semantic Frag-ments. The fifth training data contains both theHELP+SubMNLI training data and the trainingset for hard monotonicity from Semantic Frag-ments. Finally, the last training data contains theHELP+SubMNLI training data and the training setfor simple and hard monotonicity from SemanticFragments.To validate our model’s ability for monotonic-ity reasoning and to evaluate its performance onupward and downward inference, the Monotonic-ity Entailment Dataset (MED) was used (Yanakaet al., 2019a), which is designed to examine amodel’s ability of performing monotonicity rea-soning. MED contains 5382 premise-hypothesispairs including 1820 upward inference examples,3270 downward inference examples, and 292 non-monotone examples. The sentences in MED covera variety of linguistic phenomena, including lexi-cal knowledge, reverse, conjunction, disjunction, conditional and negative polarity items. We re-moved sentence pair with the label ”contradict”from MNLI dataset since the test dataset MED andthe training dataset HELP do not contain the label”contradict”. We furthermore tested our model onthe simple and hard monotonicity fragments testsets from Semantic Fragments. Word embeddings are a common way to representwords when training neural networks (Mikolov etal., 2013). To train our model we used Stanford’spre-trained 300-D Glove 840B vectors (Penning-ton et al., 2014) to initialize the word embeddings.The Stanford Dependency Parser (Chen and Man-ning, 2014) was used to parse each sentence inthe dataset. The model is trained with the Adamoptimizer (Kingma and Ba, 2014) which is com-putationally efficient and helps a model to quicklyconverge to an optimal result. A standard learningrate for Adam, 0.001, is also used. Dropout with astandard rate of 0.5 is applied to the feed-forwardlayer in the self-attention aggregator and the clas-sifier to reduce the over-fitting of the model. Forthe number of hops of the self-attention, we usedthe default 15 hops. The metric for evaluation isaccuracy based. The system is implemented usinga common deep learning framework, PyTorch andis trained on a GPU for 20 epochs. In this section, we evaluated our model’s abilityof performing monotonicity reasoning. Table 1shows a comparison of the performance of differ-ent models on the Monotonicity Entailment Dataset(MED), including our model. The data for allmodels except for ours was developed byYanaka est Model Training Data Upward Downward None All - Full Model w/ vector-concat HELP 55.7 72.6 57.9 66.01 –Self-Attentive Aggregator HELP 65.1 67.1 53.7 65.72 –Tree-LSTM HELP 36.6 65.5 94.8 49.53 Full Model w/ mean-dist HELP 59.3 71.2 46.2 65.9- Full Model w/ vector-concat HELP+SubMNLI w/ mean-dist HELP+SubMNLI 68.9 73.7 Table 2: This table shows the accuracy of ablation tests trained on HELP and HELP+SubMNLI and tested onMED. Three ablation test were performed: (i) Remove self-attentive aggregator (–Self-Attentive Aggregator), (ii)Replace tree-LSTM with regular LSTM (–Tree-LSTM) (iii) Use mean distance as a matching method (Full Model w/ mean-dist ). The final model (Full Model w/ vector-concat ) uses a concatenation of the sentence vectors as one ofthe matching methods instead of mean distance. et al. (2019a) who developed the MED dataset.Our model achieves an overall accuracy of 75.7%which outperforms all other models, even a state-of-art language model like BERT. Table 1 showsthe ability of different models on performing up-ward and downward inference. Our attentive treemodel performed better on downward inferencethan other models with an accuracy of 74.5% . Ourmodel’s performance on upward inference outper-forms other models except BERT. However, the up-ward inference accuracy of our model (81.4) is veryclose to the accuracy of BERT (82.7). We believethe good performance on upward and downward in-ference is due to considering parse tree information.Furthermore, the accuracy on upward inference in-creased significantly when trained with a combina-tion of HELP and MNLI (HELP+SubMNLI) thentrained only with HELP; the accuracy increasedfrom 55.7 to 81.4 while the downward accuracydid not change much. Such phenomena suggeststhat adding MNLI to HELP does reduce the effectof the large number of downward monotone ex-amples in the HELP dataset and thus improve themodel’s ability on upward inference. To demonstrate the robustness of our model, weexperimented with training the model on variousdatasets. First, the model was trained on the HELPdataset alone. The overall accuracy was 66.0%,which outperformed other models from Table 1except BERT trained with HELP+SubMNLI andour model trained with HELP+SubMNLI. Evenon downward inference alone our model outper-forms all other models with an accuracy of 72.6%except our model trained with HELP+SubMNLI. This result indicates that with a rich set of down-ward monotone examples, the model can learn tobetter predict a downward inference problem.We then trained a model with the MNLI datasetalone. It contains a large amount of upward infer-ence examples and only a rare number of down-ward inference examples. The result shows thatthe model generalized to the training data, andhad an accuracy of 58.6% which is still higherthan most models from Table 1. Interestingly, themodel’s performance on downward inference isstill better than its performance on upward infer-ence, even though the training dataset contains alarge number of upward monotone examples. Thissuggests that the model is immune to significantchange of training data possibly due to the multipledropout layer added to the aggregator and the clas-sifier which forces a the model to learn more robustfeatures. As Table 1 show, comparing to BERTtrained with MNLI along, our model trained withMNLI along has better performance on downwardinference than BERT’s performance from Yanakaet al. (2019a).Finally, we trained our model on a combina-tion of the MNLI dataset and the HELP dataset(HELP+SubMNLI). Because of the large numberof upward training examples in MNLI, we sus-pected that the combination would alleviate theeffects of this distortion and as such increase theaccuracy for upward inference. We selected 20%of the complete MNLI dataset due to the long train-ing period. As the results in Table 1 show, ourmodel still performs well on downward inferencewith 74.5% accuracy, it also showed significant im-provements on upward inference with an accuracyof 81.4% . The overall performance also increasedubstantially to 75.7% . Compared to the resultsof BERT trained with HELP+MNLI from Yanakaet al. (2019a), our model performs better on bothupward inference and downward inference, andachieves a higher overall accuracy. The result vali-dates our hypothesis that training on a combinationof upward and downward monotone sentences canhelp the model achieve good performance on bothupward and downward monotone, and that the useof AttentiveTreeNet is a good choice. To further evaluate which part of the model con-tributed the most for monotonicity reasoning, weperformed several ablation tests on the model.The ablation tests were trained with HELP andHELP+SubMNLI separately and the models wereevaluated on the MED dataset. The results areshown in Table 2. We will focus our evaluation onthe HELP+SubMNLI data.For ablation test 1, we removed the self-attentiveaggregator and built the feature vector for classifica-tion right after the tree-LSTM encoder. As Table 2(–Self-Attentive Aggregator) shows, performanceof the model trained on HELP+SubMNLI showsa significant, 6.6 percentage point drop in overallaccuracy, a 10.9 percentage point drop in upwardinference accuracy and a 7.6 percentage point dropin downward inference accuracy. The results ofthis test suggest that the self-attentive aggregator isan important component of the model that cannotbe removed.For ablation test 2, we replaced the tree-LSTMencoder with a standard LSTM encoder. Here, wesee an even larger drop in performance. As Table 2(–Tree-LSTM) shows, performance of the modeltrained on HELP+SubMNLI shows a large, 17.1percentage point drop in overall accuracy, a 26.7percentage point drop in upward inference accu-racy and a 14.1 percentage point drop in downwardinference accuracy. Based on the results, replacingtree-LSTM with standard LSTM has significantnegative impact on the model’s monotonicity rea-soning performance. Thus, tree-LSTM is a majorcomponent of the model that cannot be replaced.For ablation test 3, we compared two match-ing methods for aggregating the two sentence vec-tors. In our final model (Full Model w/ vector-concat ),we updated the matching method by following thegeneric NLI training scheme (Conneau et al., 2017).In it, we concatenate the two sentence vectors with Training Data SF HF MED Pre-Trained Models HELP 57.0 56.8 66.0HELP+SubMNLI 46.0 63.0 75.7 Re-trained Models w/ SF-training fragments HELP +frag +frag Re-trained Models w/ HF-training fragments HELP +frag +frag Re-trained Models w/ SF and HF-training fragments HELP +frag +frag Table 3: This table shows the result of the modeltested on MED and the simple monotonicity frag-ments test set (SF) and hard monotonicty fragmentstest set (HF) from the Semantic Fragments dataset.The table includes three subsections: (i) test accu-racy on the three test sets using models pre-trainedon HELP and HELP+SubMNLI; (ii) test accuracy onthe three test sets using the model re-trained afteradding simple monotonicity training set to HELP andHELP+SubMNLI; (iii) test accuracy on the three testsets using the model re-trained after adding hard mono-tonicity training set to HELP and HELP+SubMNLI;(iv) test accuracy on the three test sets using the modelre-trained after adding both simple and hard monotonic-ity training sets to HELP and HELP+SubMNLI. an absolute distance and an element-wise productas the input vector for the classifier. We comparedthe performance to our original model (Full Model w/ mean-dist ) which contains the tree-LSTM encoder,the self-attentive aggregator, and the concatenationof an absolute distance, an element-wise product,and a mean distance as the input vector for the clas-sifier. For this ablation test, the results from Table2 (Full Model w/ mean-dist ) are mixed, yet important.While the overall accuracy decreases just slightly,by 2.7 percentage points and the downward infer-ence accuracy only decreases by 0.8 percentagepoints, the accuracy for upward inference decreasesby a significant 12.5 percentage points. We believethat these results justify the use of concatenation ofthe sentence vector pair.Overall, the removal of the Tree-LSTM encoderaffected the model’s performance most. Thus, weconclude that the Tree-LSTM encoder contributesthe most to the model’s performance on monotonic-ity reasoning. .4 Additional Testings To check if our pre-trained model can be gener-alized to other monotonicity dataset, and to seeif the model can be easily trained to master thenew dataset while retaining its performance on theoriginal benchmark, we conducted some additionaltestings on the model. We tested our pre-trainedmodels on the Semantic Fragments test datasetwhich provides a more in-depth test for an NLImodel’s performance with semantic phenomena,see (Richardson et al., 2019). Since our model fo-cuses on monotonicity reasoning, we only selectedthe simple and hard monotonicity fragments fortesting. Additionally, since our models are pre-trained on datasets that only contain two labels:”Entailment” and ”Neutral”, we removed sentencepairs with the third label ”contradict” from the testdataset.Table 3 shows the results of our testing. Whilewe show the results for both, the HELP andHELP+SubMNLI data sets, we will focus ourdiscussion again on the data obtained with theHELP+SubMNLI data set.The top portion of Table 3 shows that the modeltrained on just HELP+SubMNLI performs poorlyon the simple and hard monotonicity fragments.This performance is on par with other state-of-artmodel’s, see (Richardson et al., 2019).The first middle portion on Table 3 shows theresults of our model’s performance when onlythe simple training fragments were added to theHELP+SubMNLI training set. As the data shows,the model masters the simple monotonicity reason-ing tests, does well on the hard monotonicity rea-soning tests and retains its accuracy on the originalbenchmark MED.The second middle portion of Table 3 showsthe results of our model’s performance when onlythe hard training fragments were added to theHELP+SubMNLI training set. In this case, themodel masters the hard monotonicity reasoningtests, does well on the simple monotonicity rea-soning tests and again retains its accuracy on theoriginal benchmark MED.The bottom portion of Table 3 shows the resultsof our model’s performance when both the sim-ple and hard training fragments were added to theHELP+SubMNLI training set. As the results show,the model masters both the simple and hard mono-tonicity reasoning tests while retaining its accuracyon the original benchmark MED. Overall, the results show that the model trainedon the fragments can be generalized to both simpleand hard monotonicity reasoning. In this paper, we explained our attentive tree-structured network to perform monotonicity reason-ing. Our model combines a tree-structured LSTMnetwork and a self-attention mechanism, which is apotential mechanism for future natural language in-ference models, to incorporate syntactic structuresof the sentence to improve sentence-level mono-tonicity reasoning. We evaluated our model andshowed that it achieves better accuracy on mono-tonicity reasoning than other inference models. Inparticular, our model is performing significantlybetter on downward inference than others. We in-terpret the results of the experiments as supportingthe thesis that using parse trees of a sentence arehelpful in inferring the entailment relation.Future research on the attentive tree networkmight extend a tree-LSTM architecture by replac-ing the LSTM cell with newer language modelsthat have much better performance on various num-ber of natural language processing tasks. One suchmodel is the transformer model. Furthermore, fu-ture work might want to investigate how differentattention mechanism affect a model’s performance. Acknowledgments We thank Michael Wollowski for reading and giv-ing feedback on drafts and revisions of this paper.We also thank the anonymous reviewers for provid-ing helpful suggestions and feedback. References Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015a. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing , pages632–642, Lisbon, Portugal. Association for Compu-tational Linguistics.Samuel R. Bowman, Christopher Potts, and Christo-pher D. Manning. 2015b. Recursive neural networkscan learn logical semantics. In Proceedings of the3rd Workshop on Continuous Vector Space Modelsand their Compositionality , pages 12–21, Beijing,China. Association for Computational Linguistics.Danqi Chen and Christopher Manning. 2014. A fastand accurate dependency parser using neural net-works. In Proceedings of the 2014 Conference onmpirical Methods in Natural Language Processing(EMNLP) , pages 740–750, Doha, Qatar. Associationfor Computational Linguistics.Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced LSTMfor natural language inference. In Proceedings ofthe 55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 1657–1668, Vancouver, Canada. Associationfor Computational Linguistics.Yufei Chen, Sheng Huang, Fang Wang, Junjie Cao,Weiwei Sun, and Xiaojun Wan. 2018. Neural max-imum subgraph parsing for cross-domain semanticdependency analysis. In Proceedings of the 22ndConference on Computational Natural LanguageLearning , pages 562–572, Brussels, Belgium. Asso-ciation for Computational Linguistics.Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing .Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas-simo Zanzotto. 2013. Recognizing Textual Entail-ment: Models and Applications . Synthesis Lec-tures on Human Language Technologies. Morganand Claypool Publishers.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization.Zhouhan Lin, Minwei Feng, C´ıcero Nogueira dos San-tos, Mo Yu, Bing Xiang, Bowen Zhou, and YoshuaBengio. 2017. A structured self-attentive sentenceembedding. ArXiv , abs/1703.03130.Bill MacCartney and Christopher D. Manning. 2009.An extended model of natural logic. In Proceed-ings of the Eight International Conference on Com-putational Semantics , pages 140–156, Tilburg, TheNetherlands. Association for Computational Lin-guistics.Jonas Mueller and Aditya Thyagarajan. 2016. Siameserecurrent architectures for learning sentence simi-larity. In Proceedings of the Thirtieth AAAI Con-ference on Artificial Intelligence , AAAI’16, page2786–2792. AAAI Press. Ankur Parikh, Oscar T¨ackstr¨om, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing , pages 2249–2255,Austin, Texas. Association for Computational Lin-guistics.Barbara Partee. 2007. Compositionality and coercionin semantics: The dynamics of adjective meaning 1.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) , pages 1532–1543, Doha, Qatar. Asso-ciation for Computational Linguistics.Kyle Richardson, Hai Hu, Lawrence S. Moss, andAshish Sabharwal. 2019. Probing natural languageinference models through semantic fragments.Kai Sheng Tai, Richard Socher, and Christopher D.Manning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers) ,pages 1556–1566, Beijing, China. Association forComputational Linguistics.Zhiguo Wang, Wael Hamza, and Radu Florian. 2017.Bilateral multi-perspective matching for natural lan-guage sentences. In Proceedings of the Twenty-SixthInternational Joint Conference on Artificial Intelli-gence, IJCAI-17 , pages 4144–4150.Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers) , pages 1112–1122. Association forComputational Linguistics.Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Ken-taro Inui, Satoshi Sekine, Lasha Abzianidze, and Jo-han Bos. 2019a. Can neural networks understandmonotonicity reasoning? In Proceedings of the2019 ACL Workshop BlackboxNLP: Analyzing andInterpreting Neural Networks for NLP , pages 31–40,Florence, Italy. Association for Computational Lin-guistics.Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Ken-taro Inui, Satoshi Sekine, Lasha Abzianidze, and Jo-han Bos. 2019b. HELP: A dataset for identifyingshortcomings of neural models in monotonicity rea-soning. In Proceedings of the Eighth Joint Con-ference on Lexical and Computational Semantics(*SEM 2019) , pages 250–255, Minneapolis, Min-nesota. Association for Computational Linguistics.ai Zhao, Liang Huang, and Mingbo Ma. 2016. Tex-tual entailment with structured attentions and com-position. In Proceedings of COLING 2016, the 26thInternational Conference on Computational Linguis-tics: Technical Papers , pages 2248–2258, Osaka,Japan. The COLING 2016 Organizing Committee.Yao Zhou, Cong Liu, and Yan Pan. 2016. Modellingsentence pairs with tree-structured attentive encoder.In