[PDF] Exploring Transitivity in Neural NLI Models through Veridicality

Abstract

Despite the recent success of deep neural networks in natural language processing, the extent to which they can demonstrate human-like generalization capacities for natural language understanding remains unclear. We explore this issue in the domain of natural language inference (NLI), focusing on the transitivity of inference relations, a fundamental property for systematically drawing inferences. A model capturing transitivity can compose basic inference patterns and draw new inferences. We introduce an analysis method using synthetic and naturalistic NLI datasets involving clause-embedding verbs to evaluate whether models can perform transitivity inferences composed of veridical inferences and arbitrary inference types. We find that current NLI models do not perform consistently well on transitivity inference tasks, suggesting that they lack the generalization capacity for drawing composite inferences from provided training examples. The data and code for our analysis are publicly available at this https URL

Full PDF

EExploring Transitivity in Neural NLI Models through Veridicality

Hitomi Yanaka , Koji Mineshima , and Kentaro Inui , RIKEN, Keio University, Tohoku University [email protected] , [email protected] , [email protected] Abstract

Despite the recent success of deep neural net-works in natural language processing, the ex-tent to which they can demonstrate human-likegeneralization capacities for natural languageunderstanding remains unclear. We explorethis issue in the domain of natural languageinference (NLI), focusing on the transitivity ofinference relations, a fundamental property forsystematically drawing inferences. A modelcapturing transitivity can compose basic infer-ence patterns and draw new inferences. Weintroduce an analysis method using syntheticand naturalistic NLI datasets involving clause-embedding verbs to evaluate whether modelscan perform transitivity inferences composedof veridical inferences and arbitrary inferencetypes. We ﬁnd that current NLI models donot perform consistently well on transitivityinference tasks, suggesting that they lack thegeneralization capacity for drawing compos-ite inferences from provided training exam-ples. The data and code for our analysis arepublicly available at https://github.com/verypluming/transitivity . Deep neural networks (DNNs) have shown im-pressive performance in many natural languageprocessing tasks. In particular, DNN models pre-trained with large-scale data such as BERT (Devlinet al., 2019) have achieved high accuracy in vari-ous benchmark tasks (Wang et al., 2019a,b), whichsuggests that they might possess some generaliza-tion capacities that are a hallmark of human cogni-tion. However, recent analyses (Talmor and Berant,2019; Liu et al., 2019; McCoy et al., 2019) haveshown that high accuracy on a test set drawn fromthe same distribution as the training set does notalways indicate that the model has obtained the in-tended ability, so it remains unclear to what extent

Veridical inferenceBoolean inference A : Jo knows that Ann and Bob left. B : Ann and Bob left. A (cid:48) : Jo hopes that Ann and Bob left. C : Ann left. Figure 1: Illustration of transitivity inferences (indi-cated by ) composed of two basic inferences, veridi-cal and Boolean. Arrows indicate entailment and ar-rows with a cross ( ) indicate non-entailment . DNN models can learn the systematic generaliza-tion in natural language from training instances.Central to human-like generalization capacitiesis the fact that ability to understand a given sentenceis related to ability to understand other sentences,called systematicity of human cognition in Fodorand Pylyshyn (1988). Thus, if speakers understandthe meaning of the sentence

Ann loves Bob , theymust also understand the meaning of structurally re-lated sentences such as

Bob loves Ann . We explorewhether DNN models possess this type of general-ization capacity in the domain of natural languageinference (NLI), which is the task to judge whethera premise entails a hypothesis (Dagan et al., 2013;Bowman et al., 2015a).A key property underlying systematicity of draw-ing inferences is the transitivity of inference rela-tions, illustrated in Figure 1. Schematically, if amodel learns a basic inference pattern from A to B and one from B to C , it should be able to composethe two patterns to draw a new inference from A to C . If a model lacks this generalization capacity, itmust memorize an exponential number of inferencecombinations independently of basic patterns.Among the various inference patterns, we focuson transitivity inferences that combine veridical inferences with other types. In veridical inferences, a r X i v : . [ c s . C L ] J a n ne must distinguish two entailment types. Forexample, the verb know is called veridical in that“ x knows that P ” entails that P is true, while theverb hope is called non-veridical since “ x hopesthat P ” does not entail that P is true. Veridical in-ferences can relatively easily compose transitivityinferences at scale by embedding various inferencetypes into clause-embedding verbs. For instance,as Figure 1 shows, if a model has the ability toperform both Boolean inference and veridical infer-ence, it is desirable to have the ability to combineboth types to make a chained inference.Such transitivity inferences are by no means triv-ial. For instance, if the premise is changed to Joknows that Ann or Bob left , it does not follow that

Bob left , even though the veridical verb know ap-pears. Models relying on shallow heuristics suchas lexical overlap can wrongly predict entailment in this case. To correctly handle such compositeinferences, models must capture structural relationsbetween veridical inferences and various kinds ofembedded inference.Previous studies on the generalization capacitiesof NLI models have addressed how models couldlearn inferences with various challenging linguis-tic phenomena (Bowman et al., 2015b; Dasguptaet al., 2018; Geiger et al., 2019, 2020; Yanaka et al.,2019a,b; Richardson et al., 2020). However, thesestudies have focused on the linguistic phenomenain isolation, and thus do not address how a modelcould learn the interactions between them. Ouraim is to ﬁll this gap by presenting a method forprobing generalization capacity of DNN modelsperforming transitivity inferences.This study provides three main contributions.First, we create and publicly release two types ofNLI datasets for testing model ability to performtransitivity inferences: a fully synthetic datasetthat combines veridical inferences and Booleaninferences, and a naturalistic dataset that combinesveridical inferences with lexical and structural in-ferences. Second, we use these datasets to system-atically expose models to basic inference patternsand test them on a variety of combinations. Thiswill demonstrate that the models lack the ability tocapture transitivity of inference. Third, we inves-tigate whether data augmentation with new com-bination patterns helps models to learn transitiv-ity. Experiments show that the data augmentationimproves model performance on similar combina-tions, regardless of the existence of basic inference patterns in the training set. These results suggestthere is much room for improving the generaliza-tion capacities of DNN models for combining basicinferential abilities.

Transitivity

The transitivity of entailment rela-tions, which derives A → C from A → B and B → C , is incorporated into logic-based NLI sys-tems using automated theorem proving (Abzian-idze, 2015; Mineshima et al., 2015). This is abasic property of formal logic, also known as syl-logism in traditional logic or the cut rule in prooftheory (Troelstra and Schwichtenberg, 2000; vanDalen, 2013). Transitivity inference in its vari-ous forms has also been widely studied as a fun-damental property of human reasoning in cogni-tive psychology (Johnson-Laird and Byrne, 1991;Khemlani and Johnson-Laird, 2012). In the contextof NLP, previous works have proposed a methodfor training models with transitivity constraintsin multi-hop reasoning tasks (Asai and Hajishirzi,2020) and temporal relation extraction tasks (Ninget al., 2017). Clark et al. (2020) investigated atransformer’s ability to perform a chain of reason-ing where reasoning rules are explicitly given. Inthis work, we study model ability to learn transitiv-ity of entailment relations from training examples,rather than explicitly providing rules. Systematicity

There has been extensive discus-sion of whether neural networks (aka Connection-ist models) can exhibit systematicity of cognitivecapacities (Fodor and Pylyshyn, 1988; Marcus,2003). Recent works have explored whether mod-ern neural networks can learn systematicity in se-mantic parsing tasks (Lake and Baroni, 2017; Ba-roni, 2020; Kim and Linzen, 2020) and questionanswering tasks (Sinha et al., 2019), whereas ourfocus is the systematicity in NLI.In works related to the systematicity in NLI,Goodwin et al. (2020), Yanaka et al. (2020), andGeiger et al. (2020) used a manually constructedNLI dataset of monotonicity inferences with andwithout negation (e.g.,

The child is not holdingplants → The child is not holding ﬂowers ) to exam-ine DNN models’ generalization capacities. Whilethese approaches concentrate on monotonicity in-ferences involving quantiﬁers and negative expres-sions, our method using veridical inference is gen-eral in that it can be applied to any entailment rela-tion that combines basic inference patterns; we gen-rate composite inferences by embedding varioustypes of sentences into clause-embedding verbs.Fodor and Pylyshyn (1988) distinguished sys-tematicity (roughly, the ability to understand sen-tences that are structurally related to each other)from productivity (the ability to understand an inﬁ-nite set of sentences), claiming that systematicityposes a serious challenge to neural network mod-els. Yanaka et al. (2020) tested both systematicityand productivity of DNN models with a syntheticdataset of monotonicity inferences for upward (e.g., some, at least three ) and downward (e.g., few, atmost three ) quantiﬁers, where handling productiv-ity (recursion) makes sentences more involved (e.g.,iterated relative clauses and negation). Focusingon systematicity rather than productivity allowstesting models with more natural and less compli-cated data, as compared to sentences appearing inmonotonicity inferences.

Veridicality

Veridical inferences, includingthose licensed by factive and implicative verbs,have been intensively studied in the literatureof semantics and pragmatics (Karttunen andPeters, 1979; Beaver, 2001). Recent work hasrevealed graded and context-sensitive aspectsof veridicality inferences, creating veridicalityjudgement datasets (de Marneffe et al., 2012;White and Rawlins, 2018; White et al., 2018).While we use only a subset of veridical predicatesdiscussed in the literature, our method can beextended to more complex inferences, such asfactive presupposition.Ross and Pavlick (2019) presented a naturalisticveridicality dataset and compared the predictionsof a BERT-based NLI model and human judge-ments. These previous studies on veridicality in-ferences have tended to focus on relations betweenwhole sentences (e.g.,

Jo remembered that therewas a wild deer jumping a fence ) and its embeddedmaterial (e.g.,

There was a wild deer jumping afence ). By contrast, we consider the interactionsof veridicality inferences and other inference types(see Section 3.2), including cases where the embed-ded material is further paraphrased via linguisticphenomena (e.g.,

Jo remembered that there was awild deer jumping a fence ⇒ An animal was jump-ing ). We also collect human judgements on ourdataset and compare them with model predictions(see Section 4.4).

Probing NLI models

Many studies of probingNLI models have found that current models oftenfail on linguistically challenging (adversarial) infer-ences (Rozen et al., 2019; Nie et al., 2019; Yanakaet al., 2019a; Richardson et al., 2020), learning un-desired biases (Glockner et al., 2018; Poliak et al.,2018; Tsuchiya, 2018; Liu et al., 2019), and heuris-tics (McCoy et al., 2019). Our approach also pro-vides adversarial test sets against such heuristics byconsidering combinations of veridical inferencesand diverse (lexical, structural, and logical) typesof inferences.One way to learn challenging inferences is dataaugmentation, and prior studies (Yanaka et al.,2019b; Richardson et al., 2020; Min et al., 2020)have shown that data augmentation with synthe-sized datasets improves performance with challeng-ing linguistic phenomena. However, it remains un-clear whether data augmentation can help modelslearn composite inferences mixing several infer-ence types from training instances. We address thisquestion in Section 4.3.

To investigate whether models can capture transi-tivity, we consider two basic inference patterns andtheir combinations. The ﬁrst basic pattern, I , isveridical inference. We write f ( s ) → s to de-note a schematic veridical inference, where f isa clause-embedding verb and s is the embeddedclause. For instance, in the case of the inferencepattern A → B in Figure 1, “ Jo knows that x ” cor-responds to f ( x ) and “ Ann and Bob left ” to s .The second basic pattern, I , provides an infer-ence from the embedded material. We denote apremise-hypothesis pair of this second inferenceby s → s . Given two inferences f ( s ) → s in I and s → s in I , we consider a new inference f ( s ) → s , where premise f ( s ) is the same asthat of I and hypothesis s is the same as that of I . See Table 1 and Table 2 for some examples ofinferences f ( s ) → s , s → s , and f ( s ) → s .In this work, we consider binary labels, entailment and non-entailment , denoted by yes and unk , re-spectively. As Table 3 shows, the gold label onthe f ( s ) → s pattern can be determined fromthose of the basic patterns f ( s ) → s and s → s ,following the transitivity of entailment relations.We train models with the ﬁrst and second pat-terns, f ( s ) → s and s → s , and then test them f ( s ) → s s → s f ( s ) → s Example V yes yes yes f ( s ) : Someone noticed that [Henry and Daniel found Elliot, John and Fred]. s : Henry and Daniel found Elliot, John and Fred. s : Henry found John. NV unk yes unk f ( s ) : Someone expects that [Tom and Ann admire Greg and Fred]. s : Tom and Ann admire Greg and Fred. s : Tom admires Greg. NV unk unk unk f ( s ) : Someone argued that [it was not the case that Greg hated John or Elliot]. s : It was not the case that Greg hated John or Elliot. s : Greg hated John. Table 1: Examples from our fully synthetic transitivity inference datasets. V and NV indicate types of clause-embedding verbs (veridical/non-veridical); yes means entailment and unk means non-entailment . ID f f ( s ) → s s → s f ( s ) → s Example V yes yes yes f ( s ) : Someone realized that [a boy was playing a guitar]. s : A boy was playing a guitar. s : A kid was playing a guitar.2049 V yes unk unk f ( s ) : Someone remembered that [a cat was playing with a device]. s : A cat was playing with a device. s : The boy was enthusiastically playing in the mud.5024 NV unk yes unk f ( s ) : Someone doubts that [the woman is putting makeup on the man]. s : The woman is putting makeup on the man. s : A man’s face is being painted by a woman. Table 2: Examples from our naturalistic transitivity inference datasets. V and NV indicate types of clause-embedding verbs (veridical/non-veridical); yes means entailment and unk means non-entailment . ID indicatesthe original ID of s → s in the SICK dataset. f ( s ) → s s → s f ( s ) → s yes yes yes yes unk unk unk yes unk unk unk unk Table 3: Rule for determining the f ( s ) → s labelfrom the basic patterns f ( s ) → s and s → s . on a set of the composite inferences f ( s ) → s thatcombines them. Note that due to how they are con-structed, the training and test sets do not overlap.Model capable of applying the transitivity inferencefrom f ( s ) → s and s → s to f ( s ) → s shouldconsistently predict the correct label of f ( s ) → s for any combination of f ( s ) → s and s → s . We generate basic inferences f ( s ) → s and s → s and combine them to produce transitiv-ity inferences f ( s ) → s . To test diverse inferencepatterns, we consider two types of the second basicinference s → s : synthesized Boolean inferencesand naturalistic inferences using an existing NLIdataset, SICK (Marelli et al., 2014), which con-tains lexical inferences (e.g., boy → kid in ID 2299in Table 2) and structural inferences (e.g., active- Type of f Verbs

Veridical realize, acknowledge, remember, note,ﬁnd, notice, learn, see, reveal, dis-cover, understand, know, admit, rec-ognize, observeNon-veridical feel, claim, doubt, hope, predict, im-ply, suspect, wish, think, believe, hear,expect, estimate, assume, argue

Table 4: Clause-embedding verbs used for our dataset. passive alternation in ID 5024 in Table 2). Sincethe ratio of the gold labels ( yes and unk ) is set to in both basic inference sets, the ratio of thegold labels for the transitivity test set is bythe rule in Table 3. We reserve 10% of the basicinference set for the validation set.

Clause-embedding verbs

We focus on clause-embedding verbs that take tensed subordinateclauses. Speciﬁcally, we collect 67 verbs appear-ing in both MegaVeridicality2 (White et al., 2018)and the verb veridicality dataset (Ross and Pavlick,2019). As Table 4 shows, we select a ﬁnal set of30 clause-embedding verbs.Following a previous study (White et al., 2018),we slot a clause-embedding verb f into a templateith the form “Someone f that s ” and generatepremise f ( s ) of veridical inference to avoid con-founds introduced by world knowledge and prag-matic inference in the main clause. The clause-embedding verb f is in past or present tense, andwe inﬂect the verb in the complement s to matchthe tense of f .When measuring the extent to which models canlearn transitivity of entailment relations from train-ing instances, it is desirable to determine the goldlabels of composite inferences from those of basicinferences. Thus, we take the labels of veridical in-ference datasets predicted by the veridical and non-veridical distinction in lexical semantics as the goldstandard. In addition, veridical inferences are sensi-tive to context, inﬂuenced by world knowledge andpragmatic factors (de Marneffe et al., 2012). Ac-cordingly, we also present additional experimentsto take into account such complexity of veridicalinferences in Section 4.2. Boolean inference

To provide a fully synthetictransitivity inference dataset, we generate Booleaninferences with conjunction, disjunction, and nega-tion. The data generation process is similar to theone in Yanaka et al. (2020): sentences are gener-ated using a context-free grammar (CFG) associ-ated with semantic composition rules in lambda-calculus. We ﬁrst generate a set of premise sen-tences by the CFG rules and translate each sentence s into a ﬁrst-order-logic (FOL) formula F in ac-cordance with semantic composition rules speciﬁedin the CFG rules. Appendix A provides a set ofCFG rules and semantic composition rules. Werandomly select one of the atomic sub-formulasappearing in F and take its positive or negativeform, which we denote by F . Then we convert F to a sentence s using the same grammar. We set s as a hypothesis.The gold label for inference pair s → s is de-termined by checking whether formula F entailsformula F using an FOL theorem prover. The goldlabels for f ( s ) → s and f ( s ) → s pairs are auto-matically determined according to the veridicalityof a clause-embedding verb and the rule in Table 3,respectively. To restrict the complexity of gener-ated sentences, we set the maximum number oflogical connectives appearing in formula F to 6.Table 1 illustrates examples of fully synthetictransitivity inference datasets. We generate 3,000Boolean inference examples s → s , 6,000 veridi-cal inference examples f ( s ) → s , and 6,000 com- posite inference examples f ( s ) → s . Naturalistic inference

To generate a naturalis-tic transitivity inference dataset, we collect an ex-ample s → s of naturalistic inference from theSICK dataset, which is constructed from existingsentences (image descriptions given by differentpeople) and covers various lexical and structuralphenomena. (1) is an example of lexical inference( brush → comb ) in SICK, whose label is yes .(1) s : A person is brushing a cat. s : A person is combing the fur of a cat.By selecting a clause-embedding verb f and anembedded sentence s , we generate a new sentence f ( s ) . As shown in (2), we construct a veridicalinference example f ( s ) → s by setting f ( s ) asa premise and s as a hypothesis.(2) f ( s ) : Someone sees that a person isbrushing a cat. s : A person is brushing a cat. ( yes )Likewise, as in (3), we can obtain a compositeinference example f ( s ) → s whose label is yes :(3) f ( s ) : Someone sees that a person isbrushing a cat. s : A person is combing the fur of a cat.Table 2 illustrates examples of naturalistic tran-sitivity inference datasets. We sample 1,000 natu-ralistic inference examples s → s from the SICKtraining set and obtain 30,000 veridical inferenceexamples f ( s ) → s and 30,000 composite infer-ence examples f ( s ) → s . We analyze whether models trained with the ba-sic inference set can consistently perform compos-ite inferences on the test set. We use two DNNmodels, BERT and LSTM, which are known toperform well with linguistic phenomena such assubject-verb agreement and hierarchical and struc-tural probing tasks (Linzen et al., 2016; Weiss et al.,2018; Kuncoro et al., 2018).

In all experiments, we train each model for 25epochs or until convergence and select the best-performing model based on its accuracy on thevalidation set. We perform ﬁve runs and report theaverage and standard deviation of their accuracies. ata Model f ( s ) → s s → s f ( s ) → s LSTM-M LSTM-B LSTM-M&B BERT-M BERT-B BERT-M&B yes yes yes . ± . . ± . . ± . . ± . . ± . . ± . yes unk unk . ± . . ± . . ± . . ± . . ± . . ± . unk yes unk . ± . . ± . . ± . . ± . . ± . . ± . unk unk unk . ± . . ± . . ± . . ± . . ± . . ± . Test Overall . ± . . ± . . ± . . ± . . ± . . ± . Validation ( f ( s ) → s ) . ± . . ± . . ± . . ± . . ± . . ± . Validation ( s → s ) . ± . . ± . . ± . . ± . . ± . . ± . Table 5: Accuracies for the fully synthetic transitivity test set and the validation set. -B indicates a model trainedwith the basic inference set, -M indicates a model trained with MNLI, and -M&B indicates a model trained withMNLI mixed with the basic inference set. The label yes means entailment , and unk means non-entailment . Data Model f ( s ) → s s → s f ( s ) → s LSTM-M LSTM-B LSTM-M&B BERT-M BERT-B BERT-M&B yes yes yes . ± . . ± . . ± . . ± . . ± . . ± . yes unk unk . ± . . ± . . ± . . ± . . ± . . ± . unk yes unk . ± . . ± . . ± . . ± . . ± . . ± . unk unk unk . ± . . ± . . ± . . ± . . ± . . ± . Test Overall . ± . . ± . . ± . . ± . . ± . . ± . Validation ( f ( s ) → s ) . ± . . ± . . ± . . ± . . ± . . ± . Validation ( s → s ) . ± . . ± . . ± . . ± . . ± . . ± . Table 6: Accuracies for the naturalistic transitivity test set and the validation set.

LSTM

We use an LSTM (Hochreiter andSchmidhuber, 1997) model, where each premiseand hypothesis is processed as a sequence of wordsusing RNN with LSTM cells, and the ﬁnal hiddenstate of each serves as its representation. The modelconcatenates the premise and hypothesis represen-tations and passes the result to three hidden layersfollowed by a two-way softmax classiﬁer. Themodel is initialized with 300-dimensional GloVevectors (Pennington et al., 2014) and optimizedusing Adam (Kingma and Ba, 2015). We searchdropout probabilities of [0 , . , . on the output. BERT

We use the base-uncased pretrainedBERT (Devlin et al., 2019) model , ﬁne-tuned forthe NLI classiﬁcation task on training data in thestandard way. When ﬁne-tuning BERT, we searchdropout probabilities of [0 , . , . on the output,and hyperparameters are the same as those com-monly used for MultiNLI. We ﬁrst evaluate whether the models trained withbasic inferences f ( s ) → s and s → s can con-sistently make judgements on the composite in-ferences f ( s ) → s . As a previous work (Rossand Pavlick, 2019) reported that a BERT modeltrained with the benchmark NLI dataset MultiNLI We use the Pytorch implementation of BERT released athttps://github.com/huggingface/transformers. (MNLI; Williams et al., 2018) is sensitive to verbveridicality, we regard the accuracy of modelstrained with MNLI as a baseline. We also ana-lyze models trained with MNLI mixed with thebasic inference set.Table 5 shows accuracies for the fully synthetictransitivity test set that combines veridical andBoolean inferences. Models trained with the basicinference set achieved over 80% accuracy on thetest cases, except for cases where f ( s ) → s is yes and s → s is unk . Table 6 shows accuracies forthe naturalistic transitivity test set. Again, modelstrained with the basic inference set performed sub-stantially below chance for the cases f ( s ) → s ,where f ( s ) → s is yes and s → s is unk . Thissuggests that while the models achieve over 80%accuracy on both f ( s ) → s and s → s valida-tion sets, they do not apply transitivity inferencefrom the inferences f ( s ) → s and s → s , butrather predict the label for the composite inference f ( s ) → s by judging whether it is similar to theveridical inference f ( s ) → s in the training set.Accuracy of models trained with MNLI waslow because they predicted yes for many exampleswhere correct labels were unk , as in (4).(4) f ( s ) : Someone wished that John sawTom or Greg. s : John saw Tom. ( unk ) ype Templates Pronoun At that moment, we f that s Pronoun Then he f that s Speciﬁc group The customers f that s Speciﬁc group Some economists f that s Proper noun Hanson f that s Table 7: Examples of additional templates used for gen-erating veridical inference datasets. Here f is a placefor a veridical verb and s for an embedded sentence. The models predicted yes for over 80% of the fullysynthetic transitivity test set and more than 60% ofthe naturalistic transitivity test set. These results areconsistent with the ﬁndings in McCoy et al. (2019),namely, that models trained with MNLI tend topredict entailment relations when the hypothesis isa subsequence of the premise, as in (4).When models are trained with MNLI mixed withthe basic inference set, they seem to improve per-formance on the fully synthetic transitivity test set.One reason for this result is that the models mightuse heuristics to make predictions for some unk examples in the fully synthetic inference set. Erroranalysis shows that the models tend to predict unk when either a premise or a hypothesis contains anegation like (5).(5) f ( s ) : Someone knew that Fred praisedHenry or Ann. s : Fred did not praise Ann. ( unk )These heuristics might be related to the annotationartifact (Gururangan et al., 2018) in MNLI, becausean inference example involving negation wordstends to be a contradiction . Moreover, modelscan memorize the basic inference set regardlessof the existence of MNLI in the training set, soperformance seems to be better.Note that models trained with MNLI mixed withthe basic inference set still failed on the naturalistictransitivity inference f ( s ) → s where f ( s ) → s is yes and s → s is unk . Since the naturalisticbasic inference examples s → s contain variouslinguistic phenomena, models cannot rely on theheuristics for such examples. Is poor performance of transitivity inferencedue to overﬁtting on verbs?

To determinewhether models do not overﬁt on clause-embeddingverbs, we analyze the models under two additional We use binary labels ( entailment / non-entailment ) andtake contradiction as non-entailment . Data Model f ( s ) → s s → s f ( s ) → s LSTM-B ( (cid:52) ) BERT-B ( (cid:52) ) yes yes yes . .

8) 100 . . yes unk unk . .

0) 2 . − . unk yes unk . .

9) 100 . . unk unk unk . .

9) 100 . . Table 8: Accuracies of models in the setting (I). ( (cid:52) ) isthe difference from the accuracy in Table 6.

Data Model f ( s ) → s s → s f ( s ) → s LSTM-B ( (cid:52) ) BERT-B ( (cid:52) ) yes yes yes . − .

1) 93 . − . yes unk unk . .

2) 17 . . unk yes unk . − .

2) 94 . − . unk unk unk . .

0) 95 . . Table 9: Accuracies of models in the setting (II). ( (cid:52) ) isthe difference from the accuracy in Table 6. settings using naturalistic transitivity datasets:(I) we use various templates other than “Someone f that s ” to generate the main clause in f ( s ) ,and (II) we ﬂip the gold labels of 10% veridicalinference f ( s ) → s instances, randomly sampled,instead of using gold labels uniquely ﬁxed fromverb types. These two complex settings exposemodels to more natural evaluation settings that con-sider the context-sensitive property of veridicality.For evaluation setting (I) using various templatesinvolving clause-embedding verbs, we manuallyselect forty main clauses of the verb veridicalitydataset (Ross and Pavlick, 2019) and provide addi-tional templates. Table 7 shows examples of addi-tional templates involving clause-embedding verbsused for generating veridical inference datasets.Table 8 and Table 9 show the results for (I) and(II), respectively. These results show the sametrends as those in Table 6, indicating that evenwhen we consider the complexity of veridical infer-ence in our analysis, the models fail to consistentlyperform composite inferences. We further hypothesize that even if the current mod-els fail to consistently perform composite infer-ences, data augmentation with a small number ofcomposite inference examples might allow modelsto learn transitivity inference. Thus, we evaluatemodels trained with basic inferences f ( s ) → s and s → s and with a subset of the compositeinferences f ( s ) → s on a naturalistic inferencetest set. Considering that models fail on compos-ite inference f ( s ) → s where f is veridical and s → s is unk , we gradually add veridical verbs(e.g., know ) one-by-one to generate an additional a) f ( s ) → s , s → s , and a subset of f ( s ) → s (b) f ( s ) → s and a subset of f ( s ) → s Figure 2: Accuracies of models trained with (a) and (b). I indicates the ﬁrst basic pattern f ( s ) → s and I indicates the second basic pattern s → s . Y means entailment and U means non-entailment . The horizontal axisshows the number of veridical verbs for the additional training set. Data Model Human f ( s ) → s s → s f ( s ) → s LSTM-B BERT-B yes yes yes . ± . . ± . . yes unk unk . ± . . ± . . unk yes unk . ± . . ± . . unk unk unk . ± . . ± . . Test Overall . ± . . ± . . Table 10: Comparison between accuracies of humans and the models trained with the basic inference set. training set of composite inference f ( s ) → s and analyze performance on a test set. Figure 2(a)shows that this data augmentation improved per-formance on test examples f ( s ) → s where f is veridical and s → s is unk , while maintainingaccuracy on the remaining examples in the test set.BERT achieved 100% accuracy over the entire testset by adding composite inferences generated fromfour veridical verbs, whereas in the case of LSTMtwelve veridical verbs were needed to achieve thesame accuracy.To determine whether models augmented withcomposite inference examples learn the ability tocombine basic inferences to perform transitivityinference, we analyze the performance of mod-els where basic inference examples are not in-cluded in the training set. Figure 2(b) shows thatmodels trained with only the basic inference set f ( s ) → s and a subset of the composite inferenceset f ( s ) → s also had improved accuracy. Thisresult supports ﬁndings that models do not com-bine the basic inference f ( s ) → s and s → s ,but rather predict the label for a composite infer-ence f ( s ) → s by judging whether it is similar toinference patterns found in the training set. To investigate how humans perform on transitivityinference tasks, we collect human judgements fora subset of our naturalistic inference dataset. Weasked crowdsourced workers to label 960 transi- tivity inference examples involving all the clause-embedding verbs in Table 4. Following prior worksinvolving crowdsourced NLI datasets (Zhang et al.,2017; Ross and Pavlick, 2019), we instructed ratersto label each premise-hypothesis pair with the de-gree of entailment on a 5-point Likert scale, with1 meaning a hypothesis is deﬁnitely not true giventhe premise, and 5 meaning a hypothesis is deﬁ-nitely true. We collected three annotations per pairon Amazon Mechanical Turk (see Appendix D fordetails), and the inter-rater agreement (the Pear-son correlation among raters, averaged across bothexamples and raters) was 0.76. As model predic-tions are discrete ( yes or unk ), we discretized hu-man scores into evenly sized bins, setting yes ifthe score was 4 or higher and set unk if the scorewas 3 or lower. We assumed the majority of threediscretized labels as the ﬁnal human judgement.Table 10 shows that humans generally followthe distinction between veridical and non-veridicalverbs traditionally assumed in the lexical semantics,as well as the transitivity of entailment relation. Inparticular, while as we saw in Section 4.2 the DNNmodels performed substantially below chance fortransitivity inferences where f ( s ) → s is yes and s → s is unk , human performance is near perfectfor such inferences.Interestingly, however, humans tend to predictincorrect labels for transitivity inferences wherethe verb f is non-veridical (so f ( s ) → s is unk )and the embedded inference s → s is yes . Thisight be because a natural complement as in (6)induces veridicality bias (Ross and Pavlick, 2019),that is, no matter whether a complement verb f isveridical or non-veridical, humans tend to decidethe truth value of f ( s ) by judging whether its com-plement s is true. Thus, judgement for f ( s ) → s coincides with that of s → s in this case.(6) f ( s ) : Someone believed that a man isjumping off a low wall. s : A man is jumping off a low wall. s : A man is jumping a wall. We introduced an analysis method using transitiv-ity inferences for evaluating systematic general-ization capacities of NLI models. We found thatcurrent NLI models do not perform consistentlywell on transitivity inference tasks. Furthermore,data augmentation analysis suggested that modelscan memorize composite inference examples, butdo not perform the intended transitivity inferencescombining basic inference examples.Overall, our results indicated that despite the im-pressive performance of DNN models on standardNLI datasets, there remains much room for im-proving their systematic generalization capacitieswith respect to combining basic inferential abili-ties on various linguistic phenomena. Regardingwhat is necessary for improving the systematic gen-eralization capacity, one interesting possibility isexplicitly feeding some form of logic-guided transi-tivity rules to models, which is left for future work.Our analysis method using transitivity can be aneffective tool for further progress in the study ofcompositional NLI.

Acknowledgement

We thank the three anonymous reviewers for theirhelpful comments and suggestions. We are alsograteful to Masashi Yoshikawa for helpful dis-cussions. This work was partially supported bythe RIKEN-AIST Joint Research Fund (feasibil-ity study) and JSPS KAKENHI Grant NumberJP20K19868.

References

Lasha Abzianidze. 2015. A tableau prover for naturallogic and language. In

Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 2492–2502. Akari Asai and Hannaneh Hajishirzi. 2020. Logic-guided data augmentation and regularization for con-sistent question answering. In

Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics (ACL) , pages 5642–5650.Marco Baroni. 2020. Linguistic generalization andcompositionality in modern artiﬁcial neural net-works.

Philosophical Transactions of the Royal So-ciety B , 375(1791):20190307.David Beaver. 2001.

Presupposition and Assertion inDynamic Semantics . CSLI Publications.Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015a. A large anno-tated corpus for learning natural language inference.In

Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 632–642.Samuel R. Bowman, Christopher Potts, and Christo-pher D. Manning. 2015b. Recursive neural networkscan learn logical semantics. In

Proceedings of the3rd Workshop on Continuous Vector Space Modelsand their Compositionality , pages 12–21.Peter Clark, Oyvind Tafjord, and Kyle Richardson.2020. Transformers as soft reasoners over language.In

Proceedings of the 29th International Joint Con-ference on Artiﬁcial Intelligence and the 17th Pa-ciﬁc Rim International Conference on Artiﬁcial In-telligence (IJCAI-PRICAI) .Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas-simo Zanzotto. 2013.

Recognizing Textual Entail-ment: Models and Applications . Synthesis Lectureson Human Language Technologies. Morgan & Clay-pool Publishers.Dirk van Dalen. 2013.

Logic and Structure , 5 edition.Springer.Ishita Dasgupta, Demi Guo, Andreas Stuhlmüller,Samuel J. Gershman, and Noah D. Goodman. 2018.Evaluating compositionality in sentence embed-dings. In

Proceedings of the 40th Annual Confer-ence of the Cognitive Science Society , pages 1596–1601.Jacob Devlin, Chang Ming-Wei, Lee Kenton, andToutanova Kristina. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies (NAACL-HLT) , pages 4171–4186.Jerry A. Fodor and Zenon W. Pylyshyn. 1988. Connec-tionism and cognitive architecture: A critical analy-sis.

Cognition , 28(1-2):3–71.Atticus Geiger, Ignacio Cases, Lauri Karttunen, andChristopher Potts. 2019. Posing fair generalizationtasks for natural language inference. In

Proceedingsof the 2019 Conference on Empirical Methods inatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 4484–4494.Atticus Geiger, Kyle Richardson, and ChristopherPotts. 2020. Neural natural language inference mod-els partially embed theories of lexical entailment andnegation. In

Proceedings of the Third BlackboxNLPWorkshop on Analyzing and Interpreting Neural Net-works for NLP , pages 163–173.Max Glockner, Vered Shwartz, and Yoav Goldberg.2018. Breaking NLI systems with sentences that re-quire simple lexical inferences. In

Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (ACL) , pages 650–655.Emily Goodwin, Koustuv Sinha, and Timothy J.O’Donnell. 2020. Probing linguistic systematicity.In

Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics (ACL) ,pages 1958–1969.Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel Bowman, and Noah A.Smith. 2018. Annotation artifacts in natural lan-guage inference data. In

Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies (NAACL-HLT) , pages 107–112.Irene Heim and Angelika Kratzer. 1998.

Semantics inGenerative Grammar . Blackwell.Sepp Hochreiter and Jürgen Schmidhuber. 1997. Longshort-term memory.

Neural Comput. , 9(8):1735–1780.Philip N. Johnson-Laird and Ruth M.J. Byrne. 1991.

Deduction . Erlbaum.Lauri Karttunen and Stanley Peters. 1979. Conven-tional implicatures. In Choon Kyu Oh and David A.Dineen, editors,

Syntax and Semantics 11: Presup-position , pages 1–56. Academic Press.Sangeet Khemlani and Philip N Johnson-Laird. 2012.Theories of the syllogism: A meta-analysis.

Psycho-logical bulletin , 138(3):427–457.Najoung Kim and Tal Linzen. 2020. COGS: A com-positional generalization challenge based on seman-tic interpretation. In

Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 9087–9105.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In

Proceedingsof the International Conference on Learning Repre-sentations (ICLR) .Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo-gatama, Stephen Clark, and Phil Blunsom. 2018.LSTMs can learn syntax-sensitive dependencieswell, but modeling structure makes them better. In

Proceedings of the 56th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL) , pages1426–1436.Brenden M. Lake and Marco Baroni. 2017. General-ization without systematicity: On the compositionalskills of sequence-to-sequence recurrent networks.In

Proceedings of the International Conference onMachine Learning (ICML) .Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of LSTMs to learnsyntax-sensitive dependencies.

Transactions of theAssociation for Computational Linguistics (TACL) ,4:521–535.Nelson F. Liu, Roy Schwartz, and Noah A. Smith. 2019.Inoculation by ﬁne-tuning: A method for analyz-ing challenge datasets. In

Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies (NAACL-HLT) , pages 2171–2179.Gary Marcus. 2003.

The Algebraic Mind: IntegratingConnectionism and Cognitive Science . MIT Press.Marco Marelli, Stefano Menini, Marco Baroni, LuisaBentivogli, Raffaella Bernardi, and Roberto Zampar-elli. 2014. A SICK cure for the evaluation of com-positional distributional semantic models. In

Pro-ceedings of the Ninth International Conference onLanguage Resources and Evaluation (LREC) , pages216–223.Marie-Catherine de Marneffe, Christopher D. Man-ning, and Christopher Potts. 2012. Did it happen?the pragmatic complexity of veridicality assessment.

Computational Linguistics , 38(2):301–333.R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019.Right for the wrong reasons: Diagnosing syntacticheuristics in natural language inference. In

Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics (ACL) , pages 3428–3448.Junghyun Min, R. Thomas McCoy, Dipanjan Das,Emily Pitler, and Tal Linzen. 2020. Syntacticdata augmentation increases robustness to inferenceheuristics. In

Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics(ACL) , pages 2339–2352.Koji Mineshima, Pascual Martínez-Gómez, YusukeMiyao, and Daisuke Bekki. 2015. Higher-orderlogical inference with compositional semantics. In

Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 2055–2061.Richard Montague. 1973. The proper treatment ofquantiﬁcation in ordinary English. In Jaakko Hin-tikka, Julius M. E. Moravcsik, and Patrick Suppes,editors,

Approaches to Natural Language , pages89–224. Reidel, Dordrecht. Reprinted in Rich-mond H. Thomason (ed.),

Formal Philosophy: Se-lected Papers of Richard Montague , 247–270, 1974,New Haven: Yale University Press.Yixin Nie, Yicheng Wang, and Mohit Bansal. 2019.Analyzing compositionality-sensitivity of NLI mod-els. In

Proceedings of the AAAI Conference on Arti-ﬁcial Intelligence , pages 6867–6874.Qiang Ning, Zhili Feng, and Dan Roth. 2017. A struc-tured learning approach to temporal relation extrac-tion. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing(EMNLP) , pages 1027–1037.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for wordrepresentation. In

Proceedings of the 2014 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 1532–1543.Adam Poliak, Jason Naradowsky, Aparajita Haldar,Rachel Rudinger, and Benjamin Van Durme. 2018.Hypothesis only baselines in natural language in-ference. In

Proceedings of the Seventh Joint Con-ference on Lexical and Computational Semantics(*SEM) , pages 180–191.Kyle Richardson, Hai Hu, Lawrence S. Moss, andAshish Sabharwal. 2020. Probing natural languageinference models through semantic fragments. In

Proceedings of the AAAI Conference on Artiﬁcial In-telligence .Alexis Ross and Ellie Pavlick. 2019. How well do NLImodels capture verb veridicality? In

Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 2230–2240.Ohad Rozen, Vered Shwartz, Roee Aharoni, and IdoDagan. 2019. Diversify your datasets: Analyzinggeneralization via controlled variance in adversar-ial datasets. In

Proceedings of the 23rd Confer-ence on Computational Natural Language Learning(CoNLL) , pages 196–205.Koustuv Sinha, Shagun Sodhani, Jin Dong, JoellePineau, and William L. Hamilton. 2019. CLUTRR:A diagnostic benchmark for inductive reasoningfrom text. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages4506–4515.Alon Talmor and Jonathan Berant. 2019. MultiQA: Anempirical investigation of generalization and trans-fer in reading comprehension. In

Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics (ACL) , pages 4911–4921.Anne S. Troelstra and Helmut Schwichtenberg. 2000.

Basic Proof Theory , 2 edition. Cambridge Univer-sity Press. Masatoshi Tsuchiya. 2018. Performance impactcaused by hidden bias of training data for recogniz-ing textual entailment. In

Proceedings of the 11th In-ternational Conference on Language Resources andEvaluation (LREC) .Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel Bowman. 2019a. SuperGLUE: Astickier benchmark for general-purpose language un-derstanding systems. In

Proceedings of Advances inNeural Information Processing Systems 32 (NIPS) ,pages 3266–3280.Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2019b.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In

Pro-ceedings of the International Conference on Learn-ing Representations (ICLR) .Gail Weiss, Yoav Goldberg, and Eran Yahav. 2018. Onthe practical computational power of ﬁnite precisionRNNs for language recognition. In

Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (ACL) , pages 740–745.Aaron Steven White and Kyle Rawlins. 2018. The roleof veridicality and factivity in clause selection. In

Proceedings of the 48th Annual Meeting of the NorthEast Linguistic Society, Amherst, MA, USA. GLSAPublications .Aaron Steven White, Rachel Rudinger, Kyle Rawlins,and Benjamin Van Durme. 2018. Lexicosyntacticinference in neural models. In

Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages 4717–4724.Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In

Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies (NAACL-HLT) , pages 1112–1122.Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, andKentaro Inui. 2020. Do neural models learn sys-tematicity of monotonicity inference in natural lan-guage? In

Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics(ACL) , pages 6105–6117.Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Ken-taro Inui, Satoshi Sekine, Lasha Abzianidze, and Jo-han Bos. 2019a. Can neural networks understandmonotonicity reasoning? In

Proceedings of the2019 ACL Workshop BlackboxNLP: Analyzing andInterpreting Neural Networks for NLP , pages 31–40.Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Ken-taro Inui, Satoshi Sekine, Lasha Abzianidze, andJohan Bos. 2019b. HELP: A dataset for identify-ing shortcomings of neural models in monotonic-ity reasoning. In

Proceedings of the Eighth Jointonference on Lexical and Computational Seman-tics (*SEM) , pages 250–255.Sheng Zhang, Rachel Rudinger, Kevin Duh, and Ben-jamin Van Durme. 2017. Ordinal common-sense in-ference.

Transactions of the Association for Compu-tational Linguistics (TACL) , 5:379–395.

Details about the Boolean logicfragment

Table 11 shows the context-free grammar used togenerate sentences for Boolean logic reasoningwith conjunction, disjunction, and negation. Eachrewriting rule is paired with the corresponding se-mantic composition rule in standard Montagoviansemantics to generate the logical form of a sen-tence (Montague, 1973; Heim and Kratzer, 1998).We use ten items each for proper names (PN), in-transitive verbs (IV), and transitive verbs (TV).Each sentence is generated with a verb in the pasttense.For sentences with multiple NPs, we as-sume the surface-scope reading where the sub-ject NP takes scope over the object NP. For in-stance, the sentence

Ann and Bob saw Chrisor Daniel , where the subject NP is conjunctiveand the object NP is disjunctive, has the logicalform ( see ( ann , chris ) ∨ see ( ann , daniel )) ∧ ( see ( bob , chris ) ∨ see ( bob , daniel )) .There are two types of negation, sentential nega-tion (S NEG ) and verbal negation (V

NEG ), whichare distinguished with respect to their scope inter-pretation. Thus, the sentence

Ann and Bob didnot swim has the logical form ¬ swim ( ann ) ∧¬ swim ( bob ) , while the sentence It is not the casethat Ann and Bob did not swim has the logical form ¬ ( swim ( ann ) ∧ swim ( bob )) .To generate a premise-hypothesis pair ( s , s ) using this Boolean logic fragment, we ﬁrst generatea sentence s and derive its logical form F usingthe grammar in Table 11. We then randomly selectone of the atomic formulas appearing in F , say A ,and takes its positive ( A ) or negative ( ¬ A ) form,which is in turn converted to the hypothesis sen-tence s using the same grammar. The gold label( entailment or non-entailment ) for the pair ( s , s ) is determined by checking whether F logicallyentails A or ¬ A using a ﬁrst-order-logic theoremprover . B Training details

In all experiments, we trained models on eightNVIDIA DGX-1 Tesla V100 GPUs. The runtimefor training each model was about 1-8 hours, de-pending on the size of the training set. The orderof training instances was shufﬂed for each model. https://github.com/vprover/vampire C Supplementary results on the randomtrain-test split

To conﬁrm that our transitivity inference dataset isnot excessively difﬁcult, we conducted additionalexperiments using the random train : test splitof transitivity inference ( f ( s ) → s ) datasets. Weevaluate models under two settings: (i) modelstrained with the train split of transitivity inferencedatasets and (ii) models trained with the train splitmixed with MNLI. Table 12 shows the results onthe random train-test split of our full-synthetic tran-sitivity dataset, and Table 13 shows the results onthe random train-test split of our naturalistic transi-tivity dataset. These results showed that regardlessof the existence of MNLI in the training set, modelsachieved perfect performance on our transitivity in-ference test set with the standard random train-testsplit setting. D Human judgement details

Using Amazon Mechanical Turk, we collected hu-man judgements for 960 naturalistic veridical in-ference examples and 960 naturalistic transitivityinference examples. We required raters to havecompleted at least 5,000 approved tasks to main-tain a 99% approval rating. Raters could indicateby a checkbox that one or both sentences did notmake sense, but no rater clicked the checkbox. Wecollected three annotations per pair and paid $0.06per labeled pair.Since humans predict incorrect labels for somecomposite inference examples f ( s ) → s wherethe verb f is non-veridical, we checked the ac-curacy of human judgement on a set of premise-hypothesis pairs f ( s ) → s and f ( s ) → s involv-ing each non-veridical verb, as shown in Table 14.Annotators tended to incorrectly make judgementsfor both f ( s ) → s and f ( s ) → s . Regarding ac-curacy for each non-veridical verb, annotators cor-rectly drew inferences containing wish and hope ,while they tended to draw inferences containing claim and hear incorrectly.In comparison with the previous veridicalitydataset MegaVeridicality2 (White et al., 2018),the accuracy tended to be lower than that inMegaVeridicality2 . As (7) shows, while a simplecomplement is used for MegaVeridicality2, a natu-ral complement like (8) might induce veridicality We calculated the percentage of the majority judge-ment for each verb for ten different annotations inMegaVeridicality2. yntax SemanticsS → NP VP past [[ S ]] = [[ NP ]]([[ VP past ]]) S → S NEG S [[ S ]] = [[ S NEG ]]([[ S ]]) NP → PN [[ NP ]] = [[ PN ]] NP → PN CON PN [[ NP ]] = λF. [[ CON ]]([[ PN ]]( F ) , [[ PN ]]( F )) NP → PN , PN , CON PN [[ NP ]] = λF. [[ CON ]]([[ PN ]]( F ) , [[ CON ]]([[ PN ]]( F ) , [[ PN ]]( F ))) VP tense → IV tense [[ VP tense ]] = [[ IV tense ]] VP tense → TV tense NP [[ VP tense ]] = λx. [[ NP ]]( λy. [[ TV tense ]]( x, y )) VP past → V NEG VP base [[ VP past ]] = λx. [[ V NEG ]]([[ VP base ]]( x )) PN → Ann | Bob | Chris | · · · [[ PN ]] = λF.F ( sym ) IV base → swim | drink | smoke | · · · [[ IV base ]] = λx. sym ( x ) IV past → swam | drank | smoked | · · · [[ IV past ]] = λx. sym ( x ) TV base → see | visit | touch | · · · [[ TV base ]] = λyλx. sym ( x, y ) TV past → saw | visited | touched | · · · [[ TV past ]] = λyx. sym ( x, y ) S NEG → it is not the case that [[ S NEG ]] = λP. ¬ P V NEG → did not [[ V NEG ]] = λP. ¬ P CON → and [[ CON ]] = λP λQ.P ∧ Q CON → or [[ CON ]] = λP λQ.P ∨ Q Table 11: Grammar for the Boolean logic fragment with semantic composition. Feature tense for VP is either“base” or “past.” In semantic composition, sym is the place where the symbol (lemma) for a lexical item appears.

Data Model f ( s ) → s s → s f ( s ) → s LSTM-T LSTM-M&T BERT-T BERT-M&T yes yes yes . ± . . ± . . ± . . ± . yes unk unk . ± . . ± . . ± . . ± . unk yes unk . ± . . ± . . ± . . ± . unk unk unk . ± . . ± . . ± . . ± . Table 12: Accuracies on the random train-test split of our fully synthetic transitivity dataset. -T indicates a modeltrained with the train split of the transitivity inference set, and -M&T indicates a model trained with MNLI mixedwith the train split. Data Model f ( s ) → s s → s f ( s ) → s LSTM-T LSTM-M&T BERT-T BERT-M&T yes yes yes . ± . . ± . . ± . . ± . yes unk unk . ± . . ± . . ± . . ± . unk yes unk . ± . . ± . . ± . . ± . unk unk unk . ± . . ± . . ± . . ± . Table 13: Accuracies on the random train-test split of our naturalistic transitivity test set. bias (Ross and Pavlick, 2019), resulting in incor-rect judgements on veridical inference. Whether averb is veridical or non-veridical, humans tend tojudge the complement as true.(7) f ( s ) : Someone believed that somethinghappened. s : Something happened.(8) f ( s ) : Someone believed that a man isjumping off a low wall. s : A man is jumping off a low wall. s : A man is jumping a wall. E Supplementary results with dataaugmentation

In Section 4.3, we gradually added a subset of thecomposite inferences f ( s ) → s involving a veridi-cal verb (e.g., know ) to the training set and evalu-ated the performance of models on a naturalisticinference test set. We also evaluated the perfor-mance of models under two conditions: (a) modelstrained with the basic inference set s → s and asubset of the composite inference set f ( s ) → s and (b) models trained with a subset of the com-posite inference set f ( s ) → s . Figure 3(a) showsthat the models signiﬁcantly improved accuracyon composite inferences except for the test exam-ple f ( s ) → s , whose label differed from that of a) s → s , and a subset of f ( s ) → s (b) a subset of f ( s ) → s Figure 3: Accuracies of models trained with (a) and (b). I is the ﬁrst basic pattern f ( s ) → s and I is thesecond basic pattern s → s . Y indicates entailment and U indicates non-entailment . The horizontal axis showsthe number of veridical verbs for the additional training set. Verb f ( s ) → s f ( s ) → s MegaV2 argue 34 (-46) 66 (-14) 80assume 70 (-15) 79 (-6) 85believe 19 (-71) 59 (-31) 90claim 15 (-65) 56 (-24) 80doubt 91 (+9) 96 (+16) 80estimate 35 (-50) 64 (-21) 85expect 53 (-27) 70 (-10) 80feel 42 (-53) 67 (-28) 95hear 14 (-41) 53 (-2) 55hope 77 (-8) 92 (+7) 85imply 18 (-47) 58 (-7) 65predict 50 (-25) 73 (-2) 75suspect 48 (-47) 79 (-16) 95think 18 (-77) 57 (-38) 95wish 77 (+7) 92 (+22) 70

Table 14: Accuracy (%) of human judgements for eachnon-veridical verb. MegaV2 indicates the percentageof those annotators who judge each verb to be non-veridical in MegaVeridicality2 (White et al., 2018). Anumber in parentheses is a difference from the accuracyin MegaVeridicality2. s → s . Moreover, their performance was main-tained even without composite inference examplesin the training set. This indicates that models pre-dict labels for the composite inference exampleonly by judging whether it is similar to the basicinference example in the training set.Figure 3(b) shows the result when models aretrained only with a subset of the composite infer-ence set f ( s ) → s . As non-veridical verbs arenot included in the training set in this setting, themodels predict labels for composite inferences in-volving non-veridical verbs by judging whetherthey are similar to composite inferences involvingveridical verbs in the training set. The models thusfail on composite inference examples f ( s ) → s where f is non-veridical and s → s is yesyes