[PDF] Natural Language Inference in Context -- Investigating Contextual Reasoning over Long Texts

Abstract

Natural language inference (NLI) is a fundamental NLP task, investigating the entailment relationship between two texts. Popular NLI datasets present the task at sentence-level. While adequate for testing semantic representations, they fall short for testing contextual reasoning over long texts, which is a natural part of the human inference process. We introduce ConTRoL, a new dataset for ConTextual Reasoning over Long texts. Consisting of 8,325 expert-designed "context-hypothesis" pairs with gold labels, ConTRoL is a passage-level NLI dataset with a focus on complex contextual reasoning types such as logical reasoning. It is derived from competitive selection and recruitment test (verbal reasoning test) for police recruitment, with expert level quality. Compared with previous NLI benchmarks, the materials in ConTRoL are much more challenging, involving a range of reasoning types. Empirical results show that state-of-the-art language models perform by far worse than educated humans. Our dataset can also serve as a testing-set for downstream tasks like Checking Factual Correctness of Summaries.

Full PDF

NNatural Language Inference in Context— Investigating Contextual Reasoning over Long Texts

Hanmeng Liu, Leyang Cui, , 1

Jian Liu, Yue Zhang Zhejiang University, Fudan University, Westlake [email protected], [email protected], [email protected], [email protected]

Abstract

ConT extual R easoning o ver L ong Texts. Consisting of 8,325 expert-designed “context-hypothesis” pairs with gold labels, ConTRoL is a passage-level NLI dataset with a focus on complex contextual reason-ing types such as logical reasoning. It is derived from com-petitive selection and recruitment test (verbal reasoning test)for police recruitment, with expert level quality. Comparedwith previous NLI benchmarks, the materials in ConTRoLare much more challenging, involving a range of reasoningtypes. Empirical results show that state-of-the-art languagemodels perform by far worse than educated humans. Ourdataset can also serve as a testing-set for downstream taskslike Checking Factual Correctness of Summaries. Introduction

Natural languages are powerful tools for reasoning. In NLP,natural language inference (NLI) has attracted surging re-search interests (Bowman et al. 2015; Williams, Nangia, andBowman 2018; Bhagavatula et al. 2019). The task is to de-termine whether a hypothesis h can reasonably be inferredfrom a premise p . Thanks to the generalizability of the NLIframework (i.e., nearly all questions about meaningfulnessin language can be reduced to questions of entailment andcontradiction in context), NLI can serve as a proxy to gen-eral tasks such as natural language understanding (NLU).As a result, the NLI task is constantly employed as a testingground for learning sentence representation as well as eval-uating language models, with the expectation of beneﬁtingdownstream applications.Large-scale NLI datasets have been collected via crowd-sourcing. Existing benchmarks (Bowman et al. 2015;Williams, Nangia, and Bowman 2018; Dagan, Glickman,and Magnini 2006; Khot, Sabharwal, and Clark 2018a) han-dle the task at the sentence-level, generating labelled sen-tence pairs by probing into the essence of lexical and com- H1:

At least one of the shows that were cancelled was an hour-long drama.

Entailment

Contradiction Neutral

H2:

There is no hour-long drama remained on the air.

Entailment

Contradiction

Neutral

H3:

Television viewers prefer sitcoms over hour-long dramas.

Entailment Contradiction

NeutralP:

Ten new television shows appeared during the month ofSeptember. Five of the shows were sitcoms, three were hour-long dramas, and two were news-magazine shows. By January,only seven of these new shows were still on the air. Five of theshows that remained were sitcoms.

Figure 1: An example of the ConTRoL dataset. ( (cid:51) indicatesthe correct answer.)positional semantics. These benchmarks explore rich fea-tures of sentence meaning, testing various aspects of se-mantic representation. With the advance of contextualizedembeddings such as BERT (Devlin et al. 2019), pre-trainedlanguage models achieve competitive results. The state-of-the-art models can even reach human-level performance.Contextual reasoning is essential to the process of humancognition, where inference is made based on contextual in-formation and a collection of facts (Giunchiglia 1992). Infer-ring hidden facts from context is an indispensable element ofhuman language understanding. Contextual reasoning is typ-ically preformed on the passage level, where multiple stepsmay be necessary for inferring facts from given evidences.It has been investigated by NLP tasks such as machine read-ing (Lai et al. 2017; Sun et al. 2019), retrieval-based dia-logue (Wu et al. 2017). However, dominant NLI benchmarks(Bowman et al. 2015; Williams, Nangia, and Bowman 2018)investigate the relationship of two sentences, with relativelyless attention being payed to the exploration of groundedlogical inference (Bhagavatula et al. 2019; Clark, Tafjord,and Richardson).We investigate contextual reasoning for NLI by makinga dataset that consists of 8,325 instances. One example isshown in Figure 1. In this example, the premise consists ofseveral facts concerning a set of shows, which can serveas a context for evidence integration and reasoning. Thetruthfulness of the hypotheses are determined by reasoningover multiple sentences. Various types of contextual reason- a r X i v : . [ c s . C L ] N ov ataset Task Reasoning Context Source SQuAD (Rajpurkar et al. 2016) Reading Comprehension (cid:51)

Passage WikipediaWIKIHOP (Welbl, Stenetorp, and Riedel 2017) Reading Comprehension (cid:51)

Document WikipediaH

OTPOT

QA (Yang et al. 2018) Reading Comprehension (cid:51)

Document WikipediaCosmos QA (Huang et al. 2019) Reading Comprehension (cid:51)

Passage WebblogSocial IQA (Sap et al. 2019) Reading Comprehension (cid:55)

Sentence SocialW

INO G RANDE (Sakaguchi et al. 2019) Coreference Resolution (cid:55)

Sentence DiverseCommonsenseQA (Talmor et al. 2019) Reading Comprehension (cid:55)

Sentence DiverseMuTual (Cui et al. 2020b) Next Utterance Prediction (cid:51)

Dialogue ExamReClor (Yu et al. 2020) Reading Comprehension (cid:51)

Passage ExamLogiQA (Liu et al. 2020) Reading Comprehension (cid:51)

Passage ExamRTE (Dagan, Glickman, and Magnini 2005) Natural Language Inference (cid:55)

Sentence DiverseSNLI (Bowman et al. 2015) Natural Language Inference (cid:55)

Sentence CaptioningWNLI (Wang et al. 2018) Natural Language Inference (cid:55)

Sentence FictionQNLI (Wang et al. 2018) Natural Language Inference (cid:55)

Sentence WikipediaMultiNLI (Williams, Nangia, and Bowman 2018) Natural Language Inference (cid:55)

Sentence DiverseDialogue NLI (Welleck et al. 2018) Natural Language Inference (cid:55)

Sentence Persona-chatSciTaiL (Khot, Sabharwal, and Clark 2018a) Natural Language Inference (cid:55)

Sentence ScienceAdversarial NLI (Nie et al. 2019) Natural Language Inference (cid:55)

Paragraph DiverseAlphaNLI (Bhagavatula et al. 2019) Natural Language Inference (cid:55)

Sentence DiverseConTRoL Natural Language Inference (cid:51)

Passage ExamTable 1: Comparison between our dataset and existing benchmarks. “Reasoning” refers to contextual reasoning.ing are considered in the dataset, with more examples be-ing shown in Figure 2 We name our open-domain dataset

ConT extual R easoning o ver L ong Texts (ConTRoL), whichis a passage-level natural language inference dataset withgold label data. It differs from the existing NLI datasetsin the following three main aspects: (1) the materials aresourced from verbal reasoning exams which are expert-designed rather than crowdsouced; (2) they inspect the abil-ities of various reasoning types; (3) the contexts are morecomplex than previous datasets with longer spans.We evaluate the state-of-the-art NLI models to establishbaseline performances for ConTRoL. Experimental resultsdemonstrate a signiﬁcant gap between machine and humanceiling performance. Detailed analysis is given to shed lighton future research. Our dataset and results are released athttps://anonymous. Related Work

Natural Language Inference

The task of text entailmentwas introduced in the PASCAL Recognizing Textual En-tailment (RTE) challenges (Dagan, Glickman, and Magnini2006), which deals with relationship of sentence pairs. Onthe third RTE challenge (Giampiccolo et al. 2007), a verylimited number of longer texts with multiple sentences wereincorporated for more comprehensive scenarios. This sharesa similar idea to our work, yet the challenge does not givemulti-sentence materials at scale for detailed study.Recently, the most widely used NLI benchmarks includethe Stanford Natural Language Inference (SNLI) dataset(Bowman et al. 2015), and the subsequently expandedMultiNLI (Williams, Nangia, and Bowman 2018), bring-ing sentences of various genres into the original SNLI.MultiNLI is included in the GLUE benchmark (Wang et al.2018) and is widely used in evaluating language mod- els’ performance. Other NLI datasets include the Question-answering NLI (QNLI) (Wang et al. 2018), the WinogradNLI (WNLI) (Wang et al. 2018), the SciTail (Khot, Sabhar-wal, and Clark 2018b) etc., which focuses on different as-pects of knowledge. While all the above datasets are on thesentence-level, we investigate NLI for long texts.Dialogue NLI (Welleck et al. 2018) features a persona-based dialogue structure for making inference on the cur-rent utterance based on previous dialogue history. Similarto our dataset, discourses involve multi-sentence context aspremises. However, they do not consider relationships thatrequire more than two sentences to express, nor is logicalreasoning explored.Adversarial NLI (Nie et al. 2019) introduces an iterative,adversarial human-and-model-in-the-loop training methodto collect large-scale NLI dataset. It holds the simple intu-ition that longer contexts lead to harder examples, which co-incide with our idea to some extent. The Adversarial NLIdataset is similar to ours in that longer contexts are consid-ered in the premises. However, we differ in context lengthand reasoning types. The context of our dataset is muchlonger and with multiple paragraphs being involved. In con-trast, Adversarial NLI has single-paragraph contexts only.In addition, it does not test logical reasoning, which is themain focus of ConTRoL. To our knowledge, we are the ﬁrstto introduce a passage-level NLI dataset requiring compre-hensive grounded logical reasoning.AlphaNLI (Bhagavatula et al. 2019) explores the prob-lem of abductive reasoning. It asks for the most plausibleexplanation given observations from two narrative contexts.Similar to our dataset, investigating the nature of human rea-soning is the target of AlphaNLI. However, the AlphaNLIchallenge resembles the classical formulation of abductivereasoning, which is different from the reasoning types we onTRoL

Construction Method ExamsContext Type Passage true or false . Compared with their dataset, our workdiffers in two aspects. First, we consider more reasoningtypes. Second, ConTRoL is a multi-paragraph NLI datasetwith human-written inputs. Contextual Reasoning

Long texts with multiple para-graphs have been explored in reading comprehension. Inparticular, there have been challenges that examine evidenceintegration over multiple text passages (Welbl, Stenetorp,and Riedel 2017; Yang et al. 2018; Rajpurkar, Jia, and Liang2018), and challenges that focus on commonsense reasoning(Talmor et al. 2019; Cui et al. 2020b), including social com-monsense (Sap et al. 2019) and external knowledge (Huanget al. 2019; Sakaguchi et al. 2019). Different from thesedatasets, ConTRoL examines more complex contextual rea-soning types such as logical reasoning .There have been reading comprehension datasets that ex-amine logical reasoning. LogiQA (Liu et al. 2020) is sourcedfrom public service exams. It focuses on linguistic reasoningquestions typically featured with a question and four possi-ble answers. ReClor (Yu et al. 2020) is a reading comprehen-sion dataset that is sourced from the GMAT and LSAT test.Similar to our dataset, these datasets examine a range of dif-ferent logical reasoning types. Different from these bench-marks, ConTRoL takes the form of NLI, which is a morefundamental linguistic task and relevant to different down-stream tasks. The correlation and differences between exisit-ing datasets are shown in Table 1.

Dataset

Crowdsourcing has been a widely-adopted practice for de-veloping large-scale NLI datasets (Bowman et al. 2015; Williams, Nangia, and Bowman 2018; Nie et al. 2019).However, producing a high-quality dataset addressing com-plex logical reasoning can be difﬁcult for crowdsource work-ers. Annotation artefacts exist in crowdsourced datasets, forthe annotation protocols encourage workers to adopt heuris-tics to generate hypotheses quickly and efﬁciently (Guru-rangan et al. 2018). To avoid such issues, we source ourdataset from examinations, and in particular senior aptitudetests (verbal reasoning test), which are designed by experts.

Data Collection and Statistics

We collect our data from publicly available online practicetests, which include verbal logical reasoning tests in the Po-lice Initial Recruitment Test (PIRT), verbal reasoning testsused by the Medical College Admission Test (MCAT) andUniversity Clinical Aptitude Test (UCAT), as well as ver-bal aptitude tests adopted by corporations’ employee recruit-ment & selection online test. Unlike reading comprehensiontests, which can be diverse both in question types and op-tions, questions in the original verbal reasoning tests aresimilar in struct to NLI tests, where a premise and a hypothe-sis are given, and the answer is a choice from three options: true , false and cannot say . This corresponds to the three-label setting of the NLI task and we can easily convert thethree answer choices into ENTAILMENT , CONTRADICTION and

NEUTRAL respectively.The verbal reasoning tests require exam-takers to com-prehend meaning and signiﬁcance, assess logical strength,make valid inference, and identify a valid summary, inter-pretation or conclusion. The subjects of the passages aredrawn from a range of ﬁelds, such as current affairs, busi-ness, science, the environment, economics, history, meteo-rology, health and education. The questions are of high qual-ity, advanced in difﬁculty level, used in exams such as policeinitial selection and other highly intellectual practices’ can-didate recruitment.The detailed statistics of ConTRoL are shown in Table 2.After removing all duplicated questions, we obtain 8,325context-hypothesis pairs. We also calculate the lexical over-lap between context and hypothesis, ﬁnding only 4.87%overlap in the E

NTAILMENT relationship, and 5.49% inthe C

ONTRADICTION relationship. This suggests that Con-TRoL can be difﬁcult to solve by plain lexical matching.

Data Format

The data format of ConTRoL follows existing NLI bench-marks (Bowman et al. 2015; Williams, Nangia, and Bowman2018), where each instance contains a premise, a hypothe-sis, and a label from E

NTAILMENT , N

EUTRAL and C ON - TRADICTION . Different from existing datasets, the premisesare much longer, in one or more paragraphs. In addition, foreach premise, three or more hypotheses are given, which isanother distinction from former NLI datasets.

Reasoning Types

We manually categorize the test instances by the reasoningtype, which can be described as follows:•

Coreferential Reasoning over Long Texts

Mr and Mrs Cross were going to Portugal for a short mid-winter break. The journey to the airport took exactly 45 minutes and as theﬂight was at 5.15 am , and they had to check in at least one hourbefore departure, they arranged for a taxi to pick them up at 3.15 am.

The taxi was late and they did not arrive at the airport until 50minutes before the scheduled departure time . When they arrived atthe airport they found that the ﬂight was delayed because of a fault inthe aircraft. The ﬂight eventually left at 6.40 am and arrived in

Faro,Portugal at 9.30 am . It is also known that: While waiting to depart Mr and Mrs Cross were provided with complimentary coffee and doughnuts in the Airport Cafe.The couple had hired a small three-door car for the period of theirstay in Portugal.Because of a medical condition Mr Cross does not drive.The biggest risk facing the world's insurance companies is possiblythe rapid change now taking place within their own ranks.

Sluggishgrowth in core markets and intense price competition , coupled withshifting patterns of customer demand and the rising cost of losses, arethreatening to overwhelm those too slow to react. Insurance companies are experiencing a boomin their core markets.Insurance companies are competing to providebest prices to customers.Customers don't trust insurance companies asmuch as they once were. contradictionentailmentneutralNew age problems require new age solutions. Further new ageproblems arise with new age populations and new age technologies. Inorder to ﬁnd solutions to these problems we need to build new age institutions as well as new age political, economic and socialmechanisms. Yet, institutions and political and economic mechanisms grow slowly an die slowly. Hence, new age institutions should begiven every chance of trying to achieve success int their objectives. New age institutions are created in order tosolve existing problems.Over a course of time, as an institutions grows,it has chances of succeeeding in its objectives.New age institutions are needed because oldinstitutions are ding. contradictionentailmentneutral coreferentialreasoing(26.0%)logicalreasoning(36.2%)The taxi arrived at the airport at 4.25 am.The ﬂight to Portugal took 2 hours 40 minutes.The couple bought breakfast while waiting forthe ﬂight. entailmentcontradictioncontradiction

The taxi was late because the driver lost his wayThe hire car was collected at Faro airport. neutralneutral temporalreasoing(12.4%)analyticalreasoning(12.8%)

Context HypothesisLabel Reasoning Type informationintegration(32.6%)

Figure 2: Reasoning types in ConTRoL (Reasoning clues are highlighted in the context).Coreferential reasoning (Ye et al. 2020) is a form of rea-soning over multiple mentions. Long text can accommo-date complex relationships between noun phrases, whichmakes coreferential reasoning crucial for the coherent un-derstanding of texts.•

Verbal Logical Reasoning

Verbal logical reasoning (Liu et al. 2020) is the abilityto examine, analyze, and critically evaluate arguments asthey occur in ordinary language. In contrast to formal log-ical reasoning, most of which uses abstract diagrammati-cal cues, verbal logical reasoning concerns the logical in-ference of human language. Deep logical reasoning canbe necessary in comprehending long texts.•

Temporal and Mathematical Reasoning

Time and sequential cues of events and requires the abil-ity to reason about time and do the necessary mathemati-cal calculation. Temporal reasoning (Nakhimovsky 1987)is the process of extracting temporal cues and combiningthem into a coherent temporal view. Various types of tem-poral information can be found in ConTRoL.•

Information Integration over Paragraphs

Multi-step reasoning (Liu and Gardner 2020; Chen, tingLin, and Durrett 2019; Welbl, Stenetorp, and Riedel 2017)is the ability to retrieve and combine information frommultiple paragraphs or multiple documents. For each hy-pothesis, readers ﬁnd the most relevant paragraphs in apremise through an iterative (multi-step) process betweenthe contexts and the hypotheses.•

Analytical Reasoning

Analytical reasoning (Williams et al. 2019) is the abilityfor problem solving to consider a group of facts and rules,and determine the validity of new facts. The fact sets arebased on a single or multiple paragraphs, reﬂecting thekinds of detailed analyses of relationships and sets of con-straints. Reasoning is based on what is required given thescenario, what is permissible given the scenario, and whatis prohibited given the scenario.Examples of the above reasoning types can be found inFigure 2. Among all the reasoning types, logical reason-ing takes 36.2% of all the test instances, followed by in-formation integration, which takes 32.6%. The proportionof coreferential reasoning, analytical reasoning and tempo-ral reasoning are 26.0%, 12.8% and 12.4%, respectively. Itis also worth noticing that one context-hypothesis pair maycontain more than one reasoning type, under which circum-stance we take the most signiﬁcant one into the statistics.

Models

We establish several strong baseline methods using the state-of-the-art pre-trained language models.

Pre-trained Language Models

BERT (Devlin et al. 2019) is a Transformer-based (Vaswaniet al. 2017) language model. During pre-training, BERT usesa masked language modeling objective. The basic idea is totrain a model to make use of bidirectional context informa-tion for predicting a masked token, so that linguistic knowl-edge can be collected from large texts. It has been shownthat such a language model contains certain degrees of syn-tactic (Goldberg 2019), semantic (Clark et al. 2019), com- verall Entailment Neutral Contradiction

Acc Avg.F1 Precision Recall F1 Precision Recall F1 Precision Recall F1Human 87.06 93.15 94.83 95.65 95.24 93.33 91.21 92.26 93.02 90.91 91.95Ceiling 94.40 97.26 99.16 99.16 99.16 97.72 93.75 95.69 96.09 97.79 96.93BERT-base 47.39 46.22 43.84 54.40 42.45 39.67 51.07 50.21 41.65 52.68 46.00BERT-large 50.62 49.49 45.15 59.32 45.96 44.21 53.52 53.19 44.68 56.27 49.31RoBERTa 45.90 45.67 40.99 51.24 45.38

BART

BERTBARTEncoder[CLS] premise [SEP] hypothesis [SEP][CLS]

E N C ~~premise hypothesis~~

BARTDecoder

E N C ~~premise hypothesis~~

Figure 3: The model structure of BERT and BART. (“E” rep-resents E

NTAILMENT , “N” represents N

EUTRAL , “C” rep-resents C

ONTRADICTION monsense (Cui et al. 2020a) and logical reasoning (Clark,Tafjord, and Richardson) knowledge.

RoBERTa (Liu et al. 2019) extends BERT using a moredynamic sentence masking method.

XLNet (Yang et al. 2019) Instead of a bidirectional Trans-fomer, XLNet uses Transformer-XL (Dai et al. 2019) as itsmain structure. Avoiding several limitations faced by BERT,XLNet uses an autogressive language model which modelslong-term dependencies beyond a ﬁxed context.

Longformer (Beltagy, Peters, and Cohan 2020) Tradi-tional self-attention operation are unable to process longsequences, which scales quadratically with the sequencelength. The aforementioned Transformer-based models con-strain the input to 512 tokens. To address this limitation,Longformer adopts sliding window attention with global at-tention to replace the self-attention mechanism in pretrainedTransformers.

BART (Lewis et al. 2020) is a denoising autoencoderfor pre-training sequence-to-sequence models by combiningbidirectional and auto-regressive Transformers.

NLI Model

The NLI model structures of BART and BERT-based are il-lustrated in Figure 3. For BERT-based models (i.e., BERT,RoBERTa, XLNet and Longformer), following Devlin et al. (2019), given a premise P and a hypothesis h , we concate-nate premise-hypothesis pair as a new sequence [ CLS ]+ p +[ SEP ] + h + [ SEP ] , where [CLS] and [SEP] are specialsymbol for classiﬁcation token and separator token. Afterpre-training model encoding, the last layer’s hidden repre-sentation from the [CLS] token is fed in an MLP+softmaxfor classiﬁcation. For BART, we feed the same sequence toboth the encoder and the decoder, using the last hidden statefor classiﬁcation. The class that corresponds to the highestprobability is chosen as model prediction. Implementation Details

We randomly split the dataset into training, development,and test set with the ratio of 8:1:1. All models are trainedfor 10 epochs. We ﬁnd hyper-parameters using grid search:batch size ∈ { , , } learning rate ∈ { e − , e − , e − , e − , e − } and gradient accumulate step ∈ { , , } .We set the max length to 512 tokens for all models exceptLongformer, of which 3,000 tokens are the max length wetake. Models with the best performance on the developmentset are used for testing. Evaluation

Following the NLI benchmark setting (Bowman et al. 2015;Williams, Nangia, and Bowman 2018; Welleck et al. 2018),we employ the overall accuracies as the main evaluationmethod. Furthermore, to give more detailed analysis, wealso calculate precision, recall and F1 score on the

ENTAIL - MENT , NEUTRAL and

CONTRADICTION labels.

Human Performance

To measure human performance on the ConTRoL dataset,we randomly select 300 context-hypothesis pairs from thetest set. Four testees are recruited. The testees are well ed-ucated, two of them are post-graduate students and two ofthem have PhD degrees. We report the human performanceby the mean score and standard deviation. The human ceil-ing performance is obtained by considering the proportionof questions with at least one correct answer. enchmark

MultiNLI 393k 20k 85.9 T5-11B (Raffel et al. 2019) 92.0 92.8QNLI 105k 5.4k 92.7 ALBERT (Lan et al. 2019) 99.2 91.2RTE 2.5k 3k 70.1 T5-11B (Raffel et al. 2019) 92.5 93.6WNLI 634 146 65.1 T5-11B (Raffel et al. 2019) 93.2 95.9ConTRoL 8.3k 804 50.6 BART-NLI-FT 61.0 94.4Table 4: The state-of-the-art performances of popular NLI benchmarks (accuracy%).

BERTBARTLongformer

Figure 4: Performance across different context lengths.

Reasoning Type BERT BART

Coreferential Reasoning 74.64 74.92Analytical Reasoning 67.96 69.65Temporal Reasoning 56.44 57.34Information Integration 40.07 43.39Logical Reasoning 40.76 43.20Table 5: Performance across reasoning types (accuracy%).

Results

Table 3 shows the main results. As shown in the table, BERTgives an overall accuracy of 50.62% and F1 of 49.49%;RoBERTa gives a higher accuracy of 45.90% and F1 of45.67%; Longformer gives an overall accuracy of 49.88%and F1 of 46.22%; XLNet gives an overall accuracy of54.85% and F1 of 54.93%. The top reported performance isgiven by the BART model, with a 56.34% accuracy score.Compared with human performance, the performance ofBART is lower by approximately 30%. The human perfor-mance on the ConTRoL surpasses SOTA NLI models by alarge margin, which demonstrate limitations for the compu-tational models for solving contextual reasoning tasks.As shown in Table 4, we see a huge performance dropwhen the SOTA model results on ConTRoL are comparedto their reported score on previous NLI dataset (Liu et al.2019). In contrast, similar to the existing benchmarks, hu-man testees are able to achieve high scores with proper train-ing. Different from datasets that emphasise fact extractionand veriﬁcation, the inference of ConTRoL relies not onlyon the long-term dependency of texts, but the contextual rea-soning abilities regarding long contexts.To further understand the phenomena, we conduct variousqualitative and quantitative detailed analysis on ConTRoL.

Performance Across Different Relationships

We ﬁrst compare human performance and model perfor-mance across different relationships. Interestingly, as shownin Table 3, humans are good at deciding the entailment andcontradiction relationship, while struggling when examiningthe relationship of neutral. This can be because humans tendto associate external irrelevant knowledge to the reasoningprocess, which is not expressed in the context. The compu-tational models seem not to bear this burden, which givessimilar results across the three labels.

Performance Across Different Context Lengths

As mentioned earlier, aside from single-paragraph context-hypothesis pairs, there are multi-paragraph context-hypothesis pairs in our dataset. We conduct experimentson the single-paragraph and multi-paragraph instancesseparately, which gives us the insight into how contextlength affects the performance of the transformer-basedNLI models. The accuracy of the BERT model is 40.30%on multiple-paragraph instances while 51.17% on single-paragraph instances. We also conduct ﬁne-grained analysisconcerning the context length. The result are shown inFigure 4. When the context length increases, the model per-formance drops accordingly. The best model BART dropsfrom 65% (shorter than 500 words) to 40% (longer than3,000 words), demonstrating that ConTRoL heavily rely onpassage-level reasoning ability, rather than sentence-levelreasoning ability.

Performance Across Reasoning Types

Table 5 gives the performance over the 5 reasoning types.BERT and BART have similar trends across different rea-soning types. In particular, on the coreferential reasoningtype, BERT and BART give accuracies of 74.64% and74.92%, respectively. On the other hand, both models aremore confused on reasoning types including multi-step rea-soning and logical reasoning. This can be because multi-step reasoning can be correlated with longer context length,and information integration is processed over multiple para-graphs. Finally performing inductive and deductive reason-ing is difﬁcult for current models, making logical reasoninga difﬁcult endeavour (Liu et al. 2020).

Transfer Learning

Recent studies have shown the beneﬁt of ﬁne-tuning on sim-ilar datasets for knowledge transfer (Huang et al. 2019). Weexplore three related NLI datasets for knowledge transfer,SNLI (Bowman et al. 2015), MultiNLI (Williams, Nangia,

ERT Longformer BART30354045505560 A cc u r a c y % randomcontext-onlyhypothesis-onlycontext & hypothesis Figure 5: Ablation study on different models.and Bowman 2018) and Adversarial NLI (Nie et al. 2019).As shown in the last two rows of Table 3, BART-NLI onlyachieves 45.0%, which shows that ConTRoL is differentfrom existing NLI benchmarks. After ﬁne-tuning on Con-TRoL, BART-NLI-FT achieves the state-of-the-art results,which demonstrates that general knowledge from traditionalNLI benchmarks are beneﬁcial to the performance of Con-TRoL.

Discussion

Corpus Bias

Recent studies show that pre-trained language models canmake the right prediction by merely looking at context(McCoy, Pavlick, and Linzen 2019). The hypothesis-onlybias is common in large-scale datasets for NLI, particularlyfor benchmarks constructed by crowdsourcing methods. Weconduct an ablation experiment on ConTRoL. Figure 5shows the comparison of BERT, Longfomer and BART.BERT gives a 36.07% accuracy with hypothesis-only, whichis slightly higher than theoretical random guess; Longformergives a 44.15% accuracy, surpass BART by a small margin,which gives a 43.41% accuracy.Context-only results are also calculated to further ex-amine annotation artefacts in the ConTRoL dataset. BERTgives 33.09% accuracy; BART gives 35.94% accuracy;Longformer also gives a better performance than BERT andBART, which gives 38.56% accuracy. Longformer gives abetter score on context-only and hypothesis-only ablation,which can be because Longformer sees more context thanthe other two models. The ablation results are lower than theresults without ablation, which indicates that models needto look at both the contexts and the hypotheses to makethe correct prediction. Thus we conclude that the ConTRoLdataset is exempt from signiﬁcant annotation biases thanksto expert-designed questions.

Case Study

Figure 6 shows two cases that demonstrate the challenge inConTRoL. P1 of Figure 6 is a representative example of thechallenges brought by logical reasoning. The context con-cerns three athletes and three sports. We need to decide theirplaces in a competition. The lexical overlap between the

H1:

Josie was best with the javelin.

Entailment

Contradiction

NeutralP1:

Three athletes each receive a first, second and third prizefor a different sporting event. Either

Anne or Josie got thesecond prize for Tennis.

Anne got the same prize for throwingthe javelin as

Josie got for swimming.

Tanya got the first prizefor swimming, and her prize for the javelin was the same as

Josie 's for tennis and

Anne 's for swimming.

H1:

As a goodwill gesture, Tuisdales other bank providedemergency access to cash for customers after their ordeal.

Entailment

Contradiction

NeutralP2:

Two masked gunmen held up the only bank in Tuisdale at10.30 on Wednesday 23 May.

They made a successful getawaywith over 500,000 . The police say that three men are helpingthem with their enquiries. It is also known that: Four peoplework at the bank. Six customers were in the bank at 10.30. Noshots were fired. Ms Grainger left the bank at 10.28 onWednesday 23 May. All the people in the bank were made tolie on the floor face down on their stomachs. The police chasedthe getaway car for 16 km, and then lost it. An alarm alertedthe police to the hold-up. A red Ford Mondeo drove away fromthe bank at high speed at 10.30 on Wednesday 23 May.

Figure 6: Example mistakes of BART ( (cid:51) indicates the cor-rect label and (cid:55) indicates the BART prediction. Reasoningclues are highlighted in the context.)premise and the hypothesis is very low. BART incorrectlychooses the

Neutral label, while we can infer from the con-text that Josie is actually not the best with the javelin, whichcan only be done by deductive reasoning. Information inte-gration is difﬁcult for BART.P2 of Figure 6 shows a typical example of challengebrought by information integration, where the hypothesisis made considering the whole passage. We know from theﬁrst sentence that Tuisdale holds the only bank in the region.The hypothesis talks about the possible aftermath of the rob-bery, BART incorrectly chooses the

Neutral label for it over-looks the information that Tuisdale only has one bank.Inboth cases, the correct answer is not explicitly mentionedin the premise, but need contextual reasoning to infer.

Conclusion

We presented the ConTRoL dataset, a passage-level NLIbenchmark that consists of different contextual reasoningtypes. Compared with existing NLI benchmarks, the con-text length of the premise is bigger by a large margin, andreasoning skills such as logical reasoning, analytical rea-soning and multi-step reasoning are required. Experimentsshow that state-of-the-art NLI models perform poorly on theConTRoL dataset, far below human performance. Ablationstudy indicates that the data does not suffer from heavy an-notation artefacts and can be served as a reliable NLI bench-mark for future study. To our knowledge, we are the ﬁrst tointroduce a passage-level NLI dataset that highlights contex-tual reasoning. eferences

Beltagy, I.; Peters, M. E.; and Cohan, A. 2020. Longformer:The Long-Document Transformer.Bhagavatula, C.; Bras, R. L.; Malaviya, C.; Sakaguchi, K.;Holtzman, A.; Rashkin, H.; Downey, D.; tau Yih, S. W.; andChoi, Y. 2019. Abductive Commonsense Reasoning.Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D.2015. A large annotated corpus for learning natural languageinference. In

Proceedings of the 2015 Conference on Em-pirical Methods in Natural Language Processing (EMNLP) .Association for Computational Linguistics.Chen, J.; ting Lin, S.; and Durrett, G. 2019. Multi-hop Ques-tion Answering via Reasoning Chains.Clark, K.; Khandelwal, U.; Levy, O.; and Manning, C. D.2019. What Does BERT Look at? An Analysis of BERT’sAttention. In

Proceedings of the 2019 ACL Workshop Black-boxNLP: Analyzing and Interpreting Neural Networks forNLP

Proceedings of the 58th Annual Meeting of the Associationfor Computational Linguistics

Machine Learning Challenges. EvaluatingPredictive Uncertainty, Visual Object Classiﬁcation, andRecognising Tectual Entailment , 177–190. Berlin, Heidel-berg: Springer Berlin Heidelberg. ISBN 978-3-540-33428-6.Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.; andSalakhutdinov, R. 2019. Transformer-XL: Attentive Lan-guage Models beyond a Fixed-Length Context.

Proceedingsof the 57th Annual Meeting of the Association for Compu-tational Linguistics doi:10.18653/v1/p19-1285. URL http://dx.doi.org/10.18653/v1/P19-1285.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding. In

Proceedings of the 2019 Con-ference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers)

Proceedings of the ACL-PASCAL Workshopon Textual Entailment and Paraphrasing , RTE ’07, 1–9.USA: Association for Computational Linguistics.Giunchiglia, F. 1992. Contextual Reasoning.

EPISTE-MOLOGIA, SPECIAL ISSUE ON I LINGUAGGI E LEMACCHINE

CoRR abs/1803.02324.URL http://arxiv.org/abs/1803.02324.Huang, L.; Le Bras, R.; Bhagavatula, C.; and Choi, Y. 2019.Cosmos QA: Machine Reading Comprehension with Con-textual Commonsense Reasoning. In

Proceedings of the2019 Conference on Empirical Methods in Natural Lan-guage Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP)

AAAI .Khot, T.; Sabharwal, A.; and Clark, P. 2018b. SciTaiL: ATextual Entailment Dataset from Science Question Answer-ing. In

AAAI .Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; and Hovy, E. 2017.RACE: Large-scale ReAding Comprehension Dataset FromExaminations. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing

Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics doi:10.18653/v1/2020.acl-main.703. URL http://dx.doi.org/10.18653/v1/2020.acl-main.703.Liu, J.; Cui, L.; Liu, H.; Huang, D.; Wang, Y.; and Zhang,Y. 2020. LogiQA: A Challenge Dataset for Machine Read-ing Comprehension with Logical Reasoning.

Proceedingsof the Twenty-Ninth International Joint Conference on Ar-tiﬁcial Intelligence doi:10.24963/ijcai.2020/501. URL http://dx.doi.org/10.24963/ijcai.2020/501.iu, J.; and Gardner, M. 2020. Multi-Step Inference for Rea-soning Over Paragraphs.Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.2019. RoBERTa: A Robustly Optimized BERT PretrainingApproach.McCoy, T.; Pavlick, E.; and Linzen, T. 2019. Right for theWrong Reasons: Diagnosing Syntactic Heuristics in Natu-ral Language Inference. In

Proceedings of the 57th An-nual Meeting of the Association for Computational Lin-guistics

Proceedings of the Third Conference on Eu-ropean Chapter of the Association for Computational Lin-guistics , EACL ’87, 262–269. USA: Association for Com-putational Linguistics. doi:10.3115/976858.976900. URLhttps://doi.org/10.3115/976858.976900.Nie, Y.; Williams, A.; Dinan, E.; Bansal, M.; Weston, J.; andKiela, D. 2019. Adversarial NLI: A New Benchmark forNatural Language Understanding.

ArXiv abs/1910.14599.Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.;Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2019. Exploringthe Limits of Transfer Learning with a Uniﬁed Text-to-TextTransformer.Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know What YouDon’t Know: Unanswerable Questions for SQuAD. In

Pro-ceedings of the 56th Annual Meeting of the Association forComputational Linguistics (Volume 2: Short Papers) . Mel-bourne, Australia: Association for Computational Linguis-tics.Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016.SQuAD: 100,000+ Questions for Machine Comprehensionof Text.Sakaguchi, K.; Bras, R. L.; Bhagavatula, C.; and Choi,Y. 2019. WinoGrande: An Adversarial Winograd SchemaChallenge at Scale. arXiv preprint arXiv:1907.10641 .Sap, M.; Rashkin, H.; Chen, D.; Le Bras, R.; and Choi,Y. 2019. Social IQa: Commonsense Reasoning about So-cial Interactions. In

Proceedings of the 2019 Confer-ence on Empirical Methods in Natural Language Process-ing and the 9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP)

Transactions ofthe Association for Computational Linguistics

URL https://arxiv.org/abs/1902.00164v1.Talmor, A.; Herzig, J.; Lourie, N.; and Berant, J. 2019.CommonsenseQA: A Question Answering Challenge Tar-geting Commonsense Knowledge. In

Proceedings of the 2019 Conference of the North American Chapter of the As-sociation for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers)

CoRR abs/1804.07461. URL http://arxiv.org/abs/1804.07461.Welbl, J.; Stenetorp, P.; and Riedel, S. 2017. Construct-ing Datasets for Multi-hop Reading Comprehension AcrossDocuments.Welleck, S.; Weston, J.; Szlam, A.; and Cho, K. 2018. Dia-logue Natural Language Inference.Williams, A.; Nangia, N.; and Bowman, S. 2018. ABroad-Coverage Challenge Corpus for Sentence Under-standing through Inference. In

Proceedings of the 2018Conference of the North American Chapter of the Associa-tion for Computational Linguistics: Human Language Tech-nologies, Volume 1 (Long Papers)

NeuroIm-age

Proceedings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers) .Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov,R.; and Le, Q. V. 2019. XLNet: Generalized AutoregressivePretraining for Language Understanding.Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhut-dinov, R.; and Manning, C. D. 2018. HotpotQA: A Datasetfor Diverse, Explainable Multi-hop Question Answering.