[PDF] Explainable Natural Language Reasoning via Conceptual Unification

Abstract

This paper presents an abductive framework for multi-hop and interpretable textual inference. The reasoning process is guided by the notions unification power and plausibility of an explanation, computed through the interaction of two major architectural components: (a) An analogical reasoning model that ranks explanatory facts by leveraging unification patterns in a corpus of explanations; (b) An abductive reasoning model that performs a search for the best explanation, which is realised via conceptual abstraction and subsequent unification. We demonstrate that the Step-wise Conceptual Unification can be effective for unsupervised question answering, and as an explanation extractor in combination with state-of-the-art Transformers. An empirical evaluation on the Worldtree corpus and the ARC Challenge resulted in the following conclusions: (1) The question answering model outperforms competitive neural and multi-hop baselines without requiring any explicit training on answer prediction; (2) When used as an explanation extractor, the proposed model significantly improves the performance of Transformers, leading to state-of-the-art results on the Worldtree corpus; (3) Analogical and abductive reasoning are highly complementary for achieving sound explanatory inference, a feature that demonstrates the impact of the unification patterns on performance and interpretability.

Full PDF

EExplainable Natural Language Reasoning via Conceptual Uniﬁcation

Marco Valentino, Mokanarangan Thayaparan, Andr´e Freitas

Department of Computer ScienceUniversity of Manchester { marco.valentino, mokanarangan.thayaparan, andre.freitas } @manchester.ac.uk Abstract

This paper presents an abductive frameworkfor multi-hop and interpretable textual infer-ence. The reasoning process is guided by thenotions of uniﬁcation power and plausibility of an explanation, computed through the in-teraction of two major architectural compo-nents: (a) An analogical reasoning model thatranks explanatory facts by leveraging uniﬁca-tion patterns in a corpus of explanations; (b)An abductive reasoning model that performsa search for the best explanation, which is re-alised via conceptual abstraction and subse-quent uniﬁcation.We demonstrate that the Step-wise Concep-tual Uniﬁcation can be effective for unsuper-vised question answering, and as an explana-tion extractor in combination with state-of-the-art Transformers. An empirical evaluation onthe Worldtree corpus and the ARC Challengeresulted in the following conclusions: (1) Thequestion answering model outperforms com-petitive neural and multi-hop baselines with-out requiring any explicit training on answerprediction; (2) When used as an explanationextractor, the proposed model signiﬁcantly im-proves the performance of Transformers, lead-ing to state-of-the-art results on the Worldtreecorpus; (3) Analogical and abductive reason-ing are highly complementary for achievingsound explanatory inference, a feature thatdemonstrates the impact of the uniﬁcation pat-terns on performance and interpretability.

Multiple-choice Science Questions have been pro-posed as a challenge task for natural language in-ference and question answering (Khot et al., 2019;Clark et al., 2018; Mihaylov et al., 2018). A centralresearch line in the ﬁeld aims at developing ex-plainable inference models capable of performingaccurate predictions and, at the same time, gen-erating explanations for the underlying reasoningprocess (Miller, 2019; Biran and Cotton, 2017).

UniﬁcationAbstractionQ:

If you bounce a rubber ball on the ﬂoor , it goes up and then comes down . What causes the ball to come down ? C2:

Magnetism

C3:

Electricity

C4:

Friction

C1:

Gravity A ball is a kind of object The ﬂoor is a kind of objectCome down is similar to fallingGravity means gravitational pull ; gravitational energy

Gravity ; gravitational force causes objects that havemass; substances to be pulled down ; to fall on a planet (1)(2)

Figure 1: Modelling explanatory inference via step-wise conceptual uniﬁcation.

The construction of explanations for science ques-tions is typically framed as a multi-hop reasoningproblem, where multiple pieces of evidence needto be combined to arrive at the ﬁnal answer. Recentapproaches adopt global and local semantic con-straints to guide the generation of plausible multi-hop explanations (Khashabi et al., 2018; Jansenet al., 2017; Khashabi et al., 2016). However, theuse of explicit constraints for reasoning with natu-ral language often results in semantic drift – i.e. thetendency of composing spurious inference chainsthat lead to wrong conclusions (Khashabi et al.,2019). To deal with semantic drift, recent workhave proposed the crowd-sourcing of explanation-centred corpora (Xie et al., 2020; Jansen et al.,2018, 2016) which can enable the identiﬁcationof common explanatory patterns. Although theseresources have been applied for explanation regen-eration (Valentino et al., 2020; Jansen and Ustalov,2019), it is not yet clear how they can support thedownstream answer prediction task. In this paper,we aim at moving a step forward in this direction,exploring how explanatory patterns can be lever-aged for multi-hop reasoning.Research in Philosophy of Science suggests that a r X i v : . [ c s . A I] S e p xplanations act through uniﬁcation (Friedman,1974; Kitcher, 1989). The function of an expla-nation is to unify a set of disconnected phenomenashowing that they are the expression of a commonregularity – e.g. Newton’s law of universal grav-itation uniﬁes the motion of planets and fallingbodies showing that they obey the same law. Thehigher the number of distinct phenomena explainedby a given statement, the higher its uniﬁcationpower. Therefore, explanations with high uniﬁ-cation power tend to create uniﬁcation patterns –i.e. the same statement is reused to explain a largevariety of similar phenomena. We hypothesise thatthe uniﬁcation patterns emerging in a corpus ofexplanations can ultimately guide the abductivereasoning process. Consider the example in Figure1. The explanation performs uniﬁcation by con-necting a concrete phenomenon – i.e. a ball fallingon the ﬂoor, to a general regularity that applies to abroader set of phenomena and that explains a largenumber of questions – i.e. gravity affects all theobjects that have mass. As a result, multi-hop in-ference for science questions can be modelled asan abstraction from the original context in searchof an underlying explanatory law, which in turnmanifests its uniﬁcation power by being frequentlyreused in explanations for similar questions. Inthis paper, we build upon the concept of explana-tory uniﬁcation and provide the following contri-butions: (1) We present the Step-wise ConceptualUniﬁcation, an abductive framework that combinesexplicit semantic constraints with the notion of uni-ﬁcation power for multi-hop inference, computedvia analogical reasoning on a corpus of explana-tions; (2) We empirically show the efﬁcacy of theframework for unsupervised question answeringand explanation extraction; (3) We study the impactof the uniﬁcation patterns on abductive reasoning,demonstrating their role in improving the accuracyof prediction and the soundness of explanations. A multiple-choice science question is a tuple Q =( q, C ) characterised by a question q and a set ofcandidate answers C = { c , c , . . . , c n } . A setof hypotheses H = { h , h , . . . , h n } can be de-rived by concatenating q with each c j ∈ C – i.e. h j = concat ( c j , q ) . Given H , we frame multi-hopinference for multiple-choice question answeringas the problem of selecting the hypothesis that issupported by the best explanation . The Step-wise Conceptual Uniﬁcation constructs and scores ex-planations by composing multiple sentences froma knowledge base, which we refer to as Facts KB(

F KB ). This resource includes the knowledgenecessary to answer and explain science questions,ranging from common sense and taxonomic rela-tions (e.g. a ball is a kind of object ) to scientiﬁcstatements and laws (e.g. gravity; gravitationalforce causes objects that have mass; substances tobe pulled down; to fall on a planet ). The abduc-tive process is guided by the uniﬁcation patternsemerging in a second knowledge base, named Ex-planations KB (

EKB ), which contains a set of truehypotheses with their respective explanations. Theframework is based on the following research hy-potheses:

RH1:

Explanations can be constructedthrough two major inference steps, namely abstrac-tion and uniﬁcation : (1) Retrieving a set of abstrac-tive facts whose role is to expand the context of thehypothesis in search of an underlying regularity;(2) Selecting an uniﬁcation fact , which representsan explanatory scientiﬁc statement;

RH2:

The bestexplanation can be determined by considering twoproperties of the uniﬁcation: (a) The plausibility ofthe uniﬁcation, that is a measure of the semanticconnection between the uniﬁcation statement andthe original hypothesis; (b) The uniﬁcation power ,that depends on how often the uniﬁcation explainssimilar hypotheses.

In general, we consider the facts in

F KB and thehypotheses in H as natural language statementscomposed of a set of distinct concepts CP ( f i ) = { cp , cp , . . . , cp z } (e.g. “gravity” , “ball” , “liv-ing thing” ). To formalise our research hypothe-ses, we divide the facts in F KB into two cate-gories. The sentences expressing taxonomic re-lations between concepts (i.e. “x is a kind ofy” ), synonyms (i.e. “x means y” ) and antonyms(i.e. “x is the opposite of y” ) are classiﬁed as ab-stractive , while all the other facts (e.g. properties,causes, processes, scientiﬁc laws) are consideredfor uniﬁcation . We say that two arbitrary facts f i and f j are conceptually connected if the intersec-tion between CP ( f i ) and CP ( f j ) is not empty, CP ( f i ) ∩ CP ( f j ) (cid:54) = ∅ . On the other hand, wesay that two facts f i and f j are indirectly con-nected if CP ( f i ) ∩ CP ( f j ) = ∅ and there existsa fact f z such that CP ( f i ) ∩ CP ( f z ) (cid:54) = ∅ and CP ( f j ) ∩ CP ( f z ) (cid:54) = ∅ . We consider as part of niﬁcation-based Reconstruction (Analogical Reasoning) AnswerSelection ExplanationStep-wise Conceptual Uniﬁcation (Abductive Reasoning)RS + US Abstraction

Uniﬁcation

RS + US Abstraction

Uniﬁcation

RS + US Abstraction

Uniﬁcation

Hypothesis ScoreCABS(h1) RS Analogical Scores + Abstraction A ball is a kind of object The ﬂoor is a kind of objectCome down is similar to fallingGravity means gravitational pull ; gravitational energy

Uniﬁcation

Gravity ; gravitational force causes objects that have mass;substances to be pulled down ; to fall on a planet Q: If you bounce a rubber ball on the ﬂoor, it goes up and then comes down. What causes the ball to come down? C : Magnetism C : Electricity C : Friction

US AbstractionUniﬁcation MAX

FKB C : GravityExplanations Facts A: Gravity Q: If you bounce a rubber ball on the ﬂoor , it goes up and then comes down . What causes the ball to come down ? (1) (2) (3) (4) EKB

CUNF(h1) Explanatory Scores

Figure 2: Overview of the Step-wise Conceptual Uniﬁcation framework. the explanation only pairs of facts that are at leastindirectly connected. Moreover, following our ﬁrstresearch hypothesis (RH1) we consider composi-tions formed by an arbitrary number of abstractivefacts and one uniﬁcation fact. Therefore, a genericexplanation for h j ∈ H can be reformulated as a tu-ple E j = ( ABS, U N F ) deﬁned by the followingelements: • ABS = { a , a , . . . , a n } ⊆ F KB is a set ofabstractive facts such that | ABS | ≥ ; • U N F = { u } ⊆ F KB is a singleton includ-ing one uniﬁcation fact u – i.e. | U N F | = 1 .Additional constraints are determined by the con-ceptual connections between these sets. Specif-ically, each abstractive fact a i ∈ ABS must beconceptually connected with both the hypothesis h j and the uniﬁcation fact u ∈ U N F . On the otherhand, u must be conceptually connected with eachabstractive fact if | ABS | > , and with the hypoth-esis h j if | ABS | = 0 . In other words, to ensurethat the uniﬁcation fact is semantically plausible,we force it to be linked with h j in one or two hopsthrough abstractive facts (e.g. Fig 1). To determine which hypothesis in H is supportedby the best explanation, we deﬁne a framework con-sisting of four major algorithmic steps (Fig. 2). Foreach hypothesis h j ∈ H , the ﬁrst step is aimed atretrieving a set of candidate explanatory facts. Thearchitectural component responsible for this taskperforms analogical reasoning by leveraging uni-ﬁcation patterns for similar hypotheses in EKB . The output of this component is represented by twodistinct subsets of

F KB : (a) A set of candidate ab-stractive facts

CABS ( h j ) = { a , a , . . . a z } ; (b)A set of candidate uniﬁcation facts CU N F ( h j ) = { u , u , . . . u t } . Each u i ∈ CU N F ( h j ) is asso-ciated with an analogical score as ( h j , u i ) com-puted with respect to the hypothesis h j and re-ﬂecting the uniﬁcation power of u i . The sec-ond step uses the output of the analogical compo-nent to perform abductive reasoning . Speciﬁcally,the elements of CABS ( h j ) and CU N F ( h j ) arecombined to build a set of plausible explanations E ( h j ) = { E , E , . . . E n } . For each explanation E i = ( ABS i , U N F i ) , the abductive componentcomputes an explanatory score es ( u i , h j ) by tak-ing into account the analogical score and the plau-sibility of the uniﬁcation u i , which is derived fromthe conceptual connections with h j . The top K uniﬁcations ranked according to their explanatoryscores are adopted to determine the ﬁnal score forthe hypothesis h j . Finally, the answer selectioncomponent (step 3) collects the scores computedfor each h j ∈ H and selects the candidate answer c i ∈ C associated to the best hypothesis. For ex-plainability, the predicted answer can be enrichedwith the uniﬁcation performed by the system (Step4). The analogical reasoning component adoptsthe

Uniﬁcation-based Reconstruction model(Valentino et al., 2020). For each fact f i ∈ F KB ,the model computes a score (i.e. as ( h j , f i ) ) that isderived by the combination of its lexical relevance,i.e. the Relevance Score (RS) , and its uniﬁcationower, deﬁned as

Uniﬁcation Score (US) (Fig. 2): as ( h j , f i ) = λ rs ( h j , f i ) + λ us ( h j , f i ) (1) The uniﬁcation score us ( h j , f i ) is described by thefollowing formula: us ( h j , f i ) = | kNN ( h j ) | (cid:88) z sim ( h j , h z ) in ( f i , E z ) (2) in ( f i , E z ) = (cid:40) if f i ∈ E z otherwise (3) KN N ( h j ) = { ( h , E ) , . . . ( h n , E n ) } ⊆ EKB is the set of k-nearest neighbours of h j that includeshypothesis and explanation pairs ( h z , E z ) retrievedaccording to a similarity measure sim ( h j , h z ) ,while in ( f i , E z ) is a function that returns 1 if f i is used to explain h z , 0 otherwise. Therefore, themore a fact f i ∈ F KB explains similar hypothe-ses in

EKB , the higher its uniﬁcation score. Inour experiments, both sim ( h j , h z ) and rs ( h j , f i ) are implemented using BM25 vectors and cosinesimilarity. The abductive reasoning model constructs andscores a set of explanations E ( h j ) using abstrac-tion and uniﬁcation steps. For each concept in thehypothesis c i ∈ CP ( h j ) , the abstraction step com-putes an expansion set EXP ( c i ) considering eachcandidate abstractive fact a k ∈ CABS ( h j ) : EXP ( c i ) = (cid:91) k CP ( a k ) | c i ∈ CP ( a k ) ∩ CP ( h j ) (4) The set

EXP ( c i ) represents the union of all theconcepts that occur in abstractive facts mention-ing c i . For example, considering the hypothesis inﬁgure 1, the set EXP ( ball ) will include the con-cept “object” extracted from the fact “a ball is akind of object” . Therefore, EXP ( c i ) will include c i plus its hypernyms, hyponyms, synonyms andopposite concepts contained in CABS ( h j ) . In theuniﬁcation step, the abductive component analyseseach candidate uniﬁcation fact u k in CU N F ( h j ) and checks whether there exists at least a concept c i such that CP ( u k ) ∩ EXP ( c i ) (cid:54) = ∅ . If this con-dition is respected (e.g. Fig. 1), the componentadds a new explanation E k to E ( h j ) composed ofthe uniﬁcation u k and all the abstractive facts thatare connected with u k and h j . Conversely, if thecondition is not respected, the uniﬁcation fact u k is discarded. Once the set E ( h j ) is created, the ab-ductive component assigns an explanatory score to each explanation E k by considering the uniﬁcationfact u k : es ( h j , u k ) = λ as ( h j , u k ) + λ ps ( h j , u k ) (5) Here, as ( h j , u k ) is the analogical score computedfor u k , while ps ( h j , u k ) represents the plausibilityscore deﬁned as follows: ps ( h j , u k ) = | (cid:83) i c i | EXP ( c i ) ∩ CP ( u k ) (cid:54) = ∅|| CP ( h j ) | (6) The plausibility score ps ( h j , u k ) represents the per-centage of concepts in the hypothesis h j that haveat least an indirect link with the uniﬁcation fact u k . Therefore, the higher the degree of conceptualcoverage between the uniﬁcation and the originalhypothesis, the higher the plausibility score. Inline with our research hypotheses (RH2), the fullexplanatory score of a uniﬁcation fact u k jointlydepends on its semantic plausibility and uniﬁcationpower. Finally, the abductive model computes the hypothesis score by considering the top K uniﬁ-cations for h j ranked by their explanatory scores: hs ( h j ) = K (cid:88) k es ( h j , u k ) (7) The ﬁnal answer is selected by considering thehypothesis in H with the highest score: ans ( Q ) = c a ∈ C | a = argmax j [ hs ( h j ) ] (8) We evaluate the Step-wise Conceptual Uniﬁcation(SWCU) on multiple-choice question anwering.First, we test the efﬁcacy of the framework for unsupervised question answering . Here, we adoptthe algorithmic steps described in the previous sec-tion, using equation 8 for answer prediction. Inaddition, we evaluate the model for explanationextraction . In this case, the top k uniﬁcation facts(Equation 7) are used as supporting evidence fora Transformer model, which is then ﬁne-tuned onanswer prediction. We perform the experimentscombining SWCU with BERT-base (Devlin et al.,2019) and RoBERTa-large (Liu et al., 2019). TheSWCU model is implemented via BM25 vectorsand cosine similarity, which are used for computing sim ( h j , h z ) in equation 2 and rs ( h j , f i ) in equa-tion 1. The knowledge bases ( F KB and

EKB )are populated using the Worldtree corpus (Jansenet al., 2018) which provides gold explanations for odel Unsupervised Overall Easy ChallengeInformation Retrieval (IR)

BM25 IR solver (Clark et al., 2018) Yes 41.22 44.94

BM25 Uniﬁcation-based IR solver Yes

Multi-hop Inference

BM25 IR + PathNet (Kundu et al., 2019) No 41.50 43.32

BM25 Uniﬁcation-based IR + PathNet (Kundu et al., 2019) No

Transformers

BERT-base (Devlin et al., 2019; Valentino et al., 2020) No 41.78 48.54 26.28RoBERTa-large (Liu et al., 2019) No

BM25 IR + BERT-base (Valentino et al., 2020) No 49.39 53.20 40.97BM25 Uniﬁcation-based IR + BERT-base (Valentino et al., 2020) No 51.62 55.46 41.97BM25 IR + RoBERTa-large No 56.86 60.88

BM25 Uniﬁcation-based IR + RoBERTa-large No

Step-wise Conceptual UniﬁcationSWCU (K = 1)

Yes 52.36 56.93 42.27

SWCU (K = 2)

Yes

Yes 53.49 59.25 40.72

SWCU (K = 2) + BERT-base

No 52.29 56.00 44.07

SWCU (K = 2) + RoBERTa-large No Table 1: Accuracy in answer prediction on the Worldtree corpus (test-set). The parameter K represents the numberof uniﬁcation facts considered to compute the hypothesis score (equation 7). multiple-choice science questions. Here, an ex-planation is a composition of facts stored in a setof semi-structured tables, each of them represent-ing a speciﬁc knowledge type. We extract the rowsentences from the tables and use them to buildthe Facts KB ( F KB ). The sentences in

Kindof , Synonyms and

Opposites tables are used as ab-stractive facts, while the remaining sentences areadopted for uniﬁcation . The questions in the cor-pus are split into train-set (1,190 questions), dev-set (264 questions) and test-set (1,247 questions).Questions and explanations in the train-set are usedto populate the Explanations KB (

EKB ), while the dev-set and the test-set are adopted for evaluation.The concepts in facts and hypotheses are extractedusing WordNet (Miller, 1995). Speciﬁcally, givena sentence, we deﬁne a concept as a maximal se-quence of words that corresponds to a valid synset.This process allows us to capture multi-word ex-pressions (e.g. “living thing” ) that typically occurin science questions.

In this section, we present the results achieved onthe Worldtree corpus (test-set). We report the ac-curacy for SWCU with different numbers of uni-ﬁcation facts ( K in equation 7), while the accu- racy for SWCU in combination with Transformersis achieved considering the best model ( K = 2 ).Overall, we observe that the SWCU model is com-petitive with SWCU + BERT-base, while SWCU+ RoBERTa-large achieves state-of-the-art resultsoutperforming all the proposed models and base-lines. We compare the framework against fourcategories of approaches: Information Retrieval , Multi-hop Inference , Transformers , and

Transform-ers with Explanation . The results are reported inTable 1.

Information Retrieval (IR).

For the IR cate-gory, we employ two baselines similar to the onedescribed in (Clark et al., 2018). Given an hypoth-esis h j , the BM25 IR solver adopts BM25 vectorsand cosine similarity to retrieve the sentence in F KB that is most relevant to h j . The relevancescore is then used to determine the ﬁnal answer.The BM25 Uniﬁcation-based IR solver adopts thesame strategy complementing the relevance scorewith the uniﬁcation score (equation 1). Similarly tothe SWCU model, these approaches employ scal-able IR techniques and do not require training foranswer prediction. However, the results show thatthe SWCU model signiﬁcantly outperforms thesebaselines on both easy and challenge questions. odel Explanation Unsupervised Pre-trained External Acc. TupleInf (Khot et al., 2017) Yes Yes No Yes 23.83TableILP (Khashabi et al., 2016) Yes Yes No Yes 26.97DGEM (Clark et al., 2018) Yes No Yes Yes 27.11KG (Zhang et al., 2018) Yes No No Yes 31.70Bi-LSTM max-out (Mihaylov et al., 2018) No No Yes Yes 33.87Unsupervised AHE (Yadav et al., 2019a) Yes Yes Yes No 33.87Supervised AHE (Yadav et al., 2019a) Yes No Yes No 34.47BERT-large (Yadav et al., 2019b) No No Yes No 35.11ET-RR (Ni et al., 2019) Yes No Yes Yes 36.60BERT-large + AutoROCC (Yadav et al., 2019b) Yes No Yes No 41.24Reading Strategies (Sun et al., 2019) No No Yes Yes Yes Yes No Yes 34.64

SWCU (K = 2)

Yes Yes No Yes 35.32

SWCU (K = 3)

Yes Yes No Yes

Table 2: Accuracy of the SWCU model on the ARC Challenge (test-set) and comparison with existing baselines.

Model EKB Overall Easy Challenge @2

BM25 IR + Plausibility Score (PS) No 40.58 43.19 34.79 65.28BM25 IR + Abstraction (ABS) + PS No 43.46 46.57 36.60 67.68BM25 IR + ABS + PS + Relevance Score (RS) No 50.36 55.30 39.43 72.65BM25 Uniﬁcation-based IR + ABS + PS + RS + Uniﬁcation Score (US) Yes

Table 3: Ablation Study (Worldtree test-set). @2 is the accuracy considering whether the answer is in the top 2hypotheses. Multi-hop Inference.

We consider PathNet(Kundu et al., 2019) as a multi-hop and explain-able reasoning baseline. This model constructspaths connecting question and candidate answer,and subsequently scores them through a neural ar-chitecture. We reproduce PathNet on the Worldtreecorpus using the source code available at thefollowing URL: https://github.com/allenai/PathNet . The best results are obtained consideringthe top 15 facts selected by the IR models. Thedifferences between PathNet and SWCU are two-folds: (1) PathNet assumes that an explanation hasalways the shape of a single, linear path; (2) Path-Net does not leverage uniﬁcation patterns to guidethe construction of multi-hop explanations. Ourexperiments show that these characteristics play asigniﬁcant role for the ﬁnal accuracy of the sys-tems.

Transformers.

We compare our frameworkagainst BERT-base (Devlin et al., 2019) andRoBERTa-large (Liu et al., 2019) ﬁne-tuned onthe multiple-choice question answering task. Weobserve that the SWCU model outperforms bothbaselines with signiﬁcantly less number of param-eters and without direct supervision. At the sametime, the improvement achieved using SWCU asan evidence extractor demonstrates the impact of the constructed explanations on Transformers.

Transformers with explanation.

Finally, wecompare our approach against Transformers en-hanced with IR baselines (i.e. BM25 IR and BM25Uniﬁcation-based IR) (Valentino et al., 2020). Thebest results for these models are obtained consider-ing the top 3 sentences retrieved by the IR models.We observe that the use of SWCU as an expla-nation extractor improves these baselines on botheasy and challenge questions, conﬁrming that theStep-wise Conceptual Uniﬁcation provides morediscriminating evidence for answer prediction.

To evaluate the generalisation of the SWCU modelon a larger set of questions requiring multi-hopreasoning, we run additional experiments on theARC Challenge (Clark et al., 2018). Regardingthe knowledge bases, we keep the set of uniﬁ-cation facts and explanations from the Worldtreecorpus (Xie et al., 2020) and substitute the set ofabstractive facts with hypernyms, hyponyms, andantonyms from WordNet (Miller, 1995). This pro-cess allows us to reuse the core uniﬁcation facts rep-resenting general scientiﬁc knowledge (e.g. grav-ity, friction) and, at the same time, being able toperform abstraction from novel concepts in thequestions. Table 2 reports the results on the test- uestion Prediction Abstraction Uniﬁcation Gold

What force is needed to help stop a child from slipping on ice ? (A) gravity, (B) friction, (C)electric, (D) magnetic (B) friction (1) counter means reduce; stop ; resist; (2) ice is a kind of object ; (3) slipping is a kind of motion ; (4) stop means not move . friction acts to counter the motion of two ob-jects when their surfaces are touching. Y What causes a change in the speed of a movingobject ? (A) force, (B) temperature, (C) changein mass (D) change in location (A) force – a force continually acting on an object in thesame direction that the object is moving can cause that object’s speed to increase in a for-ward motion NWeather patterns sometimes result in drought .Which activity would be most negatively af-fected during a drought year? (A) boating, (B)farming, (C) hiking, (D) hunting (B) farming (1) affected means changed ; (2) a drought is akind of slow environmental change ; farming changes the environment NBeryl ﬁnds a rock and wants to know what kindit is. Which piece of information about the rock will best help her to identify it? (A) Thesize of the rock, (B) The weight of the rock, (C)The temperature where the rock was found, (D)The minerals the rock contains (A) The size ofthe rock (1) a property is a kind of information ; (2) size is a kind of property ; (3) knowing the proper-ties of something means knowing information about that something . the properties of something can be used to identify ; used to describe that something . Y Jeannie put her soccer ball on the ground on theside of a hill. What force acted on the soccerball to make it roll down the hill? (A) gravity,(B) electricity, (C) friction, (D) magnetism (C) friction (1) the ground means Earth’s surface ; (2) rolling is a kind of motion ; (3) a roll is a kindof movement . friction acts to counter the motion of two ob-jects when their surfaces are touching. N Table 4: Examples of explanations generated by the SWCU model (dev-set). The underlined choices represent thecorrect answers.

Gold indicates whether the uniﬁcation fact is part of the gold explanation in the Worldtree corpus.

Model Answer Acc. Ex. Precision Ex. Recall Ex. F1 score UNF Acc.

BM25 IR + ABS + PS 44.69 35.72 17.25 26.49 36.28BM25 IR + ABS + PS + RS 60.62 52.75 23.77 38.26 55.75BM25 Uniﬁcation IR + ABS + PS + RS + US

Correct

Wrong 32.94

Table 5: Correlation between accuracy in answer prediction and explanation reconstruction metrics (Worldtreedev-set). set (1172 challenge questions). We compare theSWCU model against a set of state-of-the-art base-lines, classifying them according to 4 dimensions:(1)

Explanation : the model produces an explana-tion for their prediction; (2)

Unsupervised : thesystem does not require training on answer predic-tion; (3)

Pre-trained : the model adopts pre-trainedneural components such as Language Models orWord Embeddings; (4)

External : the system usesexternal knowledge bases or it is pre-trained onadditional datasets (e.g. RACE (Lai et al., 2017),SciTail (Khot et al., 2018)). The results show thatthe SWCU model outperforms the existing unsu-pervised systems based on Integer Linear Program-ming (ILP) (Khot et al., 2017; Khashabi et al.,2016) and pre-trained embeddings (Yadav et al.,2019a). At the same time, our model obtainscompetitive results with most of the supervisedapproaches, including BERT-large (Devlin et al.,2019). The SWCU model is still outperformed byreading strategies that adopt pre-training on exter-nal question answering datasets (Sun et al., 2019),which, however, do not produce explanations for their predicted answers.

We carried out an ablation study to investigate thecontribution of the main architectural components.To perform the study, we gradually combine indi-vidual features to recreate the best SWCU model( K = 2 ). Table 3 reports the obtained results. Thebasic model – BM25 IR + Plausibility Score, con-structs and scores explanations without abstractionstep and analogical reasoning, considering only uni-ﬁcation facts that are connected in one-hop to theoriginal hypothesis. The ﬁrst observation is that theabstraction step has a positive impact on the abduc-tive inference, improving the accuracy of the basicmodel by 2.88%. In the same way, a consistent im-provement is achieved when the plausibility score(PS) is combined with the BM25 relevance score(RS) (+ 6.9%). In line with our research hypothe-ses, the use of analogical reasoning to computethe uniﬁcation score (US) via EKB is crucial toachieve the ﬁnal accuracy, leading to a substantialimprovement on both easy (+5.83%) and challengeuestions (+3.87%).

In this section, we investigate the relation betweenexplanation and answer prediction. To this end, wecorrelate the accuracy achieved by different combi-nations of the SWCU model ( K = 2 ) with a set ofquantitative metrics for explanation evaluation – i.e.Precision, Recall, F1 score, and uniﬁcation accu-racy. Since the gold explanations in the Worldtree test-set are masked, we perform this analysis onthe dev-set , comparing the best explanations gen-erated for the predicted answers against the goldexplanations in the corpus. The results reportedin table 5 (top) highlight a positive correlation be-tween accuracy in answer prediction and qualityof the explanations. In particular, the performanceincreases according to the uniﬁcation accuracy –i.e. the percentage of uniﬁcations for the predictedhypotheses that are part of the gold explanations.Therefore, these results conﬁrm that the improve-ment on answer prediction is a consequence ofbetter explanatory inference. The second part ofthe analysis focuses on investigating the extent towhich accurate uniﬁcation is also necessary foranswer prediction (Tab. 5, bottom). In line withthe expectations, the table shows that the major-ity of correct answers are derived from accurateuniﬁcation (78.72%), while the majority of wrongpredictions are the results of erroneous or spuriousuniﬁcation (67.06%). However, a minor percent-age of correct and wrong answers are inferred fromspurious and correct uniﬁcation respectively, sug-gesting that alternative ways of constructing expla-nations are exploited by the model, and that, at thesame time, accurate uniﬁcation can in some occa-sions lead to wrong conclusions. Table 4 shows aset of qualitative examples that help clarify theseresults. The ﬁrst example shows the case in whichboth selected answer and uniﬁcation are correct.The second row shows an example of correct an-swer prediction and spurious uniﬁcation. In thiscase, however, the selected uniﬁcation fact repre-sents a plausible alternative way of constructingexplanations, that is marked as spurious due to thedifference with the corpus annotation. The thirdexample represents the situation in which, despitewrong uniﬁcation, the system is able to infer thecorrect answer. On the other hand, the subsequentexample shows the case in which the uniﬁcationis accurate, but the information it contains is not Choices Concepts Overlap (AVG) US ¬ US From 0% to 20%

From 60% to 80% ¬ US From 1 to 5

Table 6: Accuracy with distracting concepts with andwithout the Uniﬁcation Score (US) (Worldtree test-set). sufﬁcient to discriminate the correct answer fromthe alternative choices. Finally, the last row de-scribes the case in which spurious uniﬁcation leadsto wrong answer prediction.

In this section, we present an analysis to explorethe robustness and limitations of the proposed ap-proach. In this experiment (Table 6), we computethe accuracy of the SWCU model ( K = 2 ) withand without the Uniﬁcation Score (US) on ques-tions with varying degree of conceptual overlapbetween the alternative choices. The results showa drop in performance that is proportional to thenumber of shared concepts between the candidateanswers. Since the explanatory score partly de-pends on the conceptual connections between hy-potheses and uniﬁcations, the system struggles todiscriminate choices that share a large proportionof concepts. A similar behaviour is observed whenthe accuracy is correlated with the number of dis-tinct concepts in the questions. Long questions, infact, tend to include distracting concepts that affectthe abstraction step, increasing the probability ofbuilding spurious explanations. Nevertheless, theresults highlight the positive impact of the Uniﬁ-cation Score (US) on the robustness of the model,showing that the uniﬁcation patterns contribute toa better accuracy for questions that are difﬁcult toanswer with plausibility and relevance score alone. Explanations for Science Questions.

Explana-tory inference for science questions typically re-quires multi-hop reasoning – i.e. the ability to ag-gregate multiple facts from heterogeneous knowl-edge sources to arrive at the correct answer. Thisprocess is extremely challenging when dealing withatural language, with both empirical (Fried et al.,2015) and theoretical work (Khashabi et al., 2019)suggesting an intrinsic limitation in the composi-tion of inference chains longer than 2 hops. Thisphenomenon, known as semantic drift, often resultsin the construction of spurious inference chainsleading to wrong conclusions. Recent approacheshave framed explanatory inference as the prob-lem of building an optimal graph, whose gener-ation is conditioned on a set of local and globalsemantic constraints (Khashabi et al., 2018; Khotet al., 2017; Jansen et al., 2017; Khashabi et al.,2016). A parallel line of research tries to tackle theproblem through the construction of explanation-centred corpora, which can facilitate the identiﬁca-tion of common explanatory patterns (Valentinoet al., 2020; Jansen et al., 2018; Jansen, 2017;Jansen et al., 2016). Our approach attempts to lever-age the best of both worlds by imposing, on onehand, a set of structural and functional constraintsthat limit the inference process to two macro steps(abductive reasoning), and on the other hand, byidentifying common uniﬁcation patterns in expla-nations for similar questions (analogical reason-ing). The explanatory patterns generated by theuniﬁcation process, largely discussed in philoso-phy of science (Friedman, 1974; Kitcher, 1981,1989), have inﬂuenced the development of expertsystems based on case-based reasoning (Thagardand Litt, 2008; Kolodner, 2014). Similarly to ourapproach, case-based reasoning adopts analogy as acore component to retrieve explanations for knowncases, and adapt them in the solution of unseenproblems.

Explanations for Natural Language Reasoning.

Recent work have highlighted issues related to theinterpretability of deep learning models (Miller,2019; Biran and Cotton, 2017), which, amongother things, affects the design of proper bench-mark for assessing natural language reasoning ca-pabilities (Schlegel et al., 2020). To deal with lackof interpretability, an emerging line of research ex-plores the design of datasets including gold expla-nations, that support the construction and evalua-tion of explainable models in different domains,ranging from open domain question answering(Yang et al., 2018; Thayaparan et al., 2019), totextual entailment (Camburu et al., 2018) and rea-soning with mathematical text (Ferreira and Freitas,2020a,b). Other approaches explore the construc-tion of explanations through the use of distribu- tional and similarity-based models applied on ex-ternal commonsense knowledge bases (Silva et al.,2019, 2018; Freitas et al., 2014). In line with thiswork, we demonstrate that the use of uniﬁcationpatterns for multi-hop explanations can enhanceboth accuracy and explainability of neural modelson a challenging question answering task (Rajaniet al., 2019; Yadav et al., 2019b).

This paper presented the

Step-wise ConceptualUniﬁcation , a multi-hop reasoning framework thatleverages uniﬁcation patterns through analogicaland abductive reasoning. We empirically demon-strated the efﬁcacy of the model for unsupervisedquestion answering and explanation extraction, re-marking the impact of uniﬁcation power on soundexplanatory inference.

References

Or Biran and Courtenay Cotton. 2017. Explanationand justiﬁcation in machine learning: A survey. In

IJCAI-17 workshop on explainable AI (XAI) , vol-ume 8.Oana-Maria Camburu, Tim Rockt¨aschel, ThomasLukasiewicz, and Phil Blunsom. 2018. e-snli: Nat-ural language inference with natural language expla-nations. In

Advances in Neural Information Process-ing Systems , pages 9539–9549.Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,Ashish Sabharwal, Carissa Schoenick, and OyvindTafjord. 2018. Think you have solved question an-swering? try arc, the ai2 reasoning challenge. arXivpreprint arXiv:1803.05457 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Deborah Ferreira and Andr´e Freitas. 2020a. Natu-ral language premise selection: Finding supportingstatements for mathematical text. In

Proceedings ofThe 12th Language Resources and Evaluation Con-ference , pages 2175–2182.Deborah Ferreira and Andr´e Freitas. 2020b. Premiseselection in natural language mathematical texts. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 7365–7374.ndr´e Freitas, Joao Carlos Pereira da Silva, EdwardCurry, and Paul Buitelaar. 2014. A distributionalsemantics approach for selective reasoning on com-monsense graph knowledge bases. In

InternationalConference on Applications of Natural Languageto Data Bases/Information Systems , pages 21–32.Springer.Daniel Fried, Peter Jansen, Gustave Hahn-Powell, Mi-hai Surdeanu, and Peter Clark. 2015. Higher-order lexical semantic models for non-factoid an-swer reranking.

Transactions of the Association forComputational Linguistics , 3:197–210.Michael Friedman. 1974. Explanation and scientiﬁcunderstanding.

The Journal of Philosophy , 71(1):5–19.Peter Jansen, Niranjan Balasubramanian, Mihai Sur-deanu, and Peter Clark. 2016. Whats in an expla-nation? characterizing knowledge and inference re-quirements for elementary science exams. In

Pro-ceedings of COLING 2016, the 26th InternationalConference on Computational Linguistics: Techni-cal Papers , pages 2956–2965.Peter Jansen, Rebecca Sharp, Mihai Surdeanu, and Pe-ter Clark. 2017. Framing qa as building and rankingintersentence answer justiﬁcations.

ComputationalLinguistics , 43(2):407–449.Peter Jansen and Dmitry Ustalov. 2019. Textgraphs2019 shared task on multi-hop inference for expla-nation regeneration. In

Proceedings of the Thir-teenth Workshop on Graph-Based Methods for Nat-ural Language Processing (TextGraphs-13) , pages63–77.Peter Jansen, Elizabeth Wainwright, Steven Mar-morstein, and Clayton Morrison. 2018. Worldtree:A corpus of explanation graphs for elementary sci-ence questions supporting multi-hop inference. In

Proceedings of the Eleventh International Confer-ence on Language Resources and Evaluation (LREC2018) .Peter A Jansen. 2017. A study of automatically ac-quiring explanatory inference patterns from corporaof explanations: Lessons from elementary scienceexams. In .Daniel Khashabi, Erfan Sadeqi Azer, Tushar Khot,Ashish Sabharwal, and Dan Roth. 2019. Onthe capabilities and limitations of reasoning fornatural language understanding. arXiv preprintarXiv:1901.02522 .Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Pe-ter Clark, Oren Etzioni, and Dan Roth. 2016. Ques-tion answering via integer programming over semi-structured knowledge. In

Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial In-telligence , pages 1145–1152. Daniel Khashabi, Tushar Khot, Ashish Sabharwal, andDan Roth. 2018. Question answering as global rea-soning over semantic abstractions. In

Thirty-SecondAAAI Conference on Artiﬁcial Intelligence .Tushar Khot, Peter Clark, Michal Guerquin, PeterJansen, and Ashish Sabharwal. 2019. Qasc: Adataset for question answering via sentence compo-sition. arXiv preprint arXiv:1910.11473 .Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017.Answering complex questions using open informa-tion extraction. In

Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , pages 311–316.Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.Scitail: A textual entailment dataset from sciencequestion answering.Philip Kitcher. 1981. Explanatory uniﬁcation.

Philoso-phy of science , 48(4):507–531.Philip Kitcher. 1989. Explanatory uniﬁcation and thecausal structure of the world.Janet Kolodner. 2014.

Case-based reasoning . MorganKaufmann.Souvik Kundu, Tushar Khot, Ashish Sabharwal, andPeter Clark. 2019. Exploiting explicit paths formulti-hop reading comprehension. In

Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics , pages 2737–2747.Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,and Eduard Hovy. 2017. Race: Large-scale readingcomprehension dataset from examinations. In

Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing , pages 785–794.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Todor Mihaylov, Peter Clark, Tushar Khot, and AshishSabharwal. 2018. Can a suit of armor conduct elec-tricity? a new dataset for open book question an-swering. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing ,pages 2381–2391.George A Miller. 1995. Wordnet: a lexical database forenglish.

Communications of the ACM , 38(11):39–41.Tim Miller. 2019. Explanation in artiﬁcial intelligence:Insights from the social sciences.

Artiﬁcial Intelli-gence , 267:1–38.Jianmo Ni, Chenguang Zhu, Weizhu Chen, and JulianMcAuley. 2019. Learning to attend on essentialerms: An enhanced retriever-reader model for open-domain question answering. In

Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers) , pages 335–344.Nazneen Fatema Rajani, Bryan McCann, CaimingXiong, and Richard Socher. 2019. Explain your-self! leveraging language models for commonsensereasoning. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics ,pages 4932–4942.Viktor Schlegel, Marco Valentino, Andr´e Freitas,Goran Nenadic, and Riza Theresa Batista-Navarro.2020. A framework for evaluation of machine read-ing comprehension gold standards. In

Proceedingsof The 12th Language Resources and EvaluationConference , pages 5359–5369.Vivian Dos Santos Silva, Siegfried Handschuh, andAndr´e Freitas. 2018. Recognizing and justifyingtext entailment through distributional navigation ondeﬁnition graphs. In

AAAI , pages 4913–4920.Vivian S Silva, Andr´e Freitas, and Siegfried Hand-schuh. 2019. Exploring knowledge graphs in an in-terpretable composite approach for text entailment.In

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , volume 33, pages 7023–7030.Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2019.Improving machine reading comprehension withgeneral reading strategies. In

Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers) , pages 2633–2643.Paul Thagard and Abninder Litt. 2008. Models of sci-entiﬁc explanation.

The Cambridge handbook ofcomputational psychology , pages 549–564.Mokanarangan Thayaparan, Marco Valentino, ViktorSchlegel, and Andr´e Freitas. 2019. Identifyingsupporting facts for multi-hop question answeringwith document graph networks. In

Proceedings ofthe Thirteenth Workshop on Graph-Based Methodsfor Natural Language Processing (TextGraphs-13) ,pages 42–51.Marco Valentino, Mokanarangan Thayaparan, andAndr´e Freitas. 2020. Uniﬁcation-based reconstruc-tion of explanations for science questions. arXivpreprint arXiv:2004.00061 .Zhengnan Xie, Sebastian Thiem, Jaycie Martin, Eliz-abeth Wainwright, Steven Marmorstein, and PeterJansen. 2020. Worldtree v2: A corpus of science-domain structured explanations and inference pat-terns supporting multi-hop inference. In

Proceed-ings of The 12th Language Resources and Evalua-tion Conference , pages 5456–5473. Vikas Yadav, Steven Bethard, and Mihai Surdeanu.2019a. Alignment over heterogeneous embeddingsfor question answering. In

Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers) , pages 2681–2691.Vikas Yadav, Steven Bethard, and Mihai Surdeanu.2019b. Quick and (not so) dirty: Unsupervised se-lection of justiﬁcation sentences for multi-hop ques-tion answering. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2578–2589.Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D Manning. 2018. Hotpotqa: A dataset fordiverse, explainable multi-hop question answering.In

Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing , pages2369–2380.Yuyu Zhang, Hanjun Dai, Kamil Toraman, andLe Song. 2018. Kgˆ 2: Learning to reason scienceexam questions with contextual knowledge graphembeddings. arXiv preprint arXiv:1805.12393arXiv preprint arXiv:1805.12393