Explainable Natural Language Reasoning via Conceptual Unification
EExplainable Natural Language Reasoning via Conceptual Unification
Marco Valentino, Mokanarangan Thayaparan, Andr´e Freitas
Department of Computer ScienceUniversity of Manchester { marco.valentino, mokanarangan.thayaparan, andre.freitas } @manchester.ac.uk Abstract
This paper presents an abductive frameworkfor multi-hop and interpretable textual infer-ence. The reasoning process is guided by thenotions of unification power and plausibility of an explanation, computed through the in-teraction of two major architectural compo-nents: (a) An analogical reasoning model thatranks explanatory facts by leveraging unifica-tion patterns in a corpus of explanations; (b)An abductive reasoning model that performsa search for the best explanation, which is re-alised via conceptual abstraction and subse-quent unification.We demonstrate that the Step-wise Concep-tual Unification can be effective for unsuper-vised question answering, and as an explana-tion extractor in combination with state-of-the-art Transformers. An empirical evaluation onthe Worldtree corpus and the ARC Challengeresulted in the following conclusions: (1) Thequestion answering model outperforms com-petitive neural and multi-hop baselines with-out requiring any explicit training on answerprediction; (2) When used as an explanationextractor, the proposed model significantly im-proves the performance of Transformers, lead-ing to state-of-the-art results on the Worldtreecorpus; (3) Analogical and abductive reason-ing are highly complementary for achievingsound explanatory inference, a feature thatdemonstrates the impact of the unification pat-terns on performance and interpretability.
Multiple-choice Science Questions have been pro-posed as a challenge task for natural language in-ference and question answering (Khot et al., 2019;Clark et al., 2018; Mihaylov et al., 2018). A centralresearch line in the field aims at developing ex-plainable inference models capable of performingaccurate predictions and, at the same time, gen-erating explanations for the underlying reasoningprocess (Miller, 2019; Biran and Cotton, 2017).
UnificationAbstractionQ:
If you bounce a rubber ball on the floor , it goes up and then comes down . What causes the ball to come down ? C2:
Magnetism
C3:
Electricity
C4:
Friction
C1:
Gravity A ball is a kind of object The floor is a kind of objectCome down is similar to fallingGravity means gravitational pull ; gravitational energy
Gravity ; gravitational force causes objects that havemass; substances to be pulled down ; to fall on a planet (1)(2)
Figure 1: Modelling explanatory inference via step-wise conceptual unification.
The construction of explanations for science ques-tions is typically framed as a multi-hop reasoningproblem, where multiple pieces of evidence needto be combined to arrive at the final answer. Recentapproaches adopt global and local semantic con-straints to guide the generation of plausible multi-hop explanations (Khashabi et al., 2018; Jansenet al., 2017; Khashabi et al., 2016). However, theuse of explicit constraints for reasoning with natu-ral language often results in semantic drift – i.e. thetendency of composing spurious inference chainsthat lead to wrong conclusions (Khashabi et al.,2019). To deal with semantic drift, recent workhave proposed the crowd-sourcing of explanation-centred corpora (Xie et al., 2020; Jansen et al.,2018, 2016) which can enable the identificationof common explanatory patterns. Although theseresources have been applied for explanation regen-eration (Valentino et al., 2020; Jansen and Ustalov,2019), it is not yet clear how they can support thedownstream answer prediction task. In this paper,we aim at moving a step forward in this direction,exploring how explanatory patterns can be lever-aged for multi-hop reasoning.Research in Philosophy of Science suggests that a r X i v : . [ c s . A I] S e p xplanations act through unification (Friedman,1974; Kitcher, 1989). The function of an expla-nation is to unify a set of disconnected phenomenashowing that they are the expression of a commonregularity – e.g. Newton’s law of universal grav-itation unifies the motion of planets and fallingbodies showing that they obey the same law. Thehigher the number of distinct phenomena explainedby a given statement, the higher its unificationpower. Therefore, explanations with high unifi-cation power tend to create unification patterns –i.e. the same statement is reused to explain a largevariety of similar phenomena. We hypothesise thatthe unification patterns emerging in a corpus ofexplanations can ultimately guide the abductivereasoning process. Consider the example in Figure1. The explanation performs unification by con-necting a concrete phenomenon – i.e. a ball fallingon the floor, to a general regularity that applies to abroader set of phenomena and that explains a largenumber of questions – i.e. gravity affects all theobjects that have mass. As a result, multi-hop in-ference for science questions can be modelled asan abstraction from the original context in searchof an underlying explanatory law, which in turnmanifests its unification power by being frequentlyreused in explanations for similar questions. Inthis paper, we build upon the concept of explana-tory unification and provide the following contri-butions: (1) We present the Step-wise ConceptualUnification, an abductive framework that combinesexplicit semantic constraints with the notion of uni-fication power for multi-hop inference, computedvia analogical reasoning on a corpus of explana-tions; (2) We empirically show the efficacy of theframework for unsupervised question answeringand explanation extraction; (3) We study the impactof the unification patterns on abductive reasoning,demonstrating their role in improving the accuracyof prediction and the soundness of explanations. A multiple-choice science question is a tuple Q =( q, C ) characterised by a question q and a set ofcandidate answers C = { c , c , . . . , c n } . A setof hypotheses H = { h , h , . . . , h n } can be de-rived by concatenating q with each c j ∈ C – i.e. h j = concat ( c j , q ) . Given H , we frame multi-hopinference for multiple-choice question answeringas the problem of selecting the hypothesis that issupported by the best explanation . The Step-wise Conceptual Unification constructs and scores ex-planations by composing multiple sentences froma knowledge base, which we refer to as Facts KB(
F KB ). This resource includes the knowledgenecessary to answer and explain science questions,ranging from common sense and taxonomic rela-tions (e.g. a ball is a kind of object ) to scientificstatements and laws (e.g. gravity; gravitationalforce causes objects that have mass; substances tobe pulled down; to fall on a planet ). The abduc-tive process is guided by the unification patternsemerging in a second knowledge base, named Ex-planations KB (
EKB ), which contains a set of truehypotheses with their respective explanations. Theframework is based on the following research hy-potheses:
RH1:
Explanations can be constructedthrough two major inference steps, namely abstrac-tion and unification : (1) Retrieving a set of abstrac-tive facts whose role is to expand the context of thehypothesis in search of an underlying regularity;(2) Selecting an unification fact , which representsan explanatory scientific statement;
RH2:
The bestexplanation can be determined by considering twoproperties of the unification: (a) The plausibility ofthe unification, that is a measure of the semanticconnection between the unification statement andthe original hypothesis; (b) The unification power ,that depends on how often the unification explainssimilar hypotheses.
In general, we consider the facts in
F KB and thehypotheses in H as natural language statementscomposed of a set of distinct concepts CP ( f i ) = { cp , cp , . . . , cp z } (e.g. “gravity” , “ball” , “liv-ing thing” ). To formalise our research hypothe-ses, we divide the facts in F KB into two cate-gories. The sentences expressing taxonomic re-lations between concepts (i.e. “x is a kind ofy” ), synonyms (i.e. “x means y” ) and antonyms(i.e. “x is the opposite of y” ) are classified as ab-stractive , while all the other facts (e.g. properties,causes, processes, scientific laws) are consideredfor unification . We say that two arbitrary facts f i and f j are conceptually connected if the intersec-tion between CP ( f i ) and CP ( f j ) is not empty, CP ( f i ) ∩ CP ( f j ) (cid:54) = ∅ . On the other hand, wesay that two facts f i and f j are indirectly con-nected if CP ( f i ) ∩ CP ( f j ) = ∅ and there existsa fact f z such that CP ( f i ) ∩ CP ( f z ) (cid:54) = ∅ and CP ( f j ) ∩ CP ( f z ) (cid:54) = ∅ . We consider as part of nification-based Reconstruction (Analogical Reasoning) AnswerSelection ExplanationStep-wise Conceptual Unification (Abductive Reasoning)RS + US Abstraction
Unification
RS + US Abstraction
Unification
RS + US Abstraction
Unification
Hypothesis ScoreCABS(h1) RS Analogical Scores + Abstraction A ball is a kind of object The floor is a kind of objectCome down is similar to fallingGravity means gravitational pull ; gravitational energy
Unification
Gravity ; gravitational force causes objects that have mass;substances to be pulled down ; to fall on a planet Q: If you bounce a rubber ball on the floor, it goes up and then comes down. What causes the ball to come down? C : Magnetism C : Electricity C : Friction
US AbstractionUnification MAX
FKB C : GravityExplanations Facts A: Gravity Q: If you bounce a rubber ball on the floor , it goes up and then comes down . What causes the ball to come down ? (1) (2) (3) (4) EKB
CUNF(h1) Explanatory Scores
Figure 2: Overview of the Step-wise Conceptual Unification framework. the explanation only pairs of facts that are at leastindirectly connected. Moreover, following our firstresearch hypothesis (RH1) we consider composi-tions formed by an arbitrary number of abstractivefacts and one unification fact. Therefore, a genericexplanation for h j ∈ H can be reformulated as a tu-ple E j = ( ABS, U N F ) defined by the followingelements: • ABS = { a , a , . . . , a n } ⊆ F KB is a set ofabstractive facts such that | ABS | ≥ ; • U N F = { u } ⊆ F KB is a singleton includ-ing one unification fact u – i.e. | U N F | = 1 .Additional constraints are determined by the con-ceptual connections between these sets. Specif-ically, each abstractive fact a i ∈ ABS must beconceptually connected with both the hypothesis h j and the unification fact u ∈ U N F . On the otherhand, u must be conceptually connected with eachabstractive fact if | ABS | > , and with the hypoth-esis h j if | ABS | = 0 . In other words, to ensurethat the unification fact is semantically plausible,we force it to be linked with h j in one or two hopsthrough abstractive facts (e.g. Fig 1). To determine which hypothesis in H is supportedby the best explanation, we define a framework con-sisting of four major algorithmic steps (Fig. 2). Foreach hypothesis h j ∈ H , the first step is aimed atretrieving a set of candidate explanatory facts. Thearchitectural component responsible for this taskperforms analogical reasoning by leveraging uni-fication patterns for similar hypotheses in EKB . The output of this component is represented by twodistinct subsets of
F KB : (a) A set of candidate ab-stractive facts
CABS ( h j ) = { a , a , . . . a z } ; (b)A set of candidate unification facts CU N F ( h j ) = { u , u , . . . u t } . Each u i ∈ CU N F ( h j ) is asso-ciated with an analogical score as ( h j , u i ) com-puted with respect to the hypothesis h j and re-flecting the unification power of u i . The sec-ond step uses the output of the analogical compo-nent to perform abductive reasoning . Specifically,the elements of CABS ( h j ) and CU N F ( h j ) arecombined to build a set of plausible explanations E ( h j ) = { E , E , . . . E n } . For each explanation E i = ( ABS i , U N F i ) , the abductive componentcomputes an explanatory score es ( u i , h j ) by tak-ing into account the analogical score and the plau-sibility of the unification u i , which is derived fromthe conceptual connections with h j . The top K unifications ranked according to their explanatoryscores are adopted to determine the final score forthe hypothesis h j . Finally, the answer selectioncomponent (step 3) collects the scores computedfor each h j ∈ H and selects the candidate answer c i ∈ C associated to the best hypothesis. For ex-plainability, the predicted answer can be enrichedwith the unification performed by the system (Step4). The analogical reasoning component adoptsthe
Unification-based Reconstruction model(Valentino et al., 2020). For each fact f i ∈ F KB ,the model computes a score (i.e. as ( h j , f i ) ) that isderived by the combination of its lexical relevance,i.e. the Relevance Score (RS) , and its unificationower, defined as
Unification Score (US) (Fig. 2): as ( h j , f i ) = λ rs ( h j , f i ) + λ us ( h j , f i ) (1) The unification score us ( h j , f i ) is described by thefollowing formula: us ( h j , f i ) = | kNN ( h j ) | (cid:88) z sim ( h j , h z ) in ( f i , E z ) (2) in ( f i , E z ) = (cid:40) if f i ∈ E z otherwise (3) KN N ( h j ) = { ( h , E ) , . . . ( h n , E n ) } ⊆ EKB is the set of k-nearest neighbours of h j that includeshypothesis and explanation pairs ( h z , E z ) retrievedaccording to a similarity measure sim ( h j , h z ) ,while in ( f i , E z ) is a function that returns 1 if f i is used to explain h z , 0 otherwise. Therefore, themore a fact f i ∈ F KB explains similar hypothe-ses in
EKB , the higher its unification score. Inour experiments, both sim ( h j , h z ) and rs ( h j , f i ) are implemented using BM25 vectors and cosinesimilarity. The abductive reasoning model constructs andscores a set of explanations E ( h j ) using abstrac-tion and unification steps. For each concept in thehypothesis c i ∈ CP ( h j ) , the abstraction step com-putes an expansion set EXP ( c i ) considering eachcandidate abstractive fact a k ∈ CABS ( h j ) : EXP ( c i ) = (cid:91) k CP ( a k ) | c i ∈ CP ( a k ) ∩ CP ( h j ) (4) The set
EXP ( c i ) represents the union of all theconcepts that occur in abstractive facts mention-ing c i . For example, considering the hypothesis infigure 1, the set EXP ( ball ) will include the con-cept “object” extracted from the fact “a ball is akind of object” . Therefore, EXP ( c i ) will include c i plus its hypernyms, hyponyms, synonyms andopposite concepts contained in CABS ( h j ) . In theunification step, the abductive component analyseseach candidate unification fact u k in CU N F ( h j ) and checks whether there exists at least a concept c i such that CP ( u k ) ∩ EXP ( c i ) (cid:54) = ∅ . If this con-dition is respected (e.g. Fig. 1), the componentadds a new explanation E k to E ( h j ) composed ofthe unification u k and all the abstractive facts thatare connected with u k and h j . Conversely, if thecondition is not respected, the unification fact u k is discarded. Once the set E ( h j ) is created, the ab-ductive component assigns an explanatory score to each explanation E k by considering the unificationfact u k : es ( h j , u k ) = λ as ( h j , u k ) + λ ps ( h j , u k ) (5) Here, as ( h j , u k ) is the analogical score computedfor u k , while ps ( h j , u k ) represents the plausibilityscore defined as follows: ps ( h j , u k ) = | (cid:83) i c i | EXP ( c i ) ∩ CP ( u k ) (cid:54) = ∅|| CP ( h j ) | (6) The plausibility score ps ( h j , u k ) represents the per-centage of concepts in the hypothesis h j that haveat least an indirect link with the unification fact u k . Therefore, the higher the degree of conceptualcoverage between the unification and the originalhypothesis, the higher the plausibility score. Inline with our research hypotheses (RH2), the fullexplanatory score of a unification fact u k jointlydepends on its semantic plausibility and unificationpower. Finally, the abductive model computes the hypothesis score by considering the top K unifi-cations for h j ranked by their explanatory scores: hs ( h j ) = K (cid:88) k es ( h j , u k ) (7) The final answer is selected by considering thehypothesis in H with the highest score: ans ( Q ) = c a ∈ C | a = argmax j [ hs ( h j ) ] (8) We evaluate the Step-wise Conceptual Unification(SWCU) on multiple-choice question anwering.First, we test the efficacy of the framework for unsupervised question answering . Here, we adoptthe algorithmic steps described in the previous sec-tion, using equation 8 for answer prediction. Inaddition, we evaluate the model for explanationextraction . In this case, the top k unification facts(Equation 7) are used as supporting evidence fora Transformer model, which is then fine-tuned onanswer prediction. We perform the experimentscombining SWCU with BERT-base (Devlin et al.,2019) and RoBERTa-large (Liu et al., 2019). TheSWCU model is implemented via BM25 vectorsand cosine similarity, which are used for computing sim ( h j , h z ) in equation 2 and rs ( h j , f i ) in equa-tion 1. The knowledge bases ( F KB and
EKB )are populated using the Worldtree corpus (Jansenet al., 2018) which provides gold explanations for odel Unsupervised Overall Easy ChallengeInformation Retrieval (IR)
BM25 IR solver (Clark et al., 2018) Yes 41.22 44.94
BM25 Unification-based IR solver Yes
Multi-hop Inference
BM25 IR + PathNet (Kundu et al., 2019) No 41.50 43.32
BM25 Unification-based IR + PathNet (Kundu et al., 2019) No
Transformers
BERT-base (Devlin et al., 2019; Valentino et al., 2020) No 41.78 48.54 26.28RoBERTa-large (Liu et al., 2019) No
BM25 IR + BERT-base (Valentino et al., 2020) No 49.39 53.20 40.97BM25 Unification-based IR + BERT-base (Valentino et al., 2020) No 51.62 55.46 41.97BM25 IR + RoBERTa-large No 56.86 60.88
BM25 Unification-based IR + RoBERTa-large No
Step-wise Conceptual UnificationSWCU (K = 1)
Yes 52.36 56.93 42.27
SWCU (K = 2)
Yes
Yes 53.49 59.25 40.72
SWCU (K = 2) + BERT-base
No 52.29 56.00 44.07
SWCU (K = 2) + RoBERTa-large No Table 1: Accuracy in answer prediction on the Worldtree corpus (test-set). The parameter K represents the numberof unification facts considered to compute the hypothesis score (equation 7). multiple-choice science questions. Here, an ex-planation is a composition of facts stored in a setof semi-structured tables, each of them represent-ing a specific knowledge type. We extract the rowsentences from the tables and use them to buildthe Facts KB ( F KB ). The sentences in
Kindof , Synonyms and
Opposites tables are used as ab-stractive facts, while the remaining sentences areadopted for unification . The questions in the cor-pus are split into train-set (1,190 questions), dev-set (264 questions) and test-set (1,247 questions).Questions and explanations in the train-set are usedto populate the Explanations KB (
EKB ), while the dev-set and the test-set are adopted for evaluation.The concepts in facts and hypotheses are extractedusing WordNet (Miller, 1995). Specifically, givena sentence, we define a concept as a maximal se-quence of words that corresponds to a valid synset.This process allows us to capture multi-word ex-pressions (e.g. “living thing” ) that typically occurin science questions.
In this section, we present the results achieved onthe Worldtree corpus (test-set). We report the ac-curacy for SWCU with different numbers of uni-fication facts ( K in equation 7), while the accu- racy for SWCU in combination with Transformersis achieved considering the best model ( K = 2 ).Overall, we observe that the SWCU model is com-petitive with SWCU + BERT-base, while SWCU+ RoBERTa-large achieves state-of-the-art resultsoutperforming all the proposed models and base-lines. We compare the framework against fourcategories of approaches: Information Retrieval , Multi-hop Inference , Transformers , and
Transform-ers with Explanation . The results are reported inTable 1.
Information Retrieval (IR).
For the IR cate-gory, we employ two baselines similar to the onedescribed in (Clark et al., 2018). Given an hypoth-esis h j , the BM25 IR solver adopts BM25 vectorsand cosine similarity to retrieve the sentence in F KB that is most relevant to h j . The relevancescore is then used to determine the final answer.The BM25 Unification-based IR solver adopts thesame strategy complementing the relevance scorewith the unification score (equation 1). Similarly tothe SWCU model, these approaches employ scal-able IR techniques and do not require training foranswer prediction. However, the results show thatthe SWCU model significantly outperforms thesebaselines on both easy and challenge questions. odel Explanation Unsupervised Pre-trained External Acc. TupleInf (Khot et al., 2017) Yes Yes No Yes 23.83TableILP (Khashabi et al., 2016) Yes Yes No Yes 26.97DGEM (Clark et al., 2018) Yes No Yes Yes 27.11KG (Zhang et al., 2018) Yes No No Yes 31.70Bi-LSTM max-out (Mihaylov et al., 2018) No No Yes Yes 33.87Unsupervised AHE (Yadav et al., 2019a) Yes Yes Yes No 33.87Supervised AHE (Yadav et al., 2019a) Yes No Yes No 34.47BERT-large (Yadav et al., 2019b) No No Yes No 35.11ET-RR (Ni et al., 2019) Yes No Yes Yes 36.60BERT-large + AutoROCC (Yadav et al., 2019b) Yes No Yes No 41.24Reading Strategies (Sun et al., 2019) No No Yes Yes Yes Yes No Yes 34.64
SWCU (K = 2)
Yes Yes No Yes 35.32
SWCU (K = 3)
Yes Yes No Yes
Table 2: Accuracy of the SWCU model on the ARC Challenge (test-set) and comparison with existing baselines.
Model EKB Overall Easy Challenge @2
BM25 IR + Plausibility Score (PS) No 40.58 43.19 34.79 65.28BM25 IR + Abstraction (ABS) + PS No 43.46 46.57 36.60 67.68BM25 IR + ABS + PS + Relevance Score (RS) No 50.36 55.30 39.43 72.65BM25 Unification-based IR + ABS + PS + RS + Unification Score (US) Yes
Table 3: Ablation Study (Worldtree test-set). @2 is the accuracy considering whether the answer is in the top 2hypotheses. Multi-hop Inference.
We consider PathNet(Kundu et al., 2019) as a multi-hop and explain-able reasoning baseline. This model constructspaths connecting question and candidate answer,and subsequently scores them through a neural ar-chitecture. We reproduce PathNet on the Worldtreecorpus using the source code available at thefollowing URL: https://github.com/allenai/PathNet . The best results are obtained consideringthe top 15 facts selected by the IR models. Thedifferences between PathNet and SWCU are two-folds: (1) PathNet assumes that an explanation hasalways the shape of a single, linear path; (2) Path-Net does not leverage unification patterns to guidethe construction of multi-hop explanations. Ourexperiments show that these characteristics play asignificant role for the final accuracy of the sys-tems.
Transformers.
We compare our frameworkagainst BERT-base (Devlin et al., 2019) andRoBERTa-large (Liu et al., 2019) fine-tuned onthe multiple-choice question answering task. Weobserve that the SWCU model outperforms bothbaselines with significantly less number of param-eters and without direct supervision. At the sametime, the improvement achieved using SWCU asan evidence extractor demonstrates the impact of the constructed explanations on Transformers.
Transformers with explanation.
Finally, wecompare our approach against Transformers en-hanced with IR baselines (i.e. BM25 IR and BM25Unification-based IR) (Valentino et al., 2020). Thebest results for these models are obtained consider-ing the top 3 sentences retrieved by the IR models.We observe that the use of SWCU as an expla-nation extractor improves these baselines on botheasy and challenge questions, confirming that theStep-wise Conceptual Unification provides morediscriminating evidence for answer prediction.
To evaluate the generalisation of the SWCU modelon a larger set of questions requiring multi-hopreasoning, we run additional experiments on theARC Challenge (Clark et al., 2018). Regardingthe knowledge bases, we keep the set of unifi-cation facts and explanations from the Worldtreecorpus (Xie et al., 2020) and substitute the set ofabstractive facts with hypernyms, hyponyms, andantonyms from WordNet (Miller, 1995). This pro-cess allows us to reuse the core unification facts rep-resenting general scientific knowledge (e.g. grav-ity, friction) and, at the same time, being able toperform abstraction from novel concepts in thequestions. Table 2 reports the results on the test- uestion Prediction Abstraction Unification Gold
What force is needed to help stop a child from slipping on ice ? (A) gravity, (B) friction, (C)electric, (D) magnetic (B) friction (1) counter means reduce; stop ; resist; (2) ice is a kind of object ; (3) slipping is a kind of motion ; (4) stop means not move . friction acts to counter the motion of two ob-jects when their surfaces are touching. Y What causes a change in the speed of a movingobject ? (A) force, (B) temperature, (C) changein mass (D) change in location (A) force – a force continually acting on an object in thesame direction that the object is moving can cause that object’s speed to increase in a for-ward motion NWeather patterns sometimes result in drought .Which activity would be most negatively af-fected during a drought year? (A) boating, (B)farming, (C) hiking, (D) hunting (B) farming (1) affected means changed ; (2) a drought is akind of slow environmental change ; farming changes the environment NBeryl finds a rock and wants to know what kindit is. Which piece of information about the rock will best help her to identify it? (A) Thesize of the rock, (B) The weight of the rock, (C)The temperature where the rock was found, (D)The minerals the rock contains (A) The size ofthe rock (1) a property is a kind of information ; (2) size is a kind of property ; (3) knowing the proper-ties of something means knowing information about that something . the properties of something can be used to identify ; used to describe that something . Y Jeannie put her soccer ball on the ground on theside of a hill. What force acted on the soccerball to make it roll down the hill? (A) gravity,(B) electricity, (C) friction, (D) magnetism (C) friction (1) the ground means Earth’s surface ; (2) rolling is a kind of motion ; (3) a roll is a kindof movement . friction acts to counter the motion of two ob-jects when their surfaces are touching. N Table 4: Examples of explanations generated by the SWCU model (dev-set). The underlined choices represent thecorrect answers.
Gold indicates whether the unification fact is part of the gold explanation in the Worldtree corpus.
Model Answer Acc. Ex. Precision Ex. Recall Ex. F1 score UNF Acc.
BM25 IR + ABS + PS 44.69 35.72 17.25 26.49 36.28BM25 IR + ABS + PS + RS 60.62 52.75 23.77 38.26 55.75BM25 Unification IR + ABS + PS + RS + US
Correct
Wrong 32.94
Table 5: Correlation between accuracy in answer prediction and explanation reconstruction metrics (Worldtreedev-set). set (1172 challenge questions). We compare theSWCU model against a set of state-of-the-art base-lines, classifying them according to 4 dimensions:(1)
Explanation : the model produces an explana-tion for their prediction; (2)
Unsupervised : thesystem does not require training on answer predic-tion; (3)
Pre-trained : the model adopts pre-trainedneural components such as Language Models orWord Embeddings; (4)
External : the system usesexternal knowledge bases or it is pre-trained onadditional datasets (e.g. RACE (Lai et al., 2017),SciTail (Khot et al., 2018)). The results show thatthe SWCU model outperforms the existing unsu-pervised systems based on Integer Linear Program-ming (ILP) (Khot et al., 2017; Khashabi et al.,2016) and pre-trained embeddings (Yadav et al.,2019a). At the same time, our model obtainscompetitive results with most of the supervisedapproaches, including BERT-large (Devlin et al.,2019). The SWCU model is still outperformed byreading strategies that adopt pre-training on exter-nal question answering datasets (Sun et al., 2019),which, however, do not produce explanations for their predicted answers.
We carried out an ablation study to investigate thecontribution of the main architectural components.To perform the study, we gradually combine indi-vidual features to recreate the best SWCU model( K = 2 ). Table 3 reports the obtained results. Thebasic model – BM25 IR + Plausibility Score, con-structs and scores explanations without abstractionstep and analogical reasoning, considering only uni-fication facts that are connected in one-hop to theoriginal hypothesis. The first observation is that theabstraction step has a positive impact on the abduc-tive inference, improving the accuracy of the basicmodel by 2.88%. In the same way, a consistent im-provement is achieved when the plausibility score(PS) is combined with the BM25 relevance score(RS) (+ 6.9%). In line with our research hypothe-ses, the use of analogical reasoning to computethe unification score (US) via EKB is crucial toachieve the final accuracy, leading to a substantialimprovement on both easy (+5.83%) and challengeuestions (+3.87%).
In this section, we investigate the relation betweenexplanation and answer prediction. To this end, wecorrelate the accuracy achieved by different combi-nations of the SWCU model ( K = 2 ) with a set ofquantitative metrics for explanation evaluation – i.e.Precision, Recall, F1 score, and unification accu-racy. Since the gold explanations in the Worldtree test-set are masked, we perform this analysis onthe dev-set , comparing the best explanations gen-erated for the predicted answers against the goldexplanations in the corpus. The results reportedin table 5 (top) highlight a positive correlation be-tween accuracy in answer prediction and qualityof the explanations. In particular, the performanceincreases according to the unification accuracy –i.e. the percentage of unifications for the predictedhypotheses that are part of the gold explanations.Therefore, these results confirm that the improve-ment on answer prediction is a consequence ofbetter explanatory inference. The second part ofthe analysis focuses on investigating the extent towhich accurate unification is also necessary foranswer prediction (Tab. 5, bottom). In line withthe expectations, the table shows that the major-ity of correct answers are derived from accurateunification (78.72%), while the majority of wrongpredictions are the results of erroneous or spuriousunification (67.06%). However, a minor percent-age of correct and wrong answers are inferred fromspurious and correct unification respectively, sug-gesting that alternative ways of constructing expla-nations are exploited by the model, and that, at thesame time, accurate unification can in some occa-sions lead to wrong conclusions. Table 4 shows aset of qualitative examples that help clarify theseresults. The first example shows the case in whichboth selected answer and unification are correct.The second row shows an example of correct an-swer prediction and spurious unification. In thiscase, however, the selected unification fact repre-sents a plausible alternative way of constructingexplanations, that is marked as spurious due to thedifference with the corpus annotation. The thirdexample represents the situation in which, despitewrong unification, the system is able to infer thecorrect answer. On the other hand, the subsequentexample shows the case in which the unificationis accurate, but the information it contains is not Choices Concepts Overlap (AVG) US ¬ US From 0% to 20%
From 60% to 80% ¬ US From 1 to 5
Table 6: Accuracy with distracting concepts with andwithout the Unification Score (US) (Worldtree test-set). sufficient to discriminate the correct answer fromthe alternative choices. Finally, the last row de-scribes the case in which spurious unification leadsto wrong answer prediction.
In this section, we present an analysis to explorethe robustness and limitations of the proposed ap-proach. In this experiment (Table 6), we computethe accuracy of the SWCU model ( K = 2 ) withand without the Unification Score (US) on ques-tions with varying degree of conceptual overlapbetween the alternative choices. The results showa drop in performance that is proportional to thenumber of shared concepts between the candidateanswers. Since the explanatory score partly de-pends on the conceptual connections between hy-potheses and unifications, the system struggles todiscriminate choices that share a large proportionof concepts. A similar behaviour is observed whenthe accuracy is correlated with the number of dis-tinct concepts in the questions. Long questions, infact, tend to include distracting concepts that affectthe abstraction step, increasing the probability ofbuilding spurious explanations. Nevertheless, theresults highlight the positive impact of the Unifi-cation Score (US) on the robustness of the model,showing that the unification patterns contribute toa better accuracy for questions that are difficult toanswer with plausibility and relevance score alone. Explanations for Science Questions.
Explana-tory inference for science questions typically re-quires multi-hop reasoning – i.e. the ability to ag-gregate multiple facts from heterogeneous knowl-edge sources to arrive at the correct answer. Thisprocess is extremely challenging when dealing withatural language, with both empirical (Fried et al.,2015) and theoretical work (Khashabi et al., 2019)suggesting an intrinsic limitation in the composi-tion of inference chains longer than 2 hops. Thisphenomenon, known as semantic drift, often resultsin the construction of spurious inference chainsleading to wrong conclusions. Recent approacheshave framed explanatory inference as the prob-lem of building an optimal graph, whose gener-ation is conditioned on a set of local and globalsemantic constraints (Khashabi et al., 2018; Khotet al., 2017; Jansen et al., 2017; Khashabi et al.,2016). A parallel line of research tries to tackle theproblem through the construction of explanation-centred corpora, which can facilitate the identifica-tion of common explanatory patterns (Valentinoet al., 2020; Jansen et al., 2018; Jansen, 2017;Jansen et al., 2016). Our approach attempts to lever-age the best of both worlds by imposing, on onehand, a set of structural and functional constraintsthat limit the inference process to two macro steps(abductive reasoning), and on the other hand, byidentifying common unification patterns in expla-nations for similar questions (analogical reason-ing). The explanatory patterns generated by theunification process, largely discussed in philoso-phy of science (Friedman, 1974; Kitcher, 1981,1989), have influenced the development of expertsystems based on case-based reasoning (Thagardand Litt, 2008; Kolodner, 2014). Similarly to ourapproach, case-based reasoning adopts analogy as acore component to retrieve explanations for knowncases, and adapt them in the solution of unseenproblems.
Explanations for Natural Language Reasoning.
Recent work have highlighted issues related to theinterpretability of deep learning models (Miller,2019; Biran and Cotton, 2017), which, amongother things, affects the design of proper bench-mark for assessing natural language reasoning ca-pabilities (Schlegel et al., 2020). To deal with lackof interpretability, an emerging line of research ex-plores the design of datasets including gold expla-nations, that support the construction and evalua-tion of explainable models in different domains,ranging from open domain question answering(Yang et al., 2018; Thayaparan et al., 2019), totextual entailment (Camburu et al., 2018) and rea-soning with mathematical text (Ferreira and Freitas,2020a,b). Other approaches explore the construc-tion of explanations through the use of distribu- tional and similarity-based models applied on ex-ternal commonsense knowledge bases (Silva et al.,2019, 2018; Freitas et al., 2014). In line with thiswork, we demonstrate that the use of unificationpatterns for multi-hop explanations can enhanceboth accuracy and explainability of neural modelson a challenging question answering task (Rajaniet al., 2019; Yadav et al., 2019b).
This paper presented the
Step-wise ConceptualUnification , a multi-hop reasoning framework thatleverages unification patterns through analogicaland abductive reasoning. We empirically demon-strated the efficacy of the model for unsupervisedquestion answering and explanation extraction, re-marking the impact of unification power on soundexplanatory inference.
References
Or Biran and Courtenay Cotton. 2017. Explanationand justification in machine learning: A survey. In
IJCAI-17 workshop on explainable AI (XAI) , vol-ume 8.Oana-Maria Camburu, Tim Rockt¨aschel, ThomasLukasiewicz, and Phil Blunsom. 2018. e-snli: Nat-ural language inference with natural language expla-nations. In
Advances in Neural Information Process-ing Systems , pages 9539–9549.Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,Ashish Sabharwal, Carissa Schoenick, and OyvindTafjord. 2018. Think you have solved question an-swering? try arc, the ai2 reasoning challenge. arXivpreprint arXiv:1803.05457 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Deborah Ferreira and Andr´e Freitas. 2020a. Natu-ral language premise selection: Finding supportingstatements for mathematical text. In
Proceedings ofThe 12th Language Resources and Evaluation Con-ference , pages 2175–2182.Deborah Ferreira and Andr´e Freitas. 2020b. Premiseselection in natural language mathematical texts. In
Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 7365–7374.ndr´e Freitas, Joao Carlos Pereira da Silva, EdwardCurry, and Paul Buitelaar. 2014. A distributionalsemantics approach for selective reasoning on com-monsense graph knowledge bases. In
InternationalConference on Applications of Natural Languageto Data Bases/Information Systems , pages 21–32.Springer.Daniel Fried, Peter Jansen, Gustave Hahn-Powell, Mi-hai Surdeanu, and Peter Clark. 2015. Higher-order lexical semantic models for non-factoid an-swer reranking.
Transactions of the Association forComputational Linguistics , 3:197–210.Michael Friedman. 1974. Explanation and scientificunderstanding.
The Journal of Philosophy , 71(1):5–19.Peter Jansen, Niranjan Balasubramanian, Mihai Sur-deanu, and Peter Clark. 2016. Whats in an expla-nation? characterizing knowledge and inference re-quirements for elementary science exams. In
Pro-ceedings of COLING 2016, the 26th InternationalConference on Computational Linguistics: Techni-cal Papers , pages 2956–2965.Peter Jansen, Rebecca Sharp, Mihai Surdeanu, and Pe-ter Clark. 2017. Framing qa as building and rankingintersentence answer justifications.
ComputationalLinguistics , 43(2):407–449.Peter Jansen and Dmitry Ustalov. 2019. Textgraphs2019 shared task on multi-hop inference for expla-nation regeneration. In
Proceedings of the Thir-teenth Workshop on Graph-Based Methods for Nat-ural Language Processing (TextGraphs-13) , pages63–77.Peter Jansen, Elizabeth Wainwright, Steven Mar-morstein, and Clayton Morrison. 2018. Worldtree:A corpus of explanation graphs for elementary sci-ence questions supporting multi-hop inference. In
Proceedings of the Eleventh International Confer-ence on Language Resources and Evaluation (LREC2018) .Peter A Jansen. 2017. A study of automatically ac-quiring explanatory inference patterns from corporaof explanations: Lessons from elementary scienceexams. In .Daniel Khashabi, Erfan Sadeqi Azer, Tushar Khot,Ashish Sabharwal, and Dan Roth. 2019. Onthe capabilities and limitations of reasoning fornatural language understanding. arXiv preprintarXiv:1901.02522 .Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Pe-ter Clark, Oren Etzioni, and Dan Roth. 2016. Ques-tion answering via integer programming over semi-structured knowledge. In
Proceedings of the Twenty-Fifth International Joint Conference on Artificial In-telligence , pages 1145–1152. Daniel Khashabi, Tushar Khot, Ashish Sabharwal, andDan Roth. 2018. Question answering as global rea-soning over semantic abstractions. In
Thirty-SecondAAAI Conference on Artificial Intelligence .Tushar Khot, Peter Clark, Michal Guerquin, PeterJansen, and Ashish Sabharwal. 2019. Qasc: Adataset for question answering via sentence compo-sition. arXiv preprint arXiv:1910.11473 .Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017.Answering complex questions using open informa-tion extraction. In
Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , pages 311–316.Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.Scitail: A textual entailment dataset from sciencequestion answering.Philip Kitcher. 1981. Explanatory unification.
Philoso-phy of science , 48(4):507–531.Philip Kitcher. 1989. Explanatory unification and thecausal structure of the world.Janet Kolodner. 2014.
Case-based reasoning . MorganKaufmann.Souvik Kundu, Tushar Khot, Ashish Sabharwal, andPeter Clark. 2019. Exploiting explicit paths formulti-hop reading comprehension. In
Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics , pages 2737–2747.Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,and Eduard Hovy. 2017. Race: Large-scale readingcomprehension dataset from examinations. In
Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing , pages 785–794.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Todor Mihaylov, Peter Clark, Tushar Khot, and AshishSabharwal. 2018. Can a suit of armor conduct elec-tricity? a new dataset for open book question an-swering. In
Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing ,pages 2381–2391.George A Miller. 1995. Wordnet: a lexical database forenglish.
Communications of the ACM , 38(11):39–41.Tim Miller. 2019. Explanation in artificial intelligence:Insights from the social sciences.
Artificial Intelli-gence , 267:1–38.Jianmo Ni, Chenguang Zhu, Weizhu Chen, and JulianMcAuley. 2019. Learning to attend on essentialerms: An enhanced retriever-reader model for open-domain question answering. In
Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers) , pages 335–344.Nazneen Fatema Rajani, Bryan McCann, CaimingXiong, and Richard Socher. 2019. Explain your-self! leveraging language models for commonsensereasoning. In
Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics ,pages 4932–4942.Viktor Schlegel, Marco Valentino, Andr´e Freitas,Goran Nenadic, and Riza Theresa Batista-Navarro.2020. A framework for evaluation of machine read-ing comprehension gold standards. In
Proceedingsof The 12th Language Resources and EvaluationConference , pages 5359–5369.Vivian Dos Santos Silva, Siegfried Handschuh, andAndr´e Freitas. 2018. Recognizing and justifyingtext entailment through distributional navigation ondefinition graphs. In
AAAI , pages 4913–4920.Vivian S Silva, Andr´e Freitas, and Siegfried Hand-schuh. 2019. Exploring knowledge graphs in an in-terpretable composite approach for text entailment.In
Proceedings of the AAAI Conference on ArtificialIntelligence , volume 33, pages 7023–7030.Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2019.Improving machine reading comprehension withgeneral reading strategies. In
Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers) , pages 2633–2643.Paul Thagard and Abninder Litt. 2008. Models of sci-entific explanation.
The Cambridge handbook ofcomputational psychology , pages 549–564.Mokanarangan Thayaparan, Marco Valentino, ViktorSchlegel, and Andr´e Freitas. 2019. Identifyingsupporting facts for multi-hop question answeringwith document graph networks. In
Proceedings ofthe Thirteenth Workshop on Graph-Based Methodsfor Natural Language Processing (TextGraphs-13) ,pages 42–51.Marco Valentino, Mokanarangan Thayaparan, andAndr´e Freitas. 2020. Unification-based reconstruc-tion of explanations for science questions. arXivpreprint arXiv:2004.00061 .Zhengnan Xie, Sebastian Thiem, Jaycie Martin, Eliz-abeth Wainwright, Steven Marmorstein, and PeterJansen. 2020. Worldtree v2: A corpus of science-domain structured explanations and inference pat-terns supporting multi-hop inference. In
Proceed-ings of The 12th Language Resources and Evalua-tion Conference , pages 5456–5473. Vikas Yadav, Steven Bethard, and Mihai Surdeanu.2019a. Alignment over heterogeneous embeddingsfor question answering. In
Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers) , pages 2681–2691.Vikas Yadav, Steven Bethard, and Mihai Surdeanu.2019b. Quick and (not so) dirty: Unsupervised se-lection of justification sentences for multi-hop ques-tion answering. In
Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2578–2589.Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D Manning. 2018. Hotpotqa: A dataset fordiverse, explainable multi-hop question answering.In
Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing , pages2369–2380.Yuyu Zhang, Hanjun Dai, Kamil Toraman, andLe Song. 2018. Kgˆ 2: Learning to reason scienceexam questions with contextual knowledge graphembeddings. arXiv preprint arXiv:1805.12393arXiv preprint arXiv:1805.12393