Reasoning-Driven Question-Answering for Natural Language Understanding
RREASONING-DRIVEN QUESTION-ANSWERINGFOR NATURAL LANGUAGE UNDERSTANDINGDaniel KhashabiA DISSERTATIONinComputer and Information SciencesPresented to the Faculties of the University of PennsylvaniainPartial Fulfillment of the Requirements for theDegree of Doctor of Philosophy 2019Supervisor of DissertationDan Roth, Professor of Computer and Information ScienceGraduate Group ChairpersonRajeev Alur, Professor of Computer and Information ScienceDissertation CommitteeDan Roth, Professor, Computer and Information Science, University of PennsylvaniaMitch Marcus, Professor of Computer and Information Science, University of PennsylvaniaZachary Ives, Professor of Computer and Information Sciences, University of PennsylvaniaChris Callison-Burch, Associate Professor of Computer Science, University of PennsylvaniaAshish Sabharwal, Senior Research Scientist, Allen Institute for Artificial Intelligence a r X i v : . [ c s . C L ] A ug EASONING-DRIVEN QUESTION-ANSWERINGFOR NATURAL LANGUAGE UNDERSTANDINGc (cid:13)
COPYRIGHT2019Daniel KhashabiThis work is licensed under theCreative Commons AttributionNonCommercial-ShareAlike 3.0LicenseTo view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ edicated to the loving memory of my gramma, An’nahYour patience and kindness will forever stay with me. iii
CKNOWLEDGEMENT
I feel incredibly lucky to have Dan Roth as my advisor. I am grateful to Dan for trustingme, especially when I had only a basic understanding of many key challenges in naturallanguage. It took me a while to catch up with what is important in the field and be able tocommunicate the challenges effectively. During these years, Dan’s vision has always beenthe guiding principle to many of my works. His insistence on focusing on the long-termprogress, rather than “easy” wins, shaped the foundation of many of the ideas I pursued.This perspective pushed me to think differently than the popular trends. It has been agenuine privilege to work together.I want to thank my thesis committee at UPenn, Mitch Marcus, Zach Ives and Chris Callison-Burch for being a constant source of invaluable feedback and guidance. Additionally, Iwould like to thanks the many professors who have touched parts of my thinking: JerryDeJong, for encouraging me read the classic literature; Chandra Chekuri and Avrim Blum,for their emphasis on intuition, rather than details; and my undergraduate advisor HamidSheikhzadeh Nadjar, for encouraging me to work on important problems.A huge thank you to the Allen Institute for Artificial Intelligence (AI2) for much supportduring my PhD studies. Any time I needed any resources (computing resources, crowdsourc-ing credits, engineering help, etc), without any hesitation, AI2 has provided me what wasneeded. Special thanks to Ashish Sabhwaral and Tushar Khot for being a constant sourceof wisdom and guidance, and investing lots of time and effort. They both have always beenpresent to listen to my random thoughts, almost on a weekly basis. I am grateful to othermembers of AI2 for their help throughout my projects: Oren Etzioni, Peter Clark, OyvindTafjord, Peter Turney, Ingmar Ellenberger, Dirk Groeneveld, Michael Schmitz, ChandraBhagavatula and Scott Yih. Moreover, I would like to remember Paul Allen (1953-2018):his vision and constant generous support has tremendously changed our field (and my life,in particular). ivy collaborators, especially past and present CogComp members, have been major con-tributors and influencers throughout my works. I would like to thank Mark Sammons,Vivek Srikumar, Christos Christodoulopoulos, Erfan Sadeqi Azer, Snigdha Chaturvedi,Kent Quanrud, Amirhossein Taghvaei, Chen-Tse Tsai, and many other CogComp members.Furthermore, I thank Eric Horn and Jennifer Sheffield for their tremendous contributionsto many of my write-ups. And thank you to all the friends I have made at Penn, UIUC, andelsewhere, for all the happiness you’ve brought me. Thanks to Whitney, for sharing manyhappy and sad moments with me, and for helping me become a better version of myself.Last, but never least, my family, for their unconditional sacrifice and support. I wouldn’thave been able to go this far without you. v
BSTRACT
REASONING-DRIVEN QUESTION-ANSWERINGFOR NATURAL LANGUAGE UNDERSTANDINGDaniel KhashabiDan Roth
Natural language understanding (NLU) of text is a fundamental challenge in AI, and ithas received significant attention throughout the history of NLP research. This primarygoal has been studied under different tasks, such as Question Answering (QA) and TextualEntailment (TE). In this thesis, we investigate the NLU problem through the QA task andfocus on the aspects that make it a challenge for the current state-of-the-art technology.This thesis is organized into three main parts:In the first part, we explore multiple formalisms to improve existing machine comprehensionsystems. We propose a formulation for abductive reasoning in natural language and showits effectiveness, especially in domains with limited training data. Additionally, to helpreasoning systems cope with irrelevant or redundant information, we create a supervisedapproach to learn and detect the essential terms in questions.In the second part, we propose two new challenge datasets. In particular, we create twodatasets of natural language questions where (i) the first one requires reasoning over multiplesentences; (ii) the second one requires temporal common sense reasoning. We hope that thetwo proposed datasets will motivate the field to address more complex problems.In the final part, we present the first formal framework for multi-step reasoning algorithms,in the presence of a few important properties of language use, such as incompleteness,ambiguity, etc. We apply this framework to prove fundamental limitations for reasoningalgorithms. These theoretical results provide extra intuition into the existing empiricalevidence in the field. vi
ABLE OF CONTENTS
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viLIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivLIST OF ILLUSTRATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiiPUBLICATION NOTES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xixCHAPTER 1 : Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Challenges along the way to NLU . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Measuring the progress towards NLU via Question Answering . . . . . . . . 41.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6CHAPTER 2 : Background and Related Work . . . . . . . . . . . . . . . . . . . . 82.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Measuring the progress towards NLU . . . . . . . . . . . . . . . . . . . . . . 102.3.1 Measurement protocols . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Knowledge Representation and Abstraction for NLU . . . . . . . . . . . . . 142.4.1 Early Works: “Neats vs Scruffies” . . . . . . . . . . . . . . . . . . . 142.4.2 Connectionism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.3 Unsupervised representations . . . . . . . . . . . . . . . . . . . . . . 162.4.4 Grounding of meanings . . . . . . . . . . . . . . . . . . . . . . . . . 17 Terms originally made by Roger Schank to characterize two different camps: the first group thatrepresented commonsense knowledge in the form of large amorphous semantic networks, as opposed toanother from the camp of whose work was based on logic and formal extensions of logic. vii.4.5 Common sense and implied meanings . . . . . . . . . . . . . . . . . 172.4.6 Abstractions of the representations . . . . . . . . . . . . . . . . . . . 182.5 Reasoning/Decision-making Paradigms for NLU . . . . . . . . . . . . . . . . 192.5.1 Early formalisms of reasoning . . . . . . . . . . . . . . . . . . . . . . 192.5.2 Incorporating “uncertainty” in reasoning . . . . . . . . . . . . . . . . 202.5.3 Macro-reading vs micro-reading . . . . . . . . . . . . . . . . . . . . . 212.5.4 Reasoning on “structured” representations . . . . . . . . . . . . . . . 222.5.5 Models utilizing massive annotated data . . . . . . . . . . . . . . . . 232.6 Technical background and notation . . . . . . . . . . . . . . . . . . . . . . . 232.6.1 Complexity theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6.2 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.6.3 Graph theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.6.4 Optimization Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
I Reasoning-Driven System Design 26
CHAPTER 3 : QA as Subgraph Optimization on Tabular Knowledge . . . . . . . . 273.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3 QA as Subgraph Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.1 Semi-Structured Knowledge as Tables . . . . . . . . . . . . . . . . . 303.3.2 QA as a Search for Desirable Support Graphs . . . . . . . . . . . . . 313.3.3 ILP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.4.1 Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.4.4 Question Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44viiiHAPTER 4 : QA as Subgraph Optimization over Semantic Abstractions . . . . . 464.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2 Knowledge Abstraction and Representation . . . . . . . . . . . . . . . . . . 514.2.1 Semantic Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . 524.2.2 Semantic Graph Generators . . . . . . . . . . . . . . . . . . . . . . . 524.3 QA as Reasoning Over Semantic Graphs . . . . . . . . . . . . . . . . . . . . 534.3.1 ILP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.4.1 Question Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.4.2 Question Answering Systems . . . . . . . . . . . . . . . . . . . . . . 584.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 614.4.4 Error and Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . 624.4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66CHAPTER 5 : Learning Essential Terms in Questions . . . . . . . . . . . . . . . . 685.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Essential Question Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.2.1 Crowd-Sourced Essentiality Dataset . . . . . . . . . . . . . . . . . . 715.2.2 The Importance of Essential Terms . . . . . . . . . . . . . . . . . . . 735.3 Essential Terms Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.4 Using ET Classifier in QA Solvers . . . . . . . . . . . . . . . . . . . . . . . 795.4.1 IR solver + ET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.4.2 TableILP solver + ET . . . . . . . . . . . . . . . . . . . . . . . . . 815.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83ix I Moving the Peaks Higher: Designing More Challenging Datasets 84
CHAPTER 6 : A Challenge Set for Reasoning on Multiple Sentences . . . . . . . . 856.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.2 Relevant Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.3 Construction of
MultiRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3.1 Principles of design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3.2 Sources of documents . . . . . . . . . . . . . . . . . . . . . . . . . . 906.3.3 Pipeline of question extraction . . . . . . . . . . . . . . . . . . . . . 926.3.4 Pilot experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.3.5 Verifying multi-sentenceness . . . . . . . . . . . . . . . . . . . . . . . 956.3.6 Statistics on the dataset . . . . . . . . . . . . . . . . . . . . . . . . . 966.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101CHAPTER 7 : A Question Answering Benchmark for Temporal Common-sense . . 1027.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.3 Construction of
TacoQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
III Formal Study of Reasoning in Natural Language 111
CHAPTER 8 : Capabilities and Limitations of Reasoning in Natural Language . . 1128.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1128.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1168.3 Background and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117x.4 The Meaning-Symbol Interface . . . . . . . . . . . . . . . . . . . . . . . . . 1188.5 Connectivity Reasoning Algorithm . . . . . . . . . . . . . . . . . . . . . . . 1248.5.1 Possibility of accurate connectivity . . . . . . . . . . . . . . . . . . . 1258.5.2 Limits of connectivity algorithm . . . . . . . . . . . . . . . . . . . . 1268.6 Limits of General Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 1278.7 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298.8 Summary, Discussion and Practical Lessons . . . . . . . . . . . . . . . . . . 130CHAPTER 9 : Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . 1339.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1339.2 Discussion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 135APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138A.1 Supplementary Details for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 138A.2 Supplementary Details for Chapter 8 . . . . . . . . . . . . . . . . . . . . . . 145BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157xi
IST OF TABLES
TABLE 1 : Natural language questions about the story in Figure 1. . . . . . . 4TABLE 2 : Various answer representation paradigms in QA systems; examplesselected from Khashabi et al. (2018a); Rajpurkar et al. (2016); Clarket al. (2016). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13TABLE 3 : Notation for the ILP formulation. . . . . . . . . . . . . . . . . . . 33TABLE 4 : Variables used for defining the optimization problem for
TableILP solver. All variables have domain { , } . . . . . . . . . . . . . . . . 34TABLE 5 : TableILP significantly outperforms both the prior MLN reasoner,and IR using identical knowledge as
TableILP . . . . . . . . . . . 40TABLE 6 : Solver combination results . . . . . . . . . . . . . . . . . . . . . . . 41TABLE 7 :
TableILP statistics averaged across questions . . . . . . . . . . . . 42TABLE 8 : Ablation results for
TableILP . . . . . . . . . . . . . . . . . . . . 42TABLE 9 : Drop in solver scores (on the development set, rather than the hiddentest set) when questions are perturbed . . . . . . . . . . . . . . . . 44TABLE 10 : Minimum requirements for using each family of graphs. Each graphconnected component (e.g. a
PredArg frame, or a
Coreference chain)cannot be used unless the above-mentioned conditioned is satisfied. 55TABLE 11 : The set of preferences functions in the objective. . . . . . . . . . . 57TABLE 12 : The semantic annotator combinations used in our implementationof
SemanticILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60TABLE 13 : Science test scores as a percentage. On elementary level scienceexams,
SemanticILP consistently outperforms baselines. In eachrow, the best score is in bold and the best baseline is italicized . . 62xiiABLE 14 : Biology test scores as a percentage.
SemanticILP outperformsvarious baselines on the
ProcessBank dataset and roughly matchesthe specialized best method. . . . . . . . . . . . . . . . . . . . . . . 62TABLE 15 :
SemanticILP statistics averaged across questions, as compared to
TableILP and
TupleInf statistics. . . . . . . . . . . . . . . . . . 64TABLE 16 : Ablation study of
SemanticILP components on various datasets.The first row shows the overall test score of the full system, whileother rows report the change in the score as a result of dropping anindividual combination. The combinations are listed in Table 12. . 64TABLE 17 : Comparison of test scores of
SemanticILP using a generic ensemblevs. domain-targeted cascades of annotation combinations. . . . . . 66TABLE 18 :
Effectiveness of various methods for identifying essential question terms inthe test set, including area under the PR curve (AUC), accuracy (Acc),precision (P), recall (R), and F1 score. ET classifier substantially outper-forms all supervised and unsupervised (denoted with † ) baselines. . . . . 77TABLE 19 : Generalization to unseen terms: Effectiveness of various methods, using thesame metrics as in Table 18. As expected, supervised methods performpoorly, similar to a random baseline. Unsupervised methods generalizewell, but the ET classifier again substantially outperforms them. . . . . 78TABLE 20 : Effectiveness of various methods for ranking the terms in a question byessentiality. † indicates unsupervised method. Mean-Average Precision(MAP) numbers reflect the mean (across all test set questions) of the av-erage precision of the term ranking for each question. ET classifier againsubstantially outperforms all baselines. . . . . . . . . . . . . . . . . . . 78TABLE 21 : Performance of the IR solver without (Basic IR) and with (IR +ET) essential terms. The numbers are solver scores (%) on the testsets of the three datasets. . . . . . . . . . . . . . . . . . . . . . . . 80xiiiABLE 22 : Bounds used to select paragraphs for dataset creation. . . . . . . . 91TABLE 23 : Various statistics of our dataset. Figures in parentheses representstandard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96TABLE 24 : Performance comparison for different baselines tested on a subset of ourdataset (in percentage). There is a significant gap between the humanperformance and current statistical methods. . . . . . . . . . . . . . . 100TABLE 25 :
Statistics of
TacoQA . . . . . . . . . . . . . . . . . . . . . . . . . . . 105TABLE 26 : Summary of the performances for different baselines. All numbers are inpercentages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109TABLE 27 : The weights of the variables in our objective function. In each col-umn, the weight of the variable is mentioned on its right side. Thevariables that are not mentioned here are set to have zero weight. . 139TABLE 28 : Minimum thresholds used in creating pairwise variables. . . . . . . 141TABLE 29 : Some of the important constants and their values in our model. . 141TABLE 30 : All the sets useful in definitions of the constraints in Table 31. . . 142TABLE 31 : The set of all constraints used in our ILP formulation. The set ofvariables and are defined in Table 4. More intuition about con-straints is included in Section 3. The sets used in the definition ofthe constraints are defined in Table 30. . . . . . . . . . . . . . . . . 144xiv
IST OF ILLUSTRATIONS
FIGURE 1 : A sample story appeared on the New York Times (taken from Mc-Carthy (1976)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2FIGURE 2 :
Ambiguity (left) appears when mapping a raw string to its actualmeaning;
Variability (right) is having many ways of referring to thesame meaning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2FIGURE 3 : Visualization of two semantic tasks for the given story in Figure 1.Top figure shows verb semantic roles ; bottom figure shows clustersof coreferred mentions. The visualizations use CogCompNLP (Khashabiet al., 2018c) and AllenNLP (Gardner et al., 2018). . . . . . . . . 5FIGURE 4 : An overview of the contributions and challenges addressed in eachchapter of this thesis. . . . . . . . . . . . . . . . . . . . . . . . . 7FIGURE 5 : Major highlights of NLU in the past 50 years (within the AI com-munity). For each work, its contribution-type is color-coded. Toprovide perspective about the role of the computational resourcesavailable at each period, we show the progress of CPU/GPU hard-ware over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9FIGURE 6 : A hypothetical manifold of all the NLU instances. Static datasetsmake it easy to evaluate our progress but since they usually give abiased estimate, they limit the scope of the challenge. . . . . . . . 12FIGURE 7 : Example frames used in this work. Generic basic science frames(left), used in Chapter 3; event frames with values filled with thegiven sentence (right), used in Chapter 4. . . . . . . . . . . . . . 15FIGURE 8 : Brief definitions for popular reasoning classes and their examples. 20xvIGURE 9 :
TableILP searches for the best support graph (chains of reasoning) connectingthe question to an answer, in this case June. Constraints on the graph definewhat constitutes valid support and how to score it (Section 3.3.3). . . . . . . 27FIGURE 10 : Depiction of
SemanticILP reasoning for the example paragraphgiven in the text. Semantic abstractions of the question, answers,knowledge snippet are shown in different colored boxes (blue, green,and yellow, resp.). Red nodes and edges are the elements that arealigned (used) for supporting the correct answer. There are manyother unaligned (unused) annotations associated with each piece oftext that are omitted for clarity. . . . . . . . . . . . . . . . . . . 47FIGURE 11 :
Knowledge Representation used in our formulation. Raw text is asso-ciated with a collection of
SemanticGraph s, which convey certain infor-mation about the text. There are implicit similarity edges among thenodes of the connected components of the graphs, and from nodes to thecorresponding raw-text spans. . . . . . . . . . . . . . . . . . . . . . 52FIGURE 12 : Overlap of the predictions of
SemanticILP and IR on 50 randomly-chosen questions from
AI2Public 4th . . . . . . . . . . . . . . . 64FIGURE 13 : Performance change for varying knowledge length. . . . . . . . . . 65FIGURE 14 :
Essentiality scores generated by our system, which assigns high essentiality to“drop” and “temperature”. . . . . . . . . . . . . . . . . . . . . . . . . 68FIGURE 15 : Crowd-sourcing interface for annotating essential terms in a ques-tion, including the criteria for essentiality and sample annotations. 72FIGURE 16 : Crowd-sourcing interface for verifying the validity of essentialityannotations generated by the first task. Annotators are asked toanswer, if possible, questions with a group of terms dropped. . . . 73xviIGURE 17 :
The relationship between the fraction of question words dropped and the frac-tion of the questions attempted (fraction of the questions workers felt comfort-able answering). Dropping most essential terms (blue lines) results in very fewquestions remaining answerable, while least essential terms (red lines) allowsmost questions to still be answerable. Solid lines indicate human annotationscores while dashed lines indicate predicted scores. . . . . . . . . . . . . . 74FIGURE 18 :
Precision-recall trade-off for various classifiers as the threshold is varied. ET classifier (green) is significantly better throughout. . . . . . . . . 77FIGURE 19 : Examples from our
MultiRC corpus. Each example shows relevant ex-cerpts from a paragraph; multi-sentence question that can be answeredby combining information from multiple sentences of the paragraph; andcorresponding answer-options. The correct answer(s) is indicated by a *.Note that there can be multiple correct answers per question. . . . . . 86FIGURE 20 : Pipeline of our dataset construction. . . . . . . . . . . . . . . . . . 92FIGURE 21 : Distribution of (left) general phenomena; (right) variations of the“coreference” phenomena. . . . . . . . . . . . . . . . . . . . . . . 97FIGURE 22 : Most frequent first chunks of the questions (counts in log scale). 98FIGURE 23 : PR curve for each of the baselines. There is a considerable gapwith the baselines and human. . . . . . . . . . . . . . . . . . . . 100FIGURE 24 :
Five types of temporal commonsense in
TacoQA . Note that a questionmay have multiple answers. . . . . . . . . . . . . . . . . . . . . . . 103FIGURE 25 :
BERT + unit normalization performance per temporal reasoning cate-gory (top), performance gain over random baseline per category (bottom) uttered in many ways into symbolic forms (bottom). . . 112xviiIGURE 27 :
The meaning space contains [clean and unique] symbolic representationand the facts, while the symbol space contains [noisy, incomplete andvariable] representation of the facts. We show sample meaning and sym-bol space nodes to answer the question:
Is a metal spoon a good conductorof heat? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114FIGURE 28 : The construction considered in Definition 9. The node-pair m - m (cid:48) is con-nected with distance d in G M , and disconnected in G (cid:48) M , after droppingthe edges of a cut C . For each symbol graph, we consider it “local”Laplacian. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127FIGURE 29 : Various colors in the figure depict the average distance betweennode-pairs in the symbol graph, for each true meaning-graph dis-tance d (x-axis), as the noise parameter p − (y-axis) is varied. Thegoal is to distinguish squares in the column for a particular d withthe corresponding squares in the right-most column, which cor-responds to node-pairs being disconnected. This is easy in thebottom-left regime and becomes progressively harder as we moveupward (more noise) or rightward (higher meaning-graph distance).( ε + = 0 . , λ = 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131FIGURE 30 : Notation for the ILP formulation. . . . . . . . . . . . . . . . . . . 138FIGURE 31 : With varied values for p − a heat map representation of the distri-bution of the average distances of node-pairs in symbol graph basedon the distances of their corresponding meaning nodes is presented. 156xviii UBLICATION NOTES
1. Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Peter Clark, Oren Etzioni, and DanRoth. Question answering via integer programming over semi-structured knowledge.In
Proceedings of the 25th International Joint Conference on Artificial Intelligence(IJCAI), 2016 . URL http://cogcomp.org/page/publication_view/786 .2. Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Dan Roth. Learning whatis essential in questions. In
Proceedings of the Conference on Computational NaturalLanguage Learning (CoNLL), 2017 . URL http://cogcomp.org/page/publication_view/813 .3. Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and DanRoth. Looking beyond the surface: A challenge set for reading comprehension overmultiple sentences. In
Proceedings of the Annual Conference of the North AmericanChapter of the Association for Computational Linguistics (NAACL), 2018a . URL http://cogcomp.org/page/publication_view/833 .4. Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Dan Roth. Question answer-ing as global reasoning over semantic abstractions. In
Proceedings of the FifteenthConference on Artificial Intelligence (AAAI), 2018b.
URL http://cogcomp.org/page/publication_view/824 .5. Daniel Khashabi, Erfan Sadeqi Azer, Tushar Khot, Ashish Sabharwal, and Dan Roth.On the capabilities and limitations of reasoning for natural language understanding,2019. URL https://arxiv.org/abs/1901.02522
6. Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. “Going on a vacation” takeslonger than “Going for a walk”: A Study of Temporal Commonsense Understanding.In
Proceedings of the Conference on Empirical Methods in Natural Language Process-ing (EMNLP), 2019. xix
HAPTER 1 : Introduction “To model this language understanding process in a computer, we needa program which combines grammar, semantics, and reasoning in anintimate way, concentrating on their interaction.”— T. Winograd, Understanding Natural Language, 1972
The purpose of
Natural Language Understanding (NLU) is to enable systems to interpret agiven text, as close as possible to the many ways humans would interpret it.Improving NLU is increasingly changing the way humans interact with machines. Thecurrent NLU technology is already making significant impacts. For example, we can seeit used by speech agents, including Alexa, Siri, and Google Assistant. In the near future,with better NLU systems, we will witness a more active presence of these systems in ourdaily lives: social media interactions, in financial estimates, during the course of productrecommendation, in accelerating of scientific findings, etc.The importance of NLU was understood by many pioneers in Artificial Intelligence (startingin the ’60s and ’70s). The initial excitement about the field ushered a decade of activity inthis area (McCarthy, 1963; Winograd, 1972; Schank, 1972; Woods, 1973; Zadeh, 1978). Thebeginning of these trends was overly positive at times, and it took years (if not decades) tocomprehend and appreciate the real difficulty of language understanding.
We, humans, are so used to using language that it’s almost impossible to see its complexity,without a closer look into instances of this problem. As an example, consider the storyshown in Figure 1, which appeared in an issue of the New York Times (taken from Mc-Carthy (1976)). With relatively simple wording, this story is understandable to Englishspeakers. Despite the simplicity, many nuances have to come together to form a coherentunderstanding of this story. 1 61-year-old furniture salesman was pushed down the shaft of a freight elevator yesterdayin his downtown Brooklyn store by two robbers while a third attempted to crush him withthe elevator car because they were dissatisfied with the $1,200 they had forced him to givethem.The buffer springs at the bottom of the shaft prevented the car from crushing the salesman,John J. Hug, after he was pushed from the first floor to the basement. The car stoppedabout 12 inches above him as he flattened himself at the bottom of the pit.Mr. Hug was pinned in the shaft for about half an hour until his cries attracted the attentionof a porter. The store at 340 Livingston Street is part of the Seamans Quality Furniturechain.Mr. Hug was removed by members of the Police Emergency Squad and taken to Long IslandCollege Hospital. He was badly shaken, but after being treated for scrapes of his left armand for a spinal injury was released and went home. He lives at 62-01 69th Lane, Maspeth,Queens.He has worked for seven years at the store, on the corner of Nevins Street, and this was thefourth time he had been held up in the store. The last time was about one year ago, whenhis right arm was slashed by a knife-wielding robber.Figure 1: A sample story appeared on the New York Times (taken from McCarthy (1976)).We flesh out a few general factors which contribute to the complexity of language under-standing in the context of the story given in Figure 1: • Ambiguity comes along when trying to make sense of a given string. While an averagehuman might be good at this, it’s incredibly hard for machines to map symbols orcharacters to their actual meaning. For example, the mention of “car” that appearsin our story has multiple meanings (see Figure 2; left). In particular, this mention inthe story refers to a sense other than its usual meaning (here refers to the elevator
Figure 2:
Ambiguity (left) appears when mapping a raw string to its actual meaning;
Variability (right) is having many ways of referring to the same meaning.2 abin ; the usual meaning is a road vehicle ). • Variability of language means that a single idea could be phrased in many differentways. For instance, the same character in the story, “Mr. Hug,” has been referredto in different ways: “the salesman,” “he,” “him,” “himself,” etc. Beyond lexicallevel, there is even more variability in bigger constructs of language, such as phrases,sentences, paragraphs, etc. • Reading and understanding text involves an implicit formation of a mental structurewith many elements. Some of these elements are directly described in the given story,but a significant portion of the understanding involves information that is implied based on a readers’ background knowledge.
Common sense refers to our (humans)understanding of everyday activities (e.g., sizes of objects, duration of events, etc),usually shared among many individuals. Take the following sentence from the story:The car stopped about 12 inches above him as he flattened himself at the bottomof the pit.There is a significant amount of imagination hiding in this sentence; each person afterreading this sentence has a mental picture of the incident. And based on this mentalpicture, we have implied meanings: we know he is lucky to be alive now; if he didn’tflatten himself, he would have died; he had nowhere to go at the bottom of the pit;the car is significantly heavier than the man ; etc. Such understanding is commonand easy for humans and rarely gets direct mention in text, since they are consideredtrivial (for humans). Humans are able to form such implicit understanding as a resultof our own world model and past shared experiences. • Many small bits combine to make a big picture.
We understand that “downtownBrooklyn” is probably not a safe neighborhood, since “this was the fourth time hehad been held up here.” We also understand that despite all that happened to “Mr.3 uestion 1:
Where did the robbers push Mr. Hug?
Answer 1: down the shaft of a freight elevator
Question 2:
How old is Mr. Hug?
Answer 2:
61 years old
Question 3:
On what street is Mr. Hug’s store located?
Answer 3:
340 Livingston Street, on the corder of Nevins Street
Question 4:
How far is his house to work?
Answer 4:
About 30 minutes train ride
Question 5:
How long did the whole robbery take?
Answer 5:
Probably a few minutes
Question 6:
Was he trapped in the elevator car, or under?
Answer 6: under
Question 7:
Was Mr. Hug conscious after the robbers left?
Answer 7:
Yes, he cried out and his cries were heard.
Question 8:
How many floors does Mr. Hug’s store have?
Answer 8:
More than one, since he has an elevator
Table 1: Natural language questions about the story in Figure 1.Hug,” he likely goes back to work after treatment because similar incidents havehappened in the past. Machines don’t really make these connections (for now!).Challenges in NLU don’t end here; there are many other aspects to language understandingthat we skip here since they go beyond the scope of this thesis.
To measure machines’ ability to understand a given text, one can create numerous questionsabout the story. A system that better understands language should have a higher chance ofanswering these questions. This approach has been a popular way of measuring NLU sinceits early days (McCarthy, 1976; Winograd, 1972; Lehnert, 1977).Table 1 shows examples of such questions. Consider
Question 1 . The answer to thisquestion is directly mentioned in text and the only thing that needs to be done is creating arepresentation to handle the variability of text. For instance, a reoresentation of the meaningthat are conveyed by verb predicates, since a major portion of meanings are centered aroundverbs. For example, to understand the various elements around a verb “push,” one has to4igure 3: Visualization of two semantic tasks for the given story in Figure 1. Top fig-ure shows verb semantic roles ; bottom figure shows clusters of coreferred mentions. Thevisualizations use CogCompNLP (Khashabi et al., 2018c) and AllenNLP (Gardner et al.,2018).figure out who pushed, who was pushed, pushed where , etc. The subtask of semantic rolelabeling (Punyakanok et al., 2004) is dedicated to resolving such inferences (Figure 3;top). The output of this annotation of indicates that the location pushed to is “the shaft ofa freight elevator.” In addition, the output of the coreference task (Carbonell and Brown,1988; McCarthy, 1995) informs computers about such equivalences between the mentionsof the main character of the story (namely, the equivalence between “Mr. Hug” and “A61-year-old furniture salesman”).Similarly the answers to
Question 2 and 3 are directly included in the paragraph, althoughthey both require some intermediate processing like the coreference task. The system weintroduce in Chapter 4 uses such representations (coreference, semantic roles, etc) and inprinciple should be able to answer such questions. The dataset introduced in Chapter 6also motivates addressing questions that require chaining information from multiple piecesof text. In a similar vein, Chapter 8 takes a theoretical perspective on the limits of chaininginformation.The rest of the questions in Table 1 are harder for machines, as they require information5eyond what is directly mentioned in the paragraph. For example,
Question 4 requiresknowledge of the distance between “Queens” and “Brooklyn,” which can be looked up on theinternet. Similarly,
Question 5 requires information beyond text; however, it is unlikely tobe looked up easily on the web. Understanding that “the robbery” took only a few minutes(and not hours or days) is part of our common sense understanding. The dataset that weintroduce in Chapter 7 motivates addressing such understanding (temporal common sense).
Question 6 and 7 require different forms of common sense understanding, beyond thescope of this thesis.In this thesis we focus on the task of Question Answering (QA), aiming to progress towardsNLU. And for this goal, we study various representations and reasoning algorithms. Insummary, this thesis is centered around the following statement:
Thesis Statement.
Progress in automated question answering could be facilitated byincorporating the ability to reason over natural language abstractions and world knowledge.More challenging, yet realistic QA datasets pose problems to current technologies; hence,more opportunities for improvement.
In the thesis we use QA as a medium to tackle a few important challenges in the contextof NLU. We start with an in-depth review of past work and its connections to our work inChapter 2. The main content of the thesis is organized as follows (see also Figure 4): • Part 1:
Reasoning-Driven QA System Design – Chapter 3 discusses
TableILP , a model for abductive reasoning over naturallanguage questions, with internal knowledge available in tabular representation. – Chapter 4 presents
SemanticILP , an extension of the system in the previouschapter to function on raw text knowledge.6 ategory Sub-category Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8
Contribution type system design ✓ ✓ ✓ dataset ✓ ✓ theory ✓ Challenges addressed
Ambiguity (grounding) ✓ ✓
Variability ✓ ✓ ✓ ✓ ✓
Combining information ✓ ✓ ✓ ✓
Common-sense understanding ✓ ✓ ✓
Figure 4: An overview of the contributions and challenges addressed in each chapter of thisthesis. – Chapter 5 studies the notion of essential question terms with the goal of makingQA solvers more robust to distractions and irrelevant information. • Part 2:
Moving the Peaks Higher: More Challenging QA datasets – Chapter 5 presents
MultiRC , a reading comprehension challenge which requirescombining information from multiple sentences. – Chapter 6 presents
TacoQA , a reading comprehension challenge which requiresthe ability to resolve temporal common sense. • Part 3:
Formal Study of Reasoning in Natural Language – Chapter 7 presents a formalism, in an effort to provide theoretical grounds tothe existing intuitions on the limits and possibilities in reasoning, in the contextof natural language. 7
HAPTER 2 : Background and Related Work “Whoever wishes to foresee the future must consult the past.”— Nicolo Machiavelli, 1469-1527
In this chapter, we review the related literature that addresses different aspects of naturallanguage understanding.
Before anything else, we define the terminology (Section 2.2). We divide the discussion intomultiple interrelated axes: Section 2.3 discusses various evaluation protocols and datasetsintroduced in the field. We then provide an overview of the field from the perspectiveof knowledge representation and abstraction in Section 2.4. Building on the discussion ofrepresentation, we provide a survey of reasoning algorithms, in Section 2.5. We end thechapter with a short section on the technical background necessary for the forthcomingchapters (Section 2.6).To put everything into perspective, we show a summary of the highlights of the field inFigure 5. Each highlight is color-coded to indicate its contribution type. In the followingsections, we go over a select few of these works and explain the evolution of the field,especially those directly related to the focus of this work.
Before starting our main conversation, we define the terminology we will be using throughoutthis document. • Propositions are judgments or opinions which can be true or false. A proposition isnot necessarily a sentence, although a sentence can express a proposition (e.g., “catscannot fly”). • A concept is either a physical entity (like a tree, bicycle, etc ) or an abstract idea (like8igure 5: Major highlights of NLU in the past 50 years (within the AI community). Foreach work, its contribution-type is color-coded. To provide perspective about the role ofthe computational resources available at each period, we show the progress of CPU/GPUhardware over time. 9 appiness, thought, betrayal, etc ). • a belief is an expression of faith and/or trust in the truthfulness of a proposition. Wealso use confidence or likelihood to refer to the same notion. • Knowledge is information, facts, understanding acquired through experience or edu-cation. The discussion on the philosophical nature of knowledge and its various formsis studied under epistemology (Steup, 2014). • Representation is a medium through which knowledge is provided to a system. Forexample, the number 5 could be represented as the string “5”, as bits , or Romannumeral “V”, etc. • Abstraction defines the level of granularity in a representation. For example, thementions “New York City”, “Buenos Aires”, “Maragheh” could all be abstracted as city . • Knowledge acquisition is the process of identifying and acquiring the relevant knowl-edge, according to the representation. • Reasoning is the process of drawing a conclusion based on the given information. Wesometimes refer to this process as decision-making or inference . Evaluation protocols are critical in incentivizing the field to solve the right problems. Oneof the earliest proposals is due to Alan Turing: if you had a pen-pal for years, you wouldnot know whether you’re corresponding to a human or a machine (Turing, 1950; Harnad,1992). A major limitation of this test (and many of its extensions) is that it is “expensive”to compute (Hernandez-Orallo, 2000; French, 2000).The protocol we are focusing on in this work is through answering natural language ques-10ions; if an actor (human or computer) understands a given text, it should be able to answerany questions about it. Throughout this thesis, we will refer to this protocol as
QuestionAnswering (QA). This has been used in the field for many years (McCarthy, 1976; Wino-grad, 1972; Lehnert, 1977). There are few other terms popularized in the community torefer the same task we are solving here. The phrase
Reading Comprehension is bor-rowed from standardized tests (SAT, TOEFL, etc.), usually refers to the scenario where aparagraph is attached to the given question. Another similar phrase is
Machine Com-prehension . Throughout this thesis, we use these phrases interchangeably to refer to thesame task.To make it more formal, for an assumed scenario described by a paragraph P , a system f equipped with NLU should be able to answer any questions Q about the given paragraph P . One can measure the expected performance of the system on a set of questions D ,via some distance measure d ( ., . ) between the predicted answers f ( Q ; P ) and the correctanswers f ∗ ( Q ; P ) (usually a prediction agreed upon by multiple humans): R ( f ; D ) = E ( Q,P ) ∼D (cid:104) d (cid:16) f ( Q ; P ) , f ∗ ( Q ; P ) (cid:17)(cid:105) A critical question here is the choice of question set D so that R ( f ; D ) is an effective measureof f ’s progress towards NLU. Denote the set of all the possible English questions as D u .This is an enormous set and, in practice it is unlikely that we could write them all in oneplace. Instead, it might be more practical to sample from this set. In practice, this samplingis replaced with static datasets. This introduces a problem: datasets are hardly a uniformsubset of D u ; instead, they are heavily skewed towards more simplicity.Figure 6 depicts a hypothetical high-dimensional manifold of all the natural language ques-tions in terms of an arbitrary representation (bytes, characters, etc.) Unfortunately, datasetsare usually biased samples of the universal set D u . And they are often biased towards sim-plicity. This issue makes the dataset design of extra importance since performance results11n a single set might not be a true representative of our progress. Two chapters of thiswork are dedicated to the construction of QA datasets.Figure 6: A hypothetical manifold of all the NLU instances. Static datasets make it easyto evaluate our progress but since they usually give a biased estimate, they limit the scopeof the challenge.There are few flavors of QA in terms of their answer representations (see Table 2): (i)questions with multiple candidate-answers, a subset of which are correct; (ii) extractivequestions, where the correct answer is a substring of a given paragraph; (iii) Direct-answerquestions; a hypothetical system has to generate a string for such questions. The choiceof answer-representation has direct consequences for the representational richness of thedataset and ease of evaluation. The first two settings (multiple-choice and extractive ques-tions) are easy to evaluate but restrict the richness of the dataset. Direct-answer questionscan result in richer datasets but are more expensive to evaluate.Datasets make it possible to automate the evaluation of the progress towards NLU andbe able to compare systems to each other on fixed problems sets. One of the earliestNLU datasets published in the field is the Remedia dataset (Hirschman et al., 1999) whichcontains short-stories written in simple language for kids provided by Remedia Publications.Each story has 5 types of questions ( who, when, why, where, what ). Since then, there hasbeen many suggestions as to what kind of question-answering dataset is a better test of NLU.Brachman et al. (2005) suggests SAT exams as a challenge for AI. Davis (2014) proposes12 u l t i p l e - c h o i c e Dirk Diggler was born as Steven Samuel Adams on April 15, 1961 outside of Saint Paul,Minnesota. His parents were a construction worker and a boutique shop owner whoattended church every Sunday and believed in God. Looking for a career as a male model,Diggler dropped out of school at age 16 and left home. He was discovered at a falafel standby Jack Horner. Diggler met his friend, Reed Rothchild, through Horner in 1979 whileworking on a film.
Question:
How old was Dirk when he met his friend Reed?
Answers: *(A) 18 (B) 16 (C) 22 E x t r a c t i v e The city developed around the Roman settlement Pons Aelius and was named after thecastle built in 1080 by Robert Curthose, William the Conqueror’s eldest son. The city grewas an important centre for the wool trade in the 14th century, and later became a majorcoal mining area. The port developed in the 16th century and, along with the shipyardslower down the River Tyne, was amongst the world’s largest shipbuilding and ship-repairingcentres.
Question:
Who built a castle in Newcastle in 1080?
Answers: “Robert Curthose” D i r e c t - a n s w e r Question:
Some birds fly south in the fall. This seasonal adaptation is known asmigration. Explain why these birds migrate.
Answers: “A(n) bird can migrate, which helps cope with lack of food resources in harshcold conditions by getting it to a warmer habitat with more food resources.”
Table 2: Various answer representation paradigms in QA systems; examples selected fromKhashabi et al. (2018a); Rajpurkar et al. (2016); Clark et al. (2016).multiple-choice challenge sets that are easy for children but difficult for computers. In asimilar spirit, Clark and Etzioni (2016) advocate elementary-school science tests. Manyscience questions have answers that are not explicitly stated in text and instead, requirecombining information together. In Chapter 2, 3 we use elementary-school science tests asour target challenge.While the field has produced many datasets in the past few years, many of these datasetsare either too restricted in terms of their linguistic richness or they contain annotationbiases (Gururangan et al., 2018; Poliak et al., 2018). For many of these datasets, it hasbeen pointed out that many of the high-performing models neither need to ‘comprehend’in order to correctly predict an answer, nor learn to ‘reason’ in a way that generalizesacross datasets (Chen et al., 2016; Jia and Liang, 2017; Kaushik and Lipton, 2018). InSection 3.4.4 we show that adversarially-selected candidate-answers result in a significantdrop in performance of a few state-of-art science QA systems. To address these weaknesses,13n Chapter 4, 5 we propose two new challenge datasets which, we believe, pose betterchallenges for systems.A closely related task is the task of Recognizing Textual Entailment (RTE) (Khashabiet al., 2018c; Dagan et al., 2013), as QA can be cast as entailment (Does P entail Q + A ?(Bentivogli et al., 2008)). While we do not directly address this task, in some cases we useit as a component within out proposed QA systems (in Chapter 3 and 4). The discussion of knowledge representation has been with AI since its beginning and itis central to the progress of language understanding. Since directly dealing with the rawinput/output complicates the reasoning stage, historically researchers have preferred todevise a middleman between the raw information and the reasoning engine. Therefore, theneed for an intermediate level seems to be essential. In addition, in many problems, there isa significant amount of knowledge that is not mentioned directly, but rather implied fromthe context. Somehow the extra information has to be provided to the reasoning system.As a result, the discussion goes beyond just creating formalism for information, and alsoincludes issues like, how to acquire , encode and access it. The issue of representationsapplies to both input level information and the internal knowledge of a reasoning system.We refer to some of the relevant debates in the forthcoming sections. An early trend emerged as the family of symbolic and logical representations, such as propo-sitional and 1st-order logic (McCarthy, 1963). This approach has deep roots in philosophy and mathematical logic , where the theories have evolved since Aristotle’s time. Logic, pro-vided a general purpose, clean and uniform language, both in terms of representations andreasoning. Terms originally made by Roger Schank to characterize two different camps: the first group thatrepresented commonsense knowledge in the form of large amorphous semantic networks, as opposed toanother from the camp of whose work was based on logic and formal extensions of logic. linguistics and psychology . Thistrend was less concerned with mathematical rigor, but more concerned with richer psy-chological and linguistic motivations. For example, semantic networks (Quillan, 1966), anetwork of concepts and links, was based on the idea that memory consists of associa-tions between mental entities. In Chapter 8 we study a formalism for reasoning with suchgraph-like representations.
Scripts and plans are representational tools to model frequentlyencountered activities; e.g., going to a restaurant (Schank and Abelson, 1975; Lehnert,1977). Minsky and Fillmore, separately and in parallel, advocated frame -based representa-tions (Minsky, 1974; Fillmore, 1977). The following decades, these approaches have evolvedinto fine-grained representations and hybrid systems for specific problems. One of the firstNLU programs was the
STUDENT program of Bobrow (1964), written in
LISP (McCarthyand Levin, 1965), which could read and solve high school algebra problems expressed in nat-ural language.Intuitively, a frame induces a grouping of concepts and creates abstract hierarchies amongthem. For example, “Monday”, “Tuesday”, ... are distinct concepts, but all members ofthe same conceptual frame. A frame consists of a group of slots and fillers to define astereotypical object or activity. A slot can contain values such as rules, facts, images, video,procedures or even another frame (Fikes and Kehler, 1985). Frames can be organizedhierarchically, where the default values can be inherited the value directly from parentframes. This is part of our underlying representation in Chapter 3, where the reasoningis done over tables of information (an example in Figure 7, left). Decades later after15ts proposal, the frame-based approach resulted in resources like FrameNet (Baker et al.,1998), or tasks like Semantic Role Labeling (Gildea and Jurafsky, 2002; Palmer et al., 2005;Punyakanok et al., 2004). This forms the basis for some of the key representations we usein Chapter 4 (see Figure 7, right).
There is another important trend inspired by the apparent brain function emergent frominterconnected networks of neural units (Rosenblatt, 1958). It lost many of its fans afterMinsky and Papert (1969) showed representational limitations of shallow networks in ap-proximating few functions. However, a series of events reinvigorated this thread: Notably,Rumelhart et al. (1988) found a formalized way to train networks with more than onelayer (nowadays known as Back-Propagation algorithm). This work emphasized the paral-lel and distributed nature of information processing and gave rise to the “connectionism”movement. Around the same time (Funahashi, 1989) showed the universal approximationproperty for feed-forward networks (any continuous function on the real numbers can beuniformly approximated by neural networks). Over the past decade, this school has en-joyed newfound excitement by effectively exploiting parallel processing power of GPUs andharnessing large datasets to show progress on certain datasets.
Unsupervised representations are one of the areas that have shown tangible impacts acrossthe board. A pioneering work is Brown et al. (1992) which creates binary term repre-sentations based on co-occurrence information. Over the years, a wide variety of suchrepresentations emerged; using Wikipedia concepts (Gabrilovich and Markovitch, 2007),word co-occurrences (Turney and Pantel, 2010), co-occurrence factorization (Mikolov et al.,2013; Levy and Goldberg, 2014; Pennington et al., 2014; Li et al., 2015), and using context-sensitive representation (Peters et al., 2018; Devlin et al., 2018). In particular, the lattertwo are inspired by the connectionist frameworks in the 80s and have shown to be effective16cross a wide range of NLU tasks. In this thesis we use unsupervised representations invarious ways: In Chapter 3 and 4 we use such representations for phrasal semantic equiva-lence within reasoning modules. In Chapter 5 we use as features of our supervised system.In Chapter 6, 7 we create NLU systems based on such representations in order to createbaselines for the datasets we introduce.A more recent highlight along this path is the emergence of new unsupervised represen-tations that have been shown to capture many interesting associations in freely availabledata (Peters et al., 2018; Devlin et al., 2018).
A tightly related issue to the abstraction issue is grounding natural language surface infor-mation to their actual meaning (Harnad, 1990), as discussed in Section 1.2. Practitionersoften address this challenging by enriching their representations; for example by mappingtextual information to Wikipedia entries (Mihalcea and Csomai, 2007). In Chapter 4 we usethe disambiguation of semantic actions and their roles (Punyakanok et al., 2004; Dang andPalmer, 2005). Chapter 8 of this thesis provides a formalism that incorporates elements ofthe symbol-grounding problem and shed theoretical light on existing empirical intuitions.
A major portion of our language understanding is only implied in language and not explicitlymention (examples in Section 1.2). This difficulty of this challenge has historically beenunder-estimated. Early AI, during the sixties and onward, experienced a lot of interestin modeling common sense knowledge. McCarthy, one of the founders of AI, believedin formal logic as a solution to common sense reasoning (McCarthy and Lifschitz, 1990).Minsky (1988) estimated that “... commonsense is knowing maybe 30 or 60 million thingsabout the world and having them represented so that when something happens, you canmake analogies with others”. There have been decade-long efforts to create knowledge basesof common sense information, such as Cyc (Lenat, 1995) and ConceptNet (Liu and Singh,17004), but none of these have yielded any major impact so far. A roadblock in progresstowards such goals is the lack of natural end-tasks that can provide an objective measureof progress in the field. To facilitate research in this direction, in Chapter 6 we provide anew natural language QA dataset that performing well on it requires significant progresson multiple temporal common sense tasks. of information is one of the key issues in any effort towards an effective represen-tation. Having coarser abstraction could result in better generalization. However, too muchabstraction could result in losing potentially-useful details. In general, there is a trade-offbetween the expressive level of the representation and the reasoning complexity. We alsodeal with this issue in multiple ways: (i) we use unsupervised representations that havebeen shown to indirectly capture abstractions (Mahabal et al., 2018). (ii) we use systemspre-trained with annotations that abstract over raw text; for example, in Chapter 4 we usesemantic roles representations of sentences, which abstract over low-level words and mapthe argument into their high-level thematic roles.For a given problem instance, how does a system internally choose the right level of abstrac-tion? The human attention structure is extremely good in abstracting concepts (Johnsonand Proctor, 2004; Janzen and Vicente, 1997), although automating this is an open ques-tion. One way of dealing with such issues is to use multiple levels of abstraction and letthe reasoning algorithm use the right level of abstraction when available (Rasmussen, 1985;Bisantz and Vicente, 1994). In Chapter 4, we take a similar approach by using a collectionof different abstractions. 18 .5. Reasoning/Decision-making Paradigms for NLU
The idea of automated reasoning dates back before AI itself and can be traced to ancientGreece. Aristotle’s syllogisms paved the way for deductive reasoning formalism. It continuedits way with philosophers like Al-Kindi, Al-Farabi, and Avicenna (Davidson, 1992), beforeculminating as the modern mathematics and logic.Within AI research, McCarthy (1963) pioneered the use of logic for automating reasoningfor language problems, which over time branched into other classes of reasoning (Hollandet al., 1989; Evans et al., 1993).A closely related reasoning to what we study here is abduction (Peirce, 1883; Hobbs et al.,1993), which is the process of finding the best minimal explanation from a set of observations(see Figure 8). Unlike in deductive reasoning, in abductive reasoning the premises do notguarantee the conclusion. Informally speaking, abduction is inferring cause from effect(reverse direction from deductive reasoning). The two reasoning systems in Chapter 3 and4 can be interpreted as abductive systems.We define the notation to make the exposition slightly more formal. Let (cid:96) denote entailmentand ⊥ denote contradiction. Formally, (logical) abductive reasoning is defined as follows:Given background knowledge B and observations O , find a hypothesis H , such that B ∪ H (cid:48) ⊥ (consistency with the given background) and B ∪ H (cid:96) O (explaining theobservations).In practical settings, this purely logical definition has many limitations: (a) There could bemultiple hypotheses H that explain a particular set of observations given the backgroundknowledge. The best hypothesis has to be selected based on some measure of goodnessand the simplicity of the hypothesis (Occam’s Razor). (b) Real life has many uncertain19igure 8: Brief definitions for popular reasoning classes and their examples.elements, i.e. there are degrees of certainties (rather than binary assignments) associatedwith observations and background knowledge. Hence the decision of consistency and ex-plainability has to be done with respect to this fuzzy measure. (c) The inference problemin its general form is computationally intractable; often assumptions have to be made tohave tractable inference (e.g., restricting the representation to Horn clauses). Over the years, a wide variety of soft alternatives have emerged for reasoning algorithms,by incorporating uncertainty into symbolic models. This resulted in theories like fuzzy-logic (Zadeh, 1975), or probabilistic Bayesian networks (Pearl, 1988; Dechter, 2013), softabduction (Hobbs et al., 1988; Selman and Levesque, 1990; Poole, 1990). In Bayesian net-works, the (uncertain) background knowledge is encoded in a graphical structure and uponreceiving observations, the probabilistic explanation is derived by maximizing a posteriorprobability distribution. These models are essentially based on propositional logic and can-not handle quantifiers (Kate and Mooney, 2009). Weighted abduction combines the weightsof relevance/plausibility with first-order logic rules (Hobbs et al., 1988). However, unlikeprobability theoretic frameworks, their weighting scheme does not have any solid theoret-20cal basis and does not lend itself to a complete probabilistic analysis. Our framework inChapter 3,4 is also a way to perform abductive reasoning under uncertainty. Our proposalis different from the previous models in a few ways: (i) Unlike Bayesian network our frame-work is not limited to propositional rules; in fact, there are first-order relations used in thedesign of
TableILP (more details in Chapter 3). (ii) unlike many other previous works,we do not make representational assumptions to make the inference simpler (like limitingto Horn clauses, or certain independence assumptions). In fact, the inference might beNP-hard, but with the existence of industrial ILP solvers this is not an issue in practice.Our work is inspired by a prior line of work on inference on structured representations toreason on (and with) language; see Chang et al. (2008, 2010, 2012), among others.
With increased availability of information (especially through the internet) macro-reading systems have emerged with the aim of leveraging a large variety of resources and exploitingthe redundancy of information (Mitchell et al., 2009). Even if a system does not understandone text, there might be many other texts that convey a similar meaning. Such systems de-rive significant leverage from relatively shallow statistical methods with surprisingly strongperformance (Clark et al., 2016). Today’s Internet search engines, for instance, can success-fully retrieve factoid style answers to many natural language queries by efficiently searchingthe Web. Information Retrieval (IR) systems work under the assumption that answers tomany questions of interest are often explicitly stated somewhere (Kwok et al., 2001), andall one needs, in principle, is access to a sufficiently large corpus. Similarly, statistical cor-relation based methods, such as those using Pointwise Mutual Information or PMI (Churchand Hanks, 1989), work under the assumption that many questions can be answered bylooking for words that tend to co-occur with the question words in a large corpus. Whileboth of these approaches help identify correct answers, they are not suitable for questionsrequiring language understanding and reasoning, such as chaining together multiple facts inorder to arrive at a conclusion. On the other hand, micro-reading aims at understanding
21 piece of evidence given to the system, without reliance of redundancy . The focus of thisthesis is micro-reading as it directly addresses NLU; that being said, whenever possible, weuse macro-reading systems as our baselines.
With increasing knowledge resources and diversity of the available knowledge representa-tions, numerous QA systems are developed to operate over large-scale explicit knowledgerepresentations. These approaches perform reasoning over structured (discrete) abstrac-tions. For instance, Chang et al. (2010) address RTE (and other tasks) via inference onstructured representations), Banarescu et al. (2013) use AMR annotators (Wang et al.,2015), Unger et al. (2012) use RDF knowledge (Yang et al., 2017), Zettlemoyer and Collins(2005); Clarke et al. (2010); Goldwasser and Roth (2014); Krishnamurthy et al. (2016) usesemantic parsers to answer a given question, and Do et al. (2011, 2012) employ constrainedinference for temporal/causal reasoning. The framework we study in Chapter 3 is a reason-ing algorithm functioning over tabular knowledge (frames) of basic science concepts.An important limitation of IR-based systems is their inability to connect distant pieces ofinformation together. However, many other realistic domains (such as science questions orbiology articles) have answers that are not explicitly stated in text, and instead require com-bining facts together. Khot et al. (2017) creates an inference system capable of combiningOpen IE tuples (Banko et al., 2007). Jansen et al. (2017) propose reasoning by aggregatingsentential information from multiple knowledge bases. Socher et al. (2013); McCallum et al.(2017) propose frameworks for chaining relations to infer new (unseen) relations. Our workin Chapter 3 creates chaining of information over multiple tables. The reasoning frameworkin Chapter 4 investigates reasoning over multiple peaces of raw text. The QA dataset inChapter 5 we propose also encourages the use of information from different segments of thestory. Chapter 8 proposes a formalism to study limits of chaining long-range information.22 .5.5. Models utilizing massive annotated data
A highlight over the past two decades is the advent of statistical techniques into NLP (Hirschmanet al., 1999). Since then, a wide variety of supervised-learning algorithms have shown strongperformances on different datasets.The increasingly large amount of data available for recent benchmarks make it possible totrain neural models (see “Connectionism”; Section 2.4.2) (Seo et al., 2016; Parikh et al.,2016; Wang et al., 2018; Liu et al., 2018; Hu et al., 2018). Moreover, an additional tech-nical shift was using distributional representation of words (word vectors or embeddings)extracted from large-scale text corpora (Mikolov et al., 2013; Pennington et al., 2014) (seeSection 2.4.3).Despite all the decade-long excited about supervised-learning algorithms, the main progress,especially in the past few years, has mostly been due to the re-emergence of unsupervised representations (Peters et al., 2018; Devlin et al., 2018). In this section, we provide the relevant mathematical background used throughout thisthesis. We cover three main areas used widely across this document.
We follow the standard notation for asymptotic comparison of functions: O ( . ) , o ( . ) , Θ( . ) , Ω( . ) , and ω ( . ) (Cormen et al., 2009).We use P and NP to refer to the basic complexity classes. We briefly review these classes: P consists of all problems that can be solved efficiently (in polynomial time). NP (non-deterministic polynomial time) includes all problems that given a solution, one can efficiently verify the solution. When a problem is called intractable , it refers to its complexity class Unsupervised in the sense that they are constructed with freely available data, as opposed to task-specificannotated data. NP -hard. X ∼ f ( θ ) denotes a random variable X distributed according to probability distribution f ( θ ), paramterized by θ . The mean and variance of X are denoted as E X ∼ f ( θ ) [ X ] and V [ X ],resp. Bern( p ) and Bin( n, p ) denote the Bernoulli and Binomial distributions, resp. We denote an undirected graph with G ( V, E ) where V and E are the sets of nodes andedges, resp. We use the notations V G and E G to refer to the nodes and edges of a graph G ,respectively.A subgraph of a graph G is another graph formed from a subset of the vertices and edges of G . The vertex subset must include all endpoints of the edge subset, but may also includeadditional vertices.A cut C = ( S, T ) in G is a partition of the nodes V into subsets S and T . The size of thecut C is the number of edges in E with one endpoint in S and the other in T . As it is widely known an ILP can be written as the following:maximize w T x (2.1)subject to A x ≤ b , (2.2)and x ∈ Z n . (2.3)We first introduce the basic variables, and define the full definition of the ILP program:define the weights in the objective function ( w in Equation 2.1), and the constraints ( A and b in Equation 2.2). 24his formulation is incredibly powerful and has been used for many problems. In thecontext of NLP problems, ILP based discrete optimization was introduced by Roth and Yih(2004) and has been successfully used (Chang et al., 2010; Berant et al., 2010; Srikumarand Roth, 2011; Goldwasser and Roth, 2014). In Chapter 3 and 4 also, we formalize ourdesired behavior as an optimization problem.This optimization problem with integrality constraint and its general form, is an NP-hardproblem. That being said, the industrial solvers (which use cutting-plane and other heuris-tics) are quite fast across a wide variety of problems.25 art I Reasoning-Driven System Design HAPTER 3 : QA as Subgraph Optimization on Tabular Knowledge “The techniques of artificial intelligence are to the mind whatbureaucracy is to human social interaction.”— Terry Winograd, Thinking Machines: Can there be? 1991
Consider a question from the NY Regents 4th Grade Science Test: In New York State, the longest period of daylight occurs during which month?(A) June (B) March (C) December (D) SeptemberWe would like a QA system that, even if the answer is not explicitly stated in a document,can combine basic scientific and geographic facts to answer the question, e.g., New Yorkis in the north hemisphere; the longest day occurs during the summer solstice; and thesummer solstice in the north hemisphere occurs in June (hence the answer is June). Figure 9illustrates how our system approaches this, with the highlighted support graph representingits line of reasoning. Q: In New York State , the longest period of daylight occurs during which month ? Subdivision Country New York State USA California USA Rio de Janeiro Brazil … …
Orbital Event Day Duration Night Duration Summer Solstice Long Short Winter Solstice Short Long …. …. … (A) December (B) June (C) March (D) September
Country Hemisphere United States Northern Canada Northern Brazil Southern ….. … Hemisphere Orbital Event Month North Summer Solstice June North Winter Solstice
December South Summer Solstice December South Winter Solstice
June
Semi-structured Knowledge
Figure 9:
TableILP searches for the best support graph (chains of rea-soning) connecting the question to an answer, in this case June. Con-straints on the graph define what constitutes valid support and how toscore it (Section 3.3.3).
Further, we would likethe system to be ro-bust under simple pertur-bations , such as chang-ing New York to NewZealand (in the southernhemisphere) or changingan incorrect answer optionto an irrelevant word suchas “last” that happens tohave high co-occurrence This chapter is based on the following publication: Khashabi et al. (2016).
TableILP , that operatesover a semi-structured knowledge base derived from text and answers questions by chain-ing multiple pieces of information and combining parallel evidence. The knowledge baseconsists of tables , each of which is a collection of instances of an n -ary relation defined overnatural language phrases. E.g., as illustrated in Figure 9, a simple table with schema (coun-try, hemisphere) might contain the instance (United States, Northern) while a ternary tablewith schema (hemisphere, orbital event, month) might contain (North, Summer Solstice,June) . TableILP treats lexical constituents of the question Q , as well as cells of potentiallyrelevant tables T , as nodes in a large graph G Q,T , and attempts to find a subgraph G of G Q,T that “best” supports an answer option. The notion of best support is captured viaa number of structural and semantic constraints and preferences, which are convenientlyexpressed in the Integer Linear Programming (ILP) formalism. We then use an off-the-shelfILP optimization engine called SCIP (Achterberg, 2009) to determine the best supportedanswer for Q .Following a recently proposed AI challenge (Clark, 2015), we evaluate TableILP on un-seen elementary-school science questions from standardized tests. Specifically, we considera challenge set (Clark et al., 2016) consisting of all non-diagram multiple choice questionsfrom 6 years of NY Regents 4th grade science exams. In contrast to a state-of-the-artstructured inference method (Khot et al., 2015) for this task, which used Markov LogicNetworks (MLNs) (Richardson and Domingos, 2006),
TableILP achieves a significantly(+14% absolute) higher test score. This suggests that a combination of a rich and fine-grained constraint language, namely ILP, even with a publicly available solver is moreeffective in practice than various MLN formulations of the task. Further, while the scalabil-ity of the MLN formulations was limited to very few (typically one or two) selected science A preliminary version of our ILP model was used in the ensemble solver of Clark et al. (2016). Webuild upon this earlier ILP formulation, providing further details and incorporating additional syntactic andsemantic constraints that improve the score by 17.7%.
TableILP with IR and PMI results in a significant (+10% absolute) boost in the scorecompared to IR alone.Our ablation study suggests that combining facts from multiple tables or multiple rowswithin a table plays an important role in
TableILP ’s performance. We also show that
TableILP benefits from the table structure, by comparing it with an IR system usingthe same knowledge (the table rows) but expressed as simple sentences;
TableILP scoressignificantly (+10%) higher. Finally, we demonstrate that our approach is robust to asimple perturbation of incorrect answer options: while the simple perturbation results in arelative drop of 20% and 33% in the performance of IR and PMI methods, respectively, itaffects
TableILP ’s performance by only 12%.
In this section, we provide additional related work, and augment our review related workprovided in Section 2.1.Clark et al. (2016) proposed an ensemble approach for the science QA task, demonstratingthe effectiveness of a combination of information retrieval, statistical association, rule-basedreasoning, and an ILP solver operating on semi-structured knowledge. Our ILP systemextends their model with additional constraints and preferences (e.g., semantic relationmatching), substantially improving QA performance.A number of systems have been developed for answering factoid questions with short answers(e.g., “What is the capital of France?”) using document collections or databases (e.g.,Freebase (Bollacker et al., 2008), NELL (Carlson et al., 2010)), for example (Brill et al.,2002; Fader et al., 2014; Ferrucci et al., 2010; Ko et al., 2007; t. Yih et al., 2014; Yao andDurme, 2014; Zou et al., 2014). However, many science questions have answers that are not29xplicitly stated in text, and instead require combining information together. Conversely,while there are AI systems for formal scientific reasoning (e.g., (Gunning et al., 2010; Novak,1977)), they require questions to be posed in logic or restricted English. Our goal here is asystem that operates between these two extremes, able to combine information while stilloperating with natural language.There is a relatively rich literature in the databases community, on executing differentcommands on the tablular content (e.g., searching, joining, etc) via a user commands issuedby a semi-novice user Talukdar et al. (2008, 2010). A major distinguishing perspective isthat in our problem the queries are generated completely independent of the the tablecontent. However, in a database system application, a user is at-least partially informed ofthe common keywords, could observe the outputs of the queries and adjust the commandsaccordingly.
We begin with our knowledge representation formalism, followed by our treatment of QAas an optimal subgraph selection problem over such knowledge, and then briefly describeour ILP model for subgraph selection.
We use semi-structured knowledge represented in the form of n -ary predicates over naturallanguage text (Clark et al., 2016). Formally, a k -column table in the knowledge base is apredicate r ( x , x , . . . , x k ) over strings, where each string is a (typically short) natural lan-guage phrase. The column headers capture the table schema, akin to a relational database.Each row in the table corresponds to an instance of this predicate. For example, a sim-ple country-hemisphere table represents the binary predicate r ctry-hems ( c, h ) with instancessuch as (Australia, Southern) and (Canada, Northern). Since table content is specified innatural language, the same entity is often represented differently in different tables, posingan additional inference challenge. 30hAlthough techniques for constructing this knowledge base are outside the scope of thispaper, we briefly mention them. Tables were constructed using a mixture of manual andsemi-automatic techniques. First, the table schemas were manually defined based on thesyllabus, study guides, and training questions. Tables were then populated both manuallyand semi-automatically using IKE (Dalvi et al., 2016), a table-building tool that performsinteractive, bootstrapped relation extraction over a corpus of science text. In addition, toaugment these tables with the broad knowledge present in study guides that doesn’t alwaysfit the manually defined table schemas, we ran an Open IE (Banko et al., 2007) pattern-based subject-verb-object (SVO) extractor from Clark et al. (2014) over several science texts topopulate three-column Open IE tables. Methods for further automating table constructionare under development. We treat question answering as the task of pairing the question with an answer such thatthis pair has the best support in the knowledge base, measured in terms of the strength ofa “support graph” defined as follows.Given a multiple choice question Q and tables T , we can define a labeled undirected graph G Q,T over nodes V and edges E as follows. We first split Q into lexical constituents (e.g.,non-stopword tokens, or chunks) q = { q (cid:96) } and answer options a = { a m } . For each table T i ,we consider its cells t = { t ijk } as well as column headers h = { h ik } . The nodes of G Q,T arethen V = q ∪ a ∪ t ∪ h . For presentation purposes, we will equate a graph node with thelexical entity it represents (such as a table cell or a question constituent). The undirectededges of G Q,T are E = (( q ∪ a ) × ( t ∪ h )) ∪ ( t × t ) ∪ ( h × h ) excluding edges both whoseendpoints are within a single table.Informally, an edge denotes (soft) equality between a question or answer node and a tablenode, or between two table nodes. To account for lexical variability (e.g., that tool and instrument are essentially equivalent) and generalization (e.g., that a dog is an animal ), we31eplace string equality with a phrase-level entailment or similarity function w : E → [0 , e ∈ E with an associated score w ( e ). We use entailment scores(directional) from q to t ∪ h and from t ∪ h to a , and similarity scores (symmetric) betweentwo nodes in t . In the special case of column headers across two tables, the score is(manually) set to either 0 or 1, indicating whether this corresponds to a meaningful join.Intuitively, we would like the support graph for an answer option to be connected, and toinclude nodes from the question, the answer option, and at least one table. Since each tablerow represents a coherent piece of information but cells within a row do not have any edgesin G Q,T (the same holds also for cells and the corresponding column headers), we use thenotion of an augmented subgraph to capture the underlying table structure. Let G = ( V, E )be a subgraph of G Q,T . The augmented subgraph G + is formed by adding to G edges ( v , v )such that v and v are in V and they correspond to either the same row (possibly theheader row) of a table in T or to a cell and the corresponding column header. Definition 1. A support graph G = G ( Q, T, a m ) for a question Q , tables T , and an answeroption a m is a subgraph ( V, E ) of G Q,T with the following basic properties:1. V ∩ a = { a m } , V ∩ q (cid:54) = φ, V ∩ t (cid:54) = φ ;2. w ( e ) > e ∈ E ;3. if e ∈ E ∩ ( t × t ) then there exists a corresponding e (cid:48) ∈ E ∩ ( h × h ) involving the samecolumns; and4. the augmented subgraph G + is connected.A support graph thus connects the question constituents to a unique answer option throughtable cells and (optionally) table headers corresponding to the aligned cells. A given questionand tables give rise to a large number of possible support graphs, and the role of the inference In our evaluations, w for entailment is a simple WordNet-based (Miller, 1995) function that computesthe best word-to-word alignment between phrases, scores these alignments using WordNet’s hypernym andsynonym relations normalized using relevant word-sense frequency, and returns the weighted sum of thescores. w for similarity is the maximum of the entailment score in both directions. Alternative definitionsfor these functions may also be used. desirable support graphs developednext. We do this through a number of additional structural and semantic properties; themore properties the support graph satisfies, the more desirable it is. We model the above support graph search for QA as an ILP optimization problem, i.e., asmaximizing a linear objective function over a finite set of variables, subject to a set of linearinequality constraints (see Section 2.6.4 for a premier on ILP formulation). A summary ofthe model is given below. We note that the ILP objective and constraints aren’t tied to the particular domain ofevaluation; they represent general properties that capture what constitutes a well supportedanswer for a given question.
Element Description T i table ih ik header of the k -th column of i -th table t ijk cell in row j and column k of i -th table r ij row j of i -th table (cid:96) ik column k of i -th table q (cid:96) (cid:96) -th lexical constituent of the question Qa m m -th answer option Table 3: Notation for the ILP formulation.Table 3 summarizes the notationfor various elements of the prob-lem, such as t ijk for cell ( j, k ) of ta-ble i . All core variables in the ILPmodel are binary, i.e., have do-main { , } . For each element, themodel has a unary variable captur-ing whether this element is part ofthe support graph G , i.e., it is “active”. For instance, row r ij is active if at least one cell inrow j of table i is in G . The model also has pairwise “alignment” variables, capturing edgesof G Q,T . The alignment variable for an edge e in G Q,T is associated with the correspondingweight w ( e ), and captures whether e is included in G . To improve efficiency, we create apairwise variable for e only if w ( e ) is larger than a certain threshold. These unary andpairwise variables are then used to define various types of constraints and preferences, asdiscussed next. Details of the ILP model may be found in Appendix A.1.1.
33o make the definitions clear, we introduce the variables used in our optimization, whichwe will use later to define constraints explicitly. We define variables over each element byoverloading x ( . ) or y ( ., . ) notation to refer to a binary variable on a single elements ortheir pair, respectively. Table 4 contains the complete list of the variables, all of whichare binary, i.e. they are defined on { , } domain. The unary variables represent pres-ence of a specific element in the support graph as a node. For example x ( T i ) = 1 ifand only if the table T i is active. Similarly basic variables are defined between pairs ofelements; e.g., y ( t ijk , q (cid:96) ) is a binary variable that takes value 1 if and only if the cor-responding edge is present in the support graph, which can alternatively be referred toas an alignment between cell ( j, k ) of table i and the (cid:96) -th constituent of the question. Basic Pairwise Activity Variables y ( t ijk , t ij (cid:48) k (cid:48) ) cell to cell y ( t ijk , q (cid:96) ) cell to question constituent y ( h ik , q (cid:96) ) header to questionconstituent y ( t ijk , a m ) cell to answer option y ( h ik , a m ) header to answer option y ( (cid:96) ik , a m ) column to answer option y ( T i , a m ) table to answer option y ( (cid:96) ik , (cid:96) ik (cid:48) ) column to column relation High-level Unary Variables x ( T i ) active table x ( r ij ) active row x ( (cid:96) ik ) active column x ( h ik ) active column header x ( q (cid:96) ) active question constituent x ( a m ) active answer option Table 4: Variables used for definingthe optimization problem for
TableILP solver. All variables have domain { , } .As previously mentioned, in practice we donot create all possible pairwise variables. In-stead we choose the pairs which have thealignment score w ( e ) exceeding a threshold.For example we create the pairwise variables y (cid:0) t ijk , t i (cid:48) j (cid:48) k (cid:48) (cid:1) only if the score w ( t ijk , t i (cid:48) j (cid:48) k (cid:48) ) ≥ MinCellCellAlignment . The objective function is a weighted linear sumof all the variables we instantiate for a givenproblem. There is a small set of auxiliaryvariables defined for linearizing complicated con-straints, which will later introduce among con-straints.Constraints are a significant part of our model, An exhaustive list of the minimum alignment thresholds for creating pairwise variables is in Table 28 inthe appendix. The complete list of weights for the pairwise and unary variables are included in Table 27 in the appendix. Some constraints relate variables to each other. The unary variables are defined throughconstraints that relate them to the pairwise basic variables. For example, for active rowvariable x ( T i ), we ensure that it is active if and only if any cell in row j is active: x ( r ij ) ≥ y ( t ijk , ∗ ) , ∀ ( t ijk , ∗ ) ∈ R ij , ∀ i, j, k, where R ij is collection of pairwise variables with one end in row j of table i .In what follows we outline the some of the important behaviors we expect from our modelwhich come out with different combination of the active variables. Basic Lookup
Consider the following question:Which characteristic helps a fox find food? (A) sense of smell (B) thick fur (C) longtail (D) pointed teethIn order to answer such lookup-style questions, we generally seek a row with the high-est aggregate alignment to question constituents. We achieve this by incorporating thequestion-table alignment variables with the alignment scores, w ( e ), as coefficients and theactive question constituents variable with a constant coefficient in the objective function.Since any additional question-table edge with a positive entailment score (even to irrelevanttables) in the support graph would result in an increase in the score, we disallow tables withalignments only to the question (or only to a choice) and add a small penalty for every tableused in order to reduce noise in the support graph. We also limit the maximum number ofalignments of a question constituent and table cells to prevent one constituent or cell from The complete list of the constraints is explained in Table 31 in the appendix. (cid:88) ( ∗ ,q (cid:96) ) ∈Q l y ( ∗ , q (cid:96) ) ≤ MaxAlignmentsPerQCons , ∀ l where Q l is the set of all pairwise variables with one end in question constituent (cid:96) . Parallel Evidence
For certain questions, evidence needs to be combined from multiple rows of a table. Forexample,Sleet, rain, snow, and hail are forms of (A) erosion (B) evaporation (C) groundwater(D) precipitationTo answer this question, we need to combine evidence from multiple table entries fromthe weather terms table, (term, type) , namely (sleet, precipitation), (rain, precipitation),(snow, precipitation), and (hail, precipitation). To achieve this, we allow multiple activerows in the support graph. Similar to the basic constraints, we limit the maximum numberof active rows per table and add a penalty for every active row to ensure only relevant rowsare considered for reasoning: (cid:88) j x ( r ij ) ≤ MaxRowsPerTable , ∀ i To encourage only coherent parallel evidence within a single table, we limit our supportgraph to always use the same columns across multiple rows within a table, i.e., every activerow has the active cells corresponding to the same set of columns.
Evidence Chaining
Questions requiring chaining of evidence from multiple tables, such as the example in Fig-ure 9, are typically the most challenging in this domain. Chaining can be viewed as per-36orming a join between two tables. We introduce alignments between cells across columnsin pairs of tables to allow for chaining of evidence. To help minimize potential noise intro-duced by chaining irrelevant facts, we add a penalty for every inter-table alignment and alsorely on the 0/1 weights of header-to-header edges to ensure only semantically meaningfultable joins are considered.
Semantic Relation Matching
Our constraints so far have only looked at the content of the table cells, or the structureof the support graph, without explicitly considering the semantics of the table schema.By using alignments between the question and column headers (i.e., type information), weexploit the table schema to prefer alignments to columns relevant to the “topic” of thequestion. In particular, for questions of the form “which X . . . ”, we prefer answers thatdirectly entail X or are connected to cells that entail X. However, this is not sufficient forquestions such as:What is one way to change water from a liquid to a solid? (A) decrease the temperature(B) increase the temperature (C) decrease the mass (D) increase the massEven if we select the correct table, say r change-init-fin ( c, i, f ) that describes the initial andfinal states for a phase change event, both choice (A) and choice (B) would have the exactsame score in the presence of table rows (increase temperature, solid, liquid) and (decreasetemperature, liquid, solid). The table, however, does have the initial vs. final state structure.To capture this semantic structure, we annotate pairs of columns within certain tables withthe semantic relationship present between them. In this example, we would annotate thephase change table with the relations: changeFrom( c, i ), changeTo( c, f ), and fromTo( i, f ).Given such semantic relations for table schemas, we can now impose a preference towardsquestion-table alignments that respect these relations. We associate each semantic relationwith a set of linguistic patterns describing how it might be expressed in natural language. TableILP then uses these patterns to spot possible mentions of the relations in the question37 . We then add the soft constraint that for every pair of active columns in a table (with anannotated semantic relation) aligned to a pair of question constituents, there should be avalid expression of that relation in Q between those constituents. In our example, we wouldmatch the relation fromTo(liquid, solid) in the table to “liquid to a solid” in the questionvia the pattern “X to a Y” associated with fromTo(X,Y), and thereby prefer aligning withthe correct row (decrease temperature, liquid, solid). We compare our approach to three existing methods, demonstrating that it outperforms thebest previous structured approach (Khot et al., 2015) and produces a statistically significantimprovement when used in combination with IR-based methods (Clark et al., 2016). Forevaluations, we use a 2-core 2.5 GHz Amazon EC2 linux machine with 16 GB RAM.
Question Set.
We use the same question set as Clark et al. (2016), which consists of allnon-diagram multiple-choice questions from 12 years of the NY Regents 4th Grade Scienceexams. The set is split into 108 development questions and 129 hidden test questions basedon the year they appeared in (6 years each). All numbers reported below are for the hiddentest set, except for question perturbation experiments which relied on the 108 developmentquestions.Test scores are reported as percentages. For each question, a solver gets a score of 1 if itchooses the correct answer and 1 /k if it reports a k -way tie that includes the correct answer.On the 129 test questions, a score difference of 9% (or 7%) is statistically significant at the95% (or 90%, resp.) confidence interval based on the binomial exact test (Howell, 2012). Corpora.
We work with three knowledge corpora:1. Web Corpus: This corpus contains 5 × tokens (280 GB of plain text) extractedfrom Web pages. It was collected by Charles Clarke at the University of Waterloo, TableILP (our approach). Given a question Q , we select the top 7 tables from the TableCorpus using the the standard TF-IDF score of Q with tables treated as bag-of-wordsdocuments. For each selected table, we choose the 20 rows that overlap with Q the most.This filtering improves efficiency and reduces noise. We then generate an ILP and solve itusing the open source SCIP engine (Achterberg, 2009), returning the active answer option a m from the optimal solution. To check for ties, we disable a m , re-solve the ILP, andcompare the score of the second-best answer, if any, with that of a m . MLN Solver (structured inference baseline). We consider the current state-of-the-artstructured reasoning method developed for this specific task by Khot et al. (2015). Wecompare against their best performing system, namely Praline, which uses Markov LogicNetworks (Richardson and Domingos, 2006) to (a) align lexical elements of the questionwith probabilistic first-order science rules and (b) to control inference. We use the entireset of 47,000 science rules from their original work, which were also derived from samedomain-targeted sources as the ones used in our Sentence Corpus. Table Corpus and the ILP model are available at allenai.org. R Solver (information retrieval baseline). We use the IR baseline by Clark et al. (2016),which selects the answer option that has the best matching sentence in a corpus. Specifically,for each answer option a i , the IR solver sends q + a i as a query to a search engine (we useLucene) on the Sentence Corpus, and returns the search engine’s score for the top retrievedsentence s , where s must have at least one non-stopword overlap with q , and at least onewith a i . The option with the highest Lucene score is returned as the answer. PMI Solver (statistical co-occurrence baseline). We use the PMI-based approach by Clarket al. (2016), which selects the answer option that most frequently co-occurs with thequestion words in a corpus. Specifically, it extracts unigrams, bigrams, trigrams, and skip-bigrams from the question and each answer option. For a pair ( x, y ) of n -grams, theirpointwise mutual information (PMI) (Church and Hanks, 1989) in the corpus is defined aslog p ( x,y ) p ( x ) p ( y ) where p ( x, y ) is the co-occurrence frequency of x and y (within some window) inthe corpus. The solver returns the answer option that has the largest average PMI in theWeb Corpus, calculated over all pairs of question n -grams and answer option n -grams. We first compare the accuracy of our approach against the previous structured (MLN-based) reasoning solver. We also compare against IR(tables), an IR solver using table rowsexpressed as sentences, thus embodying an unstructured approach operating on the sameknowledge as
TableILP . Solver Test Score (%)MLN 47.5IR(tables) 51.2
TableILP
Table 5:
TableILP signifi-cantly outperforms both theprior MLN reasoner, and IRusing identical knowledge as
TableILP
As Table 5 shows, among the two structured inference ap-proaches,
TableILP outperforms the MLN baseline by14%. The preliminary ILP system reported by Clark et al.(2016) achieves only a score of 43.8% on this question set.Further, given the same semi-structured knowledge (i.e.,the Table Corpus),
TableILP is substantially (+10%)better at exploiting the structure than the IR(tables)40aseline, which, as mentioned above, uses the same dataexpressed as sentences.
Complementary Strengths
Solver Test Score (%)IR 58.5PMI 60.7
TableILP
TableILP + IR 66.1
TableILP + PMI 67.6
TableILP + IR+ PMI
Table 6: Solver combination resultsWhile their overall score is similar,
TableILP and IR-based methods clearly approachQA very differently. To assess whether
TableILP adds any new capabilities, weconsidered the 50 (out of 129) questions in-correctly answered by PMI solver (ignoringtied scores). On these unseen but arguablymore difficult questions,
TableILP answered 27 questions correctly, achieving a score of54% compared to the random chance of 25% for 4-way multiple-choice questions. Resultswith IR solver were similar:
TableILP scored 24.75 on the 52 questions incorrectly an-swered by IR (i.e., 47.6% accuracy).This analysis highlights the complementary strengths of these solvers. Following Clarket al. (2016), we create an ensemble of
TableILP , IR, and PMI solvers, combining theiranswer predictions using a simple Logistic Regression model trained on the developmentset. This model uses 4 features derived from each solver’s score for each answer option,and 11 features derived from
TableILP ’s support graphs. Table 6 shows the results,with the final combination at 69% representing a significant improvement over individualsolvers.
ILP Solution Properties
Table 7 summarizes various ILP and support graph statistics for
TableILP , averaged acrossall test questions. Details of the 11 features may be found in the Appendix B. Category Quantity AverageILP complexity
Table 7:
TableILP statistics averaged acrossquestionsThus,
TableILP takes only 4 secondsto answer a question using multiplerows across multiple tables (typically140 rows in total), as compared to 17seconds needed by the MLN solver forreasoning with four rules (one per an-swer option).While the final support graph on thisquestion set relies mostly on a single table to answer the question, it generally combinesinformation from more than two rows (2.3 on average) for reasoning. This suggests parallelevidence is more frequently used on this dataset than evidence chaining.
Solver Test Score (%)
TableILP
Table 8: Ablation results for
TableILP
To quantify the importance of various com-ponents of our system, we performed sev-eral ablation experiments, summarized inTable 8 and described next.
No Multiple Row Inference : We mod-ify the ILP constraints to limit inferenceto a single row (and hence a single table), Commercial ILP solvers (e.g., CPLEX, Gurobi) are much faster than the open-source SCIP solver weused for evaluations.
No Relation matching : To assess the importance of considering the semantics of thetable, we remove the requirement of matching the semantic relation present between columnsof a table with its lexicalization in the question (Section 3.3.3). The 6% drop indicates
TableILP relies strongly on the table semantics to ensure creating meaningful inferentialchains.
No Open IE tables : To evaluate the impact of relatively unstructured knowledge froma large corpus, we removed the tables containing Open IE extractions (Section 3.3.2). The9% drop in the score shows that this knowledge is important and
TableILP is able toexploit it even though it has a very simple triple structure. This opens up the possibilityof extending our approach to triples extracted from larger knowledge bases.
No Lexical Entailment : Finally, we test the effect of changing the alignment metric w (Section 3.3.2) from WordNet based scores to a simple asymmetric word-overlap measuredas score ( T, H ) = | T ∩ H || H | . Relying on just word-matching results in an 11% drop, which isconsistent with our knowledge often being defined in terms of generalities. One desirable property of QA systems is robustness to simple variations of a question, especially when a variation would make the question arguably easier for humans eastern (B) June (C) history (D) years
Original % Drop with PerturbationSolver Score (%) absolute relativeIR 70.7 13.8 19.5PMI 73.6 24.4 33.2
TableILP
Table 9: Drop in solver scores (on the development set,rather than the hidden test set) when questions are per-turbedAs in this example, the per-turbations (italicized) are of-ten not even of the cor-rect “type”, typically makingthem much easier for humans.They, however, still remaindifficult for solvers.For each of the 108 development questions, we generate 10 new perturbed questions, usingthe 30 most frequently occurring words in step (5) above. While this approach can introducenew answer options that should be considered correct as well, only 3% of the questions in arandom sample exhibited this behavior. Table 9 shows the performance of various solvers onthe resulting 1,080 perturbed questions. As one might expect, the PMI approach suffers themost at a 33% relative drop.
TableILP ’s score drops as well (since answer type matchingisn’t perfect), but only by 12%, attesting to its higher resilience to simple question variation.
This chapter proposed a reasoning system for question answering on elementary-school sci-ence exams, using a semi-structured knowledge base. We formulate QA as an Integer LinearProgram (ILP), that answers natural language questions using a semi-structured knowledgebase derived from text, including questions requiring multi-step inference and a combinationof multiple facts. On a dataset of real, unseen science questions, our system significantlyoutperforms (+14%) the best previous attempt at structured reasoning for this task, which44sed Markov Logic Networks (MLNs). When combined with unstructured inference meth-ods, the ILP system significantly boosts overall performance (+10%). Finally, we showour approach is substantially more robust to a simple answer perturbation compared tostatistical correlation methods.There are a few factors that limit the ideas discussed in this chapter. In particular, theknowledge consumed by this system are in the form of curated tables; constructing suchknowledge is not always easy. In addition, not everything might be representable in thatform. Another limitation stems from the nature of multi-step reasoning: larger number ofreasoning steps could result in more brittle decisions. We study this issue in Chapter 8.45
HAPTER 4 : QA as Subgraph Optimization over Semantic Abstractions “It linked all the perplexed meaningsInto one perfect peace.” — Procter and Sullivan, The Lost Chord, 1877
In this chapter, we consider the multiple-choice setting where Q is a question, A is a setof answer candidates, and the knowledge required for answering Q is available in the formof raw text P . A major difference here with the previous chapter is that, the knowledgegiven a system is raw-text , instead of of being represented in tabular format.We demonstrate that we can use existing NLP modules, such as semantic role labeling (SRL)systems with respect to multiple predicate types (verbs, prepositions, nominals, etc.), toderive multiple semantic views of the text and perform reasoning over these views to answera variety of questions.As an example, consider the following snippet of sports news text and an associated question: P : Teams are under pressure after PSG purchased Neymar this season. Chelsea purchased Morata .The Spaniard looked like he was set for a move to Old Trafford for the majority of the summer onlyfor Manchester United to sign Romelu Lukaku instead, paving the way for Morata to finally move toChelsea for an initial £56m. Q : Who did Chelsea purchase this season? A : { (cid:88) Alvaro Morata, Neymar, Romelu Lukaku } Given the bold-faced text P (cid:48) in P , simple word-matching suffices to correctly answer Q .However, P (cid:48) could have stated the same information in many different ways. As paraphrasesbecome more complex, they begin to involve more linguistic constructs such as coreference,punctuation, prepositions, and nominals. This makes understanding the text, and thus theQA task, more challenging. This chapter is based on the following publication: Khashabi et al. (2018b). panLabel(Chunk)PredArg(Verb-SRL) Who
A1.thing purchased
SpanLabel(Chunk)
Alvaro MorataNeymar
SpanLabel(Chunk)
Romelu Lukaku purchase
A0.purchaser
Answer:
Alvaro Morata
Knowledge: … Morata, the recent acquisition by Chelsea, will start for the team tomorrow … Question:
Who did Chelsea purchase this season?
Chelsea
Answer:
Neymar
Answer:
Romelu Lukaku
PredArg(Comma-SRL)
Morata the recent acqisition, left.substitute right.substitute
PredArg(Prep-SRL) chelseathe recent acquisition by objectgovernor
Figure 10: Depiction of
SemanticILP reasoning for the example paragraph given in thetext. Semantic abstractions of the question, answers, knowledge snippet are shown indifferent colored boxes (blue, green, and yellow, resp.). Red nodes and edges are the elementsthat are aligned (used) for supporting the correct answer. There are many other unaligned(unused) annotations associated with each piece of text that are omitted for clarity.For instead, P (cid:48) could instead say Morata is the recent acquisition by Chelsea . This simplelooking transformation can be surprisingly confusing for highly successful systems such as
BiDAF (Seo et al., 2016), which produces the partially correct phrase “Neymar this season.Morata” . On the other hand, one can still answer the question confidently by abstractingrelevant parts of Q and P , and connecting them appropriately. Specifically, a verb SRLframe for Q would indicate that we seek the object of the verb purchase , a nominal SRLframe for P (cid:48) would capture that the acquisition was of Morata and was done by Chelsea,and textual similarity would align purchase with acquisition .Similarly, suppose P (cid:48) instead said Morata , the recent acquisition by Chelsea , will start forthe team tomorrow. BiDAF now incorrectly chooses Neymar as the answer, presumablydue to its proximity to the words purchased and this season . However, with the rightabstractions, one could still arrive at the correct answer as depicted in Figure 10 for our47roposed system,
SemanticILP . This reasoning uses comma SRL to realize that the Moratais referring to the acquisition , and a preposition SRL frame to capture that the acquisitionwas done by Chelsea.One can continue to make P (cid:48) more complex. For example, P (cid:48) could introduce the needfor coreference resolution by phrasing the information as: Chelsea is hoping to have agreat start this season by actively hunting for new players in the transfer period. Morata,the recent acquisition by the team, will start for the team tomorrow.
Nevertheless, withappropriate semantic abstractions of the text, the underlying reasoning remains relativelysimple.Given sufficiently large QA training data, one could conceivably perform end-to-end training(e.g., using a deep learning method) to address these linguistic challenges. However, existinglarge scale QA datasets such as SQuAD (Rajpurkar et al., 2016) often either have a limitedlinguistic richness or do not necessarily need reasoning to arrive at the answer (Jia andLiang, 2017). Consequently, the resulting models do not transfer easily to other domains.For instance, the above mentioned BiDAF model trained on the SQuAD dataset performssubstantially worse than a simple IR approach on our datasets. On the other hand, manyof the QA collections in domains that require some form of reasoning, such as the sciencequestions we use, are small (100s to 1000s of questions). This brings into question theviability of the aforementioned paradigm that attempts to learn everything from only theQA training data.Towards the goal of effective structured reasoning in the presence of data sparsity, wepropose to use a rich set of general-purpose, pre-trained NLP tools to create various semanticabstractions of the raw text in a domain independent fashion, as illustrated for an examplein Figure 10. We represent these semantic abstractions as families of graphs , where thefamily (e.g., trees, clusters, labeled predicate-argument graphs, etc.) is chosen to match thenature of the abstraction (e.g., parse tree, coreference sets, SRL frames, etc., respectively). This applies to all three inputs of the system: Q , A , and P . support graph , a subgraph G of the above augmented graph connecting (the semantic graphsof) Q and A via P . The reasoning used to answer the question is captured by a variety ofrequirements or constraints that G must satisfy, as well as a number of desired properties,encapsulating the “correct” reasoning, that makes G preferable over other valid supportgraphs. For instance, a simple requirement is that G must be connected and it must touchboth Q and A . Similarly, if G includes a verb from an SRL frame, it is preferable to alsoinclude the corresponding subject. Finally, the resulting constrained optimization problemis formulated as an Integer Linear Program (ILP), and optimized using an off-the-shelf ILPsolver (see Section 2.6.4 for a review of ILP).This formalism may be viewed as a generalization of the systems introduced in the previouschapter: instead of operating over table rows (which are akin to labeled sequence graphs orpredicate-argument graphs), we operate over a much richer class of semantic graphs. It canalso be viewed as a generalization of the recent TupleInf system (Khot et al., 2017), whichconverts P into a particular kind of semantic abstraction, namely Open IE tuples (Bankoet al., 2007).This generalization to multiple semantic abstractions poses two key technical challenges:(a) unlike clean knowledge-bases (e.g., Dong et al. (2015)) used in many QA systems, ab-stractions generated from NLP tools (e.g., SRL) are noisy; and (b) even if perfect, usingtheir output for QA requires delineating what information in Q , A , and P is relevant fora given question, and what constitutes valid reasoning. The latter is especially challengingwhen combining information from diverse abstractions that, even though grounded in thesame raw text, may not perfectly align. We address these challenges via our ILP formula-tion, by using our linguistic knowledge about the abstractions to design requirements andpreferences for linking these abstractions. 49e present a new QA system, SemanticILP , based on these ideas, and evaluate it onmultiple-choice questions from two domains involving rich linguistic structure and reasoning:elementary and middle-school level science exams, and early-college level biology readingcomprehension. Their data sparsity, as we show, limits the performance of state-of-the-artneural methods such as BiDAF (Seo et al., 2016). SemanticILP , on the other hand, isable to successfully capitalize on existing general-purpose NLP tools in order to outper-form existing baselines by 2%-6% on the science exams, leading to a new state of the art.It also generalizes well, as demonstrated by its strong performance on biology questionsin the
ProcessBank dataset (Berant et al., 2014). Notably, while the best existing sys-tem for the latter relies on domain-specific structural annotation and question processing,
SemanticILP needs neither.
We provide a brief review of the related work, in additional to the discussion provided inSection 2.1.Our formalism can be seen as an extension of the previous chapter. For instance, in ourformalism, each table used by
TableILP can be viewed as a semantic frame and representedas a predicate-argument graph. The table-chaining rules used there are equivalent to thereasoning we define when combining two annotation components. Similarly, Open IE tuplesused by (Khot et al., 2017) can also be viewed as a predicate-argument structure.One key abstraction we use is the predicate-argument structure provided by Semantic RoleLabeling (SRL). Many SRL systems have been designed (Gildea and Jurafsky, 2002; Pun-yakanok et al., 2008) using linguistic resources such as FrameNet (Baker et al., 1998),PropBank (Kingsbury and Palmer, 2002), and NomBank (Meyers et al., 2004). These sys-tems are meant to convey high-level information about predicates (which can be a verb, anoun, etc.) and related elements in the text. The meaning of each predicate is conveyedby a frame, the schematic representations of a situation. Phrases with similar semantics Code available at: https://github.com/allenai/semanticilp
We begin with our formalism for abstracting knowledge from text and representing it as afamily of graphs, followed by specific instantiations of these abstractions using off-the-shelfNLP modules. 51 .2.1. Semantic Abstractions
The pivotal ingredient of the abstraction is raw text. This representation is used forquestion Q , each answer option A i and the knowledge snippet P , which potentially con-tains the answer to the question. The KB for a given raw text, consists of the text it-self, embellished with various SemanticGraph s attached to it, as depicted in Figure 11.
Text
SemanticGraph 1(Sequence)SemanticGraph 2(Tree)SemanticGraph 3(Predicate-Argument)SemanticGraph 4(Cluster)
Figure 11:
Knowledge Representationused in our formulation. Raw text isassociated with a collection of
Semantic-Graph s, which convey certain informa-tion about the text. There are implicitsimilarity edges among the nodes of theconnected components of the graphs,and from nodes to the correspondingraw-text spans.
Each
SemanticGraph is representable from a familyof graphs. In principle there need not be any con-straints on the permitted graph families; however forease of representation we choose the graphs to belongto one of the 5 following families:
Sequence graphsrepresent labels for each token in the sentence.
Span family represents labels for spans of the text.
Tree ,is a tree representation of text spans.
Cluster family,contain spans of text in different groups.
PredArg fam-ily represents predicates and their arguments; in thisview edges represent the connections between eachsingle predicates and its arguments. Each
SemanticGraph belongs to one of the graph fami-lies and its content is determined by the semantics of the information it represents and thetext itself.We define the knowledge more formally here. For a given paragraph, T , its representation K ( T ) consists of a set of semantic graphs K ( T ) = { g , g , . . . } . We define v ( g ) = { c i } and e ( g ) = { ( c i , c j ) } to be the set of nodes and edges of a given graph, respectively. Having introduced a graph-based abstraction for knowledge and categorized it into a familyof graphs, we now delineate the instantiations we used for each family. Many of the pre-52rained extraction tools we use are available in
CogCompNLP . • Sequence or labels for sequence of tokens; for example
Lemma and
POS (Roth and Zelenko,1998). • Span which can contains labels for spans of text; we instantiated
Shallow-Parse (Punyakanokand Roth, 2001),
Quantities (Roy et al., 2015),
NER (Ratinov and Roth, 2009; Redman et al.,2016)). • Tree , a tree representation connecting spans of text as nodes; for this we used
Dependency of Chang et al. (2015). • Cluster , or spans of text clustered in groups. An example is
Coreference (Lee et al., 2011). • PredArg ; for this view we used
Verb-SRL and
Nom-SRL (Punyakanok et al., 2008; Roth andLapata, 2016),
Prep-SRL (Srikumar and Roth, 2013),
Comma-SRL (Arivazhagan et al., 2016).Given
SemanticGraph generators we have the question, answers and paragraph representedas a collection of graphs. Given the instance graphs, creating augmented graph will be doneimplicitly as an optimization problem in the next step.
We introduce our treatment of QA as an optimal subgraph selection problem over knowl-edge. We treat question answering as the task of finding the best support in the knowledgesnippet, for a given question and answer pair, measured in terms of the strength of a “sup-port graph” defined as follows.The inputs to the QA system are, a question K ( Q ), the set of answers {K ( A i ) } and givena knowledge snippet K ( P ). Given such representations, we will form a reasoning problem,which is formulated as an optimization problem, searching for a “support graph” that Available at: http://github.com/CogComp/cogcomp-nlp For simplicity, from now on, we drop “knowledge”; e.g., instead of saying “question knowledge”, we say“question”. I = I ( Q, { A i } , P ) as the union of knowledge graphs: I (cid:44) K ( Q ) ∪ ( K ( A i )) ∪K ( P ). Intuitively, we would like the support graph to be connected, and to includenodes from the question, the answer option, and the knowledge. Since the SemanticGraph is composed of many disjoint sub-graph, we define augmented graph I + to model a biggerstructure over the instance graphs I . Essentially we augment the instance graph and weightthe new edges. Define a scoring function f : ( v , v ) labels pair of nodes v and v with anscore which represents their phrase-level entailment or similarity. Definition 2. An augmented graph I + , for a question Q , answers { A i } and knowledge P ,is defined with the following properties:1. Nodes: v ( I + ) = v ( I ( Q, { A i } , P ))2. Edges: e ( I + ) = e ( I ) ∪ K ( Q ) ⊗ K ( P ) ∪ [ ∪ i K ( P ) ⊗ K ( A i )]3. Edge weights: for any e ∈ I + : • If e / ∈ I , the edge connects two nodes in different connected components: ∀ e = ( v , v ) / ∈ I : w ( e ) = f ( v , v ) • If e ∈ I , the edge belongs to a connected component, and the edge weightinformation about the reliability of the SemanticGraph and semantics of the twonodes. ∀ g ∈ I, ∀ e ∈ g : w ( e ) = f (cid:48) ( e, g ) Define K ( T ) ⊗K ( T ) (cid:44) (cid:83) ( g ,g ) ∈K ( T ) ×K ( T ) v ( g ) × v ( g ), where v ( g ) × v ( g ) = { ( v, w ); v ∈ v ( g ) , w ∈ v ( g ) } . em. Graph Property PredArg
Use at least (a) a predicate andits argument, or (b) twoarguments
Cluster
Use at least two nodes
Tree
Use two nodes with distance lessthan k SpanLabelView
Use at least k nodes Table 10: Minimum requirements for using each family of graphs. Each graph connectedcomponent (e.g. a
PredArg frame, or a
Coreference chain) cannot be used unless the above-mentioned conditioned is satisfied.Next, we have to define support graphs , the set of graphs that support the reasoning of aquestion. For this we will apply some structured constraints on the augmented graph.
Definition 3. A support graph G = G ( Q, { A i } , P ) for a question Q , answer-options { A i } and paragraph P , is a subgraph ( V, E ) of I + with the following properties:1. G is connected.2. G has intersection with the question, the knowledge, and exactly one answer candi-date: G ∩ K ( Q ) (cid:54) = ∅ , G ∩ K ( P ) (cid:54) = ∅ , ∃ ! i : G ∩ K ( A i ) (cid:54) = ∅ G satisfies structural properties per each connected component, as summarized inTable 10.Definition 3 characterizes what we call a potential solution to a question. A given questionand paragraph give rise to a large number of possible support graphs. We define the spaceof feasible support graphs as G (i.e., all the graphs that satisfy Definition 3, for a given( Q, { A i } , P )). To rank various feasible support graphs in such a large space, we define ascoring function score( G ) as: (cid:88) v ∈ v ( G ) w ( v ) + (cid:88) e ∈ e ( G ) w ( e ) − (cid:88) c ∈C w c { c is violated } (4.1) ∃ ! here denotes the uniqueness quantifier, meaning “there exists one and only one”. C . When c is violated, denoted by theindicator function { c is violated } in Eq. (4.1), we penalize the objective value by some fixedamount w c . The second term is supposed to bring more sparsity to the desired solutions,just like how regularization terms act in machine learning models (Natarajan, 1995). Thefirst term is the sum of weights we defined when constructing the augmented-graph, and issupposed to give more weight to the models that have better and more reliable alignmentsbetween its nodes. The role of the inference process will be to choose the “best” one underour notion of desirable support graphs: G ∗ = arg max G ∈G (4.2) Our QA system,
SemanticILP , models the above support graph search of Eq. (4.2) as anILP optimization problem, i.e., as maximizing a linear objective function over a finite setof variables, subject to a set of linear inequality constraints. A summary of the model isgiven below.The augmented graph is not explicitly created; instead, it is implicitly created. The nodesand edges of the augmented graph are encoded as a set of binary variables. The value ofthe binary variables reflects whether a node or an edge is used in the optimal graph G ∗ .The properties listed in Table 10 are implemented as weighted linear constraints using thevariables defined for the nodes and edges.As mentioned, edge weights in the augmented graph come from a function, f , which captures(soft) phrasal entailment between question and paragraph nodes, or paragraph and answernodes, to account for lexical variability. In our evaluations, we use two types of f . (a) Similarto Khashabi et al. (2016), we use a WordNet-based (Miller, 1995) function to score word-to-word alignments, and use this as a building block to compute a phrase-level alignmentscore as the weighted sum of word-level alignment scores. Word-level scores are computed56 Number of sentences used is more than k - Active edges connected to each chunk of the answer option, more than k - More than k chunks in the active answer-option- More than k edges to each question constituent- Number of active question-terms- If using PredArg of K ( Q ), at least an argument should be used- If using PredArg ( Verb-SRL ) of K ( Q ), at least one predicate should be used. Table 11: The set of preferences functions in the objective.using WordNet’s hypernym and synonym relations, and weighted using relevant word-sensefrequency. f for similarity (as opposed to entailment) is taken to be the average of theentailment scores in both directions. (b) For longer phrasal alignments (e.g., when aligningphrasal verbs) we use the Paragram system of Wieting et al. (2015).The final optimization is done on Eq. (4.1). The first part of the objective is the sum ofthe weights of the sub-graph, which is what an ILP does, since the nodes and edges aremodeled as variables in the ILP. The second part of Eq. (4.1) contains a set of preferences C , summarized in Table 11, meant to apply soft structural properties that partly dependanton the knowledge instantiation. These preferences are soft in the sense that they are appliedwith a weight to the overall scoring function (as compare to a hard constraint). For eachpreference function c there is an associated binary or integer variable with weight w c , andwe create appropriate constraints to simulate the corresponding behavior.We note that the ILP objective and constraints aren’t tied to the particular domain ofevaluation; they represent general properties that capture what constitutes a well supportedanswer for a given question. We evaluate on two domains that differ in the nature of the supporting text (concatenatedindividual sentences vs. a coherent paragraph), the underlying reasoning, and the wayquestions are framed. We show that
SemanticILP outperforms a variety of baselines,including retrieval-based methods, neural-networks, structured systems, and the current57est system for each domain. These datasets and systems are described next, followed byresults.
For the first domain, we have a collection of question sets containing elementary-level sci-ence questions from standardized tests (Clark et al., 2016; Khot et al., 2017). Specifically,
Regents 4th contains all non-diagram multiple choice questions from 6 years of NY Re-gents 4th grade science exams (127 train questions, 129 test).
Regents 8th similarlycontains 8th grade questions (147 train, 144 test). The corresponding expanded datasetsare
AI2Public 4th (432 train, 339 test) and
AI2Public 8th (293 train, 282 test). For the second domain, we use the
ProcessBank dataset for the reading comprehensiontask proposed by Berant et al. (2014). It contains paragraphs about biological processes andtwo-way multiple choice questions about them. We used a broad subset of this dataset thatasks about events or about an argument that depends on another event or argument. . Theresulting dataset has 293 train and 109 test questions, based on 147 biology paragraphs.Test scores are reported as percentages. For each question, a system gets a score of 1 if itchooses the correct answer, 1 /k if it reports a k -way tie that includes the correct answer,and 0 otherwise. We consider a variety of baselines, including the best system for each domain. IR (information retrieval baseline). We use the IR solver from Clark et al. (2016), whichselects the answer option that has the best matching sentence in a corpus. The sentence isforced to have a non-stopword overlap with both q and a . AI2 Science Questions V1 at http://data.allenai.org/ai2-science-questions https://nlp.stanford.edu/software/bioprocess These are referred to as “dependency questions” by Berant et al. (2014), and cover around 70% of allquestions. emanticILP (our approach). Given the input instance (question, answer options, and aparagraph), we invoke various NLP modules to extract semantic graphs. We then generatean ILP and solve it using the open source SCIP engine (Achterberg, 2009), returning theactive answer option a m from the optimal solution found. To check for ties, we disable a m ,re-solve the ILP, and compare the score of the second-best answer, if any, with that of thebest score.For the science question sets, where we don’t have any paragraphs attached to each question,we create a passage by using the above IR solver to retrieve scored sentences for each answeroption and then combining the top 8 unique sentences (across all answer options) to forma paragraph.While the sub-graph optimization can be done over the entire augmented graph in oneshot, our current implementation uses multiple simplified solvers, each performing reason-ing over augmented graphs for a commonly occurring annotator combination, as listed inTable 12. For all of these annotator combinations, we let the representation of the answersbe K ( A ) = { Shallow-Parse , Tokens } . Importantly, our choice of working with a few annotatorcombinations is mainly for simplicity of implementation and suffices to demonstrate thatreasoning over even just two annotators at a time can be surprisingly powerful. There isno fundamental limitation in implementing SemanticILP using one single optimizationproblem as stated in Eq. (4.2).Each simplified solver associated with an annotator combination in Table 12 produces aconfidence score for each answer option. We create an ensemble of these solvers as a linearcombination of these scores, with weights trained using the union of training data from allquestions sets.
BiDAF (neural network baseline). We use the recent deep learning reading comprehensionmodel of Seo et al. (2016), which is one of the top performing systems on the SQuADdataset and has been shown to generalize to another domain as well (Min et al., 2017).59 ombination RepresentationComb-1 K ( Q ) = { Shallow-Parse , Tokens }K ( P ) = { Shallow-Parse , Tokens , Dependency } Comb-2 K ( Q ) = { Verb-SRL , Shallow-Parse }K ( P ) = { Verb-SRL } Comb-3 K ( Q ) = { Verb-SRL , Shallow-Parse }K ( P ) = { Verb-SRL , Coreference } Comb-4 K ( Q ) = { Verb-SRL , Shallow-Parse }K ( P ) = { Comma-SRL } Comb-5 K ( Q ) = { Verb-SRL , Shallow-Parse }K ( P ) = { Prep-SRL } Table 12: The semantic annotator combinations used in our implementation of
Semanti-cILP .Since
BiDAF was designed for fill-in-the-blank style questions, we follow the variation usedby Kembhavi et al. (2017) to apply it to our multiple-choice setting. Specifically, we comparethe predicted answer span to each answer candidate and report the one with the highestsimilarity.We use two variants: the original system,
BiDAF , pre-trained on 100,000+ SQuAD ques-tions, as well as an extended version,
BiDAF’ , obtained by performing continuous trainingto fine-tune the SQuAD-trained parameters using our (smaller) training sets. For the latter,we convert multiple-choice questions into reading comprehension questions by generatingall possible text-spans within sentences, with token-length at most correct answer length +2 , and choose the ones with the highest similarity score with the correct answer. We usethe
AllenNLP re-implementation of
BiDAF , train it on SQuAD, followed by trainingit on our dataset. We tried different variations (epochs and learning rates) and selected themodel which gives the best average score across all the datasets. As we will see, the variantthat was further trained on our data often gives better results. TupleInf (semi-structured inference baseline). Recently proposed by Khot et al. (2017),this is a state-of-the-art system designed for science questions. It uses Open IE (Bankoet al., 2007) tuples derived from the text as the knowledge representation, and performsreasoning over it via an ILP. It has access to a large knowledge base of Open IE tuples, and Available at: https://github.com/allenai/allennlp
Proread and
SyntProx . Proread is a specialized and best performing system on the
ProcessBank question set. Berant et al. (2014) annotated the training data with eventsand event relations, and trained a system to extract the process structure. Given a question,
Proread converts it into a query (using regular expression patterns and keywords) andexecutes it on the process structure as the knowledge base. Its reliance on a question-dependent query generator and on a process structure extractor makes it difficult to applyto other domains.
SyntProx is another solver suggested by (Berant et al., 2014). It aligns content wordlemmas in both the question and the answer against the paragraph, and select the answertokens that are closer to the aligned tokens of the questions. The distance is measuredusing dependency tree edges. To support multiple sentences they connect roots of adjacentsentences with bidirectional edges.
We evaluate various QA systems on datasets from the two domains. The results are sum-marized below, followed by some some insights into
SemanticILP ’s behavior and an erroranalysis.
Science Exams.
The results of experimenting on different grades’ science exams are sum-marized in Table 13, which shows the exam scores as a percentage. The table demonstratesthat
SemanticILP consistently outperforms the best baselines in each case by 2%-6%.Further, there is no absolute winner among the baselines; while IR is good on the 8th gradequestions,
TupleInf and
BiDAF’ are better on 4th grade questions. This highlights thediffering nature of questions for different grades.
Biology Exam.
The results on the
ProcessBank dataset are summarized in Table 14.While
SemanticILP ’s performance is substantially better than most baselines and close61 ataset
BiDAF BiDAF’ IR TupleInf SemanticILPRegents 4th
AI2Public 4th
Regents 8th
AI2Public 8th
Table 13: Science test scores as a percentage. On elementary level science exams,
Seman-ticILP consistently outperforms baselines. In each row, the best score is in bold and thebest baseline is italicized .to that of
Proread , it is important to note that this latter baseline enjoys additionalsupervision of domain-specific event annotations. This, unlike our other relatively generalbaselines, makes it limited to this dataset, which is also why we don’t include it in Table 13.We evaluate IR on this reading comprehension dataset by creating an ElasticSearch index,containing the sentences of the knowledge paragraphs.
Proread SyntProx IR BiDAF BiDAF’ SemanticILP
Table 14: Biology test scores as a percentage.
SemanticILP outperforms various baselineson the
ProcessBank dataset and roughly matches the specialized best method.
For some insight into the results, we include a brief analysis of our system’s output comparedto that of other systems.We identify a few main reasons for
SemanticILP ’s errors. Not surprisingly, some mistakes(see the appendix figure of Khashabi et al. (2018b) for an example) can be traced back tofailures in generating proper annotation (
SemanticGraph ). Improvement in SRL modulesor redundancy can help address this. Some mistakes are from the current ILP model notsupporting the ideal reasoning, i.e., the requisite knowledge exists in the annotations, butthe reasoning fails to exploit it. Another group of mistakes is due to the complexity of thesentences, and the system lacking a way to represent the underlying phenomena with ourcurrent annotators. 62 weakness (that doesn’t seem to be particular to our solver) is reliance on explicit mentions.If there is a meaning indirectly implied by the context and our annotators are not able tocapture it, our solver will miss such questions. There will be more room for improvementon such questions with the development of discourse analysis systems.When solving the questions that don’t have an attached paragraph, relevant sentences needto be fetched from a corpus. A subset of mistakes on this dataset occurs because theextracted knowledge does not contain the correct answer.
ILP Solution Properties.
Our system is implemented using many constraints, requires using many linear inequalitieswhich get instantiated on each input instanced, hence there are a different number of vari-ables and inequalities for each input instance. There is an overhead time for pre-processingan input instance, and convert it into an instance graph. Here in the timing analysis weprovide we ignore the annotation time, as it is done by black-boxes outside our solver.Table 15 summarizes various ILP and support graph statistics for
SemanticILP , aver-aged across
ProcessBank questions. Next to
SemanticILP we have included numbersfrom
TableILP which has similar implementation machinery, but on a very different rep-resentation. While the size of the model is a function of the input instance, on average,
SemanticILP tends to have a bigger model (number of constraints and variables). Themodel creation time is significantly time-consuming in
SemanticILP as involves manygraph traversal operations and jumps between nodes and edges. We also providing timesstatistics for
TupleInf which takes roughly half the time of
TableILP , which means thatit is faster than
SemanticILP . In order to better understand the results, we ablate the contribution of different annotationcombinations, where we drop different combination from the ensemble model. We retrain63 ategory Quantity Avg. Avg. Avg. ( SemanticILP ) (
TableILP ) (
TupleInf ) ILP complexity
Table 15:
SemanticILP statistics averaged across questions, as compared to
TableILP and
TupleInf statistics.the ensemble, after dropping each combination.The results are summarized in Table 16. While Comb-1 seems to be important for sciencetests, it has limited contribution to the biology tests. On 8th grade exams, the
Verb-SRL and
Comma-SRL -based alignments provide high value. Structured combinations (e.g.,
Verb-SRL -based alignments) are generally more important for the biology domain.
AI2Public8th ProcessBank
Full
SemanticILP
Table 16: Ablation study of
SemanticILP components on various datasets. The first rowshows the overall test score of the full system, while other rows report the change in thescore as a result of dropping an individual combination. The combinations are listed inTable 12. Figure 12: Overlap of the predictions of
Se-manticILP and IR on 50 randomly-chosenquestions from
AI2Public 4th . Complementarity to IR.
Given that inthe science domain the input snippets fedto
SemanticILP are retrieved through aprocess similar to the IR solver, one mightnaturally expect some similarity in the pre-dictions. The pie-chart in Figure 12 showsthe overlap between mistakes and correctpredictions of
SemanticILP and IR on50 randomly chosen training questions from 64igure 13: Performance change for varying knowledge length.
AI2Public 4th . While there is substantialoverlap in questions that both answer cor-rectly (the yellow slice) and both miss (thered slice), there is also a significant number of questions solved by
SemanticILP but not IR(the blue slice), almost twice as much as the questions solved by IRbut not
SemanticILP (the green slice).
Cascade Solvers.
In Tables 13 and 14, we presented one single instance of
SemanticILP with state-of-artresults on multiple datasets, where the solver was an ensemble of semantic combinations(presented in Table 12). Here we show a simpler approach that achieves stronger results onindividual datasets, at the cost of losing a little generalization across domains. Specifically,we create two “cascades” (i.e., decision lists) of combinations, where the ordering of com-binations in the cascade is determined by the training set precision of the simplified solverrepresenting an annotator combination (combinations with higher precision appear earlier).One cascade solver targets science exams and the other the biology dataset.The results are reported in Table 17. On the 8th grade data, the cascade solver createdfor science test achieves higher scores than the generic ensemble solver. Similarly, thecascade solver on the biology domain outperforms the ensemble solver on the
ProcessBank dataset. 65 ataset Ensemble Cascade(Science) Cascade(Biology) S c i e n ce Regents 4th
AI2Public 4th
Regents 8th
AI2Public 8th
ProcessBank
Table 17: Comparison of test scores of
SemanticILP using a generic ensemble vs. domain-targeted cascadesof annotation combinations.
Effect of Varying KnowledgeLength.
We analyze the perfor-mance of the system as a func-tion of the length of the paragraphfed into
SemanticILP , for 50 ran-domly selected training questionsfrom the
Regents 4th set. Fig-ure 13 (left) shows the overall sys-tem, for two combinations introduced earlier, as a function of knowledge length, counted asthe number of sentences in the paragraph.As expected, the solver improves with more sentences, until around 12-15 sentences, afterwhich it starts to worsen with the addition of more irrelevant knowledge. While the cascadecombinations did not show much generalization across domains, they have the advantage ofa smaller drop when adding irrelevant knowledge compared to the ensemble solver. This canbe explained by the simplicity of cascading and minimal training compared to the ensembleof annotation combinations.Figure 13 (right) shows the performance of individual combinations as a function of knowl-edge length. It is worth highlighting that while Comb-1 (blue) often achieves higher coverageand good scores in simple paragraphs (e.g., science exams), it is highly sensitive to knowl-edge length. On the other hand, highly-constrained combinations have a more consistentperformance with increasing knowledge length, at the cost of lower coverage.
This chapter extends our abductive reasoning system from Chapter 3 to consume raw textas input knowledge. This is the first system to successfully use a wide range of semanticabstractions to perform a high-level NLP task like Question Answering. The approachis especially suitable for domains that require reasoning over a diverse set of linguistic66onstructs but have limited training data. To address these challenges, we present thefirst system, to the best of our knowledge, that reasons over a wide range of semanticabstractions of the text, which are derived using off-the-shelf, general-purpose, pre-trainednatural language modules such as semantic role labelers. Representing multiple abstractionsas a family of graphs, we translate question answering (QA) into a search for an optimalsubgraph that satisfies certain global and local properties. This formulation generalizesseveral prior structured QA systems. Our system,
SemanticILP , demonstrates strongperformance on two domains simultaneously. In particular, on a collection of challengingscience QA datasets, it outperforms various state-of-the-art approaches, including neuralmodels, broad coverage information retrieval, and specialized techniques using structuredknowledge bases, by 2%-6%.A key limitation of the system here is that its abstractions are mostly extracted from explicitmentions of in a given text. However, a major portion of our understanding come is onlyimplied from text (not directly mention). We propose a challenge dataset for such questions(limited to the temporal domain) in Chapter 7. Additionally, the two systems discussedin Chapter 3 and here, lack explicit explicit attention mechanism to the content of thequestions. We study this topic in Chapter 5.67
HAPTER 5 : Learning Essential Terms in Questions “The trouble with Artificial Intelligence is that computers don’t give adamn-or so I will argue by considering the special case ofunderstanding natural language.” — John Haugeland, 1979
Many of today’s QA systems often struggle with seemingly simple questions because theyare unable to reliably identify which question words are redundant, irrelevant, or evenintentionally distracting. This reduces the systems’ precision and results in questionable“reasoning” even when the correct answer is selected among the given alternatives. Thevariability of subject domain and question style makes identifying essential question wordschallenging. Further, essentiality is context dependent—a word like ‘animals’ can be criticalfor one question and distracting for another. Consider the following example:
One way animals usually respond to a sudden drop in temperature is by (A) sweating (B) shivering(C) blinking (D) salivating.
The system we discussed in Chapter 3,
TableILP (Khashabi et al., 2016), which performsreasoning by aligning the question to semi-structured knowledge, aligns only the word ‘an-imals’ when answering this question. Not surprisingly, it chooses an incorrect answer. Theissue is that it does not recognize that “drop in temperature” is an essential aspect of thequestion. O n e w a y a n i m a l s u s u a ll y r e s p o nd t o a s udd e n d r o p i n t e m p e r a t u r e i s b y Chart Title
Figure 14:
Essentiality scores generated by oursystem, which assigns high essentiality to “drop”and “temperature”.
Towards this goal, we propose a system thatcan assign an essentiality score to each term inthe question. For the above example, our sys-tem generates the scores shown in Figure 14,where more weight is put on “temperature” This chapter is based on the following publication: Khashabi et al. (2017) question term essentiality and release a new dataset of 2,223crowd-sourced essential term annotated questions (total 19K annotated terms) that capturethis concept. We illustrate the importance of this concept by demonstrating that humansbecome substantially worse at QA when even a few essential question terms are dropped.(B) We design a classifier that is effective at predicting question term essentiality. The F1(0.80) and per-sentence mean average precision (MAP, 0.90) scores of our classifier supercedethe closest baselines by 3%-5%. Further, our classifier generalizes substantially better tounseen terms.(C) We show that this classifier can be used to improve a surprisingly effective IR basedQA system (Clark et al., 2016) by 4%-5% on previously used question sets and by 1.2%on a larger question set. We also incorporate the classifier in
TableILP (Khashabi et al.,2016), resulting in fewer errors when sufficient knowledge is present for questions to bemeaningfully answerable.
Our work can be viewed as the study of an intermediate layer in QA systems. Some systemsimplicitly model and learn it, often via indirect signals from end-to-end training data. Forinstance, Neural Networks based models (Wang et al., 2016; Tymoshenko et al., 2016; Yinet al., 2016) implicitly compute some kind of attention . While this is intuitively meantto weigh key words in the question more heavily, this aspect hasn’t been systematicallyevaluated, in part due to the lack of ground truth annotations.There is related work on extracting question type information (Li and Roth, 2002; Li et al., Annotated dataset and classifier available at https://github.com/allenai/essential-terms focus words . Their rule-based system incorporates grammatical structure,answer types, etc. We take a different approach by learning a supervised model using a newannotated dataset.
In this section, we introduce the notion of essential question terms , present a dataset anno-tated with these terms, and describe two experimental studies that illustrate the importance70f this notion—we show that when dropping terms from questions, humans’ performancedegrades significantly faster if the dropped terms are essential question terms.Given a question q , we consider each non-stopword token in q as a candidate for beingan essential question term. Precisely defining what is essential and what isn’t is not aneasy task and involves some level of inherent subjectivity. We specified three broad criteria :1) altering an essential term should change the intended meaning of q , 2) dropping non-essential terms should not change the correct answer for q , and 3) grammatical correctness isnot important. We found that given these relatively simple criteria, human annotators hada surprisingly high agreement when annotating elementary-level science questions. Next wediscuss the specifics of the crowd-sourcing task and the resulting dataset. We collected 2,223 elementary school science exam questions for the annotation of essentialterms. This set includes the questions used by Clark et al. (2016) and additional onesobtained from other public resources such as the Internet or textbooks. For each of thesequestions, we asked crowd workers to annotate essential question terms based on the abovecriteria as well as a few examples of essential and non-essential terms. Figure 15 depictsthe annotation interface.The questions were annotated by 5 crowd workers, and resulted in 19,380 annotated terms.The Fleiss’ kappa statistic (Fleiss, 1971) for this task was κ = 0.58, indicating a level ofinter-annotator agreement very close to ‘substantial’. In particular, all workers agreed on36.5% of the terms and at least 4 agreed on 69.9% of the terms. We use the proportion ofworkers that marked a term as essential to be its annotated essentiality score.On average, less than one-third (29.9%) of the terms in each question were marked as We use Amazon Mechanical Turk for crowd-sourcing. A few invalid annotations resulted in about 1% of the questions receiving fewer annotations. 2,199questions received at least 5 annotations (79 received 10 annotations due to unintended question repetition),21 received 4 annotations, and 4 received 3 annotations. > (such as precipitation and gravity ). 76.6% of such terms occurring in questions were markedas essential.In summary, we have a term essentiality annotated dataset of 2,223 questions. We split this We use 9,144 science terms from Khashabi et al. (2016).
Here we report a second crowd-sourcing experiment that validates our hypothesis thatthe question terms marked above as essential are, in fact, essential for understanding andanswering the questions. Specifically, we ask:
Is the question still answerable by a human ifa fraction of the essential question terms are eliminated?
For instance, the sample questionin the introduction is unanswerable when “drop” and “temperature” are removed from thequestion:
One way animals usually respond to a sudden * in * is by ?
Figure 16: Crowd-sourcing interface for verifying the validity of essentiality annotationsgenerated by the first task. Annotators are asked to answer, if possible, questions with agroup of terms dropped.To this end, we consider both the annotated essentiality scores as well as the score producedby our trained classifier (to be presented in Section 5.3). We first generate candidate sets ofterms to eliminate using these essentiality scores based on a threshold ξ ∈ { , . , . . . , . } :73a) essential set : terms with score ≥ ξ ; (b) non-essential set : terms with score < ξ . Wethen ask crowd workers to try to answer a question after replacing each candidate set ofterms with “***”. In addition to four original answer options, we now also include “I don’tknow. The information is not enough” (cf. Figure 16 for the user interface). For each valueof ξ , we obtain 5 ×
269 annotations for 269 questions. We measure how often the workersfeel there is sufficient information to attempt the question and, when they do attempt, howoften do they choose the right answer. fraction of question terms dropped f r a c t i on o f que s t i on s a tt e m p t ed Annotation:drop-essentials-above-xAnnotation:drop-essentials-below-xClassifier:drop-essentials-above-xClassifier:drop-essentials-below-x
Figure 17:
The relationship between the fractionof question words dropped and the fraction of thequestions attempted (fraction of the questions workersfelt comfortable answering). Dropping most essentialterms (blue lines) results in very few questions remain-ing answerable, while least essential terms (red lines)allows most questions to still be answerable. Solidlines indicate human annotation scores while dashedlines indicate predicted scores.
Each value of ξ results in some fraction ofterms to be dropped from a question; theexact number depends on the question andon whether we use annotated scores or ourclassifier’s scores. In Figure 17, we plot theaverage fraction of terms dropped on thehorizontal axis and the corresponding frac-tion of questions attempted on the verticalaxis. Solid lines indicate annotated scoresand dashed lines indicate classifier scores.Blue lines (bottom left) illustrate the effectof eliminating essential sets while red lines(top right) reflect eliminating non-essentialsets.We make two observations. First, the solid blue line (bottom-left) demonstrates that drop-ping even a small fraction of question terms marked as essential dramatically reduces theQA performance of humans. E.g., dropping just 12% of the terms (with high essentialityscores) makes 51% of the questions unanswerable. The solid red line (top-right), on the It is also possible to directly collect essential term groups using this task. However, collecting suchsets of essential terms would be substantially more expensive, as one must iterate over exponentially manysubsets rather than the linear number of terms used in our annotation scheme. ET classifier arevery close to the solid lines based on human annotation. This indicates that our classifier,to be described next, closely captures human intuition. Given the dataset of questions and their terms annotated with essential scores, is it possibleto learn the underlying concept? Towards this end, given a question q , answer options a , and a question term q l , we seek a classifier that predicts whether q l is essential foranswering q . We also extend it to produce an essentiality score et ( q l , q, a ) ∈ [0 , We usethe annotated dataset from Section 2, where real-valued essentiality scores are binarized to1 if they are at least 0.5, and to 0 otherwise.We train a linear SVM classifier (Joachims, 1998), henceforth referred to as
ET classi-fier . Given the complex nature of the task, the features of this classifier include syntactic(e.g., dependency parse based) and semantic (e.g., Brown cluster representation of words(Brown et al., 1992), a list of scientific words) properties of question words, as well astheir combinations. In total, we use 120 types of features (cf. Appendix of Khashabi et al.(2017)).
Baselines.
To evaluate our approach, we devise a few simple yet relatively powerful base-lines.First, for our supervised baseline, given ( q l , q, a ) as before, we ignore q and compute howoften is q l annotated as essential in the entire dataset. In other words, the score for q l isthe proportion of times it was marked as essential in the annotated dataset. If the instance The essentiality score may alternatively be defined as et ( q l , q ), independent of the answer options a .This is more suitable for non-multiple choice questions. Our system uses a only to compute PMI-basedstatistical association features for the classifier. In our experiments, dropping these features resulted in onlya small drop in the classifier’s performance.
75s never observer in training, we choose an arbitrary label as prediction. We refer to thisbaseline as label proportion baseline and create two variants of it:
PropSurf based onsurface string and
PropLem based on lemmatizing the surface string. For unseen q l , thisbaseline makes a random guess with uniform distribution.Our unsupervised baseline is inspired by work on sentence compression (Clarke and Lapata,2008) and the PMI solver of Clark et al. (2016), which compute word importance based onco-occurrence statistics in a large corpus. In a corpus C of 280 GB of plain text (5 × tokens) extracted from Web pages, we identify unigrams, bigrams, trigrams, and skip-bigrams from q and each answer option a i . For a pair ( x, y ) of n -grams, their pointwisemutual information (PMI) (Church and Hanks, 1989) in C is defined as log p ( x,y ) p ( x ) p ( y ) where p ( x, y ) is the co-occurrence frequency of x and y (within some window) in C . For a givenword x , we find all pairs of question n -grams and answer option n -grams. MaxPMI and
SumPMI score the importance of a word x by max-ing or summing, resp., PMI scores p ( x, y ) across all answer options y for q . A limitation of this baseline is its dependence onthe existence of answer options, while our system makes essentiality predictions independentof the answer options.We note that all of the aforementioned baselines produce real-valued confidence scores (foreach term in the question), which can be turned into binary labels (essential and non-essential) by thresholding at a certain confidence value. We consider two natural evaluation metrics for essentiality detection, first treating it as abinary prediction task at the level of individual terms and then as a task of ranking termswithin each question by the degree of essentiality.
Binary Classification of Terms.
We consider all question terms pooled together asdescribed in Section 5.2.1, resulting in a dataset of 19,380 terms annotated (in the context Collected by Charles Clarke at the University of Waterloo, and used previously by Turney (2013).
76f the corresponding question) independently as essential or not. The ET classifier is trainedon the train subset, and the threshold is tuned using the dev subset. AUC Acc P R F1
MaxPMI † SumPMI † PropSurf
PropLem
ET Classifier 0.79 0.75 0.91 0.71 0.80
Table 18:
Effectiveness of various methods for identifying essential question terms in the test set,including area under the PR curve (AUC), accuracy (Acc), precision (P), recall (R), and F1 score. ET classifier substantially outperforms all supervised and unsupervised (denoted with † ) baselines. For each term in the corresponding test set of 4,124 instances, we use various methods topredict whether the term is essential (for the corresponding question) or not. Table 18summarizes the resulting performance. For the threshold-based scores, each method wastuned to maximize the F1 score based on the dev set. The ET classifier achieves an F1score of 0.80, which is 5%-14% higher than the baselines. Its accuracy at 0.75 is statisticallysignificantly better than all baselines based on the Binomial exact test (Howell, 2012) at p -value 0.05. Recall P r ec i s i o n MaxPMISumPMIPropSurfPropLemmaET
Figure 18:
Precision-recall trade-off for various classifiers as the threshold is varied. ET classifier(green) is significantly better throughout. As noted earlier, each of these essentiality identification methods are parameterized by athreshold for balancing precision and recall. This allows them to be tuned for end-to-endperformance of the downstream task. We use this feature later when incorporating the Each test term prediction is assumed to be a binomial. UC Acc P R F1
MaxPMI † SumPMI † PropSurf
PropLem
ET Classifier 0.78 0.71 0.88 0.71 0.78
Table 19:
Generalization to unseen terms: Effectiveness of various methods, using the same metricsas in Table 18. As expected, supervised methods perform poorly, similar to a random baseline.Unsupervised methods generalize well, but the ET classifier again substantially outperforms them. ET classifier in QA systems. Figure 18 depicts the PR curves for various methods as thethreshold is varied, highlighting that the ET classifier performs reliably at various recallpoints. Its precision, when tuned to optimize F1, is 0.91, which is very suitable for high-precision applications. It has a 5% higher AUC (area under the curve) and outperformsbaselines by roughly 5% throughout the precision-recall spectrum.As a second study, we assess how well our classifier generalizes to unseen terms . Forthis, we consider only the 559 test terms that do not appear in the train set. Table 19provides the resulting performance metrics. We see that the frequency based supervisedbaselines, having never seen the test terms, stay close to the default precision of 0.5. Theunsupervised baselines, by nature, generalize much better but are substantially dominatedby our ET classifier, which achieves an F1 score of 78%. This is only 2% below its own F1across all seen and unseen terms, and 6% higher than the second best baseline. System MAP
MaxPMI † SumPMI † PropSurf
PropLem
ET Classifier 0.90
Table 20:
Effectiveness of variousmethods for ranking the terms in a ques-tion by essentiality. † indicates unsuper-vised method. Mean-Average Precision(MAP) numbers reflect the mean (acrossall test set questions) of the average pre-cision of the term ranking for each ques-tion. ET classifier again substantiallyoutperforms all baselines. Ranking Question Terms by Essentiality.
Next, we investigate the performance of the ET classifier as a system that ranks all terms within aquestion in the order of essentiality. Thus, unlikethe previous evaluation that pools terms togetheracross questions, we now consider each question as aunit. For the ranked list produced by each classifierfor each question, we compute the average precision In all our other experiments, test and train questions are always distinct but may have some terms incommon. We then take the mean of these AP valuesacross questions to obtain the mean average precision(MAP) score for the classifier.The results for the test set (483 questions) are shown in Table 20. Our ET classifier achievesa MAP of 90.2%, which is 3%-5% higher than the baselines, and demonstrates that one canlearn to reliably identify essential question terms. ET Classifier in QA Solvers
In order to assess the utility of our ET classifier, we investigate its impact on two end-to-endQA systems. We start with a brief description of the question sets. Question Sets.
We use three question sets of 4-way multiple choice questions. Re-gents and
AI2Public are two publicly available elementary school science question set.
Regents comes with 127 training and 129 test questions;
AI2Public contains 432 train-ing and 339 test questions that subsume the smaller question sets used previously (Clarket al., 2016; Khashabi et al., 2016).
RegtsPertd set, introduced by Khashabi et al. (2016),has 1,080 questions obtained by automatically perturbing incorrect answer choices for 108New York Regents 4th grade science questions. We split this into 700 train and 380 testquestions.For each question, a solver gets a score of 1 if it chooses the correct answer and 1 /k if itreports a k -way tie that includes the correct answer. QA Systems.
We investigate the impact of adding the ET classifier to two state-of-the-art QA systems for elementary level science questions. Let q be a multiple choice questionwith answer options { a i } . The IR Solver from Clark et al. (2016) searches, for each a i , a We rank all terms within a question based on their essentiality scores. For any true positive instanceat rank k , the precision at k is defined to be the number of positive instances with rank no more than k ,divided by k . The average of all these precision values for the ranked list for the question is the averageprecision . Available at http://allenai.org/data.html q, a i ) pair. It then selects the answeroption for which the match score is the highest. The inference based TableILP
Solver fromKhashabi et al. (2016), on the other hand, performs QA by treating it as an optimizationproblem over a semi-structured knowledge base derived from text. It is designed to answerquestions requiring multi-step inference and a combination of multiple facts.For each multiple-choice question ( q, a ), we use the ET classifier to obtain essential termscores s l for each token q l in q ; s l = et ( q l , q, a ). We will be interested in the subset ω of allterms T q in q with essentiality score above a threshold ξ : ω ( ξ ; q ) = { l ∈ T q | s l > ξ } . Let ω ( ξ ; q ) = T q \ ω ( ξ ; q ). For brevity, we will write ω ( ξ ) when q is implicit. ET To incorporate the ET classifier, we create a parameterized IR system called IR + ET( ξ )where, instead of querying a ( q, a i ) pair, we query ( ω ( ξ ; q ) , a i ).While IR solvers are generally easy to implement and are used in popular QA systems withsurprisingly good performance, they are often also sensitive to the nature of the questionsthey receive. Khashabi et al. (2016) demonstrated that a minor perturbation of the ques-tions, as embodied in the RegtsPertd question set, dramatically reduces the performanceof IR solvers. Since the perturbation involved the introduction of distracting incorrect an-swer options, we hypothesize that a system with better knowledge of what’s important inthe question will demonstrate increased robustness to such perturbation.
Dataset Basic IR IR + ET
Regents
AI2Public
RegtsPertd
Table 21: Performance of the IR solver with-out (Basic IR) and with (IR + ET) essentialterms. The numbers are solver scores (%) onthe test sets of the three datasets.Table 21 validates this hypothesis, showingthe result of incorporating ET in IR, as IR+ ET( ξ = 0 . ξ was selected byoptimizing end-to-end performance on thetraining set. We observe a 5% boost in thescore on RegtsPertd , showing that incor-porating the notion of essentiality makes the 80ystem more robust to perturbations.Adding ET to IR also improves its performance on standard test sets. On the larger
AI2Public question set, we see an improvement of 1.2%. On the smaller
Regents set,introducing ET improves IRsolvers score by 1.74%, bringing it close to the state-of-the-artsolver, TableILP , which achieves a score of 61.5%. This demonstrates that the notion ofessential terms can be fruitfully exploited to improve QA systems.
TableILP solver + ET Our essentiality guided query filtering helped the IR solver find sentences that are morerelevant to the question. However, for
TableILP an added focus on essential terms is ex-pected to help only when the requisite knowledge is present in its relatively small knowledgebase. To remove confounding factors, we focus on questions that are, in fact, answerable.To this end, we consider three (implicit) requirements for
TableILP to demonstrate reli-able behavior: (1) the existence of relevant knowledge, (2) correct alignment between thequestion and the knowledge, and (3) a valid reasoning chain connecting the facts together.Judging this for a question, however, requires a significant manual effort and can only bedone at a small scale.
Question Set.
We consider questions for which the
TableILP solver does have accessto the requisite knowledge and, as judged by a human, a reasoning chain to arrive at thecorrect answer. To reduce manual effort, we collect such questions by starting with thecorrect reasoning chains (‘support graphs’) provided by
TableILP . A human annotatoris then asked to paraphrase the corresponding questions or add distracting terms, whilemaintaining the general meaning of the question. Note that this is done independent ofessentiality scores. For instance, the modified question below changes two words in thequestion without affecting its core intent: 81 riginal question:
A fox grows thicker fur as a season changes. This adaptation helps the foxto (A) find food(B) keep warmer(C) grow stronger(D) escape from predators
Generated question:
An animal grows thicker hair as a season changes. This adaptation helpsto (A) find food(B) keep warmer(C) grow stronger(D) escape from predators
While these generated questions should arguably remain correctly answerable by
TableILP ,we found that this is often not the case. To investigate this, we curate a small dataset Q R with 12 questions (see the Appendix) on each of which, despite having the required knowl-edge and a plausible reasoning chain, TableILP fails.
Modified Solver.
To incorporate question term essentiality in the
TableILP solver whilemaintaining high recall, we employ a cascade system that starts with a strong essentialityrequirement and progressively weakens it.Following the notation of Chapter 3, let x ( q l ) be a binary variable that denotes whetheror not the l -th term of the question is used in the final reasoning graph. We enforce thatterms with essentiality score above a threshold ξ must be used: x ( q l ) = 1, ∀ l ∈ ω ( ξ ).Let TableILP + ET ( ξ ) denote the resulting system which can now be used in a cascadingarchitecture: TableILP + ET ( ξ ) → TableILP + ET ( ξ ) → ...where ξ < ξ < . . . < ξ k is a sequence of thresholds. Questions unanswered by thefirst system are delegated to the second, and so on. The cascade has the same recall as TableILP , as long as the last system is the vanilla
TableILP . We refer to this configurationas
Cascades ( ξ , ξ , . . . , ξ k ).This can be implemented via repeated calls to TableILP + ET ( ξ j ) with j increasing from 1to k , stopping if a solution is found. Alternatively, one can simulate the cascade via a singleextended ILP using k new binary variables z j with constraints: | ω ( ξ j ) | ∗ z j ≤ (cid:80) l ∈ ω ( ξ j ) x ( q l )for j ∈ { , . . . , k } , and adding M ∗ (cid:80) kj =1 z j to the objective function, for a sufficiently large82onstant M .We evaluate Cascades (0 . , . , . , .
0) on our question set, Q R . By employing essentialityinformation provided by the ET classifier, Cascades corrects 41.7% of the mistakes madeby vanilla
TableILP . This error-reduction illustrates that the extra attention mechanismadded to
TableILP via the concept of essential question terms helps it cope with distractingterms.
This chapter introduces and studies the notion of essential question terms with the goalof improving such QA solvers. We illustrate the importance of essential question termsby showing that humans’ ability to answer questions drops significantly when essentialterms are eliminated from questions. We then develop a classifier that reliably (90% meanaverage precision) identifies and ranks essential terms in questions. Finally, we use theclassifier to demonstrate that the notion of question term essentiality allows state-of-the-art QA solvers for elementary-level science questions to make better and more informeddecisions, improving performance by up to 5%.83 art II
Moving the Peaks Higher:Designing More ChallengingDatasets HAPTER 6 : A Challenge Set for Reasoning on Multiple Sentences “Human beings, viewed as behaving systems, are quite simple. Theapparent complexity of our behavior over time is largely a reflection ofthe complexity of the environment in which we find ourselves.”— Herbert A. Simon, The Sciences of the Artificial, 1968
In this chapter we develop a reading comprehension challenge in which answering each ofthe questions requires reasoning over multiple sentences. There is evidence that answering ‘single-sentence questions’, i.e. questions that can beanswered from a single sentence of the given paragraph, is easier than answering multi-sentence questions’, which require multiple sentences to answer a given question. For exam-ple, (Richardson et al., 2013) released a reading comprehension dataset that contained bothsingle-sentence and multi-sentence questions; models proposed for this task yielded con-siderably better performance on the single-sentence questions than on the multi-sentencequestions (according to (Narasimhan and Barzilay, 2015) accuracy of about 83% and 60%on these two types of questions, respectively).There could be multiple reasons for this. First, multi-sentence reasoning seems to be inher-ently a difficult task. Research has shown that while complete-sentence construction emergesas early as first grade for many children, their ability to integrate sentences emerges only infourth grade (Berninger et al., 2011). Answering multi-sentence questions might be morechallenging for an automated system because it involves more than just processing individ-ual sentences but rather combining linguistic, semantic and background knowledge acrosssentences—a computational challenges in itself. Despite these challenges, multi-sentencequestions can be answered by humans and hence present an interesting yet reasonable goalfor AI systems (Davis, 2014). This chapter is based on the following publication: Khashabi et al. (2018a).
Examples from our
MultiRC corpus. Eachexample shows relevant excerpts from a paragraph;multi-sentence question that can be answered by com-bining information from multiple sentences of the para-graph; and corresponding answer-options. The correctanswer(s) is indicated by a *. Note that there can bemultiple correct answers per question.
In this work, we propose a multi-sentence QA challenge in which ques-tions can be answered only usinginformation from multiple sentences.Specifically, we present
MultiRC (Multi-Sentence Reading Comprehen-sion) —a dataset of short paragraphsand multi-sentence questions that canbe answered from the content of theparagraph. Each question is associ-ated with several choices for answer-options, out of which one or more correctly answer the question. Fig-ure 19 shows two examples from ourdataset. Each instance consists of amulti-sentence paragraph, a question,and answer-options. All instances wereconstructed such that it is not possibleto answer a question correctly withoutgathering information from multiple sentences. Due to space constraints, the figure showsonly the relevant sentences from the original paragraph. The entire corpus consists of 871paragraphs and about ∼
6k multi-sentence questions.The goal of this dataset is to encourage the research community to explore approaches thatcan do more than sophisticated lexical-level matching. To accomplish this, we designedthe dataset with three key challenges in mind. (i) The number of correct answer-optionsfor each question is not pre-specified. This removes the over-reliance of current approacheson answer-options and forces them to decide on the correctness of each candidate answer http://cogcomp.org/multirc/ • ∼ k high-quality multiple-choice RC questions that are generated (and manually ver-ified via crowdsourcing) to require integrating information from multiple sentences. • The questions are not constrained to have a single correct answer, generalizing exist-ing paradigms for representing answer-options. • Our dataset is constructed using 7 different sources, allowing more diversity in con-tent, style, and possible question types. • We show a significant performance gap between current solvers and human perfor-mance, indicating an opportunity for developing sophistical reasoning systems.87 .2. Relevant Work
Some recent datasets proposed for machine comprehension pay attention to type of ques-tions and reasoning required. For example, RACE (Lai et al., 2017) attempts to incorporatedifferent types of reasoning phenomena, and MCTest (Richardson et al., 2013) attempted tocontain at least 50% multi-sentence reasoning questions. However, since the crowdsourcedworkers who created the dataset were only encouraged, and not required, to write suchquestions, it is not clear how many of these questions actually require multi-sentence rea-soning (see Sec. 6.3.5). Similarly, only about 25% of question in the RACE dataset requiremulti-sentence reasoning as reported in their paper. Remedia (Hirschman et al., 1999) alsocontains 5 different types of questions (based on question words) but is a much smallerdataset. Other datasets which do not deliberately attempt to include multi-sentence rea-soning, like SQuAD (Rajpurkar et al., 2016) and the CNN/Daily Mail dataset (Hermannet al., 2015), suffer from even lower percentage of such questions (12% and 2% respec-tively (Lai et al., 2017)). There are several other corpora which do not guarantee specificreasoning types, including MS MARCO (Nguyen et al., 2016), WikiQA (Yang et al., 2015),and TriviaQA (Joshi et al., 2017).The complexity of reasoning required for a reading comprehension dataset would depend onseveral factors such as the source of questions or paragraphs; the way they are generated;and the order in which they are generated (i.e. questions from paragraphs, or the reverse).Specifically, paragraphs’ source could influence the complexity and diversity of the languageof the paragraphs and questions, and hence the required level of reasoning capabilities.Unlike most current datasets which rely on only one or two sources for their paragraphs(e.g. CNN/Daily Mail and SQuAD rely only on news and Wikipedia articles respectively)our dataset uses 7 different domains.Another factor that distinguishes our dataset from previously proposed corpora is the wayanswers are represented. Several datasets represent answers as multiple-choices with a singlecorrect answer. While multiple-choice questions are easy to grade, coming up with non-88rivial correct and incorrect answers can be challenging. Also, assuming exactly one correctanswer (e.g., as in MCTest and RACE) inadvertently changes the task from choosing thecorrect answer to choosing the most likely answer. Other datasets (e.g MS-MARCO andSQuAD) represent answers as a contiguous substring within the passage. This assumptionof the answer being a span of the paragraph, limits the questions to those whose answer iscontained verbatim in the paragraph. Unfortunately, it rules out more complicated ques-tions whose answers are only implied by the text and hence require a deeper understanding.Because of these limitations, we designed our dataset to use multiple-choice representations,but without specifying the number of correct answers for each question.
MultiRC
In this section we describe our principles and methodology of dataset collection. This in-cludes automatically collecting paragraphs, composing questions and answer-options throughcrowd-sourcing platform, and manually curating the collected data. We also summarize apilot study that helped us design this process, and end with a summary of statistics of thecollected corpus.
Questions and answers in our dataset are designed based on the following key principles:
Multi-sentenceness.
Questions in our challenge require models to use information frommultiple sentences of a paragraph. This is ensured through explicit validation. We excludeany question that can be answered based on a single sentence from a paragraph.
Open-endedness.
Our dataset is not restricted to questions whose answer can be foundverbatim in a paragraph. Instead, we provide a set of hand-crafted answer-options for eachquestion. Notably, they can represent information that is not explicitly stated in the textbut is only inferable from it (e.g. implied counts, sentiments, and relationships).89 nswers to be judged independently.
The total number of answer options per ques-tion is variable in our data and we explicitly allow multiple correct and incorrect answeroptions (e.g. 2 correct and 1 incorrect options). As a consequence, correct answers cannotbe guessed solely by a process of elimination or by simply choosing the best candidates outof the given options.Through these principles, we encourage users to explicitly model the semantics of text be-yond individual words and sentences, to incorporate extra-linguistic reasoning mechanisms,and to handle answer options independently of one another.
Variability.
We encourage variability on different levels. Our dataset is based on para-graphs from multiple domains, leading to linguistically diverse questions and answers. Also,we do not impose any restrictions on the questions, to encourage different forms of reasoning.
The paragraphs used in our dataset are extracted from various sources. Here is the com-plete list of the text types and sources used in our dataset, and the number of paragraphsextracted from each category (indicated in square brackets on the right):1. News: [121] • CNN (Hermann et al., 2015) • WSJ (Ide et al., 2008) • NYT (Ide et al., 2008)2. Wikipedia articles [92]3. Articles on society, law and justice (Ide and Suderman, 2006) [91]4. Articles on history and anthropology (Ide et al., 2008) [65]5. Elementary school science textbooks [153]6. 9/11 reports (Ide and Suderman, 2006) [72] ondition boundNumber of sentences ≥ ≤ ≥ ≥ . ≥ ≥ . ≥ ≥ ≥ . ≥ ≥ . ≥ ≥ . ≥ Table 22: Bounds used to select paragraphs for dataset creation.7. Fiction: [277] • Stories from the Gutenberg project • Children stories from MCTest (Richardson et al., 2013) • Movie plots from CMU Movie Summary corpus (Bamman et al., 2013)From each of the above-mentioned sources we extracted paragraphs that had enough con-tent. To ensure this we followed a 3-step process. In the first step we selected top fewsentences from paragraphs such that they contained 1k-1 .
5k characters. To ensure coher-ence, all sentences were contiguous and extracted from the same paragraph. In this processwe also discarded paragraphs that seemed to deviate too much from third person narrativestyle. For example, while processing Gutenberg corpus we considered files that had at least5k lines because we found that most of them were short poetic texts. In the second step, weannotated (Khashabi et al., 2018c) the paragraphs and automatically filtered texts usingconditions such as the average number of words per sentence; number of named entities;number of discourse connectives in the paragraph. These were designed by the authors ofthis paper after reviewing a small sample of paragraphs. A complete set of conditions islisted in Table 22. Finally in the last step, we manually verified each paragraph and filteredout the ones that had formatting issues or other concerns that seemed to compromise theirusability. 91 .3.3. Pipeline of question extraction
In this section, we delineate details of the process for collecting questions and answers.Figure 20 gives a high-level idea of the process. The first two steps deal with creatingmulti-sentence questions, followed by two steps for construction of candidate answers.
Step 1: generating multi-sentence questions given paragraphs
Step 2:
Verifying multi-sentenceness
Step 3:
Generating candidate answers
Step 4:
Judging quality of questions & candidates
Figure 20: Pipeline of our dataset construction.
Step 1: Generating questions.
The goal of the first step of our pipeline is to collectmulti-sentence questions. We show each paragraph to 5 turkers and ask them to write 3-5questions such that: (1) the question is answerable from the passage, and (2) only thosequestions are allowed whose answer cannot be determined from a single sentence. We clarifythis point by providing example paragraphs and questions. In order to encourage turkers towrite meaningful questions that fit our criteria, we additionally ask them for a correct answerand for the sentence indices required to answer the question. To ensure the grammaticalquality of the questions collected in this step, we limit the turkers to the countries withEnglish as their major language. After the acquisition of questions in this step, we filter outquestions which required less than 2 or more than 4 sentences to be answered; we also runthem through an automatic spell-checker and manually correct questions regarding typosand unusual wordings. Step 2: Verifying multi-sentenceness of questions.
In a second step, we verify thateach question can only be answered using more than one sentence. For each questioncollected in the previous step, we create question-sentence pairs by pairing it with eachof the sentences necessary for answering it as indicated in the previous step. For a givenquestion-sentence pair, we then ask turkers to annotate if they could answer the questionfrom the sentence it is paired with (binary annotation). The underlying idea of this stepis that a multi-sentence question would not be answerable from a single sentence, hence Step 3: Generating answer-options.
In this step, we collect answer-options that willbe shown with each question. Specifically, for each verified question from the previous steps,we ask 3 turkers to write as many correct and incorrect answer options as they can thinkof. In order to not curb creativity, we do not place a restriction on the number of optionsthey have to write. We explicitly ask turkers to design difficult and non-trivial incorrectanswer-options (e.g. if the question is about a person, a non-trivial incorrect answer-optionwould be other people mentioned in the paragraph).After this step, we perform a light clean up of the candidate answers by manually correctingminor errors (such as typos), completing incomplete sentences and rephrasing any ambigu-ous sentences. We further make sure there is not much repetition in the answer-options,to prevent potential exploitation of correlation between some candidate answers in orderto find the correct answer. For example, we drop obviously duplicate answer-options (i.e.identical options after lower-casing, lemmatization, and removing stop-words).
Step 4: Verifying quality of the dataset.
This step serves as the final quality checkfor both questions and the answer-options generated in the previous steps. We show eachparagraph, its questions, and the corresponding answer-options to 3 turkers, and ask themto indicate if they find any errors (grammatical or otherwise), in the questions and/oranswer-options. We then manually review, and correct if needed, all erroneous questionsand answer-options. This ensures that we have meaningful questions and answer-options.93n this step, we also want to verify that the correct (or incorrect) options obtained fromStep 3 were indeed correct (or incorrect). For this, we additionally ask the annotators toselect all correct answer-options for the question. If their annotations did not agree withthe ones we had after Step 3 (e.g. if they unanimously selected an ‘incorrect’ option as theanswer), we manually reviewed and corrected (if needed) the annotation.
The 4-step process described above was a result of detailed analysis and substantial refine-ment after two small pilot studies.In the first pilot study, we ran a set of 10 paragraphs extracted from the CMU MovieSummary Corpus through our pipeline. Our then pipeline looked considerably different fromthe one described above. We found the steps that required turkers to write questions andanswer-options to often have grammatical errors, possibly because a large majority of turkerswere non-native speakers of English. This probslem was more prominent in questions thanin answer-options. Because of this, we decided to limit the task to native speakers. Also,based on the results of this pilot, we overhauled the instructions of these steps by includingexamples of grammatically correct—but undesirable (not multi-sentence)—questions andanswer-options, in addition to several minor changes.Thereafter, we decided to perform a manual validation of the verification steps (currentSteps 2 and 4). For this, we (the authors of this paper) performed additional annotationsourselves on the data shown to turkers, and compared our results with those provided bythe turkers. We found that in the verification of answer-options, our annotations were inhigh agreement (98%) with those obtained from mechanical turk. However, that was notthe case for the verification of multi-sentence questions. We made several further changesto the first two steps. Among other things, we clarified in the instructions that turkersshould not use their background knowledge when writing and verifying questions, and alsoincluded negative examples of such questions. Additionally, when turkers judged a question94o be answerable using a single sentence, we decided to encourage (but not require) them toguess the answer to the question. This improved our results considerably, possibly becauseit forced annotators to think more carefully about what the answer might be, and whetherthey actually knew the answer or they just thought that they knew it (possibly becauseof background knowledge or because the sentence contained a lot of information relevantto the question). Guessed answers in this step were only used to verify the validity ofmulti-sentence questions. They were not used in the dataset or subsequent steps.After revision, we ran a second pilot study in which we processed a set of 50 paragraphsthrough our updated pipeline. This second pilot confirmed that our revisions were helpful,but thanks to its larger size, also allowed us to identify a couple of borderline cases forwhich additional clarifications were required. Based on the results of the second pilot, wemade some additional minor changes and then decided to apply the pipeline for creatingthe final dataset.
While collecting our dataset, we found that, even though Step 1 instructed turkers to writemulti-sentence questions, not all generated questions indeed required multi-sentence reason-ing. This happened even after clarifications and revisions to the corresponding instructions,and we attribute it to honest mistakes. Therefore, we designed the subsequent verificationstep (Step 2).There are other datasets which aim to include multi-sentence reasoning questions, especiallyMCTest. Using our verification step, we systematically verify their multi-sentenceness. Forthis, we conducted a small pilot study on about 60 multi-sentence questions from MCTest.As for our own verification, we created question-sentence pairs for each question and askedannotators to judge whether they can answer a question from the single sentence shown.Because we did not know which sentences contain information relevant to a question, wecreated question-sentence pairs using all sentences from a paragraph. After aggregation of95urker annotations, we found that about half of the questions annotated as multi-sentencecould be answered from a single sentence of the paragraph. This study, though performed ona subset of the data, underscores the necessity of rigorous verification step for multi-sentencereasoning when studying this phenomenon.
We now provide a brief summary of
MultiRC . Overall, it contains roughly ∼ k multi-sentence questions collected for about +800 paragraphs. The median number of correctand total answer options for each question is 2 and 5, respectively. Additional statistics aregiven in Table 23.
Parameter Value
Table 23: Various statistics of our dataset.Figures in parentheses represent standard de-viation.In Step 1, we also asked annotators to iden-tify sentences required to answer a givenquestion. We found that answering eachquestion required 2 . .
4. Next, we analyze the typesof questions in our dataset. Figure 22 showsthe count of first word(s) for our questions.We can see that while the popular questionwords (
What , Who , etc.) are very common,there is a wide variety in the first word(s)indicating a diversity in question types. About 28% of our questions require binary decisions(true/false or yes/no).We randomly selected 60 multi-sentence questions from our corpus and asked two indepen-dent annotators to label them with the type of reasoning phenomenon required to answer We will also release the 3 . k questions that did not pass Step 2. Though not multi-sentence questions,they could be a valuable resource on their own. During this process, the annotators were shown a list of common reasoning phenom-ena (shown below), and they had to identify one or more of the phenomena relevant to agiven question. The list of phenomena shown to the annotators included the following cat-egories: mathematical and logical reasoning, spatio-temporal reasoning, list/enumeration,coreference resolution (including implicit references, abstract pronouns, event coreference,etc.), causal relations, paraphrases and contrasts (including lexical relations such as syn-onyms, antonyms), commonsense knowledge, and ‘other’. The categories were selected aftera manual inspection of a subset of questions by two of the authors. The annotation pro-cess revealed that answering questions in our corpus requires a broad variety of reasoningphenomena. The left plot in Figure 21 provides detailed results.The figure shows that a large fraction of questions require coreference resolution, and amore careful inspection revealed that there were different types of coreference phenomenaat play here. To investigate these further, we conducted a follow-up experiment in whichmanually annotated all questions that required coreference resolution into finer categories.Specifically, each question was shown to two annotators who were asked to select one ormore of the following categories: entity coreference (between two entities), event coreference(between two events), set inclusion coreference (one item is part of or included in the other)and ‘other’. Figure 21 (right) shows the results of this experiment. We can see that, as The annotations were adjudicated by two authors of this paper.
In this section, we provide a quantitative analysis of several baselines for our challenge.
Evaluation Metrics.
We define precision and recall for a question q as: Pre( q ) = | A ( q ) ∩ ˆ A ( q ) || ˆ A ( q ) | and Rec( q ) = | A ( q ) ∩ ˆ A ( q ) || A ( q ) | , where A ( q ) and ˆ A ( q ) are the sets of correct and selectedanswer-options. We define (macro-average) F1 m as the harmonic mean of average-precision avg q ∈ Q (Pre( q )) and average-recall avg q ∈ Q (Rec( q )) with Q as the set of all questions.Since by design, each answer-option can be judged independently, we consider anothermetric, F1 a , evaluating binary decisions on all the answer-options in the dataset. Wedefine F1 a to be the harmonic mean of Pre( Q ) and Rec( Q ), with Pre( Q ) = | A ( Q ) ∩ ˆ A ( Q ) || ˆ A ( Q ) | ; A ( Q ) = (cid:83) q ∈ Q A ( q ); and similar definitions for ˆ A ( Q ) and Rec( Q ). Human.
Human performance provides us with an estimate of the best achievable resultson datasets. Using mechanical turk, we ask 4 people (limited to native speakers) to solveour data. We evaluate score of each label by averaging the decision of the individuals.98 andom.
To get an estimate on the lower-bound we consider a random baseline, whereeach answer option is selected as correct with a probability of 50% (an unbiased coin toss).The numbers reported for this baseline represent the expected outcome (statistical expec-tation). IR (information retrieval baseline). This baseline selects answer-options that best matchsentences in a text corpus (Clark et al., 2016). Specifically, for each question q and answeroption a i , the IR solver sends q + a i as a query to a search engine (we use Lucene) on acorpus, and returns the search engine’s score for the top retrieved sentence s , where s musthave at least one non-stopword overlap with q , and at least one with a i .We create two versions of this system. In the first variation IR(paragraphs) we createa corpus of sentences extracted from all the paragraphs in the dataset. In the secondvariation, IR(web) in addition to the knowledge of the paragraphs, we use extensive externalknowledge extracted from the web (Wikipedia, science textbooks and study guidelines, andother webpages), with 5 × tokens (280GB of plain text). SurfaceLR (logistic regression baseline). As a simple baseline that makes use of oursmall training set, we reimplemented and trained a logistic regression model using word-based overlap features. As described in (Merkhofer et al., 2018), this baseline takes intoaccount the lengths of a text, question and each answer candidate, as well as indicatorfeatures regarding the (co-)occurrences of any words in them.
SemanticILP (semi-structured baseline). This state-of-the-art solver, originally pro-posed for science questions and biology tests, uses a semi-structured representation toformalize the scoring problem as a subgraph optimization problem over multiple layers of se-mantic abstractions (Khashabi et al., 2018b). Since the solver is designed for multiple-choicewith single-correct answer, we adapt it to our setting by running it for each answer-option.Specifically for each answer-option, we create a single-candidate question, and retrieve areal-valued score from the solver. 99 ev TestF1 m F1 a F1 m F1 a Random 44.3 43.8 47.1 47.6IR(paragraphs) 64.3 60.0 54.8 53.9SurfaceLR 66.1 63.7 66.7 63.5Human 86.4 83.8 84.3 81.8
Table 24:
Performance comparison for differentbaselines tested on a subset of our dataset (in per-centage). There is a significant gap between thehuman performance and current statistical meth-ods.
BiDAF (neural network baseline). As aneural baseline, we apply this solver by Seoet al. (2016), which was originally proposedfor SQuAD but has been shown to gener-alize well to another domain (Min et al.,2017). Since BiDAF was designed for clozestyle questions, we apply it to our multiple-choice setting following the procedure byKembhavi et al. (2017): Specifically, wescore each answer-option by computing the similarity value of it’s output span with each ofthe candidate answers, computed by phrasal similarity tool of Wieting et al. (2015).
Figure 23: PR curve for each of the baselines.There is a considerable gap with the baselinesand human.To get a sense of our dataset’s hardness, weevaluate both human performance and mul-tiple computational baselines. Each base-line scores an answer-option with a real-valued score, which we threshold to decidewhether an answer option is selected or not,where the threshold is tuned on the devel-opment set. Table 24 shows performanceresults for different baselines. The signifi-cantly high human performance shows thathumans do not have much difficulties in an-swering the questions. Similar observationscan be made in Figure 23 where we plot avg q ∈ Q (Pre( q )) vs. avg q ∈ Q (Rec( q )), for different threshold values.100 .5. Summary To motivate the community to work on more challenging forms of natural language com-prehension, in this chapter we discussed a dataset that requires reasoning over multiplesentences. We solicit and verify questions and answers for this challenge through a 4-stepcrowdsourcing experiment. Our challenge dataset contains ∼ k questions for +800 para-graphs across 7 different domains (elementary school science, news, travel guides, fictionstories, etc) bringing in linguistic diversity to the texts and to the questions wordings. Ona subset of our dataset, we found human solvers to achieve an F1-score of 86 . HAPTER 7 : A Question Answering Benchmark for Temporal Common-sense “Everything changes and nothing stands still.”— Heraclitus, 535 BC - 475 BC
Automating natural language understanding requires models that are informed by com-monsense knowledge and the ability to reason with it in both common and unexpectedsituations. The NLP community has started in the last few year to investigate how toacquire such knowledge Forbes and Choi (2017); Zhang et al. (2017); Yang et al. (2018);Rashkin et al. (2018); Bauer et al. (2018); Tandon et al. (2018); Zellers et al. (2018).This work studies a specific type of commonsense, temporal commonsense . For instance,given two events “going on a vacation” and “going for a walk,” most humans would knowthat a vacation is typically longer and occurs less often than a walk, but our programscurrently do not know that.Temporal commonsense has received limited attention so far.
Our first contribution isthat, to the best of our knowledge, we are the first to systematically study and quantifyperformance on a range of temporal commonsense phenomena. Specifically, we considerfive temporal properties: duration (how long an event takes), temporal ordering (typicalorder of events), typical time (when an event happens), frequency (how often an eventoccurs), and stationarity (whether a state holds for a very long time). Previous works haveinvestigated some of them, either explicitly or implicitly (e.g., duration DivyeKhilnani andJurafsky (2011); Williams (2012) and ordering Chklovski and Pantel (2004); Ning et al.(2018a)), but none of them have defined or studied all aspects of temporal commonsensein a unified framework. Kozareva and Hovy (2011) came close, when they defined a fewtemporal aspects to be investigated, but failed short of distinguishing in text and quantifying This chapter is based on the following publication: Zhou et al. (2019).
Five types of temporal commonsense in
TacoQA . Note that a question may havemultiple answers. our sec-ond contribution is the collection of a new dataset dedicated for it,
TacoQA (short for t empor a l co mmon-sense q uestion a nswering). TacoQA is constructed via crowdsourcingwith three meticulously-designed stages to guarantee its quality. An entry in
TacoQA contains a sentence providing context information, a question requiring temporal common-sense, and candidate answers with or without correct ones (see Fig. 24). More details about
TacoQA are in Sec. 7.3.
Our third contribution is that we propose multiple systems, including
ESIM , BERT and their variants, for this task.
TacoQA allows us to investigate how state-of-the-artNLP techniques do on temporal commonsense tasks. Results in Sec. 7.4 show that, despitea significant improvement over random-guess baselines,
BERT is still far behind humanperformance on temporal commonsense reasoning, indicating that existing NLP techniquesstill have limited capability of capturing high-level semantics like time. commonsense has been a very popular topic in recent years and existing NLP works havemainly investigated the acquisition and evaluation of commonsense in the physical world,including but not limited to, size, weight, and strength Forbes and Choi (2017), roundnessand deliciousness Yang et al. (2018), and intensity Cocos et al. (2018). In terms of common-sense on “events”, Rashkin et al. (2018) investigated the intent and reaction of participantsof an event, and Zellers et al. (2018) tried to select the most likely subsequent event. Asfar as we know, no existing work has focused on temporal commonsense yet.There have also been many works trying to understand time in natural language but notnecessarily with respect to commonsense, such as the extraction and normalization of tem-poral expressions Lee et al. (2014), temporal relation extraction Ning et al. (2018b), and The dataset and code will be released upon publication. easure Value event frequency
433 8.5 event duration
440 9.4 event stationarity
279 3.1 event ordering
370 5.4 event typical time
371 6.8
Table 25:
Statistics of
TacoQA . timeline construction Leeuwenberg and Moens (2018). Among these, some works are im-plicitly on temporal commonsense, such as event durations Williams (2012); Vempala et al.(2018), typical temporal ordering Chklovski and Pantel (2004); Ning et al. (2018a), andscript learning (i.e., what happens next after certain events) Granroth-Wilding and Clark(2016); Li et al. (2018). However, either in terms of datasets or approaches, existing worksdid not study all five types of temporal commonsense in a unified framework as we do here.Instead of working on each individual aspect of temporal commonsense, we formulate theproblem as a machine reading comprehension task in the format of question-answering(QA). The past few years have also seen significant progress on QA Clark et al. (2018);Ostermann et al. (2018); Merkhofer et al. (2018), but mainly on general natural languagecomprehension tasks without tailoring it to test specific reasoning capabilities such as tem-poral commonsense. Therefore, a new dataset like TacoQA is strongly desired.
TacoQA
We describe our crowdsourcing scheme for
TacoQA that is designed after extensive pilotstudies. The multi-step scheme asks annotators to generate questions, validate questions,and then label candidate answers. We use Amazon Mechanical Turk and restrict our tasksto English-speakers only. Before working on our task, annotators need to read through our105uidelines and pass a qualification test designed to ensure their understandings. Step 1: Question generation.
In the first step, we ask crowdsourcers to generatequestions given a sentence. We randomly select 630 sentences from MultiRC Khashabiet al. (2018a) (70 from each of the 9 domains) as input sentences. To make sure thatthe questions indeed require temporal commonsense knowledge, we instruct annotators tofollow two requirements when generating questions: (a) “temporal” questions, from oneof our five categories (see Fig. 24); (b) not having direct answers mentioned in the givensentence. We also ask annotators to provide a correct answer for each of their questions tomake sure that the questions are answerable at least by themselves.
Step 2: Question verification.
To improve the quality of the questions generated inStep 1, we further ask two different annotators to check (a) whether the two requirementsabove are satisfied and (b) whether there exist grammatical or logical errors. We keepa question if and only if both annotators agree on its quality; since the annotator whoprovided the question in Step 1 also agrees on it, this leads to a [3/3] agreement for eachquestion. For the questions that we keep, we continue to ask annotators to give one correctanswer and one incorrect answer, which serve as a seed set for automatic answer expansionin the next step.
Step 3: Candidate answer expansion.
In the previous steps, we have collected 3positive and 2 negative answers for each question. Step 3 aims to automatically expand thisset of candidate answers by three approaches. First, we use a set of rules to extract temporalterms (e.g. “a.m.”, “1990”, “afternoon”, “day”), or numbers and quantities (“2”, “once”),which are replaced by terms randomly selected from a list of temporal units (“second”),adjectives (“early”), points ( “a.m.”) and adverbs (“always”). Examples are “2 a.m.” → “3 p.m.”, “1 day” → “10 days”, “once a week” → “twice a month”. Second, we mask each Our dataset and some related details (such as, our annotation interfaces, guidelines and qualificationtests) are available at the following link: https://bit.ly/2tZ1mkd One positive answer from Step 1; one positive and one negative answer from each of the two annotatorsin Step 2.
BERT
Devlin et al. (2018) to predict them;we rank those predictions by the confidence level of
BERT and keep the top ones. Third,for those candidates representing events, typically there are no temporal terms in them. Wethen create a pool of 60 k event phrases using PropBank Kingsbury and Palmer (2002), andretrieve the most similar ones to a given candidate answer using an information retrieval( IR ) system. We use the three approaches sequentially to expand the candidate answerset to 20 candidates per question.
Step 4: Answer labeling.
In this step, we ask annotators to label each answer withthree options: “likely”, “unlikely”, or “garbage” (incomplete or meaningless phrases). Wekeep a candidate answer if and only if all 4 annotators agree on “likely” or “unlikely”, and“garbage” is not marked by any annotator. We also discard any questions that end up withno valid candidate answers. Finally, the statistics of
TacoQA is in Table 25.
We assess the quality of our dataset using a couple of baseline systems. We create auniform split of 30%/70% of the data to dev/test. The rationale behind this split is that, asuccessful system has to bring in a huge amount of world knowledge and derive commonsenseunderstandings prior to the current task evaluation. We therefore believe that it make nosense to expect a system to train solely on this data, and we think of the development dataas only providing a definition of the task. Indeed, the gains from our development dataare marginal after a certain number of observations. This intuition has been studied andverified in the appendix of Zhou et al. (2019).
Evaluation metrics.
Two question-level metrics are adopted in this work: exact match( EM ) and F1 . EM measures in how many questions a system is able to correctly label allcandidate answers, while F1 measures the average overlap between one’s predictions andthe ground truth (see the appendix of Zhou et al. (2019) for full definition). uman performance. An expert annotator also worked on
TacoQA to gain a betterunderstanding of the human performance on it. The expert specifically answered 100 ques-tions randomly sampled from the test set, and could only see a single answer at a time,with its corresponding question and sentence.
Systems.
We propose to use two state-of-the-art systems in machine reading compre-hension that are suitable for our task.
ESIM
Chen et al. (2017) is a neural model effec-tive on natural language inference. We initialize the word embeddings in
ESIM via either
GloVe
Pennington et al. (2014) or
ELMo
Peters et al. (2018) to demonstrate the effect ofpre-training in this task.
BERT is a recent state-of-the-art contextualized representationused in a broad range of high-level tasks Devlin et al. (2018). We also add unit normaliza-tion to
BERT , which extracts and converts temporal expressions in candidate answers totheir most proper units. For example, “30 months” will be converted to “2.5 years”.
Experimental setting.
In both
ESIM baselines, we model the process as a sentence-pairlabeling task, following the
SNLI setting provided in AllenNLP. In both versions of
BERT ,we use the same sequence pair classification model and the same parameters as in
BERT ’s GLUE experiment. A system receives two phrases at a time: (a) the concatenation of thesentence and question, and (b) the answer. The system makes a binary prediction on eachinstance, positive or negative.
Results and discussion.
Table 26 provides a summary of the results on
TacoQA , wherewe compare the
ESIM and
BERT baselines, along with a few naive baselines (always-positive, always-negative, uniformly random), to the human performance. The significantimprovement brought by contextualized pre-training such as
BERT and
ELMo indicatesthat a significant portion of commonsense knowledge is actually acquired via pre-training.We can also see that human annotators achieved a very high performance under bothmetrics, indicating the high agreement level humans are on for this task. Our baselines, https://github.com/allenai/allennlp github.com/huggingface/pytorch-pretrained-BERT ystem F1 EM
Random 36.2 8.1Always Positive 49.8 12.1Always Negative 17.4 17.4
ESIM + GloVe
ESIM + ELMo
BERT
BERT + unit normalization
Single Human 87.1 75.8
Table 26:
Summary of the performances for different baselines. All numbers are in percentages. including
BERT , still fall behind the human performance with a significantly margin.Further analysis shows that
BERT , as a language model, is good at associating surface-forms (e.g. associating “sunrise” and “morning” since they often co-occur), which is highlysensitive to units ( days, years, etc ). To address the high sensitivity, we added unit normal-ization on top of BERT , but even with normalization,
BERT+unit normalization is stillfar behind the human performance. This implies that the information acquired by
BERT is still not sufficient to solve this task. Moreover, the low EM scores show that the currentsystems do not truly understand time in those questions.Figure 25 reveals that the performance of
BERT is not uniform across different categories,which could stem from the nature of those different types of temporal commonsense, qualityof the candidate answers, etc. For example, the number of candidates for stationarityquestions are much smaller than those for other questions, leading to a relatively easy task,but the performance gain from a random baseline to
BERT +normalization is not large,indicating that further improvement on stationarity is still difficult.
This chapter has focused on the challenge of temporal commonsense. Specifically, we framedit as a QA task, defined five categories of questions that capture such ability, and developeda novel crowdsourcing scheme to generate a high-quality dataset for temporal commonsense.109igure 25:
BERT + unit normalization performance per temporal reasoning category (top), per-formance gain over random baseline per category (bottom)
We then showed that systems equipped with state-of-the-art language models such as
ELMo and
BERT are still far behind humans, thus motivating future research in this area. Ouranalysis sheds light on the capabilities as well as limitations of current models. We hopethat this study will inspire further research on temporal commonsense.110 art III
Formal Study of Reasoning inNatural Language
HAPTER 8 : Capabilities and Limitations of Reasoning in Natural Language “Language is froth on the surface of thought.” — John McCarthy
Reasoning can be defined as the process of combining facts and beliefs, in order to makedecisions (Johnson-Laird, 1980). In particular, in natural language processing (NLP), it hasbeen studied under various settings, such as question-answering (QA) (Hirschman et al.,1999). Figure 26: The interface between meaningsand symbols: each meaning (top) can be uttered in many ways into symbolic forms(bottom).While there is a rich literature on reason-ing, there is little understanding of the na-ture of the problem and its limitations,especially in the context of natural lan-guage. In particular, there remains a siz-able gap between empirical understandingof reasoning algorithms for language andthe theoretical guarantees for their quality,often due to the complexity of the realitythey operate on. An important challengein many language understanding problemsis the symbol grounding problem (Harnad,1990), the problem of accurately mappingsymbols into its underlying meaning repre-sentation. Practitioners often address thischallenging by enriching their representations; for example by mapping textual informationto Wikipedia entries (Mihalcea and Csomai, 2007; Ratinov et al., 2011), or grounding textto executable rules via semantic parsing (Reddy et al., 2017). Building upon such rep-112esentations, has produced various reasoning systems that essentially work by combininglocal information.This work introduces a formalism that incorporates elements of the symbol-grounding prob-lem, via the two spaces illustrated in Figure 26, and sheds theoretical light on existingintuitions. The formalism consists of (A) an abstract model of linguistic knowledge, and(B) a reasoning model. (A) Linguistically-inspired abstract model:
We propose a theoretical framework tomodel and study the capabilities/limitations of reasoning, especially when taking into ac-count key difficulties that arise when formalizing linguistic reasoning. Our model uses twospaces; cf. Figure 26. We refer to the internal conceptualization in the human mind asthe meaning space . We assume the information in this space is free of noise and uncer-tainty. In contrast to human thinking in this space, human expression of thought via theutterance of language introduces many imperfections. The information in this linguisticspace—which we refer to as the symbol space —has many language-specific properties. Thesymbolic space is often redundant (e.g., multiple symbols “CPU” and “computer processor”express the same meaning), ambiguous (e.g., a symbol like “chips” could refer to multiplemeanings ), incomplete (relations between some symbolic nodes might be missing), andinaccurate (there might be incorrect edges). Importantly, this noisy symbol space is alsowhat a machine reasoning algorithm operates in. (B) Reasoning model:
We define reasoning as the ability to infer the existence ofproperties of interest in the meaning space , by observing only its representation in the symbol space . The target property in the meaning graph is what characterizes the natureof the reasoning algorithm, e.g., are two nodes connected. While there are many flavors ofreasoning (including multi-hop reasoning ), in this first study, we explore a common primi-tive shared among various reasoning formalisms; namely, the connectivity problem betweena pair of nodes in an undirected graph in the meaning space , while observing its noisy ver- This chapter is based on the following publication: Khashabi et al. (2019).
The meaning space contains [clean and unique] symbolic representation and the facts,while the symbol space contains [noisy, incomplete and variable] representation of the facts. Weshow sample meaning and symbol space nodes to answer the question:
Is a metal spoon a goodconductor of heat? . sion in the symbol space . This simplification clarifies the exposition and the analysis,andwe expect similar results to hold for a broader class of reasoning algorithms that rely onconnectivity.Figure 27 illustrates a reasoning setting where the semantics of the edges is included. Mosthumans understand that V1 :“present day spoons” and V2 :“the metal spoons” are equiva-lent nodes (have the same meaning). However, a machine has to infer this understanding.The semantics of the connection between nodes are expressed through natural language sen-tences. For example, connectivity could express the semantic relation between two nodes: has-property(metal,thermal-conductor) . However a machine may find it difficult to inferthis fact from, say, reading text over the Internet as it may be expressed in many differentways, e.g., can be found in a sentence like “dense materials such as [ V3: ]metals and stonesare [
V5: ]good conductors of heat” .To ground this in existing efforts, consider multi-hop reasoning for QA systems (Khashabiet al., 2016; Jansen et al., 2018). Here the reasoning task is to connect local information, viamultiple local “hops”, in order to arrive at a conclusion. In the meaning graph, one can tracea path of locally connected nodes to verify the correctness of a query; for example the query has-property(metal-spoon, thermal-conductor) can be verified by tracing a sequence ofnodes, as shown in Figure 27. In other words, answering queries can be cast as inferring the114xistence of a path connecting two nodes m and m (cid:48) . While doing so on the meaning graphis straightforward, doing so on the noisy symbol graph is not. Intuitively, each local “hop”introduces more noise, allowing reliable inference to be performed only when it does notrequire too many steps in the underlying meaning space . To study this issue, one mustquantify the effect of noise accumulation for long-range reasoning.
Contributions.
We believe that this is the first work to provide a mathematical study ofthe challenges and limitations of reasoning algorithms in the presence of the symbol-meaningmapping challenge. We make three main contributions.First, we establish a novel, linguistically motivated formal framework for analyzing theproblem of reasoning about the ground truth (the meaning space) while operating over anoisy and incomplete linguistic representation (the symbol space). This framework allowsone to derive rigorous intuitions about what various classes of reasoning algorithms can and cannot achieve.Second, we study in detail the connectivity reasoning problem, in particular the interplaybetween the noise level in the symbol space (due to ambiguity, variability, and missinginformation) and the distance (in terms of inference steps, or hops) between two elementsin the meaning space. We prove that under low noise levels, it is indeed possible to performreliable connectivity reasoning up to a few hops (Theorem 1). On the flip side, even amoderate increase in the noise level makes it difficult to assess the connectivity of elementsif they are logarithmic distance apart in the meaning space (Theorems 2 and 3). Thisfinding is aligned with empirical observations of “semantic drift”, i.e., substantial drop inperformance beyond a few (usually 2-3) hops (Fried et al., 2015; Jansen, 2016).Third, we apply the framework to a subset of a real-world knowledge-base, FB15k237,treated as the meaning graph, illustrating how key noise parameters influence the possibility(or not) of accurately solving the connectivity problem. This particular grounding is meant to help relate our graph-based formalism to existing applications,and is not the only way of realizing reasoning on graphs. .2. Related Work
Classical views on reasoning.
Philosophers, all the way from Aristotle and Avicenna,were the first ones to notice reasoning and rationalism (Kirk et al., 1983; Davidson, 1992).In modern philosophy, the earlier notions were mixed with mathematical logic, resulting informal theories of reasoning, such as deductive, inductive, and abductive reasoning (Peirce,1883). Our treatment of reasoning applies to all these, that can be modeled and executedusing graphical representations.
Reasoning in AI literature.
The AI literature has seen a variety of formalisms forautomated reasoning. These include, reasoning with logical representations (McCarthy,1963), semantic networks (Quillan, 1966), frame-semantic based systems (Fillmore, 1977),Bayesian networks (Pearl, 1988), among others.It is widely believed that a key obstacle to progress has been the symbol grounding prob-lem (Harnad, 1990; Taddeo and Floridi, 2005). Our formalism is directly relevant to thisissue. We assume that symbols available to reasoning systems are results of communica-tion meaning n natural language. This results in ambiguity since a given symbol could bemapped to multiple actual meanings but also in variablity (redundancy).
Reasoning for natural language comprehension.
In the context of natural languageapplications (such as QA) flavors of linguistic theories are blended with the foundationprovided by AI. A major roadblock has been the problem of symbol grounding , or groundingfree-form texts to a higher-level meaning. Example proposals to deal with this issue are,extracting semantic parses (Kaplan et al., 1982; Steedman and Baldridge, 2011; Banarescuet al., 2013), linking to the knowledge bases (Mihalcea and Csomai, 2007), mapping tosemantic frames (Punyakanok et al., 2004), etc. These methods can be thought of asapproximate solutions for grounding symbolic information to some meaning . (Roth and Yih,2004) suggested a general abductive framework that addresses it by connecting reasoning tomodels learned from data; it has been used in multiple NLP reasoning problems (Khashabi116t al., 2018b).On the execution of reasoning with the disambiguated inputs there are varieties of proposals,e.g., using executable formulas (Reddy et al., 2017; Angeli and Manning, 2014), chainingrelations to infer new relations (Socher et al., 2013; McCallum et al., 2017; Khot et al.,2017), and possible combinations of the aforementioned paradigms (Gardner et al., 2015;Clark et al., 2016). Our analysis covers any algorithm for inferring patterns that can beformulated in graph-based knowledge, e.g., chaining local information, often referred to as multi-hop reasoning (Jansen et al., 2016, 2018; Lin et al., 2018). For example, Jansen et al.(2017) propose a structured multi-hop reasoning by aggregating sentential information frommultiple knowledge bases. The work shows that while this strategy improves over baselineswith no reasoning (showing the effectiveness of reasoning), with aggregation of more than 2-3sentences the quality declines (showing a limitation for reasoning). Similar observations werealso made in (Khashabi et al., 2016). These empirical observations support the theoreticalintuition proven in this work.
We start with basic definitions and notation.
Graph Theory.
We denote an undirected graph with G ( V, E ) where V and E are thesets of nodes and edges, resp. We use the notations V G and E G to refer to the nodes andedges of a graph G , respectively. Let dist( v i , v j ) be the distance between nodes v i and v j in G . A simple path (henceforth referred to as just a path ) is a sequence of adjacent nodesthat does not have repeating nodes. Let v i d (cid:33) v j denote the existence of a path of length d between v i and v j . Similarly, v i (cid:24)(cid:24) (cid:33) v j denotes that there is no path between v i and v j . Wedefine the notion of d - neighborhood in order to analyze local properties of the graphs: Definition 4.
For a graph G = ( V, E ), s ∈ V , and d ∈ N , the d -neighbourhood of s is { v | dist( s, v ) ≤ d } , i.e., the ‘ball’ of radius d around s . B ( s, d ) denotes the number of nodesin this d -neighborhood, and B ( d ) = max s ∈ V B ( s, d ).117inally, a cut C = ( S, T ) in G is a partition of the nodes V into subsets S and T . The size of the cut C is the number of edges in E with one endpoint in S and the other in T . Probability Theory. X ∼ f ( θ ) denotes a random variable X distributed according toprobability distribution f ( θ ), paramterized by θ .Given random variables X ∼ Bern( p ) and Y ∼ Bern( q ), their disjunction X ∨ Y is anotherBernoulli Bern( p ⊕ q ), where p ⊕ q (cid:44) − (1 − p )(1 − q ) = p + q − pq . We will make extensiveuse of this notation throughout this work. We introduce two notions of knowledge spaces: • The meaning space , M , is a conceptual hidden space where all the facts are accurateand complete. We assume the knowledge in this space can be represented as an undi-rected graph, denoted G M ( V M , E M ). This knowledge is hidden, and representativeof the information that exists within human minds. • The symbol space , S , is the space of written sentences, curated knowledge-based, etc.,in which knowledge is represented for human and machine consumption. We assumeaccess to a knowledge graph G S ( V S , E S ) in this space that is an incomplete, noisy,redundant, and ambiguous approximation of G M .There are interactions between the two spaces: when we read a sentence, we are readingfrom the symbol space and interpreting it in the meaning space . When writing out ourthoughts, we symbolize our thought process, by moving them from meaning space to the symbol space . Figure 26 provides a high-level view of the framework. A reasoning systemis not aware of the exact structure and information encoded in the meaning graph.The only information given is the ball-assumption, i.e., we assume that each node m isconnected to at most B ( m, d ) many nodes, within distance at most d . If this bound holds118 lgorithm 1: Generative construction of knowledge graphs; sampling a symbol knowledgegraph G S given a meaning graph G M . Input:
Meaning graph G M ( V M , E M ), discrete distribution r ( λ ), edge retention probability p + , edgecreation probability p − Output:
Symbol graph G S ( V S , E S ) foreach v ∈ V M do sample k ∼ r ( λ )construct a collection of new nodes U s.t. | U | = kV S ← V S ∪ U O ( v ) ← U foreach ( m , m ) ∈ ( V M × V M ) , m (cid:54) = m do S ← O ( m ), S ← O ( m ) foreach e ∈ S × S doif ( m , m ) ∈ E M then with probability p + : E S ← E S ∪ { e } else with probability p − : E S ← E S ∪ { e } for all the nodes in a graph, we’d simply write it as B ( d ). The ball assumption is a simpleunderstanding of the maximum-connectivity in the meaning-graph, without knowing thedetails of the connections. Meaning-Symbol mapping.
We define an oracle function O : M → S that map nodesin the meaning space to those in the symbol space. When s ∈ O ( m ), with some abuse ofnotation, we write O − ( s ) = m . Generative Modeling of Symbol Graphs.
We now explain a generative process forconstructing symbol graphs. Starting with G M , we sample a symbol graph G S ← ALG ( G M )using a stochastic process, detailed in Algorithm 1. Informally, the algorithm simulatesthe process of transforming conceptual information into linguistic utterances (web-pages,conversations, knowledge-bases).Our stochastic process has three main parameters: (a) the distribution r ( λ ) of the number ofreplicated symbols to be created for each node in the meaning space; (b) the edge retentionprobability p + ; and (c) the noisy edge creation probability p − . We will discuss later theregimes under which Algorithm 1 generates interesting symbol graphs.119his construction models a few key properties of linguistic representation of meaning. Eachnode in the meaning space is potentially mapped to multiple nodes in the symbol space,which models redundancy. Incompleteness of knowledge is modeled by the fact that notall meaning space edges appear in the symbol space (controlled by parameter p + in Algo-rithm 1). There are also edges in the symbol space that do not correspond to any edges inthe meaning space and account for the noise (controlled by parameter p − in Algorithm 1).Next, we introduce a linguistic similarity based connection to model ambiguity, i.e., a singlenode in the symbol graph mapping to multiple nodes in the meaning graph. The ambiguityphenomena is modelled indirectly via the linguistic similarity based connections (discussednext). We view ambiguity as treating (or confusing) two symbol nodes as the same evenwhen they originate from different nodes in the meaning space. Noisy Similarity Metric.
Similarity metrics are typically used to judge the equivalenceof symbolic assertions. Let ρ : V S × V S → { , } be such a metric, where ρ ( s, s (cid:48) ) = 1 denotesthe equivalence of two nodes in the symbol graph. Specifically, we define the similarity tobe a noisy version of the true node similarity between node pairs: ρ ( s, s (cid:48) ) (cid:44) − Bern( ε + ) if O − ( s ) = O − ( s (cid:48) )Bern( ε − ) otherwise , where ε + , ε − ∈ (0 ,
1) are the noise parameters of the similarity function, both typicallyclose to zero. Intuitively, the similarity function is a perturbed version of ground-truthsimilarities, with small random noise (parameterized with ε + and ε − ). Specifically with ahigh probability 1 − ε + / − , it returns the correct similarity decision (i.e., whether two symbolshave the same meaning); and with a low probability ε + / − it returns an incorrect similaritydecision. In particular, ε + = ε − = 0 models the perfect similarity metric. In practice, eventhe best entailment/similarity systems have some noise (modeled as ε + / − > G S and the similarity function ρ ,120nd that they use the following procedure to verify the existence of a connection betweentwo nodes: function NodePairConnectivity ( s, s (cid:48) ) return ( s, s (cid:48) ) ∈ E S or ρ ( s, s (cid:48) ) = 1 end function There are many corner cases that result in uninteresting meaning or symbol graphs. Belowwe define the regime of realistic instances:
Definition 5 (Nontrivial Graph Instances) . A pair ( G M , G S ) of a meaning graph and asymbol graph sampled from it is non-trivial if it satisfies:1. non-zero noise, i.e., p − , ε − , ε + > p + < p − (cid:28) p + , ε + < . p + > . G M is not overly-connected, i.e., B ( d ) ∈ o ( n ), where n is the number of nodes in G M ;5. G M is not overly-sparse, i.e., | E G M | ∈ ω (1).Henceforth, we will only consider sampling parameters satisfying the above conditions. Reasoning About Meaning, through Symbols.
While the reasoning engine only seesthe symbol graph G S , it must make inferences about the potential latent meaning graph.Given a pair of nodes V S := { s, s (cid:48) } ⊂ V S in the symbol graph, the reasoning algorithm mustthen predict properties about the corresponding nodes V M = { m, m (cid:48) } = {O − ( s ) , O − ( s (cid:48) ) } in the meaning graph.We use a hypothesis testing setup to assess the likelihood of two disjoint hypotheses de-fined over these meaning nodes: H M ( V M ) and H M ( V M ). Given observations about thesymbol nodes, defined as X S ( V S ), the goal of a reasoning algorithm is to identify which of121he two hypotheses about the meaning graph has a higher likelihood of resulting in theseobservations under the sampling process of Algorithm 1. Formally, we are interested in:argmax h ∈{ H M ( V M ) ,H M ( V M ) } P ( h ) [ X S ( V S )] (8.1)where P ( h ) [ x ] denotes the probability of an event x in the sample space induced by Algo-rithm 1 on the latent meaning graph G M when it satisfies hypothesis h .Since we start with two disjoint hypotheses on G M , the resulting probability spaces are gen-erally different, making it plausible to identify the correct hypothesis with high confidence.At the same time, with sufficient noise in the sampling process, it can also become difficultfor an algorithm to distinguish the two resulting probability spaces (corresponding to thetwo hypotheses) especially depending on the observations X S ( V S ) used by the algorithm.For example, the distance between the symbolic nodes can often be an insufficient indicatorfor distinguishing these hypotheses. We will explore these two contrasting behaviors in thenext section. Definition 6 (Reasoning Problem) . The input for an instance P of the reasoning problem is a collection of parameters that characterize how a symbol graph G S is generated from a(latent) meaning graph G M , two hypotheses H M ( V M ) , H M ( V M ) about G M , and availableobservations X S ( V S ) in G S . The reasoning problem, P ( p + , p − , ε + , ε − , B ( d ), n, λ , H M ( V M ), H M ( V M ), X S ( V S )), is to map the input to the hypothesis h as per Eq. (8.1).We use the following notion to measure the effectiveness of the observation X S in distin-guishing between the two hypotheses as in Eq. (8.1): Definition 7 ( γ -Separation) . For γ ∈ [0 ,
1] and a problem instance P with two hypotheses h = H M ( V M ) and h = H M ( V M ), we say an observation X S ( V S ) in the symbol space γ -separates h from h if: P ( h ) [ X S ( V S )] − P ( h ) [ X S ( V S )] ≥ γ. γ as the gap between the likelihoods of the observation X S ( V S ) having orig-inated from a meaning graph satisfying hypothesis h vs. one satisfying hypothesis h .When γ = 1, X S ( V S ) is a perfect discriminator for distinguishing h and h . In general,any positive γ bounded away from 1 yields a valuable observation. Given an observation X S that γ -separates h and h , there is a simple algorithm thatdistinguishes h from h : function Separator X S ( G S , V S = { s, s (cid:48) } ) if X S ( V S ) = 1 then return h else return h end function Importantly, this algorithm does not compute the probabilities in Definition 7. Rather,it works with a particular instantiation G S of the symbol graph. We refer to such analgorithm A as γ -accurate for h and h if, under the sampling choices of Algorithm 1,it outputs the ‘correct’ hypothesis with probability at least γ ; that is, for both i ∈ { , } : P ( h i ) [ A outputs h i ] ≥ γ. Proposition 1.
If observation X S γ -separates h and h , then algorithm Separator X S is γ -accurate for h and h . Proof.
Let A denote Separator X S for brevity. Combining γ -separation of X S with how A operates, we obtain: P ( h ) [ A outputs h ] − P ( h ) [ A outputs h ] ≥ γ ⇒ P ( h ) [ A outputs h ] + P ( h ) [ A outputs h ] ≥ γ Since each term on the left is bounded above by 1, each of them must also be at least γ .In the rest of work, we will analyze when one can obtain a γ -accurate algorithm, using γ -separation of the underlying observation as a tool for the analysis. If the above probability gap is negative, one can instead use the complement of X S ( V S ) for γ -separation. r is such that P [ | U | = λ ] = 1. One simple but often effective approach for reasoning is to focus on connectivity (as de-scribed in Figure 27). Specifically, we consider reasoning chains as valid if they correspondto a short path in the meaning space, and invalid if they correspond to disconnected nodes.Given nodes m, m (cid:48) ∈ G M , this corresponds to two possible hypotheses: h = m d (cid:33) m (cid:48) , and h = m (cid:24)(cid:24) (cid:33) m (cid:48) We refer to distinguishing between these two worlds as the d -connectivity reasoningproblem . While we consider two extreme hypotheses for our analysis, we find that with asmall amount of noise, even these extreme hypotheses can be difficult to distinguish.For the reasoning algorithm, one natural observation that can be used is the connectivityof the symbol nodes in G S . Existing models of multi-hop reasoning (Khot et al., 2017) usesimilar features to identify valid reasoning chains. Specifically, we consider the observationthat there is a path of length at most ˜ d between s and s (cid:48) : X ˜ dS ( s, s (cid:48) ) = s ≤ ˜ d (cid:33) s (cid:48) The corresponding connectivity algorithm is Separator X ˜ dS , which we would like to be γ -accurate for the two hypotheses under consideration. Next, we derive bounds on γ forthese specific hypotheses and observation. Note that while the space of possible hypothesesand observations is large, the above natural and simple choices still allow us to derivevaluable intuitions for the limits of reasoning.124 .5.1. Possibility of accurate connectivity We begin by defining the following accuracy threshold, γ ∗ , as a function of the parametersfor sampling a symbol graph: Definition 8.
Given n, d ∈ N and symbol graph sampling parameters p + , ε + , λ , define γ ∗ ( n, d, p + , ε + , ε − , λ ) as (cid:16) − (1 − ( p + ⊕ ε − )) λ (cid:17) d · (cid:16) − e ε λ/ (cid:17) d +1 − en ( λ B ( d )) p − . This expression is somewhat difficult to follow. Nevertheless, as one might expect, the accu-racy threshold γ ∗ increases (higher accuracy) as p + increases (higher edge retention) or ε + decreases (fewer dropped connections between replicas). As λ increases (higher replication),the impact of the noise on edges between node cluster or d decreases (shorter paths), theaccuracy threshold will also increase.The following theorem (see Appendix for a proof) establishes the possibility of a γ -accuratealgorithm for the connectivity problem: Theorem 1.
Let p + , p − , ε + , ε − , λ be parameters of the sampling process in Algorithm 1on a meaning graph with n nodes. Let d ∈ N and ˜ d = d (1 + λ ). If p − and d satisfy( p − ⊕ ε − ) · B ( d ) < eλ n , and γ = max { , γ ∗ ( n, d, p + , ε + , ε − , λ ) } , then the connectivity algorithm Separator X ˜ dS is γ -accurate for the d -connectivity problem. Proof idea.
The proof consists of two steps: first show that for the assumed choice of parameters,connectivity in the meaning space is recoverable in the symbol space, with high-probability. Thenshow that spurious connectivity in the symbol space (with no meaning space counterparts) has lowprobability.
Corollary 1. (Informal) If p − , ε − , d, and γ are small enough, then the connectivity algo-125ithm Separator X ˜ dS with ˜ d = d (1 + λ ) is γ -accurate for the d -connectivity problem. We show that as d , the distance between two nodes in the meaning space, increases, it isunlikely that we will be able to make any inference about their connectivity by assessingconnectivity of the corresponding symbol-graph nodes. More specifically, if d is at leastlogarithmic in the number of nodes in the graph, then, even for relatively small amountsof noise, the algorithm will see all node-pairs as connected within distance d ; hence anyinformative inference will be unlikely. Theorem 2.
Let c > p − , ε − , λ be parameters of the sampling processin Algorithm 1 on a meaning graph G M with n nodes. Let d ∈ N and ˜ d = λd . If p − ⊕ ε − ≥ cλn and d ∈ Ω(log n ) , then the connectivity algorithm Separator X ˜ dS almost-surely infers any node-pair in G M as connected, and is thus not γ -accurate for any γ > d -connectivity problem. Proof idea.
One can show that, for the given choice of parameters, noisy edges would dominateover informative ones and the symbol-graph would be a densely connected graph (i.e., one cannotdistinguish actual connectivities from the spurious ones).
This result exposes an inherent limitation to multi-hop reasoning: even for small valuesof noise, the diameter of the symbol graph becomes very small, namely, logarithmic in n .This has a resemblance to similar observations in various contexts, commonly known as the small-world phenomenon . This principle states that in many real-world graphs, nodes areall linked by short chains of acquaintances, such as “six degrees of separation” (Milgram,1967; Watts and Strogatz, 1998). Our result affirms that if NLP reasoning algorithms arenot designed carefully, such macro behaviors will necessarily become bottlenecks.We note that the preconditions of Theorems 1 and 2 are disjoint, that is, both results do not126pply simultaneously. Since B ( . ) ≥ λ ≥
1, Theorem 1 requires p − ⊕ ε − ≤ eλ n < λ n ,whereas Theorem 2 applies when p − ⊕ ε − ≥ cλn > λ n . While in the previous section we showed limitations of multi-hop reasoning in inferringlong-range relations, here we extend the argument to prove the difficulty for any reasoningalgorithm.Our exposition is algorithm independent; in other words, we do not make any assumptionon the choice of E S ( s, s (cid:48) ) in Equation 8.1. In our analysis we use the spectral properties ofthe graph to quantify local information within graphs.Figure 28: The construction consideredin Definition 9. The node-pair m - m (cid:48) isconnected with distance d in G M , and dis-connected in G (cid:48) M , after dropping the edgesof a cut C . For each symbol graph, weconsider it “local” Laplacian. Consider a meaning graph G M in which two nodes m and m (cid:48) are connected. We drop edges in a min-cut C to make the two nodes disconnected and get G (cid:48) M (Figure 28). Definition 9.
Define a pair of meaning-graphs G and G (cid:48) , both with size n and satisfying the ballassumption B ( d ), with the following properties: (1) m d (cid:33) m (cid:48) in G , (2) m (cid:24)(cid:24) (cid:33) m (cid:48) in G (cid:48) , (3) E G (cid:48) ⊂ E G ,(4) C = E G \ E G (cid:48) , an ( m, m (cid:48) ) min-cut of G .We define a uniform distribution over all the in-stances that satisfy the construction explained inDefinition 9: Definition 10.
We define a distribution G over pairs of possible meaning graphs G, G (cid:48) and pairs of nodes m, m (cid:48) which satisfies the requirements of Definition 9. Formally, G is auniform distribution over the following set: { ( G, G (cid:48) , m, m (cid:48) ) | G, G (cid:48) , m, m (cid:48) satisfy Definition 9 } . G S and G (cid:48) S , as denoted in Figure 28. Inthe sampling of G S and G (cid:48) S , all the edges share the randomization, except for the ones thatcorrespond to C (i.e., the difference between the G M and G (cid:48) M ). Let U be the union of thenodes involved in ˜ d -neighborhood of s, s (cid:48) , in G S and G (cid:48) S . Define L, L (cid:48) to be the Laplacianmatrices corresponding to the nodes of U . As n grows, the two Laplacians become lessdistinguishable whenever p − ⊕ ε − and d are large enough: Lemma 1.
Let c > p − , λ be parameters of the sampling process inAlgorithm 1 on a pair of meaning graphs G and G (cid:48) on n nodes constructed according toDefinition 9. Let d ∈ N , ˜ d ≥ λd, and L, L (cid:48) be the Laplacian matrices for the ˜ d -neighborhoodsof the corresponding nodes in the sampled symbol graphs G S and G S (cid:48) . If p − ⊕ ε − ≥ c log nn and d > log n, then, with a high probability, the two Laplacians are close: (cid:107) ˜ L − ˜ L (cid:48) (cid:107) ≤ √ λ B (1) (cid:112) n log( nλ )This can be used to show that, for such large enough p − and d , the two symbol graphs, G S and G (cid:48) S sampled as above, are indistinguishable by any function operating over a λd -neighborhood of s, s (cid:48) in G S , with a high probability.A reasoning function can be thought of a mapping defined on normalized Laplacians, sincethey encode all the information in a graph. For a reasoning function f with limited precision,the input space can be partitioned into regions where the function is constant; and for largeenough values of n both ˜ L, ˜ L (cid:48) (with a high probability) fall into regions where f is constant.Note that a reasoning algorithm is oblivious to the the details of C , i.e. it does not knowwhere C is, or where it has to look for the changes. Therefore a realistic algorithm oughtto use the neighborhood information collectively.In the next lemma, we define a function f to characterize the reasoning function, whichuses Laplacian information and maps it to binary decisions. We then prove that for anysuch functions, there are regimes that the function won’t be able to distinguish ˜ L and ˜ L (cid:48) :128 emma 2. Let meaning and symbol graphs be constructed under the conditions of Lemma 1.Let β > f : R |U|×|U| → { , } be the indicator function of an open set. Then thereexists n ∈ N such that for all n ≥ n : P ( G,G (cid:48) ,m,m (cid:48) ) ∼G G S ← ALG ( G ) ,G (cid:48) S ← ALG ( G (cid:48) ) (cid:104) f ( ˜ L ) = f ( ˜ L (cid:48) ) (cid:105) ≥ − β. This yields the following result:
Theorem 3.
Let c > p − , ε − , λ be parameters of the sampling processin Algorithm 1 on a meaning graph G M with n nodes. Let d ∈ N . If p − ⊕ ε − > c log nλn and d > log n, then there exists n ∈ N such that for all n ≥ n , any algorithm cannot distinguish, with ahigh probability, between two nodes in G M having a d -path vs. being disconnected, and isthus not γ -accurate for any γ > d -connectivity problem. Proof idea.
The proof uses Lemma 2 to show that for the given choice of parameters, the informativepaths are indistinguishable from the spurious ones, with high probability.
This reveals a fundamental limitation: under noisy conditions, our ability to infer interestingphenomena in the meaning space is limited to a small, logarithmic neighborhood.
Our formal analysis thus far provides worst-case bounds for two regions in the rather largespectrum of noisy sampling parameters for the symbol space, namely, when p − ⊕ ε − and d are either both small (Theorem 1), or both large (Theorem 2).This section complements the theoretical findings in two ways: (a) by grounding the for-malism empirically into a real-world knowledge graph, and (b) by quantifying the impact of129oisy sampling parameters on the success of the connectivity algorithm. We use ε − = 0 forthis experiments, but the effect turns out to be identical as long as p − ⊕ ε − stays unchanged(see Remark 1 in Appendix).Specifically, we consider FB15k237 (Toutanova and Chen, 2015) containing a set of (cid:104) head,relation, target (cid:105) triples from a curated knowledge base, FreeBase (Bollacker et al., 2008).For scalability, we use a subset that relates to the movies domain, resulting in 2855 distinctentity nodes and 4682 relation edges. We treat this as the meaning graph and sample asymbol graph as per Algorithm 1 to simulate the observed graph derived from text.We sample symbol graphs for various values of p − and plot the resulting symbol and meaninggraph distances in Figure 29. For every value of p − ( y -axis), we sample points in the meaninggraph separated by distance d ( x -axis). For these points, we compute the average distancebetween the corresponding symbol nodes, and indicate that in the heat map using colorshades.We make two observations from this simulation. First, for lower values of p − , disconnectednodes in the meaning graph (rightmost column) are clearly distinguishable from meaningnodes with short paths (small d ) as predicted by Theorem 1, but harder to distinguish fromnodes at large distances (large d ). Second, and in contrast, for higher values of p − , almostevery pair of symbol nodes is connected with a very short path (dark color), making itimpossible for a distance-based reasoning algorithm to confidently assess d -connectivity inthe meaning graph. This simulation also confirms our finding in Theorem 2: any graphwith p − ≥ /λn , which is ∼ . Our work is inspired by empirical observations of “semantic drift” of reasoning algorithms,as the number of hops is increased. There are series of works sharing this empirical ob- Specifically, relations beginning with /film/ . d (x-axis), as the noise parameter p − (y-axis) is varied. The goal is to distinguish squares in the column for a particular d withthe corresponding squares in the right-most column, which corresponds to node-pairs beingdisconnected. This is easy in the bottom-left regime and becomes progressively harder as wemove upward (more noise) or rightward (higher meaning-graph distance). ( ε + = 0 . , λ = 3)servation; for example, Fried et al. (2015) show modest benefits up to 2-3 hops, and thendecreasing performance; Jansen (2016); Jansen et al. (2018) made similar observations ingraphs built out of larger structures such as sentences, where the performance drops offaround 2 hops. This pattern has interestingly been observed in a number of results witha variety of representations, including word-level representations, graphs, and traversalmethods. The question we are after in this work is whether the field might be hitting afundamental limit on multi-hop information aggregation using existing methods and noisyknowledge sources.Our “impossibility” results are reaffirmations of the empirical intuition in the field. Thismeans that multi-hop inference (and any algorithm that can be cast in that form), as we’vebeen approaching it, is exceptionally unlikely to breach the few-hop barrier predicted in ouranalysis.There are at least two practical lessons:1. There are several efforts in the field pursuing “very long” multi-hop reasoning. Ourresults suggest that such efforts, especially without a careful understanding of the limi-131ations, are unlikely to succeed, unless some fundamental building blocks are altered.2. A corollary of this observation suggests that, due to the limited number of hops, prac-titioners must focus on richer representations that allow reasoning with only a “few”hops. This, in part, requires higher-quality abstraction and grounding mechanisms. Italso points to alternatives, such as offline KB completion/expansion, which indirectlyreduce the number of steps needed at inference time. It basically suggests that ambiguityand variability must be handled well to reduce the number of hops needed.Finally, we note that our proposed framework applies to any machine comprehension taskover natural text that requires multi-step decision making, such as multi-hop QA or textualentailment. 132 HAPTER 9 : Summary and Future Work ﻥﻣ ﻪﻧ ﻭ ﯽﻧﺍﺩ ﻭﺗ ﻪﻧ ﺍﺭ ﻝﺯﺍ ﺭﺍﺭﺳﺍ ﻥﻣ ﻪﻧ ﻭ ﯽﻧﺍﻭﺧ ﻭﺗ ﻪﻧ ﺎﻣﻌﻣ ﻑﺭﺣ ﻥﻳﻭ ﻭﺗ ﻭ ﻥﻣ ﯼﻭﮕﺗﻔﮔ ﻩﺩﺭﭘ ﺱﭘ ﺯﺍ ﺕﺳﻫ ﻥﻣ ﻪﻧ ﻭ ﯽﻧﺎﻣ ﻭﺗ ﻪﻧ ﺩﺗﻓﺍ ﺭﺩ ﻩﺩﺭﭘ ﻥﻭﭼ
There was a Door to which I found no Key There was a Veil past which I could not see: Some little Talk awhile of ME and THEE There seemed--and then no more of THEE and ME. (Khayyam)
We see the world through you and yet we don’t see you ﺍﺭﺗ ﻡﻳﻧﻳﺑﻧ ﻭ ﻡﻳﻧﻳﺑﺑ ﻭﺗ ﻡﻟﺎﻋ (Rumi) — Omar Khayyam, Rubaiyat, 1120 CE
This thesis aims at progressing towards natural language understanding, by means of thetask of question answering. This chapter, gives a summary of our contributions across thisdocument and provides a few angles along which we would like to extend this work.
We start the discussion in Chapter 2 by providing a thorough review of the past literatureconcerning NLU, highlighting the ones that are related to the works in this thesis.Chapter 3 studies reasoning systems for question answering on elementary-school scienceexams, using a semi-structured knowledge base. We treat QA as a subgraph selectionproblem and then formulate this as an ILP optimization. Most importantly, this formula-tion allows multiple, semi-formally expressed facts to be combined to answer questions, acapability outside the scope of IR-based QA systems. In our experiments, this approachsignificantly outperforms both the previous best attempt at structured reasoning for thistask, and an IR engine provided with the same knowledge. Our effort has had great impactssince publication. Our work has inspired others to to build systems based on our designand to improve the state of the art in other domains; for instance, Khot et al. (2017) usessimilar ideas to reasoning with OpenIE tuples (Etzioni et al., 2008). In addition, the systemhas been incorporated into Allen Institute’s reading-comprehension project and is shownto give a significant boost to their performance (Clark et al., 2016). Even after a coupleof years, the system has been shown to be among the best systems on a recently-proposedreading comprehension task (Clark et al., 2018). https://allenai.org/aristo/ reasoningover multiple sentences . This dataset contains ∼ k questions from different domains andwide variety of complexities. We have shown a significant performance different betweenhuman and state-of-the-art systems and we hope that this performance gap will encouragethe community to work towards more sophisticated reasoning systems. It is encouraging tosee that the work is already been used in a couple of works (Sun et al., 2019; Trivedi et al.,2019; Wang et al., 2019).Chapter 7 offers a question answering dataset dedicated to temporal common sense under-standing . We show that systems equipped with the state-of-the-art techniques are still farbehind human performance. We hope that the dataset will bring more attention to thestudy of common sense (especially in the context of understanding of time ).134n Chapter 8, we develop a theoretical formalism to investigate fundamental limitationspertaining to multi-step reasoning in the context of natural language problems. We presentthe first analysis of reasoning in the context of properties like ambiguity, variability, in-completeness, and inaccuracy. We show that a multi-hop inference (and any algorithmthat can be cast in that form), as we’ve been approaching it, is exceptionally unlikely tobreach the few-hop barrier predicted in our analysis. Our results suggest that such efforts,especially without a careful understanding of the limitations, are unlikely to succeed, un-less some fundamental building blocks are altered. A corollary of this observation suggeststhat, practitioners must focus on richer representations that allow reasoning with only a“few” hops. This, in part, requires higher-quality abstraction and grounding mechanisms.In other words, ambiguity and variability must be handled well to reduce the number ofhops needed. This thesis has taken a noticeably distinct approach towards a few important problems inthe field and has shown progress on multiple ends. For example, the formalism of Chapter 3and 4 are novel and provide general ways to formalize and implement reasoning algorithms.The datasets of Chapter 6 and 7 are distinct from the many QA datasets in the field. Thetheoretical analysis of Chapter 8 takes a uniquely distinct formal analysis of reasoning inthe context of natural language.All these said, there are many issues that are not addressed as extensively as we could have(or should have), or there are aspects that turned out slightly differently from what weinitially expected.Looking back at the reasoning formalism of Chapter 4, we underestimated the hardnessof extracting the underlying semantic representations. Even though the field has madesignificant progress in low-level NLP tasks (like SRL or Coreference), such tasks still sufferfrom brittleness and lack of transfer across domains. And brittleness in the extraction135f such annotations, result in exponentially bigger errors when reasoning with them (asalso justified by the theoretical observations of Chapter 8); in practice, it worked wellonly for short-ranged chains (1, 2, and sometimes 3 hops). With more recent progress inunsupervised representations and improvement of semantic extraction systems, my hope isto redo these ideas in the coming years and revisit the remaining challenges.A vision that I would like to pursue (influenced by discussions with my advisor) is reasoningwith minimal data. We (humans) are able to perform the same reasoning on many high-level concepts and are able to transfer them in all sorts of domains: for instance, an averagehuman uses the same inductive reasoning to conclude the sky is blue and inferring that there is another number after every number . Effective (unsupervised) representation couldpotentially need a huge amount of data (and many parameters), but successful reasoningsystems will likely need very minimal data (and very simple, but general definitions).Over the past years, the field has witnessed a wave of activity on unsupervised languagemodels (Peters et al., 2018; Devlin et al., 2018). There are many questions with respectto the success of such models on several datasets: for instance, what kinds of reasoningare they capable of? what is it that they are missing? And how we can address them bypossibly creating hybrid systems. What is clear is that these systems will offer increasinglyricher representations of meaning; we need better ways to effectively understand what thesesystems are capable of and what are the scenarios they are used to represent. And inconjunction to understanding their capabilities and limitations, we have to build reasoningalgorithms on top of them. It’s unlikely that these tools will ever be enough to solve all ofour challenges; one has to equip these representations with the ability to reason, especiallywhen they face an unusual/unseen scenario.In Chapter 5 (essential terms) an initial motivation was to model knowing what we don’tknow (Rajpurkar et al., 2018); basically, systems should be able to infer whether theyhave enough confidence about the answer to a given query before acting. In hindsight, Ithink our supervised system ended up using too many shallow features, which didn’t end136p generalizing to tricky instances. Additionally, it would have been better if the decisionof essentiality was more involved within reasoning systems (rather than an independentlysupervised classifier, which limited its domain transfer).The datasets of Chapter 6 and 7 are critical parts of this thesis which, I suspect, are likelyto be remembered longer than the rest of the chapters. In general, the construction ofdatasets (including the ones we described) is a menial task. It’s unfortunate that manysmall empirical details are usually left out. It is not clear to me whether using staticdatasets is the best way for the road ahead. In the future, I hope that the field discoversmore effective ways of measuring the progress towards NLU.A key issue contributing to the complexity of NLU (and Question Answering) is the setof implied information (common sense). We touch upon a class of such understanding inChapter 7, where we introduce a dataset for such problems. A natural next step is addressingsuch questions and exploring the many ways we can incorporate such understanding in themodels.The analysis of Chapter 8 is uniquely distinct within the field. That said, there are manyissues that make me feel unsatisfied about our current attempt. In particular, there are manyassumptions that may or may not stand the test of time (e.g., the generative construction ofsymbol graph from the meaning graph or the connectivity reasoning as a proxy for the actualreasoning in language). And there are some important reasoning phenomena missing fromthis formalism: conditional reasoning, transitivity and directionality, inductive reasoning,just to name a few. In general, our (the field’s) understanding of “reasoning” (and itsformalisms) is very limited. And the existing formalisms are not easily applicable, sincethose who formalized reasoning were not intimately aware of the complexity of NLU; theywere philosophers and mathematicians. In practice, it’s really hard to make the existingtheories of reasoning work in the existence of many of the properties of language. In thecoming years, I would like to see more efforts on reconciling the issues in the interface of“language” and “reasoning”. 137
PPENDIXA.1. Supplementary Details for Chapter 3
A.1.1. The ILP Model for
TableILP
Variables:
We start with a brief overview of the basic variables and how they are combinedinto high level variables.
Reference Description i index over tables j index over table rows k index over table columns l index over lexical constituents ofquestion m index over answer options x ( . ) a unary variable y ( ., . ) a pairwise variable Figure 30: Notation for the ILP formulation.Table 30 summarizes our notationto refer to various elements of theproblem, such as t ijk for cell ( j, k )of table i , as defined in Section 3.We define variables over each ele-ment by overloading x ( . ) or y ( ., . )notation which refer to a binaryvariable on elements or their pair,respectively. Table 4 contains thecomplete list of basic variables in the model, all of which are binary. The pairwise vari-ables are defined between pairs of elements; e.g., y ( t ijk , q (cid:96) ) takes value 1 if and only if thecorresponding edge is present in the support graph. Similarly, if a node corresponding toan element of the problem is present in the support graph, we will refer to that element asbeing active .In practice we do not create pairwise variables for all possible pairs of elements; insteadwe create pairwise variables for edges that have an entailment score exceeding a thresh-old. For example we create the pairwise variables y (cid:0) t ijk , t i (cid:48) j (cid:48) k (cid:48) (cid:1) only if w ( t ijk , t i (cid:48) j (cid:48) k (cid:48) ) ≥ MinCellCellAlignment . An exhaustive list of the minimum alignment thresholds forcreating pairwise variables is in Table 28.Table 4 also includes some high level unary variables, which help conveniently impose struc-138 airwise Variables y ( t ijk , t i (cid:48) j (cid:48) k (cid:48) ) 1 y ( t ijk , t ij (cid:48) k (cid:48) ) w ( t ijk , t ij (cid:48) k (cid:48) ) − . y ( t ijk , q (cid:96) ) w ( q (cid:96) , t ijk ) y ( h ik , q (cid:96) ) w ( q (cid:96) , h ik ) y ( t ijk , a m ) w ( t ijk , a m ) y ( h ik , a m ) w ( h ik , a m )Unary Variables x ( T i ) 1.0 x ( r ij ) -1.0 x ( (cid:96) ik ) 1.0 x ( h ik ) 0.3 x ( t ijk ) 0.0 x ( q (cid:96) ) 0.3 Table 27: The weights of the variables in our objective function. In each column, the weightof the variable is mentioned on its right side. The variables that are not mentioned hereare set to have zero weight.tural constraints on the support graph G we seek. An example is the active row variable x ( T i ) which should take value 1 if and only if at least a cell in row j of table i . Objective function:
Any of the binary variables defined in our problem are included inthe final weighted linear objective function. The weights of the variables in the objectivefunction (i.e. the vector w in Equation 2.1) are set according to Table 27. In addition to thecurrent set of variables, we introduce auxiliary variables for certain constraints. Definingauxiliary variables is a common trick for linearizing more intricate constraints at the costof having more variables. Constraints:
Constraints are significant part of our model in imposing the desirablebehaviors for the support graph (cf. Section 3.1).The complete list of the constraints is explained in Table 31. While groups of constraints aredefined for different purposes, it is hard to partition them into disjoint sets of constraints.Here we give examples of some important constraint groups.
Active variable constraints:
An important group of constraints relate variables to eachother. The unary variables are defined through constraints that relate them to the basicpairwise variables. For example, active row variable x ( T i ) should be active if and only ifany cell in row j is active. (constraint A.12, Table 31). Correctness Constraints:
A simple, but important set of constraints force the basiccorrectness principles on the final answer. For example G should contain exactly one answeroption which is expressed by constraint A.24, Table 31. Another example is that, G shouldcontain at least a certain number of constituents in the question, which is modeled by139onstraint A.27, Table 31. Sparsity Constraints:
Another group of constraint induce simplicity (sparsity) in theoutput. For example G should use at most a certain number of knowledge base tables (con-straint A.25, Table 31), since letting the inference use any table could lead to unreasonablylong, and likely error-prone, answer chains. A.1.2. Features used in Solver Combination
To combine the predictions from all the solvers, we learn a Logistic Regression model (Clarket al., 2016) that returns a probability for an answer option, a i , being correct based on thefollowing features. Solver-independent features:
Given the solver scores s j for all the answer options j , wegenerate the following set of features for the answer option a i , for each of the solvers:1. Score = s i
2. Normalized score = s i (cid:80) j s j
3. Softmax score = exp( s i ) (cid:80) j exp( s j )
4. Best Option, set to 1 if this is the top-scoring option = I ( s i = max s j ) TableILP-specific features:
Given the proof graph returned for an option, we generatethe following 11 features apart from the solver-independent features:1. Average alignment score for question constituents2. Minimum alignment score for question constituents3. Number of active question constituents4. Fraction of active question constituents140 inCellCellAlignment
MinCellQConsAlignment
MinTitleQConsAlignment
MinTitleTitleAlignment
MinCellQChoiceAlignment
MinTitleQChoiceAlignment
MinCellQChoiceConsAlignment
MinCellQChoiceConsAlignment
MinTitleQChoiceConsAlignment
MinActiveCellAggrAlignment
MinActiveTitleAggrAlignment
Table 28: Minimum thresholds used in creating pairwise variables.
MaxTablesToChain qConsCoalignMaxDist WhichTermSpan WhichTermMulBoost MinAlignmentWhichTerm
TableUsagePenalty RowUsagePenalty InterTableAlignmentPenalty
MaxAlignmentsPerQCons MaxAlignmentsPerCell RelationMatchCoeff
RelationMatchCoeff
EmptyRelationMatchCoeff
NoRelationMatchCoeff -5 MaxRowsPerTable MinActiveQCons MaxActiveColumnChoiceAlignments MaxActiveChoiceColumnVars MinActiveCellsPerRow Table 29: Some of the important constants and their values in our model.5. Average alignment scores for question choice6. Sum of alignment scores for question choice7. Number of active table cells8. Average alignment scores across all the edges9. Minimum alignment scores across all the edges10. Log of number of variables in the ILP11. Log of number of constraints in the ILP141 ollection of basic variables connected to headercolumn k of table i : H ik = { ( h ik , q (cid:96) ); ∀ l } ∪ { ( h ik , a m ); ∀ m } (A.1)Collection of basic variables connected to cell j, k of table i : E ijk = { ( t ijk , t ij (cid:48) k (cid:48) ); ∀ i (cid:48) , j (cid:48) , k (cid:48) }∪{ ( t ijk , a m ); ∀ m }∪{ ( t ijk , q (cid:96) ); ∀ l } (A.2)Collection of basic variables connected tocolumn k of table i C ik = H ik ∪ (cid:91) j E ijk (A.3)Collection of basic variables connected to row j of table i : R ij = (cid:91) k E ijk (A.4)Collection of non-choice basic variablesconnected to row j of table i : L ij = { ( t ijk , t ij (cid:48) k (cid:48) ); ∀ k, i (cid:48) , j (cid:48) , k (cid:48) } ∪ { ( t ijk , q (cid:96) ); ∀ k, l } (A.5)Collection of non-question basic variablesconnected to row j of table i : K ij = { ( t ijk , t ij (cid:48) k (cid:48) ); ∀ k, i (cid:48) , j (cid:48) , k (cid:48) } ∪ { ( t ijk , a m ); ∀ k, m } (A.6)Collection of basic variables connected to table i : T i = (cid:91) k C ik (A.7)Collection of non-choice basic variablesconnected to table i : N i = { ( h ik , q (cid:96) ); ∀ l }∪{ ( t ijk , t ij (cid:48) k (cid:48) ); ∀ j, k, i (cid:48) , j (cid:48) , k (cid:48) }∪{ ( t ijk , q (cid:96) ); ∀ j, k, l } (A.8)Collection of basic variables connected toquestion constituent q (cid:96) : Q l = { ( t ijk , q (cid:96) ); ∀ i, j, k } ∪ { ( h ik , q (cid:96) ); ∀ i, k } (A.9)Collection of basic variables connected to option m O m = { ( t ijk , a m ); ∀ i, j, k } ∪ { ( h ik , a m ); ∀ i, k } (A.10)Collection of basic variables in column k of table i connected to option m : M i,k,m = { ( t ijk , a m ); ∀ j } ∪ { ( h ik , a m ) } (A.11) Table 30: All the sets useful in definitions of the constraints in Table 31.
If any cell in row j of table i is active, the row should beactive. x ( r ij ) ≥ y ( t ijk , e ) , ∀ ( t ijk , e ) ∈ R ij , ∀ i, j, k (A.12)If the row j of table i is active, at least one cell in that rowmust be active as well. (cid:88) ( t ijk ,e ) ∈R ij y ( t ijk , e ) ≥ x ( r ij ) , ∀ i, j (A.13)Column j header should be active if any of the basic variableswith one end in this column header are active. x ( h ik ) ≥ y ( h ik , e ) , ∀ ( h ik , e ) ∈ H ik , ∀ i, k (A.14)If the header of column j variable is active, at least one basicvariable with one end in the end in the header (cid:88) ( h ik ,e ) ∈H ik y ( h ik , e ) ≥ x ( h ik ) , ∀ i (A.15)Column k is active if at least one of the basic variables withone end in this column are active. x ( (cid:96) ik ) ≥ y ( t ijk , e ) , ∀ ( t ijk , e ) ∈ C ik , ∀ i, k (A.16) f the column k is active, at least one of the basic variables with oneend in this column should be active. (cid:88) ( t ijk ,e ) ∈C ik y ( t ijk , e ) ≥ x ( h ik ) , ∀ i, k (A.17)If a basic variable with one end in table i is active, the tablevariable is active. y ( t ijk , e ) ≥ x ( T i ) , ∀ ( t ijk , e ) ∈ T i , ∀ i (A.18)If the table i is active, at least one of the basic variables with oneend in the table should be active. (cid:88) ( t,e ) ∈T i y ( t, e ) ≥ x ( T i ) , ∀ i (A.19)If any of the basic variables with one end in option a m are on, theoption should be active as well. x ( a m ) ≥ y ( x, a m ) , ∀ ( e, a m ) ∈ O m (A.20)If the question option a m is active, there is at least one active basicelement connected to it (cid:88) ( e,a ) ∈O m y ( x, a ) ≥ x ( a m ) (A.21)If any of the basic variables with one end in the constituent q (cid:96) , theconstituent must be active. x ( q (cid:96) ) ≥ y ( e, q (cid:96) ) , ∀ ( e, q (cid:96) ) ∈ Q l (A.22)If the constituent q (cid:96) is active, at least one basic variable connectedto it must be active. (cid:88) ( e,q (cid:96) ) ∈Q l y ( e, q (cid:96) ) ≥ x ( q (cid:96) ) (A.23)Choose only a single option. (cid:88) m x ( a m ) ≤ , (cid:88) m x ( a m ) ≥ (cid:88) i x ( T i ) ≤ MaxTablesToChain (A.25)The number of active rows in each table is upper-bounded. (cid:88) j x ( r ij ) ≤ MaxRowsPerTable , ∀ i (A.26)The number of active constituents in each question islower-bounded. Clearly We need to use the question definition inorder to answer a question. (cid:88) l x ( q (cid:96) ) ≥ MinActiveQCons (A.27)A cell is active if and only if the sum of coefficients of all externalalignment to it is at least a minimum specified value (cid:88) ( t ijk ,e ) ∈E i,j,k y ( t ijk , e ) ≥ x ( t ijk ) × MinActiveCellAggrAlignment , ∀ i, j, k (A.28)A title is active if and only if the sum of coefficients of all externalalignment to it is at least a minimum specified value (cid:88) ( e ) ∈H i,k y ( t ijk , e ) ≥ x ( t ijk ) × MinActiveTitleAggrAlignment , ∀ i, k (A.29)If a column is active, at least one of its cells must be active as well. (cid:88) j x ( t ijk ) ≥ x ( (cid:96) ik ) , ∀ i, k (A.30)At most a certain number of columns can be active for a singleoption (cid:88) k y ( (cid:96) ik , a m ) ≤ MaxActiveChoiceColumn , ∀ i, m (A.31)If a column is active for a choice, the table is active too. x ( (cid:96) ik ) ≤ x ( T i ) , ∀ i, k (A.32)If a table is active for a choice, there must exist an active columnfor choice. x ( T i ) ≤ (cid:88) k x ( (cid:96) ik ) , ∀ i (A.33)If a table is active for a choice, there must be some non-choicealignment. y ( T i , a m ) ≤ (cid:88) ( e,e (cid:48) ) ∈N i y ( e, e (cid:48) ) , ∀ i, m (A.34)Answer should be present in at most a certain number of tables y ( T i , a m ) ≤ MaxActiveTableChoiceAlignmets , ∀ i, m (A.35)If a cell in a column, or its header is aligned with a question option,the column is active for question option as well. y ( t ijk , a m ) ≤ y ( (cid:96) ik , a m ) , ∀ i, k, m, ∀ ( t ijk , a m ) ∈ M i,k,m (A.36)If a column is active for an option, there must exist an alignment toheader or cell in the column. y ( (cid:96) ik , a m ) ≤ (cid:88) ( t ijk ,a m ) ∈O i,k,m y ( t ijk , a m ) , ∀ i, m (A.37) t most a certain number of columns may be active for questionoption in a table. (cid:88) k y ( (cid:96) ik , a m ) ≤ MaxActiveChoiceColumnVars , ∀ i, m (A.38)If a column is active for a choice, the table is active for an optionas well. y ( (cid:96) ik , a m ) ≤ y ( T i , a m ) , ∀ i, k, m (A.39)If the table is active for an option, at least one column is activefor a choice y ( T i , a m ) ≤ (cid:88) k y ( (cid:96) ik , a m ) , ∀ i, m (A.40)Create an auxiliary variable x (whichTermIsActive) withobjective weight 1.5 and activate it, if there a “which” term inthe question. (cid:88) l { q (cid:96) = “which” } ≤ x (whichTermIsActive)(A.41)Create an auxiliary variable x (whichTermIsAligned) withobjective weight 1.5. Add a boost if at least one of the tablecells/title aligning to the choice happens to have a goodalignment ( { w ( ., . ) > MinAlignmentWhichTerm } ) with the“which” terms, i.e. WhichTermSpan constituents after“which”. (cid:88) i (cid:88) ( e ,e ) ∈T i y ( e , e ) ≥ x (whichTermIsAligned)(A.42)A question constituent may not align to more than a certainnumber of cells (cid:88) ( e,q (cid:96) ) ∈Q l y ( e, q (cid:96) ) ≤ MaxAlignmentsPerQCons (A.43)Disallow aligning a cell to two question constituents if they aretoo far apart; in other words add the following constraint if thetwo constituents q (cid:96) and q (cid:96) (cid:48) are more than qConsCoalignMaxDist apart from each other: y ( t ijk , q (cid:96) ) + y ( t ijk , q (cid:96) (cid:48) ) ≤ , ∀ l, l (cid:48) , i, j, k (A.44)For any two two question constraints that are not more than qConsCoalignMaxDist apart create an auxiliary binaryvariable x (cellProximityBoost) and set its weight in theobjective function to be 1 / ( l − l (cid:48) + 1), where l and l (cid:48) are theindices of the two question constituents. With this we boostobjective score if a cell aligns to two question constituents thatare within a few words of each other x (cellProximityBoost) ≤ y ( t ijk , q (cid:96) ) ,x (cellProximityBoost) ≤ y ( t ijk , q (cid:96) (cid:48) ) , ∀ i, j, k (A.45)If a relation match is active, both the columns for the relationmust be active r ( (cid:96) ik , (cid:96) ik (cid:48) , q (cid:96) , q (cid:96) (cid:48) ) ≤ x ( (cid:96) ik ) , r ( (cid:96) ik , (cid:96) ik (cid:48) , q (cid:96) , q (cid:96) (cid:48) ) ≤ x ( (cid:96) ik (cid:48) )(A.46)If a column is active, a relation match connecting to the columnmust be active x ( (cid:96) ik ) ≤ (cid:88) k (cid:48) ( r ( (cid:96) ik , (cid:96) ik (cid:48) , q (cid:96) , q (cid:96) (cid:48) ) + r ( (cid:96) ik (cid:48) , (cid:96) ik , q (cid:96) , q (cid:96) (cid:48) )) , ∀ k (A.47)If a relation match is active, the column cannot align to thequestion in an invalid position r ( (cid:96) ik , (cid:96) ik (cid:48) , q (cid:96) , q (cid:96) (cid:48) ) ≤ − y ( t ijk , ˆ q (cid:96) ) , where ˆ q (cid:96) ≤ q (cid:96) and t ijk ∈ (cid:96) ik (A.48)If a row is active, at least a certain number of its cells must beactive (cid:88) k x ( t ijk ) ≥ MinActiveCellsPerRow × x ( r ij ) , ∀ i, j (A.49)If row is active, it must have non-choice alignments. x ( r ij ) ≤ (cid:88) ( n,n (cid:48) ) ∈L ij y ( n, n ) (A.50)If row is active, it must have non-question alignments x ( r ij ) ≤ (cid:88) ( n,n (cid:48) ) ∈K ij y ( n, n ) (A.51)If two rows of a table are active, the corresponding active cellvariables across the two rows must match; in other words, thetwo rows must have identical activity signature x ( r ij ) + x ( r ij (cid:48) ) + x ( t ijk ) − x ( t ij (cid:48) k (cid:48) ) ≤ , ∀ i, j, j (cid:48) , k, k (cid:48) (A.52)If two rows are active, then at least one active column in whichthey differ (in tokenized form) must also be active; otherwise thetwo rows would be identical in the proof graph. (cid:88) t ijk (cid:54) = t ijk (cid:48) x ( (cid:96) ik ) − x ( r ij ) − x ( r ij (cid:48) ) ≥ − x ( T i ) + x ( T i (cid:48) ) + (cid:88) j,k,j (cid:48) ,k (cid:48) y ( t ijk , t i (cid:48) j (cid:48) k (cid:48) ) ≥ , ∀ i, i (cid:48) (A.54) set Table 31: The set of all constraints used in our ILP formulation. The set of variables andare defined in Table 4. More intuition about constraints is included in Section 3. The setsused in the definition of the constraints are defined in Table 30.144 .2. Supplementary Details for Chapter 8
We here provide detailed proofs of the formal results, followed by additional experiments.The following observation allows a simplification of the proofs, without loss of any generality.
Remark 1.
Since our procedure doesn’t treat similarity edges and meaning-to-symbol noiseedges differently, we can ‘fold’ ε − into p − and p + (by increasing edge probabilities). Moregenerally, the results are identical whether one uses p + , p − , ε − or p (cid:48) + , p (cid:48)− , ε (cid:48)− , as long as: p + ⊕ ε − = p (cid:48) + ⊕ ε (cid:48)− p − ⊕ ε − = p (cid:48)− ⊕ ε (cid:48)− For any p + and ε − , we can find a p + such that ε − = 0. Thus, w.l.o.g., in the followinganalysis we derive results only using p + and p − (i.e. assume ε − = 0). Note that we expandthese terms to p + ⊕ ε − and p − ⊕ ε − respectively in the final results. A.2.1. Proofs: Possibility of Accurate Connectivity Reasoning
In this section we provide the proofs of the additional lemmas necessary for proving theintermediate results. First we introduce a few useful lemmas, and then move on to the proofof Theorem 1.We introduce the following lemmas which will be used in connectivity analysis of the clustersof the nodes O ( m ). Lemma 3 (Connectivity of a random graph (Gilbert, 1959)) . Let P n denote the probabilityof the event that a random undirected graph G ( n , p ) ( p > .
5) is connected. This probabilitycan be lower-bounded as following: P n ≥ − (cid:104) q n − (cid:110) (1 + q ( n − / ) n − − q ( n − n − / (cid:111) + q n/ (cid:110) (1 + q ( n − / ) n − − (cid:111)(cid:105) , where q = 1 − p . 145ee Gilbert (1959) for a proof of this lemma. Since q ∈ (0 , P n → n increases. The following lemma provides a simpler version of the above probability: Corollary 2 (Connectivity of a random graph (Gilbert, 1959)) . The random-graph con-nectivity probability P n (Lemma 3) can be lower-bounded as following: P n ≥ − e q n/ Proof.
We use the following inequality:(1 + 3 n ) n ≤ e Given that q ≤ . , n ≥
1, one can verify that q ( n − / ≤ /n . Combining this with theabove inequality gives us, (1 + q n − / ) n − ≤ e .With this, we bound the two terms within the two terms of the target inequality: (1 + q ( n − / ) n − − q ( n − n − / ≤ e (1 + q ( n − / ) n − − ≤ e (cid:104) q n − (cid:110) (1 + q ( n − / ) n − − q ( n − n − / (cid:111) + q n/ (cid:110) (1 + q ( n − / ) n − − (cid:111)(cid:105) ≤ e q n − + e q n/ ≤ e q n/ which concludes the proof.We show a lower-bound on the probability of s and s (cid:48) being connected given the connectivityof their counterpart nodes in the meaning graph. This lemma will be used in the proof ofTheorem 1: 146 emma 4 (Lower bound) . P (cid:20) s ˜ d (cid:33) s (cid:48) | m d (cid:33) m (cid:48) (cid:21) ≥ (cid:16) − e ε λ/ (cid:17) d +1 · (cid:16) − (1 − p + ) λ (cid:17) d . Proof.
We know that m and m (cid:48) are connected through some intermediate nodes m , m , · · · , m (cid:96) ( (cid:96) < d ). We show a lower-bound on having a path in the symbol-graph between s and s (cid:48) ,through clusters of nodes O ( m ) , O ( m ) , · · · , O ( m (cid:96) ). We decompose this into two events: e [ v ] For a given meaning node v its cluster in the symbol-graph, O ( v ) is connected. e [ v, u ] For any two connected nodes ( u, v ) in the meaning graph, there is at least an edgeconnecting their clusters O ( u ) , O ( v ) in the symbol-graph.The desired probability can then be refactored as: P (cid:20) s ˜ d (cid:33) s (cid:48) | m d (cid:33) m (cid:48) (cid:21) ≥ P (cid:92) v ∈{ s,m ,...,m (cid:96) ,s (cid:48) } e [ v ] ∩ (cid:92) ( v,u ) ∈{ ( s,m ) ,..., ( m (cid:96) ,s (cid:48) ) } e [ v, u ] ≥ P [ e ] d +1 · P [ e ] d . We split the two probabilities and identify lower bounds for each. Based on Corollary 2, P [ e ] ≥ − e ε λ/ , and as a result P [ e ] d +1 ≥ (cid:16) − e ε λ/ (cid:17) d +1 . The probability ofconnectivity between pair of clusters is P [ e ] = 1 − (1 − p + ) λ . Thus, similarly, P [ e ] d ≥ (cid:16) − (1 − p + ) λ (cid:17) d . Combining these two, we obtain: P (cid:20) s ˜ d (cid:33) s (cid:48) | m d (cid:33) m (cid:48) (cid:21) ≥ (cid:16) − e ε λ/ (cid:17) d +1 · (cid:16) − (1 − p + ) λ (cid:17) d (A.55)The connectivity analysis of G S can be challenging since the graph is a non-homogeneouscombination of positive and negative edges. For the sake of simplifying the probabilisticarguments, given symbol graph G S , we introduce a non-unique simple graph ˜ G S as follows. Definition 11.
Consider a special partitioning of V G such that the d -neighbourhoods of s s (cid:48) form two of the partitions and the rest of the nodes are arbitrarily partitioned in away that the diameter of each component does not exceed ˜ d . • The set of nodes V ˜ G S of ˜ G S corresponds to the aforementioned partitions. • There is an edge ( u, v ) ∈ E ˜ G S if and only if at least one node-pair from the partitionsof V G corresponding to u and v , respectively, is connected in E G S .In the following lemma we give an upper-bound on the connectivity of neighboring nodesin ˜ G S : Lemma 5.
When G S is drawn at random, the probability that an edge connects twoarbitrary nodes in ˜ G S is at most ( λ B ( d )) p − . Proof.
Recall that a pair of nodes from ˜ G S , say ( u, v ), are connected when at least onepair of nodes from corresponding partitions in G S are connected. Each d -neighbourhoodin the meaning graph has at most B ( d ) nodes. It implies that each partition in ˜ G S has atmost λ B ( d ) nodes. Therefore, between each pair of partitions, there are at most ( λ B ( d )) possible edges. By union bound, the probability of at least one edge being present betweentwo partitions is at most ( λ B ( d )) p − .Let v s , v s (cid:48) ∈ V ˜ G S be the nodes corresponding to the components containing s and s (cid:48) respec-tively. The following lemma establishes a relation between connectivity of s, s (cid:48) ∈ V G S andthe connectivity of v s , v s (cid:48) ∈ V ˜ G S : Lemma 6. P (cid:20) s ˜ d (cid:33) s (cid:48) | m (cid:24)(cid:24) (cid:33) m (cid:48) (cid:21) ≤ P (cid:104) There is a path from v s to v s (cid:48) in ˜ G S with length ˜ d (cid:105) . Proof.
Let L and R be the events in the left hand side and right hand side respectively.Also for a permutation of nodes in G S , say p , let F p denote the event that all the edges of p are present, i.e., L = ∪ F p . Similarly, for a permutation of nodes in ˜ G S , say q , let H q denotethe event that all the edges of q are present. Notice that F p ⊆ H q for q ⊆ p , because if all148he edges of p are present the edges of q will be present. Thus, L = (cid:91) p F p ⊆ (cid:91) p H p ∩ E ˜ GS = (cid:91) q H q = R. This implies that P [ L ] ≤ P [ R ]. Lemma 7 (Upper bound) . If ( λ B ( d )) p − ≤ en , then P (cid:20) s ≤ ˜ d (cid:33) s (cid:48) | m (cid:24)(cid:24) (cid:33) m (cid:48) (cid:21) ≤ en ( λ B ( d )) p − . Proof.
To identify the upper bound on P (cid:20) s ≤ ˜ d (cid:33) s (cid:48) | m (cid:24)(cid:24) (cid:33) m (cid:48) (cid:21) , recall the definition of ˜ G S ,given an instance of G S (as outlined in Lemmas 5 and 6, for ˜ p = ( λ B ( d )) p − ). Lemma 6 re-lates the connectivity of s and s (cid:48) to a connectivity event in ˜ G S , i.e., P (cid:20) s ≤ ˜ d (cid:33) s (cid:48) | m (cid:24)(cid:24) (cid:33) m (cid:48) (cid:21) ≤ P (cid:104) there is a path from v s to v s (cid:48) in ˜ G S with length ˜ d (cid:105) , where v s , v s (cid:48) ∈ V ˜ G S are the nodescorresponding to the components containing s and s (cid:48) respectively. Equivalently, in thefollowing, we prove that the event dist( v s , v s (cid:48) ) ≤ ˜ d happens with a small probability: P (cid:20) s ≤ ˜ d (cid:33) s (cid:48) (cid:21) = P (cid:95) (cid:96) =1 , ··· , ˜ d s (cid:96) (cid:33) s (cid:48) ≤ (cid:88) (cid:96) ≤ ˜ d (cid:18) n(cid:96) (cid:19) ˜ p (cid:96) ≤ (cid:88) (cid:96) ≤ ˜ d ( en(cid:96) ) (cid:96) ˜ p (cid:96) ≤ (cid:88) (cid:96) ≤ ˜ d ( en ) (cid:96) ˜ p (cid:96) ≤ en ˜ p ( en ˜ p ) ˜ d − en ˜ p − ≤ en ˜ p − en ˜ p ≤ en ˜ p. where the final inequality uses the assumption that ˜ p ≤ en .Armed with the bounds in Lemmas 4 and 7, we are ready to provide the main proof: Proof of Theorem 1.
Recall that the algorithm checks for connectivity between two givennodes s and s (cid:48) , i.e., s ≤ ˜ d (cid:33) s (cid:48) . With this observation, we aim to infer whether the two nodesin the meaning graph are connected ( m ≤ d (cid:33) m (cid:48) ) or not ( m (cid:24)(cid:24) (cid:33) m (cid:48) ). We prove the theorem149y using lower and upper bound for these two probabilities, respectively: γ = P (cid:20) s ≤ ˜ d (cid:33) s (cid:48) | m d (cid:33) m (cid:48) (cid:21) − P (cid:20) s ≤ ˜ d (cid:33) s (cid:48) | m (cid:24)(cid:24) (cid:33) m (cid:48) (cid:21) ≥ LB (cid:18) P (cid:20) s ≤ ˜ d (cid:33) s (cid:48) | m d (cid:33) m (cid:48) (cid:21)(cid:19) − U B (cid:18) P (cid:20) s ≤ ˜ d (cid:33) s (cid:48) | m (cid:24)(cid:24) (cid:33) m (cid:48) (cid:21)(cid:19) ≥ (cid:16) − e ε λ/ (cid:17) d +1 · (cid:16) − (1 − p + ) λ (cid:17) d − en ( λ B ( d )) p − . where the last two terms of the above inequality are based on the results of Lemmas 4and 7, with the assumption for the latter that ( λ B ( d )) p − ≤ en . To write this result inits general form we have to replace p + and p − , with p + ⊕ ε − and p − ⊕ ε − , respective (seeRemark 1). A.2.2. Proofs: Limitations of Connectivity Reasoning
We provide the necessary lemmas and intuitions before proving the main theorem.A random graph is an instance sampled from a distribution over graphs. In the G ( n , p )Erd˝os-Renyi model, a graph is constructed in the following way: Each edge is included inthe graph with probability p , independent of other edges. In such graphs, on average, thelength of the path connecting any node-pair is short (logarithmic in the number of nodes). Lemma 8 (Diameter of a random graph, Corollary 1 of (Chung and Lu, 2002)) . If n · p = c > c , then almost-surely the diameter of G ( n , p ) is Θ(log n ).We use the above lemma to prove Theorem 2. Note that the overall noise probably (i.e., p in Lemma 8) in our framework is p − ⊕ ε − . Proof of Theorem 2.
Note that the | V G S | = λ · n . By Lemma 8, the symbol graph hasdiameter Θ(log λn ). This means that for any pair of nodes s, s (cid:48) ∈ V G S , we have s Θ(log λn ) (cid:33) s (cid:48) .Since ˜ d ≥ λd ∈ Ω(log λn ), the multi-hop reasoning algorithm finds a path between s and s (cid:48) connected regardless of the connectivity of m and m (cid:48) . A.2.3. Proofs: Limitations of General Reasoning
The proof of the theorem follows after introducing necessary lemmas.In the following lemma, we show that the spectral differences between the two symbol graphsin the locality of the target nodes are small. For ease of exposition, we define an intermediatenotation, for a normalized version of the Laplacians: ˜ L = L/ (cid:107) L (cid:107) and ˜ L (cid:48) = L (cid:48) / (cid:107) L (cid:48) (cid:107) . Lemma 9.
The norm-2 of the Laplacian matrix corresponding to the nodes participatingin a cut, can be upper-bounded by the number of the edges participating in the cut (witha constant factor).
Proof of Lemma 9.
Using the definition of the Laplacian: (cid:107) L C (cid:107) ≤ (cid:107) A − D (cid:107) ≤ (cid:107) A (cid:107) + (cid:107) D (cid:107) where A is the adjacency matrix and D is a diagonal matrix with degrees on the diagonal.We bound the norms of the matrices based on size of the cut (i.e., number of the edges inthe cut). For the adjacency matrix we use the Frobenius norm: (cid:107) A (cid:107) ≤ (cid:107) A (cid:107) F = (cid:115)(cid:88) ij a ij = 2 · | C | where | C | denotes the number of edges in C . To bound the matrix of degrees, we usethe fact that norm-2 is equivalent to the biggest eigenvalue, which is the biggest diagonalelement in a diagonal matrix: (cid:107) D (cid:107) = σ max ( D ) = max i deg ( i ) ≤ | C | With this we have shown that: (cid:107) L C (cid:107) ≤ | C | .151or sufficiently large values of p , G ( n , p ) is a connected graph, with a high probability. Moreformally: Lemma 10 (Connectivity of random graphs) . In a random graph G ( n , p ), for any p biggerthan (1+ ε ) ln nn , the graph will almost surely be connected.The proof can be found in (Erdos and R´enyi, 1960). Lemma 11 (Norm of the adjacency matrix in a random graph) . For a random graph G ( n , p ), let L be the adjacency matrix of the graph. For any ε > n → + ∞ P (cid:16)(cid:12)(cid:12)(cid:12) (cid:107) L (cid:107) − (cid:112) n log n (cid:12)(cid:12)(cid:12) > ε (cid:17) → Proof of Lemma 11.
From Theorem 1 of (Ding et al., 2010) we know that: σ max ( L ) √ n log n P → √ P → denote convergence in probability . And also notice that norm-2 of a matrix isbasically the size of its biggest eigenvalue, which concludes our proof. Lemma 12.
For any pair of meaning-graphs G and G (cid:48) constructed according to Definition 9,and, • d > log n , • p − ⊕ ε − ≥ c log n (cid:14) n for some constant c , • ˜ d ≥ λd ,with L and L (cid:48) being the Laplacian matrices corresponding to the ˜ d -neighborhoods of thecorresponding nodes in the surface-graph; we have: (cid:107) L − L (cid:48) (cid:107) (cid:107) L (cid:107) ≤ √ λ B (1) (cid:112) n log( nλ ) , Proof of Lemma 12.
In order to simplify the exposition, w.l.o.g. assume that ε − = 0 (seeRemark 1). Our goal is to find an upper-bound to the fraction (cid:107) L − L (cid:48) (cid:107) (cid:107) L (cid:107) . Note that theLaplacians contain only the local information, i.e., ˜ d − neighborhood. First we prove anupper bound on the nominator. By eliminating an edge in a meaning-graph, the probabilityof edge appearance in the symbol graph changes from p + to p − . The effective result ofremoving edges in C would appear as i.i.d. Bern( p + − p − ). Since by definition, B (1) isan upper bound on the degree of meaning nodes, the size of minimum cut should also beupper bounded by B (1). Therefore, the maximum size of the min-cut C separating twonodes m d (cid:33) m (cid:48) is at most B (1). To account for vertex replication in symbol-graph, theeffect of cut would appear on at most λ B (1) edges in the symbol graph. Therefore, wehave (cid:107) L − L (cid:48) (cid:107) ≤ λ B (1) using Lemma 9.As for the denominator, the size of the matrix L is the same as the size of ˜ d -neighborhoodin the symbol graph. We show that if ˜ d > log( λn ) the neighborhood almost-surely coversthe whole graph. While the growth in the size of the ˜ d -neighborhood is a function ofboth p + and p − , to keep the analysis simple, we underestimate the neighborhood size byreplacing p + with p − , i.e., the size of the ˜ d -neighborhood is lower-bounded by the size of a˜ d -neighborhood in G ( λ · n, p − ).By Lemma 10 the diameters of the symbol-graphs G S and G (cid:48) S are both Θ(log( λn )). Since˜ d ∈ Ω(log( λn )), ˜ d -neighborhood covers the whole graph for both G S and G (cid:48) S .Next, we use Lemma 11 to state that (cid:107) L (cid:107) converges to (cid:112) λn log( λn ), in probability.Combining numerator and denominator, we conclude that the fraction, for sufficiently large n , is upper-bounded by: λ B (1) √ λn log( λn ) , which can get arbitrarily small, for a big-enoughchoice of n . 153 roof of Lemma 1. We start by proving an upper bound on ˜ L − ˜ L (cid:48) in matrix inequalitynotation. Similar upper-bound holds for ˜ L (cid:48) − ˜ L which concludes the theorem.˜ L − ˜ L (cid:48) = L (cid:107) L (cid:107) − L (cid:48) (cid:107) L (cid:48) (cid:107)(cid:22) L (cid:107) L (cid:107) − L (cid:48) (cid:107) L − L (cid:48) (cid:107) + (cid:107) L (cid:107) = L · (cid:107) L − L (cid:48) (cid:107)(cid:107) L (cid:107) + L − L (cid:48) (cid:107) L (cid:107)(cid:22) √ λ B (1) (cid:112) n log( nλ ) I + √ λ B (1) (cid:112) n log( nλ ) I. The last inequality is due to Lemma 12. By symmetry the same upper-bound holds for˜ L (cid:48) − ˜ L (cid:22) √ λ B (1) √ n log( nλ ) I . This means that (cid:107) ˜ L − ˜ L (cid:48) (cid:107) ≤ √ λ B (1) √ n log( nλ ) . Lemma 13.
Suppose f is an indicator function on an open set , it is always possible towrite it as composition of two functions: • A continuous and Lipschitz function: g : R d → (0 , , • A thresholding function: H ( x ) = { x > . } . such that: ∀ x ∈ R d : f ( x ) = h ( g ( x )). Proof of Lemma 13.
Without loss of generality, we assume that the threshold function isdefined as H ( x ) = { x > . } . One can verify that a similar proof follows for H ( x ) = { x ≥ . } . We use notation f − ( A ) the set of pre-images of a function f , for the set ofoutputs A .First let’s study the collection of inputs that result in output of 1 in f function. Since f = h ◦ g , then f − ( { } ) = g − ( h − ( { } )) = g − ((0 . , f − ( { } ) = g − ( h − ( { } )) = g − ((0 , . C and C , such that C i (cid:44) f − ( { i } ); note that since g is continuous https://en.wikipedia.org/wiki/Indicator_function . ,
1) is open C is an open set (hence C is closed). Let d : R n → R be defined by, d ( x ) (cid:44) dist( x, C ) = inf c ∈ C (cid:107) x − c (cid:107) . Since C is closed, it follows d ( x ) = 0 if and only if x ∈ C . Therefore, letting g ( x ) = 12 + 12 · d ( x )1 + d ( x ) , then g ( x ) = when x ∈ C , while g ( x ) > when x (cid:54)∈ C . This means that letting h ( x ) = 1when x > and h ( x ) = 0 when x ≤ , then f = h ◦ g . One can also verify that thisconstruction is 1 / d ( x ) is 1-Lipschitz, which can be provedusing the triangle inequalityHence the necessary condition to have such decomposition is f − ( { } ) and f − ( { } ) beopen or closed. Proof of Lemma 2.
Note that f maps a high dimensional continuous space to a discretespace. To simplify the argument about f , we decompose it to two functions: a continuousfunction g mapping matrices to (0 ,
1) and a threshold function H (e.g. 0 . . . ))which maps to one if g is higher than a threshold and to zero otherwise. Without loss ofgenerality we also normalize g such that the gradient is less than one. Formally, f = H ◦ g, where g : R |U|×|U| → (0 , , (cid:107)∇ g (cid:12)(cid:12)(cid:12) ˜ L (cid:107) ≤ . Lemma 13 gives a proof of existence for such decompositon, which depends on having openor closed pre-images.One can find a differentiable and Lipschitz function g such that it intersects with thethreshold specified by H , in the borders where f changes values.155igure 31: With varied values for p − a heat map representation of the distribution of the av-erage distances of node-pairs in symbol graph based on the distances of their correspondingmeaning nodes is presented.With g being Lipschitz, one can upper-bound the variations on the continuous function: (cid:107) g ( ˜ L ) − g ( ˜ L (cid:48) ) (cid:107) ≤ M (cid:107) ˜ L − ˜ L (cid:48) (cid:107) . According to Lemma 1, (cid:107) ˜ L − ˜ L (cid:48) (cid:107) is upper-bounded by a decreasing function in n .For uniform choices ( G, G (cid:48) , m, m (cid:48) ) ∼ G the Laplacian pairs ( ˜ L, ˜ L (cid:48) ) are randomly distributedin a high-dimensional space, and for big enough n , there are enough portion of the ( ˜ L, ˜ L (cid:48) )(to satisfy 1 − β probability) that appear in the same side of the hyper-plane correspondingto the threshold function (i.e. f ( ˜ L ) = f ( ˜ L (cid:48) )). A.2.4. Further experiments
To evaluate the impact of the other noise parameters in the sampling process, we comparethe average distances between nodes in the symbol graph for a given distance between the156eaning graph nodes. In the Figure 31, we plot these graphs for decreasing values of p − (from top left to bottom right). With high p − (top left subplot), nodes in the symbol graphat distances lower than two, regardless of the distance of their corresponding node-pair inthe meaning graph. As a result, any reasoning algorithm that relies on connectivity cannot distinguish symbolic nodes that are connected in the meaning space from those thatare not. As the p − is set to lower values (i.e. noise reduces), the distribution of distancesget wider, and correlation of distance between the two graphs increases. In the bottommiddle subplot, when p − has a very low value, we observe a significant correlation that canbe reliably utilized by a reasoning algorithm.157 IBLIOGRAPHY
T. Achterberg. SCIP: solving constraint integer programs.
Math. Prog. Computation , 1(1):1–41, 2009.G. Angeli and C. D. Manning. NaturalLI: Natural Logic Inference for Common Sense Rea-soning. In
Proc. of the Conference on Empirical Methods for Natural Language Processing(EMNLP) , 2014.N. Arivazhagan, C. Christodoulopoulos, and D. Roth. Labeling the semantic roles of com-mas. In
AAAI , 2016.C. F. Baker, C. J. Fillmore, and J. B. Lowe. The berkeley framenet project. In
Proc. ofthe Annual Meeting of the Association of Computational Linguistics (ACL) , pages 86–90,1998.D. Bamman, B. O’Connor, and N. A. Smith. Learning Latent Personas of Film Char-acters. In
Proceedings of the 51st Annual Meeting of the Association for Computa-tional Linguistics, ACL 2013, Volume 1: Long Papers , pages 352–361, 2013. URL http://aclweb.org/anthology/P/P13/P13-1035.pdf .L. Banarescu, C. Bonial, S. Cai, M. Georgescu, K. Griffitt, U. Hermjakob, K. Knight,M. Palmer, and N. Schneider. Abstract meaning representation for sembanking. In
Linguistic Annotation Workshop and Interoperability with Discourse , 2013.M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open InformationExtraction from the Web. In
Proc. of the International Joint Conference on ArtificialIntelligence (IJCAI) , 2007.R. Bar-Haim, I. Dagan, and J. Berant. Knowledge-Based Textual Inference via Parse-TreeTransformations.
J. Artif. Intell. Res.(JAIR) , 54:1–57, 2015.L. Bauer, Y. Wang, and M. Bansal. Commonsense for Generative Multi-Hop QuestionAnswering Tasks. In
Proc. of the Conference on Empirical Methods for Natural LanguageProcessing (EMNLP) , pages 4220–4230, 2018.L. Bentivogli, P. Clark, I. Dagan, and D. Giampiccolo. The Sixth PASCAL RecognizingTextual Entailment Challenge. In
TAC , 2008.J. Berant, I. Dagan, and J. Goldberger. Global learning of focused entailment graphs.In
Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL) ,pages 1220–1229, 2010.J. Berant, V. Srikumar, P.-C. Chen, A. V. Linden, B. Harding, B. Huang, P. Clark, andC. D. Manning. Modeling Biological Processes for Reading Comprehension. In
Proc. ofthe Conference on Empirical Methods for Natural Language Processing (EMNLP) , 2014.158. W. Berninger, W. Nagy, and S. Beers. Child writers construction and reconstruction ofsingle sentences and construction of multi-sentence texts: Contributions of syntax andtranscription to translation.
Reading and writing , 24(2):151–182, 2011.A. M. Bisantz and K. J. Vicente. Making the abstraction hierarchy concrete.
InternationalJournal of human-computer studies , 40(1):83–117, 1994.D. G. Bobrow. Natural language input for a computer problem solving system. Technicalreport, MIT, 1964.K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: A collaborativelycreated graph database for structuring human knowledge. In
ICMD , pages 1247–1250.ACM, 2008.R. Brachman, D. Gunning, S. Bringsjord, M. Genesereth, L. Hirschman, and L. Ferro.Selected Grand Challenges in Cognitive Science. Technical report, MITRE TechnicalReport 05-1218, 2005.E. Brill, S. Dumais, and M. Banko. An analysis of the AskMSR question-answering system.In
Proceedings of EMNLP , pages 257–264, 2002.P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. Class-basedn-gram models of natural language.
Computational linguistics , 18(4):467–479, 1992.J. G. Carbonell and R. D. Brown. Anaphora resolution: a multi-strategy approach. In
Proceedings of the 12th conference on Computational linguistics-Volume 1 , pages 96–101.Association for Computational Linguistics, 1988.A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Towardan Architecture for Never-Ending Language Learning. In
Proceedings of the NationalConference on Artificial Intelligence (AAAI) , 2010.K.-W. Chang, S. Upadhyay, M.-W. Chang, V. Srikumar, and D. Roth. Illinois-SL: A JAVAlibrary for structured prediction. arXiv preprint arXiv:1509.07179 , 2015.M.-W. Chang, L. Ratinov, N. Rizzolo, and D. Roth. Learning and inference with constraints.In
Proc. of the Conference on Artificial Intelligence (AAAI) , 7 2008. URL http://cogcomp.org/papers/CRRR08.pdf .M.-W. Chang, D. Goldwasser, D. Roth, and V. Srikumar. Discriminative Learning overConstrained Latent Representations.
Proceedings of Human Language Technologies: The2010 Annual Conference of the North American Chapter of the Association for Compu-tational Linguistics (HLT 2010) , (June):429–437, 2010.M.-W. Chang, L. Ratinov, and D. Roth. Structured learning with constrained conditionalmodels.
Machine Learning , 88(3):399–431, 6 2012. URL http://cogcomp.org/papers/ChangRaRo12.pdf . 159. Chen, J. Bolton, and C. D. Manning. A Thorough Examination of the CNN/DailyMail Reading Comprehension Task. In
Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics, ACL 2016, Volume 1: Long Papers , 2016.URL http://aclweb.org/anthology/P/P16/P16-1223.pdf .Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen. Enhanced LSTM for NaturalLanguage Inference. In
Proceedings of the 55th Annual Meeting of the Association forComputational Linguistics (ACL 2017) , Vancouver, July 2017. ACL.T. Chklovski and P. Pantel. VerbOcean: Mining the Web for Fine-Grained Semantic VerbRelations. In
EMNLP , 2004.F. Chung and L. Lu. The average distances in random graphs with given expected degrees.
Proceedings of the National Academy of Sciences , 99(25):15879–15882, 2002.K. W. Church and P. Hanks. Word Association Norms, Mutual Information and Lexicog-raphy. In , pages 76–83, 1989.P. Clark. Elementary School Science and Math Tests as a Driver for AI: Take the AristoChallenge! In , pages 4019–4021, Austin, TX, 2015.P. Clark and O. Etzioni. My Computer is an Honor Student but how Intelligent is it?Standardized Tests as a Measure of AI.
AI Magazine , 2016. (To appear).P. Clark, N. Balasubramanian, S. Bhakthavatsalam, K. Humphreys, J. Kinkead, A. Sabhar-wal, and O. Tafjord. Automatic Construction of Inference-Supporting Knowledge Bases.In , Montreal, Canada, 2014.P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. Turney, and D. Khashabi.Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions.In
Proceedings of the National Conference on Artificial Intelligence (AAAI) , 2016.P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord.Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.
CoRR , abs/1803.05457, 2018.J. Clarke and M. Lapata. Global inference for sentence compression: An integer linearprogramming approach.
Journal of Artificial Intelligence Research , 31:399–429, 2008.J. Clarke, D. Goldwasser, M.-W. Chang, and D. Roth. Driving semantic parsing fromthe world’s response. In
Proc. of the Conference on Computational Natural LanguageLearning (CoNLL) , 7 2010. URL http://cogcomp.org/papers/CGCR10.pdf .A. Cocos, V. Wharton, E. Pavlick, M. Apidianaki, and C. Callison-Burch. Learning ScalarAdjective Intensity from Paraphrases. In
Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pages 1752–1762, 2018.160. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.
Introduction to algorithms . MITpress, 2009.I. Dagan, D. Roth, M. Sammons, and F. M. Zanzoto. Recognizing textual entailment:Models and applications. 7 2013.B. Dalvi, S. Bhakthavatsalam, and P. Clark. IKE - An Interactive Tool for KnowledgeExtraction. In , 2016.H. T. Dang and M. Palmer. The role of semantic roles in disambiguating verb senses. In
Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics ,pages 42–49. Association for Computational Linguistics, 2005.H. A. Davidson.
Alfarabi, Avicenna, and Averroes on intellect: their cosmologies, theoriesof the active intellect, and theories of human intellect . Oxford University Press, 1992.E. Davis. The Limitations of Standardized Science Tests as Benchmarks for ArtificialIntelligence Research: Position Paper.
CoRR , abs/1411.1629, 2014. URL http://arxiv.org/abs/1411.1629 .R. Dechter. Reasoning with Probabilistic and Deterministic Graphical Models: Exact Al-gorithms. In
Reasoning with Probabilistic and Deterministic Graphical Models: ExactAlgorithms , 2013.J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.X. Ding, T. Jiang, et al. Spectral distributions of adjacency and Laplacian matrices ofrandom graphs.
The annals of applied probability , 20(6):2086–2117, 2010.A. N. P. DivyeKhilnani and S. B. D. Jurafsky. Using Query Patterns to Learn the Durationof Events.
Computational Semantics IWCS 2011 , page 145, 2011.Q. Do, Y. S. Chan, and D. Roth. Minimally supervised event causality identification. In
Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP) ,Edinburgh, Scotland, 7 2011. URL http://cogcomp.org/papers/DoChaRo11.pdf .Q. Do, W. Lu, and D. Roth. Joint inference for event timeline construction. In
Proc. ofthe Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2012.URL http://cogcomp.org/papers/DoLuRo12.pdf .L. Dong, F. Wei, M. Zhou, and K. Xu. Question Answering over Freebase with Multi-Column Convolutional Neural Networks. In
Proc. of the Annual Meeting of the Associa-tion of Computational Linguistics (ACL) , 2015.P. Erdos and A. R´enyi. On the evolution of random graphs.
Publ. Math. Inst. Hung. Acad.Sci , 5(1):17–60, 1960. 161. Etzioni, M. Banko, S. Soderland, and D. Weld. Open information extraction from theweb.
Communications of the ACM , 51(12):68–74, 2008.J. S. B. Evans, S. E. Newstead, and R. M. Byrne.
Human reasoning: The psychology ofdeduction . Psychology Press, 1993.A. Fader, L. Zettlemoyer, and O. Etzioni. Open question answering over curated andextracted knowledge bases. In
Proceedings of SIGKDD , pages 1156–1165, 2014.D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur, A. Lally, J. W.Murdock, E. Nyberg, J. Prager, et al. Building Watson: An overview of the DeepQAproject.
AI Magazine , 31(3):59–79, 2010.R. Fikes and T. Kehler. The role of frame-based representation in reasoning.
Communica-tions of the ACM , 28(9):904–920, 1985.C. J. Fillmore. Scenes-and-frames semantics.
Linguistic structures processing , 59:55–88,1977.J. L. Fleiss. Measuring nominal scale agreement among many raters.
Psychological bulletin ,76(5):378, 1971.M. Forbes and Y. Choi. Verb Physics: Relative Physical Knowledge of Actions and Ob-jects. In
Proceedings of the 55th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , volume 1, pages 266–276, 2017.R. M. French. The Turing Test: the first 50 years.
Trends in cognitive sciences , 4(3):115–122, 2000.D. Fried, P. Jansen, G. Hahn-Powell, M. Surdeanu, and P. Clark. Higher-order lexicalsemantic models for non-factoid answer reranking.
Transactions of the Association forComputational Linguistics , 3:197–210, 2015.K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks.
Neural networks , 2(3):183–192, 1989.E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-basedexplicit semantic analysis. In
IJcAI , volume 7, pages 1606–1611, 2007.M. Gardner, P. Talukdar, and T. Mitchell. Combining vector space embeddings with sym-bolic logical inference over open-domain text. In
AAAI spring symposium , 2015.M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. Liu, M. Peters, M. Schmitz,and L. Zettlemoyer. AllenNLP: A Deep Semantic Natural Language Processing Platform.2018.E. N. Gilbert. Random graphs.
The Annals of Mathematical Statistics , 30(4):1141–1144,1959. 162. Gildea and D. Jurafsky. Automatic labeling of semantic roles.
Computational linguistics ,28(3):245–288, 2002.D. Goldwasser and D. Roth. Learning from natural instructions.
Machine Learning , 94(2):205–232, 2 2014. URL http://cogcomp.org/papers/GoldwasserRo14.pdf .M. Granroth-Wilding and S. Clark. What happens next? event prediction using a com-positional neural network model. In
Proceedings of the Thirtieth AAAI Conference onArtificial Intelligence , pages 2727–2733. AAAI Press, 2016.D. Gunning, V. Chaudhri, P. Clark, K. Barker, J. Chaw, and M. Greaves. Project HaloUpdate - Progress Toward Digital Aristotle.
AI Magazine , 31(3), 2010.S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith. An-notation Artifacts in Natural Language Inference Data. In
Proceedings of the 2018 Con-ference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 2 (Short Papers) , volume 2, pages 107–112, 2018.S. Harnad. The symbol grounding problem.
Physica D: Nonlinear Phenomena , 42(1-3):335–346, 1990.S. Harnad. The Turing Test is not a trick: Turing indistinguishability is a scientific criterion.
ACM SIGART Bulletin , 3(4):9–10, 1992.K. M. Hermann, T. Kocisk´y, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, andP. Blunsom. Teaching Machines to Read and Comprehend. In
Advances in Neural In-formation Processing Systems 28: Annual Conference on Neural Information ProcessingSystems 2015 , pages 1693–1701, 2015.J. Hernandez-Orallo. Beyond the Turing test.
Journal of Logic, Language and Information ,9(4):447–466, 2000.L. Hirschman, M. Light, E. Breck, and J. D. Burger. Deep Read: A Reading ComprehensionSystem. In , 1999. URL .J. R. Hobbs, M. E. Stickel, P. A. Martin, and D. Edwards. Interpretation as Abduction.
Artif. Intell. , 63:69–142, 1988.J. R. Hobbs, M. E. Stickel, D. E. Appelt, and P. Martin. Interpretation as abduction.
Artificial intelligence , 63(1-2):69–142, 1993.J. H. Holland, K. J. Holyoak, R. E. Nisbett, and P. R. Thagard.
Induction: Processes ofinference, learning, and discovery . MIT press, 1989.C. Hori and F. Sadaoki. Speech summarization: an approach through word extraction anda method for evaluation.
IEICE TRANSACTIONS on Information and Systems , 87(1):15–25, 2004. 163. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman. Learning to Solve ArithmeticWord Problems with Verb Categorization. In , pages 523–533, 2014.D. Howell.
Statistical methods for psychology . Cengage Learning, 2012.M. Hu, Y. Peng, Z. Huang, X. Qiu, F. Wei, and M. Zhou. Reinforced mnemonic reader formachine reading comprehension. In
Proceedings of the 27th International Joint Confer-ence on Artificial Intelligence , pages 4099–4106. AAAI Press, 2018.N. Ide and K. Suderman. Integrating Linguistic Resources: The American National Cor-pus Model. In
Proceedings of the Fifth International Conference on Language Resourcesand Evaluation, LREC 2006 , pages 621–624, 2006. URL .N. Ide, C. F. Baker, C. Fellbaum, C. J. Fillmore, and R. J. Passonneau. MASC: theManually Annotated Sub-Corpus of American English. In
Proceedings of the InternationalConference on Language Resources and Evaluation, LREC 2008 , 2008. URL .P. Jansen, N. Balasubramanian, M. Surdeanu, and P. Clark. What’s in an Explanation?Characterizing Knowledge and Inference Requirements for Elementary Science Exams.In
Proc. the International Conference on Computational Linguistics (COLING) , pages2956–2965, 2016.P. Jansen, R. Sharp, M. Surdeanu, and P. Clark. Framing QA as Building and RankingIntersentence Answer Justifications.
Computational Linguistics , 2017.P. A. Jansen. A Study of Automatically Acquiring Explanatory Inference Patterns fromCorpora of Explanations: Lessons from Elementary Science Exams. In
AKBC , 2016.P. A. Jansen, E. Wainwright, S. Marmorstein, and C. T. Morrison. WorldTree: A Corpus ofExplanation Graphs for Elementary Science Questions supporting Multi-Hop Inference.
CoRR , abs/1802.03052, 2018.M. E. Janzen and K. J. Vicente. Attention allocation within the abstraction hierarchy. In
Proceedings of the Human Factors and Ergonomics Society Annual Meeting , volume 41,pages 274–278. SAGE Publications, 1997.P. Jia and P. Liang. Adversarial Examples for Evaluating Reading Comprehension Sys-tems.
Proc. of the Conference on Empirical Methods for Natural Language Processing(EMNLP) , 2017.T. Joachims. Text categorization with support vector machines: Learning with many rele-vant features.
Machine learning: ECML-98 , pages 137–142, 1998.A. Johnson and R. W. Proctor.
Attention: Theory and practice . Sage Publications, 2004.164. N. Johnson-Laird. Mental models in cognitive science.
Cognitive science , 4(1):71–115,1980.M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer. TriviaQA: A Large Scale DistantlySupervised Challenge Dataset for Reading Comprehension. In
Proceedings of the 55thAnnual Meeting of the Association for Computational Linguistics, ACL 2017, Volume1: Long Papers , pages 1601–1611, 2017. doi: 10.18653/v1/P17-1147. URL https://doi.org/10.18653/v1/P17-1147 .M. Kaisser and B. Webber. Question answering based on semantic roles. In
Proceedings ofthe workshop on deep linguistic processing , pages 41–48, 2007.R. M. Kaplan, J. Bresnan, et al. Lexical-functional grammar: A formal system for gram-matical representation. In
The Mental Representation of Grammatical Relations . TheMIT Press, 1982.R. J. Kate and R. J. Mooney. Probabilistic Abduction using Markov Logic Networks. In
In: IJCAI-09 Workshop on Plan, Activity, and Intent Recognition , 2009.D. Kaushik and Z. C. Lipton. How Much Reading Does Reading Comprehension Require?A Critical Investigation of Popular Benchmarks. In
Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Processing , pages 5010–5015, 2018.A. Kembhavi, M. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. Are YouSmarter Than A Sixth Grader? Textbook Question Answering for Multimodal MachineComprehension.
The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2017.D. Khashabi, T. Khot, A. Sabharwal, P. Clark, O. Etzioni, and D. Roth. Question answeringvia integer programming over semi-structured knowledge. In
Proc. of the InternationalJoint Conference on Artificial Intelligence (IJCAI) , 2016. URL http://cogcomp.org/papers/KKSCER16.pdf .D. Khashabi, T. Khot, A. Sabharwal, and D. Roth. Learning what is essential in questions.In
The Conference on Computational Natural Language Learning (Proc. of the Conferenceon Computational Natural Language Learning (CoNLL)) , 2017. URL http://cogcomp.org/papers/2017_conll_essential_terms.pdf .D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth. Looking beyond the sur-face: A challenge set for reading comprehension over multiple sentences. In
Proceedingsof the Annual Conference of the North American Chapter of the Association for Com-putational Linguistics (NAACL) , 2018a. URL .D. Khashabi, T. Khot, A. Sabharwal, and D. Roth. Question answering as global rea-soning over semantic abstractions. In
Proceedings of The Conference on Artificial In- elligence (Proc. of the Conference on Artificial Intelligence (AAAI)) , 2018b. URL http://cogcomp.org/papers/2018_aaai_semanticilp.pdf .D. Khashabi, M. Sammons, B. Zhou, T. Redman, C. Christodoulopoulos, V. Srikumar,N. Rizzolo, L. Ratinov, G. Luo, Q. Do, C.-T. Tsai, S. Roy, S. Mayhew, Z. Feng, J. Wieting,X. Yu, Y. Song, S. Gupta, S. Upadhyay, N. Arivazhagan, Q. Ning, S. Ling, and D. Roth.Cogcompnlp: Your swiss army knife for nlp. In , 2018c. URL http://cogcomp.org/papers/2018_lrec_cogcompnlp.pdf .D. Khashabi, E. S. Azer, T. Khot, A. Sabharwal, and D. Roth. On the capabilities andlimitations of reasoning for natural language understanding, 2019. under review.T. Khot, N. Balasubramanian, E. Gribkoff, A. Sabharwal, P. Clark, and O. Etzioni. Explor-ing Markov Logic Networks for Question Answering. In , Lisbon, Portugal,2015.T. Khot, A. Sabharwal, and P. Clark. Answering Complex Questions Using Open Infor-mation Extraction.
Proc. of the Annual Meeting of the Association of ComputationalLinguistics (ACL) , 2017.P. Kingsbury and M. Palmer. From TreeBank to PropBank. In
LREC , pages 1989–1993,2002.G. S. Kirk, J. E. Raven, and M. Schofield.
The presocratic philosophers: A critical historywith a selcetion of texts . Cambridge University Press, 1983.K. Knight and D. Marcu. Summarization beyond sentence extraction: A probabilisticapproach to sentence compression.
Artificial Intelligence , 139(1):91–107, 2002.J. Ko, E. Nyberg, and L. Si. A probabilistic graphical model for joint answer ranking inquestion answering. In
Proceedings of SIGIR , pages 343–350, 2007.Z. Kozareva and E. Hovy. Learning temporal information for states and events. In
FifthInternational Conference on Semantic Computing , pages 424–429. IEEE, 2011.J. Krishnamurthy, O. Tafjord, and A. Kembhavi. Semantic parsing to probabilistic programsfor situated question answering.
Proc. of the Conference on Empirical Methods for NaturalLanguage Processing (EMNLP) , 2016.C. C. T. Kwok, O. Etzioni, and D. S. Weld. Scaling question answering to the Web. In
TheInternational World Wide Web Conference , 2001.G. Lai, Q. Xie, H. Liu, Y. Yang, and E. H. Hovy. RACE: Large-scale ReAding Comprehen-sion Dataset From Examinations. In
Proceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing, EMNLP 2017 , pages 785–794, 2017. URL https://aclanthology.info/papers/D17-1082/d17-1082 .166. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky. Stanford’smulti-pass sieve coreference resolution system at the CoNLL-2011 shared task. In
CONLLShared Task , pages 28–34, 2011.H. Lee, A. Chang, Y. Peirsman, N. Chambers, M. Surdeanu, and D. Jurafsky. Determinis-tic coreference resolution based on entity-centric, precision-ranked rules.
ComputationalLinguistics , 39(4):885–916, 2013.K. Lee, Y. Artzi, J. Dodge, and L. Zettlemoyer. Context-dependent semantic parsing fortime expressions. In
ACL (1) , pages 1437–1447, 2014.A. Leeuwenberg and M.-F. Moens. Temporal Information Extraction by Predicting Rel-ative Time-lines.
Proc. of the Conference on Empirical Methods for Natural LanguageProcessing (EMNLP) , 2018.W. G. Lehnert.
The Process of Question Answering.
PhD thesis, Yale University, 1977.D. B. Lenat. CYC: A large-scale investment in knowledge infrastructure.
Communicationsof the ACM , 38(11):33–38, 1995.O. Levy and Y. Goldberg. Linguistic regularities in sparse and explicit word representations.In
Proceedings of the eighteenth conference on computational natural language learning ,pages 171–180, 2014.F. Li, X. Zhang, J. Yuan, and X. Zhu. Classifying What-Type Questions by Head NounTagging. In
Proceedings 22nd International Conference on Computational Linguistics(COLING) , 2007.X. Li and D. Roth. Learning Question Classifiers. In
Proceedings of the 19th Interna-tional Conference on Computational Linguistics - Volume 1 , COLING ’02, pages 1–7,Stroudsburg, PA, USA, 2002. Association for Computational Linguistics.Y. Li, L. Xu, F. Tian, L. Jiang, X. Zhong, and E. Chen. Word Embedding Revisited: ANew Representation Learning and Explicit Matrix Factorization Perspective. In
Proc. ofthe International Joint Conference on Artificial Intelligence (IJCAI) , pages 3650–3656,2015.Z. Li, X. Ding, and T. Liu. Constructing Narrative Event Evolutionary Graph for ScriptEvent Prediction.
Proc. of the International Joint Conference on Artificial Intelligence(IJCAI) , 2018.X. V. Lin, R. Socher, and C. Xiong. Multi-Hop Knowledge Graph Reasoning with Re-ward Shaping. In
Proc. of the Conference on Empirical Methods for Natural LanguageProcessing (EMNLP) , 2018.H. Liu and P. Singh. ConceptNeta practical commonsense reasoning tool-kit.
BT technologyjournal , 22(4):211–226, 2004. 167. Liu, Y. Shen, K. Duh, and J. Gao. Stochastic answer networks for machine readingcomprehension. In
Proceedings of the 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) , pages 1694–1704, 2018.A. A. Mahabal, D. Roth, and S. Mittal. Robust handling of polysemy via sparse represen-tations. In *SEM , 2018. URL http://cogcomp.org/papers/MahabalRoMi18.pdf .A. McCallum, A. Neelakantan, R. Das, and D. Belanger. Chains of Reasoning over Entities,Relations, and Text using Recurrent Neural Networks. In
EACL , pages 132–141, 2017.J. McCarthy.
Programs with common sense . Defense Technical Information Center, 1963.J. McCarthy. An example for natural language understanding and the AI problems it raises.
Formalizing Common Sense: Papers by John McCarthy , 355, 1976.J. McCarthy and M. I. Levin.
LISP 1.5 programmer’s manual . MIT press, 1965.J. McCarthy and V. Lifschitz.
Formalizing common sense: papers , volume 5. IntellectBooks, 1990.J. F. McCarthy. Using decision trees for coreference resolution. In
Proc. 14th InternationalJoint Conf. on Artificial Intelligence (IJCAI), Quebec, Canada, Aug. 1995 , 1995.E. Merkhofer, J. Henderson, D. Bloom, L. Strickhart, and G. Zarrella. MITRE at SemEval-2018 Task 11: Commonsense Reasoning without Commonsense Knowledge. In
Pro-ceedings of the International Workshop on Semantic Evaluation (SemEval-2018) , NewOrleans, LA, USA, 2018.A. Meyers, R. Reeves, C. Macleod, R. Szekely, V. Zielinska, B. Young, and R. Grishman.The NomBank project: An interim report. In
HLT-NAACL 2004 workshop: Frontiers incorpus annotation , volume 24, page 31, 2004.R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In
CIKM , pages 233–242, 2007.T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representationsin vector space. arXiv preprint arXiv:1301.3781 , 2013.S. Milgram. Six degrees of separation.
Psychology Today , 2:60–64, 1967.G. Miller. WordNet: a lexical database for English.
Communications of the ACM , 38(11):39–41, 1995.S. Min, M. J. Seo, and H. Hajishirzi. Question Answering through Transfer Learning fromLarge Fine-grained Supervision Data. In
Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics, ACL 2017, Volume 2: Short Papers , pages510–517, 2017. URL https://doi.org/10.18653/v1/P17-2081 .168. Minsky. A Framework for Representing Knowledge. Technical report, MassachusettsInstitute of Technology, Cambridge, MA, USA, 1974.M. Minsky.
Society of mind . Simon and Schuster, 1988.M. Minsky and S. Papert. Perceptron: an introduction to computational geometry.
TheMIT Press, Cambridge, expanded edition , 19:88, 1969.T. M. Mitchell, J. Betteridge, A. Carlson, E. Hruschka, and R. Wang. Populating thesemantic web by macro-reading internet text. In
International Semantic Web Conference ,pages 998–1002. Springer, 2009.D. Moldovan, M. Pa¸sca, S. Harabagiu, and M. Surdeanu. Performance issues and error anal-ysis in an open-domain question answering system.
ACM Transactions on InformationSystems (TOIS) , 21(2):133–154, 2003.P. Moreda, H. Llorens, E. S. Bor´o, and M. Palomar. Combining semantic information inquestion answering systems.
Inf. Process. Manage. , 47:870–885, 2011.K. Narasimhan and R. Barzilay. Machine Comprehension with Discourse Relations. In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguisticsand the 7th International Joint Conference on Natural Language Processing of the AsianFederation of Natural Language Processing, ACL 2015, Volume 1: Long Papers , pages1253–1262, 2015. URL http://aclweb.org/anthology/P/P15/P15-1121.pdf .B. K. Natarajan. Sparse approximate solutions to linear systems.
SIAM journal on com-puting , 24(2):227–234, 1995.T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng.MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.
CoRR ,abs/1611.09268, 2016. URL http://arxiv.org/abs/1611.09268 .J. Ni, C. Zhu, W. Chen, and J. McAuley. Learning to attend on essential terms:An enhanced retriever-reader model for scientific question answering. arXiv preprintarXiv:1808.09492 , 2018.Q. Ning, H. Wu, H. Peng, and D. Roth. Improving Temporal Relation Extractionwith a Globally Acquired Statistical Resource. In
Proc. of the Annual Meeting of theNorth American Association of Computational Linguistics (NAACL) , pages 841–851,New Orleans, Louisiana, 6 2018a. Association for Computational Linguistics. URL http://cogcomp.org/papers/NingWuPeRo18.pdf .Q. Ning, B. Zhou, Z. Feng, H. Peng, and D. Roth. CogCompTime: A Tool for Under-standing Time in Natural Language. In
EMNLP (Demo Track) , Brussels, Belgium, 112018b. Association for Computational Linguistics. URL http://cogcomp.org/papers/NZFPR18.pdf . 169. Novak. Representations of Knowledge in a Program for Solving Physics Problems. In
IJCAI-77 , 1977.S. Ostermann, M. Roth, A. Modi, S. Thater, and M. Pinkal. SemEval-2018 Task 11:Machine Comprehension using Commonsense Knowledge. In
Proceedings of The 12thInternational Workshop on Semantic Evaluation , pages 747–757, 2018.M. Palmer, D. Gildea, and P. Kingsbury. The proposition bank: An annotated corpus ofsemantic roles.
Computational linguistics , 31(1):71–106, 2005.A. P. Parikh, O. T¨ackstr¨om, D. Das, and J. Uszkoreit. A Decomposable Attention Modelfor Natural Language Inference. In
Proc. of the Conference on Empirical Methods forNatural Language Processing (EMNLP) , 2016.J. H. Park and W. B. Croft. Using key concepts in a translation model for retrieval. In
Pro-ceedings of the 38th International ACM SIGIR Conference on Research and Developmentin Information Retrieval , pages 927–930. ACM, 2015.J. Pearl.
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference .Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. ISBN 1558604790.C. S. Peirce. A Theory of Probable Inference. In
Studies in Logic by Members of the JohnsHopkins University , pages 126–181. Little, Brown, and Company, 1883.J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation.In
Proceedings of the 2014 conference on empirical methods in natural language processing(EMNLP) , pages 1532–1543, 2014.M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer.Deep Contextualized Word Representations. In
Proceedings of the 2018 Conference ofthe North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers) , volume 1, pages 2227–2237, 2018.L. A. Pizzato and D. Moll´a. Indexing on semantic roles for question answering. In , pages 74–81, 2008.A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. V. Durme. Hypothesis OnlyBaselines in Natural Language Inference. In
Proceedings of the Seventh Joint Conferenceon Lexical and Computational Semantics , pages 180–191, 2018.D. Poole. A methodology for using a default and abductive reasoning system.
Int. J. Intell.Syst. , 5:521–548, 1990.V. Punyakanok and D. Roth. The use of classifiers in sequential inference. In
Proc. ofthe Conference on Neural Information Processing Systems (NIPS) , pages 995–1001. MITPress, 2001. URL http://cogcomp.org/papers/nips01.pdf .170. Punyakanok, D. Roth, and W. Yih. Mapping Dependencies Trees: An Applica-tion to Question Answering.
AIM , 1 2004. URL http://cogcomp.org/papers/PunyakanokRoYi04a.pdf .V. Punyakanok, D. Roth, and W. tau Yih. The importance of syntactic parsing and inferencein semantic role labeling.
Computational Linguistics , 2008.M. R. Quillan. Semantic memory. Technical report, BOLT BERANEK AND NEWMANINC CAMBRIDGE MA, 1966.P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ Questions for MachineComprehension of Text. In
Proc. of the Conference on Empirical Methods for NaturalLanguage Processing (EMNLP) , 2016.P. Rajpurkar, R. Jia, and P. Liang. Know What You Don’t Know: Unanswerable Ques-tions for SQuAD. In
Proc. of the Annual Meeting of the Association of ComputationalLinguistics (ACL) , 2018.H. Rashkin, M. Sap, E. Allaway, N. A. Smith, and Y. Choi. Event2Mind: CommonsenseInference on Events, Intents, and Reactions. In
Proc. of the Annual Meeting of theAssociation of Computational Linguistics (ACL) , pages 463–473, 2018.J. Rasmussen. The role of hierarchical knowledge representation in decisionmaking andsystem management.
Systems, Man and Cybernetics, IEEE Transactions on , pages 234–243, 1985.L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition.In
Proc. of the Conference on Computational Natural Language Learning (CoNLL) , 62009. URL http://cogcomp.org/papers/RatinovRo09.pdf .L. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and global algorithms for dis-ambiguation to wikipedia. In
Proc. of the Annual Meeting of the Association for Com-putational Linguistics (ACL) , 2011. URL http://cogcomp.org/papers/RRDA11.pdf .S. Reddy, O. T¨ackstr¨om, S. Petrov, M. Steedman, and M. Lapata. Universal Semantic Pars-ing. In
Proc. of the Conference on Empirical Methods for Natural Language Processing(EMNLP) , pages 89–101, 2017.T. Redman, M. Sammons, and D. Roth. Illinois Named Entity Recognizer: Addendumto Ratinov and Roth ’09 reporting improved results, 2016. URL http://cogcomp.org/papers/ner-addendum.pdf . Tech Report.M. Richardson and P. Domingos. Markov Logic Networks.
Machine learning , 62(1–2):107–136, 2006.M. Richardson, C. J. C. Burges, and E. Renshaw. MCTest: A Challenge Dataset for theOpen-Domain Machine Comprehension of Text. In
Proceedings of the 2013 Conference n Empirical Methods in Natural Language Processing, EMNLP 2013 , pages 193–203,2013. URL http://aclweb.org/anthology/D/D13/D13-1020.pdf .F. Rosenblatt. The perceptron: a probabilistic model for information storage and organi-zation in the brain.
Psychological review , 65(6):386, 1958.D. Roth and W. Yih. A linear programming formulation for global inference in naturallanguage tasks. In H. T. Ng and E. Riloff, editors,
Proc. of the Conference on Compu-tational Natural Language Learning (CoNLL) , pages 1–8. Association for ComputationalLinguistics, 2004. URL http://cogcomp.org/papers/RothYi04.pdf .D. Roth and D. Zelenko. Part of speech tagging using a network of linear separators. In
ACL-COLING , 1998.M. Roth and M. Lapata. Neural semantic role labeling with dependency path embeddings.
Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL) ,2016.S. Roy, T. Vieira, and D. Roth. Reasoning about quantities in natural language.
Trans-actions of the Association for Computational Linguistics (TACL) , 3, 2015. URL http://cogcomp.org/papers/RoyViRo15.pdf .D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors.
Cognitive modeling , 5, 1988.R. C. Schank. Conceptual dependency: A theory of natural language understanding.
Cog-nitive psychology , 3(4):552–631, 1972.R. C. Schank and R. P. Abelson. Scripts, plans, and knowledge. In
Proc. of the InternationalJoint Conference on Artificial Intelligence (IJCAI) , pages 151–157, 1975.B. Selman and H. J. Levesque. Abductive and Default Reasoning: A Computational Core.In
Proceedings of the National Conference on Artificial Intelligence (AAAI) , 1990.M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional attention flow for machinecomprehension.
ICLR , 2016.D. Shen and M. Lapata. Using Semantic Roles to Improve Question Answering. In
EMNLP-CoNLL , pages 12–21, 2007.R. Socher, D. Chen, C. D. Manning, and A. Y. Ng. Reasoning With Neural Tensor Networksfor Knowledge Base Completion. In
The Conference on Advances in Neural InformationProcessing Systems (NIPS) , 2013.V. Srikumar and D. Roth. A Joint Model for Extended Semantic Role Labeling. In
Proc.of the Conference on Empirical Methods for Natural Language Processing (EMNLP) ,Edinburgh, Scotland, 2011. URL http://cogcomp.org/papers/SrikumarRo11.pdf .172. Srikumar and D. Roth. Modeling semantic relations expressed by prepositions. 1:231–242, 2013. URL http://cogcomp.org/papers/SrikumarRo13.pdf .M. Steedman and J. Baldridge. Combinatory categorial grammar.
Non-TransformationalSyntax: Formal and explicit models of grammar , pages 181–224, 2011.A. Stern, R. Stern, I. Dagan, and A. Felner. Efficient search for transformation-basedinference. In
Proc. of the Annual Meeting of the Association of Computational Linguistics(ACL) , pages 283–291, 2012.M. Steup. Epistemology. In E. N. Zalta, editor,
The Stanford Ency-clopedia of Philosophy . http://plato.stanford.edu/archives/spr2014/entries/epistemology/ , spring 2014 edition, 2014.K. Sun, D. Yu, D. Yu, and C. Cardie. Improving machine reading comprehension with gen-eral reading strategies. In Proc. of the Annual Meeting of the North American Associationof Computational Linguistics (NAACL) , 2019.W. t. Yih, X. He, and C. Meek. Semantic Parsing for Single-Relation Question Answering.In , pages 643–648. Citeseer, 2014.M. Taddeo and L. Floridi. Solving the symbol grounding problem: a critical review offifteen years of research.
Journal of Experimental & Theoretical Artificial Intelligence , 17(4):419–445, 2005.P. P. Talukdar, M. Jacob, M. S. Mehmood, K. Crammer, Z. G. Ives, F. Pereira, and S. Guha.Learning to create data-integrating queries.
Proceedings of the VLDB Endowment , 1(1):785–796, 2008.P. P. Talukdar, Z. G. Ives, and F. Pereira. Automatically incorporating new sources inkeyword search-based data integration. In
Proceedings of the 2010 ACM SIGMOD Inter-national Conference on Management of data , pages 387–398. ACM, 2010.N. Tandon, B. Dalvi, J. Grus, W. tau Yih, A. Bosselut, and P. Clark. Reasoning aboutActions and State Changes by Injecting Commonsense Knowledge. In
Proc. of the Con-ference on Empirical Methods for Natural Language Processing (EMNLP) , pages 57–66,2018.K. Toutanova and D. Chen. Observed versus latent features for knowledge base and textinference. In
CVSC workshop , 2015.H. Trivedi, H. Kwon, T. Khot, A. Sabharwal, and N. Balasubramanian. Entailment-basedQuestion Answering over Multiple Sentences. In
Proc. of the Annual Meeting of the NorthAmerican Association of Computational Linguistics (NAACL) , 2019.A. M. Turing. Computing machinery and intelligence.
Mind , 59(236):433, 1950.173. D. Turney. Distributional semantics beyond words: Supervised learning of analogy andparaphrase.
TACL , 1:353–366, 2013.P. D. Turney and P. Pantel. From frequency to meaning: Vector space models of semantics.
Journal of artificial intelligence research , 37:141–188, 2010.K. Tymoshenko, D. Bonadiman, and A. Moschitti. Convolutional Neural Networks vs.Convolution Kernels: Feature Engineering for Answer Sentence Reranking. In
HLT-NAACL , 2016.C. Unger, L. B¨uhmann, J. Lehmann, A.-C. N. Ngomo, D. Gerber, and P. Cimiano.Template-based question answering over RDF data. In
Proceedings of the 21st inter-national conference on World Wide Web , pages 639–648. ACM, 2012.A. Vempala, E. Blanco, and A. Palmer. Determining Event Durations: Models and ErrorAnalysis. In
Proceedings of the 2018 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, Volume 2(Short Papers) , volume 2, pages 164–168, 2018.B. Wang, K. Liu, and J. Zhao. Inner Attention based Recurrent Neural Networks for AnswerSelection. In
Proc. of the Annual Meeting of the Association of Computational Linguistics(ACL) , 2016.C. Wang, N. Xue, S. Pradhan, and S. Pradhan. A Transition-based Algorithm for AMRParsing. In
HLT-NAACL , pages 366–375, 2015.H. Wang, D. Yu, K. Sun, J. Chen, D. Yu, D. Roth, and D. McAllester. Evidence SentenceExtraction for Machine Reading Comprehension. arXiv preprint arXiv:1902.08852 , 2019.W. Wang, M. Yan, and C. Wu. Multi-granularity hierarchical attention fusion networksfor reading comprehension and question answering. In
Proceedings of the 56th AnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages1705–1714, 2018.D. J. Watts and S. H. Strogatz. Collective dynamics of small-worldnetworks. nature , 393(6684):440, 1998.J. Wieting, M. Bansal, K. Gimpel, K. Livescu, and D. Roth. From Paraphrase Databaseto Compositional Paraphrase Model and Back.
TACL , 3:345–358, 2015.J. Williams. Extracting fine-grained durations for verbs from Twitter. In
Proceedingsof ACL 2012 Student Research Workshop , pages 49–54. Association for ComputationalLinguistics, 2012.T. Winograd. Understanding natural language.
Cognitive psychology , 3(1):1–191, 1972.W. A. Woods. Progress in natural language understanding: an application to lunar geology.174n
Proceedings of the June 4-8, 1973, national computer conference and exposition , pages441–450. ACM, 1973.S. Yang, L. Zou, Z. Wang, J. Yan, and J.-R. Wen. Efficiently Answering Technical Questions-A Knowledge Graph Approach. In
Proceedings of the National Conference on ArtificialIntelligence (AAAI) , pages 3111–3118, 2017.Y. Yang, W. Yih, and C. Meek. WikiQA: A Challenge Dataset for Open-Domain QuestionAnswering. In
Proceedings of the 2015 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP 2015 , pages 2013–2018, 2015. URL http://aclweb.org/anthology/D/D15/D15-1237.pdf .Y. Yang, L. Birnbaum, J.-P. Wang, and D. Downey. Extracting Commonsense Propertiesfrom Embeddings with Limited Human Guidance. In
Proceedings of the 56th AnnualMeeting of the Association for Computational Linguistics (Volume 2: Short Papers) ,volume 2, pages 644–649, 2018.X. Yao and B. V. Durme. Information extraction over structured data: Question answeringwith Freebase. In , 2014.W. Yin, S. Ebert, and H. Sch¨utze. Attention-based convolutional neural network for machinecomprehension. In
NAACL HCQA Workshop , 2016.L. A. Zadeh. The concept of a linguistic variable and its application to approximate rea-soningI.
Information sciences , 8(3):199–249, 1975.L. A. Zadeh. PRUFa meaning representation language for natural languages.
InternationalJournal of man-machine studies , 10(4):395–460, 1978.R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi. SWAG: A Large-Scale Adversarial Dataset forGrounded Commonsense Inference. In
Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pages 93–104, 2018.L. S. Zettlemoyer and M. Collins. Learning to map sentences to logical form: Structuredclassification with probabilistic categorial grammars.
UAI , 2005.S. Zhang, R. Rudinger, K. Duh, and B. V. Durme. Ordinal Common-sense Inference.
Transactions of the Association of Computational Linguistics , 5(1):379–395, 2017.B. Zhou, D. Khashabi, Q. Ning, and D. Roth. “going on a vacation” takes longer than“going for a walk”: A study of temporal commonsense understanding. In
Proceedings ofthe Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2019.L. Zou, R. Huang, H. Wang, J. X. Yu, W. He, and D. Zhao. Natural language questionanswering over RDF: a graph data driven approach. In