[PDF] A Relational Tsetlin Machine with Applications to Natural Language Understanding

Abstract

TMs are a pattern recognition approach that uses finite state machines for learning and propositional logic to represent patterns. In addition to being natively interpretable, they have provided competitive accuracy for various tasks. In this paper, we increase the computing power of TMs by proposing a first-order logic-based framework with Herbrand semantics. The resulting TM is relational and can take advantage of logical structures appearing in natural language, to learn rules that represent how actions and consequences are related in the real world. The outcome is a logic program of Horn clauses, bringing in a structured view of unstructured data. In closed-domain question-answering, the first-order representation produces 10x more compact KBs, along with an increase in answering accuracy from 94.83% to 99.48%. The approach is further robust towards erroneous, missing, and superfluous information, distilling the aspects of a text that are important for real-world understanding.

Full PDF

AARXIV PREPRINT 1

A Relational Tsetlin Machine with Applications toNatural Language Understanding

Rupsa Saha, Ole-Christoffer Granmo, Vladimir I. Zadorozhny, Morten Goodwin

Abstract —TMs are a pattern recognition approach that uses ﬁnite state machines for learning and propositional logic to representpatterns. In addition to being natively interpretable, they have provided competitive accuracy for various tasks. In this paper, weincrease the computing power of TMs by proposing a ﬁrst-order logic-based framework with Herbrand semantics. The resulting TM is relational and can take advantage of logical structures appearing in natural language, to learn rules that represent how actions andconsequences are related in the real world. The outcome is a logic program of Horn clauses, bringing in a structured view ofunstructured data. In closed-domain question-answering, the ﬁrst-order representation produces × more compact KBs, along withan increase in answering accuracy from . to . . The approach is further robust towards erroneous, missing, andsuperﬂuous information, distilling the aspects of a text that are important for real-world understanding. (cid:70) NTRODUCTION U SING

Artiﬁcial Intelligence (AI) to answer natural lan-guage questions has long been an active research area,considered as an essential aspect in machines ultimatelyachieving human-level world understanding. Large-scalestructured knowledge bases (KBs), such as Freebase [1],have been a driving force behind successes in this ﬁeld.The KBs encompass massive ever-growing amounts of in-formation, which enable easier handling of Open-DomainQuestion-Answering (QA) [2] by organizing a large varietyof answers in a structured format. The difﬁculty arises insuccessfully interpreting natural language by artiﬁcially in-telligent agents, both to build the KBs from natural languagetext resources and to interpret the questions asked.Generalization beyond the information stored in a KBfurther complicates the QA problem. Human-level worldunderstanding requires abstracting from speciﬁc examplesto build more general concepts and rules. When the infor-mation stored in the KB is error-free and consistent, gener-alization becomes a standard inductive reasoning problem.However, abstracting world-knowledge entails dealing withuncertainty, vagueness, exceptions, errors, and conﬂictinginformation. This is particularly the case when relying onAI approaches to extract and structure information, whichis notoriously error-prone.This paper addresses the above QA challenges byproposing a Relational TM that builds non-recursive ﬁrst-order

Horn clauses from speciﬁc examples, distilling generalconcepts and rules.

Tsetlin Machines [3] are a pattern recognition approachto constructing human-understandable patterns from givendata, founded on propositional logic. While the idea ofTsetlin automaton (TA) [4] have been around since 1960s,using them in pattern recognition is relatively new. TMs • R. Saha, O. C. Granmo and M. Goodwin are with Centre for AI Research,Department of IKT, University of Agder, Norway. • V. I. Zadorozhny is with School of Computing and Information, Univer-sity of Pittsburgh, USA, and Centre for AI Research, University of Agder,Norway. have successfully addressed several machine learning tasks,including natural language understanding [5], [6], [7], [8],[9], image analysis [10], classiﬁcation [11], regression [12],and speech understanding [13]. The propositional clausesconstructed by a TM have high discriminative power andconstitute a global description of the task learnt [8], [14].Apart from maintaining accuracy comparable to state-of-the-art machine learning techniques, the method also hasprovided a smaller memory footprint and faster inferencethan more traditional neural network-based models [11],[13], [15], [16]. Furthermore, [17] shows that TMs can befault-tolerant, able to mask stuck-at faults. However, al-though TMs can express any propositional formula by usingdisjunctive normal form, ﬁrst-order logic is required to ob-tain the computing power equivalent to a universal Turingmachine. In this paper, we take the ﬁrst steps towardsincreasing the computing power of TMs by introducing a ﬁrst order

TM framework with Herbrand semantics, referredto as the

Relational

TM. Accordingly, we will in the followingdenote the original approach as

Propositional TMs . Closed-Domain Question-Answering:

As proof-of-concept, we apply our proposed Relational TM to so-calledClosed-Domain QA. Closed-Domain QA assumes a text(single or multiple sentences) followed by a question whichrefers to some aspect of the preceding text. Accordingly, theamount of information that must be navigated is less thanfor open question-answering. Yet, answering closed-domainquestions poses a signiﬁcant natural language understand-ing challenge.Consider the following example of information, takenfrom [18]: “The Black Death is thought to have originatedin the arid plains of Central Asia, where it then travelledalong the Silk Road, reaching Crimea by 1343. From there,it was most likely carried by Oriental rat ﬂeas living onthe black rats that were regular passengers on merchantships.”One can then have questions such as “Where did theblack death originate?” or “How did the black death make itto the Mediterranean and Europe?”. These questions can beanswered completely with just the information provided,hence it is an example of closed-domain question answer- a r X i v : . [ c s . C L ] F e b RXIV PREPRINT 2 ing. However, mapping the question to the answer requiresnot only natural language processing, but also a fair bit oflanguage understanding.Here is a much simpler example: “Bob went to thegarden. Sue went to the cafe. Bob walked to the ofﬁce.”This information forms the basis for questions like “Whereis Bob?” or “Where is Sue?”. Taking it a step further, giventhe previous information and questions, one can envisiona model that learns to answer similar questions based onsimilar information, even though the model has never seenthe speciﬁcs of the information before (i.e., the names andthe locations).With QA being such an essential area of Natural Lan-guage Understanding, there has been a lot of differentapproaches proposed. Common methods to QA include thefollowing: • Linguistic techniques, such as tokenization, POS tag-ging and parsing that transform questions into a precisequery that merely extracts the respective response froma structured database; • Statistical techniques such as Support Vector Machines,Bayesian Classiﬁers, and maximum entropy models,trained on large amount of data, specially for open QA; • Pattern matching using surface text patterns with tem-plates for response generation.Many methods use a hybrid approach encompassing morethan one of these approaches for increased accuracy. MostQA systems suffer from a lack of generality, and are tunedfor performance in restricted use cases. Lack of availableexplainabilty also hinders researchers’ quest to identify painpoints and possible major improvements [19], [20].

Paper Contributions:

Our main contributions in thispaper are as follows: • We introduce a

Relational

TM, as opposed to a proposi-tional one, founded on non-recursive Horn clauses andcapable of processing relations, variables and constants. • We propose an accompanying relational frameworkfor efﬁcient representation and processing of the QAproblem. • We provide empirical evidence uncovering that theRelational TM produces at least one order of magnitudemore compact KBs than the Propositional TM. At thesame time, answering accuracy increases from . %to . % because of more general rules. • We provide a model-theoretical interpretation for theproposed framework.Overall, our Relational TM uniﬁes knowledge representa-tion, learning, and reasoning in a single framework.

Paper Organization:

The paper is organized as follows.In Section 2, we present related work on Question Answer-ing. Section 3 focuses on the background of the Proposi-tional TM and the details of the new Relational TM. InSections 4 and 5, we describe how we employ RelationalTMs in QA and related experiments.

ACKGROUND AND R ELATED W ORK

The problem of QA is related to numerous aspects of Knowl-edge Engineering and Data Management.Knowledge engineering deals with constructing andmaintaining knowledge bases to store knowledge of the real world in various domains. Automated reasoning tech-niques use this knowledge to solve problems in domainsthat ordinarily require human logical reasoning. Therefore,the two key issues in knowledge engineering are howto construct and maintain knowledge bases, and how toderive new knowledge from existing knowledge effectivelyand efﬁciently. Automated reasoning is concerned with thebuilding of computing systems that automate this process.Although the overall goal is to automate different forms ofreasoning, the term has largely been identiﬁed with validdeductive reasoning as conducted in logical systems. This isdone by combining known (yet possibly incomplete) infor-mation with background knowledge and making inferencesregarding unknown or uncertain information.Typically, such a system consists of subsystems likeknowledge acquisition system, the knowledge base itself,inference engine, explanation subsystem and user interface.The knowledge model has to represent the relations be-tween multiple components in a symbolic, machine under-standable form, and the inference engine has to manipulatethose symbols to be capable of reasoning. The “way toreason” can range from earlier versions that were simplerule-based systems to more complex and recent approachesbased on machine learning, especially on Deep Learning.Typically, rule-based systems suffered from lack of gener-ality, and the need for human experts to create rules inthe ﬁrst place. On the other hand most machine learningbased approaches have the disadvantage of not being ableto justify decisions taken by them in human understandableform [21], [22].While databases have long been a mechanism of choicefor storing information, they only had inbuilt capabilityto identify relations between various components, and didnot have the ability to support reasoning based on suchrelations. Efforts to combine formal logic programmingwith relational databases led to the advent of deductivedatabases. In fact, the ﬁeld of QA is said to have arisen fromthe initial goal of performing deductive reasoning on a set ofgiven facts [23]. In deductive databases, the semantics of theinformation are represented in terms of mathematical logic.Queries to deductive databases also follow the same logicalformulation [24]. One such example is ConceptBase [25],which used the Prolog-inspired language O-Telos for logicalknowledge representation and querying using deductiveobject-oriented database framework.With the rise of the internet, there came a need foruniﬁcation of information on the web. The Semantic Web(SW) proposed by W3C is one of the approaches thatbridges the gap between the Knowledge Representation andthe Web Technology communities. However, reasoning andconsistency checking is still not very well developed, despitethe underlying formalism that accompanies the semanticweb. One way of introducing reasoning is via descriptivelogic. It involves concepts (unary predicates) and roles(binary predicates) and the idea is that implicitly capturedknowledge can be inferred from the given descriptions ofconcepts and roles [26], [27].One of the major learning exercises is carried out bythe NELL mechanism proposed by [28], which aims tolearn many semantic categories from primarily unlabeleddata. At present, NELL uses simple frame-based knowledge

RXIV PREPRINT 3 representation, augmented by the PRA reasoning system.The reasoning system performs tractable, but limited typesof reasoning based on restricted Horn clauses. NELL’s capa-bilities is already limited in part by its lack of more powerfulreasoning components; for example, it currently lacks meth-ods for representing and reasoning about time and space.Hence, core AI problems of representation and tractablereasoning are also core research problems for never-endinglearning agents.While other approaches such as neural networks areconsidered to provide attribute-based learning, InductiveLogic Programming (ILP) is an attempt to overcome theirlimitations by moving the learning away from the attributesthemselves and more towards the level of ﬁrst-order pred-icate logic. ILP builds upon the theoretical framework oflogic programming and looks to construct a predicate logicgiven background knowledge, positive examples and neg-ative examples. One of the main advantages of ILP overattribute-based learning is ILP’s generality of representationfor background knowledge. This enables the user to pro-vide, in a more natural way, domain-speciﬁc backgroundknowledge to be used in learning. The use of backgroundknowledge enables the user both to develop a suitableproblem representation and to introduce problem-speciﬁcconstraints into the learning process. Over the years, ILPhas evolved from depending on hand-crafted backgroundknowledge only, to employing different technologies inorder to learn the background knowledge as part of theprocess. In contrast to typical machine learning, which usesfeature vectors, ILP requires the knowledge to be in terms offacts and rules governing those facts. Predicates can eitherbe supplied or deducted, and one of the advantages of thismethod is that newer information can be added easily, whilepreviously learnt information can be maintained as required[29]. Probabilistic inductive logic programming is an exten-sion of ILP, where logic rules, as learnt from the data, arefurther enhanced by learning probabilities associated withsuch rules [30], [31], [32].To sum up, none of the above approaches can be ef-ﬁciently and systematically applied to the QA problem,especially in uncertain and noisy environments. In thispaper we propose a novel approach to tackle this problem.Our approach is based on relational representation of QA,and using a novel Relational TM technique for answeringquestions. We elaborate on the proposed method in the nexttwo sections.

UILDING A R ELATIONAL T SETLIN M ACHINE

A Tsetlin Automaton (TA) is a deterministic automaton thatlearns the optimal action among the set of actions offeredby an environment. It performs the action associated withits current state, which triggers a reward or penalty basedon the ground truth. The state is updated accordingly, sothat the TA progressively shifts focus towards the optimalaction [4]. A TM consists of a collection of such TAs, whichtogether create complex propositional formulas using con-junctive clauses.

A TM takes a vector X = ( x , . . . , x f ) of propositionalvariables as input, to be classiﬁed into one of two classes, y = 0 or y = 1 . Together with their negated counter-parts, ¯ x k = ¬ x k = 1 − x k , the features form a literal set L = { x , . . . , x f , ¯ x , . . . , ¯ x f } . We refer to this “regular” TMas a Propositional TM, due to the input it works with andthe output it produces.A TM pattern is formulated as a conjunctive clause C j ,formed by ANDing a subset L j ⊆ L of the literal set: C j ( X ) = (cid:86) l k ∈ L j l k = (cid:81) l k ∈ L j l k . (1)E.g., the clause C j ( X ) = x ∧ x = x x consists of theliterals L j = { x , x } and outputs iff x = x = 1 .The number of clauses employed is a user set parame-ter n . Half of the n clauses are assigned positive polarity( C + j ). The other half is assigned negative polarity ( C − j ). Theclause outputs, in turn, are combined into a classiﬁcationdecision through summation: v = (cid:80) n/ j =1 C + j ( X ) − (cid:80) n/ j =1 C − j ( X ) . (2)In effect, the positive clauses vote for y = 1 and the negativefor y = 0 . Classiﬁcation is performed based on a majorityvote, using the unit step function: ˆ y = u ( v ) = 1 if v ≥ else . The classiﬁer ˆ y = u ( x ¯ x + ¯ x x − x x − ¯ x ¯ x ) ,for instance, captures the XOR-relation. Alg. 1 encompasses the entire learning procedure. We ob-serve that, learning is performed by a team of f TAs perclause, one TA per literal l k (Alg. 1, Step 2). Each TA hastwo actions – Include or Exclude – and decides whether toinclude its designated literal l k in its clause.TMs learn on-line, processing one training example ( X, y ) at a time (Step 7). The TAs ﬁrst produce a newconﬁguration of clauses (Step 8), C +1 , . . . , C − n/ , followed bycalculating a voting sum v (Step 9).Feedback are then handed out stochastically to each TAteam. The difference (cid:15) between the clipped voting sum v c and a user-set voting target T decides the probabilityof each TA team receiving feedback (Steps 12-20). Notethat the voting sum is clipped to normalize the feedbackprobability. The voting target for y = 1 is T and for y = 0 it is − T . Observe that for any input X , the probability ofreinforcing a clause gradually drops to zero as the votingsum approaches the user-set target. This ensures that clausesdistribute themselves across the frequent patterns, ratherthan missing some and over-concentrating on others.Clauses receive two types of feedback. Type I feedbackproduces frequent patterns, while Type II feedback increasesthe discrimination power of the patterns. Type I feedback is given stochastically to clauses withpositive polarity when y = 1 and to clauses with negativepolarity when y = 0 . Each clause, in turn, reinforces itsTAs based on: (1) its output C j ( X ) ; (2) the action of theTA – Include or Exclude; and (3) the value of the literal l k assigned to the TA. Two rules govern Type I feedback: • Include is rewarded and

Exclude is penalized with prob-ability s − s whenever C j ( X ) = 1 and l k = 1 . This re-inforcement is strong (triggered with high probability) RXIV PREPRINT 4

Algorithm 1

Propositional TM input

Tsetlin Machine TM , Example pool S , Trainingrounds e , Clauses n , Features f , Voting target T , Speciﬁcity s procedure T RAIN ( TM , S, e, n, f, T, s ) for j ← , . . . , n/ do TA + j ← RandomlyInitializeClauseTATeam(2 f ) TA − j ← RandomlyInitializeClauseTATeam(2 f ) end for for i ← , . . . , e do ( X i , y i ) ← ObtainTrainingExample( S ) C +1 , . . . , C − n/ ← ComposeClauses( TA +1 , . . . , TA − n/ ) v i ← (cid:80) n/ j =1 C + j ( X i ) − (cid:80) n/ j =1 C − j ( X i ) (cid:46) Vote sum v ci ← clip ( v i , − T, T ) (cid:46) Clipped vote sum for j ← , . . . , n/ do (cid:46) Update TA teams if y i = 1 then (cid:15) ← T − v ci (cid:46) Voting error

TypeIFeedback( X i , TA + j , s ) if rand() ≤ (cid:15) T TypeIIFeedback( X i , TA − j ) if rand() ≤ (cid:15) T else (cid:15) ← T + v ci (cid:46) Voting error

TypeIIFeedback( X i , TA + j ) if rand() ≤ (cid:15) T TypeIFeedback( X i , TA − j , s ) if rand() ≤ (cid:15) T end if end for end for end procedure Algorithm 2

Relational TM input

Convolutional Tsetlin Machine TM , Example pool S ,Number of training rounds e procedure T RAIN ( TM , S, e ) for i ← , . . . , e do ( ˜ X , ˜ Y ) ← ObtainTrainingExample( S ) A (cid:48) ← ObtainConstants( ˜ Y ) ( ˜ X (cid:48) , ˜ Y (cid:48) ) ← VariablesReplaceConstants( ˜ X , ˜ Y , A (cid:48) ) A (cid:48)(cid:48) ← ObtainConstants( ˜ X (cid:48) ) Q ← GenerateVariablePermutations( ˜ X (cid:48) , A (cid:48)(cid:48) ) UpdateConvolutionalTM(TM , Q, ˜ Y (cid:48) ) end for end procedure and makes the clause remember and reﬁne the patternit recognizes in X . • Include is penalized and

Exclude is rewarded with prob-ability s whenever C j ( X ) = 0 or l k = 0 . This rein-forcement is weak (triggered with low probability) andcoarsens infrequent patterns, making them frequent.Above, the user-conﬁgurable parameter s controls patternfrequency, i.e., a higher s produces less frequent patterns. Type II feedback is given stochastically to clauses withpositive polarity when y = 0 and to clauses with nega-tive polarity when y = 1 . It penalizes Exclude whenever C j ( X ) = 1 and l k = 0 . Thus, this feedback produces literalsfor discriminating between y = 0 and y = 1 , by making theclause evaluate to when facing its competing class. Furtherdetails can be found in [3]. In this section, we introduce the Relational TM, which amajor contribution of this paper. It is designed to take

1. Note that the probability s − s is replaced by when boosting truepositives. advantage of logical structures appearing in natural lan-guage and process them in a way that leads to a compact,logic-based representation, which can ultimately reducethe gap between structured and unstructured data. Whilethe Propositional TM operates on propositional input vari-ables X = ( x . . . , x f ) , building propositional conjunctiveclauses, the Relational TM processes relations, variables andconstants, building Horn clauses. Based on Fig. 1 and Alg. 2,we here describe how the Relational TM builds upon theoriginal TM in three steps. First, we establish an approachfor dealing with relations and constants. This is done bymapping the relations to propositional inputs, allowing theuse of a standard TM. We then introduce Horn clauses withvariables, showing how this representation detaches the TMfrom the constants, allowing for a more compact represen-tation compared to only using propositional clauses. Weﬁnally introduce a novel convolution scheme that effectivelymanages multiple possible mappings from constants tovariables. While the mechanism of convolution remains thesame as in the original [10], what we wish to attain fromusing it in a Relational TM context is completely different,as explained in the following. The concept of Relational TM can be grounded in themodel-theoretical interpretation of a logic program withoutfunctional symbols and with a ﬁnite

Herbrand model [33],[34]. The ability to represent learning in the form of Hornclauses is extremely useful due to the fact that Horn clausesare both simple, as well as powerful enough to describe anylogical formula [34].Next, we deﬁne the

Herbrand model of a logic program.A

Herbrand Base (HB) is the set of all possible ground atoms,i.e., atomic formulas without variables, obtained using pred-icate names and constants in a logic program P . A HerbrandInterpretation is a subset I of the Herbrand Base ( I ⊆ HB ). Tointroduce the Least Herbrand Model we deﬁne the immediateconsequence operator

T P : P ( HB ) → P ( HB ) , which foran Herbrand Interpretation I produces the interpretation thatimmediately follows from I by the rules (Horn clauses) inthe program P : T P ( I ) = { A ∈ HB | A ← A , ..., A n ∈ ground ( P ) ∧ { A , ..., A n } ⊆ I } ∪ I. The least ﬁxed point lfp(TP) of the immediate conse-quence operator with respect to subset-inclusion is the

LeastHerbrand Model (LHM) of the program P . LHM identiﬁes thesemantics of the program P : it is the Herbrand Interpretation that contains those and only those atoms that follow fromthe program: ∀ A ∈ HB : P | = A ⇔ A ∈ LHM . As an example, consider the following program P : p ( a ) . q ( c ) .q ( X ) ← p ( X ) . Its Herbrand base is HB = { p ( a ) , p ( c ) , q ( a ) , q ( c ) } , RXIV PREPRINT 5

Parent(Bob, Mary) Parent(Mary, Peter)1. Obtain input(possibly with errors) Grandparent(Bob, Peter)Grandparent(Z , Z )2. Replace constants inconsequent with variables Parent(Bob, J ane) Parent(Mary, Z )Parent(Z , Mary)Parent(Z , J ane)3. Replace remaining constantswith variables. Generate allreplacement permutations.Cf. Convolutional TM patches. Grandparent(Z , Z )Parent(Z , Z )Parent(Z , Z )Parent(Z , Z ) Grandparent(Z , Z )Parent(Z , Z )Parent(Z , Z )Parent(Z , Z )i)ii)4. Evaluate clause oneach permutation (assume closedworld). Cf. Convolutional TM inference.5. Output OR of evaluations GrandParent(Z ,Z ) ← Parent(Z , Z ), Parent(Z , Z )i) ii) Compare for reinforcementSubset of Interpretation (True Ground Atoms) Immediate Consequent Truth Value ofConsequent Fig. 1. Relational TM processing steps and its Least Herbrand Model is:

LHM = lfp(

T P ) = { p ( a ) , q ( a ) , q ( c ) } , which is the set of atoms that follow from the program P . Let A = { a , a , . . . , a q } be a ﬁnite set of constants andlet R = { r , r , . . . , r p } be a ﬁnite set of relations of arity w u ≥ , u ∈ { , , . . . , p } , which forms the alphabet Σ . TheHerbrand base HB = { r ( a , a , . . . , a w ) , r ( a , a , . . . , a w ) ,. . . , r p ( a , a , . . . , a w p ) , r p ( a , a , . . . , a w p ) , . . . } (3)is then also ﬁnite, consisting of all the ground atoms thatcan be expressed using A and R .We also have a logic program P , with program rulesexpressed as Horn clauses without recursion. Each Hornclause has the form: B ← B , B , · · · , B d . (4)Here, B l , l ∈ { , . . . , d } , is an atom r u ( Z , Z , . . . , Z w u ) with variables Z , Z , . . . , Z w u , or its negation ¬ r u ( Z , Z , . . . , Z w u ) . The arity of r u is denoted by w u .Now, let X be a subset of the LHM of P , X ⊆ lfp(

T P ) ,and let Y be the subset of the LHM that follows from X due to the Horn clauses in P . Further assume that atomsare randomly removed and added to X and Y to producea possibly noisy observation ( ˜ X , ˜ Y ) , i.e., ˜ X and ˜ Y are notnecessarily subsets of lfp( T P ) . The learning problem is topredict the atoms in Y from the atoms in ˜ X by learning froma sequence of noisy observations ( ˜ X , ˜ Y ) , thus uncovering theunderlying program P . We base our Relational TM on mapping the learning prob-lem to a Propositional TM pattern recognition problem. Weconsider Horn clauses without variables ﬁrst. In brief, we map every atom in HB to a propositional input x k , obtain-ing the propositional input vector X = ( x , . . . , x o ) (cf. Sec-tion 3.1). That is, consider the w -arity relation r u ∈ R , whichtakes w symbols from A as input. This relation can thustake q w unique input combinations. As an example, withthe constants A = { a , a } and the binary relations R = { r , r } , we get propositional inputs: x , ≡ r ( a , a ) ; x , ≡ r ( a , a ) ; x , ≡ r ( a , a ) ; x , ≡ r ( a , a ) ; x , ≡ r ( a , a ) ; x , ≡ r ( a , a ) ; x , ≡ r ( a , a ) ; and x , ≡ r ( a , a ) . Correspondingly, we perform the samemapping to get the propositional output vector Y .Finally, obtaining an input ( ˜ X , ˜ Y ) , we set the proposi-tional input x k to true iff its corresponding atom is in ˜ X ,otherwise it is set to false. Similarly, we set the propositionaloutput variable y m to true iff its corresponding atom is in ˜ Y , otherwise it is set to false.Clearly, after this mapping, we get a Propositional TMpattern recognition problem that can be solved as describedin Section 3.1 for a single propositional output y m . This isillustrated as Step 1 in Fig. 1. The TM can potentially deal with thousands of proposi-tional inputs. However, we now detach our Relational TMfrom the constants, introducing Horn clauses with variables.Our intent is to provide a more compact representation ofthe program and to allow generalization beyond the data.Additionally, the detachment enables faster learning evenwith less data.Let Z = { Z , Z , . . . , Z z } be z variables representingthe constants appearing in an observation ( ˜ X , ˜ Y ) . Here, z is the largest number of unique constants involved inany particular observation ( ˜ X , ˜ Y ) , each requiring its ownvariable.Seeking Horn clauses with variables instead of constants,we now only need to consider atoms over variable conﬁgu-rations (instead of over constant conﬁgurations). Again, wemap the atoms to propositional inputs to construct a propo-sitional TM learning problem. That is, each propositional RXIV PREPRINT 6 input x k represents a particular atom with a speciﬁc variableconﬁguration: x k ≡ r u ( Z α , Z α , . . . , Z α wu ) , with w u beingthe arity of r u . Accordingly, the number of constants in A no longer affects the number of propositional inputs x k needed to represent the problem. Instead, this is governedby the number of variables in Z (and, again, the number ofrelations in R ). That is, the number of propositional inputsis bounded by O ( z w ) , with w being the largest arity of therelations in R .To detach the Relational TM from the constants, weﬁrst replace the constants in ˜ Y with variables, from left toright. Accordingly, the corresponding constants in ˜ X is alsoreplaced with the same variables (Step 2 in Fig. 1). Finally,the constants now remaining in ˜ X is arbitrarily replacedwith additional variables (Step 3 in Fig. 1). Since there may be multiple ways of assigning variablesto constants, the above approach may produce redundantrules. One may end up with equivalent rules whose onlydifference is syntactic, i.e., the same rules are expressedusing different variable symbols. This is illustrated in Step3 of Fig. 1, where variables can be assigned to constantsin two ways. To avoid redundant rules, the Relational TMproduces all possible permutations of variable assignments.To process the different permutations, we ﬁnally performa convolution over the permutations in Step 4, employinga TM convolution operator [10]. The target value of theconvolution is the truth value of the consequent (Step 5).

Fig. 1 contains an example of the process of detaching a Re-lational TM from constants. We use the parent-grandparentrelationship as an example, employing the following Hornclause. grandparent ( Z , Z ) ← parent ( Z , Z ) , parent ( Z , Z ) . We replace the constants in each training example withvariables, before learning the clauses. Thus the RelationalTM never “sees” the constants, just the generic variables.Assume the ﬁrst training example is

Input : parent ( Bob, M ary ) = 1;

T arget output : child ( M ary, Bob ) = 1 . Then Mary is replaced with Z and Bob with Z in the targetoutput: Input : parent ( Bob, M ary ) = 1;

T arget output : child ( Z , Z ) = 1 .We perform the same exchange for the input, getting: Input : parent ( Z , Z ) = 1; T arget output : child ( Z , Z ) = 1 . Here, “parent( Z , Z )” is treated as an input feature by theRelational TM. That is, “parent( Z , Z )” is seen as a singlepropositional variable that is either 0 or 1, and the nameof the variable is simply the string “parent( Z , Z )”. Theconstants may be changing from example to example, sonext time it may be Mary and Ann. However, they all end upas Z or Z after being replaced by the variables. After sometime, the Relational TM would then learn the followingclause: child ( Z , Z ) ← parent ( Z , Z ) . This is because the feature “parent( Z , Z )” predicts“child( Z , Z )” perfectly. Other candidate features like“parent( Z , Z )” or “parent( Z , Z )” are poor predictors of“child( Z , Z )” and will be excluded by the TM. Here, Z is a free variable representing some other constant, differentfrom Z and Z .Then the next training example comes along: Input : parent ( Bob, M ary ) = 1;

T arget output : child ( Jane, Bob ) = 0 . Again, we start with replacing the constants in the targetoutput with variables:

Input : parent ( Bob, M ary ) = 1;

T arget output : child ( Z , Z ) = 0 which is then completed for the input: Input : parent ( Z , M ary ) = 1; T arget output : child ( Z , Z ) = 0 . The constant Mary was not in the target output, so weintroduce a free variable Z for representing Mary: Input : parent ( Z , Z ) = 1; T arget output : child ( Z , Z ) = 0 . The currently learnt clause was: child ( Z , Z ) ← parent ( Z , Z ) . The feature “parent( Z , Z )” is not present in the inputin the second training example, only “parent( Z , Z ). As-suming a closed world, we thus have “parent( Z , Z )”=0.Accordingly, the learnt clause correctly outputs 0.For some inputs, there can be many different ways variablescan be assigned to constants (for the output, variables arealways assigned in a ﬁxed order, from Z to Z z ). Returningto our grandparent example in Fig. 1, if we have: Input : parent ( Bob, M ary ); parent ( M ary, P eter ); parent ( Bob, Jane ) T arget output : grandparent ( Bob, P eter ) . replacing Bob with Z and Peter with Z , we get: Input : parent ( Z , M ary ); parent ( M ary, Z ); parent ( Z , Jane ) T arget output : grandparent ( Z , Z ) . Above, both Mary and Jane are candidates for being Z .One way to handle this ambiguity is to try both, andpursue those that make the clause evaluate correctly, whichis exactly how the TM convolution operator works [10].Note that above, there is an implicit existential quan-tiﬁer over Z . That is, ∀ Z , Z ( ∃ Z ( parent ( Z , Z ) ∧ parent ( Z , Z ) → grandparent ( Z , Z )) .A practical view of how the TM learns in such a scenariois shown in Fig. 2. Continuing in the same vein as theprevious examples, the Input Text in the ﬁgure is a set ofstatements, each followed by a question. The text is reducedto a set of relations, which act as the features for the TM tolearn from. The ﬁgure illustrates how the TM learns relevantinformation (while disregarding the irrelevant), in orderto successfully answer a new question (Test Document).The input text is converted into a features vector whichindicates the presence or absence of relations ( R , R ) inthe text, where R and R are respectively M oveT o (in thestatements) and

W hereIs (in the question) relations. Forfurther simpliﬁcation and compactness of representation,instead of using person and location names, those speciﬁcdetails are replaced by ( P , P ) and ( L , L ), respectively. Ineach sample, the person name that occurs ﬁrst is termed P RXIV PREPRINT 7 throughout, the second unique name is termed P , and soon, and similarly for the locations. As seen in the ﬁgure, theTM reduces the feature-set representation of the input intoa set of logical conditions or clauses, all of which togetherdescribe scenarios in which the answer is L (or Location 2). Remark 1.

We now return to the implicit existentialand universal quantiﬁers of the Horn clauses, exempli-ﬁed in: ∀ Z , Z ( ∃ Z ( parent ( Z , Z ) ∧ parent ( Z , Z ) → grandparent ( Z , Z )) . A main goal of the procedure inFig. 1 is to correctly deal with the quantiﬁers “for all”and “exists”. “For all” maps directly to the TM architecturebecause the TM is operating with conjunctive clauses andthe goal is to make these evaluate to (True) whenever thelearning target is . “For all” quantiﬁers are taken care of inStep 3 of the relational learning procedure. Remark 2. “Exists” is more difﬁcult because it meansthat we are looking for a speciﬁc value for the variablesin the scope of the quantiﬁer that makes the expressionevaluate to . This is handled in the Steps 4-6 in Fig. 1,by evaluating all alternative values (all permutations withmultiple variables involved). Some values make the expres-sion evaluate to (False) and some make the expressionbecome . If none makes the expression , the output of theclause is . Otherwise, the output is . Remarkably, this isexactly the behavior of the TM convolution operator deﬁnedin [10], so we have an existing learning procedure in placeto deal with the “Exists” quantiﬁer. (If there exists a patchin the image that makes the clause evaluate to , the clauseevaluates to ). IN A R ELATIONAL

TM F

RAMEWORK

In this section, we describe the general pipeline to reducenatural language text into a machine-understandable rela-tional representation to facilitate question answering. Fig. 3shows the pipeline diagrammatically, with a small example.Throughout this section, we make use of two toy examplesin order to illustrate the steps. One of them is derived froma standard question answering dataset [35]. The other is asimple handcrafted dataset, inspired by [36], that refers toparent-child relationships among a group of people. Bothdatasets consist of instances, where each instance is a set oftwo or more statements, followed by a query. The expectedoutput for each instance is the answer to the query based onthe statements.

As a ﬁrst step, we need to extract the relation(s) presentin the text. A relation here is a connection between two (ormore) elements of a text. As discussed before, relations occurin natural language, and reducing a text to its constituentrelations makes it more structured while ignoring superﬂu-ous linguistic elements, leading to easier understanding. Weassume that our text consists of simple sentences, that is,each sentence contains only one relation. The relation foundin the query is either equal to, or linguistically related to therelations found in the statements preceding the query.Table 1 shows examples of Relation Extraction on ourtwo datasets. In Example-Movement, each statement hasthe relation “MoveTo”, while the query is related to “Lo-cation”. The “Location” can be thought of as a result of the “MoveTo” relations. Example-Parentage has “Parent” rela-tions as the information and “Grandparent” as the query.

TABLE 1Relation Extraction

Sentence Relation

Mary went to the ofﬁce.

MoveTo

John moved to the hallway

MoveTo

Where is Mary?

Location

Example-Movement

Sentence Relation

Bob is a parent of Mary.

Parent

Bob is a parent of Jane.

Parent

Mary is a parent of Peter

Parent

Is Bob a grandparent of Peter?

Grandparent

Example-Parentage

Once the relations have been identiﬁed, the next step is toidentify the elements of the text (or the entities) that areparticipating in the respective relations. Doing so allowsus to further enrich the representation with the addition ofrestrictions (often real-world ones), which allow the Rela-tional TM to learn rules that best represent actions and theirconsequences in a concise, logical form. Since the datasetswe are using here consist only of simple sentences, eachrelation is limited to having a maximum of two entities (therelations are unary or binary).In this step, the more external world knowledge thatcan be combined with the extracted entities, the richer theresultant representation. In Table 2, Example-Movement, wecould add the knowledge that “MoveTo” relation alwaysinvolves a “Person” and a “Location”. Or in Example-Parentage, “Parent” is always between a “Person” and a“Person”. This could, for example, prevent questions like“Jean-Joseph Pasteur was the father of Louis Pasteur. LouisPasteur is the father of microbiology. Who is the grandfatherof microbiology?”Note that, as per Fig. 3, it is only possible to start answeringthe query after both Relation Extraction and Entity Extrac-tion have been performed, and not before. Knowledge of theRelation also allows us to narrow down possible entities foranswering the query successfully.

TABLE 2Entity Extraction

Sentence Relation Entities

Mary went to the ofﬁce.

MoveTo Mary, ofﬁce

John moved to the hallway

MoveTo John, hallway

Where is Mary?

Location Mary, ?

Example-Movement

Sentence Relation Entities

Bob is a parent of Mary.

Parent Bob, Mary

Bob is a parent of Jane.

Parent Bob, Jane

Mary is a parent of Peter

Parent Mary, Peter

Is Bob a grandparent of Peter?

Grandparent Bob, Peter

Example-Parentage

One of the drawbacks of the relational representation isthat there is a huge increase in the number of possible

RXIV PREPRINT 8

Fig. 2. The Relational TM in operationFig. 3. General Pipeline relations as more and more examples are processed. Oneway to reduce the spread is to reduce individual enti-ties from their speciﬁc identities to a more generalisedidentity. Let us consider two instances : “Mary went tothe ofﬁce. John moved to the hallway. Where is Mary?”and “Sarah moved to the garage. James went to thekitchen. Where is Sarah?”. Without generalization, we endup with six different relations : MoveTo(Mary, Ofﬁce),MoveTo(John, Hallway), Location(Mary), MoveTo(Sarah,Garage), MoveTo(James, Kitchen), Location(Sarah). How-ever, to answer either of the two queries, we only need therelations pertaining to the query itself. Taking advantage ofthat, we can generalize both instances to just 3 relations:MoveTo(Person , Location ), MoveTo(Person , Location )and Location(Person ).In order to prioritise, the entities present in the queryrelation are the ﬁrst to be generalized. All occurrences ofthose entities in the relations preceding the query are alsoreplaced by suitable placeholders.The entities present in the other relations are then re-placed by what can be considered as free variables (sincethey do not play a role in answering the query).In the next section we explain how this relational frame- work is utilized for question answering using TM. One of the primary differences between the relational frame-work proposed in this paper versus the existing TM frame-work lies in Relation Extraction and Entity Generalization.The reason for these steps is that they allow us moreﬂexibility in terms of what the TM learns, while keepingthe general learning mechanism unchanged.Extracting relations allows the TM to focus only on themajor operations expressed via language, without gettingcaught up in multiple superﬂuous expressions of the samething. It also enables to bridge the gap between structuredand unstructured data. Using relations helps the resultantclauses be closer to real-world phenomena, since they modelactions and consequences, rather than the interactions be-tween words.Entity Generalization allows the clauses to be succinctand precise, adding another layer of abstraction away fromspeciﬁc literals, much like Relation Extraction. It also givesthe added beneﬁt of making the learning of the TM moregeneralized. In fact, due to this process, the learning reﬂects‘concepts’ gleaned from the training examples, rather thanthe examples themselves.To evaluate computational complexity, we employ thethree costs α , β , and γ , where in terms of computationalcost, α is cost to perform the conjunction of two bits, β is costof computing the summation of two integers, and γ is costto update the state of a single automaton. In a PropositionalTM, the worst case scenario would be the one in which allthe clauses are updated. Therefore, a TM with m clauses andan input vector of o features, needs to perform (2 o + 1) × m number of TA updates for a single training sample. For atotal of d training samples, the total cost of updating is d × γ × (2 o + 1) × m .Once weight updates have been successfully performed,the next step is to calculate the clause outputs. Here theworst case is represented by all the clauses including all the RXIV PREPRINT 9

TABLE 3Entity Generalization : Part 1

Sentence Relation Entities Reduced Relation

Mary went to the ofﬁce.

MoveTo Mary, ofﬁce

Source

MoveTo (X, ofﬁce)John moved to the hallway

MoveTo John, hallway

Source

MoveTo (John, hallway)Where is Mary?

Location Mary, ?

Target

Location (X, ?)Example-Movement

Sentence Relation Entities Reduced Relation

Bob is a parent of Mary.

Parent Bob, Mary

Source

Parent (X, Mary)Bob is a parent of Jane.

Parent Bob, Jane

Source

Parent (X, Jane)Mary is a parent of Peter

Parent Mary, Peter

Source

Parent (Mary, Y)Is Bob a grandparent of Peter?

Grandparent Bob, Peter

Target

Grandparent (X, Y)Example-Parentage

TABLE 4Entity Generalization : Part 2

Sentence Relation Entities Reduced Relation

Mary went to the ofﬁce.

MoveTo Mary, ofﬁce

Source

MoveTo (X, A)John moved to the hallway

MoveTo John, hallway

Source

MoveTo (Y, B)Where is Mary?

Location Mary, ?

Target

Location (X, ?)Example-Movement

Sentence Relation Entities Reduced Relation

Bob is a parent of Mary.

Parent Bob, Mary

Source

Parent (X, Z)Bob is a parent of Jane.

Parent Bob, Jane

Source

Parent (X, W)Mary is a parent of Peter

Parent Mary, Peter

Source

Parent (Z, Y)Is Bob a grandparent of Peter?

Grandparent Bob, Peter

Target

Grandparent (X, Y)Example-Parentage corresponding literals, giving us a cost of α × o × m (for asingle sample).The last step involves calculating the difference in votesfrom the clause outputs, thus incurring a per-sample cost of β × ( m − .Taken together, the total cost function for a PropositionalTM can be expressed as: f ( d ) = d × [( γ × (2 o +1) × m )+( α × o × m )+( β × ( m − Expanding this calculation to the Relation TM scenario,we need to account for the extra operations being per-formed, as detailed earlier: Relation Extraction and EntityGeneralization. The number of features per sample is re-stricted by the number of possible relations, both in theentirety of the training data, as well as only in the context ofa single sample. For example, in the experiments involving“MoveTo” relations, we have restricted our data to have 3statements, followed by a question (elaborated further in thenext section). Each statement gives rise to a single “MoveTo”relation, which has 2 entities (a location and a person).When using the textual constants (i.e., without EntityGeneralization), the maximum number of possible featuresthus becomes equal to the number of possible combinationbetween the unique entities in the relations. Thus if eachsample contains r Relations, and a Relation R involves e different entities ( E , E , ..., E e ), and cardinality of sets E , E , ..., E e be represented as | E | , | E | , ..., | E e | , the num-ber of features in the above equation can be re-written as o = { (cid:0) | E | (cid:1) × (cid:0) | E | (cid:1) × ... (cid:0) | E e | (cid:1) } × r .As discussed earlier in Section 4.3 as well as shown in theprevious paragraph, this results in a large o , since it dependson the number of unique textual elements in each of the en-tity sets. Using Entity Generalization, the number of featuresno longer depends on the cardinality of set E n, ≤ n ≤ e in thecontext of the whole dataset. Instead, it only depends on the context of the single sample. Thus, if each sample contains r Relations, and a Relation R involves e different entities( E , E , ..., E e ), and maximum possible cardinality of sets E , E , ..., E e are | E | = | E | = ... = | E n | = r , the numberof features become o = { (cid:0) | r | (cid:1) × (cid:0) | r | (cid:1) × ... (cid:0) | r | (cid:1) } × r = r ( e +1) .In most scenarios, this number is much lower than theone obtained without the use of Entity Generalization.A little further modiﬁcation is required when using theconvolutional approach. In calculating f ( d ) , the measureof o remains the same as we just showed with/withoutEntity Generalization. However the second term in theequation, which refers to the calculation of clause outputs( d × α × o × m ), changes due to the difference in mech-anism of calculating outputs for convolutional and non-convolutional approaches. In the convolutional approach,with Entity Generalization, we need to consider free andbound variables in the feature representation. Bound vari-ables are the ones which are linked by appearance in thesource and target relations, while the other variables, whichare independent of that restriction, can be referred to asthe free variables. Each possible permutation of the freevariables in different positions are used by the convolutionalapproach to determine the best generic rule that describesthe scenario. In certain scenarios, it may be possible to havecertain permutations among the bound variables as well,without violating the restrictions added by the relations.One such scenario is detailed with an example in Section5.2, where a bound “Person” entity can be permuted toallow any other “Person” entity, as long as the order of“MoveTo” relations are not violated. However, it is difﬁcultto get a generic measure for the same, which would work ir-respective of the nature of the relation (or their restrictions).Therefore, for the purpose of this calculation, we only take RXIV PREPRINT 10 into account the permutations afforded to us by the freevariable. Using the same notation as before, if each samplecontains r Relations, and a Relation R involves e differententities, the total number of variables is R × e . Of these, if v is the number of free variables, then they can be arrangedin v ! different ways. Assuming v is constant for all samples,the worst case f ( d ) can thus be rewritten as f ( d ) = d × [( γ × (2 o + 1) × m ) + ( v ! × α × o × m ) +( β × ( m − . XPERIMENTAL STUDY

To further illustrate how the TM based logic modellingworks practically, we employ examples from a standardquestion answering dataset [35]. For the scope of this work,we limit ourselves to the ﬁrst subtask as deﬁned in thedataset, viz. a question that can be answered by the pre-ceding context, and the context contains a single supportingfact.To start with, there is a set of statements, followedby a question, as discussed previously. For this particularsubtask, the answer to the question is obtained by a singlestatement out of the set of statements provided (hence theterm, single supporting fact).

Input:

William moved to the ofﬁce. Susan went to the garden.William walked to the pantry. Where is William?

Output: pantryWe assume the following knowledge to help us constructthe task:1) All statements only contain information pertaining torelation MoveTo2) All questions only relate to information pertaining torelation CurrentlyAt3) Relation MoveTo involves 2 entities, such that

M oveT o ( a, b ) : a ∈ { P ersons } , b ∈ { Locations }

4) Relation CurrentlyAt involves 2 entities, such that

CurrentlyAt ( a, b ) : a ∈ { P ersons } , b ∈ { Locations }

5) MoveTo is a time-bound relation, it’s effect is super-seded by a similar action performed at a later time.

The size of the set of statements from which the model has toidentify the correct answer inﬂuences the complexity of thetask. For the purpose of this experiment, the data is cappedto have a maximum of three statements per input, and overall has ﬁve possible locations. This means that the task forthe TM model is reduced to classifying the input into one ofﬁve possible classes.To prepare the data for the TM, the ﬁrst step involves re-ducing the input to relation-entity bindings. These bindingsform the basis of our feature set, which is used to train theTM. Consider the following input example:

Input = > MoveTo(William, Ofﬁce), MoveTo(Susan, Gar-den), MoveTo(William, Pantry), Q(William).Since the TM requires binary features, each input isconverted to a vector, where each element represents thepresence (or absence) of the relationship instances.Secondly, the list of possible answers is obtained fromthe data, which is the list of class labels. Continuing ourexample, possible answers to the question could be:

Possible Answers : [Ofﬁce, Pantry, Garden, Foyer,Kitchen]Once training is complete, we can use the inherent inter-pretability of the TM to get an idea about how the modellearns to discriminate the information given to it. The setof all clauses arrived at by the TM at the end of trainingrepresents a global view of the learning, i.e. what the modelhas learnt in general. The global view can also be thoughtof as a description of the task itself, as understood by themachine. We also have access to a local snapshot, whichis particular to each input instance. The local snapshotinvolves only those clauses that help in arriving at theanswer for that particular instance.Table 5 shows the local snapshot obtained for the aboveexample. As mentioned earlier, the TM model depends ontwo sets of clauses for each class, a positive set and anegative set. The positive set represents information favourof the class, while the negative set represents the opposite.The sum of the votes given by these two sets thus representthe ﬁnal class the model decides on. As seen in the example,all the classes other than “Pantry” receive more negativevotes than positive, making it the winning class. The clausesthemselves allow us to peek into the learning mechanism.For the class “Ofﬁce”, a clause captures the information that(a) the question contains “William”, and (b) the relationshipMoveTo(William, Ofﬁce) is available. This clause votes insupport of the class, i.e. this is an evidence that the answerto the question “Where is William?” may be “Ofﬁce”. How-ever, another clause encapsulates the previous two pieces ofinformation, as well as something more : (c) the relationshipMoveTo(William, Pantry) is available. Not only does thisclause vote against the class “Ofﬁce”, it also ends up with alarger share of votes than the clause voting positively.The accuracy obtained over 100 epochs for this experi-ment was . , with a F-score of . . Above results were obtained by only allowing positive liter-als in the clauses. The descriptive power of the TM goes upif negative literals are also added. However, the drawbackto that is, while the TM is empowered to make more preciserules (and by extension, decisions), the overall complexityof the clauses increase, making them less readable. Also,previously, the order of the MoveTo action could be impliedby the order in which they appear in the clauses, since onlyone positive literal can be present per sentence, but in caseof negative literals, we need to include information aboutthe sentence order. Using the above example again, if we doallow negative literals the positive evidence for Class Ofﬁcelooks like: Q ( W illiam ) AND Not ( Q ( Susan )) ANDMoveT o ( S , W illiam, Office ) ANDNOT ( MoveT o ( S , W illiam, Garden )) ANDNOT ( MoveT o ( S , W illiam, F oyer )) ANDNOT ( MoveT o ( S , W illiam, Kitchen )) ANDNOT ( MoveT o ( S , Susan, Garden )) ANDNOT ( MoveT o ( S , Susan, Office )) ANDNOT ( MoveT o ( S , Susan, P antry )) ANDNOT ( MoveT o ( S , Susan, F oyer )) ANDNOT ( MoveT o ( S , Susan, Kitchen )) . RXIV PREPRINT 11

TABLE 5Local Snapshot of Clauses for example “William moved to the ofﬁce. Susan went to the garden. William walked to the pantry. Where is William?”

Class Clause +/- Votes Total Votesfor Class

Ofﬁce Q(William) AND MoveTo(William, Ofﬁce) + 12 -35Q(William) AND MoveTo(William, Ofﬁce)˜ AND MoveTo(William, Pantry) - 47Pantry Q(William) AND MoveTo(William, Ofﬁce)˜ AND MoveTo(William, Pantry) + 64 +49

Q(William) AND MoveTo(William, Ofﬁce) - 15Garden MoveTo(Susan, Garden) + 12 -36Q(William) AND MoveTo(William, Ofﬁce) AND MoveTo(William, Pantry) - 48Foyer - + 0 -106Q(William) AND MoveTo(William, Ofﬁce) AND MoveTo(William, Pantry)AND MoveTo(Susan, Garden) - 106Kitchen - + 0 -113Q(William) AND MoveTo(William, Ofﬁce) AND MoveTo(William, Pantry)AND MoveTo(Susan, Garden) - 113

At this point, we can see that the use of constants leadto a large number of repetitive information in terms of therules learnt by the TM. Generalizing the constants as pertheir entity type can prevent this.

Given a set of sentences and a following question, the ﬁrststep remains the same as in the previous subsection, i.e.reducing the input to relation-entity bindings. In the secondstep, we carry out a grouping by entity type, in order togeneralize the information. Once the constants have beenreplaced by general placeholders, we continue as previously,collecting a list of possible outputs (to be used as classlabels), and further, training a TM based model with binaryfeature vectors.As before, the data is capped to have a maximum of threestatements per input. Continuing with the same example asabove,

Input:

William moved to the ofﬁce. Susan went to the garden.William walked to the pantry. Where is William?

Output: pantry

1. Reducing to relation-entity bindings:

Input = > MoveTo(William, Ofﬁce), MoveTo(Susan, Garden),MoveTo(William, Pantry), Q(William)

2. Generalizing bindings: = > MoveTo(Per1, Loc1),MoveTo(Per2, Loc2), MoveTo(Per1, Loc3), Q(Per1)

3. Possible Answers : [Loc1, Loc2, Loc3]The simplifying effect of generalization is seen right away:even though there are 5 possible location in the wholedataset, for any particular instance there are always max-imum of three possibilities, since there are maximum threestatements per instance.As seen from the local snapshot (Table 6), the clausesformed are much more compact and easily understandable.The generalization also releases the TM model from therestriction of having had to seen deﬁnite constants beforein order to make a decision. The model can process “Rorymoved to the conservatory. Rory went to the cinema. Cecilwalked to the school. Where is Rory?”, without needingto have encountered constants “Rory”, “Cecil”, “school”,“cinema” and “conservatory”.Accuracy for this experiment over 100 epochs was . , with a F-score of . .A logic based understanding of the relation “Move”could typically be expressed as : → MoveT o ( W illiam, office ) +

MoveT o ( Susan, garden ) +

MoveT o ( W illiam, pantry ) → MoveT o ( P , office ) + MoveT o ( P , garden ) + MoveT o ( P , pantry ) → MoveT o ( P , L ) + MoveT o ( P , L ) + MoveT o ( P , L ) → MoveT o ( P , L ) + ∗ + MoveT o ( P , L n )= ⇒ Location ( P , L n ) . From the above two subsections, we can see that withmore and more generalization, the learning encapsulatedin the TM model can approach what could possibly be ahuman-level understanding of the world.

As described in Section 3.2.5, we can produce all possiblepermutations of the available variables in each sample (afterEntity Generalization) as long as the relation constraintsare not violated. Doing this gives us more information persample:

Input:

William moved to the ofﬁce. Susan went to the garden.William walked to the pantry. Where is William?

Output: pantry

1. Reducing to relation-entity bindings:

Input = > MoveTo(William, Ofﬁce), MoveTo(Susan, Garden),MoveTo(William, Pantry), Q(William)

2. Generalizing bindings: = > MoveTo(Per1, Loc1),MoveTo(Per2, Loc2), MoveTo(Per1, Loc3), Q(Per1)

3. Permuted Variables: = > MoveTo(Per2, Loc1),MoveTo(Per1, Loc2), MoveTo(Per2, Loc3), Q(Per2)

4. Possible Answers : [Loc1, Loc2, Loc3]This has two primary beneﬁts. Firstly, in a scenario wherethe given data does not encompass all possible structuraldifferences in which a particular information maybe rep-resented, using the permutations allows the TM to view acloser-to-complete representation from which to build it’slearning (and hence, explanations). Moreover, since the TMcan learn different structural permutations from a singlesample, it ultimately requires fewer clauses to learn effec-tively. In our experiments, permutations using RelationalTM Convolution allowed for up to 1.5 times less clausesthan using a non-convolutional approach.As detailed in Section 4.4, convolutional and non-convolutional approaches have different computationalcomplexity. Hence, the convolutional approach makes senseonly when the reduction in complexity from fewer clausesbalance the increase due to processing the convolutionalwindow itself.

RXIV PREPRINT 12

TABLE 6Clause snapshot for “William moved to the ofﬁce. Susan went to the garden. William walked to the pantry. Where is William?” after generalization

Class Clause +/- Votes Total Votesfor Class

Loc1 Q(Per1) AND MoveTo(Per1, Loc1) + 3 -44Q(Per1) AND MoveTo(Per1, Loc1)AND MoveTo(Per1, Loc3) - 47Loc2 MoveTo(Per1, Loc1) + 2 -88Q(Per1) AND MoveTo(Per1, Loc3) - 90Loc3 Q(Per1) AND MoveTo(Per1, Loc1)AND MoveTo(Per1, Loc3) + 51 +51 - - 0

TABLE 7Average Accuracy on Test with Increase in Error in Training Data

Error Rate 0% 1% 2% 5% 10%Accuracy 99.48 98.79 98.24 97.02 95.08

To verify our claims of noise tolerance as shown by the TMbased architecture, the above experiments were repeated,but with increasing amount of noise artiﬁcially introducedinto the training data. The results are shown in Table 7. Weobserve that with 1%, 2%, 5% and 10% of noise, the testingaccuracy fell by approximately 1.1% each time when entitygeneralization was used.

The example elaborated in the previous section can beformulated as the following Horn clause representation: P erson ( Susan ) . P erson ( W illiam ) . Location ( Office ) . Location ( Garden ) . Location ( P antry ) . CurrentlyAt ( Susan, P antry ) . CurrentlyAt ( W illiam, P antry ) . MoveT o ( Susan, Garden ) ← P erson ( Susan ) ,Location ( Garden ) ,not CurrentlyAt ( Susan, Garden ) MoveT o ( W illiam, Office ) ← P erson ( W illiam ) ,Location ( Office ) ,not CurrentlyAt ( W illiam, Office ) . After generalization, we substitute ground rules 8 and 9with the following rule:

MoveT o ( P, L ) ← P erson ( P ) , Location ( L ) ,not CurrentlyAt ( P, L ) . The above set of Horn clauses deﬁne the immediateconsequences operator whose LFP represents the Herbrandinterpretation of our QA framework.

ONCLUSION

Making interpretable logical decisions in question answer-ing system is an area of active research. In this work, wepropose a novel relational logic based TM framework toapproach QA tasks systematically. Our proposed methodtakes advantage of noise tolerance showed by TMs to workin uncertain or ambiguous contexts. We reduce the context-question-answer tuples to a set of logical arguments, whichis used by the TM to determine rules that mimic real-worldactions and consequences. The resulting TM is relational (as opposed to the pre-viously propositional TM) and can take work on logicalstructures that occur in natural language in order to encoderules representing actions and effects in the form of Hornclauses. We show initial results using the Relational TM onartiﬁcial datasets of closed-domain question answering, andthose results are extremely promising. The use of ﬁrst-orderrepresentations, as described in this paper, allows KBs to beup to times smaller, while at the same time showing ananswering accuracy increase of almost to . .Further work on this framework will involve a largernumber of relations, with greater inter-dependencies, andanalyzing how well the TM can learn the inherent logicalstructure governing such dependencies. We also intend tointroduce recursive Horn clauses to make the computingpower of the Relational TM equivalent to a universal Tur-ing machine. Moreover, we wish to experiment with thisframework on real-world natural language datasets, ratherthan on toy ones. A prominent example is exploring largecorpus of documents related to human rights violation andusing them to assess risks of social instability. We expect thatthe resultant logic structures will be large and complicated,however, once obtained, can be used to effectively translateto and fro between the machine world and the real world. R EFERENCES [1] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Free-base: a collaboratively created graph database for structuringhuman knowledge,” in

Proceedings of the 2008 ACM SIGMODinternational conference on Management of data , 2008, pp. 1247–1250.[2] J. M. Prager, “Open-domain question-answering.”

Found. TrendsInf. Retr. , vol. 1, no. 2, pp. 91–231, 2006.[3] O.-C. Granmo, “The Tsetlin Machine - A Game Theoretic BanditDriven Approach to Optimal Pattern Recognition with Proposi-tional Logic,” arXiv preprint arXiv:1804.01508 , 2018.[4] M. L. Tsetlin, “On behaviour of ﬁnite automata in randommedium,”

Avtom I Telemekhanika , vol. 22, no. 10, pp. 1345–1354,1961.[5] R. K. Yadav, L. Jiao, O.-C. Granmo, and M. Goodwin, “Human-Level Interpretable Learning for Aspect-Based Sentiment Anal-ysis,” in

The Thirty-Fifth AAAI Conference on Artiﬁcial Intelligence(AAAI-21) . AAAI, 2021.[6] R. K. Yadav, L. Jiao, O.-C. Granmo, and M. Goodwin, “Inter-pretable classiﬁcation of word sense disambiguation using tsetlinmachine,” in . INSTICC, 2021.[7] B. Bhattarai, L. Jiao, and O.-C. Granmo, “Measuring the Novelty ofNatural Language Text Using the Conjunctive Clauses of a TsetlinMachine Text Classiﬁer,” in . INSTICC, 2021.[8] R. Saha, O.-C. Granmo, and M. Goodwin, “Mining InterpretableRules for Sentiment and Semantic Relation Analysis using TsetlinMachines,” in

Lecture Notes in Computer Science: Proceedings of the40th International Conference on Innovative Techniques and Applica-tions of Artiﬁcial Intelligence (SGAI-2020) . Springer, 2020.

RXIV PREPRINT 13 [9] G. T. Berge, O.-C. Granmo, T. O. Tveit, M. Goodwin, L. Jiao, andB. V. Matheussen, “Using the Tsetlin Machine to learn human-interpretable rules for high-accuracy text categorization with med-ical applications,”

IEEE Access , vol. 7, pp. 115 134–115 146, 2019.[10] O.-C. Granmo, S. Glimsdal, L. Jiao, M. Goodwin, C. W. Omlin, andG. T. Berge, “The Convolutional Tsetlin Machine,” arXiv preprintarXiv:1905.09688 , 2019.[11] K. D. Abeyrathna, O.-C. Granmo, and M. Goodwin, “Extendingthe Tsetlin Machine With Integer-Weighted Clauses for IncreasedInterpretability,”

IEEE Access , vol. 9, 2021.[12] K. D. Abeyrathna, O.-C. Granmo, X. Zhang, L. Jiao, and M. Good-win, “The Regression Tsetlin Machine - A Novel Approach toInterpretable Non-Linear Regression,”

Philosophical Transactions ofthe Royal Society A , vol. 378, 2019.[13] J. Lei, T. Rahman, R. Shaﬁk, A. Wheeldon, A. Yakovlev, O.-C.Granmo, F. Kawsar, and A. Mathur, “Low-Power Audio KeywordSpotting using Tsetlin Machines,” arXiv preprint arXiv:2101.11336 ,2021.[14] C. D. Blakely and O.-C. Granmo, “Closed-Form Expressions forGlobal and Local Interpretation of Tsetlin Machines with Ap-plications to Explaining High-Dimensional Data,” arXiv preprintarXiv:2007.13885 , 2020.[15] A. Wheeldon, R. Shaﬁk, A. Yakovlev, J. Edwards, I. Haddadi, andO.-C. Granmo, “Tsetlin Machine: A New Paradigm for PervasiveAI,” in

SCONA Workshop at Design, Automation and Test in Europe(DATE 2020) , 2020.[16] J. Lei, A. Wheeldon, R. Shaﬁk, A. Yakovlev, and O.-C. Granmo,“From Arithmetic to Logic Based AI: A Comparative Analysis ofNeural Networks and Tsetlin Machine,” in . IEEE,2020.[17] R. Shaﬁk, A. Wheeldon, and A. Yakovlev, “Explainability and De-pendability Analysis of Learning Automata based AI Hardware,”in

IEEE 26th International Symposium on On-Line Testing and RobustSystem Design (IOLTS) . IEEE, 2020.[18] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+questions for machine comprehension of text,” arXiv preprintarXiv:1606.05250 , 2016.[19] M. A. C. Soares and F. S. Parreiras, “A literature review onquestion answering techniques, paradigms and systems,”

Journalof King Saud University-Computer and Information Sciences , vol. 32,no. 6, pp. 635–646, 2020.[20] A. M. Pundge, S. Khillare, and C. N. Mahender, “Question answer-ing system, approaches and techniques: a review,”

InternationalJournal of Computer Applications , vol. 141, no. 3, pp. 0975–8887, 2016.[21] S. A. Ludwig, “Comparison of a deductive database with a seman-tic web reasoning engine,”

Knowledge-Based Systems , vol. 23, no. 6,pp. 634–642, 2010.[22] K. Cyras, R. Badrinath, S. K. Mohalik, A. Mujumdar, A. Nikou,A. Previti, V. Sundararajan, and A. V. Feljan, “Machine reasoningexplainability,” arXiv preprint arXiv:2009.00418 , 2020.[23] C. Green, “Theorem proving by resolution as a basis for question-answering systems,”

Machine intelligence , vol. 4, pp. 183–205, 1969.[24] H. Gallaire, J. Minker, and J.-M. Nicolas, “Logic and databases:A deductive approach,” in

Readings in Artiﬁcial Intelligence andDatabases . Elsevier, 1989, pp. 231–247.[25] M. Jarke, R. Gallersd¨orfer, M. A. Jeusfeld, M. Staudt, and S. Eherer,“Conceptbase—a deductive object base for meta data manage-ment,”

Journal of Intelligent Information Systems , vol. 4, no. 2, pp.167–192, 1995.[26] J. S. Dong, J. Sun, and H. Wang, “Checking and reasoning aboutsemantic web through alloy,” in

International Symposium of FormalMethods Europe . Springer, 2003, pp. 796–813.[27] A.-Y. Turhan, “Description logic reasoning for semantic webontologies,” in

Proceedings of the International Conference on WebIntelligence, Mining and Semantics , 2011, pp. 1–5.[28] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, J. Bet-teridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel et al. , “Never-ending learning,”

Communications of the ACM , vol. 61, no. 5, pp.103–115, 2018.[29] A. Cropper, S. Dumanˇci´c, and S. H. Muggleton, “Turning 30:New ideas in inductive logic programming,” arXiv preprintarXiv:2002.11002 , 2020.[30] I. Bratko and S. Muggleton, “Applications of inductive logicprogramming,”

Communications of the ACM , vol. 38, no. 11, pp.65–70, 1995. [31] M. Nickles and A. Mileo, “Probabilistic inductive logic pro-gramming based on answer set programming,” arXiv preprintarXiv:1405.0720 , 2014.[32] L. De Raedt and K. Kersting, “Probabilistic inductive logic pro-gramming,” in

Probabilistic Inductive Logic Programming . Springer,2008, pp. 1–27.[33] J. Lloyd,

Foundations of Logic Programming . New York: Springer-Verlag, 1984.[34] R. Kowalski, “Logic programming,” in

Computational Logic , ser.Handbook of the History of Logic, J. H. Siekmann, Ed. North-Holland, 2014, vol. 9, pp. 523–569.[35] J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merri¨enboer,A. Joulin, and T. Mikolov, “Towards ai-complete questionanswering: A set of prerequisite toy tasks,” arXiv preprintarXiv:1502.05698 , 2015.[36] R. Kowalski, “Algorithm= logic+ control,”

Communications of theACM , vol. 22, no. 7, pp. 424–436, 1979.

Rupsa Saha received her M.Tech degree in in-formation and communication technology withspecialization in machine intelligence from DAI-ICT, India in 2014. She is currently pursuingher Ph.D. on Tsetlin Machines with the Centrefor Artiﬁcial Intelligence Research, University ofAgder, Norway. Her research interests includemachine learning, NLP and chatbots.

Ole-Christoffer Granmo is a Professor andFounding Director of Centre for Artiﬁcial Intel-ligence Research (CAIR), University of Agder,Norway. He obtained his master’s degree in 1999and the PhD degree in 2004, both from theUniversity of Oslo, Norway. Dr. Granmo has au-thored in excess of 140 refereed papers with6 best paper awards, encompassing learningautomata, bandit algorithms, Tsetlin machines,Bayesian reasoning, reinforcement learning, andcomputational linguistics. He has further coordi-nated 7+ Norwegian Research Council projects and graduated morethan 60 master- and PhD students. Dr. Granmo is also a co-founder ofthe Norwegian Artiﬁcial Intelligence Consortium (NORA). Apart from hisacademic endeavours, he co-founded the company Anzyz TechnologiesAS.

Vladimir I. Zadorozhny is a Professor at theUniversity of Pittsburgh School of Computingand Information. He is also a Core Faculty Mem-ber at the University of Pittsburgh BiomedicalInformatics Training Program, an Adjunct Profes-sor at Faculty of Engineering and Science and amember of the Centre for Artiﬁcial IntelligenceResearch, University of Agder, Norway. He re-ceived his Ph.D. in 1993 from the Institute forProblems of Informatics, Russian Academy ofSciences in Moscow. Before coming to USA in1998 he was a Principal Research Scientist in the Institute of SystemProgramming, Russian Academy of Sciences. His research interestsinclude information integration, data fusion, complex adaptive systemsand scalable architectures for wide-area environments. He speciﬁcallyinterested in application of scalable data fusion methods to enableefﬁcient data processing and sense-making in complex domains. Hisresearch has been supported by NSF, EU and Norwegian ResearchCouncil. Vladimir is a recipient of Fulbright Scholarship for 2014-2015.He has received several best paper awards and has chaired and servedon program committees of multiple Database and Distributed ComputingConferences and Workshops.

RXIV PREPRINT 14