Increasing the Inference and Learning Speed of Tsetlin Machines with Clause Indexing
Saeed Rahimi Gorji, Ole-Christoffer Granmo, Sondre Glimsdal, Jonathan Edwards, Morten Goodwin
IIncreasing the Inference and Learning Speed ofTsetlin Machines with Clause Indexing
Saeed Rahimi Gorji − − − , Ole-ChristofferGranmo − − − X ] , Sondre Glimsdal , Jonathan Edwards , andMorten Goodwin − − − X ] Centre for Artificial Intelligence Research, University of Agder, Grimstad, Norway Temporal Computing, Newcastle, United Kingdom [email protected], [email protected], [email protected],[email protected], [email protected]
Abstract.
The Tsetlin Machine (TM) is a machine learning algorithmfounded on the classical Tsetlin Automaton (TA) and game theory. It fur-ther leverages frequent pattern mining and resource allocation principlesto extract common patterns in the data, rather than relying on minimiz-ing output error, which is prone to overfitting. Unlike the intertwinednature of pattern representation in neural networks, a TM decomposesproblems into self-contained patterns, represented as conjunctive clauses.The clause outputs, in turn, are combined into a classification decisionthrough summation and thresholding, akin to a logistic regression func-tion, however, with binary weights and a unit step output function. Inthis paper, we exploit this hierarchical structure by introducing a novelalgorithm that avoids evaluating the clauses exhaustively. Instead weuse a simple look-up table that indexes the clauses on the features thatfalsify them. In this manner, we can quickly evaluate a large number ofclauses through falsification, simply by iterating through the features andusing the look-up table to eliminate those clauses that are falsified. Thelook-up table is further structured so that it facilitates constant timeupdating, thus supporting use also during learning. We report up to 15times faster classification and three times faster learning on MNIST andFashion-MNIST image classification, and IMDb sentiment analysis.
Keywords:
Tsetlin Machines · Pattern Recognition · Propositional Logic · Frequent Pattern Mining · Learning Automata · Pattern Indexing
The Tsetlin Machine (TM) is a recent machine learning approach that is based onthe Tsetlin Automaton (TA) by M. L. Tsetlin [1], one of the pioneering solutionsto the well-known multi-armed bandit problem [2,3] and the first LearningAutomaton [4]. It further leverages frequent pattern mining [5] and resourceallocation [6] to capture frequent patterns in the data, using a limited numberof conjunctive clauses in propositional logic. The clause outputs, in turn, arecombined into a classification decision through summation and thresholding, a r X i v : . [ c s . L G ] A p r S. R. Gorji et al. akin to a logistic regression function, however, with binary weights and a unitstep output function [7]. Being based on disjunctive normal form (DNF), likeKarnaugh maps, the TM can map an exponential number of input feature valuecombinations to an appropriate output [8].
Recent Progress on TMs.
Recent research reports several distinct TM prop-erties. The clauses that a TM produces have an interpretable form (e.g., if X satisfies condition A and not condition B then Y = 1), similar to the branchesin a decision tree [7]. For small-scale pattern recognition problems, up to threeorders of magnitude lower energy consumption and inference time has beenreported, compared to neural networks alike [9]. The TM structure captures re-gression problems, comparing favourably with Regression Trees, Random Forestsand Support Vector Regression [10]. Like neural networks, the TM can be usedin convolution, providing competitive memory usage, computation speed, andaccuracy on MNIST, F-MNIST and K-MNIST, in comparison with simple 4-layer CNNs, K-Nereast Neighbors, SVMs, Random Forests, Gradient Boosting,BinaryConnect, Logistic Circuits, and ResNet [11]. By introducing clause weightsthat allow one clause to represent multiple, it has been demonstrated that thenumber of clauses can be reduced up to × , without loss of accuracy, leading tomore compact clause sets [8]. Finally, hyper-parameter search can be simplifiedwith multi-granular clauses, eliminating the pattern specificity parameter [12]. Paper Contributions and Organization.
The expression power of a TM isgoverned by the number of clauses employed. However, adding clauses to increaseaccuracy comes at the cost of linearly increasing computation time. In this paper,we address this problem by exploiting the hierarchical structure of the TM. Wefirst cover the basics of TMs in Sect. 2. Then, in Sect. 3, we introduce a novelindexing scheme that speeds up clause evaluation. The scheme is based on asimple look-up tables that indexes the clauses on the features that falsify them.We can then quickly evaluate a large number of clauses through falsification,simply by iterating through the features and using the look-up table to eliminatethose clauses that are falsified. The look-up table is further structured so thatit facilitates constant time updating, thus supporting indexing also during TMlearning. In Sect. 4, we report up to 15 times faster classification and three timesfaster learning on three different datasets: IMDb sentiment analysis, and MNISTand Fashion-MNIST image classification. We conclude the paper in Sect. 5.
The Tsetlin Machine (TM)[13] is a novel machine learning approach that isparticularly suited to low power and explainable applications [9,7]. The structureof a TM is similar to the majority of learning classification algorithms, withforward pass inference and backward pass parameter adjustment (Fig. 1). Akey property of the approach is that the whole system is Boolean, requiring an setlin Machines with Clause Indexing 3 encoder and decoder for any other form of data. Input data is defined as a set offeature vectors: D = { x , x , . . . } , and the output Y is formulated as a functionparameterized by the actions A of a team of TAs: Y = f ( x ; A ) . TA Team.
The function f is parameterized by a team of TAs, each deciding thevalue of a binary parameter a k ∈ A (labelling the full collection of parameters A ).Each TA has an integer state t k ∈ { , . . . , N } , with N being the size of thestate space. The action a k of a TA is decided by its state: a k = 1 if t k > N else . Classification Function.
The function f is formulated as an ensemble of n conjunctive clauses, each clause controlling the inclusion of a feature and(separately) its negation. A clause C j is defined as: C j ( x ) = o (cid:94) k =1 ( x k ∨ ¬ a jk ) ∧ ( ¬ x k ∨ ¬ a jo + k ) . (1)Here, x k is the k th feature of input vector x ∈ D o , o is the total number offeatures and a jk is the action of the TA that has been assigned to literal k forclause j . A variety of ensembling methods can be used – most commonly a unitstep function, with half the clauses aimed at inhibition: ˆ y = 1 if n/ (cid:88) j =1 C + j ( x ) − n/ (cid:88) j =1 C − j ( x ) ≥ else . (2)When the problem is multi-classed (one-hot encoded [15]) thresholding is removedand the maximum vote is used to estimate the class: ˆ y = argmax i n/ (cid:88) j =1 C i + j ( x ) − n/ (cid:88) j =1 C i − j ( x ) , (3)with i referring to the class that the clauses belong to. Learning.
Learning controls the inclusion of features or their negation. A TA k is trained using rewards and penalties. If t k > N (action ), a reward increasesthe state, t k += 1 , up to N , whilst a penalty decreases the state, t k − = 1 . If t k < = N (action 0), a reward decreases the state, t k − = 1 , down to , whilst apenalty increases the state, t k += 1 . The learning rules are described in detail in[13], the basic premise being that individual TA can be controlled by examiningtheir effect on the clause output relative to the desired output, the clause output,the polarity (an inhibitory or excitory clause), and the action. For example, if (i)a feature is , (ii) the TA state governing the inclusion of the feature is resolvedto action , and (iii) the overall clause outputs when a output is desired, thenthe TA is rewarded, because it is trying to do the right action.The learning approach has the following added randomness: S. R. Gorji et al.
Fig. 1.
The backwards pass requires no gradient methods (just manipulation of all TAstates ( t k )). Due to the negation of features, clauses can singularly encode non-linearlyseparable boundaries. There are no sigmoid style functions, all functions are compatiblewith low level hardware implementations. – An annealing style cooling parameter T regulates how many clauses areactivated for updating. In general, increasing T together with the number ofclauses leads to an increase in accuracy. – A sensitivity s which controls the ratio of reward to penalty, based on a splitof the total probability, with values /s and − /s . This parameter governsthe fine-grainedness of the clauses, with a higher s producing clauses withmore features included.When comparing the TMs with a classical neural network [14] (Fig. 1), noteworthypoints include: Boolean/integer rather than floating point internal representation;logic rather than linear algebra; and binary inputs/outputs. In this section we describe the details of our clause indexing scheme, includingindex based inference and learning.
Overall Clause Indexing Strategy.
Recall that each TM clause is a conjunc-tion of literals. The way that the TM determines whether to include a literal in aclause is by considering the action of the corresponding TA. Thus, to successfully setlin Machines with Clause Indexing 5
Class 1: x Class 1: x Class 1: ¬x Class 1: ¬x Class 2: x Class 2: x Class 2: ¬x Class 2: ¬x n: 3 C C C + +-n: 2 C C - -n: 2 C C - -n: 1 C +n: 3 C C C - ++n: 2 C C + +n: 2 C C + -n: 3 C C C - +- C C C C + + - -1 3 22 11 212 3 11 21 23 2 1 Class 1: x Class 1: x Class 1: ¬x Class 1: ¬x Class 2: x Class 2: x Class 2: ¬x Class 2: ¬x Fig. 2.
Indexing structure. evaluate a clause, the TM must scan through all the actions of the team ofTAs responsible for the clause, and perform the necessary logical operations tocalculate its truth value. However, being a conjunction of literals, it is actuallysufficient to merely identify a single literal that is false to determine that theclause itself is false. Conversely, if no false literals are present, we can concludethat the clause is true. Our approach is thus to maintain a list for each literalthat contains the clauses that include that literal, and to use these lists to speedup inference and learning.
Index Based Inference.
We first consider the clause lists themselves (Fig. 2,left), before we propose a position matrix that allows constant time updating ofthe clause lists (Fig. 2, right). For each class-literal pair ( i, k ) , we maintain an inclusion list L ik that contains all the clauses in class i that include the literal l k . Let the cardinality of L ik be | L ik | = n ik . Evaluating the complete set of class i clauses then proceeds as follows. Given a feature vector x = ( x , x , . . . , x o ) ,we define the index set L F for the false literals as L F = { k | l k = 0 } . Then C iF = (cid:83) k ∈ L F L ik is the clauses of class i that have been falsified by x . Wepartition C iF into clauses with positive polarity and clauses with negative polarity: C iF = C i + F ∪ C i − F . The predicted class can then simply be determined based onthe cardinality of the sets of falsified clauses: ˆ y = argmax i (cid:0) |C i − F | − |C i + F | (cid:1) , (4)Alongside the inclusion list, an m × n × o matrix M , with entries M ijk , keepstrack of the position of each clause j in the inclusion list for literal k (Fig. 2,right). With the help of M , searching and removing a clause from a list can takeplace in constant time. S. R. Gorji et al.
Index Construction and Maintenance.
Constructing the inclusion lists andthe accompanying position matrix is rather straightforward since one can initializeall of the TAs to select the exclude action. Then the inclusion lists are empty,and, accordingly, the position matrix is empty as well.Insertion can be done in constant time. When the TA responsible for e.g.literal l k in clause C ij changes its action from exclude to include, the correspondinginclusion list L ik must be updated accordingly. Adding a clause to the list is fastbecause it can be appended to the end of the list using the size of the list n ik : n ik ← n ik + 1 L ik [ n ik ] ← C ij M ijk ← n ik .Deletion, however, requires that we know the position of the clause that weintend to delete. This happens when one of the TAs changes from including toexcluding its literal, say literal l k in clause C ij . To avoid searching for C ij in L ik ,we use M ijk for direct lookup: L ik [ M ijk ] ← L ik [ n ik ] n ik ← n ik − M i,L ik [ M ijk ] k ← M ijk M ijk ← N A .As seen, the clause is deleted by overwriting it with the last clause in the list,and then updating M accordingly. Memory Footprint.
To get a better sense of the impact of indexing on memoryusage, assume that we have a classification problem with m classes and featurevectors of size o , to be solved with a TM. Thus, the machine has m classeswhere each class contains n clauses. The memory usage for the whole machine isabout × m × n × o bytes, considering a bitwise implementation and TAs with8-bit memory. The indexing data structure then requires two tables with m × o rows and n columns. Consequently, assuming 2-byte entries, each of these tablesrequire × m × n × o bytes of memory, which is roughly the same as the memoryrequired by the TM itself. As a result, using the indexing data structure roughlytriples memory usage. Step-by-Step Example.
Consider the indexing configuration in Fig. 2 and thefeature vector x = (1 , . To evaluate the clauses of class 1, i.e., C +1 , C − , C +2 ,and C − (omitting the class index for simplicity), we initially assume that theyall are true. This provides two votes for and two votes against the class, whichgives a class score of . We then look up the literals ¬ x and x , which both arefalse, in the corresponding inclusion lists. We first obtain C − and C − from ¬ x , setlin Machines with Clause Indexing 7 which falsifies C − and C − . This changes the class score from to because twonegative votes have been removed from the sum. We then obtain C − and C − from x , however, these have already been falsified, so are ignored. We have nowderived a final class score of simply by obtaining a total of four clause indexvalues, rather than fully evaluating four clauses over four possible literals.Let us now assume that the clause C +1 needs to be deleted from the inclusionlist of x . The position matrix M is then used to update the inclusion list inconstant time. By looking up x and C +1 for class 1 in M , we obtain the positionof C +1 in the inclusion list L of x , which is . Then the last element in the list, C +2 , is moved to position , and the number of elements in the list is decremented.Finally, the entry at row Class 1 : x and column C +1 is erased, while the entry at row Class 1 : x and column C +2 is set to to reflect the new position of C +2 in the list. Conversely, if C +1 needs to be added to e.g. the inclusion list of x , it is simply added to the end of the Class 1 : x list, in position . Then M is updated by writing into row Class 1 : x and column C +1 . Remarks.
As seen above, the maintenance of the inclusion lists and the positionmatrix takes constant time. The performance increase, however, is significant.Instead of going through all the clauses, evaluating each and every one of them,considering all of the literals, the TM only goes through the relevant inclusionlists. Thus, the amount of work becomes proportional to the size of the clauses.For instance, for the MNIST dataset, where the average length of clauses isabout 58, the algorithm goes through the inclusion list for half of the literals(the falsifying ones), each containing about clauses on average. Thus, insteadof evaluating
20 000 clauses (the total number of clauses), by considering literals for each (in the worst case), the algorithm just considers falsifyingliterals with an average of clauses in their inclusion lists. In other words,the amount of work done to evaluate all the clauses with indexing is roughly . compared to the case without indexing. Similarly, for the IMDB dataset,the average clause length is about , and each inclusion list contains roughly clauses on average. Without indexing, evaluating a clause requires assessing
10 000 literals, while indexing only requires assessing the inclusion lists for falsifying literals. Using indexing reduces the amount of work to roughly . of the work performed without indexing. Thus, in both examples, usingindexing reduces computation significantly. We now proceed to investigate howthe indexing affect inference and learning speed overall. To measure the effect of indexing we use three well-known datasets, namelyMNIST, Fashion-MNIST, and IMDb. We explore the effect of different number ofclauses and features, to determine the effect indexing has on performance undervarying conditions.Our first set of experiments covers the MNIST dataset. MNIST consists of × -pixel grayscale images of handwritten digits. We first binarize the images, S. R. Gorji et al. converting each image into a 784-bit feature vector. We refer to this version ofbinarized MNIST as M1. Furthermore, to investigate the effect of increasingthe number of features on indexing, we repeat the same experiment using 2, 3and 4 threshold-based grey tone levels. This leads to feature vectors of size 1568( = 2 × ), 2352 ( = 3 × ) and 3136 ( = 4 × ), respectively. We refer tothese datasets as M2, M3 and M4.Table 1 summarizes the resulting speedups. Different rows correspond todifferent number of clauses. The columns refer to different numbers of features,covering training and inference. As seen, increasing the number of clauses increasesthe effect of indexing. In particular, using more than a few thousands clausesresults in about three times faster learning and seven times faster inference. Table 1.
Indexing speedup for different number of clauses and features on MNISTFeatures 784 1568 2352 3136Clauses Train Test Train Test Train Test Train Test1000 1.78 2.75 1.54 2.79 1.76 3.74 1.91 4.152000 1.67 2.83 2.63 5.74 2.46 5.95 2.87 6.065000 2.62 5.95 3.23 8.02 3.26 6.88 3.34 7.1710000 2.82 6.31 3.49 8.31 3.47 7.60 3.43 6.7820000 2.69 5.22 3.58 8.02 3.34 6.44 3.35 6.39
Fig. 3 and Fig. 4 plots the average time of each epoch across our experiments,as a function of the number of clauses. From the figures, we see that after aninitial phase of irregular behavior, both indexed and unindexed versions of theTM suggest a linear growth with roughly the same slopes. Thus, the general effectof using indexing on MNIST is a several-fold speedup, while the computationalcomplexity of the TM algorithm remains unchanged, being linearly related tothe number of clauses.In our second set of experiments, we used the IMDb sentiment analysis dataset,consisting of 50,000 highly polar movie reviews, divided in positive and negativereviews. We varied the problem size by using binary feature vectors of size 5000,10000, 15000 and 20000, which we respectively refer to as I1, I2, I3 and I4.Table 2 summarizes the resulting speedup for IMDb, employing differentnumber of clauses and features. Interestingly, in the training phase, indexingseems to slow down training by about 10%. On the flip side, when it comes toinference, indexing provided a speedup of up to 15 times.Fig. 5 and Fig. 6 compare the average time of each epoch across our ex-periments, as a function of the number of clauses. Again, we observe someirregularities for smaller clause numbers. However, as the number of clausesincreases, the computation time of both the indexed and unindexed versions ofthe TM grows linearly with the number of clauses.In our third and final set of experiments, we consider the Fashion-MNISTdataset [15]. Here too, each example is a × -pixel grayscale image, associated setlin Machines with Clause Indexing 9 ,
000 2 ,
000 5 ,
000 10 ,
000 20 , , Clauses T i m e [ s ] Unindexed M4Unindexed M3Unindexed M2Unindexed M1Indexed M4Indexed M3Indexed M2Indexed M1
Fig. 3.
Average training time on MNIST ,
000 2 ,
000 5 ,
000 10 ,
000 20 , , Clauses T i m e [ s ] Unindexed M4Unindexed M3Unindexed M2Unindexed M1Indexed M4Indexed M3Indexed M2Indexed M1
Fig. 4.
Average inference time on MNIST0 S. R. Gorji et al.
Table 2.
Indexing speedup for different number of clauses and features on IMDbFeatures 5000 10000 15000 20000Clauses Train Test Train Test Train Test Train Test1000 0.76 1.95 0.84 1.59 0.81 1.78 0.81 2.032000 1.00 4.60 0.87 3.17 0.84 4.47 0.83 4.815000 0.99 8.58 0.93 8.66 0.85 9.41 0.87 9.4910000 1.06 15.40 0.92 11.92 0.90 13.17 0.87 13.0320000 1.05 15.19 0.93 15.87 0.86 13.84 0.88 13.28 ,
000 2 ,
000 5 ,
000 10 ,
000 20 , , , Clauses T i m e [ s ] Unindexed I4Unindexed I3Unindexed I2Unindexed I1Indexed I4Indexed I3Indexed I2Indexed I1
Fig. 5.
Average training time on IMDbsetlin Machines with Clause Indexing 11 ,
000 2 ,
000 5 ,
000 10 ,
000 20 , , Clauses T i m e [ s ] Unindexed I4Unindexed I3Unindexed I2Unindexed I1Indexed I4Indexed I3Indexed I2Indexed I1
Fig. 6.
Average inference time on IMDb with a label from 10 classes of different kinds of clothing items. Similar to MNIST,we binarized and flattened the data using different numbers of bits for each pixel(ranging from one to four bits by means of thresholding). This process producedfour different binary datasets with feature vectors of size 784, 1568, 2352 and3136. We refer to these datasets as F1, F2, F3 and F4, respectively. The resultsare summarized in Table 3 for different number of clauses and features. Similarto MNIST, in the training phase, we see a slight speedup due to indexing, exceptwhen the number of clauses or the number of features are small. In that case,indexing slows down training to some extent (by about 10%). On the other hand,for inference, we see a several-fold speedup (up to five-fold) due to indexing,especially with more clauses.
Table 3.
Indexing speedup for different number of clauses and features on Fashion-MNIST Features 784 1568 2352 3136Clauses Train Test Train Test Train Test Train Test1000 0.86 0.82 0.97 1.33 1.04 1.68 1.04 1.772000 0.92 1.17 1.23 2.55 1.24 2.93 1.25 3.135000 1.23 2.21 1.32 2.85 1.33 2.84 1.33 2.9510000 1.55 3.81 1.58 4.36 1.67 4.92 1.56 3.9520000 1.55 3.54 1.65 4.54 1.70 4.74 1.67 5.052 S. R. Gorji et al.
Fig. 7 and Fig. 8 compare the average time of each epoch as a functionof the number of clauses. Again, we observe approximately linearly increasingcomputation time with respect to the number of clauses. ,
000 2 ,
000 5 ,
000 10 ,
000 20 , , Clauses T i m e [ s ] Unindexed F4Unindexed F3Unindexed F2Unindexed F1Indexed F4Indexed F3Indexed F2Indexed F1
Fig. 7.
Average training time on Fashion-MNIST
In this work, we introduced indexing-based evaluation of Tsetlin Machine clauses,which demonstrated a promising increase in computation speed. Using simplelook-up tables and lists, we were able to leverage the typical sparsity of clauses.We evaluated the indexing scheme on MNIST, IMDb and Fashion-MNIST. Fortraining, on average, the indexed TM produced a three-fold speedup on MNIST,a roughly 50% speedup on Fashion-MNIST, while it slowed down IMDb training.Thus, the effect on training speed seems to be highly dependent on the data. Theoverhead from maintaining the indexing structure could counteract any speedgains.During inference, on the other hand, we saw a more consistent increase inperformance. We achieved a three- to eight-fold speedup on MNIST and Fashion-MNIST, and a 13-fold speedup on IMDb, which is reasonable considering thereis no index maintenance overhead involved in classification.In our further work, we intend to investigate how clause indexing can speed upMonte Carlo tree search for board games, by exploiting the incremental changesof the board position from parent to child node. setlin Machines with Clause Indexing 13 ,
000 2 ,
000 5 ,
000 10 ,
000 20 , , Clauses T i m e [ s ] Unindexed F4Unindexed F3Unindexed F2Unindexed F1Indexed F4Indexed F3Indexed F2Indexed F1
Fig. 8.
Average inference time on Fashion-MNIST
References
1. Michael Lvovitch Tsetlin. On behaviour of finite automata in random medium.
Avtomat. i Telemekh , 22(10):1345–1354, 1961.2. Herbert Robbins. Some aspects of the sequential design of experiments.
Bulletin ofthe American Mathematical Society , 1952.3. JC Gittins. Bandit processes and dynamic allocation indices.
Journal of the RoyalStatistical Society, Series B (Methodological) , 41(2):148–177, 1979.4. K S Narendra and M A L Thathachar.
Learning Automata: An Introduction .Prentice-Hall, Inc., 1989.5. Vegard Haugland et al. A two-armed bandit collective for hierarchical examplarbased mining of frequent itemsets with applications to intrusion detection.
TCCIXIV , 8615:1–19, 2014.6. O.-C. Granmo et al. Learning Automata-based Solutions to the Nonlinear FractionalKnapsack Problem with Applications to Optimal Resource Allocation.
IEEE Trans.on Systems, Man, and Cybernetics, Part B , 37(1):166–175, 2007.7. Geir Thore Berge, Ole-Christoffer Granmo, Tor Oddbjørn Tveit, Morten Goodwin,Lei Jiao, and Bernt Viggo Matheussen. Using the Tsetlin Machine to LearnHuman-Interpretable Rules for High-Accuracy Text Categorization with MedicalApplications.
IEEE Access , 7:115134–115146, 2019.8. Adrian Phoulady, Ole-Christoffer Granmo, Saeed Rahimi Gorji, and Hady AhmadyPhoulady. The Weighted Tsetlin Machine: Compressed Representations with ClauseWeighting. In
Ninth International Workshop on Statistical Relational AI (StarAI2020) , 2020.9. Adrian Wheeldon, Rishad Shafik, Alex Yakovlev, Jonathan Edwards, IbrahimHaddadi, and Ole-Christoffer Granmo. Tsetlin Machine: A New Paradigm for4 S. R. Gorji et al.Pervasive AI. In
Proceedings of the SCONA Workshop at Design, Automation andTest in Europe (DATE) , 2020.10. K. Darshana Abeyrathna, Ole-Christoffer Granmo, Xuan Zhang, Lei Jiao, andMorten Goodwin. The Regression Tsetlin Machine - A Novel Approach to Inter-pretable Non-Linear Regression.
Philosophical Transactions of the Royal Society A ,378, 2019.11. Ole-Christoffer Granmo, Sondre Glimsdal, Lei Jiao, Morten Goodwin, Christian W.Omlin, and Geir Thore Berge. The Convolutional Tsetlin Machine. arXiv preprint,arXiv:1905.09688 , 2019.12. Saeed Rahimi Gorji, Ole-Christoffer Granmo, Adrian Phoulady, and Morten Good-win. A Tsetlin Machine with Multigranular Clauses. In
Lecture Notes in ComputerScience: Proceedings of the Thirty-ninth International Conference on InnovativeTechniques and Applications of Artificial Intelligence (SGAI-2019) , volume 11927.Springer International Publishing, 2019.13. Ole-Christoffer Granmo. The Tsetlin Machine - A Game Theoretic Bandit DrivenApproach to Optimal Pattern Recognition with Propositional Logic. arXiv preprintarXiv:1804.01508 , 2018.14. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations byback-propagating errors.