[PDF] Massively Parallel and Asynchronous Tsetlin Machine Architecture Supporting Almost Constant-Time Scaling

Abstract

Using logical clauses to represent patterns, Tsetlin Machines (TMs) have recently obtained competitive performance in terms of accuracy, memory footprint, energy, and learning speed on several benchmarks. Each TM clause votes for or against a particular class, with classification resolved using a majority vote. While the evaluation of clauses is fast, being based on binary operators, the voting makes it necessary to synchronize the clause evaluation, impeding parallelization. In this paper, we propose a novel scheme for desynchronizing the evaluation of clauses, eliminating the voting bottleneck. In brief, every clause runs in its own thread for massive native parallelism. For each training example, we keep track of the class votes obtained from the clauses in local voting tallies. The local voting tallies allow us to detach the processing of each clause from the rest of the clauses, supporting decentralized learning. This means that the TM most of the time will operate on outdated voting tallies. We evaluated the proposed parallelization across diverse learning tasks and it turns out that our decentralized TM learning algorithm copes well with working on outdated data, resulting in no significant loss in learning accuracy. Furthermore, we show that the proposed approach provides up to 50 times faster learning. Finally, learning time is almost constant for reasonable clause amounts (employing from 20 to 7,000 clauses on a Tesla V100 GPU). For sufficiently large clause numbers, computation time increases approximately proportionally. Our parallel and asynchronous architecture thus allows processing of massive datasets and operating with more clauses for higher accuracy.

Full PDF

MMassively Parallel and Asynchronous Tsetlin MachineArchitecture Supporting Almost Constant-Time Scaling ∗ K. Darshana Abeyrathna, Bimal Bhattarai, Morten Goodwin, Saeed Gorji,Ole-Christoffer Granmo, Lei Jiao, Rupsa Saha, Rohan K. Yadav † Centre for Artiﬁcial Intelligence Research (CAIR), University of Agder, Kristiansand, [email protected]

Abstract

Using logical clauses to represent patterns, Tsetlin machines(TMs) have recently obtained competitive performance interms of accuracy, memory footprint, energy, and learningspeed on several benchmarks. A team of Tsetlin automata(TAs) composes each clause, thus driving the entire learn-ing process. These are rewarded/penalized according to threelocal rules that optimize global behaviour. Each clause votesfor or against a particular class, with classiﬁcation resolvedusing a majority vote. In the parallel and asynchronous ar-chitecture that we propose here, every clause runs in its ownthread for massive parallelism. For each training example,we keep track of the class votes obtained from the clausesin local voting tallies. The local voting tallies allow us todetach the processing of each clause from the rest of theclauses, supporting decentralized learning. Thus, rather thanprocessing training examples one-by-one as in the originalTM, the clauses access the training examples simultaneously,updating themselves and the local voting tallies in parallel.There is no synchronization among the clause threads, apartfrom atomic adds to the local voting tallies. Operating asyn-chronously, each team of TAs will most of the time operateon partially calculated or outdated voting tallies. However,across diverse learning tasks (regression, novelty detection,semantic relation analysis, and word sense disambiguation), itturns out that our decentralized TM learning algorithm copeswell with working on outdated data, resulting in no signiﬁcantloss in learning accuracy. Further, we show that the approachprovides up to times faster learning. Finally, learning timeis almost constant for reasonable clause amounts (employ-ing from to , clauses on a Tesla V100 GPU). Forsufﬁciently large clause numbers, computation time increasesapproximately proportionally. Our parallel and asynchronousarchitecture thus allows processing of more massive datasetsand operating with more clauses for higher accuracy. TMs (Granmo 2018) have recently obtained competitive re-sults in terms of accuracy, memory footprint, energy, andlearning speed on diverse benchmarks (image classiﬁcation,regression and natural language understanding) (Berge et al. ∗ Source code and demos for this paper can be found athttps://github.com/cair/PyTsetlinMachineCUDA. † The authors are ordered alphabetically by last name. if X satisﬁes condition A and not condition B then y = 1).The clause outputs, in turn, are combined into a classiﬁca-tion decision through summation and thresholding, akin toa logistic regression function, however, with binary weightsand a unit step output function. Being based on disjunctivenormal form, like Karnaugh maps (Karnaugh 1953), a TMcan map an exponential number of input feature value com-binations to an appropriate output (Granmo 2018). Recent progress on TMs.

Recent research reports sev-eral distinct TM properties. The TM can be used in con-volution, providing competitive performance on MNIST,Fashion-MNIST, and Kuzushiji-MNIST, in comparison withCNNs, K-Nearest Neighbor, Support Vector Machines, Ran-dom Forests, Gradient Boosting, BinaryConnect, LogisticCircuits and ResNet (Granmo et al. 2019). The TM hasalso achieved promising results in text classiﬁcation usingthe conjunctive clauses to capture textual patterns (Bergeet al. 2019). Recently, regression TMs compared favorablywith Regression Trees, Random Forest Regression, and Sup-port Vector Regression (Abeyrathna et al. 2020). The aboveTM approaches have further been enhanced by various tech-niques. By introducing real-valued clause weights, it turnsout that the number of clauses can be reduced by up to × without loss of accuracy (Phoulady et al. 2020). Also,the logical inference structure of TMs makes it possibleto index the clauses on the features that falsify them, in-creasing inference- and learning speed by up to an order ofmagnitude (Gorji et al. 2020). Multi-granular clauses sim-plify the hyper-parameter search by eliminating the patternspeciﬁcity parameter (Gorji et al. 2019). In (Abeyrathna,Granmo, and Goodwin 2020), stochastic searching on theline automata (Oommen 1997) learn integer clause weights,performing on-par or better than Random Forest, Gradi-ent Boosting and Explainable Boosting Machines. Closed a r X i v : . [ c s . A I] S e p orm formulas for both local and global TM interpreta-tion, akin to SHAP, was proposed in (Blakely and Granmo2020). From a hardware perspective, energy usage can betraded off against accuracy by making inference determinis-tic (Abeyrathna et al. 2020). Additionally, (Shaﬁk, Wheel-don, and Yakovlev 2020) show that TMs can be fault-tolerant, completely masking stuck-at faults. Recent theo-retical work proves convergence to the correct operator for“identity” and “not”. It is further shown that arbitrarily rarepatterns can be recognized, using a quasi-stationary Markovchain-based analysis. The work ﬁnally proves that when twopatterns are incompatible, the most accurate pattern is se-lected (Zhang et al. 2020). Paper Contributions.

In all of the above TM schemes, theclauses are learnt using TA-teams (Tsetlin 1961) that inter-act to build and integrate conjunctive clauses for decision-making. While producing accurate learning, this interactioncreates a bottleneck that hinders parallelization. That is, theclauses must be evaluated and compared before feedbackcan be provided to the TAs. In this paper, we ﬁrst cover thebasics of TMs in Section 2. Then, in Section 3, we proposea novel parallel and asynchronous architecture where everyclause runs in its own thread for massive parallelism. Weeliminate the above interaction bottleneck by introducing lo-cal voting tallies that keep track of the clause outputs, pertraining example. The local voting tallies detach the process-ing of each clause from the rest of the clauses, supportingdecentralized learning. Thus, rather than processing train-ing examples one-by-one as in the original TM, the clausesaccess the training examples simultaneously, updating them-selves and the local voting tallies in parallel. In Section 4, weinvestigate the properties of the new architecture empiricallyon regression, novelty detection, semantic relation analysisand word sense disambiguation. We show that our decen-tralized TM architecture copes well with working on out-dated data, with no measurable loss in learning accuracy. Wefurther investigate how processing time scales with numberof clauses, uncovering almost constant time processing overreasonable clause amounts. Finally, in Section 5, we con-clude with pointers to further work, including architecturesfor grid-computing and heterogeneous systems spanning thecloud and the edge.

Classiﬁcation

A TM takes a vector X = ( x , . . . , x o ) of Boolean fea-tures as input, to be classiﬁed into one of two classes, y = 0 or y = 1 . Together with their negated counter-parts, ¯ x k = ¬ x k = 1 − x k , the features form a literal set L = { x , . . . , x o , ¯ x , . . . , ¯ x o } .A TM pattern is formulated as a conjunctive clause C j ,formed by ANDing a subset L j ⊆ L of the literal set: C j ( X ) = (cid:86) l k ∈ L j l k = (cid:81) l k ∈ L j l k . (1)E.g., the clause C j ( X ) = x ∧ x = x x consists of theliterals L j = { x , x } and outputs iff x = x = 1 . The number of clauses employed is a user set parame-ter n . Half of the clauses are assigned positive polarity. Theother half is assigned negative polarity. The clause outputsare combined into a classiﬁcation decision through summa-tion and thresholding using the unit step function u ( v ) =1 if v ≥ else : ˆ y = u (cid:16)(cid:80) n/ j =1 C + j ( X ) − (cid:80) n/ j =1 C − j ( X ) (cid:17) . (2)Namely, classiﬁcation is performed based on a major-ity vote, with the positive clauses voting for y = 1 and the negative for y = 0 . The classiﬁer ˆ y = u ( x ¯ x + ¯ x x − x x − ¯ x ¯ x ) , e.g., captures the XOR-relation (illustrated in Figure 1). Class - 1 Class - 0 1, 0 Output decider y = 1 1 0 𝑥 ¬𝑥 ¬𝑥 𝑥 𝑥 𝑥 ¬𝑥 ¬𝑥 Vote Collector – Class 1 Vote Collector – Class 0

Figure 1: The Tsetlin machine architecture

Learning

A clause C j ( X ) is composed by a team of TAs (Tsetlin1961), each TA deciding to Include or Exclude a speciﬁcliteral l k in the clause. Learning which literals to include isbased on reinforcement: Type I feedback produces frequentpatterns, while Type II feedback increases the discrimina-tion power of the patterns. TMs learn on-line, processingone training example ( X, y ) at a time. Type I feedback is given stochastically to clauses withpositive polarity when y = 1 and to clauses with negativepolarity when y = 0 . Each clause, in turn, reinforces itsTAs based on: (1) its output C j ( X ) ; (2) the action of theTA – Include or Exclude ; and (3) the value of the literal l k assigned to the TA. Two rules govern Type I feedback: • Include is rewarded and

Exclude is penalized with proba-bility s − s if C j ( X ) = 1 and l k = 1 . This reinforcementis strong (triggered with high probability) and makes theclause remember and reﬁne the pattern it recognizes in X . • Include is penalized and

Exclude is rewarded with proba-bility s if C j ( X ) = 0 or l k = 0 . This reinforcement isweak (triggered with low probability) and coarsens infre-quent patterns, making them frequent. Note that the probability s − s is replaced by when boostingtrue positives. bove, parameter s controls pattern frequency. Type II feedback is given stochastically to clauses withpositive polarity when y = 0 and to clauses with negativepolarity when y = 1 . It penalizes Exclude with probability if C j ( X ) = 1 and l k = 0 . Thus, this feedback producesliterals for discriminating between y = 0 and y = 1 . Resource allocation dynamics ensure that clauses dis-tribute themselves across the frequent patterns, rather thanmissing some and over-concentrating on others. That is, forany input X , the probability of reinforcing a clause gradu-ally drops to zero as the clause output sum v = (cid:80) n/ j =1 C + j ( X ) − (cid:80) n/ j =1 C − j ( X ) . (3)approaches a user-set target T for y = 1 (and − T for y = 0 ).If a clause is not reinforced, it does not give feedback toits TAs, and these are thus left unchanged. In the extreme,when the voting sum v equals or exceeds the target T (theTM has successfully recognized the input X ), no clauses arereinforced. They are then free to learn new patterns, natu-rally balancing the pattern representation resources (Granmo2018). Even though CPUs have been traditionally geared to han-dle high workloads, they are more suited for sequential pro-cessing and their performance is still dependant on the lim-ited number of cores available. In contrast, since GPUs areprimarily designed for graphical applications by employingmany small processing elements, they offer a large degreeof parallelism (Owens et al. 2007). As a result, a growingbody of research has been focused on performing generalpurpose GPU computation or GPGPU. For efﬁcient use ofGPU power, it is critical for the algorithm to expose a largeamount of ﬁne-grained parallelism (Jiang and Snir 2005;Satish, Harris, and Garland 2009).The inherent discreetness of the TM architecture allowsus to effectively use the parallel processing power of GPUsto offer a huge speedup over the existing implementations.In this section, we introduce our decentralized inferencescheme and the accompanying architecture that makes itpossible to have parallel asynchronous learning and classiﬁ-cation.

Voting Tally

A voting tally that tracks the aggregated output of the clausesfor each training example is central to our scheme. In a stan-dard TM, each training example ( X i , y i ) , ≤ i ≤ N, isprocessed by ﬁrst evaluating the clauses on X i and then ob-taining the majority vote v from Eqn. 3 ( N is the total num-ber of examples). The majority vote v is then compared withthe summation target T when y = 1 and − T when y = 0 ,to produce the feedback to the TAs of each clause.As illustrated in Figure 2, to decouple the clauses, we nowassume that the particular majority vote of example X i hasbeen pre-calculated, meaning that each training example be-comes a triple ( X i , y i , v i ) , where v i is the pre-calculatedmajority vote. With v i in place, the calculation performedin Eqn. 3 can be skipped, and we can go directly to give Figure 2: Parallel Tsetlin machine architecture Algorithm 1

Decentralized updating of clause input

Example pool P , clause C j , positive polarity indicator p j ∈ { , } , batch size b ∈ [1 , ∞ ) , voting target T ∈ [1 , ∞ ) ,pattern speciﬁcity s ∈ [1 . , ∞ ) procedure U PDATE C LAUSE ( C j , p j , P, b, T, s )2: for i ← , . . . , b do ( X i , y i , v i ) ← ObtainTrainingExample( P ) v ci ← clip ( v i , − T, T ) e = T − v ci if y i = 1 else T + v ci if rand() ≤ e T then if y i xor p j then

8: TypeIIFeedback( X i , C j )9: else

10: TypeIFeedback( X i , C j , s )11: end if o ij ← C j ( X i ) o ∗ ij ← ObtainPreviousClauseOutput( i, j ) if o ij (cid:54) = o ∗ ij then AtomicAdd( v i , o ij − o ∗ ij ) StorePreviousClauseOutput( i, j, o ij ) end if end if end for end procedure Type I or Type II feedback to any clause C j , without con-sidering the other clauses. This opens up for decentralizedlearning of the clauses. However, any time the compositionof a clause changes after receiving feedback, all voting ag-gregates v i , ≤ i ≤ N, becomes outdated. This requiresthat the standard learning scheme for updating the clausesmust be replaced. Decentralized Clause Learning

Our decentralized learning scheme is captured by Algo-rithm 1. As shown, each clause is trained independently ofthe other clauses. That is, each clause proceeds with train-ing without taking other clauses into account. Accordingly,Algorithm 1 supports massive parallelization because eachclause can run in its own thread by employing the algorithm.Notice further how the clause in focus ﬁrst obtains a ref-erence to the next training example ( X i , y i , v i ) to process,including the pre-recorded voting sum v i (Line 3). This ex-ample is retrieved from an example pool P , which is thestorage of the training examples (centralized or decentral-zed).The error of the pre-recorded voting sum v i is then cal-culated based on the voting target T (Line 5). The error, inturn, decides the probability of updating the clause, whichis updating according to standard Type I and Type II TMfeedback, governed by the polarity p j of the clause and thespeciﬁcity hyper-parameter s (Lines 7-11).The moment clause C j is updated, all recorded votingsums in the example pool P are potentially outdated. This isbecause C j now captures a different pattern. Thus, to keepall of the voting sums v i in P consistent with C j , C j shouldideally have been re-evaluated on all of the examples in P .To partially remedy for outdated voting aggregates, theclause only updates the current voting sum v i . This happenswhen the calculated clause output o ij is different from thepreviously calculated clause output o ∗ ij (Lines 12-17). Notethat the previously recorded output o ∗ ij is a single bit that isstored locally together with the clause. In this manner, the al-gorithm provides eventual consistency . That is, if the clausesstop changing, all the voting sums eventually become cor-rect.Employing the above algorithm, the clauses access thetraining examples simultaneously, updating themselves andthe local voting tallies in parallel. There is no synchroniza-tion among the clause threads, apart from atomic adds to thelocal voting tallies (Line 15). Accordingly, with this mini-malistic synchronization, each team of TAs will usually op-erate on partially calculated or outdated voting tallies. In this section, we investigate how our new approach to TMlearning scales, including effects on training time and ac-curacy. We employ seven different datasets that representdiverse learning tasks, including regression, novelty detec-tion, sentiment analysis, semantic relation analysis and wordsense disambiguation. The datasets are of various sizes,spanning from to , examples, to classes,and to , features. We have strived to recreate TMexperiments reported by various researchers, including theirhyper-parameter settings. For comparison of performance,we contrast with fast single-core TM implementations bothwith and without clause indexing (Gorji et al. 2020). Ourproposed architecture is implemented in CUDA and runson a Tesla V100 GPU (grid size and block size ).The standard implementations run on an Intel Xeon Plat-inum 8168 CPU at . GHz. Obtained performance met-rics are summarized in Table 1. For greater reproducibility,each experiment is repeated ﬁve times and the average ac-curacy and standard deviation are reported. We also reporthow much faster the CUDA TM executes compared to theindexed version.

Regression

We ﬁrst investigate performance with regression Tsetlin ma-chines (RTMs) using two datasets, based on (Abeyrathnaet al. 2020): Annual Return and Energy Performance. Retrieved from https://github.com/cair/pyTsetlinMachine

10 20 40 80 160 320 640 1280Clauses0.15250.15500.15750.16000.16250.16500.16750.17000.1725 E rr o r ( M A E ) Figure 3: MAE vs. samples.With the Energy Performance dataset, we try to estimate theheating load of residential buildings using input featuresfrom samples. From both datasets, of the samplesare utilized to train the model and the rest to evaluate.We ﬁrst study the impact of the number of clauses onprediction error, measured by Mean Absolute Error (MAE).As illustrated in Figure 3 for Annual Return, increasing thenumber of clauses ( clauses. After that, the CUDAimplementation is superior, with no signiﬁcant increase inexecution time as the number of clauses increases. This canbe explained by the large number of threads available to theGPU and the asynchronous operation of our architecture.Looking at how MAE and execution time varies over thetraining epochs for Annual Return (Figure 5 and Figure 6,respectively), we observe MAE falls systematically acrossthe epochs, while execution time remains stable (employ-ing T = 1280 , s = 3 . , n = 1280 ). Notice the higherCUDA execution time in the ﬁrst epoch, which encompassescopying the dataset to GPU memory. Execution on EnergyPerformance exhibits similar behavior. Mean MAEs of eachmethod are similar (Table 2) across 5 independent runs, in-dicating no signiﬁcant difference in learning accuracy.The execution time of the CUDA implementation can befurther controlled by modifying the number of threads, e.g.,by changing the block size. Figure 7 shows the variation ofexecution time with increase of block size for Annual Re-turn: increasing block size reduces execution time for , clauses, while having limited effect on , . That is, moreclauses are needed to leverage the increase in number ofthreads. ataset TM indexed TM non-indexed TM CUDA Speed up Acc F1 Acc F1 Acc F1BBC Sports 85.08 ± ± ± ± ± ± ×

20 Newsgroup 79.37 ± ± ± ± ± ± × SEMEVAL 91.9 ± ± ± ± ± ± × IMDb 88.42 ± ± ± ± ± ± × JAVA (WSD) 97.03 ± ± ± ± ± ± × Apple (WSD) 92.65 ± ± ± ± ± ± × Table 1: Performance on multiple datasets. Mean and standard deviation are calculated over 5 independent runs. Speed up iscalculated as how many times faster is average execution time on CUDA implementation than on Indexed implementation

Datasets TMindexed TMnon-indexed TMCUDA Speed up

Annual Return 0.14 ± ± ± × Energy Perfor. 4.62 ± ± ± × Table 2: MAE with conﬁdence interval, and Speed up on tworegression datasets, calculated over 5 independent runs

10 20 40 80 160 320 640 1280Clauses02468101214 E x e c u t i o n t i m e [ s e c . ] Non-IndexedIndexedCUDA

Figure 4: Execution time vs.

Novelty Detection

Novelty detection is another important machine learningtask. Most supervised classiﬁcation approaches assume aclosed world, counting on all classes being present in thedata at training time. This assumption can lead to unpre-dictable behaviour during operation, whenever novel, previ-ously unseen, classes appear. We here investigate TM-basednovelty detection using two datasets: 20 Newsgroup andBBC Sports. In brief, we use the class voting sums (Sect.2) as features measuring novelty. We then employ a Multi-layer perceptron (MLP) for novelty detection that uses theclass voting sums as input.The BBC sports dataset contains documents from theBBC sport website, organized in ﬁve sports article categoriesand collected from 2004 to 2005. Overall, the dataset en-compasses a total of , terms. For novelty classiﬁcation,we designate the classes “Cricket” and “Football” as knownand “Rugby” as novel. We train on the known classes, whichruns for epochs with , clauses, threshold T of ,and sensitivity s of . . The training times for both indexedand non-indexed TMs are high compared to that of CUDATM, which is around times faster.The 20 Newsgroup dataset contains , docu-ments with classes. The classes “comp.graphics” E rr o r ( M A E ) Non-IndexedIndexedCUDA

Figure 5: MAE over epochs on Annual Return E x e c u t i o n T i m e [ s e c . ] Non-IndexedIndexedCUDA

Figure 6: Execution time over epochs on Annual Returnand “talk.politics.guns” are designated as known and“rec.sport.baseball” is considered novel. We train the TMfor epochs with a target T of , , clauses andsensitivity s of . . The CUDA TM implementation is hereabout times faster than the other versions.To assess scalability, we record the execution time of boththe indexed and the CUDA TM while increasing the num-ber of clauses (Figure 8). For the indexed TM, the execu-tion time increases almost proportionally with the number ofclauses, but there is no such noticeable effect for the CUDATM.The novelty scores generated by TM are passed into theMulti-layer perceptron (MLP) with hidden layer sizes ( , ), and RELU activation with stochastic gradient descent.As seen in Table 1, for both datasets, the non-indexed TMslightly outperforms other versions of TM, while indexedand CUDA TMs have similar accuracy. These differencescan be explained by the random variation of TM learning E x e c u t i o n t i m e ( s e c ) Clauses=1280Clauses=5000

Figure 7: CUDA Exec. time vs. block size on Annual Returnwith differing

Sentiment and Semantic Analysis

We use the

SemEval 2010 Semantic Relations (Hendrickxet al. 2009) and the

ACL Internet Movie Database (IMDb) (Maas et al. 2011) datasets to explore the performance ofthe TM implementations when applied to data with a largenumber of sparse features.The SEMEVAL dataset focuses on identifying seman-tic relations in text. The dataset has , examples, weconsider each to be annotated to either contain the relationCause-Effect or not contain it. The presence of an unam-biguous causal connective is indicative of a sentence being acausal sentence (Xuelan and Kennedy 1992). For each TM,we use clauses per class to identify this characteristic ofcausal texts. The IMDb dataset contains , highly po-lar movie reviews, which are either positive or negative. Dueto the large variety and combination of possibly distinguish-ing features, we assign , clauses to each class. For bothdatasets we use unigrams and bigrams as features.As noted in Table 1, the accuracy obtained by the CPU(non-indexed) and the CUDA implementations are compa-rable on the SEMEVAL dataset, while the indexed TM per-forms slightly poorer. However, the execution time is muchlower using the CUDA version than the other two (Figure 9).This is further shown in Figure 10: the CPU based TM withindexing takes an increasing amount of time to execute asthe number of clauses grows, but no such effect is seen withCUDA TM. That is, going from to , clauses merely Figure 9: Execution time over epochs on SEMEVALFigure 10: Execution time vs. .With the IMDB dataset, the CUDA version performs bet-ter in terms of accuracy, with less variance compared to theCPU versions (Table 1). It exhibits similar behaviour as inthe SEMEVAL dataset with respect to execution time overincreasing number of epochs. From approximately , clauses and onwards, however, we observe proportionallyincreasing execution time, e.g., execution time doubles go-ing from , to , clauses (Figure 11). This can po-tentially be explained by the Tesla V100 GPU having , cores.We also show how the change in CUDA block size affectsthe execution time, given a particular number of clauses inFigure 12. With less number of clauses, there are no beneﬁtsto using a larger block size. When a large number of clausesare used, a larger block size effectively parallelizes the workof the TM, reducing the execution time. Word Sense Disambiguation

Word Sense Disambiguation (WSD) is a vital task in NLP(Navigli 2009) that consists of distinguishing the meaningof homonyms – identically spelled words whose sense de-pends on the surrounding context words. We here performa quantitative evaluation of the three TM implementationsusing a recent WSD evaluation framework (Loureiro et al.2020) based on WordNet. We use a balanced dataset forcoarse grained classiﬁcation, focusing on two speciﬁc do-mains. The ﬁrst dataset concerns the meaning of the word“Apple”, which here has two senses: “apple inc.” (company)and “apple apple” (fruit). The other dataset covers the wordigure 11: Execution time vs. samples split into training and testing sam-ples of and , respectively. The JAVA dataset has samples split into and samples, for trainingand testing. As preprocessing, we ﬁlter the stopwords andstem the words using the Porter Stemmer to reduce the ef-fect of spelling mistakes or non-important variations of thesame word. To build a vocabulary (the feature space), weselect the , most frequent terms. Number of clauses,threshold, and speciﬁcity used are , , respectively,for both datasets.The accuracy and F1 score of non-indexed and indexedTMs is quite similar for the Apple dataset (Table 1). How-ever, the CUDA TM outperforms both of them by a signiﬁ-cant margin. In the case of JAVA dataset, the performance is comparable for all three, CUDA TM being slightly better.Again, we observe no signiﬁcant increase in executiontime with respect to increasing number clauses for theCUDA TM. The indexed TM, on the other hand, experiencea substantial increase in computation time (Figure 13). In this paper, we proposed a new approach to TM learn-ing, to open up for massively parallel processing. Ratherthan processing training examples one-by-one as in the orig-inal TM, the clauses access the training examples simultane-ously, updating themselves and local voting tallies in paral-lel. The local voting tallies allow us to detach the processingof each clause from the rest of the clauses, supporting de-centralized learning. There is no synchronization among theclause threads, apart from atomic adds to the local votingtallies. Operating asynchronously, each team of TAs most ofthe time operates on partially calculated or outdated votingtallies.The main conclusions of the paper can be summarized asfollows: • Our decentralized TM architecture copes remarkably withworking on outdated data, resulting in no signiﬁcant lossin learning accuracy across diverse learning tasks (regres-sion, novelty detection, semantic relation analysis, andword sense disambiguation). • Learning time is almost constant for reasonable clauseamounts (employing from to clauses on a TeslaV100 GPU). • For sufﬁciently large clause numbers, computation timeincreases approximately proportionally.Our parallel and asynchronous architecture thus allows pro-cessing of more massive datasets and operating with moreclauses for higher accuracy, signiﬁcantly increasing the im-pact of logic-based machine learning.From the above results, our main conclusion is that TMlearning is very robust towards relatively severe distortionsof communication and coordination among the clauses. Ourresults are thus compatible with the ﬁndings in (Shaﬁk,Wheeldon, and Yakovlev 2020), where it is shown that TMlearning is inherently fault tolerant, completely maskingstuck-at faults. In our future work, we will investigate therobustness of TM learning further, which includes develop-ing mechanisms for heterogeneous architectures and moreloosely coupled systems, such as grid-computing.

References

Abeyrathna, K. D.; Granmo, O.-C.; and Goodwin, M.2020. Extending the Tsetlin Machine With Integer-WeightedClauses for Increased Interpretability. arXiv preprintarXiv:2005.05131

URL https://arxiv.org/abs/1905.09688.Abeyrathna, K. D.; Granmo, O.-C.; Shaﬁk, R.; Yakovlev, A.;Wheeldon, A.; Lei, J.; and Goodwin, M. 2020. A NovelMulti-Step Finite-State Automaton for Arbitrarily Deter-ministic Tsetlin Machine Learning. In

Lecture Notes inComputer Science: Proceedings of the 40th Internationalonference on Innovative Techniques and Applications ofArtiﬁcial Intelligence (SGAI-2020) . Springer InternationalPublishing.Abeyrathna, K. D.; Granmo, O.-C.; Zhang, X.; Jiao, L.;and Goodwin, M. 2020. The Regression Tsetlin Machine -A Novel Approach to Interpretable Non-Linear Regression.

Philosophical Transactions of the Royal Society A

IEEE Access

7: 115134–115146. ISSN 2169-3536. doi:10.1109/ACCESS.2019.2935416.Blakely, C. D.; and Granmo, O.-C. 2020. Closed-Form Ex-pressions for Global and Local Interpretation of Tsetlin Ma-chines with Applications to Explaining High-DimensionalData. arXiv preprint arXiv:2007.13885

URL https://arxiv.org/abs/2007.13885.Gorji, S.; Granmo, O. C.; Glimsdal, S.; Edwards, J.; andGoodwin, M. 2020. Increasing the Inference and LearningSpeed of Tsetlin Machines with Clause Indexing. In

Inter-national Conference on Industrial, Engineering and OtherApplications of Applied Intelligent Systems . Springer.Gorji, S. R.; Granmo, O.-C.; Phoulady, A.; and Goodwin,M. 2019. A Tsetlin Machine with Multigranular Clauses.In

Lecture Notes in Computer Science: Proceedings of theThirty-ninth International Conference on Innovative Tech-niques and Applications of Artiﬁcial Intelligence (SGAI-2019) , volume 11927. Springer International Publishing.Granmo, O.-C. 2018. The Tsetlin Machine - A GameTheoretic Bandit Driven Approach to Optimal PatternRecognition with Propositional Logic. arXiv preprintarXiv:1804.01508

URL https://arxiv.org/abs/1804.01508.Granmo, O.-C.; Glimsdal, S.; Jiao, L.; Goodwin, M.; Om-lin, C. W.; and Berge, G. T. 2019. The ConvolutionalTsetlin Machine. arXiv preprint arXiv:1905.09688

URLhttps://arxiv.org/abs/1905.09688.Hendrickx, I.; Kim, S. N.; Kozareva, Z.; Nakov, P.;´O S´eaghdha, D.; Pad´o, S.; Pennacchiotti, M.; Romano, L.;and Szpakowicz, S. 2009. Semeval-2010 task 8: Multi-wayclassiﬁcation of semantic relations between pairs of nomi-nals. In

Proceedings of the Workshop on Semantic Evalu-ations: Recent Achievements and Future Directions , 94–99.Association for Computational Linguistics.Jiang, C.; and Snir, M. 2005. Automatic tuning matrix mul-tiplication performance on graphics hardware. In , 185–194. IEEE.Karnaugh, M. 1953. The map method for synthesis of com-binational logic circuits.

Transactions of the American In-stitute of Electrical Engineers, Part I: Communication andElectronics

Proceedings of the 49th annual meeting of theassociation for computational linguistics: Human languagetechnologies , 142–150.Navigli, R. 2009. Word sense disambiguation: A survey.

ACM Comput. Surv.

41: 10:1–10:69.Oommen, B. J. 1997. Stochastic searching on the line and itsapplications to parameter learning in nonlinear optimization.

IEEE Transactions on Systems, Man, and Cybernetics, PartB (Cybernetics)

Computer graphics forum , volume 26, 80–113. Wiley On-line Library.Phoulady, A.; Granmo, O.-C.; Gorji, S. R.; and Phoulady,H. A. 2020. The Weighted Tsetlin Machine: CompressedRepresentations with Clause Weighting. In

Proceedings ofthe Ninth International Workshop on Statistical RelationalAI (StarAI 2020) .Satish, N.; Harris, M.; and Garland, M. 2009. Designing ef-ﬁcient sorting algorithms for manycore GPUs. In , 1–10. IEEE.Shaﬁk, R.; Wheeldon, A.; and Yakovlev, A. 2020. Explain-ability and Dependability Analysis of Learning Automatabased AI Hardware. In

IEEE 26th International Symposiumon On-Line Testing and Robust System Design (IOLTS) .IEEE.Tsetlin, M. L. 1961. On behaviour of ﬁnite automata in ran-dom medium.

Avtomat. i Telemekh

Philosophical Trans-actions of the Royal Society A .Xuelan, F.; and Kennedy, G. 1992. Expressing causation inwritten English.

RELC Journal arXiv preprintarXiv:2007.14268arXiv preprintarXiv:2007.14268