[PDF] Low-Power Audio Keyword Spotting using Tsetlin Machines

Abstract

The emergence of Artificial Intelligence (AI) driven Keyword Spotting (KWS) technologies has revolutionized human to machine interaction. Yet, the challenge of end-to-end energy efficiency, memory footprint and system complexity of current Neural Network (NN) powered AI-KWS pipelines has remained ever present. This paper evaluates KWS utilizing a learning automata powered machine learning algorithm called the Tsetlin Machine (TM). Through significant reduction in parameter requirements and choosing logic over arithmetic based processing, the TM offers new opportunities for low-power KWS while maintaining high learning efficacy. In this paper we explore a TM based keyword spotting (KWS) pipeline to demonstrate low complexity with faster rate of convergence compared to NNs. Further, we investigate the scalability with increasing keywords and explore the potential for enabling low-power on-chip KWS.

Full PDF

LL OW -P OWER A UDIO K EYWORD S POTTINGUSING T SETLIN M ACHINES

A P

REPRINT

Jie Lei

Microsystems Research GroupSchool of Engineering,Newcastle University, NE1 7RU, UK

[email protected]

Tousif RahmanMicrosystems Research GroupSchool of Engineering,Newcastle University, NE1 7RU, UK

[email protected]

Rishad ShaﬁkMicrosystems Research GroupSchool of Engineering,Newcastle University, NE1 7RU, UK

Rishad.Shaﬁ[email protected]

Adrian WheeldonMicrosystems Research GroupSchool of Engineering,Newcastle University, NE1 7RU, UK

[email protected]

Alex YakovlevMicrosystems Research GroupSchool of Engineering,Newcastle University, NE1 7RU, UK

[email protected]

Ole-Christoffer GranmoCentre for AI Research (CAIR),University of Agder, Kristiansand, Norway; [email protected]

Fahim KawsarPervasive Systems Centre,Nokia Bell Labs Cambridge, UK; [email protected]

Akhil MathurPervasive Systems Centre,Nokia Bell Labs Cambridge, UK; [email protected]

January 28, 2021 A BSTRACT

The emergence of Artiﬁcial Intelligence (AI) driven Keyword Spotting (KWS) technologies hasrevolutionized human to machine interaction. Yet, the challenge of end-to-end energy efﬁciency,memory footprint and system complexity of current Neural Network (NN) powered AI-KWS pipelineshas remained ever present. This paper evaluates KWS utilizing a learning automata powered machinelearning algorithm called the Tsetlin Machine (TM). Through signiﬁcant reduction in parameterrequirements and choosing logic over arithmetic based processing, the TM offers new opportunitiesfor low-power KWS while maintaining high learning efﬁcacy. In this paper we explore a TM basedkeyword spotting (KWS) pipeline to demonstrate low complexity with faster rate of convergencecompared to NNs. Further, we investigate the scalability with increasing keywords and explore thepotential for enabling low-power on-chip KWS. K eywords Speech Command · Keyword Spotting · MFCC · Tsetlin Machine · Learning Automata · Pervasive AI · Machine Learning · Artiﬁcial Neural Network · Continued advances in Internet of Things (IoT) and embedded system design have allowed for accelerated progressin artiﬁcial intelligence (AI) based applications [1]. AI driven technologies utilizing sensory data have already had a a r X i v : . [ ee ss . A S ] J a n PREPRINT - J

ANUARY

28, 2021profoundly beneﬁcial impact to society, including those in personalized medical care [2], intelligent wearables [3] aswell as disaster prevention and disease control [4].A major aspect of widespread AI integration into modern living is underpinned by the ability to bridge the human-machine interface, viz . through sound recognition. Current advances in sound classiﬁcation have allowed for AI to beincorporated into self-driving cars, home assistant devices and aiding those with vision and hearing impairments [5].One of the core concepts that has allowed for these applications is through using KWS[6]. Selection of speciﬁcallychosen key words narrows the training data volume thereby allowing the AI to have a more focused functionality [7].With the given keywords, modern keyword detection based applications are usually reliant on responsive real-timeresults [8] and as such, the practicality of transitioning keyword recognition based machine learning to wearable, andother smart devices is still dominated by the challenges of algorithmic complexity of the KWS pipeline, energy efﬁciency of the target device and the AI model’s learning efﬁcacy .The algorithmic complexity of KWS stems from the pre-processing requirements of speech activity detection, noisereduction, and subsequent signal processing for audio feature extraction, gradually increasing application and systemlatency [7]. When considering on-chip processing, the issue of algorithmic complexity driving operational latency maystill be inherently present in the AI model [7, 9].AI based speech recognition often ofﬂoad computation to a cloud service. However, ensuring real-time responsesfrom such a service requires constant network availability and offers poor return on end-to-end energy efﬁciency[10]. Dependency on cloud services also leads to issues involving data reliability and more increasingly, user dataprivacy [11].Currently most commonly used AI methods apply a neural network (NN) based architecture or some derivative of itin KWS [9, 12, 8, 13] (see Section Section 5.1 for a relevant review). The NN based models employ arithmeticallyintensive gradient descent computations for ﬁne-tuning feature weights. The adjustment of these weights require alarge number of system-wide parameters, called hyperparameters, to balance the dichotomy between performance andaccuracy [14]. Given that these components, as well as their complex controls are intrinsic to the NN model, energyefﬁciency has remained challenging [15].To enable alternative avenues toward real-time energy efﬁcient KWS, low-complexity machine learning (ML) solutionsshould be explored. A different ML model will eliminate the need to focus on issues NN designers currently face suchas optimizing arithmetic operations or automating hyper-parameter searches. In doing so, new methodologies can beevaluated against the essential application requirements of energy efﬁciency and learning efﬁcacy,The challenge of energy efﬁciency is often tackled through intelligent hardware-software co-design techniques or ahighly customized AI accelerator, the principal goal being to exploit the available resources as much as possible.To obtain adequate learning efﬁcacy for keyword recognition the KWS-AI pipeline must be tuned to adapt to speechspeed and irregularities, but most crucially it must be able to extract the signiﬁcant features of the keyword from thetime-domain to avoid redundancies that lead to increased latency.Overall, to effectively transition keyword detection to miniature form factor devices, there must be a conscious designeffort in minimizing the latency of the KWS-AI pipeline through algorithmic optimizations and exploration of alternativeAI models, development of dedicated hardware accelerators to minimize power consumption, and understanding therelationships between speciﬁc audio features with their associated keyword and how they impact learning accuracy.This paper establishes an analytical and experimental methodology for addressing the design challenges mentionedabove. A new automata based learning method called the Tsetlin machine (TM) is evaluated in the KWS-AI design inplace of the traditional perceptron based NNs. The TM operates through deriving propositional logic that describes theinput features [16]. It has shown great potential over NN based models in delivering energy frugal AI application whilemaintaining faster convergence and high learning efﬁcacy [17, 18, 19]Through exploring design optimizations utilizing the TM in the KWS-AI pipeline we address the following researchquestions:• How effective is the TM at solving real-world KWS problems?• Does the Tsetlin Machine scale well as the KWS problem size is increased?• How robust is the Tsetlin Machine in the KWS-AI pipeline when dealing with dataset irregularities andoverlapping features?This initial design exploration will uncover the relationships concerning how the Tsetlin Machine’s parameters affectthe KWS performance, thus enabling further research into energy efﬁcient KWS-TM methodology.2

PREPRINT - J

ANUARY

28, 2021

The contributions of this paper are as follows:• Development of a pipeline for KWS using the TM.• Using data encoding techniques to control feature granularity in the TM pipeline.• Exploration of how the Tsetlin Machine’s parameters and architectural components can be adjusted to deliverbetter performance.

The rest of the paper is organized as follows: Section 2 offers an introduction to the core building blocks andhyper-parameters of the Tsetlin Machine. Through exploring the methods of feature extraction and encoding processblocks, the KWS-TM pipeline is proposed in Section 3.3. We then analyze the effects of manipulating the pipelinehyper-parameters in Section 4 showing the Experimental Results. We examine the effects of changing the number ofMel-frequency cepstrum coefﬁcientss (MFCCs) generated, the granularity of the encoding and the the robustness of thepipeline through the impact of acoustically similar keywords. We then apply our understanding of the Tsetlin Machinesattributes to optimize performance and energy expenditure through Section 4.5.Through the related works presented in Section 5.2, we explore the current research progress on AI powered audiorecognition applications and offer an in-depth look at the key component functions of the TM. We summarize the majorﬁndings in Section 6 and present the direction of future work in Section 7.

The Tsetlin Machine is a promising, new ML algorithm based on formulation of propositional logic [16]. This sectionoffers a high level overview of the main functional blocks; a detailed review of relevant research progress can be foundin Section 5.2.The core components of the Tsetlin Machine are: a team of Tsetlin Automata (TA) in each clause , conjunctive clauses , summation and threshold module and the feedback module, as seen in Figure 1. The TA are ﬁnite state machine (FSM)sthat are used to form the propositional logic based relationships that describe an output class through the inclusion or exclusion of input features and their complements. The states of the TAs for each feature and its compliment arethen aligned to a stochastically independent clause computation module. Through a voting mechanism built into thesummation and threshold module the expected output class Y is generated. During the training phase this class iscompared against the target class ˆ Y and the TA states are incremented or decremented accordingly (this is also referredto as as issuing rewards or penalties). TA Teams Feedback Module OutputsTA Teams Summation and Threshold ModuleTA TeamsTA TeamsClauses ComputingModuleBooleanizerRaw dataset

Figure 1: Block diagram of TM (dashed green arrow indicates penalties and rewards)[19]A fundamental difference between the TM and NNs is the requirement of a

Booleanizer module. The key premise isto convert the raw input features and their complements to Boolean features rather than binary encoded features asseen with NNs. These Boolean features are also referred to as literals: ˆ X and X . Current research has shown thatsigniﬁcance-driven Booleanization of features for the Tsetlin Machine is vital in controlling the Tsetlin Machine sizeand processing requirements [18]. Increasing the number of features will increase the number of TA and increasecomputations for the clause module and subsequently the energy spent in incrementing and decrementing states in thefeedback module. The choice of the number of clauses to represent the problem is also available as a design knob,which also directly affects energy/accuracy tradeoffs [19].3 PREPRINT - J

ANUARY

28, 2021The Tsetlin Machine also has two hyper parameters, the s value and the Threshold (T) . The Threshold parameter is usedto determine the clause selection to be used in the voting mechanism, larger Thresholds will mean more clauses partakein the voting and inﬂuence the feedback to TA states. The s value is used to control the ﬂuidity with which the TAscan transition between states. Careful manipulation of these parameters can be used to determine the ﬂexibility of thefeedback module and therefore control the TMs learning stability [17]. As seen in Figure 2, increasing the Thresholdand decreasing the s value will lead to more events triggered as more states are transitioned. These parameters must becarefully tuned to balance energy efﬁciency through minimizing events triggered, and achieving good performancethrough ﬁnding the optimum s - T range for learning stability in the KWS application.Figure 2: The affect of T and s on reinforcements on TM[19]In order to optimize the TM for KWS, due diligence must be given to designing steps that minimize the Boolean featureset. This allows for ﬁnding a balance between performance and energy usage through varying the TM hyper parametersand the number of clause computation modules. Through exploitation of these relationships and properties of the TM,the KWS pipeline can be designed with particular emphasis on feature extraction and minimization of the number ofthe TMs clause computation modules. An extensive algorithmic description of Tsetlin Machine can be found in [16].The following section will detail how these ideas can be implemented through audio pre-processing and Booleanizationtechniques for KWS. When dealing with audio data, the fundamental design efforts in pre-processing should be to ﬁnd the correct balancebetween reducing data volume and preserving data veracity. That is, while removing the redundancies from the audiostream the data quality and completeness should be preserved. This is interpreted in the proposed KWS-TM pipelinethrough two methods: feature extraction through MFCCs, followed by discretization control through quantile basedbinning for Booleanization. These methods are expanded below.

Audio data streams are always subject to redundancies in the channel that formalize as nonvocal noise, backgroundnoise and silence [20, 21]. Therefore the challenge becomes identiﬁcation and extraction of the desired linguisticcontent (the keyword) and maximally discarding everything else. To achieve this we must consider transformation andﬁltering techniques that can amplify the characteristics of the speech signals against the background information. Thisis often done through the generation of MFCCs as seen in the signal processing ﬂow in Figure 3.The MFCC is a widely used audio ﬁle pre-processing method for speech related classiﬁcation applications [22, 21, 23,24, 25, 12]. The component blocks in the MFCC pipeline are speciﬁcally designed for extracting speech data takinginto account the intricacies of the human voice.The

Pre-Emphasis step is used to compensate for the structure of the human vocal tract and provide initial noiseﬁltration. When producing glottal sounds when speaking, higher frequencies are damped by the vocal tract which canbe characterized as a step roll-off in the signals’ frequency spectrum [26]. The Pre-Emphasis step, as its name-sakesuggests, ampliﬁes (adds emphasis to) the energy in the high frequency regions, which leads to an overall normalizationof the signal [27]. 4

PREPRINT - J

ANUARY

28, 2021

Pre-Emphasis

Normalizing the audio signal at different frequency

Framing & Windowing

Performing datasegmentation

Fast Fourier Transform

Converting audio from time to frequency domain

Mel Filter Bank Processing

Extracting the frequency cepstrum which close to human auditory system

Logarithm Energy Extraction Apply Discrete Cosine Transform

Discard redundency and data compression

Audio FilesMFCC Features

Figure 3: MFCC pipeline.Speech signals hold a quasi-stationary quality when examined over a very short time period, which is to say thatthe statistical information it holds remains near constant [20]. This property is exploited through the

Framing andWindowing step. The signal is divided into around 20ms frames, then around 10-15ms long window functions aremultiplied to these overlapping frames, in doing so we preserve the temporal changes of the signal between frames andminimize discontinuities (this is realized through the smoothed spectral edges and enhanced harmonics of the signalafter the subsequent transformation to the frequency domain) [28]. The windowed signals are then transformed tothe frequency domain through a Discrete Fourier Transform (DFT) process using the

Fast Fourier Transform (FFT) algorithm. FFT is chosen as it is able to ﬁnd the redundancies in the DFT and reduce the amount of computationsrequired offering quicker run-times.The human hearing system interprets frequencies linearly up to a certain range (around 1KHz) and logarithmicallythereafter. Therefore, adjustments are required to translate the FFT frequencies to this non-linear function [29]. This isdone through passing signal through the

Mel Filter Banks in order to transform it to the

Mel Spectrum [30]. The ﬁlter isrealized by overlapping band-pass ﬁlters to create the required warped axis. Next, the logarithm of the signal is taken,this brings the data values closer and less sensitive to the slight variations in the input signal [30]. Finally we perform a

Discrete Cosine Transform (DCT) to take the resultant signal to the

Cepstrum domain [31]. The DCT function is usedas energies present in the signal are very correlated as a result of the overlapping Mel ﬁlterbanks and the smoothness ofthe human vocal tract; the DCT ﬁnds the co-variance of the energies and is used to calculate the MFCC features vector[27, 32]. This vector can be passed to the Booleanizer module to produce the input Boolean features, as described next.

As described in Section 2, Booleanization is an essential step for logic based feature extraction in Tsetlin Machines.Minimizing the Boolean feature space is crucial to the Tsetlin Machine’s optimization. The size and processing volumeof a TM is primarily dictated by the number of Booleans [18]. Therefore, a pre-processing stage for the audio featuresmust be embedded into the pipeline before the TM to allow for granularity control of the raw MFCC data. The numberof the Booleanized features should be kept as low as possible while capturing the critical features for classiﬁcation [18].The discretization method should be able to adapt to, and preserve the statistical distribution of the MFCC data. Themost frequently used method in categorizing data is through binning. This is the process of dividing data into group,individual data-points are then represented by the group they belong to. Data points that are close to each other areput into the same group thereby reducing data granularity [16]. Fixed width binning methods are not effective inrepresenting skewed distribution and often result in empty bins, they also require manual decision making for binboundaries.Therefore, for adaptive and scalable Booleanization quantile based binning is preferred. Through binning the data usingits own distribution, we maintain its statistical properties and do not need to provide bin boundaries, merely the numberof bins the data should be discretized into. The control over the number of quantiles is an important parameter inobtaining the ﬁnal Boolean feature set. Choosing two quantiles will result in each MFCC coefﬁcient being representedusing only one bit; however, choosing ten quantiles (or bins) will result in four bits per coefﬁcient. Given the largenumber of coefﬁcients present in the KWS problem, controlling the number of quantiles is an effective way to reducethe total TM size.

The KWS-TM pipeline is composed of the the data encoding and classiﬁcation blocks presented in Figure 4. The dataencoding scheme encompasses the generation of MFCCs and the quantile binning based Booleanization method. Theresulting Booleans are then fed to the Tsetlin Machine for classiﬁcation. The ﬁgure highlights the core attributes of the5

PREPRINT - J

ANUARY

28, 2021pre-processing blocks: the ability to extract the audio features only associated with speech through MFCCs and theability to control their Boolean granularity through quantile binning.To explore the functionality of the pipeline and the optimizations that can be made, we return to our primary intentions,i.e., to achieve energy efﬁciency and high learning efﬁcacy in KWS applications. We can now use the design knobsoffered in the pre-processing blocks, such as variable window size in the MFCC generation, and control over the numberof quantiles to understand how these parameters can be used in presenting the Boolean data to the TM in a way toreturns good performance utilizing the least number of Booleans. Through Section 2 we have also seen the design knobsavailable through variation of the hyperparameters s and Threshold T , as well as the number of clause computationmodules used to represent the problem. Varying the parameters in both the encoding and classiﬁcation stages throughan experimental context will uncover the impact they have on the overall KWS performance and energy usage. -60 -40 -20 0 20 40 60 Value Q u a n t i t y

50% Quantile at -2.4

MFCC

Audio Files

Compute Mel-frequencyCepstral Coefﬁcients Apply Discretization for Booleanization using 50% Quantile Method

TsetlinMachine

Data EncodingClassiﬁcation ` "Yes" 0"Two" 0"Five" 1.... Figure 4: The data encoding and classiﬁcation stages in the KWS-TM pipeline

To evaluate the proposed KWS-TM pipeline, Tensorﬂow speech command dataset was used . The dataset consists ofmany spoken keywords collected from a variety of speakers with different accents, as well as male and female gender.The datapoints are stored as 1 second long audio ﬁles where the background noise is negligible. This reduces the effectof added redundancies in the MFCC generation, given our main aim is predominantly to test functionality, we willexplore the impact of noisy channels in our future work. This dataset is commonly used in testing the functionality ofML models and will therefore allow for fair comparisons to be drawn [33].From the Tensorﬂow dataset, 10 keywords: "Yes", "No", "Stop", "Seven", "Zero", "Nine", "Five", "One", "Go" and"Two", have been chosen to explore the functionality of the pipeline using some basic command words. Consideringother works comparing NN based pipelines, 10 keywords is the maximum used [34, 13]. Among the keywords chosen,there is an acoustic similarity between "No" and "Go", therefore, we explore the impact of 9 keywords together (without"Go") and then the effect of "No" and "Go" together. The approximate ratio of training data, testing data and validationdata is given as 8:1:1 respectively with a total of 3340 datapoints per class. Using this setup, we will conduct a seriesof experiments to examine the impact of the various parameters of the KWS-TM pipeline discussed earlier. Theexperiments are as follows:• Manipulating the window length and window steps to control the number of MFCCs generated.• Exploring the effect of different quantile bins to change the number of Boolean features.• Using a different number of the keywords ranging from the 2 to 9 to explore the scalability of the pipeline.• Testing the effect on performance of acoustically different and similar keywords.• Changing the size of the TM through manipulating the number of clause computation modules, optimizingperformance through tuning the feedback control parameters s and T . It is well deﬁned that the number of input features to the TM is one of the major factors that affect its resource usage[17, 18, 19]. Increased raw input features means more Booleans are required to represent them and thus the numberof Tsetlin Automaton (TA) in the TM will also increase leading to more energy required to provide feedback to them.Therefore, reducing the number of features at the earliest stage of the data encoding stage of the pipeline is crucial toimplementing energy-frugal TM applications. Tensorﬂow speech command: https://tinyurl.com/TFSCDS PREPRINT - J

ANUARY

28, 2021The ﬁrst set of parameters available in manipulating the number of features comes in the form of the

Window Step andthe

Window Length (this takes place in the "Framing an Windowing" stage in Figure 4) in MFCC generation and can beseen through Figure 5(a).

Time A m p li t ude Length

Step (a) The windowing process.

Time A m p li t ude (b) Effect of increasing window length. Time A m p li t ude (c) Effect of increasing window step. Figure 5: The Hamming window function applied to audio pre-processing.The window function is effective in reducing the spectral distortion by tapering the sample signal at the beginning andending of each frame (We use overlapping frames to ensure signal continuity is not lost). Smaller Window Steps leadto a more ﬁne grained and descriptive representation of the audio features through more frames and therefore moreMFCCs but this also increases computations and latency.Increasing the Window Length leads to a linear decrease in the total number of frames and therefore the MFCCs asseen in Figure 6(a). Given that the Window Steps are kept constant for this experiment, we have a linearly decreasingnumber of window overlaps resulting in a linearly decreasing total number of window functions, FFTs and subsequentcomputations. This leads to the linear decrease in the MFCCs across all frames.Increasing the Window Step leads to much sharper fall given the overlapping regions now no longer decrease linearly asseen in Figure 6(b). This results in a total number of non-linearly decreasing window functions and therefore muchfewer FFTs and so on, leading to much fewer MFCCs across all frames. As a result of this, the smaller the increase inthe Window Step the larger the decrease in the number of frames and therefore MFCCs.To test the effectiveness of manipulating the Window Length and Window Step, the MFCC coefﬁcients were producedfor 4 keywords and the TMs classiﬁcation performance was examined as seen in Figure 7(a) and Figure 7(b). Changingthe Window Length results in much bigger falls in accuracy compared to Window Step. This is due to the diminishedsignal amplitudes at the window edges, longer windows mean more tapering of the edge amplitudes and fewer overlapsto preserve the signal continuities as seen through Figure 5(b). As a result, the ﬁdelity of generated the MFCC featuresis reduced.The effect of increasing the Window Step leads to a smaller drop in accuracy. We see the testing and validation accuracyremain roughly the same at around 90.5 % between 0.03 and 0.10 second Window Steps and then experience a slightdrop. Once again this is due to the tapering effect of the window function, given the window length remains the same7 PREPRINT - J

ANUARY

28, 2021

Window Length in "Second" N u m be r o f M F CC C oe ff i c i en t MFCC Coefficient

Window Step in "Second" N u m be r o f M F CC C oe ff i c i en t MFCC Coefficient (a) The effect of increasing window length. (b) The effect of increasing window step.

Figure 6: Changing the number of MFCC coefﬁcients through manipulating the Window parameters.

Window Length in "Second" A cc u r a cy i n % TrainingTestingValidation

Window Step in "Second" A cc u r a cy i n % TrainingTestingValidation (a) Effect of window length on accuracy. (b) Effect of window step on accuracy.

Figure 7: Effect of changing window parameters on classiﬁcation accuracy.for this experiment, we know that the increasing of window steps will mean far fewer total overlaps and a shrinkingoverlapping region as seen in Figure 5(c). The overlaps are used to preserve the continuity of the signal against thewindow function edge tapering, as the size of the overlapping regions decrease, the effect of edge tapering increasesthereby leading to increased loss of information. The accuracy remains constant up to a Window Step of 0.1s as theWindow Length is sufﬁciently long to capture enough of the signal information, once the overlapping regions start toshrink we experience the loss in accuracy.We can see that increasing the Window Step is very effective in reducing the number of frames and therefore thetotal number MFCC coefﬁcients across all frames and providing the Window Length is long enough, the reduction inperformance is minimal. To translate these ﬁndings toward energy efﬁcient implementations, we must give increaseddesign focus to ﬁnding the right balance between the size of the Window Step parameter and the achieved accuracygiven the reduction in computations from the reduction in features produced.

Increased granularity through more bins will lead to improved performance but it is observed that this is not the caseentirely. Table 1 shows the impact of the KWS-TM performance when increasing the number of bins. The testing andvalidation accuracy remain around the same with 1 Boolean per feature compared with 4 Booleans per feature. Figure 8shows the large variance in some feature columns and no variance in others. The zero variance features are redundantin the subsequent Booleanization, they will be represented through the same Boolean sequence. The features withlarge variances are of main interest. We see that the mean for these features is relatively close to zero compared totheir variance (as seen in Figure 9), therefore one Boolean per feature representation is sufﬁcient, a 1 will representvalues above the mean and 0 will represent below. The logical conclusion to be made from these explorations is that the8

PREPRINT - J

ANUARY

28, 2021MFCC alone is sufﬁcient in both eliminating redundancies and extracting the keyword properties and does not requireadditional granularity beyond one Boolean per feature to distinguish classes. V a r i e n c e Variance of Features for 1 Keyword "Stop"

Figure 8: Variance between MFCC features. M e a n Mean of Features for 1 Keyword "Stop"

Figure 9: Mean of MFCC features.We have seen that the large variance of the MFCCs mean that they are easily represented by 1 Boolean per feature andthat is sufﬁcient to achieve high performance. This is an important initial result, for ofﬂine learning we can now alsoevaluate the effect of removing the no variance features in future work to further reduce the total number of Booleans.From the perspective of the Tsetlin Machine there is an additional explanation as to why the performance remainshigh even when additional Boolean granularity is allocated to the MFCC features. Given that there are a large numberdatapoints in each class (3340), if the MFCCs that describe these datapoints are very similar then the TM will havemore than sufﬁcient training data to settle on the best propositional logic descriptors. This is further seen by the hightraining accuracy compared to the testing and validation accuracy.Table 1: Impact of increasing quantiles with 4 classesTraining Testing Validation Num.Bins Bools per Feature Total Bools94.8% 91.3% 91.0% 2 1 37896.0% 92.0% 90.7% 4 2 75895.9% 90.5% 91.0% 6 3 113295.6% 91.8% 92.0% 8 3 113297.1% 91.0% 90.8% 10 4 15129

PREPRINT - J

ANUARY

28, 2021

Figure 10(a) shows the linear nature with which the training, testing and validation accuracy decrease as the number ofkeywords are increased for a TM with 450 clauses with 200 epochs for training. We note that the testing and validationaccuracy start to veer further away from the training accuracy with the increase of keywords. This performance drop isexpected in ML methods as the problem scales [35]. Despite the large number of datapoints per keyword this is anindicator of overﬁtting, as conﬁrmed through Figure 10(b) showing around a 4 % increase. The implication of this isthat increased number of keywords make it difﬁcult for the TM to create distinct enough propositional logic to separatethe classes. The performance drop is caused when the correlation of keywords outweighs the number of datapoints todistinguish each of them. This behavior is commonly observed in ML models for audio classiﬁcation applications [23].The explained variance ratio of the dataset with an increasing number of keywords was taken for the ﬁrst 100 PrincipleComponent Analysis eigenvalues, as seen in Figure 10(b). We observe that as the number of keywords is increased, thesystem variance decreases, i.e. the inter-class features start to become increasingly correlated. Correlated inter-classfeatures will lead to class overlap and degrade TM performance [18]. Through examination of the two largest LinearDiscriminant component values for the 9 keyword dataset, we clearly see in Figure 11 that there is very little classseparability present. Number of Keywords A cc u r a cy i n % TrainingTestingValidation

Number of Keywords D eg r ee o f O v e r f i tt i ng i n % E x p l a i ned V a r i an c e R a t i o i n % Degree of OverfittingExplained Variance Ratio (a) The effect on accuracy. (b) The amount of overﬁtting.

Figure 10: The effect of increasing the number of keywords.To mitigate against the effect on performance of increasing keywords, there are two methods available: Firstly to adjustthe Tsetlin Machines hyperparameters to enable more events triggered (see Figure 2). In doing so the this may allowthe TM to create more differing logic to describe the classes. Then, by increasing the number of clause computationmodules, the TM will have a larger voting group in the Summation and Threshold module and potential reach thecorrect classiﬁcation more often. Secondly the quantity of the datapoints can be increased, however, for this to beeffective the new dataset should hold more variance and completeness when describing each class. This method of dataregularization is often used in audio ML applications to deliberately introduce small variance between datapoints [21].

In order to test the robustness of the KWS-TM pipeline functionality, we must emulate real-word conditions where auser will use commands that are acoustically similar to others. Table 2 shows the results of such circumstances. The

Baseline experiment is a KWS dataset consisting of 3 keywords: ’Yes’, ’No’ and ’Stop’. The second experiment thenintroduces the keyword ’Seven’ to the dataset and the third experiment introduces the keyword ’Go’.The addition of ’Seven’ causes a slight drop in accuracy adhering to our previously made arguments of increasedcorrelation and the presence of overﬁtting. However the key result is the inclusion of ’Go’; ’Go’ is acoustically similarto ’No’ and this increases the difﬁculty in separating these two classes. We see from Figure 12(a), showing the ﬁrst twoLDA components that adding ’Seven’ does not lead to as much class overlap as adding ’Go’ as seen in Figure 12(b). Asexpected, the acoustic similarities of ’No’ and ’Go’ lead to signiﬁcant overlap. We have seen from the previous result(Figure 11) that distinguishing class separability is increasingly difﬁcult when class overlaps are present.10

PREPRINT - J

ANUARY

28, 2021 L D A Linear Discriminant Analysis of 9 Keywords

SevenNoStopYesFiveNineOneTwoZero

Figure 11: LDA of 9 keywords.Table 2: Impact of acoustically similar keywords.Experiments Training Testing ValidationBaseline 94.7% 92.6% 93.1%Baseline + ‘Seven’ 92.5% 90.1% 90.2%Baseline + ‘Go’ 85.6% 82.6% 80.9%

So far we have considered the impact of Booleanization granularity, problem scalabilty and robustness when dealingwith acoustically similar classes. Now, we turn our attention towards optimizing the KWS-TM pipeline to ﬁnd theright functional balance between performance and energy efﬁciency. This is made possible through two streams ofexperimentation: manipulating the number of clauses for each keyword class in the TM and observing the energyexpenditure and accuracy, and experimenting with the TMs hyperparameters to enable better performance using fewerclauses.The inﬂuence of increasing the number of clauses was brieﬂy discussed in Section 2, here we can see the experimentalresult in Figure 13(a) showing the impact of increasing clauses with 4 classes.Increased number of clauses leads to better performance. However, upon closer examination we can also see the impactof overﬁtting at the clause level, i.e., increasing the number of clauses has resulted in a larger difference in the trainingaccuracy with the testing and validation. The datapoints for the 4 classes were sufﬁcient to create largely differentsub-patterns for the TAs during training, but not complete enough to describe new data in the testing and validation.As a result, when clauses are increased, more clauses reach incorrect decisions and sway the voting in the summationand threshold module toward incorrect classiﬁcation, which is seen through Figure 14(a). The TM has two types offeedback, Type I, which introduces stochasticity to the system and Type II, which bases state transitions on the results ofcorresponding clause value. Type II feedback is predominantly used to diminish the effect of false positives. We see thatas the clause value increases the TM uses more Type II feedback indicating increased false positive classiﬁcations. Thisresult is for due to the incompleteness in the training data in describing all possible logic propositions for each class. Wesee this through 14(b); despite increasing the number of epochs we do not experience a boost in testing and validationaccuracy and through Figure 13(b) we ﬁnd the point where the overﬁtting outweighs the accuracy improvement ataround 190-200 clauses.From the perspective of energy efﬁciency, these results offer two possible implications for the KWS-TM pipeline, if asmall degradation of performance in the KWS application is acceptable, then operating at a lower clause range will bemore beneﬁcial for the TM. The performance can then be boosted through hyperparameters available to adjust feedbackﬂuidity. This approach will reduce energy expenditure through fewer clause computations and reduce the effects ofoverﬁtting when the training data lacks enough completeness. Alternatively, if performance is the main goal, then thedesign focus should be on injecting training data with more diverse datapoints to increase the descriptiveness of eachclass. In that case, increased clauses will provide more robust functionality.The impacts of being resource efﬁcient and energy frugal are most prevalent when implementing KWS applications intodedicated hardware and embedded systems. To explore this practically, the KWS-TM pipeline was implemented onto a11

PREPRINT - J

ANUARY

28, 2021 L D A Linear Discriminant Analysis of Baseline + Seven

SevenNoStopYes 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0LDA 16420246 L D A Linear Discriminant Analysis of Baseline + Go

GoNoStopYes (a) The Baseline with ‘Seven’. (b) The Baseline with ‘Go’.

Figure 12: The LDA of 4 keywords - the Baseline with one other.

30 100 170 240 310 380 450

Number of Clauses per Class A cc u r a cy i n % TrainingTestingValidation

30 100 170 240 310 380 450

Number of Clauses per Class -1012345678910 P e r c en t age i n % Accuracy ImprovementDegress of Overfitting (a) The effect on accuracy. (b) The effect on overﬁtting.

Figure 13: Effect of increasing the number of clauses on accuracy and overﬁtting.Raspberry Pi. The same 4 keyword experiement was ran with 100 and 240 clauses. As expected, we see that increasedclause computations lead to increased current, time and energy usage, but also delivers better performance. We canpotentially boost the performance of the Tsetlin Machine at lower clauses through manipulating the hyperparameters asseen Table 4.The major factor that has impacted the performance of the KWS is the capacity of the TM which is determined by thenumber of clauses per class. The higher the number clauses, the higher the overall classiﬁcation accuracy [18]. Yet, theresource usage will increase linearly along with the energy consumption and memory footprint. Through Table 4 wesee that at 30 clauses the accuracy can be boosted through reducing the Threshold hyperparameter. The table offers twodesign scenarios; ﬁrstly, very high accuracy is achievable through a large number of clauses (450 in this case) and alarge Threshold value. With a large number of clauses an increased number of events must be triggered in terms ofstate transitions (see Figure 2) to encourage more feedback to clauses and increases the TMs decisiveness. While thisTable 3: Impact of the number of clauses on energy/accuracy tradeoffs.Clauses Current Time Energy AccuracyTraining 100 0.50 A 68 s 426.40 J -Training 240 0.53 A 96 s 636.97 J -Inference 100 0.43 A 12 s 25.57 J 80 %Inference 240 0.47 A 37 s 87.23 J 90 %12

PREPRINT - J

ANUARY

28, 2021

30 100 170 240 310 380 450

Number of Clauses per Class N u m be r o f F eedba ck E v en t s Type IType II

Epoch A cc u r a cy i n % TrainingTestingValidation (a) The effect of clauses on feedback. (b) The effect of epoch on accuracy.

Figure 14: Effect of increasing the number of clauses on TM feedback (a), and the effect of increasing the number ofepochs on accuracy (b). Table 4: Impact of the T values on accuracyClauses T Training Testing Validation Better Classiﬁcation30 2 83.5 % 80.5 % 83.8 % (cid:88)

30 23 74.9 % 71.1 % 76.1 %450 2 89.7 % 86.1 % 84.9 %450 23 96.8 % 92.5 % 92.7 % (cid:88) offers a very good return on performance, the amount of computations are increased with more clauses and more eventstriggered and this leads to increased energy expenditure as seen through Table 3.In contrast, using 960 clauses and a lower Threshold still yields good accuracy but at a much lower energy expenditurethrough fewer clause computations and feedback events. A smaller number of clauses mean that the vote of each clausehas more impact, even at a smaller Threshold the inbuilt stochasticity of the TM’s feedback module allows the TAs toreach the correct propositional logic. Through these attributes it is possible to create more energy frugal TMs requiringfewer computations and operating at a much lower latency.

Both TMs and NNs have modular design components in their architecture; For the TM, this is in the form of clausesand for the NN it is the number of neurons. NNs require input weights for the learning mechanism which deﬁne theneurons’ output patterns. The number of weights and the number of neurons are variable, however more neurons willlead to better overall NN connectivity due to more reﬁned arithmetic pathways to deﬁne a learning problem.For the TM the clauses are composed of TAs. The number of TAs are deﬁned by the number of Boolean features whichremains static throughout the course of the TMs learning. It is the number of clauses that is variable, increasing theclauses typically offers more propositional diversity to deﬁne a learning problem.Through Figure 15 and Table 5 we investigate the learning convergence rates of the TM against 4 ’vanilla’ NNimplementations. The TM is able to converge to 90.5 % after less than 10 epochs highlighting its quick learning ratecompared to NNs which require around 100 epochs to converge to the isoaccuracy target ( ≈ PREPRINT - J

ANUARY

28, 2021

50 100 150 200

Epoch A cc u r a cy i n % TM with 240 Clauses per ClassNN Small & Shallow: 256+512X2NN Large & Shallow: 256+1024X2

50 100 150 200

Epoch A cc u r a cy i n % TM with 240 Clauses per ClassNN Small & Deep: 256+512X5NN Large & Deep: 256+1024X5 (a) Convergence of the TM against shallow NNs. (b) Convergence of the TM against deep NNs.

Figure 15: Training convergence of TM and NN implementations.Table 5: The required parameters for different NNs and the TM for a 4 keyword problem.KWS-ML Conﬁguration Num. neurons Num. hyperparametersNN Small & Shallow: 256+512X2 1,280 983,552NN Small & Deep: 256+512X5 2,816 2,029,064NN Large & Shallow: 256+1024X2 2,304 2,822,656NN Large & Deep: 256+1024X5 5,376 7,010,824TM with 240 Clauses per Class 960 (clauses) 2 hyperparameters with725760 TAs

This section will provide a brief examination into current KWS research, industrial challenges with KWS, deeper lookin the component blocks of the TM and provide insight into the current developments and the future research directions.

The ﬁrst KWS classiﬁcation methods proposed in the late 1970s used MFCCs for their feature extraction ability andbecause the coefﬁcients produced offered a very small dimensionality compared to the raw input data that was beingconsidered then [37]. It was later shown that compared to other audio extraction methods such as near prediction codingcoefﬁcients (LPCC)s and perceptual linear production (PLP), MFCCs perform much better with increased backgroundnoise and low SNR [12].For the classiﬁer, Hidden Markov Models (HMMs) were favored after the MFCC stage due to their effectiveness inmodelling sequences [37]. However they rely on many summation and Bayesian probability based arithmetic operationsas well as the computationally intensive

Viterbi decoding to identify the ﬁnal keyword [34, 38, 39].Later it was shown that Recurrent Neural Networks (RNN)s outperform HMMs but suffer from operational latency asthe problem scales, albeit RNNs still have faster run-times than HMM pipelines given they do not require a decoderalgorithm [38]. To solve the latency issue, the Deep Neural Network (DNN) was used, it has smaller memory footprintand reduced run-times compared to HMMs [12, 39]. However, DNNs are unable to efﬁciently model the temporalcorrelations of the MFCCs and their transitional variance [36] [34]. In addition to this, commonly used optimizationtechniques used for DNNs such as pruning, encoding and quantization lead to great accuracy losses with KWSapplications [12].The MFCC features exist as a 2D array as seen in Figure 4, to preserve the temporal correlations and transitionalvariance, this array can be treated as an image and a convolutional neural network (CNN) can be used for classiﬁcation[13, 36]. With the use of convolution comes the preservation of the spatial and temporal dependencies of the 2D dataas well as the reduction of features and computations from the convolution and pooling stages [13]. However, onceagain both the CNN and DNN suffer from the large number of parameters (250K for the dataset used in [36] and 9million Multiplies required for the CNN). Despite the gains in performance and reductions in latency, the computationalcomplexity and large memory requirements from parameter storage are ever present with all NN based KWS solutions.14

PREPRINT - J

ANUARY

28, 2021The storage and memory requirements played a major part in transitioning to a micro-controller system for inferencewhere memory is limited through the size of the SRAM [34]. In order to accommodate for the large throughput ofrunning NN workloads, micro-controllers with integrated DSP instructions or integrated SIMD and MAC instructionscan accelerate low-precision computations [34]. When testing for 10 keywords, it was shown experimentally in [34],that for systems with limited memory and compute abilities DNNs are favorable given they use the fewer operationsdespite having a lower accuracy (around 6 % less) compared to CNNs.It is when transitioning to hardware that the limitations of memory and compute resources become more apparent. Inthese cases it is better to settle for energy efﬁciency through classiﬁers with lower memory requirements and operationsper second even if there is a slight drop in performance.A 22nm CMOS based Quantized Convolutional Neural Network (QCNN) Always-ON KWS accelerator is implementedin [12], they explore the practicalities of CNN in hardware through quantized weights, activation values and approximatecompute units. Their ﬁndings illustrate the effectiveness of hardware design techniques; the use of approximate computeunits led to a signiﬁcant decrease in energy expenditure, the hardware unit is able to classify 10 real-time keywordsunder different SNRs with a power consumption of 52 µ W. This impact of approximate computing is also argued in [13]with design focus given to adder design, they propose an adder with a critical path that is 49.28 % shorter than standard16-bit Ripple Carry Adders.Through their research work with earables Nokia Bell Labs Cambridge have brought an industrial perspective to theidea of functionality while maintaining energy frugality into design focus for AI powered KWS [40, 41], with particularemphasis on user oriented ergonomics and commercial form factor. They discovered that earable devices are not asinﬂuenced by background noise compared to smartphones and smartwatches and offer better signal-to-noise ratio formoving artefacts due to their largely ﬁxed wearing position in daily activities (e.g. walking or descending stairs) [41].This was conﬁrmed when testing using Random Forest classiﬁers. We brieﬂy discussed the overall mechanism of the TM and the main building blocks in the Section 2. In this section, wewill have a closer look to the fundamental learning element of the TM, namely the Tsetlin Automaton, as describedin Figure 16. We will also present a more detailed look at the clause computing module as seen in Figure 17, and wewill discuss the ﬁrst application-speciﬁc integrated circuit (ASIC) implementation of the TM, the Mignon , as seen inFigure 18. Φ1 Φ2 Φ3 Φ4 Φ5 Φ6 aB = 0aA = 1

Penalty Reward

Reinforcement Signal TA Outputs

An Independent TA

Figure 16: Mechanism of a TA.The TA is the most fundamental part of the TM forming the core learning element that drives classiﬁcation (Figure 16).Developed by Mikhail Tsetlin in the 1950s, the TA is an FSM where the current state will transition towards or awayfrom the middle state upon receiving

Reward or Penalty reinforcements during the TMs training stage. The currentstate of the TA will decide the output of the automaton which will be either an

Include (aA) or

Exclude (aB).Figure 17 shows how the clause module create logic propositions that describe the literals based on the TA decisionsthrough logic OR operations between the negated TA decision and the literal. The TA decision is used to bit maskthe literal and through this we can determine which literals are to be excluded. The proposition is then logic AND edand this forms the raw vote for this clause. Clauses can be of positive and negative polarity, as such, a sign will beadded to the clause output before it partakes in the class voting. It is important to note the reliance purely on logicoperations making the TM well suited to hardware implementations. Clauses are largely independent of each other,only coalescing for voting giving the TM good scalability potential. Mignon AI: http://mignon.ai/ PREPRINT - J

ANUARY

28, 2021 A Clause Computing Module

TA TeamLiterals = 1= 0

Figure 17: Mechanism of a Clause computing module (assuming TA = 1 means Include and TA = 0 means Exclude ).The feedback to the TM can be thought of on three levels, at the TM level, at the clause level and at the TA level. Atthe TM level, the type of feedback to issue is decided based on the target class and whether the TM is in learning orinference mode. For inference no feedback is given, we simply take the clause computes for each class and pass tothe summation and threshold module to generate the predicted class. However, in training mode there is a choice ofType I feedback to combat false negatives or Type II feedback to combat false positives. This feedback choice is furtherconsidered at the clause level.At the clause level there are three main factors that will determine feedback type to the TAs, the feedback type decisionfrom the TM level, the current clause value, and whether the magnitude of clause vote is above the magnitude of theThreshold.At the TA level, the feedback type from the clause level will be used in conjunction with the current TA state and the sparameter to determine whether there is inaction, penalty or reward given to the TA states.The simplicity of the TM shows its potential to be a promising NN alternative. Lei et al [19] comparatively analyzedthe architecture, memory footprint and convergence of these two algorithms for different datasets. This research showsthe fewer number of hyperparameter of the TM will reduce the complexity of the design. The convergence of the TM ishigher than the NN in all experiments conducted.The most unique architectural advances of the TM is the propositional logic based learning mechanism which will bebeneﬁcial in achieving energy frugal hardware AI. Wheeldon et al. [18] presented the ﬁrst ASIC implementation of theTM for Iris ﬂower classiﬁcations (see Figure 18).Figure 18: The Mignon AI: ASIC microchip acclerating TM (left).This 65-nm technology based design is a breakthrough in achieving an energy efﬁciency of up to 63 Tera Operations perJoule (Tops/Joule) while maintaining high convergence rate and performance. The early results from this microchip hasbeen extensively compared with Binarized Convolutional Neural Network (BCNN) and neuromorphic designs in [18].In addition, Wheeldon et al. [18] also proposed a system-wide design space exploration pipeline in deploying TMinto ASIC design. They introduced a detailed methodology from 1) dataset encoding building on the work seen in [4]16

PREPRINT - J

ANUARY

28, 2021to 2) software based design exploration and 3) an FPGA based hyperparameter search to 4) ﬁnal ASIC synthesis. Afollow-up work of this [42] also implemented a self-timed and event-driven hardware TM. This implementation showedpower and timing elasticity properties suitable for low-end AI implementations at-the-microedge.Other works include mathematical lemma based analysis of clause convergence using the XOR dataset [43], naturallanguage (text) processing [44], disease control [4], methods of automating the s parameter [45] as well as explorationof regression and convolutional TMs [46, 47].The TM has so far, been implemented with many different programming languages such as, C, C++, C , Python andNode.js, to name a few. It has also been optimized for High Performance Computing (HPC) through Compute UniﬁedDevice Architecture (CUDA) for accelerating Graphics Processing Unit (GPU) based solutions and currently throughOpenCL for heterogeneous embedded systems [48].Exploiting the natural logic underpinning there are currently ongoing efforts in establishing explainability evaluationand analysis of TMs[17]. Deterministic implementation of clause selection in TM, reported by [49], is a promisingdirection to this end.Besides published works, there are numerous talks, tutorials and multimedia resources currently available online tomobilize the hardware/software community around this emerging AI algorithm. Below are some key sources:Videos: https: // tinyurl. com/ TMVIDEOSCAIR .Publications: https: // tinyurl. com/ TMPAPERCAIR & .Software implementations: https: // tinyurl. com/ TMSWCAIR Hardware implementations, Mignon AI: .A short video demonstrating KWS using TM can be found here: https: // tinyurl. com/ KWSTMDEMO . The paper presented the ﬁrst ever TM based KWS application. Through experimenting with the hyperparameters of theproposed KWS-TM pipeline we established relationships between the different component blocks that can be exploitedto bring about increased energy efﬁciency while maintaining high learning efﬁcacy.From current research work we have already determined the best methods to optimize for the TM is through ﬁnding theright balance between reduction of the number of features, number of clauses and number of events triggered throughthe feedback hyper-parameters against the resulting performance from these changes. These insights were carried intoour pipeline design exploration experiments.Firstly, we ﬁne tuned the window function in the generation of MFCCs, we saw that increasing the window steps leadto much fewer MFCCs and if the window length is sufﬁcient enough to reduce edge tapering then the performancedegradation is minimal. Through quantile binning to manipulate the discretization of the Boolean MFCCs, it was seenthat this did not yield change in performance. The MFCC features of interest have very large variances in each featurecolumn and as such less precision can be afforded to them, even as low as one Boolean per feature. This was extremelyuseful in reducing the resulting TM size.Through manipulating the number of clause units to the TM on a Raspberry Pi, we conﬁrmed the energy and latencysavings possible by running the pipeline at a lower clause number and using the Threshold hyper-parameter theclassiﬁcation of the accuracy can also be boosted. Through these design considerations we are able to increase theenergy frugality of the whole system and transition toward low-power hardware accelerators of the pipeline to tacklereal time applications.The KWS-TM pipeline was then compared against some different NN implementations, we demonstrated the muchfaster convergence to the same accuracy during training. Through these comparisons we also highlighted the far fewerparameters required for the TM as well as a fewer number of clauses compared to neurons. The faster convergence,fewer parameters and logic over arithmetic processing makes the KWS-TM pipeline more energy efﬁcient and enablesfuture work into hardware accelerators to enable better performance and low power on-chip KWS.

Acknowledgement : The authors gratefully acknowledge the funding from EPSRC IAA project “Whisperable” andEPSRC grant STRATA (EP/N023641/1). The research also received help from the computational powerhouse atCAIR . https://cair.uia.no/house-of-cair/ PREPRINT - J

ANUARY

28, 2021

Through testing the KWS-TM pipeline against the Tensorﬂow Speech data set we did not account for background noiseeffects. In-ﬁeld IoT applications must be robust enough to minimize the effects of additional noise, therefore, futurework in this direction should examine the effects of the pipeline with changing signal-to-noise ratios. The pipeline willalso be deployed to a micro-controller in order to beneﬁt from the effects of energy frugality by operating at a lowerpower level.

References [1] T. Rausch and S. Dustdar. Edge intelligence: The convergence of humans, things, and ai. In , pages 86–96, 2019.[2] Itsuki Osawa, Tadahiro Goto, Yuji Yamamoto, and Yusuke Tsugawa. Machine-learning-based prediction modelsfor high-need high-cost patients using nationwide clinical and claims data.[3] Tiago M. Fernández-Caramés and Paula Fraga-Lamas. Towards the internet-of-smart-clothing: A review on iotwearables and garments for creating intelligent connected e-textiles.

Electronics (Switzerland) , 7, 12 2018.[4] K. D. Abeyrathna, O. C. Granmo, X. Zhang, and M. Goodwin. Adaptive continuous feature binarization fortsetlin machines applied to forecasting dengue incidences in the philippines. In , pages 2084–2092, 2020.[5] K. Hirata, T. Kato, and R. Oshima. Classiﬁcation of environmental sounds using convolutional neural networkwith bispectral analysis. In , pages 1–2, 2019.[6] Hadas Benisty, Itamar Katz, Koby Crammer, and David Malah. Discriminative keyword spotting for limited-dataapplications.

Speech Communication , 99:1 – 11, 2018.[7] J. S. P. Giraldo, C. O’Connor, and M. Verhelst. Efﬁcient keyword spotting through hardware-aware conditionalexecution of deep neural networks. In , pages 1–8, 2019.[8] J. S. P. Giraldo, S. Lauwereins, K. Badami, H. Van Hamme, and M. Verhelst. 18uw soc for near-microphonekeyword spotting and speaker veriﬁcation. In , pages C52–C53, 2019.[9] S. Leem, I. Yoo, and D. Yook. Multitask learning of deep neural network-based keyword spotting for iot devices.

IEEE Transactions on Consumer Electronics , 65(2):188–194, 2019.[10] A depthwise separable convolutional neural network for keyword spotting on an embedded system.

EURASIPJournal on Audio , 2020:10, 2020.[11] Massimo Merenda, Carlo Porcaro, and Demetrio Iero. Edge machine learning for ai-enabled iot devices: A review.

Sensors (Switzerland) , 20, 5 2020.[12] B. Liu, Z. Wang, W. Zhu, Y. Sun, Z. Shen, L. Huang, Y. Li, Y. Gong, and W. Ge. An ultra-low power always-onkeyword spotting accelerator using quantized convolutional neural network and voltage-domain analog switchingnetwork-based approximate computing.

IEEE Access , 7:186456–186469, 2019.[13] S. Yin, P. Ouyang, S. Zheng, D. Song, X. Li, L. Liu, and S. Wei. A 141 uw, 2.46 pj/neuron binarized convolutionalneural network based self-learning speech recognition processor in 28nm cmos. In , pages 139–140, 2018.[14] Nebojsa Bacanin, Timea Bezdan, Eva Tuba, Ivana Strumberger, and Milan Tuba. Optimizing convolutional neuralnetwork hyperparameters by enhanced swarm intelligence metaheuristics. 2020.[15] Rishad Shaﬁk, Alex Yakovlev, and Shidhartha Das. Real-power computing.

IEEE Transactions on Computers ,2018.[16] Ole-Christoffer Granmo. The Tsetlin Machine - A Game Theoretic Bandit Driven Approach to Optimal PatternRecognition with Propositional Logic. arXiv , April 2018.[17] Rishad Shaﬁk, Adrian Wheeldon, and Alex Yakovlev. Explainability and dependability analysis of learningautomata based AI hardware. In

IEEE IOLTS , 2020.[18] Adrian Wheeldon, Rishad Shaﬁk, Tousif Rahman, Jie Lei, Alex Yakovlev, and Ole-Christoffer Granmo. Learningautomata based AI hardware design for IoT.

Philosophical Trans. A of the Royal Society , 2020.18

PREPRINT - J

ANUARY

28, 2021[19] J. Lei, A. Wheeldon, R. Shaﬁk, A. Yakovlev, and O. C. Granmo. From arithmetic to logic based ai: A comparativeanalysis of neural networks and tsetlin machine. In , pages 1–4, 2020.[20] S. Chu, S. Narayanan, and C. . J. Kuo. Environmental sound recognition with time–frequency audio features.

IEEE Transactions on Audio, Speech, and Language Processing , 17(6):1142–1158, 2009.[21] Zohaib Mushtaq and Shun-Feng Su. Environmental sound classiﬁcation using a regularized deep convolutionalneural network with data augmentation.

Applied Acoustics , 167:107389, 2020.[22] W. Shan, M. Yang, J. Xu, Y. Lu, S. Zhang, T. Wang, J. Yang, L. Shi, and M. Seok. 14.1 a 510nw 0.41v low-memory low-computation keyword-spotting chip using serial fft-based mfcc and binarized depthwise separableconvolutional neural network in 28nm cmos. In , pages 230–232, 2020.[23] Muqing Deng, Tingting Meng, Jiuwen Cao, Shimin Wang, Jing Zhang, and Huijie Fan. Heart sound classiﬁcationbased on improved mfcc features and convolutional recurrent neural networks.

Neural Networks , 130:22 – 32,2020.[24] L. Xiang, S. Lu, X. Wang, H. Liu, W. Pang, and H. Yu. Implementation of lstm accelerator for speech keywordsrecognition. In , pages195–198, 2019.[25] Kirandeep Kaur and N. Jain. Feature extraction and classiﬁcation for automatic speaker recognition system – areview. 2015.[26] Joseph W. Picone. Signal modeling techniques in speech recognition. In

PROCEEDINGS OF THE IEEE , pages1215–1247, 1993.[27] Uday Kamath, John Liu, and James Whitaker.

Automatic Speech Recognition , pages 369–404. SpringerInternational Publishing, Cham, 2019.[28] Automatic speech recognition. In

Speech and Audio Signal Processing , pages 299–300. John Wiley & Sons, Inc.,oct 2011.[29] N.J. Nalini and S. Palanivel. Music emotion recognition: The combined evidence of mfcc and residual phase.

Egyptian Informatics Journal , 17(1):1 – 10, 2016.[30] Q. Li, Y. Yang, T. Lan, H. Zhu, Q. Wei, F. Qiao, X. Liu, and H. Yang. Msp-mfcc: Energy-efﬁcient mfcc featureextraction method with mixed-signal processing architecture for wearable speech recognition applications.

IEEEAccess , 8:48720–48730, 2020.[31] C. Paseddula and S. V. Gangashetty. Dnn based acoustic scene classiﬁcation using score fusion of mfcc andinverse mfcc. In , pages18–21, 2018.[32] S. Jothilakshmi, V. Ramalingam, and S. Palanivel. Unsupervised speaker segmentation with residual phase andmfcc features.

Expert Systems with Applications , 36(6):9799 – 9804, 2009.[33] Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition, 2018.[34] Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. Hello edge: Keyword spotting on microcon-trollers.

CoRR , abs/1711.07128, 2017.[35] Z. Zhang, S. Xu, S. Zhang, T. Qiao, and S. Cao. Learning attentive representations for environmental soundclassiﬁcation.

IEEE Access , 7:130327–130339, 2019.[36] Tara Sainath and Carolina Parada. Convolutional neural networks for small-footprint keyword spotting. In

Interspeech , 2015.[37] J. G. Wilpon, L. R. Rabiner, C. . Lee, and E. R. Goldman. Automatic recognition of keywords in unconstrainedspeech using hidden markov models.

IEEE Transactions on Acoustics, Speech, and Signal Processing , 38(11):1870–1878, 1990.[38] Santiago Fernández, Alex Graves, and Jürgen Schmidhuber. An application of recurrent neural networks todiscriminative keyword spotting. In

Proceedings of the 17th International Conference on Artiﬁcial NeuralNetworks , ICANN’07, page 220–229, Berlin, Heidelberg, 2007. Springer-Verlag.[39] G. Chen, C. Parada, and G. Heigold. Small-footprint keyword spotting using deep neural networks. In , pages 4087–4091, 2014.19

PREPRINT - J

ANUARY

28, 2021[40] Chulhong Min, Akhil Mathur, and Fahim Kawsar. Exploring audio and kinetic sensing on earable devices. In

Proceedings of the 4th ACM Workshop on Wearable Systems and Applications , WearSys ’18, page 5–10, NewYork, NY, USA, 2018. Association for Computing Machinery.[41] F. Kawsar, C. Min, A. Mathur, and A. Montanari. Earables for personal-scale behavior analytics.

IEEE PervasiveComputing , 17(3):83–89, 2018.[42] Adrian Wheeldon, Alex Yakovlev, Rishad Shaﬁk, and Jordan Morris. Low-latency asynchronous logic design forinference at the edge. arXiv preprint arXiv:2012.03402 , 2020.[43] Lei Jiao, Xuan Zhang, Ole-Christoffer Granmo, and K. Darshana Abeyrathna. On the convergence of tsetlinmachines for the xor operator, 2021.[44] Bimal Bhattarai, Ole-Christoffer Granmo, and Lei Jiao. Measuring the novelty of natural language text using theconjunctive clauses of a tsetlin machine text classiﬁer, 2020.[45] Saeed Rahimi Gorji, Ole-Christoffer Granmo, Adrian Phoulady, and Morten Goodwin. A tsetlin machine withmultigranular clauses, 2019.[46] K Darshana Abeyrathna, Ole-Christoffer Granmo, Xuan Zhang, Lei Jiao, and Morten Goodwin. The regressiontsetlin machine: a novel approach to interpretable nonlinear regression.

Philosophical Trans. A of the RoyalSociety , 2019.[47] Ole-Christoffer Granmo, Sondre Glimsdal, Lei Jiao, Morten Goodwin, Christian W. Omlin, and Geir Thore Berge.The convolutional tsetlin machine.

CoRR , abs/1905.09688, 2019.[48] K Darshana Abeyrathna, Bimal Bhattarai, Morten Goodwin, Saeed Gorji, Ole-Christoffer Granmo, Lei Jiao,Rupsa Saha, and Rohan K Yadav. Massively parallel and asynchronous tsetlin machine architecture supportingalmost constant-time scaling. arXiv preprint arXiv:2009.04861 , 2020.[49] K Darshana Abeyrathna, Ole-Christoffer Granmo, Rishad Shaﬁk, Alex Yakovlev, Adrian Wheeldon, Jie Lei, andMorten Goodwin. A novel multi-step ﬁnite-state automaton for arbitrarily deterministic tsetlin machine learning.In