[PDF] A 5 μW Standard Cell Memory-based Configurable Hyperdimensional Computing Accelerator for Always-on Smart Sensing

Abstract

Hyperdimensional computing (HDC) is a brain-inspired computing paradigm based on high-dimensional holistic representations of vectors. It recently gained attention for embedded smart sensing due to its inherent error-resiliency and suitability to highly parallel hardware implementations. In this work, we propose a programmable all-digital CMOS implementation of a fully autonomous HDC accelerator for always-on classification in energy-constrained sensor nodes. By using energy-efficient standard cell memory (SCM), the design is easily cross-technology mappable. It achieves extremely low power, 5 \mu W in typical applications, and an energy-efficiency improvement over the state-of-the-art (SoA) digital architectures of up to 3\times in post-layout simulations for always-on wearable tasks such as EMG gesture recognition. As part of the accelerator's architecture, we introduce novel hardware-friendly embodiments of common HDC-algorithmic primitives, which results in 3.3\times technology scaled area reduction over the SoA, achieving the same accuracy levels in all examined targets. The proposed architecture also has a fully configurable datapath using microcode optimized for HDC stored on an integrated SCM based configuration memory, making the design "general-purpose" in terms of HDC algorithm flexibility. This flexibility allows usage of the accelerator across novel HDC tasks, for instance, a newly designed HDC applied to the task of ball bearing fault detection.

Full PDF

11 A 5 µ W Standard Cell Memory-based ConﬁgurableHyperdimensional Computing Accelerator forAlways-on Smart Sensing

Manuel Eggimann,

Graduate Student Member, IEEE,

Abbas Rahimi, Luca Benini,

Fellow, IEEE

Abstract —Hyperdimensional computing (HDC) is a brain-inspired computing paradigm based on high-dimensional holisticrepresentations of vectors. It recently gained attention for em-bedded smart sensing due to its inherent error-resiliency andsuitability to highly parallel hardware implementations. In thiswork, we propose a programmable all-digital CMOS implemen-tation of a fully autonomous HDC accelerator for always-on clas-siﬁcation in energy-constrained sensor nodes. By using energy-efﬁcient standard cell memory (SCM), the design is easily cross-technology mappable. It achieves extremely low power, 5 µ W intypical applications, and an energy-efﬁciency improvement overthe state-of-the-art (SoA) digital architectures of up to 3 × in post-layout simulations for always-on wearable tasks such as EMGgesture recognition. As part of the accelerator’s architecture,we introduce novel hardware-friendly embodiments of commonHDC-algorithmic primitives, which results in 3.3 × technologyscaled area reduction over the SoA, achieving the same accuracylevels in all examined targets. The proposed architecture alsohas a fully conﬁgurable datapath using microcode optimized forHDC stored on an integrated SCM based conﬁguration memory,making the design “general-purpose” in terms of HDC algorithmﬂexibility. This ﬂexibility allows usage of the accelerator acrossnovel HDC tasks, for instance, a newly designed HDC appliedto the task of ball bearing fault detection. Index Terms —Hyperdimensional Computing, Always-on, EdgeComputing, Machine Learning, Hardware Accelerator, VLSI,Standard Cell Memory

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer beaccessible.

I. I

NTRODUCTION E NERGY boundedness is the key design metric and con-straint in the development of internet-of-things (IoT)devices [1, 2, 3]. With more and more sensor modalitiesintegrated into IoT end nodes, the amount of data to process,and the complexity of the processing pipeline increases. Aim-ing for uninterrupted operation for years or even indeﬁnitelywithin the tight power envelope of small batteries or envi-ronmental energy harvesting urges to drastically reduce theaverage power consumption of the sensor nodes themselves.Observing that the majority of power consumption in to-day’s wireless sensor devices is spent in data transmission [4]promotes moving data processing closer to the sensor. In-stead of raw data transmission and centralized processing inthe cloud, the data is processed continuously on these so-called smart sensor devices [5]. Only the analyzed portion ofthe information is transmitted (e.g., transmission of a single imminent machine failure message instead of theraw vibration and temperature data). This cannot be achievedby application-speciﬁc integrated circuit (ASIC) designs for deep neural networks alone because general purpose always-on smart sensing systems operate in the µ W range. Therefore,the next evolution step towards fully self-sustainable always-on smart sensors requires the exploration of new avenuesof hardware-software co-design and outside the realm oftraditional von Neumann based computing [6, 7, 8].An energy proportional sensor data processing scheme,where a wake-up circuit (WuC) detects patterns of interest andaggressively duty cycles other circuitry is a viable solution todrastically reduce average power consumption [9, 10]. Whilethere are numerous WuCs, e.g., for biosignal anomaly detec-tion, sound/keyword spotting, incoming radio transmissions inthe µ W range, all of these solutions are highly application-speciﬁc. Considering the cost of custom silicon developmentand the rapidly widening range of application targets, there isa need for conﬁgurable and application-agnostic WuCs withmore ﬂexible pattern extraction capabilities than the simplethreshold-based solutions, which can suffer from high false-positive rate and thus energy losses of unnecessary wakeups.Hyperdimensional computing (HDC) is a brain-inspiredcomputing paradigm that excels in the learning curve, compu-tational complexity of the training, and simplicity of operationsfor hardware. This makes it a perfect ﬁt for energy-constrainedinference applications, and, more speciﬁcally for general-purpose always-on sensing [11, 12, 13].In this work, we present the following contributions: • We propose a novel ﬂexible and highly energy-efﬁcientall-digital HDC architecture for always-on smart sensingapplications achieving up to 3 × higher energy efﬁciency(191 nJ / inference ) over the SoA. • As part of the architecture, we introduce novel hardware-friendly embodiments of common HDC operators result-ing in 3.3 × technology scaled area reduction. • We provide an evaluation of latch-based associative mem-ories at sub nominal supply voltage conditions in post-layout simulation indicating the potential of at least 3.5 × energy efﬁciency improvement compared to an SRAMbased digital solution. • We provide practical application case studies of ourapproach, including the ﬁrst investigation (to the best ofour knowledge) on the feasibility of HDC for the task ofball bearing fault detection. • Finally, using an all-digital approach enables us to pub-licly release our architecture under the permissive solder- a r X i v : . [ ee ss . SP ] F e b pad open-source license .The remainder of this paper is structured as follows. InSection II we elaborate on previous work in the domain ofHDC accelerators and always-on classiﬁcation circuitry andhighlight the distinctive novel characteristics of the proposedapproach. Section III analyzes in detail the modules of theproposed architecture. We continue with post-layout analysison power and area of the design in different target technologiesand different design parameter combinations in Section IVbefore we conduct an energy-efﬁciency and accuracy analysisfor several always-on cognitive sensing scenarios in Section V.Finally, we conclude in Section VI.II. R ELATED W ORK

Tackling the power-consumption challenge of always-onsensing in a hierarchical manner using WuCs to apply aggres-sive duty cycling on more involved data processing modulesis not a new idea. In the recent past, there have been severalpublications on low-power always-on wake-up circuitry invarious domains. Table I gives an exemplary overview ofcurrent wake-up circuitry research using selected publicationsin the recent past. Keyword spotting and voice activity detec-tion (VAD) is a very actively researched target for always-onsensing; Giraldo et al. present a low power WuC for speechdetection, speaker identiﬁcation, and keyword spotting withintegrated preprocessing blocks for MFCC generation andLSTM accelerator for classiﬁcation [15]. Shan et al. proposedanother implementation in the same application domain withstate-of-the-art energy efﬁciency on the task of two-wordkeyword spotting using binarized depth-wise separable CNN’soperating at near-threshold [16]. At the lower end of thepower consumption spectrum Cho et al. present a

142 nW

VAD circuitry with integrated analog-frontend that combinesa conﬁgurable always-on time-interleaved mixer architecturewith a heavily duty cycled neural-network processor [14].Monitoring life signals is another very active ﬁeld; In thecontext of cardiac arrhythmia detection, Zhao et al. combine alevel-crossing ADC with asynchronous QRS-complex detec-tion circuitry with an artiﬁcial neural network accelerator tobeneﬁt from the energy advantage of non-Nyquist sampling[17]. Although these solutions achieve outstanding energyefﬁciency in their particular application domain, they arehardwired for the respective task.More in line with the target of a ﬂexible and conﬁgurablesmart sensing platform are Miro-Panades et al.; They presentan asynchronous RISC processor with an integrated wake-upradio receiver for efﬁcient low-latency wake-up from severalexternal and internal triggers. While their architecture achievesoutstanding reaction time to interrupts without the need for ahigh-frequency clock, the wake-up circuitry lacks the interfaceand compute capability to perform actual data processing fordata input pattern dependent wake-up [9]. Wang et al. presenta conﬁgurable WuC resembling the work in [17] that combinesan LC-ADC with a set of asynchronous detector blocksto extract low-level signal properties like peak amplitude,slope, or time interval between peaks. Each detector can be Will be made available under https://github.com/pulp-platform/hypnos conﬁgured with a threshold, and the individual detector outputcan be logically fused to a single wake up signal. Althoughtheir architecture uses minimal power, continuous detection ofmore complex patterns is entirely outside the capabilities of adetector-set approach [18].To the authors’ knowledge, the only low power WuC withslightly more sophisticated pattern matching capabilities wasintroduced by Rovere et al.. Instead of analyzing the delta-encoded signal from the LC-ADC with hardwired detectors,they continuously match the input signal against a sequence ofupper and lower amplitude thresholds with up to 16 thresholdsegments. This scheme equates to matching the input signal’sapproximate amplitude slope against a conﬁgurable pre-trainedprototypical signal slope of interest [19]. Their approachproved successful for pathological ECG classiﬁcation andbinary hand gesture recognition (ﬁnger-snap or hand clapping).Still, detecting more complex patterns in the spatial or timedimension remains outside their proposed architecture’s scope.Hyperdimensional computing (HDC) is an energy-efﬁcientand ﬂexible computing paradigm for near-sensor classiﬁcationthat gracefully degrades in the presence of bit errors, andnoise [20, 21, 22]. Various works showcased HDC’s few-shot learning properties and energy efﬁciency in multipledomains like biosignal processing [23, 24, 25], languagerecognition [26], DNA sequencing [27], or vehicle type clas-siﬁcation [28].In emerging hardware implementations, the HDC’s inherenterror-resiliency is leveraged for novel non-volatile memory(NVM) based in-memory computing architectures [8, 29, 30].Targeting FPGAs, efﬁcient mappings of binary and bipolarHDC operations are proposed [31, 32, 33]. However, the onlycomplete digital CMOS-based HDC accelerator was recentlyintroduced by Datta et al.. They propose a data processingunit (DPU) based encoder design that interconnects with aROM based item memory, and a fully parallel associativememory [34]. While their implementation indeed excels inthroughput, its’ conﬁgurability as well as area- and energy-efﬁciency are limited; Their encoder architecture is restrictedto what they call generic multi-stage HDC algorithms witha hardwired encoder depth in feedforward conﬁguration im-posing hard limits on the supported encoding schemes. Froman energy-efﬁciency and area standpoint, their design suffersa lot from using a large read-only-memory (ROM) for itemmemory (IM) and pipeline registers in the very wide datapathof every encoding layer.Our proposed architecture targets the sub 25µ W powerenvelope (resulting in a lifetime of about four years from asmall lithium-thionyl chloride coin cell battery). The always-on smart sensing circuitry leverages the ﬂexibility of HDC toperform energy-efﬁcient end-to-end classiﬁcation on a diverseset of input signals. We achieve higher conﬁgurability, areduction of 3.1 × in area and up to 3.3 × improvement inenergy-efﬁciency than the current SoA in HDC accelerationand present a ﬁrst-in-class ﬂexible and technology agnosticdigital CMOS architecture for near sensor smart sensing wake-up circuitry. Application Speciﬁc General Purpose

Cho et al. [14] Giraldo et al. [15] Shan et al. [16] Zhao et al. [17] Wang et al. [18] Miro-Panades et al. [9] Rovere et al. [19]

This Work

Applications VAD Keyword Spott. Keyword Spott. EMG Slope Matching Wake-up Radio, Interrupts General Purpose General PurposeTechnology 180nm 65nm 28nm 180nm 180nm 28nm 130nm 65nm / 22nmCross Tech. Low Medium High Low Low Low Low HighPower Envelope ~ µ W ~ µ W ~ ~ µ W ~ ~ µ W ~ . µ W max. ~ µ W , typ. ~ µ W Classiﬁcation Scheme NN MFCC, LSTM DSCNN ANN Threshold, Slope - Threshold Sequence HDCConﬁgurability App. Speciﬁc App. Speciﬁc App. Speciﬁc App. Speciﬁc Limited App. Speciﬁc Medium HighArea . . . . . . . . / . TABLE IC

OMPARISON OF STATE - OF - THE - ART W U C S WITH OUR PROPOSED

HDC

BASED W U C. A

REA NUMBERS ARE REPORTED NM AND NM TECHNOLOGYWHILE POWER CONSUMPTION IS REPORTED IN NM FOR A COMPUTE INTENSIVE LANGUAGE CLASSIFICATION ALGORITHM AND A TYPICAL ALWAYS - ONCLASSIFICATION ALGORITHM FOR

EMG

DATA . III. P

ROGRAMMABLE

HDC-A

CCELERATOR A RCHITECTURE

A. Hyperdimensional Computing

Hyperdimensional Computing (HDC) or vector symbolicarchitectures (VSAs) in general, is a brain-inspired computeparadigm that recently is gaining attention [20]. Its core ideais to map low-dimensional input data, i.e., raw sensor dataor features thereof, to vectors of very high dimensionality(cardinality in the order of thousands). The procedure of inputto HD space mapping is commonly called hyperdimensionalencoding . HDC deﬁnes simple operations on vectors to aggre-gate their informational content into a single vector.

Binding avector V a to another vector V b creates a vector that is dissimilarto both inputs and thus may be used to represents the mapping V a : V b . Bundling several input vectors yields a vector mostsimilar to all of its inputs, therefore representing the set of itsinput vector. The unary

Permutation operation maps a singlevector deterministically to an entirely unrelated subspace.Combining these three operations on multiple channels ora time-sequence of mapped input vectors (using a so-called item memory ) captures high-level signal characteristics of theunderlying data in an error-resilient and ﬂexible manner [35].The inverse mapping of HD Vectors to the low dimensionaloutput space, i.e., the index of a classiﬁcation result, isenabled by the

Associative Lookup operation. This operationﬁnds the most similar vector to the input within a set ofstored HD vectors. There are various embodiment options forVSAs, differing in the concrete representation of the individualdimensions and actual implementations of

Binding , Bundling and the similarity metric. In this work, we concentrate on theso-called binary spatter code (BSC), a digital CMOS friendlyVSA that uses a single bit per dimension. BSC uses XORfor the binding and majority vote for the bundling operationwith Hamming distance as the implied similarity metric forassociative lookups.

B. Overview

Figure 1 illustrates the three major components of theaccelerator, which we describe in detail in the followingsubsections; the associative memory (AM) stores the proto-type vectors and performs the associative lookup operations,the ﬁnal step of most HDC algorithms. The hyperdimen-sional encoder (HD-Encoder) is responsible for mapping low-dimensional input values to HD-vectors. It operates on HDvectors from the AM or its own output in an iterative manner. A l g o r i t h m S t o r a g e Recon ﬁ gure Algorithm map_input → reg → store @15ld @3 → perm 1 → bit ﬂ ip → regbundle reghw.loop1 repeat 5x:map_input → regld @7 → bind with reg → reg → @15associative lookupintrpt if class==3 && dist<10jmp 0 In ﬁ nite Loop Class 1 Prototype VectorClass 2 Prototype VectorClass 3 Prototype VectorSearch VectorScratchpad VectorScratchpad Vector

Associative Lookup Logic E n c o d i n g / W r i t e b a c k B u ﬀ e r Con ﬁ g.Unit I M / C I M M a t e r i a li z a t i o n EUEUEUEUEUEUEUEU W r i t e P o r t R e a d P o r t I n t e rr u p t Input Values

Associative Memory HD Encoder

HDC Accelerator

APB Access Port

Fig. 1. High-level structure of the proposed HDC accelerator. The associativememory (blue) is responsible for storage and associative lookup of prototypevectors and serves as a scratchpad memory for the HD Encoder (red). Encoderand associative memory are orchestrated by user programmable algorithmstorage (green).

The AM and HD-encoder are managed by a small controllercircuit that sequentially consumes a stream of compact mi-crocode instructions and accordingly reconﬁgures the datapath.A tiny user-programmable conﬁguration memory supplies thismicrocode stream.

C. HD-Encoder

The ﬁrst step of every HDC classiﬁcation algorithm ismapping a dense input space to a high-dimensional holis-tic representation. Most current algorithms encode the low-dimensional input data into a single high-dimensional searchvector representing the whole or a subset of the input. Thesearch vector is then compared with prototype vectors storedin the AM that represent the different classes. The differences

Serializer

D Q D Q D Q D Q

EUEUEUEUEUEUEU π π -1-1 π D/K

All-Zero

Seed S i m ili a r i t y M a n i p . D/K

Mixer(IM Materialization)

D/K

Low-Dimensional Inputfrom Associative Memory toAssociative Memory E n c o d e r R e g i s t e r D=

HD-Encoder

16 7 I n p u t M u x Input Stage (C)IM Materialization Bit-WiseOperations (a) AM Associative Memory EU Encoder Unit

HDC

Hyper-Dimensional Computing

BSC

Binary Spatter Code (C)IM (Continuous) Item Memory Mapping

NISC

No Instruction Set Computing

CISC

Complex Instruction Set Computing

SCM

Standard Cell Memory (b)Fig. 2. (a) Architecture of the HDC Encoder responsible for (Continouus) Item Memory materialization and search vector encoding. The width of the datapathis a function of the HDC dimensionality (D) and the design parameter K (discussed in Section III-C3). (b) Table of accelerator related acronyms. between the various HDC algorithms mainly lay in the partic-ular encoding algorithms. They are crafted to capture relevantcharacteristics from the raw data, e.g., amplitude distribution,spatial or temporal features, and are highly target applicationdependent. Thus, it is mainly the encoder’s versatility thataffects the afﬁnity of an HDC accelerator for different algo-rithms.Figure 2a illustrates our proposed encoder architecture. Itconsists of three main components connected in a combina-tional pipeline. The input stage of the encoder multiplexesbetween 4 different input sources; the all-zeros vectors, ahardwired random seed vector, a vector addressed from AM,or the HD-encoder’s own output. The IM materializationstage maps input data to item vectors using either quasi-orthogonal vectors (IM) or continuous item mapping (CIM).The encoder’s ﬁnal stage are the bitwise encoder units thatperform binary or unary operations on the individual bits ofthe vectors.There are no pipeline registers in the very wide datapathbetween the encoder stages. Although this design choicereduces throughput, it increases the energy efﬁciency of ourarchitecture.

1) Encoder Units:

The Encoder Unit processes one dimen-sion of the input vector. Besides the combinational logic forthe binary and unary bitwise operations, each unit containsan output Flip-Flop that stores the result after each encodingcycle.Additionally, there is one saturating bidirectional 5-bitcounter per unit to perform the bundling operations. Analysesin [31] showed that for dimensions up to 10000, a 5-bit satu-rating counter implementation still achieves the same bundlingcapacity as a full precision model.A noteworthy detail of the saturating counter is its possi-bility to evict the current counter value to the AM in a bit-serial manner (i.e., one cycle for each of the ﬁve counter bits).Eviction and loading of the counter state allow the proposed design to execute multistage encoding algorithms with nestedbundling operations.

2) Mixing Stage:

The Mixer submodule visualized in ﬁgure2a generates quasi-orthogonal pseudo-random HD vectors.The rematerialization, i.e., on-the-ﬂy regeneration, of suchvectors is an area-efﬁcient alternative to explicit storage oflarge numbers of item vectors required for input to HD spacemapping.The mixer stage feeds the input vector selected by theencoder input stage through one of two hardwired randompermutations π and π . This enables the encoder to map agiven low-dimensional binary input datum w from the inputdomain D to the pseudo-random HD-vector V w by iterativelyapplying one of the two permutations to a hardwired randomseed HD-vector S : V w = n (cid:89) k =0 π i S , for i = (cid:40) , if w k = 01 , if w k = 1 (1)where w k denotes the k th bit position in the input word w’sbinary representation and n = log | D | . The resulting HD-vectors V w are all quasi-orthogonal, given that π and π donot commute.For algorithms that require random access to the itemmemory, the above scheme rematerializes the item vector withtime complexity O (log | D | ) . However, many algorithms useIM-mapping to bind a value vector V value to a channel label V chn [ k ] . In these scenarios, the channel label vectors are usedwith a ﬁxed ordering assuming the raw data is feed to theaccelerator using a ﬁxed channel ordering. We can thereforereduce the time complexity to O (1) with the mapping: V chn [ k ] = (cid:40) S if k = 0 π V chn [ k − if k > (2)where we store the channel label from the previous iterationin an unused row of the associative memory. A0123 Z0000000100110111 + to Mixerinput ValuesoruCode Immediate Spread withRepetition Encoder Shu ﬄ e withHardwiredPermutationMap toThermometerCode from Input Mux

128 D/K D/K Similarity Manipulator X input X out = X input with / % bits ﬂ ipped Flip Bits of Input Vector

Fig. 3. Structure of the Similarity Manipulator stage

Our proposed IM-mapping approach is more area-efﬁcientthan storing random vectors in a large ROM and scales well tolarge input domains whose cardinality is unknown in advance.From a hardware perspective, the mixer stage translates toN 4-input and N 2-input multiplexers, where N denotes thedatapath width and some moderate wiring overhead causedby the random permutations.

3) Vector Folding:

If synthesized with default parameters,the proposed HD-Accelerator contains a datapath wide enoughto process a whole HD-Vector in a single cycle. However, aswill be analyzed in more detail in Chapter IV, going for a moreparallel architecture does not always yield the most energy-efﬁcient solution for a given target technology. Thus, apartfrom various other modiﬁable design parameters, the designexposes the

Vector Fold parameter; It allows to tune the designfor the optimal amount of parallelism to achieve maximumenergy efﬁciency. Increasing the value of the vector fold splitsa single D -dimensional vector into K smaller subparts ofequal size. The datapath of the accelerator shrinks accordinglyand only processes one subpart at a time. While the throughputof the accelerator at constant frequency decreases by K , thearea of the HD-Encoder, dominated by the saturating counters,reduces similarly by a factor of K .An important detail is that by decreasing the datapath width,we also reduce the permutations’ operand size within thesimilarity manipulator and the mixing stage. If we stick withthe same IM-Mapping scheme described above, all subpartsof a vector would be identical since they all pass through thesame hardwired permutations π and π of size DK . The IMmapping scheme in equation (1) is thus modiﬁed as follows: V ∗ w = π idx V W (3)with, π idx = log ( j ) (cid:89) k =0 π i , for i = (cid:40) , if h k = 01 , if h k = 1 (4)where h is the value of the dedicated part index counter whichholds the index of the current vector subpart. This yieldsa unique set of permutations π ∗ o , π ∗ per vector part at theexpense of O ( log (( K )) additional mixing cycles.

4) Similarity Manipulator:

The Similarity Manipulatorstage transforms the mixing stage’s output vector by ﬂip-ping a conﬁgurable number of its bits. This operation is a fundamental building block of various high-level operationslike binarized B2B bundling [31], CIM mapping [21] andexponential forgetting. Figure 3 shows its internal structure;The 7-bit input word w is ﬁrst mapped to a 128-bit unaryrepresentation w unary . This unary representation is spread tothe target HD-vector dimensionality D/K by repeating eachbit of w unary DK × times. The resulting vector passes througha hard-wired random permutation to distribute the ’ones’ overall the vector dimensions. The result is XOR-ed with theinput vector. A limitation of the proposed solution is that auniform distribution of the input words does not yield equaldistribution of the probabilities for a bit to be set across theHD-Vector’s input dimensions. A multi-cycle approach canbe used for operations where equal bit-ﬂipping probabilityis a hard requirement; First, a bitmask with the desired bit-density is generated by passing the all-zero vector throughthe manipulator stage with the input word w . This mask issubsequently mixed in the Mixing stage using the same inputword w to randomize the position of the ’ones’ in the bitmask.The resulting bitmask is ultimately XOR-ed with the inputHD-vector within the encoder units. D. Fully-Synthesizable Associative Memory

For a given search vector, the AM looks up the mostsimilar vector currently stored within the memory. However,the obvious approach to combine traditional SRAMs to storethe HD-vectors with digital logic yields suboptimal results.Although SRAMs are the go-to solution for fast and area-efﬁcient volatile on-chip memory, conventional SRAM macrogenerators are not optimized for the extremely wide memoryaspect ratios needed for parallel access to HD-vectors. Alsothey are less energy-efﬁcient under low V DD conditions forlow bandwidth applications [36, 37]. The nature of hyperdi-mensional computing with lots of simple, componentwise op-erations demands a non-von Neumann scheme of computationwith computational logic intermixed with memory cells.

1) Using Latch Cells as Memory Primitives:

Figure 4shows the structure of the AM in our design; latch cells areused as primitive memory elements instead of ﬂip-ﬂops dueto their lower area ( -10%) and energy ( -20%) footprint [37].Each row of the memory consists of

D/K latch cells anda single glitch-free clock gate. These row clock gates areactivated by the one-hot encoded write address. A two-portdesign allows fetching a new HD-Vector from AM into theHD-encoder while simultaneously writing back the previousresult without any stalls or energy costly pipeline registers inthe wide datapath.In most HDC based classiﬁcation schemes, the AM is solelykeeps hold of the prototype vectors representing the individualclasses. The proposed architecture differs in that regard byusing rows of the AM to store the iterative encoding process’sintermediate results. The AM thus serves the double purposeof a register ﬁle for entire HD-vectors (or vector subparts incase vector fold

K > ).Although latch cells drastically reduce the impact on areafootprint compared to ﬂip-ﬂops, their usage can complicatestatic timing analysis (STA). Due to their transparent nature D QD QD Q D QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD QD Q

Associative SCM

Read Addr G l o b a l W r i t e E n a b l e W r i t e A dd r e ss D e c o d e r R e a d A dd r e ss D e c o d e r Write Addr

CLK

N × K × D/K bits

D Q ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ + + + ++ +++

A>B? R e s u l t R e g A>B? - b i t R W A cc e ss Con ﬁ gurable Threshold Hamming DistanceCalculationMin Distance Search

Interrupt toHDCEncoderfromHDCEncoder

D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD QD QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD QD QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD QD QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD QD QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q

Class Prototype Vectors&Encoder Scratchpad VectorsSearch Vector

Fig. 4. Architecture of the latch-cell based all-digital AM. Vectors can beread and written simultaneously in subrows of length

D/K . The last vectorwithin the memory acts as the search vector for the associative lookup logic.The

D/K -bit adder tree for the popcount operation is shared by all memoryrows. The distance of the most similar entry is compared with conﬁgurablethreshold and conditionally raises an interrupt line to an external peripheral(e.g. power management unit in an SoC) during write access, one must take care not to introducecombinational loops. While Teman et al. suggest decouplingthe memory by using ﬂip-ﬂops at the IO boundary of thememory [36], we repurposed the output register in the encoderstage to break combinational loops. This approach, coupledwith multicycle path constraints for STA [36], allows treatingthe AM like a regular ﬂip-ﬂop based synchronous designduring synthesis.

2) Associative Lookup Logic:

As can be seen in ﬁgure 4,the HD-vector slot acts as the search vector in the proposedarchitecture. While we could directly use the write input intothe memory as the search word, this would prevent the vectorfolding feature’s usage since our write port would not havea full vector width anymore. The lookup logic iterates overeach memory row, calculating the Hamming distance betweenone subpart of the search vector and a subpart of one of thestored HD-vectors at a time. The control logic accumulatesthe Hamming distance between the subparts and iterativelydetermines the most similar entry’s index and distance.

E. An ISA for HD-Computing

Previously proposed HDC accelerator designs hardwiredlarge portions of their datapath to execute HD-algorithms of aparticular structure [21]. On the other hand, the architecture weare proposing is not bound to execute only one speciﬁc class ofalgorithms. A control unit continuously reconﬁgures the data-path according to a stream of microcode instructions fetchedfrom a tiny embedded conﬁguration memory. This allows theaccelerator to be reconﬁgured at runtime to execute algorithmsof a much larger variety by altering the microcode stored inthe conﬁguration memory. After conﬁguration, the algorithmis executed autonomously without any further interaction of ahost processor.We propose a 26-bit instruction set architecture (ISA) withthe encoding space split into 25-bit No-instruction-set com-puting (NISC) and 25-bit Complex-instruction set computing(CISC) instructions.

1) Low-level NISC Instructions:

The NISC instructionsdirectly encode the select signals of the multiplexers within theHD-encoder and the address lines of the HD-memory. Figure5 summarizes the function of the bitﬁelds with a single 25-bit NISC instruction. They provide ﬁne-grained control overthe datapath with the RIDX and WIDX ﬁelds acting likesource and destination register operands in a conventional ISA.conventional ISA. However, since the Encoder unit contains anoutput Flip-Flop, many vector transformation operations canbe performed without AM access using feedback.If we synthesize the architecture with a

Vector Fold pa-rameter larger than 1, all instructions only process a smallersubpart of the complete HD-vector. The control unit doesnot transparently iterate over all subparts of the vector butleaves control to the user through the part index counter . Thecounter’s value is automatically appended to the read- andwrite-port address lines of the AM and thus controls whichsubpart of the HD-vector is affected by the current instruction.The counter can be cleared, increased, and decreased withdedicated instructions.The rationale behind leaving control over the subpart it-eration scheme to the user is that we also want to supportiteration over the vector parts in the outermost loop of an HD-algorithm instead of only iterating in the innermost loop. Thatis, instead of ﬁrst applying a transformation on all subpartsof a vector before switching to the next transformation, wewant the possibility to apply all operations of an HD-encodingalgorithm on the ﬁrst subpart and repeat the whole algorithmfor subsequent subparts. For the ﬁrst iteration scheme, wewould have to swap the bundling counters’ state after everybundling operation since we do not have individual countersfor each vector part. The second iteration scheme does notrequire state eviction but requires multiple iterations over theinput stream.

2) CISC Instructions:

The CISC instructions encode multi-cycle HDC operations and instructions for code size reductionand host interaction. a) High-level HDC Operations:

For several HDC trans-formations, there are dedicated high-level multicycle instruc-tions. Providing CISC instructions on top of the NISC ISAkeeps the number of control signals and thus the instruction

ENCSEL

SMEN SMSEL MXEN MXINV MXSEL OP BNDEN BNDRST0561112WBEN

RIDX WIDX

ENCSEL

Select between the all-zero vector, a vector from AM and the currentHD-encoder output as input for the encoder stage

SMEN

Enable/Bypass Similarity Manipulator Stage

SMSEL

Select between external input data and internal register as input forsimilarity manipulator stage

MXEN

Enable/Bypass Mixing Stage

MXINV

Select Inverse Permutation set in Mixing Stage

MXSEL

Select between permutation π and π or if MXINV is set between π − and π − . OP Select operations to be performed in Encoder Units.

BNDEN

Enable bundle counter thus bundling the current encoder output.

BNDRST

Reset the bundle counter to its initial value

WBEN

Enable write back of the encoder output to AM at index WIDX. Ifdisabled HD-encoder is only stored in output buffer.

RIDX

Read index in case vector from AM is used as encoder input.

WIDX

Write index if the result of current iteration is written back to AM(WBEN = 1).

Fig. 5. NISC Instruction format with small. Furthermore, mapping common HDC operationslike IM-Mapping or associative lookup to single CISC instruc-tions reduces the given HDC-algorithm’s code size.The

AM_SEARCH instruction starts the associative lookupprocedure within the AM. The vector currently stored atthe highest index is used as the search vector. As its onlyoperand, the instruction takes an immediate that limits thesearch space to a maximum index. Only vectors stored at anindex smaller than the given maximum index are consideredduring the lookup operation. The immediate value thus allowspartitioning the AM dynamically into scratchpad and prototypememory.The

MIX instruction applies multiple mixing cycles to thecurrent content of the encoder register and hence is the basisof IM-mapping. The mixing value is either an immediate,the current value of the part index counter or an externallysupplied value, e.g., digital data from a sensor. b) Host interaction and Code Size Reduction:

An au-tonomous WuC requires to conditionally signal a target systemabout the result of the classiﬁcation algorithm. The proposeddesign uses a dedicated interrupt instruction to conditionally(or unconditionally) assert an interrupt signal line. The instruc-tion has two operands: • Similarity Threshold - The interrupt is not raised if thelast associative lookup operation yielded a result with aHamming distance higher than the given value. • Index Threshold - The interrupt signal is not raised ifthe index of the most similar vector found in the lastassociative lookup operation is higher than the giventhreshold.One use case of these thresholds is to wake up the targetsystem only if the HDC classiﬁcation algorithm detects oneparticular class with a certainty above a speciﬁc threshold.For the architecture to be autonomous and energy-efﬁcient,the amount of memory required to map a given HD algorithm to the proposed ISA must be kept small.Thus the algorithm storage in our design supports up to 3nested hardware loops. Each loop is initiated with a singleinstruction containing a 10 bit immediate for the number ofiterations and a 10 bit immediate for the instruction addressthat marks the end of the loop body.The combination of dedicated instructions for commonlyused HDC algorithmic primitives and code size reducingfeatures like hardware loops results in a high expressivenessof the ISA. All examined HDC algorithms (see Section V) canbe mapped with less than 64 instructions.

3) An Example Conﬁguration for Language Recognition:

Language Recognition is a commonly used example applica-tion in the ﬁeld of HDC [38, 22, 26, 30, 29, 8, 11]. The taskis to determine the language given a sentence in the form of acharacter string. For a text corpus with 21 European languages,HDC achieves accuracies of up to 96.7% [38]. The algorithmconsists of four main steps; In the preprocessing step, the testsentence is split into so-called n-grams, substrings of the testsentence, obtained when applying a sliding window of size n over the character string. In the next step, the individualn-grams of the sentence are each mapped to an HD-vectoraccording to V n-gram = π n − ( V n-1 ) ⊕ π n − ( V n-2 ) ⊕ . . . ⊕ V with V k denoting the HD-vector corresponding to the char-acter at index k within the n-gram. This vector is obtainedthrough IM mapping using 27 random HD-vectors (26 charac-ters in the Latin alphabet plus one for whitespaces). π k denotesthe repeated application of a bit permutation (most commonlya binary shift operation), and ⊕ is the bind operator (XORfor BSC). The n-gram vectors V n-gram for the test sentenceare then bundled together to a single search vector V sentence and in the ﬁnal step compared with prototype vectors for eachlanguage in the AM. The model of the described algorithm,thus the prototype vectors are obtained by bundling togetherall sentence vectors V sentence of the training dataset of alanguage.In practice, an n-gram size of 4 proved to yield the bestperformance in terms of accuracy [38].Listing III-E3 shows the above algorithm for n=4 in Pseu-docode; i ← char_vec i − [0 , , , ← (cid:48) b ngram i − [0 , ← (cid:48) b for char in sentence do char_vec i ← im _ map ( char ) ngram i ← π ( ngram i − ) ⊕ char_vec i ⊕ π ( char_vec i − ) i ← i + 1 end for search_vec ← bundle ( ngram , ngram , ... ) idx ← min_distance ← ∞ class_idx ← for p in prototype vectors do distance ← popcount ( search_vec ⊕ p ) if distance < min_distance then min_distance ← distance class_idx ← idx end if end for Listing 1: Pseudo code of an HDC algorithm for languagerecognition.

Instead of recalculating the same character vectors repeat-edly when sliding over the sentence, we recursively computethe n-gram using a FIFO structure [26]. Mapping the abovealgorithm to the proposed ISA with an AM size of 16 vectorsand vector fold of one results in the following code: start: hw.loop0 nr_characters_in_sentence, end_loop enc_reg → mix → enc_reg mem[12] → mix → bind_with_enc_reg → mem[11] mem[13] → mix → mem[12] mem[14] → mix → mem[13] mem[15] → mix → mem[14] zero_vec → man → enc_reg MIX_EXT 5 enc_reg → mem[15] mem[11] → bind_with_enc_reg → bundle end_loop: threshold_bndl_cntrs → mem[15] am_search nr_classes intr 400, 2 jmp start Listing 2: Microcode mapping of the language classiﬁcationalgorithm in pseudo code. Arrows indicate that operations hap-pen in a combinational pipeline in the same cycle, multi-cycleinstructions are specially indicated with comments denotingthe number of execution cyclesWe omitted the initialization steps that would correspond tolines 1-3 in the pseudo-code listing for simplicity. As can beseen in listing III-E3, the body of the algorithm maps to the12 instructions (lines 1-16). The instruction on line 17 triggersan interrupt if the processed sentence belongs to the classesrepresented by prototype 1 or 2 with a Hamming distance ofless or equal to 400 bits. The ﬁnal unconditional jump causesthe algorithm to start over again, either immediately if theinterrupt conditions are not met or after the host processorclears the pending interrupt.IV. I

MPLEMENTATION AND R ESULTS

In this section, we evaluate the proposed architecture interms of area and power consumption. In Section IV-B, wepresent an overhead analysis of the proposed associativememory. Finally, in subsection IV-C we compare the area andpower consumption of the whole accelerator for two differenttechnologies nodes and examine the inﬂuence of the vectorfold parameter on the efﬁciency for a given target technology.

A. Methodology

We followed the subsequent methodology for the area andpower analysis; the purely digital design written in Sys-temVerilog RTL was ﬁrst synthesized with Synopsys DesignCompiler 2018.6 using default settings for mapping effort.We evaluate the design’s performance in two different targettechnologies: The ﬁrst one is a 65 nm Low-Leakage Low-Kprocess node using a high Vth (HVT) standard cell library tominimize cell leakage at low operating frequencies requiredby the HDC accelerator. If not denoted otherwise, all numberswere obtained with the typical case library characterization at1.0 V , 25 °C. The second technology we targeted is a 22nmFDSOI node using a UHVT and SLVT library. The library characterization at 0.8 V, 25 °C without body biasing at thetypical-typical corner was used. Using Cadence Innovus 2018,we performed place and route with an eight-layer metal stackfor the 65 nm node targeting a core area utilization of 80%. Forthe 22 nm node, a ten-layer metal stack with a target core areautilization of 70% was used. Post-layout power numbers wereobtained with Cadence Voltus using switching activity for allinternal nodes extracted from a timing back-annotated post-layout simulation of the HDC algorithms in Mentor GraphicsQuestasim 2019. B. Energy and Area overhead Analysis of SCM based AMs

Table IV-B provides an evaluation of the area overheadand energy efﬁciency for a fully-combinational and the row-sequential AM architecture described in Section III-D. To getan accurate estimate of the delay and power consumption atsub nominal voltages, the complete standard cell library wasrecharacterized with spice simulations using Cadence Liberatefor a V DD corner of 0.6 V . At this voltage, all standard cellswithin the library are still operational in spice simulation.6T-bitcell based SRAMs that are readily available in allcommercial technology nodes are no longer operational atsuch low voltages[37, 39]. Although there are specialized low-voltage SRAMs for sub-threshold operation [40], they arecustom-tailored for a particular technology and not readilyavailable for all technology nodes. Furthermore, experimentsby Andersson et al. indicate that customized SCMs can stillhave an energy advantage over sub-threshold SRAMs for smallmemory sizes [41].At the 0.6V operating corner, we see a 4 × improvementin energy efﬁciency for the sequential architecture and almost5 × for the fully parallel version compared to operation atnominal voltage. The full-parallel implementation is 2.6x moreenergy efﬁcient than the sequential one. However, for mostHDC algorithms, the vast majority of the proposed HDCaccelerator’s compute time is spent on vector encoding, duringwhich the AM lookup logic stays idle. For this reason, wefocus on the row-sequential SCM AM architecture, which hasa better trade-off between energy efﬁciency during lookupoperation and static leakage power in the subsequent analysis. C. Tuning for Maximum Energy-Efﬁciency

As will be further elaborated in Section V, the high amountof parallelism in the datapath and the efﬁciency of the pro-posed ISA in executing common HD-algorithms allows thearchitecture to be clocked at fairly low frequencies while stillachieving real-time processing capabilities for many targetapplications. Figure 6a shows the power breakdown of theproposed architecture synthesized with an AM size of 16kBit(16x 1024 bits) while processing an EMG gesture recognitionclassiﬁcation algorithm for different degrees of vector folding.Since higher vector fold values result in less datapath paral-lelism, we adjusted the frequency for each different vector foldconﬁguration to achieve identical throughput for all conﬁgu-rations. In other words, although the different conﬁgurationsrun at different frequencies, they perform the same amount

Area [kGE] Throughput [MOPS/s] Energy Efﬁciency [ pJ / lookup ] Leakage Power [uW]@ . @ . @ . @ . @ . @ . SRAM + Digital AM 17 . — — . —Sequential SCM AM 101 .

29 0 .

23 2353 556 7 . . Full parallel SCM AM 265 .

80 1 .

54 921 188 81 . . TABLE IIA

REA AND E NERGY E FFICIENCY COMPARISON OF

SCM

BASED BY BIT

AM-

AND

SRAM

BASED

AM-A

RCHITECTURE IN NM TECHNOLOGYUSING ALL THREE AVAILABLE VT FLAVORS . T

HE MOST ENERGY EFFICIENT

SRAM

CONFIGURATION GENERATED BY THE AVAILABLE

SRAM

MACROGENERATOR COLLECTION FOR THE TARGET TECHNOLOGY WAS CHOSEN . of useful work per time interval with different degrees ofsequentiality.We see entirely orthogonal tendencies for the two differenttechnology nodes in energy efﬁciency versus Vector Fold. For65nm, the overall energy efﬁciency increases with lower vectorfolds, thus a higher degree of parallelism, while we see theopposite effect in GF22.The reason behind this effect becomes evident when wehave a closer look at the area breakdown in ﬁgure 6b. For aVector Fold value of one, almost 60% of the accelerator area isoccupied by the HD-Encoder. In a technology node like GF22with SLVT cells, the design is dominated by leakage power.Increasing the vector fold that directly affects the encoder’sdatapath width has a large effect on the overall area and thusstatic current draw of the accelerator.Although the fully synthesizable architecture’s technologyindependence would make it easy to switch to a differenttechnology node with lower leakage, this is not always apossibility, especially when the device is integrated into alarger system. For these situations, the vector fold feature, inaddition to its function as a control knob to trade-off areafor maximum throughput, provides the means to tune thedesign for maximum energy efﬁciency depending on the targettechnologies’ leakage behavior.V. A PPLICATIONS AND U SE C ASES

As thoroughly discussed in Section III, the proposed HDCaccelerator uses hardware-friendly embodiments of commonlyused HDC primitives and combines them with a programmablecontrol path. In this section, we take a closer look at theachieved accuracy of the proposed architecture when conﬁg-ured to execute different classiﬁcation problems using state-of-the-art HDC algorithms. Both, to validate the soundnessof the algorithmic transformations and to compare the energyefﬁciency with other fully digital HDC accelerators.

A. Accuracy Analysis on Text classiﬁcation and EMG GestureRegocnition

As mentioned earlier, the language classiﬁcation of tex-tual data is a prime example for classiﬁcation with HDC.While this application does not ﬁt the context of always-on smart sensing, it serves the purpose of validating theaccuracy implications of the permutation-based item memorymaterialization described in Section III-C2. We tackle the same classiﬁcation task to classify the text samples into 21 Indo-European languages [38]. We use the same HDC algorithmdescribed in Section III-E with an n-gram size of ﬁve, whichis identical to the algorithm used by Rahimi et al.. Figure 7indicates the achieved accuracy using a vector fold factor of1 for different dimensionalities; For 8192 bit HD vectors, themodiﬁed HDC operators achieve an accuracy of 94.52%. Thisaccuracy is almost identical to the results reported by Dattaet al. on their accelerator (95.2%) [34]. The algorithm maps toonly 14 HDC ISA instructions and has a memory requirementof ﬁve vector slots in the AM, in addition to the 21 languageprototype vectors, for intermediate results during the encodingprocess. For a vector fold of 1, the algorithm executes at 14cycles per processed input character, which results in 1400cycles to classify a single sentence.The second application we evaluate is hand gesture recog-nition on electromyography (EMG) data recorded on the sub-ject’s forearm. We used the dataset and preprocessing pipelinefrom [23]; The data consists of recordings from the subjectperforming ﬁve different hand gestures captured by a 64-channel EMG sensor array with a sampling rate of 1kSPS. Theactual HDC classiﬁcation algorithm works as follows; For eachtime sample, the 64 channel values are continuously mapped toHD-vectors using the similarity manipulator module describedin Section III-C4 and bound to a per-channel label vector,generated in the mixer stage. Bundling the resulting 64 channelvectors together yields a single HD-vector that representsthe state of all channels for a given instance in time. Fiveof these vectors are combined to a 5-gram analog to thelanguage classiﬁcation algorithm to form the search vectorfor associative lookup against the prototype vector. Trainingof the prototype vectors works like classiﬁcation, but manysearch vectors corresponding to the same gesture are bundledtogether to form the prototype vector.The whole algorithm maps very well to HDC ISA, requiringonly 12 instructions and two memory slots for intermediateresults. The inner loop over the 64 channels in the algorithmis executed in only two cycles for a folding factor of 1, whichresults in a total of

678 cycles to classify a single 500mswindow of data. Consequently, realtime classiﬁcation of 64EMG channels implies an accelerator clock frequency of only .While the data preprocessing ﬂow we used in our exper-iments was identical to [23], the HDC algorithm, althoughidentical in general structure, differs in a few crucial aspectsfrom the baseline implementation. Moin et al. perform CIM UMC65 hvt @1.0V

GF22 slvt @0.5V

Power [uW]Dynamic PowerLeakage Power (a) μ m² μ m² μ m² HD-EncoderCAMControl Path

Area Breakdown in UMC65

16 kBit CAM, Vector Fold =1 (b)Fig. 6. (a) Post-layout-simulated power consumption of the HDC accelerator (16 vectors à 1024 bits) when executing a realtime HDC algorithm for differentvector folds in 65nm and 22nm technology. (b) Area breakdown of the HDC algorithm for a vector fold of 1, placed and routed in UMC 65nm. A cc u r a c y LANG A cc u r a c y EMG

OursData et. al

Fig. 7. Achieved accuracies for the target applications using different HD-Vector sizes. mapping of the individual samples to HDC vectors usingscalar multiplication of the sample value with a per-channelbipolar label vector, effectively leaving the binary domain[23]. Moreover, the bundling operation to form a time samplevector is implemented as a scalar addition of the integer-valued vectors before thresholding the result back to a bipolarrepresentation with positive values mapped to +1 and negativevalues to -1. Even though the proposed algorithm modiﬁcationstays strictly in the binary domain, there is only a small dropin accuracy; With 8192 dimensions, the proposed architectureachieves 96.31% accuracy while Moin et al. report an accuracyof 99.44% accuracy using 10’000 bit-vectors and arbitraryprecision bundling [23].

B. Ball Bearing anomaly detection

Predictive maintenance, also known as condition-basedmaintenance, is a term for the process of estimating the currentcondition of in-service equipment to anticipate componentfailure. The goal is to switch to a maintenance scheme werecomponents are replaced once they approach their end-of-lifeinstead of ﬁxed maintenance intervals based on preventive replacement according to the statistically expected lifetime[42]. As part of our algorithmic investigations, we investigatethe feasibility of HDC for the task of ball bearing faultprediction using vibration data from low power accelerometersensors.For our analysis, we use the IMS Bearing Dataset providedby the University of Cincinnati [43]. They recorded vibrationdata at a sampling rate of 20kHz from 4 different ballbearings on a loaded shaft rotating at a constant 2000rpm.We concentrated on the ﬁrst of the three recording sets, whichcontains 1 second data records obtained with an interval of10 minutes in a run-to-failure experiment that lasted 35 dayswith an accumulated operating time of about 15 days.

NormalizerQuantizer128 levels [12, 66, ...27]

IM Map

250 values + V V V V W1 V M + V W2 V W3 V W4 V W5 V M* HammingDistance> Threshold?Trigger further Analysis

Fig. 8. Illustration of the proposed HDC based ball bearing anomaly detectionalgorithm. V M ∗ denotes the on-line trained calibration vector from the ﬁrst24 operating hours of the ball bearing. Figure 8 illustrates the basic classiﬁcation procedure. Thealgorithm requires an initial calibration phase where a pro-totype vector representing the ball bearing’s normal oper-ating condition is generated. With the inherent feature ofHDC that classiﬁcation and training are of almost equivalentcomputational complexity, online-training with HDC imposesnegligible additional energy costs. The current control pathof the proposed HDC accelerator allows for online training algorithms to be encoded in the algorithm storage but requiresan external control entity, e.g., a general-purpose core thatprovides the labels during algorithm execution.The algorithm’s basis is the encoding of small time windowsfrom the raw vibration data to measurement vectors V M . Eachtime window consists of 250 samples (12.5ms). The samplevalues are ﬁrst normalized using a pre-trained normalizationfactor and quantized to 7 bits. Each sample value is thenmapped to an HD-vector using IM mapping, and the wholewindow of 250 samples is bundled together to a window vector V W . Five of these window vectors with an interval of 125msare again bundled together to form a single measurement vec-tor V M . The resulting vector thus approximates the amplitudedistribution over a 0.5-second time frame.The general idea behind the proposed analysis scheme isto generate a prototype vector V M ∗ using the ﬁrst coupleof measurement vectors after commissioning. We then trackthe evolution of Hamming distance over time for subsequentmeasurement vectors. We calibrated the prototype vector using100 random measurement vectors from the ﬁrst 24 operatinghours of the respective ball bearing in our experiments. Sim-ilarly, the normalization factor is generated using the 99%quantile of the amplitude within the same 24 hours aftercommissioning. The proposed algorithm can be mapped to and requires two vector slots, onefor the calibration vector and one for intermediate results.Figure 9 shows the evolution of Hamming distance overtime with an exponential moving average ﬁlter with a half-lifeof ﬁve hours. This feature can be computed very efﬁcientlywithout the need for a large ring buffer. The line color indicatesthe labels proposed by experts on manual analysis of thedataset [44].By the end of the IMS ball bearing experiment, bearings 3and 4 failed, while bearings 1 and 2 were severely worn downbut did not fail yet. We see a sharp increase in Hammingdistance for all four ball bearings several hours before theactual failure, in the case of ball bearing 3, even several daysbefore the actual inner race failure.While the proposed algorithm certainly does not replacemore involved analysis on time and frequency domain features,the results suggest that it can act as a ﬁrst ﬁltering stagefor aggressive duty cycling of more power-intensive analysisschemes when combined with simple thresholding. However,more experiments on larger datasets and possibly with morecomplex HDC encoding schemes will be required to quantifythe beneﬁts of an HDC based ball bearing fault predictor. C. Energy Efﬁciency Analysis and Comparison

Table III summarizes the performance of the three in-troduced HDC algorithms, language classiﬁcation (LANG),EMG gesture recognition (EMG), and ball bearing anomalydetection (BEARING). Columns 2 & 3 report the numberof HDC instructions and the total number of required HDvector memory to map the algorithm to the architecture.Column 4 shows the required minimum frequency for real-time execution of the algorithm (not applicable for LANGsince there is no real-time constraint for this application). The E W M ﬁ l t e r e d H a mm i n g D i s t a n c e Ball Bearing 3 labelearlynormalsuspectInner race failure

Fig. 9. Hamming distance evolution over time for ball bearings 3 in theIMS dataset. The Hamming distance was post-processed with an exponentialmoving average ﬁlter with a halﬂife of 5 hours. The other ball bearings inthe dataset show a similar behavior last two columns indicate the power when operating at theaforementioned minimum frequency and the correspondingenergy efﬁciency per classiﬁcation. For LANG, we considera single classiﬁcation to be the processing of a 100-characterstring, the average sentence length in the Wortschatz corpora.For EMG and BEARING, a single classiﬁcation is deﬁned asthe analysis of a 500ms window as described in the algorithmsections V-A and V-B.In table IV we compare the energy efﬁciency of our solutionto the current SoA HDC accelerator architecture from Dattaet al. [34]. Among other algorithms, they report the energynumbers for EMG and LANG executed on a 32 by 2048-bit accelerator in TSMC28. We achieve a technology scaledarea reduction by × . This can be explained by massivearea reductions in all major components of the accelerator.The most considerable effect has the on-the-ﬂy pseudo-randommaterialization of the item vectors used in our design, whichremoves the necessity to incorporate a large ROM to store allpossible item vectors. In fact, 62% of the overall area in Dattaet al. is occupied by a large 1024 by 2048 bit ROM. Besidesthe area and energy implications, the ROM based solutionhas the added drawback of having a hardwired partitioningof the memory; One for the item memory, containing quasi-orthogonal vectors, and one for continuous item memoryvectors, where the pair-wise Hamming distance between thevectors correlates to the difference of the corresponding inputvalues. Another large reduction in area is achieved in the AM,where our solution uses latch cells and sequentially calculatesthe Hamming distance in contrast to the baseline, which usesa ﬂip-ﬂop based fully parallel implementation.In fairness, one has to notice that [34], with a maximumclock frequency of 434MHz, unarguably has a much higherpeak throughput than our solution due to its parallel andheavily pipelined architecture. However, the results in tableIII suggest that algorithms used for always-on sensing do notbeneﬁt from such a high throughput, and energy efﬁciency isthe key metric by which we should judge the performance ofthe different approaches.As we can see in table IV, the energy efﬁciency differences Algorithm kHz ] Power [µ W ] Min. Energy/classif. [ nJ ] @

100 kHz

EMG 12 2 + 5 678 1.4 10.7 2.9 703

BEARING 9 1 + 1 12513 25 29.1 7.9 10913

TABLE IIIM

EMORY REQUIREMENTS AND POST - LAYOUT ENERGY NUMBERS OF SELECTED

HDC

ALGORITHM ON THE PROPOSED ARCHITECTURE WITH AN AM SIZE OF X BIT , VECTOR FOLD nJ / inference ]LANG EMGDatta et al. TSMC28 3618 generic 10 or 10 250 610 Our Work GF22 1094 general-purpose arbitrary and 7 332 191

TABLE IVA

REA AND E NERGY EFFICIENCY COMPARISON WITH THE CURRENT STATE OF THE ART

HDC

ACCELERATOR ARCHITECTURE . T

HE TERMS generic

AND general purpose

WERE INTRODUCED BY D ATTA ET AL . IN [34]. between the two architectures depend a lot on the algorithmat hand. For LANG, the achieved energy efﬁciency is slightlyworse (+31%) than the baseline, which is still impressiveconsidering the 3.3 × reduction in area.For EMG, on the other hand, we achieve a 3.1 × im-provement in energy efﬁciency. This can be explained bythe difference in the computational complexity of orthogonaland continuous item mapping in our architecture. In LANG,input values are mapped to quasi-orthogonal vectors using themixing stage (III-C2), which requires log ( N ) cycles, where N denotes the cardinality of the input set. The overhead of thisiterative approach considerably lowers the energy advantage ofnot using a large ROM for item memory generation. For EMG,on the other hand, the input values are mapped continuouslyusing the similarity manipulator, which can be performed ina single cycle and can even be combined with a bundlingor bin operation in the subsequent encoder units. Hence, forthis algorithm, the effect of not requiring a ROM comes todisplay. In general, we can say that for very high input valueresolutions, the overhead of iterative item vector generationstarts to dominate the overall energy consumption of ourarchitecture. Still, the fact that the computational complexityof the rematerialization approach grows with the logarithm ofthe input space instead of linear ROM area scaling suggests anadvantage of our architecture for larger input space cardinality.In any case, the proposed architecture excels in its energyproportionality to the desired HDC algorithm. The ROM basedapproach in [34] has an almost ﬁxed cost for item memorymapping with an upper limit on the supported resolution. Forexample, in LANG, only 13% (27 out of 1024 item vectors)of all ROM entries are required to map the input values. Thearchitecture proposed by Datta et al. is only generic accordingto their taxonomy on HDC algorithm classes [34]. In contrast,the microcode based approach that our architecture followsallows for arbitrary HDC algorithm computation within thelimits of the available AM and instruction memory resources.Finally, our proposed architecture is energy- and area-ﬂexibleand can be ﬁnely parametrized to ﬁt the area, throughput, andenergy efﬁciency constraints of a particular target technology. VI. C ONCLUSION

In this work, we presented a novel all-digital cross-technology mappable HDC accelerator architecture with ahighly conﬁgurable datapath using a newly proposed mi-crocode ISA optimized for HDC. Place and routed in GF22nm technology, the architecture improves on the currentstate of the art both in area and energy efﬁciency by afactor of up to 3.1 × and 3.3 × respectively. The architectureachieves an energy efﬁciency of 192 nJ / inference for the taskof EMG gesture classiﬁcation with an always-on compatibletypical power consumption of 5 µ W . Our post-layout sim-ulation experiments on different digital associative memoryarchitectures in Section IV-B indicate a signiﬁcant potentialfor latch based associative memories to push the limits ofenergy efﬁciency when operating at sub-nominal voltage andcan already outperform the energy efﬁciency of commercial-off-the-shelf SRAM macros at nominal voltage. In Section Vwe demonstrated that our newly introduced rematerializationscheme for IM and CIM mapping have a negligible impacton classiﬁcation accuracy with a drop of less than 0.5%compared to a ROM based approach used by the currentSoA HDC accelerator. As part of the analysis, we proposed anovel HDC based end-to-end classiﬁcation algorithm for ballbearing anomaly detection that maps to only 9 HDC microcodeinstructions. While our experiments in Section V-C indicatedthat the energy efﬁciency of a rematerializing IM is inferior toa ROM based solution for low input resolutions, the proposedCIM mapping scheme outperforms the current SoA in energyefﬁciency, area usage, and ﬂexibility. Finally, we providedthe ﬁrst open-source release of a complete HDC Acceleratorplatform which is possible due to the all-digital nature of theproposed architecture. R EFERENCES [1] B. Chatterjee et al. , “Context-Aware Intelligence inResource-Constrained IoT Nodes: Opportunities andChallenges,”

IEEE Design Test , vol. 36, no. 2, pp. 7–40, Apr. 2019. [2] S. Bagchi et al. , “New Frontiers in IoT: Networking,Systems, Reliability, and Security Challenges,” IEEEInternet of Things Journal , pp. 1–1, Jul. 2020.[3] D. Newell and M. Duffy, “Review of Power Conversionand Energy Management for Low-Power, Low-VoltageEnergy Harvesting Powered Wireless Sensors,”

IEEETransactions on Power Electronics , vol. 34, no. 10, pp.9794–9805, Oct. 2019.[4] V. Shnayder et al. , “Simulating the Power Consump-tion of Large-scale Sensor Network Applications,” in

Proceedings of the 2Nd International Conference onEmbedded Networked Sensor Systems , ser. SenSys ’04.New York, NY, USA: ACM, Nov. 2004, pp. 188–200.[5] B. Spencer et al. , “Smart Sensing Technology: Oppor-tunities and Challenges,”

Structural Control and HealthMonitoring , vol. 11, pp. 349–368, Oct. 2004.[6] N. Verma et al. , “In-Memory Computing: Advances andProspects,”

IEEE Solid-State Circuits Magazine , vol. 11,no. 3, pp. 43–55, 2019.[7] A. Sebastian et al. , “Memory devices and applications forin-memory computing,”

Nature Nanotechnology , vol. 15,no. 7, pp. 529–544, Jul. 2020.[8] G. Karunaratne et al. , “In-memory hyperdimensionalcomputing,”

Nature Electronics , vol. 3, no. 6, pp. 327–337, Jun. 2020.[9] I. Miro-Panades et al. , “SamurAI: A 1.7MOPS-36GOPSAdaptive Versatile IoT Node with 15,000× Peak-to-IdlePower Reduction, 207ns Wake-Up Time and 1.3TOPS/WML Efﬁciency,” in , Jun. 2020, pp. 1–2.[10] D. Ma et al. , “Sensing, Computing, and Communicationsfor Energy Harvesting IoTs: A Survey,”

IEEE Communi-cations Surveys Tutorials , vol. 22, no. 2, pp. 1222–1250,Secondquarter 2020.[11] L. Ge and K. K. Parhi, “Classiﬁcation using Hyperdi-mensional Computing: A Review,”

IEEE Circuits andSystems Magazine , vol. 20, no. 2, pp. 30–47, 22.[12] A. Rahimi et al. , “Hyperdimensional biosignal process-ing: A case study for EMG-based hand gesture recog-nition,” in . IEEE, Oct. 2016, pp. 1–8.[13] F. Montagna et al. , “PULP-HD: Accelerating Brain-Inspired High-Dimensional Computing on a ParallelUltra-Low Power Platform,” in

Proceedings of the 55thAnnual Design Automation Conference on - DAC ’18 .ACM Press, 2018, pp. 1–6.[14] M. Cho et al. , “17.2 A 142nW Voice and Acoustic Ac-tivity Detection Chip for mm-Scale Sensor Nodes UsingTime-Interleaved Mixer-Based Frequency Scanning,” in , Feb. 2019, pp. 278–280.[15] J. S. P. Giraldo et al. , “Vocell: A 65-nm Speech-TriggeredWake-Up SoC for 10- µ W Keyword Spotting andSpeaker Veriﬁcation,”

IEEE Journal of Solid-State Cir-cuits , vol. 55, no. 4, pp. 868–878, Apr. 2020.[16] W. Shan et al. , “14.1 A 510nW 0.41V Low-MemoryLow-Computation Keyword-Spotting Chip Using SerialFFT-Based MFCC and Binarized Depthwise Separable Convolutional Neural Network in 28nm CMOS,” in , Feb. 2020, pp. 230–232.[17] Y. Zhao et al. , “A 13.34 µ W Event-Driven Patient-Speciﬁc ANN Cardiac Arrhythmia Classiﬁer for Wear-able ECG Sensors,”

IEEE Transactions on BiomedicalCircuits and Systems , vol. 14, no. 2, pp. 186–197, Apr.2020.[18] Z. Wang et al. , “20.2 A 57nW Software-Deﬁned Always-On Wake-Up Chip for IoT Devices with AsynchronousPipelined Event-Driven Architecture and Time-ShieldingLevel-Crossing ADC,” in , Feb. 2020, pp. 314–316.[19] G. Rovere et al. , “A 2.2- µ W Cognitive Always-OnWake-Up Circuit for Event-Driven Duty-Cycling of IoTSensor Nodes,”

IEEE Journal on Emerging and SelectedTopics in Circuits and Systems , vol. 8, no. 3, pp. 543–554, 2018.[20] P. Kanerva, “Hyperdimensional Computing: An Intro-duction to Computing in Distributed Representation withHigh-Dimensional Random Vectors,”

Cognitive Compu-tation , vol. 1, no. 2, pp. 139–159, Jun. 2009.[21] A. Rahimi et al. , “Efﬁcient Biosignal Processing UsingHyperdimensional Computing: Network Templates forCombined Learning and Classiﬁcation of ExG Signals,”

Proceedings of the IEEE , vol. 107, no. 1, pp. 123–143,Jan. 2019.[22] A. Rahimi et al. , “High-Dimensional Computing as aNanoscalable Paradigm,”

IEEE Transactions on Circuitsand Systems I: Regular Papers , vol. 64, no. 9, pp. 2508–2521, Sep. 2017.[23] A. Moin et al. , “An EMG Gesture Recognition Systemwith Flexible High-Density Sensors and Brain-InspiredHigh-Dimensional Classiﬁer,” , pp. 1–5,Feb. 2018.[24] A. Burrello et al. , “Laelaps: An Energy-Efﬁcient SeizureDetection Algorithm from Long-term Human iEEGRecordings without False Alarms,” in

Proceedings of the2019 Design, Automation & Test in Europe Conference& Exhibition (DATE) . IEEE, 2019, pp. 752–757.[25] E. Chang et al. , “Hyperdimensional Computing-basedMultimodality Emotion Recognition with PhysiologicalSignals,” in , Mar.2019, pp. 137–141.[26] A. Joshi et al. , “Language Geometry Using RandomIndexing,” in

Quantum Interaction , ser. Lecture Notes inComputer Science, J. A. de Barros et al. , Eds. Cham:Springer International Publishing, 2017, pp. 265–274.[27] M. Imani et al. , “HDNA: Energy-efﬁcient DNA sequenc-ing using hyperdimensional computing,” in , Mar. 2018, pp. 271–274.[28] D. Kleyko and E. Osipov, “Brain-like classiﬁer of tem-poral patterns,” in , Jun. et al. , “Hyperdimensional Computing Exploit-ing Carbon Nanotube FETs, Resistive RAM, and TheirMonolithic 3D Integration,” IEEE Journal of Solid-StateCircuits , vol. 53, no. 11, pp. 3183–3196, Nov. 2018.[30] H. Li et al. , “Hyperdimensional computing with 3D VR-RAM in-memory kernels: Device-architecture co-designfor energy-efﬁcient, error-resilient language recognition,”in , Dec. 2016, pp. 16.1.1–16.1.4.[31] M. Schmuck et al. , “Hardware Optimizations of DenseBinary Hyperdimensional Computing: Rematerializationof Hypervectors, Binarized Bundling, and CombinationalAssociative Memory,”

ACM Journal on Emerging Tech-nologies in Computing Systems , vol. 15, no. 4, pp. 32:1–32:25, Oct. 2019.[32] S. Salamat et al. , “F5-HD: Fast Flexible FPGA-basedFramework for Refreshing Hyperdimensional Comput-ing,” in

Proceedings of the 2019 ACM/SIGDA Interna-tional Symposium on Field-Programmable Gate Arrays ,ser. FPGA ’19. New York, NY, USA: Association forComputing Machinery, Feb. 2019, pp. 53–62.[33] S. Salamat et al. , “Accelerating Hyperdimensional Com-puting on FPGAs by Exploiting Computational Reuse,”

IEEE Transactions on Computers , vol. 69, no. 8, pp.1159–1171, Aug. 2020.[34] S. Datta et al. , “A Programmable Hyper-DimensionalProcessor Architecture for Human-Centric IoT,”

IEEEJournal on Emerging and Selected Topics in Circuits andSystems , vol. 9, no. 3, pp. 439–452, Sep. 2019.[35] M. Imani et al. , “Exploring Hyperdimensional Associa-tive Memory,” in .Austin, TX: IEEE, Feb. 2017, pp. 445–456.[36] A. Teman et al. , “Power, Area, and Performance Op-timization of Standard Cell Memory Arrays ThroughControlled Placement,”

ACM Transactions on DesignAutomation of Electronic Systems , vol. 21, no. 4, pp. 1–25, May 2016.[37] P. Meinerzhagen et al. , “Benchmarking of Standard-CellBased Memories in the Sub-V T Domain in 65-nm CMOSTechnology,”

IEEE Journal on Emerging and SelectedTopics in Circuits and Systems , vol. 1, no. 2, pp. 173–182, Jun. 2011.[38] A. Rahimi et al. , “A Robust and Energy-Efﬁcient Classi-ﬁer Using Brain-Inspired Hyperdimensional Computing,”in

Proceedings of the 2016 International Symposiumon Low Power Electronics and Design , ser. ISLPED’16. San Francisco Airport, CA, USA: Association forComputing Machinery, Aug. 2016, pp. 64–69.[39] M. E. Sinangil et al. , “A reconﬁgurable 65nm SRAMachieving voltage scalability from 0.25–1.2V and perfor-mance scalability from 20kHz–200MHz,” in

ESSCIRC2008 - 34th European Solid-State Circuits Conference ,Sep. 2008, pp. 282–285.[40] B. Mohammadi et al. , “A 128 kb 7T SRAM Using aSingle-Cycle Boosting Mechanism in 28-nm FD–SOI,”

IEEE Transactions on Circuits and Systems I: Regular Papers , vol. 65, no. 4, pp. 1257–1268, Apr. 2018.[41] O. Andersson et al. , “Ultra Low Voltage SynthesizableMemories: A Trade-Off Discussion in 65 nm CMOS,”

IEEE Transactions on Circuits and Systems I: RegularPapers , vol. 63, no. 6, pp. 806–817, Jun. 2016.[42] S. Selcuk, “Predictive maintenance, its implementationand latest trends,”

Proceedings of the Institution ofMechanical Engineers, Part B: Journal of EngineeringManufacture , vol. 231, no. 9, pp. 1670–1679, Jul. 2017.[43] H. Qiu et al. , “Wavelet ﬁlter-based weak signature de-tection method and its application on rolling elementbearing prognostics,”

Journal of Sound and Vibration ,vol. 289, no. 4, pp. 1066–1090, Feb. 2006.[44] J. Ben Ali et al. , “Linear feature selection and classi-ﬁcation using PNN and SFAM neural networks for anearly online diagnosis of bearing naturally progressingdegradations,”