A 5 μW Standard Cell Memory-based Configurable Hyperdimensional Computing Accelerator for Always-on Smart Sensing
11 A 5 µ W Standard Cell Memory-based ConfigurableHyperdimensional Computing Accelerator forAlways-on Smart Sensing
Manuel Eggimann,
Graduate Student Member, IEEE,
Abbas Rahimi, Luca Benini,
Fellow, IEEE
Abstract —Hyperdimensional computing (HDC) is a brain-inspired computing paradigm based on high-dimensional holisticrepresentations of vectors. It recently gained attention for em-bedded smart sensing due to its inherent error-resiliency andsuitability to highly parallel hardware implementations. In thiswork, we propose a programmable all-digital CMOS implemen-tation of a fully autonomous HDC accelerator for always-on clas-sification in energy-constrained sensor nodes. By using energy-efficient standard cell memory (SCM), the design is easily cross-technology mappable. It achieves extremely low power, 5 µ W intypical applications, and an energy-efficiency improvement overthe state-of-the-art (SoA) digital architectures of up to 3 × in post-layout simulations for always-on wearable tasks such as EMGgesture recognition. As part of the accelerator’s architecture,we introduce novel hardware-friendly embodiments of commonHDC-algorithmic primitives, which results in 3.3 × technologyscaled area reduction over the SoA, achieving the same accuracylevels in all examined targets. The proposed architecture alsohas a fully configurable datapath using microcode optimized forHDC stored on an integrated SCM based configuration memory,making the design “general-purpose” in terms of HDC algorithmflexibility. This flexibility allows usage of the accelerator acrossnovel HDC tasks, for instance, a newly designed HDC appliedto the task of ball bearing fault detection. Index Terms —Hyperdimensional Computing, Always-on, EdgeComputing, Machine Learning, Hardware Accelerator, VLSI,Standard Cell Memory
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer beaccessible.
I. I
NTRODUCTION E NERGY boundedness is the key design metric and con-straint in the development of internet-of-things (IoT)devices [1, 2, 3]. With more and more sensor modalitiesintegrated into IoT end nodes, the amount of data to process,and the complexity of the processing pipeline increases. Aim-ing for uninterrupted operation for years or even indefinitelywithin the tight power envelope of small batteries or envi-ronmental energy harvesting urges to drastically reduce theaverage power consumption of the sensor nodes themselves.Observing that the majority of power consumption in to-day’s wireless sensor devices is spent in data transmission [4]promotes moving data processing closer to the sensor. In-stead of raw data transmission and centralized processing inthe cloud, the data is processed continuously on these so-called smart sensor devices [5]. Only the analyzed portion ofthe information is transmitted (e.g., transmission of a single imminent machine failure message instead of theraw vibration and temperature data). This cannot be achievedby application-specific integrated circuit (ASIC) designs for deep neural networks alone because general purpose always-on smart sensing systems operate in the µ W range. Therefore,the next evolution step towards fully self-sustainable always-on smart sensors requires the exploration of new avenuesof hardware-software co-design and outside the realm oftraditional von Neumann based computing [6, 7, 8].An energy proportional sensor data processing scheme,where a wake-up circuit (WuC) detects patterns of interest andaggressively duty cycles other circuitry is a viable solution todrastically reduce average power consumption [9, 10]. Whilethere are numerous WuCs, e.g., for biosignal anomaly detec-tion, sound/keyword spotting, incoming radio transmissions inthe µ W range, all of these solutions are highly application-specific. Considering the cost of custom silicon developmentand the rapidly widening range of application targets, there isa need for configurable and application-agnostic WuCs withmore flexible pattern extraction capabilities than the simplethreshold-based solutions, which can suffer from high false-positive rate and thus energy losses of unnecessary wakeups.Hyperdimensional computing (HDC) is a brain-inspiredcomputing paradigm that excels in the learning curve, compu-tational complexity of the training, and simplicity of operationsfor hardware. This makes it a perfect fit for energy-constrainedinference applications, and, more specifically for general-purpose always-on sensing [11, 12, 13].In this work, we present the following contributions: • We propose a novel flexible and highly energy-efficientall-digital HDC architecture for always-on smart sensingapplications achieving up to 3 × higher energy efficiency(191 nJ / inference ) over the SoA. • As part of the architecture, we introduce novel hardware-friendly embodiments of common HDC operators result-ing in 3.3 × technology scaled area reduction. • We provide an evaluation of latch-based associative mem-ories at sub nominal supply voltage conditions in post-layout simulation indicating the potential of at least 3.5 × energy efficiency improvement compared to an SRAMbased digital solution. • We provide practical application case studies of ourapproach, including the first investigation (to the best ofour knowledge) on the feasibility of HDC for the task ofball bearing fault detection. • Finally, using an all-digital approach enables us to pub-licly release our architecture under the permissive solder- a r X i v : . [ ee ss . SP ] F e b pad open-source license .The remainder of this paper is structured as follows. InSection II we elaborate on previous work in the domain ofHDC accelerators and always-on classification circuitry andhighlight the distinctive novel characteristics of the proposedapproach. Section III analyzes in detail the modules of theproposed architecture. We continue with post-layout analysison power and area of the design in different target technologiesand different design parameter combinations in Section IVbefore we conduct an energy-efficiency and accuracy analysisfor several always-on cognitive sensing scenarios in Section V.Finally, we conclude in Section VI.II. R ELATED W ORK
Tackling the power-consumption challenge of always-onsensing in a hierarchical manner using WuCs to apply aggres-sive duty cycling on more involved data processing modulesis not a new idea. In the recent past, there have been severalpublications on low-power always-on wake-up circuitry invarious domains. Table I gives an exemplary overview ofcurrent wake-up circuitry research using selected publicationsin the recent past. Keyword spotting and voice activity detec-tion (VAD) is a very actively researched target for always-onsensing; Giraldo et al. present a low power WuC for speechdetection, speaker identification, and keyword spotting withintegrated preprocessing blocks for MFCC generation andLSTM accelerator for classification [15]. Shan et al. proposedanother implementation in the same application domain withstate-of-the-art energy efficiency on the task of two-wordkeyword spotting using binarized depth-wise separable CNN’soperating at near-threshold [16]. At the lower end of thepower consumption spectrum Cho et al. present a
142 nW
VAD circuitry with integrated analog-frontend that combinesa configurable always-on time-interleaved mixer architecturewith a heavily duty cycled neural-network processor [14].Monitoring life signals is another very active field; In thecontext of cardiac arrhythmia detection, Zhao et al. combine alevel-crossing ADC with asynchronous QRS-complex detec-tion circuitry with an artificial neural network accelerator tobenefit from the energy advantage of non-Nyquist sampling[17]. Although these solutions achieve outstanding energyefficiency in their particular application domain, they arehardwired for the respective task.More in line with the target of a flexible and configurablesmart sensing platform are Miro-Panades et al.; They presentan asynchronous RISC processor with an integrated wake-upradio receiver for efficient low-latency wake-up from severalexternal and internal triggers. While their architecture achievesoutstanding reaction time to interrupts without the need for ahigh-frequency clock, the wake-up circuitry lacks the interfaceand compute capability to perform actual data processing fordata input pattern dependent wake-up [9]. Wang et al. presenta configurable WuC resembling the work in [17] that combinesan LC-ADC with a set of asynchronous detector blocksto extract low-level signal properties like peak amplitude,slope, or time interval between peaks. Each detector can be Will be made available under https://github.com/pulp-platform/hypnos configured with a threshold, and the individual detector outputcan be logically fused to a single wake up signal. Althoughtheir architecture uses minimal power, continuous detection ofmore complex patterns is entirely outside the capabilities of adetector-set approach [18].To the authors’ knowledge, the only low power WuC withslightly more sophisticated pattern matching capabilities wasintroduced by Rovere et al.. Instead of analyzing the delta-encoded signal from the LC-ADC with hardwired detectors,they continuously match the input signal against a sequence ofupper and lower amplitude thresholds with up to 16 thresholdsegments. This scheme equates to matching the input signal’sapproximate amplitude slope against a configurable pre-trainedprototypical signal slope of interest [19]. Their approachproved successful for pathological ECG classification andbinary hand gesture recognition (finger-snap or hand clapping).Still, detecting more complex patterns in the spatial or timedimension remains outside their proposed architecture’s scope.Hyperdimensional computing (HDC) is an energy-efficientand flexible computing paradigm for near-sensor classificationthat gracefully degrades in the presence of bit errors, andnoise [20, 21, 22]. Various works showcased HDC’s few-shot learning properties and energy efficiency in multipledomains like biosignal processing [23, 24, 25], languagerecognition [26], DNA sequencing [27], or vehicle type clas-sification [28].In emerging hardware implementations, the HDC’s inherenterror-resiliency is leveraged for novel non-volatile memory(NVM) based in-memory computing architectures [8, 29, 30].Targeting FPGAs, efficient mappings of binary and bipolarHDC operations are proposed [31, 32, 33]. However, the onlycomplete digital CMOS-based HDC accelerator was recentlyintroduced by Datta et al.. They propose a data processingunit (DPU) based encoder design that interconnects with aROM based item memory, and a fully parallel associativememory [34]. While their implementation indeed excels inthroughput, its’ configurability as well as area- and energy-efficiency are limited; Their encoder architecture is restrictedto what they call generic multi-stage HDC algorithms witha hardwired encoder depth in feedforward configuration im-posing hard limits on the supported encoding schemes. Froman energy-efficiency and area standpoint, their design suffersa lot from using a large read-only-memory (ROM) for itemmemory (IM) and pipeline registers in the very wide datapathof every encoding layer.Our proposed architecture targets the sub 25µ W powerenvelope (resulting in a lifetime of about four years from asmall lithium-thionyl chloride coin cell battery). The always-on smart sensing circuitry leverages the flexibility of HDC toperform energy-efficient end-to-end classification on a diverseset of input signals. We achieve higher configurability, areduction of 3.1 × in area and up to 3.3 × improvement inenergy-efficiency than the current SoA in HDC accelerationand present a first-in-class flexible and technology agnosticdigital CMOS architecture for near sensor smart sensing wake-up circuitry. Application Specific General Purpose
Cho et al. [14] Giraldo et al. [15] Shan et al. [16] Zhao et al. [17] Wang et al. [18] Miro-Panades et al. [9] Rovere et al. [19]
This Work
Applications VAD Keyword Spott. Keyword Spott. EMG Slope Matching Wake-up Radio, Interrupts General Purpose General PurposeTechnology 180nm 65nm 28nm 180nm 180nm 28nm 130nm 65nm / 22nmCross Tech. Low Medium High Low Low Low Low HighPower Envelope ~ µ W ~ µ W ~ ~ µ W ~ ~ µ W ~ . µ W max. ~ µ W , typ. ~ µ W Classification Scheme NN MFCC, LSTM DSCNN ANN Threshold, Slope - Threshold Sequence HDCConfigurability App. Specific App. Specific App. Specific App. Specific Limited App. Specific Medium HighArea . . . . . . . . / . TABLE IC
OMPARISON OF STATE - OF - THE - ART W U C S WITH OUR PROPOSED
HDC
BASED W U C. A
REA NUMBERS ARE REPORTED NM AND NM TECHNOLOGYWHILE POWER CONSUMPTION IS REPORTED IN NM FOR A COMPUTE INTENSIVE LANGUAGE CLASSIFICATION ALGORITHM AND A TYPICAL ALWAYS - ONCLASSIFICATION ALGORITHM FOR
EMG
DATA . III. P
ROGRAMMABLE
HDC-A
CCELERATOR A RCHITECTURE
A. Hyperdimensional Computing
Hyperdimensional Computing (HDC) or vector symbolicarchitectures (VSAs) in general, is a brain-inspired computeparadigm that recently is gaining attention [20]. Its core ideais to map low-dimensional input data, i.e., raw sensor dataor features thereof, to vectors of very high dimensionality(cardinality in the order of thousands). The procedure of inputto HD space mapping is commonly called hyperdimensionalencoding . HDC defines simple operations on vectors to aggre-gate their informational content into a single vector.
Binding avector V a to another vector V b creates a vector that is dissimilarto both inputs and thus may be used to represents the mapping V a : V b . Bundling several input vectors yields a vector mostsimilar to all of its inputs, therefore representing the set of itsinput vector. The unary
Permutation operation maps a singlevector deterministically to an entirely unrelated subspace.Combining these three operations on multiple channels ora time-sequence of mapped input vectors (using a so-called item memory ) captures high-level signal characteristics of theunderlying data in an error-resilient and flexible manner [35].The inverse mapping of HD Vectors to the low dimensionaloutput space, i.e., the index of a classification result, isenabled by the
Associative Lookup operation. This operationfinds the most similar vector to the input within a set ofstored HD vectors. There are various embodiment options forVSAs, differing in the concrete representation of the individualdimensions and actual implementations of
Binding , Bundling and the similarity metric. In this work, we concentrate on theso-called binary spatter code (BSC), a digital CMOS friendlyVSA that uses a single bit per dimension. BSC uses XORfor the binding and majority vote for the bundling operationwith Hamming distance as the implied similarity metric forassociative lookups.
B. Overview
Figure 1 illustrates the three major components of theaccelerator, which we describe in detail in the followingsubsections; the associative memory (AM) stores the proto-type vectors and performs the associative lookup operations,the final step of most HDC algorithms. The hyperdimen-sional encoder (HD-Encoder) is responsible for mapping low-dimensional input values to HD-vectors. It operates on HDvectors from the AM or its own output in an iterative manner. A l g o r i t h m S t o r a g e Recon fi gure Algorithm map_input → reg → store @15ld @3 → perm 1 → bit fl ip → regbundle reghw.loop1 repeat 5x:map_input → regld @7 → bind with reg → reg → @15associative lookupintrpt if class==3 && dist<10jmp 0 In fi nite Loop Class 1 Prototype VectorClass 2 Prototype VectorClass 3 Prototype VectorSearch VectorScratchpad VectorScratchpad Vector
Associative Lookup Logic E n c o d i n g / W r i t e b a c k B u ff e r Con fi g.Unit I M / C I M M a t e r i a li z a t i o n EUEUEUEUEUEUEUEU W r i t e P o r t R e a d P o r t I n t e rr u p t Input Values
Associative Memory HD Encoder
HDC Accelerator
APB Access Port
Fig. 1. High-level structure of the proposed HDC accelerator. The associativememory (blue) is responsible for storage and associative lookup of prototypevectors and serves as a scratchpad memory for the HD Encoder (red). Encoderand associative memory are orchestrated by user programmable algorithmstorage (green).
The AM and HD-encoder are managed by a small controllercircuit that sequentially consumes a stream of compact mi-crocode instructions and accordingly reconfigures the datapath.A tiny user-programmable configuration memory supplies thismicrocode stream.
C. HD-Encoder
The first step of every HDC classification algorithm ismapping a dense input space to a high-dimensional holis-tic representation. Most current algorithms encode the low-dimensional input data into a single high-dimensional searchvector representing the whole or a subset of the input. Thesearch vector is then compared with prototype vectors storedin the AM that represent the different classes. The differences
Serializer
D Q D Q D Q D Q
EUEUEUEUEUEUEU π π -1-1 π D/K
All-Zero
Seed S i m ili a r i t y M a n i p . D/K
Mixer(IM Materialization)
D/K
D/K
Low-Dimensional Inputfrom Associative Memory toAssociative Memory E n c o d e r R e g i s t e r D=
HD-Encoder
16 7 I n p u t M u x Input Stage (C)IM Materialization Bit-WiseOperations (a) AM Associative Memory EU Encoder Unit
HDC
Hyper-Dimensional Computing
BSC
Binary Spatter Code (C)IM (Continuous) Item Memory Mapping
NISC
No Instruction Set Computing
CISC
Complex Instruction Set Computing
SCM
Standard Cell Memory (b)Fig. 2. (a) Architecture of the HDC Encoder responsible for (Continouus) Item Memory materialization and search vector encoding. The width of the datapathis a function of the HDC dimensionality (D) and the design parameter K (discussed in Section III-C3). (b) Table of accelerator related acronyms. between the various HDC algorithms mainly lay in the partic-ular encoding algorithms. They are crafted to capture relevantcharacteristics from the raw data, e.g., amplitude distribution,spatial or temporal features, and are highly target applicationdependent. Thus, it is mainly the encoder’s versatility thataffects the affinity of an HDC accelerator for different algo-rithms.Figure 2a illustrates our proposed encoder architecture. Itconsists of three main components connected in a combina-tional pipeline. The input stage of the encoder multiplexesbetween 4 different input sources; the all-zeros vectors, ahardwired random seed vector, a vector addressed from AM,or the HD-encoder’s own output. The IM materializationstage maps input data to item vectors using either quasi-orthogonal vectors (IM) or continuous item mapping (CIM).The encoder’s final stage are the bitwise encoder units thatperform binary or unary operations on the individual bits ofthe vectors.There are no pipeline registers in the very wide datapathbetween the encoder stages. Although this design choicereduces throughput, it increases the energy efficiency of ourarchitecture.
1) Encoder Units:
The Encoder Unit processes one dimen-sion of the input vector. Besides the combinational logic forthe binary and unary bitwise operations, each unit containsan output Flip-Flop that stores the result after each encodingcycle.Additionally, there is one saturating bidirectional 5-bitcounter per unit to perform the bundling operations. Analysesin [31] showed that for dimensions up to 10000, a 5-bit satu-rating counter implementation still achieves the same bundlingcapacity as a full precision model.A noteworthy detail of the saturating counter is its possi-bility to evict the current counter value to the AM in a bit-serial manner (i.e., one cycle for each of the five counter bits).Eviction and loading of the counter state allow the proposed design to execute multistage encoding algorithms with nestedbundling operations.
2) Mixing Stage:
The Mixer submodule visualized in figure2a generates quasi-orthogonal pseudo-random HD vectors.The rematerialization, i.e., on-the-fly regeneration, of suchvectors is an area-efficient alternative to explicit storage oflarge numbers of item vectors required for input to HD spacemapping.The mixer stage feeds the input vector selected by theencoder input stage through one of two hardwired randompermutations π and π . This enables the encoder to map agiven low-dimensional binary input datum w from the inputdomain D to the pseudo-random HD-vector V w by iterativelyapplying one of the two permutations to a hardwired randomseed HD-vector S : V w = n (cid:89) k =0 π i S , for i = (cid:40) , if w k = 01 , if w k = 1 (1)where w k denotes the k th bit position in the input word w’sbinary representation and n = log | D | . The resulting HD-vectors V w are all quasi-orthogonal, given that π and π donot commute.For algorithms that require random access to the itemmemory, the above scheme rematerializes the item vector withtime complexity O (log | D | ) . However, many algorithms useIM-mapping to bind a value vector V value to a channel label V chn [ k ] . In these scenarios, the channel label vectors are usedwith a fixed ordering assuming the raw data is feed to theaccelerator using a fixed channel ordering. We can thereforereduce the time complexity to O (1) with the mapping: V chn [ k ] = (cid:40) S if k = 0 π V chn [ k − if k > (2)where we store the channel label from the previous iterationin an unused row of the associative memory. A0123 Z0000000100110111 + to Mixerinput ValuesoruCode Immediate Spread withRepetition Encoder Shu ffl e withHardwiredPermutationMap toThermometerCode from Input Mux
128 D/K D/K Similarity Manipulator X input X out = X input with / % bits fl ipped Flip Bits of Input Vector
D=
Fig. 3. Structure of the Similarity Manipulator stage
Our proposed IM-mapping approach is more area-efficientthan storing random vectors in a large ROM and scales well tolarge input domains whose cardinality is unknown in advance.From a hardware perspective, the mixer stage translates toN 4-input and N 2-input multiplexers, where N denotes thedatapath width and some moderate wiring overhead causedby the random permutations.
3) Vector Folding:
If synthesized with default parameters,the proposed HD-Accelerator contains a datapath wide enoughto process a whole HD-Vector in a single cycle. However, aswill be analyzed in more detail in Chapter IV, going for a moreparallel architecture does not always yield the most energy-efficient solution for a given target technology. Thus, apartfrom various other modifiable design parameters, the designexposes the
Vector Fold parameter; It allows to tune the designfor the optimal amount of parallelism to achieve maximumenergy efficiency. Increasing the value of the vector fold splitsa single D -dimensional vector into K smaller subparts ofequal size. The datapath of the accelerator shrinks accordinglyand only processes one subpart at a time. While the throughputof the accelerator at constant frequency decreases by K , thearea of the HD-Encoder, dominated by the saturating counters,reduces similarly by a factor of K .An important detail is that by decreasing the datapath width,we also reduce the permutations’ operand size within thesimilarity manipulator and the mixing stage. If we stick withthe same IM-Mapping scheme described above, all subpartsof a vector would be identical since they all pass through thesame hardwired permutations π and π of size DK . The IMmapping scheme in equation (1) is thus modified as follows: V ∗ w = π idx V W (3)with, π idx = log ( j ) (cid:89) k =0 π i , for i = (cid:40) , if h k = 01 , if h k = 1 (4)where h is the value of the dedicated part index counter whichholds the index of the current vector subpart. This yieldsa unique set of permutations π ∗ o , π ∗ per vector part at theexpense of O ( log (( K )) additional mixing cycles.
4) Similarity Manipulator:
The Similarity Manipulatorstage transforms the mixing stage’s output vector by flip-ping a configurable number of its bits. This operation is a fundamental building block of various high-level operationslike binarized B2B bundling [31], CIM mapping [21] andexponential forgetting. Figure 3 shows its internal structure;The 7-bit input word w is first mapped to a 128-bit unaryrepresentation w unary . This unary representation is spread tothe target HD-vector dimensionality D/K by repeating eachbit of w unary DK × times. The resulting vector passes througha hard-wired random permutation to distribute the ’ones’ overall the vector dimensions. The result is XOR-ed with theinput vector. A limitation of the proposed solution is that auniform distribution of the input words does not yield equaldistribution of the probabilities for a bit to be set across theHD-Vector’s input dimensions. A multi-cycle approach canbe used for operations where equal bit-flipping probabilityis a hard requirement; First, a bitmask with the desired bit-density is generated by passing the all-zero vector throughthe manipulator stage with the input word w . This mask issubsequently mixed in the Mixing stage using the same inputword w to randomize the position of the ’ones’ in the bitmask.The resulting bitmask is ultimately XOR-ed with the inputHD-vector within the encoder units. D. Fully-Synthesizable Associative Memory
For a given search vector, the AM looks up the mostsimilar vector currently stored within the memory. However,the obvious approach to combine traditional SRAMs to storethe HD-vectors with digital logic yields suboptimal results.Although SRAMs are the go-to solution for fast and area-efficient volatile on-chip memory, conventional SRAM macrogenerators are not optimized for the extremely wide memoryaspect ratios needed for parallel access to HD-vectors. Alsothey are less energy-efficient under low V DD conditions forlow bandwidth applications [36, 37]. The nature of hyperdi-mensional computing with lots of simple, componentwise op-erations demands a non-von Neumann scheme of computationwith computational logic intermixed with memory cells.
1) Using Latch Cells as Memory Primitives:
Figure 4shows the structure of the AM in our design; latch cells areused as primitive memory elements instead of flip-flops dueto their lower area ( -10%) and energy ( -20%) footprint [37].Each row of the memory consists of
D/K latch cells anda single glitch-free clock gate. These row clock gates areactivated by the one-hot encoded write address. A two-portdesign allows fetching a new HD-Vector from AM into theHD-encoder while simultaneously writing back the previousresult without any stalls or energy costly pipeline registers inthe wide datapath.In most HDC based classification schemes, the AM is solelykeeps hold of the prototype vectors representing the individualclasses. The proposed architecture differs in that regard byusing rows of the AM to store the iterative encoding process’sintermediate results. The AM thus serves the double purposeof a register file for entire HD-vectors (or vector subparts incase vector fold
K > ).Although latch cells drastically reduce the impact on areafootprint compared to flip-flops, their usage can complicatestatic timing analysis (STA). Due to their transparent nature D QD QD Q D QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD QD Q
Associative SCM
Read Addr G l o b a l W r i t e E n a b l e W r i t e A dd r e ss D e c o d e r R e a d A dd r e ss D e c o d e r Write Addr
CLK
N × K × D/K bits
D Q ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ + + + ++ +++
A>B? R e s u l t R e g A>B? - b i t R W A cc e ss Con fi gurable Threshold Hamming DistanceCalculationMin Distance Search
Interrupt toHDCEncoderfromHDCEncoder
D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD QD QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD QD QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD QD QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD QD QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q D QD QD QD Q
Class Prototype Vectors&Encoder Scratchpad VectorsSearch Vector
Fig. 4. Architecture of the latch-cell based all-digital AM. Vectors can beread and written simultaneously in subrows of length
D/K . The last vectorwithin the memory acts as the search vector for the associative lookup logic.The
D/K -bit adder tree for the popcount operation is shared by all memoryrows. The distance of the most similar entry is compared with configurablethreshold and conditionally raises an interrupt line to an external peripheral(e.g. power management unit in an SoC) during write access, one must take care not to introducecombinational loops. While Teman et al. suggest decouplingthe memory by using flip-flops at the IO boundary of thememory [36], we repurposed the output register in the encoderstage to break combinational loops. This approach, coupledwith multicycle path constraints for STA [36], allows treatingthe AM like a regular flip-flop based synchronous designduring synthesis.
2) Associative Lookup Logic:
As can be seen in figure 4,the HD-vector slot acts as the search vector in the proposedarchitecture. While we could directly use the write input intothe memory as the search word, this would prevent the vectorfolding feature’s usage since our write port would not havea full vector width anymore. The lookup logic iterates overeach memory row, calculating the Hamming distance betweenone subpart of the search vector and a subpart of one of thestored HD-vectors at a time. The control logic accumulatesthe Hamming distance between the subparts and iterativelydetermines the most similar entry’s index and distance.
E. An ISA for HD-Computing
Previously proposed HDC accelerator designs hardwiredlarge portions of their datapath to execute HD-algorithms of aparticular structure [21]. On the other hand, the architecture weare proposing is not bound to execute only one specific class ofalgorithms. A control unit continuously reconfigures the data-path according to a stream of microcode instructions fetchedfrom a tiny embedded configuration memory. This allows theaccelerator to be reconfigured at runtime to execute algorithmsof a much larger variety by altering the microcode stored inthe configuration memory. After configuration, the algorithmis executed autonomously without any further interaction of ahost processor.We propose a 26-bit instruction set architecture (ISA) withthe encoding space split into 25-bit No-instruction-set com-puting (NISC) and 25-bit Complex-instruction set computing(CISC) instructions.
1) Low-level NISC Instructions:
The NISC instructionsdirectly encode the select signals of the multiplexers within theHD-encoder and the address lines of the HD-memory. Figure5 summarizes the function of the bitfields with a single 25-bit NISC instruction. They provide fine-grained control overthe datapath with the RIDX and WIDX fields acting likesource and destination register operands in a conventional ISA.conventional ISA. However, since the Encoder unit contains anoutput Flip-Flop, many vector transformation operations canbe performed without AM access using feedback.If we synthesize the architecture with a
Vector Fold pa-rameter larger than 1, all instructions only process a smallersubpart of the complete HD-vector. The control unit doesnot transparently iterate over all subparts of the vector butleaves control to the user through the part index counter . Thecounter’s value is automatically appended to the read- andwrite-port address lines of the AM and thus controls whichsubpart of the HD-vector is affected by the current instruction.The counter can be cleared, increased, and decreased withdedicated instructions.The rationale behind leaving control over the subpart it-eration scheme to the user is that we also want to supportiteration over the vector parts in the outermost loop of an HD-algorithm instead of only iterating in the innermost loop. Thatis, instead of first applying a transformation on all subpartsof a vector before switching to the next transformation, wewant the possibility to apply all operations of an HD-encodingalgorithm on the first subpart and repeat the whole algorithmfor subsequent subparts. For the first iteration scheme, wewould have to swap the bundling counters’ state after everybundling operation since we do not have individual countersfor each vector part. The second iteration scheme does notrequire state eviction but requires multiple iterations over theinput stream.
2) CISC Instructions:
The CISC instructions encode multi-cycle HDC operations and instructions for code size reductionand host interaction. a) High-level HDC Operations:
For several HDC trans-formations, there are dedicated high-level multicycle instruc-tions. Providing CISC instructions on top of the NISC ISAkeeps the number of control signals and thus the instruction
ENCSEL
SMEN SMSEL MXEN MXINV MXSEL OP BNDEN BNDRST0561112WBEN
RIDX WIDX
ENCSEL
Select between the all-zero vector, a vector from AM and the currentHD-encoder output as input for the encoder stage
SMEN
Enable/Bypass Similarity Manipulator Stage
SMSEL
Select between external input data and internal register as input forsimilarity manipulator stage
MXEN
Enable/Bypass Mixing Stage
MXINV
Select Inverse Permutation set in Mixing Stage
MXSEL
Select between permutation π and π or if MXINV is set between π − and π − . OP Select operations to be performed in Encoder Units.
BNDEN
Enable bundle counter thus bundling the current encoder output.
BNDRST
Reset the bundle counter to its initial value
WBEN
Enable write back of the encoder output to AM at index WIDX. Ifdisabled HD-encoder is only stored in output buffer.
RIDX
Read index in case vector from AM is used as encoder input.
WIDX
Write index if the result of current iteration is written back to AM(WBEN = 1).
Fig. 5. NISC Instruction format with small. Furthermore, mapping common HDC operationslike IM-Mapping or associative lookup to single CISC instruc-tions reduces the given HDC-algorithm’s code size.The
AM_SEARCH instruction starts the associative lookupprocedure within the AM. The vector currently stored atthe highest index is used as the search vector. As its onlyoperand, the instruction takes an immediate that limits thesearch space to a maximum index. Only vectors stored at anindex smaller than the given maximum index are consideredduring the lookup operation. The immediate value thus allowspartitioning the AM dynamically into scratchpad and prototypememory.The
MIX instruction applies multiple mixing cycles to thecurrent content of the encoder register and hence is the basisof IM-mapping. The mixing value is either an immediate,the current value of the part index counter or an externallysupplied value, e.g., digital data from a sensor. b) Host interaction and Code Size Reduction:
An au-tonomous WuC requires to conditionally signal a target systemabout the result of the classification algorithm. The proposeddesign uses a dedicated interrupt instruction to conditionally(or unconditionally) assert an interrupt signal line. The instruc-tion has two operands: • Similarity Threshold - The interrupt is not raised if thelast associative lookup operation yielded a result with aHamming distance higher than the given value. • Index Threshold - The interrupt signal is not raised ifthe index of the most similar vector found in the lastassociative lookup operation is higher than the giventhreshold.One use case of these thresholds is to wake up the targetsystem only if the HDC classification algorithm detects oneparticular class with a certainty above a specific threshold.For the architecture to be autonomous and energy-efficient,the amount of memory required to map a given HD algorithm to the proposed ISA must be kept small.Thus the algorithm storage in our design supports up to 3nested hardware loops. Each loop is initiated with a singleinstruction containing a 10 bit immediate for the number ofiterations and a 10 bit immediate for the instruction addressthat marks the end of the loop body.The combination of dedicated instructions for commonlyused HDC algorithmic primitives and code size reducingfeatures like hardware loops results in a high expressivenessof the ISA. All examined HDC algorithms (see Section V) canbe mapped with less than 64 instructions.
3) An Example Configuration for Language Recognition:
Language Recognition is a commonly used example applica-tion in the field of HDC [38, 22, 26, 30, 29, 8, 11]. The taskis to determine the language given a sentence in the form of acharacter string. For a text corpus with 21 European languages,HDC achieves accuracies of up to 96.7% [38]. The algorithmconsists of four main steps; In the preprocessing step, the testsentence is split into so-called n-grams, substrings of the testsentence, obtained when applying a sliding window of size n over the character string. In the next step, the individualn-grams of the sentence are each mapped to an HD-vectoraccording to V n-gram = π n − ( V n-1 ) ⊕ π n − ( V n-2 ) ⊕ . . . ⊕ V with V k denoting the HD-vector corresponding to the char-acter at index k within the n-gram. This vector is obtainedthrough IM mapping using 27 random HD-vectors (26 charac-ters in the Latin alphabet plus one for whitespaces). π k denotesthe repeated application of a bit permutation (most commonlya binary shift operation), and ⊕ is the bind operator (XORfor BSC). The n-gram vectors V n-gram for the test sentenceare then bundled together to a single search vector V sentence and in the final step compared with prototype vectors for eachlanguage in the AM. The model of the described algorithm,thus the prototype vectors are obtained by bundling togetherall sentence vectors V sentence of the training dataset of alanguage.In practice, an n-gram size of 4 proved to yield the bestperformance in terms of accuracy [38].Listing III-E3 shows the above algorithm for n=4 in Pseu-docode; i ← char_vec i − [0 , , , ← (cid:48) b ngram i − [0 , ← (cid:48) b for char in sentence do char_vec i ← im _ map ( char ) ngram i ← π ( ngram i − ) ⊕ char_vec i ⊕ π ( char_vec i − ) i ← i + 1 end for search_vec ← bundle ( ngram , ngram , ... ) idx ← min_distance ← ∞ class_idx ← for p in prototype vectors do distance ← popcount ( search_vec ⊕ p ) if distance < min_distance then min_distance ← distance class_idx ← idx end if end for Listing 1: Pseudo code of an HDC algorithm for languagerecognition.
Instead of recalculating the same character vectors repeat-edly when sliding over the sentence, we recursively computethe n-gram using a FIFO structure [26]. Mapping the abovealgorithm to the proposed ISA with an AM size of 16 vectorsand vector fold of one results in the following code: start: hw.loop0 nr_characters_in_sentence, end_loop enc_reg → mix → enc_reg mem[12] → mix → bind_with_enc_reg → mem[11] mem[13] → mix → mem[12] mem[14] → mix → mem[13] mem[15] → mix → mem[14] zero_vec → man → enc_reg MIX_EXT 5 enc_reg → mem[15] mem[11] → bind_with_enc_reg → bundle end_loop: threshold_bndl_cntrs → mem[15] am_search nr_classes intr 400, 2 jmp start Listing 2: Microcode mapping of the language classificationalgorithm in pseudo code. Arrows indicate that operations hap-pen in a combinational pipeline in the same cycle, multi-cycleinstructions are specially indicated with comments denotingthe number of execution cyclesWe omitted the initialization steps that would correspond tolines 1-3 in the pseudo-code listing for simplicity. As can beseen in listing III-E3, the body of the algorithm maps to the12 instructions (lines 1-16). The instruction on line 17 triggersan interrupt if the processed sentence belongs to the classesrepresented by prototype 1 or 2 with a Hamming distance ofless or equal to 400 bits. The final unconditional jump causesthe algorithm to start over again, either immediately if theinterrupt conditions are not met or after the host processorclears the pending interrupt.IV. I
MPLEMENTATION AND R ESULTS
In this section, we evaluate the proposed architecture interms of area and power consumption. In Section IV-B, wepresent an overhead analysis of the proposed associativememory. Finally, in subsection IV-C we compare the area andpower consumption of the whole accelerator for two differenttechnologies nodes and examine the influence of the vectorfold parameter on the efficiency for a given target technology.
A. Methodology
We followed the subsequent methodology for the area andpower analysis; the purely digital design written in Sys-temVerilog RTL was first synthesized with Synopsys DesignCompiler 2018.6 using default settings for mapping effort.We evaluate the design’s performance in two different targettechnologies: The first one is a 65 nm Low-Leakage Low-Kprocess node using a high Vth (HVT) standard cell library tominimize cell leakage at low operating frequencies requiredby the HDC accelerator. If not denoted otherwise, all numberswere obtained with the typical case library characterization at1.0 V , 25 °C. The second technology we targeted is a 22nmFDSOI node using a UHVT and SLVT library. The library characterization at 0.8 V, 25 °C without body biasing at thetypical-typical corner was used. Using Cadence Innovus 2018,we performed place and route with an eight-layer metal stackfor the 65 nm node targeting a core area utilization of 80%. Forthe 22 nm node, a ten-layer metal stack with a target core areautilization of 70% was used. Post-layout power numbers wereobtained with Cadence Voltus using switching activity for allinternal nodes extracted from a timing back-annotated post-layout simulation of the HDC algorithms in Mentor GraphicsQuestasim 2019. B. Energy and Area overhead Analysis of SCM based AMs
Table IV-B provides an evaluation of the area overheadand energy efficiency for a fully-combinational and the row-sequential AM architecture described in Section III-D. To getan accurate estimate of the delay and power consumption atsub nominal voltages, the complete standard cell library wasrecharacterized with spice simulations using Cadence Liberatefor a V DD corner of 0.6 V . At this voltage, all standard cellswithin the library are still operational in spice simulation.6T-bitcell based SRAMs that are readily available in allcommercial technology nodes are no longer operational atsuch low voltages[37, 39]. Although there are specialized low-voltage SRAMs for sub-threshold operation [40], they arecustom-tailored for a particular technology and not readilyavailable for all technology nodes. Furthermore, experimentsby Andersson et al. indicate that customized SCMs can stillhave an energy advantage over sub-threshold SRAMs for smallmemory sizes [41].At the 0.6V operating corner, we see a 4 × improvementin energy efficiency for the sequential architecture and almost5 × for the fully parallel version compared to operation atnominal voltage. The full-parallel implementation is 2.6x moreenergy efficient than the sequential one. However, for mostHDC algorithms, the vast majority of the proposed HDCaccelerator’s compute time is spent on vector encoding, duringwhich the AM lookup logic stays idle. For this reason, wefocus on the row-sequential SCM AM architecture, which hasa better trade-off between energy efficiency during lookupoperation and static leakage power in the subsequent analysis. C. Tuning for Maximum Energy-Efficiency
As will be further elaborated in Section V, the high amountof parallelism in the datapath and the efficiency of the pro-posed ISA in executing common HD-algorithms allows thearchitecture to be clocked at fairly low frequencies while stillachieving real-time processing capabilities for many targetapplications. Figure 6a shows the power breakdown of theproposed architecture synthesized with an AM size of 16kBit(16x 1024 bits) while processing an EMG gesture recognitionclassification algorithm for different degrees of vector folding.Since higher vector fold values result in less datapath paral-lelism, we adjusted the frequency for each different vector foldconfiguration to achieve identical throughput for all configu-rations. In other words, although the different configurationsrun at different frequencies, they perform the same amount
Area [kGE] Throughput [MOPS/s] Energy Efficiency [ pJ / lookup ] Leakage Power [uW]@ . @ . @ . @ . @ . @ . SRAM + Digital AM 17 . — — . —Sequential SCM AM 101 .
29 0 .
23 2353 556 7 . . Full parallel SCM AM 265 .
80 1 .
54 921 188 81 . . TABLE IIA
REA AND E NERGY E FFICIENCY COMPARISON OF
SCM
BASED BY BIT
AM-
AND
SRAM
BASED
AM-A
RCHITECTURE IN NM TECHNOLOGYUSING ALL THREE AVAILABLE VT FLAVORS . T
HE MOST ENERGY EFFICIENT
SRAM
CONFIGURATION GENERATED BY THE AVAILABLE
SRAM
MACROGENERATOR COLLECTION FOR THE TARGET TECHNOLOGY WAS CHOSEN . of useful work per time interval with different degrees ofsequentiality.We see entirely orthogonal tendencies for the two differenttechnology nodes in energy efficiency versus Vector Fold. For65nm, the overall energy efficiency increases with lower vectorfolds, thus a higher degree of parallelism, while we see theopposite effect in GF22.The reason behind this effect becomes evident when wehave a closer look at the area breakdown in figure 6b. For aVector Fold value of one, almost 60% of the accelerator area isoccupied by the HD-Encoder. In a technology node like GF22with SLVT cells, the design is dominated by leakage power.Increasing the vector fold that directly affects the encoder’sdatapath width has a large effect on the overall area and thusstatic current draw of the accelerator.Although the fully synthesizable architecture’s technologyindependence would make it easy to switch to a differenttechnology node with lower leakage, this is not always apossibility, especially when the device is integrated into alarger system. For these situations, the vector fold feature, inaddition to its function as a control knob to trade-off areafor maximum throughput, provides the means to tune thedesign for maximum energy efficiency depending on the targettechnologies’ leakage behavior.V. A PPLICATIONS AND U SE C ASES
As thoroughly discussed in Section III, the proposed HDCaccelerator uses hardware-friendly embodiments of commonlyused HDC primitives and combines them with a programmablecontrol path. In this section, we take a closer look at theachieved accuracy of the proposed architecture when config-ured to execute different classification problems using state-of-the-art HDC algorithms. Both, to validate the soundnessof the algorithmic transformations and to compare the energyefficiency with other fully digital HDC accelerators.
A. Accuracy Analysis on Text classification and EMG GestureRegocnition
As mentioned earlier, the language classification of tex-tual data is a prime example for classification with HDC.While this application does not fit the context of always-on smart sensing, it serves the purpose of validating theaccuracy implications of the permutation-based item memorymaterialization described in Section III-C2. We tackle the same classification task to classify the text samples into 21 Indo-European languages [38]. We use the same HDC algorithmdescribed in Section III-E with an n-gram size of five, whichis identical to the algorithm used by Rahimi et al.. Figure 7indicates the achieved accuracy using a vector fold factor of1 for different dimensionalities; For 8192 bit HD vectors, themodified HDC operators achieve an accuracy of 94.52%. Thisaccuracy is almost identical to the results reported by Dattaet al. on their accelerator (95.2%) [34]. The algorithm maps toonly 14 HDC ISA instructions and has a memory requirementof five vector slots in the AM, in addition to the 21 languageprototype vectors, for intermediate results during the encodingprocess. For a vector fold of 1, the algorithm executes at 14cycles per processed input character, which results in 1400cycles to classify a single sentence.The second application we evaluate is hand gesture recog-nition on electromyography (EMG) data recorded on the sub-ject’s forearm. We used the dataset and preprocessing pipelinefrom [23]; The data consists of recordings from the subjectperforming five different hand gestures captured by a 64-channel EMG sensor array with a sampling rate of 1kSPS. Theactual HDC classification algorithm works as follows; For eachtime sample, the 64 channel values are continuously mapped toHD-vectors using the similarity manipulator module describedin Section III-C4 and bound to a per-channel label vector,generated in the mixer stage. Bundling the resulting 64 channelvectors together yields a single HD-vector that representsthe state of all channels for a given instance in time. Fiveof these vectors are combined to a 5-gram analog to thelanguage classification algorithm to form the search vectorfor associative lookup against the prototype vector. Trainingof the prototype vectors works like classification, but manysearch vectors corresponding to the same gesture are bundledtogether to form the prototype vector.The whole algorithm maps very well to HDC ISA, requiringonly 12 instructions and two memory slots for intermediateresults. The inner loop over the 64 channels in the algorithmis executed in only two cycles for a folding factor of 1, whichresults in a total of
678 cycles to classify a single 500mswindow of data. Consequently, realtime classification of 64EMG channels implies an accelerator clock frequency of only .While the data preprocessing flow we used in our exper-iments was identical to [23], the HDC algorithm, althoughidentical in general structure, differs in a few crucial aspectsfrom the baseline implementation. Moin et al. perform CIM UMC65 hvt @1.0V
GF22 slvt @0.5V
Power [uW]Dynamic PowerLeakage Power (a) μ m² μ m² μ m² HD-EncoderCAMControl Path
Area Breakdown in UMC65
16 kBit CAM, Vector Fold =1 (b)Fig. 6. (a) Post-layout-simulated power consumption of the HDC accelerator (16 vectors à 1024 bits) when executing a realtime HDC algorithm for differentvector folds in 65nm and 22nm technology. (b) Area breakdown of the HDC algorithm for a vector fold of 1, placed and routed in UMC 65nm. A cc u r a c y LANG A cc u r a c y EMG
OursData et. al
Fig. 7. Achieved accuracies for the target applications using different HD-Vector sizes. mapping of the individual samples to HDC vectors usingscalar multiplication of the sample value with a per-channelbipolar label vector, effectively leaving the binary domain[23]. Moreover, the bundling operation to form a time samplevector is implemented as a scalar addition of the integer-valued vectors before thresholding the result back to a bipolarrepresentation with positive values mapped to +1 and negativevalues to -1. Even though the proposed algorithm modificationstays strictly in the binary domain, there is only a small dropin accuracy; With 8192 dimensions, the proposed architectureachieves 96.31% accuracy while Moin et al. report an accuracyof 99.44% accuracy using 10’000 bit-vectors and arbitraryprecision bundling [23].
B. Ball Bearing anomaly detection
Predictive maintenance, also known as condition-basedmaintenance, is a term for the process of estimating the currentcondition of in-service equipment to anticipate componentfailure. The goal is to switch to a maintenance scheme werecomponents are replaced once they approach their end-of-lifeinstead of fixed maintenance intervals based on preventive replacement according to the statistically expected lifetime[42]. As part of our algorithmic investigations, we investigatethe feasibility of HDC for the task of ball bearing faultprediction using vibration data from low power accelerometersensors.For our analysis, we use the IMS Bearing Dataset providedby the University of Cincinnati [43]. They recorded vibrationdata at a sampling rate of 20kHz from 4 different ballbearings on a loaded shaft rotating at a constant 2000rpm.We concentrated on the first of the three recording sets, whichcontains 1 second data records obtained with an interval of10 minutes in a run-to-failure experiment that lasted 35 dayswith an accumulated operating time of about 15 days.
NormalizerQuantizer128 levels [12, 66, ...27]
IM Map
250 values + V V V V W1 V M + V W2 V W3 V W4 V W5 V M* HammingDistance> Threshold?Trigger further Analysis
Fig. 8. Illustration of the proposed HDC based ball bearing anomaly detectionalgorithm. V M ∗ denotes the on-line trained calibration vector from the first24 operating hours of the ball bearing. Figure 8 illustrates the basic classification procedure. Thealgorithm requires an initial calibration phase where a pro-totype vector representing the ball bearing’s normal oper-ating condition is generated. With the inherent feature ofHDC that classification and training are of almost equivalentcomputational complexity, online-training with HDC imposesnegligible additional energy costs. The current control pathof the proposed HDC accelerator allows for online training algorithms to be encoded in the algorithm storage but requiresan external control entity, e.g., a general-purpose core thatprovides the labels during algorithm execution.The algorithm’s basis is the encoding of small time windowsfrom the raw vibration data to measurement vectors V M . Eachtime window consists of 250 samples (12.5ms). The samplevalues are first normalized using a pre-trained normalizationfactor and quantized to 7 bits. Each sample value is thenmapped to an HD-vector using IM mapping, and the wholewindow of 250 samples is bundled together to a window vector V W . Five of these window vectors with an interval of 125msare again bundled together to form a single measurement vec-tor V M . The resulting vector thus approximates the amplitudedistribution over a 0.5-second time frame.The general idea behind the proposed analysis scheme isto generate a prototype vector V M ∗ using the first coupleof measurement vectors after commissioning. We then trackthe evolution of Hamming distance over time for subsequentmeasurement vectors. We calibrated the prototype vector using100 random measurement vectors from the first 24 operatinghours of the respective ball bearing in our experiments. Sim-ilarly, the normalization factor is generated using the 99%quantile of the amplitude within the same 24 hours aftercommissioning. The proposed algorithm can be mapped to and requires two vector slots, onefor the calibration vector and one for intermediate results.Figure 9 shows the evolution of Hamming distance overtime with an exponential moving average filter with a half-lifeof five hours. This feature can be computed very efficientlywithout the need for a large ring buffer. The line color indicatesthe labels proposed by experts on manual analysis of thedataset [44].By the end of the IMS ball bearing experiment, bearings 3and 4 failed, while bearings 1 and 2 were severely worn downbut did not fail yet. We see a sharp increase in Hammingdistance for all four ball bearings several hours before theactual failure, in the case of ball bearing 3, even several daysbefore the actual inner race failure.While the proposed algorithm certainly does not replacemore involved analysis on time and frequency domain features,the results suggest that it can act as a first filtering stagefor aggressive duty cycling of more power-intensive analysisschemes when combined with simple thresholding. However,more experiments on larger datasets and possibly with morecomplex HDC encoding schemes will be required to quantifythe benefits of an HDC based ball bearing fault predictor. C. Energy Efficiency Analysis and Comparison
Table III summarizes the performance of the three in-troduced HDC algorithms, language classification (LANG),EMG gesture recognition (EMG), and ball bearing anomalydetection (BEARING). Columns 2 & 3 report the numberof HDC instructions and the total number of required HDvector memory to map the algorithm to the architecture.Column 4 shows the required minimum frequency for real-time execution of the algorithm (not applicable for LANGsince there is no real-time constraint for this application). The E W M fi l t e r e d H a mm i n g D i s t a n c e Ball Bearing 3 labelearlynormalsuspectInner race failure
Fig. 9. Hamming distance evolution over time for ball bearings 3 in theIMS dataset. The Hamming distance was post-processed with an exponentialmoving average filter with a halflife of 5 hours. The other ball bearings inthe dataset show a similar behavior last two columns indicate the power when operating at theaforementioned minimum frequency and the correspondingenergy efficiency per classification. For LANG, we considera single classification to be the processing of a 100-characterstring, the average sentence length in the Wortschatz corpora.For EMG and BEARING, a single classification is defined asthe analysis of a 500ms window as described in the algorithmsections V-A and V-B.In table IV we compare the energy efficiency of our solutionto the current SoA HDC accelerator architecture from Dattaet al. [34]. Among other algorithms, they report the energynumbers for EMG and LANG executed on a 32 by 2048-bit accelerator in TSMC28. We achieve a technology scaledarea reduction by × . This can be explained by massivearea reductions in all major components of the accelerator.The most considerable effect has the on-the-fly pseudo-randommaterialization of the item vectors used in our design, whichremoves the necessity to incorporate a large ROM to store allpossible item vectors. In fact, 62% of the overall area in Dattaet al. is occupied by a large 1024 by 2048 bit ROM. Besidesthe area and energy implications, the ROM based solutionhas the added drawback of having a hardwired partitioningof the memory; One for the item memory, containing quasi-orthogonal vectors, and one for continuous item memoryvectors, where the pair-wise Hamming distance between thevectors correlates to the difference of the corresponding inputvalues. Another large reduction in area is achieved in the AM,where our solution uses latch cells and sequentially calculatesthe Hamming distance in contrast to the baseline, which usesa flip-flop based fully parallel implementation.In fairness, one has to notice that [34], with a maximumclock frequency of 434MHz, unarguably has a much higherpeak throughput than our solution due to its parallel andheavily pipelined architecture. However, the results in tableIII suggest that algorithms used for always-on sensing do notbenefit from such a high throughput, and energy efficiency isthe key metric by which we should judge the performance ofthe different approaches.As we can see in table IV, the energy efficiency differences Algorithm kHz ] Power [µ W ] Min. Energy/classif. [ nJ ] @
100 kHz
EMG 12 2 + 5 678 1.4 10.7 2.9 703
BEARING 9 1 + 1 12513 25 29.1 7.9 10913
TABLE IIIM
EMORY REQUIREMENTS AND POST - LAYOUT ENERGY NUMBERS OF SELECTED
HDC
ALGORITHM ON THE PROPOSED ARCHITECTURE WITH AN AM SIZE OF X BIT , VECTOR FOLD nJ / inference ]LANG EMGDatta et al. TSMC28 3618 generic 10 or 10 250 610 Our Work GF22 1094 general-purpose arbitrary and 7 332 191
TABLE IVA
REA AND E NERGY EFFICIENCY COMPARISON WITH THE CURRENT STATE OF THE ART
HDC
ACCELERATOR ARCHITECTURE . T
HE TERMS generic
AND general purpose
WERE INTRODUCED BY D ATTA ET AL . IN [34]. between the two architectures depend a lot on the algorithmat hand. For LANG, the achieved energy efficiency is slightlyworse (+31%) than the baseline, which is still impressiveconsidering the 3.3 × reduction in area.For EMG, on the other hand, we achieve a 3.1 × im-provement in energy efficiency. This can be explained bythe difference in the computational complexity of orthogonaland continuous item mapping in our architecture. In LANG,input values are mapped to quasi-orthogonal vectors using themixing stage (III-C2), which requires log ( N ) cycles, where N denotes the cardinality of the input set. The overhead of thisiterative approach considerably lowers the energy advantage ofnot using a large ROM for item memory generation. For EMG,on the other hand, the input values are mapped continuouslyusing the similarity manipulator, which can be performed ina single cycle and can even be combined with a bundlingor bin operation in the subsequent encoder units. Hence, forthis algorithm, the effect of not requiring a ROM comes todisplay. In general, we can say that for very high input valueresolutions, the overhead of iterative item vector generationstarts to dominate the overall energy consumption of ourarchitecture. Still, the fact that the computational complexityof the rematerialization approach grows with the logarithm ofthe input space instead of linear ROM area scaling suggests anadvantage of our architecture for larger input space cardinality.In any case, the proposed architecture excels in its energyproportionality to the desired HDC algorithm. The ROM basedapproach in [34] has an almost fixed cost for item memorymapping with an upper limit on the supported resolution. Forexample, in LANG, only 13% (27 out of 1024 item vectors)of all ROM entries are required to map the input values. Thearchitecture proposed by Datta et al. is only generic accordingto their taxonomy on HDC algorithm classes [34]. In contrast,the microcode based approach that our architecture followsallows for arbitrary HDC algorithm computation within thelimits of the available AM and instruction memory resources.Finally, our proposed architecture is energy- and area-flexibleand can be finely parametrized to fit the area, throughput, andenergy efficiency constraints of a particular target technology. VI. C ONCLUSION
In this work, we presented a novel all-digital cross-technology mappable HDC accelerator architecture with ahighly configurable datapath using a newly proposed mi-crocode ISA optimized for HDC. Place and routed in GF22nm technology, the architecture improves on the currentstate of the art both in area and energy efficiency by afactor of up to 3.1 × and 3.3 × respectively. The architectureachieves an energy efficiency of 192 nJ / inference for the taskof EMG gesture classification with an always-on compatibletypical power consumption of 5 µ W . Our post-layout sim-ulation experiments on different digital associative memoryarchitectures in Section IV-B indicate a significant potentialfor latch based associative memories to push the limits ofenergy efficiency when operating at sub-nominal voltage andcan already outperform the energy efficiency of commercial-off-the-shelf SRAM macros at nominal voltage. In Section Vwe demonstrated that our newly introduced rematerializationscheme for IM and CIM mapping have a negligible impacton classification accuracy with a drop of less than 0.5%compared to a ROM based approach used by the currentSoA HDC accelerator. As part of the analysis, we proposed anovel HDC based end-to-end classification algorithm for ballbearing anomaly detection that maps to only 9 HDC microcodeinstructions. While our experiments in Section V-C indicatedthat the energy efficiency of a rematerializing IM is inferior toa ROM based solution for low input resolutions, the proposedCIM mapping scheme outperforms the current SoA in energyefficiency, area usage, and flexibility. Finally, we providedthe first open-source release of a complete HDC Acceleratorplatform which is possible due to the all-digital nature of theproposed architecture. R EFERENCES [1] B. Chatterjee et al. , “Context-Aware Intelligence inResource-Constrained IoT Nodes: Opportunities andChallenges,”
IEEE Design Test , vol. 36, no. 2, pp. 7–40, Apr. 2019. [2] S. Bagchi et al. , “New Frontiers in IoT: Networking,Systems, Reliability, and Security Challenges,” IEEEInternet of Things Journal , pp. 1–1, Jul. 2020.[3] D. Newell and M. Duffy, “Review of Power Conversionand Energy Management for Low-Power, Low-VoltageEnergy Harvesting Powered Wireless Sensors,”
IEEETransactions on Power Electronics , vol. 34, no. 10, pp.9794–9805, Oct. 2019.[4] V. Shnayder et al. , “Simulating the Power Consump-tion of Large-scale Sensor Network Applications,” in
Proceedings of the 2Nd International Conference onEmbedded Networked Sensor Systems , ser. SenSys ’04.New York, NY, USA: ACM, Nov. 2004, pp. 188–200.[5] B. Spencer et al. , “Smart Sensing Technology: Oppor-tunities and Challenges,”
Structural Control and HealthMonitoring , vol. 11, pp. 349–368, Oct. 2004.[6] N. Verma et al. , “In-Memory Computing: Advances andProspects,”
IEEE Solid-State Circuits Magazine , vol. 11,no. 3, pp. 43–55, 2019.[7] A. Sebastian et al. , “Memory devices and applications forin-memory computing,”
Nature Nanotechnology , vol. 15,no. 7, pp. 529–544, Jul. 2020.[8] G. Karunaratne et al. , “In-memory hyperdimensionalcomputing,”
Nature Electronics , vol. 3, no. 6, pp. 327–337, Jun. 2020.[9] I. Miro-Panades et al. , “SamurAI: A 1.7MOPS-36GOPSAdaptive Versatile IoT Node with 15,000× Peak-to-IdlePower Reduction, 207ns Wake-Up Time and 1.3TOPS/WML Efficiency,” in , Jun. 2020, pp. 1–2.[10] D. Ma et al. , “Sensing, Computing, and Communicationsfor Energy Harvesting IoTs: A Survey,”
IEEE Communi-cations Surveys Tutorials , vol. 22, no. 2, pp. 1222–1250,Secondquarter 2020.[11] L. Ge and K. K. Parhi, “Classification using Hyperdi-mensional Computing: A Review,”
IEEE Circuits andSystems Magazine , vol. 20, no. 2, pp. 30–47, 22.[12] A. Rahimi et al. , “Hyperdimensional biosignal process-ing: A case study for EMG-based hand gesture recog-nition,” in . IEEE, Oct. 2016, pp. 1–8.[13] F. Montagna et al. , “PULP-HD: Accelerating Brain-Inspired High-Dimensional Computing on a ParallelUltra-Low Power Platform,” in
Proceedings of the 55thAnnual Design Automation Conference on - DAC ’18 .ACM Press, 2018, pp. 1–6.[14] M. Cho et al. , “17.2 A 142nW Voice and Acoustic Ac-tivity Detection Chip for mm-Scale Sensor Nodes UsingTime-Interleaved Mixer-Based Frequency Scanning,” in , Feb. 2019, pp. 278–280.[15] J. S. P. Giraldo et al. , “Vocell: A 65-nm Speech-TriggeredWake-Up SoC for 10- µ W Keyword Spotting andSpeaker Verification,”
IEEE Journal of Solid-State Cir-cuits , vol. 55, no. 4, pp. 868–878, Apr. 2020.[16] W. Shan et al. , “14.1 A 510nW 0.41V Low-MemoryLow-Computation Keyword-Spotting Chip Using SerialFFT-Based MFCC and Binarized Depthwise Separable Convolutional Neural Network in 28nm CMOS,” in , Feb. 2020, pp. 230–232.[17] Y. Zhao et al. , “A 13.34 µ W Event-Driven Patient-Specific ANN Cardiac Arrhythmia Classifier for Wear-able ECG Sensors,”
IEEE Transactions on BiomedicalCircuits and Systems , vol. 14, no. 2, pp. 186–197, Apr.2020.[18] Z. Wang et al. , “20.2 A 57nW Software-Defined Always-On Wake-Up Chip for IoT Devices with AsynchronousPipelined Event-Driven Architecture and Time-ShieldingLevel-Crossing ADC,” in , Feb. 2020, pp. 314–316.[19] G. Rovere et al. , “A 2.2- µ W Cognitive Always-OnWake-Up Circuit for Event-Driven Duty-Cycling of IoTSensor Nodes,”
IEEE Journal on Emerging and SelectedTopics in Circuits and Systems , vol. 8, no. 3, pp. 543–554, 2018.[20] P. Kanerva, “Hyperdimensional Computing: An Intro-duction to Computing in Distributed Representation withHigh-Dimensional Random Vectors,”
Cognitive Compu-tation , vol. 1, no. 2, pp. 139–159, Jun. 2009.[21] A. Rahimi et al. , “Efficient Biosignal Processing UsingHyperdimensional Computing: Network Templates forCombined Learning and Classification of ExG Signals,”
Proceedings of the IEEE , vol. 107, no. 1, pp. 123–143,Jan. 2019.[22] A. Rahimi et al. , “High-Dimensional Computing as aNanoscalable Paradigm,”
IEEE Transactions on Circuitsand Systems I: Regular Papers , vol. 64, no. 9, pp. 2508–2521, Sep. 2017.[23] A. Moin et al. , “An EMG Gesture Recognition Systemwith Flexible High-Density Sensors and Brain-InspiredHigh-Dimensional Classifier,” , pp. 1–5,Feb. 2018.[24] A. Burrello et al. , “Laelaps: An Energy-Efficient SeizureDetection Algorithm from Long-term Human iEEGRecordings without False Alarms,” in
Proceedings of the2019 Design, Automation & Test in Europe Conference& Exhibition (DATE) . IEEE, 2019, pp. 752–757.[25] E. Chang et al. , “Hyperdimensional Computing-basedMultimodality Emotion Recognition with PhysiologicalSignals,” in , Mar.2019, pp. 137–141.[26] A. Joshi et al. , “Language Geometry Using RandomIndexing,” in
Quantum Interaction , ser. Lecture Notes inComputer Science, J. A. de Barros et al. , Eds. Cham:Springer International Publishing, 2017, pp. 265–274.[27] M. Imani et al. , “HDNA: Energy-efficient DNA sequenc-ing using hyperdimensional computing,” in , Mar. 2018, pp. 271–274.[28] D. Kleyko and E. Osipov, “Brain-like classifier of tem-poral patterns,” in , Jun. et al. , “Hyperdimensional Computing Exploit-ing Carbon Nanotube FETs, Resistive RAM, and TheirMonolithic 3D Integration,” IEEE Journal of Solid-StateCircuits , vol. 53, no. 11, pp. 3183–3196, Nov. 2018.[30] H. Li et al. , “Hyperdimensional computing with 3D VR-RAM in-memory kernels: Device-architecture co-designfor energy-efficient, error-resilient language recognition,”in , Dec. 2016, pp. 16.1.1–16.1.4.[31] M. Schmuck et al. , “Hardware Optimizations of DenseBinary Hyperdimensional Computing: Rematerializationof Hypervectors, Binarized Bundling, and CombinationalAssociative Memory,”
ACM Journal on Emerging Tech-nologies in Computing Systems , vol. 15, no. 4, pp. 32:1–32:25, Oct. 2019.[32] S. Salamat et al. , “F5-HD: Fast Flexible FPGA-basedFramework for Refreshing Hyperdimensional Comput-ing,” in
Proceedings of the 2019 ACM/SIGDA Interna-tional Symposium on Field-Programmable Gate Arrays ,ser. FPGA ’19. New York, NY, USA: Association forComputing Machinery, Feb. 2019, pp. 53–62.[33] S. Salamat et al. , “Accelerating Hyperdimensional Com-puting on FPGAs by Exploiting Computational Reuse,”
IEEE Transactions on Computers , vol. 69, no. 8, pp.1159–1171, Aug. 2020.[34] S. Datta et al. , “A Programmable Hyper-DimensionalProcessor Architecture for Human-Centric IoT,”
IEEEJournal on Emerging and Selected Topics in Circuits andSystems , vol. 9, no. 3, pp. 439–452, Sep. 2019.[35] M. Imani et al. , “Exploring Hyperdimensional Associa-tive Memory,” in .Austin, TX: IEEE, Feb. 2017, pp. 445–456.[36] A. Teman et al. , “Power, Area, and Performance Op-timization of Standard Cell Memory Arrays ThroughControlled Placement,”
ACM Transactions on DesignAutomation of Electronic Systems , vol. 21, no. 4, pp. 1–25, May 2016.[37] P. Meinerzhagen et al. , “Benchmarking of Standard-CellBased Memories in the Sub-V T Domain in 65-nm CMOSTechnology,”
IEEE Journal on Emerging and SelectedTopics in Circuits and Systems , vol. 1, no. 2, pp. 173–182, Jun. 2011.[38] A. Rahimi et al. , “A Robust and Energy-Efficient Classi-fier Using Brain-Inspired Hyperdimensional Computing,”in
Proceedings of the 2016 International Symposiumon Low Power Electronics and Design , ser. ISLPED’16. San Francisco Airport, CA, USA: Association forComputing Machinery, Aug. 2016, pp. 64–69.[39] M. E. Sinangil et al. , “A reconfigurable 65nm SRAMachieving voltage scalability from 0.25–1.2V and perfor-mance scalability from 20kHz–200MHz,” in
ESSCIRC2008 - 34th European Solid-State Circuits Conference ,Sep. 2008, pp. 282–285.[40] B. Mohammadi et al. , “A 128 kb 7T SRAM Using aSingle-Cycle Boosting Mechanism in 28-nm FD–SOI,”
IEEE Transactions on Circuits and Systems I: Regular Papers , vol. 65, no. 4, pp. 1257–1268, Apr. 2018.[41] O. Andersson et al. , “Ultra Low Voltage SynthesizableMemories: A Trade-Off Discussion in 65 nm CMOS,”
IEEE Transactions on Circuits and Systems I: RegularPapers , vol. 63, no. 6, pp. 806–817, Jun. 2016.[42] S. Selcuk, “Predictive maintenance, its implementationand latest trends,”
Proceedings of the Institution ofMechanical Engineers, Part B: Journal of EngineeringManufacture , vol. 231, no. 9, pp. 1670–1679, Jul. 2017.[43] H. Qiu et al. , “Wavelet filter-based weak signature de-tection method and its application on rolling elementbearing prognostics,”
Journal of Sound and Vibration ,vol. 289, no. 4, pp. 1066–1090, Feb. 2006.[44] J. Ben Ali et al. , “Linear feature selection and classi-fication using PNN and SFAM neural networks for anearly online diagnosis of bearing naturally progressingdegradations,”