Always-On, Sub-300-nW, Event-Driven Spiking Neural Network based on Spike-Driven Clock-Generation and Clock- and Power-Gating for an Ultra-Low-Power Intelligent Device
Dewei Wang, Pavan Kumar Chundi, Sung Justin Kim, Minhao Yang, Joao Pedro Cerqueira, Joonsung Kang, Seungchul Jung, Sangjoon Kim, Mingoo Seok
AAlways-On, Sub-300-nW, Event-Driven Spiking Neural Network based on Spike-Driven Clock-Generation and Clock- and Power-Gating for an Ultra-Low-Power Intelligent Device
Dewei Wang , Pavan Kumar Chundi , Sung Justin Kim , Minhao Yang , Joao Pedro Cerqueira , Joonsung Kang , Seungchul Jung , Sangjoon Kim , and Mingoo Seok Columbia University Samsung Electronics E-mail: [email protected]
Abstract —Always-on artificial intelligent (AI) functions such as keyword spotting (KWS) and visual wake-up tend to dominate total power consumption in ultra-low power devices [1]. A key observation is that the signals to an always-on function are sparse in time, which a spiking neural network (SNN) classifier can leverage for power savings, because the switching activity and power consumption of SNNs tend to scale with spike rate. Toward this goal, we present a novel SNN classifier architecture for always-on functions, demonstrating sub-300nW power consumption at the competitive inference accuracy for a KWS and other always-on classification workloads.
Keywords—always-on device, neuromorphic hardware, spiking neural network, event-driven architecture, speech recognition, keyword spotting I. I NTRODUCTION
SNN classifiers have been attracting a large amount of attention for ultra-low-power intelligence. Especially, the asynchronous versions are promising for always-on functions thanks to event-driven operation, which enables power dissipation proportional to input rate. However, most of the existing asynchronous SNNs [2] employ complex logic such as quasi-delay-insensitive (QDI) dual-rail dynamic logic, which is significantly bulkier and power-hungrier than single-rail static counterpart and also not very voltage-scalable [3,4]. On the other hand, synchronous SNNs use power-efficient static logic [5], but they target high throughput, not always-on function, and focus on minimizing energy consumption, not power. As a result, they exhibit the power consumption of tens of mW. In this work, therefore, we propose an always-on SNN classifier consuming <300 nW. Our architecture uses only static logic operating at near-threshold voltage (NTV) while being fully spike-event-driven. Specifically, we design the neurosynaptic core in static gates and equip it with i) spatiotemporally fine-grained clock-generation, ii) clock-gating and iii) power-gating, all driven by spikes . Also, the communication fabric between neurosynaptic cores is simply wires and yet free of spike collision. This event-driven architecture incurs zero switching at no input change, thereby exhibits power consumption proportional to input rate. We prototyped a 5-layer SNN classifier having 650 neurons and 67,000 synapses in a 65-nm CMOS. The SNN hardware demonstrates 7 to 1000X less power consumption at the state-of-the-art accuracies for well-known KWS benchmarks: Google Speech Command Dataset (GSCD) for multi-keyword recognition [6] and HeySnips for single-keyword spotting [7]. II. S PIKE - EVENT - DRIVEN
SNN
ARCHITECTURE
Fig. 1 shows the proposed SNN classifier. It has five neurosynaptic cores and maps a fully-connected SNN model as large as 256-128-128-128-10. Each core contains a neuron block and a synapse block (the last layer has no synapse). A neuron block contains up to 256 integrate-and-fire (IF) neurons. A synapse block has i) an arbiter, ii) an SRAM storing up to 256-by-128 binary weights, and iii) a spike generator that simultaneously generates 128 spikes. The communication fabric that spikes travel is simple wires. The arbiter in a synapse block ensures no spike collision in the communication fabric.
SNN classifierNeurosynaptic coreSynapse block
Neuron block Comm. fabric
128 1 This chip
To FPGA
Hey Snips! 8 A F E [ ] F P G A D e c o d e r E n c o d e r W e i g h t S R A M x A r b i t e r W e i g h t S R A M x A r b i t e r W e i g h t S R A M x A r b i t e r W e i g h t S R A M x A r b i t e r Fig. 1. Proposed SNN architecture and testing setup
Ack i Spk -1 Spk +1 FF +1 D Q 0 D Q FF clk-ug From other neurons
Shared clock generator
FSM
Async. wake-up circuits
Req i (to/from the synapse block)ActiveVDDVSSActive ActiveFF -1 D Q FF clk-en Q D FF +1 /FF -1 Start/Standby
Potential updateFF +1 & FF -1 reset Spk Req Potential reset
Potential update 2FF +1 & FF -1 reset Fig. 2. Proposed neuron block architecture e devise a fully spike-event-driven architecture for IF neurons (Fig. 2). It employs fine-grained clock-generation and clock-gating circuits. Also, the non-retentive parts of neurons employ zigzag power-gating switches (PGSs) [8] for leakage suppression. Each neuron has i) asynchronous wake-up circuits and ii) a synchronous finite state machine (FSM). The wake-up circuits (Fig. 2 left) can detect the rising edge of incoming spike signals. There are two inputs, spk +1 incrementing and spk -1 decrementing neuron’s potential, which are detected by using static flip-flops, FF +1 and FF -1 . The detection of a spike sets the clock-enable flip-flop ( FF clk-en ), making its output Active high. This starts up the shared clock generator in the neuron cluster if it has not been started up by other neurons. It also un-gates the zigzag PGS of the FSM in a single cycle. Then, the first falling edge of clock generator’s output sets FF clk-ug , un-gating the clock signal that goes to the FSM. The use of FF clk-ug ensures that clock starts from the low phase, giving the sufficient setup time to the flip-flops in the FSM. Indeed, all the other neurons in the cluster that receive no spike remain clock- and power-gated, saving power. Once waken up, the neuron’s FSM executes the IF neuron model (Fig. 2 right). In the Start/Standby state, it resets FF clk-en and FF clk-ug in the wake-up circuits, which gates clock and power. When either of FF +1 or FF -1 receives a new spike, the FSM enters the Potential update state. The neuron’s potential is increased or decreased by one based on the input spike’s type and FF +1 and FF -1 are reset for readily receiving next spike. Then, the neuron’s potential is compared with the preset threshold. If the potential is less than the threshold, the FSM goes back to the Start/Standby state. Otherwise it resets the neuron’s potential to zero and assert firing request (
Req i ). While waiting for the acknowledgement ( Ack i ) from the arbiter in the synapse block, we add the state, Potential update 2 , such that the FSM can receive a new spike. Once acknowledged, the neuron’s FSM goes back to the
Start/Standby state. no low-power technique clock gen. and gating clock gen. + clock and power gating P o w er ( n W ) Simulation
Fig. . Power savings of spike-driven clock-gen., -gating, & power-gating This spike-driven clock generation, gating, and power gating enables large power savings. For the targeted benchmarks, the longest idle time between two spikes per neuron is estimated ~4ms at the maximum input rate. The SNN hardware without the spike-driven power management would consume 750nW (Fig.3). The proposed clock-generation/-gating enables 63% power savings and the zigzag power gating provides additional 20%, resulting in the overall power reduction of 70%. Note that the actual power savings are expected much greater since the average idle time of neurons would be orders of magnitude lower than the maximum one considered above. We also design the synapse block to be fully spike-event-driven (Fig. 4). A request from neurons (
Req i ) starts up its local clock generator and thus the arbiter FSM. In case more than one neuron assert Req i , the arbiter FSM arbitrates the access of the single-port weight SRAM. The weight SRAM stores the 128 binary weights in the n-th row, which the n-th neuron in the current neurosynaptic core drives to the post-synaptic neurons. Therefore, upon the n-th neuron’s request, the arbiter asserts the n-th wordline (WL n ) and loads the binary weights on the read-bitlines (RBLs). The spike generator then takes RBLs and generate 128 positive or negative spikes to the next neurosynaptic core. The arbiter then acknowledges back to the neuron by asserting Ack n . If there is any outstanding Req i , the arbiter FSM continue to serve, otherwise, the clock is disabled. Req Req Req n A r b i t e r FS M Ack Ack Ack n W e i g h t S R A M ( n - b y - m b i t c e ll s ) WL WL WL n 1 2 m RBL RBL m {spk +1 , spk -1 } S p i k e g e n e r a t o r Local clock generator
VSS VDDActive
Active
Active ...
Ack[1]Ack[0]
XX..XX1
XX..X1X 1X..XXX .. Ack[127]
Generate spike
Start/
Standby (a) (b)
Fig. 4. (a) Proposed synapse block architecture and (b) the arbiter FSM
Simulation Design pointsStarved S y n a p s e b l o c k c l o c k f re qu e n c y ( H z ) Threshold
Non-starvedf clk,a at 0.5V
Fig. 5. Threshold and clock frequency optimization for no starvation
We design the arbiter with a fixed-priority scheme (Fig. 4(b)). In our experiment, it requires 23X smaller silicon area than a round-robin one. The chosen priority scheme, however, could cause the neuron with the lowest priority to starve. Therefore, we analyze and optimize several design parameters to eliminate such a starvation. We first formulate the number of requests in the i -th layer (N req,i ) to: 𝑁 𝑟𝑒𝑞,𝑖 = 𝑁 𝑠𝑝𝑘,𝑖 × 𝑁 𝑛𝑟𝑛,𝑖 𝑇𝐻 𝑖 , (1) where N spk,i is the number of incoming spikes per frame (a time duration in which a feature vector is generated) and per neuron in the i -th neurosynaptic core, N nrn,i is the number of neurons in the i -th neurosynaptic core, and TH i is the threshold of neurons in the i -th neurosynaptic core. On the other hand, the number of requests that the arbiter is able to serve (N serve,i ) can be formulated to: 𝑠𝑒𝑟𝑣𝑒,𝑖 = 𝑓 𝑐𝑙𝑘,𝑎 × 𝑇 𝑓𝑟𝑎𝑚𝑒 𝑁 𝑐𝑦𝑐,𝑎 , (2) where N cyc,a is the number of cycles that the arbiter consumes to serve one request, T frame is the frame length, f clk,a is the arbiter’s clock frequency. If N req,i (Eq. 1) exceeds N serve,i (Eq. 2), starving happens. To eliminate such, we can increase TH i and arbiter clock frequency (f clk,a ) (Fig. 5). The former, however, can reduce the number of spikes generated in the i -th layer and thus incur accuracy degradation. The latter can increase the power consumption of the synapse block. We thus swept TH i and f clk,a and found several optimal points which are used in this chip (Fig. 5). It is noteworthy that the arbiter also enables the use of wires for spike communication. If two (post-synaptic) spikes travel to a single neuron at the same time, they collide, causing information loss. The arbiter systematically eliminates such collisions by guaranteeing that a post-synaptic neuron receives only one spike at a time. III. E XPERIMENT AND M EASUREMENT
We prototyped the test chip in a 65nm CMOS (Fig. 6). We added the input decoder and the output encoder for reducing chip I/O counts (Fig. 1). The SNN is envisioned to interface directly with a spike-generating feature-extraction front end such as [9,10]. For the experiment, we generate 16-dimension features from [9] and feed them to the SNN using an FPGA based interface (Fig. 1). The central frequencies of the 16 channels are geometrically scaled from about 100 to 5kHz. In GSCD and HeySnips datasets, each keyword audio sample is roughly 1 s. We set the frame length T frame at 80ms without frame overlapping. We send the current frame together with the past 15 frames to SNN classifier (i.e., the input dimension is 256). We configure the front end [9] to generate 6-bit features for each frame. We train the SNN equivalent binary neural network (BNN) model that uses binary weights (+1, -1) and 6-bit ReLU activation (Fig. 7) [11]. This provides the weights for the SNN model. The 6-bit activations in the BNN are spike-rate encoded for the SNN, e.g., 010000 (2) is mapped to 16 spikes/frame. In the SNN model, as spikes pass through the neurons in a layer, the number of spikes scales roughly by the ratio of the threshold. We thus set the threshold of the neurons in each layer such that each neuron generates at most 63 spikes/frame, matched to the 6-bit activation of the BNN model. Note that we can easily change the activation bit count for different models by configuring the thresholds. For example, we use 8-bit activation with the T frame of 0.5s for the MNIST grayscale.
Input neurosynaptic core Hidden 1TestHidden 3 Hidden 2Output I npu t d e c o d e r . mm Input core (24.65%)Hidden 1(24.65%) Hidden 2 (24.65%)
Hidden 3(24.65%)Output core(1.39%)Total area: 1.99mm Fig. 6. Die photo and area break-down
ReLU
66 6 w =+1/-1 w =+1/-1 w =+1/-1 > TH w =+1/-1w =+1/-1w =+1/-1 N (i)spk,in N req,i Fig. 7. Binary Coding in a BNN and spike-rate coding in an SNN
At our target voltage (0.5V), the neuron block operates at 70kHz and the synapse block operates at 17kHz (Fig. 8). The proposed SNN achieves 75-220nW power dissipation that scales with the input rate (Fig. 9). Fig. 10 shows the accuracy performance across several benchmarks. In GSCD, the SNN can recognize four keywords ("yes", "stop", "right", and "off", arbitrarily chosen) and fillers. The SNN has the architecture of 256-128-128-128-5 with the thresholds of (1,28,18,10). In HeySnips, it can recognize one keyword (“Hey Snips”) and fillers. And in MNIST grayscale, we reduced the image size to 16x16 by utilizing 2x2 max-pooling. The trained SNN structure is 256-128-128-128-10 with the thresholds of (1,24,12,8). Fig. 11 shows the receiver operating characteristic (ROC) curve for GSCD and HeySnips. False reject rate (FRR) vs. the false alarm rate (FAR) under 1-hour-long audio concatenated by test set samples are presented. In addition, we mix the speech audio with the white noise at various SNR We adopt the noise-dependent training [9] in this experiment. The SNN achieves reasonably high accuracy across 0 to 40dB SNR levels (Fig. 12). Finally, the data precision can be traded off for power savings (Fig. 13). -2 -1 Neuron Synapse
Supply (V) C l o c k f re qu e n c y ( M H z ) -4 -3 -2 -1 M i n f r a m e s i ze f o r - b i t a c t i va t i o n s Measurement
VDD=0.75V P o w er c o n s u m p t i o n ( W ) Input rate
VDD=0.5VMeasurement
MNIST A cc u r a c y ( % ) HeySnips
Measurement
Fig. 8. Clock frequency and frame length Fig.9. Power consumption vs. input rate Fig.10. Accuracies across multiple benchmarks F a l s e re j ec t r a t e ( % ) False alarm rate (%)
GSCD
HeySnipsSimulation
GSCD
HeySnips
SNR (dB) A cc u r a c y ( % ) Simulation
Activation precision (bits) E rr o r ( % ) VDD=0.5VHeySnipsT frame =80ms P o w er ( n W ) Measurement
Fig.11. ROC curves from KWS benchmarks Fig.12. Accuracy across 0 to 40dB SNR levels Fig.13. Activation precision vs. power
We compare our work to the recent KWS accelerators (Table I) and SNN hardware (Table II). Our design is one of few SNNs targeting always-on function, achieving 7 to 1,000X power savings at the competitive accuracies in both of the benchmarks.
Table I. Comparisons with recent KWS hardware
This work
Giraldo ESSCIRC18[14]
Shan
ISSCC20[12]
GuoVLSI19[13]Technology [nm]
65 65 6528
Algorithm
SNN
RNN
LSTM
DSCNN
Area[mm ] VDD[V]
Clock frequency
Benchmark
GSCD (4 Keywords)
GSCD (10 Keywords)
TIMIT (4 Keywords)
GSCD (2 Keywords)
Accuracy[%]
Power
Additional benchmark
HeySnips (1 Keyword)
HeySnips(1 Keyword)
N/A
GSCD(1 Keyword)
Accuracy[%]
Table II. Comparisons with recent SNN hardware
This work ChenVLSI18[15] TrueNorth[2]ParkISSCC19[5]Technology [nm]
65 10 2865
Neuron count
650 4096 1M
Synapse count
67K 1M 256MN/A
Area[mm ] Clock frequency
MNIST Classification
Power
Accuracy[%]
Throughput [infs/s]
Energy per inference [nj]
195 1,700 N/A236* Input layer not included; ** Estimated from neuron s power dissipation *** Estimated from Hsin-Pai Cheng et al, IEEE DATE 2017
Energy per
SOP [pj]
IV. C ONCLUSION
In this paper, we present a fully spike-event-driven SNN classifier for always-on intelligent function. By taking advantage of signal sparseness, the SNN hardware consumes 75 to 220 nW. We train the SNN for multiple always-on functions, notably multi- and single-keyword spotting benchmarks across SNR levels, achieving competitive accuracies. A
CKNOWLEDGEMENT
This research is in part supported by Samsung Electronics and DARPA (the µBrain program). R
EFERENCES [1]
K. Badami et al. , "Context-aware hierarchical information-sensing in a 6μW 90nm CMOS voice activity detector."
IEEE International Solid-State Circuits Conference-(ISSCC),
P.A. Merolla et al. , "A million spiking-neuron integrated circuit with a scalable communication network and interface."
Science,
Yu Chen, Mingoo Seok, Steve M. Nowick, “Robust and Energy-Energycient Asynchronous Dynamic Pipelines for Ultra-Low-Voltage Operations Using Adaptive Keeper Control,”
IEEE ACM International Symposium on Low Power Electronics and Design-(ISLPED),
Jian Liu, Steve M. Nowick, Mingoo Seok, “Soft MOUSETRAP: a Bundled-Data Asynchronous Pipeline Scheme Tolerant to Random Variations at Ultra Low Supply Voltages,”
IEEE International Symposium on Asynchronous Circuits and Systems-(ASYNC),
Jeongwoo Park, Juyun Lee, and Dongsuk Jeon. "A 65nm 236.5 nJ/classification neuromorphic processor with 7.5% energy overhead on-chip learning using direct spike-only feedback."
IEEE International Solid-State Circuits Conference-(ISSCC),
Pete Warden, "Speech Commands: A Dataset for Limited -Vocabulary Speech Recognition", arXiv: 1804.03209 [cs. CL] [7]
Alice Coucke et al., "Efficient keyword spotting using dilated convolutions and gating."
IEEE International Conference on Acoustics, Speech and Signal Processing-(ICASSP),
J. P. Cerqueira and M. Seok, "Temporarily Fine-Grained Sleep Technique for Near- and Subthreshold Parallel Architectures," in
IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
Jan. 2017 [9]
M. Yang, C. Yeh, Y. Zhou, J. P. Cerqueira, A. A. Lazar and M. Seok, "Design of an Always-On Deep Neural Network-Based 1- µW Voice Activity Detector Aided With a Customized Software Model for Analog Feature Extraction," in
IEEE Journal of Solid-State Circuits-(JSSCC),
June 2019 [10]
M. Yang, S. Liu and T. Delbruck, "A Dynamic Vision Sensor With 1% Temporal Contrast Sensitivity and In-Pixel Asynchronous Delta Modulator for Event Encoding," in
IEEE Journal of Solid-State Circuits-(JSSCC),
Sept. 2015 [11]
Yongqiang Cao et al. , "Spiking deep convolutional neural networks for energy-efficient object recognition."
International Journal of Computer Vision-(IJCV),
W. Shan et al ., "14.1 A 510nW 0.41V Low-Memory Low-Computation Keyword-Spotting Chip Using Serial FFT-Based MFCC and Binarized Depthwise Separable Convolutional Neural Network in 28nm CMOS,"
IEEE International Solid- State Circuits Conference - (ISSCC),
R. Guo et al ., "A 5.1pJ/Neuron 127.3us/Inference RNN-based Speech Recognition Processor using 16 Computing-in-Memory SRAM Macros in 65nm CMOS,"
IEEE Symposium on VLSI Circuits,
Juan SP Giraldo, and Marian Verhelst. "Laika: A 5uW programmable LSTM accelerator for always-on keyword spotting in 65nm CMOS."
ESSCIRC IEEE European Solid State Circuits Conference - (ESSCIRC), G. K. Chen, R. Kumar, H. E. Sumbul, P. C. Knag and R. K. Krishnamurthy, "A 4096-Neuron 1M-Synapse 3.8PJ/SOP Spiking Neural Network with On-Chip STDP Learning and Sparse Weights in 10NM FinFET CMOS,"
IEEE Symposium on VLSI Circuits,2018