Hardware Aware Training for Efficient Keyword Spotting on General Purpose and Specialized Hardware
Peter Blouw, Gurshaant Malik, Benjamin Morcos, Aaron R. Voelker, Chris Eliasmith
HHardware Aware Training for Efficient KeywordSpotting on General Purpose and SpecializedHardware
Peter Blouw, Gurshaant Malik, Benjamin Morcos, Aaron R. Voelker, and Chris Eliasmith
Applied Brain Research Inc.
Abstract
Keyword spotting (KWS) provides a critical user interface for many mobile andedge applications, including phones, wearables, and cars. As KWS systems aretypically ‘always on’, maximizing both accuracy and power efficiency are centralto their utility. In this work we use hardware aware training (HAT) to build newKWS neural networks based on the Legendre Memory Unit (LMU) that achievestate-of-the-art (SotA) accuracy and low parameter counts. This allows the neuralnetwork to run efficiently on standard hardware (212 µ W). We also characterizethe power requirements of custom designed accelerator hardware that achievesSotA power efficiency of 8.79 µ W, beating general purpose low power hardware (amicrocontroller) by 24x and special purpose ASICs by 16x.
Keywords speech processing · keyword spotting · on-device inference · online inference · keywordspotting hardware · edge AI · low power · deep learning accelerator · TinyML · hardware awaretraining There are a wide variety of keyword spotting deep neural networks available, including those basedon CNNs, LSTMs, GRUs, and many variants of these. However, commercially viable networks haveseveral constraints often ignored by research focused efforts. In this more constrained setting, neuralnetworks must be:1.
Stateful : The network cannot assume to know when a keyword is about to be presented. Asa result, the starting state of the network cannot be known in advance, but is determined bywhatever processing has happened recently – not by being reset to a known ‘zero’ state.2.
Online (or ‘streaming’): The most responsive, low-latency networks will process audio dataas soon as it is available and in real-time. Many methods are often tested on the assumptionthat large windows of data are available all at once. However, at deployment, waitingfor large amounts of data introduces undesirable latencies. As well, reusing previouslyprocessed data, as done by RNNs, can lead to efficiency gains.3.
Quantized : Quantization to 8-bit weights and activities is becoming standard for mobileor ‘edge’ applications. Quantization allows more efficient deployment on low power, edgehardware.4.
Power efficient : While quantization helps with power efficiency, it is not the sole determinerof the power required by a network. For instance, the number and type of computationsperformed are also important. Specific focus on the power efficiency of the network, and itsviability for deployment on available hardware is critical for commercial applications. a r X i v : . [ ee ss . A S ] S e p n this paper, we use a method of hardware aware training (HAT) that directly trains a network forefficient hardware deployment, accounting for hardware assumptions during model development.This provides a practical method for meeting such constraints.As a result, we focus on comparing this work to recent SotA results that share interest in theseconstraints. All of the new results we report also satisfy these constraints. As a consequence, ourmain metrics of interest will be: accuracy; size of the model (in bits; the number of parameters timesthe bits per parameter); and power usage in a real-time setting. In what follows we describe newoptimization, algorithmic, and hardware techniques that have allowed us to develop a highly powerefficient KWS algorithm and hardware platform. Critically, we demonstrate that these same methodscan be used to target different hardware platforms (both general and special purpose). To the bestof our knowledge, we present better than current SotA results on each of these metrics and for eachplatform. The recurrent neural network (RNN) that lies at the heart of our algorithm is called the LegendreMemory Unit (LMU), which we have recently proposed (Voelker et al., 2019). The LMU consists ofa linear ‘memory layer’ and a nonlinear ‘output layer’ that are recurrently coupled both to themselvesand each other. A distinguishing feature of the LMU is that the linear memory layer is optimal forcompressing an input time series over time. The output of this layer represents the weighting of aLegendre basis, which gives rise to the LMU’s name.Because of this provable optimality, unlike past RNNs (including LSTMs, GRUs, and so on) the LMUhas fixed recurrent and input weights on the linear layer. As well, the theoretical characterization ofthe LMU permits intermediate representations to be decoded, providing a degree of explainability tothe functioning of the network.In the original LMU paper, it was shown that on a task requiring the memory of a time-varying signal,the LMU outperforms the LSTM with a x reduction in error, while encoding more timesteps,and using 500 versus 41,000 parameters. In some ways this is not surprising, as the LMU is optimizedfor this task. Nevertheless, the LMU also outperforms all previous RNNs on the standard psMNISTbenchmark task by achieving 97.15% test accuracy, compared to the next best network (dilated RNN)at 96.1% and the LSTM at 89.86%. Again, the LMU used far fewer parameters ~102,000 versus~165,000 (a reduction of 38%).Because the LMU is designed to be optimal at remembering information over a window, whilereceiving streamed input, and because it also tends to use fewer parameters while achieving highaccuracy, it is well-suited to the constraints of real-world KWS tasks. In this work, we have modified the originally proposed LMU in a number of ways (see Figure 1). Inparticular, we have removed the connection from the nonlinear to the linear layer, the connectionfrom the linear layer to the intermediate input, and the recurrent connection from the nonlinear layerto itself. As well, we have included multiple linear memory layers in the architecture. We found thatthis was important for improving performance on the KWS task.The resulting architecture, depicted in Figure 1, is thus described by the following equations: h t = f ( W x x t + W m m t + b ) u t = e x T x t + e h T h t − m t = ¯ Am t − + ¯ B u t where each of the variables is defined as depicted in Figure 1, and the nonlinearity we use for thisapplication is the ReLU.We refer to this architecture as a single LMU layer. The final network we test includes multiple LMUlayers and a feedforward output layer. The LMU is a patent pending technology of Applied Brain Research Inc., free for academic research, educa-tional and personal uses. Please contact ABR for commercial use licensing at [email protected]
Our methods and metrics follow standard practices for the SpeechCommands dataset (see, e.g., War-den (2018); Rybakov et al. (2020)). As specified in Warden (2018) we split the data into training,validation, and testing sets with one second speech samples at 16 kHz. The network is trained ontwelve labels: the ten keywords, plus silence and unknown tokens. All accuracy results are on the testdata only (see Table 1 and Figure 2).The methods we use to build the LMU models all leverage hardware aware training (HAT). Thisextends standard quantization aware training to precisely match the hardware on which the modelswill be deployed. This means that all model elements are matched to the bit precisions assumedthroughout a design. Quantization aware training typically makes assumptions not satisfied byhardware. As a result the reported accuracies for the LMU models are the expected, real-world,deployed accuracies.All LMU models and the results from Rybakov et al. (2020) are for stateful, quantized and online KWSapplications. The results from Wong et al. (2020) are quantized, but their latency and statefulnessis not reported. Because the amount of quantization is different between different models, we havemeasured the model size in kilobits (kbits) instead of parameter count. The kbits are the number ofparameters multiplied by the number of bits per parameter to give a consistent model size metric.In Table 1 we show the results from four different LMU models. The first model (LMU-1) uses 8-bitweights, while the remaining three models use 4-bit weights. All LMU models use 7-bit activations.LMU-1 and LMU-2 are not pruned. LMU-3 has 80% pruning performed and LMU-4 has 91% of itsweights pruned.
We compare our results to Google’s latest KWS paper (Rybakov et al., 2020), updated in July of2020, and DarwinAI’s announcement (Wong et al., 2020) from August of 2020 regarding the useof “attention condensers” to generate “highly efficient deep neural networks for on-device speechrecognition.” As shown in Table 1, the LMU models outperform both sets of results in terms ofaccuracy and size. For instance, LMU-1 is the same accuracy as the best Google model, while using41% fewer bits. As well, LMU-2 is comparable in accuracy to the CNN (Rybakov et al., 2020), whileusing 11.7x fewer bits in the final model. In comparison to the generally smaller models of Wong et al. For instance, activities are often asymmetrically quantized to unsigned 8 bits, but in a hardware implementa-tion 7-bit quantization is more appropriate since one bit is required for a signed two’s complement representationto allow 8-bit multiplication with weights.
Model Accuracy (%) Model Size(kbits) Reference
DNN 90.6 3576 Rybakov et al. (2020)CNN+strd 95.6 4232CNN 96.0 4848GRU (S) 96.3 4744CRNN (S) 96.5 3736SVDF 96.9 2832DSCNN 96.9 3920TinySpeech-A 94.3 127 Wong et al. (2020)TinySpeech-B 91.3 53LMU-1 96.9 1683 This workLMU-2 95.9 361LMU-3 95.0 105LMU-4 92.7 49Figure 2: Scatter plot of the model size and accuracy data in Table 1. Model size is shown on aninverted log scale (right is better). The LMU models are consistently smaller and more accurate overthe model space, shown by their being in the top right.(2020), the LMUs show significant accuracy improvements. Specifically, the LMU-3 reduces theerror by 14% relative to TinySpeech-A, while using 17% fewer bits. Similarly, the LMU-4 reduceserror by 19% relative to TinySpeech-B while using 8% fewer bits.As noted in Section 3.1, the LMU models are all stateful, streamable, and developed with HAT. Hencethe reported accuracies can be realized on hardware in real-time, real-world applications.
While power use will scale with model size on standard hardware, suggesting the LMU models willbe very efficient, an even more efficient implementation can be obtained using custom designedhardware. Hence, we have designed low power digital hardware to natively implement the necessarycomputations for the LMU models discussed in Section 3. Here we report results on power modeling4igure 3: Power and area trade-off for different clock frequencies of our custom hardware design.Blue dots indicate specific designs considered while varying the number of components and theclock’s frequency.of the LMU-2 architecture, which strikes a balance between small size and high accuracy. The designis flexible, allowing for different degrees of parallelism, depending on the speed, power, and arearequirements. We considered a variety of designs across different clock frequencies, while alwaysensuring that the timing constraints of the SpeechCommands models proposed above (40 ms windowsupdated every 20 ms) are satisfied in real time.To estimate the power of our design, we established cycle-accurate power envelopes of our designusing ABR’s proprietary, silicon-aware, hardware-software co-design mapping tools. Total powerusage is determined with these envelopes using publicly available power data (Frustaci et al., 2015;Yabuuchi et al., 2017; Höppner and Mayr, 2018). Multiply-accumulate (MAC) and SRAM dynamicand static power, are the dominant power consumers in the design. We also included dynamic powerestimates for multipliers, dividers, and other components as a function of the number of transistors inthe component, and the power cost per transistor of the MAC. All estimates are for a 22nm process.To estimate the number of transistors, and hence the area, of the design we generated RTL designs ofeach of the relevant components, and used the yosys open source tool (Wolf) and libraries to estimatethe number of transistors required for the total number of components included in our network.Figure 3 shows the resulting power/area trade-off for our LMU-based design. As can be seen, thelowest power design we found sits at 8.79 µ W (92 kHz clock) and 8,052,298 transistors. For thisdesign, the throughput for one 20 ms frame is 13.38 ms and the latency for the 40 ms update is39.59 ms, meaning the design runs in real time. Note that all designs depicted in Figure 3 arereal-time.
There have been several recent results published noting low power specialized hardware for keywordspotting. Those we were able to find that had similar or lower numbers are not of comparablecomplexity or accuracy to the networks we describe here. For instance, Wang et al. (2020) claimsub-300 nW power, but only detect a single keyword. Similarly, Giraldo and Verhelst (2018) claimless than 5 µ W, but only detect 4 keywords and report accuracy in the low 90s. In contrast, Giraldoet al. (2020) uses the SpeechCommands, but the accuracy is 90.9% for 10.6 µ W. A similar result isreported by Shan et al. (2020) who achieve 90.8% on this dataset for 16.11 µ W. A main distinguishingfeature of our result above is the high accuracy, which is usually very difficult to achieve in a powerconstrained setting. Our use of the LMU and HAT combine to provide SotA performance.5able 2: Summary of hardware power results for SpeechCommands keyword spotting.
Hardware Model Accuracy (%) Power ( µ W) ARM M4F LMU-2 95.9 212ARM M4F (Optimal) LMU-2 95.9 119Syntiant NPD10x – 94.0 170This work LMU-2 95.9 9Because we use HAT, it is straightforward to run the LMU networks on different hardware andcompare across them. In this section we compare an implementation on an off-the-shelf ARMM4F, on an ‘idealized’ ARM M4F, on our hardware design from the previous section, and on theSyntiant NDP10x special purpose keyword spotting chips (see Table 2).We implemented the LMU keyword spotter on an ARM M4F clocked at 120 MHz, which processes1 s of audio in 143,678 µ s (0.14 s). This means that 17.24 million cycles are used to process onesecond of audio. The lowest power setting of the ARM M4 is rated at 12.26 µ W/MHz on the ARMM4 datasheet (ARM, 2020), which results in 212 µ W of power for this model. Thus our designfrom Section 4.1 is 24x more power efficient. A recent world-record efficiency was reported byRacyics and GlobalFoundries with a power efficiency of 6.88 µ W/MHz (Höppner et al., 2019) foran ARM M4F. Using that idealized power efficiency, the ARM M4F would use 119 µ W of power.This suggests that our design is 14x more power efficient than running on state-of-the-art low powergeneral purpose hardware.Holleman (2019) reports energy per frame on the SpeechCommands dataset for the Syntiant NDP10xspecial purpose chip at 3.4 µ J. For real-time computation with a standard window stride of 20ms, thenetwork needs to process 50 frames per second, well within the inference time of 10 ms of the chip.This rate of processing results in a power usage of 170 µ W. Syntiant has also reported a power usageof 140 µ W (Medici, 2020). As well, the network achieves an accuracy of 94%, with a network size of4456 kbits (assuming 8-bit weights, which is not reported). As a result, our network is more accuratewith 95.9% accuracy and is 16-19x more power efficient with our hardware design than the Syntiantspecial purpose hardware.Finally, we note that because the LMU is parameter efficient, with 12x fewer parameters than theSyntiant network, it potentially eliminates the need for special purpose hardware. Specifically, theLMU-2 running on the M4F uses (212 µ W), compared to Syntiant’s special purpose hardware at(170 µ W). This suggests that the LMU-3 (with one third the parameters of LMU-2) will run for lesspower, while still achieving higher accuracy.
LMU-based keyword spotting networks are highly efficient, surpassing recent state-of-the-art resultsfrom Google, DarwinAI, and Syntiant, in terms of accuracy, size, and power efficiency over a widerange. These improvements become more pronounced with special purpose-designed hardwareresulting in >
14x reduction in power use compared to current state-of-the-art offerings. Notably,these conclusions are drawn in the context of real-world, real-time deployment of keyword spottingsystems.
References
ARM. Cortex-M4. https://developer.arm.com/ip-products/processors/cortex-m/cortex-m4 ,2020. Accessed online (September 2020).Fabio Frustaci, Mahmood Khayatzadeh, David Blaauw, Dennis Sylvester, and Massimo Alioto. SRAM forerror-tolerant applications with dynamic energy-quality management in 28 nm CMOS.
IEEE Journal ofSolid-state circuits , 50(5):1310–1323, 2015.J. S. P. Giraldo and M. Verhelst. Laika: A 5uW programmable LSTM accelerator for always-on keyword spottingin 65nm CMOS. In
ESSCIRC 2018 - IEEE 44th European Solid State Circuits Conference (ESSCIRC) , pages166–169, 2018. . S. P. Giraldo, S. Lauwereins, K. Badami, and M. Verhelst. Vocell: A 65-nm speech-triggered wake-up SoCfor 10- µ W keyword spotting and speaker verification.
IEEE Journal of Solid-State Circuits , 55(4):868–878,2020.Jeremy Holleman. The Speed and Power Advantage of a Purpose-Built Neural Compute Engine. , 2019. Accessed online (September2020).Sebastian Höppner and Christian Mayr. SpiNNaker2 - Towards extremely efficient digital neuromorphics andmulti-scale brain emulation. In
NICE Workshop Conference Proceedings , 2018.S. Höppner, H. Eisenreich, D. Walter, U. Steeb, A. S. Clifford Dmello, R. Sinkwitz, H. Bauer, A. Oefelein,F. Schraut, J. Schreiter, R. Niebsch, S. Scherzer, U. Hensel, J. Winkler, and M. Orgis. How to achieveworld-leading energy efficiency using 22fdx with adaptive body biasing on an arm cortex-m4 iot soc. In
ESSDERC 2019 - 49th European Solid-State Device Research Conference (ESSDERC) , pages 66–69, 2019.George Medici. Syntiant NDP101 Microprocessor Receives LinleyGroup’s Analysts’ Choice Award. ,2020. Accessed online (September 2020).Oleg Rybakov, Natasha Kononenko, Niranjan Subrahmanya, Mirko Visontai, and Stella Laurenzo. Streamingkeyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 , 2020.W. Shan, M. Yang, J. Xu, Y. Lu, S. Zhang, T. Wang, J. Yang, L. Shi, and M. Seok. 14.1 a 510nW 0.41Vlow-memory low-computation keyword-spotting chip using serial FFT-based MFCC and binarized depthwiseseparable convolutional neural network in 28nm CMOS. In , pages 230–232, 2020.Aaron R. Voelker, Ivana Kaji´c, and Chris Eliasmith. Legendre Memory Units: Continuous-time representationin recurrent neural networks. In
Advances in Neural Information Processing Systems , pages 15544–15553,2019.Dewei Wang, P. Chundi, S. Kim, Minhao Yang, J. P. Cerqueira, Joonsung Kang, Seungchul Jung, and MingooSeok. Always-on, sub-300-nW, event-driven spiking neural network based on spike-driven clock-generationand clock- and power-gating for an ultra-low-power intelligent device. arXiv preprint arXiv:2006.12314 ,2020.Pete Warden. Speech Commands: A dataset for limited-vocabulary speech recognition. arXiv preprintarXiv:1804.03209 , 2018.Clifford Wolf. Yosys Open SYnthesis Suite. . Accessed online (September2020).Alexander Wong, Mahmoud Famouri, Maya Pavlova, and Siddharth Surana. TinySpeech: Attention condensersfor deep speech recognition neural networks on edge devices. arXiv preprint arXiv:2008.04245 , 2020.Makoto Yabuuchi, Koji Nii, Shinji Tanaka, Yoshihiro Shinozaki, Yoshiki Yamamoto, Takumi Hasegawa, HirokiShinkawata, and Shiro Kamohara. A 65 nm 1.0 V 1.84 ns Silicon-on-Thin-Box (SOTB) embedded SRAMwith 13.72 nW/Mbit standby power for smart IoT. In , pages C220–C221.IEEE, 2017., pages C220–C221.IEEE, 2017.