Memory-efficient Speech Recognition on Smart Devices
Ganesh Venkatesh, Alagappan Valliappan, Jay Mahadeokar, Yuan Shangguan, Christian Fuegen, Michael L. Seltzer, Vikas Chandra
MMEMORY-EFFICIENT SPEECH RECOGNITION ON SMART DEVICES
Ganesh Venkatesh, Alagappan Valliappan, Jay Mahadeokar, Yuan Shangguan,Christian Fuegen, Michael L. Seltzer, Vikas Chandra
Facebook Inc.
ABSTRACT
Recurrent transducer models have emerged as a promising so-lution for speech recognition on the current and next genera-tion smart devices. The transducer models provide competi-tive accuracy within a reasonable memory footprint alleviat-ing the memory capacity constraints in these devices. How-ever, these models access parameters from off-chip memoryfor every input time step which adversely effects device bat-tery life and limits their usability on low-power devices.We address transducer model’s memory access concernsby optimizing their model architecture and designing novelrecurrent cell designs. We demonstrate that i) model’s energycost is dominated by accessing model weights from off-chipmemory, ii) transducer model architecture is pivotal in deter-mining the number of accesses to off-chip memory and justmodel size is not a good proxy, iii) our transducer model op-timizations and novel recurrent cell reduces off-chip memoryaccesses by . × and model size by × with minimal accu-racy impact. Index Terms — RNN-T, ASR, Recurrent Transducer, Au-tomatic Speech Recognition, On-device Inference
Speech is a natural interface for the “smart” devices aroundus – especially the emerging class of keyboard-/screen-lessdevices such as assistant, watch, glass amongst others. Giventhe tremendous growth and adoption of these new devices,we expect speech to be primary mode of interaction for hu-mans with their devices going forward. As a result, thereis a lot of interest in building on-device speech recognitionto improve their reliability and latency as well as addressuser data privacy concerns. While the previous attemptsat building on-device ASR involved scaling down tradi-tional, memory-heavy multi-model system [1, 2] (acous-tic/pronunciation/language models), recent work on Recur-rent Transducer models [3] (originally described in [4]) haveshown promise for the general problem of speech recogni-tion using an end-to-end neural model. Their compact sizeaddresses the memory capacity constraints while providingaccuracy comparable to the larger server-side models.While the recurrent transducer models address the mem-ory capacity constraints, their execution still relies on fetchingweights from off-chip memory for every input speech frame.This repeated access of weights from off-chip memory makesthe model inefficient because the cost of accessing memoryis significantly higher than accessing on-chip buffers as well
Fig. 1 . Hardware abstraction and energy cost breakdown [5,6, 7]. Transducer model power cost is dominated by its accessto the off-chip memory to fetch weights.as performing computations (Figure 1). Transducer model’smemory heavy behavior can limit their ability to run on low-end devices with slow off-chip memory. In this work, wesignificantly reduce off-chip memory accesses by redesign-ing transducer model such that it can access model parame-ters primarily from on-chip buffers. We find that the num-ber of off-chip memory accesses is not just proportional tothe model size but rather depends on the per-layer parame-ter count as well as how we schedule across time steps andlayers. This work makes the following contributions:•
Efficient Transducer model architecture that reduces thenumber of off-chip memory accesses for model parameters.•
Novel recurrent cell design that models the cell state as amatrix instead of a vector. This gives us more flexibility interms of sizing a layer to fit within on-chip buffers.•
Memory-efficient Model : We reduce memory accesses by4.5 × and model size by × without loss in accuracy. This section presents an overview of Recurrent Transducernetworks [3, 4] and challenges in deploying them on low-endsmart devices. The recurrent transducer network (Figure 2)consists of three components – encoder works on the inputaudio stream, prediction uses the previously predicted sym-bols to guide the next symbol and joint network combinesthe two networks to produce the probability distribution of thenext symbol. Encoder is the largest component of the trans-ducer network and also executes the most often because thenumber of input speech frames are much higher than the out-put word pieces. We focus on encoder in this work. a r X i v : . [ c s . S D ] F e b .1 Encoder Network Architecture Encoder accepts as input the audio stream preprocessed intofeatures and runs them through a multi-layer LSTM [8] (Fig-ure 2) network to produce feature representation. Our encoderuses unidirectional LSTMs to target the streaming behavior.
Compute Building Block:
LSTMs take as input vector forcurrent time step ( x t ) and hidden state from previous timestep ( h t ). It uses three gates – input ( i ), forget ( f ) and output( o ) to update the cell memory ( c ) and produce new hiddenstate vector ( h t ) for the next time step. Current day networkstypically stack multiple layers of the LSTM cells where thehidden state output of a layer is fed as input to the next layeras input ( x l +1 ). To aid with training stability and efficiency,speech models such as RNNT [3] use LayerNorm [9] (shownas ln in equations) within the LSTM to normalize the cell,gate and output calculations. f p t , ip t , cp t , o t = ln ([ W f , W i , W C , W o ] · [ h t − , x t ] T ) f t , i t , o t = sigmoid ( f p t , ip t , op t )˜ c t = tanh ( cp t ) c t = ln ( f t ∗ c t − + i t ∗ ˜ c t ) h t = o t ∗ tanh ( c t ) Time Reduction Layer:
This layer merges features fromneighboring time steps into a single embedding vector – acommon reduction technique being concatenation. This layeris motivated by the observation that there are many more in-put speech frames (one every 10 ms or so) than the numberof output tokens (word pieces). Time reduction helps addressthis imbalance which improves the training stability [10].
Execute one speech utterance per encoder inference:
Thebasic scheduling would run the whole encoder network forevery new speech frame. In this scheme, we will fetch all themodel parameters from off-chip memory every time step.
Batch “B” time steps:
An alternative would be to wait for B speech frames and execute encoder on all of them. By doingso, we fetch weights for the input-to-hidden path ( W ih in Fig-ure 2) only once for “B” time steps. However, we still needto fetch weights for hidden-to-hidden path ( W hh in Figure 2)“B” times from off-chip memory because of the sequentialdependence between each time step calculation.In either scheme, we need to fetch model parameters fromoff-chip memory for each input speech frame. Section 3 dis-cusses how to optimize away this need for repeated off-chipmemory access of model parameters. While LSTM-based encoder models achieve high accuracyfor streaming speech recognition [3], there are many chal-lenges in deploying them on smart devices.
Repeated access to model parameters:
These models needto access model parameters at each input time step as shownin Section 2.2.
Large memory footprint:
An LSTM layer has O( d ) pa-rameters each where d is the dimension of the input/hidden Scheduling OptionsEncoderPredictionJoint Network previous predicted symbol: y t-1 y t audio features: x - x T W ih W hh W ih W hh W ih W hh W ih W hh X0 Gates Gates Gates
X1 X2 X3 W ih W hh W hh W hh W hh Gates Gates Gates
X0 X1 X2 X3
Schedule : Execute one timestep at a time.
Computation : Matrix x Vector
Schedule : Batch “B” timesteps.
Computation : Input-to-Hidden: Matrix x Matrix. Hidden-to-Hidden: Matrix x Vector
LSTM
Transducer Architecture Multi-layer Stacked LSTM
Speech Embeddingsaudio features: x x x x ⏞ Fig. 2 . Recurrent Transducer Networkstate. As a result, for any reasonable layer dimension, it istoo large to store a complete layer (or even subset such as W hh ) on-chip. This large memory footprint combined withthe above observation about repeated access to model pa-rameters result in frequent access to off-chip memory. Thisis the reason the model execution power is data access dom-inated as shown in Figure 1. Few optimization knobs:
The design space for these mod-els is very coarse in that the size of a layer is controlled pri-marily by just one knob – layer dimension. Furthermore, re-ducing layer dimension also limits the representation power– cell memory and hidden state.
Prior work has looked at addressing these RNN-T inefficien-cies by using LSTM variants such as CIFG [11, 12, 13].While model size is a popular target metric for optimization,as we show in this work, it is not a good proxy for executionefficiency. Hence, we instead use expected memory accesscounts as our optimization metric. Our work focuses on de-veloping novel LSTM variants and sizing them appropriatelyso that we can maintain speech model accuracy while im-proving its memory access efficiency. We demonstrate thatthere is scope for large, non-linear improvements in memoryaccess efficiency by appropriately designing RNN-T model.In the next section, we explore transducer model variantsthat improve network’s efficiency and flexibility.
Optimizing Transducer Models
We inspect various components of the transducer encoder andpropose variations to improve model efficiency. In partic-ular, we optimize layer normalization, time reduction, sim-plify multi-layer stacking and allow reducing layer size with-out hurting its cell memory size. We show our modified cellequation and highlight the changes with a box.
Layer Normalization ( ln ) helps stabilize training by normal-izing the output of the compute layers. The original imple-mentation normalizes values across all the gates ( i , f , o gate)and cell state ( c ). Given the very different downstream usageof gates and cell state, it is not clear why they should be nor-malized together. An alternative would be normalizing themseparately [14], but that makes it hard for gates to be all smallor large needed to mostly forget or carry over past state.We use layer normalization only to normalize cell statecomputation. In our experiments, we see that this providessimilar training stability while being more efficient at infer-ence time because we can skip normalization of gates. f p t , ip t , cp t , o t = ([ W f , W i , W c , W o ] · [ h t − , x t ] T ) f t , i t , o t = sigmoid ( f p t , ip t , op t )˜ c t = tanh ( ln ( cp t )) c t = ln ( f t ∗ c t − + i t ∗ ˜ c t ) h t = o t ∗ tanh ( c t ) This technique simplifies layer stacking and reduces its cost.The current approach for multi-layer stacked recurrent net-works stack another LSTM layer on top. Our approach ex-plores stacking internal to the recurrent cell by just buildinga deeper network for cell memory and hidden state compu-tation. This provides a network designer additional knobs interms of achieving extra depth with minimal increase in pa-rameters and network complexity. f p t , ip t , cp t , o t = ([ W f , W i , W c , W o ] · [ h t − , x t ] T ) f t , i t , o t = sigmoid ( f p t , ip t , op t )˜ c t = tanh ( ln ( W ch · cp t )) c t = ln ( f t ∗ c t − + i t ∗ ˜ c t ) h t = o t ∗ tanh ( c t ) We extend the traditional LSTM cells with two-dimensionalcell memory instead of a vector which is single dimensional.The motivation is as follows – when we reduce the LSTMlayer size by reducing the hidden-state size, we also reducethe memory capacity of the layer since cell memory size andhidden state is proportional to the hidden size. In our ap-proach, we model cell memory as h × v matrix where h is thehidden state size and v is the number of cell memory vectorsin the recurrent cell. By doing so, we can reduce the hiddenstate size without reducing the cell memory by increasing v . f p t , ip t , cp t , o t = ([ W f , W i , W c , W o ] · [ h t − , x t ] T ) f t , i t , o t = sigmoid ( f p t , ip t , op t ) Cm t = ln ( W ch · cp t ) .view ( h, v )˜ C t = tanh ( Cm t ) C t = ln ( f t ∗ C t − + i t ∗ ˜ C t ) Hm t = o t ∗ tanh ( C t ) h t = Hm t .view ( h × v ) To reduce the number of tokens flowing through the encodernetwork, a common time reduction layer step is using con-catenation [3]. This does come at the cost of an increase innumber of parameters. We simplify the time reduction stepby replacing concatenation to mean. By doing so, we can i)reduce the model size without impacting accuracy, ii) reducethe number of memory accesses and compute in a pretrainedmodel by changing number the time reduction factor with-out any change to network weight shape. This allows us toquickly adapt a trained model to smart device’s memory con-straints without having to train a new model.
In this section, we demonstrate that i) model size is a poorproxy to approximate it’s expected off-chip memory accesses,ii) our transducer architecture optimizations and novel re-current cell reduces the encoder size by 2.5 × and iii) weachieve super-linear savings in off-chip memory access of4.5 × demonstrating that model size is not proportional to off-chip memory access. To demonstrate the above points, weconstruct and train the following model variations (Table 1): B This is our baseline network based on RNNT paper [3] thatuses multi-layer stacked LSTMs. We scale the model sizedown to below 40 MB. This model’s encoder has two timereductions layers that each reduce the token count by × . E1 We change time reduction operator from concatenation tomean with no impact on accuracy. E2 We build a deeper variant of the network where each layeris skinnier (1.78 × smaller). However, it does not converge. E3 Replacing LSTM with its residual [15] variant helps thenetwork converge and recover the original accuracy. E4 Replacing LSTMs with CIFG [11] helps us achieve fur-ther reduce model size for a very modest accuracy hit. E5 Adding greater depth via internal stacking (IS) followingthe time reduction layer helps us recover accuracy with al-most no increase in parameter count. E6 Replacing CIFG with our novel two-dimensional cellmemory (2D) allows us to reduce hidden dimension from480 to 256 with minimal accuracy impact. Our novel designreduces the hidden dimension of the recurrent cell withoutreducing the cell memory size. E7 We further reduce the model size by building a deepermodel with skinnier cells (200 hidden state). ig. 3 . Reducing model parameter access from Off-chipmemory by exploiting on-chip buffers (Memory-Opt)
We start by demonstrating that designing a model to be on-chip buffer aware is critical. For this discussion, we focus onone LSTM layer with hidden state size H and say it can workon T input samples at once (for streaming use cases T can be4 - 16 incurring 100 - 200 ms model latency). As shown inFigure 3, when a layer’s recurrent path ( W hh in Figure 2) canfit on-chip (Memory-Opt), then it accesses off-chip memorymuch less often (up to 8 × lower) than the baseline scenariowhere the layer needs to fetch weights from off-chip mem-ory for every speech utterance. Since off-chip memory ac-cesses dominate Transducer’s execution (Figure 1), replacingthem with on-chip memory accesses enables efficient speechrecognition on low-end devices.To realize the above gains, each layer should be smallenough to fit within on-chip buffers. In the next section, weshow how our optimizations can reduce the per-layer size bymore than 3 × with minimal accuracy impact. We train the model variations on Librispeech [16] – we useADAM optimizer for 75 epochs with 61 epochs on a constantlearning rate of 0.0004 and polynomial scaling by factor of0.8. Table 1 shows the results from the various configurations.We would like to highlight the following observations:
Significant Model Size Reductions
Our model and layeroptimizations can reduce the model size by 2 × with mini-mal accuracy impact. Furthermore, we spread the smallermodel size across a greater number of layers which reduceson average per-layer size by more than 3 × . This makes thedesign amenable to utilizing on-chip buffers for each layer. LayerNorm on Cell state is enough
Experiments from E5-E7 only use layer normalization for the cell state and we didnot see any network training stability issues.
Internal Stacking is Efficient (E4 → E5)
Using internalstacking to increase depth by 2 improves accuracy withnegligible increase in the model size.
2D cell state allows hidden-state reduction (E5 → E7)
We can reduce the hidden state dimension significantly(E6 and E7) with modest accuracy impact accuracy by rep-resenting the cell memory and hidden state as a 2D matrix.This allows us to reduce a layer’s parameter count withoutimpact its representation power.
Table 1 . Accuracy, Parameter Counts and Cell Variations
Param (M)ID Gate H Vec Depth Network Encoder WERB LSTM 640 1 8 37 32.8 4.63E1 LSTM 640 1 8 34 29.5 4.63E2 LSTM 480 1 11 28 23.6 DivergesE3 LSTMR 480 1 11 28 23.6 4.63E4 CIFGR 480 1 11 24 19.5 4.74E5 IS-CIFGR 480 1 11 24.5 20 4.63E6 IS-2D-CIFGR 256 2 11 22 17.5 4.89E7 IS-2D-CIFGR 200 2 12 18 13.2 4.87
This section discusses the improvement in execution effi-ciency from our model optimizations. For this analysis, weassume a mobile system with a modest resource of ∼ × by optimizing thetransducer model architecture (E3) and to 0.22 × by using ournovel recurrent cells. We also note that reduction in memoryaccesses are higher than the reduction in the model size re-inforcing our observation that the savings come from a moreefficient model design and not just from model size reduction. Table 2 . Model Efficiency Improvements
ID Param (M) WER (C) Encoder Size Off-chip MemoryB 37 4.63 1 × × E3 28 4.63 0.7 × × E7 18 4.87 0.4 × × Our work significantly reduces the memory traffic in a speechmodel inference without compromising on the accuracy. Webelieve this will be critical in providing high-quality speechsupport on the next generation devices such as smart-watch,AR-glass among others that will have limited memory re-sources as well as severe power constraints. Further efficiencygains are possible by combining our approach with pruning,quantization and neural architecture search. Another relatedline of work would be post-training adaptation of a trainedspeech recognition model for deployment on different targetdevices, since the memory traffic overhead on a phone, watchand glass will vary substantially. Our technique of using mean in the time-reduction layer instead of concat enables adaptinga model post-training. For example, by increasing the time-reduction factor we can reduce the memory accesses by morethan 20% while recovering much of the accuracy with quickfine-tuning. This in combination with other techniques suchas LayerDrop [18] can provide significant deployment flexi-bility and reduce the need to train many different models.
We propose an optimized transducer model architecture builtwith a novel recurrent cell design that reduces its off-chipmemory accesses by . × and model size by . × . Withour architecture optimizations, we can enable high accuracyspeech recognition support on low-end smart devices. References [1] Alex Waibel, Ahmed Badran, Alan W Black, Robert Fred-erking, Donna Gates, Alon Lavie, Lori Levin, KevinLenzo, Laura Mayfield Tomokiyo, Jurgen Reichert, TanjaSchultz, Dorcas Wallace, Monika Woszczyna, and Jing Zhang,“Speechalator: two-way speech-to-speech translation on a con-sumer pda,” in
EuroSpeech , 2003, pp. 369–372.[2] Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez,Montse Gonzalez Arenas, Kanishka Rao, David Rybach,Ouais Alsharif, Hasim Sak, Alexander Gruenstein, Franc¸oiseBeaufays, and Carolina Parada, “Personalized speech recog-nition on mobile devices,”
CoRR , vol. abs/1603.03185,2016.[3] Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian Mc-Graw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kan-nan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia,Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, TomBagby, Shuo-Yiin Chang, Kanishka Rao, and Alexander Gru-enstein, “Streaming end-to-end speech recognition for mobiledevices,”
CoRR , vol. abs/1811.06621, 2018.[4] Alex Graves, “Sequence transduction with recurrent neuralnetworks,”
CoRR , vol. abs/1211.3711, 2012.[5] Sha Rabii, Edith Beigne, Vikas Chandra, Barbara D. Salvo,Ron Ho, and Raj Pendse, “Computational and technology di-rections for augmented reality systems,” IEEE Symposium onVLSI Circuits, Plenary, 2019.[6] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer,“Efficient processing of deep neural networks: A tutorial andsurvey,”
CoRR , vol. abs/1703.09039, 2017.[7] Mark Horowitz, “1.1 computing’s energy problem (and whatwe can do about it),” in , Feb2014, pp. 10–14.[8] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-termmemory,”
Neural Comput. , vol. 9, no. 8, pp. 1735–1780, Nov.1997.[9] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton,“Layer normalization,” 2016.[10] William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals,“Listen, attend and spell,”
CoRR , vol. abs/1508.01211, 2015.[11] Klaus Greff, Rupesh K. Srivastava, Jan Koutnik, Bas R. Ste-unebrink, and Jurgen Schmidhuber, “Lstm: A search spaceodyssey,”
IEEE Transactions on Neural Networks and Learn-ing Systems , vol. 28, no. 10, pp. 2222–2232, Oct 2017.[12] Yuan Shangguan, Jian Li, Liang Qiao, Raziel Alvarez, and IanMcGraw, “Optimizing speech recognition for the edge,” ArxivPreprint, 2019.[13] Tao Lei, Yu Zhang, and Yoav Artzi, “Training rnns as fast ascnns,”
CoRR , vol. abs/1709.02755, 2017.[14] Jinyu Li, Rui Zhao, Hu Hu, and Yifan Gong, “Improving rnntransducer modeling for end-to-end speech recognition,” 2019.[15] Jaeyoung Kim, Mostafa El-Khamy, and Jungwon Lee, “Resid-ual lstm: Design of a deep recurrent architecture for distantspeech recognition,” 2017.[16] Vassil Panayotov, Guogo Chen, Daniel Povey, and SanjeevKhudanpur, “Librispeech: An asr corpus based on public do-main audio books,” in , 2015, pp.5206–5210. [17] “Arm cortex a77,” https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a77https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a77