[PDF] Temporal Convolution for Real-time Keyword Spotting on Mobile Devices

Abstract

Keyword spotting (KWS) plays a critical role in enabling speech-based user interactions on smart devices. Recent developments in the field of deep learning have led to wide adoption of convolutional neural networks (CNNs) in KWS systems due to their exceptional accuracy and robustness. The main challenge faced by KWS systems is the trade-off between high accuracy and low latency. Unfortunately, there has been little quantitative analysis of the actual latency of KWS models on mobile devices. This is especially concerning since conventional convolution-based KWS approaches are known to require a large number of operations to attain an adequate level of performance. In this paper, we propose a temporal convolution for real-time KWS on mobile devices. Unlike most of the 2D convolution-based KWS approaches that require a deep architecture to fully capture both low- and high-frequency domains, we exploit temporal convolutions with a compact ResNet architecture. In Google Speech Command Dataset, we achieve more than \textbf{385x} speedup on Google Pixel 1 and surpass the accuracy compared to the state-of-the-art model. In addition, we release the implementation of the proposed and the baseline models including an end-to-end pipeline for training models and evaluating them on mobile devices.

Full PDF

TTemporal Convolution for Real-time Keyword Spotting on Mobile Devices

Seungwoo Choi ∗ , Seokjun Seo ∗ , Beomjun Shin ∗ , Hyeongmin Byun,Martin Kersner, Beomsu Kim, Dongyoung Kim † , Sungjoo Ha † Hyperconnect, Seoul, South Korea { seungwoo.choi, seokjun.seo, beomjun.shin, hyeongmin.byun } @hpcnt.com { martin.kersner, beomsu.kim, dongyoung.kim, shurain } @hpcnt.com Abstract

Keyword spotting (KWS) plays a critical role in enablingspeech-based user interactions on smart devices. Recent de-velopments in the ﬁeld of deep learning have led to wide adop-tion of convolutional neural networks (CNNs) in KWS systemsdue to their exceptional accuracy and robustness. The mainchallenge faced by KWS systems is the trade-off between highaccuracy and low latency. Unfortunately, there has been lit-tle quantitative analysis of the actual latency of KWS modelson mobile devices. This is especially concerning since conven-tional convolution-based KWS approaches are known to requirea large number of operations to attain an adequate level of per-formance.In this paper, we propose a temporal convolution forreal-time KWS on mobile devices. Unlike most of the 2Dconvolution-based KWS approaches that require a deep archi-tecture to fully capture both low- and high-frequency domains,we exploit temporal convolutions with a compact ResNet archi-tecture. In Google Speech Command Dataset, we achieve morethan speedup on Google Pixel 1 and surpass the accuracycompared to the state-of-the-art model. In addition, we releasethe implementation of the proposed and the baseline models in-cluding an end-to-end pipeline for training models and evaluat-ing them on mobile devices.

Index Terms : keyword spotting, real-time, convolutional neu-ral network, temporal convolution, mobile device

1. Introduction

Keyword spotting (KWS) aims to detect pre-deﬁned keywordsin a stream of audio signals. It is widely used for hands-freecontrol of mobile applications. Since its use is commonly con-centrated on recognizing wake-up words (e.g., “Hey Siri” [1],“Alexa” [2, 3], and “Okay Google” [4]) or distinguishing com-mon commands (e.g., “yes” or “no”) on mobile devices, the re-sponse of KWS should be both immediate and accurate . How-ever, it is challenging to implement fast and accurate KWS mod-els that meet the real-time constraint on mobile devices withrestricted hardware resources.Recently, with the success of deep learning in a variety ofcognitive tasks, neural network based approaches have becomepopular for KWS [5, 6, 7, 8, 9, 10]. Especially, KWS stud-ies based on convolutional neural networks (CNNs) show re-markable accuracy [6, 7, 8]. Most of CNN-based KWS ap-proaches receive features, such as mel-frequency cepstral co-efﬁcients (MFCC), as a 2D input of a convolutional network.Even though such CNN-based KWS approaches offer reliableaccuracy, they demand considerable computations to meet a ∗ Equal contributions, listed in alphabetical order. † Shared corresponding authors. performance requirement. In addition, inference time on mo-bile devices has not been analyzed quantitatively, but instead,indirect metrics have been used as a proxy to the latency. Zhang et al . [7] presented the total number of multiplications and ad-ditions performed by the whole network. Tang and Lin [8] re-ported the number of multiplications of their network as a surro-gate for inference speed. Unfortunately, it has been pointed outthat the number of operations such as additions and multiplica-tions, is only an indirect alternative for the direct metric suchas latency [11, 12, 13]. Neglecting the memory access costsand different platforms being equipped with varying degrees ofoptimized operations are potential sources for the discrepancy.Thus, we focus on the measurement of actual latency on mobiledevices.In this paper, we propose a temporal convolutional neuralnetwork for real-time KWS on mobile devices, denoted as

TC-ResNet . We apply temporal convolution , i.e., 1D convolutionalong the temporal dimension, and treat MFCC as input chan-nels. The proposed model utilizes advantages of temporal con-volution to enhance the accuracy and reduce the latency of mo-bile models for KWS. Our contributions are as follows:• We propose

TC-ResNet which is a fast and accurate con-volutional neural network for real-time KWS on mo-bile devices. According to our experiments on GooglePixel 1, the proposed model shows speedup anda 0.3%p increase in accuracy compared to the state-of-the-art CNN-based KWS model on Google Speech Com-mands Dataset [14].• We release our models for KWS and implementationsof the state-of-the-art CNN-based KWS models [6, 7, 8]together with the complete benchmark tool to evaluatethe models on mobile devices.• We empirically demonstrate that temporal convolutionis indeed responsible for reduced computation and in-creased performance in terms of accuracy compared to2D convolutions in KWS on mobile devices.

2. Network Architecture

Figure 1 is a simpliﬁed example illustrating the difference be-tween 2D convolution and temporal convolution for KWS ap-proaches utilizing MFCC as input data. Assuming that strideis one and zero padding is applied to match the input andthe output resolution, given input X ∈ R w × h × c and weight W ∈ R k w × k h × c × c (cid:48) , 2D convolution outputs Y ∈ R w × h × c (cid:48) . Source code can be found at the following link: https://github.com/hyperconnect/TC-ResNet a r X i v : . [ c s . S D ] N ov FCC is widely used for transforming raw audio into a time-frequency representation, I ∈ R t × f , where t represents thetime axis ( x -axis in Figure 1a) and f denotes the feature axisextracted from frequency domain ( y -axis in Figure 1a). Most ofthe previous studies [7, 8] use input tensor X ∈ R w × h × c where w = t , h = f (or vice versa), and c = 1 ( X ∈ R t × f × inFigure 1b).CNNs are known to perform a successive transformation oflow-level features into higher level concepts. However, sincemodern CNNs commonly utilize small kernels, it is difﬁcult tocapture informative features from both low and high frequen-cies with a relatively shallow network (colored box in Figure 1bonly covers a limited range of frequencies). Assuming that onenaively stacks n convolutional layers of × weights with astride of one, the receptive ﬁeld of the network only grows upto n + 1 . We can mitigate this problem by increasing the strideor adopting pooling, attention, and recurrent units. However,many models still require a large number of operations, even ifwe apply these methods, and has a hard time running real-timeon mobile devices.In order to implement a fast and accurate model for real-time KWS, we reshape the input from X in Figure 1b to X in Figure 1c. Our main idea is to treat per-frame MFCC asa time series data, rather than an intensity or grayscale image,which is a more natural way to interpret audio. We consider I as one-dimensional sequential data whose features at each timeframe are denoted as f . In other words, rather than transform-ing I to X ∈ R t × f × , we set h = 1 and c = f , which resultsin X ∈ R t × × f , and feed it as an input to temporal convolu-tion (Figure 1c). The advantages of the proposed method are asfollows: Large receptive ﬁeld of audio features.

In the proposedmethod, all lower-level features always participate in formingthe higher-level features in the next layer. Thus, it takes advan-tage of informative features in lower layers (colored box in Fig-ure 1c covers a whole range of frequencies), thereby avoidingstacking many layers to form higher-level features. This enablesus to achieve better performance even with a small number oflayers.

Small footprint and low computational complexity.

Ap-plying the proposed method, a two-dimensional feature mapshrinks in size if we keep the number of parameters the sameas illustrated in Figure 1b and 1c. Assuming that both conven-tional 2D convolution, W ∈ R × × × c , and proposed tem-poral convolution, W ∈ R × × f × c (cid:48) , have the same numberof parameters (i.e., c (cid:48) = × cf ), the proposed temporal convolu-tion requires a smaller number of computations compared to the2D convolution ( 2 (cid:13) is smaller than 1 (cid:13) in Figure 1). In addition,the output feature map (i.e., the input feature map of the nextlayer) of the temporal convolution, Y ∈ R t × × c (cid:48) , is smallerthan that of a 2D convolution, Y ∈ R t × f × c . The decrease infeature map size leads to a dramatic reduction of the computa-tional burden and footprint in the following layers, which is keyto implementing fast KWS. We adopt ResNet [15], one of the most widely used CNN archi-tectures, but utilize m × kernels ( m = 3 for the ﬁrst layer and m = 9 for the other layers) rather than × kernels (Figure 2).None of the convolution layers and fully connected layers havebiases, and each batch normalization layer [16] has trainableparameters for scaling and shifting. The identity shortcuts canbe directly used when the input and the output have matching time ( 𝑡 ) f e a t u r e ( 𝑓 ) MFCC

𝑰 ∈ ℝ 𝑡×𝑓 (a) Input feature map 𝑿 ∈ ℝ 𝑡×𝑓×1 𝑾 ∈ ℝ 𝑐 Output feature map 𝒀 ∈ ℝ 𝑡×𝑓×𝑐 ① MACs = 3×3× 1×𝑓 ×𝑡 ×𝑐 = 5,644,800 𝑓 𝑡 𝑓 𝑡 (b) Input feature map 𝑿 ∈ ℝ 𝑡×1×𝑓 Output feature map 𝒀 ∈ ℝ 𝑡×1×𝑐 ′ ② MACs = 3×1×𝑓 ×𝑡 ×1× 𝑐 ′ = 141,120 (c)2D convolutution Temporal convolution 𝑐 ′ 𝑡 𝑾 ∈ ℝ ′ Figure 1:

A simpliﬁed example illustrating the difference be-tween 2D convolution and temporal convolution. (a) MFCC. (b)2D convolution for conventional CNN-based KWS approaches.(c) Proposed temporal convolution. Note that both the param-eters of a conventional 2D convolution and that of the tempo-ral convolution have the same size in this example by setting t = 98 , f = 40 , c = 160 , and c (cid:48) = 12 . dimensions (Figure 2a), otherwise, we use an extra conv-BN-ReLU to match the dimensions (Figure 2b). Tang and Lin [8]also adopted the residual network, but they did not employ atemporal convolution and used a conventional × kernel. Inaddition, they replaced strided convolutions with dilated con-volutions of stride one. Instead, we employ temporal convolu-tions to increase the effective receptive ﬁeld and follow the orig-inal ResNet implementation for other layers by adopting stridedconvolutions and excluding dilated convolutions.We select TC-ResNet8 (Figure 2c), which has three residualblocks and { , , , } channels for each layer includingthe ﬁrst convolution layer, as our base model. TC-ResNet14 (Figure 2d) expands the network by incorporating twice asmuch residual blocks compared to

TC-ResNet8 .We introduce width multiplier [17] ( k in Figure 2c and Fig-ure 2d) to increase (or decrease) the number of channels ateach layer, thereby achieving ﬂexibility in selecting the rightcapacity model for given constraints. For example, in TC-ResNet8 , a width multiplier of . expands the model to have { , , , } number of channels respectively. We denotesuch a model by appending a multiplier sufﬁx such as TC-ResNet8-1.5 . TC-ResNet14-1.5 is created in the same manner.

3. Experimental Framework

We evaluated the proposed models and baselines [6,8, 7] using

Google Speech Commands Dataset [14]. Thedataset contains 64,727 one-second-long utterance ﬁles whichare recorded and labeled with one of 30 target categories.Following Google’s implementation [14], we distinguish 12classes: “yes”, “no”, “up”, “down”, “left”, “right”, “on”,“off”, “stop”, “go”, silence , and unknown . Using SHA-1hashed name of the audio ﬁles, we split the dataset into training,validation, and test sets, with 80% training, 10% validation, and10% test, respectively.

Data augmentation and preprocessing.

We followedGoogle’s preprocessing procedures which apply random shiftand noise injection to training data. First, in order to generatebackground noise, we randomly sample and crop background a) Block (s= 1) (b) Block (s= 2) . . BN + ReLUBNReLU . . BN ++ ReLUBNReLUconv , 𝑠 = 1 conv , 𝑠 = 1 conv , 𝑠 = 2 conv , 𝑠 = 1 conv , 𝑠 = 2 BNReLUReLU (d) TC-ResNet14

Block , 𝑠 = 2 , c = 24𝑘 Block , 𝑠 = 1 , c = 24𝑘 Block , 𝑠 = c = 32𝑘 Block , 𝑠 = c = 32𝑘 Block , 𝑠 = c = 48𝑘 Block , 𝑠 = c = 48𝑘 Average poolingFCSoftmax (c) TC-ResNet8

Block , 𝑠 = 2 , c = 24𝑘 conv , 𝑠 = 1 , c = 16𝑘 Block , 𝑠 = c = 32𝑘 Block , 𝑠 = c = 48𝑘 Average poolingFCSoftmax conv , 𝑠 = 1 , c = 16𝑘 Figure 2:

The building block (denoted Block) of TC-ResNetwhen (a) stride = 1 and (b) stride = 2. (c) Architecture for TC-ResNet8 and (d) TC-ResNet14. Each of them utilizes ResNet8and ResNet14 as the backbone-CNN, respectively. BN and FCdenote batch normalization and fully connected layer. Note that‘s’, ‘c’, and ‘k’ indicates stride, channel size, and width multi-plier, respectively. noises provided in the dataset, and multiply it with a random co-efﬁcient sampled from uniform distribution, U (0 , . . The au-dio ﬁle is decoded to a ﬂoat tensor and shifted by s seconds withzero padding, where s is sampled from U ( − . , . . Then, itis blended with the background noise. The raw audio is decom-posed into a sequence of frames following the settings of theprevious study [8] where the window length is 30 ms and thestride is 10 ms for feature extraction. We use 40 MFCC featuresfor each frame and stack them over time-axis. Training.

We trained and evaluated the models using Ten-sorFlow [18]. We use a weight decay of 0.001 and dropout witha probability of 0.5 to alleviate overﬁtting. Stochastic gradientdescent is used with a momentum of 0.9 on a mini-batch of 100samples. Models are trained from scratch for 30k iterations.Learning rate starts at 0.1 and is divided by 10 at every 10kiterations. We employ early stopping [19] with the validationsplit.

Evaluation.

We use accuracy as the main metric to eval-uate how well the model performs. We trained each model 15times and report its average performance.

Receiver operatingcharacteristic (ROC) curves , of which the x -axis is the falsealarm rate and the y -axis is the false reject rate, are plotted tocompare different models. To extend the ROC curve to multi-classes, we perform micro-averaging over multiple classes perexperiment, then vertically average them over the experimentsfor the ﬁnal plot.We report the number of operations and parameters whichfaithfully reﬂect the real-world environment for mobile deploy-ment. Unlike previous works which only reported the num-bers for part of the computation such as the number of mul-tiply operations [8] or the number of multiplications and ad-ditions only in the matrix-multiplication operations [7], we in-clude FLOPs [20], computed by TensorFlow proﬁling tool [21],and the number of all parameters instead of only trainable pa-rameters reported by previous studies [8].Inference speed can be estimated by FLOPs but it is wellknown that FLOPs are not always proportional to speed. There-fore, we also measure inference time on a mobile device usingthe TensorFlow Lite Android benchmark tool [22]. We mea- sured inference time on a Google Pixel 1 and forced the modelto be executed on a single little core in order to emulate thealways-on nature of KWS. The benchmark program measuresthe inference time 50 times for each model and reports the av-erage. Note that the inference time is measured from the ﬁrstlayer of models that receives MFCC as input to focus on theperformance of the model itself.

We carefully selected baselines and veriﬁed advantages of theproposed models in terms of accuracy, the number of parame-ters, FLOPs, and inference time on mobile devices. Below arethe baseline models:•

CNN-1 and

CNN-2 [6]. We followed the implementa-tions of [7] where window size is 40 ms and the strideis 20 ms using 40 MFCC features. CNN-1 and

CNN-2 represent cnn-trad-fpool3 and cnn-one-fstride4 in [6],respectively.•

DS-CNN-S , DS-CNN-M , and

DS-CNN-L [7].

DS-CNN utilizes depthwise convolutions. It aims to achieve thebest accuracy when memory and computation resourcesare constrained. We followed the implementation of [7]which utilizes 40 ms window size with 20 ms stride andonly uses 10 MFCCs to reduce the number of opera-tions. DS-CNN-S , DS-CNN-M , and

DS-CNN-L representsmall-, medium-, and large-size model, respectively.•

Res8 , Res8-Narrow , Res15 , and

Res15-Narrow [8].

Res -variants employ a residual architecture for keywordspotting. The number following

Res (e.g., 8 and 15) de-notes the number of layers and the -Narrow sufﬁx rep-resents that the number of channels is reduced.

Res15 has shown the best accuracy with Google Speech Com-mands Dataset among the KWS studies which are basedon CNNs. The window size is 30 ms , the stride is 10 ms ,and MFCC feature size is 40.We release our end-to-end pipeline codebase for training, evalu-ating, and benchmarking the baseline models and together withthe proposed models. It consists of TensorFlow implementationof models, scripts to convert the models into the TensorFlowLite models that can run on mobile devices, and the pre-builtTensorFlow Lite Android benchmark tool.

4. Experimental Results

Table 1 shows the experimental results. Utilizing advantages oftemporal convolutions, we improve the inference time measuredon mobile device dramatically while achieving better accuracycompared to the baseline KWS models.

TC-ResNet8 achieves29x speedup while improving 5.4%p in accuracy compared to

CNN-1 , and improves 11.5%p in accuracy while maintaininga comparable latency to

CNN-2 . Since

DS-CNN is designedfor the resource-constrained environment, it shows better accu-racy compared to the naive CNN models without using largenumber of computations. However,

TC-ResNet8 achieves 1.5x/ 4.7x / 15.3x speedup, and improves 1.7%p / 1.2%p / 0.7%paccuracy compared to

DS-CNN-S / DS-CNN-M / DS-CNN-L ,respectively. In addition, the proposed models show better accu-racy and speed compared to

Res which shows the best accuracyamong baselines.

TC-ResNet8 achieves 385x speedup while im-proving 0.3%p accuracy compared to deep and complex

Res odel Acc. Time FLOPs Params(%) ( ms )CNN-1 90.7 (cid:63)

32 76.1M 524KCNN-2 84.6 (cid:63) (cid:63) (cid:63) (cid:63) (cid:63)

47 143.2M 20KRes8 94.1 (cid:63)

174 795.3M 111KRes15-Narrow 94.0 (cid:63)

107 348.7M 43KRes15 (cid:63)

424 1950.0M 239KTC-ResNet8 96.1

Comparison of the baseline models and the proposedmodels. The numbers marked with (cid:63) are taken from the pa-per. The best result (accuracy and latency) among different ap-proaches are displayed in bold. F a l s e r e j e c t r a t e ( f a l s e n e g a t i v e ) CNN-1 (AUC: 5.22e-03)DS-CNN-L (AUC: 1.68e-03)Res15 (AUC: 1.13e-03)TC-ResNet14-1.5 (AUC: 9.02e-04)

Figure 3:

ROC curves for selected models with correspondingvalues of AUC. baseline,

Res15 . Compared to a slimmer

Res baseline,

Res8-Narrow , proposed

TC-ResNet8 achieves 43x speedup while im-proving 6%p accuracy. Note that our wider and deeper mod-els (e.g.,

TC-ResNet8-1.5 , TC-ResNet14 , and

TC-ResNet14-1.5 )achieve better accuracy at the expense of inference speed.We also plot the ROC curves of models which depict thebest accuracy among their variants:

CNN-1 , DS-CNN-L , Res15 ,and

TC-ResNet14-1.5 . As presented in Figure 3,

TC-ResNet14-1.5 is less likely to miss target keywords compared to otherbaselines assuming that the number of incorrectly detected key-words is the same. The small area under the curve (AUC) meansthat the model would miss fewer target keywords on average forvarious false alarm rates.

TC-ResNet14-1.5 shows the smallestAUC, which is critical for good user experience with KWS sys-tem.

We demonstrate that the proposed method could effectively im-prove both accuracy and inference speed compared to the base-line models which treat the feature map as a 2D image. Wefurther explore the impact of the temporal convolution by com- Model Acc. Time FLOPs Params(%) ( ms )2D-ResNet8 96.1 10.1 35.8M 66K2D-ResNet8-Pool 94.9 3.5 4.0M 66KTable 2: Comparison of TC-ResNet variants, 2D-ResNet8 and2D-ResNet8-Pool, which utilize 2D convolutions while retain-ing the architecture and the number of parameters of TC-ResNet8. paring variants of

TC-ResNet8 , named and , which adopt a similar network architecture andthe number of parameters but utilize 2D convolutions.We designed , whose architecture is identical to

TC-ResNet8 except for the use of ×

2D convolutions. (in Table 2) shows comparable accuracy, but is 9.2xslower compared to

TC-ResNet8 (in Table 1).

TC-ResNet8-1.5 is able to surpass while using less computationalresources.We also demonstrate the use of temporal convolution is su-perior to other methods of reducing the number of operationsin CNNs such as applying a pooling layer. In order to reducethe number of operations while minimizing the accuracy loss,

CNN-1 , Res8 , and

Res8-Narrow adopt average pooling at anearly stage, speciﬁcally, right after the ﬁrst convolution layer.We inserted an average pooling layer, where both the windowsize and the stride are set to 4, after the ﬁrst convolution layer of , and named it . improves inference time with the same number of parameters,however, it loses 1.2%p accuracy and is still 3.2x slower com-pared to TC-ResNet8 .

5. Related Works

Recently, there has been a wide adoption of CNNs in KWS.Sainath et al . [6] proposed small-footprint CNN models forKWS. Zhang et al . [7] searched and evaluated proper neuralnetwork architectures within memory and computation con-straints. Tang and Lin [8] exploited residual architecture and di-lated convolutions to achieve further improvement in accuracywhile preserving compact models. In previous studies [6, 7, 8],it has been common to use 2D convolutions for inputs withtime-frequency representations. However, there has been an in-crease in the use of 1D convolutions in acoustics and speechdomain [23, 24]. Unlike previous studies [23, 24] our work ap-plies 1D convolution along the temporal axis of time-frequencyrepresentations instead of convolving along the frequency axisor processing raw audio signals.

6. Conclusion

In this investigation, we aimed to implement fast and accu-rate models for real-time KWS on mobile devices. We mea-sured inference speed on the mobile device, Google Pixel 1,and provided quantitative analysis of conventional convolution-based KWS models and our models utilizing temporal convo-lutions. Our proposed model achieved 385x speedup while im-proving 0.3%p accuracy compared to the state-of-the-art model.Through ablation study, we demonstrated that temporal convo-lution is indeed responsible for the dramatic speedup while im-proving the accuracy of the model. Further studies analyzingthe efﬁcacy of temporal convolutions for a diverse set of net-work architectures would be worthwhile. . References [1] S. Sigtia, R. Haynes, H. Richards, E. Marchi, and J. Bridle, “Ef-ﬁcient voice trigger detection for low resource hardware,” in

Pro-ceedings of the Annual Conference of the International SpeechCommunication Association (INTERSPEECH) , 2018.[2] M. Sun, D. Snyder, Y. Gao, V. Nagaraja, M. Rodehorst, S. Pan-chapagesan, N. Strom, S. Matsoukas, and S. Vitaladevuni, “Com-pressed time delay neural network for small-footprint keywordspotting,” in

Proceedings of the Annual Conference of the Inter-national Speech Communication Association (INTERSPEECH) ,2017.[3] G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. Vita-ladevuni, “Model compression applied to small-footprint keywordspotting,” in

Proceedings of the Annual Conference of the Inter-national Speech Communication Association (INTERSPEECH) ,2016.[4] G. Chen, C. Parada, and G. Heigold, “Small-footprint keywordspotting using deep neural networks,” in

Proceedings of the IEEEInternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) , 2014.[5] Z. Wang, X. Li, and J. Zhou, “Small-footprint keyword spottingusing deep neural network and connectionist temporal classiﬁer,” arXiv preprint arXiv:1709.03665 , 2017.[6] T. N. Sainath and C. Parada, “Convolutional neural networks forsmall-footprint keyword spotting,” in

Proceedings of the AnnualConference of the International Speech Communication Associa-tion (INTERSPEECH) , 2015.[7] Y. Zhang, N. Suda, L. Lai, and V. Chandra, “Hello edge: Keywordspotting on microcontrollers,” arXiv preprint arXiv:1711.07128 ,2017.[8] R. Tang and J. Lin, “Deep residual learning for small-footprintkeyword spotting,” in

Proceedings of the IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) ,2018.[9] D. C. de Andrade, S. Leo, M. L. D. S. Viana, and C. Bernkopf, “Aneural attention model for speech command recognition,” arXivpreprint arXiv:1808.08929 , 2018.[10] S. ¨O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky,C. Fougner, R. Prenger, and A. Coates, “Convolutional recurrentneural networks for small-footprint keyword spotting,” in

Pro-ceedings of the Annual Conference of the International SpeechCommunication Association (INTERSPEECH) , 2017.[11] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“MobileNetV2: Inverted residuals and linear bottlenecks,” in

Pro-ceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR) , 2018.[12] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “Mnas-Net: Platform-aware neural architecture search for mobile,” arXivpreprint arXiv:1807.11626 , 2018.[13] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShufﬂeNet v2: Prac-tical guidelines for efﬁcient cnn architecture design,” in

Proceed-ings of the European Conference on Computer Vision (ECCV) ,2018.[14] P. Warden. (2017, August) Launching the speech commandsdataset. [Online]. Available: https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2016.[16] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in

Proceed-ings of the Internal Conference on Machine Learning (ICML) ,2015.[17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efﬁcientconvolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861 , 2017. [18] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard et al. , “TensorFlow:A system for large-scale machine learning.” in

Proceedings ofthe USENIX Symposium on Operating Systems Design and Im-plementation (OSDI) , 2016.[19] L. Prechelt, “Early stopping-but when?” in

Neural Networks:Tricks of the trade . Springer, 1998, pp. 55–69.[20] S. Arik, H. Jun, and G. Diamos, “Fast spectrogram inversionusing multi-head convolutional neural networks,” arXiv preprintarXiv:1808.06719 , 2018.[21] TensorFlow Proﬁler and Advisor. [Online]. Avail-able: https://github.com/tensorﬂow/tensorﬂow/blob/master/tensorﬂow/core/proﬁler/README.md[22] TFLite Model Benchmark Tool. [Online]. Avail-able: https://github.com/tensorﬂow/tensorﬂow/tree/r1.13/tensorﬂow/lite/tools/benchmark/[23] H. Lim, J. Park, K. Lee, and Y. Han, “Rare sound event detectionusing 1d convolutional recurrent neural networks,” in

Proceedingsof the Detection and Classiﬁcation of Acoustic Scenes and Events2017 Workshop , 2017.[24] K. Choi, G. Fazekas, M. Sandler, and K. Cho, “Convolutional re-current neural networks for music classiﬁcation,” in