Neural Architecture Search For Keyword Spotting
aa r X i v : . [ ee ss . A S ] S e p Neural Architecture Search For Keyword Spotting
Tong Mo ∗ , Yakun Yu ∗ , Mohammad Salameh , Di Niu , Shangling Jui University of Alberta, Edmonton, AB, Canada Huawei Technologies { tmo1, yakun2, dniu } @ualberta.ca { mohammad.salameh, jui.shangling } @huawei.com Abstract
Deep neural networks have recently become a popular solu-tion to keyword spotting systems, which enable the control ofsmart devices via voice. In this paper, we apply neural architec-ture search to search for convolutional neural network modelsthat can help boost the performance of keyword spotting basedon features extracted from acoustic signals while maintainingan acceptable memory footprint. Specifically, we use differen-tiable architecture search techniques to search for operators andtheir connections in a predefined cell search space. The foundcells are then scaled up in both depth and width to achieve com-petitive performance. We evaluated the proposed method onGoogle’s Speech Commands Dataset and achieved a state-of-the-art accuracy of over 97% on the setting of 12-class utteranceclassification commonly reported in the literature.
Index Terms : Keyword Spotting, Neural Architecture Search
1. Introduction
Keyword spotting (KWS) aims to identify a set of keywordsin utterances. KWS was traditionally performed in the cloudbased on audio recordings uploaded by users [1]. Nowadays,on-device KWS applications are becoming increasingly popu-lar, e.g., Apple’s “Siri”, Microsoft’s “Cortana” and Amazon’s“Alexa”, which help preserve user privacy and avoid data leak-age during transmission. The deployment of KWS models onresource-constrained smart devices requires a small footprintwhile retaining accuracy. Thus, small-footprint KWS focuseson the recognition of simple commands, such as “yes”, “no”,“on” and “off”, which are sufficient to support frequent user-device interactions.In recent years, various convolutional neural networks(CNNs) have been applied to KWS and achieved remarkableresults. Sainath et al. [2] introduce CNNs into KWS andshow that CNNs perform well when limiting the number ofparameters. Tang et al. [1] use variants of the deep residualnetwork (ResNet) to build a neural KWS model, and achievean accuracy of 95.8% with 239K parameters on the GoogleSpeech Commands Dataset (v1) [3] using Res15. Choi et al.[4] combine temporal convolutions with ResNet to constructTC-ResNet models and improve the accuracy to 96.6% with305K parameters. Mittermaier et al. [5] use parameterizedSinc-convolutions from SincNet to classify keywords based onraw audio, and reduce the number of parameters to 122K whilemaintaining the accuracy of TC-ResNet. Kao et al. [6] pro-pose a sub-band CNN architecture to apply different convolu-tional kernels on each feature sub-band, and achieve an accu-racy of around 90.0% on the second version of Google SpeechCommands Dataset while reducing the computation by 39.7% ∗ Equal contributions, listed in alphabetical order. compared to a full-band CNN model. Zeng et al. [7] useDenseNet with BiLSTM and achieve an accuracy of 96.2% fol-lowing Google’s setup [2]. Pons et al. [8] propose a model thatuses randomly weighted CNNs as feature extractors to conductaudio classification. Chen et al. [9] propose a compact and effi-cient convolutional network (CENet) for small-footprint KWS,and insert the graph convolutional network (GCN) for contex-tual feature augmentation to CENet as CENet-GCN, which canachieve an accuracy of 96.8% with 72.3K parameters when onlyusing Mel-frequency Cepstrum Coefficient (MFCC) features asthe input. Majumdar et al. [10] propose MatchboxNet thatcontains residual blocks of 1D time-channel separable convo-lutions, batch-normalization (BN), ReLU, and dropout layers,achieving an accuracy of around 97.48% with 93K parameters,though on a different setting of 30-class utterance classificationwith the help of data augmentation (while the majority of theliterature evaluates a 12-class benchmark).In this paper, we propose to use Neural Architecture Search(NAS) to automate the neural network architecture design forKWS. NAS is widely used and evaluated for image classifica-tion and language modeling tasks. Zoph et al. [11] first use areinforcement learning approach to train a neural network archi-tecture with the maximum validation accuracy on CIFAR-10.However, this method is computationally expensive, requiringhundreds of GPUs, and the model could not be transferred to alarge dataset. The same authors then design a NASNet searchspace to search for the best convolutional layer (or “cell”) andstack copies of this cell to form a NASNet architecture [12].Though NASNet is trained faster and able to generalize to largerdatasets, the whole search process still takes over four days with500 GPUs. Other NAS methods, e.g., AmoebaNet [13], Pro-gressive NAS [14], have been proposed to further optimize thesearch process. However, all of them search over a discretedomain where more architecture evaluations are required. Tomake the search space continuous, Liu et al. [15] propose a dif-ferentiable architecture search (DARTS) and enable the efficientsearch of neural architectures through gradient descent.To date, there have been some efforts on NAS for KWS,although not achieving state-of-the-art results. Veniat et al.[16] propose a stochastic adaptive neural architecture search ap-proach for KWS that automatically adapts the architecture by arecurrent neural network (RNN) according to the difficulty ofthe prediction problem at each time step, and achieve an 86.5%prediction accuracy on the Google Speech Commands Dataset[3]. Anderson et al. [17] propose a performance-oriented neu-ral architecture search approach based on information about thehardware and achieve a 95.11% prediction accuracy on the samedataset.In this paper, we leverage DARTS [15], a gradient-baseddifferentiable NAS technique to search for the best convolu-tional network architecture for KWS. A typical NAS processinvolves searching for the best architecture for a given task, "! !" !" .*+-!"’(3%&4+5(+,’(6"’)-*7%,(-"55. 0)55 !" Figure 1:
The composition of the convolutional neural networkto be searched for. A stack of six cells is used during search,where each blue rectangle represents a normal cell, while eachyellow one represents a reduction cell. Once the best cells arefound, the network can be scaled up in both depth and width. followed by training the found best architecture from scratch.The search process involves three dimensions [18]. The searchspace defines which architectures are considered and the oper-ations that compose them.
Search strategies define the strat-egy used to explore the search space, e.g., reinforcement learn-ing (RL) [11, 12, 19, 20], evolutionary algorithm [21, 22, 13]and gradient-based approaches [23, 15, 24, 25]. It is computa-tionally intensive to evaluate the proposed architecture by thesearch strategy from scratch.
Performance estimation estimatesthe performance of an architecture without the need to train itfully. Research in NAS aims to improve in these dimensionsin order to discover highly performing architectures while min-imizing the search cost (in terms of GPU days).We choose a differentiable NAS approach, DARTS, be-cause it has remarkable efficiency, as compared to earlier NAStechniques operated in a discrete domain based on RNN con-trollers [11, 12]. DARTS can finish searching in a single GPUday. Besides, DARTS does not rely on performance predictors[14] and can find architectures with complex structures in a richsearch space.We evaluate the proposed NAS method on the publicGoogle Speech Commands Dataset [3]. Our experimental re-sults have shown that the proposed method can find architec-tures that achieve a state-of-the-art accuracy of over 97% on thecommon benchmark setting of 12-class utterance classification,which is the same evaluation setting adopted by most KWS lit-erature [26, 1, 4, 9, 5, 16].
2. Method
We search for a convolutional neural network (CNN) to opti-mize the classification performance based on a matrix of MFCCfeatures extracted from each audio sample. As is shown in Fig-ure 1, the CNN we will search for is composed of a head layerthat performs a preliminary × convolution, followed by asequence of L stacked layers, each called a cell , and finally, astem that performs the classification. The preprocessing pro-cedure to process audio in MFCC features will be described inSection 3. To reduce the complexity of the search, we searchfor cell architectures rather than searching for the entire net-work architecture. As is illustrated in Figure 1, two types ofcells are searched for: normal cells and reduction cells . A nor-mal cell ensures that the size of its output is the same as that ofits input by using a stride of one. A reduction cell, on the otherhand, doubles the number of channels and divides the heightand width of its input by one half. All the normal cells share thesame neural architecture. So do all the reduction cells. Once thebest normal cell and reduction cell architectures are found fromthe search phase, we will scale up the depth and width of the InputInput Output
Figure 2:
An illustration of the inner search space of a cell.Each circle is an operation in O , while the solid ones are thosefinally selected by the search algorithm. network by stacking the found cells and tuning the number ofchannels at the initial layer. When stacking cells sequentially toform a deeper architecture, the same stacking rule applies–everytwo normal cells are followed by a reduction cell.We leverage a cost-efficient differentiable architecturesearch algorithm, DARTS [15], to find the best normal and re-duction cell architectures for KWS. Specifically, a cell can berepresented by a directed acyclic graph (DAG) consisting of or-dered nodes and directed edges, as is shown in Figure 2. Thereare two inputs to a cell (green), which correspond to the outputsof the previous two cells, while the output of the cell (yellow)is a concatenation of all intermediate nodes.Each node is a latent representation , while each edge com-prises mixed operations from a predefined operation set O , e.g., × convolution, × convolution, max-pooling, etc. A di-rected edge connecting node i and node j represents the direc-tion of information flow and performs a weighted sum f i,j ( x i ) of all possible operations o ( . ) ∈ O applied onto the latent rep-resentation x i of node i , i.e., f i,j ( x i ) = X o ∈O α ( i,j ) ,o o ( x i ) . Let α ( i,j ) denote the vector of α ( i,j ) ,o ’s for all o ( . ) ∈ O on edge ( i, j ) . The weights α ( i,j ) ’s are learnable parameters, which en-code the cell structure. The latent representation x j for each in-termediate node j is then computed as the sum of outputs fromall its preceding nodes, i.e., x j = X i Model Search space NAS1 zero , × max pool, × avg pool, identity, × and × dil conv, × , × , and × sep convNAS2 zero , × max pool, × avg pool, identity, × and × dil conv, × regular convAt the end of the search, the operation o with the highestweight α ( i,j ) ,o on edge ( i, j ) will be finally selected, as illus-trated in Figure 2 by the solid circles. Only the selected oper-ations and the edges connected to them are kept to produce theresulting cell architecture. 3. Performance Evaluation We evaluate the proposed method for keyword spotting onGoogle Speech Commands Dataset (v1) [3]. This dataset con-tains 65,000 one-second-long audio utterances pertaining to 30words. There are approximately 2,200 samples for each word.Following the same setting as [26, 19, 4, 9, 5], we cast the prob-lem as a classification task that distinguishes among 12 classes,i.e., “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”,“go”, “stop”, an unknown class, and a silence class. The un-known class contains utterances sampled from the remaining20 words other than the above ten words, while the silence classhas utterances with only background noise. We split the entiredataset into 40% training, 40% validation, and 20% testing sets.The training set and validation set are used during architecturesearch, and are further combined to form a new training set forevaluating the best architecture on the test set. We follow the preprocessing procedure of Honk [26] to pro-cess the acoustic signals, which are adopted by multiple small-footprint KWS studies [26, 1, 2, 4, 9]. To generate trainingdata, we first add background noise to each sample with 80%probability at each epoch, followed by a random t -second timeshift where t is sampled from a UNIFORM [ − , distri-bution on each sample to enhance robustness. Then, we applya 20Hz/4kHz filter. Finally, each raw audio file is split into 101frames using a window size of 30 milliseconds and a frameshiftof 10 milliseconds. We extract 40 Mel-Frequency cepstral coef-ficients (MFCC) features for each frame and stack them acrossthe time axis.During neural architecture search, we set the number ofcells to 6 and train the network for 50 epochs. The batch sizeand the initial number of channels are both set to 16 to ensurethat the network fits into one GPU. We use stochastic gradientdescent (SGD) to update the weights ω with a momentum of 0.9and a weight decay of × − . The learning rate for ω is set to0.025, following a cosine annealing scheduler. We optimize thearchitecture parameters α with Adam ( β = 0 . , β = 0 . ),and set weight decay and the initial learning rate to − and × − , respectively.During the evaluation, we instantiate the network to betested based on the best cell architecture with the highest val-idation score found by the search phase, and experiment with adepth of 6 and 12. A network of depth 6 is illustrated in Fig. 1.A network of depth 12 is obtained by stacking the 6-cell net-work twice. We randomly re-initialize the weights in the net- c_{k-2} 0dil_conv_5x5 1max_pool_3x3 2dil_conv_3x33max_pool_3x3c_{k-1} dil_conv_3x3skip_connectskip_connect c_{k}sep_conv_7x7 Figure 3: The normal cell found on the NAS1 search space. c_{k-2} 0max_pool_3x3 1max_pool_3x3 2max_pool_3x3 3max_pool_3x3c_{k-1} max_pool_3x3max_pool_3x3max_pool_3x3 max_pool_3x3 c_{k} Figure 4: The reduction cell found on the NAS1 search space. work and re-train it from scratch for 200 epochs to report theevaluation results.In our cell searches, each normal/reduction cell consists of7 nodes. Table 1 summarizes the candidate operations used. Intotal, 7 candidate operations have been considered : skip con-nection (or identity), zero , average pooling, max pooling, di-lated convolution, separable convolution (depthwise separableconvolution), and regular convolution. Zero means no connec-tion between two nodes, identity represents identity mapping.The dilated convolution introduces a dilation rate (set to two inour experiments) to the regular convolution. Each convolutionoperation follows the sequence of execution: ReLU, Convo-lution, Batch Normalization (BN). Each separable convolutionexecutes two ReLU-Conv-BN sequences.As shown in Table 1, we conduct searches based on two setsof operators. NAS1 uses separable convolutions, dilated con-volutions and pooling, while NAS2 uses regular convolutionsinstead of separable convolutions. The separable convolutionconsists of a depth-wise convolution conducted independentlyover each channel of the input, followed by a point-wise convo-lution, i.e., a × convolution, to combine information acrosschannels [27, 28]. Dilated convolutions are known to be able toexpand the receptive field exponentially without loss of cover-age [29], while separable convolutions can reduce the numberof parameters and computational cost [30]. Separable convolu-tions are also frequently used in neural KWS literature [5, 10]to improve performance and reduce model size.On the other hand, NAS2 uses the regular convolution,which is the convolutional operation traditionally used inResNet and has been applied to KWS by [1]. NAS2 consid-ers the same operations used in [1] to test whether our searchstrategy is effective at producing architectures that can beat tra-ditional ResNet models [1] when using similar operations.We evaluate the models discovered under both NAS1 andNAS2 search spaces in terms of the accuracy and model size,under different scaling-up settings, by varying the depth (num-ber of cells) and the initial number of channels. We compare to _{k-2} 0max_pool_3x3 1max_pool_3x32max_pool_3x3 3max_pool_3x3c_{k-1} dil_conv_3x3dil_conv_3x3dil_conv_5x5 dil_conv_3x3 c_{k} Figure 5: The normal cell found on the NAS2 search space. c_{k-2} 0max_pool_3x3 1max_pool_3x32max_pool_3x3 3max_pool_3x3c_{k-1} max_pool_3x3max_pool_3x3 conv_3x3conv_3x3 c_{k} Figure 6: The reduction cell found on the NAS2 search space. the following baseline models that utilize CNN blocks and areevaluated on the same dataset and 12 classes as our method * :• Res15: a ResNet variant based on regular convolutionsachieving the highest accuracy in [1]. It consists of 6residual blocks and 45 feature maps.• TC-ResNet14-1.5: a ResNet variant achieving the high-est accuracy in [4], which uses a × temporal convo-lution instead of regular convolutions to reduce the foot-print. 6 residual blocks are used. A width multiplier of1.5 is applied to expand the number of channels at eachlayer.• SincConv+DSConv: the best model reported in [5],which first uses the Sinc-convolution to extract featuresfrom raw audio and then applies separable convolutionswith a kernel length of 9 to reduce the model size.• CENet-GCN-40: the best model in [9], which mainlyconsists of bottleneck blocks and a GCN module. Eachbottleneck block is a stack of × , × and × convo-lutions to reduce model complexity. The GCN module isintroduced to learn non-local relations of convolutionalfeatures. Figures 3-6 illustrate the cells found on each search space. Thesearch costs for NAS1 and NAS2 remain at a low level of 0.58GPU day and 0.29 GPU day, respectively. Table 2 shows a per-formance comparison between our models and baseline models.From this table, we can observe that the model found by NAS1with 6 cells and 16 initial channels outperform Res15, TC-ResNet and SincConv in terms of both accuracy and the numberof parameters, while the rest of the NAS1 models can achieve * MatchboxNet- × × [10] proposes a deep residual networkand achieves state-of-the-art results on Google Speech Dataset v1 on30 keyword classes. Thus, their setup is not comparable to ours or thelisted baselines. It also uses data augmentation, e.g., time shift pertur-bations and SpecAugment to boost the performance, which is not usedin our method or the listed baselines. Similarly, we do not compare toDenseNet-BiLSTM [7] which relies on attention BiLSTM. Table 2: Performance of the models found by the proposedmethod and baseline models. The numbers marked with † aretaken from the corresponding papers. ’-’ means not available.The best results among different methods are marked in bold. Model Cell( Res15 [1] - - 95.8 † † † † 4. Conclusion Existing methods for neural keyword spotting rely on manu-ally designed convolutional neural networks and other neuralnetworks. In this paper, we perform differentiable neural archi-tecture search to search for CNN architectures that can lead toa high accuracy and a relatively small footprint. Our approachis robust and finds architectures with accuracy over 96% underdifferent sets of operations. The best models found by neuralarchitecture search achieves a state-of-the-art accuracy of over97% accuracy on the Google Speech Commands Dataset, out-performing a range of existing baseline models under the sameexperimental setup, while maintaining competitive footprints.These observations demonstrate the enormous potential of con-ducting neural architecture search for keyword spotting, espe-cially toward other types of neural networks and the adoptionof KWS-friendly operations, which open up avenues for futureinvestigation. . References [1] R. Tang and J. Lin, “Deep residual learning for small-footprintkeyword spotting,” in . IEEE, 2018,pp. 5484–5488.[2] T. N. Sainath and C. Parada, “Convolutional neural networksfor small-footprint keyword spotting,” in Sixteenth Annual Con-ference of the International Speech Communication Association ,2015.[3] P. Warden, “Speech commands: A dataset for limited-vocabularyspeech recognition,” arXiv preprint arXiv:1804.03209 , 2018.[4] S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim,and S. Ha, “Temporal convolution for real-time keyword spottingon mobile devices,” arXiv preprint arXiv:1904.03814 , 2019.[5] S. Mittermaier, L. K¨urzinger, B. Waschneck, and G. Rigoll,“Small-footprint keyword spotting on raw audio data with sinc-convolutions,” arXiv preprint arXiv:1911.02086 , 2019.[6] C.-C. Kao, M. Sun, Y. Gao, S. Vitaladevuni, and C. Wang, “Sub-band convolutional neural networks for small-footprint spokenterm classification,” arXiv preprint arXiv:1907.01448 , 2019.[7] M. Zeng and N. Xiao, “Effective combination of densenet andbilstm for keyword spotting,” IEEE Access , vol. 7, pp. 10 767–10 775, 2019.[8] J. Pons and X. Serra, “Randomly weighted cnns for (music) audioclassification,” in ICASSP 2019-2019 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 336–340.[9] X. Chen, S. Yin, D. Song, P. Ouyang, L. Liu, and S. Wei, “Small-footprint keyword spotting with graph convolutional network,” arXiv preprint arXiv:1912.05124 , 2019.[10] S. Majumdar and B. Ginsburg, “Matchboxnet–1d time-channelseparable convolutional neural network architecture for speechcommands recognition,” arXiv preprint arXiv:2004.08531 , 2020.[11] B. Zoph and Q. V. Le, “Neural architecture search with reinforce-ment learning,” arXiv preprint arXiv:1611.01578 , 2016.[12] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning trans-ferable architectures for scalable image recognition,” in Proceed-ings of the IEEE conference on computer vision and patternrecognition , 2018, pp. 8697–8710.[13] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evo-lution for image classifier architecture search,” in Proceedings ofthe aaai conference on artificial intelligence , vol. 33, 2019, pp.4780–4789.[14] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural ar-chitecture search,” in Proceedings of the European Conference onComputer Vision (ECCV) , 2018, pp. 19–34.[15] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable archi-tecture search,” arXiv preprint arXiv:1806.09055 , 2018.[16] T. V´eniat, O. Schwander, and L. Denoyer, “Stochastic adaptiveneural architecture search for keyword spotting,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2019, pp. 2842–2846.[17] A. Anderson, J. Su, R. Dahyot, and D. Gregg, “Performance-oriented neural architecture search,” arXiv preprintarXiv:2001.02976 , 2020.[18] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecturesearch: A survey,” arXiv preprint arXiv:1808.05377 , 2018.[19] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard,and Q. V. Le, “Mnasnet: Platform-aware neural architecturesearch for mobile,” 2018.[20] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficientneural architecture search via parameter sharing,” arXiv preprintarXiv:1802.03268 , 2018. [21] T. Elsken, J. H. Metzen, and F. Hutter, “Efficient multi-objective neural architecture search via lamarckian evolution,” arXiv preprint arXiv:1804.09081 , 2018.[22] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan,Q. V. Le, and A. Kurakin, “Large-scale evolution of image clas-sifiers,” in Proceedings of the 34th International Conference onMachine Learning-Volume 70 . JMLR. org, 2017, pp. 2902–2911.[23] X. Dong and Y. Yang, “Searching for a robust neural architecturein four gpu hours,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2019, pp. 1761–1770.[24] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiablearchitecture search: Bridging the depth gap between search andevaluation,” 2019.[25] S. Xie, H. Zheng, C. Liu, and L. Lin, “SNAS: stochastic neuralarchitecture search,” CoRR , vol. abs/1812.09926, 2018. [Online].Available: http://arxiv.org/abs/1812.09926[26] R. Tang and J. Lin, “Honk: A pytorch reimplementation of con-volutional neural networks for keyword spotting,” arXiv preprintarXiv:1710.06554 , 2017.[27] F. Chollet, “Xception: Deep learning with depthwise separableconvolutions,” in Proceedings of the IEEE conference on com-puter vision and pattern recognition , 2017, pp. 1251–1258.[28] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,“Encoder-decoder with atrous separable convolution for semanticimage segmentation,” in Proceedings of the European conferenceon computer vision (ECCV) , 2018, pp. 801–818.[29] F. Yu and V. Koltun, “Multi-scale context aggregation by dilatedconvolutions,” arXiv preprint arXiv:1511.07122 , 2015.[30] L. Kaiser, A. N. Gomez, and F. Chollet, “Depthwise separa-ble convolutions for neural machine translation,” arXiv preprintarXiv:1706.03059arXiv preprintarXiv:1706.03059