Dynamic curriculum learning via data parameters for noise robust keyword spotting
Takuya Higuchi, Shreyas Saxena, Mehrez Souden, Tien Dung Tran, Masood Delfarah, Chandra Dhir
DDYNAMIC CURRICULUM LEARNING VIA DATA PARAMETERSFOR NOISE ROBUST KEYWORD SPOTTING
Takuya Higuchi, Shreyas Saxena, Mehrez Souden, Tien Dung Tran, Masood Delfarah and Chandra Dhir
Apple
ABSTRACT
We propose dynamic curriculum learning via data param-eters for noise robust keyword spotting. Data parameterlearning has recently been introduced for image processing,where weight parameters, so-called data parameters, for tar-get classes and instances are introduced and optimized alongwith model parameters. The data parameters scale logits andcontrol importance over classes and instances during train-ing, which enables automatic curriculum learning withoutadditional annotations for training data. Similarly, in thispaper, we propose using this curriculum learning approachfor acoustic modeling, and train an acoustic model on cleanand noisy utterances with the data parameters. The proposedapproach automatically learns the difficulty of the classes andinstances, e.g. due to low speech to noise ratio (SNR), inthe gradient descent optimization and performs curriculumlearning. This curriculum learning leads to overall improve-ment of the accuracy of the acoustic model. We evaluate theeffectiveness of the proposed approach on a keyword spottingtask. Experimental results show 7.7% relative reduction infalse reject ratio with the data parameters compared to a base-line model which is simply trained on the multiconditioneddataset.
Index Terms — Noise robustness, acoustic modeling,keyword spotting, curriculum learning
1. INTRODUCTION
Acoustic modeling is essential for speech applications, suchas keyword spotting and automatic speech recognition, e.g.[1–8]. Noise robustness of an acoustic model is one of criti-cal points for successful acoustic modeling particularly in farfield speech scenarios [9–12]. Various approaches have beenproposed, including front-end speech enhancement [13–17]and data augmentation for acoustic modeling [18–20]. Onepromising approach for improving the noise robustness is dataaugmentation using noise signals, where the noise signals areadded to clean utterances in training data. This data aug-mentation, also called multicondition training, artificially in-troduces difficult training samples, which typically leads theacoustic model to be more robust against noise. However, itis unclear what degree of difficulty would help to generalizethe model the most, so it is difficult to understand how to ob- tain the maximum benefit from the multicondition training.Although the augmented training data have various degreesof difficulty for acoustic modeling, e.g. high/low SNRs, andeasy/hard phone states for classification, it is difficult to de-sign an efficient training curriculum exploiting the variabilityof the augmented training data.Recently, data parameters were proposed for curriculumlearning, and their effectiveness was demonstrated on imageclassification [21]. Weight parameters are introduced for tar-get classes and instances, i.e. image samples, in training data.The data parameters are used to scale logits depending on tar-get classes and instances, and are optimized with the modelparameters. By scaling the logits, we can control the con-tribution of each class/instance to the gradients. Optimizingthe data parameters during training enables class-level and/orinstance-level dynamic curriculum learning by dynamicallyupdating the scaling factors.The main contributions of this paper are: 1) an appli-cation of the data parameter to acoustic modeling for key-word spotting, and 2) a combination of the data parameterapproach with a data augmentation technique to yield fur-ther performance improvement. In an acoustic modeling sce-nario, class parameters are introduced for the target phonemeclasses, which vary depending on time frames and controlcontributions of the target classes. On the other hand, in-stance parameters are introduced depending on utterances,which control contributions of the utterances based on theirhardness, e.g. clean/noisy. In our experimental evaluation ona keyword spotting task, the data parameter approach com-bined with data augmentation yielded up to 7.7% relative im-provement in terms of the false reject ratio (FRR) comparedto a baseline model trained on a noisy data.
2. DATA PARAMETERS FOR ACOUSTICMODELING2.1. Class parameters
Let { x it , y it } denote the data, where x it denotes an audio fea-ture vector at time t ∈ { , ..., T i } of utterance i ∈ { , ..., I } ,and y it ∈ { , ..., K } denotes its corresponding class label, e.g.a phone state label. During training, an acoustic model, i.e. adeep neural network (DNN) takes the input x it and estimateslogits z it as z it = f θ ( x it ) , where θ denotes model parametersand f θ denotes a mapping function by the DNN. a r X i v : . [ ee ss . A S ] F e b nstead of computing the softmax directly using the log-its, we introduce the class parameters to scale the logits. Let σ class = { σ class , ..., σ classK } denote the class parameters forthe target classes , ..., K . The probability for target class y it at time t of utterance i can be written as p it,y it = exp ( z it,y it /σ classy it ) (cid:80) j exp ( z it,j /σ classy it ) , (1)where z it,y it and σ classy it denote logits and the class parameterfor target class y it , respectively. Note that the class parameterused in both the numerator and denominator is determined bytarget class y it . Therefore, the class parameter cannot be ab-sorbed into model parameters in contrast to approaches intro-ducing scaling parameters into model parameters (e.g. [22]).Thus, the class parameters control contributions of the targetclasses during training, keeping the model architecture un-changed. When we set all the class parameters to one, i.e. σ classj = 1 , j = 1 , ..., K , Eq. (1) is equivalent to the standardsoftmax computation. Instance parameters are specific to a particular utterance,so they control curriculum over instances. Let σ inst = { σ inst , ..., σ instI } denote the instance parameters for utter-ances , ..., I in training data. Using the instance parameters,the scaled probability can be written as: p it,y it = exp ( z it,y it /σ insti ) (cid:80) j exp ( z it,j /σ insti ) . (2)Note that the same instance parameter is used across timeframes in the same utterance, which allows us to performutterance-wise curriculum learning. The data parameters are defined as the sum of the class andinstance parameters, σ ∗ t,i = σ classy it + σ insti . The data pram-eters are used to scale the logits similarly to Eqs. (1) and(2). Unlike the data parameters for image processing [21],the data parameters for speech processing have different val-ues at different time frames depending on both the class andinstance parameters, even if data points at the time frames arefrom the same utterance. By optimizing both the class and in-stance parameters, we can perform curriculum learning overboth classes and instances. The data parameters can be optimized via a gradient descentalgorithm, and we use the usual averaged cross entropy lossover the time frames and the instances: L = − T ∗ (cid:88) t,i L it (3) = − T ∗ (cid:88) t,i log( p it,y it ) , (4) where T ∗ denotes the total number of time frames in trainingdata, and p it,y it is the probability scaled by the data parameter, σ ∗ t,i . We minimize L with respect to model parameters, θ , aswell as the class and instance parameters. The gradient of theloss with respect to logits can be written as: ∂L it ∂z it,j = p it,j − j = y it ) σ ∗ t,i . (5)Thus the gradient gets scaled by σ ∗ t,i , where different scalingfactors are applied at different data points. The data parame-ter values control importance over different time frames andutterances during model parameter optimization.On the other hand, the gradient with respect to the dataparameters can be written as: ∂L it ∂σ ∗ t,i = (1 − p it,y it )( σ ∗ t,i ) ( z it,y it − (cid:88) j (cid:54) = y it q it,j z it,j ) , (6)where q it,j = p it,j − p it,yit is the probability distribution over non-target class j scaled by − p it,y it . Since (cid:80) j (cid:54) = y it q it,j z it,j cor-responds to the expected value on non-target classes, the dataparameter σ ∗ t,i will increase if the logit of the target class z it,y it is smaller than the expected logits on non-target classes andvice-versa. The gradients for the class and instance parame-ters can also be written as Eq. (6) since the data parameter isformed by the addition of these parameters.From Eqs. (5) and (6), if the DNN misclassifies a datapoint, the data parameter will gradually increase, which willdecay the gradient of the logit. Decreasing the data param-eter has an opposite effect, and accelerate the optimizationof the model parameters. This mechanism helps the DNN tofocus on easy data points at the beginning of training, andautomatically leaves the harder cases until later based on theperformance of the model.We optimize log( σ class ) and log( σ inst ) instead of σ class and σ inst to avoid them having negative values. In addi-tion, we have l regularization on the data parameters, i.e. (cid:107) log( σ class + σ inst ) (cid:107) to prevent the data parameters fromhaving too high or low values, and to favor the original soft-max computation with σ ∗ = 1 . The contribution of this reg-ularization in the objective function can be controlled by aweight decay hyperparameter.
3. EXPERIMENTAL EVALUATION3.1. Data
Clean training data consists of ∼ ∼ ∼
2% of the datafor cross validation to tune hyperparameters.Evaluation data consisted of 13186 positive utteranceswhich have the trigger phrase, and 2000 hours of negativedata obtained by playing TV, radio, and podcast in rooms. Allthe evaluation data were recorded with the same microphonearray. The positive utterances were recorded in realistic roomconditions, e.g. with room noises and/or playback with vari-ous volumes. Our front-end speech enhancement system [24]was applied to both training and evaluation data.
We used 5 layers of fully-connected networks with 64 units.Batch normalization was applied after each fully-connectedlayer, followed by the sigmoid activation function. The lastlayer projected the 64-dimensional hidden activation to logitsfor the target classes. An acoustic model with this size can bealways-on for keyword spotting in streaming audio on a smartspeaker. The target classes consisted of 18 classes (3 states × . , and betas were set at [0.9,0.999]. A learning ratedecay of 0.5 was applied when the cross validation loss didnot decrease at two consecutive epochs. Early-stopping wasapplied when we did not see cross validation loss decreaseat 9 consecutive epochs. We used utterances per mini-batch for training. Two baseline models were trained on theclean and noisy data, respectively, without using the data pa-rameters. Regarding the data parameters, we considered threesettings: 1) class parameter only, 2) instance parameter only,and 3) joint class and instance parameters. For the data pa-rameter optimization, stochastic gradient descent (SGD) wasused with the weight decay. To avoid the class and instance Table 1 . Hyperparameters for data parameters class inst.data case lr init lr init wd class 0.001 1 0.01clean inst. 0.001 1 0.01joint 0.001 1 0.1 0.01 0.01class 0.001 1 0.01noisy inst. 0.01 1 0.1joint 0.001 1 1 0.1 0.01
Fig. 1 . A distribution of the class parameters during train-ing. The line in the middle shows the median, and the coloredregions have widths σ , 2 σ and 3 σ respectively, where σ de-notes a standard deviation of the distribution. The lowest andthe highest lines indicate the minimum and maximum valuesrespectively.parameters becoming zero or having high values during op-timization, clipping was applied to the class parameters witha minimum value of . a maximum value of , and to theinstance parameters with a minimum value of . and amaximum value of . No learning rate decay was appliedfor the class and instance parameters. The data parameter im-plementation was done based on the open source software provided by the authors of [21].Hyperparameters for the data parameters, i.e. learningrate, initial value, and weight decay were set as shown in Ta-ble 1. The weight decay parameter was multiplied by the l regularization term to control its contribution to the objectivefunction. For the joint cases, we kept the same learning rateand the same weight decay as used in the class parametersonly case, and tuned the learning rate and the initial valuesfor the instance parameters. The initial values were set so thatthe sum of the class and instance parameters kept close to .All models including baseline were trained using the samerandom seed for model initialization and data ordering, andhyperparameter search was performed based on the cross val-idation loss. https://github.com/apple/ml-data-parameters ig. 2 . Means and standard deviations of distributions of theinstance parameters for the clean and noisy utterances. Figure 1 shows an example of a distribution of the class pa-rameters during training, which was obtained with the noisydata in the class parameter only case. Starting with the ini-tial value of , the class parameters spread higher/lower atthe beginning part of training, and then converged into cer-tain values depending on the target classes. This distributionshows that target class-level curriculum learning via the classparameters was performed during training. Similar distribu-tions were observed with the other settings.Figure 2 shows means and standard deviations of distri-butions of the instance parameters for the clean and the noisyutterances in the noisy data, respectively. The distributionsobtained in the joint case are shown. At the beginning of train-ing, the mean for the noisy utterances increases more than themean for the clean utterances, then both converge to similarvalues. This means that, on average, the model focused moreon the clean utterances at the early stage of training, and thenstarted treating equally the clean and noisy utterances. Thestandard deviation for the noisy utterances was higher at thebeginning and converged at a higher value compared with thatfor the clean utterances. This broader distribution on the noisydata indicates that the instance parameters controlled the im-portance of the noisy utterances depending on the degrees ofdifficulty of the noisy utterances. A similar tendency was ob-served with other settings as well. Note that clean/noisy an-notations were not used during training, but were used onlyto calculate the instance parameter distributions separately onthe clean and noisy utterances for this figure. Table 2 shows the false reject ratios (FRRs) at a certain operat-ing point, i.e., 10 false alarm (FA) per hour obtained with theclean training data. Although the class parameters improvedthe performance, we could not get substantial improvements
10 FA per hour is a reasonable operating point for always-on models ondevice with this size, being evaluated on this negative data. The FAs can bemitigated by using muti-stage approaches, e.g., [27, 28].
Table 2 . False reject ratios at 10 false alarm per hour usingclean training data
Models class inst. FRRs (rel. impr.) [%] baseline 2.37class param. (cid:88) inst. param. (cid:88) (cid:88) (cid:88)
Table 3 . False reject ratios at 10 false alarm per hour usingnoisy training data
Models class inst. FRRs (rel. impr.) [%] baseline 2.22class param. (cid:88) (cid:88) (cid:88) (cid:88) when using the instance parameters with the clean trainingdata. The reason could be that the clean data did not have amoderate diversity of difficulties across data samples, hencethe instance parameters introduced the unnecessarily high de-gree of freedom which could result in overfitting.Table 3 shows the FRRs obtained using the noisy train-ing data. By simply training the model on the noisy data, theFRR improved by 6.3% relative to the baseline model trainedon the clean data. By using the joint class and instance pa-rameters, we achieved further 7.7% improvement relative tothe model trained on the noisy data without using the data pa-rameters, although the improvement by the joint optimizationwas limited compared to the result with only the class parame-ters. In contrast to the results with the clean training data, theinstance parameters also improved the FRRs from the base-line by 4.1% relative. These results shows the effectivenessof the proposed approach on the keyword spotting task in amulticondition training scenario.
4. RELATED WORK
Sivasankaran et al. [29] also proposed utterance-wise weightparameters for multicondition training, where they directlyscaled the cross entropy loss using the weight parameters.However, in their experiments, they split training data basedon SNR, and used subset-wise weight parameters noting thatutterance-wise weights are undesirable and would result inoverfitting. In this work, we introduced utterance-wise dataparameters and demonstrated how they improved the accu-racy without additional annotations, such as SNR labels.
5. CONCLUSIONS
We proposed data parameters for noise robust keyword spot-ting. The data parameters consisted of the class and instanceparameters, which controlled the contributions of the targetclasses and the instances during training. Experimental re-sults with augmented training data showed that the data pa-rameter approach achieved up to 7.7% relative improvementover the baseline model simply trained on the noisy data. . REFERENCES [1] Guoguo Chen et al., “Small-footprint keyword spottingusing deep neural networks,” in
ICASSP . IEEE, 2014,pp. 4087–4091.[2] Tara N Sainath et al., “Convolutional neural networksfor small-footprint keyword spotting,” in
Interspeech ,2015.[3] Siri Team, “Hey siri: An on-device dnn-powered voicetrigger for apple’s personal assistant,”
Apple MachineLearning Journal , vol. 1, no. 6, 2017.[4] Yanzhang He et al., “Streaming small-footprint keywordspotting using sequence-to-sequence models,” in
ASRU .IEEE, 2017, pp. 474–481.[5] Geoffrey Hinton et al., “Deep neural networks foracoustic modeling in speech recognition: The sharedviews of four research groups,”
IEEE Signal process-ing magazine , vol. 29, no. 6, pp. 82–97, 2012.[6] Tara N Sainath et al., “Deep convolutional neural net-works for lvcsr,” in
ICASSP . IEEE, 2013, pp. 8614–8618.[7] Ossama Abdel-Hamid et al., “Convolutional neural net-works for speech recognition,”
IEEE/ACM Trans. ASLP ,vol. 22, no. 10, pp. 1533–1545, 2014.[8] Chung-Cheng Chiu et al., “State-of-the-art speechrecognition with sequence-to-sequence models,” in
ICASSP . IEEE, 2018, pp. 4774–4778.[9] Keisuke Kinoshita et al., “The reverb challenge: Acommon evaluation framework for dereverberation andrecognition of reverberant speech,” in . IEEE, 2013, pp. 1–4.[10] Jon Barker et al., “The third ‘chime’speech separationand recognition challenge: Dataset, task and baselines,”in
ASRU . IEEE, 2015, pp. 504–511.[11] Jon Barker et al., “The fifth’chime’speech separationand recognition challenge: dataset, task and baselines,” arXiv preprint arXiv:1803.10609 , 2018.[12] Reinhold Haeb-Umbach et al., “Speech processing fordigital home assistants: Combining signal processingwith deep-learning techniques,”
IEEE Signal Process-ing Magazine , vol. 36, no. 6, pp. 111–124, 2019.[13] Jahn Heymann et al., “Neural network based spectralmask estimation for acoustic beamforming,” in
ICASSP .IEEE, 2016, pp. 196–200.[14] Hakan Erdogan et al., “Improved mvdr beamformingusing single-channel mask prediction networks.,” in
In-terspeech , 2016, pp. 1981–1985. [15] Takuya Yoshioka et al., “The ntt chime-3 system: Ad-vances in speech enhancement and recognition for mo-bile multi-microphone devices,” in
ASRU . IEEE, 2015,pp. 436–443.[16] Takuya Higuchi et al., “Robust mvdr beamformingusing time-frequency masks for online/offline asr innoise,” in
ICASSP . IEEE, 2016, pp. 5210–5214.[17] Christoph Boeddeker et al., “Front-end processing forthe chime-5 dinner party scenario,” in
CHiME5 Work-shop, Hyderabad, India , 2018.[18] Michael L Seltzer et al., “An investigation of deep neu-ral networks for noise robust speech recognition,” in
ICASSP . IEEE, 2013, pp. 7398–7402.[19] Yanmin Qian et al., “Very deep convolutional neural net-works for noise robust speech recognition,”
IEEE/ACMTrans. ASLP , vol. 24, no. 12, pp. 2263–2276, 2016.[20] Xiaodong Cui et al., “Data augmentation for deep neuralnetwork acoustic modeling,”
IEEE/ACM Trans. ASLP ,vol. 23, no. 9, pp. 1469–1477, 2015.[21] Shreyas Saxena et al., “Data parameters: A new familyof parameters for learning a differentiable curriculum,”in
Advances in Neural Information Processing Systems ,2019, pp. 11093–11103.[22] Pegah Ghahremani et al., “Linearly augmented deepneural network,” in
ICASSP . IEEE, 2016, pp. 5085–5089.[23] “BBC Sound Effects,” http://bbcsfx.acropolis.org.uk.[24] Apple Machine Learning Blog, “Optimiz-ing Siri on HomePod in Far-Field Settings,”https://machinelearning.apple.com/2018/12/03/optimizing-siri-on-homepod-in-far-field-settings.html, December2018.[25] Siddharth Sigtia et al., “Efficient Voice Trigger Detec-tion for Low Resource Hardware,” in
Interspeech , 2018,pp. 2092–2096.[26] Diederik P Kingma et al., “Adam: A methodfor stochastic optimization,” arXiv preprintarXiv:1412.6980 , 2014.[27] Alexander Gruenstein et al., “A cascade architecture forkeyword spotting on mobile devices,” arXiv preprintarXiv:1712.03603 , 2017.[28] Apple Machine Learning Blog, “Hey Siri: AnOn-device DNN-powered Voice Trigger for Apples Per-sonal Assistant,” https://machinelearning.apple.com/2017/10/01/hey-siri.html ,October 2017.[29] Sunit Sivasankaran et al., “Discriminative importanceweighting of augmented training data for acoustic modeltraining,” in