A bandit approach to curriculum generation for automatic speech recognition
AA BANDIT APPROACH TO CURRICULUM GENERATION FOR AUTOMATIC SPEECHRECOGNITION
Anastasia Kuznetsova (cid:63) † Anurag Kumar (cid:63)
Francis M. Tyers † (cid:63) Indiana University, BloomingtonComputer Science Department † Indiana University, BloomingtonDepartment of Linguistics
ABSTRACT
The Automated Speech Recognition (ASR) task has been achallenging domain especially for low data scenarios withfew audio examples. This is the main problem in trainingASR systems on the data from low-resource or marginalizedlanguages. In this paper we present an approach to mitigatethe lack of training data by employing Automated CurriculumLearning in combination with an adversarial bandit approachinspired by Reinforcement learning. The goal of the approachis to optimize the training sequence of mini-batches rankedby the level of difficulty and compare the ASR performancemetrics against the random training sequence and discretecurriculum. We test our approach on a truly low-resourcelanguage and show that the bandit framework has a goodimprovement over the baseline transfer-learning model.
Index Terms — Low-resource ASR, Curriculum Learning,Bandits
1. INTRODUCTION
Automated speech recognition (ASR) is a task of transformingspeech signal into text. Lack of training data is problematicin this domain, especially for low-resource languages. Bylow-resource languages we understand not only endangeredlanguages with fewer native speakers but also the languageswhich lack digital presence and sufficiently large corpora on-line. There are efforts towards addressing this problem, creat-ing public-domain speech data for a growing number of lan-guages [1]. However, despite this effort the data problem willalmost certainly persist into the foreseeable future as newlyemerging neural architectures are more and more data hungry.Therefore, there is a need for a solution which would miti-gate the growing need for data and make speech recognitionaccessible to speakers of more languages.The notion of curriculum learning is not new in MachineLearning. The first results on the effectiveness of the approachwere demonstrated by [2]. The idea behind the method isto mimic human study behaviour by presenting to the neuralnetwork easy examples in the initial phases of the training and then gradually increasing the difficulty of the examples. [3]conducted the experiments on artificially generated tasks andpresented proof-of-concept results. Despite the effectivenessof the method the researchers in Natural Language Process-ing and Speech Recognition have not actively investigatedcurriculum learning.Some examples of curriculum learning in speech recogni-tion area include the research by [4] and [5]. While the abovestudies prove that curriculum learning increases the conver-gence rate, it also establishes a need for clean speech signalsand a higher volume of data. Since we lacked both of these,we propose a method of automated curriculum generation thatlearns the curriculum online.
Reinforcement Learning is an area of Machine learningwhich exploits three main concepts: agent, environment andreward. The agent is acting in the environment and takesdecisions based on the reward it gets from the environment.The goal of the agent is to maximize the expected reward itcan get over the time. Reinforcement learning is widely usedin control tasks where the environment could be a video gameor a real-world visual input as in self-driving cars. We definethese terms more formally later in Section 2.1.Kala et al. [6] is one of the rare works combining RL andASR we encountered. The authors train two rival ASR modelsvia the policy gradient method with
REINFORCE style updates.The remainder of the article is organized as follows: Sec-tion 2 formulates the problem in terms of bandit frameworkand reviews the development of the proposed automated cur-riculum system; Section 3 discusses the experimental setting,training data, evaluation metrics and presents the results of thecurrent research; we reserve Section 4 for a brief discussion.
2. METHODOLOGY
In reinforcement learning (RL) the idea of curriculum learningis often used in control tasks for games, however it has notbeen widely used in speech recognition. Our approach wasinspired by the work of [7] where bandit algorithms are usedfor automated curriculum generation. a r X i v : . [ c s . C L ] F e b ig. 1 : Compression Ratio against different noise levels. A k -armed bandit can be formally defined as an agent actingin the reward space R which aims to collect the maximumexpected reward in the finite number of trials. Consider a finitesequence of trials t = 1 , , ...T .1. During each trial the agent selects the bandit’s arm attime t , the decision is based on the expected pay-offreceived previously. We define the arm as action a t ∈A t where A is the set of actions;2. After selecting the action a on time step t we observethe reward r t ∈ R ;3. Maximum expected reward defines the value functionof each action q ∗ ( a ) = E [ R t |A t = a ] which is updatedfor a played action at every time step after observing thereward . According to the curriculum learning approach we need away of ranking the examples to define the complexity. [4]suggest utterance length as the measure of complexity. Wedecided to take a different approach and ranked the trainingexamples using the compression ratio of the raw audio files. The motivation behind is that the noisier the audio file, themore entropy it has, and thus the harder it is to both compressand to learn from. We define compression ratio, which showsby how much (%) the audio file was compressed compared toits uncompressed counterpart as in (1).
CR = 1 − Size after
Size before (1)Figure 1 demonstrates the dependence of signal to noise ratio (SNR) and compression ratio. The plot is based on NOIZEUScorpus [8] which contains noisy audio mixtures at differentSNRs. The higher the proportion of noise is being blended into The asterisk means that q ( a ) is optimal. We used the standard gzip utility for compressing the distributed WAVs. speech the less compressed is the audio sample. Followingthis logic we divide the data set in K levels of complexitywhich defines the degree of hardness, where 1 is ‘easy’ and K is ‘hard’.Assume the set of input sequences X where every inputsequence x i ∈ X .The task is a set of training examples ranked according tothe CR . We split the tasks into K categories. Thus, we definea task set as D = { D , D , D , ...D K } where k is the indexof the task and x i is the training batch sampled randomly from D k . The Curriculum is a sequence of tasks selected in eachtraining epoch.The goal is to find the best sequence (curriculum) ofbatches to maximize the training gain. The action a ∈ A (intuitively ‘pick batch’) refers to the task D k chosen from D .This way a = k i.e. the index of the task can be seen as one ofthe handles of the multi-armed bandit. So at every time stepwe are going to update the action value function q ( a ) afterreceiving the reward. The reward r ( a ) is calculated as a resultof the progress signals received from the network. Algorithm 1:
Curriculum Learning
Initialize: w i ← (EXP3) or q ( a ) (UCB1)Create tasks D k beginfor t → T do Draw action index a based on w i or q ( a ) k ← abatch ← sample ( D k ) Train the model on batch
Observe progress gain νr t ← M ap ( ν ) ∈ [ − , Update w i or q ( a ) on r t Following the authors in [7] we seek to experiment with loss-based progress gains such as prediction gain : ν P G = L ( x, θ ) − L ( x, θ (cid:48) ) (2)which is the loss before and after training on k th batch. Self-prediction gain is defined in Eq. 3. ν SP G = L ( x, θ ) − L ( x (cid:48) , θ (cid:48) ) x (cid:48) ∼ D k (3)SPG assess the gain on the k th training batch and then samplesthe batch from the same task again to avoid bias. Progresssignals are then used as a reward and are re-scaled to fall inthe interval [ − , using the mapping in Eq. 4, where r t is thecurrent reward at time step t , ˆ r t is the progress gain at timestep t , q lo is 0.2 quantile of the gain history and q hi is the 0.8quantile.n earlier stages of training the difference between thelosses is significantly higher than at the end of the trainingwhere the magnitude of the losses is not that drastically highdue to convergence. The re-scaling is needed to account forthis fact [7]. r t = − if ˆ r t < q lo if ˆ r t > q hi r t − q lo ) q hi − q lo − otherwise (4) In our experiments we use two bandit algorithms: Upper Con-fidence Bound (UCB1) [9] and Exponential-weight Algorithmfor Exploration and Exploitation (EXP3) [10].Both of the algorithms demonstrate strong convergenceproperties and are analytically proven. This fact is of aparamount importance to us since a lot of experiments inMachine Learning hinge on empirically obtained results e.g.hyper parameter tuning and/or are not interpretable.The bandit algorithms try to solve a well-known exploita-tion vs. exploration dilemma i.e. how to ensure that the agentselects the best arm most of the time but also explores otheroptions that potentially may lead to a higher expected rewardand the least possible regret .The UCB1 algorithm is based on the notion of upper con-fidence index assigned to each arm [9]. Graves et al. [7]obtained plausible results using EXP3 so we decided to useit as well to compare its performance against UCB1. Due tospace constraints we refer the reader to the definition of thealgorithms in the original papers by Aurer et al. [9, 10].A generalized version of Automated Curriculum Learningincorporating both of the algorithms is shown in Algorithm1 where M ap function is defined as in Eq. 4 and the trainingstep is done with the
Baseline 1 model described in Section 2.3. w i is the weight vector required in EXP3 and q ( a ) representsaction-value function in UCB1, r t is the reward received bythe agent at time step t . The architecture of our ASR system is based on Baidu’s Deep-Speech end-to-end speech recognition system [11], as imple-mented by Mozilla. The model is a 5-layer Recurrent NeuralNetwork (RNN), where the first three layers have ReLu ac-tivation, the fourth layer is a bi-directional RNN and the 5thlayer is a non-recurrent softmax output layer predicting theprobabilities of the output characters. The network is trainedvia CTC loss.
Baseline 1 system is DeepSpeech with transfer learningon version of Mozilla’s English model as Regret is the expected loss which happens because the agent does notalways select the best action in favour of exploration. https://github.com/mozilla/DeepSpeech Algorithm Gain Tasks Epochs WER CER
Transfer — — 25 94.62 39.75+ SWTSK — 5 25 85.99 29.85+ UCB1 PG 5 10 86.53 29.43+ UCB1 SPG 5 10 + EXP3 SPG 5 10 85.91 30.33
Table 1 : Experimental configurations and results. The firstbaseline is transfer learning from English,
Transfer . The sec-ond baseline is
SWTSK , a switch-task baseline. PG standsfor prediction gain, SPG – self-prediction gain. The UCB1algorithm has a parameter of c = 0 . to control the degree ofexploration, and EXP3 has a parameter γ = 0 . which is theprobability of selecting a random action.a source. We slice-off the final 2 layers from the model andresume training for the target language data. For
Baseline 1 we select batches for training randomly with no curriculum.For the purpose of comparison of the systems with curricu-lum learning approach the default file-size based sorting inDeepSpeech was turned off.
3. EXPERIMENTS
In this section we describe the experimental setting, algorithmparameters and results. We compare the results of the algo-rithm proposed in Alg.1 to two baselines , the transfer learningmodel mentioned in Section 2.3 and the model trained withthe conventional method of applying curriculum learning asdescribed in [4].
Baseline 2 was trained with transfer learningand is the discrete curriculum model where the data was splitin 5 tasks. For both baselines training and development batchsizes are and , learning rate is − and drop out . . We apply Algorthm 1 to Tatar (ISO-639: tat ), a Turkic lan-guage spoken in central areas of Russia with around 4.3 millionfirst-language speakers [12]. The dataset consists of shortutterances from Mozilla’s
Common Voice [1]. For Tatar in the2019-12-10 release there is 27 hours of speech. The data issplit into training, development and test subsets with 7,131,4,815 and 4,855 examples respectively. https://github.com/mozilla/DeepSpeech/releases/tag/v0.5.1 The motivation behind choosing a real low-resourced language instead ofartificially creating one using English data is that real datasets better reflectreal-world conditions. For example, Tatar has vowel harmony, the orthographyis more transparent than English and words are in general longer and fewer.The Tatar-speaking community is interested in speech recognition and wehope that this work will be of direct use to them. Note that this dataset split is different from the one in [1]. The precisesplit from [1] was made on an unpublished version of the corpus which is no-longer available (Meyer, p.c.). Thus the results are unfortunately not directlycomparable. The model setup is however the same.
Epochs C T C _ l o ss transfer-learningtransfer-learning+SWTSKtransfer-learning+UCB1+PGtransfer-learning+EXP3+PGtransfer-learning+UCB1+SPGtransfer-learning+EXP3+SPG Fig. 2 : Validation loss compared with the baseline models andcomparison of cumulative rewards for the bandit models.
Table 1 summarizes the experimental configurations and re-sults for the models. Note that the reported configurations arefor the most successful experiments only, although we testeddifferent number of tasks and training for more episodes aswell.The best reduction in WER was achieved by combiningUCB1 with self-prediction gain while the best CER with EXP3algorithm and prediction gain . The results for WER for bothalgorithms are comparable i.e. the WER between them doesnot differ much.By implementing bandit algorithms we received 10% re-duction in WER and 27% reduction in CER over
Baseline1 . Compared to
Baseline 2 our approach improved WER byapproximately 1% and CER by 3.4%. The discrepancy isnot that big in terms of the evaluation metrics but our modelsmanaged to achieve the same result in less time i.e. banditalgorithms contributed to faster convergence leading to betterloss reduction. Validation curves on Fig. 2 support this claim:both of the baseline models took 25 training epochs to go onthe plateau, bandit algorithms converged only in 10 epochs.
4. DISCUSSION
Fig. 3 demonstrates the action selection during the last th epoch when the model has reached convergence. Each plotpresents the order of the actions taken by the agent. The ac-tions on the y -axis show the level of complexity on a scale of ... , where zero is the index of the task with the highest com-pression ratio and four with the lowest. The first row in Figure3 shows that the action selection is more stable for UCB1 +PG algorithm and the agent follows a certain order of tasks.An interesting observation is that the agent exhausts tasks withlower compression ratios first and tasks with higher CRs later.Whereas UCB1 + SPG actually learns the curriculum that wedesigned i.e. proceeding through tasks from easy to hard. Ourhypothesis is that the nature of prediction gains impacts thebehaviour of the models. Prediction-gain is a biased estimate of the change in the
UCB1+PG
UCB1+SPG
EXP3+PG
EXP3+SPG
Steps A c t i o n s Fig. 3 : Actions selected by the algorithms in the course oftraining.expected loss on a task (Eq. 2). Harder tasks contain longerexamples with more noise. We suggest that the model is ableto learn better from harder examples especially during earlystages of training as these examples contain more informationwhich is captured by PG and thus prefers harder tasks first.
Self-prediction gain is an unbiased estimate of the changein the expected loss over a task (Eq. 3). When trained onan easier task the model is able to generalize better on theexamples sampled from the same simpler task. The reason forthat is the higher loss gradient for easier tasks, and thus themodel obtains a greater reward in contrast to the harder taskswhere the loss gradient, and hence SPG, tends to be lower.However, the same trend in the action selection is notobservable for the EXP3 algorithm. The generated curriculumis not easily interpretable in terms of the complexity metricwe have defined. One possibility is that the batch selection isdone following another metric that we have yet to observe inthe data. We intend to investigate this as future work.
5. CONCLUDING REMARKS
In this article we have presented several approaches to au-tomatic curriculum development for ASR using bandit algo-rithms which improve absolute performance in terms of WERand CER and decrease training time. In addition we haveoutlined a novel measure of training sample complexity — thecompression ratio — which captures the amount of noise, orentropy. This measure is trivial to apply and substantiallyimproves over random selection of training examples. cknowledgements
We would like to thank Mansur Saykhunov for his help in cre-ating a language model for Tatar. We would also like to thankNils Hjortnæs and Josh Meyer for their helpful comments.
6. REFERENCES [1] Rosana Ardila, Megan Branson, Kelly Davis, MichaelHenretty, Michael Kohler, Josh Meyer, Reuben Morais,Lindsay Saunders, Francis M. Tyers, and Gregor We-ber, “Common Voice: A massively-multilingual speechcorpus,” in
Proceedings of the 12th Conference on Lan-guage Resources and Evaluation (LREC 2020) , 2020, pp.4211–4215.[2] Jeffrey L. Elman, “Learning and development in neuralnetworks: the importance of starting small,”
Cognition ,vol. 48, no. 1, pp. 71 – 99, 1993.[3] Yoshua Bengio, J´erˆome Louradour, Ronan Collobert, andJason Weston, “Curriculum learning,” in
Proceedings ofthe 26th Annual International Conference on MachineLearning , New York, NY, USA, 2009, ICML ’09, p.41–48, Association for Computing Machinery.[4] Kim Suyoun, Michael L. Seltzer, Jinyu Li, and Rui Zhao,“Improved training for online end-to-end speech recogni-tion systems,”
CoRR , vol. abs/1711.02212, 2017.[5] Stefan Braun, Daniel Neil, and Shih-Chii Liu, “A curricu-lum learning method for improved noise robustness in au-tomatic speech recognition,”
CoRR , vol. abs/1606.06864,2016.[6] Taku Kala and Takahiro Shinozaki, “Reinforcementlearning of speech recognition system based on policygradient and hypothesis selection,” , Apr 2018.[7] Alex Graves, Marc G. Bellemare, Jacob Menick, RemiMunos, and Koray Kavukc¸uo˘glu, “Automated curricu-lum learning for neural networks,” 2017.[8] Yi Hu, “Subjective evaluation and comparison of speechenhancement algorithms,”
Speech Communication , vol.49, pp. 588–601, 2007.[9] Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer,“Finite-time analysis of the multiarmed bandit problem,”
Machine Learning , vol. 47, pp. 235–256, 05 2002.[10] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, andRobert E. Schapire, “The nonstochastic multiarmed ban-dit problem,”
SIAM J. Comput. , vol. 32, no. 1, pp. 48–77,Jan. 2003. [11] Awni Hannun, Carl Case, Jared Casper, Bryan Catan-zaro, Greg Diamos, Erich Elsen, Ryan Prenger, San-jeev Satheesh, Shubho Sengupta, Adam Coates, and An-drew Y. Ng, “Deep Speech: Scaling up end-to-end speechrecognition,” 2014.[12] David M. Eberhard, Gary F. Simons, and Charles D.Fennig, “Tatar language of Russia,”