Semi-Supervised Learning with Data Augmentation for End-to-End ASR
Felix Weninger, Franco Mana, Roberto Gemello, Jesús Andrés-Ferrer, Puming Zhan
SSemi-Supervised Learning with Data Augmentation for End-to-End ASR
Felix Weninger , Franco Mana , Roberto Gemello , Jes´us Andr´es-Ferrer , Puming Zhan Nuance Communications, Inc., Burlington, MA, USA, Torino, Italy, Valencia, Spain { felix.weninger,franco.mana,roberto.gemello,jesusandres.ferrer,puming.zhan } @nuance.com Abstract
In this paper, we apply Semi-Supervised Learning (SSL) alongwith Data Augmentation (DA) for improving the accuracy ofEnd-to-End ASR. We focus on the consistency regularizationprinciple, which has been successfully applied to image classi-fication tasks, and present sequence-to-sequence (seq2seq) ver-sions of the FixMatch and Noisy Student algorithms. Specifi-cally, we generate the pseudo labels for the unlabeled data on-the-fly with a seq2seq model after perturbing the input featureswith DA. We also propose soft label variants of both algorithmsto cope with pseudo label errors, showing further performanceimprovements. We conduct SSL experiments on a conversationalspeech data set (doctor-patient conversations) with 1.9 kh man-ually transcribed training data, using only 25 % of the originallabels (475 h labeled data). In the result, the Noisy Student al-gorithm with soft labels and consistency regularization achieves10.4 % word error rate (WER) reduction when adding 475 h ofunlabeled data, corresponding to a recovery rate of 92 %. Fur-thermore, when iteratively adding 950 h more unlabeled data,our best SSL performance is within 5 % WER increase comparedto using the full labeled training set (recovery rate: 78 %).
Index Terms : automatic speech recognition, semi-supervisedlearning, data augmentation, sequence-to-sequence, end-to-end
1. Introduction
End-to-end (E2E) systems have become a focus of ASR researchin recent years, due to their ability of integrating all compo-nents of an ASR system in a single deep neural network (DNN),which greatly simplifies and unifies the training and decodingprocess [1–5]. The Sequence-to-Sequence (seq2seq) model withattention is one of the model architectures for E2E ASR systemswhich has shown state-of-the-art performance [6–10]. However,a general observation is that E2E ASR needs large amounts oftraining data for achieving state-of-the-art performance, espe-cially when no language model trained with external text datais included in the system. Data Augmentation (DA) and semi-supervised learning (SSL) are two approaches that can be usedfor improving E2E model performance with limited amounts ofmanually transcribed training data.DA perturbs (usually randomly) the input data without alter-ing the corresponding labels. This not only increases the varietyof the data, but also serves as implicit regularization to avoidoverfitting [11]. It has been successfully used for both conven-tional [12–14] and E2E ASR systems [15, 16]. In particular, theSpecAugment approach proposed in [17] has shown impressiveimprovement for seq2seq based E2E ASR models.SSL (also called semi-supervised training) aims at lever-aging unlabeled data for improving ASR model accuracy. Inthe self-training paradigm for SSL, a seed model trained withlimited amount of labeled data is used to generate transcriptions(pseudo labels) for unlabeled data (cf. [18]). This procedure canbe iterated on additional unlabeled data [19]. Another possible implementation of SSL is via teacher-student training, where a‘student’ model is trained to replicate the outputs of a powerful‘teacher’ model on the unlabeled data [20].The central research question of our paper is how to bestintegrate SSL with DA for training E2E ASR systems. So far,the use of DA with SSL for E2E ASR has been largely limitedto a simple cascade of both, i.e., doing pseudo label generationfor the unlabeled data and then applying DA [21–24]. In con-trast, for image classification, several algorithms have recentlybeen proposed that use DA for teacher-student training [25] and consistency regularization in SSL [26–28]. Consistency regular-ization stems from the intuition that a small perturbation to aninput data sample should not change the output distribution a lot.However, these SSL algorithms were designed only for staticclassification and need to be modified to support the seq2seqASR use case.In this regard, our paper makes the following contributions:First, we modify the Noisy Student [25] and FixMatch [28]algorithms – which have only been applied to image classifica-tion so far – for the seq2seq ASR use case. Second, we showperformance improvements for both algorithms by using softlabels and consistency training (via SpecAugment and dropout).Finally, we demonstrate additional gains from iterative genera-tion of pseudo labels by exploiting a larger amount of unlabeleddata. We show that our proposed methods outperform the simpleapproach of doing DA after generating pseudo labels.
Relation to prior work:
Several studies have recently inves-tigated SSL techniques for E2E ASR, e.g. representation learn-ing [29], the usage of external text data [30], text-to-speech [31],and transcriptions generated by conventional ASR [32]. Re-garding self-training for E2E ASR, [22] proposed data filteringand ensemble schemes, generating hard pseudo labels via beamsearch. In [33], dropout was employed to improve the pseudolabel accuracy and confidence measure, due to its well-knownmodel ensembling property. All of these works did not con-sider the interaction of SSL with DA, which we investigate inour study. Very recently, [34] proposed teacher-student learningwith DA for consistency training, without considering the NoisyStudent or the FixMatch algorithm or soft labels as in our work.
2. Methods
For our E2E ASR models, we use an encoder-decoder archi-tecture with attention as described in [9], which is similar toListen-Attend-Spell (LAS) [6]. The ASR task is treated as aseq2seq learning problem: The model M is trained to predict asequence y j of symbols (here, we use sub-word units) from asequence of acoustic features (usually a spectrogram x ).The encoder e creates a hidden representation of the acous-tic features. Here, e is implemented as a stack of convolutional(CNN) layers followed by bidirectional Long Short-Term Mem-ory (bLSTM) layers. The decoder is similar to an RNN LM thattakes into account a context vector c j . Bahdanau attention [35]is used to focus c j on various parts of the encoder output. The a r X i v : . [ ee ss . A S ] J u l utput distribution p j = p ( y j | y , . . . , y j − , x ) = M ( x, y
Sequence-to-sequence FixMatch algorithm.
PT generation is done only once, it is reasonable to use a largebeam size W to improve the PT quality (here, we set W = 16 ).Furthermore, we apply a heuristic loop filtering technique tothe PT utterances similar to the one from [22]. This helps usavoid reinforcing the well-known ‘looping’ problem, where theseq2seq model keeps repeating the same n-gram.Semi-supervised training can be performed by directly usingthe PT transcriptions as pseudo labels ˆ y i,j in (7). Alternatively,we can dynamically update the PT transcriptions in the trainingprocess as in the FixMatch or Noisy Student approaches. The FixMatch algorithm is a self-training method that was pro-posed in [28] for image classification. The method implements consistency training by applying two kinds of data augmentationto the input x in both the unlabeled loss and the pseudo labelgeneration, thus encouraging the outputs to be consistent for bothaugmented inputs. Here, we present an extension of FixMatchto seq2seq models. Specifically, we generate pseudo labels ˜ y i,j for unlabeled training examples x ui as: ˜ y i,j = M θ ( A w ( x ui ) , ˆ y i,