Enhancing Audio Augmentation Methods with Consistency Learning
Turab Iqbal, Karim Helwani, Arvindh Krishnaswamy, Wenwu Wang
EENHANCING AUDIO AUGMENTATION METHODS WITH CONSISTENCY LEARNING
Turab Iqbal , Karim Helwani , Arvindh Krishnaswamy , Wenwu Wang Centre for Vision, Speech and Signal Processing, University of Surrey, UK Amazon Web Services, Inc., Palo Alto, CA, USA {t.iqbal,w.wang}@surrey.ac.uk, {helwk,arvindhk}@amazon.com
ABSTRACT
Data augmentation is an inexpensive way to increase trainingdata diversity, and is commonly achieved via transformationsof existing data. For tasks such as classification, there is a goodcase for learning representations of the data that are invariantto such transformations, yet this is not explicitly enforced byclassification losses such as the cross-entropy loss. This paperinvestigates the use of training objectives that explicitly imposethis consistency constraint, and how it can impact downstreamaudio classification tasks. In the context of deep convolutionalneural networks in the supervised setting, we show empiricallythat certain measures of consistency are not implicitly capturedby the cross-entropy loss, and that incorporating such measuresinto the loss function can improve the performance of taskssuch as audio tagging. Put another way, we demonstrate howexisting augmentation methods can further improve learningby enforcing consistency.
Index Terms — Audio classification, data augmentation,consistency learning, neural networks
1. INTRODUCTION
For tasks such as audio classification, a de facto practice whentraining deep neural networks is to use data augmentation, as itis an inexpensive way to increase the amount of training data.The most common approach to data augmentation is to usetransformations of existing training data. Examples of suchtransformations for audio include time-frequency masking,the addition of noise, pitch shifting, equalization, and addingreverberations [1, 2, 3]. These transformations are intendedto preserve the semantics of the data, so that for an instance x belonging to a class y , a transformation x (cid:48) = T ( x ) should alsomap to y . From the perspective of representation learning, it isalso desirable for the model’s latent representation of the datato capture the data’s properties [4]. This means that the repre-sentation should remain unchanged or only slightly changedunder these transformations too. By learning representationsthat behave in this way, tasks such as classification can benefitin terms of improved robustness to nuisance factors and bettergeneralization performance [4, 5, 6]. The standard cross-entropy function does not enforce thisinvariance constraint explicitly. That is, the trained model’srepresentations of instances x and x (cid:48) may differ significantly.To impose consistency between similar instances, there hasbeen interest in incorporating suitable similarity measures intothe training objective – either as a standalone loss or as anadditional loss term. In these contexts, they are sometimesreferred to as consistency losses [6, 7] or as stability losses [5].Closely related to this are contrastive losses and triplet losses[8, 9, 10, 11], where the objective is to cluster instances thatare similar while also separating instances that are dissimilar.Unlike the cross-entropy loss, consistency losses and theiroffshoots do not require ground truth labels, and have thusbeen adopted in unsupervised and semi-supervised settings[7, 10, 11]. In the image domain, their efficacy has also beendemonstrated for supervised learning in terms of improvingrobustness against distortions [5, 6].In this paper, we investigate the use of consistency lossesfor audio classification tasks in the supervised learning setting.We examine several audio transformations that could be usedfor data augmentation, and impose consistency in a suitablelatent space of the model when using these transformations.We are interested in whether enforcing consistency can influ-ence the learned representations of the neural network modelin a significant way, and, if so, whether this is beneficial fordownstream audio classification tasks. An affirmative outcomewould give a new purpose to data augmentation methods andfurther enhance their utility. To our knowledge, this is the firststudy in this direction for audio classification.More concretely, we propose using the Jensen-Shannondivergence as a loss term to constrain the class distribution P ( ˆ Y | X ) of the neural network to not deviate greatly undercertain transformations. On the ESC-50 environmental sounddataset [12], the proposed method is shown to bring notableimprovements to existing augmentation methods. By trackingthe Jensen-Shannon divergence as training progresses, regard-less of whether it is minimized, claims about the consistencyof the model outputs are verified. We also discover that thecross-entropy loss on its own can encourage consistency tosome extent if the data pipeline is modified to include x and itstransformations in the same training mini-batch – a variationwe call batched data augmentation . a r X i v : . [ c s . S D ] F e b .1. Related Work In terms of using consistency learning, the closest analog toour work is the AugMix algorithm [6], where the authors alsopropose the Jensen-Shannon divergence as a consistency loss.Another related work is from Zheng et al. [5], where theyuse the Kullback-Leibler divergence for class distributions andthe L distance for feature embeddings. These works look atimproving robustness for image recognition when distortionsare present, while our work is on general audio recognition.In addition, they only use the transformations to minimize theconsistency loss and not the cross-entropy loss. We observedthat the benefits of augmentation are partially lost this way.A consistency learning framework was also proposed by Xieet al. [13], but for unsupervised learning of non-audio tasks.A similar paradigm is contrastive learning, where, usinga similarity measure, a margin is maintained between similarinstances and dissimilar instances in an unsupervised fashion.In the audio domain, contrastive learning has been exploredfor unsupervised and semi-supervised learning [10, 11, 14].Another related concept is virtual adversarial training (VAT)[15, 16], which also promotes consistency, but for adversarialperturbations of the data.
2. CONSISTENCY LEARNING
We first develop the consistency learning framework that ourproposed method is based on. A neural network is a function f : X → Z that is composed of several lower-level functions, f l , such that f = f L ◦ . . . ◦ f . Each f l corresponds to a layer inthe neural network. Using a learning algorithm, the parametersof f are optimized with respect to a suitable training objective.For a classification task with K classes, f ( x ) is a vector of K class probabilities, from which the class that x belongs tocan be inferred. The objective, in this case, is to minimize theclassification error, or rather a surrogate of this error that isfeasible to compute. In the supervised setting, this surrogate ismost commonly the cross-entropy loss function, (cid:96) ce .Each layer of f produces a latent representation of the data.For certain architectures, including the convolutional neuralnetwork architectures popular in image/audio classification,earlier layers tend to capture the low-level properties of thedata after training, while the later layers tend to capture thehigh-level properties [4]. Therefore, when concerned withthe data’s high-level properties, the output of the penultimatelayer, f L − , or sometimes the output of the final layer, f L , isconsidered to be the representation of interest. We will denoteas G ( x ) such a representation of x ∈ X .Since G ( x ) is intended to capture high-level features of x ,it should be insensitive to small perturbations of x , such that G ( x ) ≈ G ( T ( x )) for any T ( x ) that preserves such features.In particular, this property should hold for the transformationsused in data augmentation. The motivation for learning suchrepresentations is to improve downstream classification tasks. However, the cross-entropy loss does not explicitly enforce thisproperty of consistency. To enforce consistency, we can definea similarity measure, D ( G ( x ) , G ( x (cid:48) )) , that is to be minimizedwhen x (cid:48) = T ( x ) is a perturbation of x . To do this, we add thesimilarity measure as a loss term, giving: (cid:96) ( x, x (cid:48) , y ) := (cid:96) ce ( f ( x ) , y ) + λ(cid:96) sim ( x, x (cid:48) ) , (1) (cid:96) sim ( x, x (cid:48) ) := D ( G ( x ) , G ( x (cid:48) )) , (2)where y is the ground truth label and λ determines the strengthof the new consistency loss term. Note that the cross-entropyloss can also be minimized with respect to x (cid:48) , since x (cid:48) belongsto class y by design. Therefore, we modify (1) to give (cid:96) ( x, x (cid:48) , y ) := (cid:96) jce ( x, x (cid:48) , y ) + λ(cid:96) sim ( x, x (cid:48) ) , (3) (cid:96) jce ( x, x (cid:48) , y ) := 12 [ (cid:96) ce ( f ( x ) , y ) + (cid:96) ce ( f ( x (cid:48) ) , y )] . (4)The training objective is then to minimize (3) rather than thecross-entropy loss on its own. This formulation can also begeneralized to handle multiple transformations x , ..., x n of x provided that the measure D can accommodate them.
3. PROPOSED METHOD
Given the framework based on (3), the main considerationsfor implementing consistency learning are the choices of thetransformations and the exact form of D . These concerns areaddressed in the following subsections. This paper considers three types of transformations, which are:•
Pitch shifting:
The pitch of the audio clip is shifted tobe higher or lower without affecting the clip’s duration.The pitch is randomly shifted by l semitones, where l ∈ {− . , − , − . , − , − . , . , , . , , . } .• Reverberations:
Reverberations are added to the audioby convolving the waveform with a randomly-generatedroom impulse response (RIR). The RIRs were set to havean RT60 between
200 ms and . We generated
RIRs in advance and selected one at random eachtime a transformation needed to be applied.•
Time-frequency masking:
Regions of the spectrogramare randomly masked out in an effort to encourage theneural network to correctly infer the class despite themissing information. This is done in the same way asthe SpecAugment algorithm [2], such that the number,size, and position of the regions is random. This is theonly transformation that is applied to the spectrogramrather than the audio waveform directly.Each type of transformation can produce several variations.A specific variation is selected randomly each time an instance needs to be transformed. We apply two transformations toeach instance x , giving the triplet ( x, x , x ) . The consistencylearning objective is then to ensure G ( x ) , G ( x ) , and G ( x ) do not diverge from each other. We found that applying twotransformations rather than one improved the performance ofthe trained model by a significant margin. It also allows forgreater diversity, because x and x can be generated usingdifferent types of transformations.Since the training instances must be processed in tripletsto use our method, the mini-batches used for training mustbe of size N , where N is the number of original instances.As an ablation study, we also examine the case in which thisbatch arrangement is used without the consistency loss term,giving (cid:96) jce ( x, x , x , y ) as the loss instead. We refer to this as batched data augmentation (BDA) . The representation G ( x ) we adopt in this paper is the outputof the final layer of the neural network, i.e. G ( x ) := f ( x ) .This means G ( x ) represents a class probability distribution, P ( ˆ Y | X = x ) , which is the model’s estimate of the true classprobability distribution, P ( Y | X = x ) . This choice of G ( x ) has high interpretability, since it is a vector of probabilitiesassociated with the target classes. Other latent representationstypically demand some form of metric learning before theycan be endowed with a similarity measure and interpreted [9].Since G ( x ) is a probability distribution, familiar probabilitydistribution divergences can be used directly.We propose to use the Jensen-Shannon (JS) divergenceas the similarity measure D . Given that we wish to measurethe similarity between three distributions – P x , P x , and P x ,where P x := P ( ˆ Y | X = x ) – the JS divergence is defined as JSD( P x , P x , P x ) := 13 [KL( P x (cid:107) M )+ KL( P x (cid:107) M )+ KL( P x (cid:107) M )] , (5)where M := ( P x + P x + P x ) and KL( P (cid:107) Q ) is theKullback-Leibler (KL) divergence from Q to P . The primaryreason for using the JS divergence is that it can handle anarbitrary number of distributions, while other divergences suchas the KL divergence are defined for two distributions only. Imposing consistency is arguably less meaningful when theneural network outputs incorrect predictions. In these cases,the consistency loss may negatively affect the learning process.To avoid this, we propose to linearly increase the weight λ from zero to a fixed value for the first m epochs. The rationaleis that mispredictions are common at the beginning of training,but less likely as training progresses. Our experiments showeda measurable improvement when using this heuristic.
4. EXPERIMENTS
In this section, we present experiments to evaluate our method.Our intention is to compare the performance of standard dataaugmentation methods to the proposed method, which uses thesame audio transformations but with a different data pipelineand a different loss function. The modified data pipeline onits own corresponds to BDA (see Section 3.1), which is alsocompared in our experiments. The models used for trainingare convolutional neural networks (CNNs) with log-scaled melspectrogram inputs. These CNN models were evaluated on theESC-50 environmental sound classification dataset [12].
The ESC-50 dataset is comprised of audio recordings forenvironmental audio classification. There are sound classes,with recordings per class. Each recording is five secondsin duration and is sampled at . . The recordings aresourced from the Freesound database and are relatively freeof noise. The dataset creators split the dataset into five foldsfor the purpose of cross-validation. To evaluate the systems,we use the given cross-validation setup and report the accuracy,which is the percentage of correct predictions. The neural network used in our experiments is a CNN basedon the VGG architecture [17]. The main differences are theuse of batch normalization [18], global averaging pooling afterthe convolutional layers, and only one fully-connected layerinstead of three. The model contains eight convolutional layerswith the following number of output feature maps: 64, 64, 128,128, 256, 256, 512, 512. The inputs to the neural network aremel spectrograms. To generate the mel spectrograms, we useda short-time Fourier transform (STFT) with a window size of and a hop length of . mel bins were used.The models were trained using the AdamW optimizationalgorithm [19] for epochs with a weight decay of . anda learning rate of . , which was decayed by
10 % afterevery two epochs. The mini-batch size was set to . For ourproposed method and the BDA method, this means there were original instances and transformations per mini-batch.Although some models converged sooner than epochs, theperformance did not degrade with further training.The consistency loss term of (3) has one hyperparameter, λ , which is the weight. Following the discussion in Section 3.3,we initially set the weight to zero and linearly increased it aftereach epoch until the m th epoch, at which point it remained ata fixed value. In our experiments, m = 10 and the fixed valueis λ = 5 . These hyperparameter values were selected using avalidation set, though we found that rigorous fine-tuning wasnot necessary (e.g. λ = 7 . gave similar results). https://freesound.org able 1 : The experimental results for ESC-50. The averageaccuracy and standard error are stated along with the absoluteimprovement compared to using no data augmentation.Model Accuracy ImprovementNo Augmentation . ± . -Pitch-Shift . ± . +0 . Pitch-Shift-BDA . ± . +1 . Pitch-Shift-CL . % ± . + . % Reverb . ± . +0 . Reverb-BDA . ± . +1 . Reverb-CL . % ± . + . % TF-Masking . ± . +0 . TF-Masking-BDA . ± . +0 . TF-Masking-CL . % ± . + . % Combination . ± . +0 . Combination-BDA . ± . +2 . Combination-CL . % ± . + . % For our experiments, we trained types of models, one ofwhich is the CNN without any data augmentation applied.Four of the models apply standard data augmentation, i.e. thetraining set is simply augmented and instances are sampledfrom it as normal. They are Pitch-Shift , Reverb , TF-Masking ,and
Combination . As the names imply, three of these apply justa single type of transformation (cf. Section 3.1).
Combination applies either pitch-shifting or reverberations randomly withequal probability. Complementing the aforementioned fourmodels are the BDA variations, which are suffixed with -BDA in the results table; and the variations using our consistencylearning method, which are suffixed with -CL . The results are presented in Table 1. The vanilla CNN achievesan accuracy of .
59 % , which matches results presented in thepast for such an architecture [20]. The models implementingstandard augmentation improve the performance marginally,with an average accuracy increase of .
36 % . For batched dataaugmentation (BDA), sizable improvements can be observedfor all of the transformations (average improvement of .
38 % ),including the
Combination variant, where the improvementover the vanilla CNN is .
25 % . Using the consistency loss,the improvements are even greater, with an average accuracyincrease of .
99 % and a maximum increase of .
63 % whencombining transformations. Overall, these results show thatconsistency learning can benefit audio classification and thatBDA is also superior to standard augmentation. It should benoted that using two transformations per instance instead ofone made a large difference in our experiments. s i m CombinationCombination-BDACombination-CL (a) s i m CombinationCombination-BDACombination-CL (b)
Fig. 1 : The consistency loss, (cid:96) sim , measured after each trainingepoch for the
Combination models on (a) the training set and(b) the test set of the fold 1 split.To confirm whether the consistency loss term is indeedenforcing consistency more effectively than the cross-entropyloss on its own, we measured the average JS divergence aftereach training epoch. This can be carried out for any modelprovided the data is processed in triplets during the validation.Figure 1 plots the progress of the consistency loss term for thetraining set and the test set of the fold 1 split – specifically forthe
Combination models, albeit we observed similar patternswith the other models. The figures show that the cross-entropyloss on its own encourages consistency to some extent, but notas effectively as having an explicit loss term. It is interestingto note that using BDA resulted in a lower JS divergence forthe training set than standard data augmentation.
5. CONCLUSION
In this paper, we investigated consistency learning as a way toregularize the latent space of deep neural networks with respectto input transformations commonly used for data augmentation.We argued that enforcing consistency can benefit tasks such asaudio classification. We proposed using the Jensen-Shannondivergence as a consistency loss term and used it to constrainthe neural network output for several audio transformations.Experiments on the ESC-50 audio dataset demonstrated thatthis method can enhance existing data augmentation methodsfor audio tagging, and confirmed that consistency is enforcedmore effectively with an explicit loss term. . REFERENCES [1] J. Salamon and J. P. Bello, “Deep convolutional neuralnetworks and data augmentation for environmental soundclassification,”
IEEE Signal Processing Letters , vol. 24,no. 3, pp. 279–283, Jan. 2017.[2] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D.Cubuk, and Q. V. Le, “SpecAugment: A simple dataaugmentation method for automatic speech recognition,”in
Proceedings of Interspeech , Graz, Austria, 2019, pp.2613–2617.[3] U. Isik, R. Giri, N. Phansalkar, J-M. Valin, K. Helwani,and A. Krishnaswamy, “PoCoNet: Better speech en-hancement with frequency-positional embeddings, semi-supervised conversational data, and biased loss,” in
Pro-ceedings of Interspeech , 2020.[4] Y. Bengio, A. Courville, and P. Vincent, “Representationlearning: A review and new perspectives,”
IEEE Trans-actions on Pattern Analysis and Machine Intelligence ,vol. 35, no. 8, pp. 1798–1828, Aug 2013.[5] S. Zheng, Y. Song, T. Leung, and I. Goodfellow, “Improv-ing the robustness of deep neural networks via stabilitytraining,” in
IEEE Conference on Computer Vision andPattern Recognition (CVPR) , Las Vegas, NV, 2016, pp.4480–4488.[6] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer,and B. Lakshminarayanan, “AugMix: A simple data pro-cessing method to improve robustness and uncertainty,”in
International Conference on Learning Representations(ICML) , 2020.[7] G. French, S. Laine, T. Aila, M. Mackiewicz, and G. Fin-layson, “Semi-supervised semantic segmentation needsstrong, varied perturbations,” in
British Machine VisionVirtual Conference (BMVC) , 2020.[8] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet:A unified embedding for face recognition and cluster-ing,” in , 2015, pp. 815–823.[9] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “Asimple framework for contrastive learning of visual rep-resentations,” in
International Conference on MachineLearning (ICML) , 2020.[10] A. Jansen, M. Plakal, R. Pandya, D. P. W. Ellis, S. Her-shey, J. Liu, R. C. Moore, and R. A. Saurous, “Unsu-pervised learning of semantic audio representations,” in
IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2018, pp. 126–130. [11] N. Turpault, R. Serizel, and E. Vincent, “Semi-supervised triplet loss based learning of ambient au-dio embeddings,” in
IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) ,2019, pp. 760–764.[12] K. J. Piczak, “ESC: Dataset for environmental soundclassification,” in
Proceedings of the 23rd ACM Interna-tional Conference on Multimedia , New York, NY, USA,2015, MM ’15, pp. 1015–1018.[13] Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le, “Un-supervised data augmentation for consistency training,”in
Advances in Neural Information Processing Systems(NeurIPS) , 2020.[14] S. Schneider, A. Baevski, R. Collobert, and M. Auli,“wav2vec: Unsupervised pre-training for speech recogni-tion,” in
Proceedings of Interspeech , Graz, Austria, 2019,pp. 3465–3469.[15] T. Miyato, S. Maeda, M. Koyama, and S. Ishii, “Virtualadversarial training: A regularization method for super-vised and semi-supervised learning,”
IEEE Transactionson Pattern Analysis and Machine Intelligence , vol. 41,no. 8, pp. 1979–1993, Jul. 2019.[16] F. L. Kreyssig and P. C. Woodland, “Cosine-distancevirtual adversarial training for semi-supervised speaker-discriminative acoustic embeddings,” in
Proceedings ofInterspeech , 2020.[17] K. Simonyan and A. Zisserman, “Very deep convolu-tional networks for large-scale image recognition,” in
International Conference on Learning Representations(ICLR) , San Diego, CA, 2015.[18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerat-ing deep network training by reducing internal covariateshift,” in
International Conference on Machine Learning(ICML) , Lille, France, 2015, vol. 37, pp. 448–456.[19] I. Loshchilov and F. Hutter, “Decoupled weight decayregularization,” in
International Conference on LearningRepresentations (ICLR) , New Orleans, LA, 2019.[20] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D.Plumbley, “PANNs: Large-scale pretrained audio neu-ral networks for audio pattern recognition,”