Improving Deep-learning-based Semi-supervised Audio Tagging with Mixup
11 Improving Deep-learning-based Semi-supervisedAudio Tagging with Mixup
L´eo Cances, Etienne Labb´e, Thomas PellegriniIRIT, Universit´e Paul Sabatier, CNRS, Toulouse, France
Abstract —Recently, semi-supervised learning (SSL) methods,in the framework of deep learning (DL), have been shown toprovide state-of-the-art results on image datasets by exploitingunlabeled data. Most of the time tested on object recognition tasksin images, these algorithms are rarely compared when applied toaudio tasks. In this article, we adapted four recent SSL methodsto the task of audio tagging. The first two methods, namelyDeep Co-Training (DCT) and Mean Teacher (MT), involve twocollaborative neural networks. The two other algorithms, calledMixMatch (MM) and FixMatch (FM), are single-model methodsthat rely primarily on data augmentation strategies. Using theWide ResNet 28-2 architecture in all our experiments, 10% oflabeled data and the remaining 90% as unlabeled, we first com-pare the four methods’ accuracy on three standard benchmarkaudio event datasets: Environmental Sound Classification (ESC-10), UrbanSound8K (UBS8K), and Google Speech Commands(GSC). MM and FM outperformed MT and DCT significantly,MM being the best method in most experiments. On UBS8Kand GSC, in particular, MM achieved 18.02% and 3.25% errorrates (ER), outperforming models trained with 100% of theavailable labeled data, which reached 23.29% and 4.94% ER,respectively. Second, we explored the benefits of using the mixupaugmentation in the four algorithms. In almost all cases, mixupbrought significant gains. For instance, on GSC, FM reached4.44% and 3.31% ER without and with mixup.
Index Terms —Audio tagging, semi-supervised deep learning.
I. I
NTRODUCTION S EMI-SUPERVISED learning (SSL) aims to reduce thedependency of deep learning systems on labeled data byintegrating non-labeled data during the learning phase. It isessential since the conception of a large labeled dataset isexpensive, dependent on the task to be learned, and time-consuming. On the contrary, the acquisition of non-labeleddata is cheaper and quicker regardless of the learning task. Us-ing unlabeled data while maintaining high performance can bedone in three different ways:
Consistency regularization [1],[2], which encourages a model to produce consistent predictionwhereas the input is perturbed;
Entropy minimization [3]–[5], which encourages the model to output high confidencepredictions on unlabeled files, and
Standard regularization by using weight decay [6], [7], mixup [8] or adversarialexamples [9]. The most direct approach for SSL is pseudo-labeling [5], but since then, many new and better approachescame out such as Mean Teacher [10], Deep Co-Training [11],MixMatch [12] and FixMatch [13].In this work, we adapted these four SSL methods to thetask of audio tagging. One difficulty lies in choosing whichaudio data augmentation techniques to use, that work fordifferent types of sound events and spoken words. We compare the methods’ accuracy on three audio datasets with differentscopes and size:
Environmental Sound Classification 10(ESC-10)
Dataset [14], with audio event categories such ashuman noise not related to speech, natural ambient noise,
UrbanSound8k (UBS8K) [15], more specific to urban noises,and
Google Speech Commands (GSC)
Dataset v2 [16]containing spoken words exclusively.In the MixMatch algorithm, a successful data augmentationtechnique called mixup [8] is used. It consists of mixing pairsof samples, both the data samples and the labels with a randomcoefficient. We propose to add mixup to the three other SSLapproaches, namely MT, DCT and FM, which do not alreadyuse it. The results reported in this article will highlight thepositive impact of mixup in almost all our experiments.This paper’s contributions are two-fold: i) the applicationand comparison of recent SSL methods for audio tagging onthree different datasets, ii) the modification of these methodswith the integration of mixup resulted in systematic accuracygains. We shall see that in most cases, MixMatch outperformedthe other methods, closely followed by FixMatch+mixup.The structure of the paper is as follows. Section II describesthe augmentations we used and the mixup mechanism at thecore of the present work. Section III describes the four SSLmethods, Section IV presents the experimental settings, andfinally, Section V presents and discusses the results.II. A
UDIO DATA AUGMENTATION
Augmentations are at the heart of semi-supervised learningmechanisms to take advantage of the non-labeled examplesthey are provided with. In this section, we begin by describingthe mixup mechanism, which we explore in this work, and thedifferent audio data augmentations used in some of the SSLapproaches.
A. mixup
Mixup [8] is a successful data augmentation/regularizationtechnique, that proposes to mix pairs of samples (images,audio clips, etc.). If x and x are two different input samples(spectrograms in our case) and y , y their respective one-hotencoded labels, then the mixed sample and target are obtainedby a simple convex combination: x mix = λx + (1 − λ ) x y mix = λy + (1 − λ ) y a r X i v : . [ c s . S D ] F e b where λ is a scalar sampled from a symmetric Beta distributionat each mini-batch generation: λ ∼ Beta ( α, α ) where α is a real-valued hyper-parameter to tune (alwayssmaller than 1.0 in our case).In the original MM algorithm, an “asymmetric” version ofmixup is used, in which the maximum value between λ and − λ is retrieved: λ = max( λ, − λ ) (1)This allows the resulting mixed batch to be closer to one of thetwo original batches (the one with the λ coefficient). This isuseful when the method mixes labeled and unlabeled samples. B. Audio signal augmentation methods
For supervised learning, MM and FM, we use three differentaugmentations: Occlusion, StretchPadCrop, and CutOutSpec.During training, one of these augmentations is randomlyapplied to each sample. • Occlusion : applied to the raw audio signal, Occlusionconsists of setting a segment of the file to zero. Thesize of the segment is randomly chosen to a user-definedmaximum size. The position of the segment is also chosenrandomly. • StretchPadCrop : also applied to the raw audio sig-nal, StretchPadCrop consists of up-sampling or down-sampling the signal. The rate with which the signal ismodified is chosen randomly within a predefined interval.The resulting augmented file is either shorter or longer.It is then necessary to apply zero padding or cropping tokeep the shape of the samples constant. • CutOutSpec : applied to the log-mel-spectrogram,CutOutSpec sets the values within a random rectanglearea with the -80 dB value, which corresponds to thesilence energy level in our spectrograms. The length andwidth of the removed sections are randomly chosen froma predefined interval and depend on the spectrogramsize.To select and tune these augmentations, we trained on GSCseveral Wide ResNet28-2 models [17], which architecture willbe described in details in Section IV-B. During training, a dif-ferent augmentation was applied to the input data with a 100%chance. We ran a grid search to tune the parameter valuesfor each augmentation. In addition to the three augmentationsdescribed above, we also tried to apply uniform noise on thelog-Mel spectrograms, invert the Mel-band axis or the time
TABLE IL
IST OF AUGMENTATION PARAMETERS USED IN AUGMENTEDSUPERVISED , M IX M ATCH AND F IX M ATCH TRAINING . Name Parameters Weak range Strong rangeOcclusion max size [0.25, 0.25] [0.75, 0.75]
StretchPadCrop rate [0.50, 1.50] [0.25, 1.75]
CutOutSpec scale [0.10, 0.50] [0.50, 1.00] Fig. 1. Example of different augmentations on an audio file from Urban-Sound8k, from top to bottom: original, weak augmentation, strong augmenta-tion, and mixup augmentation. The weak and strong augmentations correspondto StretchPadCrop with a factor randomly taken from the interval [0.5, 1.5]and [0.25, 1.75], respectively. axis, and apply a frequency and time dropout. Table I detailsthe different augmentations we tested.FM makes use of so-called “weak” and “strong” augmenta-tions. The difference between the two lies in the strength andfrequency with which an augmentation is applied. The specificparameters of each augmentation are defined in Table I.Figure 1 shows examples of a weak and a strong Stretch-PadCrop augmentation, as well as two spectrograms mixedusing mixup as done in our experiments.III. S
EMI - SUPERVISED DEEP LEARNING ALGORITHMS
This section provides a detailed description of each of themethods we decided to experiment with. We chose them fortheir performance and novelty. Two of these approaches, MeanTeacher (MT) [10] and Deep Co-Training (DCT) [11] usethe principle of consistency regularization between the outputof two models. The other methods, MixMatch (MM) andFixMatch (FM), use only one model and combine the threeSSL mechanisms described in the introduction.
Fig. 2. MT workflow. Both models receive as input labeled x s and unlabeledfiles x u . A supervised loss L s is computed between the ground truth andthe student model predictions, whereas a consistency cost L cc is computedbetween the student and teacher model predictions. We considered a fifth method, ReMixMatch, that has beenproposed by the same authors to enhance MM, but before in-troducing FM. ReMixMatch uses an additional self-supervisedloss term, related to predicting a rotation angle applied toan input image. For audio, we replaced the rotations withhorizontal and vertical flips on the spectrograms, but we didnot obtain better results compared to MixMatch, so that wedo not report results with this method.We provide a figure to illustrate each of the four methods.In Section III-E, we explain how we add mixup to eachmethod, except MM that already uses mixup in its originalversion. We included the mixup operation in a green-coloredbox in the method workflow figures, to show where mixup isoptionally integrated. We will refer to the modified methodsas “method+mixup”, for instance, FM+mixup.
A. Mean-Teacher (MT)
We can find audio applications of MT [10] in the Detectionand Classification of Acoustic Scenes and Events (DCASE)task 4 challenges, namely the weakly supervised Sound EventDetection task.MT uses two neural networks: a “student” f and a “teacher” g , which share the same architecture. The weights ω of thestudent model are updated using the standard gradient descentalgorithm, whereas the weights W of the teacher model are theExponential Moving Average (EMA) of the student weights.The teacher weights are computed at every mini-batch iteration t , as the convex combination of its weights at t (cid:57) and thestudent weights, with a smoothing constant α ema : W t = α ema · W t (cid:57) + (1 − α ema ) · ω t (2)There are two loss functions applied either on the labeled orunlabeled data subsets. On the labeled data x s , the usual cross- entropy (CE) is used between the student model’s predictionsand the ground-truth y s . L sup = CE ( f ( x s ) , y s ) (3)The consistency cost is computed from the student predic-tion f ( x u ) and the teacher prediction g ( x (cid:48) u ) , where x u is asample from the unlabeled subset, and x (cid:48) u the same samplebut slightly perturbed with Gaussian noise and a 15 dB signal-to-noise ratio. In our case, this cost is a Mean Square Error(MSE) loss. L cc = MSE ( f ( x u ) , g ( x (cid:48) u )) (4)The final loss function is the sum of the supervised lossfunction and the consistency cost weighted by a factor λ cc which controls its influence. L total = L sup + λ cc · L cc (5) B. Deep Co-Training (DCT)
DCT has been recently proposed by Qiao and col-leagues [11]. It is based on Co-Training (CT), the well-knowngeneric framework for SSL proposed by Blum and colleaguesin 1998 [18]. The main idea of Co-Training is based on theassumption that two independent views on a training datasetare available to train two models separately. Ideally, the twoviews are conditionally independent given the class. The twomodels are then used to make predictions on the non-labeleddata subset. The most confident predictions are selected andadded to the labeled subset. This process is iterative, likepseudo-labeling.DCT is an adaptation of CT in the context of deep learning.Instead of relying on views of the data that are different, DCTmakes use of adversarial examples to ensure the independencein the “view” presented to the models. The second differenceis that the whole non-labeled dataset is used during training.
Fig. 3. DCT workflow. Each model is trained on its own labeled samples x i ,unlabeled samples x u and the adversarial examples generated by the othermodel. Model f makes predictions on x and x g , and model g on x and x f . In our DCT+mixup variant, mixup is used on the unlabeled samples only. Each batch is composed of a supervised and an unsupervisedpart. Thus, the non-labeled data are directly used, and theiterative aspect of the algorithm is removed.Let S and U be the subsets of labeled and unlabeled data,respectively, and let f and g be the two neural networks thatare expected to collaborate.The DCT loss function is comprised of three terms, asshown in Eq. 6. These terms correspond to loss functionsestimated either on S , U , or both. Note that during training,a mini-batch is comprised of labeled and unlabeled samplesin a fixed proportion. Furthermore, in a given mini-batch, thelabeled examples given to each of the two models are sampleddifferently. L = L sup + λ cot L cot + λ diff L diff (6)The first term, L sup , given in Eq. 7, corresponds to thestandard supervised classification loss function for the twomodels f and g , estimated on examples x and x sampledfrom S . In our case, we use categorical Cross-Entropy (CE),the standard loss function used in classification tasks withmutually exclusive classes. L sup = CE( f ( x ) , y ) + CE( g ( x ) , y ) (7)In SSL and Co-Training, the two classifiers are expected toprovide consistent and similar predictions on both the labeledand unlabeled data. To encourage this behavior, the Jensen-Shannon (JS) divergence between the two sets of predictions isminimized on examples x u sampled from the unlabeled subset U only. Indeed, there is no need to minimize this divergencealso on S since L sup already encourages the two models tohave similar predictions on S . Eq. 8 gives the JS analyticalexpression, with H denoting entropy. L cot = H (cid:16)
12 ( f ( x u ) + g ( x u )) (cid:17) − (cid:16) H ( f ( x u )) + H ( g ( x u )) (cid:17) (8)For DCT to work, the two models need to be comple-mentary: on a subset different from S ∪ U , examples mis-classified by one model should be correctly classified by theother model [19]. This can be achieved in deep learning bygenerating adversarial examples with one model and trainingthe other model to be resistant to these adversarial samples. Todo so, the L diff term (Eq. 9) is the sum of the Cross-Entropylosses between the predictions f ( x ) and g ( x (cid:48) ) , where x is sampled from S ∪ U and x f is the adversarial examplegenerated with model f and x taken as input. The secondterm is the symmetric term for model g . L diff = CE( f ( x ) , g ( x f )) + CE( g ( x ) , f ( x g )) (9)For the adversarial examples generation, we use the FastGradient Signed Method (FGSM, [20]), as in Qiao’s work.For more in-depth details on the technical aspects of DCT,the reader may refer to [11]. We implemented DCT as pre-cisely as described in Qiao’s article, using PyTorch, and made Fig. 4. MM workflow. K augmentations are applied to the unlabeled data x u , and the averaged model predictions are used as pseudo labels ˆ y u . Thelabeled and augmented unlabeled data are mixed up and used to compute thesupervised and unsupervised loss values. sure to accurately reproduce their results on CIFAR-10: about90% accuracy when using only 10% of the training data aslabeled data (5000 images). C. MixMatch
MixMatch [12] (MM) is an SSL approach that uses entropyminimization and standard regularization, namely pseudo-labeling [5], mixup, and weak data augmentation, to leveragethe unlabeled data and provide better generalization capabili-ties. Unlike MT and DCT, this approach uses only one model.The different steps are shown in Fig. 4 and detailed in thefollowing paragraphs.During the learning phase, each minibatch is composed oflabeled x s and non-labeled x u samples in equivalent propor-tions. The first step consists of applying an augmentation to thelabeled part of the mini-batch and k augmentations to the non-labeled part. In the second step, pseudo-labels y u are generatedfor the non-labeled files using the model’s prediction averagedon these k variants as shown in Eq. 10, where x (cid:48) u,i denotesthe i -th variant of an unlabeled augmented file. ˆ y u = 1 k k (cid:88) i =1 f ( x (cid:48) u,i ) (10)For encouraging the model to produce confident predictions,a post-processing step is necessary to decrease the output’sentropy. To do so, the highest probability is increased and theother ones decreased. This process is called ”sharpening” bythe method authors, and it is defined as:sharpen ( p, T ) i := p /Ti (cid:44) | p | (cid:88) j =1 p /Tj (11) The sharpen function is applied on to the pseudo-labels p = ˆ y u . The parameter T , called Temperature, controls thestrength of the sharpen function. When T tends towards zero,the entropy of the distribution produced is lowered.Finally, the labeled and unlabeled augmented samples areconcatenated and shuffled into a W set then used as a poolof training samples used by the asymmetric mixup function.Asymmetric mixup is applied separately on the labeled andunlabeled parts of the mini-batch, as formulated here: x (cid:48) mix s = mixup ( x s | W .. | x s | ) (12) x (cid:48) mix u = mixup ( x u | W | x s | .. | W | ) (13)The W set and the corresponding labels are shuffled inthe same order. Each labeled sample is then perturbed by asecond labeled or unlabeled sample. Mixing the two is done sothat the original labeled sample remains the main componentof the resulting sample. The operation has been detailed inSection II-A. The same procedure is applied onto the unlabeledfiles using the remaining samples from W.The original MixMatch loss function is composed of thestandard cross-entropy (CE) for the supervised loss L s , anda squared l2 norm for the unsupervised loss L u . We replacethe l2 norm with a cross-entropy in all our experiments, asproposed in the ReMixMatch paper. Indeed, it seems that CEperforms better than the l2 norm in our experiments. L s = 1 B s (cid:88) ( x (cid:48) mix s ,y mix s ) CE ( f ( x (cid:48) mix s ) , y mix s ) (14) L u = 1 k · B u (cid:88) ( x (cid:48) mix u , ˆ y mix u ) CE ( f ( x (cid:48) mix u ) , ˆ y mix u ) (15)The final loss is the sum of the two components, with ahyper-parameter λ u : L = L s + λ u · L u (16) D. FixMatch
FixMatch [13] (FM) is another SSL method which proposesa simplification of MM and ReMixMatch. The method alsouses one model, removes mixup and replaces the sharpenfunction by binary pseudo-labels. FM uses both weak augmen-tations (weak) and strong augmentations (strong). The strongaugmentations can mislead the model predictions by disruptingtoo much the training data. Figure 5 shows the main pipelineof FixMatch. As in the other method illustrations, we added amixup box in blue, to indicate where we add it to the algorithmin our modified FM algorithm, thus called FM+mixup.The supervised loss component is the standard cross-entropyapplied to the weakly-augmented data : L s = CE (cid:16) f (cid:0) weak ( x s ) (cid:1) , y s (cid:17) (17)Then, we guess the labels of the weakly augmented unla-beled data and apply a binarization (argmax) of these pre-dictions to have a ”one-hot” encoded label. This label isused as target for training the model with strongly augmented Fig. 5. FixMatch workflow. A weakly augmented version of x u is used tocompute a pseudo-label ˆ y u and a mask. The strongly augmented variant isused to compute the unlabeled loss term. The mixup component is used on aconcatenated set of labeled and unlabeled samples (FixMatch+mixup). unlabeled data. It allows the model to generalize with weakand strong augmentations and it also uses the guessed label toimprove the model accuracy with unlabeled data: ˆ y u = f (cid:0) weak ( x u ) (cid:1) (18)To avoid training on incorrect guessed labels, FM uses athreshold τ that ensures that the unsupervised cost functioncan only be applied to predictions made with high confidence,i.e., above this threshold. This can be easily implemented inthe form of a mask: mask = (cid:0) max (ˆ y u ) > τ (cid:1) (19) L u = mask · CE (cid:16) f (cid:0) strong ( x u ) (cid:1) , argmax (ˆ y u ) (cid:17) As in MixMatch, we sum the loss components to computethe final loss: L = L s + λ u · L u (20) E. Adding mixup to MT, DCT and FM
As we described here-above, MM already uses mixup in itsworkflow. In order to measure the impact of mixup in MM, wewill report results when we remove mixup from MM. On thecontrary, the three other SSL methods explored in our work(MT, DCT, FM) do not use mixup in their original version.How should we add mixup to those then? We explored severalways to do so, and retained the best one for each of the threemethods. Note that we illustrate where the mixup operationhas been added in the figures describing the different methodsin the previous section.
Since the labeled and unlabeled data flow is very similarin MM and FM, we added mixup to FM at the same placeas in MM: both labeled and unlabeled samples are mixed up.Similarly, it is also the asymmetric mixup variant that we inMM and FM since mixup is applied to labeled and unlabeledsamples together, as in the original MM method. Using mixupon labeled and unlabeled examples separately seems to hurtperformance with these two methods.In MT, mixup is applied on labeled and unlabeled samplesseparately and only for the teacher model. The perturbationwith Gaussian noise applied to the unlabeled samples isremoved, since no gain was observed when mixup is usedinstead.For DCT, mixup is applied on the unlabeled samples only,common to both models in each minibatch during training.Applying mixup on the labeled samples, which are sampleddifferently for the two models at each training step, lead yoworse results. It is then, not necessary to use the asymmetricalvariant for MT and DCT.Finally, in all cases, we apply mixup on the log-Mel spec-trograms, which are the input features given to our deep neuralnetworks (feature extraction is detailed in the experimentsection). IV. E
XPERIMENTS
In this section, we describe our experimental setup. We givea brief description of the datasets and metrics, describe theWide ResNet architecture we used, together with the trainingstrategy details.
A. Datasets and evaluation metrics
Environmental Sound Classification 10 (ESC-10) [14] isa selection of 400 five-second-long recordings of audio eventsseparated into ten balanced categories. The dataset is providedwith five uniformly sized cross-validation folds that will beused to perform the evaluation. The files are sampled at 44kHz and are converted into × log-Mel spectrograms. UrbanSound8k (UBS8K) [15] is a dataset composed of8742 files between 1 and 4 seconds long, separated into tenbalanced categories. The dataset is provided with ten cross-validation folds of uniform size that will be used to perform theevaluation. The files are zero-padded to 4 seconds, resampledto 22 kHz, and converted to × log-Mel spectrograms. Google Speech Commands Dataset v2 (GSC) [16] is anaudio dataset of spoken words designed to evaluate keywordspotting systems. The dataset is split into 85511 training files,10102 validation files, and 4890 testing files. The latter isused for the evaluation of our systems. We ran the task ofclassifying the 35 word categories of this dataset. The filesare zero-padded to 1 second if needed and sampled at 16 kHzbefore being converted into × log -Mel spectrogram.In all cases, the 64 Mel-coefficients were extracted using awindow size of 2048 samples and a hop length of 512 samples.For ESC-10 and UBS8K, we used the official cross-validationfolds. We report the average classification Error Rate (ER)along with standard deviations. Layer Architectureinput Log Mel spectrogramconv1 BasicBlock(32)Max poolblock1 (cid:104)
BasicBlock(32) (cid:105) × (cid:104) BasicBlock(64) (cid:105) × (cid:104) BasicBlock(128) (cid:105) × RCHITECTURE OF W IDE R ES N ET OWNSAMPLING IS PERFORMEDBY THE FIRST LAYERS IN BLOCK AND BLOCK B. Models
We used the Wide ResNet 28-2 [17] architecture in all ourexperiments. This model is very efficient, achieving SOTAperformance on the three datasets when trained in a 100%supervised setting. Moreover, its small size, comprised ofabout 1.4 Million parameters, allows to experiment quickly. Itsstructure consists of an initial convolutional layer ( conv1 ) fol-lowed by three groups of residual blocks ( block1 , block2 ,and block3 ). Finally, an average pooling and a linear layeract as a classifier. The residual blocks, composed of two BasicBlock , are repeated three times and their structure isdefined in Eq. 21. The number of channels of the convolutionlayers is referred as l , BN stands for Batch Normalization andReLU [21] for the Rectified Linear Unit activation function.We used the official implementation available in PyTorch [22]. BasicBlock( l ) = ( conv 3 × @ l, BN , ReLU ) (21) C. Training configurations
Each model was trained using the ADAM [23] optimizer.Table III shows the hyper-parameter values used for eachmethod, such as the learning rate lr, the mini-batches’ size bs,
TABLE IIIT
RAINING PARAMETERS USED ON THE DATASETS . B S : BATCH SIZE , LR : LEARNING RATE , WL : WARM - UP LENGTH IN EPOCH , E : NUMBER OFEPOCHS , α : MIXUP B ETA PARAM .bs lr wl e α Supervised 256 0.001 - 300 -mixup 256 0.001 - 300 0.4MT 64 0.001 50 200 -MT+mixup 64 0.001 50 200 0.40DCT 64 0.0005 160 300 -DCT+mixup 64 0.0005 160 300 0.40MM 256 0.001 - 300 0.75MM-mixup 256 0.001 - 300 -FM 256 0.001 - 300 -FM+mixup 256 0.001 - 300 0.75
TABLE IVS
UPERVISED LEARNING E RROR R ATES (%) ON ESC-10, UBS8K
AND
GSC.Labeled fraction ESC-10 UBS8K GSC10% 100% 10% 100% 10% 100%Supervised 32.00 ± 6.17 8.00 ± 5.06 33.80 ± 4.82 23.29 ± 5.80 10.01 4.94+mixup 36.00 ± 5.22 8.33 ± 4.56 31.41 ± 5.56 22.04 ± 5.99 8.83 3.86+Weak +Strong 23.00 ± 5.19 5.00 ± 2.64 25.58 ± 4.15 20.69 ± 4.92 7.60 3.27+Strong+mixup 24.00 ± 8.71 5.00 ± 4.25 24.73 ± 4.42 18.52 ± 4.38 6.86 the warmup length wl if used, and the number of epochs e .These parameters are identical regardless of the dataset used,unless otherwise specified.For supervised training, MM and FM, the learning rateremains constant throughout training. For MT and DCT, thelearning rate is weighted by a descending cosine rule: lr = 0 . (cid:16) . cos (cid:0) ( t (cid:57) πe (cid:1)(cid:17) (22)All the SSL approaches, but FixMatch, introduce one ormore subsidiary terms to the loss. To alleviate their impact atthe beginning of the training, these terms are weighted by alambda λ ratio, which ramps up to its maximum value withina warmup length wl . The ramp-up strategy is defined in Eq. 23for MT and DCT, and is linear in MM during the first 16klearning iterations. λ = λ max × (cid:0) − e − × (1 − ( t/ wl )) (cid:1) (23)In MT, the maximum value of λ cc is 1 and α ema is set to0.999. In DCT, the maximum values of λ cot and λ diff are 1and 0.5, respectively. In MM, the maximum value of λ u is 1.In MM, we use two augmentations ( k = 2 ), the sharpeningtemperature T is set to 0.5. In FM, we use a threshold τ = 0 . on ESC-10 and GSC datasets, and τ = 0 . for UBS8K.For MM and FM, on ESC-10, the batch size is 60 becauseESC-10 is a small dataset of 400 files only. During training,only four folders are used, that is, 320 files. In a 10%configuration and due to the whole division’s restrictions, thisrepresents only 30 supervised files in total. Each mini-batchmust contain as many labeled as unlabeled files, hence thebatch size of 60. Moreover, because of this small numberof files, the training phase only lasts for 2700 iterations, andtherefore, warm-up ends prematurely.For our proposed variants, which include mixup, we keptthe same configurations and parameter values.V. R ESULTS
We first report the results obtained in a supervised setting,with and without the same data augmentation methods used inthe SSL algorithms, including mixup. We compare the errorrates obtained by the four SSL methods and then show thatadding mixup is almost in all cases beneficial.
A. Supervised learning
This section presents the results obtained with supervisedlearning in different settings while using either 10% or 100% of the labeled data available. MM and FM use augmentationsas their core mechanism. Therefore, it seems essential for faircomparisons to use the same augmentations in the supervisedsettings too. Indeed, FM uses weak and strong augmentations,while MM uses a combination of weak augmentations andmixup.We trained models without any augmentation (Super-vised), using mixup alone (mixup), weak augmentations alone(Weak), a combination of weak augmentations and mixup(Weak+mixup), strong augmentations alone (Strong), and tofinish, a combination of strong augmentations with mixup(Strong+mixup). Table IV presents the results on ESC-10,UBS8K, and GSC.
ESC-10.
In the 10% setting, the supervised model reachedan ER of 32.00%. The use of Weak yielded to the bestperformance with 22.67% ER, outperforming the supervisedmodel by 9.3 points (29.16% relative). In the 100% setting, thesupervised model reached an ER of 8.00%, and the best ERof 4.67% was achieved when using Weak+mixup. The gain is3.33 points (41.62% relative).
UBS8K.
In a 10% setting, the supervised model reached33.80% ER, and the best supervised result was obtained withWeak+mixup, with a 23.75% ER. It represents an improve-ment of 10.05 points, 29.73% relative improvement. In the100% setting, the same augmentation combination reached anER of 17.96%, outperforming the 23.29% ER from the super-vised model by 5.33 points, 22.88% relative improvement.
GSC.
In a 10% setting, the supervised model reached10.01% ER, and Weak+mixup yielded the best ER of 6.58%It represents an augmentation of 3.43 points, 34.26% relativeimprovement. In the 100% setting, the Strong+mixup reachedan ER of 2.98%, outperforming the 4.94% ER from thesupervised model by 1.96 point, 39.68% relative improvement.Overall, we observe that in a supervised setting, if using asingle augmentation improved the accuracy of our systems, thecombination of mixup with a weak or a strong augmentationis systematically better than using one of those augmentationsindividually, except in the ESC-10 dataset.
B. Semi-supervised learning
In this section, we report in Table V the results of the SSLmethods with and without the mixup augmentation. For MM,mixup is already used in the original method, thus, we compareMM to MM without mixup (MM-mixup). For the three othermethods, we denote, for instance FM+mixup the FM algorithmaugmented with mixup.
TABLE VS
EMI - SUPERVISED LEARNING RESULTS IN E RROR R ATE (%) ON ESC-10, UBS8K
AND
GSC.Labeled fraction ESC-10 UBS8K GSC10% 100% 10% 100% 10% 100%Supervised 32.00 ± 6.17 8.00 ± 5.06 33.80 ± 4.82 23.29 ± 5.80 10.01 4.94Best supervised 22.67 ± 3.46 4.67 ± 1.39 23.75 ± 4.73 17.96 ± 3.64 6.58 3.00MT 28.28 ± 5.28 - 23.75 ± 4.73 - 8.51 -MT+mixup 27.81 ± 2.25 - 32.00 ± 5.80 - 8.50 -DCT 25.16 ± 4.42 - 27.85 ± 4.29 - 6.22 -DCT+mixup 23.75 ± 2.36 - 25.77 ± 4.73 - 5.63 -MM 15.33 ± 5.58 - - -MM-mixup 17.33 ± 3.84 - 20.42 ± 4.88 - 4.49 -FM - 21.44 ± 4.16 - 4.44 -FM+mixup 14.67 ± 7.21 - 18.27 ± 3.80 - 3.31 - In all the three datasets, the four SSL methods brought ERdecreases compared to the 10% supervised learning setup,when no augmentation is performed. Only MM and FMperformed better than the best supervised training result, thatused the weak augmentations.Furthermore, MM and FM significantly outperformed MTand DCT in all cases, showing that using single-model SSLmethods is more efficient than two-model-based methods, atleast on these three datasets.For ESC-10, in the 10% setting, the lowest ER was achievedby FM with a 13.33% ER, compared to a 22.67% ER for aweakly augmented supervised training. It represents a 9.34points improvement, 41.20% relative. The difference with afully supervised training using weak augmentations reachinga 4.67% ER is still notable with a 8.66 points difference.On UBS8K, the best ER was achieved using MM with an18.02% ER. The difference with the best supervised trainingWeak+mixup, reaching a 23.75% ER, represents a differenceof 5.73 points, a 24.13% relative improvement. The perfor-mance of MM is also very close to the best fully supervisedtraining Weak+mixup, which reached an ER of 17.96%. Thedifference is then only 0.06 points. Similarly to ESC-10, ifMT and DCT outperformed the supervised training, they didnot score better than a supervised training using Weak+mixupaugmentation.The GSC dataset results confirm the previous observationsin UBS8K. The MM method is again the best method with anER of 3.25%, representing a relative gain of 6.76 (67.53%) or3.33 points (50.61%) compared to supervised training withoutand with Weak+mixup augmentations, respectively.
C. Impact of mixup
Given that the best SSL method so far was MM, and thatmixup is used in MM and not in the three other methods, wedecided to try to add mixup to MT, DCT and FM, in differentways for each method as explained in Section III-E.In Table V, we reported the results when adding mixup toMT, DCT and FM, (MT+mixup, DCT+mixup, FM+mixup).We also give the ER when removing mixup from MM, in therow named MM-mixup.As a first comment, MM-mixup is always worse thanMM. For instance, on USB8K, ER increased from 18.02%to 20.42%. With the other SSL methods, adding mixup brought per-formance improvements on all the datasets tested. The onlycounter-example observed is FM on ESC-10, which went froman ER of 13.33% to an ER of 14.67%.Similarly, FM on UBS8K went from 21.44% ER withoutmixup to 18.24% with mixup. On GSC, MM once againpresented the most significant improvement with 4.49% and3.25% ERs without and with mixup.It is also important to note that using mixup allowed to getER values very close to the ones obtained with a fully (100%setting) supervised training using augmentations, on UBS8Kand GSC. This is observable with MM and FM+mixup, which,compared to a Weak+mixup 100% supervised, have only 0.06and 0.31 points difference on UBS8K, and 0.25 and 0.31points difference on GSC.When we look at our supervised training performance, wecan observe that an improvement in performance does notsystematically follow the use of weak or strong augmentation.However, when combined with mixup, ER is frequently im-proved. This can be partly explained by the fact that audioaugmentations are often difficult to choose and that theirimpact is often dependent on the dataset and the task athand [24]. With this in mind, mixup seems to be beneficialregardless of the dataset used.VI. C
ONCLUSION
In this article, we reported monophonic audio taggingexperiments in a semi-supervised setting on three standarddatasets of different sizes and content, the very small-sizedESC-10 with generic audio events, urban noises with Urban-Sound8K, and speech with Google Speech Commands, usingonly 10% of the labeled data samples and the remaining90% as unlabeled samples. We adapted and compared fourSSL existing algorithms for this task, two methods that usetwo neural networks in parallel: Mean Teacher and Deep Co-Training, and the two single-model methods MixMatch andFixMatch, that rely in particular on data augmentation.All the four methods brought significant gains comparedto a supervised training setting using 10% of labeled data.They performed better than supervised learning without aug-mentation, and MixMatch and FixMatch were even better thansupervised learning with augmentation. On ESC-10, FixMatchreached the best Error Rate of 13.33%. The relative gains were
EFERENCES[1] M. Sajjadi, M. Javanmardi, and T. Tasdizen, “Regularization withstochastic transformations and perturbations for deep semi-supervisedlearning,” in
Advances in Neural Information Processing Systems ,D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds.,vol. 29. Curran Associates, Inc., 2016, pp. 1163–1171.[2] S. Laine and T. Aila, “Temporal ensembling for semi-supervisedlearning,” in . OpenReview.net, 2017. [Online]. Available: https://openreview.net/forum?id=BJ6oOfqge[3] T. Miyato, S. ichi Maeda, M. Koyama, and S. Ishii, “Virtual adversarialtraining: A regularization method for supervised and semi-supervisedlearning,” 2018.[4] Y. Grandvalet and Y. Bengio, “Semi-supervised learning by entropyminimization,” in
Advances in Neural Information Processing Systems ,L. Saul, Y. Weiss, and L. Bottou, Eds., vol. 17. MIT Press, 2005, pp.529–536.[5] D.-H. Lee, “Pseudo-label : The simple and efficient semi-supervisedlearning method for deep neural networks,”
ICML 2013 Workshop :Challenges in Representation Learning (WREPL) , 07 2013.[6] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”2019.[7] G. Zhang, C. Wang, B. Xu, and R. Grosse, “Three mechanisms of weightdecay regularization,” 2018.[8] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyondempirical risk minimization,” 2018.[9] R. R. Wiyatno, A. Xu, O. Dia, and A. de Berker, “Adversarial examplesin modern machine learning: A review,” 2019. [10] A. Tarvainen and H. Valpola, “Mean teachers are better role mod-els: Weight-averaged consistency targets improve semi-supervised deeplearning results,” 2018.[11] S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille, “Deep co-trainingfor semi-supervised image recognition,” in proc. ECCV , Munich, 2018,pp. 135–152.[12] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, andC. Raffel, “Mixmatch: A holistic approach to semi-supervised learning,”in proc. NeurIPS , Vancouver, 2019, pp. 5049–5059.[13] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk,A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” 2020.[14] K. J. Piczak, “Esc: Dataset for environmental sound classification,”in proc. ACM Multimedia , Brisbane, 2015, p. 1015–1018. [Online].Available: https://doi.org/10.1145/2733373.2806390[15] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy forurban sound research,” in proc. ACM Multimedia , 2014, p. 1041–1044.[Online]. Available: https://doi.org/10.1145/2647868.2655045[16] P. Warden, “Speech commands: A dataset for limited-vocabulary speechrecognition,” 2018, arxiv:1804.03209.[17] S. Zagoruyko and N. Komodakis, “Wide residual networks,” 2017.[18] A. Blum and T. Mitchell, “Combining labeled and unlabeled data withco-training,” in proc. COLT , Madison, 1998, pp. 92–100.[19] M.-A. Krogel and T. Scheffer, “Multi-relational learning, text mining,and semi-supervised learning for functional genomics,” in
MachineLearning , 2004, pp. 61–81.[20] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” in proc. ICLR , San Diego, 2015.[21] A. F. Agarap, “Deep learning using rectified linear units (relu),” 2019.[22] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison,A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperativestyle, high-performance deep learning library,” in proc. NeurIPS ,2019, pp. 8026–8037. [Online]. Available: http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf[23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”2017.[24] K. Lu, C.-S. Foo, K. K. Teh, H. D. Tran, and V. R. Chandrasekhar,“Semi-supervised audio classification with consistency-based regular-ization.” in proc. INTERSPEECH , Graz, 2019, pp. 3654–3658.[25] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology andhuman-labeled dataset for audio events,” in