[PDF] Improving Deep-learning-based Semi-supervised Audio Tagging with Mixup

Abstract

Recently, semi-supervised learning (SSL) methods, in the framework of deep learning (DL), have been shown to provide state-of-the-art results on image datasets by exploiting unlabeled data. Most of the time tested on object recognition tasks in images, these algorithms are rarely compared when applied to audio tasks. In this article, we adapted four recent SSL methods to the task of audio tagging. The first two methods, namely Deep Co-Training (DCT) and Mean Teacher (MT) involve two collaborative neural networks. The two other algorithms, called MixMatch (MM) and FixMatch (FM), are single-model methods that rely primarily on data augmentation strategies. Using the Wide ResNet 28-2 architecture in all our experiments, 10% of labeled data and the remaining 90\% as unlabeled, we first compare the four methods' accuracy on three standard benchmark audio event datasets: Environmental Sound Classification (ESC-10), UrbanSound8K (UBS8K), and Google Speech Commands (GSC). MM and FM outperformed MT and DCT significantly, MM being the best method in most experiments. On UBS8K and GSC, in particular, MM achieved 18.02% and 3.25% error rates (ER), outperforming models trained with 100% of the available labeled data, which reached 23.29% and 4.94% ER, respectively. Second, we explored the benefits of using the mixup augmentation in the four algorithms. In almost all cases, mixup brought significant gains. For instance, on GSC, FM reached 4.44% and 3.31% ER without and with mixup.

Full PDF

11 Improving Deep-learning-based Semi-supervisedAudio Tagging with Mixup

L´eo Cances, Etienne Labb´e, Thomas PellegriniIRIT, Universit´e Paul Sabatier, CNRS, Toulouse, France

Abstract —Recently, semi-supervised learning (SSL) methods,in the framework of deep learning (DL), have been shown toprovide state-of-the-art results on image datasets by exploitingunlabeled data. Most of the time tested on object recognition tasksin images, these algorithms are rarely compared when applied toaudio tasks. In this article, we adapted four recent SSL methodsto the task of audio tagging. The ﬁrst two methods, namelyDeep Co-Training (DCT) and Mean Teacher (MT), involve twocollaborative neural networks. The two other algorithms, calledMixMatch (MM) and FixMatch (FM), are single-model methodsthat rely primarily on data augmentation strategies. Using theWide ResNet 28-2 architecture in all our experiments, 10% oflabeled data and the remaining 90% as unlabeled, we ﬁrst com-pare the four methods’ accuracy on three standard benchmarkaudio event datasets: Environmental Sound Classiﬁcation (ESC-10), UrbanSound8K (UBS8K), and Google Speech Commands(GSC). MM and FM outperformed MT and DCT signiﬁcantly,MM being the best method in most experiments. On UBS8Kand GSC, in particular, MM achieved 18.02% and 3.25% errorrates (ER), outperforming models trained with 100% of theavailable labeled data, which reached 23.29% and 4.94% ER,respectively. Second, we explored the beneﬁts of using the mixupaugmentation in the four algorithms. In almost all cases, mixupbrought signiﬁcant gains. For instance, on GSC, FM reached4.44% and 3.31% ER without and with mixup.

Index Terms —Audio tagging, semi-supervised deep learning.

I. I

NTRODUCTION S EMI-SUPERVISED learning (SSL) aims to reduce thedependency of deep learning systems on labeled data byintegrating non-labeled data during the learning phase. It isessential since the conception of a large labeled dataset isexpensive, dependent on the task to be learned, and time-consuming. On the contrary, the acquisition of non-labeleddata is cheaper and quicker regardless of the learning task. Us-ing unlabeled data while maintaining high performance can bedone in three different ways:

Consistency regularization [1],[2], which encourages a model to produce consistent predictionwhereas the input is perturbed;

Entropy minimization [3]–[5], which encourages the model to output high conﬁdencepredictions on unlabeled ﬁles, and

Standard regularization by using weight decay [6], [7], mixup [8] or adversarialexamples [9]. The most direct approach for SSL is pseudo-labeling [5], but since then, many new and better approachescame out such as Mean Teacher [10], Deep Co-Training [11],MixMatch [12] and FixMatch [13].In this work, we adapted these four SSL methods to thetask of audio tagging. One difﬁculty lies in choosing whichaudio data augmentation techniques to use, that work fordifferent types of sound events and spoken words. We compare the methods’ accuracy on three audio datasets with differentscopes and size:

Environmental Sound Classiﬁcation 10(ESC-10)

Dataset [14], with audio event categories such ashuman noise not related to speech, natural ambient noise,

UrbanSound8k (UBS8K) [15], more speciﬁc to urban noises,and

Google Speech Commands (GSC)

Dataset v2 [16]containing spoken words exclusively.In the MixMatch algorithm, a successful data augmentationtechnique called mixup [8] is used. It consists of mixing pairsof samples, both the data samples and the labels with a randomcoefﬁcient. We propose to add mixup to the three other SSLapproaches, namely MT, DCT and FM, which do not alreadyuse it. The results reported in this article will highlight thepositive impact of mixup in almost all our experiments.This paper’s contributions are two-fold: i) the applicationand comparison of recent SSL methods for audio tagging onthree different datasets, ii) the modiﬁcation of these methodswith the integration of mixup resulted in systematic accuracygains. We shall see that in most cases, MixMatch outperformedthe other methods, closely followed by FixMatch+mixup.The structure of the paper is as follows. Section II describesthe augmentations we used and the mixup mechanism at thecore of the present work. Section III describes the four SSLmethods, Section IV presents the experimental settings, andﬁnally, Section V presents and discusses the results.II. A

UDIO DATA AUGMENTATION

Augmentations are at the heart of semi-supervised learningmechanisms to take advantage of the non-labeled examplesthey are provided with. In this section, we begin by describingthe mixup mechanism, which we explore in this work, and thedifferent audio data augmentations used in some of the SSLapproaches.

A. mixup

Mixup [8] is a successful data augmentation/regularizationtechnique, that proposes to mix pairs of samples (images,audio clips, etc.). If x and x are two different input samples(spectrograms in our case) and y , y their respective one-hotencoded labels, then the mixed sample and target are obtainedby a simple convex combination: x mix = λx + (1 − λ ) x y mix = λy + (1 − λ ) y a r X i v : . [ c s . S D ] F e b where λ is a scalar sampled from a symmetric Beta distributionat each mini-batch generation: λ ∼ Beta ( α, α ) where α is a real-valued hyper-parameter to tune (alwayssmaller than 1.0 in our case).In the original MM algorithm, an “asymmetric” version ofmixup is used, in which the maximum value between λ and − λ is retrieved: λ = max( λ, − λ ) (1)This allows the resulting mixed batch to be closer to one of thetwo original batches (the one with the λ coefﬁcient). This isuseful when the method mixes labeled and unlabeled samples. B. Audio signal augmentation methods

For supervised learning, MM and FM, we use three differentaugmentations: Occlusion, StretchPadCrop, and CutOutSpec.During training, one of these augmentations is randomlyapplied to each sample. • Occlusion : applied to the raw audio signal, Occlusionconsists of setting a segment of the ﬁle to zero. Thesize of the segment is randomly chosen to a user-deﬁnedmaximum size. The position of the segment is also chosenrandomly. • StretchPadCrop : also applied to the raw audio sig-nal, StretchPadCrop consists of up-sampling or down-sampling the signal. The rate with which the signal ismodiﬁed is chosen randomly within a predeﬁned interval.The resulting augmented ﬁle is either shorter or longer.It is then necessary to apply zero padding or cropping tokeep the shape of the samples constant. • CutOutSpec : applied to the log-mel-spectrogram,CutOutSpec sets the values within a random rectanglearea with the -80 dB value, which corresponds to thesilence energy level in our spectrograms. The length andwidth of the removed sections are randomly chosen froma predeﬁned interval and depend on the spectrogramsize.To select and tune these augmentations, we trained on GSCseveral Wide ResNet28-2 models [17], which architecture willbe described in details in Section IV-B. During training, a dif-ferent augmentation was applied to the input data with a 100%chance. We ran a grid search to tune the parameter valuesfor each augmentation. In addition to the three augmentationsdescribed above, we also tried to apply uniform noise on thelog-Mel spectrograms, invert the Mel-band axis or the time

TABLE IL

IST OF AUGMENTATION PARAMETERS USED IN AUGMENTEDSUPERVISED , M IX M ATCH AND F IX M ATCH TRAINING . Name Parameters Weak range Strong rangeOcclusion max size [0.25, 0.25] [0.75, 0.75]

StretchPadCrop rate [0.50, 1.50] [0.25, 1.75]

CutOutSpec scale [0.10, 0.50] [0.50, 1.00] Fig. 1. Example of different augmentations on an audio ﬁle from Urban-Sound8k, from top to bottom: original, weak augmentation, strong augmenta-tion, and mixup augmentation. The weak and strong augmentations correspondto StretchPadCrop with a factor randomly taken from the interval [0.5, 1.5]and [0.25, 1.75], respectively. axis, and apply a frequency and time dropout. Table I detailsthe different augmentations we tested.FM makes use of so-called “weak” and “strong” augmenta-tions. The difference between the two lies in the strength andfrequency with which an augmentation is applied. The speciﬁcparameters of each augmentation are deﬁned in Table I.Figure 1 shows examples of a weak and a strong Stretch-PadCrop augmentation, as well as two spectrograms mixedusing mixup as done in our experiments.III. S

EMI - SUPERVISED DEEP LEARNING ALGORITHMS

This section provides a detailed description of each of themethods we decided to experiment with. We chose them fortheir performance and novelty. Two of these approaches, MeanTeacher (MT) [10] and Deep Co-Training (DCT) [11] usethe principle of consistency regularization between the outputof two models. The other methods, MixMatch (MM) andFixMatch (FM), use only one model and combine the threeSSL mechanisms described in the introduction.

Fig. 2. MT workﬂow. Both models receive as input labeled x s and unlabeledﬁles x u . A supervised loss L s is computed between the ground truth andthe student model predictions, whereas a consistency cost L cc is computedbetween the student and teacher model predictions. We considered a ﬁfth method, ReMixMatch, that has beenproposed by the same authors to enhance MM, but before in-troducing FM. ReMixMatch uses an additional self-supervisedloss term, related to predicting a rotation angle applied toan input image. For audio, we replaced the rotations withhorizontal and vertical ﬂips on the spectrograms, but we didnot obtain better results compared to MixMatch, so that wedo not report results with this method.We provide a ﬁgure to illustrate each of the four methods.In Section III-E, we explain how we add mixup to eachmethod, except MM that already uses mixup in its originalversion. We included the mixup operation in a green-coloredbox in the method workﬂow ﬁgures, to show where mixup isoptionally integrated. We will refer to the modiﬁed methodsas “method+mixup”, for instance, FM+mixup.

A. Mean-Teacher (MT)

We can ﬁnd audio applications of MT [10] in the Detectionand Classiﬁcation of Acoustic Scenes and Events (DCASE)task 4 challenges, namely the weakly supervised Sound EventDetection task.MT uses two neural networks: a “student” f and a “teacher” g , which share the same architecture. The weights ω of thestudent model are updated using the standard gradient descentalgorithm, whereas the weights W of the teacher model are theExponential Moving Average (EMA) of the student weights.The teacher weights are computed at every mini-batch iteration t , as the convex combination of its weights at t (cid:57) and thestudent weights, with a smoothing constant α ema : W t = α ema · W t (cid:57) + (1 − α ema ) · ω t (2)There are two loss functions applied either on the labeled orunlabeled data subsets. On the labeled data x s , the usual cross- entropy (CE) is used between the student model’s predictionsand the ground-truth y s . L sup = CE ( f ( x s ) , y s ) (3)The consistency cost is computed from the student predic-tion f ( x u ) and the teacher prediction g ( x (cid:48) u ) , where x u is asample from the unlabeled subset, and x (cid:48) u the same samplebut slightly perturbed with Gaussian noise and a 15 dB signal-to-noise ratio. In our case, this cost is a Mean Square Error(MSE) loss. L cc = MSE ( f ( x u ) , g ( x (cid:48) u )) (4)The ﬁnal loss function is the sum of the supervised lossfunction and the consistency cost weighted by a factor λ cc which controls its inﬂuence. L total = L sup + λ cc · L cc (5) B. Deep Co-Training (DCT)

DCT has been recently proposed by Qiao and col-leagues [11]. It is based on Co-Training (CT), the well-knowngeneric framework for SSL proposed by Blum and colleaguesin 1998 [18]. The main idea of Co-Training is based on theassumption that two independent views on a training datasetare available to train two models separately. Ideally, the twoviews are conditionally independent given the class. The twomodels are then used to make predictions on the non-labeleddata subset. The most conﬁdent predictions are selected andadded to the labeled subset. This process is iterative, likepseudo-labeling.DCT is an adaptation of CT in the context of deep learning.Instead of relying on views of the data that are different, DCTmakes use of adversarial examples to ensure the independencein the “view” presented to the models. The second differenceis that the whole non-labeled dataset is used during training.

Fig. 3. DCT workﬂow. Each model is trained on its own labeled samples x i ,unlabeled samples x u and the adversarial examples generated by the othermodel. Model f makes predictions on x and x g , and model g on x and x f . In our DCT+mixup variant, mixup is used on the unlabeled samples only. Each batch is composed of a supervised and an unsupervisedpart. Thus, the non-labeled data are directly used, and theiterative aspect of the algorithm is removed.Let S and U be the subsets of labeled and unlabeled data,respectively, and let f and g be the two neural networks thatare expected to collaborate.The DCT loss function is comprised of three terms, asshown in Eq. 6. These terms correspond to loss functionsestimated either on S , U , or both. Note that during training,a mini-batch is comprised of labeled and unlabeled samplesin a ﬁxed proportion. Furthermore, in a given mini-batch, thelabeled examples given to each of the two models are sampleddifferently. L = L sup + λ cot L cot + λ diﬀ L diﬀ (6)The ﬁrst term, L sup , given in Eq. 7, corresponds to thestandard supervised classiﬁcation loss function for the twomodels f and g , estimated on examples x and x sampledfrom S . In our case, we use categorical Cross-Entropy (CE),the standard loss function used in classiﬁcation tasks withmutually exclusive classes. L sup = CE( f ( x ) , y ) + CE( g ( x ) , y ) (7)In SSL and Co-Training, the two classiﬁers are expected toprovide consistent and similar predictions on both the labeledand unlabeled data. To encourage this behavior, the Jensen-Shannon (JS) divergence between the two sets of predictions isminimized on examples x u sampled from the unlabeled subset U only. Indeed, there is no need to minimize this divergencealso on S since L sup already encourages the two models tohave similar predictions on S . Eq. 8 gives the JS analyticalexpression, with H denoting entropy. L cot = H (cid:16)

12 ( f ( x u ) + g ( x u )) (cid:17) − (cid:16) H ( f ( x u )) + H ( g ( x u )) (cid:17) (8)For DCT to work, the two models need to be comple-mentary: on a subset different from S ∪ U , examples mis-classiﬁed by one model should be correctly classiﬁed by theother model [19]. This can be achieved in deep learning bygenerating adversarial examples with one model and trainingthe other model to be resistant to these adversarial samples. Todo so, the L diﬀ term (Eq. 9) is the sum of the Cross-Entropylosses between the predictions f ( x ) and g ( x (cid:48) ) , where x is sampled from S ∪ U and x f is the adversarial examplegenerated with model f and x taken as input. The secondterm is the symmetric term for model g . L diﬀ = CE( f ( x ) , g ( x f )) + CE( g ( x ) , f ( x g )) (9)For the adversarial examples generation, we use the FastGradient Signed Method (FGSM, [20]), as in Qiao’s work.For more in-depth details on the technical aspects of DCT,the reader may refer to [11]. We implemented DCT as pre-cisely as described in Qiao’s article, using PyTorch, and made Fig. 4. MM workﬂow. K augmentations are applied to the unlabeled data x u , and the averaged model predictions are used as pseudo labels ˆ y u . Thelabeled and augmented unlabeled data are mixed up and used to compute thesupervised and unsupervised loss values. sure to accurately reproduce their results on CIFAR-10: about90% accuracy when using only 10% of the training data aslabeled data (5000 images). C. MixMatch

MixMatch [12] (MM) is an SSL approach that uses entropyminimization and standard regularization, namely pseudo-labeling [5], mixup, and weak data augmentation, to leveragethe unlabeled data and provide better generalization capabili-ties. Unlike MT and DCT, this approach uses only one model.The different steps are shown in Fig. 4 and detailed in thefollowing paragraphs.During the learning phase, each minibatch is composed oflabeled x s and non-labeled x u samples in equivalent propor-tions. The ﬁrst step consists of applying an augmentation to thelabeled part of the mini-batch and k augmentations to the non-labeled part. In the second step, pseudo-labels y u are generatedfor the non-labeled ﬁles using the model’s prediction averagedon these k variants as shown in Eq. 10, where x (cid:48) u,i denotesthe i -th variant of an unlabeled augmented ﬁle. ˆ y u = 1 k k (cid:88) i =1 f ( x (cid:48) u,i ) (10)For encouraging the model to produce conﬁdent predictions,a post-processing step is necessary to decrease the output’sentropy. To do so, the highest probability is increased and theother ones decreased. This process is called ”sharpening” bythe method authors, and it is deﬁned as:sharpen ( p, T ) i := p /Ti (cid:44) | p | (cid:88) j =1 p /Tj (11) The sharpen function is applied on to the pseudo-labels p = ˆ y u . The parameter T , called Temperature, controls thestrength of the sharpen function. When T tends towards zero,the entropy of the distribution produced is lowered.Finally, the labeled and unlabeled augmented samples areconcatenated and shufﬂed into a W set then used as a poolof training samples used by the asymmetric mixup function.Asymmetric mixup is applied separately on the labeled andunlabeled parts of the mini-batch, as formulated here: x (cid:48) mix s = mixup ( x s | W .. | x s | ) (12) x (cid:48) mix u = mixup ( x u | W | x s | .. | W | ) (13)The W set and the corresponding labels are shufﬂed inthe same order. Each labeled sample is then perturbed by asecond labeled or unlabeled sample. Mixing the two is done sothat the original labeled sample remains the main componentof the resulting sample. The operation has been detailed inSection II-A. The same procedure is applied onto the unlabeledﬁles using the remaining samples from W.The original MixMatch loss function is composed of thestandard cross-entropy (CE) for the supervised loss L s , anda squared l2 norm for the unsupervised loss L u . We replacethe l2 norm with a cross-entropy in all our experiments, asproposed in the ReMixMatch paper. Indeed, it seems that CEperforms better than the l2 norm in our experiments. L s = 1 B s (cid:88) ( x (cid:48) mix s ,y mix s ) CE ( f ( x (cid:48) mix s ) , y mix s ) (14) L u = 1 k · B u (cid:88) ( x (cid:48) mix u , ˆ y mix u ) CE ( f ( x (cid:48) mix u ) , ˆ y mix u ) (15)The ﬁnal loss is the sum of the two components, with ahyper-parameter λ u : L = L s + λ u · L u (16) D. FixMatch

FixMatch [13] (FM) is another SSL method which proposesa simpliﬁcation of MM and ReMixMatch. The method alsouses one model, removes mixup and replaces the sharpenfunction by binary pseudo-labels. FM uses both weak augmen-tations (weak) and strong augmentations (strong). The strongaugmentations can mislead the model predictions by disruptingtoo much the training data. Figure 5 shows the main pipelineof FixMatch. As in the other method illustrations, we added amixup box in blue, to indicate where we add it to the algorithmin our modiﬁed FM algorithm, thus called FM+mixup.The supervised loss component is the standard cross-entropyapplied to the weakly-augmented data : L s = CE (cid:16) f (cid:0) weak ( x s ) (cid:1) , y s (cid:17) (17)Then, we guess the labels of the weakly augmented unla-beled data and apply a binarization (argmax) of these pre-dictions to have a ”one-hot” encoded label. This label isused as target for training the model with strongly augmented Fig. 5. FixMatch workﬂow. A weakly augmented version of x u is used tocompute a pseudo-label ˆ y u and a mask. The strongly augmented variant isused to compute the unlabeled loss term. The mixup component is used on aconcatenated set of labeled and unlabeled samples (FixMatch+mixup). unlabeled data. It allows the model to generalize with weakand strong augmentations and it also uses the guessed label toimprove the model accuracy with unlabeled data: ˆ y u = f (cid:0) weak ( x u ) (cid:1) (18)To avoid training on incorrect guessed labels, FM uses athreshold τ that ensures that the unsupervised cost functioncan only be applied to predictions made with high conﬁdence,i.e., above this threshold. This can be easily implemented inthe form of a mask: mask = (cid:0) max (ˆ y u ) > τ (cid:1) (19) L u = mask · CE (cid:16) f (cid:0) strong ( x u ) (cid:1) , argmax (ˆ y u ) (cid:17) As in MixMatch, we sum the loss components to computethe ﬁnal loss: L = L s + λ u · L u (20) E. Adding mixup to MT, DCT and FM

As we described here-above, MM already uses mixup in itsworkﬂow. In order to measure the impact of mixup in MM, wewill report results when we remove mixup from MM. On thecontrary, the three other SSL methods explored in our work(MT, DCT, FM) do not use mixup in their original version.How should we add mixup to those then? We explored severalways to do so, and retained the best one for each of the threemethods. Note that we illustrate where the mixup operationhas been added in the ﬁgures describing the different methodsin the previous section.

Since the labeled and unlabeled data ﬂow is very similarin MM and FM, we added mixup to FM at the same placeas in MM: both labeled and unlabeled samples are mixed up.Similarly, it is also the asymmetric mixup variant that we inMM and FM since mixup is applied to labeled and unlabeledsamples together, as in the original MM method. Using mixupon labeled and unlabeled examples separately seems to hurtperformance with these two methods.In MT, mixup is applied on labeled and unlabeled samplesseparately and only for the teacher model. The perturbationwith Gaussian noise applied to the unlabeled samples isremoved, since no gain was observed when mixup is usedinstead.For DCT, mixup is applied on the unlabeled samples only,common to both models in each minibatch during training.Applying mixup on the labeled samples, which are sampleddifferently for the two models at each training step, lead yoworse results. It is then, not necessary to use the asymmetricalvariant for MT and DCT.Finally, in all cases, we apply mixup on the log-Mel spec-trograms, which are the input features given to our deep neuralnetworks (feature extraction is detailed in the experimentsection). IV. E

XPERIMENTS

In this section, we describe our experimental setup. We givea brief description of the datasets and metrics, describe theWide ResNet architecture we used, together with the trainingstrategy details.

A. Datasets and evaluation metrics

Environmental Sound Classiﬁcation 10 (ESC-10) [14] isa selection of 400 ﬁve-second-long recordings of audio eventsseparated into ten balanced categories. The dataset is providedwith ﬁve uniformly sized cross-validation folds that will beused to perform the evaluation. The ﬁles are sampled at 44kHz and are converted into × log-Mel spectrograms. UrbanSound8k (UBS8K) [15] is a dataset composed of8742 ﬁles between 1 and 4 seconds long, separated into tenbalanced categories. The dataset is provided with ten cross-validation folds of uniform size that will be used to perform theevaluation. The ﬁles are zero-padded to 4 seconds, resampledto 22 kHz, and converted to × log-Mel spectrograms. Google Speech Commands Dataset v2 (GSC) [16] is anaudio dataset of spoken words designed to evaluate keywordspotting systems. The dataset is split into 85511 training ﬁles,10102 validation ﬁles, and 4890 testing ﬁles. The latter isused for the evaluation of our systems. We ran the task ofclassifying the 35 word categories of this dataset. The ﬁlesare zero-padded to 1 second if needed and sampled at 16 kHzbefore being converted into × log -Mel spectrogram.In all cases, the 64 Mel-coefﬁcients were extracted using awindow size of 2048 samples and a hop length of 512 samples.For ESC-10 and UBS8K, we used the ofﬁcial cross-validationfolds. We report the average classiﬁcation Error Rate (ER)along with standard deviations. Layer Architectureinput Log Mel spectrogramconv1 BasicBlock(32)Max poolblock1 (cid:104)

BasicBlock(32) (cid:105) × (cid:104) BasicBlock(64) (cid:105) × (cid:104) BasicBlock(128) (cid:105) × RCHITECTURE OF W IDE R ES N ET OWNSAMPLING IS PERFORMEDBY THE FIRST LAYERS IN BLOCK AND BLOCK B. Models

We used the Wide ResNet 28-2 [17] architecture in all ourexperiments. This model is very efﬁcient, achieving SOTAperformance on the three datasets when trained in a 100%supervised setting. Moreover, its small size, comprised ofabout 1.4 Million parameters, allows to experiment quickly. Itsstructure consists of an initial convolutional layer ( conv1 ) fol-lowed by three groups of residual blocks ( block1 , block2 ,and block3 ). Finally, an average pooling and a linear layeract as a classiﬁer. The residual blocks, composed of two BasicBlock , are repeated three times and their structure isdeﬁned in Eq. 21. The number of channels of the convolutionlayers is referred as l , BN stands for Batch Normalization andReLU [21] for the Rectiﬁed Linear Unit activation function.We used the ofﬁcial implementation available in PyTorch [22]. BasicBlock( l ) = ( conv 3 × @ l, BN , ReLU ) (21) C. Training conﬁgurations

Each model was trained using the ADAM [23] optimizer.Table III shows the hyper-parameter values used for eachmethod, such as the learning rate lr, the mini-batches’ size bs,

TABLE IIIT

RAINING PARAMETERS USED ON THE DATASETS . B S : BATCH SIZE , LR : LEARNING RATE , WL : WARM - UP LENGTH IN EPOCH , E : NUMBER OFEPOCHS , α : MIXUP B ETA PARAM .bs lr wl e α Supervised 256 0.001 - 300 -mixup 256 0.001 - 300 0.4MT 64 0.001 50 200 -MT+mixup 64 0.001 50 200 0.40DCT 64 0.0005 160 300 -DCT+mixup 64 0.0005 160 300 0.40MM 256 0.001 - 300 0.75MM-mixup 256 0.001 - 300 -FM 256 0.001 - 300 -FM+mixup 256 0.001 - 300 0.75

TABLE IVS

UPERVISED LEARNING E RROR R ATES (%) ON ESC-10, UBS8K

AND

GSC.Labeled fraction ESC-10 UBS8K GSC10% 100% 10% 100% 10% 100%Supervised 32.00 ± 6.17 8.00 ± 5.06 33.80 ± 4.82 23.29 ± 5.80 10.01 4.94+mixup 36.00 ± 5.22 8.33 ± 4.56 31.41 ± 5.56 22.04 ± 5.99 8.83 3.86+Weak +Strong 23.00 ± 5.19 5.00 ± 2.64 25.58 ± 4.15 20.69 ± 4.92 7.60 3.27+Strong+mixup 24.00 ± 8.71 5.00 ± 4.25 24.73 ± 4.42 18.52 ± 4.38 6.86 the warmup length wl if used, and the number of epochs e .These parameters are identical regardless of the dataset used,unless otherwise speciﬁed.For supervised training, MM and FM, the learning rateremains constant throughout training. For MT and DCT, thelearning rate is weighted by a descending cosine rule: lr = 0 . (cid:16) . cos (cid:0) ( t (cid:57) πe (cid:1)(cid:17) (22)All the SSL approaches, but FixMatch, introduce one ormore subsidiary terms to the loss. To alleviate their impact atthe beginning of the training, these terms are weighted by alambda λ ratio, which ramps up to its maximum value withina warmup length wl . The ramp-up strategy is deﬁned in Eq. 23for MT and DCT, and is linear in MM during the ﬁrst 16klearning iterations. λ = λ max × (cid:0) − e − × (1 − ( t/ wl )) (cid:1) (23)In MT, the maximum value of λ cc is 1 and α ema is set to0.999. In DCT, the maximum values of λ cot and λ diff are 1and 0.5, respectively. In MM, the maximum value of λ u is 1.In MM, we use two augmentations ( k = 2 ), the sharpeningtemperature T is set to 0.5. In FM, we use a threshold τ = 0 . on ESC-10 and GSC datasets, and τ = 0 . for UBS8K.For MM and FM, on ESC-10, the batch size is 60 becauseESC-10 is a small dataset of 400 ﬁles only. During training,only four folders are used, that is, 320 ﬁles. In a 10%conﬁguration and due to the whole division’s restrictions, thisrepresents only 30 supervised ﬁles in total. Each mini-batchmust contain as many labeled as unlabeled ﬁles, hence thebatch size of 60. Moreover, because of this small numberof ﬁles, the training phase only lasts for 2700 iterations, andtherefore, warm-up ends prematurely.For our proposed variants, which include mixup, we keptthe same conﬁgurations and parameter values.V. R ESULTS

We ﬁrst report the results obtained in a supervised setting,with and without the same data augmentation methods used inthe SSL algorithms, including mixup. We compare the errorrates obtained by the four SSL methods and then show thatadding mixup is almost in all cases beneﬁcial.

A. Supervised learning

This section presents the results obtained with supervisedlearning in different settings while using either 10% or 100% of the labeled data available. MM and FM use augmentationsas their core mechanism. Therefore, it seems essential for faircomparisons to use the same augmentations in the supervisedsettings too. Indeed, FM uses weak and strong augmentations,while MM uses a combination of weak augmentations andmixup.We trained models without any augmentation (Super-vised), using mixup alone (mixup), weak augmentations alone(Weak), a combination of weak augmentations and mixup(Weak+mixup), strong augmentations alone (Strong), and toﬁnish, a combination of strong augmentations with mixup(Strong+mixup). Table IV presents the results on ESC-10,UBS8K, and GSC.

ESC-10.

In the 10% setting, the supervised model reachedan ER of 32.00%. The use of Weak yielded to the bestperformance with 22.67% ER, outperforming the supervisedmodel by 9.3 points (29.16% relative). In the 100% setting, thesupervised model reached an ER of 8.00%, and the best ERof 4.67% was achieved when using Weak+mixup. The gain is3.33 points (41.62% relative).

UBS8K.

In a 10% setting, the supervised model reached33.80% ER, and the best supervised result was obtained withWeak+mixup, with a 23.75% ER. It represents an improve-ment of 10.05 points, 29.73% relative improvement. In the100% setting, the same augmentation combination reached anER of 17.96%, outperforming the 23.29% ER from the super-vised model by 5.33 points, 22.88% relative improvement.

GSC.

In a 10% setting, the supervised model reached10.01% ER, and Weak+mixup yielded the best ER of 6.58%It represents an augmentation of 3.43 points, 34.26% relativeimprovement. In the 100% setting, the Strong+mixup reachedan ER of 2.98%, outperforming the 4.94% ER from thesupervised model by 1.96 point, 39.68% relative improvement.Overall, we observe that in a supervised setting, if using asingle augmentation improved the accuracy of our systems, thecombination of mixup with a weak or a strong augmentationis systematically better than using one of those augmentationsindividually, except in the ESC-10 dataset.

B. Semi-supervised learning

In this section, we report in Table V the results of the SSLmethods with and without the mixup augmentation. For MM,mixup is already used in the original method, thus, we compareMM to MM without mixup (MM-mixup). For the three othermethods, we denote, for instance FM+mixup the FM algorithmaugmented with mixup.

TABLE VS

EMI - SUPERVISED LEARNING RESULTS IN E RROR R ATE (%) ON ESC-10, UBS8K

AND

GSC.Labeled fraction ESC-10 UBS8K GSC10% 100% 10% 100% 10% 100%Supervised 32.00 ± 6.17 8.00 ± 5.06 33.80 ± 4.82 23.29 ± 5.80 10.01 4.94Best supervised 22.67 ± 3.46 4.67 ± 1.39 23.75 ± 4.73 17.96 ± 3.64 6.58 3.00MT 28.28 ± 5.28 - 23.75 ± 4.73 - 8.51 -MT+mixup 27.81 ± 2.25 - 32.00 ± 5.80 - 8.50 -DCT 25.16 ± 4.42 - 27.85 ± 4.29 - 6.22 -DCT+mixup 23.75 ± 2.36 - 25.77 ± 4.73 - 5.63 -MM 15.33 ± 5.58 - - -MM-mixup 17.33 ± 3.84 - 20.42 ± 4.88 - 4.49 -FM - 21.44 ± 4.16 - 4.44 -FM+mixup 14.67 ± 7.21 - 18.27 ± 3.80 - 3.31 - In all the three datasets, the four SSL methods brought ERdecreases compared to the 10% supervised learning setup,when no augmentation is performed. Only MM and FMperformed better than the best supervised training result, thatused the weak augmentations.Furthermore, MM and FM signiﬁcantly outperformed MTand DCT in all cases, showing that using single-model SSLmethods is more efﬁcient than two-model-based methods, atleast on these three datasets.For ESC-10, in the 10% setting, the lowest ER was achievedby FM with a 13.33% ER, compared to a 22.67% ER for aweakly augmented supervised training. It represents a 9.34points improvement, 41.20% relative. The difference with afully supervised training using weak augmentations reachinga 4.67% ER is still notable with a 8.66 points difference.On UBS8K, the best ER was achieved using MM with an18.02% ER. The difference with the best supervised trainingWeak+mixup, reaching a 23.75% ER, represents a differenceof 5.73 points, a 24.13% relative improvement. The perfor-mance of MM is also very close to the best fully supervisedtraining Weak+mixup, which reached an ER of 17.96%. Thedifference is then only 0.06 points. Similarly to ESC-10, ifMT and DCT outperformed the supervised training, they didnot score better than a supervised training using Weak+mixupaugmentation.The GSC dataset results conﬁrm the previous observationsin UBS8K. The MM method is again the best method with anER of 3.25%, representing a relative gain of 6.76 (67.53%) or3.33 points (50.61%) compared to supervised training withoutand with Weak+mixup augmentations, respectively.

C. Impact of mixup

Given that the best SSL method so far was MM, and thatmixup is used in MM and not in the three other methods, wedecided to try to add mixup to MT, DCT and FM, in differentways for each method as explained in Section III-E.In Table V, we reported the results when adding mixup toMT, DCT and FM, (MT+mixup, DCT+mixup, FM+mixup).We also give the ER when removing mixup from MM, in therow named MM-mixup.As a ﬁrst comment, MM-mixup is always worse thanMM. For instance, on USB8K, ER increased from 18.02%to 20.42%. With the other SSL methods, adding mixup brought per-formance improvements on all the datasets tested. The onlycounter-example observed is FM on ESC-10, which went froman ER of 13.33% to an ER of 14.67%.Similarly, FM on UBS8K went from 21.44% ER withoutmixup to 18.24% with mixup. On GSC, MM once againpresented the most signiﬁcant improvement with 4.49% and3.25% ERs without and with mixup.It is also important to note that using mixup allowed to getER values very close to the ones obtained with a fully (100%setting) supervised training using augmentations, on UBS8Kand GSC. This is observable with MM and FM+mixup, which,compared to a Weak+mixup 100% supervised, have only 0.06and 0.31 points difference on UBS8K, and 0.25 and 0.31points difference on GSC.When we look at our supervised training performance, wecan observe that an improvement in performance does notsystematically follow the use of weak or strong augmentation.However, when combined with mixup, ER is frequently im-proved. This can be partly explained by the fact that audioaugmentations are often difﬁcult to choose and that theirimpact is often dependent on the dataset and the task athand [24]. With this in mind, mixup seems to be beneﬁcialregardless of the dataset used.VI. C

ONCLUSION

In this article, we reported monophonic audio taggingexperiments in a semi-supervised setting on three standarddatasets of different sizes and content, the very small-sizedESC-10 with generic audio events, urban noises with Urban-Sound8K, and speech with Google Speech Commands, usingonly 10% of the labeled data samples and the remaining90% as unlabeled samples. We adapted and compared fourSSL existing algorithms for this task, two methods that usetwo neural networks in parallel: Mean Teacher and Deep Co-Training, and the two single-model methods MixMatch andFixMatch, that rely in particular on data augmentation.All the four methods brought signiﬁcant gains comparedto a supervised training setting using 10% of labeled data.They performed better than supervised learning without aug-mentation, and MixMatch and FixMatch were even better thansupervised learning with augmentation. On ESC-10, FixMatchreached the best Error Rate of 13.33%. The relative gains were

EFERENCES[1] M. Sajjadi, M. Javanmardi, and T. Tasdizen, “Regularization withstochastic transformations and perturbations for deep semi-supervisedlearning,” in

Advances in Neural Information Processing Systems ,D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds.,vol. 29. Curran Associates, Inc., 2016, pp. 1163–1171.[2] S. Laine and T. Aila, “Temporal ensembling for semi-supervisedlearning,” in . OpenReview.net, 2017. [Online]. Available: https://openreview.net/forum?id=BJ6oOfqge[3] T. Miyato, S. ichi Maeda, M. Koyama, and S. Ishii, “Virtual adversarialtraining: A regularization method for supervised and semi-supervisedlearning,” 2018.[4] Y. Grandvalet and Y. Bengio, “Semi-supervised learning by entropyminimization,” in

Advances in Neural Information Processing Systems ,L. Saul, Y. Weiss, and L. Bottou, Eds., vol. 17. MIT Press, 2005, pp.529–536.[5] D.-H. Lee, “Pseudo-label : The simple and efﬁcient semi-supervisedlearning method for deep neural networks,”

ICML 2013 Workshop :Challenges in Representation Learning (WREPL) , 07 2013.[6] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”2019.[7] G. Zhang, C. Wang, B. Xu, and R. Grosse, “Three mechanisms of weightdecay regularization,” 2018.[8] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyondempirical risk minimization,” 2018.[9] R. R. Wiyatno, A. Xu, O. Dia, and A. de Berker, “Adversarial examplesin modern machine learning: A review,” 2019. [10] A. Tarvainen and H. Valpola, “Mean teachers are better role mod-els: Weight-averaged consistency targets improve semi-supervised deeplearning results,” 2018.[11] S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille, “Deep co-trainingfor semi-supervised image recognition,” in proc. ECCV , Munich, 2018,pp. 135–152.[12] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, andC. Raffel, “Mixmatch: A holistic approach to semi-supervised learning,”in proc. NeurIPS , Vancouver, 2019, pp. 5049–5059.[13] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk,A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi-supervised learning with consistency and conﬁdence,” 2020.[14] K. J. Piczak, “Esc: Dataset for environmental sound classiﬁcation,”in proc. ACM Multimedia , Brisbane, 2015, p. 1015–1018. [Online].Available: https://doi.org/10.1145/2733373.2806390[15] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy forurban sound research,” in proc. ACM Multimedia , 2014, p. 1041–1044.[Online]. Available: https://doi.org/10.1145/2647868.2655045[16] P. Warden, “Speech commands: A dataset for limited-vocabulary speechrecognition,” 2018, arxiv:1804.03209.[17] S. Zagoruyko and N. Komodakis, “Wide residual networks,” 2017.[18] A. Blum and T. Mitchell, “Combining labeled and unlabeled data withco-training,” in proc. COLT , Madison, 1998, pp. 92–100.[19] M.-A. Krogel and T. Scheffer, “Multi-relational learning, text mining,and semi-supervised learning for functional genomics,” in

MachineLearning , 2004, pp. 61–81.[20] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” in proc. ICLR , San Diego, 2015.[21] A. F. Agarap, “Deep learning using rectiﬁed linear units (relu),” 2019.[22] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison,A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperativestyle, high-performance deep learning library,” in proc. NeurIPS ,2019, pp. 8026–8037. [Online]. Available: http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf[23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”2017.[24] K. Lu, C.-S. Foo, K. K. Teh, H. D. Tran, and V. R. Chandrasekhar,“Semi-supervised audio classiﬁcation with consistency-based regular-ization.” in proc. INTERSPEECH , Graz, 2019, pp. 3654–3658.[25] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology andhuman-labeled dataset for audio events,” in