[PDF] PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation

Abstract

Audio tagging is an active research area and has a wide range of applications. Since the release of AudioSet, great progress has been made in advancing model performance, which mostly comes from the development of novel model architectures and attention modules. However, we find that appropriate training techniques are equally important for building audio tagging models with AudioSet, but have not received the attention they deserve. To fill the gap, in this work, we present PSLA, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices. By training an EfficientNet with these techniques, we obtain a single model (with 13.6M parameters) and an ensemble model that achieve mean average precision (mAP) scores of 0.444 and 0.474 on AudioSet, respectively, outperforming the previous best system of 0.439 with 81M parameters. In addition, our model also achieves a new state-of-the-art mAP of 0.567 on FSD50K.

Full PDF

PPREPRINT 1

PSLA: Improving Audio Event Classiﬁcation withPretraining, Sampling, Labeling, and Aggregation

Yuan Gong,

Member, IEEE,

Yu-An Chung,

Member, IEEE, and James Glass,

Fellow, IEEE

Abstract —Audio event classiﬁcation is an active research areaand has a wide range of applications. Since the release ofAudioSet, great progress has been made in advancing theclassiﬁcation accuracy, which mostly comes from the developmentof novel model architectures and attention modules. However, weﬁnd that appropriate training techniques are equally importantfor building audio event classiﬁcation models with AudioSet, buthave not received the attention they deserve. To ﬁll the gap,in this work, we present PSLA, a collection of training tech-niques that can noticeably boost the model accuracy includingImageNet pretraining, balanced sampling, data augmentation,label enhancement, model aggregation and their design choices.By training an EfﬁcientNet with these techniques, we obtain amodel that achieves a new state-of-the-art mean average precision(mAP) of 0.474 on AudioSet, outperforming the previous bestsystem of 0.439.

Index Terms —Audio event classiﬁcation, transfer learning,imbalanced learning, noisy label, ensemble

I. I

NTRODUCTION

Audio event classiﬁcation (AEC) aims to identify soundevents that occur in a given audio recording, and enables avariety of Artiﬁcial Intelligence-based systems to disambiguatesounds and understand the acoustic environment. AEC hasa wide range of health and safety applications in the home,ofﬁce, industry, transportation etc., and has become an activeresearch topic in the ﬁeld of acoustic signal processing.In recent years, AEC research has moved from small and/orconstrained datasets such as ESC-50 [1] and CHiME-Home [2]to much larger datasets with a greater variety and range ofreal-world audio events and substantially more training data.A signiﬁcant milestone in this ﬁeld occurred with the release ofthe AudioSet corpus [3] containing over 2 million 10-secondaudio clips extracted from video and tagged at the utterancelevel with a set of 527 event labels. AudioSet is currentlythe largest and most comprehensive publicly available datasetfor AEC. Not surprisingly, it has subsequently become theprimary source of training and evaluation material for AECresearch. The availability of AudioSet has encouraged muchAEC research that has steadily seen the standard evaluationmetric of mean average precision (mAP) increase from, forexample, 0.314 with shallow fully-connected networks [3],to 0.392 with a residual network with attention [4] to, mostrecently, 0.439 with waveform-based convolutional neural net-works (CNNs) [5]. In order to cope with the weakly labeled

This work has been submitted to the IEEE for possible publication.Copyright may be transferred without notice, after which this version mayno longer be accessible.The authors are with the Computer Science and Artiﬁcial IntelligenceLaboratory, Massachusetts Institute of Technology, Cambridge, MA 02139USA (e-mail: [email protected]; [email protected]; [email protected]).

AudioSetNoisy Label SetClass-imbalanced dataset

Balanced SamplingData Augmentation

EfficientNetModelImageNet

Cross-modal Pretraining

Enhanced Label Set

Label Enhancement

EfficientNetModelEfficientNetModel

Weight Averaging and Ensemble

Proposed Model

Fig. 1. The proposed Pretraining, Sampling, Labeling, and Aggregation(PSLA) training pipeline. AudioSet is extremely class imbalanced and hasprevalent annotation errors, we propose a data augmentation/balanced sam-pling strategy and a label enhancement strategy to alleviate these two prob-lems. We also pretrain the convolutional neural networks with ImageNet andﬁnd it leads to a noticeable performance improvement. By further aggregatingmultiple models with weight averaging and ensemble techniques, we get amodel that performs much better than that trained with a conventional pipelineand achieves a new state-of-the-art mAP of 0.474. data, multiple instance learning and attention mechanisms havealso been the subject of much investigation [6], [7], [8], [9].In our AEC experiments using Audioset we have observedthat, in addition to the particular model architecture being eval-uated, signiﬁcant performance improvements can be achievedvia training techniques including cross-modal pre-training,data augmentation, label enhancement, and ensemble model-ing. Our empirical evaluations show that these techniques leadto signiﬁcant accuracy improvements, and combining themtogether can further boost the model accuracy. Speciﬁcally,we train an EfﬁcientNet [10] model with the proposed set oftraining techniques and achieve a new state-of-the-art mAPof 0.474 on AudioSet. In addition, a single model with 15Mparameters also achieves an mAP of 0.444, outperforming theprevious the best system that contained 81M parameters.As shown in Figure 1, the training techniques we investi-gated fall into four main categories. First, we ﬁnd cross-modalpretraining with ImageNet [11] improves the performanceof AEC CNNs even though AudioSet already contains asubstantial amount of in-domain data. Second, we addressthe Audioset label imbalance by adopting balanced samplingand data augmentation. Third, we observed that there arenumerous annotation errors in AudioSet and thus developeda method to improve training label quality. Fourth, we useweight averaging and ensemble methods to improve the overall a r X i v : . [ c s . S D ] F e b REPRINT 2

TABLE IT HE A UDIO S ET [3] S TATISTICS .Balanced Train Full Train EvaluationAudioSet 22,176 2,042,985 20,383Downloaded 20,785 1,953,082 19,185Downloaded Ratio 93.7% 95.6% 94.1% performance. Many of these techniques have been proposedpreviously in isolation. For example, ImageNet pretraininghas been used in [12] for small datasets, balanced samplingand data augmentation have been used in [5], label enhance-ment has been proposed in [13], and ensemble modeling hasbeen used in [4], [14], [15]. To the best of our knowledgehowever, none of the prior efforts have used more than twoof these simultaneously, and the particular implementationis often only brieﬂy mentioned in the literature. A morethorough understanding of the beneﬁts of different trainingtechniques should facilitate a more meaningful comparisonbetween various works because performance differences due tothe particular training procedure could overshadow the modelarchitecture or other novel techniques being investigated. Thetraining pipeline we propose is model-agnostic and can serveas a recipe for AudioSet AEC experiments to facilitate faircomparisons with new techniques.The contributions of this work are summarized as follows:1) We present a collection of training strategies and designchoices for AEC. We quantify the improvement of eachcomponent via extensive experimentation.2) By training a standard EfﬁcientNet model with theproposed training procedure, we achieve a new state-of-the-art mAP of 0.474 on AudioSet, outperforming thebest previous system of 0.439.3) We plan to release the code, model, and enhanced labelset after the review process. The training pipeline canserve as a recipe of AudioSet training to facilitate futureAEC research.The paper is organized as follows. We ﬁrst describe thebaseline model architecture in Section II, then we graduallyimprove the model by adding new training techniques inSections III, IV, V, and VI. In each section, we ﬁrst review thecorresponding technique and then present our implementationand results. We conclude the paper in Section VII.II. E

XPERIMENT S ETTING AND B ASELINE M ODEL

A. Dataset

In this work, we focus on AudioSet [3], a collection of over2 million 10-second audio clips excised from YouTube videosand labeled with the sounds that the clip contains from a setof 527 labels. AudioSet is a weakly labeled and multi-labeldataset, i.e., labels are given to a clip with no indication ofwhere in the clip the associated sound occurred, and every clipcan, and often does, have multiple labels associated with it.As shown in Table I, the dataset is split into three subsets:balanced train, unbalanced (full) train, and evaluation. Thebalanced train dataset is a set of 22,176 recordings, where eachclass has at least 49 samples, while the full train set contains

Log fbank (1056 X 128)

EfficientNet

Time-frequency Representation (4 X 527)

Input (33 X 1408) (33 X 527)

Temporal Mean PoolingSigmoid SigmoidFrequency Mean Pooling NormalizeElement-wise Product (33 X 527)

Output (527)(33 X 1408) Attention PoolingAttention PoolingAttention PoolingAttention Pooling

Prediction (527)

Weighted Averaging (33 X 4 X 1408)

Fig. 2. The AEC model used in this work. The 10-second waveform is ﬁrstconverted to a × log Mel ﬁlterbank (fbank) feature vector and inputto the EfﬁcientNet model. The output of the penultimate layer of EfﬁcientNetis a × × tensor. We apply a frequency mean pooling to produce a × representation that is fed into a 4-headed attention pooling module.In each head, the CNN output is transformed into a × dimensionaltensor via a set of 1 × the entire 2 million recordings. The evaluation set consists of20,383 recordings and contains at least 59 examples for eachclass. To obtain the raw audio, we extracted the dataset fromYouTube. Due to the constant change in video availability(videos being removed, taken down, etc.) there is a naturalshrinkage (about 5%) from the original dataset [3]. Specif-ically, we downloaded 20,785 (94%), 1,953,082 (96%), and19,185 (94%) recordings for the balanced train, full train, andevaluation set, respectively, which is consistent with previousliterature (e.g., [5]). Therefore, we do make fair comparisonswith previous state-of-the-art models by evaluating on thesame subset of the evaluation dataset. B. Training and Evaluation Details

For all experiments in this paper, we train the neural networkmodel with a batch size of 100, the Adam optimizer [16], anduse binary cross-entropy (BCE) loss. We use a ﬁxed initiallearning rate of 1e-3 and 1e-4 and cut it in half every 5 epochsafter the th and th epoch for all balanced set and full setexperiments, respectively. We use a linear learning rate warm-up strategy for the ﬁrst 1,000 iterations. As in previous efforts,we train the model with 60 and 30 epochs for all balanced setand full set experiments, respectively, and report the meanresult on the evaluation set of the last 5 epochs.We use the mean average precision (mAP) of all the classesas our main evaluation metric since it is the most commonlyused AEC evaluation metric on AudioSet. In the discussionsection, we also report the average area under the curve (AUC) REPRINT 3

TABLE IIM

EAN A VERAGE P RECISION ( M AP) C

OMPARISON OF THE R ES N ET M ODEL [4]

AND THE E FFICIENT N ET M ODEL U SED IN T HIS P APER . of the receiver operating characteristic curve and sensitivityindex (d-prime) in order to compare our model with previouswork that only reports AUC and d-prime.

C. Baseline Model

In this work, we use a similar model structure as in [4],illustrated in Figure 2. Each 10-second audio waveform isﬁrst converted to a sequence of 128 dimensional log Melﬁlterbank (fbank) features computed with a 25ms Hammingwindow every 10ms. This results in a × feature vectorthat is input to a CNN model. In [4] the CNN was based onthe ResNet50 model [17]. In our work, the CNN is basedon the EfﬁcientNet-B2 model [10] since it requires a smallernumber of parameters and is faster for training and inference.The EfﬁcientNet model effectively downsamples the time andfrequency dimensions by a factor of 32. The penultimateoutput of the model is a × × tensor. We applymean pooling over the 4 frequency dimensions to produce a × representation that is fed into a multi-head attentionmodule. The attention module consists of an attention branchand a classiﬁcation branch. Each branch transforms the CNNmean pooled output into a × dimensional tensor via aset of × convolutional ﬁlters. After a sigmoid non-linearityand a normalization on the attention branch, we combine thetwo branches via a element-wise product. A temporal meanpooling (implemented by summation) is then performed toproduce a ﬁnal dimensional output for each class label.Unlike [4], we use a 4-headed attention module instead of asingle-head one in this work. We sum the weighted outputof each attention head after it has been scaled by a learnableweight to produce the ﬁnal output.EfﬁcientNet [10] is a recent proposed convolutional neuralnetwork architecture that has shown an advantage on bothaccuracy and efﬁciency over previous architectures. Suchadvantage mainly comes from two design: First, EfﬁcientNetis based on the mobile inverted bottleneck convolution (MB-Conv) block [18], [19], an efﬁcient residual convolution block.Second, EfﬁcientNet scales the network on all dimensions (i.e.,width, depth, and input resolution), which is demonstrated tobe a better strategy than scaling only one dimension. In thiswork, we use EfﬁcientNet-B2 that consists of 9 stages, 339layers, and 9.2 million parameters. As shown in Table II, theEfﬁcientNet model achieves slightly worse performance thanthe ResNet-50 model, but has 10.5 million fewer parameters.In the rest of the paper, we keep using the EfﬁcientNet modeland show that a signiﬁcant improvement can be achievedwithout modifying its model architecture. TABLE IIIP

ERFORMANCE I MPACT ON M

AP D

UE TO P RETRAINING WITH I MAGE N ET D ATA .Balanced Set Full SetNo pretraining 0.1570 0.3723With pretraining 0.2385 0.3939Fig. 3. Comparison of the performance of ImageNet-pretrained model andrandom-initialized model with different training data volume.

III. N

ETWORK P RETRAINING

Transfer learning and network pretraining have been widelyused in computer vision, natural language processing, speechand audio processing in recent years [20], [21], [22]. Thetypical process is to ﬁrst train a model with either a largeout-of-domain or unlabeled dataset using an auxiliary task andthen ﬁne-tune the model with in-domain data for the main task.The idea being that the knowledge learned from the pretrainingtask can be transferred to the main task.For the audio event classiﬁcation task, both supervisedpretraining (e.g., in [5]) and self-supervised pretraining (e.g.,in [23], [24], [25], [26], [27]) using audio data have beenstudied in recent years. Performance improvement is typicallyachieved when the in-domain dataset is small (e.g., ESC-50 [1], UrbanSound [28], and balanced AudioSet). However, ithas not been reported that a pretrained model can outperform astate-of-the-art AEC model trained from scratch using the fullAudioSet, possibly because the full AudioSet contains 2 mil-lion audio recordings and there is no larger annotated datasetavailable. While theoretically self-supervised pretraining canleverage an unlimited amount of unlabelled audio data, inpractice it is non-trivial to ﬁnd a large scale data with sufﬁcientvariety and coverage of the 527 sound classes.In contrast to the above-mentioned efforts, we ﬁnd notice-able performance improvement can be achieved by pretrainingthe CNN with the ImageNet dataset [11] used for visualobject classiﬁcation, even when the training data for the endtask of audio event classiﬁcation is the full AudioSet. Inour experiment, we initialize the EfﬁcientNet (the second tothe penultimate layer) with 1) ImageNet-pretrained weights(released by the authors of [10]), and 2) random weights (HeUniform initialization [29]). We then train both models inexactly the same way as described in Section II-B.

REPRINT 4

As shown in Table III, ImageNet pretraining leads to a51.9% and 5.8% relative improvement for the balanced setand full set experiment, respectively. To see the relationshipbetween the performance improvement and the end-task train-ing data volume, we further evaluate the performance whenthe audio event classiﬁcation training data volume is 100k,200k, 300k, and 500k (all comprised of the entire balancedset and samples randomly taken from the full set). As shownin Figure 3, the performance improvement decreases with thetraining data volume, but is always noticeable.In some sense, it is surprising that pretraining a model withdata from a different modality can be effective. However,transfer learning from computer vision tasks to audio tasksis not new and has been previously studied in [30], [31],[32], [12]. However, we believe this is the ﬁrst time it hasbeen demonstrated to be effective when the dataset of theaudio task is at this scale, indicating the auxiliary imageclassiﬁcation task helps the model learn some complementaryknowledge. We hypothesize that the improvements may be dueto the model learning to recognize low-level features such asedges during pretraining. Such knowledge could potentially berelevant for ﬁnding acoustic “edges” in the spectrogram.In practice, many commonly used CNN architectures (e.g.,Inception [33], ResNet [17], EfﬁcientNet [10]) have off-the-shelf ImageNet-pretrained models for both TensorFlow andPyTorch. It is also straightforward to adapt these off-the-shelfmodels to audio tasks. The only things that need to be modiﬁedare the ﬁrst convolution layer and the last classiﬁcation layer.Since the input of vision tasks is a 3-channel image whilethe input to the audio task is a single-channel spectrogram,we adjust the input channel of the ﬁrst convolutional layerfrom 3 to 1 and initialize it with random weights. Sincethe classiﬁcation task is essentially different, we abandonthe last classiﬁcation layer of the pretrained model and feedthe output of the penultimate layer to our succeeding layers.We implement this using the efficientnet_pytorch package.In summary, the advantages of using ImageNet pretrain-ing are as follows. First, no additional in-domain labeledor unlabeled datasets are needed. This is important becausecurrently there is no AEC dataset of comparable size toAudioSet. Second, ImageNet pretraining can lead to consistentperformance improvement even when the in-domain trainingdata size is huge. Third, ImageNet pretraining is practicallyeasy to implement. The limitation is that it is only applicableto models that take 2D image-like input (e.g., spectrogram).Nevertheless, a majority of deep learning models for audiotasks do fall in this category. In the following sections, weuse Imagenet pretraining by default for all experiments.IV. B ALANCED S AMPLING AND D ATA A UGMENTATION

A. Balanced Sampling

As might be expected, the frequency of occurrence ofdifferent sound events ranges widely. It is not surprising thenthat a large scale AEC dataset is class imbalanced. As shown https://github.com/lukemelas/EfﬁcientNet-PyTorch TABLE IVP ERFORMANCE I MPACT ON M

AP D

UE TO V ARIOUS B ALANCED S AMPLING AND D ATA A UGMENTATION S TRATEGIES .Balanced Set Full SetBaseline 0.2385 0.3939+ Balanced Sampling - 0.3721+ Time-Frequency Masking 0.2818 0.4265+ Mix-up Training 0.3108 0.4397Fig. 4. Sample count of each class in the full AudioSet (vertical axis is inlog scale). Note that the sample count of the “Speech” class is substantiallylarger than the sum of sample counts of the “Male Speech”, “Female Speech”,and “Child Speech” class. Similarly, the sample count of the “Music” class issubstantially larger than the sum of sample counts of the “Happy music” and“Sad music” class. This indicates a potential prevalent miss annotation issuein AudioSet. in Figure 4, the most frequent AudioSet class is “Music”which has 949,029 samples, while the most infrequent class“Toothbrush” only has 61 samples, leading to a ratio of 15,557.Such imbalances can have a large impact on performance,particularly for low-frequency classes [34].With such large data imbalance, simple upsampling ordownsampling are difﬁcult to implement because upsamplingwill make the dataset unacceptably large while downsamplingwill waste a large portion of the data. Moreover, AudioSetis a multi-label dataset, making it even harder to implementup/downsampling methods. In this work, we propose a randombalanced sampling method to alleviate the class imbalanceproblem. Note that balanced sampling on AudioSet has beenused in [6], [8], [5], but is only brieﬂy mentioned and thedetails can only be found in the source code.The proposed random balanced sampling approach is shownin Algorithm 1, lines 1-8. We ﬁrst count the sample number c k of each class k over the entire dataset. We then assign asampling weight for each sample, speciﬁcally, the weight w ( i ) of the i th sample is (cid:80) k =1 { k ∈ y ( i ) } /c k . This assigns a higherweight for samples containing rare audio events and also takesall audio events that appear in the sample into consideration.During training, we still feed N samples ( N is the datasetsize) to the model for each epoch, but instead of traversing thedataset, we draw a sample from the multinomial distribution REPRINT 5

Algorithm 1

Balanced Sampling and Data Augmentation

Require:

Multi-label Dataset D = { x ( i ) , y ( i ) } , i ∈ { , ..., N } Procedure 1: Generate Sampling WeightInput:

Label Set { y ( i ) } Output:

Sample Weight Set W = { w ( i ) } , i ∈ { , ..., N } traverse { y ( i ) } , count sample number c k of each class k initialize w ( i ) = 0 , i ∈ { , ..., N } for each sample i do for each class k ∈ y ( i ) do w ( i ) = w ( i ) + 1 /c k return W = { w ( i ) } Procedure 2: Sampling and Augmentation in TrainingInput: { x ( i ) , y ( i ) } , W , F , T , M for every epoch do for n ∈ { , ..., N } do draw i ∼ multinomial ( W ) if unif (0 , < mixup rate M then draw j ∼ unif { , N } draw λ ∼ Beta ( α, α ) x = λx ( i ) + (1 − λ ) x ( j ) y = λy ( i ) + (1 − λ ) y ( j ) else x = x ( i ) , y = y ( i ) draw f ∼ unif (0 , F ) , f ∼ unif (0 , − f ) draw t ∼ unif (0 , T ) , t ∼ unif (0 , − t ) x = M asking ( f , t , f, t )( x ) use ( x, y ) to train the neural networkparameterized by the above-mentioned sample weights withreplacement. That makes rare sound event samples more likelyto be seen by the model. The advantages of the proposedrandom sampling are 1) it is a compromise of upsampling anddownsampling. It wastes fewer samples than downsamplingwhile keeping the number of N samples fed to the modelevery epoch; 2) it is applicable to multi-label datasets; and3) the model sees a different set of data every epoch, so themodel checkpoints after every epoch have a greater diversity,which is helpful for ensembles [35], [36], as we will discussin Section VI.We compare the performance of the model trained withplain dataset traversal (with data reshufﬂed at every epoch) andwith the proposed random sampling. As shown in Table IV,we ﬁnd random balanced sampling actually lowers the perfor-mance. This result is not surprising because: 1) while betterthan downsampling, there is still a substantial amount of datawasted every epoch. As shown in Figure 5, 40.9% data is notseen by the model after 30 training epochs; 2) while the low-frequency class samples and high-frequency class samples areroughly equally seen by the model, the low-frequency classsamples are actually repeated samples. Both issues increasethe risk of model overﬁtting. Therefore, we explored the useof data augmentation to overcome this problem. Fig. 5. The proportion of unseen samples with the training epochs. Mixup rateis the probability that the sample input to the model is a mixed-up sample.In our implementation, one of the two mixed-up samples is drawn from auniform distribution, while the other is drawn using the balanced samplingmultinomial distribution.

B. Time and Frequency Masking

We ﬁrst consider simple time and frequency maskingfor data augmentation, which has been found to be effec-tive for AEC [5] and speech recognition [37]. Frequencymasking is applied so that f consecutive frequency chan-nels [ f , f + f ) are masked, where f ∼ unif (0 , F ) , f ∼ unif (0 , − f ) , and F is the maximum possi-ble length of the frequency mask. Similarly, time mask-ing is applied so that t consecutive frequency channels[ t , t + t ) are masked, where t ∼ unif (0 , T ) , t ∼ unif (0 , − t ) , and T is the maximum possible length ofthe frequency mask. Note that 128 and 1056 are the inputdimensions of our model. We use the implementation of torchaudio.transforms.FrequencyMasking and TimeMasking , F = 48 and T = 192 . The masking param-eters f , t , f, t are sampled on-the-ﬂy for each audio sampleduring training to minimize the chance of repeated audio sam-ples being fed to the model. As shown in Table IV, time andfrequency masking improves AEC performance considerably,with relative improvements of 18.2% and 14.6% achieved forthe balanced set and full set experiment, respectively. Notethat the overall amount of training samples per epoch remainsthe same. We hypothesize that the effectiveness of masking isdue to the reduction of repeated samples in the training data,especially for low-frequency samples. C. Mix-up Training

An additional form of data augmentation we explored iscalled mix-up training where weighted combinations of audiosamples are combined to make new samples. Mix-up trainingcreates convex combinations of pairs of examples and theircorresponding labels. Studies have shown it can improve theperformance of image classiﬁcation, voice command recogni-tion [38], [39], and AEC [5], [40]. Speciﬁcally, mix-up trainingconstructs augmented training examples as follows:

REPRINT 6

TABLE VP

ERFORMANCE AS A FUNCTION OF M IX - UP R ATE (T RAINING ON B ALANCED S ET WITH α = 10 ).Mixup Rate 0 0.2 0.5 0.8 1.0mAP 0.2818 0.3060 0.3108 0.3119 0.2928TABLE VIP ERFORMANCE AS A FUNCTION OF α (T RAINING ON BALANCED SET WITH M IX - UP RATE = 0 . ). α −∞ x = λx ( i ) + (1 − λ ) x ( j ) y = λy ( i ) + (1 − λ ) y ( j ) where x ( i ) and x ( j ) are two different training audio samples, y ( i ) and y ( j ) are the corresponding labels, λ ∈ [0 , and x isthe mixed-up new audio sample, and y is the resulting label.Past explanations for why mix-up training improves per-formance include: 1) it increases the variation of the trainingdata [5], [40]; 2) it leads to an enlargement of Fisher’s criterionin the feature space and a regularization of the positionalrelationship among the feature distributions of the classes [38],[40]; and 3) it reduces the model’s memorization of corruptlabels [39].In addition to these observations, we ﬁnd mix-up traininghas an additional advantage for imbalanced datasets. As wediscussed in Section IV-A, balanced sampling, while makingthe low-frequency class samples more prevalent, has the unfor-tunate side effect of wasting a large number of (40.9%) classsamples. By adopting the mixup strategy, the model can seetwice the number of samples within the same training epoch.This advantage can be increased if one of the two mixed-upsamples is drawn from a uniform distribution, while the otheris drawn using the balanced sampling multinomial distributionintroduced in the previous section. Intuitively, mixing up arare sound event (e.g., toothbrush) with a frequent one (e.g.,music) is more reasonable than mixing up two rare soundevents. Some previous synthetic audio event detection datasetsuse a similar method to construct samples [41]. As shown inFigure 5, the mix-up strategy can reduce the unseen samplesto almost zero in just a few epochs.We further make two modiﬁcations based on previousefforts. In prior work λ is drawn from a unif (0 , or Beta ( α, α ) with α < . Thus λ has a relatively high likelihoodto be close to either 0 or 1. From the perspective of soundmixing and reducing the number of unseen samples, a λ closeto 0.5 could be more reasonable since it leads to more “evenly”mixed up samples and the model can see both samples.Second, since samples in the evaluation set are not mixed up,mixing up all samples during training might lead to a gapbetween training and evaluation. Thus we set a mix-up rate to control the number of samples to mix up during training.Therefore, the model can see non-synthetic samples during training. As shown in Figure 5, a mix-up rate of 0.5 results in95% samples being seen by the model in 5 epochs. For nonmix-up samples, the data loader only needs to load one audiosample instead of two. A low mix-up rate can also reduce thedata loading and pre-processing cost during training, which isnon-negligible because it is almost impossible to ﬁt the fullAudioSet into memory.We evaluate the impact of mix-up rate and α , as shownin Tables V and VI. A larger α and a medium mix-up rateindeed lead to better classiﬁcation performance. Combiningthem achieves 0.3108 mAP, which is better than a plain settingof α =mixup rate= that achieves 0.3079 mAP. We use α = 10 and mix-up rate = 0 . in all subsequent experiments. D. Summary

We combine the balanced sampling and masking and mix-up data augmentation strategies together, as described in Algo-rithm 1. We summarize the contribution of each component inTable IV. It is worth mentioning that while balanced samplingalone lowers the performance, it is helpful when combinedwith data augmentation strategies. By adopting balanced sam-pling and data augmentation, an 11.6% relative improvementand an mAP of 0.4397 are achieved for the full set experiment.We only do data augmentation for balanced set experimentsas the data is already roughly balanced and obtain a 30.3%relative improvement and an mAP of 0.3108, demonstratingthe effectiveness of data augmentation for small datasets. Fi-nally, it is worth mentioning that by merely adopting ImageNetpretraining, balanced sampling, and data augmentation witha standard EfﬁcientNet architecture, the model already out-performs the previous best system. In the following sections,we use balanced sampling (for the full AudioSet) and dataaugmentation as defaults for all experiments.V. L

ABEL E NHANCEMENT

In this section, we explore the noisy label aspect of Au-dioSet: how it impacts AED performance, and how to alleviateit. This line of research is motivated by observing the model’sclass-wise performance. In Figure 6, we show the class-wiseaverage precision (AP) of the model trained with the full set.From the ﬁgure it is immediately apparent that the AP of eachclass differs greatly, indicating that the model has a rangeof ability to recognize various sounds. This is not an issuespeciﬁc to our model or training pipeline, but has been widelyreported in prior work [5], [8], [13], [42], [43]. The order ofclass-speciﬁc performance reported by independent researchalso appears to be similar. For example, the “Male speech”,“Bicycle”, “Harmonic”, “Rattle”, and “Scrape” classes areamong the 10 worst performing classes in [42], and they arealso are among the 10 worst performing classes for our modelwhen trained with the balanced set. This consistency suggeststhat the issue might be due to an intrinsic problem with thedata or the task. Since the class-wise AP is not stronglycorrelated with either class sample count in the training set orthe class annotation quality estimate released by the AudioSetauthors (as shown in Table VII), it has been hypothesizedthat the class-wise performance variation is due in part to

REPRINT 7

Fig. 6. Sorted class-wise average precision (AP) and its standard deviationof the model trained on full set. Note that the “Speech” class has a muchhigher AP than the “Male Speech”, “Female Speech”, and “Child Speech”class. Similarly, the “Music” class has a much higher AP than the “HappyMusic” and “Sad Music” class. “Singing” and “Song” have similar deﬁnitionbut very different AP. Classes with low AP also have a larger AP variance.TABLE VIIC

ORRELATION C OEFFICIENTS B ETWEEN C LASS - WISE AP AND C LASS S AMPLE C OUNT /A NNOTATION Q UALITY E STIMATE R ELEASED BY A UDIO S ET A UTHORS .Balanced Set Full SetAP and Sample Count 0.1692 0.0946AP and Annotation Qualtiy Estimate 0.2464 0.2629 the difﬁculty in reliably tagging the different sound classesthemselves [5], [43].While we agree that the poor performance of some classescould be due to particular audio events being difﬁcult toidentify, it is not true for all poor-performing classes. Forexample, the “Male Speech”, “Female Speech”, and “ChildSpeech” classes have APs of 0.07, 0.09, 0.45, respectivelywhile the AP of the “Speech” class is 0.80. This discrepancycannot be explained by the class difﬁculty hypothesis becauserecognizing speaker gender from speech is a relatively easytask [44], [45], [46], and the performances of the speechclasses should not be so disparate. By examining the classsample counts, we ﬁnd another issue that the sample countof the “Speech” class is substantially larger than the sum ofsample counts of the “Male Speech”, “Female Speech”, and“Child Speech” classes. Speciﬁcally, in the balanced set, thereare 5,309 audio clips with the label “Speech” but only 55,55, 128 audio clips are with label “Male Speech”, “FemaleSpeech”, and “Child Speech”, respectively. The same thinghappens in the full set (shown in Figure 4): the “SpeechClass” has 947,009 samples while the sum of the other threeclasses is 34,878. In other words, only 4.5% and 3.7% ofspeech samples are labeled as either male, female, or childspeech in the balanced and full AudioSet, respectively. Thisindicates that a large portion of samples are not correctly labeled. Based on these two observations, we hypothesize thatthe low performance of the male, female, and child speechclasses is not due a small number of samples, or inherentclassiﬁcation difﬁculty, but that they have only a small fractionof correctly labeled data, which ultimately confuses the model.We refer to this phenomenon as a Type I error.We also ﬁnd that there are substantial samples labeledwith sub-classes, but not with the corresponding parent classdeﬁned by the AudioSet ontology. For example, there are40 and 3,201 audio clips labeled as either “Male Speech”,“Female Speech”, or “Child Speech”, but not labeled as“Speech” in the balanced and full AudioSet, respectively. Werefer to this phenomenon as Type II error.We formalize the two types of error as follows:1) Type I error: an audio clip is labeled with a parent class,but not also labeled as a child class when it does in factcontain the audio event of the child class.2) Type II error: an audio clip is labeled with a child class,but not labeled with corresponding parent classes.It is worth mentioning that neither type of error are includedin the quality estimate released by the AudioSet authorsbecause the quality estimate checked 10 random audio clipsof each class and veriﬁed that they actually contained thecorresponding sound event. In other words, the quality esti-mate counts the false positive annotation errors, but not falsenegatives. As a consequence, the quality estimate of the “MaleSpeech”, “Female Speech”, and “Child Speech” is 90%, 100%,and 100%, respectively, while they have obvious false negativeannotation errors.Unfortunately, false negatives are prevalent in AudioSet.Another example are the music classes (see Figure 4 and 6 forsample counts and class-wise AP of music classes). The reasonfor these types of errors is due to the AudioSet annotationpipeline. In the pipeline, the human annotator veriﬁes thecandidate labels nominated by a series of automatic methods(e.g., by using metadata). Also, the list of candidate labels islimited to ten labels per clip. Since the automatic methods fornomination are not perfect, some existing sound events fail tobe nominated, or are nominated but ranked below the top ten,thus leading to missing labels [13], [3].As seen in the speech class example, annotation error canimpact performance, but has not received much attention. Tothe best of our knowledge, only a few efforts have coveredthe missing label issue. In [42], [47], a synthetic error isstudied, however, the real-world noisy labels are believedto be much harder to deal with than the synthetic labels.In [13], the authors propose a loss masking based teacher-student model. In this section, we propose an ontology-basedlabel enhancement method to alleviate the noisy label problem.Our approach differs from previous work in three aspects:First, we work on real-world noisy labels rather than syntheticcorrupted labels; Second, we explicitly modify the labels of thetraining data rather than using loss masking during training.Thus the enhanced label set can be used in the exact sameway as the original set (no need to modify the model andtraining pipeline). We plan to release the enhanced label setto facilitate future research. Third, we leverage the AudioSetontology to constrain label modiﬁcation, which reduces the

REPRINT 8

Algorithm 2

Label Enhancement

Require:

Teacher Model M Dataset D = { x ( i ) , y ( i ) } , i ∈ { , ..., N } Label Ontology O Procedure 1: Generate Label Modiﬁcation ThresholdInput: M , D Output:

Threshold Set T = { t k } , k ∈ { , ..., } for k ∈ { , ..., } do t k = (cid:80) Ni =1 { k ∈ y ( i ) } M ( x ( i ) )( k ) / (cid:80) Ni =1 { k ∈ y ( i ) } return T = { t k } Procedure 2: Enhance the Label SetInput: M , D , O , T Output:

Enhanced Label Set { y (cid:48) ( i ) } , i ∈ { , ..., N } Initialize { y (cid:48) ( i ) } = { y ( i ) } for i ∈ { , ..., N } do for k ∈ y ( i ) do for k n ∈ O ( k ) do (cid:46) parent or child class of k if M ( x ( i ) )( k n ) > t k n and k n (cid:54)∈ y ( i ) then y (cid:48) ( i ) = y (cid:48) ( i ) ∪ { k n } return { y (cid:48) ( i ) } TABLE VIIIR

ESULT OF L ABEL E NHANCEMENT ON THE B ALANCED S ET (N OTE THEM AP WITHOUT L ABEL E NHANCEMENT IS ± ± ± ± chance of incorrect modiﬁcations. For example, for an audioclip labeled as “Speech”, we only consider adding child orparent labels in the speciﬁc ”Speech” branch of the ontology.As shown in Algorithm 2, the proposed approach consists ofthe following steps. First, we train a teacher model using thefull AudioSet with the original label set. Second, we set a labelmodiﬁcation threshold for each audio event class, speciﬁcally,we set the threshold of a class as the teacher model’s meanprediction score of all audio clips originally labeled as thatclass (lines 1-2). We then identify all samples that need tobe relabeled. For each sample, we compile all child (TypeI) and/or parent (Type II) labels of all original labels as thecandidate set according to the AudioSet ontology (line 6).For each label in the candidate set, if the teacher model’sprediction score of the class is greater than the correspondinglabel modiﬁcation threshold, we add it to the labels of thesample (line 7-8). Finally, we retrain the model from scratchwith the enhanced label set.We apply the proposed label enhancement method on thebalanced training set and show the results in Table VIII. Note the model without label enhancement has an mAP of0.3108 ± ± EIGHT A VERAGING AND E NSEMBLE

A. Model Weight Averaging

In this section, we explore improving model performanceby aggregating multiple models. The ﬁrst strategy we exploreis weight averaging [48]. Weight averaging performs an equalaverage of the weights traversed by the optimizer, which makesthe solution fall in the center, rather than the boundary, of awide ﬂat low-loss region and thus lead to better generalizationthan conventional training. Empirically, weight averaging hasbeen shown to improve the performance of various modelssuch as VGG [49], ResNets [17], and DenseNets [50] on avariety of tasks [48], [51]. While weight averaging is usuallyapplied with a high constant or cyclical learning rate, we ﬁndit is helpful even when used together with a weight decaystrategy.In this work, we simply average all weights of the modelcheckpoints at multiple epochs. For both balanced set and fullset experiments, we start averaging model checkpoints of every

REPRINT 9

TABLE IXP

ERFORMANCE I MPACT ON M

AP D

UE TO W EIGHT A VERAGING

Balanced Set Full SetWithout Weight Averaging 0.3162 ± ± ± ± th epochwhile starting averaging at any epoch after the th epochs can outperformany single checkpoint. For prediction averaging, starting averaging from theﬁrst epoch leads to the highest mAP, indicating averaging all checkpointsis optimal, while starting averaging at any epoch can outperform any singlecheckpoint. However, averaging the predictions of the last few checkpointsbarely outperforms single checkpoints, indicating the importance of diversity. epoch after the learning rate is decreased to 1/4 of the initiallearning rate (i.e., the st and the th epochs, respectively)until the end of the training. As shown in Table IX, weightaveraging leads to a 0.9% improvement for both balanced setand full set experiment. We further ﬁnd the improvement isnot sensitive to exactly when weight averaging begins. Asshown in Figure 7, starting averaging at any epoch after the th epochs (until the last epoch) can outperform any singlecheckpoint model for the full set experiment.In summary, weight averaging is easy to implement, adds noadditional cost to training and inference, but can consistentlyimprove model performance. By applying weight averagingto our models, we get our best single model with an mAP of0.3192 and 0.4435 for balanced and full AudioSet experiment,respectively. B. Ensemble

Finally, we explore a series of ensemble strategies. The goalof ensemble methods is to combine the predictions of severalmodels to improve generalizability and robustness over anysingle model. Previously, ensemble of AEC models has beenstudied in in [14], [52], [53], [54], [15], [12], [4], but typicallyonly one strategy is covered in each of these previous efforts.In this work, we use the simple voting algorithm, but compare

TABLE XR

ESULTS OF M ODEL E NSEMBLE . F

OR EACH EXPERIMENT , WE SHOW THENUMBER OF THE MODELS IN THE COMMITTEE (

ODELS ), THE AVERAGEM AP OF MODELS IN THE COMMITTEE (A VG M

AP),

THE M AP OF THE BESTMODEL IN THE COMMITTEE (B EST M

AP),

AND THE M AP OF THEENSEMBLE MODEL (E NSEMBLE M A P ). N OTE THAT FOR ALLEXPERIMENTS , THE ENSEMBLE M AP IS HIGHER THAN THE BEST M

AP. α Full-pretrain 2 0.3831 0.3939 0.4006Full-augment 4 0.4080 0.4396 0.4578Full-label 4 0.4397 0.4400 0.4653Full-top5 5 0.4396 0.4405 0.4690Full-all 10 0.4201 0.4405 multiple ways of building the model committee. The reasonwhy we do not use iterative ensemble methods (e.g., Boosting)is because AudioSet training is expensive making iterativetraining computationally unreasonable for this work.

1) Checkpoint Averaging:

The ﬁrst strategy investigatedis checkpoint averaging, whereby the output of checkpointmodels at multiple epochs are averaged together. The imple-mentation is similar to weight averaging, but is conductedin the model space rather than the weight space. Since weconduct random sampling with replacement during full settraining, the combination with checkpoint averaging is thesame as bootstrap aggregating (i.e., Bagging) [36]. In ourexperiment, we average the output of all checkpoint models(i.e., 60 and 30 checkpoint models for the balanced set andfull set, respectively). As shown in the upper part of Table X,this approach works well. Speciﬁcally, the ensembled modelnoticeably outperforms the best checkpoint model in the com-mittee. In addition, as shown in Figure 7, starting averagingfrom the ﬁrst epoch leads to the highest mAP, indicatingaveraging all checkpoints is optimal. Averaging from anyepoch can outperform the best single checkpoint model, whichcan be a simple alternative. However, this approach greatlyincreases the computational overhead of inference, whichmakes it less practical in deployment.

2) Averaging Models Trained with Different Random Seeds:

Previous work suggests that ensembles generalize better whenthey constitute members that form a diverse and accurate set [55]. As shown in Figure 7, starting averaging the check-point predictions from the last few epochs can only slightlyoutperform the best single checkpoint model, even thoughthese checkpoint models are quite accurate, indicating the im-

REPRINT 10 portance of diversity. Therefore, we run the experiment threetimes with the exact same setting, but with a different randomseed. We then average the output of the last checkpoint modelof each run. As shown in the middle part of Table X, thisapproach leads to an even larger improvement than checkpointaveraging with only three models in the committee. Therefore,averaging models trained with different random seeds, whileincreasing the training cost (due to the repeat runs), is morepractical for deployment and offers better performance.

3) Averaging Models Trained with Different Settings:

Finally, we explore averaging more models with greater diver-sity. Speciﬁcally, we ensemble models trained with all differentsettings tested in this paper, including whether pretraining isused (pretrain), different mix-up rates (mixup rate), differentmix-up α (mix-up- α ), different augmentation settings (aug-ment), and different label enhancement strategies (label). Asshown in the lower part of Table X, no matter how the modelcommittee is built, ensemble always improves the performanceand outperforms the best model in the committee. In theliterature, diversity is usually introduced with an intuitivemotivation. For example, in [14], the authors ensemble modelsuse different scale inputs because they believe the optimalinput scale varies with the target audio events, and ensemblesallows the model to extract relevant information from inputswith various scales. But according to our experimental results,the source of the diversity seems to be less important, i.e., thediversity caused by any factor is helpful for an ensemble.In addition, we ﬁnd the performance of the ensemble modelis positively correlated with the accuracy of the models in thecommittee as well as the number of the models. For boththe balanced set and full set experiments, our best model isachieved when all available models form an ensemble.VII. D ISCUSSION AND C ONCLUSION

In this paper, we describe several techniques that improvethe performance of a CNN-based neural model for AEC.First, we show an ImageNet-pretrained CNN can noticeablyimprove performance. While it is straightforward to implementfor CNN-based models it has seldom been used in AECresearch. Second, due to an imbalance in sound class samplesin Audioset, we describe several data balancing and augmen-tation strategies that alleviate the data imbalance issue andhelp improve performance. We argue that balanced samplingand data augmentation should be a standard component forAudioSet modeling. Third, by observing variation in class-speciﬁc performance, we identiﬁed a missing label issue withAudioset and proposed a label enhancement method that showsimprovement on the balanced training set. The enhanced labelset can be used in the same way as the original label set infuture research. We were not able to observe a performanceimprovement by enhancing the full set labels, possibly due tosimilar missing labels in the evaluation set. Due to its impacton performance, we believe addressing the noisy label issueis an important research topic for AEC. Finally, we describeweight averaging and ensemble strategies that are both simpleand effective for AEC.By combining all these training techniques, we are able toimprove the performance of a normal EfﬁcientNet model by

TABLE XIC

OMPARISON WITH P REVIOUS M ETHODS (U PPER : B

ALANCED A UDIO S ET E XPERIMENTS , L

OWER : F

ULL A UDIO S ET E XPERIMENTS ). d (cid:48) Wu-minimal [56], 2018 2.6M - 0.916 1.950Kumar [57], 2018 - 0.213 0.927 2.056Wu-best [56], 2018 56M - 0.927 2.056Kong [8], 2019 - 0.274 0.949 2.316PANNs [5], 2020 81M 0.278 0.905 1.853Our Baseline 15M 0.1570 0.9108 1.903Proposed Single Model 15M 0.3192 ± ± ± × × AudioSet Baseline [3] - 0.314 0.959 2.452Kong [6], 2018 - 0.327 0.965 2.558Yu [7], 2018 - 0.360 0.970 2.660TALNet [58], 2019 - 0.362 0.965 2.554Kong [8], 2019 - 0.369 0.969 2.639DeepRes [4], 2019 26M 0.392 0.971 2.682PANNs [5], 2020 81M 0.439 0.973 2.725Our Baseline 15M 0.3723 0.9706 2.672Proposed Single Model 15M 0.4435 ± ± ± × × Fig. 8. The learning curve of our experiments. Each experiment is run threetimes, and the stand deviation is shown in the shade. ∼ of the full set) outperforms our baseline andmany previous models trained with the full set (see Table XI).We show the learning curve of our best single model(without weight averaging) in Figure 8. For both the balancedset and full set experiment, we repeat the training processthree times with different random seeds and show the standard REPRINT 11 deviation in the plot. As we can see, the training converges,and the performance of the model barely varies with therandom seed, i.e., the three runs achieve almost the sameresult. Finally, we show the AUC and d-prime of our modelsand compare them with previous efforts in Table XI. Theproposed model outperform previous models for all evaluationmetrics.The work in this paper can serve as a recipe for AudioSettraining. Most of the proposed methods are model agnostic andcan be combined together with various model architectures andattention modules. As we showed in the paper, the same modelcan perform much better when it is trained with appropriatetechniques. We hope this work can facilitate future AECresearch by documenting a set of strong and useful trainingtechniques. R

EFERENCES[1] K. J. Piczak, “ESC: Dataset for environmental sound classiﬁcation,” in

ACM Multimedia , 2015.[2] P. Foster, S. Sigtia, S. Krstulovic, J. Barker, and M. D. Plumbley,“CHiME-Home: A dataset for sound source recognition in a domesticenvironment,” in

WASPAA , 2015.[3] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology andhuman-labeled dataset for audio events,” in

ICASSP , 2017.[4] L. Ford, H. Tang, F. Grondin, and J. R. Glass, “A deep residual networkfor large-scale acoustic scene analysis,” in

Interspeech , 2019.[5] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley,“PANNs: Large-scale pretrained audio neural networks for audio patternrecognition,”

IEEE/ACM TASLP , vol. 28, pp. 2880–2894, 2020.[6] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “Audio set classiﬁcationwith attention model: A probabilistic perspective,” in

ICASSP , 2018.[7] C. Yu, K. S. Barsim, Q. Kong, and B. Yang, “Multi-level attention modelfor weakly supervised audio classiﬁcation,” in

DCASE Workshop , 2018.[8] Q. Kong, C. Yu, Y. Xu, T. Iqbal, W. Wang, and M. D. Plumbley, “Weaklylabelled AudioSet tagging with attention neural networks,”

IEEE/ACMTASLP , vol. 27, no. 11, pp. 1791–1802, 2019.[9] S. Chen, J. Chen, Q. Jin, and A. Hauptmann, “Class-aware self-attentionfor audio event recognition,” in

ICMR , 2018.[10] M. Tan and Q. V. Le, “EfﬁcientNet: Rethinking model scaling forconvolutional neural networks,” in

ICML , 2019.[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A large-scale hierarchical image database,” in

CVPR , 2009.[12] K. Palanisamy, D. Singhania, and A. Yao, “Rethinking CNN models foraudio classiﬁcation,” arXiv preprint arXiv:2007.11154 , 2020.[13] E. Fonseca, S. Hershey, M. Plakal, D. P. Ellis, A. Jansen, and R. C.Moore, “Addressing missing labels in large-scale sound event recogni-tion using a teacher-student framework with loss masking,”

IEEE SignalProcessing Letters , vol. 27, pp. 1235–1239, 2020.[14] D. Lee, S. Lee, Y. Han, and K. Lee, “Ensemble of convolutional neuralnetworks for weakly-supervised sound event detection using multiplescale input,” in

DCASE Workshop , 2017.[15] P. Lopez-Meyer, J. A. del Hoyo Ontiveros, G. Stemmer, L. Nach-man, and J. Huang, “Ensemble of convolutional neural networks forthe DCASE 2020 acoustic scene classiﬁcation challenge,” in

DCASEWorkshop , 2020.[16] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in

ICLR , 2015.[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

CVPR , 2016.[18] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“MobileNetV2: Inverted residuals and linear bottlenecks,” in

CVPR ,2018.[19] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard,and Q. V. Le, “MnasNet: Platform-aware neural architecture search formobile,” in

CVPR , 2019.[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,”in

NAACL-HLT , 2019.[21] K. He, R. Girshick, and P. Doll´ar, “Rethinking ImageNet pre-training,”in

ICCV , 2019. [22] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsupervisedautoregressive model for speech representation learning,” in

Interspeech ,2019.[23] A. Saeed, D. Grangier, and N. Zeghidour, “Contrastive learningof general-purpose audio representations,” arXiv preprintarXiv:2010.10915 , 2020.[24] J. Shor, A. Jansen, R. Maor, O. Lang, O. Tuval, F. d. C. Quitry,M. Tagliasacchi, I. Shavitt, D. Emanuel, and Y. Haviv, “Towards learninga universal non-semantic representation of speech,” in

Interspeech , 2020.[25] M. Tagliasacchi, B. Gfeller, F. de Chaumont Quitry, and D. Roblek,“Pre-training audio representations with self-supervision,”

IEEE SignalProcessing Letters , vol. 27, pp. 600–604, 2020.[26] A. Jansen, M. Plakal, R. Pandya, D. P. Ellis, S. Hershey, J. Liu, R. C.Moore, and R. A. Saurous, “Unsupervised learning of semantic audiorepresentations,” in

ICASSP , 2018.[27] L. Wang, K. Kawakami, and A. van den Oord, “Contrastive predictivecoding of audio with an adversary,” in

Interspeech , 2020.[28] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy forurban sound research,” in

ACM Multimedia , 2014.[29] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:Surpassing human-level performance on ImageNet classiﬁcation,” in

ICCV , 2015.[30] G. Gwardys and D. M. Grzywczak, “Deep image features in musicinformation retrieval,”

IJET , vol. 60, no. 4, pp. 321–326, 2014.[31] A. Guzhov, F. Raue, J. Hees, and A. Dengel, “ESResNet: Environmentalsound classiﬁcation based on visual domain models,” in

ICPR , 2020.[32] S. Adapa, “Urban sound tagging using convolutional neural networks,”in

DCASE Workshop , 2019.[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in

CVPR , 2015.[34] H. He and E. A. Garcia, “Learning from imbalanced data,”

IEEE TKDE ,vol. 21, no. 9, pp. 1263–1284, 2009.[35] B. Efron and R. J. Tibshirani,

An introduction to the bootstrap . CRCpress, 1994.[36] L. Breiman, “Bagging predictors,”

Machine Learning , vol. 24, no. 2, pp.123–140, 1996.[37] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,and Q. V. Le, “SpecAugment: A simple data augmentation method forautomatic speech recognition,” in

Interspeech , 2019.[38] Y. Tokozume, Y. Ushiku, and T. Harada, “Between-class learning forimage classiﬁcation,” in

CVPR , 2018.[39] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyondempirical risk minimization,” in

ICLR , 2018.[40] Y. Tokozume, Y. Ushiku, and T. Harada, “Learning from between-classexamples for deep sound recognition,” in

ICLR , 2018.[41] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent,B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasetsand baseline system,” in

DCASE Workshop , 2017.[42] A. Shah, A. Kumar, A. G. Hauptmann, and B. Raj, “A closer look atweak label learning for audio events,” arXiv preprint arXiv:1804.09288 ,2018.[43] L. H. Ford, “Large-scale acoustic scene analysis with deep residualnetworks,” Master’s thesis, Massachusetts Institute of Technology, 2019.[44] K. Wu and D. G. Childers, “Gender recognition from speech. part i:Coarse analysis,”

JASA , vol. 90, no. 4, pp. 1828–1840, 1991.[45] D. G. Childers and K. Wu, “Gender recognition from speech. part ii:Fine analysis,”

JASA , vol. 90, no. 4, pp. 1841–1856, 1991.[46] Z.-Q. Wang and I. Tashev, “Learning utterance-level representations forspeech emotion and age/gender recognition using deep neural networks,”in

ICASSP , 2017.[47] M. Meire, P. Karsmakers, and L. Vuegen, “The impact of missing labelsand overlapping sound events on multi-label multi-instance learning forsound event classiﬁcation,” in

DCASE Workshop , 2019.[48] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson,“Averaging weights leads to wider optima and better generalization,” in

UAI , 2018.[49] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[50] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in

CVPR , 2017.[51] B. Athiwaratkun, M. Finzi, P. Izmailov, and A. G. Wilson, “Thereare many consistent explanations of unlabeled data: Why you shouldaverage,” in

ICLR , 2019.[52] Z. Shi, L. Liu, H. Lin, R. Liu, and A. Shi, “HODGEPODGE: Soundevent detection based on ensemble of semi-supervised learning meth-ods,” in

DCASE Workshop , 2019.

REPRINT 12 [53] Y. Guo, M. Xu, Z. Wu, J. Wu, and B. Su, “Multi-scale convolutionalrecurrent neural network with ensemble method for weakly labeledsound event detection,” in

ACIIW , 2019.[54] W. Lim, S. Suh, S. Park, and Y. Jeong, “Sound event detection indomestic environments using ensemble of convolutional recurrent neuralnetworks,” in

DCASE Workshop , 2019.[55] A. Chandra, H. Chen, and X. Yao, “Trade-off between diversity andaccuracy in ensemble generation,” in

Multi-Objective Machine Learning ,2006, pp. 429–464.[56] Y. Wu and T. Lee, “Reducing model complexity for DNN based large-scale audio classiﬁcation,” in

ICASSP , 2018.[57] A. Kumar, M. Khadkevich, and C. F¨ugen, “Knowledge transfer fromweakly labeled audio using convolutional neural network for soundevents and scenes,” in

ICASSP , 2018.[58] Y. Wang, J. Li, and F. Metze, “A comparison of ﬁve multiple instancelearning pooling functions for sound event detection with weak labeling,”in