[PDF] A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling

Abstract

This paper proposes a network architecture mainly designed for audio tagging, which can also be used for weakly supervised acoustic event detection (AED). The proposed network consists of a modified DenseNet as the feature extractor, and a global average pooling (GAP) layer to predict frame-level labels at inference time. This architecture is inspired by the work proposed by Zhou et al., a well-known framework using GAP to localize visual objects given image-level labels. While most of the previous works on weakly supervised AED used recurrent layers with attention-based mechanism to localize acoustic events, the proposed network directly localizes events using the feature map extracted by DenseNet without any recurrent layers. In the audio tagging task of DCASE 2017, our method significantly outperforms the state-of-the-art method in F1 score by 5.3% on the dev set, and 6.0% on the eval set in terms of absolute values. For weakly supervised AED task in DCASE 2018, our model outperforms the state-of-the-art method in event-based F1 by 8.1% on the dev set, and 0.5% on the eval set in terms of absolute values, by using data augmentation and tri-training to leverage unlabeled data.

Full PDF

AA Joint Framework for Audio Tagging and Weakly Supervised Acoustic EventDetection Using DenseNet with Global Average Pooling

Chieh-Chi Kao , Bowen Shi , Ming Sun , Chao Wang Alexa Speech, Amazon.com Inc. Toyota Technological Institute at Chicago [email protected], [email protected], { mingsun,wngcha } @amazon.com Abstract

This paper proposes a network architecture mainly designed foraudio tagging, which can also be used for weakly supervisedacoustic event detection (AED). The proposed network consistsof a modiﬁed DenseNet as the feature extractor, and a globalaverage pooling (GAP) layer to predict frame-level labels at in-ference time. This architecture is inspired by the work proposedby Zhou et al., a well-known framework using GAP to local-ize visual objects given image-level labels. While most of theprevious works on weakly supervised AED used recurrent lay-ers with attention-based mechanism to localize acoustic events,the proposed network directly localizes events using the featuremap extracted by DenseNet without any recurrent layers. In theaudio tagging task of DCASE 2017, our method signiﬁcantlyoutperforms the state-of-the-art method in F1 score by 5.3% onthe dev set, and 6.0% on the eval set in terms of absolute val-ues. For weakly supervised AED task in DCASE 2018, ourmodel outperforms the state-of-the-art method in event-basedF1 by 8.1% on the dev set, and 0.5% on the eval set in terms ofabsolute values, by using data augmentation and tri-training toleverage unlabeled data.

1. Introduction

Audio tagging is the task of detecting the occurrence of cer-tain events based on acoustic signals. Recent releases of pub-lic datasets [1, 2, 3] signiﬁcantly stimulate the research in thisﬁeld. Hershey et al. [4] did a benchmark of different convo-lutional neural network (CNN) architectures on audio taggingusing AudioSet, which is a dataset consisting of over 2 mil-lion audio clips from YouTube and an ontology of 527 classes.DCASE 2017 Task 4 subtask A [2] focuses on audio taggingfor the application of smart cars. The winner of this challengeused a gated CNN with learnable gated linear units (GLU) toreplace the ReLU activation after each convolutional layer [5].Yan et al. [6] further improved the above-mentioned architec-ture by inserting a feature selection structure after each GLU toexploit channel relationships.Besides classifying audio recordings into different classes,AED requires predicting the onset and offset time of soundevents. DCASE 2017 Task 2 [2] provides datasets with stronglabels for detecting rare sound events (baby crying, glass break-ing, and gunshot) within synthesized 30-second clips. Mostof the state-of-the-art AED models are based on convolutionalrecurrent neural network (CRNN). The winner of this chal-lenge [7] used 1D CNN with 2 layers of long short term mem-ory (LSTM) layers to generate the frame level prediction. Kaoet al. [8] used region-based CRNN for AED, which does not re-quire post-processing for converting the prediction from frame-level to event-level. Shen et al. [9] used a temporal and a fre-quential attention model to improve the performance of CRNN.Zhang et al. [10] gathered information at multiple resolutions to generate a time-frequency attention mask, which tells the modelwhere to focus along both time and frequency axis.Training such AED models in a fully-supervised mannercan be very costly since annotating strong labels (onset/offsettime) is labor-intensive and time-consuming. Weakly super-vised AED (also called multiple instance learning) is an efﬁ-cient way to train AED models without using strong labels. Ituses weak labels (utterance-level labels) to train a model, wherethe trained model is still able to predict strong labels (frame-level labels) at inference time. DCASE 2017 Task 4 subtaskB [2] provides datasets for weakly supervised AED in drivingenvironments. The winner of DCASE 2017 challenge used anensemble of CNNs with various lengths of analysis windows formultiple input scaling [11]. He et al. [12] proposed a hierarchi-cal pooling structure to improve the performance of CRNN. Theeffect of different pooling/attention methods on AED and audiotagging also have been analyzed in previous works [13, 14, 15].DCASE 2018 Task 4 [16] further extends weakly supervisedAED in domestic environments by incorporating in domain andout-of-domain unlabeled samples. Lu [17] proposed a mean-teacher model with context-gating CRNN to utilize unlabeledin-domain data. Liu [18] used a tagging model with pre-setthresholds to mine unlabeled data with high conﬁdence.Although GAP layer has been used with VGG-based fea-ture extractor for both tagging and localization [19, 20], our ex-perimental results on DCASE 2017 Task 4 dataset show thatDenseNet [21] works better as a feature extractor. On the otherhand, DenseNet has been used in AED related tasks but notwith GAP for both tagging and localization. Zhe et al. [22]chunked the input into small segments, and fed each segmentto DenseNet to generate frame-wise prediction for AED. Jeonget al. [23] used DenseNet for audio tagging but not for local-ization. This paper proposes a network architecture mainly de-signed for audio tagging, which can also be used for weaklysupervised AED. It consists of a modiﬁed DenseNet [21] asthe feature extractor, and a global average pooling (GAP) layerto predict frame-level labels at inference time. We tested ourmethod on DCASE 2017 Task 4 subtask A for audio tagging,and the proposed method signiﬁcantly outperforms the state-of-the-art method [6]. We also tested our system for weaklysupervised AED in driving environments (DCASE 2017 Task4 subtask B) and domestic environments (DCASE 2018 Task4). Our method outperforms the state-of-the-art work [24] ofDCASE 2018 Task 4 by using tri-training [25, 26] to leverageunlabeled data.

2. Proposed Method

The proposed network consists of a modiﬁed DenseNet [21] asa feature extractor, and a GAP layer for predicting frame-levellabels at inference time. In order to generate strong labels withﬁner resolution in time at inference, we modiﬁed DenseNet to a r X i v : . [ ee ss . A S ] A ug !" × + ⋯ += ! × $ LFBE High-level feature map $ DenseNet GAP … … … Event class %! ! %" ! !" $ $ % $ ! ! ! ! " ! & " : Class activation map for class % Max. along feature axis "ℎ $% Binary thresholding

Strong label prediction & Class inference

Figure 1:

System overview of the proposed architecture for weakly supervised AED. have less pooling operations to maintain the resolution in timeof the extracted feature map. The exact network conﬁgura-tions we used are shown in Table 1. We used DenseNet-63 onDCASE 2017 Task 4 and DenseNet-120 on DCASE 2018 Task4, and these architectures are chosen based on our experimentalresults on the dev set.Given weak labels (i.e. utterance-level labels), the networkcan be trained under a multi-class classiﬁcation setting. Sincemultiple events of different classes can happen within the sameutterance, we use sigmoid as the activation function with binarycross-entropy for each class. We use the method proposed byZhou et al. [27] to generate class activation maps (CAM) forpredicting strong labels at inference time. The system overviewis shown in Fig. 1. Given an input utterance, a high-level featuremap F ( T × N × K ) can be extracted by DenseNet (i.e. inputto the GAP layer), where T , N , K represent the dimension intime, feature, and channel. For each channel k , the GAP layerwill generate a response G k , which is the average of all featuresin channel k . These responses are further fed into a dense layerto predict the classiﬁcation probability. For a given class c , theinput to the sigmoid is S c = (cid:80) k w ck G k , where w ck is the weightin the ﬁnal dense layer corresponding to class c for channel k .The utterance-level prediction for class c is y c = sigmoid ( S c ) . w ck controls the contribution of a given channel k to class c . TheCAM for class c is deﬁned as: M c = (cid:88) k w ck F k , (1)where F k is channel k of the high-level feature map F .If one clip has utterance-level probability ( y c ) greater thanthe utterance-level threshold ( th uc , where u represents utter-ance) at inference time, it indicates the occurrence of targetclass c . We can use CAM to predict strong labels. We ﬁrstconvert the 2D CAM ( T × N ) to a 1D sequential signal (length T ) by taking the maximum value across the feature axis. Stronglabels of class c are predicted by binary thresholding on the se-quential signal with a frame-level threshold ( th fc ). Note that thetime resolution of the sequential signal is not the same as oneframe in the input feature to the network (10 ms) due to poolingoperations in the network. Both utterance-level and frame-levelthresholds are set by optimizing the F1 score of weakly super-vised AED on the development set. Layers DenseNet-63 DenseNet-120(for DCASE2017) (for DCASE2018)Convolution 7 × (cid:20) × conv × conv (cid:21) × (cid:20) × conv × conv (cid:21) × Transition (1) 1 × × (cid:20) × conv × conv (cid:21) × (cid:20) × conv × conv (cid:21) × Transition (2) 1 × × (cid:20) × conv × conv (cid:21) × (cid:20) × conv × conv (cid:21) × Transition (3) 1 × × (cid:20) × conv × conv (cid:21) × (cid:20) × conv × conv (cid:21) × GAP global avg. poolingClassiﬁcation 17D dense, 10D dense,sigmoid sigmoidTable 1:

DenseNet architectures for audio tagging and weaklysupervised acoustic event detection. Note that each “conv”layer in dense blocks/ transition layers corresponds the se-quence BN-ReLU-Conv. We set the growth rate to 32 as pro-posed in the original DenseNet [21]. Less pooling operationsare used compared to the original DenseNet in order to haveﬁner resolution in time.

3. Experimental Setups

We tested our method on DCASE 2017 Task 4 [2] and DCASE2018 Task 4 [16]. Both of these two datasets are subsets of Au-dioSet [1]. The audio clips are mono-channel and sampled at44.1k Hz with a maximum duration of 10 seconds. We decom-pose each clip into a sequence of 25 ms frames with a 10 msshift. 64 dimensional log ﬁlter bank energies (LFBEs) are cal-culated for each frame, and we aggregate the LFBEs from allframes to generate the input spectrogram. Note that we train allmodels in this work from scratch without any pre-training usingexternal datasets, which is complied with task rules of DCASEChallenge. .1. DCASE 2017 Task 4

There are two subtasks in this challenge: (A) audio tagging, (B)weakly supervised AED. It contains 17 classes of warning andvehicle sounds related to driving environments. The trainingset has only weak labels denoting the presence of events, andstrong labels with timestamps are provided in dev/eval sets forevaluation. There are 51,172, 488, and 1,103 samples in train,dev, and eval sets, respectively. We use the same metrics used inthe challenge to evaluate our method. For audio tagging, clas-siﬁcation F1 score is used; for weakly supervised AED, we usesegment-based F1 score [28], and the length of segments is setto 1 second.We train DenseNet-63 model shown in Table 1 with adap-tive momentum (ADAM) optimizer and the initial learning rateis set to 0.01. The training is stopped when the classiﬁcationF1 score on the dev set has stopped improving for 20 epochs.We further ﬁnetune the model for 10 epochs with decreasingthe learning rate to 0.001. The size of minibatch is set to 200.For the results shown in the paper on DCASE 2017, we use anensemble of 5 models by taking the average of output prob-abilities. These 5 models are trained using the same hyper-parameters, and the only difference between them is the ran-domness in weight initialization and the data shufﬂing duringtraining.

Task 4 of DCASE 2018 challenge consists of detecting on-set/offset timestamps of sound events using audio with bothweakly labeled data and unlabeled data. It contains 10 classesof audio events in domestic environments (e.g. Speech, Dog,Blender, etc.) There are three different sets of training data pro-vided: weakly labeled data, in-domain unlabeled data and out-of-domain unlabeled data. Weakly labeled training set contains1,578 clips with 2,244 occurrences with only utterance-level la-bels. The in-domain unlabeled training set contains 14,412 clipsof which the distribution per class is close to the labeled set. Inaddition, the unlabeled out-of-domain training set is composedof 39,999 clips from classes not considered in this task. Notethat event-based F1 is chosen by the challenge organizer as theevaluation metric, which is different from the segment-based F1used in DCASE 2017 task 4B.To utilize the unlabeled in-domain data, we use the tri-training proposed for audio tagging tasks in [26]. The idea oftri-training is similar to self-training, which takes advantage ofa model trained with labeled data only to assign pseudo-labelsto unlabeled data. Instead of relying on one model for pseudo-labeling, we train three independent models. To update one ofthose three models, an unlabled clip gets a pseudo-label andis added into the training set if the other two models predict thesame label with high conﬁdence on the clip. Generating pseudo-labels using consensus of multiple models mitigates mistakesmade by a speciﬁc model. One caveat of tri-training is thatmultiple models should differ such that the prediction of indi-vidual models complement each other. Although the trainingset is bootstrapped three times for training three models in [26],we use the same training set while initializing models with dif-ferent random seeds rather than bootstrapping. We ﬁnd suchpractice leads to better performance which might be due to thelimited amount of labeled data.While predicting pseudo-labels of unlabeled data, we onlyinfer utterance-level label. Model is trained with ADAM opti-mizer with an initial learning rate of 0.001 for 30 epochs, andthe learning rate is reduced by half every 10 epochs. We chose Classiﬁcation F1 Dev (%) Eval (%)Xu et al. [5] (ranked 1st) 57.7 55.6Lee et al. [11] (ranked 2nd) 57.0 52.6Iqbal et al. [29] N/A 58.6Wang et al. [14] 53.8 N/AYan et al. [6] 59.5 60.1Ours

Table 2:

Results on DCASE 2017 task 4A: audio tagging forsmart cars

Segment-based F1 Dev (%) Eval (%)Lee et al. [11] (ranked 1st) 47.1

Xu et al. [5] (ranked 2nd) 49.7 51.8Iqbal et al. [29] N/A 46.3Wang et al. [14] 46.8 N/AYan et al. [6]

Results on DCASE 2017 task 4B: weakly supervisedAED for smart cars the best weights out of 30 epochs based on classiﬁcation F1 onthe dev set. The batch size is set to 48 due to GPU memoryconstraints. We also augment the labeled data by doing (1) cir-cular shifting audio at a random timestep (2) randomly mixingtwo audio clips. When two clips are mixed, their labels are alsomerged. The number of labeled audio in augmented dataset isincreased to 3,578. Only in-domain labeled data are used forpseudo-labeling in tri-training. For post-processing, we applymedian ﬁltering on the output segmentation mask, and the ﬁltersize per event is tuned based on event-based F1 on the dev set.

4. Experimental Results

Table 2 shows the classiﬁcation F1 for the audio tagging subtaskin DCASE 2017 task 4 on the development set and the evalu-ation set. While most of the previous works of joint frame-work for audio tagging and weakly supervised AED use at-tention mechanism (e.g. gated CNN [5], attention by capsulerouting [29], region-based attention [6], etc.), our method with-out any attention mechanism performs the best in audio tag-ging. The proposed method outperforms the state-of-the-artmethod [6] in F1 score by 5.3% on the dev set, and 6.0% onthe eval set. Based on these results, we argue that attentionmechanism may not be necessary for audio tagging.

Table 3 shows the segment-based F1 for theweakly supervised AED subtask in DCASE 2017 task 4 on thedevelopment set and the evaluation set. Although our methodperforms well on the audio tagging subtask, it does not outper-form state-of-the-art methods in the weakly supervised AEDsubtask. We suspect that the lack of attention mechanism maycause this performance gap in weakly supervised AED. Explor-ing adding attention mechanism to our current model would beour future work. We plan to explore whether it can improve theperformance on weakly supervised AED, and how it impactsthe performance on audio tagging.vent-based F1 Dev (%) Eval (%)Lu et al. [17] (ranked 1st) 25.9 32.4Liu et al. [18] (ranked 2nd)

Table 4:

Results on DCASE 2018 task 4: weakly supervisedAED in domestic environments

Event-based F1 Dev (%) Eval (%)Labeled data only 34.9 25.8 + data aug. 42.0 29.5 + data aug. & unlabeled data 44.5 33.0Table 5: Ablation study of data augmentation methods onDCASE 2018 task 4

DCASE 2018:

We also tested our method on DCASE 2018task 4, and the results are shown in Table 4. Different fromthe results on DCASE 2017 task 4, our method outperforms thestate-of-the-art method [24] in event-based F1 by 8.1% on thedev set, and 0.5% on the eval set. In order to know which partgives us the performance gain, we did an ablation study on thistask. As shown in Table 5, data augmentation (cicular shiftingand clip mixing) plays an important role, which might be duethe amount of labeled training data is limited given model archi-tecture is relative complicated. On top of that, using tri-trainingprovides additional boost, which is complementary to data aug-mentation. For tri-training, we use an ensemble of six models,which consists of three models trained on labeled data only, andthree models trained on both labeled data and in-domain unla-beled data. If only labeled data are used, we use an ensemble ofthree models. Note that the gap between dev and eval set, whichis also observed in [24, 30, 18], might be due to the disparity ofdistribution of two sets.

5. Ablation Study

To investigate the performance of different feature extractors,we experimented with different architectures to generate thehigh-level feature map. Three different types have been tested:VGG [31], ResNet [32], and DenseNet [21]. We modiﬁed eacharchitecture to have similar number of parameters for fair com-parison. For VGG, the architecture is similar to the ConvNetconﬁguration D in [31] with only 4 blocks and 9 conv layers.For ResNet, the architecture is similar to ResNet-18 in [32] withless number of ﬁlters in each block (from [64, 128, 256, 512]to [28, 56, 112, 224]). For DenseNet, the architecture is de-scribed as DenseNet-63 in Table 1. The number of parametersof VGG, ResNet, DenseNet are 2.33M, 2.71M, and 2.34M. Ta-ble 6 shows the results on DCASE 2017 task 4 development set.Note that all these results are based on ensemble of 5 models,which is the same setup as described in Sec. 3.1. As shownin Table 6, DenseNet outperforms VGG and ResNet on bothaudio tagging (classiﬁcation F1) and weakly-supervised AED(segment-based F1). Based on these results, we chose DenseNetas the feature extractor through our experiments. Classiﬁcation F1 (%) Segment-based F1 (%)VGG 63.5 48.9ResNet 62.4 48.9DenseNet

Table 6:

Ablation study of different feature extractors onDCASE 2017 task 4 development set.

Event

Alarm/bell/ringing 205 24.4 30.2

Speech 550 42.6

Frying 171 42.7

Running water 343 15.0 17.0

Cat 173 9.5 17.7

Vacuum cleaner 167 37.4 33.6

Electric shaver 103 44.2 46.4

Table 7:

Class-wise ablation study on DCASE 2018 task 4

To disentangle the effects of data augmentation and using unla-beled data, we did a further class-wise ablation study (see Ta-ble 7). Most events beneﬁt from the both methods. As shownin Table 7, data augmentation helps detection of “dishes” and“cat” sound the most. We notice those events are generallyshort and are the foreground sounds in the original audio. Mix-ing audios provides richer background noise which helps themodel disentangling the foreground sound from other sound.The gain brought by employing unlabeled data is related tothe amount of labeled data, as we don’t see large improvementfrom the “speech” event that has the largest amount of labeleddata. Additionally, it is potentially related to the difﬁculty ofdetecting certain events. As some events are harder to detect(e.g., alarm/bell/ringing, running water) potentially due to thelow loudness, ambiguity of deﬁnition and large variation, largeramount of training data are required to achieve high perfor-mance.As a consequence, those events generally beneﬁt morefrom the ways of increasing data amount including our semi-supervised approach and data augmentation.

6. Conclusions

This paper proposes a network architecture mainly designed foraudio tagging, which can also be used for weakly supervisedAED. Different from most of the previous works on weaklysupervised AED that use recurrent layers with attention-basedmechanism to localize acoustic events, the proposed networkdirectly localizes events using the feature map extracted byDenseNet without any recurrent layers. In the audio taggingtask of DCASE 2017 [2], our method signiﬁcantly outperformsthe state-of-the-art method [6] by 5.3% on the dev set, and 6.0%on the eval set in F1 score. For weakly supervised AED task inDCASE 2018 [16], our model outperforms the state-of-the-artmethod [24] by using data augmentaion and tri-training [26] toleverage unlabled data. . References [1] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen,W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set:An ontology and human-labeled dataset for audio events,” in

IEEEICASSP , 2017, pp. 776–780.[2] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vin-cent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup:Tasks, datasets and baseline system,” in

DCASE , 2017, pp. 85–92.[3] E. Fonseca, M. Plakal, F. Font, D. P. W. Ellis, X. Favory,J. Pons, and X. Serra, “General-purpose tagging of freesound au-dio with audioset labels: Task description, dataset, and baseline,”in

DCASE , 2018, pp. 69–73.[4] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke,A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous,B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, “Cnn architec-tures for large-scale audio classiﬁcation,” in

IEEE ICASSP , 2017,pp. 131–135.[5] Y. Xu, Q. Kong, W. Wang, and M. D. Plumbley, “Large-scaleweakly supervised audio classiﬁcation using gated convolutionalneural network,” in

IEEE ICASSP , 2018, pp. 121–125.[6] J. Yan, Y. Song, W. Guo, L. Dai, I. McLoughlin, and L. Chen,“A region based attention method for weakly supervised soundevent detection and classiﬁcation,” in

IEEE ICASSP , 2019, pp.755–759.[7] H. Lim, J. Park, and Y. Han, “Rare sound event detection using1D convolutional recurrent neural networks,” DCASE Challenge,Tech. Rep., 2017.[8] C.-C. Kao, W. Wang, M. Sun, and C. Wang, “R-CRNN: region-based convolutional recurrent neural network for audio event de-tection,” in

INTERSPEECH , 2018, pp. 1358–1362.[9] Y. Shen, K. He, and W. Zhang, “Learning how to listen: Atemporal-frequential attention model for sound event detection,”in

INTERSPEECH , 2019, pp. 2563–2567.[10] J. Zhang, W. Ding, J. Kang, and L. He, “Multi-scale time-frequency attention for acoustic event detection,” in

INTER-SPEECH , 2019, pp. 3855–3859.[11] D. Lee, S. Lee, Y. Han, and K. Lee, “Ensemble of convolutionalneural networks for weakly-supervised sound event detection us-ing multiple scale input,” DCASE Challenge, Tech. Rep., 2017.[12] K. He, Y. Shen, and W. Zhang, “Hierarchical pooling structure forweakly labeled sound event detection,” in

INTERSPEECH , 2019,pp. 3624–3628.[13] W. Wang, C.-C. Kao, and C. Wang, “A simple model for detectionof rare sound events,” in

INTERSPEECH , 2018, pp. 1344–1348.[14] Y. Wang, J. Li, and F. Metze, “A comparison of ﬁve multiple in-stance learning pooling functions for sound event detection withweak labeling,” in

IEEE ICASSP , 2019, pp. 31–35.[15] C.-C. Kao, M. Sun, W. Wang, and C. Wang, “A comparison ofpooling methods on LSTM models for rare acoustic event classi-ﬁcation,” in

IEEE ICASSP , 2020, pp. 316–320.[16] R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P. Shah, “Large-scale weakly labeled semi-supervised sound event detection in do-mestic environments,” in

DCASE , 2018, pp. 19–23.[17] L. JiaKai, “Mean teacher convolution system for dcase 2018 task4,” DCASE Challenge, Tech. Rep., 2018.[18] Y. Liu, J. Yan, Y. Song, and J. Du, “Ustc-nelslip system for dcase2018 challenge task 4,” DCASE Challenge, Tech. Rep., 2018.[19] A. Kumar, M. Khadkevich, and C. Fgen, “Knowledge transferfrom weakly labeled audio using convolutional neural network forsound events and scenes,” in

IEEE ICASSP , 2018, pp. 326–330.[20] A. Kumar and V. K. Ithapu, “Secost:: Sequential co-supervisionfor large scale weakly labeled audio event detection,” in

IEEEICASSP , 2020, pp. 666–670. [21] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in

IEEE CVPR ,2017, pp. 2261–2269.[22] H. Zhe and L. Ying, “Fully convolutional densenet based poly-phonic sound event detection,” in

International Conference onCloud Computing, Big Data and Blockchain (ICCBB) , 2018, pp.1–6.[23] I.-Y. Jeong and H. Lim, “Audio tagging system using densely con-nected convolutional networks,” in

Detection and Classiﬁcation ofAcoustic Scenes and Events (DCASE) , 2018.[24] H. Dinkel and K. Yu, “Duration robust weakly supervised soundevent detection,” in

IEEE ICASSP , 2020, pp. 311–315.[25] Zhi-Hua Zhou and Ming Li, “Tri-training: exploiting unlabeleddata using three classiﬁers,”

IEEE Transactions on Knowledgeand Data Engineering , vol. 17, no. 11, pp. 1529–1541, Nov 2005.[26] B. Shi, M. Sun, C. Kao, V. Rozgic, S. Matsoukas, and C. Wang,“Semi-supervised acoustic event detection based on tri-training,”in

IEEE ICASSP , 2019, pp. 750–754.[27] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,“Learning deep features for discriminative localization,” in

IEEECVPR , 2016, pp. 2921–2929.[28] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonicsound event detection,”

Applied Sciences , vol. 6, no. 6, p. 162,2016.[29] T. Iqbal, Y. Xu, Q. Kong, and W. Wang, “Capsule routing forsound event detection,” in

EUSIPCO , 2018, pp. 2269–2273.[30] Q. Kong, I. Turab, X. Yong, W. Wang, and M. D. Plumbley,“DCASE 2018 challenge baseline with convolutional neural net-works,” DCASE2018 Challenge, Tech. Rep., September 2018.[31] K. Simonyan and A. Zisserman, “Very deep convolutional net-works for large-scale image recognition,” in

International Con-ference on Learning Representations , 2015.[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in