[PDF] Semi-supervised Sound Event Detection using Random Augmentation and Consistency Regularization

Abstract

Sound event detection is a core module for acoustic environmental analysis. Semi-supervised learning technique allows to largely scale up the dataset without increasing the annotation budget, and recently attracts lots of research attention. In this work, we study on two advanced semi-supervised learning techniques for sound event detection. Data augmentation is important for the success of recent deep learning systems. This work studies the audio-signal random augmentation method, which provides an augmentation strategy that can handle a large number of different audio transformations. In addition, consistency regularization is widely adopted in recent state-of-the-art semi-supervised learning methods, which exploits the unlabelled data by constraining the prediction of different transformations of one sample to be identical to the prediction of this sample. This work finds that, for semi-supervised sound event detection, consistency regularization is an effective strategy, especially the best performance is achieved when it is combined with the MeanTeacher model.

Full PDF

aa r X i v : . [ ee ss . A S ] J a n SEMI-SUPERVISED SOUND EVENT DETECTION USING RANDOM AUGMENTATION ANDCONSISTENCY REGULARIZATION

Xiaofei Li

School of Engineering, Westlake University, Hangzhou, ChinaInstitute of Advanced Technology, Westlake Institute for Advanced Study, Hangzhou, China

ABSTRACT

Sound event detection is a core module for acoustic environ-mental analysis. Semi-supervised learning technique allowsto largely scale up the dataset without increasing the annota-tion budget, and recently attracts lots of research attention. Inthis work, we study on two advanced semi-supervised learn-ing techniques for sound event detection. Data augmentationis important for the success of recent deep learning systems.This work studies the audio-signal random augmentationmethod, which provides an augmentation strategy that canhandle a large number of different audio transformations. Inaddition, consistency regularization is widely adopted in re-cent state-of-the-art semi-supervised learning methods, whichexploits the unlabelled data by constraining the prediction ofdifferent transformations of one sample to be identical to theprediction of this sample. This work ﬁnds that, for semi-supervised sound event detection, consistency regularizationis an effective strategy, especially the best performance isachieved when it is combined with the MeanTeacher model.

Index Terms — Semi-supervised learning, sound eventdetection, random augmentation, consistency regularization

1. INTRODUCTION

Sound event detection (SED) temporally locates and recog-nizes the sound event from an audio stream, which plays acritical role in automatic analysis of acoustic environments[1]. In recent years, deep neural network has became thedominant technique for SED, since its powerful data rep-resentation capability naturally match with the high com-plexity/diversity of acoustic data. In [2, 3], convolutionalneural network (CNN) was applied on the audio spectrogramto perform sound classiﬁcation, which treats spectrogramas an image. Sound classiﬁcation predicts the class labelof audio clips, but not the temporal location of the event,which is thus referred to as weak-prediction. [4] proposedto perform sound event temporal detection, namely provid-ing frame-level strong-prediction, using only weakly-labelled(clip-level annotated) data. DCASE (Detection and Clas-siﬁcation of Acoustic Scenes and Events) 2017 Challenge task 4 [5] released a similar task with [4], namely SED withweakly-labelled data. To largely scale up the dataset withoutincreasing the annotation budget, a large amount of unlabelleddata were involved for training since DCASE 2018 [6], whicharose the problem of semi-supervised learning, namely onlya portion of the training data are annotated. The winning sys-tem of DCASE 2018, i.e. [7], used a convolutional-recurrentneural network (CRNN) to model both the local spectra andthe temporal dynamic of audio signal. To perform semi-supervised learning, [7] adopted the MeanTeacher network[8]. This architecture, i.e. CRNN plus MeanTeacher, andits variants are adopted in the baseline and top-performancesystems of DCASE 2019 task 4 [9, 10], and of DCASE 2020task 4 [11]. The major improvements over this architectureinclude applying data augmentation [10] and using morepowerful network, such as Transformer [11].Semi-supervised learning recently attracts lots of atten-tion in the deep learning community [12, 13, 14]. Semi-supervised learning needs to provide an artiﬁcial label forunlabelled data. Pseudo-label [15] takes the argmax of thecurrent network prediction as the artiﬁcial label, which trans-forms the most probable prediction as a hard label. This prin-ciple is also shared by entropy minimization [16] and labelsharpening used in [12, 13]. Another important techniquegenerating artiﬁcial label is to build a teacher network basedon the being-trained (student) network, such as by ensembling[17] or exponentially smoothing (MeanTeacher) [8] the net-works of previous training steps. The prediction of teachernetwork is taken as the artiﬁcial label of the student network.Data augmentation largely improves the data variability andconstantly improves the system performance, which is widelyadopted in recent developed semi-supervised methods [2, 3,4, 10, 11, 12, 13, 14]. AutoAugment [18] provides an au-tomated augmentation strategy with reinforcement learning.RandAugment [18] is one recent proposed effective and easy-to-use data augmentation mechanism. It randomly selects onetransformation for each sample at each training step, whichallows to exploit a large number of different types of transfor-mations without increasing the training complexity. Consis-tency regularization [19] constrains the prediction of differenttransformations of one sample to be identical to the predictionf this sample, which regularizes the network parameters tobe more robust to data disturbance, and becomes an impor-tant component in recent studies [13, 14, 19, 20].The augmentation of image data has been intensivelyinvestigated for various computer vision tasks [18]. In theprevious SED methods [2, 3, 4, 10, 11], different audiotransformations have been tested, and the scale of each trans-formation is empirically set. In this work, RandAugmentis studied for audio data augmentation, which considers alarge number of widely used audio transformations, includ-ing signal speeding, time shifting [10], time stretching [21],pitch shifting [21], dynamic range compression (DRC) [21],time/frequency masking [22] and mixup [23]. MeanTeacheris currently the most popular semi-supervised learning mech-anism for SED. In this work, following the research trendof the state-of-the-art semi-supervised learning, consistencyregularization is studied. It is found that consistency reg-ularization performs well for audio data, it solely alreadyoutperforms MeanTeacher, and can further improve the per-formance when combined with MeanTeacher.

2. METHOD

In this work, the multi-class sound event detection problemis considered, which means multiple events could be concur-rent. In the time-frequency domain, let x t | t ∈ [1 ,T ] ∈ R K × denote the feature vector of one utterance, where T and K denote the number of time frames and frequencies, respec-tively. Three types of data are used: i) strongly-labelled data { x ( s ) n,t | t ∈ [1 ,T ] ∈ R K × , y ( s ) n,t,c | t ∈ [1 ,T ′ ] ,c ∈ [1 ,C ] ∈ { , }} N s n =1 ,where N s and C denote the data number of one batch andthe number of classes, respectively. The strong label y ( s ) n,t,c isgiven for each time frame t . Note that the number of outputframes T ′ could be smaller than the one of input feature, i.e. T , to have a coarser time resolution; ii) weakly-labelled data { x ( w ) n,t | t ∈ [1 ,T ] ∈ R K × , y ( w ) n,c | c ∈ [1 ,C ] ∈ { , }} N w n =1 , whereonly the weak label y ( w ) n,c is given for the entire utterance;and iii) unlabelled data { x ( u ) n,t | t ∈ [1 ,T ] ∈ R K × } N u n =1 , where nolabels available. For any one labelled or unlabelled utterance x n , we let ˆ y ( s ) n,t,c ( x n ) ∈ [0 , and ˆ y ( w ) n,c ( x n ) ∈ [0 , denote the networkprediction of strong labels and weak labels for this utterance,respectively. The supervised classiﬁcation loss consideringboth strongly-labelled and weakly-labelled data is L super = 1 N s T ′ C N s X n =1 T ′ X t =1 C X c =1 H ( y ( s ) n,t,c , ˆ y ( s ) n,t,c ( x ( s ) n ))+ 1 N w C N w X n =1 C X c =1 H ( y ( w ) n,c , ˆ y ( w ) n,c ( x ( w ) n )) , (1) where H ( · ) denotes binary cross-entropy.To exploit the unlabelled data, MeanTeacher model [8]is used to provide pseudo labels. In practice, the pseudolabels will be applied not only to the unlabelled data, butalso to the labelled data to improve the training stability. Let ˜ y ( s ) n,t,c ( x n ) ∈ [0 , and ˜ y ( w ) n,c ( x n ) ∈ [0 , denote the Mean-Teacher strong and weak pseudo-labels, respectively. The un-supervised mean squared error is then: L unsuper = 1 N T ′ C N X n =1 T ′ X t =1 C X c =1 (˜ y ( s ) n,t,c ( x n ) − ˆ y ( s ) n,t,c ( x n )) + 1 N C N X n =1 C X c =1 (˜ y ( w ) n,c ( x n ) − ˆ y ( w ) n,c ( x n )) , (2)where N = N s + N w + N u . Data augmentation could largelyincrease the data diversity and thus improve the performance.Augmented data are generated by applying signal transforma-tions on the original data, and inherit the labels of the originaldata. The supervised and unsupervised losses deﬁned in (1)and (2) can be directly applied to the augmented data.The augmented data have to be identiﬁed as the same classwith the corresponding original data, which is implementedby consistency regularization. In practice, it is found that bet-ter performance can be achieved when consistency regular-ization is applied to both the labelled and unlabelled data, andto both the strong and weak predictions. For utterance x n ,let α p ( x n ) , p ∈ [1 , P ] denote its P different transformations.The consistency regularization term is deﬁned as: L cr =1 N P T ′ C N X n =1 P X p =1 T ′ X t =1 C X c =1 L (ˆ y ( s ) n,t,c ( x n ) − ˆ y ( s ) n,t,c ( α p ( x n ))) + 1 N P C N X n =1 P X p =1 C X c =1 L (ˆ y ( w ) n,c ( x n ) − ˆ y ( w ) n,c ( α p ( x n ))) . (3)Finally, the overall loss is set to L = L super + λ unsuper L unsuper + λ cr L cr , (4)where λ unsuper and λ cr are predeﬁned weights. This work follows the RandAugment [24] principle. The P transformations are randomly selected from a total of Q avail-able transformations with a uniform distribution, thus thereare Q P potential policies for one utterance. This random-selection is independently performed for each utterance ateach training epoch. A number of widely used data transfor-mations are tested. A proper distortion magnitude should behosen for each transformation. Searching the optimal magni-tude for each individual transformation has a very large searchspace. In RandAugment [24], it was proposed to use a singleglobal distortion magnitude for all the transformations, whichlargely reduce the search space. This work sets 10 integer dis-tortion scales for each transformation, and the optimal globalscale is set by grid-searching from 1 to 10. The audio trans-formation schemes used in this work include:• Signal speeding slows down or speeds up the originalsignal, which were conducted by up-sampling or down-sampling the signal. This transformation changes the sig-nal length, and also shifts the frequencies. Ten up-samplingfactors are set from 1.05 to 1.5 with 0.05 increment. Inaddition, the factor is randomly set as its reciprocal with0.5 probability to account for the down-sampling case.• Time shifting rolls the signal along time [10, 25]. Therolling factor is randomly selected from 0.1 to 0.9. Thedistortion scale is ﬁxed.• Time stretching raises or lowers the speed and keeps theoriginal pitch. The audio degradation toolkit [21] is used toconduct this transformation. The stretching factors is set asthe same with the sampling factors of signal speeding.• Pitch shifting [21] raises or lowers the pitch and keeps theoriginal signal length. Ten positive shifting scales are setfrom 0.5 to 5 with 0.5 increment. The corresponding nega-tive factors are randomly used with 0.5 probability.• Dynamic range compression (DRC) [21]. One mode is ran-domly chosen for each utterance. The distortion scale isﬁxed.• Time masking is a spectral augmentation technique pro-posed in [22] for speech recognition, which masks a periodof consecutive time frames to 0. In this work, a maskingunit is set as a period of consecutive time frames with theduration of 0.05 times the total signal length. The maskingscales are set as taking randomly positioned 1 to 10 mask-ing units.• Frequency masking [22] masks frequencies to 0. The mask-ing scales are set following the spirit of time masking.• Mixup [23] takes the convex combination of two samples(and corresponding labels) as a new sample (and label). Inthis work, the sum of two utterances is considered as con-current sound events, and the two mixed utterances are notrescaled. The mixup label is set by taking the logical al-ternation of the two original labels. The mixup sample isgenerated using two samples from the same training mini-batch. In the consistency regularization loss (3), the predic-tions ˆ y ( s ) n,t,c ( x n ) and ˆ y ( w ) n,c ( x n ) actually should be the mixupof the prediction of the two mixed samples, which is com-puted by ﬁrst binarizing the prediction of the two mixedsamples and then taking the logical alternation . The distor-tion scale is ﬁxed for mixup.

3. EXPERIMENTS

In this work, we use the dataset of DCASE 2020 task 4”Sound event detection and separation in domestic envi-ronments” [26], which includes 10 domestic sound classes:speech, dog, cat, alarm/bell/ringing, dishes, frying, blender,running water, vacuum cleaner, electric shaver/toothbrush.The training dataset (we have downloaded) consists of weakly-labelled data of 1466 clips, synthetic strongly-labelled dataof 2584 clips and unlabelled in domain data of 13343 clips.The validation set includes 1168 clips of real-recorded signalswith strong annotations. Out of the synthetic 2584 clips, 517clips are used for training validation. The 1168 validationclips are used for test. The length of all these clips are 10 s.The DCASE 2020 task 4 ofﬁcial baseline system isadopted to develop the proposed method, which is a mod-iﬁcation of [10]. The sampling rate is 16 kHz. The 128-dimensional feature is extracted in the short-time Fouriertransform (2048 window, 255 hop size) domain with mel-scale frequency bins. The mean-teacher model [7] is adoptedfor semi-supervised learning. The network includes 7 layersof CNNs and two layers of GRU-RNNs [10]. A median ﬁlterwith duration of 0.45 second is used for post-processing. Thebatch size is 24, and each batch is composed of 6 weakly-labelled, 6 strongly-labelled and 12 unlabelled samples. Thenumber of training epochs is set to 200. In this work, thelearning rate scheme is set as: rampuping to − at epoch50, step decaying to × − at epoch 100, and further de-caying to × − at epoch 150. The weights λ unsuper and λ cr are rampupped to a constant at epoch 50, then kept invariant.The performance is evaluated with three metrics: i) themacro-averaging event-based collar F1 score [27]. A 200 mscollar on onsets and a 200 ms and 20% of the events lengthcollar on offsets are used for the comparison between eventprediction and ground truth; ii) the macro-averaging event-based PSDS (polyphonic sound detection score) F1 score andcross-trigger (CT) PSDS F1 score [28]. To have a reliableevaluation, each of the following experiments are run threeindependent trials, and the averaged scores are reported. The sound event detection results are given in Table 1. MTand MT+RDA stand for the baseline MeanTeacher methodwithout and with random data augmentation (RDA), respec-tively, for which λ unsuper = 2 and λ cr = 0 . CR+RDA standsfor consistency regularization excluding MeanTeacher, with λ unsuper = 0 and λ cr = 2 , as consistency regularizationitself is also an unsupervised learning strategy. Finally,MT+CR+RDA combines MeanTeacher and consistency reg-ularization, with λ unsuper = 2 and λ cr = 2 . We test two https://github.com/turpaultn/dcase20 task4 able 1 . Sound event detection results.F1 score (%) collar PSDS CT PSDSMT 37.2 60.8 53.5GLU MT+RDA 39.4 64.3 57.5CR+RDA 39.3 66.2 60.0MT+CR+RDA 40.7 66.5 60.7MT 38.8 61.9 55.2CG MT+RDA 40.8 66.8 61.2CR+RDA 41.2 67.1 61.3MT+CR+RDA 43.5 69.5 64.4network architectures with different activations for CNN lay-ers, i.e. GLU (Gated Linear Units) and CG (Context Gating).It can be seen that the proposed techniques work well for bothnetwork architectures, and CG consistently performs betterthan GLU. MT+RDA largely improves the performance ofMT, especially for the (CT) PSDs scores, which shows theefﬁcacy of random data augmentation. CR+RDA outper-forms MT+RDA. It means consistency regularization solelyis even better than MeanTeacher in the sense of exploitingunlabelled data, which is consistency with the image clas-siﬁcation results presented in [14]. MT+SC+RDA providesthe best scores, which indicates that the two unsupervisedlosses, i.e. MeanTeacher and self-consistency, are somehowcomplementary.Besides, we have also studied several other semi-supervisedlearning techniques, including hard pseudo-label [14, 15], en-tropy minimization [16, 29], information maximization [20],and their combination with others. However, we did notﬁnd a better strategy than the combination of MeantTeacherand consistency regularization. In the literature, many dif-ferent combination strategies of semi-supervised learningtechniques have been reported, and achieved superior perfor-mance on various tasks, especially on the computer visiontasks. However, it seems that one strategy can hardly keepon top of a wide range of tasks. One needs to carefullyinvestigate the proper strategy for one speciﬁc task. In this section, the random data augmentation method is stud-ied in more detail. Based on preliminary experiments, thenumber of transformations applied to each sample, i.e. P , isset to 1, which will not be analyzed in detail, due to the roomlimit. The magnitude for the audio transformations listed inSection 2.2 should be empirically set, and independently set-ting for each one leads to a very large search space. In Ran-dAugment, the optimal transformation magnitude is searchedwith a global scale as deﬁned in Section 2.2. Table 2 lists thegrid search results with the CG MT+CR+RDA method, forthe scales of 3, 4, 5 and 6. Two schemes are tested: ﬁxedscale and random scale with a ﬁxed upper bound. It canbe seen that random scale averagely outperforms ﬁxed scale. Table 2 . Global grid-search for random augmentation.F1 score (%) collar PSDS CT PSDS3 41.9 68.4 63.0ﬁxed 4 41.8 69.0 64.0scale 5 41.5 68.2 62.96 41.6 68.6 63.13 41.9 67.8 62.2random 4 42.1 68.5 63.2scale 5

Table 3 . Results for excluding one transformation.F1 score (%) collar PSDS CT PSDSall 39.3 66.2 60.0- Signal speeding 38.3 66.0 60.0- Time shifting 38.6 65.4 59.7- Time stretching 38.7 65.8 59.1- Pitch shifting 37.7 65.7 59.4- DRC 38.8 66.5 60.3- Time masking 38.4 65.5 59.7- Frequency masking 39.2 66.0 60.3- Mixup 37.1 64.0 57.7Random scale 5 achieves the best performance, which is thusused in all of other experiments. It was demonstrated in [24]that changing the magnitude for one transformation does notlargely affect the performance. In addition, each type of thetransformation should be assured to play a positive role. Thisis done by comparing the results using all of them and the re-sults using all excluding each one of them. The results withthe GLU CR+RDA method are given in Table 3. It is seenthat excluding mixup or pitch shifting largely degrades theperformance relative to the ’all’ case, which means they arevery useful. Excluding DRC or frequency masking achievessimilar scores with the ’all’ case, thence they are not reallyfunctional in this experiment. The other four transformationsimprove the performance to a certain extent, and thus have amedium importance.

4. CONCLUSIONS

This work has studied the random data augmentation strat-egy with a number of different audio transformations. Whenproper parameters are chosen, random augmentation notice-ably improves the SED performance. For augmented data,consistency regularization is adopted as an effective unsuper-vised loss. The combination of consistency regularizationand MeanTeacher achieves the best performance. Note thatthis work focuses only on the semi-supervised learning strate-gies, and many other techniques not adopted in this work maycan further improve the performance, such as in the DCASE2020 winning system [11] that better network, better post-processing median ﬁlter or multi-system ensembling are used. . REFERENCES [1] Tuomas Virtanen, Mark D Plumbley, and Dan Ellis,

Computa-tional analysis of sound scenes and events , Springer, 2018.[2] Karol J Piczak, “Environmental sound classiﬁcation with con-volutional neural networks,” in

MLSP , 2015, pp. 1–6.[3] Justin Salamon and Juan Pablo Bello, “Deep convolutionalneural networks and data augmentation for environmentalsound classiﬁcation,”

IEEE Signal Processing Letters , vol. 24,no. 3, pp. 279–283, 2017.[4] Ting-Wei Su, Jen-Yu Liu, and Yi-Hsuan Yang, “Weakly-supervised audio event detection using event-speciﬁc gaussianﬁlters and fully convolutional networks,” in

ICASSP , 2017, pp.791–795.[5] Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Ben-jamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj,and Tuomas Virtanen, “Dcase 2017 challenge setup: Tasks,datasets and baseline system,” in

DCASE Challenge , 2017.[6] Romain Serizel, Nicolas Turpault, Hamid Eghbal-Zadeh,and Ankit Parag Shah, “Large-scale weakly labeled semi-supervised sound event detection in domestic environments,” arXiv preprint arXiv:1807.10501 , 2018.[7] Lu JiaKai, “Mean teacher convolution system for dcase 2018task 4,”

DCASE Challenge , 2018.[8] Antti Tarvainen and Harri Valpola, “Mean teachers are bet-ter role models: Weight-averaged consistency targets improvesemi-supervised deep learning results,” in

Advances in neuralinformation processing systems , 2017, pp. 1195–1204.[9] Nicolas Turpault, Romain Serizel, Justin Salamon, andAnkit Parag Shah, “Sound event detection in domestic envi-ronments with weakly labeled data and soundscape synthesis,”

DCASE Challenge , 2019.[10] Lionel Delphin-Poulat and Cyril Plapous, “Mean teacher withdata augmentation for dcase 2019 task 4,”

Orange Labs Lan-nion, France, Tech. Rep , 2019.[11] Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, ShinjiWatanabe, Tomoki Toda, and Kazuya Takeda, “Convolution-augmented transformer for semi-supervised sound event detec-tion,” Tech. Rep., DCASE Challenge, 2020.[12] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, andQuoc V Le, “Unsupervised data augmentation for consistencytraining,” arXiv preprint arXiv:1904.12848 , 2019.[13] David Berthelot, Nicholas Carlini, Ekin D Cubuk, AlexKurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel,“Remixmatch: Semi-supervised learning with distributionalignment and augmentation anchoring,” arXiv preprintarXiv:1911.09785 , 2019.[14] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang,Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang,and Colin Raffel, “Fixmatch: Simplifying semi-supervisedlearning with consistency and conﬁdence,” arXiv preprintarXiv:2001.07685 , 2020.[15] Dong-Hyun Lee, “Pseudo-label: The simple and efﬁcientsemi-supervised learning method for deep neural networks,”in

ICML , 2013, vol. 3. [16] Yves Grandvalet and Yoshua Bengio, “Semi-supervised learn-ing by entropy minimization,” in

Advances in neural informa-tion processing systems , 2005, pp. 529–536.[17] Samuli Laine and Timo Aila, “Temporal ensembling for semi-supervised learning,” in

Internation Conference on LearningRepresentation , 2017.[18] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude-van, and Quoc V Le, “Autoaugment: Learning augmentationstrategies from data,” in

IEEE conference on computer visionand pattern recognition , 2019, pp. 113–123.[19] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen, “Reg-ularization with stochastic transformations and perturbationsfor deep semi-supervised learning,” in

Advances in neural in-formation processing systems , 2016, pp. 1163–1171.[20] Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto,and Masashi Sugiyama, “Learning discrete representationsvia information maximizing self-augmented training,” 2017,vol. 70 of

Proceedings of Machine Learning Research , pp.1558–1567.[21] Brian McFee, Eric J Humphrey, and Juan Pablo Bello, “A soft-ware framework for musical data augmentation.,” in

ISMIR ,2015, vol. 2015, pp. 248–254.[22] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu,Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: Asimple data augmentation method for automatic speech recog-nition,” in

Interspeech , 2019.[23] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and DavidLopez-Paz, “mixup: Beyond empirical risk minimization,” in

International Conference on Learing Representations , 2018.[24] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le,“Randaugment: Practical automated data augmentation with areduced search space,” in

Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition Work-shops , 2020, pp. 702–703.[25] Chih-Yuan Koh, You-Siang Chen, Shang-En Li, Yi-Wen Liu,Jen-Tzung Chien, and Mingsian R Bai, “Sound event detec-tion by consistency training and pseudo-labeling with feature-pyramid convolutional recurrent neural networks,”

DCASEChallenge , 2020.[26] “http://dcase.community/challenge2020/task-sound-event-detection-and-separation-in-domestic-environments,” .[27] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen,“Metrics for polyphonic sound event detection,”

Applied Sci-ences , vol. 6, no. 6, pp. 162, 2016.[28] C¸ a˘gdas¸ Bilen, Giacomo Ferroni, Francesco Tuveri, Juan Az-carreta, and Sacha Krstulovi´c, “A framework for the ro-bust evaluation of sound event detection,” in

IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2020, pp. 61–65.[29] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and ShinIshii, “Virtual adversarial training: a regularization method forsupervised and semi-supervised learning,”