[PDF] Impact of Sound Duration and Inactive Frames on Sound Event Detection Performance

Abstract

In many methods of sound event detection (SED), a segmented time frame is regarded as one data sample to model training. The durations of sound events greatly depend on the sound event class, e.g., the sound event "fan" has a long duration, whereas the sound event "mouse clicking" is instantaneous. Thus, the difference in the duration between sound event classes results in a serious data imbalance in SED. Moreover, most sound events tend to occur occasionally; therefore, there are many more inactive time frames of sound events than active frames. This also causes a severe data imbalance between active and inactive frames. In this paper, we investigate the impact of sound duration and inactive frames on SED performance by introducing four loss functions, such as simple reweighting loss, inverse frequency loss, asymmetric focal loss, and focal batch Tversky loss. Then, we provide insights into how we tackle this imbalance problem.

Full PDF

aa r X i v : . [ c s . S D ] F e b IMPACT OF SOUND DURATION AND INACTIVE FRAMESON SOUND EVENT DETECTION PERFORMANCE

Keisuke Imoto † , Sakiko Mishima ♦ , Yumi Arai ♦ , Reishi Kondo ♦† Doshisha University, ♦ NEC Corporation

ABSTRACT

In many methods of sound event detection (SED), a segmented timeframe is regarded as one data sample to model training. The dura-tions of sound events greatly depend on the sound event class, e.g.,the sound event “fan” has a long duration, whereas the sound event“mouse clicking” is instantaneous. Thus, the difference in the dura-tion between sound event classes results in a serious data imbalancein SED. Moreover, most sound events tend to occur occasionally;therefore, there are many more inactive time frames of sound eventsthan active frames. This also causes a severe data imbalance betweenactive and inactive frames. In this paper, we investigate the impactof sound duration and inactive frames on SED performance by intro-ducing four loss functions, such as simple reweighting loss, inversefrequency loss, asymmetric focal loss, and focal batch Tversky loss.Then, we provide insights into how we tackle this imbalance prob-lem.

Index Terms — Sound event detection, sound duration, inactiveframe, data imbalance, asymmetric focal loss

1. INTRODUCTION

Sound event detection (SED), in which the types of sound event areidentiﬁed and their onset and offset in an audio recording are esti-mated, is one of the principal tasks in environmental sound analysis[1, 2]. Recently, many works have addressed SED because it playsan important role in realizing various applications using artiﬁcialintelligence in sounds, e.g., automatic life logging, machine mon-itoring, automatic surveillance, media retrieval, and biomonitoringsystems [3, 4, 5, 6, 7, 8].For SED, many methods using neural networks, such as a convo-lutional neural network (CNN) [9], recurrent neural network (RNN)[10], convolutional recurrent neural network (CRNN) [11], andTransformer-based neural network [12, 13], have been proposed. Inthese methods, an audio clip is segmented into short time frames(e.g., 40 ms frames), and each frame is regarded as one data samplefor model training and evaluation. As shown in Fig. 1, sound eventsvary in duration, and the average frame length of sound events varieshighly depending on the sound event class. Table 1 and Fig. 2 showthe average duration of one sound event instance and the total num-ber of frames covered by sound events in development datasets usedfor evaluation experiments described in Sec. 4 (TUT Sound Events2016, 2017, TUT Acoustic Scenes 2016, and 2017 [14, 15]), respec-tively. In these datasets, the number of frames in the sound event“mouse clicking,” which has an average length of 0.15 s, is 1,163,whereas that in the sound event “fan,” which has an average lengthof 29.99 s, is 116,837. Thus, the difference in the duration betweensound events causes a serious data imbalance in SED. Moreover,Figs. 1 and 2 show that there are much larger differences in thenumber of data samples between frames in which sound events areactive and inactive. In the development dataset used for evaluation

Timepeopletalkingkeyboardtypingfanmouseclicking

Segmented time frame (one data sample for model training)

Sound event instanceInactive frame

Fig. 1 . Examples of active/inactive sound events and number of datasamples

Table 1 . Average duration of one sound event instance in datasetsused for evaluation experiments (TUT Sound Events 2016, 2017,TUT Acoustic Scenes 2016, and 2017 [14, 15])

Sound event Duration (s) Sound event Duration (s) (object) banging 0.78 drawer 0.80(object) impact 0.35 fan 29.99(object) rustling 2.24 glass jingling 0.80(object) snapping 0.46 keyboard typing 0.21(object) squeaking 0.74 large vehicle 14.68bird singing 7.63 mouse clicking 0.14brakes squeaking 1.65 mouse wheeling 0.16breathing 0.43 people talking 4.09car 6.88 people walking 6.63children 6.87 washing dishes 4.15cupboard 0.65 water tap running 5.92cutlery 0.74 wind blowing 6.09dishes 1.24 experiments, the total number of active frames is . × andthat of inactive frames is . × ; therefore, this difference alsocauses a serious imbalance problem between active and inactive datasamples.There are some conventional methods of SED in the case of im-balanced data [16, 17, 18]. For example, Chen and Jin have proposeda method of detecting rare sound events using data augmentation[16]. Wang et al. have proposed a method of few-shot SED basedon metric learning [17]. Dinkel and Yu have proposed a method ofSED using a temporal subsampling method within a CRNN [18].However, the impact of data imbalance on SED performance causedby the difference in duration between sound event classes and ac-tive/inactive frames has not been comprehensively investigated inthese works. Our contributions in this paper are as follows.• We clarify how the data imbalance between sound eventclasses and/or active and inactive frames impacts the perfor-mance of SED.• We introduce some loss functions into SED tasks, such asasymmetric local loss and focal batch Tversky loss.The remainder of this paper is constructed as follows. In Sec. 2,we introduce the conventional SED method using the sigmoid cross- N u m b e r o f fr a m e s ( o b j e c t ) b a n g i n g ( o b j e c t ) i m p a c t ( o b j e c t ) r u s tli n g ( o b j e c t ) s n a p p i n g ( o b j e c t ) s q u e a k i n gb i r d s i n g i n gb r a k e s s q u e a k i n gp e r s o n b r e a t h i n g c a r c h il d r e n c u p b o a r d c u tl e r y d i s h e s d r a w e r f a ng l a ss ji n g li n gk e y b o a r d t y p i n g l a r g e v e h i c l e m o u s e c li c k i n g m o u s e w h e e li n gp e o p l e t a l k i n gp e o p l e w a l k i n g w a s h i n g d i s h e s w a t e r t a p r u n n i n g w i n d b l o w i n g i n a c t i v e Types of sound events

Fig. 2 . Numbers of frames of active/inactive sound events in dataset used for evaluation experimentsentropy loss function. In Sec. 3, we introduce some reweightingtechniques for the cross-entropy loss function. In Sec. 4, we deter-mine the impact of the duration of active/inactive sounds on SEDperformance on the basis of experimental results, and in Sec. 5, weconclude this paper.

2. SOUND EVENT DETECTION USING BINARY CROSSENTROPY LOSS

Let us consider a model f and model parameter θ . The goal of SEDis to predict sound event labels ˆ Z in an unknown sound by ˆ Z = I [ f ( X , θ ) ≥ φ ] ∈ { , } N × M , (1)where N , M , X , φ , and I are the number of time frames in a soundclip, the number of sound event classes, the acoustic feature, thedetection threshold, and the indicator function, respectively. Themodel parameter θ is preliminarily determined using the trainingdataset D = { ( X , Z ) , ..., ( X l , Z l ) , ..., ( X L , Z L ) } . Here, X l is the acoustic feature of the l th sound clip and Z l = { z l, , ... z l,n , ..., z l,N } indicates a sequence of multi–hot vector z l,n ∈{ , } M in the l th sound clip over the M sound event class. For theacoustic feature X l , the mel-band energy and mel-frequency cepstralcoefﬁcients (MFCCs) are often used. For the model f , CNN, CRNN,or a Transformer-based neural network has been applied. The modelparameter θ is estimated using the following binary cross-entropy(BCE) loss function E BCE ( θ ) and the backpropagation technique: E BCE ( θ ) = − N X n =1 n z n log( y n ) + (1 − z n ) log(1 − y n ) o = − N,M X n,m =1 n z n,m log( y n,m ) + (1 − z n,m ) log(1 − y n,m ) o , (2)where y n,m are the prediction of sound event m in time frame n . z n,m is the target label in time frame n and is 1 if acoustic event m is active in time frame n and 0 otherwise. Note that the sound clipindex l is omitted to simplify the equation. E BCE ( θ ) is calculatedby summing the binary cross-entropy over all time frames and soundevent classes. Since the duration of sound events varies highly de-pending on the event class, the model parameter estimation using theBCE loss leads to the data imbalance between sound event classes.Moreover, as shown in Figs. 1 and 2, because the number of inactiveframes of sound events is much larger than that of active frames, themodel parameter estimation tends to be overwhelmed by the inactiveframes. Consequently, active frames tend to be ignored in the modeltraining.

3. LOSS FUNCTION CONSIDERING DATA IMBALANCE

In this work, we apply four loss functions that can control thecontribution to model training of short/long sound events and ac-tive/inactive frames.

To investigate the impact of a very large number of inactiveframes on the SED performance, we consider the following sim-ple reweighting loss (SRL): E SRL ( θ ) = − N,M X n,m =1 n αz n,m log( y n,m )+ β (1 − z n,m ) log(1 − y n,m ) o , (3)where α ∈ [0 , ∞ ) and β ∈ [0 , ∞ ) are the reweighting factors. Inthis work, we set α as 1.0. To investigate the impact of data imbalance between sound eventclasses, we also consider the following reweighted loss on the basisof the inverse frequency of sound events (IFL): E IFL ( θ ) = − N,M X n,m =1 (cid:26)(cid:16) CN m + C (cid:17) γ z n,m log( y n,m )+ (1 − z n,m ) log(1 − y n,m ) (cid:27) , (4)where γ ∈ [0 , ∞ ) , N m , and C are the wighting factor, the numberof frames of a sound event m in a training batch, and a constant,respectively. The IFL can reweight the contribution of each soundevent for model training in accordance with the frequency, and en-able more robust model training with the imbalanced dataset. Many long-duration sound events (e.g., “fan” and “car”) and inac-tive durations are stationary, that is, they have less variation in theiracoustic features, and their model training is easy. On the otherhand, several sound events of short duration (e.g., “object impact”and “keyboard typing”) have more than one audio pattern, such asattack, decay, and release parts. To control the training weight of thesound event model in accordance with the ease/difﬁculty of modeltraining, the use of focal loss has been proposed [19, 20]. In this pa-per, we newly introduce the following asymmetric focal loss (AFL),which enables the control of the focusing factor of active and inac-tive frames separately. E AFL ( θ ) = − N,M X n,m =1 n (1 − y n,m ) γ z n,m log( y n,m )+ ( y n,m ) ζ (1 − z n,m ) log(1 − y n,m ) o (5) able 2 . Experimental conditions Length of sound clip 10 sNetwork for CNN-BiGRU 3 CNN + + ×

3, 3 ×

3, 3 × ×

8, 1 ×

4, 1 × + + C η Here, γ and ζ are the weighting parameters that control the focusingweight of active and inactive frames, respectively. When we set largevalues for the weighting parameters γ and ζ , the loss of active andinactive frames is down-weighted depending on the prediction error. To address the data imbalance between active and inactive frames,we also introduce the focal batch Tversky loss (FBTL) E FBTL ( θ ) into the SED task, which is an extended version of the focal diceloss [21] and the Tversky loss [22, 23]. The dice loss and Tverskyloss directly train the model to maximize the F-score, which doesnot consider the true-negative samples. That is, these losses can alsoprevent the model training from being overwhelmed by the easy-negative frames. In this paper, we introduce the idea of focal lossinto the batch Tversky loss [22, 23], and apply the following FBTLinto the SED task: E FBTL ( θ ) =1 − P B,N,Ml,n,m =1 (1 − y (1) l,n,m ) γ y (1) l,n,m z (1) l,n,m + η P B,N,Ml,n,m =1 α (1 − y (1) l,n,m ) γ y (1) l,n,m + P B,N,Ml,n,m =1 βz (1) l,n,m + η , (6)where y (1) l,n,m and z (1) l,n,m are the prediction and sound event labelfor the active frame, respectively. α ∈ [0 , . and β ∈ [0 , . are the tradeoff parameters between false negative and false positivesamples, where α + β = 1 . . B and η is a number of sound clip ineach batch and a smoothing parameter, respectively.

4. EXPERIMENTS4.1. Experimental Conditions

We evaluated the impact of data imbalance between sound eventclasses and active/inactive frames on the event detection perfor-mance using various SED networks and loss functions. For theevaluation, we used the dataset composed of parts of TUT SoundEvents 2016, 2017, TUT Acoustic Scenes 2016, and 2017 [14, 15].From these datasets, we selected a total of 266 min of sound clips(development set, 192 min; evaluation set, 74 min) including the 25types of sound event listed in Fig. 2. The details of the dataset canbe found in [25]. Note that the datasets were recorded not for thedetection task of rare sounds but for the analysis of real-life sounds;thus, the analysis of seriously imbalanced data is a general problemin SED.For the acoustic features, we extracted the 64-dimensional logmel-band energy at a sampling rate of 16 kHz, which was calculatedevery 40 ms with a 20 ms hop size. As the baseline network, we usedCNN-BiGRU, which is widely used as the baseline system of SED, -2 Down-weight Up-weight -4 -6 β F s c o r e ( % ) BCE loss (micro-fscore)BCE loss (macro-fscore) SRL (micro-fscore)SRL (macro-fscore) Fig. 3 . Average Fscores for SRL and BCE with various weightingfactors β -2 -4 ζ F s c o r e ( % ) AFL (micro-fscore)AFL (macro-fscore)BCE loss (micro-fscore)BCE loss (macro-fscore)

Down-weight -6 Fig. 4 . Average Fscores for AFL and BCE with various weightingfactors ζ such as DCASE2018 challenge task 4 [26]. For each method, weconducted the evaluation experiment 10 times with random initialvalues for model parameters. The performance of SED was evalu-ated using the frame-based macro- and micro-Fscores. Other exper-imental conditions are listed in Table 2. Figures 3 and 4 show the average macro- and micro-Fscores withBCE loss, SRL, and AFL ( γ = 0 . ) for various reweighting fac-tors. In this experiment, we used CNN-BiGRU as the network ar-chitecture. The results show that when the loss of inactive framesis down-weighted, both macro- and micro-Fscores tend to improve.This indicates that the inactive frames overwhelm model training,which leads to active sound events being ignored in model training. Figure 5 shows the average macro- and micro-Fscores with BCEloss, IFL, and AFL ( ζ = 0 . ) for various reweighting factors γ . Theresults show that even when the losses are reweighted to be more bal-anced between sound event classes, both macro- and micro-Fscoresdo not improve much. This implies that the data imbalance betweensound event classes have less impact on SED performance than theimbalance between active and inactive frames. Thus, in SED, thedata imbalance between active and inactive frames is a more seriousproblem and needs to be addressed preferentially. Moreover, whenboth the imbalance between sound event classes and that betweenactive and inactive frames are reweighted as shown at the bottomof Table 3, both macro- and micro-Fscores are more improved thanby reweighting of the imbalance either between event classes or be-tween active and inactive frames. We have compared the SED performance of our methods with thoseof other conventional methods, such as SED using a α min-max sub- able 3 . Average SED performance for various loss functions and networks Micro- Macro-Method Micro-Fscore Macro-Fscore ROC AUC ROC AUC[Conventional methods]

CNN-BiGRU w/ BCE loss (

Baseline ) 40.10% 7.39% 89.15% 65.85%CNN-BiGRU w/ α min-max subsampling & BCE loss 44.12% 9.35% 90.27% 67.55%CNN-BiGRU w/ batch dice loss 45.06% 9.79% 86.99% 63.89%MTL of SED & SAD w/ BCE loss 43.35% 8.64% 91.40% 70.97%Transformer w/ BCE loss 45.15% 9.27% 90.32% 66.64% [Loss reweighting between active and inactive frames] CNN-BiGRU w/ simple reweighting loss ( β = 0 . ) 46.44% 10.34% 91.07% 69.31%CNN-BiGRU w/ asymmetric focal loss ( γ =0 . , ζ =1 . ) 47.78% 10.65% 92.35% 76.18%CNN-BiGRU w/ focal batch Tversky loss ( α =0 . , β =0 . , γ =0 . ) 46.97% 10.28% 87.95% 65.08% [Loss reweighting between sound event classes] CNN-BiGRU w/ inverse frequency loss ( C = 500 ) 41.89% 7.57% 89.89% 66.46%CNN-BiGRU w/ asymmetric focal loss ( γ =0 . , ζ =0 . ) 42.33% 8.13% 91.08% 70.46% [Loss reweighting both between event classes and between active/inactive frames] CNN-BiGRU w/ asymmetric focal loss ( γ =0 . , ζ =1 . ) 48.29% 10.46% 92.62% 77.03%Transformer w/ asymmetric focal loss ( γ =0 . , ζ =1 . ) -2 -4 -6 γ F s c o r e ( % ) IFL (micro-fscore)IFL (macro-fscore)BCE loss (micro-fscore)BCE loss (macro-fscore) AFL (micro-fscore)AFL (macro-fscore)

Down-weight

Fig. 5 . Average fscores for BCE loss, IFL, and AFL with variousweighting factors γ . For AFL, we set ζ = 0 . .sampling method within CRNN [18], the batch dice loss-based SED[23, 27], multitask learning of SED and sound activity detection[28], and Transformer-based SED [12, 13]. As the Transformer-based SED, we used three CNN layers with the same structure asthe CNN-BiGRU, followed by two Transformer encoder layers andtwo dense layers. Table 3 shows the micro- and macro-Fscores,micro-ROC AUC, and macro-ROC AUC. The results show thatdown-weighting the loss of inactive frames using SRL, AFL, andFBTL improves both the Fscore and ROC AUC score to a greatextent than conventional methods. Surprisingly, even the SRL withCNN-BiGRU outperforms the BCE loss with the Transformer-basedSED system, which is the state-of-the-art technique in SED. How-ever, FBTL does not improve micro- or macro-ROC AUC. This isbecause FBTL does not consider the true-negative samples duringmodel training; therefore, the false-positive rate tends to be worsethan those in the conventional BCE loss-based methods. The SEDperformance is ﬁnally improved by 9.04 and 3.72 percentage pointsin macro- and micro-Fscores by reweighting both types of imbal-ance using AFL and applying the Transformer network comparedwith the baseline system. To investigate the SED performance in detail, we show the Fscoresfor selected sound events in Table 4. In many sound events, IFL andAFL ( γ = 0 . , ζ = 0 . ) do not markedly improve the SED per- Table 4 . Average Fscore for each sound event bird washing water tapMethod singing car drawer dishes running

CNN-BiGRU w/ BCE 17.79% 43.85% 0.00% 0.41% 43.23%CNN-BiGRU w/ SRL 32.69% 49.09% 0.00% 5.09% 69.37%CNN-BiGRU w/ IFL 19.28% 44.42% 0.00% 0.39% 32.76%CNN-BiGRU w/ FBTL % 48.88% 0.00% 3.73% 60.31%CNN-BiGRU w/ AFL 34.50% 47.96% 0.00% 3.94% 73.90%( γ = 0 . , ζ = 1 . )CNN-BiGRU w/ AFL( γ = 0 . , ζ = 0 . ) 20.15% 43.35% 0.00% 0 .74% 40.29%CNN-BiGRU w/ AFL 28.08% 45.71% 0.00% 10.55% 74.64%( γ = 0 . , ζ = 1 . )Transformer w/ AFL( γ = 0 . , ζ = 1 . ) 18.40% formance, whereas AFL ( γ = 0 . , ζ = 1 . ) achieves the bestresults. This ﬁnding also supports the notion that inactive frameshave an adverse effect on SED performance, and reweighting theimbalance between sound event classes and between active and inac-tive frames improves the performance considerably. However, evenwhen we apply the reweighting methods, sound events that have aquite small number of frames such as “drawer,” are not well detected.These sound events should be considered in future experiments.

5. CONCLUSION

In this work, we studied the impact of the data imbalance betweensound event classes and between active and inactive frames on SEDperformance. To investigate their impact, we introduced the four lossfunctions SRL, IFL, AFL, and FBTL, which can reweight the lossesand relieve the imbalance of contribution of model training. Theexperimental results showed that the inactive frames tend to over-whelm model training; consequently, the trained model tends to ig-nore active sound events. On the other hand, the results also showthat the data imbalance between sound event classes have less impacton SED performance than the imbalance between active and inactiveframes. To avoid this negative impact on SED performance, the SEDmethods using asymmetric focal loss and focal batch Tversky lossare effective and considerably improve the SED performance.

6. ACKNOWLEDGEMENT

This work was supported by JSPS KAKENHI Grant NumberJP19K20304. . REFERENCES [1] T. Virtanen, M. Plumbley, and D. Ellis, Eds.,

ComputationalAnalysis of Sound Scenes and Events . Springer, 2017.[2] K. Imoto, “Introduction to acoustic event and scene analysis,”

Acoustical Science and Technology , vol. 39, no. 3, pp. 182–188, 2018.[3] K. Imoto, S. Shimauchi, H. Uematsu, and H. Ohmuro, “Useractivity estimation method based on probabilistic generativemodel of acoustic event sequence with user activity and its sub-ordinate categories,”

Proc. INTERSPEECH , 2013.[4] Y. Koizumi, Y. Kawaguchi, K. Imoto, T. Nakamura,Y. Nikaido, R. Tanabe, H. Purohit, K. Suefusa, T. Endo, M. Ya-suda, and N. Harada, “DCASE2020 challenge task2: Unsuper-vised anomalous sound detection for machine condition moni-toring,” arXiv, arXiv:2006.05822 , pp. 1–5, 2020.[5] S. Ntalampiras, I. Potamitis, and N. Fakotakis, “On acous-tic surveillance of hazardous situations,”

Proc. IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing ( ICASSP ), pp. 165–168, 2009.[6] Q. Jin, P. F. Schulam, S. Rawat, S. Burger, D. Ding, andF. Metze, “Event-based video retrieval using audio,”

Proc. IN-TERSPEECH , pp. 2085–2088, 2012.[7] J. Salamon, J. P. Bello, A. Farnsworth, M. Robbins, S. Keen,H. Klinck, and S. Kelling, “Towards the automatic classiﬁ-cation of avian ﬂight calls for bioacoustic monitoring,”

PLoSOne , vol. 11, no. 11, 2016.[8] Y. Okamoto, K. Imoto, N. Tsukahara, K. Sueda, R. Yaman-ishi, and Y. Yamashita, “Crow call detection using gated con-volutional recurrent neural network,”

Proc. RISP InternationalWorkshop on Nonlinear Circuits, Communications and SignalProcessing ( NCSP ), pp. 171–174, 2020.[9] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke,A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous,B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, “CNN ar-chitectures for large-scale audio classiﬁcation,”

Proc. IEEE In-ternational Conference on Acoustics, Speech and Signal Pro-cessing ( ICASSP ), pp. 131–135, 2017.[10] T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. L. Roux, andK. Takeda, “Duration-controlled LSTM for polyphonic soundevent detection,”

IEEE/ACM Trans. Audio Speech Lang. Pro-cess. , vol. 25, no. 11, pp. 2059–2070, 2017.[11] E. C¸ akir, G. Parascandolo, T. Heittola, H. Huttunen, andT. Virtanen, “Convolutional recurrent neural networks forpolyphonic sound event detection,”

IEEE/ACM Trans. AudioSpeech Lang. Process. , vol. 25, no. 6, pp. 1291–1303, 2017.[12] K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda,and K. Takeda, “Weakly-supervised sound event detectionwith self-attention,”

Proc. IEEE International Conference onAcoustics, Speech and Signal Processing ( ICASSP ), pp. 66–70,2020.[13] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “Sound eventdetection of weakly labelled data with CNN-Transformer andautomatic threshold optimization,”

IEEE/ACM Trans. AudioSpeech Lang. Process. , vol. 28, pp. 2450–2460, 2020.[14] A. Mesaros, T. Heittola, and T. Virtanen, “TUT databasefor acoustic scene classiﬁcation and sound event detection,”

Proc. European Signal Processing Conference ( EUSIPCO ),pp. 1128–1132, 2016. [15] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah,B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks,datasets and baseline system,”

Proc. Workshop on Detectionand Classiﬁcation of Acoustic Scenes and Events ( DCASE ), pp.85–92, 2017.[16] Y. Chen and H. Jin, “Rare sound event detection using deeplearning and data augmentation,”

Proc. INTERSPEECH , pp.619–623, 2019.[17] Y. Wang, J. Salamon, N. J. Bryan, and J. P. Bello, “Few-shotsound event detection,”

Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing ( ICASSP ), pp. 81–85, 2020.[18] H. Dinkel and K. Yu, “Duration robust weakly supervisedsound event detection,”

Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing ( ICASSP ), pp.311–315, 2020.[19] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Fo-cal loss for dense object detection,”

Proc. IEEE InternationalConference on Computer Vision ( ICCV ), pp. 2980–2988, 2017.[20] K. Noh and J. H. Chang, “Joint optimization of deep neu-ral network-based dereverberation and beamforming for soundevent detection in multi-channel environments,”

Sensors ,vol. 20, no. 7, pp. 1–13, 2020.[21] X. Li, X. Sun, Y. Meng, J. Liang, F. Wu, and J. Li, “Dice lossfor data-imbalanced NLP tasks,”

Proc. 58th Annual Meeting ofthe Association for Computational Linguistics ( ACL ), pp. 465–476, 2020.[22] S. S. M. Salehi, D. Erdogmus, and A. Gholipour, “Tverskyloss function for image segmentation using 3D fully convolu-tional deep networks,”

Proc. International Workshop on Ma-chine Learning in Medical Imaging ( MLMI ), pp. 379–387,2017.[23] O. Kodym, M. Spanel, and A. Herout, “Segmentation of headand neck organs at risk using CNN with batch dice loss,”

Ger-man Conference in Pattern Recognition ( GCPR ), pp. 105–114,2018.[24] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han,“On the variance of the adaptive learning rate and beyond,”

Proc. International Conference on Learning Representations ( ICLR

Proc. Workshop on De-tection and Classiﬁcation of Acoustic Scenes and Events ( DCASE ), pp. 19–23, 2018.[27] S. Park, S. Suh, and Y. Jeong, “Sound event localization anddetection with various loss functions,”

Technical report of task3 of DCASE Challenge 2020 , pp. 1–5, 2020.[28] A. Pankajakshan, H. L. Bear, and E. Benetos, “Polyphonicsound event and sound activity detection: A multi-task ap-proach,”