[PDF] Sound Event Detection Based on Curriculum Learning Considering Learning Difficulty of Events

Abstract

In conventional sound event detection (SED) models, two types of events, namely, those that are present and those that do not occur in an acoustic scene, are regarded as the same type of events. The conventional SED methods cannot effectively exploit the difference between the two types of events. All time frames of sound events that do not occur in an acoustic scene are easily regarded as inactive in the scene, that is, the events are easy-to-train. The time frames of the events that are present in a scene must be classified as active in addition to inactive in the acoustic scene, that is, the events are difficult-to-train. To take advantage of the training difficulty, we apply curriculum learning into SED, where models are trained from easy- to difficult-to-train events. To utilize the curriculum learning, we propose a new objective function for SED, wherein the events are trained from easy- to difficult-to-train events. Experimental results show that the F-score of the proposed method is improved by 10.09 percentage points compared with that of the conventional binary cross entropy-based SED.

Full PDF

aa r X i v : . [ c s . S D ] F e b SOUND EVENT DETECTION BASED ON CURRICULUM LEARNINGCONSIDERING LEARNING DIFFICULTY OF EVENTS

Noriyuki Tonami , Keisuke Imoto , Yuki Okamoto , Takahiro Fukumori , Yoichi Yamashita Ritsumeikan University, Japan, Doshisha University, Japan

ABSTRACT

In conventional sound event detection (SED) models, twotypes of events, namely, those that are present and thosethat do not occur in an acoustic scene, are regarded as thesame type of events. The conventional SED methods cannoteffectively exploit the difference between the two types ofevents. All time frames of sound events that do not occurin an acoustic scene are easily regarded as inactive in thescene, that is, the events are easy-to-train. The time framesof the events that are present in a scene must be classiﬁed asactive in addition to inactive in the acoustic scene, that is, theevents are difﬁcult-to-train. To take advantage of the train-ing difﬁculty, we apply curriculum learning into SED, wheremodels are trained from easy- to difﬁcult-to-train events. Toutilize the curriculum learning, we propose a new objectivefunction for SED, wherein the events are trained from easy-to difﬁcult-to-train events. Experimental results show that theF-score of the proposed method is improved by 10.09 per-centage points compared with that of the conventional binarycross entropy-based SED.

Index Terms — Sound event detection, acoustic scene,curriculum learning

1. INTRODUCTION

The analysis of various environmental sounds in everyday lifehas be come an increasingly important area in signal process-ing [1]. The automatic analysis of environmental sounds willgive rise to various applications, such as anomalous sound de-tection systems [2], automatic life-logging systems [3], mon-itoring systems [4], and bird-call detection systems [5].Sound event detection (SED) is the task of recognizingsound event labels and their timestamp from a recording. InSED, the models need to recognize overlapped multiple soundevents in a time frame. Recently, neural-network-based SEDmodels have seen increasingly rapid advances, such as theconvolutional neural network (CNN) [6], recurrent neural net-work (RNN) [7], and convolutional recurrent neural network(CRNN) [8]. CNN is the structure that automatically extractsfeatures and is robust to time and frequency shifts. RNNis good at modeling the time structure in an audio stream.Moreover, some works considering the relationship betweensound events and scenes have been proposed. As an exampleof the relationship, “mouse clicking” occurs indoors such as“ofﬁce,” whereas, “car” tends to occurs outdoor such as “citycenter.” On the basis of this idea, SED using the informa-tion on the acoustic scene [9–11] and the model combining : active frame : inactive frame present in an acoustic scene } do not occur in an acoustic scene time (a) Scene: airplane (a): footsteps (b): elephant (c): birdsong (b)(c) E v e n t Fig. 1 . Example of difference in training difﬁculty betweensound eventsSED and acoustic scene classiﬁcation (ASC) [12–16] havebeen proposed. Heittola et al . [10] have proposed the SEDmodel using the results of the ASC, where the ASC model istrained in the ﬁrst stage and then the SED model is trainedin the second stage with the ASC results. Tonami et al . [13]have proposed the multitask-learning-based models combin-ing SED and ASC.In the conventional SED methods, two types of events,namely, those that are present and those that do not occur in anacoustic scene, are treated as the same type of the events. Theconventional SED methods cannot effectively utilize the dif-ference between the two types of events. The all time framesof events that do not occur in a scene only need to be treatedas inactive in the acoustic scene, as shown in Fig. 1 (“ele-phant” and ”birdsong” in “airplane”), i.e., the training of theeasy-to-train events is considered as the task of recognizingone class. On the other hand, the time frames of events thatare present in an acoustic scene must be classiﬁed as active orinactive in the acoustic scene, as shown in Fig. 1 (“footsteps”in “airplane”), i.e., the training of the difﬁcult-to-train eventsis regarded as the task of binary classiﬁcation.To utilize the difference in the difﬁculty of training be-tween the sound events, we employ curriculum learning [17].Curriculum learning is a method of learning data effectivelyconsidering the difﬁculty of training, in which a model learnsprogressively from easy- to difﬁcult-to-train data. Recently,some works using the curriculum learning have been carriedout [18–20]. Lotﬁan and Busso [19] have proposed the speechemotion recognition method based on the curriculum learn-ing, where the ambiguity of emotion is considered. In thispaper, we propose a SED method using the curriculum learn-ing, in which strong labels are given for the training. In theproposed method, the SED models are trained from the easy-to difﬁcult-to-train events on the basis of the curriculum learn-ing. More speciﬁcally, we present a new objective function ofSED considering the difﬁculty of the training of events basedon the curriculum learning. E v e n t : active frame: inactive framepresent in an acoustic scene } do not occur in an acoustic scene time time α = α = Early stage of training Late stage of training z Easy-to-train Difficult-to-train epoch α (a)(b)(c) Scene: airplane (a): footsteps (b): elephant (c): birdsong

Fig. 2 . Examples of early and late stage of training based on curriculum learning

2. CONVENTIONAL METHOD

SED involves sound event labels and their onset/offset froman audio. Recently, many neural-network-based methodshave been studied. In most of the neural-network-basedmethods, the acoustic features in the time-frequency domainare used for the input to the SED models. To optimize theneural-network-based SED models, the binary cross-entropyloss is used as follows: L BCE = − N X n =1 T X t =1 n z n,t log σ ( y n,t )+(1 − z n,t ) log (cid:0) − σ ( y n,t ) (cid:1)o , (1)where N and T indicate the numbers of sound event categoriesand time frames, respectively. z n,t ∈ { , } is a target label ofan event n at time t . If the event is active, z n,t is 1; otherwise, z n,t is 0. y n,t represents the output of the network of an event n at time t . σ ( · ) denotes the sigmoid function.

3. PROPOSED METHOD3.1. Training difﬁculty of events considering scenes

In the conventional SED methods, two types of events,namely, those that exist and those that do not occur in anacoustic scene, are treated as the same type of the events.The conventional SED methods cannot effectively employthe difference between the two types of sound events. The alltime frames of events that do not occur in an acoustic sceneonly need to be regarded as inactive in the acoustic scene,as seen in Fig. 1 (“elephant” and “birdsong” in “airplane”).The training of the sound events is treated as the task thatrecognizes one class (inactive), that is, the events are easy-to-train. On the other hand, the time frames of sound events thatexist in an acoustic scene need to be classiﬁed as active inaddition to inactive in the acoustic scene, as shown in Fig. 1.The training of the sound events is considered as the task thatclassiﬁes two classes (active or inactive), that is, the eventsare difﬁcult-to-train. In short, the sound events that exist inan acoustic scene are hardly trained compared with the eventsthat do not occur in the acoustic scene as shown in Fig. 2.

As mentioned in Sect. 3.1., there are differences in trainingdifﬁculty between the sound events when the acoustic scenesare considered. In the proposed method, we employ the cur-riculum learning to take advantage of the difference in the dif-ﬁculty of training between the sound events when the acousticscenes are considered. To incorporate the concept of the cur-riculum learning into the BCE, the following loss function isused instead of Eq. 1: L prop = − N X n =1 T X t =1 g n n z n,t log σ ( y n,t )+ (1 − z n,t ) log (cid:0) − σ ( y n,t ) (cid:1)o , (2)where g n is a gate function that controls the weight of trainingof two types of events. More speciﬁcally, the gate function iscalculated as g n = α s f n + (1 − α s )(1 − f n ) , (3)where α s is a progressive parameter, which is changed from 0to 1 with time-step s (epoch) during training. f n is an event-ﬂag. If an event n occurs at least once in the acoustic sceneof the input audio, the frag is 1; otherwise, it is 0.As shown in Fig. 2, in the early stage of the training, onlythe events that do not occur in an acoustic scene are trained.On the other hand, in the late stage of the training, only theevents that are present in an acoustic scene are trained. Notethat whether an event is difﬁcult- or easy-to-train is deter-mined by each acoustic scene label of an audio clip. For ex-ample, a dataset includes a scene A . Events a and b occur atleast once in the scene A . An event c does not occur in thescene A . When the scene label of input audio is A , a and b areregarded as the difﬁcult-to-train events. c is regarded as theeasy-to-train event.

4. EXPERIMENTS4.1. Experimental conditions

To evaluate the performance of the proposed method, weconducted evaluation experiments using the TUT SoundEvents 2016 [21], TUT Sound Events 2017 [22], TUT Acous-tic Scenes 2016 [21], and TUT Acoustic Scenes 2017 [22]datasets. From these datasets, we selected sound clips includ- ity center ( ob j ec t ) b a ng i ng ( ob j ec t ) i m p ac t ( ob j ec t ) r u s tli ng ( ob j ec t ) s n a pp i ng ( ob j ec t ) s qu ea k i ngb i r d s i ng i ngb r a k e s s qu ea k i ngb r ea t h i ng ca r c h il d r e n c upbo a r d c u tl e r y d i s h e s d r a w e r f a ng l a ss ji ng li ngk e ybo a r d t yp i ng l a r g e v e h i c l e m ou s e c li c k i ng m ou s e w h ee li ngp e op l e t a l k i ngp e op l e w a l k i ng w a s h i ng d i s h e s w a t e r t a p r unn i ng w i nd b l o w i ng N u m b er o f f r a m e s Sound events homeofficeresidential area Fig. 3 . Number of frames of sound events on development set used for our experiments

Table 1 . Experimental conditions

Acoustic feature Log-mel energy (64 dim.)Frame length / shift 40 ms / 20 msLength of sound clip 10 sNetwork architecture 3 CNN + 1BiGRU + 1 fully con. × ×

1, 2 ×

1, 2 × ing four acoustic scenes, “home,” “residential area” (TUTSound Events 2016), “city center” (TUT Sound Events 2017,TUT Acoustic Scenes 2017), and “ofﬁce” (TUT AcousticScenes 2016), which contain 266 min (development set, 192min; evaluation set, 74 min) of audio. Here, the acousticscene “ofﬁce” in TUT Acoustic Scenes 2016 and “city cen-ter” in TUT Acoustic Scenes 2017 did not have sound eventlabels. We thus manually annotated the sound clips withsound event labels by the procedure described in [21, 22].These sound clips include the 25 types of sound event labels.Fig. 3 shows the numbers of active time frames of soundevents on the development set that we used. The labels ofevents annotated for our experiment are available in [23].As acoustic features, we used 64-dimensional log-mel en-ergies calculated for each 40 ms time frame with 50% overlap.This setting is from the baseline system of the DCASE2018Challenge task4 [24]. As the baseline model of SED, we usedthe convolutional neural network and bidirectional gated re-current unit (CNN–BiGRU) [8]. Moreover, to verify the use-fulness of the proposed method, we used a model combiningSED and sound activity detection (SAD) based on multitasklearning (MTL), referred to as “MTL of SED & SAD” [25],and a model combining SED and ASC, referred to as “MTLof SED & ASC” [13]. The sound activity detection is the mechanism of recognizing any active events in a time frame.The reason why we choose MTL of SED & SAD is that thismodern method, in which no information on the scene is con-sidered, is simple but effective. MTL of SED & ASC is themultitask-learning-based SED with ASC, which uses scenelabels by ASC. Other experimental conditions are listed inTable 1. In Table 1, X × Y denotes that the ﬁlter size is Xalong the frequency axis by Y along the time axis. As theevaluation metric, the frame-based metric is used. We con-duct the experiments using ten initial values. To evaluate theSED performance, a segment-based metric [26] is used. Inthis work, the size of a segment is set to the frame length.In this work, we adopt the following exponential sched-uler as the progressive parameter in Eq. 3: α s = (cid:18) ss max (cid:19) λ , (4)where s and s max represent the current and maximum epoch,respectively. λ is tuned using the development dataset and isset as 2.0. Table 2 shows the SED performances in terms of the segment-based F-score and error rate. In Table 2, micro and macroindicate the overall and class-average scores, respectively.The numbers to the right of ± represent standard devi-ations. “BCE” is the CNN–BiGRU using the BCE loss.“Proposed method” represents the SED performance usingEqs. 2 and 3 with CNN–BiGRU. “Proposed+MTL of SED& SAD” indicates the SED performance using Eqs. 2 and3 with SAD. “Proposed+MTL of SED & ASC” denotes themultitask-learning-based SED with ASC using the proposedobjective function for SED. The results show that the pro-posed method achieves a more reasonable performance thanthe conventional BCE. Moreover, when using SAD and themodel combining SED and ASC with the proposed objectivefunction, the SED performance is better than those of theconventional MTL of SED & SAD and the MTL of SED & able 3 . SED performance for each event Event (object) (object) (object) (object) (object) bird brakes breathing car children cupboard cutlerybanging impact rustling snapping squeaking singing squeakingBCE F-score 0.00% 0.37% ± ± ± ± ± ± ± ± ± ± ± ± Error rate 1.00 ± ± ± ± ± ± ± ± ± ± ± ± F-score 0.00% ± ± ± ± ± ± ± ± ± ± ± ± method Error rate 1.00 1.02 1.03 1.00 1.00 0.96 0.99 1.00 1.15 1.00 1.00 1.00 ± ± ± ± ± ± ± ± ± ± ± ± Event dishes drawer fan glass keyboard large mouse mouse people people washing water tap windjingling typing vehicle clicking wheeling talking walking dishes running blowingBCE F-score 0.00% 0.00% 9.95% 0.00% 0.00% 16.93% 0.00% 0.00% 0.00% 2.67% 17.25% 40.78% ± ± ± ± ± ± ± ± ± ± ± ± ± Error rate 1.00 1.00 0.95 1.00 1.00 6.17 1.00 1.00 1.06 1.02 ± ± ± ± ± ± ± ± ± ± ± ± ± F-score ± ± ± ± ± ± ± ± ± ± ± ± ± method Error rate 1.00 1.00 ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2 . Overall performance of SED

Method F-score Error ratemicro macro micro macroBCE 25.30% 7.44% 1.00 1.21 ± ± ± ± MTL of SED & SAD 26.62% 7.36% 1.02 1.20 ± ± ± ± MTL of SED & ASC 26.12% 7.46% 0.97 1.18 ± ± ± ± Proposed method ± ± ± ± Proposed+MTL of SED & SAD ± ± ± ± Proposed+MTL of SED & ASC ± ± ± ± ASC. In particular, “Proposed method” improves the F-scoreof SED by 10.09 percentage points compared with that ofthe conventional SED using the BCE. The results indicatethat the proposed method considering the training difﬁcultyof events enables more effective SED performances than theconventional method using the BCE.To investigate in detail the SED performance, we ob-served the segment-based F-score and error rate for eachevent. Table 3 indicates the SED performance for each event.As shown in Table 3, the proposed method outperformed theconventional SED using the BCE for many events. In par-ticular, the F-scores for “fan,” “washing dishes,” and “watertap running” are more signiﬁcantly improved in the proposedmethod than in the conventional method. This might be be-cause the active frames of these events occur continuously,that is, these events are relatively effortless to be detectedcompared with other events. On the other hand, the F-scoresfor “(object) rustling,” “brakes squeaking,” and “wind blow-ing” do not improve. This is because the numbers of theevents of the active frames are too small as shown in Fig. 3.In other words, the active frames are trained mainly in thelate stage of training when using the proposed method. Thismay also lead to the poor results for some events when usingthe proposed method.

5. CONCLUSION

In this paper, we proposed the curriculum-learning-basedobjective function for SED. In the proposed method, we ap-plied the training difﬁculty between sound events consideringacoustic scenes to the conventional BCE loss. More speciﬁ-cally, the SED models using the proposed method are trainedfrom the easy-to-train to difﬁcult-to-train events during train-ing. The experimental results indicate that the proposedmethod improves the F-score of the SED by 10.09 percentagepoints compared with that of the conventional CNN–BiGRUusing the BCE loss. In our future work, we will investigate amore effective method for SED considering the relationshipbetween sound events and acoustic scenes.

6. ACKNOWLEDGEMENT

This work was supported by JSPS KAKENHI Grant NumberJP19K20304.

7. REFERENCES [1] K. Imoto, “Introduction to acoustic event and sceneanalysis,”

Acoust. Sci. Tech. , vol. 39, no. 3, pp. 182–188, 2018.[2] C. Chan and E. W. M. Yu, “An abnormal sound detec-tion and classiﬁcation system for surveillance applica-tions,”

Proc. European Signal Processing Conference ( EUSIPCO ), pp. 1851–1855, 2010.[3] J. A. Stork, L. Spinello, J. Silva, and K. O. Arras,“Audio-based human activity recognition using non-Markovian ensemble voting,”

Proc. IEEE InternationalSymposium on Robot and Human Interactive Communi-cation ( RO-MAN ), pp. 509–514, 2012.[4] S. Ntalampiras, I. Potamitis, and N. Fakotakis, “Onacoustic surveillance of hazardous situations,”

Proc.IEEE International Conference on Acoustics, Speechand Signal Processing ( ICASSP ), pp. 165–168, 2009.5] Y. Okamoto, K. Imoto, N. Tsukahara, K. Sueda, R. Ya-manishi, and Y. Yamashita, “Crow call detection usinggated convolutional recurrent neural network,”

Proc.RISP International Workshop on Nonlinear Circuits,Communications and Signal Processing ( NCSP ), pp.171–174, 2020.[6] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gem-meke, A. Jansen, C. Moore, M. Plakal, D. Platt, R. A.Saurous, B. Seybold, M. Slaney, R. Weiss, and K. Wil-son, “CNN architectures for large-scale audio classiﬁca-tion,”

Proc. IEEE International Conference on Acous-tics, Speech and Signal Processing ( ICASSP ), pp. 131–135, 2017.[7] T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. Le Roux,and K. Takeda, “Duration-controlled LSTM for poly-phonic sound event detection,”

IEEE/ACM Trans. onAudio, Speech, and Language Processing , vol. 25, no.11, pp. 2059–2070, 2017.[8] E. C¸ akır, G. Parascandolo, T. Heittola, H. Huttunen, andT. Virtanen, “Convolutional recurrent neural networksfor polyphonic sound event detection,”

IEEE/ACMTrans. Audio Speech Lang. Process. , vol. 25, no. 6, pp.1291–1303, 2017.[9] A. Mesaros, T. Heittola, and A. Klapuri, “Latent seman-tic analysis in sound event detection,”

Proc. EuropeanSignal Processing Conference ( EUSIPCO ), pp. 1307–1311, 2011.[10] T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen,“Context-dependent sound event detection,”

EURASIPJournal on Audio, Speech, and Music Processing , vol.2013, no. 1, pp. 1–13, 2013.[11] K. Imoto and S. Shimauchi, “Acoustic scene analysisbased on hierarchical generative model of acoustic eventsequence,”

IEICE Trans. Inf. Syst. , vol. E99-D, no. 10,pp. 2539–2549, 2016.[12] H. L. Bear, I. Nolasco, and E. Benetos, “Towards jointsound scene and polyphonic sound event recognition,”

Proc. INTERSPEECH , pp. 4594–4598, 2019.[13] N. Tonami, K. Imoto, M. Niitsuma, R. Yamanishi, andY. Yamashita, “Joint analysis of acoustic events andscenes based on multitask learning,”

Proc. IEEE Work-shop on Applications of Signal Processing to Audio andAcoustics ( WASPAA ), pp. 333–337, 2019.[14] K. Imoto, N. Tonami, Y. Koizumi, M. Yasuda, R. Ya-manishi, and Y. Yamashita, “Sound event detection bymultitask learning of sound events and scenes with softscene labels,”

Proc. IEEE International Conference onAcoustics, Speech and Signal Processing ( ICASSP ), pp.621–625, 2020. [15] T. Komatsu, K. Imoto, and M. Togami, “Scene-dependent acoustic event detection with scene condi-tioning and fake-scene-conditioned loss,”

Proc. IEEEInternational Conference on Acoustics, Speech and Sig-nal Processing ( ICASSP ), pp. 646–650, 2020.[16] J. Jung, H. Shin, J. Kim, and H. Yu, “DCASENet:An integrated pretrained deep neural network for detect-ing and classifying acoustic scenes and events,” arXiv , arXiv:2009.09642, Proc. International Conferenceon Machine Learning ( ICML ), pp. 41–48, 2009.[18] S. Braun, D. Neil, and S. Liu, “A curriculum learn-ing method for improved noise robustness in automaticspeech recognition,”

Proc. European Signal ProcessingConference ( EUSIPCO ), pp. 548–552, 2017.[19] R. Lotﬁan and C. Busso, “Curriculum learning forspeech emotion recognition from crowdsourced labels,”

IEEE/ACM Trans. on Audio, Speech, and LanguageProcessing , vol. 27, no. 4, pp. 815–826, 2019.[20] C. Wang, Y. Wu, S. Liu, M. Zhou, and Z. Yan1, “Cur-riculum pre-training for end-to-end speech translation,”

Proc. Annual Meeting of the Association for Computa-tional Linguistics ( ACL ), pp. 3728–3738, 2020.[21] A. Mesaros, T. Heittola, and T. Virtanen, “TUT databasefor acoustic scene classiﬁcation and sound event detec-tion,”

Proc. European Signal Processing Conference ( EUSIPCO ), pp. 1128–1132, 2016.[22] A. Mesaros, T. Heittola, A. Diment, B. Elizalde,A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE2017 Challenge setup: Tasks, datasets and baseline sys-tem,”

Proc. Workshop on Detection and Classiﬁcation ofAcoustic Scenes and Events ( DCASE ), pp. 85–92, 2017.[23] [24] R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P.Shah, “Large-scale weakly labeled semi-supervisedsound event detection in domestic environments,”

Proc.Workshop on Detection and Classiﬁcation of AcousticScenes and Events ( DCASE ), pp. 19–23, 2018.[25] A. Pankajakshan, H. L. Bear, and E. Benetos, “Poly-phonic sound event and sound activity detection: Amulti-task approach,”

Proc. IEEE Workshop on Applica-tions of Signal Processing to Audio and Acoustics ( WAS-PAA ), pp. 323–327, 2019.[26] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics forpolyphonic sound event detection,”