Sound Event Detection Based on Curriculum Learning Considering Learning Difficulty of Events
Noriyuki Tonami, Keisuke Imoto, Yuki Okamoto, Takahiro Fukumori, Yoichi Yamashita
aa r X i v : . [ c s . S D ] F e b SOUND EVENT DETECTION BASED ON CURRICULUM LEARNINGCONSIDERING LEARNING DIFFICULTY OF EVENTS
Noriyuki Tonami , Keisuke Imoto , Yuki Okamoto , Takahiro Fukumori , Yoichi Yamashita Ritsumeikan University, Japan, Doshisha University, Japan
ABSTRACT
In conventional sound event detection (SED) models, twotypes of events, namely, those that are present and thosethat do not occur in an acoustic scene, are regarded as thesame type of events. The conventional SED methods cannoteffectively exploit the difference between the two types ofevents. All time frames of sound events that do not occurin an acoustic scene are easily regarded as inactive in thescene, that is, the events are easy-to-train. The time framesof the events that are present in a scene must be classified asactive in addition to inactive in the acoustic scene, that is, theevents are difficult-to-train. To take advantage of the train-ing difficulty, we apply curriculum learning into SED, wheremodels are trained from easy- to difficult-to-train events. Toutilize the curriculum learning, we propose a new objectivefunction for SED, wherein the events are trained from easy-to difficult-to-train events. Experimental results show that theF-score of the proposed method is improved by 10.09 per-centage points compared with that of the conventional binarycross entropy-based SED.
Index Terms — Sound event detection, acoustic scene,curriculum learning
1. INTRODUCTION
The analysis of various environmental sounds in everyday lifehas be come an increasingly important area in signal process-ing [1]. The automatic analysis of environmental sounds willgive rise to various applications, such as anomalous sound de-tection systems [2], automatic life-logging systems [3], mon-itoring systems [4], and bird-call detection systems [5].Sound event detection (SED) is the task of recognizingsound event labels and their timestamp from a recording. InSED, the models need to recognize overlapped multiple soundevents in a time frame. Recently, neural-network-based SEDmodels have seen increasingly rapid advances, such as theconvolutional neural network (CNN) [6], recurrent neural net-work (RNN) [7], and convolutional recurrent neural network(CRNN) [8]. CNN is the structure that automatically extractsfeatures and is robust to time and frequency shifts. RNNis good at modeling the time structure in an audio stream.Moreover, some works considering the relationship betweensound events and scenes have been proposed. As an exampleof the relationship, “mouse clicking” occurs indoors such as“office,” whereas, “car” tends to occurs outdoor such as “citycenter.” On the basis of this idea, SED using the informa-tion on the acoustic scene [9–11] and the model combining : active frame : inactive frame present in an acoustic scene } do not occur in an acoustic scene time (a) Scene: airplane (a): footsteps (b): elephant (c): birdsong (b)(c) E v e n t Fig. 1 . Example of difference in training difficulty betweensound eventsSED and acoustic scene classification (ASC) [12–16] havebeen proposed. Heittola et al . [10] have proposed the SEDmodel using the results of the ASC, where the ASC model istrained in the first stage and then the SED model is trainedin the second stage with the ASC results. Tonami et al . [13]have proposed the multitask-learning-based models combin-ing SED and ASC.In the conventional SED methods, two types of events,namely, those that are present and those that do not occur in anacoustic scene, are treated as the same type of the events. Theconventional SED methods cannot effectively utilize the dif-ference between the two types of events. The all time framesof events that do not occur in a scene only need to be treatedas inactive in the acoustic scene, as shown in Fig. 1 (“ele-phant” and ”birdsong” in “airplane”), i.e., the training of theeasy-to-train events is considered as the task of recognizingone class. On the other hand, the time frames of events thatare present in an acoustic scene must be classified as active orinactive in the acoustic scene, as shown in Fig. 1 (“footsteps”in “airplane”), i.e., the training of the difficult-to-train eventsis regarded as the task of binary classification.To utilize the difference in the difficulty of training be-tween the sound events, we employ curriculum learning [17].Curriculum learning is a method of learning data effectivelyconsidering the difficulty of training, in which a model learnsprogressively from easy- to difficult-to-train data. Recently,some works using the curriculum learning have been carriedout [18–20]. Lotfian and Busso [19] have proposed the speechemotion recognition method based on the curriculum learn-ing, where the ambiguity of emotion is considered. In thispaper, we propose a SED method using the curriculum learn-ing, in which strong labels are given for the training. In theproposed method, the SED models are trained from the easy-to difficult-to-train events on the basis of the curriculum learn-ing. More specifically, we present a new objective function ofSED considering the difficulty of the training of events basedon the curriculum learning. E v e n t : active frame: inactive framepresent in an acoustic scene } do not occur in an acoustic scene time time α = α = Early stage of training Late stage of training z Easy-to-train Difficult-to-train epoch α (a)(b)(c) Scene: airplane (a): footsteps (b): elephant (c): birdsong
Fig. 2 . Examples of early and late stage of training based on curriculum learning
2. CONVENTIONAL METHOD
SED involves sound event labels and their onset/offset froman audio. Recently, many neural-network-based methodshave been studied. In most of the neural-network-basedmethods, the acoustic features in the time-frequency domainare used for the input to the SED models. To optimize theneural-network-based SED models, the binary cross-entropyloss is used as follows: L BCE = − N X n =1 T X t =1 n z n,t log σ ( y n,t )+(1 − z n,t ) log (cid:0) − σ ( y n,t ) (cid:1)o , (1)where N and T indicate the numbers of sound event categoriesand time frames, respectively. z n,t ∈ { , } is a target label ofan event n at time t . If the event is active, z n,t is 1; otherwise, z n,t is 0. y n,t represents the output of the network of an event n at time t . σ ( · ) denotes the sigmoid function.
3. PROPOSED METHOD3.1. Training difficulty of events considering scenes
In the conventional SED methods, two types of events,namely, those that exist and those that do not occur in anacoustic scene, are treated as the same type of the events.The conventional SED methods cannot effectively employthe difference between the two types of sound events. The alltime frames of events that do not occur in an acoustic sceneonly need to be regarded as inactive in the acoustic scene,as seen in Fig. 1 (“elephant” and “birdsong” in “airplane”).The training of the sound events is treated as the task thatrecognizes one class (inactive), that is, the events are easy-to-train. On the other hand, the time frames of sound events thatexist in an acoustic scene need to be classified as active inaddition to inactive in the acoustic scene, as shown in Fig. 1.The training of the sound events is considered as the task thatclassifies two classes (active or inactive), that is, the eventsare difficult-to-train. In short, the sound events that exist inan acoustic scene are hardly trained compared with the eventsthat do not occur in the acoustic scene as shown in Fig. 2.
As mentioned in Sect. 3.1., there are differences in trainingdifficulty between the sound events when the acoustic scenesare considered. In the proposed method, we employ the cur-riculum learning to take advantage of the difference in the dif-ficulty of training between the sound events when the acousticscenes are considered. To incorporate the concept of the cur-riculum learning into the BCE, the following loss function isused instead of Eq. 1: L prop = − N X n =1 T X t =1 g n n z n,t log σ ( y n,t )+ (1 − z n,t ) log (cid:0) − σ ( y n,t ) (cid:1)o , (2)where g n is a gate function that controls the weight of trainingof two types of events. More specifically, the gate function iscalculated as g n = α s f n + (1 − α s )(1 − f n ) , (3)where α s is a progressive parameter, which is changed from 0to 1 with time-step s (epoch) during training. f n is an event-flag. If an event n occurs at least once in the acoustic sceneof the input audio, the frag is 1; otherwise, it is 0.As shown in Fig. 2, in the early stage of the training, onlythe events that do not occur in an acoustic scene are trained.On the other hand, in the late stage of the training, only theevents that are present in an acoustic scene are trained. Notethat whether an event is difficult- or easy-to-train is deter-mined by each acoustic scene label of an audio clip. For ex-ample, a dataset includes a scene A . Events a and b occur atleast once in the scene A . An event c does not occur in thescene A . When the scene label of input audio is A , a and b areregarded as the difficult-to-train events. c is regarded as theeasy-to-train event.
4. EXPERIMENTS4.1. Experimental conditions
To evaluate the performance of the proposed method, weconducted evaluation experiments using the TUT SoundEvents 2016 [21], TUT Sound Events 2017 [22], TUT Acous-tic Scenes 2016 [21], and TUT Acoustic Scenes 2017 [22]datasets. From these datasets, we selected sound clips includ- ity center ( ob j ec t ) b a ng i ng ( ob j ec t ) i m p ac t ( ob j ec t ) r u s tli ng ( ob j ec t ) s n a pp i ng ( ob j ec t ) s qu ea k i ngb i r d s i ng i ngb r a k e s s qu ea k i ngb r ea t h i ng ca r c h il d r e n c upbo a r d c u tl e r y d i s h e s d r a w e r f a ng l a ss ji ng li ngk e ybo a r d t yp i ng l a r g e v e h i c l e m ou s e c li c k i ng m ou s e w h ee li ngp e op l e t a l k i ngp e op l e w a l k i ng w a s h i ng d i s h e s w a t e r t a p r unn i ng w i nd b l o w i ng N u m b er o f f r a m e s Sound events homeofficeresidential area Fig. 3 . Number of frames of sound events on development set used for our experiments
Table 1 . Experimental conditions
Acoustic feature Log-mel energy (64 dim.)Frame length / shift 40 ms / 20 msLength of sound clip 10 sNetwork architecture 3 CNN + 1BiGRU + 1 fully con. × ×
1, 2 ×
1, 2 × ing four acoustic scenes, “home,” “residential area” (TUTSound Events 2016), “city center” (TUT Sound Events 2017,TUT Acoustic Scenes 2017), and “office” (TUT AcousticScenes 2016), which contain 266 min (development set, 192min; evaluation set, 74 min) of audio. Here, the acousticscene “office” in TUT Acoustic Scenes 2016 and “city cen-ter” in TUT Acoustic Scenes 2017 did not have sound eventlabels. We thus manually annotated the sound clips withsound event labels by the procedure described in [21, 22].These sound clips include the 25 types of sound event labels.Fig. 3 shows the numbers of active time frames of soundevents on the development set that we used. The labels ofevents annotated for our experiment are available in [23].As acoustic features, we used 64-dimensional log-mel en-ergies calculated for each 40 ms time frame with 50% overlap.This setting is from the baseline system of the DCASE2018Challenge task4 [24]. As the baseline model of SED, we usedthe convolutional neural network and bidirectional gated re-current unit (CNN–BiGRU) [8]. Moreover, to verify the use-fulness of the proposed method, we used a model combiningSED and sound activity detection (SAD) based on multitasklearning (MTL), referred to as “MTL of SED & SAD” [25],and a model combining SED and ASC, referred to as “MTLof SED & ASC” [13]. The sound activity detection is the mechanism of recognizing any active events in a time frame.The reason why we choose MTL of SED & SAD is that thismodern method, in which no information on the scene is con-sidered, is simple but effective. MTL of SED & ASC is themultitask-learning-based SED with ASC, which uses scenelabels by ASC. Other experimental conditions are listed inTable 1. In Table 1, X × Y denotes that the filter size is Xalong the frequency axis by Y along the time axis. As theevaluation metric, the frame-based metric is used. We con-duct the experiments using ten initial values. To evaluate theSED performance, a segment-based metric [26] is used. Inthis work, the size of a segment is set to the frame length.In this work, we adopt the following exponential sched-uler as the progressive parameter in Eq. 3: α s = (cid:18) ss max (cid:19) λ , (4)where s and s max represent the current and maximum epoch,respectively. λ is tuned using the development dataset and isset as 2.0. Table 2 shows the SED performances in terms of the segment-based F-score and error rate. In Table 2, micro and macroindicate the overall and class-average scores, respectively.The numbers to the right of ± represent standard devi-ations. “BCE” is the CNN–BiGRU using the BCE loss.“Proposed method” represents the SED performance usingEqs. 2 and 3 with CNN–BiGRU. “Proposed+MTL of SED& SAD” indicates the SED performance using Eqs. 2 and3 with SAD. “Proposed+MTL of SED & ASC” denotes themultitask-learning-based SED with ASC using the proposedobjective function for SED. The results show that the pro-posed method achieves a more reasonable performance thanthe conventional BCE. Moreover, when using SAD and themodel combining SED and ASC with the proposed objectivefunction, the SED performance is better than those of theconventional MTL of SED & SAD and the MTL of SED & able 3 . SED performance for each event Event (object) (object) (object) (object) (object) bird brakes breathing car children cupboard cutlerybanging impact rustling snapping squeaking singing squeakingBCE F-score 0.00% 0.37% ± ± ± ± ± ± ± ± ± ± ± ± Error rate 1.00 ± ± ± ± ± ± ± ± ± ± ± ± F-score 0.00% ± ± ± ± ± ± ± ± ± ± ± ± method Error rate 1.00 1.02 1.03 1.00 1.00 0.96 0.99 1.00 1.15 1.00 1.00 1.00 ± ± ± ± ± ± ± ± ± ± ± ± Event dishes drawer fan glass keyboard large mouse mouse people people washing water tap windjingling typing vehicle clicking wheeling talking walking dishes running blowingBCE F-score 0.00% 0.00% 9.95% 0.00% 0.00% 16.93% 0.00% 0.00% 0.00% 2.67% 17.25% 40.78% ± ± ± ± ± ± ± ± ± ± ± ± ± Error rate 1.00 1.00 0.95 1.00 1.00 6.17 1.00 1.00 1.06 1.02 ± ± ± ± ± ± ± ± ± ± ± ± ± F-score ± ± ± ± ± ± ± ± ± ± ± ± ± method Error rate 1.00 1.00 ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2 . Overall performance of SED
Method F-score Error ratemicro macro micro macroBCE 25.30% 7.44% 1.00 1.21 ± ± ± ± MTL of SED & SAD 26.62% 7.36% 1.02 1.20 ± ± ± ± MTL of SED & ASC 26.12% 7.46% 0.97 1.18 ± ± ± ± Proposed method ± ± ± ± Proposed+MTL of SED & SAD ± ± ± ± Proposed+MTL of SED & ASC ± ± ± ± ASC. In particular, “Proposed method” improves the F-scoreof SED by 10.09 percentage points compared with that ofthe conventional SED using the BCE. The results indicatethat the proposed method considering the training difficultyof events enables more effective SED performances than theconventional method using the BCE.To investigate in detail the SED performance, we ob-served the segment-based F-score and error rate for eachevent. Table 3 indicates the SED performance for each event.As shown in Table 3, the proposed method outperformed theconventional SED using the BCE for many events. In par-ticular, the F-scores for “fan,” “washing dishes,” and “watertap running” are more significantly improved in the proposedmethod than in the conventional method. This might be be-cause the active frames of these events occur continuously,that is, these events are relatively effortless to be detectedcompared with other events. On the other hand, the F-scoresfor “(object) rustling,” “brakes squeaking,” and “wind blow-ing” do not improve. This is because the numbers of theevents of the active frames are too small as shown in Fig. 3.In other words, the active frames are trained mainly in thelate stage of training when using the proposed method. Thismay also lead to the poor results for some events when usingthe proposed method.
5. CONCLUSION
In this paper, we proposed the curriculum-learning-basedobjective function for SED. In the proposed method, we ap-plied the training difficulty between sound events consideringacoustic scenes to the conventional BCE loss. More specifi-cally, the SED models using the proposed method are trainedfrom the easy-to-train to difficult-to-train events during train-ing. The experimental results indicate that the proposedmethod improves the F-score of the SED by 10.09 percentagepoints compared with that of the conventional CNN–BiGRUusing the BCE loss. In our future work, we will investigate amore effective method for SED considering the relationshipbetween sound events and acoustic scenes.
6. ACKNOWLEDGEMENT
This work was supported by JSPS KAKENHI Grant NumberJP19K20304.
7. REFERENCES [1] K. Imoto, “Introduction to acoustic event and sceneanalysis,”
Acoust. Sci. Tech. , vol. 39, no. 3, pp. 182–188, 2018.[2] C. Chan and E. W. M. Yu, “An abnormal sound detec-tion and classification system for surveillance applica-tions,”
Proc. European Signal Processing Conference ( EUSIPCO ), pp. 1851–1855, 2010.[3] J. A. Stork, L. Spinello, J. Silva, and K. O. Arras,“Audio-based human activity recognition using non-Markovian ensemble voting,”
Proc. IEEE InternationalSymposium on Robot and Human Interactive Communi-cation ( RO-MAN ), pp. 509–514, 2012.[4] S. Ntalampiras, I. Potamitis, and N. Fakotakis, “Onacoustic surveillance of hazardous situations,”
Proc.IEEE International Conference on Acoustics, Speechand Signal Processing ( ICASSP ), pp. 165–168, 2009.5] Y. Okamoto, K. Imoto, N. Tsukahara, K. Sueda, R. Ya-manishi, and Y. Yamashita, “Crow call detection usinggated convolutional recurrent neural network,”
Proc.RISP International Workshop on Nonlinear Circuits,Communications and Signal Processing ( NCSP ), pp.171–174, 2020.[6] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gem-meke, A. Jansen, C. Moore, M. Plakal, D. Platt, R. A.Saurous, B. Seybold, M. Slaney, R. Weiss, and K. Wil-son, “CNN architectures for large-scale audio classifica-tion,”
Proc. IEEE International Conference on Acous-tics, Speech and Signal Processing ( ICASSP ), pp. 131–135, 2017.[7] T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. Le Roux,and K. Takeda, “Duration-controlled LSTM for poly-phonic sound event detection,”
IEEE/ACM Trans. onAudio, Speech, and Language Processing , vol. 25, no.11, pp. 2059–2070, 2017.[8] E. C¸ akır, G. Parascandolo, T. Heittola, H. Huttunen, andT. Virtanen, “Convolutional recurrent neural networksfor polyphonic sound event detection,”
IEEE/ACMTrans. Audio Speech Lang. Process. , vol. 25, no. 6, pp.1291–1303, 2017.[9] A. Mesaros, T. Heittola, and A. Klapuri, “Latent seman-tic analysis in sound event detection,”
Proc. EuropeanSignal Processing Conference ( EUSIPCO ), pp. 1307–1311, 2011.[10] T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen,“Context-dependent sound event detection,”
EURASIPJournal on Audio, Speech, and Music Processing , vol.2013, no. 1, pp. 1–13, 2013.[11] K. Imoto and S. Shimauchi, “Acoustic scene analysisbased on hierarchical generative model of acoustic eventsequence,”
IEICE Trans. Inf. Syst. , vol. E99-D, no. 10,pp. 2539–2549, 2016.[12] H. L. Bear, I. Nolasco, and E. Benetos, “Towards jointsound scene and polyphonic sound event recognition,”
Proc. INTERSPEECH , pp. 4594–4598, 2019.[13] N. Tonami, K. Imoto, M. Niitsuma, R. Yamanishi, andY. Yamashita, “Joint analysis of acoustic events andscenes based on multitask learning,”
Proc. IEEE Work-shop on Applications of Signal Processing to Audio andAcoustics ( WASPAA ), pp. 333–337, 2019.[14] K. Imoto, N. Tonami, Y. Koizumi, M. Yasuda, R. Ya-manishi, and Y. Yamashita, “Sound event detection bymultitask learning of sound events and scenes with softscene labels,”
Proc. IEEE International Conference onAcoustics, Speech and Signal Processing ( ICASSP ), pp.621–625, 2020. [15] T. Komatsu, K. Imoto, and M. Togami, “Scene-dependent acoustic event detection with scene condi-tioning and fake-scene-conditioned loss,”
Proc. IEEEInternational Conference on Acoustics, Speech and Sig-nal Processing ( ICASSP ), pp. 646–650, 2020.[16] J. Jung, H. Shin, J. Kim, and H. Yu, “DCASENet:An integrated pretrained deep neural network for detect-ing and classifying acoustic scenes and events,” arXiv , arXiv:2009.09642, Proc. International Conferenceon Machine Learning ( ICML ), pp. 41–48, 2009.[18] S. Braun, D. Neil, and S. Liu, “A curriculum learn-ing method for improved noise robustness in automaticspeech recognition,”
Proc. European Signal ProcessingConference ( EUSIPCO ), pp. 548–552, 2017.[19] R. Lotfian and C. Busso, “Curriculum learning forspeech emotion recognition from crowdsourced labels,”
IEEE/ACM Trans. on Audio, Speech, and LanguageProcessing , vol. 27, no. 4, pp. 815–826, 2019.[20] C. Wang, Y. Wu, S. Liu, M. Zhou, and Z. Yan1, “Cur-riculum pre-training for end-to-end speech translation,”
Proc. Annual Meeting of the Association for Computa-tional Linguistics ( ACL ), pp. 3728–3738, 2020.[21] A. Mesaros, T. Heittola, and T. Virtanen, “TUT databasefor acoustic scene classification and sound event detec-tion,”
Proc. European Signal Processing Conference ( EUSIPCO ), pp. 1128–1132, 2016.[22] A. Mesaros, T. Heittola, A. Diment, B. Elizalde,A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE2017 Challenge setup: Tasks, datasets and baseline sys-tem,”
Proc. Workshop on Detection and Classification ofAcoustic Scenes and Events ( DCASE ), pp. 85–92, 2017.[23] [24] R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P.Shah, “Large-scale weakly labeled semi-supervisedsound event detection in domestic environments,”
Proc.Workshop on Detection and Classification of AcousticScenes and Events ( DCASE ), pp. 19–23, 2018.[25] A. Pankajakshan, H. L. Bear, and E. Benetos, “Poly-phonic sound event and sound activity detection: Amulti-task approach,”
Proc. IEEE Workshop on Applica-tions of Signal Processing to Audio and Acoustics ( WAS-PAA ), pp. 323–327, 2019.[26] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics forpolyphonic sound event detection,”