Duration robust weakly supervised sound event detection
DDURATION ROBUST WEAKLY SUPERVISED SOUND EVENT DETECTION
Heinrich Dinkel and Kai Yu
MoE Key Lab of Artificial IntelligenceSpeechLab, Department of Computer Science and EngineeringShanghai Jiao Tong University, Shanghai, China [email protected], [email protected]
ABSTRACT
Task 4 of the DCASE2018 challenge demonstrated that sub-stantially more research is needed for a real-world applica-tion of sound event detection. Analyzing the challenge re-sults it can be seen that most successful models are biasedtowards predicting long (e.g., over 5s) clips. This work aimsto investigate the performance impact of fixed-sized windowmedian filter post-processing and advocate the use of doublethresholding as a more robust and predictable post-processingmethod. Further, four different temporal subsampling meth-ods within the CRNN framework are proposed: mean-max, α -mean-max, L p -norm and convolutional. We show that forthis task subsampling the temporal resolution by a neural net-work enhances the F1 score as well as its robustness towardsshort, sporadic sound events. Our best single model achieves30.1% F1 on the evaluation set and the best fusion model . , while being robust to event length variations. Index Terms : weakly supervised sound event detection, con-volutional neural networks, recurrent neural networks, semi-supervised duration estimation
1. INTRODUCTION
Sound event detection (SED) is concerned with the classifi-cation and localization of a particular audio event (e.g., dogbarking, alarm ringing) such that each event is assigned an on-set (start), offset (end) and label (tagging). In particular, thispaper focuses on weakly-supervised SED (WSSED), a semi-supervised task, which has only access to clip-based labelsduring training, yet needs to predict onsets and offsets duringevaluation.SED can be used for query-based sound retrieval [1],smart cities and homes [2, 3]. In contrast to similar tasks suchas speech/speaker recognition, the recorded audio propertiesare vast, often noisy, can overlap and be assigned multipleevents at once. Recent interest in SED has risen due to chal-lenges such as the Detection and Classification of Acoustic
This work has been supported by the Major Program of National So-cial Science Foundation of China (No.18ZDA293). Experiments have beencarried out on the PI supercomputer at Shanghai Jiao Tong University.
Scenes and Events (DCASE) challenge. In this work, we fo-cus on sound event detection within domestic environments,specifically task 4 of the DCASE2018 challenge[4].Recently, much research attention has been brought up inorder to improve CRNN performance [5, 6]. Kothinti [7] pre-sented an interesting approach by separating SED into an un-supervised onset and offset estimation problem using condi-tional restricted Boltzmann machines (c-RBM) along with asupervised label prediction using CRNN. The results of 30%F1 development and 25 % F1 evaluation performance on theDCASE2018 task4 dataset indicate the robustness of this ap-proach. Wang [8] modified connectionist temporal classifi-cation (CTC)[9] in order to enable capturing long and shortevents effectively. An essential piece of work done in [10],analyzed the usage of attention-based high-level feature dis-entangling. Other studies in [11] analyzed the impact of sev-eral post-processing methods, specifically the estimation ofan event-dependent threshold, in regards to WSSED. Theirresults indicate that the choice of post-processing is crucial,which we also testify in this work, by an increase of absolute10.9% in F1 score on the DCASE2018 task 4 dataset. Lastly,[12] introduced a cosine penalty between different time-eventpredictions of the CRNN aiming to enhance the per time stepdiscriminability of each event. This idea is similar to largemargin softmax (L-softmax)[13] and resulted in an F1 scoreof . on the DCASE2018 task4 development dataset.In contrast to previous work, we aim to enhance WSSEDperformance, specifically regarding short-duration events, byestimating event duration in a robust manner. This paper isorganized as follows: In Section 2 we state current problemsfor state-of-the-art event-based SED and propose our idea toalleviate these problems. Further, in Section 3 the experimen-tal setup and experiments are run, culminating in Section 4,where results are shown and analyzed.
2. CONTRIBUTION
In our point of view, WSSED‘s main difficulty is the incapa-bility of estimating appropriate event duration, due to the lackof any prior event knowledge. Specifically, the estimation of a r X i v : . [ c s . S D ] J a n hort, sporadic events such as dog barking is more difficultthan estimating long, continuous events. In this work, weidentify and analyze three key problems within WSSED: 1)During training weak label estimates are obtained via mean-pooling the temporal dimension. This approach benefits longevents and neglects short ones[14]; 2) During inference, per-frame predictions are post-processed to smooth event predic-tions, using median filtering, which is shown to benefit longevents further; 3) Neural network predictions are made on afine scale (e.g., 20ms). Due to the noisy nature of this task,post-processing is necessary in order to obtain coherent pre-dictions. However, post-processing cannot be learned by thenetwork directly.In this work we aim to alleviate the problems stated aboveby 1) incorporating linear softmax [14] as the default tempo-ral pooling method 2) Using double threshold as a window-independent filtering alternative to median filtering 3) Sub-sampling the temporal resolution of our neural network pre-dictions in order to learn event boundaries directly. Further-more, we show that the previous best model submitted to theDCASE2018 task 4 challenge [15] is biased towards longevent predictions and propose a duration robust alternative. This work solely uses event-based F1 score [16] as the pri-mary evaluation metric, which requires predictions to besmooth (contiguous) and penalizes irregular or disjoint pre-dictions. In order to loosen the strictness of this measure, aflexible onset (t-collar) of 200ms, as well as an offset of atmost 20% of the events‘ duration, is considered as valid.
3. EXPERIMENTS
Regarding feature extraction, -dimensional log-mel spec-trograms (LMS) were used in this work. Each LMS samplewas extracted by a point Fourier transform every mswith a window size of ms. Since our network does processthe input in a sequential fashion (e.g., convolutions over spa-tial and frequency dimensions), padding needs to be appliedto ensure a fixed input size. Batch-wise zero padding is ap-plied, which essentially is equal to pad the entire data to themaximal length of frames (10 s). Since the Speech classaccounts for nearly 25% of the total training set, random over-sampling is utilized by assigning each class a weight inverseproportional to its occurrence count. Since during training,hard labels are unknown, a final temporal pooling functionneeds to be utilized in order to reduce an input clip‘s temporaldimension to a single vector representing class probabilities.Work in [14] proposed linear softmax (LS), a parameter-freeand effective temporal pooling function for WSSED. y ( c ) = (cid:80) Tt y t ( c ) (cid:80) Tt y t ( c ) (1) LS is defined as in Equation (1), where y t ( c ) ∈ [0 , isthe output probability of event-class c at timestep t . Since LSis only dependent on the per-frame probability and not num-ber of frames, it is more robust to length variations than tradi-tional mean pooling. Standard binary cross entropy (BCE) isused as the training criterion. Our model follows a CRNN ap-proach and can be seen in Table 1, where BiGRU represents abidirectional GRU recurrent neural network. LS is only usedduring training, while during inference post-processing meth-ods are applied (see Section 3.2). For each subsampling layer ( s , s , s , s ) , we employ a different time subsampling fac-tor, notated as P : (cid:8) ( s , s , s , s ) ∈ (cid:8) , (cid:9) : s i ≥ s j , ≤ i ≤ j ≤ (cid:9) , ( s , s , s , s ) (cid:55)→ (cid:81) i =1 s i . Moreover, the in-verse P − , maps a subsampling factor to a subsampling layersequence. Here we only subsample at most by a factor of ateach layer, while the feature dimension D is always halved.Five different subsampling configurations are introduced by S k = P − ( k ) , k = 1 , , , , , where S represents no tem-poral subsampling and S represents subsampling by a factorof . Layer ParameterBlock1 Channel, × KernelSubsample1 s ↓ Block2 Channel, × KernelSubsample2 s ↓ Block3
Channel, × KernelSubsample3 s ↓ Block4
Channel, × KernelSubsample4 s ↓ Block5
Channel, × KernelDropout
BiGRU
UnitsLinear UnitsLS
Table 1 : CRNN architecture used in this work. One blockrefers to an initial batch normalization, then a convolution andlastly a ReLU activation. All convolutions use padding in or-der to preserve the input size. The notation s k ↓ representssubsampling temporal dimension by s k as well as halving thefeature dimension D .Training was done using Adam optimization with an ini-tial learning rate of 0.001. The learning rate was reduced ifthe model did not improve on the held-out set for 10 epochs.If the learning rate dropped below − training was termi-nated. We used pytorch[17] as our deep neural network tool . Experiments are conducted on the DCASE2018 task 4 dataset,sampled from the larger AudioSet [18]. The dataset consists Code is available at github.com/richermans/dcase2018 pooling f a weakly-labelled training set, hard-labelled developmentand evaluation sets, as well as the unlabelled indomain andoutdomain subsets. This paper only uses training, develop-ment and evaluation subsets. The DCASE2018 task 4 datasetconsists of 1578 training clips with 10 class labels, as wellas 242 development clips situated in domestic environments.Ten classes need to be estimated: Speech, Cat, Dog, Runningwater, Vacuum cleaner, Frying, Electric shaver toothbrush,Blender and Alarm bell.
Fig. 1 : DCASE18 Task4 development data length distribution. Av-erage per-class duration (number in box) as well as the average dataduration (colored box) are given. Each gray dot represents a singleclip duration, and each bar relative class event occupation within thedataset.
Analyzing the development data distribution in Figure 1reveals that the dataset can be effectively split into long-duration events (Vacuum cleaner, Running water, Frying,Electric shaver toothbrush and Blender) and short durationevents (Speech, Dog, Dishes, Cat and Alarm bell).
During inference, post-processing is applied to smooth eachevent-class probability sequence. Here we aim to investigatethe effect of two post-processing methods regarding their hy-perparameters: standard median filtering and the proposeddouble thresholding towards the model performance. Twosets of experiments with mean and max subsampling were runusing the default model configuration S , meaning no tempo-ral subsampling is applied. Standard median filtering first preprocesses y t ( c ) by a thresh-old φ, y t ( c ) > φ . Then a median filter of size ω is appliedin a rolling window fashion onto the sequence in order tosmooth the frame predictions. Here, the threshold value is setto φ = 0 . and two different median filter size configurations ω ∈ { , } were investigated. A window size of 1 representsno filtering, while 51 is chosen as the default value.As our initial experiments (Table 2) suggest, filtering witha window size of ω = 51 leads to an overall performance in- ω = 51 ω = 1 Sub Short Long Avg Short Long AvgMax 26.18
Table 2 : Development F1-scores for different window size ω values of a median filter with respect to long and short clips.crease for both mean and max subsampling networks. How-ever, this increase solely stems from a shift in focus from shortclips to long ones. For mean subsampling, the short clip F1score drops by 8% absolute, while long clip performance in-creases by 10% absolute. We, therefore, conclude that a larger ω correlates with an overall performance increase, at the costof performance on short clips. One of the major downsides ofmedian filtering is that it can potentially erase model predic-tions, e.g., high-confidence model estimates, as well as shiftpreviously predicted event-boundaries. In this work we advo-cate the use of double threshold (see Section 3.2.2) in orderto ameliorate the median filtering problems, that is the use apost-processing filter that does not erase high confidence es-timates or shifts an event boundary. This technique uses two thresholds φ low , φ hi . Double thresh-old first sweeps over an output probability sequence andmarks all values being larger than φ hi as being valid pre-dictions. Then it enlarges the marked frames by searchingfor all adjacent, continuous predictions being larger than φ low . Double threshold can also incorporate a window size ω , ω = 51 ω = 1 Sub Short Long Avg Short Long AvgMax 16.82 : Comparison of double thresholding with differentwindow sizes ( ω ) on the development set.but with a different purpose from standard median filtering.Here, a window size of ω represents the number of framesbeing connected after thresholding. In this paper we exclu-sively set φ low = 0 . , φ hi = 0 . . As the results in Table 3indicate, double threshold provides an overall better and du-ration robust performance compared to fixed-sized medianfiltering, without being affected by a non-optimal ω . Due tothese results, all following experiments are run with ω = 1 ,effectively neglecting the influence of ω .Both mean and max subsampling seem to exhibit simi-lar performance. Thus this work investigated the use of jointmean and max subsampling, as seen in Section 3.3. One contribution of this work is to investigate appropriatesubsampling layers.onfiguration S k , , ),Subset dev eval dev eval dev eval dev eval dev eval dev evalWinner-2018 - - - - - - - - 25.90 32.40Conv 27.26 14.97 23.04 19.95 32.05 22.46 24.80 21.13 16.39 17.07 25.26 23.68LP 28.82 23.29 32.30 27.46 35.34 α -MM 23.22 20.13 : Results for all four proposed subsampling types. Fusion is done by averaging the model outputs of k = 2 , , . TheWinner-2018 system is a fusion system. Results highlighted in bold are the best in class.Name FormulationMM mean ( x ) + max( x ) α -MM α max( x ) + (1 − α ) mean ( x ) LP p √ x p Conv W x Table 5 : Proposed subsampling layers. α is learned[19]. p in L p -norm is empirically set to .Here we propose four subsampling schemes (Table 5),where x represents the region of the image being pooled. Themethods consist of averaged mean-max subsampling (MM)used in the LightCNN framework[20], the convex combi-nation of MM and α -mean-max denoted as ( α -MM)[19]. L p -norm subsampling can be seen as a generalization ofmean and max subsampling, where p = 1 equals to meanand p = ∞ equals to max subsampling. Lastly, we investi-gate a weighted subsampling method implemented as stridedconvolutions with kernel size K × K and stride K × K .
4. EXPERIMENTAL RESULTS
In this work, we compare our proposed approach to the win-ner of the DCASE2018 task 4 challenge [15] on the devel-opment and evaluation set results. The winner system differsfrom our approach in two major ways: ) Uses semi-super-vised median filtering, where the window length ω is esti-mated according to the labelled development data. ) Uses theadditionally provided unlabelled in-domain dataset for modeltraining. Results in Table 4 indicate that subsampling im-proves F1-score performance consistently up until S for allsubsampling methods except MM, which stops at S . Thelargest tested factor of S shows no improvement for allproposed methods, due to the temporal resolution reaching320ms, which is longer than the t-collar of 200ms, leading topossible event label skips. More importantly, the majority ofour models produce stable scores between development andevaluation set and only lose around 5% in absolute betweenthe datasets.The results proposed in Table 4 seem to indicate thatstrided convolution is largely outperformed by traditionalsubsampling methods, which we believe is due to the addi-tional parameters to the limited amount of data provided in Type Short Long Gap Avg2018-Winner 23.32 α -MM 29.66 35.40 5.74 : Short and Long clip results for evaluation data.Gap is the absolute difference between long and short clipF1 scores. All shown results are model fusions.this challenge. As can be seen in Table 6, the previously bestperforming system is biased towards predicting long clipshaving a gap of absolute between short and long ones.Our proposed systems can reduce the short-long clip gap toas low as . , while performing equally as well as the2018-Winner system. In future research, we would like toinvestigate dynamic subsampling strategies further.
5. CONCLUSIONS
This paper shows that current WSSED focus on long termevents in order to enhance performance while neglecting shortevents. It is shown that a bias towards long events for theDCASE2018 Task4 winner system exists. Three potentialreasons for this bias are tracked down: 1) Temporal meanpooling 2) Fixed sized median filtering 3) Training neural net-works on a fine-grained scale. We alleviate each particularproblem by advocating the use of 1) Linear softmax pooling2) Double threshold filtering 3) Temporal subsampling to alower scale. Double thresholding as a post-processing methodis shown to broadly outperform median filtering in terms ofboth robustness to duration and overall performance while be-ing unaffected by choice of window size. The standard CRNNmodel is modified to subsample the temporal resolution upuntil a factor of and shown to improve performance up un-til a factor of . Further, variations of mean and max subsam-pling are shown to enhance the performance on developmentand evaluation sets. Our best single model achieves 30.8%,while the best model fusion obtains 32.5% F1. Furthermore,our proposed method reduces the gap between short and longclip to 4% F1. . REFERENCES [1] Frederic Font, Gerard Roma, and Xavier Serra, SoundSharing and Retrieval , pp. 279–301, Springer Interna-tional Publishing, Cham, 2018.[2] Juan Pablo Bello, Charlie Mydlarz, and Justin Salamon,
Sound Analysis in Smart Cities , pp. 373–397, SpringerInternational Publishing, Cham, 2018.[3] Sacha Krstulovi´c,
Audio Event Recognition in the SmartHome , pp. 335–371, Springer International Publishing,Cham, 2018.[4] Romain Serizel, Nicolas Turpault, Hamid Eghbal-Zadeh, and Ankit Parag Shah, “Large-scale weaklylabeled semi-supervised sound event detection in do-mestic environments,” in
Proceedings of the Detectionand Classification of Acoustic Scenes and Events 2018Workshop (DCASE2018) , November 2018, pp. 19–23.[5] E. akr, G. Parascandolo, T. Heittola, H. Huttunen, andT. Virtanen, “Convolutional recurrent neural networksfor polyphonic sound event detection,”
IEEE/ACMTransactions on Audio, Speech, and Language Process-ing , vol. 25, no. 6, pp. 1291–1303, June 2017.[6] T. Iqbal, Y. Xu, Q. Kong, and W. Wang, “Capsule rout-ing for sound event detection,” in , Sep. 2018,pp. 2255–2259.[7] S. Kothinti, K. Imoto, D. Chakrabarty, G. Sell,S. Watanabe, and M. Elhilali, “Joint acoustic and classinference for weakly supervised sound event detection,”in
ICASSP 2019 - 2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) ,May 2019, pp. 36–40.[8] Y. Wang and F. Metze, “Connectionist temporal lo-calization for sound event detection with sequential la-beling,” in
ICASSP 2019 - 2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) , May 2019, pp. 745–749.[9] Alex Graves, Santiago Fern´andez, Faustino J. Gomez,and J¨urgen Schmidhuber, “Connectionist temporal clas-sification: labelling unsegmented sequence data with re-current neural networks,” in
ICML . 2006, vol. 148 of
ACM International Conference Proceeding Series , pp.369–376, ACM.[10] Liwei Lin, Xiangdong Wang, Hong Liu, and YueliangQian, “Specialized decision surface and disentangledfeature for weakly-supervised polyphonic sound eventdetection,” arXiv preprint arXiv:1905.10091 , 2019. [11] L. Cances, P. Guyot, and T. Pellegrini, “Evaluation ofpost-processing algorithms for polyphonic sound eventdetection,” in ,Oct 2019, pp. 318–322.[12] Thomas Pellegrini and Leo Cances, “Cosine-similaritypenalty to discriminate sound classes in weakly-supervised sound event detection,” in
InternationalJoint Conference on Neural Networks (IJCNN 2019) ,Budapest, HU, 2019, pp. 1–8, INNS : International Neu-ral Network Society.[13] Weiyang Liu, Yandong Wen, Zhiding Yu, and MengYang, “Large-margin softmax loss for convolutionalneural networks.,” in
ICML , 2016, vol. 2, p. 7.[14] Y. Wang, J. Li, and F. Metze, “A comparison of five mul-tiple instance learning pooling functions for sound eventdetection with weak labeling,” in
ICASSP 2019 - 2019IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , May 2019, pp. 31–35.[15] Lu JiaKai, “Mean teacher convolution system for dcase2018 task 4,” Tech. Rep., DCASE2018 Challenge,September 2018.[16] Annamaria Mesaros, Toni Heittola, and Tuomas Virta-nen, “Metrics for polyphonic sound event detection,”
Applied Sciences , vol. 6, no. 6, pp. 162, 2016.[17] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer, “Au-tomatic differentiation in pytorch,” 2017.[18] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman,Aren Jansen, Wade Lawrence, R Channing Moore,Manoj Plakal, and Marvin Ritter, “Audio set: Anontology and human-labeled dataset for audio events,”in . IEEE, 2017,pp. 776–780.[19] Chen-Yu Lee, Patrick W Gallagher, and Zhuowen Tu,“Generalizing pooling functions in convolutional neuralnetworks: Mixed, gated, and tree,” in
Artificial intelli-gence and statistics , 2016, pp. 464–472.[20] Xiang Wu, Ran He, Zhenan Sun, and Tieniu Tan, “Alight cnn for deep face representation with noisy labels,”