HODGEPODGE: Sound event detection based on ensemble of semi-supervised learning methods
aa r X i v : . [ c s . S D ] J u l HODGEPODGE: Sound event detection based onensemble of semi-supervised learning methods
Ziqiang Shi ∗ , Liu Liu , Huibin Lin , Rujie Liu , and Anyan Shi Fujitsu Research and Development Center, Beijing, China Shuangfeng First, Beijing, ChinaJuly 18, 2019
Abstract
In this paper, we present a method called HODGEPODGE for large-scale detection of sound events using weakly labeled, synthetic, and unlabeleddata proposed in the Detection and Classification of Acoustic Scenes andEvents (DCASE) 2019 challenge Task 4: Sound event detection in domesticenvironments. To perform this task, we adopted the convolutional recurrent neuralnetworks (CRNN) as our backbone network. In order to deal with a small amountof tagged data and a large amounts of unlabeled in-domain data, we aim to focusprimarily on how to apply semi-supervise learning methods efficiently to make fulluse of limited data. Three semi-supervised learning principles have been used inour system, including: 1) Consistency regularization applies data augmentation; 2)MixUp regularizer requiring that the predictions for a interpolation of two inputsis close to the interpolation of the prediction for each individual input; 3) MixUpregularization applies to interpolation between data augmentations. We alsotried an ensemble of various models, which are trained by using different semi-supervised learning principles. Our proposed approach significantly improvedthe performance of the baseline, achieving the event-based f-measure of 42.0%compared to 25.8% event-based f-measure of the baseline in the provided officialevaluation dataset. Our submissions ranked third among 18 teams in the task 4. The sound carries a lot of information about our everyday environment and the physicalevents that take place there. We can easily perceive the sound scenes we are in(busy streets, offices, etc.) and identify individual sound events (cars, footsteps, etc.). ∗ Corresponding author: [email protected]; [email protected] HODGEPODGE has two layers of meanings. The first layer is the variety of training data involvedin the method, including weakly labeled, synthetic, and unlabeled data. The second layer refers to severalsemi-supervised principles involved in our method.
Herein, we present the method of our submissions for task 4 of DCASE 2019. Inthe following sections, we will describe the details of our approach, including featureextraction, network structure, how to use ICT and MixMatch in the context of ‘MeanTeacher’, and how to use unlabeled data. 2 .1 Feature extraction
The dataset for task 4 is composed of 10 sec audio clips recorded in domesticenvironment or synthesized to simulate a domestic environment. No preprocessingstep was applied in the presented frameworks. The acoustic features for the 44.1kHzoriginal data used in this system consist of 128-dimensional log mel-band energyextracted in Hanning windows of size 2048 with 431 points overlap. Thus themaximum number of frames is 1024. In order to prevent the system from overfittingon the small amount of development data, we added random white noise (before logoperation) to the melspectrogram in each mini-batch during training. The input to thenetwork is fixed to be 10-second audio clip. If the input audio is less than 10 seconds,it is padded to 10 seconds; otherwise it is truncated to 10 seconds.
Figure 1: Architecture of the CRNN in HODGEPODGE.Figure 1 presents the CRNN network architecture employed in ourHODGEPODGE. The audio signal is first converted to [128 × o = ( i ∗ W + b ) ⊗ σ ( i ∗ W g + b g ) , where i and o are the input and output, W , b , W g , and b g are learned parameters, σ isthe sigmoid function and ⊗ is the element-wise product between vectors or matrices.3igure 2: Architecture of a GLU.Similar to LSTMs, GLUs play the role of controlling the information passed on inthe hierarchy. This special gating mechanism allows us to effectively capture long-range context dependencies by deepening layers without encountering the problem ofvanishing gradient.For the seven gated convolutional layers, the kernel sizes are 3, the paddings are1, the strides are 1, and the number of filters are [16, 32, 64, 128, 128, 128, 128]respectively, and the poolings are [(2, 2), (2, 2), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2)]respectively. Pooling along the time axis is used in training with the clip-level andframe-level labels.The gated convolutional blocks are followed by two bidirectional gated recurrentunits (GRU) layers containing 64 units in the forward and backward path, their outputis concatenated and passed to the attention and classification layer which are describedbelow.As depicted in Figure 1, the output of the bidirectional GRU layers is fed into botha frame-level classification block and an attention block respectively. The frame-levelclassification block uses a sigmoid activation function to predict the probability of eachoccurring class at each frame. Thus bidirectional GRUs followed by a dense layer withsigmoid activation to compute posterior probabilities of the different sounds classes.In that case there are two outputs in this CRNN. The output from bidirectional GRUsfollowed by dense layers with sigmoid activation is considered as sound event detectionresult. This output can be used to predict event activity probabilities. The other outputis the weighted average of the element-wise multiplication of the attention, consideringas audio tagging result. Thus the final prediction for the weak label of each class isdetermined by the weighted average of the element-wise multiplication of the attentionand classification block output of each class c. Inspired by the DCASE 2018 task 4 winner solution [5] and the baseline system [10],in which it uses the ‘Mean Teacher’ model [7]. ‘Mean Teacher’ is a combinationof two models: the student model and the teacher model. At each training step, thestudent model is trained on synthetic and weakly labeled data with binary cross entropyclassification cost. While the teacher model uses the exponential moving average of thestudent model. The student model is the final model and the teacher model is designedto help the student model by a consistency mean-squared error cost for frame-level andclip-level predictions of unlabeled audio clips. That means good student should outputthe same class distributions as the teacher for the same unlabeled example even after it4as been augmented. The goal of ‘Mean Teacher’ is to minimize: L = L w + L s + w ( t ) L cw + w ( t ) L cs where L w and L s are the usual cross-entropy classification loss on weakly labeleddata with only weak labels and synthetic data with only strong labels respectively, L cw and L cs are the teacher-student consistence regularization loss on unlabeled data withpredicted weak and strong labels respectively, and w ( t ) is the balance of classificationloss and the consistency loss. Generally the w ( t ) changes over time to make theconsistency loss initially accounts for a very small proportion, and then the ratio slowlybecomes higher. Since in the beginning, neither the student model nor the teachermodel were accurate on predictions, and the consistency loss did not make much sense. w ( t ) has a maximum upper bound, that is, the proportion of consistent loss does nottend to be extremely large. With different maximum upper bound of consistence weight w ( t ) , the trained model has different performances. n the next section, we ensemble themodels trained under different maximum consistence weights to achieve better results.HODGEPODGE did not change the overall framework of the baseline. It onlyattempts to combine several of the latest semi-supervised learning methods under thisframework.The first attempt is the interpolation consistency training (ICT) principle [8]. ICTlearns a student network in a semi-supervised manner. To this end, ICT uses a ‘MeanTeacher’ f θ ′ . During training, the student parameters θ are updated to encourageconsistent predictions f θ ( Mix λ ( u j , u k )) ≈ Mix λ ( f θ ′ ( u j ) , f θ ′ ( u k )) , and correct predictions for labeled examples, whereMix λ ( a, b ) = λa + (1 − λ ) b is called the interpolation or MixUp [11]. In our system, we perform interpolationof sample pair and their corresponding labels (or pseudo labels predicted by theCRNNs) in both the supervised loss on labeled examples and the consistency loss onunsupervised examples. In each batch, the weakly labeled data, synthetic data, andunlabeled data are shuffled separately to form a new batch. Then use the ICT principleto generate new augmented data and labels with the corresponding clips in the originaland new batches. It should be noted that the λ is different for each batch. Thus the loss L ict = L w,ict + L s,ict + w ( t ) L cw,ict + w ( t ) L cs,ict where L w,ict and L s,ict are the classification loss on weakly labeled data with onlyweak labels and synthetic data with only strong labels using ICT respectively, L cw,ict and L cs,ict are the teacher-student consistence regularization loss on ICT applied onunlabeled data with predicted weak and strong labels respectively.The second try draws on some of the ideas in MixMatch [9], but not exactlythe same. MixMatch introduces a single loss that unifies entropy minimization,consistency regularization, and generic regularization approaches to semi-supervisedlearning. Unfortunately MixMatch can only be used for one-hot labels, not suitable5or task 4, where there may be several events in a single audio clip. So we didn’t useMixMatch in its original form. In each batch, K ( > different augmentations aregenerated, then the original MixMatch does mixup on all data, regardless of whetherthe data is weakly labeled, synthetic or unlabeled. But our experiment found that theeffect is not good, so we fine-tuned the MixMatch to do MixUp only between theaugmentations of the same data type. The loss function is similar to the loss in the ICTcase. To further improve the performance of the system, we use some ensemble methods tofuse different models. The main differences of the single models have two dimensions,one is the difference of the semi-supervised learning method, and the other is thedifference of the maximum value of the consistency loss weight. For this challenge,we submitted 4 prediction results with different model ensemble: • HODGEPODGE 1: Ensemble model is conducted by averaging the outputs of9 different models with different maximum consistency coefficients in ‘MeanTeacher’ principle. The F-score on validation data was 0.367. (Corresponding toShi BossLee task4 1 in official submissions) • HODGEPODGE 2 : Ensemble model is conducted by averaging the outputsof 9 different models with different maximum consistency coefficients in ICTprinciple. The F-score on validation data was 0.425. (Corresponding toShi BossLee task4 2 in official submissions) • HODGEPODGE 3: Ensemble model is conducted by averaging the outputs of 6different models with different maximum consistency coefficients in MixMatchprinciple. The F-score on validation data was 0.389. (Corresponding toShi BossLee task4 3 in official submissions) • HODGEPODGE 4: Ensemble model is conducted by averaging the outputs ofall the 24 models in Submission 1, 2, and 3. The F-score on validation data was0.417. (Corresponding to Shi BossLee task4 4 in official submissions)
Sound event detection in domestic environments [11] is a task to detect the onset andoffset time steps of sound events in domestic environments. The datasets are fromAudioSet [12], FSD [13] and SINS dataset [14]]. The aim of this task is to investigatewhether real but weakly annotated data or synthetic data is sufficient for designingsound event detection systems. There are a total of 1578 real audio clips with weaklabels, 2045 synthetic audio clips with strong labels, and 14412 unlabeled in domainaudio clips in the development set, while the evaluation set contains 1168 audio clips.6udio recordings are 10 seconds in duration and consist of polyphonic sound eventsfrom 10 sound classes.
The evaluation metric for this task is based on the event-based F-score [15]. Thepredicted events are compared to a list of reference events by comparing the onsetand offset of the predicted event to the overlapping reference event. If the onset of thepredicted event is within 200 ms collar of the onset of the reference event and its offsetis within 200 ms or 20% of the event length collar around the reference offset, then thepredicted event is considered to be correctly detected, referred to as true positive. If areference event has no matching predicted event, then it is considered a false negative.If the predicted event does not match any of the reference events, it is considered afalse positive. In addition, if the system partially predicts an event without accuratelydetecting its onset and offset, it will be penalized twice as a false positive and a falsenegative. The following equation shows the calculation of the F-score for each class. F c = 2 T P c T P c + F P c + F N c , where F c , T P c , F P c , F N c are the F-score, true positives, false positives, falsenegatives of the class c respectively. The final evaluation metric is the average of theF-score for all the classes. First we did some experiments to determine the best size of the median window. Themedian window is used in the post-processing of posterior probabilities to results in thefinal events with onset and offset. Table 1 shows the performance of HODGEPODGEsystems on validation data set under different median window size. Coincidentally, allmethods achieve the best performance when the window size is 9.Table 2 shows the final macro-averaged event-based evaluation results on the testset compared to the baseline system. In fact, HODGEPODGE 1 is the ensembleof baselines, the only difference is that we use a deeper network, as well as highersampling rate and larger features. It can be seen that both ICT and MixMatch principlescan improve performance, especially ICT, which performs best in all HODGEPODGEsystems.Table 1: The performance of HODGEPODGE systems on validation data set underdifferent median window size.Median window size 5 7 9 11 13HODGEPODGE 1 35.7% 36.4% 36.7% 36.5% 36.1%HODGEPODGE 2 41.4% 42.1% 42.5% 42.2% 42.1%HODGEPODGE 3 38.1% 38.7% 38.9% 38.3% 37.9%HODGEPODGE 4 40.8% 41.5% 41.7% 41.3% 40.9%7able 2: The performance of our approach compared to the baseline system.Method Evaluation ValidationHODGEPODGE 1 37.0% 36.7%HODGEPODGE 2 42.0% 42.5%HODGEPODGE 3 40.9% 38.9%HODGEPODGE 4 41.5% 41.7%Baseline 25.8% 23.7%
In this paper, we proposed a method called HODGEPODGE for sound event detectionusing only weakly labeled, synthetic and unlabeled data. Our approach is based onCRNNs, whereby we introduce several latest semi-supervised learning methods, suchas interpolation consistence training and MixMatch into the ‘Mean Teacher’ frameworkto leverage the information in audio data that are not accurately labeled. The final F-score of our system on the evaluation set is 42.0%, which is significantly higher thanthe score of the baseline system which is 25.8%.
References [1] A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, andM. D. Plumbley, “Detection and classification of acoustic scenes and events:Outcome of the dcase 2016 challenge,”
IEEE/ACM Transactions on Audio,Speech and Language Processing (TASLP) , vol. 26, no. 2, pp. 379–393, 2018.[2] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, andT. Virtanen, “Dcase 2017 challenge setup: Tasks, datasets and baseline system,”in
DCASE 2017-Workshop on Detection and Classification of Acoustic Scenesand Events , 2017.[3] http://dcase.community/challenge2019/.[4] R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P. Shah, “Large-scale weaklylabeled semi-supervised sound event detection in domestic environments,” arXivpreprint arXiv:1807.10501 , 2018.[5] L. JiaKai, “Mean teacher convolution system for dcase 2018 task 4,” DCASE2018Challenge, Tech. Rep., September 2018.[6] Q. Kong, T. Iqbal, Y. Xu, W. Wang, and M. D. Plumbley, “Dcase 2018 challengebaseline with convolutional neural networks,” arXiv preprint arXiv:1808.00773 ,2018.[7] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in
Advances in neural information processing systems , 2017, pp. 1195–1204.88] V. Verma, A. Lamb, J. Kannala, Y. Bengio, and D. Lopez-Paz, “Interpolationconsistency training for semi-supervised learning,” arXiv preprintarXiv:1903.03825 , 2019.[9] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raffel,“Mixmatch: A holistic approach to semi-supervised learning,” arXiv preprintarXiv:1905.02249 , 2019.[10] N. Turpault, R. Serizel, A. P. Shah, and J. Salamon, “Sound event detectionin domestic environments with weakly labeled data and soundscape synthesis,”2019.[11] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyondempirical risk minimization,” arXiv preprint arXiv:1710.09412 , 2017.[12] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore,M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset foraudio events,” in . IEEE, 2017, pp. 776–780.[13] E. Fonseca, J. Pons Puig, X. Favory, F. Font Corbera, D. Bogdanov, A. Ferraro,S. Oramas, A. Porter, and X. Serra, “Freesound datasets: a platform for thecreation of open audio datasets,” in
Hu X, Cunningham SJ, Turnbull D, DuanZ, editors. Proceedings of the 18th ISMIR Conference; 2017 oct 23-27; Suzhou,China.[Canada]: International Society for Music Information Retrieval; 2017. p.486-93.
International Society for Music Information Retrieval (ISMIR), 2017.[14] G. Dekkers, S. Lauwereins, B. Thoen, M. W. Adhana, H. Brouckxon, T. vanWaterschoot, B. Vanrumste, M. Verhelst, and P. Karsmakers, “The sins databasefor detection of daily activities in a home environment using an acoustic sensornetwork,” in
Proceedings of the Detection and Classification of Acoustic Scenesand Events 2017 Workshop (DCASE2017), Munich, Germany , 2017, pp. 32–36.[15] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound eventdetection,”