On Multitask Loss Function for Audio Event Detection and Localization
Huy Phan, Lam Pham, Philipp Koch, Ngoc Q. K. Duong, Ian McLoughlin, Alfred Mertins
aa r X i v : . [ ee ss . A S ] S e p Detection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan
ON MULTITASK LOSS FUNCTION FOR AUDIO EVENT DETECTION AND LOCALIZATION
Huy Phan ∗ , Lam Pham , Philipp Koch , Ngoc Q. K. Duong , Ian McLoughlin , Alfred Mertins School of Electric Engineering and Computer Science, Queen Mary University of London, UK School of Computing, University of Kent, UK Institute for Signal Processing, University of L¨ubeck, Germany InterDigital R&D France, France Singapore Institute of Technology, Singapore ∗ Corresponding email: [email protected]
ABSTRACT
Audio event localization and detection (SELD) have beencommonly tackled using multitask models. Such a modelusually consists of a multi-label event classification branchwith sigmoid cross-entropy loss for event activity detectionand a regression branch with mean squared error loss fordirection-of-arrival estimation. In this work, we propose amultitask regression model, in which both (multi-label) eventdetection and localization are formulated as regression prob-lems and use the mean squared error loss homogeneously formodel training. We show that the common combination ofheterogeneous loss functions causes the network to underfitthe data whereas the homogeneous mean squared error lossleads to better convergence and performance. Experimentson the development and validation sets of the DCASE 2020SELD task demonstrate that the proposed system also out-performs the DCASE 2020 SELD baseline across all the de-tection and localization metrics, reducing the overall SELDerror (the combined metric) by approximately absolute.
Index Terms — audio event detection, localization, mul-titask loss, regression, classification
1. INTRODUCTION
Extended from active research on sound (audio) event de-tection, sound event localization and detection (SELD) task[1, 2] entangles the what and where questions about occur-ring sound events. That is, it aims to determine the identitiesof the events and their spatial locations/trajectories simulta-neously. Solving the SELD task would enable a wide rangeof novel applications in surveillance, human-machine inter-action, bioacoustics, and healthcare monitoring, to mentiona few.The joint SELD task can be divided and conquered in-dividually by two separate models, one for sound event de-tection (SED) [3, 4, 5] and the other for sound source lo-calization (SSL) [6, 7]. The two-stage approach presentedin [8] can be also considered to belong to this line of work. Dealing with the joint task in a single model has been knownto be more challenging. Three main approaches have beenproposed, including sound-type masked SSL [6], spatiallymasked SED [9], and joint SELD modeling [10, 2]. Jointsound event detection and localization modeling with multi-task deep learning has been most commonly adopted in thelatest DCASE challenge [11, 12, 13, 2], demonstrating en-couraging results.In the joint modeling approach with a multitask model,the sigmoid cross-entropy (CE) loss is typically used forevent detection (via classification) to handle possible multi-label due to occurrences of multiple events while the meansquared error (MSE) loss is often employed for direction-of-arrival (DOA) estimation (via regression). These two lossesare usually associated with different weights and then com-bined to make the total loss for network training. How-ever, there exist no established rules to set the weights forthe losses; more often than not, they are set with sometrivial weights without a clear justification. For example,while the DCASE 2019 baseline weighted the MSE loss 50times larger than that of the sigmoid CE loss, the currentDCASE 2020 baseline even enlarges this multiplication to1000 times. Furthermore, the two different types of lossfunctions might progress at different rates and might not con-verge synchronously, making the fixed weights suboptimal.We will empirically show in a controlled experiment that, forthis joint modeling task, the classification based on the CEloss usually experiences underfitting when being optimizedjointly with regression based on the MSE loss.In order to avoid such an issue, we alternatively proposeto formulate both the SED and SSL subtasks as regressionproblems and homogeneously use the MSE loss for both ofthem. The proposed multitask-regression network features arecurrent convolutional neural network (CRNN) architecturecoupled with self-attention mechanism [14]. Experiments onthe development set of the DCASE 2020 Task 3 show thatthe proposed multitask-regression network results in bettergeneralization than the networks using the combination of etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan the CE loss and the MSE loss. Furthermore, evaluation on thedevelopment and evaluation data of the challenge shows thatthe proposed network outperforms the DCASE 2020 SELDbaseline across all the evaluation metrics, some with a largemargin.
2. THE PROPOSED NETWORK
The proposed network is illustrated in Figure 1. The networkreceives time-frequency input S ∈ R T × F × C of T frames, F frequency bins, and C channels. The convolutional partof the network consists of six convolutional layers each ofwhich is followed by a max pooling layer except the first one.We assume that the early convolutional layers are crucial forfeature learning, the network is designed to have the first twoconvolutional layers back-to-back. In order, the six convolu-tional layers accommodate { , , , , , } fil-ters, respectively, with a common kernel size of × andthe stride of × . The gradually increasing numbers of fil-ters in the later convolutional layers are to compensate fortheir smaller feature maps in the frequency dimension. Zero-padding (i.e. SAME padding) is used in order to preserve thetemporal size. After convolution, batch normalization [15]is applied on the feature maps, followed by Rectified LinearUnits (ReLU) activation [16].The max pooling layers, except the first one, have a com-mon kernel size of × to reduce the input size by half in thefrequency dimension and, by doing so, gain frequency equiv-ariance in the induced feature maps while keeping the tem-poral size unchanged. Particularly, the pooling kernel size ofthe first max pooling layer ( max pool 2 , cf. Figure 1) is setto × in order to reduce the time dimension to T to matchthe frame resolution (100 ms) for computing the evaluationmetrics.Passing through the convolutional block, the input istransformed into a feature map of size T × × which is reshaped to form a sequence of feature vectors ( x , x , . . . , x T ) where x i ∈ R , ≤ i ≤ T . A bidirec-tional recurrent neural network (biRNN) is then employed toiterate through the sequence and encode it into a sequenceof output vectors ( z , z , . . . , z T ) . The biRNN is realizedby Gated Recurrent Unit (GRU) cells with the hidden size of256. To further improve encoding the context around a fea-ture z i , self-attention mechanism [14] is used. The vectors ( z , z , . . . , z T ) can be viewed as as a set of key-value pairs ( K , V ) . In the context of this work, both the keys and valuescoincide to Z (the concatetation of the z , z , . . . , z T vec-tors). We adopt the scaled dot-product attention as in [14],i.e. the attention output at a time index is a weighted sum of z , z , . . . , z T where the weights are determined as Attention ( Q , K , V ) = sof tmax ( QK T √ d k ) V . (1) R x ...... ... Input Output max pool 2conv 2conv 5max pool 5conv 6max pool 6 i x T x ... eventactivity x yz R eventactivity x yz R eventactivity x yz conv 1conv 4max pool 4conv 3max pool 3 V K Q V K Q V K Q ... ... fc fcfc fc fc fcfc fc fc fcfc fcbiRNNself-attention Figure 1: Overview of the proposed multitask regressionself-attention CRNN.Here, Q is the query [14] and also coincides to Z in the con-text of this work, i.e. Q ≡ K ≡ V ≡ Z . d k is the ex-tra dimension into which Q , K are transformed before thedot product to prevent the inner product from becoming toolarge. d k is set to 64 in this work.At each time index, the SED and SSL subtasks are ac-complished via two network branches, each consisting oftwo fully connected (fc) layers with 512 units each. Thefirst branch’s output layer has Y units with sigmoid activa-tion to perform event activity classification/regression of Y classes. The second branch has Y units with tanh activationto regress for the target events’ DOA trajectories. Normally,when the sigmoid CE loss is used for event activity classifi-cation and the MSE loss is used for the DOA estimation, thenetwork is trained to minimize the following weighting loss: L CE + MSE (Θ) = − w CE N X n =1 T X t =1 ( y nt log(ˆ y nt )+(1 − y nt ) log(1 − ˆ y nt ))+ w MSE N X n =1 T X t =1 || ˆ d nt (Θ) − d nt || . (2) Here, Θ denotes the network parameters and N denotes thenumber of training examples. We use ˆ y and y to denote theevent activity output and grouthtruth, respectively. In addi-tion, we used ˆ d = (ˆ x, ˆ y, ˆ z ) and d = ( x, y, z ) to denote theDOA estimation output and groudtruth in terms of Cartesiancoordinates on the unit sphere, respectively. w CE and w MSE indicate the weights given to the corresponding losses.On the other hand, when the MSE loss is used for bothSED and SSL subtasks, the network is trained to minimizethe total MSE loss of the two network branches without etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan weighting: L MSE (Θ) = N X n =1 T X t =1 ( || ˆ y nt (Θ) − y nt || + || ˆ d nt (Θ) − d nt || ) . (3)
3. EXPERIMENTS3.1. DCASE 2020 SELD dataset
The database used for the DCASE 2020 SELD task wassynthesized in two spatial sound formats: (1) MIC - 4-channel microphone array extracted from a subset of 32-channel Eigenmike format and (2) FOA - 4-channel first-order Ambisonics extracted from a matrix of × con-version filters. 714 sound examples from the published NI-GENS General Sound Events Database of 14 event classes,including alarm, crying baby, crash, barking dog, runningengine, burning fire, footsteps, knocking on door, female &male speech, female & male scream, ringing phone, and pi-ano , were used for data creation. More information aboutthe data synthesis can be found in [1]. The database was splitinto eight sets, six of which were used as the development setand the remaining two were used as the evaluation set. Experiments on the development set:
We followed thechallenge setup to conduct experiments on the developmentset. That is, the first set of the development data was used asthe unseen data for testing purpose, the second set was usedas the validation set for model selection, and the remainingfour sets were used as the training data.
Experiments on the evaluation set:
To assess perfor-mance on the evaluation set, two different systems weretrained and submitted to the challenge. The first was trainedusing the first set of the development data as validation setfor model selection and the remaining five sets as the train-ing data (
Submission 1 ). The second was trained using theentire development data as the training data (i.e. without val-idation data for model selection) (
Submission 2 ). We extracted log-Mel magnitude spectrogram with a win-dow size of 40 ms, 20 ms overlap, and 64 Mel-bands. Toencode the phase information, for the FOA data, an acousticintensity vector was extracted for each Mel-band, whereas,for the MIC data, generalized-cross-correlation with phase-transform (GCC-PHAT) features were computed for eachMel-band. Overall, multi-channel images of size × × and × × were resulted for one-minute FOA andMIC recordings, respectively. Network implementation was based on
Tensorflow frame-work. We used spectrogram segments of size T = 600 (equivalent to 12 seconds) as inputs. Dropout rates of . , https://zenodo.org/record/2535878 . , and . were employed to regularize the convolutionallayers, the biRNN, and the fully-connected layers, respec-tively.The network was trained using Adam optimizer [17] for10000 epochs with a minibatch size of 64. Each spectro-gram segment in a minibatch was randomly sampled froma 1-minute recording and augmented using spectrogram aug-mentation [18]. The learning rate was initially set to × − and was exponentially reduced with a rate of . after , , and epochs. In addition, the first epochs wereused as a warmup period in which the network was trainedwith a small learning rate of × − .During training, the network snapshot that achieved thelowest combined SELD error rate on the validation set wasretained for evaluation. The retained network was then eval-uated on the test recordings with a 2-second segment at atime without overlap. To be able to analyze the effect of us-ing different loss combinations in a controllable manner, nopost-processing was carried out. Event activity was deter-mined from the corresponding regression/classification out-put using a threshold of . . The DCASE 2020 challenge evaluated the performance ofthe SED subtask using localization-aware detection error rate( ER ◦ ) and F-score ( F ◦ ) with a threshold of ◦ in one-second non-overlapping segments. For sound event local-ization, errors only between same-class predictions and ref-erences were considered. The class-aware localization er-ror ( LE CD ) and its corresponding recall ( LR CD ) were em-ployed for evaluating localization outputs and were also com-puted in one-second non-overlapping segments. In addition,we also computed the combined SELD error metric: SELD = 14 ( ER ◦ + (1 − F ◦ ) + LE CD
180 + (1 − LR CD )) (4) to give an overall picture about a system. It is a rule of thumb that the CE loss is preferred over theMSE loss for a classification task since it, in general, leadsto quicker learning through gradient descent, at least theoret-ically [19]. However, when it is used in combination withthe MSE task as in (2) as commonly used for joint SELD,it apparently underfits the data as evidenced in Figure 2.When an equal weight is used for the two losses in (2), i.e. w CE = w MSE = 1 , the CE loss (cf. Figure 2 (c)) and theSED error (cf. 2 (d)) are hard to be reduced on both the train-ing and test data (note the scale of the CE loss in Figure 2(c) is much larger than that of the MSE loss in Figure 2 (a)).The underfitting effect on the SED subtask is even worse un-der the skewed weighting scheme used in the DCASE 2020baseline [1], i.e. the MSE loss was given a weight of 1000.0and the CE loss was given a weight of 1.0, since in this case etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan
Table 1: Results obtained by the proposed system and the DCASE 2020 baseline on the development and evaluation sets.
DOA loss (weight) SED loss (weight)
FOA MIC LE CD LR CD ER ◦ F ◦ SELD LE CD LR CD ER ◦ F ◦ SELD
Development results
Val (DCASE2020) MSE (1000) CE (1) . ◦ . .
72 37 . .
46 27 . ◦ . .
74 34 . . Val (CE+MSE)
MSE (1000) CE (1) . ◦ . .
83 41 . .
50 16 . ◦ . .
82 42 . . Val (CE+MSE)
MSE (1) CE (1) . ◦ . .
78 42 . .
45 27 . ◦ . .
86 34 . . Val (MSE)
MSE MSE . ◦ . .
58 52 . .
37 17 . ◦ . .
56 53 . . Test (DCASE2020) MSE (1000) CE (1) . ◦ . .
72 37 . .
47 27 . ◦ . .
78 31 . . Test (CE+MSE)
MSE (1000) CE (1) . ◦ . .
88 38 . .
53 16 . ◦ . .
81 44 . . Test (CE+MSE)
MSE (1) CE (1) . ◦ . .
82 39 . .
49 28 . ◦ . .
93 31 . . Test (MSE)
MSE MSE . ◦ . .
60 49 . .
39 18 . ◦ . .
59 50 . . Evaluation results
DCASE2020 MSE (1000) CE (1) . ◦ . .
66 43 . .
42 21 . ◦ . .
66 44 . . Submission 1
MSE MSE . ◦ . .
52 57 . .
33 14 . ◦ . .
55 58 . . Submission 2
MSE MSE . ◦ . .
49 61 . .
31 14 . ◦ . .
53 59 . . the network further prioritizes optimizing the MSE loss overthe CE one. We speculate that a similar phenomenon hap-pened to the DCASE 2020 baseline as it results in limitedperformance on the SED subtasks (cf. Table 1).In contrast, when the MSE loss is used for both the SEDand SSL subtasks as in (3), the SED performance is improvedsignificantly (cf. Figure 2 (d)) while the DOA estimation per-formance remains comparable to the case of MSE+CE com-bination (cf. Figure 2 (b)). These results suggest that theSELD multitask network learns easier when a homogeneousloss is used for all the subtasks than when heterogeneouslosses are combined. Although we cannot conclude that theMSE loss is the optimal loss for SELD multitask modeling,these results urge the quest for one in future work. The performance obtained by the studied systems on the de-velopment and evaluation data are shown in Table 1. As ex-pected, using the MSE error homogeneously consistently re-sults in much better performance than the MSE+CE combi-nations. In addition, the proposed system outperforms theDCASE 2020 SELD baseline across the evaluation metrics,particularly on the SED metrics. This is most likely due tothe underfitting effect on the SED subtask of the baseline,making it underperforming on this subtask. Overall, usingFOA and MIC data, the proposed system reduces the com-bined SELD error by . and . absolute on the develop-ment data from that of the baseline, respectively. The corre-sponding error reduction by Submission 2 on the evaluationdata reaches . and . , respectively.Our submission to the DCASE 2020 Task 3 was ranked6 th overall. This is an encouraging result given that the sub-mission systems were compact and neither relied on ensem-ble nor multiple microphone arrays . http://dcase.community/challenge2020/task-sound-event-localization-and-detection-results training step0123456 D O A e s t i m a t i on ( M SE ) l o ss (a) training loss020406080100 SE D ( C E / M SE ) l o ss (c) training step020406080100 D O A e rr o r (b) training step SE D e rr o r (d) MSE (1000) + CE (1) - train MSE (1000) + CE (1) - testMSE (1) + CE (1) - train MSE (1000) + CE (1) - testMSE - train MSE - test Figure 2: Variation of the MSE loss, the CE loss, the DOAerror, and the SED error on the training and test sets of theDCASE 2020 development data with different loss combina-tions. (a) The MSE loss, (b) the CE loss, (c) the DOA error,and (d) the SED error. The number in bracket indicates theweight assigned to the corresponding loss.
4. CONCLUSIONS
This work investigated the loss functions used for SELDmultitask modeling. We showed empirical evidence that thecombination of the sigmoid CE loss (for the SED subtask)and the MSE loss (for the DOA estimation subtask), whichis commonly used, often results in underfitting effect on theformer. As an alternative, when the two subtasks were for-mulated as regression problems and the MSE loss was usedfor both, the multitask network was able to converge better,resulting in better and balanced performance. Experimentalresults on the development and evaluation set of the DCASE2020 SELD task showed significant improvements over theDCASE 2020 baseline across all the evaluation metrics. etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan
5. REFERENCES [1] A. Politis, S. Adavanne, and T. Virtanen, “A dataset ofreverberant spatial sound scenes with moving sourcesfor sound event localization and detection,” arXivpreprint 2006.01919 , 2020.[2] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen,“Sound event localization and detection of overlappingsources using convolutional recurrent neural networks,”
IEEE Journal of Selected Topics in Signal Processing ,vol. 13, no. 1, pp. 34–48, 2018.[3] H. Phan, O. Y. Chn, P. Koch, L. Pham, I. McLough-lin, A. Mertins, and M. D. Vos, “Unifying isolatedand overlapping audio event detection with multi-labelmulti-task convolutional recurrent neural networks,” in
Proc. ICASSP , 2019.[4] E. C¸ akir, G. Parascandolo, T. Heittola, H. Hut-tunen, and T. Virtanen, “Convolutional recurrent neu-ral networks for polyphonic sound event detection,”
IEEE/ACM Trans. on Audio, Speech, and LanguageProcessing , vol. 5, no. 6, pp. 1291–1303, 2017.[5] I. McLoughlin, H. Zhang, Z. Xie, Y. Song, W. Xiao, andH. Phan, “Continuous robust sound event classificationusing time-frequency features and deep learning,”
PLoSONE , vol. 12, no. 9, 2017.[6] N. Ma, J. A. Gonzalez, and G. J. Brown, “Robust bin-aural localization of a target sound source by combin-ing spectral source models and deep neural networks,”
IEEE/ACM Transactions on Audio Speech and Lan-guage Processing , vol. 26, no. 11, p. 21222131, 2018.[7] R. Chakraborty and C. Nadeu, “Sound-model-basedacoustic source localization using distributed micro-phone arrays,” in
Proc. ICASSP , 2014, p. 619623.[8] Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, andM. Plumbley, “Polyphonic sound event detection andlocalization usinga two-stage strategy,” in
Proc. Detec-tion and Classification ofAcoustic Scenes and Events2019 Workshop (DCASE2019) , 2019.[9] I. Trowitzsch, C. Schymura, D. Kolossa, and K. Ober-mayer, “Joining sound event detection and localiza-tionthrough spatial segregation,” in
IEEE/ACM Trans-actions on Audio Speech and Language Processing ,vol. 28, 2020, pp. 487–502.[10] W. He, P. Motlicek, and J.-M. Odobez, “Joint localiza-tion and classification of multiple sound sources using amulti-task neural network,” in
Proc. Interspeech , 2018. [11] F. Grondin, I. Sobieraj, M. Plumbley, and J. Glass,“Sound event localization and detection using crnnon pairs of microphones,” in
Detection and Classifi-cation of Acoustic Scenesand Events 2019 Workshop(DCASE2019) , 2019.[12] H. Cordourier, P. L. Meyer, J. Huang, J. D. H. On-tiveros, and H. Lu, “Gcc-phat cross-correlation audiofeaturesfor simultaneous sound event localization anddetection(seld) on multiple rooms,” in
Proc. Detectionand Classification of Acoustic Scenes and Events 2019Workshop (DCASE2019) , 2019.[13] S. Kapka and M. Lewandowski, “Sound source de-tection, localizationand classification using consecu-tive ensemble ofcrnn models,” in
Procc. Detectionand Classification of AcousticScenes and Events 2019Workshop (DCASE2019) , 2019.[14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,“Attention is all you need,” in
Proc. NIPS , 2017, pp.5998–6008.[15] S. Ioffe and C. Szegedy, “Batch normalization: Accel-erating deep network training by reducing internal co-variate shift,” in
Proc. ICML , 2015, pp. 448–456.[16] V. Nair and G. E. Hinton, “Rectified linear units im-prove restricted Boltzmann machines,” in
Proc. ICML ,2010.[17] D. P. Kingma and J. L. Ba, “Adam: a method forstochastic optimization,” in
Proc. ICLR , 2015, pp. 1–13.[18] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph,E. D. Cubuk, and Q. V. Le, “Specaugment: A simpledata augmentation method for automatic speech recog-nition,” in
Proc. Interspeech , 2019, pp. 2613–2617.[19] M. Nielsen,